Metrics#
The TabEval metrics module provides a comprehensive suite of evaluation metrics for synthetic tabular data. All metrics inherit from the base MetricEvaluator
class and are organized by evaluation category.
Overview#
TabEval provides metrics across several evaluation dimensions:
Statistical Metrics: Distribution-based comparisons between real and synthetic data
Privacy Metrics: Privacy protection and differential privacy guarantees
Structure Metrics: Feature-level utility and relationships
Density Metrics: High-order and low-order data quality assessments
Main Interface#
Base Classes#
- class tabeval.metrics.core.MetricEvaluator(reduction='mean', n_histogram_bins=10, n_folds=3, task_type='classification', random_state=0, workspace=PosixPath('logs/tabeval_workspace'), use_cache=True, default_metric=None)[source]#
Bases:
object
Base class for all metrics.
- Each derived class must implement the following methods:
evaluate() - compare two datasets and return a dictionary of metrics. direction() - direction of metric (bigger better or smaller better). type() - type of the metric. name() - name of the metric.
If any method implementation is missing, the class constructor will fail.
- Constructor Args:
- reduction: str
The way to aggregate metrics across folds. Default: ‘mean’.
- n_histogram_bins: int
The number of bins used in histogram calculation. Default: 10.
- n_folds: int
The number of folds in cross validation. Default: 3.
- task_type: str
The type of downstream task. Default: ‘classification’.
- workspace: Path
The directory to save intermediate models or results. Default: Path(“logs/tabeval_workspace”).
- use_cache: bool
Whether to use cache. If True, it will try to load saved results in workspace directory where possible.
Base class for all metrics. Provides common functionality including:
Caching mechanism for expensive computations
Standardized evaluation interface
Reduction operations (mean, max, min, median)
OneClass representation for advanced metrics
Statistical Metrics#
Statistical metrics compare the distributional properties between real and synthetic data.
- class tabeval.metrics.eval_statistical.StatisticalEvaluator(**kwargs)[source]#
Bases:
MetricEvaluator
Base class for all statistical metrics.
Jensen-Shannon Distance#
- class tabeval.metrics.eval_statistical.JensenShannonDistance(normalize=True, **kwargs)[source]#
Bases:
StatisticalEvaluator
Evaluate the average Jensen-Shannon distance (metric) between two probability arrays.
Evaluates the average Jensen-Shannon distance between probability distributions.
Score Range: [0, 1]
Direction: minimize (0 = identical distributions, 1 = completely different)
Inverse KL Divergence#
- class tabeval.metrics.eval_statistical.InverseKLDivergence(**kwargs)[source]#
Bases:
StatisticalEvaluator
Returns the average inverse of the Kullback–Leibler Divergence metric.
- Score:
0: the datasets are from different distributions. 1: the datasets are from the same distribution.
Returns the average inverse of the Kullback–Leibler Divergence metric.
Score Range: [0, 1]
Direction: maximize (1 = same distribution, 0 = different distributions)
Kolmogorov-Smirnov Test#
- class tabeval.metrics.eval_statistical.KolmogorovSmirnovTest(**kwargs)[source]#
Bases:
StatisticalEvaluator
Performs the Kolmogorov-Smirnov test for goodness of fit.
- Score:
0: the distributions are totally different. 1: the distributions are identical.
Performs the Kolmogorov-Smirnov test for goodness of fit.
Score Range: [0, 1]
Direction: maximize (1 = identical distributions, 0 = totally different)
Chi-Squared Test#
- class tabeval.metrics.eval_statistical.ChiSquaredTest(**kwargs)[source]#
Bases:
StatisticalEvaluator
Performs the one-way chi-square test.
- Returns:
The p-value. A small value indicates that we can reject the null hypothesis and that the distributions are different.
- Return type:
None
- Score:
0: the distributions are different 1: the distributions are identical.
Performs the one-way chi-square test.
Score Range: [0, 1]
Direction: maximize (1 = identical distributions, 0 = different distributions)
Maximum Mean Discrepancy#
- class tabeval.metrics.eval_statistical.MaximumMeanDiscrepancy(kernel='rbf', **kwargs)[source]#
Bases:
StatisticalEvaluator
Empirical maximum mean discrepancy. The lower the result the more evidence that distributions are the same.
- Parameters:
kernel (str) – “rbf”, “linear” or “polynomial”
- Score:
0: The distributions are the same. 1: The distributions are totally different.
Empirical maximum mean discrepancy with support for multiple kernels.
Supported Kernels: “rbf”, “linear”, “polynomial”
Score Range: [0, ∞)
Direction: minimize (0 = same distributions, higher = more different)
Wasserstein Distance#
- class tabeval.metrics.eval_statistical.WassersteinDistance(**kwargs)[source]#
Bases:
StatisticalEvaluator
Compare Wasserstein distance between original data and synthetic data.
- Parameters:
X – original data
X_syn – synthetically generated data
- Returns:
Wasserstein distance
- Return type:
WD_value
Compare Wasserstein distance between original and synthetic data.
Score Range: [0, ∞)
Direction: minimize (0 = identical distributions)
PRDC Score#
- class tabeval.metrics.eval_statistical.PRDCScore(nearest_k=5, **kwargs)[source]#
Bases:
StatisticalEvaluator
Computes precision, recall, density, and coverage given two manifolds.
- Parameters:
nearest_k (int) – int.
Computes precision, recall, density, and coverage given two manifolds.
Returns: Dictionary with precision, recall, density, and coverage scores
Direction: maximize (all metrics range from 0 to 1)
Alpha Precision#
- class tabeval.metrics.eval_statistical.AlphaPrecision(**kwargs)[source]#
Bases:
StatisticalEvaluator
Evaluates the alpha-precision, beta-recall, and authenticity scores.
The class evaluates the synthetic data using a tuple of three metrics: alpha-precision, beta-recall, and authenticity. Note that these metrics can be evaluated for each synthetic data point (which are useful for auditing and post-processing). Here we average the scores to reflect the overall quality of the data. The formal definitions can be found in the reference below:
Alaa, Ahmed, Boris Van Breugel, Evgeny S. Saveliev, and Mihaela van der Schaar. “How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models.” In International Conference on Machine Learning, pp. 290-306. PMLR, 2022.
Evaluates alpha-precision, beta-recall, and authenticity scores for sample-level quality assessment.
Returns: Dictionary with delta_precision_alpha, delta_coverage_beta, and authenticity scores
Direction: maximize (all metrics range from 0 to 1)
Survival KM Distance#
- class tabeval.metrics.eval_statistical.SurvivalKMDistance(**kwargs)[source]#
Bases:
StatisticalEvaluator
The distance between two Kaplan-Meier plots. Used for survival analysis
Distance between two Kaplan-Meier plots for survival analysis data.
Task Type: survival_analysis only
Direction: minimize
Frechet Inception Distance#
- class tabeval.metrics.eval_statistical.FrechetInceptionDistance(**kwargs)[source]#
Bases:
StatisticalEvaluator
Calculates the Frechet Inception Distance (FID) to evalulate GANs.
Paper: GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.
The FID metric calculates the distance between two distributions of images. Typically, we have summary statistics (mean & covariance matrix) of one of these distributions, while the 2nd distribution is given by a GAN.
Adapted by Boris van Breugel(bv292@cam.ac.uk)
Calculates the Frechet Inception Distance (FID) for image data evaluation.
Data Type: images only
Direction: minimize
Privacy Metrics#
Privacy metrics assess the privacy protection offered by synthetic data generation.
DCR (Baseline Protection)#
Structure Metrics#
Structure metrics evaluate the utility and relationships preserved in synthetic data.
Utility Per Feature#
Density Metrics#
Density metrics assess the quality of synthetic data through density-based comparisons.
Low Order Metrics#
High Order Metrics#
Visualization#
The metrics module also provides visualization utilities for comparing real and synthetic data.
Available Functions:
plot_marginal_comparison()
: Creates marginal distribution comparison plotsplot_tsne()
: Generates t-SNE plots for data comparison
Utility Functions#
Internal utility functions for metric computation.
- tabeval.metrics._utils.get_frequency(X_gt, X_synth, n_histogram_bins=10)[source]#
Get percentual frequencies for each possible real categorical value.
- Returns:
The observed and expected frequencies (as a percent).
- Return type:
- tabeval.metrics._utils.get_features(X, sensitive_features=[])[source]#
Return the non-sensitive features from dataset X
- tabeval.metrics._utils.load_dataset(data_train=None, data_valid=None, data_test=None, device=device(type='cpu'), batch_dim=50)[source]#
- tabeval.metrics._utils.create_model(n_dims, n_flows=5, n_layers=3, hidden_dim=32, residual='gated', verbose=False, device=device(type='cpu'), batch_dim=50)[source]#
- tabeval.metrics._utils.save_model(model, optimizer, epoch, save=False, workspace=PosixPath('logs/tabeval_workspace'))[source]#
- tabeval.metrics._utils.load_model(model, optimizer, workspace=PosixPath('logs/tabeval_workspace'))[source]#
- tabeval.metrics._utils.train(model, optimizer, scheduler, data_loader_train, data_loader_valid, data_loader_test, workspace=PosixPath('logs/tabeval_workspace'), start_epoch=0, device=device(type='cpu'), epochs=50, save=False, clip_norm=0.1)[source]#
- tabeval.metrics._utils.density_estimator_trainer(data_train, data_val=None, data_test=None, batch_dim=50, flows=5, layers=3, hidden_dim=32, residual='gated', workspace=PosixPath('logs/tabeval_workspace'), decay=0.5, patience=20, cooldown=10, min_lr=0.0005, early_stopping=100, device=device(type='cpu'), epochs=50, learning_rate=0.01, clip_norm=0.1, polyak=0.998, save=True, load=True)[source]#
Metric Categories and Default Configuration#
The metrics are organized into the following categories with their default configurations:
{
'stats': [
'jensenshannon_dist', 'chi_squared_test', 'inv_kl_divergence',
'ks_test', 'max_mean_discrepancy', 'wasserstein_dist',
'prdc', 'alpha_precision', 'survival_km_distance'
],
'privacy': ['dcr'],
'structure': ['utility_per_feature'],
'density': ['low_order', 'high_order']
}
Usage Examples#
Basic Usage#
from tabeval.metrics import Metrics
import pandas as pd
# Load your datasets
real_data = pd.read_csv("real_data.csv")
synthetic_data = pd.read_csv("synthetic_data.csv")
# Run all default metrics
results = Metrics.evaluate(real_data, synthetic_data)
print(results)
Custom Metric Selection#
# Select specific metrics to run
custom_metrics = {
'stats': ['jensenshannon_dist', 'wasserstein_dist'],
'privacy': ['dcr']
}
results = Metrics.evaluate(
real_data,
synthetic_data,
metrics=custom_metrics,
task_type='classification',
n_folds=3,
workspace=Path('my_workspace')
)
Survival Analysis Example#
# For survival analysis data
results = Metrics.evaluate(
X_gt=survival_real_data,
X_syn=survival_synthetic_data,
task_type='survival_analysis',
metrics={'stats': ['survival_km_distance']}
)
Advanced Configuration#
# Advanced configuration with all parameters
results = Metrics.evaluate(
X_gt=real_data,
X_syn=synthetic_data,
X_train=training_data, # Optional: training data for some metrics
reduction='median', # Aggregation method: 'mean', 'median', 'min', 'max'
n_histogram_bins=20, # Number of bins for histogram-based metrics
task_type='regression', # Task type affects model evaluation
random_state=42, # For reproducibility
workspace=Path('cache'), # Caching directory
use_cache=True, # Enable result caching
n_folds=5 # Cross-validation folds
)
Notes#
All metrics support caching to avoid recomputation of expensive operations
Most metrics work with any tabular data, but some are specialized (e.g., survival analysis, images)
The evaluation framework automatically handles data encoding and preprocessing
Results are returned as pandas DataFrames for easy analysis and visualization