Metrics

Metrics#

The TabEval metrics module provides a comprehensive suite of evaluation metrics for synthetic tabular data. All metrics inherit from the base MetricEvaluator class and are organized by evaluation category.

Overview#

TabEval provides metrics across several evaluation dimensions:

Statistical Metrics: Distribution-based comparisons between real and synthetic data
Privacy Metrics: Privacy protection and differential privacy guarantees
Structure Metrics: Feature-level utility and relationships
Density Metrics: High-order and low-order data quality assessments

Main Interface#

Base Classes#

class tabeval.metrics.core.MetricEvaluator(reduction='mean', n_histogram_bins=10, n_folds=3, task_type='classification', random_state=0, workspace=PosixPath('logs/tabeval_workspace'), use_cache=True, default_metric=None)[source]#

Bases: object

Base class for all metrics.

Each derived class must implement the following methods:: evaluate() - compare two datasets and return a dictionary of metrics. direction() - direction of metric (bigger better or smaller better). type() - type of the metric. name() - name of the metric.

If any method implementation is missing, the class constructor will fail.

Constructor Args:

reduction: str: The way to aggregate metrics across folds. Default: ‘mean’.
n_histogram_bins: int: The number of bins used in histogram calculation. Default: 10.
n_folds: int: The number of folds in cross validation. Default: 3.
task_type: str: The type of downstream task. Default: ‘classification’.
workspace: Path: The directory to save intermediate models or results. Default: Path(“logs/tabeval_workspace”).
use_cache: bool: Whether to use cache. If True, it will try to load saved results in workspace directory where possible.

Base class for all metrics. Provides common functionality including:

Caching mechanism for expensive computations
Standardized evaluation interface
Reduction operations (mean, max, min, median)
OneClass representation for advanced metrics

abstract evaluate(X_gt, X_syn)[source]#

abstract evaluate_default(X_gt, X_syn)[source]#

abstract static direction()[source]#

abstract static type()[source]#

abstract static name()[source]#

classmethod fqdn()[source]#

reduction()[source]#

use_cache(path)[source]#

Statistical Metrics#

Statistical metrics compare the distributional properties between real and synthetic data.

class tabeval.metrics.eval_statistical.StatisticalEvaluator(**kwargs)[source]#

Bases: MetricEvaluator

Base class for all statistical metrics.

static type()[source]#

evaluate(X_gt, X_syn)[source]#

evaluate_default(X_gt, X_syn)[source]#

Jensen-Shannon Distance#

class tabeval.metrics.eval_statistical.JensenShannonDistance(normalize=True, **kwargs)[source]#

Bases: StatisticalEvaluator

Evaluate the average Jensen-Shannon distance (metric) between two probability arrays.

Evaluates the average Jensen-Shannon distance between probability distributions.

Score Range: [0, 1]

Direction: minimize (0 = identical distributions, 1 = completely different)

static name()[source]#

static direction()[source]#

Inverse KL Divergence#

class tabeval.metrics.eval_statistical.InverseKLDivergence(**kwargs)[source]#

Bases: StatisticalEvaluator

Returns the average inverse of the Kullback–Leibler Divergence metric.

Score:: 0: the datasets are from different distributions. 1: the datasets are from the same distribution.

Returns the average inverse of the Kullback–Leibler Divergence metric.

Score Range: [0, 1]

Direction: maximize (1 = same distribution, 0 = different distributions)

static name()[source]#

static direction()[source]#

Kolmogorov-Smirnov Test#

class tabeval.metrics.eval_statistical.KolmogorovSmirnovTest(**kwargs)[source]#

Bases: StatisticalEvaluator

Performs the Kolmogorov-Smirnov test for goodness of fit.

Score:: 0: the distributions are totally different. 1: the distributions are identical.

Performs the Kolmogorov-Smirnov test for goodness of fit.

Score Range: [0, 1]

Direction: maximize (1 = identical distributions, 0 = totally different)

static name()[source]#

static direction()[source]#

Chi-Squared Test#

class tabeval.metrics.eval_statistical.ChiSquaredTest(**kwargs)[source]#

Bases: StatisticalEvaluator

Performs the one-way chi-square test.

Returns:: The p-value. A small value indicates that we can reject the null hypothesis and that the distributions are different.
Return type:: None

Score:: 0: the distributions are different 1: the distributions are identical.

Performs the one-way chi-square test.

Score Range: [0, 1]

Direction: maximize (1 = identical distributions, 0 = different distributions)

static name()[source]#

static direction()[source]#

Maximum Mean Discrepancy#

class tabeval.metrics.eval_statistical.MaximumMeanDiscrepancy(kernel='rbf', **kwargs)[source]#

Bases: StatisticalEvaluator

Empirical maximum mean discrepancy. The lower the result the more evidence that distributions are the same.

Parameters:: kernel (str) – “rbf”, “linear” or “polynomial”

Score:: 0: The distributions are the same. 1: The distributions are totally different.

Empirical maximum mean discrepancy with support for multiple kernels.

Supported Kernels: “rbf”, “linear”, “polynomial”

Score Range: [0, ∞)

Direction: minimize (0 = same distributions, higher = more different)

static name()[source]#

static direction()[source]#

Wasserstein Distance#

class tabeval.metrics.eval_statistical.WassersteinDistance(**kwargs)[source]#

Bases: StatisticalEvaluator

Compare Wasserstein distance between original data and synthetic data.

Parameters:

X – original data
X_syn – synthetically generated data

Returns:

Wasserstein distance

Return type:

WD_value

Compare Wasserstein distance between original and synthetic data.

Score Range: [0, ∞)

Direction: minimize (0 = identical distributions)

static name()[source]#

static direction()[source]#

PRDC Score#

class tabeval.metrics.eval_statistical.PRDCScore(nearest_k=5, **kwargs)[source]#

Bases: StatisticalEvaluator

Computes precision, recall, density, and coverage given two manifolds.

Parameters:: nearest_k (int) – int.

Computes precision, recall, density, and coverage given two manifolds.

Returns: Dictionary with precision, recall, density, and coverage scores

Direction: maximize (all metrics range from 0 to 1)

static name()[source]#

static direction()[source]#

Alpha Precision#

class tabeval.metrics.eval_statistical.AlphaPrecision(**kwargs)[source]#

Bases: StatisticalEvaluator

Evaluates the alpha-precision, beta-recall, and authenticity scores.

The class evaluates the synthetic data using a tuple of three metrics: alpha-precision, beta-recall, and authenticity. Note that these metrics can be evaluated for each synthetic data point (which are useful for auditing and post-processing). Here we average the scores to reflect the overall quality of the data. The formal definitions can be found in the reference below:

Alaa, Ahmed, Boris Van Breugel, Evgeny S. Saveliev, and Mihaela van der Schaar. “How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models.” In International Conference on Machine Learning, pp. 290-306. PMLR, 2022.

Evaluates alpha-precision, beta-recall, and authenticity scores for sample-level quality assessment.

Returns: Dictionary with delta_precision_alpha, delta_coverage_beta, and authenticity scores

Direction: maximize (all metrics range from 0 to 1)

static name()[source]#

static direction()[source]#

metrics(X, X_syn, emb_center=None)[source]#

Survival KM Distance#

class tabeval.metrics.eval_statistical.SurvivalKMDistance(**kwargs)[source]#

Bases: StatisticalEvaluator

The distance between two Kaplan-Meier plots. Used for survival analysis

Distance between two Kaplan-Meier plots for survival analysis data.

Task Type: survival_analysis only

Direction: minimize

static name()[source]#

static direction()[source]#

Frechet Inception Distance#

class tabeval.metrics.eval_statistical.FrechetInceptionDistance(**kwargs)[source]#

Bases: StatisticalEvaluator

Calculates the Frechet Inception Distance (FID) to evalulate GANs.

Paper: GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.

The FID metric calculates the distance between two distributions of images. Typically, we have summary statistics (mean & covariance matrix) of one of these distributions, while the 2nd distribution is given by a GAN.

Adapted by Boris van Breugel(bv292@cam.ac.uk)

Calculates the Frechet Inception Distance (FID) for image data evaluation.

Data Type: images only

Direction: minimize

static name()[source]#

static direction()[source]#

Privacy Metrics#

Privacy metrics assess the privacy protection offered by synthetic data generation.

DCR (Baseline Protection)#

Structure Metrics#

Structure metrics evaluate the utility and relationships preserved in synthetic data.

Utility Per Feature#

Density Metrics#

Density metrics assess the quality of synthetic data through density-based comparisons.

Low Order Metrics#

High Order Metrics#

Visualization#

The metrics module also provides visualization utilities for comparing real and synthetic data.

Available Functions:

plot_marginal_comparison(): Creates marginal distribution comparison plots
plot_tsne(): Generates t-SNE plots for data comparison

tabeval.metrics.plots.plot_marginal_comparison(plt, X_gt, X_syn, normalize=True)[source]#

tabeval.metrics.plots.plot_tsne(plt, X_gt, X_syn)[source]#

Utility Functions#

Internal utility functions for metric computation.

tabeval.metrics._utils.get_frequency(X_gt, X_synth, n_histogram_bins=10)[source]#

Get percentual frequencies for each possible real categorical value.

Returns:: The observed and expected frequencies (as a percent).
Return type:: dict

tabeval.metrics._utils.get_features(X, sensitive_features=[])[source]#

Return the non-sensitive features from dataset X

tabeval.metrics._utils.get_y_pred_proba_hlpr(y_pred_proba, nclasses)[source]#

tabeval.metrics._utils.evaluate_auc(y_test, y_pred_proba, classes=None)[source]#

class tabeval.metrics._utils.gaussian(X)[source]#

Bases: object

pdf(Z)[source]#

class tabeval.metrics._utils.normal_func(X)[source]#

Bases: object

pdf(Z)[source]#

class tabeval.metrics._utils.normal_func_feat(X, continuous)[source]#

Bases: object

pdf(Z)[source]#

class tabeval.metrics._utils.GeneratorInterface[source]#

Bases: object

abstract fit(data)[source]#

abstract generate(count)[source]#

tabeval.metrics._utils.compute_wd(X_syn, X)[source]#

tabeval.metrics._utils.load_dataset(data_train=None, data_valid=None, data_test=None, device=device(type='cpu'), batch_dim=50)[source]#

tabeval.metrics._utils.create_model(n_dims, n_flows=5, n_layers=3, hidden_dim=32, residual='gated', verbose=False, device=device(type='cpu'), batch_dim=50)[source]#

tabeval.metrics._utils.save_model(model, optimizer, epoch, save=False, workspace=PosixPath('logs/tabeval_workspace'))[source]#

tabeval.metrics._utils.load_model(model, optimizer, workspace=PosixPath('logs/tabeval_workspace'))[source]#

tabeval.metrics._utils.compute_log_p_x(model, x_mb)[source]#

tabeval.metrics._utils.train(model, optimizer, scheduler, data_loader_train, data_loader_valid, data_loader_test, workspace=PosixPath('logs/tabeval_workspace'), start_epoch=0, device=device(type='cpu'), epochs=50, save=False, clip_norm=0.1)[source]#

tabeval.metrics._utils.density_estimator_trainer(data_train, data_val=None, data_test=None, batch_dim=50, flows=5, layers=3, hidden_dim=32, residual='gated', workspace=PosixPath('logs/tabeval_workspace'), decay=0.5, patience=20, cooldown=10, min_lr=0.0005, early_stopping=100, device=device(type='cpu'), epochs=50, learning_rate=0.01, clip_norm=0.1, polyak=0.998, save=True, load=True)[source]#

tabeval.metrics._utils.compute_metrics_baseline(y_scores, y_true, sample_weight=None)[source]#

Metric Categories and Default Configuration#

The metrics are organized into the following categories with their default configurations:

{
    'stats': [
        'jensenshannon_dist', 'chi_squared_test', 'inv_kl_divergence',
        'ks_test', 'max_mean_discrepancy', 'wasserstein_dist',
        'prdc', 'alpha_precision', 'survival_km_distance'
    ],
    'privacy': ['dcr'],
    'structure': ['utility_per_feature'],
    'density': ['low_order', 'high_order']
}

Usage Examples#

Basic Usage#

from tabeval.metrics import Metrics
import pandas as pd

# Load your datasets
real_data = pd.read_csv("real_data.csv")
synthetic_data = pd.read_csv("synthetic_data.csv")

# Run all default metrics
results = Metrics.evaluate(real_data, synthetic_data)
print(results)

Custom Metric Selection#

# Select specific metrics to run
custom_metrics = {
    'stats': ['jensenshannon_dist', 'wasserstein_dist'],
    'privacy': ['dcr']
}

results = Metrics.evaluate(
    real_data,
    synthetic_data,
    metrics=custom_metrics,
    task_type='classification',
    n_folds=3,
    workspace=Path('my_workspace')
)

Survival Analysis Example#

# For survival analysis data
results = Metrics.evaluate(
    X_gt=survival_real_data,
    X_syn=survival_synthetic_data,
    task_type='survival_analysis',
    metrics={'stats': ['survival_km_distance']}
)

Advanced Configuration#

# Advanced configuration with all parameters
results = Metrics.evaluate(
    X_gt=real_data,
    X_syn=synthetic_data,
    X_train=training_data,  # Optional: training data for some metrics
    reduction='median',     # Aggregation method: 'mean', 'median', 'min', 'max'
    n_histogram_bins=20,    # Number of bins for histogram-based metrics
    task_type='regression', # Task type affects model evaluation
    random_state=42,        # For reproducibility
    workspace=Path('cache'), # Caching directory
    use_cache=True,         # Enable result caching
    n_folds=5              # Cross-validation folds
)

Notes#

All metrics support caching to avoid recomputation of expensive operations
Most metrics work with any tabular data, but some are specialized (e.g., survival analysis, images)
The evaluation framework automatically handles data encoding and preprocessing
Results are returned as pandas DataFrames for easy analysis and visualization

Metrics

Contents

Metrics#

Overview#

Main Interface#

Base Classes#

Statistical Metrics#

Jensen-Shannon Distance#

Inverse KL Divergence#

Kolmogorov-Smirnov Test#

Chi-Squared Test#

Maximum Mean Discrepancy#

Wasserstein Distance#

PRDC Score#

Alpha Precision#

Survival KM Distance#

Frechet Inception Distance#

Privacy Metrics#

DCR (Baseline Protection)#

Structure Metrics#

Utility Per Feature#

Density Metrics#

Low Order Metrics#

High Order Metrics#

Visualization#

Utility Functions#

Metric Categories and Default Configuration#

Usage Examples#

Basic Usage#

Custom Metric Selection#

Survival Analysis Example#

Advanced Configuration#

Notes#