Metrics

Contents

Metrics#

The TabEval metrics module provides a comprehensive suite of evaluation metrics for synthetic tabular data. All metrics inherit from the base MetricEvaluator class and are organized by evaluation category.

Overview#

TabEval provides metrics across several evaluation dimensions:

  • Statistical Metrics: Distribution-based comparisons between real and synthetic data

  • Privacy Metrics: Privacy protection and differential privacy guarantees

  • Structure Metrics: Feature-level utility and relationships

  • Density Metrics: High-order and low-order data quality assessments

Main Interface#

Base Classes#

class tabeval.metrics.core.MetricEvaluator(reduction='mean', n_histogram_bins=10, n_folds=3, task_type='classification', random_state=0, workspace=PosixPath('logs/tabeval_workspace'), use_cache=True, default_metric=None)[source]#

Bases: object

Base class for all metrics.

Each derived class must implement the following methods:

evaluate() - compare two datasets and return a dictionary of metrics. direction() - direction of metric (bigger better or smaller better). type() - type of the metric. name() - name of the metric.

If any method implementation is missing, the class constructor will fail.

Constructor Args:
reduction: str

The way to aggregate metrics across folds. Default: ‘mean’.

n_histogram_bins: int

The number of bins used in histogram calculation. Default: 10.

n_folds: int

The number of folds in cross validation. Default: 3.

task_type: str

The type of downstream task. Default: ‘classification’.

workspace: Path

The directory to save intermediate models or results. Default: Path(“logs/tabeval_workspace”).

use_cache: bool

Whether to use cache. If True, it will try to load saved results in workspace directory where possible.

Base class for all metrics. Provides common functionality including:

  • Caching mechanism for expensive computations

  • Standardized evaluation interface

  • Reduction operations (mean, max, min, median)

  • OneClass representation for advanced metrics

abstract evaluate(X_gt, X_syn)[source]#
abstract evaluate_default(X_gt, X_syn)[source]#
abstract static direction()[source]#
abstract static type()[source]#
abstract static name()[source]#
classmethod fqdn()[source]#
reduction()[source]#
use_cache(path)[source]#

Statistical Metrics#

Statistical metrics compare the distributional properties between real and synthetic data.

class tabeval.metrics.eval_statistical.StatisticalEvaluator(**kwargs)[source]#

Bases: MetricEvaluator

Base class for all statistical metrics.

static type()[source]#
evaluate(X_gt, X_syn)[source]#
evaluate_default(X_gt, X_syn)[source]#

Jensen-Shannon Distance#

class tabeval.metrics.eval_statistical.JensenShannonDistance(normalize=True, **kwargs)[source]#

Bases: StatisticalEvaluator

Evaluate the average Jensen-Shannon distance (metric) between two probability arrays.

Evaluates the average Jensen-Shannon distance between probability distributions.

Score Range: [0, 1]

Direction: minimize (0 = identical distributions, 1 = completely different)

static name()[source]#
static direction()[source]#

Inverse KL Divergence#

class tabeval.metrics.eval_statistical.InverseKLDivergence(**kwargs)[source]#

Bases: StatisticalEvaluator

Returns the average inverse of the Kullback–Leibler Divergence metric.

Score:

0: the datasets are from different distributions. 1: the datasets are from the same distribution.

Returns the average inverse of the Kullback–Leibler Divergence metric.

Score Range: [0, 1]

Direction: maximize (1 = same distribution, 0 = different distributions)

static name()[source]#
static direction()[source]#

Kolmogorov-Smirnov Test#

class tabeval.metrics.eval_statistical.KolmogorovSmirnovTest(**kwargs)[source]#

Bases: StatisticalEvaluator

Performs the Kolmogorov-Smirnov test for goodness of fit.

Score:

0: the distributions are totally different. 1: the distributions are identical.

Performs the Kolmogorov-Smirnov test for goodness of fit.

Score Range: [0, 1]

Direction: maximize (1 = identical distributions, 0 = totally different)

static name()[source]#
static direction()[source]#

Chi-Squared Test#

class tabeval.metrics.eval_statistical.ChiSquaredTest(**kwargs)[source]#

Bases: StatisticalEvaluator

Performs the one-way chi-square test.

Returns:

The p-value. A small value indicates that we can reject the null hypothesis and that the distributions are different.

Return type:

None

Score:

0: the distributions are different 1: the distributions are identical.

Performs the one-way chi-square test.

Score Range: [0, 1]

Direction: maximize (1 = identical distributions, 0 = different distributions)

static name()[source]#
static direction()[source]#

Maximum Mean Discrepancy#

class tabeval.metrics.eval_statistical.MaximumMeanDiscrepancy(kernel='rbf', **kwargs)[source]#

Bases: StatisticalEvaluator

Empirical maximum mean discrepancy. The lower the result the more evidence that distributions are the same.

Parameters:

kernel (str) – “rbf”, “linear” or “polynomial”

Score:

0: The distributions are the same. 1: The distributions are totally different.

Empirical maximum mean discrepancy with support for multiple kernels.

Supported Kernels: “rbf”, “linear”, “polynomial”

Score Range: [0, ∞)

Direction: minimize (0 = same distributions, higher = more different)

static name()[source]#
static direction()[source]#

Wasserstein Distance#

class tabeval.metrics.eval_statistical.WassersteinDistance(**kwargs)[source]#

Bases: StatisticalEvaluator

Compare Wasserstein distance between original data and synthetic data.

Parameters:
  • X – original data

  • X_syn – synthetically generated data

Returns:

Wasserstein distance

Return type:

WD_value

Compare Wasserstein distance between original and synthetic data.

Score Range: [0, ∞)

Direction: minimize (0 = identical distributions)

static name()[source]#
static direction()[source]#

PRDC Score#

class tabeval.metrics.eval_statistical.PRDCScore(nearest_k=5, **kwargs)[source]#

Bases: StatisticalEvaluator

Computes precision, recall, density, and coverage given two manifolds.

Parameters:

nearest_k (int) – int.

Computes precision, recall, density, and coverage given two manifolds.

Returns: Dictionary with precision, recall, density, and coverage scores

Direction: maximize (all metrics range from 0 to 1)

static name()[source]#
static direction()[source]#

Alpha Precision#

class tabeval.metrics.eval_statistical.AlphaPrecision(**kwargs)[source]#

Bases: StatisticalEvaluator

Evaluates the alpha-precision, beta-recall, and authenticity scores.

The class evaluates the synthetic data using a tuple of three metrics: alpha-precision, beta-recall, and authenticity. Note that these metrics can be evaluated for each synthetic data point (which are useful for auditing and post-processing). Here we average the scores to reflect the overall quality of the data. The formal definitions can be found in the reference below:

Alaa, Ahmed, Boris Van Breugel, Evgeny S. Saveliev, and Mihaela van der Schaar. “How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models.” In International Conference on Machine Learning, pp. 290-306. PMLR, 2022.

Evaluates alpha-precision, beta-recall, and authenticity scores for sample-level quality assessment.

Returns: Dictionary with delta_precision_alpha, delta_coverage_beta, and authenticity scores

Direction: maximize (all metrics range from 0 to 1)

static name()[source]#
static direction()[source]#
metrics(X, X_syn, emb_center=None)[source]#

Survival KM Distance#

class tabeval.metrics.eval_statistical.SurvivalKMDistance(**kwargs)[source]#

Bases: StatisticalEvaluator

The distance between two Kaplan-Meier plots. Used for survival analysis

Distance between two Kaplan-Meier plots for survival analysis data.

Task Type: survival_analysis only

Direction: minimize

static name()[source]#
static direction()[source]#

Frechet Inception Distance#

class tabeval.metrics.eval_statistical.FrechetInceptionDistance(**kwargs)[source]#

Bases: StatisticalEvaluator

Calculates the Frechet Inception Distance (FID) to evalulate GANs.

Paper: GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.

The FID metric calculates the distance between two distributions of images. Typically, we have summary statistics (mean & covariance matrix) of one of these distributions, while the 2nd distribution is given by a GAN.

Adapted by Boris van Breugel(bv292@cam.ac.uk)

Calculates the Frechet Inception Distance (FID) for image data evaluation.

Data Type: images only

Direction: minimize

static name()[source]#
static direction()[source]#

Privacy Metrics#

Privacy metrics assess the privacy protection offered by synthetic data generation.

DCR (Baseline Protection)#

Structure Metrics#

Structure metrics evaluate the utility and relationships preserved in synthetic data.

Utility Per Feature#

Density Metrics#

Density metrics assess the quality of synthetic data through density-based comparisons.

Low Order Metrics#

High Order Metrics#

Visualization#

The metrics module also provides visualization utilities for comparing real and synthetic data.

Available Functions:

tabeval.metrics.plots.plot_marginal_comparison(plt, X_gt, X_syn, normalize=True)[source]#
tabeval.metrics.plots.plot_tsne(plt, X_gt, X_syn)[source]#

Utility Functions#

Internal utility functions for metric computation.

tabeval.metrics._utils.get_frequency(X_gt, X_synth, n_histogram_bins=10)[source]#

Get percentual frequencies for each possible real categorical value.

Returns:

The observed and expected frequencies (as a percent).

Return type:

dict

tabeval.metrics._utils.get_features(X, sensitive_features=[])[source]#

Return the non-sensitive features from dataset X

tabeval.metrics._utils.get_y_pred_proba_hlpr(y_pred_proba, nclasses)[source]#
tabeval.metrics._utils.evaluate_auc(y_test, y_pred_proba, classes=None)[source]#
class tabeval.metrics._utils.gaussian(X)[source]#

Bases: object

pdf(Z)[source]#
class tabeval.metrics._utils.normal_func(X)[source]#

Bases: object

pdf(Z)[source]#
class tabeval.metrics._utils.normal_func_feat(X, continuous)[source]#

Bases: object

pdf(Z)[source]#
class tabeval.metrics._utils.GeneratorInterface[source]#

Bases: object

abstract fit(data)[source]#
abstract generate(count)[source]#
tabeval.metrics._utils.compute_wd(X_syn, X)[source]#
tabeval.metrics._utils.load_dataset(data_train=None, data_valid=None, data_test=None, device=device(type='cpu'), batch_dim=50)[source]#
tabeval.metrics._utils.create_model(n_dims, n_flows=5, n_layers=3, hidden_dim=32, residual='gated', verbose=False, device=device(type='cpu'), batch_dim=50)[source]#
tabeval.metrics._utils.save_model(model, optimizer, epoch, save=False, workspace=PosixPath('logs/tabeval_workspace'))[source]#
tabeval.metrics._utils.load_model(model, optimizer, workspace=PosixPath('logs/tabeval_workspace'))[source]#
tabeval.metrics._utils.compute_log_p_x(model, x_mb)[source]#
tabeval.metrics._utils.train(model, optimizer, scheduler, data_loader_train, data_loader_valid, data_loader_test, workspace=PosixPath('logs/tabeval_workspace'), start_epoch=0, device=device(type='cpu'), epochs=50, save=False, clip_norm=0.1)[source]#
tabeval.metrics._utils.density_estimator_trainer(data_train, data_val=None, data_test=None, batch_dim=50, flows=5, layers=3, hidden_dim=32, residual='gated', workspace=PosixPath('logs/tabeval_workspace'), decay=0.5, patience=20, cooldown=10, min_lr=0.0005, early_stopping=100, device=device(type='cpu'), epochs=50, learning_rate=0.01, clip_norm=0.1, polyak=0.998, save=True, load=True)[source]#
tabeval.metrics._utils.compute_metrics_baseline(y_scores, y_true, sample_weight=None)[source]#

Metric Categories and Default Configuration#

The metrics are organized into the following categories with their default configurations:

{
    'stats': [
        'jensenshannon_dist', 'chi_squared_test', 'inv_kl_divergence',
        'ks_test', 'max_mean_discrepancy', 'wasserstein_dist',
        'prdc', 'alpha_precision', 'survival_km_distance'
    ],
    'privacy': ['dcr'],
    'structure': ['utility_per_feature'],
    'density': ['low_order', 'high_order']
}

Usage Examples#

Basic Usage#

from tabeval.metrics import Metrics
import pandas as pd

# Load your datasets
real_data = pd.read_csv("real_data.csv")
synthetic_data = pd.read_csv("synthetic_data.csv")

# Run all default metrics
results = Metrics.evaluate(real_data, synthetic_data)
print(results)

Custom Metric Selection#

# Select specific metrics to run
custom_metrics = {
    'stats': ['jensenshannon_dist', 'wasserstein_dist'],
    'privacy': ['dcr']
}

results = Metrics.evaluate(
    real_data,
    synthetic_data,
    metrics=custom_metrics,
    task_type='classification',
    n_folds=3,
    workspace=Path('my_workspace')
)

Survival Analysis Example#

# For survival analysis data
results = Metrics.evaluate(
    X_gt=survival_real_data,
    X_syn=survival_synthetic_data,
    task_type='survival_analysis',
    metrics={'stats': ['survival_km_distance']}
)

Advanced Configuration#

# Advanced configuration with all parameters
results = Metrics.evaluate(
    X_gt=real_data,
    X_syn=synthetic_data,
    X_train=training_data,  # Optional: training data for some metrics
    reduction='median',     # Aggregation method: 'mean', 'median', 'min', 'max'
    n_histogram_bins=20,    # Number of bins for histogram-based metrics
    task_type='regression', # Task type affects model evaluation
    random_state=42,        # For reproducibility
    workspace=Path('cache'), # Caching directory
    use_cache=True,         # Enable result caching
    n_folds=5              # Cross-validation folds
)

Notes#

  • All metrics support caching to avoid recomputation of expensive operations

  • Most metrics work with any tabular data, but some are specialized (e.g., survival analysis, images)

  • The evaluation framework automatically handles data encoding and preprocessing

  • Results are returned as pandas DataFrames for easy analysis and visualization