TabStruct - Tabular Structural Fidelity#

TabStruct logo

arxiv ci pypi downloads license

Important

Official code for the paper “TabStruct: Measuring Structural Fidelity of Tabular Data” (https://arxiv.org/abs/2509.11950), published in The Fourteenth International Conference on Learning Representations (ICLR 2026 Oral).

Authored by Xiangjian Jiang, Nikola Simidjievski, and Mateja Jamnik, University of Cambridge, UK.

Overview#

TabStruct is an end-to-end benchmark for tabular data generation, prediction, and evaluation. It ships with ready-to-use pipelines for:

  • generating high-quality synthetic tables

  • training predictive models

  • analysing results with a rich suite of metrics, especially those that quantify structural fidelity

All components are designed to plug-and-play, so you can mix, match, and extend them to suit your own workflow.

Key Features#

Data generation#

  • Out-of-the-box support for popular tabular generators: SMOTE, TVAE, CTGAN, NFlow, TabDDPM, ARF, and more.

Evaluation dimensions#

  • Density estimation - How well does the synthetic data approximate the real distribution?

  • Privacy preservation - Does the generator leak sensitive records?

  • ML efficacy - How do models trained on synthetic data perform compared to real data?

  • Structural fidelity - Does the generator respect the causal structures of real data?

Predictive tasks#

  • Classification and regression pipelines built on scikit-learn, with optional neural-network backbones.

Installation#

We recommend managing dependencies with conda + mamba.

# 1. Upgrade conda and activate the base env
conda update -n base -c conda-forge conda
conda activate base

# 2. Install the high-performance dependency resolver
conda install conda-libmamba-solver --yes
conda config --set solver libmamba
conda install -c conda-forge mamba --yes

# 3. Create a new conda env
conda create --name tabstruct python=3.10.18 --no-default-packages
conda activate tabstruct

# 4. Set up the env
bash scripts/utils/install.sh

Logging with W&B#

TabStruct logs every experiment to Weights & Biases (W&B). Use the default project or set your own credentials in src/tabstruct/common/__init__.py:

WANDB_ENTITY  = "tabular-data-generation"
WANDB_PROJECT = "TabStruct"

Quick sanity check#

Run a toy classification job (K-NN on the Adult dataset):

python -m src.tabstruct.experiment.run_experiment \
  --model knn \
  --save_model \
  --dataset adult \
  --test_size 0.2 \
  --valid_size 0.1 \
  --tags ENV-TEST

A successful run prints a series of green log lines like:

[YYYY-MM-DD] Codebase: >>>>>>>>>> Launching create_data_module() <<<<<<<<<<<
...

If you see those, your environment is ready.

Example Workflows#

1. Generate synthetic data#

python -m src.tabstruct.experiment.run_experiment \
    --pipeline "generation" \
    --generation_only \
    --model "smote" \
    --dataset "mfeat-fourier" \
    --test_size 0.2 \
    --valid_size 0.1 \
    --tags "dev"

Template script: docs/tutorial/example_scripts/generation/train.sh.

2. Evaluate synthetic data#

python -m src.tabstruct.experiment.run_experiment \
    --pipeline "generation" \
    --model "smote" \
    --eval_only \
    --dataset "mfeat-fourier" \
    --test_size 0.2 \
    --valid_size 0.1 \
    --generator_tags "dev" \
    --tags "dev"

Template script: docs/tutorial/example_scripts/generation/eval.sh.

3. Predict on tabular data#

python -m src.tabstruct.experiment.run_experiment \
    --model "mlp" \
    --save_model \
    --max_steps_tentative 1500 \
    --dataset "adult" \
    --test_size 0.2 \
    --valid_size 0.1 \
    --tags "dev"

Template script: docs/tutorial/example_scripts/prediction/train.sh.

Citation#

For attribution in academic contexts, please cite this work as:

@inproceedings{jiang2026tabstruct,
  title={TabStruct: Measuring Structural Fidelity of Tabular Data},
  author={Jiang, Xiangjian and Simidjievski, Nikola and Jamnik, Mateja},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}

@inproceedings{jiang2025well,
  title={How Well Does Your Tabular Generator Learn the Structure of Tabular Data?},
  author={Jiang, Xiangjian and Simidjievski, Nikola and Jamnik, Mateja},
  booktitle={ICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy}
}

Contents#