Skip to content

A Collection of Cross-Sectional and Time-Series Generators

Notifications You must be signed in to change notification settings

firmai/tabular-data-generators

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 

Repository files navigation

A Collection of Cross-Sectional and Time-Series Generators

Time-Series

MTSS-GAN - Multivariate conditional time series simulation using stacked generative adversarial learning for multi-attribute generation.

Developed by Derek Snow @firmai

Time-GAN - Multivariate time series generation with an emphasis on autoregression errors to preserve temporal correlations.

Developed by Jinsung Yoon, Daniel Jarrett, Mihaela van der Schaar, forked code and colab hosted by @firmai

DoppleGANger - Modelling time series and mixed-type data

Cross-Sectional

Privacy

VAE-DP - Variational Autoencoder with Differential Privacy

Original model developed by MIT, differential privacy included by Derek Snow @firmai

PrivBN - Privbayes: Private data release via bayesian networks

General

CTGAN - Conditional GAN for Tabular Data

CTGAN:

Tabular data usually contains a mix of discrete and continuous columns. Continuous columns may have multiple modes whereas discrete columns are sometimes imbalanced making the modeling difficult. Existing statistical and deep neural network models fail to properly model this type of data. We design TGAN, which uses a conditional generative adversarial network to address these challenges. In CTGAN, we invent the mode-specific normalization (In CTGAN, we use variational Gaussian mixture model (VGM) to normalize continuous columns 25% improvement) to overcome the non-Gaussian and multimodal distribution (Section 4.2). We design a conditional generator (imbalanced data; 20% improvent) and training-by-sampling (conditional loss) to deal with the imbalanced discrete columns (Section 4.3). Ablation study necessary. They use WGANGP. The identity (original data) always had the best performance.

TVAE (Same Paper as CTGAN) - Variational autoencoder (VAE) for mixed-type tabular data generation. VAEs directly use data to build the generator;

TVAE:

Benefit here is that no conditioning is used for imbalanced data and this model still performs at a similar level to CTGAN above. On real datasets,TVAE and CTGAN outperform CLBN and PrivBN, whereas other GAN models cannot get as good a result as Bayesian networks.

TVAE outperforms CTGAN in several cases, but GANs do have several favorable attributes, and this does not indicate that we should always use VAEs rather than GANs to model tables. The generator in GANs does not have access to real data during the entire training process; thus, we can make CTGAN achieve differential privacy [14] easier than TVAE. (This might not be true, it has to be tested)

CLBN - Approximating discrete probability distributions with dependence trees

CLBN and PrivBN

For simulated data from Gaussian mixture,CLBNandPrivBNsuffer because continuousnumeric data has to be discretized before modeling using Bayesian networks. With respect to large scale real datasets, learning a high-quality Bayesian network is difficult. So models trained on CLBN and PrivBN synthetic data are 36.1% and 51.8% worse than models trained on real data.

TableGAN - Data synthesis based on generative adversarial networks

VEEGAN - Veegan: Reducing mode collapse in gans using implicit variational learning. Chris Russell - Alan Turing.

CTGAN:

A GAN variant that avoids mode collapse. Create a reconstruction network to map the data distribution to a Gaussian, but also to approximately reverse the action of the generator. Intuitively, if the reconstructor learns both to map all of the true data to the noise distribution and is an approximate inverse of the generator network, this will encourage the generator network to map from the noise distribution to the entirety of the true data distribution, thus resolving mode collapse. There was a previous iteration that did not use conditional GANS.

Other

Evaluation - A range of implementation and evaluation metrics as part of a masters thesis by Bauke Brenninkmeijer.

We always want to compare our method with benchmarks, first we want to use the original data as is used in training synthesisers, then we want to sample each column using a uniform distribution, second we can sample each column using a gaussian mixture model if continuous and a probability mass function if discrete.

About

A Collection of Cross-Sectional and Time-Series Generators

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published