While data sharing is crucial for knowledge development, privacy concerns and strict regulation (e.g., European General Data Protection Regulation (GDPR)) unfortunately limit its full effectiveness. Synthetic tabular data emerges as an alternative to enable data sharing while fulfilling regulatory and privacy constraints. The state-of-the-art tabular data synthesizers draw methodologies from Generative Adversarial Networks (GAN) and address two main data types in the industry, i.e., continuous and categorical.
In this blog, we shed light on CTAB-GAN, a novel conditional table GAN architecture that can effectively model diverse data types, including a mix of continuous and categorical variables. Moreover, the model also addresses data imbalance and long-tail issues in real tabular datasets, i.e., certain variables have drastic frequency differences across large values. This is made possible by utilizing the information loss and classification loss with the conditional GAN. Moreover, the model features a novel conditional vector, which efficiently encodes the mixed data type and skewed distribution of data variables. CTAB-GAN was evaluated with the current state of the art in terms of data similarity and analysis utility. The results on five datasets show that the synthetic data of CTAB-GAN remarkably resembles the real data for all three types of variables and results in higher accuracy for five machine learning algorithms, by up to 17%.
The industrial datasets (at stakeholders like banks, insurance companies, and health care) present multi-fold challenges. First of all, such datasets are organized in tables and populated with both continuous and categorical variables, or a mix of the two, e.g., the value of mortgage for a loan holder. This value can be either 0 (no mortgage) or some continuous positive number. Here, we term such a type of variable, mixed variable. Secondly, continuous data variables often have a wide range of values and can exhibit a heavy long-tailed distribution e.g., the statistic of transaction amount for a credit card. Most transactions should be within 0 and 500 bucks (i.e. daily shopping for food and clothes), but exceptions of a high transaction amount surely exist. Thirdly, continuous data variables may also comprise distributions with multiple modes of skewed frequencies. In figure 2 below, we show how these issues manifest while utilizing current state-of-the-art techniques.
Thus, in summary, dealing with the following challenges formed the main motivations of research:
Tabular data comprises mixed variables that consist of both a continuous and a discrete component. Similarly, missing values embedded in continuous variables may also be regarded as a categorical component of a mixed variable.
Continuous variables exhibit heavy long-tailed distributions which are difficult to model and reproduce authentically.
Continuous variables contain multiple modes of skewed frequencies which further exacerbate modelling.
We devise a novel conditional tabular data synthesizer, CTAB-GAN, that addresses the limitations of the prior state-of-the-art: (i) encoding mixed data type of continuous and categorical variables, (ii) efficient modelling of long-tail continuous variables and (iii) increased robustness to imbalanced categorical variables along with skewed continuous variables. Furthermore, two key features of CTAB-GAN are the introduction of classification loss in conditional GAN, and novel encoding for the conditional vector that efficiently encodes mixed variables and helps to deal with highly skewed distributions for continuous variables.
Hence, the main contributions can be summarized as follows:
Novel conditional adversarial network which introduces a classifier providing additional supervision to improve its utility for ML applications.
Efficient modelling of continuous, categorical, and mixed variables via novel data encoding and conditional vector.
Light-weight data pre-processing to mitigate the impact of long-tail distribution of continuous variables using a simple log transform.
Providing an effective data synthesizer for the relevant stakeholders.
Let us now review CTAB-GAN’s performance based on the three motivation-cases introduced earlier in Sec. 2.
Mixed variables- Figure 3. (a) shown above compares the real and CTAB-GAN generated data for the variable “Mortgage” in the Loan dataset. CTAB-GAN encodes this variable as a mixed type. We can see that CTAB-GAN generates clear 0 values, unlike existing state-of-the-art techniques.
Long-tail continuous variables- Figure 3. (b) compares the cumulative frequency graph for the “Amount” variable in the Credit dataset. This variable is a typical long-tail distribution. One can see that CTAB-GAN perfectly recovers the real distribution. Due to log-transform data pre-processing, CTAB-GAN learns this structure significantly better than the state-of-the-art methods.
Skewed multi-mode continuous variables- Figure 3. (c) compares the frequency distribution for the continuous variable “Hours-per-week” from the Adult dataset. Except for the dominant peak at 40, many side peaks make synthesizing this column extremely difficult. However, we see that CTAB-GAN is more capable than existing methods to recover the skewed multi-modal distribution due to its novel construction of the conditional vector designed to make the generation process more robust to such distributions.
Motivated by the importance of data sharing and fulfilment of governmental regulations, we proposed CTAB-GAN- a conditional GAN based tabular data generator. CTAB-GAN advances beyond the prior state-of-the-art methods by modelling mixed variables and provides strong generation capability for imbalanced categorical variables, and continuous variables with complex distributions. To such ends, the core features of CTAB-GAN include (i) introduction of the classifier into conditional GAN, (ii) effective data encoding for mixed variables, and (iii) a novel construction of conditional vectors. We exhaustively evaluate CTAB-GAN against four tabular data generators on a wide range of metrics, namely resulting ML utilities, statistical similarity and privacy preservation. The results show that the synthetic data of CTAB-GAN results in high utilities, high similarity and reasonable privacy guarantee, compared to existing state-of-the-art techniques. The improvement on complex datasets is up to 17% inaccuracy comparing to all state-of-the-art algorithms. The remarkable results of CTAB-GAN demonstrate its potential for a wide range of applications that greatly benefit from data sharing, such as banking, insurance, manufacturing, and telecommunications.
We thank you for your attention and encourage you to dive deeper into our paper, available here.