By __Aditya Kunar__, __Robert Birke__, __Zilong Zhao__ &__ Lydia Y. Chen__

## Introduction

**Tabular generative adversarial networks (TGANs)** have recently emerged to cater to the need of synthesizing tabular dataโโโthe most widely used data format. While synthetic tabular data offers the advantage of complying with privacy regulations, there still exists a *risk of privacy leakage via inference attacks* due to interpolating the properties of real data during training. **Differential private (DP) training algorithms** provide theoretical guarantees for training machine learning models by* injecting statistical noise to prevent privacy leaks*. However, the **challenges of applying DP on TGAN** are** to determine the most optimal framework (i.e., PATE/DP-SGD)** and **neural network (i.e., Generator/Discriminator) to inject noise** such that the data utility is well maintained under a given privacy guarantee.

In this blog, we highlight** DTGAN**, *a novel conditional Wasserstein tabular GAN* that comes in two variants** DTGAN_G **and **DTGAN_D**, for providing a detailed comparison of tabular GANs trained using **DP-SGD** for the *generator vs discriminator*, respectively. Moreover, we elicit the privacy analysis associated with training the generator with *complex loss functions (i.e., **classification and information losses**) *needed for high-quality tabular data synthesis. Additionally, we rigorously **evaluate the theoretical privacy guarantees offered by DP empirically against membership** and **attribute inference attacks**.

## DTGAN

**DTGAN **is a novel approach to generate tabular datasets with strong DP guarantees. It utilizes the **DP-SGD framework** and the **subsampled RDP moments accountant technique** to preserve privacy and account for the cost, respectively. In addition, it makes use of the **Wasserstein loss with gradient penalty** to effectively *bound the gradient norms* thereby providing an *analytically derived optimal clipping value* for better preserving gradient information after being clipped in regards to DP-SGD as shown in the work of GS-WGAN.

**DP-Discriminator**

Each discriminator update satisfies (๐,2๐ต๐/๐^2)-ยญRDP where B is the batch size, ๐ is the order of the* Renyi divergence* and ๐ is the *noise scale.*

Pros-

The

**discriminator directly interacts with real data**.The

**discriminator gradient norms are directly bounded**due to the gradient penalty.**Subsampling is highly efficient**as it is defined as ๐พ=๐ต/๐ where ๐ต is the batch size and ๐ is the size of the training dataset.

Cons-

The use of the

**Wasserstein loss requires multiple updates**to the discriminator for performing a single update to the generator.Training multiple discriminators to perform

**distributed GAN training increases the privacy cost significantly**.

**DP-Generator**

Each generator update satisfies (๐,6๐ต๐/๐^2)-ยญRDP where B is the batch size, ๐ is the order of the* Renyi divergence* and ๐ is the *noise scale*.

Pros-

**DP-generator makes for a safe public release**after training.**Distributed discriminators do not increase privacy**costs.

Cons-

**Subsampling introduces complexity via multiple discriminators**and is defined as 1/N_d where N_d is the number of discriminators.**Added loss functions on the generator increase privacy**cost.

## Inference Attacks

**Membership Inference Attack**

It is a *binary classification problem* in which an attacker tries to predict if a particular target data point has been used to train a victim generative model. This post assumes that the attacker only needs access to a black-box tabular GAN model, a reference dataset and a target data point for which the inference must be made.

**Attribute Inference Attack**

It is defined as a *regression problem* where the attacker attempts to predict the values of a sensitive target column provided he/she has black-box access to a generative model.

## Results

**ML Utility**

**Only the DTGAN_D model consistently improves across all metrics with a looser privacy budget**. It also showcases the best performance for both F1-score and APR metrics across all baselines and privacy budgets. This suggests that** training the discriminator with DP guarantees, i.e. DTGAN_D, is more optimal than training the generator with DP guarantees, i.e. DTGAN_G**.

**Statistical Similarity**

Among all DP models, **DTGAN_D is the only model which consistently improves across all three metrics when the privacy budget is increased**. Similarly, *DTGAN_G* sees an improvement across both the *Avg-JSD* and *Avg-WD*. The same is not true for* PATE-GAN* and *DP-WGAN* where DP-WGAN performs better across all metrics. Moreover, they perform worse than the two variants of* DTGAN *at both levels of epsilon. This highlights their inability to capture the statistical distributions during training despite a looser privacy budget. This is due to the lack of an effective training framework.

**Resilience to inference attacks**

With respect to* membership inference attacks*, all DP baselines provide an empirical privacy gain close to 0.25 for both feature extraction methods. This indicates that differential private methods provide strong privacy protection against membership attacks. It ensures that the **average probability of success for any attack is close to the attackerโs original prior, i.e 0.5**.

In terms of *attribute inference attacks*, *PATE-GAN* provides the greatest resilience, followed by *DP-WGAN, DTGAN_D,* and* DTGAN_G*. This is due to the superior quality of the synthetic data offered which enhances the attackerโs probability of successfully inferring sensitive information. Even if both variants of *DTGAN* are less resilient than the two DP baselines, the difference with *TGAN* providing the worst/no resilience is still significant. These results highlight the **inherent trade-off between privacy and data utility i.e., increasing the utility directly worsens privacy and vice versa**.

## Conclusion

Motivated by the risk of privacy leakage through synthetic tabular data, **we propose a novel DP conditional Wasserstein tabular GAN, DTGAN**. We rigorously analyse *DTGAN* using its two variants, namely *DTGAN_D* and *DTGAN_G *via the *theoretical Renyi DP framework* and highlight the privacy cost for additional losses used by the generator to enhance data quality. Moreover, we empirically showcase the data utility achieved by applying* DP-SGD* to train the *discriminator vs generator*, respectively, Additionally, **we rigorously evaluate the privacy robustness against practical membership and attribute inference attacks**.

**Our results on three tabular datasets show that synthetic tabular data generated by DP-SGD achieves higher data utility as compared to the PATE framework**. Moreover, we find that** DTGAN_D outperforms DTGAN_G, illustrating that the discriminator trained with DP guarantees is more optimal under stringent privacy budgets**. Finally, in terms of data utility and reliance on privacy attacks,** DTGAN_D improves upon prior work by 18% across 4 ML models in terms of the average precision score and all DP baselines reduce the success rate of membership attacks by approx. 50%**. Therefore, this showcases the effectiveness of DP for protecting the privacy of sensitive datasets being used for training tabular GANs. **However, further enhancement of the quality of synthetic data at strict privacy budgets (i.e., epsilon < 1) is still needed. Ultimately, there is an inherent trade-off between privacy and utility and obtaining the most optimal balance between both requires future work**.

*Thank you for reading. Please feel free to access our full research paper **here**.*