Tabular generative adversarial networks (TGANs) have recently emerged to cater to the need of synthesizing tabular data — the most widely used data format. While synthetic tabular data offers the advantage of complying with privacy regulations, there still exists a risk of privacy leakage via inference attacks due to interpolating the properties of real data during training. Differential private (DP) training algorithms provide theoretical guarantees for training machine learning models by injecting statistical noise to prevent privacy leaks. However, the challenges of applying DP on TGAN are to determine the most optimal framework (i.e., PATE/DP-SGD) and neural network (i.e., Generator/Discriminator) to inject noise such that the data utility is well maintained under a given privacy guarantee.
In this blog, we highlight DTGAN, a novel conditional Wasserstein tabular GAN that comes in two variants DTGAN_G and DTGAN_D, for providing a detailed comparison of tabular GANs trained using DP-SGD for the generator vs discriminator, respectively. Moreover, we elicit the privacy analysis associated with training the generator with complex loss functions (i.e., classification and information losses) needed for high-quality tabular data synthesis. Additionally, we rigorously evaluate the theoretical privacy guarantees offered by DP empirically against membership and attribute inference attacks.
DTGAN is a novel approach to generate tabular datasets with strong DP guarantees. It utilizes the DP-SGD framework and the subsampled RDP moments accountant technique to preserve privacy and account for the cost, respectively. In addition, it makes use of the Wasserstein loss with gradient penalty to effectively bound the gradient norms thereby providing an analytically derived optimal clipping value for better preserving gradient information after being clipped in regards to DP-SGD as shown in the work of GS-WGAN.
Each discriminator update satisfies (𝜆,2𝐵𝜆/𝜎^2)-RDP where B is the batch size, 𝜆 is the order of the Renyi divergence and 𝜎 is the noise scale.
The discriminator directly interacts with real data.
The discriminator gradient norms are directly bounded due to the gradient penalty.
Subsampling is highly efficient as it is defined as 𝛾=𝐵/𝑁 where 𝐵 is the batch size and 𝑁 is the size of the training dataset.
The use of the Wasserstein loss requires multiple updates to the discriminator for performing a single update to the generator.
Training multiple discriminators to perform distributed GAN training increases the privacy cost significantly.
Each generator update satisfies (𝜆,6𝐵𝜆/𝜎^2)-RDP where B is the batch size, 𝜆 is the order of the Renyi divergence and 𝜎 is the noise scale.
DP-generator makes for a safe public release after training.
Distributed discriminators do not increase privacy costs.
Subsampling introduces complexity via multiple discriminators and is defined as 1/N_d where N_d is the number of discriminators.
Added loss functions on the generator increase privacy cost.
It is a binary classification problem in which an attacker tries to predict if a particular target data point has been used to train a victim generative model. This post assumes that the attacker only needs access to a black-box tabular GAN model, a reference dataset and a target data point for which the inference must be made.
It is defined as a regression problem where the attacker attempts to predict the values of a sensitive target column provided he/she has black-box access to a generative model.
Only the DTGAN_D model consistently improves across all metrics with a looser privacy budget. It also showcases the best performance for both F1-score and APR metrics across all baselines and privacy budgets. This suggests that training the discriminator with DP guarantees, i.e. DTGAN_D, is more optimal than training the generator with DP guarantees, i.e. DTGAN_G.
Among all DP models, DTGAN_D is the only model which consistently improves across all three metrics when the privacy budget is increased. Similarly, DTGAN_G sees an improvement across both the Avg-JSD and Avg-WD. The same is not true for PATE-GAN and DP-WGAN where DP-WGAN performs better across all metrics. Moreover, they perform worse than the two variants of DTGAN at both levels of epsilon. This highlights their inability to capture the statistical distributions during training despite a looser privacy budget. This is due to the lack of an effective training framework.
Resilience to inference attacks
With respect to membership inference attacks, all DP baselines provide an empirical privacy gain close to 0.25 for both feature extraction methods. This indicates that differential private methods provide strong privacy protection against membership attacks. It ensures that the average probability of success for any attack is close to the attacker’s original prior, i.e 0.5.
In terms of attribute inference attacks, PATE-GAN provides the greatest resilience, followed by DP-WGAN, DTGAN_D, and DTGAN_G. This is due to the superior quality of the synthetic data offered which enhances the attacker’s probability of successfully inferring sensitive information. Even if both variants of DTGAN are less resilient than the two DP baselines, the difference with TGAN providing the worst/no resilience is still significant. These results highlight the inherent trade-off between privacy and data utility i.e., increasing the utility directly worsens privacy and vice versa.
Motivated by the risk of privacy leakage through synthetic tabular data, we propose a novel DP conditional Wasserstein tabular GAN, DTGAN. We rigorously analyse DTGAN using its two variants, namely DTGAN_D and DTGAN_G via the theoretical Renyi DP framework and highlight the privacy cost for additional losses used by the generator to enhance data quality. Moreover, we empirically showcase the data utility achieved by applying DP-SGD to train the discriminator vs generator, respectively, Additionally, we rigorously evaluate the privacy robustness against practical membership and attribute inference attacks.
Our results on three tabular datasets show that synthetic tabular data generated by DP-SGD achieves higher data utility as compared to the PATE framework. Moreover, we find that DTGAN_D outperforms DTGAN_G, illustrating that the discriminator trained with DP guarantees is more optimal under stringent privacy budgets. Finally, in terms of data utility and reliance on privacy attacks, DTGAN_D improves upon prior work by 18% across 4 ML models in terms of the average precision score and all DP baselines reduce the success rate of membership attacks by approx. 50%. Therefore, this showcases the effectiveness of DP for protecting the privacy of sensitive datasets being used for training tabular GANs. However, further enhancement of the quality of synthetic data at strict privacy budgets (i.e., epsilon < 1) is still needed. Ultimately, there is an inherent trade-off between privacy and utility and obtaining the most optimal balance between both requires future work.
Thank you for reading. Please feel free to access our full research paper here.