The rise of big and open data is provoking the use of machine learning models in an ever-growing set of domains including time series. To encourage growth and stimulate fast progress there is a need for data sharing. However, those datasets most valuable for corporations and research frequently contain private data, thus they cannot be distributed due to various privacy laws (e.g., GDPR). In this blog, Generative Adversarial Networks (GAN) in combination with Differential Privacy is proposed to enable sharing of valuable time-series datasets without violating users’ privacy
GANs and Differential Privacy
GANs have been shown to be effective for numerous data(set) augmentation tasks like increasing image resolution and enriching datasets by generating new samples as shown in the work of Tanaka et al. As a result of this, models trained on the augmented data will perform better. GANs achieve this by using two models that train in parallel. The first is a Generative model G, which aims to capture the distribution of the data, and the second is a Discriminatory model D, which tries to differentiate between samples from the training data and samples generated by G. During the joint training D gets better at differentiating real from fake and punishes G if it creates bad samples, thus G will create more realistic samples over time. Unfortunately, GANs are shown to be susceptible to attacks aiming to recover training samples as shown by Chen et al, making standard GANs unsuitable for use cases where privacy needs to be guaranteed.
Differential Privacy (DP) is a methodology that preserves the privacy of a dataset by injecting calibrated statistical noise. There exist different DP techniques that differ in what type of noise to inject, and when to inject it, e.g., directly to the data, or during the learning process. The first approach works by adding noise to the answers returned by queries from a database but using this approach the dataset itself is not differentially private and therefore cannot be released in full. In contrast, adding noise during the learning process of a GAN will result in a privatized dataset, which can be released to the public with no additional privacy loss.
Achieving differentially private GAN training is done by using a DP Stochastic Gradient Descent algorithm, proposed by Abadi et al., which carefully controls the influence a single training sample can have during training. DP-SGD preserves privacy by carefully clipping, and adding Gaussian noise to the gradients. The model currently showing the best performance when tasked with generating differentially private (MNIST) images is GS-WGAN which utilizes Wasserstein loss in combination with a gradient penalty to destroy the least amount of information while differentially privatizing the learning process.
But, sensitive datasets that would benefit from being distributed in a private manner are rarely MNIST-like images. Therefore we explore GS-WGAN performance on other types of data, for example, time series, e.g., location traces, browsing history, or credit card transactions. These all contain incredibly personal information and are thus not suited for release to the public, limiting the advance in industries dealing with these kinds of data.
Applying GS-WGAN to Time-series
The process of exploring GS-WGAN applicability to time series data is two-fold: first, the model was altered to allow training on time series, and secondly, the performance of the model was compared against baselines.
Model changes and data preparation
GS-WGAN was only applicable to 28x28 pixel images, limiting the model’s applicability hugely. Therefore, any hard-coded inner model layer dimension that limited the input size was converted to dynamic dimensions, thus scaling the model with the input image size and supporting square images of arbitrary size.
Since GS-WGAN has proven to be effective on image datasets, we want to convert the problem of generating time series to the problem of generating images, enabling direct application of GS-WGAN to the data. A similar approach was shown to be effective in the work of E. Brophy et al. To achieve this, we first convert the time series data into an image and then train the GS-WGAN model by wrapping the row vectors into the closest square matrix dimensions, i.e.d×d where d is the ceiled square root of the row data dimensionality and pad missing values with zeros as shown in figure 1 below.
There exists GANs shown to be capable of generating such time series, examples are RDP-CGAN, PATE-GAN and DP-GAN, all with slightly different approaches for applying differential privacy during learning and model architecture. An overview of this is shown in Table 1 below.
We show how GS-WGAN, originally designed for image synthesis, compares to these GANs when tasked with generating DP time series. We simultaneously compare the two different architectures of GS-WGAN: DCGAN and ResNet, the most notable difference between these being the number of trainable parameters, DCGAN having 1M v.s. more than 40M for ResNet. This comparison is done on the PTB dataset, each sample being one heartbeat represented by a time series of length 187. The results are all shown below.
Although the quality of data generated for the low privacy setting is not ideal, as shown in table 1 & figure 3, GS-WGAN with DCGAN architecture shows promising results for higher privacy budgets, see figure 2. For ε > 10 the model performs close to baselines, not only with competitive AUPRC, but also with good AUROC as shown in figure 2, showing that the proposed method of converting time series into images is a valid technique to leverage image-based GANs for synthesizing time series. The difference between ResNet and DCGAN performance is also interesting where we find that DCGAN shows better performance on the PTB dataset as shown in figure 3. This is likely due to the fact that it learns faster since it has significantly fewer parameters than ResNet.
GS-WGAN is capable of creating high-quality synthetic time series data in a differentially private setting after the made changes to the architecture and way data is prepared. But for low privacy settings, the model is outperformed by other state-of-the-art DP GANs, only having better performance when measured by AUPRC. But, these results are promising, since we've shown that GS-WGAN, designed for image synthesis, is capable of learning the characteristics of time series data and generating high-quality synthetic DP time series for high privacy budgets.