Improve your ML workflows with synthetic data

As a data scientist, you know that high-performance machine learning models cannot exist without a large amount of high-quality data to train and test them. Most of the time, building an appropriate machine learning model is not a problem: there are plenty of architectures available, and since it is part of your job, you know exactly which one will best suit your use case. However, having a large amount of high-quality data can be much more challenging: you need a labeled and cleaned dataset matching exactly your use case. Unfortunately, such a dataset is usually not already available. Maybe you only have a few data matching your requirements, maybe you have data but they are not matching exactly what you want (they can have biases or unbalanced classes for example), or maybe a dataset exists but you cannot access it because it contains private information. Therefore, you need to collect new data, label them and clean them, which can be a time-consuming and costly process, or even not be possible at all.

29/03/2023

Trustworthy AI

All articles

Contents

Download

Takeaway‍

Synthetic data can be a valuable tool for improving machine learning workflows.
They can allow you to :
- Obtain more training and testing data for your models.
- Improve the diversity and representativity of your data.
- Share a synthetic version of a database in order to protect private information.
However, in order to make the most of synthetic data, it is necessary to ensure their quality and avoid generating inconsistent data. Tools such as Faker can allow you to generate fake data easily, but not complex enough to replace a real-world dataset. Using machine learning models can help you generate more realistic data, and by combining this with tricks to ensure that your synthetic data meet business constraints you can unlock useful and consistent synthetic data.

How can synthetic data help you

Synthetic data are artificially generated data that imitate the characteristics of real-world data so that you can use them for your use case. Therefore, they can help you in various scenarios:

When you already have some data but you want more. It can be because your model is not generalizing well since you do not have enough training data, or because you already have a trained model but you want to test its performance on new data.
When you have data but you are not satisfied with them. As mentioned before, datasets often contain biases or are divided into different classes but with an under-representation of some classes. If your model is trained with this kind of data, it is very likely to reproduce the biases seen in the data or not perform well on under-represented classes. By generating unbiased synthetic data and more data for under-represented classes, you can get a more diverse and representative training set for your model.
When a dataset containing private but potentially useful data cannot be shared directly. Some datasets, such as medical datasets, contain private information and cannot be shared freely, even though sharing them would enable to get a lot of useful information through machine learning or any statistical analysis. By generating a synthetic version of this dataset, it is possible to share data with the same statistical characteristics, but that do not contain the information of real people. Actually, if you have read our blog post about Differential Privacy, you know that your synthetic data generator should be differentially private if you really want to guarantee the privacy of its training data. However, this topic is not the focus of this post.

From synthetic data to useful and consistent synthetic data

So, is synthetic data generation that amazing? Well, it can be, but only if the quality of the synthetic data you generate is good enough. You can generate by yourself a synthetic dataset with always the same value, but will this dataset be useful to train or test a machine learning model? Of course not. Let's see what options we have to generate synthetic data.

Level 0 of synthetic data generation: Faker

Faker is a library that allows you to quickly generate fake data. It is possible to randomly generate fake names, dates, ages, etc. It is mostly used by developers to populate databases for testing purposes. However, when you need data to train or test a machine learning model, what you are looking for is not a random dataset but a high-quality dataset with complex distributions and correlations. Faker has a lot of things to offer, but not the ability to handle complex distributions and correlations. If we want our machine learning model to be able to learn something useful from our synthetic data, we have to continue to level 1.

Level 1 of synthetic data generation: machine learning based generators

Unsupervised machine learning models are specifically designed to learn the complex distribution of real-world data and generate new samples from this distribution. Exactly what we are looking for! You can use the data you already have as a training set for an unsupervised machine learning model, and generate new data once your model is trained. Deep learning models such as Generative Adversarial Networks (GAN) [1] or Variational Auto Encoders (VAE) [2] are the most efficient for this kind of task.

Since many companies use tabular data, adaptations of these models have been proposed to be more efficient with tabular data. For instance, Conditional Tabular GAN (CTGAN) [3], an adaptation of the GAN architecture, is very popular. Hopefully, you do not have to implement it yourself: libraries such as the Synthetic Data Vault (SDV) allow you to instantiate architectures such as CTGAN, train them with the data you already have, and generate new data very easily.

So, are we done now? Sometimes yes! But ... not always. You should look more carefully at your synthetic data because it could be that your newly generated dataset is still inconsistent. Indeed, the goal of your synthetic data generators is to estimate the distribution of the training set you send them, but this training set is not always enough to make them learn specific properties of your data. For instance, it can be a numeric attribute $a$ that must be greater than a second numeric attribute $b$, but the distribution of $a$ and $b$ overlap so that a generator can generate rows for which $b>a$. Moreover, if you are generating synthetic data for privacy reasons, as mentioned before you should use Differentially Privacy (DP), and, unfortunately, DP will add random noise to your generator that may deteriorate its performance a bit.

Level 2 of synthetic data generation: machine learning based generators with user-specified constraints

Hopefully, if your dataset mostly learned the distribution of its training data, but with still some inconsistencies in some samples, you can reach fully consistent synthetic data by adding constraints to your synthetic data generators: as long as the constraints you add do not involve sharing private information of individuals in your dataset, adding these constraints can improve the utility of your dataset without privacy leaks. There are two strategies for this and they can be combined: reject sampling and transformations.

Reject sampling: Reject sampling is scandalously simple: rules are defined that the synthetic data must respect, and data that do not respect the rules are removed so that only data that respect them remain. The advantages of reject sampling are that it can be applied regardless of the rule defined and that it is extremely simple to implement. The disadvantage is that for a particularly complex constraint that is only met by a small proportion of the generated data, the generation time can be long, as the generator can generate indefinitely without ever generating enough consistent data.

Transformations: In order to force the generator to return consistent data directly without having to go through a reject sampling phase that will potentially never end, it is possible to use clever transformations of the data to force the desired business constraints to be respected. However, each transformation is specifically designed to enforce a particular type of constraint: there are not always obvious transformations to force the desired constraints. In this case, it is still possible to opt for reject sampling. To explain how transformations work, let’s say that we want to ensure that a numerical column $a$ be greater than another numerical column $b$ for every row generated. Before sending the data to the generator, the transformation $a=ln(a-b)$is applied. Once the model is trained, the generator samples two new columns $a’$ and $b’$, respectively corresponding to what the generator learned from $ln(a-b)$ , and from $b$. Then, the reverse transformation $a'=e^{a'} + b’$ is applied, and for every row, the inequality $a’>b’$ holds. Wait … what just happened? Since $e^{a'} > 0$ for $a’ \in\R$, then $e^{a'} + b’ > b'$. We gave to $a’$ the value $e^{a'} + b’$, which leads us to $a’>b’$. Actually, if the transformation $a=ln(a-b)$ has been applied, it was only in anticipation that the reverse transformation $a'=e^{a'} + b’$ would be carried out afterward to ensure the constraint to hold. This kind of trick can be adapted with many other constraints, provided that a useful transformation is found.

Practice:toy example with SDV library on breast cancer dataset

Let's see how it is possible to generate constrained synthetic data with SDV on the breast cancer dataset, and how it can improve both the consistency and the utility of the synthetic data generated.

The dataset presents features describing the cells of breast samples, with the aim of determining for each sample whether the cells observed are representative of the presence of breast cancer or not. Different features(radius, perimeter, texture, etc.) are described and for each of them, both a mean value(mean radius, mean texture, etc) and a maximum value(worst radius, worst texture, etc) are available. In order for the data to be consistent, it is necessary that for each example the average value associated with a feature is less than or equal to the maximum value. For example, the attribute mean perimeter cannot be strictly greater than the attribute worst perimeter of the same instance, otherwise it would mean that the mean perimeter of the cells on the sample is greater than the maximum perimeter of this same set of cells, which does not make sense.

Let's generate a synthetic version of the dataset with the SDV library.

from sdv.tabular import GaussianCopula
model = GaussianCopula()
model.fit(data)
synthetic_breast_cancer = model.sample(num_rows = len(data))

Now, let's generate another synthetic version, but with the constraint that, for each feature, the mean value generated has to be lower than or equal to the maximum value generated.

from sdv.constraints import Inequality
constraints = []
for col in range(len(cols_mean)):
    constraint = Inequality(
        low_column_name = cols_mean[col],
        high_column_name = cols_max[col]
    )
    constraints.append(constraint)
model_with_constraints = GaussianCopula(constraints = constraints)
model_with_constraints.fit(data)
synthetic_data_with_constraints = model_with_constraints.sample(num_rows = len(data))

Now, for each feature and for each dataset, let's check for how many rows the mean value is greater than the maximum value.

Invalid rows in the dataset (out of 569 rows)

Both the original dataset and the synthetic dataset generated with constraints are perfect. In contrast, the synthetic dataset generated without constraints suffer from inconsistencies for each feature.

Now, let's compare the utility of each dataset on the prediction of whether the cells observed are representative of the presence of breast cancer or not.

Adding constraints to the synthetic data generator improved the utility of the data generated, even though there is a drop in utility compared to the original dataset. Indeed, adding constraints can allow your synthetic data generator to be closer to perfection but not yet to reach it: there is still a lot of work to be done on synthetic data generation!

Conclusion

Synthetic data generators can be helpful tools to get more data for training and testing ML models, as well as if you want to generate a synthetic version of private data. Therefore, at Craft AI, we are working on this subject in order to be able to help you easily integrate a step of synthetic data generation in your machine learning pipelines.

‍

References:
[1] Goodfellow, Ian, et al, Generative adversarial nets, 2014
[2] Kingma, D. P., Welling, M., Auto-encoding variational bayes, 2013
[3] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni, Modeling Tabular data using Conditional GAN, 2019

‍