Overcome Data Shortages for ML Model Training with Synthetic Data
Recent innovations produce synthetic data that is richer, more varied, and similar to real data, making it more useful than ever in providing the missing data machine-learning models need.
- By Sigal Shaked
- June 14, 2021
There are many roadblocks to developing and deploying machine learning models: matching business objectives with technological capabilities, moving workloads between cloud and on-premises, finding experienced staffing, and breaking down data silos. All of these challenges are complex and difficult to solve. However, another obstacle -- the shortage of data for machine-learning model training -- is closer to being overcome, thanks to recent innovations.
Models Starved for Data
The lack of data that reflects the full depth, granularity, and variety of real life conditions is often the reason why a machine-learning model performs poorly. An enormous number of data sets are required to run an unbiased ML model that creates meaningful insights for all types of scenarios. Different model types have varying data requirements, but finding data is always a challenge. Linear algorithms need hundreds of examples per class; more complex algorithms need tens of thousands (possibly millions) of data points. A rule of thumb is that you need roughly ten times as many examples as there are degrees of freedom in your model.
If there is insufficient data, a model is more prone to overfitting, making it unable to analyze new data properly. If data types are missing specific populations, the model could be biased and not reflect the realities of the environment where it will run. Training data needs to include a proportionally accurate sample size of each member of a population, including all types of instances and combinations. This becomes even more severe in anomaly detection problems where the unusual pattern that needs to be detected may be under represented. Enterprises may also face the problem of incomplete data where attribute values are missing for data sets.
Causes of Data Shortages
There are several reasons why there is insufficient data available for AI/ML models. The first is that enterprises are not allowed to use sensitive customer data without their explicit permission due to data privacy laws. There aren't enough customers, employees, or users that agree to have their data be used for research purposes.
Another reason is that ML models might be designed to work with new trends or respond to new technologies, processes, or product features for which no historical data is yet available.
The nature of the data itself can result in smaller sample sizes. For example, a model that measures stock prices' sensitivity to the consumer price index is limited to indices published once a month. Even 50 years of CPI history will result in 600 records -- a very small data set.
Sometimes the effort to label data is not timely or cost-effective. For example, a model predicting customer satisfaction might require an excessive number of hours to manually inspect hundreds of recordings of service calls, text messages, and emails to measure customer sentiment.
New Advances for Creating Synthetic Data
Able to generate large volumes of safe data to keep enterprises compliant, synthetic data provides the data that models need while filling in the gaps that keep the data balanced and complete. Recent innovations that improve the accuracy of synthetic data have made it even more useful in providing the missing data machine-learning models need.
Used successfully to improve the quality of images, generative adversarial networks (GAN) generative models are now used to improve the accuracy of synthesized tabular data. GAN generative models use two neural network models, one that generates new plausible samples and another that differentiates generated examples from actual data. The two work against each other. The generator model provides samples to trick the discriminator, and through experience and fine-tuning, they create synthetic data that is more realistic.
An even more recent advancement is Wasserstein GAN, or WGAN. Instead of using a discriminator to predict the probability of generated images as being real, the WGAN uses a critic that scores the realness of a given image. The critic neural network seeks a minimal distance between the distribution of the data observed in the training data set and the distribution observed in generated examples and then trains the generator model to create more realistic data.
Unlike GANs that seeks stability by finding an equilibrium between two opposing models, the WGAN seeks convergence between the models, resulting in synthetic data that has characteristics more closely aligned with real life.
As technologies evolve to make synthetic data richer, more varied, and similar to real data, there is a high likelihood that synthetic data will become easy to generate and use. Eventually used to solve the data shortage, synthetic data will protect privacy, and enable enterprises to stay compliant while improving the speed and wisdom of ML models.
Sigal Shaked is the co-founder and CTO at Datomize. In her career, Sigal gained vast experience as a data scientist and a researcher for industry and government projects. Her Ph.D. investigated the utilization of machine learning techniques to protect data privacy. You can reach the author via LinkedIn.