Synthetic Data: The Diamonds of Machine Learning
Refined and labeled data is imperative for advances in AI. When your supply of good data does not match your demand, look to synthetic data to fill the gap.
- By Troy Hiltbrand
- November 22, 2019
We have all heard the saying, “Diamonds are a girl’s best friend.” This saying was made famous by Marilyn Monroe in the 1953 film “Gentlemen Prefer Blondes.” The unparalleled brilliance and permanence of the diamond contribute to its desirability. Its unique molecular structure results in its incredible strength, making it highly desirable not only as jewelry that looks beautiful but also for industrial tools that cut, grind, and drill.
However, the worldwide supply of diamonds is limited as they take millions of years to form naturally. In the middle of the last century, corporations set out to determine a process to produce lab-grown diamonds. Over the past 70 years, scientists have not only been able to replicate the strength and durability of natural diamonds but, more recently, have been able to match the color and clarity of natural diamonds as well.
Just as in the case of diamonds in the mid-twentieth century, today there is a mismatch of the supply and demand of high-quality data needed to power today’s artificial intelligence revolution. Just as the supply of coal did not equal the supply of diamonds, today’s supply of raw data does not equal the supply of refined, labeled data, which is needed to power the training of machine learning models.
What is the answer to this mismatch of supply and demand? Many companies are pursuing lab-generated synthetic data that can be used to support the explosion of artificial intelligence.
The goal of synthetic data generation is to produce sufficiently groomed data for training an effective machine learning model -- including classification, regression, and clustering. These models must perform equally well when real-world data is processed through them as if they had been built with natural data.
Synthetic data can be extremely valuable in industries where the data is sparse, scarce, or expensive to acquire. Common use cases include outlier detection or problems that deal with highly sensitive data, such as private health-related problems. Whether challenges arise from data sensitivity or data scarcity, synthetic data can fill in the gaps.
There are three common methods of generating synthetic data: enhanced sampling, generative adversarial networks, and agent-based simulations.
In problems such as rare disease detection or fraud detection, one of the most common challenges is the rarity of instances representing the target for which you are searching. Class imbalance in your data limits the ability of the machine learning model to be accurately trained. Without sufficient exposure to instances of the minority class during training, it is difficult for the model to recognize instances when evaluating production data. In fraud cases, if the model is not trained with sufficient instances of fraud, it will classify everything as non-fraudulent when deployed in production.
To balance your data, one option is to either over-sample the minority class or under-sample your majority case to create a synthetic distribution of the data. This method does ensure that the model has an equal balance of each class of data. Statistical professionals have long used this method for addressing the class imbalance.
Another method is to utilize either k-means or another generalized clustering method to create a boundary around the data points represented by your minority class. Once this boundary is defined, you have the space where all data points representing your minority class live. With that space defined, you can create a set of synthetic data points that share the same statistical characteristics as the real data. These data points can then be used to augment the data representing your minority class.
Generative Adversarial Networks (GAN)
The next method of synthetic data creation involves computer models creating a set of data points that cannot be differentiated from the real data. Imagine two computers playing a game against one another. In this game, the first computer chooses either a real data point or a fabricated data point and passes it to the second computer. The goal of the second computer is to guess whether the first computer passed it a real data point or a synthetic data point. If the second computer correctly differentiates between the two, the first computer uses this information to improve its next attempt, learning from the process. As the game progresses, the first computer gets so good at creating synthetic data that the second computer cannot distinguish between actual data and computer-generated data.
This computer-generated data is then used as an input into other machine learning models. Many advances in artificial intelligence have been achieved using this methodology. These advances include fabricated videos, images, and art that are indistinguishable from the real thing, all built by learning from real-world examples.
One situation where GANs are being used is in developing use cases for testing autonomous driving algorithms. This training allows companies to generate millions of scenarios and determine if their algorithms are ready to operate safely in the real world.
The final method of creating synthetic data is to use a simulation process where agents are developed to represent real-world entities that interact with each other, and these interactions are observed and measured to generate data.
Just as modern gaming engines allow for agents to be created that represent the physics and sociology in the real world and can interact as if they were alive, these same techniques are being leveraged for synthetic data creation. Take for instance the game The Sims, which allows people to set up a life in a virtual world and interact with the computer going through daily activities. As these agents become more intelligent, embodying real-world characteristics, they can be combined virtually and the outcome of their interactions becomes your synthetic data.
One real-world example of this is the modeling of nuclear reactions. Before scientists build actual nuclear facilities and set off sub-atomic reactions to observe what energy results and how to manage nuclear safety, they create agents that represent the elemental particles with their associated chemical and physical properties. Through modeling and simulation, they can observe what is happening inside the chemical reaction and what is happening between the particles and their external environment. With the trillions of calculations needed to represent these reactions, they leverage some of the world’s fastest supercomputers to run these models. Although these supercomputers are a huge up-front investment, the resulting data saves them in the long run and allows them to safely develop energy innovation.
With businesses, these agents can represent customers interacting with a physical store layout or with the company’s e-commerce site. In the end, the data generated from these virtual simulations performed with intelligent agents is highly valuable to a company. They can run through millions of permutations, creating a robust data set to drive its machine learning models.
Although synthetic data has huge potential for providing just what is needed for your machine learning models, be aware that the data does carry some risk. Artificial data can lead to artificial results in your models. This risk can, in turn, lead to bad decision making. Statistical testing and oversight can help reduce this possibility.
Taking the Next Step
With each of these methods of creating synthetic data, the target objective is to come out of the process with highly valuable, refined, and labeled data that can then be used to drive artificial intelligence projects. Just as lab-grown diamonds went from being merely structural duplicates to being virtually identical in appearance to real diamonds, we will continue to see advances in the creation of synthetic data so that it not only appears similar to your real-world data but will evolve to be highly accurate representations of the real world and will be completely interchangeable with real data in the machine learning process.
Where do you go from here? Look for instances in your business where an artificial intelligence model could be transformative to your business, but that you lack the data to fully implement it because of the scarcity of the data or its inherent cost to obtain. Evaluate if one of these methods would provide you with the base to make that desire a reality.