The Machine Learning Data Dilemma
Machine learning applications are dependent on, and sensitive to, the data they train on. These best practices will help you ensure that training data is of high quality.
- By Greg Council
- April 15, 2019
To be effective, machine learning (ML) has a significant requirement: data. Lots of data. We can expect a child to understand what a cat is and identify other cats after just a few encounters or by being shown a few examples of cats, but ML algorithms require many, many more examples. Unlike humans, these algorithms can't easily develop inferences on their own. For example, machine learning algorithms interpret a picture of a cat against a grassy background differently than a cat shown in front of a fireplace.
The algorithms need a lot of data to separate the relevant "features" of the cat from the background noise. It is the same for other noise such as lighting and weather. Unfortunately, such data hunger does not stop at the separation of signal from noise. The algorithms also need to identify meaningful features that distinguish the cat itself. Variations that humans do not need extra data to understand -- such as a cat's color or size -- are difficult for machine learning.
Without an adequate number of samples, machine learning provides no benefit.
Not All ML Techniques Are Equally Hungry
Many types of machine learning techniques exist, and some have been around for several decades. Each has its own strengths and weaknesses. These differences also extend to the nature and amount of data required to build effective models. For instance, deep learning neural networks (DLNNs) are an exciting area of machine learning because they are capable of delivering dramatic results. DLNNs require a greater amount of data than more established machine learning algorithms as well as a hefty amount of computing horsepower. In fact, DLNNs were considered feasible only after the advent of big data (which provided the large data sets) and cloud computing (which provided the number-crunching capability).
Other factors affect the need for data. General machine learning algorithms do not include domain-specific information; they must overcome this limitation through large, representative data sets. Referring back to the cat example, these machine learning algorithms don't understand basic features of cats, nor do they understand that backgrounds are noise. So they need many examples of this data to learn such distinctions.
To reduce the data required in these situations, machine learning algorithms can include a level of domain data so key features and attributes of the target data are already known. Then the focus of learning can be strictly on optimizing output. This need to "imbue" human knowledge into the machine learning system from the start is a direct result of the data-hungry nature of machine learning.
Training Data Sets Need Improvement
To truly drive innovation using machine learning, a good amount of innovation needs to first occur around how input data is selected.
Curating (that is, selecting the data for a training data set) is, at heart, about monitoring data quality. "Garbage-in, garbage-out" is especially true with machine learning. Exacerbating this problem is the relative "black box" nature of machine learning, which prevents understanding why machine learning produces a certain output. When machine learning creates unexpected output, it is because the input data was not appropriate, but determining the specific nature of the problem data is a challenge.
Two common problems caused by poor data curation are overfitting and bias. Overfitting is the result of a training data set that does not adequately represent the actual variance of production data; it therefore produces output that can only deal with a portion of the entire data stream.
Bias is a deeper problem that relates to the same root cause as overfitting but is harder to identify and understand. Biased data sets are not representative, have skewed distribution, or do not contain the correct data in the first place. This biased training data results in biased output that makes incorrect conclusions that may be difficult to identify as incorrect. Although there is significant optimism about machine learning applications, data quality problems should be a major concern as machine-learning-as-a-service offerings come online.
A related problem is having access to high-quality data sets. Big data has created numerous data sets, but rarely do these sets involve the type of information required for machine learning. Data used for machine learning requires both the data and the outcome associated with the data. Using the cat example, images need to be tagged indicating whether a cat is present.
Other machine learning tasks can require much more complex data. The need for large volumes of sample data combined with the need to have this data adequately and accurately described creates an environment of data haves and have-nots. Only the largest organizations with access to the best data and deep pockets to curate it will be able to easily take advantage of machine learning. Unless the playing field is level, innovation will be muted.
How Innovation Can Solve Data Problems
Just as machine learning can be applied to actual problem solving, the same technologies and techniques used to sift through millions of pages of data to identify key insights can be used to help with the problems of finding high-quality training data.
To improve data quality, some interesting options are available for automating problem detection and correction. For instance, clustering or regression algorithms can be used to scan proposed input data sets to detect unseen anomalies. Alternatively, the process of determining whether data is representative can be automated. If not properly addressed, unseen anomalies and unrepresentative data can lead to overfitting and bias.
If the input data stream is meant to be fairly uniform, regression algorithms can identify outliers that might represent garbage data that could adversely affect a learning session. Clustering algorithms can help analyze a data set that consists of a specific number of document categories to identify if the data really contains more or fewer types -- either of which can lead to poor results. Other ML techniques can be used to verify the accuracy of the tags on the sample data. We are still at the early stages of automated input data quality control, but it looks promising.
To increase access to useful data sets, one new technique deals with synthetic data. Rather than attempt to collect real sample sets and then tag them, organizations use generative adversarial networks to create and tag the data. In this scenario, one neural network creates the data and another neural network tries to determine if the data is real. This process can be left unattended with remarkable results.
Reinforcement learning is also gaining real traction to address the lack of data. Systems that employ this technique can take data from interactions with their immediate environment in order to learn. Over time, the system can develop new inferences without requiring curated sample data.
Data Is Driving Innovation
Promising and ongoing work using machine learning technologies is solving a variety of problems and automating work that is expensive, time-consuming, and complex (or a mix of all three). Yet without the necessary source data, machine learning can go nowhere. Efforts to simplify and broaden access to large volumes of high-quality input data are essential to increase the use of ML in a much broader set of domains and continue to drive innovation.
Greg Council is vice president of product management at Parascript, responsible for market vision and product strategy. Greg has over 20 years of experience in solution development and marketing within the information management market. This includes search, content management, and data capture for both on-premises solutions and SaaS. Contact Greg and Parascript here.