When Preparing Data for Machine Learning, Avoid These Common Mistakes
Follow these tips to ensure your machine learning training data is the best it can be.
- By Mayank Madan
- October 9, 2023
As Miguel de Cervantes rightly said, “Diligence is the mother of good fortune.” For developers working on machine learning and AI projects, this diligence needs to be focused acutely on the quality of the data used to train models. Models that are trained on data of low quality, with poor structure, or of insufficient volume will perform poorly, resulting in bad decisions and wasted resources.
Most data engineers and AI developers recognize this fact, and often make data preparation a routine part of their AI and machine learning workflows. However, that doesn't mean they always get the intended results.
The Basics of Data Preparation for Machine Learning
The goal of data preparation in this context is to ensure that you make the best data available to train your models. Without the right training data, your models may not perform accurately, resulting in unexpected outcomes for your use case(s). Furthermore, your models may turn out to make biased decisions. For example, if your model is trained to recognize voices using only examples from of particular region, it may not recognize voices of people with accents from other regions.
To mitigate such risks, data engineers routinely perform several core steps to prepare and clean data for machine learning.
Fixing data issues. This involves handling missing and null values, fixing inconsistent data, and removing duplicates. You may have a column with mostly blank values, for example, in which case deleting the column is likely to lead to better training. Similarly, you may have a column in a customer database that records countries of residence in inconsistent ways -- some entries may read "U.K." while others read "United Kingdom," for instance. In such a case, you'd cleanse the data by converting it to a standard format or set of values
Reducing dimensionality. It often happens that not all the features or columns you've collected for training are relevant. You may have columns or attributes that won’t be important for the problem you’re solving. There are different dimensionality-reduction techniques such as feature selection, feature extraction, and principal component analysis (PCA), which you can use to reduce the number of dimensions.
Sometimes you may need to create two attributes based on your use case. For example, you might break out data within a field that records timestamps of customer purchases such that it becomes two separate columns -- one for time of day and another for day of the week. Doing so would be valuable if you want your models to make predictions based on each of these variables in isolation.
Nomalizing data. Data normalization is a data processing technique used to bring data to a common scale. This technique is useful when you have data sets with different value ranges. For example, you may have data with an age range from 1 to 125 and sales data from 1,000 to 10,000,000. In such scenarios, it is important to normalize the data to a common scale.
In short, the goal of data preparation for machine learning is to ensure that the data you feed into your models when training them is as accurate and consistent as possible. Preparation also helps align data with the specific use cases that a given model needs to support.
Depending on how you seek your models to perform, the data features you select, the data pre-processing methods you use and so on could vary widely.
Common Mistakes in AI Data Preparation
Note, though, that these data preparation basics are only that -- the basics. Too often, engineers overlook some more critical steps in the data preparation process:
Lack of representative data. When preparing data, it's vital to ensure that you choose data that fully represents the information your models will need to process in the real world. Failing to do this is exactly the type of data training mistake that leads to biased models, such as the voice recognition model mentioned above.
To obtain representative data, you often need to think creatively about data sources. The data that is most easily available to you may not fully represent real-world data, so it’s important to critically assess the use cases of your model and where you can acquire data that addresses all of them in an unbiased way.
Selecting the wrong volume of data. One of the hallmarks of machine learning and AI is the volume of data required to properly train them -- but how much training data is enough?
If you have a relatively simple model, or if the model only needs to support a very narrow and specific use case, it may not require a large volume of training data. More complex models or those that must support a wide variety of use cases typically require a large volume of data.
The point is that when you’re training data models, you will need to assess the unique requirements of the model you are training and choose an appropriate volume of data. There is no one-size-fits-all approach to the data volume question.
Fixing the wrong data quality issues. Data quality issues are like bugs in software: they're virtually unavoidable, they're widespread, they’re frustrating, and some of them matter much more than others. You can spend days or weeks chasing data quality problems in a training data set without achieving worthwhile results, especially if you focus on fixing data that isn't relevant to your model.
To avoid this mistake, determine which information within your training data set is most important for your use cases and focus on improving just that data. It would be great to perfect all your data, but doing so is not practical, so invest in the data quality changes that yield the greatest returns based on your needs.
Leaving humans out of the loop. Training machine learning models at scale requires automation. Ideally, you'll be able to automate most of the work required to tag and organize data among other tasks. However, it's unrealistic to expect automated tools to handle every aspect of data preparation.
Automation doesn’t work well when your tools encounter unexpected conditions or missing information. For that reason, it's essential to keep human oversight in the loop, especially when you do not have enough data. Humans can make decisions such as inferring which value should exist in a missing field, helping to make the most of insufficient data. Human review can also manage unexpected outlier cases that your tools won’t understand how to classify.
Conclusion: Streamlining Data Preparations
There's no getting around the fact that data preparation is hard work. It's critical work though, because if you avoid the heavy lifting of data preparation -- or if you only perform the basics without going above and beyond where necessary -- you end up with models that fall short of expectations. Instead, think strategically about which types of data your AI models need to train on and give them that data in formats and volumes optimized for their machine learning needs.
Mayank Madan is head of the data and analytics practice for EMEA at Lemongrass where he is responsible for developing solutions that drive business performance and potential value from business and external data leveraging modern data and AI ecosystem. You can reach the author via email or LinkedIn.