How to Judge a Training Data Set
AI practitioners need to be aware of best practices in training data preparation, as well as the myriad ways in which to avoid or reduce bias in your data set.
- By MingKuan Liu
- August 6, 2021
What makes a good training data set? In artificial intelligence (AI), this is one of the most important questions for practitioners to answer. Having a foundation of good training data sets you on a path to success in developing accurate, unbiased models. Working actively to source representative data, label it correctly, and monitor it for bias are essential steps in launching AI that works well for all of your end users.
As an AI practitioner, you need to be aware of best practices in training data preparation, as well as the myriad ways in which to avoid or reduce bias in your data set.
Data Best Practices
A good data set is both representative and balanced. Let's explore what that should look like in key facets of the data preparation process:
Data sourcing. If possible, collect data from multiple sources to maximize data diversity. Research your end users in advance. Are they adequately represented in your data set? Be aware of all the potential use cases your end users will need (including outliers) and ensure the data you've collected matches those scenarios. As you collect data, continually analyze your data set from the perspective of your end users; you may be surprised to find you must acquire additional data to fill in the gaps.
Data labeling. Create a gold standard for your data labeling. If you need assistance, consider leveraging data collection and data labeling domain expertise from a third-party data provider. They can review your data management guidelines and offer additional improvements and best practices based on their knowledge and experience. In any case, provide your data annotators with clear guidelines so they are aware of what's expected of them. If needed, continue to adjust these guidelines based on feedback from annotators.
Data monitoring. The data your model encounters in the real world will often shift over time because models typically don't operate in static environments. Even after deploying your model, monitor and analyze your data routinely to catch potential model drift. Establish a plan for retraining your model with new training data when model drift does occur.
Reducing bias is one of the top concerns for AI practitioners and a crucial factor in determining model performance. A biased model won't perform well for certain user groups and will require retraining on data that's more representative of those groups. To avoid this outcome, be aware of the steps your team can take to mitigate bias and build more responsible AI. First, let's review common bias patterns to watch out for:
Sample bias or selection bias occurs when a data set doesn't reflect the realities of the environment in which a model will run. For example, certain facial recognition systems trained primarily on images of one gender will have lower levels of accuracy for any other gender.
Exclusion bias most commonly occurs at the data preprocessing stage and is often a case of deleting valuable data thought to be unimportant.
Measurement bias occurs when the data collected for training differs from that collected in the real world or when faulty measurements result in data distortion. For example, measurement bias can occur in image recognition data sets, where the training data is collected with one type of camera but the production data is collected with another. This type of bias can also occur due to inconsistent annotation during the labeling stage of a project.
Recall bias is a kind of measurement bias. Recall bias is common at the data labeling stage. It occurs when similar types of data are labeled inconsistently, resulting in lower accuracy. Let's say, for instance, you have a team labeling images of phones as damaged, partially-damaged, or undamaged. If an annotator labels one image as damaged but a similar image as labelled as partially damaged, your data labels will be inconsistent.
Association bias occurs when the data for a machine learning model reinforces or expands a cultural bias. A data set that includes only male doctors and female nurses, for example, doesn't mean that only men can be doctors and only women can be nurses -- but your model will operate under the assumption that women can't be doctors and that men can't be nurses. Association bias often leads to gender bias.
Data drift/model drift, mentioned in the previous section, occurs when your end users or your model's environment changes over time or develops new patterns.
How to Reduce Bias
There are many types of bias to monitor and applying the best practices we described in the previous section of tis article will go a long way toward reducing them. Your team should also consider the following actions to reduce bias:
- Understand how your data was generated. Once you have mapped the data generation process, you can anticipate the types of bias that may appear and design interventions to either preprocess data or obtain additional data.
- Perform comprehensive exploratory data analysis. This approach involves analyzing data sets to capture their main characteristics (usually in the form of statistical graphs or other data visualization methods). This analysis provides key insight into areas of bias in your data.
- Make bias testing a part of your development cycle and a key performance indicator. If you're working with a third-party data provider, ask if they have bias detection tools you can leverage.
Creating a training data set that is representative of your end users and balanced across your use cases is a proactive process. Your team will likely want to incorporate these and other best practices into a data governance framework to ensure consistency across all your projects and alignment among the people involved in building your AI application.
An ideal data governance framework will set expectations about data collection, the labeling process, data monitoring, and bias mitigation. As much as you can, create rules and processes up front that address common data concerns, but always be open to incorporating team feedback along the way.
MingKuan Liu is the senior director of data science at Appen. He has been working in automatic speech recognition, natural language processing, and search relevance ranking areas for two decades. MingKuan has led multiple teams of researchers and engineers to bring cutting-edge algorithms into real-world AI and ML solutions running at a large scale in companies including eBay, Microsoft, and Garmin. In the past few years, he has been leading the Appen data science team to develop ML-based automation solutions that combine both human and machine advantages to improve crowd workers' quality and efficiency with reduced bias. You can reach the author here.