Connecting the Dots: One of Your Greatest Analytics Tools
Filling the gaps in data can mean the difference between an analytics project's success and failure. Choosing the right method is a combination of art and science that requires understanding the business, the data, and the targeted decision.
- By Troy Hiltbrand
- May 9, 2016
Dirty data. Bad data. Missing data. These challenges are faced by analytics practitioners every day and follow the old adage "garbage in, garbage out." Such data can result in faulty or weakened analytics and ultimately skewed decision making.
To accomplish great insights through analytics, first you must plug the holes in your data. This requires you to connect the dots between known data points and fill in the missing values.
Numeric, continuous data is often the most difficult type to backfill due to the many potential values that the missing data points can have. This is especially important in time-series analysis. If you are missing values from a specific time period, creating a model that extrapolates into the future is difficult and the model's predictive value is weakened if you've replaced missing data values with poor estimates.
With missing numeric data points, there are several ways to create a surrogate value that can represent the missing data, each with different benefits. Although these proxies are better than a data set that looks like Swiss cheese, the ultimate goal is to have as much "real" data as possible.
Use a Default Value
The first and easiest method of filling in missing data is by applying default values. This eliminates the problem of having a null data point but can come with unintended side effects if you don't fully understand the business environment and the impact of the default value.
For example, a common default value is zero. When working with statistical calculations, a zero in the data can artificially skew the results towards that zero value and potentially give you an incorrect answer.
If you are looking at month-over-month sales, a zero means that no sales occurred in a particular time period. This value, when plugged in for missing data, could create a highly inaccurate model. A zero represents a 100 percent decline in business from one period to the next and then infinite percentage growth in the following period (because you can't divide by zero).
If you are looking at a near-term forecast, such as two months, this creates a very sharp decline or increase. If you projected that trend into the future, you would create an untrustworthy prediction that does little to help in your decision-making.
Use a Population Average
Another approach is to take the average or median of the entire data set and use that as a default for all missing values. This assumes that history is repeating itself at the point of the missing data.
With this approach you must consider where to define the boundaries of the calculation. Including too many periods of data can be problematic because you are assuming that a consistent pattern has been occurring for a long time. In periods of recent growth or decline, a long-term average could mask this pattern by using data that is "too old." On the other hand, using too few data points can miss longer-term patterns.
This method is often better than replacing the missing values with a default value such as zero, because it is much more representative of the business, but it is not foolproof.
Use a Rolling Average
Going one step further, a third approach is to take the points that directly surround the missing data point (before and/or after it) and create a rolling average from that limited scope of data points. This is similar to using a population average from a small set of data, but the result varies for each missing point because the sample set includes only the points immediately adjacent to the missing data point.
This might not represent the actual missing value, but it will provide a relatively close value, assuming that there was not a huge disruption in business during that time period. You should decide how many periods to use to determine the rolling average depending on the patterns in the data and your business needs.
Examine Cyclical Patterns or Other Data Categories
If the data is cyclical, a relative set of periods might be more advantageous to average. Whether these cycles occur based on the week of the month, month of the quarter, or month of the year, taking a set of data points from corresponding periods in the cycle and using those as the basis for determining an average or median can create a more accurate replacement value.
At times, you may work with data which is not part of a time series and still have missing numeric, continuous values. In these cases, you can still apply population averages and medians to create a replacement value for missing data, but you have to define appropriate population boundaries. Just as before, a complete population is the easiest but not always the most accurate.
You can also segment your data based either on categories that exist within the data or are extrapolated through unsupervised learning, in which natural clusters of data manifest themselves through statistical proximity. Algorithms, such as k-means clustering or k-medians clustering, can find patterns in the data to create categorical boundaries. It is not necessary to understand fully what the data categories represent in real world terms to be able to use them to ascertain missing values.
Whether or not you're dealing with time-series data, missing numerical data can impede your ability to create effective models for advanced analytics. Filling those data gaps can greatly improve analytics effectiveness. Your ability to connect the dots can be the difference between success and failure and is a critical tool in any analytics professional's toolbox.
Troy Hiltbrand is the chief digital officer at Kyäni where he is responsible for digital strategy and transformation. You can reach the author at firstname.lastname@example.org.