TDWI Articles

Managing by Exception: How to Identify Outliers

With information creation at an all-time high, data professionals have to be able to distinguish between what matters and what doesn't and present only key insights to their users.

The rate of data creation has never been greater. According to IBM, a 2011 study estimated that 2.5 quintillion bytes of data are created every day and that over the previous two years, 90 percent of the data in the world had been created. With all this data, it is no wonder why people are experiencing information overload and fatigue -- including your end users.

For Further Reading:

5 Advanced Analytics Algorithms for Your Big Data Initiatives

4 Data Mistakes Newbie Analysts Make

Data Stories: Histograms, Outliers, Parallel Coordinates

The key to combatting information overload is to deliver only those insights that have the highest value. To do this, you must identify exceptions to the norm within the data. Much of your data falls within the range of what is considered typical or "normal." When data points fall outside of that normal range, that often piques users' interest.

Once you have identified these outliers, you can build response processes around them to get the right information to the right people at the right time. Use these common techniques to help isolate these outside-the-norm data points.

Thresholds

One of the first methods you can employ is to predefine thresholds related to key performance indicators. These thresholds are often in the form of specified numbers (for example, upper or lower daily sales limits) or a percentage increase or decrease in a KPI (such as an increase in sales over the same day last year).

Thresholds are defined as the upper and lower limits of what the business considers typical or usual. When data moves out of this predefined band, it may require an action in response, such as notifying key stakeholders.

Statistical Outliers

Statistics provides another method for identifying outliers. One of the fundamental concepts of statistics deals with normally distributed data. With the normal distribution of data, a majority (68 percent) of data points fall within one standard deviation on either side of the average value. Moving out to two standard deviations on either side of the average captures 95 percent of the data. Expanding your selection to three standard deviations expands that coverage to 99.7 percent of data.

Depending on your target objective, looking for outliers could entail finding those data points that fall outside of these standard deviation bands. The average value and how data is distributed -- the basis of the standard deviation -- can change, so unlike preset thresholds, the standard deviation dynamically fluctuates depending on the data set over time.

Data Clustering

Along the line of statistical processing is the concept of clustering data. In this case, you group data points together based on similarities. Popular methods of clustering include both the k-means and k-medians methodologies. These methods organize the data set into a predefined number of groupings based on how similar the attributes of the data points are.

You don't have to provide the names of these clusters in the beginning, but rather let the k-means/medians methods bring order to a disorganized data set. After you see the groupings, you can evaluate the data points within a cluster and specify what the instances within that cluster represent. At this point you can associate a human-friendly name to these clusters.

For example, you might be looking at demographic data and divide it into two groups. By looking at the similarities within those two groups, you might observe that one cluster represents Millennials and another group Gen Xers. These become more understandable categories for an information consumer, but the process did not need to know that you were looking for Millennials or Gen Xers before it identified the clusters initially.

Once you have identified these clusters, you have a fuller model to work from in identifying data points of interest. If you have a cluster that represents those data points of interest to you, a new data point that falls inside that cluster can trigger an action.

In addition, with multiple clusters, you have multiple independent, normally distributed data sets. Within each cluster, you can apply the statistical outlier methodology previously discussed to identify data points of interest using the mean and standard deviation for each cluster. Data outside a cluster could also represent data of interest.

With k-means and k-medians clustering, the first step is to specify how many clusters to break the data into. As you look at outliers within these clusters, you might find that your original number of clusters is not adequate. You may discover additional clusters that exist within the data. Regenerating your clusters with this additional insight will help you more accurately identify outliers.

Markov Chain

The final method of identifying outliers we'll discuss deals with patterns of sequential data. The Markov Chain method uses the probabilities of event sequences in historical data to predict likely outcomes for a future event.

For example, if the pattern of events X-Y-Z is common in your historical data and your process encounters a series of events that represent X-Y, it is highly probable that Z will follow. If an event occurs that was rarely or never observed before, this is cause for action. If an X-Y event occurs and the next event is A instead of Z, this could be an indicator of something abnormal and is an opportunity to alert your end users.

Taking Action

With each of these methods, the result is to assist you in identifying data points of interest that represent something outside the norm. When one of these data points is identified, there are multiple courses of action.

The data could be directed to key decision makers who can evaluate and address it. The identification of an outlier could fire an event to downstream processes, which could automatically perform further investigation or enrich and enlarge the data further to create a fuller picture of the specific instance before it is sent on to the information consumer. Alternatively, that process could evaluate the outlier against other known information (possibly from other systems) and delete it before reaching the information consumer. The possibilities are endless.

With today's information overload, it is critical that data providers comb through the data and identify those points that have significant interest to their end users. There are multiple methods for both identification of these outliers and responses that your systems can employ to handle them.

In the end, the goal is to provide actionable data to your decision makers so they can most effectively run your business and respond to a constantly changing environment.

About the Author

Troy Hiltbrand is the senior vice president of digital product management and analytics at Partner.co where he is responsible for its enterprise analytics and digital product strategy. You can reach the author via email.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.