Do You Understand the Knowns and Unknowns in Your Data?
Effective data preparation depends on recognizing how to handle what we know and what we don't know about a set of data. Our goal as analytics professionals is to make less unknown and more known.
- By Troy Hiltbrand
- May 17, 2016
In 2002, Donald Rumsfeld was asked a question regarding Iraq's alleged weapons of mass destruction. In response, he uttered a now infamous description of intelligence.
"There are known knowns; there are things we know we know. ... There are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know."
As we dissect this statement, we also get a picture of the state of analytics in any business intelligence endeavor. We can break data down into a matrix of knowns and unknowns.
There are known knowns, there are known unknowns, and there are unknown unknowns, but I would argue that there are unknown knowns as well.
To further simplify this, we can represent this in terms that analytics professionals know well. On one axis we lay out our state of understanding of the attributes or columns of our data. On the other axis we represent the rows or instances of our data. This gives us the four quadrants of data understanding.
The question is: How do you clean and prepare the data in each of these quadrants to make the best use of it?
Known (Attributes) Known (Instances)
The easiest place to start with our data is where we know both the attributes and the instances in our data set. This data is tangible and present. It is not always clean, but it is a known quantity. Without making any structural changes to the data set, we can apply logic to further refine these known-known elements.
Functions such as trimming, capitalization, rounding, and character substitution are all methods of cleansing this category of data. Other cleansing functions you can use on this data analyze the relationship between two elements, for example, correcting addresses based on postal code. Once complete, we still have the same set of data attributes and instances, but the data itself is cleaner and better prepared to generate effective intelligence.
Known (Attributes) Unknown (Instances)
Another segment to address is where we have known data attributes with unknown data. With these data elements, we know that the attribute is part of our data set, but we don't know the value for the specific instance. This is where we might try to recreate the missing data through replacement with default values or through extrapolation using known data points. (For more, read Connecting the Dots: One of Your Greatest Analytics Tools.)
Filling in a missing city and state based on known postal codes is an example of this type of data cleansing. It's about taking what is known and filling in the data that is unknown.
Another subset of the known unknown is anywhere we are missing complete instances in our data set. Data latency is a major cause of incomplete data. As data continues to be created through ongoing transactions, analytic snapshots get stale and veer further from representing complete data. The further out of sync the data becomes, the less accurate analysis on the data becomes.
Unknown (Attributes) Unknown (Instances)
There is also unknown unknown data. This is where we have undefined data attributes relating to instances of data that we have not accessed yet. As analytics practitioners, we are often asked questions by our business peers that we understand but can't answer because we are missing the information we need.
This means that we are searching for one or more additional data sources. It could be a spreadsheet on someone's desktop that holds that missing piece of information or an additional external data source that will greatly simplify our analytics and help us to produce better intelligence.
We don't always know what this data will look like (what its attributes will be) and we don't know what it is or where it is, but intuitively we will know it is important when we see it.
Unknown (Attributes) Known (Instances)
The last category is the one that Rumsfeld neglected in his statement. This is where we have known instances of data, but we don't have the attributes related to this data. Many analytics practitioners spend a large portion of their time dealing with this data through feature extraction. It is often referred to as "making the data give up its secrets." The base data is there, but there is much more that can be learned about your data set by extracting unknown attributes.
This process can be as simple as extracting the month, day, and year from a date field or as complex as extrapolating all of the n-gram combinations in a body of text, or generating a customer value score based on all of their historical transactions.
In the world of feature extraction, the key is to pull your known data elements apart, spin them around, and put them back together in different ways, or create logical functions on top of the data that present that same data in a whole new way -- in a way that increases the data's analytics power.
When done correctly, the data preparation phase is all about managing the four key areas of the known-unknown matrix. When you have examined all four types of data, the data set will be clean, complete, and ready to generate the highest value intelligence for your organization.
Troy Hiltbrand is the Chief Digital Officer at Kyäni where he is responsible for digital strategy and transformation. You can reach the author at firstname.lastname@example.org.