Without Context, Big Data is Flat Data
It's not enough to know what your customers are doing. You need context to understand why customers do what they do.
- By Jeff Catlin
- March 1, 2016
Let's face it, these days big data is big. It's in the press every day, but what exactly is big data? In truth it can be anything from highly structured numerical content from things such as the Large Hadron Collider to more mundane such things as Google Search Logs of all tweets made in the last 30 days.
For our purposes, let's stick with the semi-structured or unstructured content sources of big data. The lack of structure in this sort of content makes it harder to understand what's going on in the data and spot the patterns and insights that are hiding within it (the hidden business value). To solve this problem, a variety of technologies have emerged, including "machine learning," "deep learning," and text mining. Before diving in and extoling the virtues of these technologies, let's take a step back and look at the problem we want to use big data to understand, not just what is happening, but why it's happening (the context of the data).
Why is it so important to understand the "context" of the information we're dealing with? To put it in layman's terms, it's the flavor in food. We could live without a sense of taste and smell but without it we'd miss out on what makes food worth eating. Similarly, we can mine big data and understand that Apple and the iPhone are mentioned everywhere, but it's the context that tells us why iPhone users are rabid loyalists who love the iPhone (hint: it's the superior user experience). With context we gain some understanding of why iPhones are so popular.
How do we dig in and understand the context? We've already mentioned two of the technologies (machine learning and text analysis). Machine learning is everywhere these days, so it's almost certain you've heard the phrase, most likely associated with IBM Watson and the Jeopardy TV show where Watson beat two humans in a general knowledge contest. What most people don't realize is that Watson is much more than simple machine learning.
The machine that beat humans was a mix of all technologies and it's that mix of machine learning and text analysis that made Watson what it is. Machine learning is a wonderful technology for classifying and cataloging information (this is about baseball, that is a car review), but it's not very good at adding flavor to that classification. The sentence "I wish GM created a cool new sports car" is much different than "GM created a cool new sports car," and it's text analysis that lets us dig in deeper and understand the flavor of the statement, the first being a desire for something and the second a positive statement about something that exists.
How do you extract context from the content (the flavor of the food). As humans, we understand the difference between "I wish GM created" and "GM created." One is a desire and the other is an opinion about an actual thing. As a person who might be looking to buy a car, we would put more emphasis on the opinion than the desire. How do we get a machine to understand that these nearly identical sentences have very different contexts? Grammar parsing answers this need, allowing a machine to codify human understanding, particularly as it relates to things like sentiment. Let's look at the Grammar parse of "I wish GM created a cool new car."
The correct grammar parse shows the beginnings of how we ascertain that this isn't really good news for GM because it represents a desire. With this parse we have rules for how things such as sentiment get attached to the entities described. "GM created a cool new car" is clearly good news for GM because cool is a positive word in the context of cars.
How do we then figure out that wish weakens the sentiment? It turns out that action words such as "wish" or "instructed" have rules that modify the sentiment in the tree; wish weakens the sentiment to its left, while instructed would move the sentiment on the right to the object on the left "I", so "I instructed GM to create a cool new car" would actually be good sentiment for me. It's this in-depth understanding of the text that lets us glean the context of a document and make better business decisions.
A big data collection of car aficionados is going to have information about "cool new sports cars", but without context we'll never understand whether they are talking about rumored production of new products or the recent release of a new Corvette. If we don't have the context of the discussion, we could easily make bad inferences about what the data is telling us and use this factually accurate yet incomplete picture to make really bad business decisions.
Without context, big data is flat data. That may be understating context's importance because without context, big data insights can be bad data insights. However, with good context you'll really understand the "what" and the "why" of these mountains of information so you can make insightful and reliable decisions. As you begin to consider what and how to leverage big data in your enterprise, ensure that if it includes unstructured information that you select a technology that can mine this information deeply enough to lead you to reasoned and accurate decisions.
Jeff Catlin is the CEO of Lexalytics, the leader in cloud and on-prem text analytics solutions. With over 20 years of experience in the fields of search, classification and text analytics products and services, Jeff held senior management positions at Thomson Financial, Sovereign Hill Software and LightSpeed Software prior to founding Lexalytics. You may contact Jeff at jeff.catlin (at) lexalytics.com or sales (at) lexalytics.com.