3 Things You Need to Know about Big Data
Before you jump into a big data project, be sure to think about these three big data issues related to data integrity, data quality, and the nuances of analysis.
- By Fern Halper, Ph.D.
- March 5, 2013
Big data analytics is exciting to many organizations and for good reason. The ability to analyze big data provides unique opportunities for organizations in terms of the kind of analysis they can do. For example, instead of being limited to sampling large data sets, organizations can utilize much more granular and complete data to do analysis. This may lead to patterns that they hadn't seen before. Additionally, the ability to analyze data in real time in order to gain real time insights can be a competitive differentiator for an organization.
However, analyzing big data does have its own set of issues. Here are three related to data integrity, data quality, and the nuances of analysis worth thinking about.
#1: Big data can come from untrusted sources
Big data analysis often involves aggregating data from various sources. These may include both internal and external data sources. For instance, you may want to aggregate data extracted from unstructured data sources such as e-mails and call center notes together with structured data about your customers from your data warehouse, or you may be interested in analyzing social data.
The question is how trustworthy are those external sources of information? For example, how trustworthy is social media data such as a tweet? This information may come from an unverified source, or the information itself, reported by the source, can be false or misleading. The integrity of this data therefore needs to be considered in the analysis. Before you start utilizing this data you need to understand your data sources and decide what you are comfortable with including in your analysis.
#2: Big data is dirty
Dirty data refers to inaccurate, incomplete, incorrect, duplicate, or erroneous data. This may include the misspelling of words or data that comes from a piece of equipment that is broken or corrupted in some way. It can even include duplicate data such as retweets or company press releases that appear numerous times in social media.
Here's a simple example of how this can impact your analysis. Suppose you are interested in performing a competitive analysis using social media data. You are interested in seeing how often your competitor's product appears in the external sources you are monitoring and the sentiment associated with those posts. You look at the data and see that the number of positive posts about the competitor is twice as large as yours. Aside from the fact that sentiment scoring needs to be checked carefully, this could simply be a case where the competitor is pushing its own press releases out to multiple sources (in essence, tooting its own horn) or getting lots of people to retweet an announcement.
In this case, you need to make sure you know how you want to handle this duplicate social media data (e.g., there are cases where you might want the duplicates because it is valuable information in and of itself).
In general, the cleansing strategy you employ will depend on the source and type of data and the goal of your analysis. Big data cleansing strategies are evolving (perhaps not rapidly enough), and although some people debate whether big data needs to be cleaned, the reality is that it does -- except, perhaps, if you're developing a filter where your goal is to detect bad elements in the data.
#3: Big data changes, especially in data streams
If you're going to analyze big data streams, you need to be aware that data quality in your analysis can change or even that data itself can change because the conditions under which you're capturing data may change.
For instance, imagine a utility company collecting weather data to use as part of a big data analysis that also uses smart-meter data as well as other geospatial data to predict events or actions. What happens when analyzing this data in real time a gap is discovered in the smart-meter data? What if an environmental variable monitored by the smart meters changes?
How do you deal with this in your own analysis? Change detection algorithms in stream mining are an active area of research in data quality and analysis. (See, for example, work done by Tamparini Dasu et al at http://www.research.att.com/export/sites/att_labs/techdocs/TD_100014.pdf.) You need to at least be aware of the issues before you start trying to analyze a big data stream.
A Final Word
Of course, there are many more issues than the three listed here. Some of these issues are unique to big data analytics, and some are important in all kinds of data analysis. Organizations using big data analytics must make sure they have access to the skills to deal with the nuances of analyzing big data and the governance in place to deal with managing it.