Improving the Quality of Data on Hadoop
Data quality on Hadoop is becoming more important as more critical data is being stored there. Consider the automation and performance advantages of an on-Hadoop data quality solution which cleanses data without it ever leaving the cluster
- By Jake Dolezal
- October 3, 2016
As the value and volume of data explodes, so does the need for mature data management. Big data is now receiving the same treatment as relational data -- integration, transformation, process orchestration, and error recovery -- so the quality of big data is becoming critical.
Hadoop Needs Scalable Data Quality Practices
Because of the promise and capacity of Hadoop, data quality was initially overlooked. However, not all Hadoop use cases are for analytics; some are driving critical business processes. Data quality is now a key consideration for process improvement and decision making based on data coming out of Hadoop.
With the size of our data stores in Hadoop, we must consider whether data quality practices can scale to the potential immensity of big data. Hadoop obviously shatters the limits of data storage, not only in terms of data volume and variety as well as in terms of structure. One way that data quality is maintained in a conventional data warehouse is by imposing strict limits on the volume, variety, and structure of data. This is in direct opposition to the advantages that Hadoop and NoSQL offer.
Data Quality Can Be Relative
We must also consider the cost of poor data quality within a Hadoop cluster. From an analytics perspective, "bad data" may not be as troublesome as it once was, if we consider the statistical insignificance of incorrect, incomplete, or inaccurate records. The effect of a statistical outlier or anomaly is reduced by the massive amounts of data around it; the sheer volume effectively drowns it out.
In conventional data analysis and data warehousing practice, "bad data" was something to be detected, cleansed, reconciled, and purged. Rigorous data hygiene measures and practices were put in place, including record-by-record manual data correction, in order to purify our sources and data stores. People are bothered by the thought of poor quality data being present in their analytics (garbage-in-garbage-out thinking), even though it is always a reality.
However, the volume and variety of big data makes conventional data quality measures impractical. Remediating row by row in a Hadoop store would take an army of data custodians, and the result wouldn't be worth the time because the business value density of big data is low.
It may be useful to think of big data as a "big picture" view rather than a "perfect picture" view.
Benefits of On-Hadoop Data Quality Tools
However, in a growing number of cases, the data going into or coming out of Hadoop may be mission critical. In those cases, higher vigilance is required.
The solution may be to turn to an on-Hadoop data quality tool. These data cleansing tools actually run the data standardization engine on Hadoop itself, taking advantage of the cluster's massive parallel performance.
An off-Hadoop data quality tool is typically a data integration tool with data quality components and capabilities; it takes the data from Hadoop, cleanses it, and puts it back. This method involves a lot of performance overhead, but an off-Hadoop tool makes sense if you are moving data off your Hadoop cluster and into other data stores anyway.
One other key to handling big data quality is to automate conventional data quality and stewardship processes to detect, correct, and prevent data quality issues. This means enforcing rule-based data quality within the data en masse. Architecturally, this strengthens the case for an on-Hadoop data cleansing capability.
Dr. Jake Dolezal is practice leader of Analytics in Action at McKnight Consulting Group Global Services, where he is responsible for helping clients build programs around data and analytics. You can contact the author at [email protected]