Top 10 Priorities for Data Quality Solutions
Data quality solutions need to grow into more technologies and address new business requirements
- By Philip Russom, Ph.D.
- February 12, 2013
The 10 priorities listed below provide an inventory of techniques, team structures, tool types, methods, mindsets, and other characteristics that are desirable for a fully modern, next-generation data quality (DQ) solution. Few organizations will need or want to embrace all ten priorities; you should pick and choose according to your organization's business and technology requirements. My intent is to help user organizations prioritize and plan their next-generation data quality program or solution.
Priority #1: Broader scope for data quality
We say data quality as if it's a single, solid monolith. In reality, DQ is a family of eight or more related techniques. Data standardization is the most commonly used technique, followed by verification, validation, monitoring, profiling, matching, and so on. TDWI regularly encounters user organizations that apply just one technique, sometimes to just one dataset or one data domain. Most DQ solutions need to expand into more DQ techniques, datasets, and data domains.
Priority #2: Real-time data quality
According to a TDWI survey, real-time data quality (RTDQ) is the second fastest growing data management discipline, after master data management (MDM) and just before real-time data integration. Make RTDQ a high priority so data can be cleansed and standardized as it's created or updated.
Priority #3: Data quality services
DQ techniques need to be generalized so they are available as services that can be called from a wide range of tools, applications, databases, and business processes. Data quality services enable greater interoperability among tools and modern application architectures as well as reuse and consistent in DQ solutions.
Priority #4: Coordination with other data management disciplines
DQ functions are beneficial to related data management disciplines. For example, DQ functions should be applied to the reference data managed by an MDM solution, and data integration solutions invariably uncover DQ problems and opportunities.
Priority #5: Data stewardship and governance
Instead of re-inventing the wheel, user organizations can borrow some of the organizational structures and processes of DQ's stewardship and apply them to data governance. This minimizes the risks and decreases the time-to-use of data governance. Likewise, there are stewardship capabilities built into many DQ tools that can help document, automate, and scale up data governance processes.
Priority #6: Nontraditional data types
New types and sources of data are coming from many directions, and all need a DQ strategy. As data is deduced and extracted from Web data, multi-structured data, and social media, it should be subject to DQ functions and quality metrics, as with all data.
Priority #7: Internationalization
This is second-, third-, or later-generation priority for most DQ solutions. Prepare for it by selecting vendor tools that support internationalization functions for national postal standards, Unicode pages, and DQ tool GUI localization.
Priority #8: Value-add process
Techniques such as standardization and data append added value by repurposing and augmenting data, respectively. De-duplication adds value to data by reducing its redundancies. Data profiling reveals opportunities for more value-adding actions by DQ techniques. Focus on the value-add process to ensure the continuous improvement expected of a DQ program.
Priority #9: Deeper profiling
Data profiling is too often shallow, just generating simple statistics for values found in a single database, table, or column. It should be broadened to enable more profound discoveries within data. Profile data repeatedly as a kind of monitoring that tests whether data's quality is truly improving.
Priority #10: Vendor tools
Many first-generation DQ solutions are homegrown and hand-coded. For example, standardization is the most commonly used DQ technique, and (at the low end) standardization can be hand-coded in SQL or developed using a tool for extract, transform, and load (ETL). Hand-coded DQ solutions can prove the usefulness of software automation for DQ, but you should anticipate life cycle stages that demand functionality that very few organizations can build themselves, such as identity resolution, probabilistic matching, internationalization, real-time operation, DQ services, and hub-based architecture.
For a more detailed discussion, read the article Ten Goals for Next Generation Data Quality in TDWI's What Works Magazine Volume 33. TDWI Members can access the magazine at http://tdwi.org/whitepapers/2012/05/what-works-volume-33/asset.aspx?tc=assetpg.