Q&A: Why Is Data Quality So Elusive?
In this Q&A, data management and governance platform strategist Marek Ovcacek discusses why enterprises haven’t made more progress on data quality.
- By Upside Staff
- March 17, 2022
It’s widely known that the consequences of poor data quality are growing every day, so why is data quality still an issue? In this Q&A, Ataccama’s VP of platform strategy Marek Ovcacek discusses why enterprises haven’t made more progress on data quality and how a data quality fabric can help.
Upside: Enterprises know that poor data results in poor decisions, but even though data quality has become part of every enterprise’s data strategy, it remains elusive. Why haven’t enterprises made greater progress in upping the quality of their data?
Marek Ovcacek: Modern organizations’ data landscapes have become exceedingly complex. There is a net of different processes, transformations, and data pipelines between data creation and data consumption. Data quality (DQ) needs to be tracked on its journey through all these layers because there can be potential DQ issues at every point.
For example, there could be process issues as data moves through an organization, such as poor integration or technical accidents. Data could be old and outdated. Different data points could be mistaken for each other. In a lot of cases, even tracking the data lineage through the organization is a very difficult task.
Given the sheer scale of the data quality task, it has to be solved through automation via metadata-driven or (ideally) AI-assisted approaches. This leads many organizations to dump the problem onto IT. Unfortunately, IT cannot solve the problem alone -- data quality must be a business priority that IT and the business must solve collaboratively.
What best practices can you recommend an enterprise follow to improve its data quality?
The first step is to properly catalog your data and keep that metadata information fresh. To do this, you need to have a process that automatically examines data at its source so you can better interpret and understand it. This includes using a data profiler that examines various statistics, identifies data domains and data patterns, and infers dependencies and relationships. Ultimately, it provides an overview of where information of interest is located and identifies inconsistencies.
The next step is to identify and monitor the quality of the data, not only on technical parameters but also using business rules. For example, you may want to identify the format of the address or credit card number field but also check with reference data or running checksums and row-level and aggregation controls. This is a time-consuming process to set up and maintain, so the ideal tooling uses a combination of metadata-driven automation, AI automation (on the orchestration level), and self-learning anomaly detection (i.e., rule-less DQ).
Finally, you need to automate data cleansing and transformation, which includes standardizing formats, breaking data down into separate attributes (such as transforming a full name into a first name and surname), enriching data with external sources, and removing duplicates. This process should occur any time data is being consumed, be it for analysis, before the data preparation phase, or loading to the target system. The tooling you are looking for here should also support automation and provide a wide variety of integration options.
How have data quality efforts expanded?
For starters, data quality is no longer the sole domain of database administrators and data geeks. It’s a business function, and organizations that still relegate it to IT are likely to experience ongoing data quality problems.
It’s also no longer a manual, SQL-based process. Modern data quality is highly automated. We’ve also moved beyond metadata-driven methods that collect data source metadata and depend on metadata rules. These methods got us 90 percent of the way there, but the industry has further improved the process by using AI and machine learning to augment data stewards’ experience and simplify the configuration of data quality projects by suggesting rules and detecting anomalies.
Today, the most advanced iteration of data quality automation is the use of a data quality fabric, which sits on the backbone of a data catalog. This data quality fabric maintains a current version of metadata and uses both AI and a rules-based approach to automate configuration and measurement.
How are data quality efforts tied up with master and reference data management and metadata management?
Standardization and cleansing are vital for efficient master data management (MDM). It’s impossible to get high-quality data without mastering and consolidating it. We discovered through years of experience and countless successful MDM implementations that DQ integrated into MDM provides business benefits (such as higher accuracy) as well as IT benefits (including lower cost and latency).
Reference data management (RDM) is vital for reporting consistency, analytics, and cross-departmental understanding of various business categories. Industries such as transportation and insurance, which rely heavily on reference data, can simplify data quality management if it’s connected to a single source of reference data. RDM is usually viewed as the backbone for your data quality and data management efforts. It also became a critical part of infrastructure for metadata management and domain classification. Centralized reference data management is increasingly becoming a must-have capability for organizations of all sizes in various industries.
You’ve mentioned to me that this has all led to a new idea – the data quality fabric. What can you tell us about this? What is it? What’s its purpose? How does it work and where is it deployed? Is this for IT or for everyday business users?
The concept of the data quality fabric builds on basic data fabric principles but adds additional capabilities, such as AI-based anomaly detection, embedded data profiling and classification, MDM, and RDM. It automates data quality management, including assessment, monitoring, standardization, cleansing, enrichment, and issue resolution.
This fabric connects and integrates data from all important and useful data sources, such as data lakes, ERPs, and file servers. The data fabric automatically ingests this data, processes it, and integrates it to create a unified metadata layer that self-maintains and infers additional metadata from existing information.
As a result, it automatically ensures data quality throughout an organization. Imagine that in traditional data fabrics, when consuming data you are getting data with the speed, format, and granularity needed for your task (whether the data consumer is an analyst or automated process). A data quality fabric ensures that on top of all that your data will also be of sufficient quality. It can even decide between different data sets based on automatically computed data quality metrics.
Where are data quality efforts headed?
Ultimately, a data quality fabric automates data quality monitoring and simplifies the process of providing data, ensuring that business users can easily access high-quality data when they need it. As a result, most data-related requests will be self-service for users and automated for machines.
Here’s how it could work in the real world: A person or a machine -- they request far more data than individuals so we need to include them as well -- makes a data request. The data quality fabric will find the best quality relevant data sets and decide whether that data needs to be integrated and consolidated. The fabric will then decide what transformations are required to meet the needs of the request. Finally, it will provide that data based on the use case and the user’s permissions.
For example, a data analyst requires their data to be integrated daily, but a stock ticker program requires a real-time stream. Although they are accessing the same data source, they need it at a different speed, in a different format, and at different granularity. This usually results in data pipelines being built separately for both these requests. A data quality fabric is not only aware of the required data format, but it also knows the context of the required use case. With this knowledge (saved as metadata and accessed through a knowledge graph), the fabric can optimize data delivery and can share part of the pipeline (for example parsing, batching, and standardization) while automatically branching it at appropriate moment for different use cases.
In the end, it simplifies data quality and availability throughout the business, ensuring that business users and applications can get exactly the data they need in the form that they require -- and can do so trusting that it will be high quality.
Editor’s Note: As the VP of platform strategy, Marek is focused on building next-generation data management and governance platforms that cover both data processing and metadata management. By using his extensive knowledge and experience, he leverages Ataccama’s custom-built, high-performance data processing engine to deliver data fabrics for the future.