Q&A: The Fundamentals of Data Quality (Part 2 of 2)
In this Q&A, the authors of O'Reilly's first-ever book on data quality answer questions about how data teams are architecting systems for reliability and trustworthiness.
- By Barr Moses, Lior Gavish, Molly Vorwerck
- January 12, 2022
As the amount of data companies rely on to do business grows exponentially, the consequences of poor data quality grow proportionally. In this TDWI Q&A, Barr Moses, Lior Gavish, and Molly Vorwerck -- authors of O'Reilly's The Fundamentals of Data Quality: How to Build More Trustworthy Data Pipelines and members of the founding team at data reliability company, Monte Carlo -- talk to us about data quality and observability. (Read Part 1 of the conversation here.)
Upside: What are some of the biggest factors contributing to broken data pipelines and unreliable data?
Lior Gavish: In theory, finding the root cause of data quality issues sounds as easy as running a few SQL queries to segment the data, but in practice this process can be quite challenging. Incidents can manifest in non-obvious ways across an entire pipeline and impact multiple, sometimes hundreds, of tables.
In our experience, we've found that data pipelines break for three key reasons: changes in your data, changes in your code, and changes in your operational environment.
An unexpected change in the data feeding into the job, pipeline, or system often manifests in broken reports and dashboards that aren't discovered until days or even weeks later. To understand what's broken, you will need to find the most upstream nodes of your system that exhibit the issue -- that's where things started and that's where the answer lies.
- Is the data wrong for all records? For some records?
- Is the data wrong for a particular time period?
- Is the data wrong for a particular subset or segment of the data, e.g., only your Android users or only orders from France?
- Are there new segments of the data that your code may not account for yet or missing segments that your code relies on?
- Has the schema changed recently in a way that might explain the problem?
- Have your numbers changed from dollars to cents? Your timestamps from PST to EST?
A change in the logic (ETL, SQL, Spark jobs, etc.) transforming the data is a primary cause of data quality issues. A peek into the logic that created the table, or even the particular field or fields that are impacting the incident, will help you come up with plausible hypotheses about what's wrong.
- What code most recently updated the table and when?
- How are the relevant fields calculated? What could possibly have created the problem data given this logic?
- Have there been any recent changes to the logic, potentially introducing an issue?
- Have there been any ad hoc writes to the table? Has it been backfilled recently?
An operational issue, such as runtime errors, permission issues, or infrastructure failures can affect the structure, format, and versioning of your data. Given that data pipelines are becoming more complicated and layered, these types of data downtime are becoming quite common. A look at logs and error traces from your ETL engines can help answer some of the following questions:
- Have relevant jobs had any errors?
- Were there unusual delays in starting jobs?
- Have any long-running queries or low-performing jobs caused delays?
- Have there been any permissions, networking, or infrastructure issues impacting execution? Have there been any changes made to these recently?
- Have there been any changes to the job schedule to accidentally drop a job or misplace it in the dependency tree?
Although these suggestions just scratch the surface of how to conduct root cause analysis on broken data pipelines, they're a solid starting point.
Data organizations are becoming increasingly distributed to keep pace with analytics demands. What are some best practices or emerging trends data teams are using to ensure data democratization while maintaining high data quality?
Molly Vorwerck: As data becomes central to business operations, more functional teams across the company have become involved in data management and analytics to streamline and speed up the insight-gathering process. Consequently, more data teams are adopting a distributed, decentralized model that mimics the industry wide migration from monolithic to microservice architectures that took the software engineering world by storm the mid-2010s.
For instance, your 200-person company may support a team of three data engineers and 10 data analysts, with the analysts distributed across functional teams to better support the needs of the business. Either these analysts will report in to operational teams or centralized data teams but they will own specific data sets and reporting functions. Multiple domains will generate and leverage data, leading to the inevitability that data sets used by multiple teams will be duplicated, go missing, or go stale over time. To combat these issues, data teams should rely on a centralized governance model that applies universal standards of data quality across the business.
You introduce a new term -- data observability -- in your book. What is data observability and how does it differ from traditional forms of data quality management?
Barr Moses: Traditionally, data teams have relied on data testing alone to ensure that pipelines are resilient; in 2021, as companies ingest ever-increasing volumes of data and pipelines become more complex, this approach is no longer sufficient.
Over the last two decades, DevOps engineers have developed best practices of observability to ensure applications stay up, running, and reliable. Just as application observability includes monitoring, tracking, and triaging incidents to prevent downtime, modern data engineers are applying the same principles to data.
Data observability refers to a team's ability to understand the health of their data at each stage in its life cycle, from ingestion in the data warehouse or lake to its manifestation in the BI layer.
Effective observability provides end-to-end lineage that allows you to expose downstream dependencies and automatically monitor your data-at-rest -- without extracting data from your data store and risking your security or compliance. Having observability makes audits, breach investigations, and other possible data disasters much easier to understand and resolve while keeping your CTO from having an ulcer!
What are some best practices for getting up and running with data observability?
Lior Gavish: Data observability can be broken down into five pillars (or data features) data practitioners should measure to better track data quality and reliability:
- Freshness: Is the data recent? When was the last time it was generated? What upstream data is included/omitted?
- Distribution: Is the data within accepted ranges? Is it properly formatted? Is it complete?
- Volume: Has all the data arrived?
- Schema: What is the schema and how has it changed? Who has made these changes and for what reasons?
- Lineage: For a given data asset, what are its upstream sources, and what downstream assets are impacted by it? Who are the people generating this data, and who is relying on it for decision making?
A robust and holistic approach to data observability requires the consistent and reliable monitoring of these five pillars through a centralized interface that serves as a main source of truth about the health of your data. Unlike ad hoc queries or simple SQL wrappers, such monitoring doesn't stop at "field X in table Y has values lower than Z today."
An effective, proactive data observability solution will also provide end-to-end lineage that allows you to track downstream dependencies. Additionally, it will automatically monitor your data at rest without requiring the extraction of data from your data store. This approach ensures that you meet the highest levels of security and compliance requirements and scale to the most demanding data volumes.