Q&A: The Fundamentals of Data Quality (Part 1 of 2)
In this Q&A, the authors of O'Reilly's first-ever book on data quality answer questions about how data teams are architecting systems for reliability and trustworthiness.
- By Barr Moses, Lior Gavish, Molly Vorwerck
- January 11, 2022
As the amount of data companies rely on to do business grows exponentially, the consequences of poor data quality grow proportionally. In this TDWI Q&A, Answers to TDWI questions from Barr Moses, Lior Gavish, and Molly Vorwerck -- authors of O'Reilly's The Fundamentals of Data Quality: How to Build More Trustworthy Data Pipelines and members of the founding team at data reliability company Monte Carlo -- talk to us about data quality and observability.
Upside: In your book, you argue that data quality is more than just ensuring your source data is clean and accurate. How would you define data quality and how has this definition evolved as companies become increasingly data driven?
Lior Gavish: Technical teams have been tracking -- and seeking to improve -- data quality for as long as they've been tracking analytical data, but only in the 2020s has data quality become a top-line priority for many businesses. As data becomes not just an output but a financial commodity for many organizations, it's important that this information can be trusted.
As a result, companies are increasingly treating their data like code, applying frameworks and paradigms long standard among software engineering teams to their data organizations and architectures. DevOps spawned industry-leading best practices such as site reliability engineering (SRE), continuous integration/continuous deployment (CI/CD), and microservices-based architectures. In short, the goal of DevOps is to release more reliable and performant software through automation.
Over the past few years, more companies have been applying these concepts to data in the form of DataOps. DataOps refers to the process of improving the reliability and performance of your data through automation, reducing data silos, and fostering quicker, more fault-tolerant analytics.
Now, the definition of data quality has started to crystallize as a function of measuring the reliability, completeness, and accuracy of data as it relates to the state of what is being reported on. As they say, you can't manage what you don't measure, and high data quality is the first stage of any robust analytics program. Data quality is also an extremely powerful way to understand whether your data fits the needs of your business, and DataOps is the next frontier when it comes to managing data quality at scale.
Data quality has been a priority for teams since the early days of data science and analytics, yet it continues to be a sticking point. Why is data quality so hard to get right?
Barr Moses: There are four main factors contributing to the rise of "data downtime," or periods of time when data is missing, erroneous, or otherwise inaccurate: the need for "fresh" data, the migration to the cloud, increased data ingestion, and the increasing complexity of data pipelines.
Lack of freshness -- i.e., when data is unusually out-of-date -- can have any number of causes, including a job stuck in a queue, a time out, a partner that did not deliver its data set on time, an error, or an accidental scheduling change that removed jobs from your data pipeline.
With the rise of data-driven analytics, cross-functional data teams, and most important, the cloud, cloud data warehousing solutions such as Amazon Redshift, Snowflake, and Google BigQuery have become increasingly popular options for companies bullish on data. In many ways, the cloud makes data easier to manage, more accessible to a wider variety of users, and far faster to process.
Nowadays, companies use anywhere from dozens to hundreds of internal and external data sources to produce analytics and machine learning models. Any one of these sources can change in unexpected ways and without notice, compromising the data the company uses to make decisions.
Over the past few years, data pipelines have become increasingly complex with multiple stages of processing and non-trivial dependencies between various data assets as a result of more advanced (and disparate) tooling, more data sources, and more diligence afforded to data by executive leadership. Without visibility into these dependencies, however, any change made to one data set can have unintended consequences impacting the correctness of dependent data assets.
There are additional factors that contribute to data downtime, but these three factors are particularly common.
How do you measure the impact or return on investment for data quality initiatives at your company?
Molly Vorwerck: We've found that the following metrics (borrowed from the DevOps world) offer a good start: time to detection and time to resolution for data issues.
Time to detection (TTD) refers to the amount of time it takes for your data engineering team to identify a data quality issue, whether that's a freshness anomaly, a model that failed to run, or even a schema change that sent an entire pipeline into chaos. For many data teams, TTD is often measured in days to weeks, and sometimes even months, because the primary means of detection is waiting for downstream data consumers to communicate that the data "looks off."
Next, data engineering teams should measure time to resolution (TTR), a metric that seeks to answer the question How quickly were you able to resolve a data incident once you were alerted? Also measured in hours, minutes, or days, TTR metrics allow you to understand the severity of your data issue and track the amount of time it takes to resolve it. When converted to dollars (i.e., how much money is spent/saved), it becomes much easier to communicate the impact of this data incident to your stakeholders.
In this way, you can measure the financial impact of your data by understanding how much money it costs when it's not operational.
The equation might go like this: (TTD + TTR hours) * downtime hourly cost = cost of data downtime. Of course, this equation doesn't even take into account the opportunity cost of bad data; we can save that for another interview.
Who in the data organization is responsible for managing data quality?
Barr Moses: In short, everyone is responsible for ensuring that data can be trusted, but it truly depends on the needs and structure of your data organization.
At companies that ingest and transform terabytes of data (such as Netflix or Uber), we've found that it's common for data engineers and data product managers to tackle the responsibility of monitoring and alerting for data reliability issues.
Outside of these behemoths, though, the responsibility often falls on data engineers and product managers. They must balance the organization's demand for data with what can be provided reliably. Notably, the brunt of any bad choices made here is often borne by the BI analysts, whose dashboards may wind up containing bad information or breaking from uncommunicated changes. In very early maturity data organizations, these roles are often combined into a jack-of-all-trades data person or a product manager.
NOTE: The conversation continues in Part 2.
Barr Moses is CEO and co-founder of Monte Carlo, a data reliability company and creator of a data observability platform. Previously, she was VP customer operations at customer success company Gainsight, where she helped scale the company 10x in revenue and, among other functions, built the data/analytics team. Prior to that, she was a management consultant at Bain & Company and a research assistant at the statistics department at Stanford University. She also served in the Israeli Air Force as a commander of an intelligence data analyst unit. Barr graduated from Stanford with a B.Sc. in mathematical and computational science.
Lior Gavish is CTO and co-founder of Monte Carlo, a data observability company. Prior to Monte Carlo, Lior co-founded cybersecurity startup Sookasa, which was acquired by Barracuda in 2016. At Barracuda, Lior was SVP of engineering, launching ML products for fraud prevention. Lior holds an MBA from Stanford and an MSC in computer science from Tel-Aviv University.
Molly Vorwerck is the head of content and community for Monte Carlo, a data reliability company, creator of the Monte Carlo Data Observability Platform. Previously, she led the tech brand team at Uber, where she managed editorial strategy for the Uber engineering blog, the Uber research review program, and Uber AI. Prior to that, she wrote for USA Today, covering U.S. history, politics, and culture. She graduated from Stanford University with a B.A. in American studies and served as managing editor for The Stanford Daily. When she’s not writing or thinking about data, she’s probably watching The Great British Baking Show or reading a murder mystery.