TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Architecting a Modern Martech Stack for Speed, Scale, and AI Readiness August 26, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - The State of Self-Service Analytics: Results from TDWI’s Latest Research September 8, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Q&A: The Fundamentals of Data Quality (Part 2 of 2)

In this Q&A, the authors of O'Reilly's first-ever book on data quality answer questions about how data teams are architecting systems for reliability and trustworthiness.

By Barr Moses, Lior Gavish, Molly Vorwerck
January 12, 2022

As the amount of data companies rely on to do business grows exponentially, the consequences of poor data quality grow proportionally. In this TDWI Q&A, Barr Moses, Lior Gavish, and Molly Vorwerck -- authors of O'Reilly's The Fundamentals of Data Quality: How to Build More Trustworthy Data Pipelines and members of the founding team at data reliability company, Monte Carlo -- talk to us about data quality and observability. (Read Part 1 of the conversation here.)

For Further Reading:

Artificial Intelligence and the Data Quality Conundrum

Study Finds Three Out of Four Executives Lack Confidence in Their Data's Quality

Banking on Semantic Technology: AI-Powered Data Quality Balances Fraud Prevention and Customer Excellence

Upside: What are some of the biggest factors contributing to broken data pipelines and unreliable data?

Lior Gavish: In theory, finding the root cause of data quality issues sounds as easy as running a few SQL queries to segment the data, but in practice this process can be quite challenging. Incidents can manifest in non-obvious ways across an entire pipeline and impact multiple, sometimes hundreds, of tables.

In our experience, we've found that data pipelines break for three key reasons: changes in your data, changes in your code, and changes in your operational environment.

An unexpected change in the data feeding into the job, pipeline, or system often manifests in broken reports and dashboards that aren't discovered until days or even weeks later. To understand what's broken, you will need to find the most upstream nodes of your system that exhibit the issue -- that's where things started and that's where the answer lies.

Ask yourself:

Is the data wrong for all records? For some records?
Is the data wrong for a particular time period?
Is the data wrong for a particular subset or segment of the data, e.g., only your Android users or only orders from France?
Are there new segments of the data that your code may not account for yet or missing segments that your code relies on?
Has the schema changed recently in a way that might explain the problem?
Have your numbers changed from dollars to cents? Your timestamps from PST to EST?

A change in the logic (ETL, SQL, Spark jobs, etc.) transforming the data is a primary cause of data quality issues. A peek into the logic that created the table, or even the particular field or fields that are impacting the incident, will help you come up with plausible hypotheses about what's wrong.

Ask yourself:

What code most recently updated the table and when?
How are the relevant fields calculated? What could possibly have created the problem data given this logic?
Have there been any recent changes to the logic, potentially introducing an issue?
Have there been any ad hoc writes to the table? Has it been backfilled recently?

An operational issue, such as runtime errors, permission issues, or infrastructure failures can affect the structure, format, and versioning of your data. Given that data pipelines are becoming more complicated and layered, these types of data downtime are becoming quite common. A look at logs and error traces from your ETL engines can help answer some of the following questions:

Have relevant jobs had any errors?
Were there unusual delays in starting jobs?
Have any long-running queries or low-performing jobs caused delays?
Have there been any permissions, networking, or infrastructure issues impacting execution? Have there been any changes made to these recently?
Have there been any changes to the job schedule to accidentally drop a job or misplace it in the dependency tree?

Although these suggestions just scratch the surface of how to conduct root cause analysis on broken data pipelines, they're a solid starting point.

Data organizations are becoming increasingly distributed to keep pace with analytics demands. What are some best practices or emerging trends data teams are using to ensure data democratization while maintaining high data quality?

For Further Reading:

Artificial Intelligence and the Data Quality Conundrum

Study Finds Three Out of Four Executives Lack Confidence in Their Data's Quality

Banking on Semantic Technology: AI-Powered Data Quality Balances Fraud Prevention and Customer Excellence

Molly Vorwerck: As data becomes central to business operations, more functional teams across the company have become involved in data management and analytics to streamline and speed up the insight-gathering process. Consequently, more data teams are adopting a distributed, decentralized model that mimics the industry wide migration from monolithic to microservice architectures that took the software engineering world by storm the mid-2010s.

For instance, your 200-person company may support a team of three data engineers and 10 data analysts, with the analysts distributed across functional teams to better support the needs of the business. Either these analysts will report in to operational teams or centralized data teams but they will own specific data sets and reporting functions. Multiple domains will generate and leverage data, leading to the inevitability that data sets used by multiple teams will be duplicated, go missing, or go stale over time. To combat these issues, data teams should rely on a centralized governance model that applies universal standards of data quality across the business.

You introduce a new term -- data observability -- in your book. What is data observability and how does it differ from traditional forms of data quality management?

Barr Moses: Traditionally, data teams have relied on data testing alone to ensure that pipelines are resilient; in 2021, as companies ingest ever-increasing volumes of data and pipelines become more complex, this approach is no longer sufficient.

Over the last two decades, DevOps engineers have developed best practices of observability to ensure applications stay up, running, and reliable. Just as application observability includes monitoring, tracking, and triaging incidents to prevent downtime, modern data engineers are applying the same principles to data.

Data observability refers to a team's ability to understand the health of their data at each stage in its life cycle, from ingestion in the data warehouse or lake to its manifestation in the BI layer.

Effective observability provides end-to-end lineage that allows you to expose downstream dependencies and automatically monitor your data-at-rest -- without extracting data from your data store and risking your security or compliance. Having observability makes audits, breach investigations, and other possible data disasters much easier to understand and resolve while keeping your CTO from having an ulcer!

What are some best practices for getting up and running with data observability?

Lior Gavish: Data observability can be broken down into five pillars (or data features) data practitioners should measure to better track data quality and reliability:

Freshness: Is the data recent? When was the last time it was generated? What upstream data is included/omitted?
Distribution: Is the data within accepted ranges? Is it properly formatted? Is it complete?
Volume: Has all the data arrived?
Schema: What is the schema and how has it changed? Who has made these changes and for what reasons?
Lineage: For a given data asset, what are its upstream sources, and what downstream assets are impacted by it? Who are the people generating this data, and who is relying on it for decision making?

A robust and holistic approach to data observability requires the consistent and reliable monitoring of these five pillars through a centralized interface that serves as a main source of truth about the health of your data. Unlike ad hoc queries or simple SQL wrappers, such monitoring doesn't stop at "field X in table Y has values lower than Z today."

An effective, proactive data observability solution will also provide end-to-end lineage that allows you to track downstream dependencies. Additionally, it will automatically monitor your data at rest without requiring the extraction of data from your data store. This approach ensures that you meet the highest levels of security and compliance requirements and scale to the most demanding data volumes.

About the Authors

Barr Moses is CEO and co-founder of Monte Carlo, a data reliability company and creator of a data observability platform. Previously, she was VP customer operations at customer success company Gainsight, where she helped scale the company 10x in revenue and, among other functions, built the data/analytics team. Prior to that, she was a management consultant at Bain & Company and a research assistant at the statistics department at Stanford University. She also served in the Israeli Air Force as a commander of an intelligence data analyst unit. Barr graduated from Stanford with a B.Sc. in mathematical and computational science.

Lior Gavish is CTO and co-founder of Monte Carlo, a data observability company. Prior to Monte Carlo, Lior co-founded cybersecurity startup Sookasa, which was acquired by Barracuda in 2016. At Barracuda, Lior was SVP of engineering, launching ML products for fraud prevention. Lior holds an MBA from Stanford and an MSC in computer science from Tel-Aviv University.

Molly Vorwerck is the head of content and community for Monte Carlo, a data reliability company, creator of the Monte Carlo Data Observability Platform. Previously, she led the tech brand team at Uber, where she managed editorial strategy for the Uber engineering blog, the Uber research review program, and Uber AI. Prior to that, she wrote for USA Today, covering U.S. history, politics, and culture. She graduated from Stanford University with a B.A. in American studies and served as managing editor for The Stanford Daily. When she’s not writing or thinking about data, she’s probably watching The Great British Baking Show or reading a murder mystery.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Q&A: The Fundamentals of Data Quality (Part 2 of 2)

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Q&A: The Fundamentals of Data Quality (Part 2 of 2)

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career