Datafold Launches Open Source Data-diff to Compare Tables of Any Size Across Databases
Data engineers can now validate data pipelines at scale and high speed.
Note: TDWI’s editors carefully choose vendor-issued press releases about new or upgraded products and services. We have edited and/or condensed this release to highlight key features but make no claims as to the accuracy of the vendor's statements.
Datafold, a data reliability company, has announced data-diff, a new open source cross-database diffing package. This new product is an open source extension to Datafold’s original Data Diff tool for comparing data sets. Open source data-diff validates the consistency of data across databases using high-performance algorithms.
In the modern data stack, companies extract data from sources, load that data into a warehouse, and transform that data so that it can be used for analysis, activation, or data science use cases, familiarly known as ELT (extract, load, and transform). Datafold has been focused on automated testing during the transformation step with Data Diff, ensuring that any change made to a data model does not break a dashboard or cause a predictive algorithm to have the wrong data.
With the launch of open source data-diff, Datafold can now help with the extract and load parts of the process. Open source data-diff verifies that the data that has been loaded matches the source of that data where it was extracted. All parts of the data stack need testing for data engineers to create reliable data products, and Datafold now gives them coverage throughout the ELT process.
“Data-diff fulfills a need that wasn’t previously being met,” said Gleb Mezhanskiy, Datafold founder and CEO. “Every data-savvy business today replicates data between databases in some way -- for example, to integrate all available data in a warehouse or data lake to leverage it for analytics and machine learning. Replicating data at scale is a complex and often error-prone process, and although multiple vendors and open source tools provide replication solutions, there was no tooling to validate the correctness of such replication. As a result, engineering teams resorted to manual one-off checks and tedious investigations of discrepancies, and data consumers couldn’t fully trust the data replicated from other systems.”
Mezhanskiy continued, “Data-diff solves this problem elegantly by providing an easy way to validate consistency of data sets across databases at scale. It relies on state-of-the art algorithms to achieve incredible speed. For example, comparing one-billion-row data sets across different databases takes less than five minutes on a regular laptop using data-diff. In addition, as an open source tool, it can be easily embedded into existing workflows and systems.”
Answering an Important Need
Today’s organizations are using data replication to consolidate information from multiple sources into data warehouses or data lakes for analytics. They’re integrating operational systems with real-time data pipelines, consolidating data for search, and migrating data from legacy systems to modern databases.
Thanks to amazing tools such as Fivetran, Airbyte, and Stitch, it’s easier than ever to sync data across multiple systems and applications. Most data synchronization scenarios call for 100 percent guaranteed data integrity, yet the practical reality is that in any interconnected system, records are sometimes lost due to dropped packets, general replication issues, or configuration errors. To ensure data integrity, it’s necessary to perform validation checks using a data diff tool.
Datafold’s approach constitutes a significant step forward for developers and data analysts who wish to compare multiple databases rapidly and efficiently, without building a makeshift diff tool themselves. Currently, data engineers use multiple comparison methods, ranging from simple row counts to comprehensive row-level analysis. The former is fast but not comprehensive, whereas the latter approach is slow but guarantees complete validation. Open source data-diff is fast and provides complete validation.
Open Source Data-diff for Building and Managing Data Quality
Data-diff uses checksums to verify complete consistency between two data sources quickly and efficiently. This method allows for a row-level comparison of 100 million records to be completed in just a few seconds without sacrificing the granularity of the resulting comparison.
Datafold has released data-diff under the MIT license. Currently, the software includes connectors for Postgres, MySQL, Snowflake, BigQuery, Redshift, Presto, and Oracle. Datafold plans to invite contributors to build connectors for additional data sources and for specific business applications.
To learn more about Datafold’s open source data-diff, visit https://github.com/datafold/data-diff/.