Executive Q&A: All About Data Contracts
In this Q&A, Monte Carlo CTO Shane Murray explains the value and purpose of a new data tool -- data contracts -- to help data teams manage pipelines.
- By Upside Staff
- May 15, 2023
With the increasing complexity of data and analytics, organizations are facing growing difficulty keeping their data creation and data consumption teams on the same page. However, data contracts may be just the solution these companies need.
Upside: What do you mean by data contract? In the broadest terms, what is it?
Shane Murray: Despite what the name might imply, data contracts are not in-depth legal documents. Rather, data contracts are processes to help data producers and data consumers get on the same page, paired with a simple technical architecture to enforce it. This agreement on process between internal data stakeholders might include things like what data schema is being emitted, the range of values expected for that data, agreements regarding the freshness and availability of data, and details on the security and access to data.
The technical architecture will generally include a standardized format for contract definition and a registry to manage contract iterations.
Who is the data contract between? (That is, is this between software engineers and a data owner, data scientist, or chief data officer?)
In broad strokes, data contracts are between data producers (typically software engineers) and data consumers (typically data engineers, data scientists, analysts, or even business stakeholders). The decision to establish and implement data contracts often begins with data leaders, such as chief data officers or VPs of data, given that they are solving a system-wide problem and will invariably require buy-in across data and technology organizations. It’s important to note that data contracts are only successful if all parties involved in the production and consumption of data buy into the approach.
How does a data contract work?
Although data contracts remain a fairly new idea and therefore haven’t seen much in the way of standardization across organizations, there are a few common components.
First, the data requirements pertaining to the contract must be gathered and documented, perhaps even in something as simple as a Google document or something similar. You wouldn’t expect all data emitted to fall under a contract, only that which is essential to business-critical reporting or machine learning applications.
The contract is implemented using a standard format -- commonly JSON, Protobuf, or Avro -- to specify the schema and related details. Then each iteration is stored in a registry or repository, allowing for managed evolution of the contract.
Finally, you require mechanisms to proactively enforce the contract. These mechanisms include preventing changes to the schema that will break the agreement, monitoring the inevitable changes that occur, and stopping the flow of data if it doesn’t match the terms of the agreement.
I encourage you to check out these helpful 7 lessons learned from Andrew Jones, a principal data engineer at GoCardless, to learn more about what this architecture could look like. Also, Chad Sanderson, head of product, data platform at Convoy shared their approach to data contracts in the warehouse.
What are some of the biggest challenges associated with data contracts?
Organizational silos remain the biggest obstacle to even getting data contracts off the ground. This is because of the disparate systems, reporting structures, objectives, and priorities that exist between data and software teams. This makes it difficult for data producers and consumers to agree on what a successful data contract looks like and the decision to adopt them in the first place. In other words, data teams are faced with the challenging task of answering the age-old question that software teams are undoubtedly going to ask them: what’s in it for me?
Before your organization chooses to adopt data contracts, there are several questions your team needs to ask itself such as what incentives will need to be put in place for the agreement to be held? Can you get people to slow down enough to adhere to them? Can data consumers truly know and understand their requirements ahead of time?
How do data contracts differ from data testing?
One could argue data contracts are just another form of data testing. However, data contracts differ from data testing in scope and effect. In terms of scope, although a data contract will include schema and data validation tests, they should go beyond that to specify an overarching agreement that includes schema and semantics, clear SLAs, and accountability.
In terms of effect, contracts differ from data testing in that they at least prevent nonconforming data from moving through pipelines and ideally prevent the upstream code change from being made in the first place.
How do you measure the success of these contracts? What kind of success have enterprises reported after implementing such a contract?
Because data contracts are a relatively novel concept in data, what “good” looks like can vary from business to business. Measures of success align with those used to manage data reliability more broadly. This includes the total number of data incidents, the time to detect and resolve them, and the uptime of the data products that are supported by the contract.
A holistic approach involves data observability, providing the tools to monitor, alert, triage, and troubleshoot incidents, complemented with data contracts to reduce the likelihood of an incident affecting business-critical data products.
A reduction in incidents, paired with increased uptime of data products, will signal that your data contract initiative is on the right track.
[Editor’s note: Shane Murray is the field CTO at Monte Carlo, where he partners with Monte Carlo’s customers on their data strategy and operations, to realize the maximal value from initiatives including data quality, while evangelizing the growing data observability category. Prior to Monte Carlo, Shane was the SVP data and insights at The New York Times, leading over 150 employees across data science, analytics, governance, and data platforms. Under his leadership, Shane expanded the team into areas such as applied machine learning, experimentation, and data privacy, delivering research and insights that improved the Times’ ability to draw and retain a large audience and scale the digital subscription business, which grew 10-fold to 8 million subscriptions during this tenure.]