Data Mesh 101: What It Is and Why You Should Care (Part 1 of 3)
We explain the basics of the data mesh and whether it’s a good fit for your enterprise.
- By Sumit Pal
- February 12, 2024
Although the enterprise data landscape is littered with new data technology and offerings, the most pressing problem data teams face today isn’t a lack of technology or skills; it’s not knowing how to create a modern data experience. Data teams are struggling to tame the chaos and control the entropy of the data ecosystem and are evaluating modern approaches such as the data mesh to manage the confusion.
Why Data Mesh?
With the disaggregation of the data stack and profusion of tools and data available, data engineering teams are often left to duct-tape the pieces together to build their end-to-end solutions. The idea of the data mesh, first promulgated by Zhamak Dehghani a few years back, is an emerging concept in the data world. It proposes a technological, architectural, and organizational approach to solving data management problems by breaking up the monolithic data platform and de-centralizing data management across different domain teams and services.
In a centralized architecture, data is copied from source systems into a data lake or data warehouse to create a single source of truth serving analytics use cases. This quickly becomes difficult to scale with data discovery and data version issues, schema evolution, tight coupling, and a lack of semantic metadata.
The ultimate goal of the data mesh is to change the way data projects are managed within organizations. This enables organizations to empower teams across different business units to build data products autonomously with unified governance principles. It is a mindset shift from centralized to decentralized ownership, with the idea of creating an ecosystem of data products built by cross-functional domain data teams.
In a data mesh, each domain implements its own products and is responsible for its pipelines from ingestion, storage, and processing to delivery. This approach facilitates communication between different parts of the organization to distribute data sets across different locations. The responsibility for generating value from data is delegated to the people who understand the business and domain-specific rules. Organizations such as ABN AMRO, Roche, JP Morgan Chase, Intuit, and many others have embraced the intrinsic principles of the data mesh to model, build, and govern data platforms faster by decoupling domains.
Components of a Data Mesh
The data mesh introduces new concepts such as data contracts and data domains, but it also leverages existing components such as data catalogs and data-sharing platforms to accomplish its principles. Important concepts to understand include:
A data domain is a fundamental concept for a data mesh. A domain has a functional context and is assigned to perform a certain task, and that is the reason it exists. Subject to organizational constraints, think of a data domain as a logical grouping of organizational units to fulfill a functional context. Once these domains interact and share data with each other, the mesh emerges. For example marketing, sales, and products can be different data domains in an organization.
A domain distributes responsibility to people who are closest to the data, know the business rules, and understand the semantics of the data for that domain. Each domain consists of a team of full-stack developers, business analysts, and data stewards who ingest operational data, build data products, and publish them with data contracts to serve other domains. Examples include a marketing domain, sales domain, customer domain, etc.
Data products are nodes on the mesh that encapsulate structural components such as code, data, metadata, and infrastructure. Data products are carefully crafted, curated, and presented to consumers as self-service, providing a reliable and trustworthy source for sharing data across the organization.
Some examples of data products are data sets, tables, machine learning models, and APIs. A data domain owns its data products and is accountable for managing its service-level agreements (SLAs), data quality, and governance. In order to build data products, the domains need data catalogs, data-sharing platforms, and data contracts.
Data contracts enable domain developers to create products according to specifications. These contracts ensure interface compatibility and include terms of service and an SLA. They cover the utilization of data and specify the required data quality.
The primary objective of data contracts is to establish transparency for data usage and dependencies while also outlining the terms of service and SLAs. However, when implementing them, users need time to familiarize themselves with and understand the importance of data ownership. Data contracts should also include information schema, semantics, and lineage.
Data sharing is another important component of the data mesh. This allows data domains to share data products rather than copying them. It minimizes data movement and discrepancies across systems and ensures that the data contracts are obeyed and adhered to when sharing the data products in a secure, scalable, and zero-copy way.
Data Mesh: When to Use and When Not to Use
Today, there is no generally accepted standard for implementing a data mesh -- adopting a data mesh isn’t a one-size-fits-all solution and requires a fundamental shift in how distributed data teams across the organization think about data platforms. General indicators of when data mesh is the right choice for an organization include:
- Global deployments of data platforms span multiple business units across multiple domains
- Autonomous teams work independently with diverse data and analytics, each with their own full-stack tools that are disparate from other units
- Long delays, iterations, and scaling bottlenecks are due to centralized IT and data engineering teams who are responsible for implementing new data sets and building data pipelines for the entire organization
- Domain-specific data has nuanced semantics and complex preparation needs that require domain specialists rather than engineering specialists
- The overall business philosophy defaults to decentralizing control to business units and domains, with a relatively thin layer of shared services provided by centralized corporate services
On the flip side, data mesh is not a fit for enterprises with a fully decentralized setup, as it needs centralized coordination to align, enable, and support the decentralized data teams. Likewise, data mesh is recommended only if an organization has a critical mass of data talent or where data teams’ engineering is mature.
Data mesh is not a silver bullet for data management problems. However, large organizations looking to scale their data platforms -- including both operational and analytics -- eventually adopt some form of data mesh. Data mesh with a data product approach is emerging as a compelling way to manage data ecosystems within enterprises with the goal of bringing agility and velocity by creating a self-serve approach to data management.
In Part 2 of this three-part series, I will delve deeper into how organizations should go about adopting the data mesh paradigm in their organizations.
Sumit Pal is the strategic technology director at Ontotext, a leading global provider of enterprise knowledge graph (EKG) technology and semantic database engines, and a former VP at Gartner. Follow the company on LinkedIn or X/Twitter.