 
        
        
        Is a Semantic Data Plane the Answer to Poor Data Management?
        
        With so many obstacles to a successful data management strategy, can a semantic data plane make a difference? 
        
			- By Bharti Patel
- April 22, 2024
In spite of -- or perhaps because of -- the decades-long shift to hybrid computing and distributed cloud architectures, the jaw-dropping hardware and software improvements, and the breakneck pace of developments in generative AI, poor data management is still commonplace. In fact, results from the recent TDWI Data Management Maturity Assessment show that, although 71% of IT experts agreed their organization values data, only 19% said a strong data management strategy was in place, and close to half (45%) said their strategy wasn’t communicated. What’s going on here?  
A truly promising answer to improve data management has emerged, but not from some vaguely defined AI solution. Rather it comes from the semantic data plane -- a data fabric for structured, unstructured, and metadata that supports data virtualization, fast distributed query processing, and local data transformations -- which may leverage AI to improve efficiencies but is primarily designed from the ground up to resolve core data issues and their complexities.
Put the Business First
Above all else, an effective data management system must solve business problems and drive business value. That’s its raison d'être. However, despite knowing this, data management in practice is often not designed to serve business use cases quickly.
A system must be natively flexible enough to handle new kinds of use cases that emerge. Teams prioritizing business problems and use cases are asking: What kind of data is required to tackle the problem in front of me? What kind of data access is needed and by whom? How fast can I get the data I need to swiftly pivot to solving an unexpected or new business problem tomorrow, next month, or next year?
A semantic data plane prioritizes business problems by allowing users to focus on the content and meaning of structured and unstructured data instead of having to figure out how to connect to different storage locations, how to deal with different data formats, and how to move data from one source’s system to another. All data sets and objects are virtualized. If, for instance, a user creates BI queries, reports, and dashboards on data sets, the queries will continue to work if those data sets are moved.
Semantic search allows us to locate relevant documents quickly, based on the meaning of the documents, and advanced retrieval augmented generation (RAG) pipelines make it easy to submit queries and get summaries and translations of data without having to be familiar with generative AI or having to identify specific documents in the first place.
Because the semantic data plane supports structured data, unstructured data, <em>and</em> metadata, a business user can easily “connect the dots.” For instance, a query about a specific customer could lead to a sales report, contracts, and invoices. The same tool (a chat-based interface) can provide answers in natural language or provide tables or charts depending on the source of the data or the requested output format.
Clear the Path to the Right Data
Access to the right data is a non-negotiable requirement in the design of modern data management systems regardless of where data resides, what format it’s in, whether it’s structured or unstructured, or how it is stored, moved, or migrated. When data is poorly managed, there’s no direct access to the latest clean data by those who need it most. A company may have loads of data, but much of it is dark -- that is, data collected and stored that’s not being used for business purposes. In fact, large organizations never use about half of all data they store -- an average of 17 PBs, according to Hitachi Vantara’s Modern Data Infrastructure Dynamics Report.
In an AI-focused world, large language models and customized, smaller models need the right data access, too. If models train on problematic data, they’ll deliver problematic responses, such as biased information and hallucinations. You may have trillions of parameters, yet large swaths may be unusable. What constitutes the right data changes over time as well? Data decays and stale data yields erroneous results. Data access can also be hampered by movement from on-premises storage to the cloud or vice versa, or from system to system via different data pipelines. When data moves, systems frequently break and data copies multiply exponentially.
Because a semantic data plane provides an abstraction layer to data, business users can access data in the same way no matter where the data is located. A user authenticated and authorized to access data can access that data using a consistent API across disparate data sources. For example, for relational data, access can be provided via an efficient API such as Apache Arrow Flight/Arrow Flight SQL, or ADBC (Arrow Database Connectivity). For legacy clients, the much slower JDBC and ODBC APIs can be supported. Again, domain experts do not need to know where the data is physically located as long as data access is fast across the hybrid cloud environments that support their work. Similar APIs are available for unstructured data.
Reduce Excessive Duplication
Enterprises struggle mightily with having multiple copies of the same data. As their data grows, so does their data duplication -- and the money wasted on storing and maintaining copies. Whether knowingly or not, enterprises retain these multiple copies, either because they often are not aware of what copies exist and where they live or because even when they get a handle on all this extant data, they often haven’t yet implemented ways to mitigate the problem. Data lakehouses have attempted to solve this problem by bringing data warehouse functionality directly to object storage and data lakes, but their approaches to versioning and their integration with distributed enterprise systems vary in efficacy.
The semantic data plane further mitigates the multiple copy problem by maintaining data lineage for all data sets (and files/objects) that are duplicated or moved. This allows an organization to quickly determine how one data set was derived from other data sets. Plus, adding decentralized semantic search capabilities means local document embeddings can be created that can be compared globally. By performing a similarity search, users can determine whether documents contain the same information.
Improve Integration of Heterogeneous Systems
Another fundamental problem is that business systems are often not well integrated so two different business units may be using different tools and solutions. Sometimes, even within the same business unit, there will be different systems that have not been integrated. Data might exist in different and inconsistent formats. As a result, data lives in different silos that aren’t interoperable. If systems are particularly complex -- as in the case of systems cobbled together from multiple vendor solutions and tools -- integration will be especially difficult.
Although technical teams often do have (or can learn) the skills needed for integration projects, finding the time, resources, and bandwidth are often the greater challenges. Business users are even worse off in these scenarios. AI may one day be able to help with some of these integration and documentation processes, but we’re not there yet. Rather, it’s architecture and tools that abstract away these integration challenges that will have the most immediate impact.
A semantic data plane serves this end by providing data virtualization and distributed query processing across disparate data sources. Even when joining data sets across heterogeneous systems, a business user does not need to know where data is physically located. There is no integration problem because all structured data sets (e.g., database tables, Parquet files, JSON files, CSV files, ORC files) are automatically converted into a highly efficient columnar data format that is also its own serialization format. This way, data can be efficiently transferred without any additional serialization and deserialization, which, in turn, supports joins of any kind of structured data across disparate data sources.
The Semantic Data Plane Puts Us Nearer to the Answer
It’s worth paying close attention this year to how data management through the semantic data plane, AI-powered processes in data systems, and open source technologies leading to a “sixth data platform” will all unfold and collide. Perhaps we’re closer to understanding data than ever before, which makes tackling poor data management easier than ever before.