TDWI’s Predictions for Data Management 2024
As we turn the calendar to 2024, TDWI predicts the following generative-driven innovations in data management.
- By James G. Kobielus
- December 19, 2023
Innovations in data are coming fast and furious. Most noticeably, generative AI tools, powered by large language models (LLMs) and other advanced neural network algorithms, are driving fresh innovations in data platforms, governance practices, query optimization tools, and much more. Here’s are the data management trends we expect to see in 2024.
Prediction #1: Vector embeddings become a core enterprise data type
Generative AI has boosted the importance of vector embeddings. These embeddings, generated by neural network algorithms, represent the dimensions, connections, patterns, and structures implicit in complex unstructured and semistructured data collections.
In 2024, we’ll see enterprise data professionals place a higher priority on vector embeddings in digital strategies. This trend will be evident in the accelerating enterprise adoption of vector databases, which are optimized for fast vectorization of source data, accelerated query and search of vector embeddings, and execution of the high-dimensionality similarity analyses needed for generative AI.
Vector embeddings will become ubiquitous in enterprise data platforms within the next two to three years. Next-generation database management platforms will be optimized for fast, efficient, and agile processing of vector embeddings. To bring this essential feature into the core of enterprise architectures, many legacy database management systems will evolve to natively support generation, indexing, storage, query, and processing of vector embeddings. Newer database platforms will be deployed into cloud-native hyperscale platforms equipped with vectorization-optimized chip architectures.
Prediction #2: Retrieval-augmented generation becomes a core data governance practice
Enterprises are increasingly concerned about the possibility that LLMs will produce hallucinations -- incorrect assumptions made by the LLMs’ neural networks that lead to the generation of inaccurate but superficially plausible output.
In 2024, we’ll see more enterprises adopt retrieval-augmented generation (RAG) as a core data integration and governance practice. RAG is a technique for improving the accuracy of LLM output through automated verification against certified facts stored in external document stores, databases, and other sources. RAG can enhance the value of LLM output through greater contextualization. This approach can also enable traceability of LLM outputs to verifiable source links, which is critical for trustworthiness and compliance purposes.
Adoption of LLMs is heightening the risk hallucinations will not only be placed into circulation, but will come back to corrupt the “single version of truth” data that is central to decision-making, business operations, and stakeholder engagement. To mitigate this risk over the next two to three years, enterprises will adopt RAG as a new and essential discipline under their data governance practices. Supplementing established approaches such as data profiling, data cleansing, and master data management, RAG will be an essential set of skills, tools, processes, and platforms for ensuring the veracity and trustworthiness of all enterprise data, including that which was automatically generated by AI systems.
Prediction #3: Synthetic data significantly grows its footprint in enterprise data lakes
Enterprises can scarcely find all the data they need to build and train today’s most sophisticated AI, machine learning, and other advanced analytics applications. Complicating the data professional’s life are legal, regulatory, and budgetary constraints (among others) that prevent them from accessing and using privacy-relevant, proprietary, and other data sources.
In 2024, more enterprise data professionals will use generative AI tools to produce synthetic data -- in other words, data that is entirely ersatz but statistically patterned on source data -- that meets their data science requirements while protecting privacy, avoiding bias, and steering clear of other such business risks. Synthetic data will be used by enterprise data scientists to build and train machine learning models for uncommon use cases for which valid source data is costly, sparse, or entirely unavailable. Indeed, synthetic data -- milled algorithmically within simulation environments -- will be an increasingly popular means of training robotic, embedded, and edge devices for complex scenarios that can’t be realistically or safely enacted in the field.
Within the next two to three years, enterprise data lakes will become repositories of diverse sets of synthetic data for building and training AI for a wide range of MLOps use cases. In addition, we’ll see more enterprises begin to monetize their synthetic data through cloud data marketplaces and other channels, serving a growing market of developers for whom it makes more sense to acquire these critical assets from third parties than go to the trouble and expense of generating all the synthetic data they need from scratch.
The Bottom Line
Even those data management professionals whose companies are not at the forefront of generative AI will be impacted by these three trends.
If nothing else, your established providers of data management infrastructure will incorporate such innovations as vector embedding, retrieval-augmented generation, and synthetic data into their solution portfolios. Given the longstanding commitment of IT solution providers to AIOps to automate management of their offerings, it’s only a matter of time before data management professionals encounter this fresh wave of AI-centric innovations and leverage them for greater productivity.
James Kobielus is senior director of research for data management at TDWI. He is a veteran industry analyst, consultant, author, speaker, and blogger in analytics and data management. At TDWI he focuses on data management, artificial intelligence, and cloud computing. Previously, Kobielus held positions at Futurum Research, SiliconANGLEWikibon, Forrester Research, Current Analysis, and the Burton Group. He has also served as senior program director, product marketing for big data analytics for IBM, where he was both a subject matter expert and a strategist on thought leadership and content marketing programs targeted at the data science community. You can reach him by email ([email protected]), on Twitter (@jameskobielus), and on LinkedIn (https://www.linkedin.com/in/jameskobielus/).