Data Management: What’s Ahead for 2022
Data management has evolved into a very different discipline from what it was just 10 years ago.
- By James G. Kobielus
- December 21, 2021
In TDWI’s crystal ball for 2022, the most noteworthy trends we’re following relate to fundamental changes in the underlying configuration of cloud data environments, the shape of the data itself, and the growing footprint of synthetic -- in other words, fake but valuable -- data in the DataOps and MLOps pipelines.
Trend #1: Containerized data will be orchestrated far and wide in the multicloud
On one level, containerized databases are old news and widely adopted. Every public cloud provider offers customers the ability to run their databases in containers and to orchestrate these across Kubernetes clusters.
Kubernetes is the foundation for the new generation of cloud-native data management architectures. The most noteworthy trend over the past few years has been the recrystallization of the data ecosystem around this orchestration platform. However, it’s not clear how long it will take for this trend to play itself out through the cloud data and analytics platform stack. Over the past four years, there have been industry efforts to standardize how Spark, TensorFlow, Hadoop, streaming, distributed object store, block storage, and other components of this stack are decoupled, containerized, and orchestrated over cloud-native fabrics.
However, as we push into 2022, one would be hard-pressed to point to a single DBMS that is thoroughly containerized for agile cloud-to-edge deployment. One of the chief stumbling blocks has been the awkward manner in which Kubernetes handles stateful applications.
A DBMS running ACID transactions and other core enterprise functions is a stateful application and is one of the most complex workloads to run in containerized backplanes such as Kubernetes. The crux of the issue is that, essentially, every instance of a running process in a Kubernetes cluster managing two or more containers uses local storage that does not persist data when that instance is ended. This Kubernetes constraint prevents containerization of this essential DBMS function -- persistent state management -- as a composable microservice.
To store data, and thus persist a state across transient instances of containerized processes, Kubernetes environments must attach to external storage volumes. This generally involves using a Kubernetes abstraction known as StatefulSets, which allows tools external to the Kubernetes environment to ensure distribution, management, and persistence of application data so that it survives the failure of any cluster or node on which Kubernetes is running.
Consequently, the future of Kubernetes-orchestrated database platforms depends on the maturity of infrastructure for persisting application state in lieu of being able to rely on Docker or other containers to manage this critical function. It’s no surprise that there are a growing assortment of commercial solutions that do this, but this is a nascent requirement that TDWI expects will be the focus of hot start-ups in 2022 and beyond.
In 2022, TDWI expects to see more Kubernetes-ready distributed-RDBMS platforms that have addressed the stateful persistence challenges head-on in their respective solutions. For example, Cockroach Labs offers this capability through a distributed SQL DBMS solution architecture. This offering functions like a single logical database while supporting multimaster, guaranteed transactions and enabling scalable deployment across regions and Kubernetes clusters without the need for federation.
The other serverless data platform vendors have rolled out their own stateful persistence infrastructures to handle the same functions as Cockroach has baked into its distributed DBMS. Indeed, none of these solution providers would be able to offer Kubernetes-based multinode serverless platforms if they hadn’t built their own state persistence infrastructures.
We also expect to see an increasing focus on Kubernetes-based distributed DBMS deployments from the vendors who’ve built their commercial database offerings on open source platforms (e.g., Cassandra, MongoDB, or ElasticSearch) that natively handle stateful operations such as sharding, failover, and replication.
Trend #2: Graph-shaped data will become the lifeblood of edge computing
Graph-shaped data is anywhere there are data sets that are intricately connected and context-sensitive. It has been the secret sauce in many AI applications for a long time. It is integral to cybersecurity, fraud prevention, influence analysis, sentiment monitoring, market segmentation, engagement optimization, geospatial analysis, and other AI applications where complex patterns must be rapidly identified.
In 2021, the graph database market continued its long streak of solid growth, though it still tends to be lumped in under the too-broad NoSQL umbrella. Markets and Markets predicts that the graph database market will reach $2.4 billion by 2023 from $821.8 million in 2018. Gartner predicts that by 2025 graph databases will be used in 80 percent of data analytics systems, a substantial rise from the current 10 percent.
Looking ahead to 2022 and beyond, graph-shaped data will form the backbone of our “new normal” existence. Graphs can illuminate the shifting relationships among users, nodes, applications, edge devices, and other entities. They’re becoming more ubiquitous with the growth of edge computing, for which graphs can describe how the “things” themselves -- such as sensor-equipped endpoints for consumer, industrial, and other uses -- are configured in nonhierarchical grids of incredible complexity.
However, graph databases are increasingly gaining a reputation as resource hogs. They are among the most ravenous consumers of processing, storage, I/O bandwidth, and other resources. If you're driving the results of graph processing into real-time applications, such as fraud prevention, you need an end-to-end low-latency database architecture.
In the new year, we will see more enterprise data analytics environments designed and optimized to support extreme-scale graph analysis. For example, consider TigerGraph’s recent scalability enhancements in version 3.2 of its parallel graph database. This version adds the ability to scale the database up and down as needed, replicate database clusters across regions, double the number of graph algorithms for data science use cases, manage multinode parallel-processing deployments across Kubernetes clusters, and process hundreds of terabytes of graph data in a single job.
Considering how much fresh funding flowed into graph database vendors such as TigerGraph, Neo4j, and ArangoDB in 2021, we can expect to see a sustained R&D focus on scaling their respective platforms to handle new challenges in global-scale, real-time graph analysis all the way to the edge. These and other graph database vendors will also invest in beefing up their multimodel database bona fides, including partnerships with leading cloud providers, in order to position their offerings for a broader range of enterprise opportunities and break away from the perception that they’re simply a niche technology segment.
Trend #3: Synthetic training data will occupy a growing footprint in enterprise data lakes
Today’s cloud powerhouses have made huge investments in data science. AWS, Microsoft, Google, and others have amassed growing sets of training data from their ongoing operation. However, we’re moving into an era in which anyone can tap into cloud-based resources to cheaply automate the development, deployment, and optimization of innovative artificial intelligence (AI) and machine learning (ML) apps.
AI/ML is playing a growing role in automating the generation and labeling of synthetic training data. Synthetic training data is AI/ML-generated data that can substitute for data obtained from real operational applications and other sources. Its utility stems from the fact that it is consistent with the statistical and mathematical patterns of operationally sourced training data, but is entirely devoid of any real-world information. Being entirely artificial, it is not likely to compromise privacy, pilfer intellectual property, or reveal trade secrets.
The next-generation data scientist will be able to generate synthetic but good-enough labeled training data on the fly to tune new apps for their intended purposes. Synthetic data generators are being used to create data that is free from demographic biases that may otherwise disadvantage some groups in some AI/ML applications. Synthetic data is also useful in traditional AI/ML scenarios when one needs to supplement an unbalanced training data set. It is also useful for generating data characteristic of fraud, cybersecurity, and “black swan” disaster scenarios that might be too rare to find in operational data sources.
By the middle of this decade, free open source synthetic data will be everywhere. By 2024, Gartner predicts that 60 percent of the data used for the development of AI and analytics projects will be synthetically generated. As the availability of low-cost synthetic training data grows, the established software companies’ massive data lakes, in which their developers maintain petabytes of authentic training data, may become more of a burden than a strategic asset. Likewise, managing the complex data preparation logic required to use this source data may become a bottleneck that impedes the ability of developers to rapidly build, train, and deploy new AI apps.
When any developer can routinely make AI apps as accurate as Google’s or Facebook’s but with far less expertise, expense, and training data, a new era will have dawned. When we reach that tipping point, possibly in 2022, the next generation of data science-powered disruptors will start to eat away at yesteryear’s software start-ups.
The Bottom Line
To sum up, TDWI expects the following data management trends to continue and deepen:
- Kubernetes is becoming the principal cloud platform for distributed databases as the industry continues to develop innovative ways to manage persistent application state in spite of container technologies’ limitations in this regard.
- Graph databases are becoming the largest, most resource-consuming databases on the planet as they become a contextualization backbone for IoT, edge, mobility, cybersecurity, and other global infrastructure.
- Synthetic data is becoming an essential ingredient for boosting productivity of MLOps pipelines while reducing data scientists’ need for sensitive personal information and other operational data to train their models.
Your feedback about these prognostications is welcome and eagerly anticipated.