TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
  - Redefining Clinical Operations with Agentic AI: Accelerating Innovation Across Data Management and Site Monitoring July 30, 2025
  - Smarter Marketing in Retail: How AI and Modern Data Foundation Drive Growth July 31, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Platforms & Architecture Week July 25, 2025
  - AI Bootcamp Week July 25, 2025
  - Data Governance Week July 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Data Management: What’s Ahead for 2022

Data management has evolved into a very different discipline from what it was just 10 years ago.

By James G. Kobielus
December 21, 2021

In TDWI’s crystal ball for 2022, the most noteworthy trends we’re following relate to fundamental changes in the underlying configuration of cloud data environments, the shape of the data itself, and the growing footprint of synthetic -- in other words, fake but valuable -- data in the DataOps and MLOps pipelines.

For Further Reading:

Executive Q&A: Containers and Kubernetes Accelerating Digital Transformation in 2021

Overcome Data Shortages for ML Model Training with Synthetic Data

Understanding Connected Data Is the Key to Understanding Your Customer

Trend #1: Containerized data will be orchestrated far and wide in the multicloud

On one level, containerized databases are old news and widely adopted. Every public cloud provider offers customers the ability to run their databases in containers and to orchestrate these across Kubernetes clusters.

Kubernetes is the foundation for the new generation of cloud-native data management architectures. The most noteworthy trend over the past few years has been the recrystallization of the data ecosystem around this orchestration platform. However, it’s not clear how long it will take for this trend to play itself out through the cloud data and analytics platform stack. Over the past four years, there have been industry efforts to standardize how Spark, TensorFlow, Hadoop, streaming, distributed object store, block storage, and other components of this stack are decoupled, containerized, and orchestrated over cloud-native fabrics.

However, as we push into 2022, one would be hard-pressed to point to a single DBMS that is thoroughly containerized for agile cloud-to-edge deployment. One of the chief stumbling blocks has been the awkward manner in which Kubernetes handles stateful applications.

A DBMS running ACID transactions and other core enterprise functions is a stateful application and is one of the most complex workloads to run in containerized backplanes such as Kubernetes. The crux of the issue is that, essentially, every instance of a running process in a Kubernetes cluster managing two or more containers uses local storage that does not persist data when that instance is ended. This Kubernetes constraint prevents containerization of this essential DBMS function -- persistent state management -- as a composable microservice.

To store data, and thus persist a state across transient instances of containerized processes, Kubernetes environments must attach to external storage volumes. This generally involves using a Kubernetes abstraction known as StatefulSets, which allows tools external to the Kubernetes environment to ensure distribution, management, and persistence of application data so that it survives the failure of any cluster or node on which Kubernetes is running.

Consequently, the future of Kubernetes-orchestrated database platforms depends on the maturity of infrastructure for persisting application state in lieu of being able to rely on Docker or other containers to manage this critical function. It’s no surprise that there are a growing assortment of commercial solutions that do this, but this is a nascent requirement that TDWI expects will be the focus of hot start-ups in 2022 and beyond.

In 2022, TDWI expects to see more Kubernetes-ready distributed-RDBMS platforms that have addressed the stateful persistence challenges head-on in their respective solutions. For example, Cockroach Labs offers this capability through a distributed SQL DBMS solution architecture. This offering functions like a single logical database while supporting multimaster, guaranteed transactions and enabling scalable deployment across regions and Kubernetes clusters without the need for federation.

The other serverless data platform vendors have rolled out their own stateful persistence infrastructures to handle the same functions as Cockroach has baked into its distributed DBMS. Indeed, none of these solution providers would be able to offer Kubernetes-based multinode serverless platforms if they hadn’t built their own state persistence infrastructures.

For Further Reading:

Executive Q&A: Containers and Kubernetes Accelerating Digital Transformation in 2021

Overcome Data Shortages for ML Model Training with Synthetic Data

Understanding Connected Data Is the Key to Understanding Your Customer

We also expect to see an increasing focus on Kubernetes-based distributed DBMS deployments from the vendors who’ve built their commercial database offerings on open source platforms (e.g., Cassandra, MongoDB, or ElasticSearch) that natively handle stateful operations such as sharding, failover, and replication.

Trend #2: Graph-shaped data will become the lifeblood of edge computing

Graph-shaped data is anywhere there are data sets that are intricately connected and context-sensitive. It has been the secret sauce in many AI applications for a long time. It is integral to cybersecurity, fraud prevention, influence analysis, sentiment monitoring, market segmentation, engagement optimization, geospatial analysis, and other AI applications where complex patterns must be rapidly identified.

In 2021, the graph database market continued its long streak of solid growth, though it still tends to be lumped in under the too-broad NoSQL umbrella. Markets and Markets predicts that the graph database market will reach $2.4 billion by 2023 from $821.8 million in 2018. Gartner predicts that by 2025 graph databases will be used in 80 percent of data analytics systems, a substantial rise from the current 10 percent.

Looking ahead to 2022 and beyond, graph-shaped data will form the backbone of our “new normal” existence. Graphs can illuminate the shifting relationships among users, nodes, applications, edge devices, and other entities. They’re becoming more ubiquitous with the growth of edge computing, for which graphs can describe how the “things” themselves -- such as sensor-equipped endpoints for consumer, industrial, and other uses -- are configured in nonhierarchical grids of incredible complexity.

However, graph databases are increasingly gaining a reputation as resource hogs. They are among the most ravenous consumers of processing, storage, I/O bandwidth, and other resources. If you're driving the results of graph processing into real-time applications, such as fraud prevention, you need an end-to-end low-latency database architecture.

In the new year, we will see more enterprise data analytics environments designed and optimized to support extreme-scale graph analysis. For example, consider TigerGraph’s recent scalability enhancements in version 3.2 of its parallel graph database. This version adds the ability to scale the database up and down as needed, replicate database clusters across regions, double the number of graph algorithms for data science use cases, manage multinode parallel-processing deployments across Kubernetes clusters, and process hundreds of terabytes of graph data in a single job.

Considering how much fresh funding flowed into graph database vendors such as TigerGraph, Neo4j, and ArangoDB in 2021, we can expect to see a sustained R&D focus on scaling their respective platforms to handle new challenges in global-scale, real-time graph analysis all the way to the edge. These and other graph database vendors will also invest in beefing up their multimodel database bona fides, including partnerships with leading cloud providers, in order to position their offerings for a broader range of enterprise opportunities and break away from the perception that they’re simply a niche technology segment.

Trend #3: Synthetic training data will occupy a growing footprint in enterprise data lakes

Today’s cloud powerhouses have made huge investments in data science. AWS, Microsoft, Google, and others have amassed growing sets of training data from their ongoing operation. However, we’re moving into an era in which anyone can tap into cloud-based resources to cheaply automate the development, deployment, and optimization of innovative artificial intelligence (AI) and machine learning (ML) apps.

For Further Reading:

Executive Q&A: Containers and Kubernetes Accelerating Digital Transformation in 2021

Overcome Data Shortages for ML Model Training with Synthetic Data

Understanding Connected Data Is the Key to Understanding Your Customer

AI/ML is playing a growing role in automating the generation and labeling of synthetic training data. Synthetic training data is AI/ML-generated data that can substitute for data obtained from real operational applications and other sources. Its utility stems from the fact that it is consistent with the statistical and mathematical patterns of operationally sourced training data, but is entirely devoid of any real-world information. Being entirely artificial, it is not likely to compromise privacy, pilfer intellectual property, or reveal trade secrets.

The next-generation data scientist will be able to generate synthetic but good-enough labeled training data on the fly to tune new apps for their intended purposes. Synthetic data generators are being used to create data that is free from demographic biases that may otherwise disadvantage some groups in some AI/ML applications. Synthetic data is also useful in traditional AI/ML scenarios when one needs to supplement an unbalanced training data set. It is also useful for generating data characteristic of fraud, cybersecurity, and “black swan” disaster scenarios that might be too rare to find in operational data sources.

By the middle of this decade, free open source synthetic data will be everywhere. By 2024, Gartner predicts that 60 percent of the data used for the development of AI and analytics projects will be synthetically generated. As the availability of low-cost synthetic training data grows, the established software companies’ massive data lakes, in which their developers maintain petabytes of authentic training data, may become more of a burden than a strategic asset. Likewise, managing the complex data preparation logic required to use this source data may become a bottleneck that impedes the ability of developers to rapidly build, train, and deploy new AI apps.

When any developer can routinely make AI apps as accurate as Google’s or Facebook’s but with far less expertise, expense, and training data, a new era will have dawned. When we reach that tipping point, possibly in 2022, the next generation of data science-powered disruptors will start to eat away at yesteryear’s software start-ups.

The Bottom Line

To sum up, TDWI expects the following data management trends to continue and deepen:

Kubernetes is becoming the principal cloud platform for distributed databases as the industry continues to develop innovative ways to manage persistent application state in spite of container technologies’ limitations in this regard.

Graph databases are becoming the largest, most resource-consuming databases on the planet as they become a contextualization backbone for IoT, edge, mobility, cybersecurity, and other global infrastructure.

Synthetic data is becoming an essential ingredient for boosting productivity of MLOps pipelines while reducing data scientists’ need for sensitive personal information and other operational data to train their models.

Your feedback about these prognostications is welcome and eagerly anticipated.

About the Author

James Kobielus is a veteran industry analyst, consultant, author, speaker, and blogger in analytics and data management. He was recently the senior director of research for data management at TDWI, where he focused on data management, artificial intelligence, and cloud computing. Previously, Kobielus held positions at Futurum Research, SiliconANGLEWikibon, Forrester Research, Current Analysis, and the Burton Group. He has also served as senior program director, product marketing for big data analytics for IBM, where he was both a subject matter expert and a strategist on thought leadership and content marketing programs targeted at the data science community. You can reach him by email ([email protected]), on X (@jameskobielus), and on LinkedIn (https://www.linkedin.com/in/jameskobielus/).

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Data Management: What’s Ahead for 2022

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Data Management: What’s Ahead for 2022

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career