TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - Expert Panel: Building an AI-Driven Data Strategy September 15, 2025
  - Why Enterprises Aren’t Ready for AI—And How to Fix It September 18, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Reach Real-time Analytics on the Data Lake with GPU Acceleration

Hadoop was a significant improvement when one gigabit networking was the norm, but a GPU database is a much better fit for real-time analytics than a traditional data lake.

By Woody Christy
October 6, 2017

The data lake is often defined as a single store for all the raw data that anyone in an organization might need to analyze. The metaphor over time has been extended to include multiple feeder streams that fill the lake and multiple lakefronts with different views.

In traditional data warehousing terms, these are just referring to source systems and data marts. There is often quite a bit of marketing speak flung around, such as "analytics sandboxes" and "logical data warehouses," but it's all just describing the tried and true data mart. To create a data mart, users take the raw data stored in the data lake and transform it into a query-friendly format to solve the business problems that are paying the bills.

For Further Reading:

When Will GPUs Go Mainstream in the Enterprise?

Busting 5 Myths about Data Lakes

Managing the Data Lake Monster

The data lake is described in scale of petabytes of data; the data mart is often described in terabytes or tens of terabytes. In today's ever-changing business environment, how fast one can derive key insights from this data is a key competitive advantage.

Hadoop's Benefits and Drawbacks

Hadoop emerged as a popular early choice for building data lakes. Hadoop systems provide large-scale data processing and storage at low cost. The Hadoop Distributed File System (HDFS) coupled with MapReduce, a batch processing framework that schedules tasks to where the data is, quickly became a hit. Hadoop allowed inexpensive clusters of commodity servers to solve massive scale problems by coupling storage and compute in the same node.

This data locality was a significant performance improvement when one gigabit networking was the norm. It was also significantly cheaper than previous data warehousing technology that used monolithic SANs for storage. The major downside is that as use cases become more complex and compute bound, storage-heavy compute nodes must be added, increasing the expense. Although a major breakthrough at the time, MapReduce can be brutally slow.

Technologies such as Apache Spark promise to take the Hadoop stack beyond batch, but even they rely on a "microbatch" approach instead of truly streaming in real time. Further, the complexities associated with development and ongoing management of Apache Spark code written to deliver real-time responses can be costly and overwhelming.

Instead of using the familiar declarative language of SQL, analysts must dive into the bowels of Scala serialization, which adds significant complexity to what could appear as a simple task. Spark is also another compute-bound framework (sometimes memory-bound) that forces Hadoop clusters to grow as more use cases are found for the underlying data.

Data Lake Storage and Analytics Options

The advent of the cloud has seen object stores, such as Amazon Web Services Simple Storage Service (S3) and Azure Data Lake Storage (ADLS), start to serve as the core storage of the data lake. This has allowed for greater flexibility and cost controls by separating the compute from the storage. Now, on-demand compute clusters can be spun up against the shared data lake, scaling compute as needed. MapReduce and Apache Spark have been modified to work directly with these object stores, so HDFS is no longer required to be the center of the data lake.

Data locality becomes less critical when the major cloud providers often run 40 and 100 Gigabit networks to each node, enabling massive read/write throughput. This pushes the bottleneck squarely onto RAM and CPU, which, as mentioned, can be scaled as needed in the cloud. No matter how many nodes are added, however, all the data must be read remotely. This adds significant latency, making it next to impossible to meet real-time requirements.

There are many frameworks that bring SQL and other analytics capabilities to the data lake. Most of these are built on top of either Spark or MapReduce and read data from query-friendly formats in logical partitions of the data lake. These often suffer from a shortage of compute cycles on a shared Hadoop cluster or higher latency when deployed in the cloud.

In the last several years we have seen the cost of RAM dramatically decrease at the same time the density increased. This has led to the advent of in-memory databases that remove the disk throughput and latency issues from the equation, enabling real-time data access and massive parallel ingest. This is critical for serving today's high volume data flows. The Achilles' heel for in-memory databases is they become instantly compute bound when doing any analytics at scale. This facilitates real-time access of data, but not necessarily real-time analytics.

Reaching Real Time with GPUs

To meet the real-time needs for both access and analytics of data, in-memory GPU databases have emerged. GPUs originally may have been designed for graphics processing, but their massively parallel designs lend themselves to embarrassingly parallel problems (what Hadoop excels at) and highly iterative tasks such as machine learning.

In the same commodity server that was once running Hadoop, adding a couple GPU cards can now deliver hundreds of times more processing power. This enables functions -- such as filtering, grouping, summations, joins, and many others -- to be greatly accelerated. Everything isn't Nirvana by just adding GPUs, however. The GPUs designed for compute are great at numerical calculations, but not as much at text manipulation.

All isn't lost, though, because the data can be manipulated by the CPUs or otherwise preprocessed -- the familiar extract, transform, and load -- using one of the previously mentioned frameworks. This step is the equivalent of creating the query-friendly format. At their core, GPU databases are databases that have different levels of SQL compliance. SQL is much faster for developing solutions to business problems than Scala or Java code. These advantages lead the GPU database to be a much better fit for applications that need real-time analytics than a traditional data lake.

About the Author

Woody Christy is principal partner engineer at Kinetica and was previously at Cloudera, where he was senior manager of partner engineering. Woody has been fortunate to be working in distributed systems his entire career. He led design and deployment for video-on-demand systems that scaled out to millions of end users, then moved on to developing real-time analytics systems, simulation software, and virtual systems. When joining Cloudera, Woody led the early integration with SAS and other advanced analytics partners. Woody earned a master’s degree in computer science from Western Illinois University.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Reach Real-time Analytics on the Data Lake with GPU Acceleration

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Reach Real-time Analytics on the Data Lake with GPU Acceleration

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career