Reach Real-time Analytics on the Data Lake with GPU Acceleration
Hadoop was a significant improvement when one gigabit networking was the norm, but a GPU database is a much better fit for real-time analytics than a traditional data lake.
- By Woody Christy
- October 6, 2017
The data lake is often defined as a single store for all the raw data that anyone in an organization might need to analyze. The metaphor over time has been extended to include multiple feeder streams that fill the lake and multiple lakefronts with different views.
In traditional data warehousing terms, these are just referring to source systems and data marts. There is often quite a bit of marketing speak flung around, such as "analytics sandboxes" and "logical data warehouses," but it's all just describing the tried and true data mart. To create a data mart, users take the raw data stored in the data lake and transform it into a query-friendly format to solve the business problems that are paying the bills.
The data lake is described in scale of petabytes of data; the data mart is often described in terabytes or tens of terabytes. In today's ever-changing business environment, how fast one can derive key insights from this data is a key competitive advantage.
Hadoop's Benefits and Drawbacks
Hadoop emerged as a popular early choice for building data lakes. Hadoop systems provide large-scale data processing and storage at low cost. The Hadoop Distributed File System (HDFS) coupled with MapReduce, a batch processing framework that schedules tasks to where the data is, quickly became a hit. Hadoop allowed inexpensive clusters of commodity servers to solve massive scale problems by coupling storage and compute in the same node.
This data locality was a significant performance improvement when one gigabit networking was the norm. It was also significantly cheaper than previous data warehousing technology that used monolithic SANs for storage. The major downside is that as use cases become more complex and compute bound, storage-heavy compute nodes must be added, increasing the expense. Although a major breakthrough at the time, MapReduce can be brutally slow.
Technologies such as Apache Spark promise to take the Hadoop stack beyond batch, but even they rely on a "microbatch" approach instead of truly streaming in real time. Further, the complexities associated with development and ongoing management of Apache Spark code written to deliver real-time responses can be costly and overwhelming.
Instead of using the familiar declarative language of SQL, analysts must dive into the bowels of Scala serialization, which adds significant complexity to what could appear as a simple task. Spark is also another compute-bound framework (sometimes memory-bound) that forces Hadoop clusters to grow as more use cases are found for the underlying data.
Data Lake Storage and Analytics Options
The advent of the cloud has seen object stores, such as Amazon Web Services Simple Storage Service (S3) and Azure Data Lake Storage (ADLS), start to serve as the core storage of the data lake. This has allowed for greater flexibility and cost controls by separating the compute from the storage. Now, on-demand compute clusters can be spun up against the shared data lake, scaling compute as needed. MapReduce and Apache Spark have been modified to work directly with these object stores, so HDFS is no longer required to be the center of the data lake.
Data locality becomes less critical when the major cloud providers often run 40 and 100 Gigabit networks to each node, enabling massive read/write throughput. This pushes the bottleneck squarely onto RAM and CPU, which, as mentioned, can be scaled as needed in the cloud. No matter how many nodes are added, however, all the data must be read remotely. This adds significant latency, making it next to impossible to meet real-time requirements.
There are many frameworks that bring SQL and other analytics capabilities to the data lake. Most of these are built on top of either Spark or MapReduce and read data from query-friendly formats in logical partitions of the data lake. These often suffer from a shortage of compute cycles on a shared Hadoop cluster or higher latency when deployed in the cloud.
In the last several years we have seen the cost of RAM dramatically decrease at the same time the density increased. This has led to the advent of in-memory databases that remove the disk throughput and latency issues from the equation, enabling real-time data access and massive parallel ingest. This is critical for serving today's high volume data flows. The Achilles' heel for in-memory databases is they become instantly compute bound when doing any analytics at scale. This facilitates real-time access of data, but not necessarily real-time analytics.
Reaching Real Time with GPUs
To meet the real-time needs for both access and analytics of data, in-memory GPU databases have emerged. GPUs originally may have been designed for graphics processing, but their massively parallel designs lend themselves to embarrassingly parallel problems (what Hadoop excels at) and highly iterative tasks such as machine learning.
In the same commodity server that was once running Hadoop, adding a couple GPU cards can now deliver hundreds of times more processing power. This enables functions -- such as filtering, grouping, summations, joins, and many others -- to be greatly accelerated. Everything isn't Nirvana by just adding GPUs, however. The GPUs designed for compute are great at numerical calculations, but not as much at text manipulation.
All isn't lost, though, because the data can be manipulated by the CPUs or otherwise preprocessed -- the familiar extract, transform, and load -- using one of the previously mentioned frameworks. This step is the equivalent of creating the query-friendly format. At their core, GPU databases are databases that have different levels of SQL compliance. SQL is much faster for developing solutions to business problems than Scala or Java code. These advantages lead the GPU database to be a much better fit for applications that need real-time analytics than a traditional data lake.
Woody Christy is principal partner engineer at Kinetica and was previously at Cloudera, where he was senior manager of partner engineering. Woody has been fortunate to be working in distributed systems his entire career. He led design and deployment for video-on-demand systems that scaled out to millions of end users, then moved on to developing real-time analytics systems, simulation software, and virtual systems. When joining Cloudera, Woody led the early integration with SAS and other advanced analytics partners. Woody earned a master’s degree in computer science from Western Illinois University.