Tech Talk: Bringing Sexy (Hardware) Back
Big Data, meet Hyper Density. A real-life use case illustrates what multiprocessor parallelism at a massive scale might mean for information processing and analysis.
- By Stephen Swoyer
- September 10, 2013
During the recent Pacific Northwest BI Summit, Colin White, a research analyst with information management consultancy BI Research, offered attendees a provocative glimpse into his technology crystal ball.
Even though White's presentation nominally addressed in-memory computing, he spent a portion of his time talking about mega-scale multiprocessor computing.
White's wasn't a strictly speculative presentation. Instead, he used a real-life use case to talk about what hardware parallelism at a massive scale might mean for information processing and analysis. His organizing point of departure was what BI This Week has called Big Density, but his actual focus was the rising prominence of massively dense multiprocessor architectures, the exemplar of which is graphics processor unit (GPU) computing, which -- in Nvidia Corp.'s Tesla implementation -- packs almost 2,700 cores onto a single GPU compute card.
Think of it as "Hyper Density." We're already computing in the era of Big Density. Thanks to trends in CPU development, the processor in even a modest OLTP database server is able to efficiently execute several threads simultaneously. What's more, Intel's Sandybridge-powered Xeon chips (the E5-2600 series) ship with up to eight cores and can execute 16 simultaneous threads; Intel's "Westmere" architecture Xeon E7-class chips integrate up to 10 cores onto a single CPU. Tthey can be configured in an eight-CPU configuration, for a total of 80 processor cores -- and 160 threads.
Chip-making rival Advanced Micro Devices Inc. (AMD) has gone even more aggressively multi-core: AMD's top-of-the-line Opteron chip -- dubbed "Abu Dhabi" -- ships as a multi-chip module (MCM) and supports up to 16 processor cores. Both Intel's Xeon and AMD's Opteron architectures are likewise available in quad-CPU configurations. The upshot is that a quad-CPU server can be populated with up to 32 Xeon cores (for a total of 64 simultaneous threads) or up to 48 Opteron cores.
This is symmetric multi-processing, or SMP -- albeit at large scale. Even SMP at large scale isn't a new phenomenon, however; system OEMs have been shipping server kits with 12, 16, 24, 32, and even 48 integrated processor cores for a few years now.
Moreover, in the data management (DM) space, several vendors -- including Kognitio, Actian Vectorwise, and Actian Pervasive -- claim to optimize for SMP parallelism. Vectorwise and Pervasive both claim to exploit processor-level parallelism, too, via support for single instruction, multiple data (SIMD); SIMD is supported by processors from both AMD and Intel.
From Big Density to Hyper Density: A Primer
Hyper Density is relatively new, however -- at least as a mainstream phenomenon.
Ironically, the newest revision of Intel's Xeon silicon -- code-named "Haswell" -- achieves a reduction in its on-chip core density. At launch, Haswell Xeon integrates fewer processor cores -- four, at least initially -- than does its Ivy Bridge predecessor. In fact, Intel's initial Haswell road map focuses on the low-end (Xeon E3) server segment. This June, for example, it released its line of two- and four-processor Haswell Xeon E3 CPUs, which dissipate less than half the power of their Sandybridge-based Xeon E3 predecessors.
Intel's entry in the Hyper Density sweepstakes isn't a CPU but a dedicated co-processor -- in this case, Xeon Phi, a PCI Express (PCI-E) card based on Intel's Many Integrated Core (MIC) architecture. MIC itself devolves from Intel's aborted "Larrabee" GPU effort, which proposed to exploit its legacy P54C architecture (i.e., the original Pentium microarchitecture) at massive scale, integrating thousands of chip cores into a single package. (Intel ultimately abandoned Larrabee, repurposing its assets for use with MIC.)
Currently, Xeon Phi -- which is based on Intel's "Knight's Ferry" implementation of the MIC architecture -- exposes 60 x86 cores with 8 GB of integrated memory. (Multiple Xeon Phi cards can be installed in a single system.) The beauty of MIC is its x86 instruction set: in other words, legacy code can run unmodified on Xeon Phi, although -- in almost all cases -- it must be recompiled to effectively exploit MIC parallelism. MIC is already winning accolades, at least in the high performance computing (HPC) space. With the recent christening of the Tianhe-2 supercomputer at China's National Supercomputing Center as the world's fastest such machine, Intel racked up a big win for MIC and Xeon Phi. Tianhe-2 uses a mix of Ivy Bridge and Xeon Phi parts.
Xeon Phi competes with Nvidia's seminal Tesla GPU compute products.
Nvidia helped to kick start GPGPU -- i.e., general-purpose (computing) on GPU -- as a legitimate performance category, much like Netezza helped to kick start the market for data warehouse (DW) appliances. Nvidia's contribution is its programmable parallel computing platform, dubbed the Compute Unified Device Architecture (CUDA), which it first unveiled more than half a decade ago. Think of it this way: by writing Java or Pig Latin code for Hadoop's MapReduce compute engine, a programmer can efficiently parallelize a workload across a Hadoop cluster. In the same way, a programmer can use Nvidia's nvcc compiler to build C and C++ binaries optimized for CUDA. These workloads can then be parallelized across the hundreds of CUDA cores consolidated onto desktop GPUs or GPUGP co-processor cards. (Nvidia's desktop GPU cards also support CUDA.)
CUDA wrappers are available for Perl, Python, Java, Ruby, R, and others; this makes it possible to parallelize programs written for these languages across Nvidia's CUDA-compliant cards. CUDA also supports Open Compute Language (OpenCL), a would-be open framework for GPGPU. Nvidia's current top-of-the-line Tesla part (the Tesla K20X) integrates as many as 2,700 cores onto a single GPU compute card; as with Xeon Phi, multiple cards can be used.
Rival AMD also markets hyper-dense GPGPU cards. AMD's strategy emphasizes adherence to the OpenCL standard, instead of (as with Nvidia) its own GPGPU platform. Both AMD and Nvidia also support Microsoft Corp.'s DirectCompute GPGPU API, too.
Hyper Density in Practice
GPGPU has always been a popular play in scientific or HPC circles. Intel's Xeon Phi -- which isn't a GPGPU offering, but which nonetheless consolidates dozens of chip cores into a single add-on card -- has likewise had early success on the HPC circuit.
Hyper-dense co-processor cards from AMD, Intel, and Nvidia are cropping up in analytics-intense use cases, too, White noted. "I'm blown away by how much analytical processing goes on on GPUs," he said. "Salesforce is processing Twitter data using Tesla Nvidia chips and selling the results to their customers. Salesforce.com is building huge servers processing Twitter streams analyzing and selling the results to its customers."
Nvidia explicitly markets its Tesla K20X and K20 cards for "data analytics." In addition to Salesforce.com, it cites several other reference customers -- such as Shazam Entertainment Ltd. and Cortexica -- that use Nvidia Tesla GPGPU cards to augment their analytic efforts. Nvidia touts Tesla as a means to parallelize everything from database joins to analytic queries: it even promotes a pair of CUDA-specific projects -- GPUMiner and Cmatch -- that implement parallel data mining and exact string matching on its Tesla GPGPU cards.
White sees both increasing SMP densities and the use of co-processor cards as part of the ongoing (and inevitable) shift toward in-memory computing. "In memory computing is not standalone, you also look at it in conjunction with the availability of multiple processors," he points out. "For many corporations, large scale symmetric multiprocessing [systems] have a lot of attraction. The result is that in-memory computing is gaining traction, not only from a business intelligence viewpoint, but also from an online transaction processing viewpoint."
He sees a significant business upside, too.
"The business benefit comes from both the production side and the development side," he continues. "From production, if I could do real-time fraud detection, that's attractive ... [and] with the performance boost we're getting [from in-memory and more massive multi-processor technologies] we can process a lot more data ... particularly when we're doing data mining. This gives us more accuracy. Developing a risk model used to take hours to run. Now we're seeing people run risk models during development. Business-wise, this has a significant benefit because the models you develop are much better."
The idea isn't to run a database or an application in-memory on CUDA-, OpenCL-, or Open Multi-Processing-compliant (OpenMP) hardware; it's instead to parallelize analytic workloads across a co-processor node or nodes. In other words, the dedicated analytic node -- stuffed with Xeon Phi or GPGPU cards from AMD or Nvidia -- becomes one more tool in the analytic arsenal.
Imagine, for example, that a database or an application runs on dedicated hardware, whether it's a large (single-system) SMP box, a massively parallel processing (MPP) DBMS cluster, or a Hadoop cluster. In this scheme -- and let's use Nvidia's CUDA architecture for simplicity's sake -- a database spins out an analytic workload to a dedicated CUDA co-processor node or nodes. The workload then executes (in-memory, with high parallelism) on that system; the results are returned to the initiating database.
Why would anybody do this? Cost could be a consideration: a Tesla K20 card sells for about $3,500; Xeon Phi cards are even cheaper. For some analytic workloads -- such as the Twitter processing use case described by White -- it can be more economical to parallelize on one or more massively dense co-processor cards than to do so by means of scaling out additional nodes. According to a whitepaper from IBM Corp., for example, a Tesla GPGPU achieves a measured peak memory bandwidth of 153 GB/s, compared to 18 GB/s for Intel's i7 2600 CPU. In terms of random access memory performance, the GPGPU card is likewise a better performer: Tesla achieves 6.6 GB/s, as distinct to the i7 2600's 800 MB/s. This same IBM whitepaper clocks Tesla performing star schema joins at 5.7 GB/s.
Xeon Phi or GPGPU cards are arguably cheap to scale, too. The scalable unit in the co-processor or GPGPU segment is a PCI-E card: multiple PCI-E cards can be installed and configured in a single system. The scalable unit in an MPP configuration is the system node: MPP clusters scale (almost) linearly by adding system nodes. Nodes tend to be cheap because hardware tends to be cheap; increasingly, however, the cost of hardware isn't the only (or, for that matter, the most important) consideration.
"The hardware's cheap, but [cooling] and power cost money," White pointed out, noting that organizations are increasingly struggling with power, cooling, spatial, and other data center issues.