Big Data Management: Deploying Best-of-Breed Solutions to Do More with Less
Savvy data mangers are using multiple approaches to maximize the strength of each technology.
By Mike Lamble, President, XtremeData, Inc.
Not so long ago, there were relatively few database options for big data warehouses. The options typically converged to Oracle, DB2, or (at the high end) Teradata. Today, much has changed as data volumes are grown, Linux-based commodity hardware has become ubiquitous, and more users are deploying on both public and private clouds.
If there were 200 data warehouses of at least 10 TB a dozen years ago, there are thousands today and plenty in excess of a petabyte. Large volumes of machine-generated source data -- click streams, for example -- are increasingly commonplace. Open source software and inexpensive hardware have lowered both the cost of entry and the growth cost curve for big data applications.
To do more with less, savvy data management shops are employing a suite of best-of-breed database tools in their data management platforms (DMP). This is in contrast to the old school philosophy of a "one-size-fits-all" database. By way of example, in most kitchens you'll find an oven, a stove, and a microwave. Each has a function, there's overlap, and all compete for the same kitchen appliance budget.
That's a fitting analogy for a modern data management shop that builds large-scale, data-intensive solutions. Data managers can choose from different database tools, work within a limited budget, and serve demanding clients. As data volumes have grown, specialized database solutions have emerged to meet differentiated needs for structured data management affordably and at scale. It is important to note that other classes of tools exist for unstructured data, such as text documents and XML.
Big data management platforms should be built on an architecture of loosely-coupled, massively parallel processing (MPP) in order to scale to ultra-high volumes. They should also be designed to effectively leverage today's converged Linux-based IT infrastructure and cloud platforms. Pooling and sharing of virtualized computing resources lowers the cost of IT infrastructure and reduces time-to-market for new IT solutions.
A cloud consists of hundreds or thousands of core processors that are virtualized and abstracted away from applications. Big data solutions need to scale across as many cores as applications demand to meet performance service-level agreements. Legacy database management systems (DBMSes) were designed in a different era, when the hardware consisted of a single, dedicated box. Today's hardware environment demands a new breed of MPP solutions, and these are evolving quickly with different "sweet spot" uses.
To make sense of the database options, we must narrow our options. We must:
- Filter out databases built for transaction-oriented applications and focus on databases that are designed for analytics. Analytic databases are designed for data-intensive processing involving millions of records in a single operation.
- Focus only on MPP solutions because single box solutions cannot scale with data growth.
- Examine solutions that run on virtualized commodity hardware because the industry trend is towards clouds -- private and public.
Applying these filters leaves three main solution categories: Hadoop-based solutions, row-oriented databases and column-oriented databases.
Understanding Solution Strengths and How They Co-Exist
Today, the three main solution categories are increasingly viewed as complementing each other and not mutually exclusive. Each option has its own strengths and can reduce the cost-per-answer and time-to-answer for big-data-intensive solutions. Let's examine each in greater detail.
Hadoop
Hadoop solutions are ideal for landing and staging large data volumes, both structured and unstructured. The underlying Hadoop File System (HDFS) brings excellent scalability and high availability to storage on inexpensive commodity hardware. HDFS handles data appends, but does not support inserts/updates/deletes very well. Higher layers of Hadoop, such as Hive and hBase, provide limited SQL capability for simple operations, such as data profiling and transformations. The execution speed of SQL is determined by the MapReduce framework that overlays HDFS. More recently, Cloudera's Impala tool provides similar SQL capability while bypassing the performance penalty imposed by MapReduce.
However, none of these SQL interfaces transforms Hadoop into a true ACID-compliant relational database system. Maintaining complex relationships between tables and managing user access permissions and concurrent accesses are difficult to implement in the Hadoop environment. These solutions struggle, however, when applications require maintaining complex relationships between tables and data intensive processing such as grouping, aggregating, and non-trivial joins."
For more sophisticated metadata management and complex SQL-based operations, true relational DBMS solutions offer a richer environment and much faster performance. Faster performance translates to cost savings on hardware that more than offset the savings from free, open source software. DBMS solutions offer a well-understood, easy-to-use environment for the rapid development of analytics applications. One can differentiate between row-oriented and column-oriented DBMS solutions for the reasons explained below.
MPP Column-Oriented Databases
MPP column-store databases are the tool of choice for applications that require ultra-fast response times -- e.g., BI dashboards. Column stores deliver efficient data compression and support fast appends. They are best suited for "write once/read many" workloads with pre-determined queries. Column stores are most suited for static data sets -- that is, for applications that don't require ongoing inserts, updates, and deletes and are not subject to frequent inflows of large data volumes. Amazon Web Services' Redshift offering is an example of a MPP column-oriented database.
Due to the nature of their storage architecture, these databases do not work well for complex queries often found in extract-transform-load (ETL) job streams and exploratory analytics.
MPP Row-Oriented Databases
For applications that require data-intensive joins, grouping, and aggregates, row-oriented database solutions are the fastest and most efficient. Designed for mixed write and read workloads, row-oriented databases are also a good fit when record-level inserts, updates, and deletes are required. They also provide industrial-strength metadata management capabilities such as data audit trails, access security tools, data about data, and concurrency controls. If the application calls for significant ELT (extract, load into database, then transform) processing, row-oriented databases are the tool of choice.
However, there are relatively few row-based analytics DBMS solutions that are MPP and can be deployed on virtualized commodity hardware running on-premise and in the cloud.
Cohabitation Brings Real-World Results
Hadoop-based solutions are great for landing, staging, and transformation of massive volumes of structured and unstructured data. Column-oriented databases are ideal for write once/read many data stores that require fast querying. Row-oriented databases are considered the best choice for mixed workload, writes as well as reads, and ongoing "heavy lifting" ELT.
Let's apply this point of view to a real-world example; a rapidly growing firm in the exploding business of online digital advertising. The company is building out a data management platform that relies on all three of these solutions in order to get the most productivity at a reasonable cost and ensure they won't hemorrhage as data volumes grow.
The company collects data about online advertising activity, aligns the data with offline customer information, and uses analytics to optimize ad placement, pricing, and campaigns. The data management platform ingests nearly 10TB of granular data each day and uses Hadoop to stage the data. The platform also uses an MPP row-oriented database for its strengths: complicated data integration that relates impression and click-through activity to anonymized customer information. The platform performs a variety of roll-ups, both daily and hourly. The database runs on the same inexpensive Hadoop cluster hardware. Finally, the DMP creates multiple data marts using a column-oriented database.
These data marts exploit their fast query response times to enable external customer access for reporting purposes. Each weekend the column-oriented data marts are re-built with new data.
The company is facing the shape of things to come. Its commodity hardware handles many times the amount of data at a fraction of the cost of legacy appliance solutions. By using best-of-breed MPP tools rather than a one-size-fits-all approach, they run a near-real-time, analytics-driven business, while providing customers sub-second query responses at a manageable cost. The company expects that today's 20 TB data warehouse will grow to 200 TB in 18 months. This digital advertiser is moving portions of its DMP to the cloud because that's where some of its supply chain (data supply, that is) partners operate exclusively.
Mike Lamble is president of XtremeData, Inc.; the company provides a massively scalable high-performance DBMS for big data analytics deployable on premise and in the cloud. You can contact the author at [email protected].