Benefits of the Hadoop-Based Data Lake
In some emerging best practices, a free-form data lake implemented on Hadoop complements a structured relational data warehouse.
- By Philip Russom
- November 3, 2016
The data lake is a new design pattern that specifies a few rules for organizing data, similar to how older design patterns did. The primary rule for a data lake is that it should be a repository for raw, detailed data that's captured, stored, and managed in its original schema or format with little or no transformation.
A data lake focuses on detailed source data so that the source can be repurposed many ways as new requirements in advanced analytics evolve and emerge. The data lake "future-proofs" analytics by provisioning ample source data for a wide range of analytics that cannot yet be foreseen because of the rapid pace of change within organizations and across marketplaces.
Given today's exploding data volumes, a data lake needs to scale to tens or hundreds of terabytes and sometimes petabytes. To provide massive scalability at a reasonable cost, Hadoop has arisen as the most common data platform for data lakes. Even so, data lakes and similar emerging data-driven design patterns (e.g., data vaults, enterprise data hubs) may also be deployed on relational database management systems (RDBMSs) or other file systems besides Hadoop.
How Data Lakes and Data Warehouses Can Work Together
The Hadoop-based data lake is important because it can extend the life and capabilities of a data warehouse.
One of the stronger trends in data warehousing is to diversify the portfolio of data platforms so that technical users can choose just the right platform for storing, processing, or delivering data sets and the products based on them. In the modern multiplatform data warehouse environment (DWE), almost all core warehouses still run on RDBMSs, but they may be integrated with other platforms -- typically Hadoop and specialized RDBMSs (based on appliances, columns, clouds, or specific forms of analytics).
In this hybrid environment, the core warehouse continues to be the preferred platform for reporting (from standard reports to dashboards), dimensional data (for OLAP, cubes, star schema, etc.), and data that requires extensive improvement or accuracy (e.g., financial reports).
However, the raw data for advanced forms of analytics is progressively being stored and processed on the other platforms of the DWE. This offloads the core warehouse so it can scale and focus on data that requires mature relational functionality (as reports and dimensional data do). This also takes raw, detailed data to platforms that are well suited to advanced forms of analytics (based on mining, clustering, statistics, graph, etc.) at scale and with a reasonable cost.
The Hadoop-based data lake is emerging as a natural fit for the large volumes of data for advanced analytics that are being relocated as organizations modernize their DWEs. Even so, TDWI also sees columnar databases and other specialty RDBMSs playing roles within the DWE.
The trend is toward having the Hadoop-based data lake be the ingestion platform and analytics archive for the DWE, while sandboxing and set-based analytics are done on specialty RDBMSs (but perhaps on Hadoop, too) and reporting and related functions are provisioned by the core warehouse.
Architectures Still Evolving
Note that it is still early days for the multiplatform architecture of the DWE, as well as for Hadoop and the data lake. It is difficult to say into what architectural patterns the DWE will eventually evolve.
However, one direction is sure: a wide range of organizations will continue to diversify their data platforms as they let go of older paradigms that sought to make a single data warehouse instance handle all or most data handling. Instead, most DW programs are moving toward multiple best-of-breed and purpose-built platforms that are tightly integrated. (The survey for the 2016 TDWI Best Practices Report: Data Warehouse Modernization corroborates this claim.)
The Hadoop-based data lake fits these and other trends quite well. The real driver is that enterprises need a broader range of analytics types so they can get better at making fact-based decisions, optimizing their organizational performance, and competing on analytics.
The Hadoop-based data lake is gaining in popularity because it can capture the volume of big data and other new sources that enterprises want to leverage via analytics, and it does so at a low cost and with good interoperability with other platforms in the DWE. In this sense, Hadoop and data lakes add value to the DW and its environment without ripping and replacing mature investments.
In other words, in the emerging best practices of the DWE, a free-form data lake complements a structured data warehouse. That's why TDWI expects to see both working together in more and more DWEs.
Further Reading:
For a deeper definition of the data lake, read my article from June 2016: "The Data Lake -- What it is, What it's for, Where it's going."
For more about the trend toward multiplatform data warehouse environments (DWEs), read TDWI Best Practices Report: Data Warehouse Modernization.
About the Author
Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 600 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at [email protected], @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.