The Data Lake: What It Is, What It's For, Where It's Going
Evolving approaches to analytics and data management are driving users toward the data lake as a new way of managing certain data.
- By Philip Russom
- June 10, 2016
What is a data lake? In its extreme form, a data lake ingests data in its raw, original state, straight from data sources, without any cleansing, standardization, remodeling, or transformation. These and other sacrosanct data management disciplines are applied on the fly, at runtime, to enable ad hoc queries, data exploration, and discovery-oriented analytics.
The early ingestion of data means that operational data is captured and made available to analytics as soon as possible. The raw state of the data ensures that data analysts, data scientists, data warehouse (DW) professionals, and similar users have ample raw material they can repurpose into many diverse data sets, as needed by unanticipated analytics questions.
A common myth says that data lakes require open-source Apache Hadoop. Actually, almost all data lakes are on Hadoop today, but the older practice of the data vault (which is very similar) proves that very large stores (or archives) of detailed source data can be managed on large configurations of massively parallel processing (MPP) relational database management systems (RDBMSs). (More on the data vault later in this article.)
In short, a data lake doesn't technically require Hadoop but theoretically could be managed on an RDBMS.
Note that a data lake depends on the early ingestion and late processing of raw data. This is quite different from data warehousing best practices, which demand substantial improvements to data before it is allowed into a warehouse or similar database. However, the two are not incompatible.
TDWI sees users implementing Hadoop-based data lakes within data warehouse environments (DWEs), where the lake becomes the persistence platform of choice for data landing, data staging, ELT push-down processing, and some forms of processing for analytics (but not reporting).
In these cases, Hadoop is an emerging platform for some pieces of the larger, multi-platform data warehouse architecture, but Hadoop coexists with the RDBMS-based data warehouse -- still the platform of choice for multi-dimensional data, performance management, standard reports, and online analytical processing (OLAP).
What's a Data Lake For?
Advanced forms of algorithmic analytics need large volumes of detailed source data for mining, clustering, graphing, statistics, and other approaches to analysis. Detailed source data may also be the input to set-based analytics enabled by SQL or OLAP. The point of working with raw, unaltered, detailed source is that the data can be altered on the fly at runtime as new and unique requirements for analytics arise.
After all, once you alter data for a specific purpose (e.g., standard reports or performance management), the data output is somewhat limited for other purposes (e.g., data exploration or discovery-oriented analytics). Algorithmic analytics and data exploration are strong drivers for data lakes now, but lakes will eventually provide flexibility for agile data warehousing and reporting.
Furthermore, note that a data lake serves multiple purposes. When deployed on Hadoop within a DWE, the lake is for landing, staging, archiving, ELT push down, exploration, and analytics processing. In fact, when first working with Hadoop, DW professionals implement those functions in that priority order, atop a multi-functional data lake. The list reveals life-cycle stages that can guide your implementation of a lake or Hadoop.
Where Are Data Lakes Going?
The discussion so far assumes a very raw, original state of data in the data lake or similar repository (like the data vault). However, there are now users who have been using some form of data lake for years (even on new-ish Hadoop), and we can learn from their successful maturation.
Users have learned that they get more business use and other value from a lake when they discretely impose some form of structure onto parts of the lake (but never the whole thing). For example, the Cloudera Enterprise Data Hub supplies users' demand for "just enough structure"; it also shows how older concepts about data hubs and operational data stores (ODSs) are being adapted to data lakes and the Hadoop environment.
In a related trend, TDWI has seen many Hadoop users "forklift" ODSs and data marts to Hadoop. These types of data collections have rather simple data structures -- just a few tables and keys, maybe a fact table -- compared to the very complex multi-dimensionality, hierarchies, cubes, and time series found in a true DW.
Almost all ODSs and most marts function well with relational-ish Hadoop tools such as Hive tables, the HBase row store, and miscellaneous SQL-based query tools (especially Impala and Drill, with Spark's SQL library upcoming).
Even so, the indexing of these data structures typically describes a small percentage of the lake's data, and the indexing doesn't change the data much, if at all. This means that most of the lake is true to its original goal of a store (or archive) for detailed source data in its original, raw state, despite structure here and there for ODSs and other lightly structured data sets. Even when indexed, the raw source is still there to be repurposed for just about anything that comes up.
Note that a number of "virtual" technologies can provide structured views of a lake data without physically relocating data or altering its persisted state. Technologies include data federation, distributed queries, virtual tables, semantic views, and materialized views. In the near future, we will probably start talking about the "virtual data lake" (similar to the logical DW) due to the rising use of virtual technologies with lakes.
What is a Data Vault?
A data vault is very similar to the mature data lake just described, where the data store is typically an archive of detailed source data but lightly structured with multiple semantic and virtual layers atop it. Whereas data lakes tend to go this direction after three or four project phases, data vaults typically have such layers designed into them from the first phase. As noted earlier, lakes are usually on Hadoop, whereas vaults are typically on RDBMSs.
TDWI has seen data vaults in European firms for years, and the vault is now appearing in North American firms. For example, in 2014, TDWI gave a Best Practices Award to Caisse de dépôt et placement du Québec for their data vault approach to enterprise data warehousing. You can read about their implementation on tdwi.org.
As another example, at the TDWI Executive Summit in May 2016, a speaker from an American insurance firm explained how they consolidated multiple DWs, marts, and ODSs into a data vault as a DW modernization strategy.
What Are the Perils of Data Lakes?
First and foremost, it's too easy to dump data in a data lake without assuring an audit trail, data integrity, anything resembling data quality, tweaks that accelerate queries and analytics, governance, stewardship, data life cycle management, and data protection (for masking and encryption, not mere Kerberos security, as in Hadoop).
I don't see the open source community caring much about these issues in a Hadoop environment, but vendors with diverse products and services -- from Hadoop distributions to data integration tools -- are working hard to provide tools that make Hadoop more manageable in general, whether with lakes or not.
Furthermore, these perils don't seem to stop many users. Don't forget that data feeding into or coming out of advanced analytics applications need not be as precise, standardized, documented, and auditable as standard reports and the DWs that feed reports.
For example, think of the estimates, probabilities, churn models, vague visualizations, and customer segments that are common outputs of analytics applications. Analytics teams are able to ignore or find work-arounds for the data issues around Hadoop and data lakes without much trouble while still delivering analytics outcomes that are valuable to the business despite their imprecise "ball park" nature.
Emerging practices around analytics are a good fit for lakes, whereas the precision and auditability of the average financial report continues to require a well-modeled and well-governed data warehouse.