Data Lake Management Innovations
When designed and managed properly, a data lake can enable faster, more trusted big data analytics.
- By Philip Russom
- January 23, 2017
I recently spoke in a webinar run by Informatica Corporation, sharing the stage with Informatica's Murthy Mathiprakasam and Cognizant's Tavo De Leon. We three had an interactive conversation about the technology and business requirements of data lakes as faced today by data management professionals and the organizations they serve.
There's a lot to say about data lakes, but I focused on the roles played by metadata and data governance because these are two of the most pressing requirements. Below I've summarized my portion of the webinar.
The data lake is about earlier data ingestion and later data preparation on the fly.
A data lake ingests data in its raw, original state, straight from data sources, with little or no cleansing, standardization, remodeling, or transformation. Data management best practices can then be applied flexibly later as diverse use cases demand.
Data in a lake can be improved on the fly during exploration (for ad hoc discovery and analytics), at intermediate stages to prep data for recurring tasks (such as reporting and performance management), or much later (as new analytics applications are envisioned).
Many other scenarios are possible, too, with evolving data lake practices. However, the trend is toward doing less "pre-preparation" of data so that data "discovery zones" are updated with data in an agile manner. Note that some data preparation is still in use but nowhere near the extreme level seen for warehouses and reports. The early ingestion of data means that operational data is captured and made available as soon as possible, and yet the data is still prepped so that it's fit for the intended purposes of exploration, discovery, and analytics.
Most data lakes are built atop Hadoop, which allows them to capture big data and enable advanced analytics processing. Hadoop enables a data lake to capture, process, and repurpose a wide range of data types and structures with linear scalability and high availability.
Although it may sound very new, a Hadoop-based data lake still needs established best practices and tools for data management. That way the lake can participate in both old and new data supply chains in a fast, flexible, systematic, and repeatable fashion.
Likewise, Hadoop-based data lakes are proving that they can integrate with a wide range of enterprise data ecosystems and be managed according to policy-based data governance. In these contexts, good data management and governance can raise the quality and usefulness of the data and keep the lake from deteriorating into a so-called data swamp.
Data lakes are already deployed in real-world use cases.
Physically speaking, all these data lakes may be in one enterprisewide Hadoop cluster, but logically speaking they are separate data lakes.
Analytics data lakes. These can be as simple as standalone data lakes built for one application, such as sentiment analysis or money laundering detection. In other cases, a Hadoop-based data lake can extend and reduce the burden on a data warehouse by supporting data staging, archiving, and processing for analytics. In short, a Hadoop-based data lake can augment and modernize a data warehouse to embrace big data and advanced analytics without replacing the warehouse.
Marketing data lakes. These are hot right now as marketers discover that a data lake is excellent for making correlations and predictions across multiple customer channels, which in turn leads to higher conversion rates in cross-selling. The same lake can also enable new levels of accuracy and insight for customer segmentation and complete views of customers.
At TDWI, we're also seeing other data lakes with a business function or industry focus -- for example, sales performance data lakes, healthcare data lakes, and financial fraud data lakes.
Diverse metadata makes a data lake more accessible, valuable, and trusted for a wider range of user types.
Today, a growing number of nontechnical or somewhat technical users want to work hands-on with data. Instead of raw technical metadata, these users need business metadata, which employs human-language descriptions of data. In fact, without business metadata, the range of users who can access the data of a lake is seriously limited.
Note that business metadata is created by technical users in addition to technical metadata. For the technical user to create business metadata that's truly useful and accurate, mappings between metadata types should be based on a governed business glossary of terms, which specifies the data owned by the business in industry- and corporate-standard language.
For many users, the point of implementing a Hadoop-based data lake is to enable self-service practices, including data access, exploration, discovery-oriented analytics, data prep, visualization, and advanced forms of analytics. Note that all these emerging self-service practices rely heavily on the cataloguing of business metadata. Without business metadata, nontechnical users cannot search for data using business terms, work quickly, independently, and collaboratively, and get full value from a data lake.
A data lake must be governed or else it may become a data swamp.
When a data lake is not managed properly, it deteriorates into a data swamp -- an undocumented and disorganized data store that is nearly impossible to navigate, trust, or leverage for organizational advantage. However, this risk is easily managed and mitigated by data governance and other process-driven data management best practices.
As with any important data asset, lake data should be curated by a steward who is responsible for driving trust and understanding of the data in the data store. TDWI feels that the best stewards are businesspeople (rather than technical staff) because they can prioritize based on business needs to keep data management aligned with business goals. Improvements to data lakes should give priority to metadata, "just enough" structure, and diversifying data and tools.
Conclusions
A data lake is a bit of a balancing act. On the one hand, the data lake's primary benefit is that it liberates analytics users by enabling new practices in agile data ingestion, with a focus on consolidating large volumes of diverse data in the lake. That, in turn, helps many users discover new opportunities and work with advanced analytics.
On the other hand, the data lake still needs some of the established best practices of data management and governance. Otherwise, the data managed in the lake can become redundant (and skew analytics results), lack a trusted audit trail, suffer integrity and quality problems, and be difficult to find and query. When those maladies beset a data lake, it becomes the dreaded data swamp.
Get the full benefit from your data lake by ingesting all kinds of data and by preparing and improving data to an appropriate degree so the data is accessible, trusted, and insightful.
If you'd like to hear more of my discussion with Informatica's Murthy Mathiprakasam and Cognizant's Tavo De Leon, please visit Database Trends and Applications to replay the Informatica webinar.
About the Author
Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 600 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at [email protected], @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.