Managing the Data Lake Monster
Data lakes offer organizations many positives, but they also have the potential to grow aimlessly, consuming an ever-larger share of storage, computing horsepower, technology personnel, and budget.
- By David Stodder
- May 23, 2017
Fans of classic science fiction movies know The Blob, the story about an "alien amoeboidal entity that crashes to Earth from outer space inside a meteorite," as Wikipedia describes it. Practically from the moment it lands, the blob starts to consume everyone and everything, growing as big as a building. Led by a local teenager, played by Steve McQueen in the original 1958 movie, the townspeople go through a rather costly trial and error process to learn how to stop it. I won't give away how they finally deal with the blob, but you are left with the uneasy feeling that their solution is only temporary. The blob will be back.
Is your data lake starting to resemble such a science fiction movie monster? Data lakes can offer organizations many positives, but they also have the potential to become like the blob: that is, growing aimlessly until they are monsters, consuming an ever-larger share of your organization's storage, computing horsepower, technology personnel, and budget. Some data lakes are now overflowing their on-premises systems and expanding into the cloud.
Their strange mystery increases if the personnel who set up the data lake and core analytics programs move on and do not leave much documentation or establish a metadata catalog. Data lakes can become security and governance headaches if they expose sensitive data to cyber threats and other abuse. It's critical that organizations have a strategy for their data lakes and don't let them become amorphous, insatiable blobs that consume everything in sight.
Purposes for Data Lakes
Data lakes, which usually contain a mixture of diverse data types in their raw, natural format, have proven to be a convenient place to put data for analytics, including execution of machine learning algorithms.
Many data lakes are built on Hadoop clusters, so organizations need to keep up to speed on the latest technologies emerging in the Hadoop ecosystem, including those that enable applications and analytics jobs to get the most out of massively parallel processing, columnar databases, in-memory computing, "fast" data interaction, and streaming. Depending on the organization's needs, these technologies could include Apache open source projects Spark, Kafka, Impala, Presto, Drill, and Kudu, a new storage engine that Cloudera began shipping this year as part of Cloudera Enterprise, primarily to support real-time analytics.
As user interest grows in the contents of the data lake, some organizations set it up as a kind of operational data store from which they can move, transform, and cleanse data for the data warehouse, data marts, or business intelligence (BI) tools as needed by users. Another common use is as a platform to offload ETL or other data preparation and profiling routines from the data warehousing system so that organizations are not forced to expand their existing data warehouse and data integration systems -- and pay what that expansion costs.
Not yet common, but potentially growing more so as the latest open source technologies mature, is using the data lake for everything, including traditional data warehouse activities. In other words, some organizations are looking to retire their data warehouse and "rip and replace" it with a data lake. Such organizations are evaluating new BI and visual analytics tools that can work directly on data in Hadoop clusters and cloud-based storage using in-database processing techniques to avoid unnecessary data movement.
Data Lake Management: Facilitating Better Order
As data lakes grow, organizations run into numerous challenges. Inconsistent and reactive management no longer works as the data lake takes on strategic importance to the users. Quick fixes and patches start to resemble the spaghetti code of old, making it hard for IT to optimize performance, manage security, and provide overall governance.
The need for better data lake management is fueling a hot software market, which was on display at meetings I had at Strata and Hadoop World earlier this year and has been gaining momentum since. Some vendors, including Podium Data and Zaloni, offer integrated tool suites that provide "data lake in a box" solutions aimed at making it easy and fast for organizations to set up and manage data lakes and support analytics.
These and other products provide self-service workflow and pipelines that enable organizations to get value from data along the way, not just at the end of data ingestion processes. Zaloni centers its data lake management, data quality, and governance on an automated metadata management capability so that organizations are building knowledge about their data and can govern it more effectively.
Established vendors are jumping into the data lake management market as well. Teradata, for example, has introduced Kylo, a data lake management platform based on an Apache open source project that the company is sponsoring. Kylo provides a template-driven approach aimed at shortening data lake development cycles. Kylo and similar tools help organizations reduce dependence on their few programmers who know data lake technologies so that they can sustain the development of the data lake as personnel change. Many of the tools offer self-service graphical interfaces that enable less technical business users to ingest and integrate data in a lake.
A Strong Strategy Supports Ambitious Plans
With good data lake management tools, organizations can move beyond just storing data in the lake; they can actively refine and enrich the data. They can also create facilities such as metadata catalogs that help reduce data confusion and errors and improve user and developer productivity.
These steps are important if organizations expect their data lakes to support ambitious plans for analytics-driven applications that require continuous data-intensive computing. Data lakes are critical for application processes that depend on near real-time analytics about nonrelational data such as multimedia and sensor data from Internet of Things sources. Most existing data warehouses are not designed for these types of applications.
Thus, it is important for organizations to develop a strategy for data lake management and evaluate tool options. With better management, organizations can more rapidly increase the value of their data lake -- and suffer fewer nightmares about the lake morphing into an uncontrollable entity that only gets bigger, swallowing precious money and resources.