Data Lake Platform Modernization: 4 New Directions
Early adopters of the data lake have arrived at a maturation stage where they need to modernize their platforms before expanding their implementations.
- By Philip Russom
- March 15, 2019
Most data lake users are committed to the data lake's method of managing diverse big data -- but not so much Hadoop as the preferred platform. These users want to continue with the data lake method and even expand it into more use cases, but they know they cannot modernize and mature their data lake successfully on the current state of Hadoop. Enterprises with data lakes on relational databases or other on-premises systems face a similar challenge.
On the one hand, data lakes originated on Hadoop, and Hadoop-based data lakes have proved themselves valuable in mission-critical use cases such as data warehousing, advanced analytics, multichannel marketing, complete customer views, digital supply chains, and the modernization of data management in general.
On the other hand, data lake early adopters have hit a ceiling, held back by Hadoop's numerous omissions and weaknesses in key areas such as cluster maintenance, admin cost, resource management, metadata management, and support for SQL and other relational techniques.
In a related trend, a number of users have prototyped their data lake on relational databases or some other on-premises system. These users know that they must select a more affordable or more easily scaled platform before going into production.
Users with data lakes that are Hadoop-based, relational, or on premises are now contemplating platform migrations and modernizations that will position their data lake for growth into new use cases, larger data volumes, and greater data source and data type diversity. Here are four of the directions that users are contemplating for the modernization of their data lake's platform(s).
Keep using the whole Hadoop stack but migrate the lake's data from on premises to the cloud. A complaint that all Hadoop users share (regardless of the use cases they implemented on Hadoop) is that the cluster required for the Hadoop Distributed File System (HDFS) is far more complex to design, set up, and maintain than they thought. Even worse, a successful data lake will increasingly demand more nodes for the cluster, which gets very expensive in terms of administrative payroll and on-premises server hardware.
A straightforward solution for the HDFS cluster problem is to migrate the lake's data from an on-premises cluster to one in the cloud. This way you keep your Hadoop architecture intact, protecting that investment and allowing for minimal tweaking after the migration. This is a viable and low-risk approach because a number of cloud providers have HDFS, most Apache tools you'd use with HDFS, and related vendor tools already set up on cloud partitions that are optimized for these. In addition, the same providers may also offer a managed service that can set up and maintain an HDFS cluster more expertly and more cheaply than the average user organization can.
Replace HDFS with cloud-based storage. This approach to Hadoop and data lake modernization breaks up the Hadoop architecture to decouple compute from storage, so you can then remove HDFS and replace it with cloud-based storage. Among the current approaches to cloud storage (file, block, and object), data management professionals prefer object storage (which may or may not comply with the Object Store standard) because it resembles the database management systems they know and love.
The downside is that this approach may require a significant amount of work. When you migrate data between two very different platforms, making the data performant and fit for purpose on the new platform can be more like new development than a "lift and shift" migration. The effort can be worth it, however, when users need to rethink the lake's organization.
For example, most users set up their Hadoop-based data lakes as structure-less repositories for algorithmic advanced analytics. Unfortunately, that kind of lake is poorly suited to the query-based self-service and data exploration that users increasingly expect of a lake. Hence, migrating to cloud storage can be an opportunity to add just enough structure to a data lake to make it more conducive to new self-service and exploratory practices.
Start over in the cloud with data platforms designed specifically for the cloud. Abandoning Hadoop is destructive for IT investments and disruptive for users who depend on the data lake. Yet, some organizations do exactly that. In part, they are escaping Hadoop's weaknesses described earlier. However, a stronger driver is to give the data lake a more feature-rich database and tool environment that are a better fit for today's business use cases. Another driver is to reap the general benefits of cloud (elastic scale, short time to use, low cost), which apply directly to data lakes and their platforms.
A number of things have changed since data management professionals started building data lakes atop Hadoop. Early data lakes were built for analytics, and they served a very short list of highly technical users, including data analysts, data scientists, data warehouse architects, and data integration specialists. These people still need the data lake, but the current expectation is that a data lake must continue to support advanced analytics and serve nontechnical business users from marketing, finance, and other departments. These users perform self-service practices -- namely data access, exploration, discovery, prep, and visualization. Such self-service practices demand far better metadata management and support for SQL and other relational techniques than the Hadoop ecosystem has been able to muster. For users facing self-service and relational requirements, it makes a lot of sense to migrate the data lake to one of the many cloud-based database and data warehouse platforms now available.
Go hybrid and/or virtual by distributing your data lake across multiple platforms. As we've seen here, data lakes are evolving to serve a broadening range of user types, use cases, analytics, and data types. The list of technical business requirements keeps getting longer, to the point that some user organizations cannot satisfy all requirements on a single platform. This leads them to deploy multiple types of data platforms, each optimized for a specific data type, user type, or use case. Platforms may be located on premises, in the cloud, or both. The data of the lake is physically distributed across these.
The result is a hybrid data lake that can also be a virtual data lake when users rely on data virtualization techniques to unify data from multiple locations. A hybrid/virtual data lake is, by nature, complex and expensive, but many users feel it is worthwhile because a hybrid data lake lets them satisfy more business and technical requirements more effectively than a single platform allows.
Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 600 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at [email protected], @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.