Welcome to the Lakehouse
A new decade, a new phrase to conjure with. The "lakehouse" is generating some interest and debate, but it needs to prove itself as an architecture.
- By Barry Devlin
- March 10, 2020
January 2020 has brought a new concept to the fore in the data management space. In a recent blog post, Ben Lorica (until recently, chief data scientist and Strata organizer at O'Reilly Media) and some senior names at Databricks report seeing "a new data management paradigm that emerged independently across many customers and use cases: the lakehouse."
What is a lakehouse? Simply put, it's a cross between a data lake and a data warehouse. Its starting point is that data lakes are too loosely governed and structured for many business needs. As Lorica and his colleagues describe it, a lakehouse arises from "implementing similar data structures and data management features to those in a data warehouse, directly on the kind of low-cost storage used for data lakes." At a more technical level, the concept hews closely to the functionality of the Databricks' Delta Lake platform product.
Ten years ago, James Dixon (then-CTO of Pentaho) introduced the data lake concept in a similar manner, directly linking to the introduction of Pentaho's first Hadoop release.
To be fair, this is a common route for the introduction of concepts in IT. The original data warehouse architecture I documented in the mid-1980s was driven largely by the possible uses of the DB2 relational database in IBM's (where I worked at the time) internal decision support and management information needs.
The key question for me regarding the lakehouse concept is: how will it evolve? Will it grow into a full-fledged architecture of value to implementers of complex data management and insight delivery solutions or will it remain marketing hype used mainly by consultants and software vendors as a sales tool?
My early-February LinkedIn post on the lakehouse generated a lively debate between those who focus on technological function and those who favor a more holistic approach. I fall firmly in the latter camp but exploring the concept from both approaches is worthwhile.
In the Data Lake -- Waving or Drowning?
The first impetus for data lakes was the emergence of a new technological environment -- Hadoop and its cutely named companions. Lakers saw an opportunity -- indeed, a necessity -- to address some fundamental IT problems with the then-dominant data warehouse paradigm. In their view, the relational environment was too rigid to change, too difficult to grow to internet data volumes, and prohibitively expensive to scale.
The extended Hadoop ecosystem offered the opportunity to simply store all the data (at least in theory) as it arrived and figure out later what to do with it and how. The term schema-on-read perfectly captured the zeitgeist of data scientists digging nuggets of business insight from (near) real-time data.
Early and obvious warnings of the dangers of such lakes silting into swamps turned out to be prescient, spawning the recent surge of data catalogs and similarly scoped tools promising to clean up and curate the mess of data that was being collected.
Of course, there were and are numerous success stories. However, they typically apply to strictly bounded and functionally focused subsets of data, where laser-sharp business needs succeed in defying the ever-increasing entropy of the rest of the lake.
Whither the Warehouse?
Whether because of its birth era or its early and long-term promoters, data warehousing today places a solid focus on data architecture and governance issues. Technology and function are, of course, important but best architectural practice is to begin from a business-focused and product-agnostic stance. Although still centered on relational database technology, modern data warehousing has evolved to a multiplatform, distributed approach to meet a key business need for enterprise data: delivering high-quality, well-managed, reconciled (where needed), and timely data to business decision makers.
I firmly believe that the concepts, principles, and methods of data warehousing are not only still relevant, but increasingly mandatory in today's enterprise data environment -- where disparate and dirty data abounds -- to provide an island of consistency and a bulwark of agreed meaning in an increasingly storm-tossed data lake environment and beyond.
Data lakers criticism of the base data warehousing technology is in part valid but has led to a simplistic schema-on-read versus schema-on-write debate. Such an either/or argument is largely misleading and ultimately pointless when we should focus on business needs, both for specific functions and for more generic data governance. However, it is this simplistic debate that appears to have led to the emergence of the lakehouse concept.
Does the Lakehouse Have a Future?
In an Instagram world, most good visually provocative memes have a future. A lakehouse meets those criteria and, by that measure, should prosper. Already emerging competition between Databricks and Snowflake over who coined the term cannot but help.
However, I wonder if either of these vendors has the scale, skills, or even the interest to expand what is currently largely marketing speak into a real architecture. If they do, it will be important to start with a conceptual architecture that defines and describes a shared business/IT vocabulary of exactly what business may expect from a lakehouse and what IT can deliver. This conceptual architecture should be extended to the logical level to detail the functional components required. Finally, it can be linked to existing and required technology at the physical level. My book, "Business unintelligence," dating from 2013, describes conceptual and logical architectures that could, I believe, cover the complete scope of a lakehouse from a product-neutral and functionally complete point of view.
Building an esthetically elegant and visually appealing lakehouse is relatively easy. Architecting its plumbing and organizing its occupants will be the real challenge.
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.