Cloud Confusion for Data Warehousing (Part 1 of 2)
If you are planning to move your data warehouse or data lake to the cloud, the current outlook is somewhat foggy.
- By Barry Devlin
- August 7, 2023
Data warehousing’s journey to the cloud began a few years ago, so you’d be forgiven for thinking it’s all plain sailing now. However, a brief review of the marketplace will reveal a larger number of even more disparate architecture patterns in the cloud than there have ever been on the ground. Cloud data warehouse, data lakehouse, data fabric, and data mesh are the current crop. Not only do they differ in their original definitions, but different vendors spin their own stories about what each one means and how their products align to one or more of the original terms.
How can we clear up the confusion? The starting point is a clear, concise definition of what we mean by the term data warehousing. After all, data warehousing has been with us since the 1980s. So, let me offer one.
In my new book Cloud Data Warehousing, Volume I, I suggest “one definition to rule them all.” The first and most fundamental purpose of data warehousing is to deliver consistent, integrated, timely, quality, useful, and usable data (or, better, information) to help decision makers of all levels -- from factory-floor operatives to data scientists to CEOs -- understand what is happening in the business and the world in which it operates and why it is happening, enabling them to do something about it, now and in the future.
This definition ties together the desired business outcome of data warehousing -- insight and relevant action -- with the means of achieving it. Only with the highest quality information can businesses succeed in making decisions and taking action. This applies to everything from basic reporting to advanced AI-driven analytics.
The meaning of cloud data warehousing follows naturally: as above but in the cloud.
Cloud Data Warehouse or Data Lakehouse?
All four of the patterns mentioned in the first paragraph adhere to the above definition.
The first, cloud data warehouse, is most easily described as the migration of a data warehouse as developed on premises to a cloud environment. As happened with on premises vendors, some cloud vendors offer little more than a bare relational database, optimized for querying, and call it a cloud data warehouse. Others, better informed, understand that a warehouse requires more than a database. They add the population and management tools, the metadata, query function, and other features needed for a full solution.
Some are on-premises vendors porting their solutions to the cloud; others are cloud-native. In either case, they optimize their solution to use best-of-breed cloud technology.
The data lakehouse -- which is almost always in the cloud -- was introduced by Databricks in 2020. The original description defines a platform that “combines the best elements of data lakes and data warehouses -- delivering data management and performance typically found in data warehouses with the low-cost, flexible object stores offered by data lakes.” It builds on a data lake foundation because “today, the vast majority of enterprise data lands in data lakes” and they thus allegedly contain “more than 90% of the data in the enterprise.”
Much of the design thinking focuses on the delivery of data warehouse-like functionality in a cloud environment. In particular, a data lakehouse emphasizes the need to maintain the integrity of streamed inbound data and its management in relational-like tables. The data lake aspects (simply speaking, a collection of loosely managed and unstructured data stores) remain unchanged.
A look under the data lakehouse hood reveals a data storage approach based on object stores with open-source table management tools (such as Delta Lake or Apache Iceberg) on top. Databricks has built its processing function on Apache Spark. Other data lakehouse vendors, such as Microsoft, favor their own relational database offerings, which are also usually built on the same open-source and object-store foundations.
In fact, the same cloud technology underpins many cloud data warehouses, so it might be argued that the data lakehouse is largely a marketing-led rebranding of the cloud data warehouse approach. The difference between a cloud data warehouse and a data lakehouse is very little beyond the naming.
Beyond the Centralized Warehouse Paradigm
Both cloud data warehouses and data lakehouses hew closely to the original concept of a single store of data. The very first data warehouse architecture pattern actually envisaged a “single logical storehouse of all information… [that] may physically reside in multiple locations.” However, practical implementation was limited to a single centralized physical database by the technological and performance limitations of relational databases of the time. This physically centralized paradigm is not fundamental to data warehousing. Nonetheless, it is generally easier to design and implement the consistent, integrated, timely, high-quality, useful, and usable data warehouse I mentioned earlier in a centralized approach.
As we shall see in my next article, the logical data warehouse pattern that emerged in the early 2010s was a first step away from this centralized paradigm. That will lead us to a discussion of data fabric and why, despite its similar-sounding name, data mesh is a completely different beast. Together, these articles will dispel most of the cloudy confusion of today’s cloud data warehousing market.
Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing in 1988. With over 40 years of IT experience, including 20 years with IBM as a Distinguished Engineer, he is a widely respected analyst, consultant, lecturer, and author of “Data Warehouse -- from Architecture to Implementation" and "Business unIntelligence--Insight and Innovation beyond Analytics and Big Data" as well as numerous white papers. As founder and principal of 9sight Consulting, Devlin develops new architectural models and provides international, strategic thought leadership from Cornwall. His latest book, "Cloud Data Warehousing, Volume I: Architecting Data Warehouse, Lakehouse, Mesh, and Fabric," is now available.