TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Data 101

00 Days

00 Hrs

00 Min

00 Sec

What Is a Data Lakehouse? One Architecture to Replace Two

To understand what a data lakehouse is, you have to understand the problem it's trying to solve, which means understanding why data warehouses and data lakes exist as separate things in the first place.

Data warehouses were built for analytics. They store structured, processed data in formats optimized for query performance, enforce schemas that keep data consistent, and integrate with the BI tools that business users rely on for reporting. For decades, the data warehouse was the destination for organizational data that needed to be analyzed. The tradeoff was inflexibility: getting data into a warehouse required transformation upfront, unstructured data was difficult to handle, and storing large volumes of raw data was expensive.

Data lakes emerged as a response to those limitations. Store everything, in its raw form, as cheaply as possible, and figure out how to use it later. Object storage, the kind that underlies services like Amazon S3, made it economical to store enormous quantities of data without committing to a schema in advance. Data scientists could work directly with raw files. Machine learning pipelines could access data at scale. The tradeoff was the opposite of the warehouse: flexibility and low cost, but poor query performance, weak data quality guarantees, and limited support for the SQL-based analytics that most business users need.

Most organizations ended up with both. Data landed in the lake first, raw and cheap. A subset got transformed and loaded into the warehouse for analytics. Two systems, two sets of infrastructure, two teams managing them, and the inevitable question of which system was the source of truth when they disagreed.

The lakehouse architecture, a term popularized by Databricks around 2020, proposes collapsing this into one. The core idea is to add a metadata and transaction layer on top of object storage that gives it the properties that made data warehouses valuable: ACID transactions, schema enforcement, efficient query execution, and support for standard SQL. The data stays in open file formats in cheap object storage, but it can be queried with warehouse-like performance and managed with warehouse-like governance.

The technical innovation that makes this possible is the open table format. Delta Lake, developed by Databricks, Apache Iceberg, and Apache Hudi are the three dominant formats. Each provides a layer on top of files in object storage that tracks which files belong to which table, manages transactions so that reads and writes don't interfere with each other, and maintains statistics that query engines can use to skip irrelevant data efficiently. Without this layer, querying files in object storage directly is slow and unreliable. With it, performance approaches that of a purpose-built warehouse on many workloads.

The practical appeal is significant. Organizations that have invested heavily in data lakes built on object storage can add lakehouse capabilities without migrating their data to a different storage system. Data scientists and data engineers who work with raw data and machine learning pipelines can work from the same storage layer as the analysts running SQL queries in BI tools. The separation between the lake and the warehouse, and the pipelines that moved data between them, becomes less necessary.

The limitations are real too, and worth being clear about. Lakehouse query performance on complex analytical workloads still lags behind purpose-built columnar data warehouses like Snowflake or BigQuery in many cases, though the gap has narrowed substantially. The operational complexity of managing an open table format at scale is non-trivial. And the ecosystem, while maturing quickly, is less settled than the warehouse ecosystem that has been developing for decades.

Whether the lakehouse fully replaces both the warehouse and the lake in most organizations is a question that different vendors answer differently, with predictable self-interest. What's clear is that the architecture has moved from a concept to a production reality at a significant number of organizations, and that the open table format layer it depends on has become a meaningful part of the modern data infrastructure conversation. For anyone trying to understand how data infrastructure is evolving, the lakehouse is where the warehouse and the lake are currently meeting.

Data 101

What Is a Data Lakehouse? One Architecture to Replace Two

TDWI

Engage

Research