Sunrise at the Lakehouse: Why the Future Looks Bright for the Data Lake’s Successor
A data lakehouse offers plenty of benefits -- including many that are not immediately obvious -- that mark a turning point in the evolution of data analytics.
- By Billy Bosworth
- July 11, 2022
After suffering from inflated expectations and well-known management challenges, data lakes have made an impressive comeback in the past few years. In fact, they have evolved so much that the industry is rapidly recognizing their full potential by giving them a new name: lakehouses. The term “lakehouse” connotes that data lakes are now robust enough to be considered on par with data warehouses. Beyond just catching up, lakehouses have hidden benefits that offer distinct advantages over the data warehouse architecture.
The Rise of the Data Lakehouse
Let’s look at how we got here. There are two primary catalysts for the rise of the current generation of lakehouses: application developers and the evolution of cloud storage.
First, consider developers. The world of infrastructure choices has more or less always been shaped by developer preference. Because developers write the line-of-business applications that generate revenue, they mostly get what they want in terms of tools and platforms. To build applications that function properly, developers have to fine-tune applications and databases to meet performance goals and service-level agreements (SLAs).
Once their fine-tuned applications are live, some developers get sensitive about who can touch them. However, the data from those applications is critical for data analysts. The data needs to be copied and moved from the application database to some other location -- and often combined with other data from other systems -- where it can then be analyzed.
This process of copying and moving data is largely captured under the broad heading of extract, transform, and load (ETL). To extract data means that at some level you will need to interact with the database. That’s where the application teams get nervous: they just don’t like other teams touching their databases. There’s too much at stake. Therefore, most of the time they create “data dumps” themselves. These data dumps are just extracted data stored in some common file format. Even if you are not a database person, chances are that at some point you opened a file in Excel that was in comma-separated values (CSV) format -- a common format used by application teams when extracting data from their databases.
With modern applications, the cumulative size of these files can get large very quickly. Where will they be stored? In the last decade, the first version of data lakes stored such data in on-premises Hadoop clusters. There are many reasons why that approach has largely fallen out of favor, some rooted in the rise of competitive cloud services.
The attractiveness of cloud storage to application developer teams has soared. Examples of cloud storage are S3 on AWS, Azure Data Lake Storage on Azure, and Google Cloud Storage on Google Cloud. The appeal of these storage layers is that they are, for all practical purposes, infinitely scalable, extremely easy to interface with, and available at very low cost. Add the fact that new applications are usually cloud-native and you have an easy, effective, and inexpensive way for developers to store their data dumps.
Hidden Benefits by Design
This brings us to the first hidden benefit of lakehouses: they reduce data copies. Data engineers do not like moving and copying data, because it adds complexity to the environment every time it’s done. There are governance concerns, limitations on batch window times, intricate job dependencies, increased costs for duplicate storage and additional computation resources, questions of which data set is endorsed for the business to use, and so on.
With lakehouses, those challenges are dramatically diminished because the largest, fastest-growing data sets are those being dumped from cloud applications into cloud storage. That yields an amazing benefit: being able to query the data directly where it lands versus forcing it through numerous ETL jobs and then sending it to a data warehouse for analysis.
The second benefit comes from the breadth of the data available for analysis. Data analysts can never get enough data. However, due to the numerous concerns about copying data already noted, only a subset of application data is typically copied into data warehouses for analysis. The data engineering teams work hard to ensure that the subset of data is what the business needs, but what if you didn’t have to worry about that at all?
With a lakehouse, you can point your data analysts to the entire data set without worrying about subsets and extracts. Data analysts really appreciate this approach and it stops them from trying to “backdoor” the data warehouse teams in search of their own personal copies of data. However, hearing the words “entire data set” may raise a security question. Do you really want all your data available to all users? Of course not, and that is not what is being offered.
To the contrary, having a centralized lakehouse is actually quite advantageous from a security standpoint because lakehouse platforms now provide fine-grained access control. Companies can control who can see what data -- at the table level or even at the column and row level. In fact, eliminating the need to copy data into other systems is a massive security benefit. Because permissions don’t travel with data, as soon as you start copying data into data warehouses (and then create various derivatives within the data warehouse, and extracts outside the data warehouse), the IT team loses the ability to control who can access what data -- or even the ability to see who is accessing what data and when.
Security risks increase dramatically with every data copy. That’s why it makes more security sense to limit the physical locations of the data. By doing so, you are also limiting the number of security controls you have to implement.
Hidden Benefits and Open Standards
Finally, there are two additional hidden benefits available -- provided you design your lakehouse in an “open” way, which is to say you design it using open standards and open architectures.
For the entire history of databases, whenever you wanted to do real analytics, you moved your data into a query engine. Those query engines were generically called databases (and when it came to analytics, data warehouses). For them to work, you had to put your data into their engine.
With open lakehouses, that paradigm changes dramatically. Instead of bringing the data to the engine, you can now bring the engines to the data. Whether it’s a SQL query engine, a Spark engine, or a streaming engine, in an open lakehouse architecture they all have access to your data -- which lives independent from any of them -- via open standards and open formats. This helps teams avoid lock-in with a specific vendor and makes it easy to adopt new, best-of-breed engines on the horizon.
Avoiding unnecessary copies, giving analysts the full breadth of the data set, avoiding vendor lock-in, and adopting modern data engines are all powerful hidden benefits that a company can derive from a well-designed, open lakehouse. These benefits mark a turning point in the evolution of data analytics in service of growing business value. The lakehouse is where accessible data now lives independently from any particular vendor and in an architecture ready for whatever new cloud services the future may bring.
Billy Bosworth has been in the tech industry for over 30 years in roles ranging from engineer to CEO to public company board member. He has served as the CEO of Dremio Corporation, a privately held company in the data analytics market, since February 2020. Prior to joining Dremio, Billy served as the CEO of DataStax, Inc. Billy frequently writes and speaks on topics such as data autonomy, data analytics, and BI, and as a coach at heart, he also speaks broadly on career management and leadership. You can find out more about Billy on Dremio’s website or LinkedIn. You can follow Dremio on Twitter.