Executive Q&A: Data Lakehouses
Data lakehouses held much promise, but as Lewis Carr, senior director of product marketing at Actian, explains in this interview, they've fallen short. How can lakehouses be improved, and will new features be enough to encourage enterprises to adopt lakehouses?
- By James E. Powell
- October 8, 2021
Was a data lakehouse just hype or do they have value? What can be done to make them better, and is it realistic to expect new features can or will be added? We asked Lewis Carr, senior director of product marketing at Actian, for his perspective.
What's the problem data lakehouses are supposed to fix?
Data lakehouses are meant to bring together the best of two worlds -- data lakes and data warehouses -- meaning they allow organizations to deal with large sets of unstructured and semistructured data, such as those found in data lakes, with the precision and structure of a data warehouse.
Why haven't they fulfilled that promise?
Part of the problem is that data lakes and data warehouses don't solve the key problem around data access, preparation, and enrichment and how to enable this functionality for both data lake and warehouse users. Neither address the problem of how to curate and ingest disparate and diverse sets of data, which is the real issue that data lakes face. Data lakes can take anything in and have historically become data dumps. Data warehouses tend to be inflexible and slow to change and incorporate new data.
The other reason data lakehouses have not fulfilled their promise is that they are mostly data lakes attempting to do part of what data warehouses can do, but they are not architected for large sets of concurrent end-users coming through several applications for operational workloads. Without the capability to handle these jobs nor architects, sales, and buyers focused on handling these mundane yet business-critical operations, it's hard to see how data lakehouses will be able to capture much of the market share owned by data warehouses. Data warehouses that are moving to the cloud are able to handle real-time streaming and semistandard data, and data lakehouses are trying to achieve the same merger of capabilities.
How do you define data hubs versus analytics hubs?
A data hub provides a centralized "way station" for transformation of disparate and diverse data on its way to another destination. An analytics hub is a centralized point for analytics processing -- typically advanced analytics, artificial intelligence (AI), and machine learning (ML). Both are points of centralization, but data hubs are concerned with data format transformation and analytics hubs are for analytics processing. These hubs also tend to be used by different roles in their organizations -- IT integration specialists use data hubs in support of others and non-IT users leverage analytics hubs in a self-service fashion.
What features are data lakehouses missing that data hubs or analytics hubs offer?
A data lakehouse is not generally a place where there is organized storage of data, nor is it meant for operationalized data. In general, in-line business processes use transactional databases, which feed into data warehouses. Historically, the problem with data warehouses has been the lack of agility when faced with new data sets and users. However, this has been changing over the last decade. In some cases, data warehouses can sufficiently handle transactional operations directly -- typically called hybrid transactional analytical processing (HTAP) databases, which act as a bridge between two major technologies.
It will be interesting to see if transactional databases start to flow into data lakehouses, which generally inherit document stores, social media feeds, and other semistructured data ingestion feeds that flow into data lakes.
How would those features improve lakehouses and would they be enough to encourage more enterprises to adopt lakehouses?
Data lakehouses will need to show they can perform subsecond operations on billions of records of data to meet operational analytics needs. Data lakehouses can store terabytes of data, run advanced analytics for use cases such as AI modeling and algorithm tuning, and crank through analysis of large sets of data over extended time periods as part of a research project. However, they can't always run queries and other ad hoc analytics on these large data sets in near real time, limiting their capabilities and, for some, their usefulness.
Data lakehouses represent a merger of data lakes and data warehouses but are largely data lakes with a few data warehouse features rather than the reverse. Just as data hubs and analytics hubs have a distinct constituency, so does the data lakehouse -- and that's largely among developers and data scientists. Data warehouses tend to be managed by IT but service the entire range of non-IT users, from highly skilled analysts to retail transactional workers.
Is there any hope that data lakehouses will actually acquire these missing features? If not, what alternatives exist for organizations?
Cloud data warehouses will increasingly be used by non-IT users, analysts, and others to handle much of what was done by data lakes. There's a reason some data lake vendors are trying to rebrand themselves as data lakehouses, but both will need to harness more of what analytics hubs do to make their tools self-service for non-IT users. Data hubs work well with robust data access, preparation, and enrichment. If one or both of these architectures merge with data hub and analytics hub capabilities, then we'll see the real democratization of data we've all been waiting for.
[Editor's Note: Lewis Carr is senior director of product marketing at Actian. In his role, Lewis leads product management, marketing, and solutions strategies and execution. Lewis has extensive experience in cloud, big data analytics, IoT, mobility, and security, as well as a background in original content development and diverse team management. He is an individual contributor and manager in engineering, pre-sales, business development, and most areas of marketing targeted at enterprise, government, OEM, and embedded marketplaces.]
James E. Powell is the editorial director of TDWI, including research reports, the Business Intelligence Journal, and Upside newsletter. You can contact him
via email here.