TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

TDWI Articles

00 Days

00 Hrs

00 Min

00 Sec

Four Tips for Achieving Lasting ROI with a Data Lakehouse

Data lakehouses provide new data storage possibilities, but implementing them can be challenging. These four tips will help you achieve long-lasting ROI from this architecture.

By Michael Hay
July 10, 2024

For more than a decade, the data lake has been evolving, and the last few years have seen a logical progression of the architecture. Today, the data lakehouse positions modern data warehouse analytics, performance, security, and governance functionality directly onto the lake while still embracing data’s many formats, intake sources, and naturally distributed state.

For Further Reading:

Avoid Ending Up with a Marshy Mess Instead of a Data Lakehouse

Data Lakehouses: The Key to Unlocking the Value of Your Unstructured Data

Sunrise at the Lakehouse: Why the Future Looks Bright for the Data Lake’s Successor

The data lakehouse is popular because it opens new efficiencies and reduces the friction inherent in constant data movement. It can empower different enterprise teams to directly access data for business intelligence, streaming analytics, data science, machine learning, and product development, using their favorite query engines and tools and leveraging computational resources on premises, in the cloud, or a hybrid mix. The lakehouse enables an end-to-end experience where data is easily accessible across an organization as a reusable product. Teams can have a conversation with their data, query data sets in the lake across multiple file and table formats, and crack open those Apache Parquet or Iceberg instances to solve real problems, whether with a simple ad hoc request or a machine learning task.

That’s not to say implementing a lakehouse always goes swimmingly. Here are four tips to achieve lasting ROI with this promising architecture.

Think Deeply About Data as a Reusable Product

Implementing a lakehouse involves thinking deeply and differently. You still have to meet known challenges, such as considering where your data originates and why you're capturing it. If you’re ingesting data from emails, logs, IoT sensors, sales and marketing tools, and many other sources, you still need to prepare it, clean it, and make sure it’s properly anonymized, masked, and aligned with governance policies, compliance regulations, and laws. You’ll need workflows to analyze, transform, index, enrich, and search data so it’s readily usable by query engines or even retrieval-augmented generation (RAG) techniques, when considering generative AI.

Imagining data as a reusable product from the outset and ensuring it can be repurposed for new, as-yet-unknown tasks is also vital when implementing lakehouse architecture to advance long-term gains. Managing data as if it’s a product means gathering requirements and thinking about data within the context of an interactive and agile development life cycle, where you’re preparing products for people who come after you and use those products for creative applications that aren’t even on the drawing board yet. You might deploy an advanced catalog and inverted index that bolsters new data use and reuse.

A data lakehouse that enables the fast repurposing of data provides a key condition for effective self-service, too. As the lakehouse opens massive reservoirs of structured, semistructured, and unstructured enterprise data stores, business teams with different kinds of domain expertise can explore data widely and bring their good ideas to fruition to produce new value in ways never before possible with the bottlenecks and limited access of the past.

Turn a Data Tax Into a Data Asset

In one of the world’s largest economies, there’s a painful but essential banking regulation that requires all banks to store all logs for seven to 10 years. All logs. This is a multipetabyte-scale compliance challenge—a kind of data tax banks must pay to do business. They need to constrain costs and, from a regulation perspective, have the log data in an optimized, biased-for-action format (structured and formatted in a way that is optimized for query by a SQL query engine) that enables the bank to respond to internal auditors or external regulators quickly.

You can use a lakehouse to transform this scenario into a refreshing new opportunity.

Log data might comprise a list of messages and time sequences. There's a text payload there that can be mined for machine learning and generative AI applications. You might mine that data to look for advanced persistent threats or security problems. You might create a new analytics application. A data lakehouse makes all this possible in an efficient way. Perhaps the query engine the bank uses is the wrong tool to solve a new problem. Fortunately, you can bring something as simple as Python, or an analytical engine, or a tool to execute statistics and mathematics within the SQL query engine, to bear on your problem. They’re all easy to dock at the lakehouse. It enables the creation of derived data products and transforms the log data tax into a data asset.

For Further Reading:

Avoid Ending Up with a Marshy Mess Instead of a Data Lakehouse

Data Lakehouses: The Key to Unlocking the Value of Your Unstructured Data

Sunrise at the Lakehouse: Why the Future Looks Bright for the Data Lake’s Successor

Beyond this use case, the idea of time series and log data management holds exciting business and nonprofit usage scenarios. Consider government-oriented sectors trying to store logs efficiently for querying that data, building reports, and feeding those data sets into AI pipelines or some other advanced analytics format.

Refresh Stale Data and Leverage Specialized Processing

To facilitate database retirement and new efficiency gains, many businesses are taking a close look at their legacy data infrastructure and warehouses and moving old, stale database objects into data lakehouses. Leveraging the lakehouse’s open format keeps data from previous systems usable beyond the life of various applications while lowering costs. The lakehouse offers efficiencies when it comes to replications, backup windows, load times, and all kinds of concerns around transition to a new environment and responsible maintenance of data for the long term.

Lakehouse-enabled ad hoc queries offer another kind of advantage and a data mart-type experience. For example, consider a telecommunications company with detailed call records and signal strength records that wants to optimize placement of dozens of new cell towers. Such a project requires various advanced queries, but it’s a one-off. A lakehouse allows you to bring specialized processing to the data for the immediate need, with secondary use cases always possible down the line.

In fact, the ease with which teams can use different, specialized query engines on the lakehouse is a major feature. Hook up an engine that does Solr SQL, which is an unstructured data search engine that can also talk to databases, structured systems, NAS devices, and object stores. Get in the middle of an indexing pipeline and add customizations that help it “speak” oil and gas data or seismic data. Develop data products that combine data from the lake and from external sources across clouds and on-premises environments. The possibilities are limitless.

Use Durable Open Formats to Pass the Test of Time

Lakehouses embrace open formats, and they’re best for keeping data accessible throughout lengthy time frames. Whether an open format is as simple as CSV or as complex as an Iceberg table, keeping data in them means you’re much more likely to be able to read that data five years or a decade from now and still reuse it. Simpler, more durable open formats pass the test of time.

It's also more important than ever to think about the control you need. For a long time, organizations have had to outsource IT control and choice; they’ve paid big cloud bills that never manifested the promised efficiency. Open source tools and open data formats put you squarely back in the driver's seat, though this also means taking some responsibility for business outcomes.

Today data lives much, much longer than applications. It’s vital that organizations never obfuscate their data to make it incapable of yielding new value, which requires distinguishing between data and applications. Data living in open formats in the lakehouse, instead of bottled up in applications, makes it available and reusable for a multiplicity of consumers, including those launching or refining AI projects in years to come.

Lakehouses aren’t an automatic panacea, and the architecture requires thoughtful implementation. Managing data in a lakehouse environment with these tips top of mind can help your enterprise better dive into its data and achieve a lasting return on its infrastructure investments.

About the Author

Michael Hay is a technologist and product planning expert currently serving in Hitachi Vantara’s CTO office as VP, technology and research. His expertise is centered on devising practical futures and innovative solutions by working backwards from the user. Previously, Michael held the role of vice president of products at Teradata, responsible for the Vantage platform on private clouds and business continuity as a service for Vantage. Before Teradata, Michael worked at Hitachi in a series of roles, culminating in his position as vice president and chief engineer. There he played a crucial role in developing data-driven applications for the oil and gas and financial services industries. He holds a master of science in industrial and systems engineering from San Jose State University and a bachelor of science in electrical engineering from the University of New Mexico. Outside of work, he is a devoted husband and father with a deep appreciation for Japanese culture.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

TDWI Articles

Four Tips for Achieving Lasting ROI with a Data Lakehouse

Related Articles

Trending Articles

Semantic Layers for AI: What They Are and Why They Matter More Than Ever

From Reactive to Proactive: Automating Data Quality in Petabyte-Scale Analytics Pipelines

From Pilot to Production: Why LLM Features Stall, and a Readiness Checklist for Data Leaders

The Inferencing Cost Problem No One Is Talking About: Unstructured Data Quality

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

TDWI Articles

Four Tips for Achieving Lasting ROI with a Data Lakehouse

Related Articles

Trending Articles

Semantic Layers for AI: What They Are and Why They Matter More Than Ever

From Reactive to Proactive: Automating Data Quality in Petabyte-Scale Analytics Pipelines

From Pilot to Production: Why LLM Features Stall, and a Readiness Checklist for Data Leaders

The Inferencing Cost Problem No One Is Talking About: Unstructured Data Quality

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career