Data Lakehouses: The Key to Unlocking the Value of Your Unstructured Data
To reap the benefits of a data lakehouse, your organization needs to approach implementation the right way. Here are three best practices to help you.
- By Craig Kelly
- September 1, 2022
For many years, organizations relied on the structured data in their data warehouses for all their data needs. However, since the rise of unstructured and semistructured data, many organizations have begun employing a combination of data warehouses and data lakes for data storage. Often, that means moving unstructured raw data from data lakes to more structured data warehouses to perform analytics -- a clunky and cumbersome process.
There’s a way to centralize your storage architecture under one platform and get the best of both storage methods while improving efficiency. This architecture, known as a data lakehouse, combines the expansive storage capabilities of data lakes with the structure of data warehouses. By using data lakehouse best practices, your organization can perform analytics on both structured and unstructured data to solve complex business problems.
Data Lakehouses Offer the Best of Both Worlds
Data warehouses revolutionized storage architecture when IBM researchers Paul Murphy and Barry Devlin introduced the concept in the 1980s. Here was a method that would empower inexperienced users to quickly access structured data from a centralized system.
As organizations increasingly digitized their workflows, adopting sophisticated digital tools such as ERPs, CRMs, and IoT devices, data became more complex. Now, businesses could capture rich, unstructured data such as text, audio, and video, but they needed a data storage option that could handle these new data types.
Enter the data lake -- a storage repository that holds a large amount of unstructured data in its raw form. Although this storage method allowed organizations to take advantage of the enormous quantities of unstructured data they produce, there were some limitations. For example, the increased scale of a data lake makes it harder for users to find the information they’re looking for.
Many organizations solve this issue by transferring information between data lakes and data warehouses, but the data lake’s lack of structure makes this process slow. With 80 percent of worldwide data expected to be unstructured by 2025, organizations need more effective data management options.
The solution may lie in the data lakehouse. Data lakehouses enable more efficient management of both structured and unstructured data, as well as semistructured data such as social media analytics. Cloud-based lakehouses create a structured metadata layer that sits on top of a data lake. This layer allows IT teams to perform analytics against the lake itself or move the data to a traditional warehouse environment to build out a data dashboard. Data lakehouses are a new concept, but they’ve caught on quickly – 73 percent of organizations are combining their data warehouses and lakes in some way.
Let’s say an enterprise wants to forecast demand generation for a product over the next six months; it would go to its data lake, where all the relevant product data is sitting. Using a data lakehouse, the company can run analytics directly against the data in the lake to create an intelligent prediction of demand for the next six months. The lakehouse also allows the enterprise to easily move the data to a data warehouse if it wants to build a dashboard that displays these insights.
This example shows how data lakehouses combine the benefits of data warehouses and lakes to unlock the potential of unstructured data. To reap the benefits of a data lakehouse, though, your organization needs to approach implementation the right way.
Rome Wasn’t Built in a Day and Your Data Lakehouse Won’t Be, Either
More companies today see the data lakehouse as a valuable addition to their existing storage architecture. As with any new trend, it’s easy to buy into the hype and ignore the work it takes to make the concept satisfy your organization’s needs. Here are three best practices for businesses looking to adopt a data lakehouse architecture:
Build up to the lakehouse. Data lakehouses are ideal for mature organizations that have large quantities of data to manage, including major companies such as Netflix and Uber. However, smaller organizations can still benefit from the structure and flexibility a lakehouse can provide -- even if they don’t have as much data to handle. After deciding to move forward with a data lakehouse, your organization should manipulate small data sets before incorporating all of your data into the lakehouse. Starting small will help your team familiarize itself with the architecture and develop a plan based on these initial experiments.
Know where your data comes from. It’s a major challenge to ingest and sync data from multiple sources, especially if the data is complex. For example, take Peloton, which relies on real-time data streaming to operate its live leaderboard. Peloton uses a lakehouse to ingest large amounts of structured, semistructured, and unstructured data during a class. The company then consolidates and relays that data back to the end user in the form of rankings, heart rates, and other KPI-related dashboards. Without the right architecture in place, Peloton wouldn’t be able to provide customers this real-time feedback.
Your organization may not need to process data as quickly as Peloton does, but you still need to kow where your data is coming from. Audit your data pipelines to see where your structured and unstructured data originates. Visibility into your data sources will help you determine which storage architecture is best suited for a specific project.
Take advantage of the lakehouse’s predictive capabilities. A major goal of traditional storage architecture is to centralize and provide access to data. As businesses recognize the value of AI and machine learning, many have shifted their focus to using data to solve business problems, which is why your organization should take advantage of the predictive elements of a data lakehouse.
Begin by identifying a specific business problem that a data lakehouse can solve. For example, let’s go back to our previous scenario about an enterprise looking to forecast product demand. If it was at the beginning of its data lakehouse journey, that enterprise could develop a proof-of-concept (POC) project to determine how a data lakehouse could help them forecast demand before actually diving into implementation and analysis.
Gaining an early win with the architecture would help secure organizational buy-in and give it a solid foundation to expand its data lakehouse use over time.
Upgrade Your Storage Architecture to Reach New Heights
Unstructured data can unlock new possibilities for your organization, but only if you have the storage architecture to manage it. Data lakehouses enable you to organize the vast amounts of data within data lakes to solve complex business problems. Keep in mind they aren’t the right architecture for everyone -- you need to have a high data volume to justify the investment. As your organization relies more on unstructured data, a data lakehouse can be a strong foundation for more intelligent problem-solving.
Craig Kelly is VP of analytics at Syntax where he leads professional and managed services around analytics and product and application development for the analytics practice. Before working at Syntax, Craig was a co-founder of EmeraldCube Solutions. He has been in the analytics space for the last 20 years, working with IBM Cognos, Oracle BI, GoodData tools to build solutions for ERP customers. Craig and his team now focus primarily on AWS analytics, integrating traditional data warehousing and BI, along with forward-looking ML and forecasting capabilities.