Unifying DW and DL: 6 Key Questions
Questions our analysts didn't have time to answer in a recent webinar highlight key considerations for moving to a unified data warehouse/data lake.
Enterprises are beginning to unify their data warehouse (DW) and data lake (DL) in a powerful new architecture. Respondents to a recent TDWI Best Practices Report survey see the unified DW/DL as an opportunity to improve management of an increasingly diverse range of data structures, end user types, and business use cases.
In a recent webinar, the report's co-authors -- TDWI Research VP Fern Halper and senior research director James Kobielus -- discussed their study's findings and provided strategic recommendations for enterprise data professionals. Due to the webinar's time limitation, Halper and Kobielus weren't able to respond to all of the audience's questions.
Here instead are their in-depth responses to some of the most salient questions the audience posed during the live event.
When you migrate data to the cloud, what data should go into the data warehouse and what data should go into data lakes?
Historically, the kind of data that would go into the data warehouse would include structured data, which would be used in reports and dashboards. This was the sort of data set that was well understood, cleansed, and curated. However, as organizations wanted to collect other kinds of data and perform analysis on new data types, the data lake rose in prominence. Here, organizations could store raw data or different types. That could be text data (text data collection is already mainstream). That could be machine data -- 20 percent of organizations are collecting machine data according to TDWI research. All sorts of data could go into this data lake.
Now the move is to the cloud data warehouse and cloud data lake. These platforms are already mainstream. In fact, we're seeing more respondents to our surveys utilizing a cloud data lake than a data lake on premises. They are moving new and raw data there. The diverse data in the cloud data lake is often being used for advanced analytics. That isn't to say that some organizations aren't moving unstructured data to the cloud data warehouse -- they are doing that, too.
However, we're also seeing the convergence of the data warehouse and the data lake enabled by vendors such as Snowflake, Databricks, and others. We're talking about the converged data warehouse and data lake. In some ways the constructs of the data warehouse and the data lake become arbitrary, where you can put data into a data cloud or "Delta Lake" or something similar so the whole thing is basically a unified platform. We're not sure that most organizations are there yet, but it's an interesting idea.
During data warehouse migration to a unified DW/DL, does it make sense to store old data from the data warehouse in the new unified platform?
In migrating a data warehouse to the cloud or migrating a data warehouse to a unified DW/DL environment, the old data generally refers to historical data that is less commonly accessed, queried, and analyzed than the new data.
The old data generally is stored or backed up to an archive. A data archive might be on an entirely different platform, such as a Hadoop platform, which could also be in the cloud. You wouldn't move the old data that had been in the data warehouse to a data lake unless that older data was relevant to some advanced analytics modeling, machine learning, or deep learning initiative that the data lake itself enables.
In other words, the data lake is fundamentally a platform for persisting and managing data of various sorts -- structured and unstructured akuje -- so that it is available for exploration and for modeling (as in machine learning models used to develop advanced analytics and API applications of various sorts).
The older data from the data warehouse still relevant to historical analysis for building predictive models should go in the unified DW/DL. Older transactional data from the data warehouse that may not be relevant to any intended predictive models should not go to the data lake but rather to an archival database of some sort, such as AWS S3.
It could be stored in the Hadoop Distributed File System when that's used as an archive. HDFS can also be used as a data lake for exploration and modeling or conceivably as a data store for BI and decision-support applications.
Underlying platforms such as HDFS can be used for multiple use cases, one of which may, in fact, be a data lake. However, all the old data that you move from a data warehouse you're migrating could be moved to the same platform you're using -- perhaps as both for a data lake and an archive. It might be architected in terms of different zones, domains, or clusters within a unified Hadoop deployment.
The bottom line: when you're migrating data from a data warehouse, the older data doesn't need to be stored in a data lake. More likely, it needs to be stored in an archival database in the cloud or elsewhere.
What are the key drivers for deciding whether to use physical or logical architecture to unify the DW and DL?
TDWI research shows that the key driver to help organizations decide if they want to go to a unified DW/DL oftentimes revolves around analytics, and it depends on where you are in your analytics journey.
If you're still at the report and dashboard stage, if your on-premises data warehouse is supporting that, and you're not ready to make a move into self-service or more compute-intensive kinds of analysis such as machine learning and natural language processing, then stick with your data warehouse.
As organizations start to mature analytically and want to do more, they want to capture diverse data types to perform different kinds of analysis. That's when they want to marry data that might be in the data warehouse with data that's in the data lake.
For example, I may be building a model where I want to understand why I'm losing customers. I want to use data (such as customers' average monthly bill) in the data warehouse starting from when the person first became a customer -- I'm analyzing structured data. On the flip side, I want to capture unstructured data about (for example) whether they file trouble tickets, what was in the trouble ticket, what they complained about, their sentiment, and the like. When you bring those data sets together, you'll have a good data set from which to build a model.
You would want a well-governed set of unified data to do that more easily. That is what we see as one of the key drivers of building a unified DW/DL. It's obviously going to depend on what you already have deployed. If you have a cloud data lake, an on-premises data warehouse, and some other platforms, you want to do it logically. If you want to put everything in the cloud, you may want to use a cloud platform such as a lakehouse.
If you have existing investments in a data warehouse platform and a data lake platform that, for example, are both on premises and it would be too expensive to deploy a new brand new physical, unified platform to unify the data and the analytics running on those legacy platforms, you might more cost-effectively and rapidly integrate them as a logically unified entity behind a virtualization layer with common governance. This may be the cheaper or faster way to go to unify them.
If some of the data in the DW or the DL is sensitive and needs to stay on premises, you can't move it into the cloud and not run afoul of laws and regulations. You might choose a cloud-based platform for the other component of the unified platform and then unify those two platforms on premises and in the cloud using a hybrid architecture. In other words, these are the considerations when deciding whether to go all physical, all logical, or a combination of both with your unification strategy.
How does an operational data store fit into a unified DW/DL platform?
The operational data store (ODS) is an established pattern for analytics databases. Its primary function is to support enterprise reporting, online analytical processing, and fast query against structured data records.
The ODS has traditionally been geared to a single data domain, such as for financial data that's used purely by the chief financial officer and their organization. An ODS is primarily for analytics on historical data rather for predictive analytics or other advanced analytics.
An ODS is typically not the same as a data warehouse, regardless of whether you subscribe to the Inmon or Kimball school of what a true data warehouse is. In other words, a multidomain repository of historical data that is governed and maintains a single version of the truth. An ODS is far more narrowly scoped in terms of domain and applications than is a data warehouse.
What are the use cases of data virtualization in a unified DW/DL and where it should be avoided?
Data virtualization is a logical model for integrating data across multiple data sources and enabling the data to basically stay where it is. We've discussed use cases of data virtualization in terms of multiple platforms when you want a logical view across the platforms.
Data virtualization should be considered when the data you're accessing is not all consolidated physically into one server or cluster but is scattered across your enterprise. You need to bring it together in terms of a unified query, unified visualization, and unified modeling of data from various locations. You will use data virtualization for a variety of requirements that can't be supported by complete, physical consolidation of the data.
One example is in B2B supply chains, where your partners are accessing and moving data back and forth. In that case, there are several federated data interchange and data integration, data management, data analysis scenarios that absolutely must be met by some form of data virtualization.
Data virtualization is a good idea as long as it doesn't impact performance. We don't recommend data virtualization if you are running a highly computer-intensive machine learning model because if you want to bring all of that data together and you're having to iterate on it, performance will suffer.
Data virtualization should only be avoided if scattering the data beyond its standard on-premises deployment might run afoul of security, legal, or compliance mandates. Be careful to ensure that the data is always persisted and managed under the full control of whoever may go to prison if they allow the data to be leaked, disclosed, or otherwise accessed that could cause harm or invade privacy.
How does the Delta Lake concept relate to the unification of the DW and DL, if at all?
The Delta Lake, as defined by Databricks and implemented in their solution architecture, is an open format storage layer that enables building a unified DW/DL -- aka lakehouse -- on top of existing storage systems such as S3.
Departing from the traditional architecture of a DW or DL, the Delta Lake also supports ACID (atomic, consistent, independent, duration) transactions, a feature that makes it similar to what some refer as a NewSQL data platform architecture. This feature enables the Delta Lake to function like an online transactional processing database because it supports two-phase commit, among other features.
Delta Lake can be considered a grand unification framework for a unified analytics platform that also supports a full range of robust enterprise transactional requirements. It also appears to implement a so-called Lambda architecture in support of both batch and streaming data within a unified architecture. It is a multimodel database that can store data in the native format and you can query it all without a separate ETL layer.
What's most noteworthy about Databricks' Delta Lake architecture is that the company has open-sourced this technology for cloud deployments.