Executive Q&A: A Closer Look at Open Data Lake Analytics
An open data lake analytics approach will augment (and over time surpass) adoption of data warehousing given the significant long-term strategic benefits it provides to companies, Ahana's Dipti Borkar argues.
- By James E. Powell
- April 12, 2021
Organizations today have data in multiple systems and most of that data is now up in the cloud, or will be soon. The data is usually a mix of structured and unstructured, static and streaming and in many formats. It's dispersed across data warehouses, open source databases, proprietary databases, and cloud warehouses, but increasingly most of the data will end up in data lakes. Given this, enterprises and organizations are trying hard to avoid getting locked into proprietary systems again for running various types of analytics.
Dipti Borkar, Ahana's co-founder and chief product officer, says that data-driven companies of all sizes -- from startups to giants -- are turning to an open data lake analytics approach. We asked her to explain.
Upside: What is this open data lake analytics approach you advocate?
Dipti Borkar: Open data lake analytics is a new approach to traditional data analytics. This open approach can provide many strategic and operational benefits. This open data lake analytics approach refers to using a technology stack that includes open source, open formats, open interfaces, and open cloud. I think each of these characteristics is important. These attributes provide value to end users and companies by leveraging different types of analytics processing as well as not being locked in to proprietary formats and technologies.
Let's drill down into each of those. Let's start with open source.
Data lakes are only meant for storage and by themselves provide no direct value. They have, however, become extremely affordable as data volumes have grown and are now ubiquitous. The value to enterprises comes from the compute engine, or more commonly the SQL engine, that runs on top of a data lake, along with other components that are needed such as an operational data catalog.
Using an engine that's open source is strategically important because it allows the data to be queried without the need to ingest it into a proprietary system. If ingested into another system, data is typically locked into formats of the closed source system. In addition, using an open source engine gives you the collective power of a community working together to deliver quick development and troubleshooting.
What about open formats?
Over the past few years many optimized open formats have emerged to store data in a structured yet highly compressed form. These formats being open source have meant that popular query engines support them. The open query engines ideally should support open and optimized formats so users can decide which engine to use for different use cases on the same set of data, all using open formats. Using open formats gives companies the flexibility to pick the right engine for the right job without the need for an expensive migration. Some popular open formats include Apache ORC, Avro, Apache Parquet, JSON, and CSV.
Talk to me about open interfaces.
Seamless integration with existing SQL systems and support for ANSI SQL is the standard to strive for; SQL analytics is key because SQL has become the lingua franca of data systems and is still growing in popularity. In an open data lake analytics stack there are no proprietary extensions; you should be able to access data through standard drivers such as ODBC and JDBC as well as standard programming languages and libraries.
Finally, you mentioned open cloud.
In the open data lake analytics stack, your query engine should be able to access any storage, natively align with containers, and run on any cloud. In addition, an advantage of open query engines is that they are stateless and don't actually manage data. This makes them a very good fit for running in containers, making it easier for users to leverage technologies such as Kubernetes for easier deployments. As an example, Ahana is a managed service that brings open data lake analytics to users, and leverages containers and the disaggregated SQL engine, Presto.
You spoke of "many benefits" earlier. Would you elaborate on these, please?
Although the traditional data warehousing approach has gained momentum over the past few years, it serves a specific set of use cases. Increasingly, an open data lake analytics approach will augment (and over time surpass) adoption of data warehousing given the significant long-term strategic benefits it provides to companies. For years, enterprises have looked at ways to break free from proprietary formats and technology lock-ins. An open data lake analytics stack makes that possible.
Technology has finally evolved to a point where separation of storage and compute is now a reality, ultimately giving businesses the ability to be more data-driven and make faster, more-informed decisions. Although cloud data warehouses are great fits for some reporting and analytics use cases, they can get very expensive very fast. In addition, there are new use cases emerging that need processing of a lot of data that existing stacks will not be able to support. For example, ad hoc data discovery where you can use SQL to run queries whenever you want, wherever your data resides. Open query engines allow you to query data where it is stored so you don't have to ETL data into a separate system.
What's it take to get started with this approach and what are the roadblocks?
Depending on where your data is, there are a few different ways to get started. If the data is already in the data lake, there are three approaches:
Do it yourself: Large internet companies take these open source technologies and deploy and manage them on their own, at massive scale. For example, Facebook running PrestoDB or Uber running PrestoDB and Apache Pinot. This approach is probably the most difficult because it requires distributed systems experience, fundamental knowledge of these open source technologies, and specific skills that the platform teams need to have.
Cloud services: AWS, Google, and Azure have "Hadoop-in-a-box" types of services that can get you started with parts of this stack. However, they don't give you the expertise needed to manage the full stack. You still have to do that on your own.
Cloud-native managed services: The first two approaches are obviously fairly challenging, particularly for data platform teams that are smaller in size. This is where many data vendors are taking an approach of building cloud-native managed services around these complicated data processing technologies. This significantly reduces the learning curve for data platform teams as well as the advanced skills needed to manage this stack while still getting the value of the best query engines built by the internet giants.
If the data is not already in the data lake, this can be a blocker. There are many approaches to move data from other systems into data lakes such as S3. Technologies include traditional change data capture techniques such as replaying logs to stream processing technologies to cloud-based ETL technologies. Once the data lands, it typically needs to be optimized into open formats such as Apache ORC and Apache Parquet so that disaggregated query engines can be leveraged.
This creation of the data lake can be a blocker depending on the skills the data platform team has; however, the good news is that the cloud giants as well as independent vendors continue to make this easier. Open query engines continue to add features to reduce the limitations of the structure of stored data.
What best practices can you recommend for enterprises considering this approach?
First, encourage engineers to research and participate in open source projects. Next, prioritize your use case and decide on what use case will bring most value to your company. Finally, evaluate the different approaches to move forward and pick the one that best fits your data platform team's profile and strengths.