The Open Analytics Stack, the Next Wave of SaaS on Kubernetes, and the In-VPC Deployment Model: What We’ll See in 2021
As cloud adoption accelerates, users want the best of all possible worlds -- flexibility and ease of use along with tight security around their data. The virtual public cloud (VPC) may be the answer.
- By Dipti Borkar
- December 1, 2020
As the shift to the cloud (and to multicloud environments) has become even greater during the past year, new challenges have arisen when it comes to managing data and workloads. Companies want to keep their data secure in their own accounts and environments but still leverage the cloud for analytics. They want quick analytics on their data lakes. They also want to take advantage of containers for their applications across different cloud environments. Here are the key data trends I believe we’ll see in 2021.
Trend #1: A New Approach to Analyzing Data: The Open Analytics Stack
It's become clear over the past year that the cloud has won over the traditional on-premises data warehouse. In the past six months alone we've seen a massive uptick in cloud adoption. Cloud data warehouses have shown that simplifying the software enables users to focus on innovating in their domain versus spending time managing data analytics software.
Although cloud data warehouses may be the solution for the traditional workloads of reporting and dash boarding, there is still a gap when it comes to analyzing data in data lakes.
Data lake analytics is complex. In 2021, there will be new approaches that address the gap -- more companies will adopt an open-source approach for analytics to run many types of analytics on data lakes with open formats and open interfaces without needing to move the data around or ingest it into the proprietary technologies that come with user lock-in.
These workloads will augment the traditional data warehousing use cases and over time will become more mission-critical given that almost all data within an enterprise is moving into data lakes. I call this new approach the "Open Analytics Stack." This stack uses open-source technologies at every layer -- the engine, the formats, and the interfaces.
How do you get started? Here's are a few open technologies to evaluate when building your stack:
- For the core engine, use an open source SQL query engine such as Presto
- For open formats, the most popular are JSON, Apache ORC, and Apache Parquet (and there are many others)
- For open interfaces, JDBC/ODBC drivers can connect to any dashboarding, reporting, or notebook tool
- Use an open cloud so you're not locked in
Trend #2: Kubernetes for Multicloud, Kubernetes for SaaS
Container orchestration technologies such as Kubernetes continue to grow in usage and popularity -- that's not new. What is new is figuring out how to run containerized workloads in multicloud environments. As more companies move to a multicloud approach (which we've seen rise in just the past year alone), these companies will start attempting to run container workloads in a multicloud environment. Decisions about which technologies to adopt may hinge on technologies being multicloud ready and containerized out of the box. If a technology or application doesn't meet those requirements, they'll be passed over for those that do.
Take, for example, distributed data systems. Hadoop was heavy weight and not very container friendly. However, technology such as Presto, a distributed SQL query engine, can be easily containerized and orchestrated using Kubernetes. This is what will drive the popularity of Presto even further to become the core SQL engine of multicloud containerized workloads because it meets these requirements.
Consumer-oriented SaaS applications already widely use Kubernetes. Data infrastructure focused SaaS applications, on the other hand, haven't adopted Kubernetes as aggressively. Another container trend I see for 2021 is a rise in managed data analytics and processing related SaaS apps running on Kubernetes, specifically using cloud services such as AWS EKS and Google GKE. Managing containers is often not easy. Companies with SaaS applications that can figure out how to take advantage of the scalability, portability, extensibility, and availability that containers offer -- as they are abstracting the management complexities from their end users -- will emerge as the winners.
Trend #3: Riding the Cloud, Owning Your Data: The In-VPC Deployment Model
We know that cloud adoption has become mainstream. More companies than ever are creating and storing the majority of their data in the cloud, especially in cost-efficient Amazon S3-based data lakes. However, security concerns are still real. Although the public cloud offers ease of use, scale, and speed of deployment, it also means that companies don't have as much control over how and where data is sent, used, and accessed. That's especially true if data has to be ingested into other environments.
Users want the best of both worlds -- the flexibility and ease of the cloud along with tight security around their data. Most would prefer for data to remain in their own cloud account, one that they can control and have full visibility into. This is where I see a new cloud-native architecture emerging, especially when it comes to data-focused managed services. I'm calling it the "In-VPC" deployment model. In this model, the control plane (delivered as SaaS, running in the vendor's VPC) is separate from the compute and data planes (where your compute and data reside, running in a customer's VPC).
The control plane oversees, orchestrates, and manages the environment outside of compute and data. That includes VPC to networking to data to compute to the operating system. It never sees any of the customer's data because vendor compute is brought to the user's data In-VPC as opposed to having to bring user data to the vendor compute.
This new deployment model will become more widely adopted in 2021. It addresses the gap that many companies face -- wanting to take advantage of the cloud without losing ownership of data.
Dipti Borkar is a co-founder and CPO of Ahana with over 15 years of experience in distributed data and database technology including relational, NoSQL, and federated systems. She is also the Presto Foundation Outreach Chairperson. Prior to Ahana, Dipti held VP roles at Alluxio, Kinetica, and Couchbase. Earlier in her career, Dipti managed development teams at IBM DB2 Distributed where she started her career as a database software engineer. Dipti holds a M.S. in computer science from UC San Diego, and an MBA from the Haas School of Business at UC Berkeley.