Executive Q&A: Kubernetes, Databases, and Distributed SQL
To take advantage of the scale and resilience of Kubernetes, Jim Walker, VP of product marketing at Cockroach Labs, says you have to rethink the database that underpins this powerful, distributed, and cloud-native platform.
- By James E. Powell
- October 30, 2020
Upside: What does it mean to be "cloud-native"?
Jim Walker: The term "cloud-native" has emerged over the past few years as an approach to development and delivery of applications in cloud environments that takes direct advantage of the scale, resilience and availability of these resources. The term can be traced back to Google. In fact, many of the core principles of being cloud-native were defined by the approach they took to scale their services during the past decade or so.
A key technology at the center of the cloud-native discussion is Kubernetes, which is an orchestration platform that automates and simplifies the deployment and management of container-based services across many (even thousands) of servers. Cloud-native services typically will be self-contained and non-reliant on other services; designed to scale autonomously and survive any failure; and reusable across multiple applications. Ultimately, cloud-native is the phase two approach to optimizing the value and benefit of cloud computing.
What does it mean to be "cloud-native" for a database?
A cloud-native database takes advantage of the core primitives of the cloud to ease scale and survive any failure while allowing applications to thrive everywhere. It implements features and APIs that are applicable across all public and private clouds and could even logically span these environments. Ultimately, it has to look and feel like a monolithic, traditional database yet take advantage of cloud infrastructure.
What does it mean to be a cloud-native distributed SQL database?
A new breed of relational database has emerged over the past few years that takes advantage of the cloud, called distributed SQL. However, to be cloud-native and a distributed SQL database, you must meet five key criteria.
First and foremost, a database must look and act like our traditional databases and this means it speaks SQL. This is important for developer productivity and integration with and across additional tooling.
Second, it must be resilient and be able to survive the failure of any piece of hardware yet still provide access to the database with limited or no impact on query performance. It is bulletproof, always on, and always available and will avoid any single point of failure.
Third, the database needs to be architected for scale. The cloud promises infinite scale and a cloud-native relational database needs to simplify utilizations of these resources without causing any additional operational overhead. It should automate and deliver effortless scale.
Fourth, a cloud-native OLTP database must not lose data or allow for discrepancies as these can lead to errors and even worse, malicious attacks. A cloud-native database is required to implement and enforce serializable isolation so all transactions are guaranteed consistent and are not casually consistent.
Finally, a cloud database will be accessed anywhere and everywhere, and it should meet the 100ms rule (the time for a transaction to appear instant). A cloud-native database will allow you to tie data to a location so you can meet these latency objectives. This capability will also allow you to use the database to meet some of the stringent data sovereignty regulations found from country to country.
What does it mean for data to be distributed?
The idea of distributed data rose out of the big data movement. Distributed data is typically replicated and stored across different physical locations and then accessed where it resides. The challenge with distributed data for relational databases (OLTP) is twofold. First, we must deliver acceptable performance for queries, and second, we must guarantee transactional consistency across all copies of this data.
What is Kubernetes and what does it enable?
Kubernetes is an open-source project, originally created by Google, that allows you to "orchestrate" containerized services. It basically automates the deployment, scaling, and management of services and applications. This capability, while seemingly simple, delivers huge value to organizations especially as they scale out their cloud-native initiatives and have to manage and operate hundreds or thousands of services. It delivers a wide range of services that help you keep apps and services up and running, scale individual services to meet user demand, and even achieve rolling updates in production so you can avoid planned downtime.
A direct descendant of Borg (the Google internal orchestration platform), Kubernetes brings Google-like power to operations and makes running instances of stateless application logic simple. It simplifies and eases delivery for applications architected to take advantage of the resources available in cloud environments. Every public cloud provider offers Kubernetes as a platform.
What should your relational data strategy look like in Kubernetes?
Running a legacy relational database on Kubernetes is a challenge and typically most organizations will just run it alongside the platform to simplify operations. However, this often creates a bottleneck or worse, a single point of failure for the application -- a violation of a core principle of being cloud-native. Running a NoSQL database is better aligned, but you will still suffer from transactional consistency issues. For both legacy relational and NoSQL, you will need to create complex operators to help manage these databases in the environment as they simply weren't built with the same architectural primitives.
Using a cloud-native database -- our company's CockroachDB is one example -- allows you to deploy a relational database seamlessly on top of Kubernetes so you can gain advantage of all its benefits across your entire application.
What is the (quick) history of distributed SQL?
Distributed SQL has emerged out of the shift to the cloud that organizations are undergoing. It defines a group of databases that are similar to our traditional, legacy relational stores but under the covers implement a distributed transactional layer that allows you to take advantage of the scale and resilience of the cloud. It is a reimagining of the database for our OLTP workloads.
Although we have had distributed systems for decades, the first big push into broad use came with HDFS/MapReduce and Hadoop. This new approach allowed organizations to start collecting and exploring all of their data but was limited to exploration; early projects such as Hive and HBase were limited in their transactional capabilities. NoSQL soon emerged. These databases took the guardrails off transactions and used limited SQL notation so we could scale accessibility of data across the planet. Although databases such as MongoDB and Cassandra have served the developer well, they struggle to deliver a reliable system of record and can be complex to deploy.
During the past few years, we've deployed legacy databases on cloud infrastructure, but these instances are held back by their legacy design. They simply weren't built to scale easily and we still have to rely on complex, resource-intense manual sharding. Further, to gain disaster recovery we are beholden to the active-passive architecture that is both expensive and not fully resilient. Finally, no matter how we re-architect these systems to distribute reads, there is no way to scale writes beyond a single region with acceptable transactional latencies. They simply weren't built for the cloud.
Distributed SQL delivers on the scale and resilience of NoSQL but incorporates the transactional consistency we expect in databases used by our system-of-record type workloads. It is aligned with the technologies and infrastructure that are driving the move to the cloud and delivers the familiar SQL that developers expect and the scale and resilience of this new platform. It is a reimagining of the relational database for the cloud.
What advice would you give data architects looking to design and deploy on Kubernetes?
Make sure that whatever you are using, it was architected less than four years ago and that it speaks SQL. You should look for a database that is aligned with the compute platform so you can enjoy all the benefits it delivers. Choose distributed SQL.
[Editor's Note: Jim Walker is the VP of product marketing at Cockroach Labs, which he joined in November 2018. He is a recovering developer turned product marketer and has spent his career in emerging tech. He believes product marketing is one of the most strategic functions in early-stage companies and helps organizations translate complex concepts into a compelling and effective core narrative and market strategy. Before Cockroach Labs, Jim spent time as VP of marketing at emerging tech companies including OverOps, EverString, and CoreOS. You can reach Jim via email, on Twitter, or on LinkedIn.]