Q&A: The Revival of the Column-Oriented Database
In this Q&A, Edgemesh CEO Jake Loveless talks about column-oriented databases and the advantages they hold over ordinary row-oriented versions.
- By Upside Staff
- August 19, 2022
Upside: What led to the development of column-oriented databases (CODs)?
Jake Loveless: The idea of storing data based on columns of data versus rows of data goes back to the early days of database design in the 1950s and 1960s. The primary motivation was that data, especially in analytics workloads, is often accessed for only certain fields. For example, the question “what is the total revenue by month” only requires two fields to evaluate: the column of month and the column of revenue. Because accessing data was (and still is) slow, there was a significant speed advantage to reducing the disk access time. The simplest solution was to read only the data needed.
What are the benefits of CODs?
The primary advantage has always been speed. For many workloads, column-oriented databases can be 10 to100 times faster than a row-oriented counterpart. Again, this is driven by reducing the access time to read data into the system. As we moved from a single CPU core to multiple CPU cores and from slower magnetic disks to faster solid state disks, column-oriented designs got even faster.
Taking the earlier example “what is the total revenue by month,” we can imagine reading an array of months (one for each record), and finding the start and offset for each changing month. For example, if there are 1000 records, but only two values for month (say records 0–499 have a value of Jan-2022 and records 500–999 have a value of Feb-2022), we can then issue two parallel reads and compute the sums of revenue independently (and 2x faster).
In the cloud-native world, though, another significant advantage is compression. Data that is similar compresses more efficiently, meaning compressing a file of all numbers will yield a higher compression rate than a file of numbers and strings. CODs store each column as a distinct -- and compressible -- file, so it’s not uncommon to see 10x or more savings on data storage. At Edgemesh our average compression ratio is 24:1, meaning we store 24x less data (and pay 24x less storage fees) every month.
Are there any special benefits for large enterprises?
Aside from the increase in speed and decrease in storage costs, one of the biggest savings is developer overhead. Developers and database administrators are very expensive resources, and optimizing database query performance is complex. With CODs, optimization is much more intuitive. For example, assume a query such as:
“select sum EXPENSES by MONTH from TABLE where (not ALLOCATION=”internal”) AND DEPARTMENT=”sales”
In a column-oriented design, the optimization is often: “find the WHERE conditional that is the most restrictive and take that first.” In this scenario, if our data has only 20 departments, we should make that the first condition (get the column of data for the department), then get the offsets for DEPARTMENT=”sales” and then only read those records from the ALLOCATION column. In my former career on Wall Street, we would often see orders of magnitude performance improvements just by reordering WHERE clauses.
What are some use cases where column-oriented databases are a good fit and where are they not a good fit?
In general, any time series or analytics workload is a good fit for a column-oriented design. If there are many rows, especially of numeric data, CODs can have massive impacts on speed and savings. Alternatively, if you have a smaller, truly relational set of data, then a row-oriented system is likely a better fit. Examples include applications such as HR databases or supplier databases where each row (an employee or supplier) is fairly unique and the workload is traditional relational queries (e.g., “get all employees by department”).
If column-oriented databases have so many benefits, why did they go out of style?
Cost -- more specifically a lack of free, open source options. Whereas there was a large movement towards non-proprietary relational databases (MySQL, PostgreSQL, SQL Lite, etc.) in the 1990s and beyond, column-oriented databases remained both expensive and proprietary. Systems such as KX’s Kdb+, Vertica, and Sybase IQ have six-figure licensing costs and, until very recently, no open source alternatives. Today, we’re seeing the rise of open source column stores, such as Clickhouse (from Yandex), MonetDB, and Citrus. As these solutions continue to mature and are deployed at scale, I expect we will continue to see an increased rate of adoption.
At Edgemesh, you’ve spoken about how your database helps provide a competitive edge. Can you explain how?
At Edgemesh, we store billions of records per database, and do so with a column-oriented database. The speed and cost savings help give us a material advantage over our competitors because we can allow clients to store and query across hundreds of billions of records with sub-second response times and minimal storage costs. The storage savings are significant; we have about 7.5 TB of storage in use across the fleet with a 24:1 compression rate. Without a COD, our storage costs would very quickly require us to either archive data, thus removing our customers’ ability to look back over long periods of time, or increase prices.