TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Data 101

00 Days

00 Hrs

00 Min

00 Sec

Columnar vs. Row Storage: Why the Same Data Is Stored Two Completely Different Ways

Picture a simple table of sales data. Each row is one transaction: a customer name, a product, a price, a date, a region. On screen it looks like a grid, neat rows and columns, and it's tempting to assume the computer stores it the same way it looks.

It doesn't.

A computer's storage is fundamentally a long, one-dimensional sequence, and a two-dimensional table has to be flattened into that sequence somehow. There are two ways to do it, and the one you pick changes almost everything about how the database performs.

You can store the data row by row, keeping all the values from one transaction together before moving to the next. Or you can store it column by column, keeping all the customer names together, then all the products, then all the prices. Same data, same table, two completely different physical arrangements. This single decision is one of the deepest reasons some databases excel at recording transactions and others excel at analyzing them.

Start with row storage, which is the older and more intuitive of the two. In a row-oriented database, all the values belonging to a single record sit next to each other in storage. The whole first transaction, then the whole second transaction, and so on. If you want to grab one complete record, or add a new one, or update an existing one, everything you need is in one place. You read it or write it in a single, efficient motion.

This is exactly what operational systems need. When a customer places an order, the database has to write one complete new record fast, and it may need to do that for thousands of customers at once. When a support agent pulls up an account, they want every detail of that one account immediately. These are row-shaped operations: touch one whole record at a time, frequently, with speed and reliability. Row storage is built for them.

Now consider a completely different kind of question. How much total revenue did we make last year? What's the average order value by region? Which product sold the most units?

These questions have a different shape. They don't care about any single transaction. They reach across millions of records but only touch one or two columns, the price, maybe the region. And this is where row storage starts to struggle, because to add up every price, a row-oriented database has to read through every complete record, dragging along the customer name, the product, the date, and everything else, just to extract the one number it actually wants. It's like reading an entire book cover to cover to count how many times a single word appears.

Columnar storage solves this by flipping the arrangement. All the prices are stored together in one continuous block. All the regions in another. When the database needs to total up revenue, it goes straight to the price column, reads that block, and ignores everything else entirely. It never has to touch the columns it doesn't need. For an analytical query that scans millions of rows but only a handful of columns, this is dramatically faster, often by orders of magnitude.

There's a second advantage to columnar storage that's easy to overlook but enormously important: compression.

When you store all the values from one column together, you're storing a lot of similar things in a row. A region column might contain the same dozen region names repeated millions of times. A date column contains values that are all close together and follow patterns. Data that's similar compresses extremely well, because the computer can store "this value repeats 40,000 times" instead of writing it out 40,000 times. Row storage can't take advantage of this nearly as well, because each row mixes together values of completely different types, a name next to a price next to a date, with little similarity to exploit. Columnar databases routinely shrink data to a fraction of its original size, which means less storage cost and even faster queries, since there's physically less to read.

So why not just use columnar storage for everything? Because it pays for those analytical strengths with transactional weaknesses. Writing a single new record to a columnar database means breaking that record apart and inserting each value into a different column block scattered across storage. That's slow and awkward compared to the single clean write a row store handles with ease. Updating one record has the same problem. Columnar storage is built for reading and aggregating enormous volumes, not for the constant small writes and updates that operational systems generate all day.

This is the tradeoff in a single sentence: row storage is optimized for writing and retrieving whole records one at a time, columnar storage is optimized for reading and summarizing specific columns across many records. Neither is better. They're built for opposite jobs.

And this is precisely why the two major families of database systems exist. Transactional systems, the ones running operational workloads like orders and accounts, are typically row-oriented, because their work is full of small writes and single-record lookups. Analytical systems, the data warehouses and analytics platforms where reporting and business intelligence happen, are typically columnar, because their work is full of large scans over a few columns. The familiar distinction between systems built for transactions and systems built for analysis rests, at the physical level, on this choice about how to lay the data down in storage.

It's a useful thing to understand even if you never configure a database yourself. When someone explains that a particular system is "great for transactions but slow for analytics," or the reverse, they're describing the downstream consequence of a decision made far below the surface, in how the bytes were arranged in the first place. The grid on your screen looks the same either way. Underneath, two databases storing that identical grid can be built for entirely different purposes.

Data 101

Columnar vs. Row Storage: Why the Same Data Is Stored Two Completely Different Ways

TDWI

Engage

Research