TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Data 101

00 Days

00 Hrs

00 Min

00 Sec

Why Database Queries Get Fast: The Case for Indexing

Imagine a library with a million books and no catalog. You need to find every book published in 1987. Your only option is to walk through every shelf, check every book, and pull the ones that match. That's what a database does when it has no index on the column you're querying: it reads every row in the table, checks whether it matches your condition, and returns the ones that do. In database terminology, this is called a full table scan, and at sufficient scale it's what turns a query that should take milliseconds into one that takes minutes.

An index is the catalog that makes the search fast.

The most common type of database index is a B-tree index, named for the balanced tree data structure it's built on. When you create an index on a column, the database builds a separate data structure that stores the values in that column in sorted order, along with pointers to the actual rows in the table where those values appear. When a query filters on that column, the database uses the index to find matching values quickly through binary search rather than scanning every row. For a table with a million rows, the difference between a full scan and an index lookup is roughly the difference between reading a million values and reading about twenty.

The sorted order is what makes the B-tree index effective for range queries too, not just equality lookups. A query asking for all rows where a date column falls between two values can use an index on that column to find the start of the range and read forward from there, rather than checking every row in the table. Without the index, the database has no choice but to examine everything.

Primary key indexes are automatic in most databases. When you declare a primary key on a table, the database creates an index on that column without being asked, because primary keys are used for lookups constantly and the performance cost of not indexing them would be immediately apparent. Secondary indexes, on other columns, have to be created explicitly, which means someone has to decide which columns deserve them.

That decision involves tradeoffs that aren't always obvious. Indexes make reads faster. They make writes slower. Every time a row is inserted, updated, or deleted, the database has to update not just the table but every index on that table. A table with ten indexes on it takes roughly ten times as much work to write to as a table with no indexes. For a table that gets written to millions of times a day, that overhead matters. For a table that gets read millions of times a day and written to rarely, indexes are straightforwardly beneficial. The right answer depends on the read-write ratio of the workload, which is why indexing decisions require understanding how the table will actually be used.

Index selectivity is the property that determines how useful an index actually is. A highly selective index is one where the indexed column has many distinct values relative to the number of rows, so that filtering on it significantly narrows down the result set. An index on a user ID column in a table with millions of users is highly selective: filtering for a specific user ID returns one row out of millions. An index on a boolean column that's true for half the rows is nearly useless: filtering for true still returns half the table, and the database might decide a full scan is faster than using the index at all.

Composite indexes, built on multiple columns together, are useful when queries frequently filter on a combination of columns. An index on last_name alone helps queries that filter by last name. An index on last_name and first_name together helps queries that filter by both, and the column order matters: the index is most effective when the query filters on the leftmost columns first. Understanding this leftmost prefix rule explains a lot of seemingly arbitrary database query behavior that puzzles developers who haven't encountered it before.

For data practitioners working with databases at scale, indexing is one of the first places to look when queries are slow. The presence of a missing index is often the explanation, and adding the right index can produce order-of-magnitude improvements in query performance without changing any application code. The absence of an obviously needed index is usually an oversight. The presence of too many indexes on a write-heavy table is an equally real problem that shows up as write performance degradation rather than read slowness. Getting indexing right is less about following rigid rules and more about understanding the workload and designing indexes to serve it.

Data 101

Why Database Queries Get Fast: The Case for Indexing

TDWI

Engage

Research