TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Data 101

00 Days

00 Hrs

00 Min

00 Sec

Surrogate Keys 101: What You Need To Know

When you design a database table, one of the first decisions you make is how to uniquely identify each row. The obvious answer is to use something that already exists in the data: a Social Security number for people, an order number for orders, a product code for products. These are called natural keys, and they have the appeal of being meaningful. They connect the database record to something recognizable in the real world. They're also, in many cases, a source of problems that become apparent only after the data model is in production and difficult to change.

A surrogate key is an alternative: an identifier with no meaning outside the database, generated purely to serve as a unique row identifier. Typically an auto-incrementing integer or a UUID, it exists for one purpose and one purpose only. Understanding when and why to use one, and what the tradeoffs are, is a foundational skill in database and data warehouse design.

The case for surrogate keys starts with the limitations of natural keys. Natural keys assume that the real-world identifier is stable, unique, and always present. In practice, those assumptions break more often than you'd expect. Business rules change. A product code format that made sense when a catalog had five hundred items becomes ambiguous when it has fifty thousand. A customer identifier that was unique within one system stops being unique after a merger with another company that used overlapping identifiers. An employee ID gets reassigned after someone leaves, creating a collision with historical records. Each of these situations creates a problem in a schema built around a natural key that a surrogate key simply avoids, because a surrogate key has no meaning to violate.

There are also performance considerations. Natural keys are often strings, and string comparisons are more expensive than integer comparisons. When a key is used in joins across large tables, that difference accumulates. Integer surrogate keys, particularly sequential ones, also tend to behave better with B-tree indexes, which is the index structure most relational databases use by default. Sequential integers create index pages that fill predictably and require less reorganization than random or non-sequential values.

UUIDs deserve their own note here because they're widely used as surrogate keys, particularly in distributed systems where you can't rely on a central sequence generator to produce unique values. A UUID can be generated independently on any machine without coordination, which makes it attractive for distributed architectures. The tradeoff is that UUIDs are large, random, and create fragmented indexes that perform worse than sequential integer keys at scale. Newer UUID variants, particularly UUID v7, address the randomness problem by incorporating a timestamp component that makes the values roughly sequential, giving you the distributed generation benefit without the index fragmentation penalty.

In data warehousing specifically, surrogate keys are standard practice for a reason that goes beyond the arguments above. Dimension tables in a data warehouse often need to track historical changes to the entities they describe, a pattern called a slowly changing dimension. The canonical implementation of this pattern, Type 2, works by creating a new row for each version of a dimension record rather than overwriting the existing one. That means the same real-world entity, a customer, a product, an employee, can have multiple rows in the dimension table, one for each version of their attributes over time. A natural key can't distinguish between these rows because they all represent the same real-world entity. A surrogate key can, because each row gets its own generated identifier regardless of what natural key it corresponds to. This is one of the main reasons surrogate keys are treated as non-negotiable in dimensional modeling.

The argument against surrogate keys, where one exists, is that they add a layer of indirection that can make data harder to inspect and debug. When you're looking at a fact table full of integer keys with no obvious connection to the real world, understanding what any given row represents requires joining to dimension tables, which adds friction. Some teams address this by including the natural key as a non-primary attribute alongside the surrogate key, giving you the lookup stability of the surrogate and the human readability of the natural key in the same table. This is generally considered good practice in data warehouse design, where both auditability and model stability matter.

The decision between natural and surrogate keys is not always obvious in the moment, but it tends to become obvious in retrospect. Schemas built on natural keys often work well until the first time the underlying business reality changes in a way the key wasn't designed to accommodate. At that point, migrating to a surrogate key is possible but painful. Understanding the tradeoffs before you start building is considerably cheaper than discovering them after the fact.

Data 101

Surrogate Keys 101: What You Need To Know

TDWI

Engage

Research