TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Data 101

00 Days

00 Hrs

00 Min

00 Sec

Database Normalization: The Concept That Shapes Relational Design

When you design a relational database, you're making decisions about how to organize data into tables and how those tables relate to each other. Normalization is the theoretical framework that guides those decisions. It was developed by Edgar Codd in the early 1970s alongside the relational model itself, and it remains the basis of how relational schemas get evaluated and critiqued today.

The goal is to eliminate redundancy and the problems that redundancy causes.

That sounds abstract, so it's worth being concrete about what those problems actually are before getting into the forms themselves.

Imagine a table that stores customer orders, and instead of having a separate customers table, it stores the customer's name and address directly in every order row. Now that customer moves. To update their address you have to find every order they've ever placed and update it in each row. Miss one and you have inconsistent data. Delete their last order and you've lost their address entirely. Insert a new order before you have a complete address and you have a row with missing required information. These are called update anomalies, deletion anomalies, and insertion anomalies respectively, and they are what normalization is designed to prevent.

Normalization achieves this through a series of rules called normal forms, each building on the last. You don't have to memorize all of them, but understanding the first three is genuinely useful for anyone working with relational data.

First Normal Form, 1NF, is the baseline. A table is in first normal form if every column contains atomic values, meaning values that can't be meaningfully broken down further, and every row is unique. A column that stores a comma-separated list of product tags violates 1NF because the value isn't atomic. A table with no primary key violates 1NF because rows aren't uniquely identifiable. Most modern databases enforce parts of this implicitly, but violating it in subtler ways, storing structured data as a string, for instance, is still common and still causes problems.

Second Normal Form, 2NF, applies to tables with composite primary keys, meaning primary keys made up of more than one column. A table is in second normal form if it's in 1NF and every non-key column is fully dependent on the entire primary key, not just part of it. If you have an order_items table with a composite key of order_id and product_id, and you also store the product name in that table, you have a partial dependency: the product name depends only on product_id, not on the combination of order_id and product_id. The fix is to move product name to a separate products table where it belongs. In practice, 2NF violations most commonly appear when people add columns to junction tables without thinking carefully about what those columns actually depend on.

Third Normal Form, 3NF, goes further. A table is in third normal form if it's in 2NF and no non-key column depends on another non-key column. This is called a transitive dependency. If an employees table stores both department_id and department_name, the department_name depends on department_id rather than directly on the employee's primary key. That's a transitive dependency, and it means you're storing department_name in two places once you have more than one employee per department. The fix, again, is a separate departments table.

Beyond 3NF, the normal forms continue: Boyce-Codd Normal Form is a stricter version of 3NF that handles some edge cases 3NF misses. Fourth Normal Form addresses multi-valued dependencies. Fifth Normal Form addresses join dependencies. These higher forms matter in specific contexts, particularly in academic database theory and in complex schemas with unusual dependency structures, but most practitioners work primarily within the first three.

It's also worth knowing that normalization is not always the right answer. Analytical databases and data warehouses frequently denormalize deliberately, combining data that normalization theory would keep separate in order to reduce the number of joins required for analytical queries. A highly normalized schema is efficient for writes and good at preventing anomalies. It can be slow for reads that need to assemble information from many tables. Dimensional modeling, the dominant approach in data warehousing, is explicitly denormalized for this reason.

Understanding normalization doesn't mean applying it rigidly in every situation. It means knowing what the rules are, why they exist, and what you're trading away when you choose to break them. That judgment, knowing when the theory applies and when the practical constraints of your situation point somewhere else, is what separates schema design from schema copying.

Data 101

Database Normalization: The Concept That Shapes Relational Design

TDWI

Engage

Research