TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Data 101

00 Days

00 Hrs

00 Min

00 Sec

Grain: The First Decision in Dimensional Modeling

Dimensional modeling has a lot of moving parts. Fact tables, dimension tables, surrogate keys, slowly changing dimensions, conformed dimensions. It's easy to get drawn into the details of any one of these before you've answered the question that should come first.

That question is: what is the grain?

Grain is the precise definition of what one row in a fact table represents. Not a vague description of the subject area, not a general statement about what the table is "about," but a specific, unambiguous statement that a developer or analyst could use to determine whether a given event belongs in the table or not.

The difference between a vague grain and a precise one matters more than it might seem. "One row per sale" is not a grain definition. Does that mean one row per order? One row per line item within an order? One row per shipment? One row per payment? These are four different grains, and they produce four different fact tables with different measures, different dimensions, and different analytical capabilities. Choosing between them is not a technical decision. It's a business decision about what level of detail the warehouse needs to support.

A precise grain statement looks something like this: one row per line item per order. Or: one row per customer per day, representing that customer's aggregate activity on that day. Or: one row per insurance claim payment transaction. Each of these statements is specific enough that you can look at any candidate row and determine whether it belongs.

Defining grain before anything else is the right sequence because grain determines everything downstream. The grain determines which dimensions are possible. If your grain is one row per order header, you can have a customer dimension and a date dimension, but you can't have a product dimension at the order level because a single order typically contains multiple products. To support a product dimension, you'd need to go to the line item grain. Trying to attach dimensions that are finer than the grain produces either incorrect results or a fact table that violates its own definition.

Grain also determines which measures make sense. At the order line item grain, quantity and line item revenue are natural measures. At the order header grain, total order value and number of line items are natural measures. Mixing measures from different grains in the same fact table is a common mistake that produces a table where some rows mean one thing and other rows mean something else, which makes every query against that table a potential source of error.

There is a recurring temptation in dimensional modeling to choose a coarser grain than the data actually supports, usually in the name of simplicity or storage efficiency. A table summarized to the daily level is smaller and simpler than one at the transaction level. But a coarser grain destroys detail that can never be recovered. If you later need to analyze behavior at the transaction level, or filter on a dimension that varies within a day, a daily summary table can't answer those questions. Choosing the finest grain the source data supports, and that the business might plausibly need, is generally the right default. Storage is cheap. Lost granularity is permanent.

The converse mistake is choosing a grain that's finer than the data actually supports, producing a table with many rows that all represent the same event at a level of detail that doesn't exist in the source. This creates artificial precision and can produce double-counting in aggregations if the grain isn't clearly understood by everyone writing queries against the table.

In practice, a data warehouse often contains multiple fact tables at different grains covering the same subject area. A transaction-level sales fact table sits alongside a daily summary table, each serving different analytical purposes. This is normal and expected. What matters is that each table has a precisely defined and consistently applied grain, documented clearly enough that anyone querying it understands what one row means before they start writing SQL.

Ralph Kimball, whose dimensional modeling methodology this concept comes from, is emphatic that grain declaration is the first step in designing any fact table, completed before dimensions are chosen and before measures are defined. That sequencing reflects a deeper point: a fact table without a clearly defined grain is not just poorly designed. It's undefined. You can't reason correctly about what it contains, and you can't trust the results it produces.

Data 101

Grain: The First Decision in Dimensional Modeling

TDWI

Engage

Research