TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Data 101

00 Days

00 Hrs

00 Min

00 Sec

Data Temperature: Why Some Data Lives on Fast Storage and Some on Cheap

Storage costs money, and not all storage costs the same. Fast storage that returns data in an instant is expensive. Slow storage that takes its time is cheap. This simple fact creates a tension at the heart of every large data system: you want everything to be fast, but you can't afford to put everything on fast storage. The way out of that tension is to recognize that you don't need everything to be fast, because you don't use all your data equally.

Some data gets accessed constantly. Some gets accessed once in a blue moon. Treating those two kinds the same way, putting both on the same expensive fast storage, is a waste, because the rarely-touched data is paying premium prices for speed it almost never uses. Sorting data by how often it's actually accessed, and matching each kind to appropriately priced storage, is the idea behind data temperature.

The temperature metaphor is intuitive once you hear it. Hot data is data that's accessed frequently, right now, all the time. Cold data is data that's rarely touched, sitting quietly, accessed once in a great while if ever. Warm data sits in between, used sometimes but not constantly. The temperature describes how active the data is, how often something reaches for it, and that activity level is what determines where it ought to live.

Consider what hot data looks like in practice. It's the current week's transactions that reports run against every few minutes. It's the active user's session data, the product catalog the website hits on every page load, the recent records that dashboards refresh from constantly. This data needs to be fast, because slowness here is felt immediately by users and systems waiting on it. Hot data earns its place on expensive, high-speed storage, because the speed is genuinely being used, over and over, all day long.

Cold data is the opposite. It's the transaction records from seven years ago, kept because regulations require it but queried almost never. It's old logs, archived projects, historical data retained for the rare audit or the occasional deep look back. This data has to exist, often for legal reasons, but it doesn't have to be fast, because almost nothing ever asks for it, and on the rare occasion something does, waiting a bit is perfectly acceptable. Putting cold data on expensive fast storage would be like renting a downtown parking spot for a car you drive once a year. It works, but you're paying enormously for convenience you don't use.

So cold data goes on cheap, slow storage. The tradeoff is explicit and sensible: you accept that retrieving it will be slow, because slow retrieval of something you almost never retrieve costs you almost nothing in practice, while the storage savings are large and constant. Warm data lands on a middle tier, reasonably quick and reasonably affordable, matching its in-between usage. The general principle is to align the cost of the storage with the value the speed actually provides, paying for fast access only where fast access is genuinely used.

The practice of organizing storage this way is called tiering, and the tiers form a hierarchy from fast-and-expensive at the top to slow-and-cheap at the bottom. A well-run system places data on whichever tier matches its temperature, and the savings can be substantial, because in most organizations the great majority of data is cold. Only a small slice is genuinely hot. If all of it were kept on fast storage, most of the budget would be spent keeping rarely-touched data needlessly fast. Tiering directs the expensive storage to the small hot slice that actually benefits and lets the large cold remainder rest cheaply.

What makes this more than a one-time sorting exercise is that temperature changes over time, usually in a predictable direction. Today's hot data, this week's transactions, will be lukewarm next month and cold next year. Data tends to cool as it ages: fresh and frequently accessed at first, then gradually less so, until it settles into the rarely-touched archive. A system that sorted data by temperature once and never revisited it would slowly fill its expensive fast tier with data that has gone cold but never moved. So mature systems migrate data between tiers as it cools, shifting it down the hierarchy as its access frequency drops, often automatically according to rules about age and usage.

That automatic migration is where data temperature becomes a managed lifecycle rather than a static arrangement. Policies can specify that data older than a certain age, or untouched for a certain period, moves down a tier. New hot data lands at the top, ages through the middle, and eventually comes to rest at the bottom, all without anyone manually shuffling it. The system continuously keeps its storage spending aligned with actual usage, reclaiming expensive capacity from data that no longer needs it and handing that capacity to data that does.

There's a subtlety worth naming, which is that getting the temperature judgment wrong has real costs in both directions. Misjudge hot data as cold and push it to slow storage, and the things that need to be fast become painfully slow, frustrating users and bottlenecking systems. Misjudge cold data as hot and keep it on fast storage, and you quietly bleed money on speed nobody uses. Good tiering depends on understanding the access patterns honestly, on knowing which data is genuinely hot rather than assuming, because the assumption is often wrong. Data people are sure is accessed constantly sometimes turns out to be touched rarely, and vice versa, and only looking at the actual usage settles it.

The broader lesson is one of matching resources to need, a theme that runs through cost-conscious system design generally. Uniform treatment is simple but wasteful: it forces you to provision for the most demanding case everywhere, even where the demand isn't there. Sorting by temperature is the recognition that data is not monolithic in how it's used, and that storage spending should follow usage rather than ignore it. The fast storage goes where the speed is used; the cheap storage holds everything else. It's an unglamorous discipline, invisible to anyone using the system, but it's one of the steady, quiet ways that large data operations keep from spending far more than they need to.

Data 101

Data Temperature: Why Some Data Lives on Fast Storage and Some on Cheap

TDWI

Engage

Research