TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Data 101

00 Days

00 Hrs

00 Min

00 Sec

Sharding: How a Database Splits Itself to Handle More Than One Machine Can

A single database server can hold a lot and handle a lot, but not an unlimited amount. There comes a point, for a growing system, where the data is too large to fit on one machine, or the requests are coming too fast for one machine to answer, or both. At that point you've outgrown what a single server can do, and you face a choice about how to grow past it.

One answer is to buy a bigger machine. That works for a while, but bigger machines get expensive fast and eventually run out of "bigger" to buy. The other answer is to spread the database across many machines, so that the combined fleet can hold and handle far more than any single server could. That spreading is called sharding, and while the idea sounds simple, the details of how you split things up are where all the difficulty lives.

Start with the basic idea. Sharding means breaking a database into pieces, called shards, and putting each piece on a different machine. Each shard holds a portion of the total data, and together the shards make up the whole. No single machine holds everything; each holds its slice. When the system needs a particular piece of data, it goes to the shard that holds it. When it needs to add data, the new data goes to the appropriate shard. The total capacity of the system becomes the sum of all the machines, rather than the limit of any one of them, and you can grow by adding more machines rather than buying a bigger one.

This is fundamentally different from simply copying the database onto several machines. Copies, where every machine holds the same complete data, help you handle more reads and survive failures, but they don't help with size, because each copy still has to hold everything. Sharding actually divides the data, so each machine holds only a fraction, which is what lets the total grow beyond what one machine could store. The two techniques solve different problems and are often used together, but sharding is specifically the one that addresses being too big for a single machine.

The entire challenge of sharding comes down to one decision: how do you split the data? You need a rule that determines, for any given piece of data, which shard it belongs to. This rule is built around what's called a shard key, some attribute of the data that decides its home. You might shard customers by their customer ID, or by geographic region, or by some other attribute. Whatever you choose, the shard key is what the system uses to figure out where each piece of data goes and, later, where to find it again.

Choosing this key well is the part that separates a sharding scheme that works from one that causes endless trouble, and the reason is that a bad key creates imbalance. The goal is for the data and the workload to spread evenly across the shards, so each machine carries a roughly equal share. If the key spreads things evenly, every machine does a fair portion of the work and the system scales smoothly. If the key spreads things unevenly, some shards end up holding far more data or fielding far more requests than others, and those overloaded shards become bottlenecks while others sit nearly idle. You've split the database, but you haven't actually balanced the load, which defeats much of the point.

An overloaded shard is sometimes called a hot shard, and hot shards are the classic failure of a poorly chosen shard key. Suppose you shard by region and one region has ten times the customers of any other. That region's shard is now doing vastly more work than the rest, and it becomes the limit of the whole system, the single overworked machine you were trying to get away from, recreated inside your sharded design. Or suppose you shard on a key where one particular value is enormously more common than others; all that data piles onto one shard. The art of choosing a shard key is largely the art of avoiding these lopsided outcomes, picking something that divides the data and the activity evenly rather than concentrating it.

Sharding also introduces costs that a single-machine database doesn't have, and being honest about them explains why people don't shard until they need to. The most significant is that queries spanning multiple shards become harder. As long as a query needs data from just one shard, the system goes to that shard and answers it easily. But a query that needs to gather and combine data from many shards has to reach out to all of them and assemble the pieces, which is slower and more complex than doing the same thing within a single database. Operations that were trivial on one machine, like combining records that now live on different shards, become genuinely awkward once the data is split. A well-designed shard key tries to keep related data that's queried together on the same shard, precisely to minimize these expensive cross-shard operations, but it can't eliminate them entirely.

There's also the matter of what happens when the data grows further or shifts, because sharding isn't a one-time act. As data accumulates, shards can become unbalanced over time even if they started even, and the system may need to redistribute data, splitting shards or moving data between them. Doing this while the system keeps running is delicate work, since you're rearranging where data lives without losing track of any of it or interrupting the requests that keep arriving. Managing the ongoing balance of a sharded system is a real operational responsibility, not something you set up once and forget.

Given all this added complexity, the practical wisdom is to shard when you need to and not before. A single machine, possibly with some copies for resilience, is simpler and entirely sufficient for a great many systems, and that simplicity is worth keeping as long as one machine can do the job. Sharding is what you reach for when you've genuinely outgrown a single machine, when the data is too large or the traffic too heavy for any one server to handle. At that point the complexity is worth accepting, because the alternative is a hard ceiling on how big you can get. The skill is in recognizing when you've actually hit that point, rather than taking on the costs of sharding before the benefits are real.

The deeper idea sharding illustrates is how systems scale by division. When one of something can't keep up, you move to many, and the central problem becomes how to divide the work among them so that the many actually function as a coordinated whole rather than a collection of imbalanced parts. The splitting is the easy half. Deciding where everything goes, so that the load lands evenly and related things stay together, is the half that takes judgment, and it's what determines whether a sharded database delivers the scale it promised or just spreads the original bottleneck across more machines.

Data 101

Sharding: How a Database Splits Itself to Handle More Than One Machine Can

TDWI

Engage

Research