Compaction: The Background Cleanup That Keeps a Lakehouse Fast
A data lakehouse table in active use has a tendency to get messy. Not in any way a user would see, but underneath, in how the data is physically organized into files. Data streams in and creates a litter of small files. Updates leave behind piles of pending changes. Old versions of the table accumulate. None of this is visible from the outside, but all of it slows the table down, and left unattended, a once-fast table becomes sluggish for reasons that have nothing to do with how much data it actually holds.
Compaction is the housekeeping that fixes this. It's a maintenance process that periodically reorganizes a table's files into a cleaner, more efficient state, and it's one of the most important and least glamorous parts of keeping a lakehouse healthy. The data doesn't change. What changes is how it's packaged, and that packaging is what determines whether the table is fast or slow.
The most familiar job compaction does is consolidating small files. A lakehouse fed by frequent writes, streaming ingestion, regular small updates, naturally produces many tiny files, one or a few per write, piling up over time into thousands or millions. Because reading each file carries a fixed overhead regardless of its size, a table fragmented into countless tiny files spends most of its effort opening and closing files rather than reading data, and queries crawl. Compaction takes those many small files and rewrites their contents into a smaller number of appropriately sized ones. After it runs, a query reads a handful of large files instead of thousands of small ones, and the table is fast again.
A second job, in tables that handle updates lazily, is folding in accumulated changes. Some lakehouse tables record updates as small, separate change files rather than rewriting the data immediately, which keeps writes cheap but means every read has to merge those pending changes onto the base data. As the changes pile up, reads get slower, because each one has more deltas to reconcile. Compaction resolves this by merging the accumulated changes into the base files, producing clean files that already reflect every update. The pending changes are cleared, and reads no longer have to do the reconciliation work, because it's been done once, in bulk, by the compaction process.
A third job is cleaning up obsolete versions. Because a lakehouse preserves old versions of a table rather than overwriting data in place, the files behind those old versions accumulate, especially in a frequently changed table. Keeping every version forever would mean unbounded storage growth, so a cleanup process removes versions older than a configured retention window, deleting the data files that belonged only to those expired versions. This reclaims storage and keeps the metadata that tracks all these files from becoming unwieldy. The same general maintenance machinery that consolidates files and folds in changes also retires the history that's no longer needed.
What ties these jobs together is a single observation: the byproducts of normal use degrade a table's physical organization over time, and compaction is how that organization gets restored. Every useful thing a lakehouse does, ingesting fresh data, accepting updates, preserving history, creates some form of clutter as a side effect. The table works correctly throughout, but it works less and less efficiently as the clutter builds. Compaction is the counterforce that periodically resets it to a clean, efficient state.
Because this clutter accumulates continuously in an active table, compaction isn't a one-time fix but an ongoing necessity. A table fed by constant ingestion is always producing new small files; a frequently updated table is always accumulating new pending changes. So compaction has to run regularly, and in modern lakehouse systems it increasingly runs automatically, as a background process that monitors tables and compacts them when fragmentation or accumulated changes cross some threshold. The goal is for the cleanup to keep pace with the mess, so the table never drifts far from its efficient state.
That automation matters because compaction itself isn't free, and this is the tension at the center of running it well. Rewriting files consumes computational resources and time. Running compaction too aggressively wastes effort reorganizing data that didn't really need it, and consumes resources that could be serving queries. Running it too rarely lets the clutter build until performance suffers. Good compaction management is a balancing act: doing enough cleanup to keep the table fast, but not so much that the cleanup itself becomes a burden. Tuning when and how often it runs, against the specific way a table is used, is part of operating a lakehouse well.
There's also a question of timing that good systems handle carefully, because compaction is rewriting the very files that queries are reading. Done naively, it could interfere with the queries running against the table. Lakehouse table formats are designed so that compaction can happen in the background without disrupting reads: queries continue to see a consistent version of the table while compaction prepares new files, and only once the new files are ready does the table switch to using them. The cleanup happens quietly alongside normal operation, rather than requiring the table to be taken offline.
The reason compaction is worth understanding, even though it's pure background maintenance, is that it reveals something about how these systems actually stay fast. A lakehouse's performance isn't a fixed property set once at design time; it's something that degrades with use and has to be actively maintained. The same flexibility that makes a lakehouse powerful, storing data as open files, accepting frequent writes, preserving history, is exactly what generates the clutter that compaction has to clean up. Compaction is the quiet, continuous work that pays for that flexibility, keeping the accumulated cost of normal use from slowly choking the table. The data sits there unchanged; compaction is what keeps it arranged so the system can still move through it quickly.