Skip to main content
00 Days
00 Hrs
00 Min
00 Sec

The Small Files Problem: Why a Lakehouse Slows Down When It Has Too Many Tiny Files

A data lakehouse stores its tables as collections of files sitting in cloud storage. This is part of what makes the approach flexible and cheap: the data lives as ordinary files in an open format, not locked inside a proprietary database. But it introduces a problem that catches many teams off guard, one where a system that should be fast becomes mysteriously, frustratingly slow, and the culprit isn't the amount of data at all. It's the number of files that data is split across.

This is the small files problem, and it's one of the most common performance issues in lakehouse and data lake systems. A table holding a modest amount of data can perform terribly if that data is fragmented into thousands or millions of tiny files, while the same data consolidated into a sensible number of larger files runs smoothly. The total size barely changed. The file count is what made the difference.

To understand why, you have to look at what happens when the system reads a table. To answer a query, it has to open and read the relevant files. Each file carries a fixed overhead that has nothing to do with how much data is inside it: the system has to locate the file, open it, read its metadata, and close it. That per-file cost is small, but it's paid once for every single file, regardless of whether the file holds a megabyte or a few hundred bytes.

Now the arithmetic turns against you. Reading one large file means paying that overhead once. Reading the same data spread across ten thousand tiny files means paying it ten thousand times. The actual data is identical, but the system spends the overwhelming majority of its effort on the mechanical work of opening and closing files rather than on reading and processing the data inside them. The overhead, trivial per file, becomes the dominant cost in aggregate, and the query crawls.

This overhead is especially punishing in the cloud storage that lakehouses typically rely on. Cloud object storage is wonderful for holding vast amounts of data cheaply, but each request