What Is Late-Arriving Data and How Do You Handle It?
Most data pipeline designs start with an implicit assumption: that data arrives roughly in order and roughly on time. That assumption is reasonable enough in a controlled environment, but it tends to encounter reality fairly quickly. Events get buffered. Network latency varies. Mobile devices go offline and sync later. Third-party systems batch their exports on unpredictable schedules. IoT sensors lose connectivity. The result is that data describing something that happened at one time often arrives at your pipeline significantly later, sometimes by minutes, sometimes by days, occasionally by weeks.
This is called late-arriving data, and it's one of the more consequential design considerations in data engineering because it forces a choice: how long do you wait for data before you consider a window of time closed, and what do you do when data arrives after you've already processed that window?
The problem is particularly acute in streaming systems, where the whole point is to process data quickly and produce timely results. Streaming frameworks like Apache Flink and Apache Kafka Streams make an explicit distinction between event time, the timestamp recording when something actually happened, and processing time, the timestamp recording when the system received and processed the record. In a well-behaved system these are close together. In a system dealing with late-arriving data they can diverge significantly, and which time you use for windowing and aggregation determines whether your results are accurate or not.
Consider a simple example: counting user sessions per hour. If you aggregate on processing time, a session that started at 11:50pm but whose final event arrived at 12:10am gets split across two hourly buckets, neither of which reflects what actually happened. If you aggregate on event time, you can correctly attribute all events in that session to the hour they occurred. But event-time aggregation requires holding your windows open long enough to receive all the events that belong to them, and deciding how long to wait is where the engineering tradeoff lives.
The concept used to manage this tradeoff in streaming systems is the watermark. A watermark is a threshold that tells the system how far behind event time it's willing to tolerate before declaring a window closed. A watermark of five minutes means the system assumes that any event with an event timestamp more than five minutes behind the current processing time is too late to be included in its original window. Events that fall within the watermark are processed normally. Events that arrive after the watermark has passed are considered late.
What happens to those late events depends on how the system is configured. The simplest approach is to drop them, which is appropriate when timeliness matters more than completeness and the volume of late data is small enough to be acceptable as measurement error. A more sophisticated approach is to allow late events to trigger updates to already-emitted results, producing a corrected output that supersedes the earlier one. This works well when downstream consumers can handle result corrections, which not all of them can. A third approach is to route late-arriving records to a separate dead-letter or late-data store for separate handling, which preserves the data without contaminating the main pipeline's output.
In batch systems the problem manifests differently but is no less significant. A daily batch job that processes yesterday's data may be missing records that were generated yesterday but haven't arrived yet. The common pattern is to introduce a processing delay, running yesterday's job not at midnight but at some point the following morning, giving late data time to arrive. How long a delay is appropriate depends on the latency characteristics of the upstream systems and is generally something that has to be measured and tuned rather than guessed.
Slowly changing dimensions in data warehouses present their own version of the problem. If a transaction record arrives late and the dimension it references has already been updated since the transaction occurred, joining them produces a result that reflects the current state of the dimension rather than its state at the time of the transaction. Handling this correctly requires either Type 2 slowly changing dimension design, which preserves historical versions of dimension records, or effective dating on the fact records themselves, so late-arriving facts can be matched to the correct historical version of the dimension.
The design questions that late-arriving data raises don't have universal answers. The right watermark setting, the right late-data policy, and the right processing delay all depend on how late your data actually tends to arrive, how much lateness you can tolerate before results become misleading, and what downstream systems expect to receive. Those are empirical questions that require profiling your data sources and understanding your consumers, not just designing in the abstract.