TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Data 101

00 Days

00 Hrs

00 Min

00 Sec

Blog archive

CDC: The Data Engineering Technique That Keeps Systems in Sync

Every organization that runs more than one data system eventually faces the same problem: keeping them in sync.

An operational database records what's happening in the business right now. A data warehouse needs to reflect those changes for analytical purposes. A downstream service needs to react when a customer record is updated. A search index needs to stay current as products are added and modified. The naive solution is to periodically copy everything from the source system to wherever it needs to go. That works at small scale. At larger scale, it becomes expensive, slow, and increasingly impractical.

Change data capture, almost universally abbreviated as CDC, is a more efficient approach. Rather than copying everything, it captures only what changed.

The core idea is to monitor a source system for changes, specifically inserts, updates, and deletions, and propagate only those changes to downstream systems. Instead of asking "what does the entire table look like right now?" on some schedule, a CDC system asks "what changed since the last time I checked?" The result is a stream of change events that downstream systems can consume and apply, keeping themselves current without the overhead of full table scans and bulk transfers.

The most common implementation approach uses the transaction log of the source database. Relational databases maintain a transaction log, variously called a write-ahead log, a redo log, or a binary log depending on the database system, that records every change made to the database before it's applied. This log exists primarily for database recovery purposes, but CDC tools can read it to capture changes as they happen. Because the log records changes in the order they occurred, CDC via log reading is both accurate and low-impact on the source system. The database is doing the work of logging changes anyway; CDC is just reading that log.

Alternative approaches exist. Query-based CDC periodically polls the source table for rows where a timestamp or version column indicates a recent change. This is simpler to implement but has significant limitations: it can't detect deletions unless the table uses soft deletes, it requires the source table to have an appropriate timestamp column, and at high polling frequency it can place meaningful load on the source database. Trigger-based CDC uses database triggers to capture changes, which is more comprehensive but adds overhead to every write operation on the source system. Log-based CDC is generally preferred for production systems precisely because it avoids these tradeoffs.

What CDC enables matters as much as how it works. Real-time data warehousing becomes practical when you're propagating changes continuously rather than running nightly bulk loads. The lag between an event occurring in an operational system and that event being reflected in analytical systems shrinks from hours to seconds or minutes. Event-driven architectures can react to changes in source systems without those source systems needing to be modified to emit events explicitly. Database replication, including replication across different database technologies, becomes feasible. And audit logging, maintaining a complete record of every change to a dataset over time, becomes a natural output of the CDC process rather than something that has to be designed separately.

Several tools have built significant adoption around CDC. Debezium is the dominant open-source option, supporting log-based CDC for PostgreSQL, MySQL, MongoDB, SQL Server, and other databases, and producing change events to Kafka topics that downstream systems can consume. Commercial options from vendors like Fivetran, Airbyte, and Striim offer managed CDC with broader connectivity and operational support. Cloud data platforms have also added native CDC capabilities, reducing the need for separate tooling in some architectures.

The operational considerations are real and worth understanding. Log-based CDC requires the source database to be configured to retain transaction logs long enough for the CDC system to read them, which has storage implications. Schema changes in the source system, adding or removing columns, changing data types, can break CDC pipelines if not handled carefully. Initial snapshots, the process of getting a consistent starting point before CDC begins tracking changes, require coordination between the snapshot and the ongoing log reading to avoid gaps or duplicates. These are manageable problems, but they require deliberate design rather than assuming CDC is plug-and-play.

For practitioners working with data integration, replication, or real-time analytics, CDC is one of those techniques that, once understood, starts appearing everywhere. The pattern of capturing what changed rather than copying everything is applicable across a wide range of data movement problems, and the log-based approach in particular is elegant in the sense that it uses something the source system was already doing rather than imposing additional overhead to achieve the same result.

Data 101

CDC: The Data Engineering Technique That Keeps Systems in Sync

TDWI

Engage

Research