Latest Headlines
How to Build Exactly-Once CDC Pipelines for Open Lakehouses with Apache Flink, Kafka, and Iceberg
Olakunle Ebenezer Aribisala
As data systems grow increasingly complex, organizations want real-time analytics, low-latency reporting, and reliable batch processing, all on cost-effective object storage. One promising approach is building a Change Data Capture (CDC) pipeline that ensures exactly-once delivery from operational databases to an open lakehouse architecture. This article explores how Apache Flink, Kafka, and Iceberg work together to achieve that goal.
- Why Exactly-Once CDC Matters
Traditional pipelines often struggle with late events, failover scenarios, or rescaling, leading to duplicate records or missing data. For real-time analytics and reproducible batch processing, the system must guarantee one and only one committed write per logical database change, even in the face of failures.
Using Debezium for database change capture, Kafka for streaming transport, Flink for stateful processing, and Iceberg for table storage, it’s possible to create an end-to-end system that combines ACID guarantees, schema evolution, and real-time performance on object storage like S3 or GCS. - Handling Ordering, Deduplication, and Upserts
To preserve event ordering, Kafka partitions data by primary key (or a stable hash). Deduplication is handled using unique offsets emitted by Debezium and coordinated with Flink checkpoints to prevent re-committing the same event.
For updates, Iceberg tables rely on equality deletes followed by inserts to represent changes, while schema evolution ensures that new columns propagate automatically through the pipeline from Debezium to Kafka, into Flink, and finally to Iceberg, using field identifiers for compatibility. - Configuring Kafka and Debezium for CDC
The pipeline typically uses one Kafka topic per source table, configured with both compaction (keeping only the last event per key) and retention (preserving enough history for replays or backfills).
Serialization formats like Avro or JSON carry essential metadata, including the operation type (create, update, delete, read), timestamps, and source offsets. Every event must have a well-defined key, and in cases where the source table lacks a natural primary key, a synthetic key is introduced for CDC correctness. - Using Flink for Event Time and State Management
Flink processes events based on event time rather than arrival time, using watermarks to handle late events while keeping state growth under control.
To ensure correctness, Flink reads only committed Kafka transactions using the read_committed isolation level. Checkpoints store source offsets, sink transaction IDs, and processing state atomically so that the system can recover without data loss or duplication after a failure. - Achieving Exactly-Once Guarantees End-to-End
The pipeline relies on three major components for exactly-once semantics:
Kafka Transactions ensure that only committed events are visible to consumers.
Flink Checkpoints capture consistent snapshots of the processing state, so recovery resumes from the last known good point.
Iceberg’s Two-Phase Commit writes data to a staging area first and atomically updates the table metadata only when the commit succeeds.
If a crash happens between pre-commit and commit, the unreferenced data files are safely garbage-collected, preventing duplicates in the final table.
- Testing for Failure Scenarios
To confirm the pipeline’s robustness, it’s critical to simulate real-world failures. This includes killing a Flink task during a commit, triggering network partitions, introducing backpressure, or restarting with new parallelism settings. Each scenario tests whether the pipeline preserves correctness without data duplication or key reordering. - The Result: Real-Time, Reliable Lakehouse Pipelines
By integrating Debezium, Kafka, Flink, and Iceberg, organisations can build vendor-neutral, real-time CDC pipelines with ACID guarantees on inexpensive object storage. The key lies in orchestrating transactions, checkpoints, and atomic commits correctly. Get these details right, and the system delivers reliable upserts, low-latency queries, and reproducible batch results at scale.







