Skip to content

Flink 2.0: DynamicIcebergSink drops commits during recovery #14090

@aiborodin

Description

@aiborodin

Apache Iceberg version

1.10.0 (latest release)

Query engine

Flink

Please describe the bug 🐞

DynamicWriteResultAggregator in DynamicIcebrgSink currently produces multiple dynamic committables per table/branch/checkpoint triplet because it aggregates write results by WriteTarget, which is unique per schemaId, specId, and equality fields. It violates the idempotence contract of the DynamicCommitter, which relies on one commit request per triplet to identify and skip already committed requests during recovery.

For example, a data loss occurs in the following scenario:

  1. Sink creates two commit requests with properties CheckpointID = 1, JobID = a, OperatorID = abc:
  • Commit 1 (data with Schema 1)
  • Commit 2 (data with Schema 2)
  1. Commit requests are saved to a checkpoint
  2. Committer commits Commit 1
  3. Flink job restarts due to a commit failure or an autoscaling event
  4. Commit requests are retrieved from a checkpoint
  5. DynamicCommitter checks the Iceberg snapshots and identifies that the Committable 1 has already been committed as it matches CheckpointID = 1, JobID = a, OperatorID = abc
  • DynamicCommitter skips the subsequent Committable 2, which is part of the same checkpoint

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions