Skip to content

Conversation

@ZuebeyirEser
Copy link
Contributor

Purpose

Linked issue: close #2242

Fixes the flaky integration test AdjustIsrITCase.testIsrShrinkAndExpand by accounting for the delay in ISR shrinkage detection.

Brief change log

The Problem:
The test was failing intermittently because it attempted to execute a produceLog request immediately after stopping a follower replica. Due to the LOG_REPLICA_MAX_LAG_TIME configuration (3 seconds), there is a delay before the Leader detects the follower failure and shrinks the In-Sync Replicas (ISR) set. During this window:

  1. The Leader waits for acknowledgment from the stopped follower (still in ISR)
  2. The request fails with NOT_ENOUGH_REPLICAS or TIMEOUT
  3. The test fails with AssertionFailedError: Expecting value to be false but was true

The Fix:
Modified testIsrShrinkAndExpand to wrap the produce request in a retry block, allowing the Leader sufficient time to detect the follower failure and shrink the ISR before proceeding with the write operation.

Tests

  • Reproduction: Reproduced the failure locally by increasing LOG_REPLICA_MAX_LAG_TIME to 60s, causing 100% test failure with the original code
  • Validation: Applied the fix and verified consistent test passes, even with artificial delays
  • Regression: Ran mvn test -pl fluss-server -Dtest=AdjustIsrITCase to ensure stability

API and Format

  • Dependencies: no
  • Public API: no
  • Schema: no
  • Default configuration values: no
  • Wire protocol: no

Documentation

no - This is a test-only fix with no user-facing changes.

@wuchong
Copy link
Member

wuchong commented Jan 25, 2026

Thanks, @ZuebeyirEser, for the thorough investigation. While the current fix works, I think there’s a simpler approach.

Since produceLog is only used to trigger ISR shrinking and doesn’t require strong durability guarantees, it doesn’t need to wait for full ISR acknowledgment. We can set its acks to 1, so the produce request succeeds immediately without waiting for all in-sync replicas.

This still triggers ISR shrinking: since a follower has been stopped, it will exceed LOG_REPLICA_MAX_LAG_TIME, prompting the controller to update the LeaderAndIsr in ZooKeeper accordingly.

What do you think, @ZuebeyirEser and @swuferhong?

@ZuebeyirEser ZuebeyirEser force-pushed the fix/flaky-adjust-isr-test branch from acdc176 to 0959741 Compare January 25, 2026 10:39
@ZuebeyirEser
Copy link
Contributor Author

@wuchong Agreed.
I took a closer look at RpcMessageTestUtils to verify the timing, and found it has 10s timeout. Since LOG_REPLICA_MAX_LAG_TIME is 3s, the ZK overhead on CI was pushing the total wait past that 10s limit.

Switching to acks=1 bypasses that client timeout entirely while still triggering the shrink. Much cleaner.

@wuchong wuchong merged commit 6eef968 into apache:main Jan 25, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unstable test AdjustIsrITCase.testIsrShrinkAndExpand

2 participants