[test] Fix flaky AdjustIsrITCase by retrying produce request #2468

ZuebeyirEser · 2026-01-24T15:13:33Z

Purpose

Linked issue: close #2242

Fixes the flaky integration test AdjustIsrITCase.testIsrShrinkAndExpand by accounting for the delay in ISR shrinkage detection.

Brief change log

The Problem:
The test was failing intermittently because it attempted to execute a produceLog request immediately after stopping a follower replica. Due to the LOG_REPLICA_MAX_LAG_TIME configuration (3 seconds), there is a delay before the Leader detects the follower failure and shrinks the In-Sync Replicas (ISR) set. During this window:

The Leader waits for acknowledgment from the stopped follower (still in ISR)
The request fails with NOT_ENOUGH_REPLICAS or TIMEOUT
The test fails with AssertionFailedError: Expecting value to be false but was true

The Fix:
Modified testIsrShrinkAndExpand to wrap the produce request in a retry block, allowing the Leader sufficient time to detect the follower failure and shrink the ISR before proceeding with the write operation.

Tests

Reproduction: Reproduced the failure locally by increasing LOG_REPLICA_MAX_LAG_TIME to 60s, causing 100% test failure with the original code
Validation: Applied the fix and verified consistent test passes, even with artificial delays
Regression: Ran mvn test -pl fluss-server -Dtest=AdjustIsrITCase to ensure stability

API and Format

Dependencies: no
Public API: no
Schema: no
Default configuration values: no
Wire protocol: no

Documentation

no - This is a test-only fix with no user-facing changes.

wuchong · 2026-01-25T04:36:17Z

Thanks, @ZuebeyirEser, for the thorough investigation. While the current fix works, I think there’s a simpler approach.

Since produceLog is only used to trigger ISR shrinking and doesn’t require strong durability guarantees, it doesn’t need to wait for full ISR acknowledgment. We can set its acks to 1, so the produce request succeeds immediately without waiting for all in-sync replicas.

This still triggers ISR shrinking: since a follower has been stopped, it will exceed LOG_REPLICA_MAX_LAG_TIME, prompting the controller to update the LeaderAndIsr in ZooKeeper accordingly.

What do you think, @ZuebeyirEser and @swuferhong?

ZuebeyirEser · 2026-01-25T11:00:20Z

@wuchong Agreed.
I took a closer look at RpcMessageTestUtils to verify the timing, and found it has 10s timeout. Since LOG_REPLICA_MAX_LAG_TIME is 3s, the ZK overhead on CI was pushing the total wait past that 10s limit.

Switching to acks=1 bypasses that client timeout entirely while still triggering the shrink. Much cleaner.

Fix flaky AdjustIsrITCase#testIsrShrinkAndExpand by using acks=1

0959741

ZuebeyirEser force-pushed the fix/flaky-adjust-isr-test branch from acdc176 to 0959741 Compare January 25, 2026 10:39

wuchong approved these changes Jan 25, 2026

View reviewed changes

wuchong merged commit 6eef968 into apache:main Jan 25, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[test] Fix flaky AdjustIsrITCase by retrying produce request #2468

[test] Fix flaky AdjustIsrITCase by retrying produce request #2468

ZuebeyirEser commented Jan 24, 2026

Uh oh!

wuchong commented Jan 25, 2026

Uh oh!

ZuebeyirEser commented Jan 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[test] Fix flaky AdjustIsrITCase by retrying produce request #2468

[test] Fix flaky AdjustIsrITCase by retrying produce request #2468

Conversation

ZuebeyirEser commented Jan 24, 2026

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

wuchong commented Jan 25, 2026

Uh oh!

ZuebeyirEser commented Jan 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants