(Backport to 7.4): DD finishMoveKeys: move waitForShardReady outside transaction (#13364) by saintstack · Pull Request #13642 · apple/foundationdb

saintstack · 2026-07-02T18:49:23Z

Backport 0887ae6 from main to release-7.4

  ## Backport notes

  - Branch `release-7.4` uses Flow actors (`.actor.cpp`, `state`
    keyword, `wait()` macro) rather than main's C++20 coroutines
    (`co_await`). Local variables that need to cross a `wait()` point
    are declared with `state`; return statements use `return`, not
    `co_return`.
  - `constexpr` is not compatible with `state` variables on 7.4.
  - Otherwise the change is behavior-identical to the main PR.

DD finishMove*: drop the transaction across waitForShardReady

SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction with a ~5s lifetime (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When destination servers were slow to respond, the wait alone consumed the txn budget, and commits failed with transaction_too_old — retries too, cascading into the DD pipeline stalls observed in incidents.

Restructure both finish-move functions (finishMoveKeys / finishMoveShards, dispatched on SHARD_ENCODE_LOCATION_METADATA) into a two-transaction pattern with the wait in between:

Transaction 1: read keyServers / serverTags / serverList (and dataMove
metadata for finishMoveShards) via the new
readShardState() helper. Save the read version and
drop the transaction (tr.reset()).
Wait: waitForShardReady — runs OUTSIDE any transaction;
the 15s timeout is now safe.
Transaction 2: re-verify via destUnchanged() (dest hasn't changed,
dataMove still in Running phase for finishMoveShards),
then commit metadata writes.

If the destination changed during the wait, the inner loop retries from the top — same as today's behaviour on transient errors, just without burning the txn budget on the wait itself.

Verification & retry details:

destUnchanged() loops every sub-range, not just keyServers[0]. The keys-flavor (expectedDataMoveId={}) tolerates empty-dest entries whose src ⊆ expectedDest — matching the planning loop's alreadyMoved = dest2.empty() && isSubset branch, which lets sibling iterations of OUR move that already completed pass through. A foreign src on an empty-dest entry signals a different move owns the range and forces a retry to avoid clobbering it. The shards-flavor uses the dataMoveId stamp as the per-sub-range invariant.
All retry paths (dest-changed, data-move-deleted, phase-changed, plus the count-mismatch else-branch in finishMoveShards) are bounded by FINISH_MOVE_KEYS_MAX_RETRIES and back off via finishMoveKeysBackoff(). Without the cap the dest-changed branch could livelock; the shards- side count-mismatch path was previously unbounded.
Txn 2's writes use the post-wait snapshot: keyServersValue uses reread.uidToTagMap; finishMoveShards introduces a postWaitDataMove local so the partial-complete / deleteCheckpoints / dataMoveValue writes reflect the fresh dataMove state.
runPreCheck=false is set inline on each retry path. The partial-complete success-continue deliberately leaves runPreCheck alone so chunks 2..N of a multi-transaction move each get their own AUDIT_DATAMOVE_PRE_CHECK.
finishMoveShards now takes finishMoveKeysParallelismLock per iteration (mirroring finishMoveKeys). The lock had been function- scoped and released mid-iteration before the wait, so retries ran lockless and silently exceeded MOVE_KEYS_PARALLELISM.
taskID = TaskPriority::MoveKeys is restored on txn 2 in both functions (tr.reset() drops it).
New CODE_PROBEs ("dest changed", "data move deleted", "phase changed") are marked probe::decoration::rare; reaching them requires a concurrent reassignment racing into the wait window.
SHARD_READY_DELAY is buggified to 5.0s to exercise the slow-dest scenario in simulation.

Validation:

k8s rig: 3 transaction_too_old events vs 360,289 in the prior run without the patch; 0 OOMs.
Simulation (DDPipelineStall.toml, knob ON): cascade trigger eliminated; transaction_too_old goes to zero; residual TryFinishMoveShardsError events are not_committed which retry cleanly.

` 20260701-232737-stack-backport-74-028aaba11a9b3a30 compressed=True data_size=41562049 duration=3423521 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:29:46 sanity=False started=100000 stopped=20260701-235723 submitted=20260701-232737 timeout=5400 username=stack-backport-74

`

DD finishMoveKeys: move waitForShardReady outside transaction (apple#13364) * DD finishMove*: drop the transaction across waitForShardReady SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction with a ~5s lifetime (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When destination servers were slow to respond, the wait alone consumed the txn budget, and commits failed with transaction_too_old — retries too, cascading into the DD pipeline stalls observed in incidents. Restructure both finish-move functions (finishMoveKeys / finishMoveShards, dispatched on SHARD_ENCODE_LOCATION_METADATA) into a two-transaction pattern with the wait in between: Transaction 1: read keyServers / serverTags / serverList (and dataMove metadata for finishMoveShards) via the new readShardState() helper. Save the read version and drop the transaction (tr.reset()). Wait: waitForShardReady — runs OUTSIDE any transaction; the 15s timeout is now safe. Transaction 2: re-verify via destUnchanged() (dest hasn't changed, dataMove still in Running phase for finishMoveShards), then commit metadata writes. If the destination changed during the wait, the inner loop retries from the top — same as today's behaviour on transient errors, just without burning the txn budget on the wait itself. Verification & retry details: * destUnchanged() loops every sub-range, not just keyServers[0]. The keys-flavor (`expectedDataMoveId={}`) tolerates empty-dest entries whose src ⊆ expectedDest — matching the planning loop's `alreadyMoved = dest2.empty() && isSubset` branch, which lets sibling iterations of OUR move that already completed pass through. A foreign src on an empty-dest entry signals a different move owns the range and forces a retry to avoid clobbering it. The shards-flavor uses the dataMoveId stamp as the per-sub-range invariant. * All retry paths (dest-changed, data-move-deleted, phase-changed, plus the count-mismatch else-branch in finishMoveShards) are bounded by FINISH_MOVE_KEYS_MAX_RETRIES and back off via finishMoveKeysBackoff(). Without the cap the dest-changed branch could livelock; the shards- side count-mismatch path was previously unbounded. * Txn 2's writes use the post-wait snapshot: keyServersValue uses reread.uidToTagMap; finishMoveShards introduces a postWaitDataMove local so the partial-complete / deleteCheckpoints / dataMoveValue writes reflect the fresh dataMove state. * runPreCheck=false is set inline on each retry path. The partial-complete success-continue deliberately leaves runPreCheck alone so chunks 2..N of a multi-transaction move each get their own AUDIT_DATAMOVE_PRE_CHECK. * finishMoveShards now takes finishMoveKeysParallelismLock per iteration (mirroring finishMoveKeys). The lock had been function- scoped and released mid-iteration before the wait, so retries ran lockless and silently exceeded MOVE_KEYS_PARALLELISM. * taskID = TaskPriority::MoveKeys is restored on txn 2 in both functions (tr.reset() drops it). * New CODE_PROBEs ("dest changed", "data move deleted", "phase changed") are marked probe::decoration::rare; reaching them requires a concurrent reassignment racing into the wait window. * SHARD_READY_DELAY is buggified to 5.0s to exercise the slow-dest scenario in simulation. Validation: * k8s rig: 3 transaction_too_old events vs 360,289 in the prior run without the patch; 0 OOMs. * Simulation (DDPipelineStall.toml, knob ON): cascade trigger eliminated; transaction_too_old goes to zero; residual TryFinishMoveShardsError events are not_committed which retry cleanly.

foundationdb-ci · 2026-07-02T19:24:23Z

Result of foundationdb-pr-macos-m1 on macOS 14.x

Commit ID: 0097a5f
Duration 0:34:47
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-07-02T19:33:23Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 0097a5f
Duration 0:43:50
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-07-02T19:37:19Z

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Commit ID: 0097a5f
Duration 0:47:43
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-07-02T19:41:44Z

Result of foundationdb-pr-macos on macOS 14.x

Commit ID: 0097a5f
Duration 0:52:07
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-07-02T19:44:15Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 0097a5f
Duration 0:54:38
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-07-02T19:50:50Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 0097a5f
Duration 1:01:13
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

saintstack requested a review from sbodagala July 2, 2026 18:49

saintstack requested a review from spraza as a code owner July 2, 2026 18:49

saintstack changed the title ~~DD finishMoveKeys: move waitForShardReady outside transaction (#13364)~~ (Backport to 7.4): DD finishMoveKeys: move waitForShardReady outside transaction (#13364) Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(Backport to 7.4): DD finishMoveKeys: move waitForShardReady outside transaction (#13364)#13642

(Backport to 7.4): DD finishMoveKeys: move waitForShardReady outside transaction (#13364)#13642
saintstack wants to merge 1 commit into
apple:release-7.4from
saintstack:backport-0887ae6d29

saintstack commented Jul 2, 2026

Uh oh!

foundationdb-ci commented Jul 2, 2026

Uh oh!

foundationdb-ci commented Jul 2, 2026

Uh oh!

foundationdb-ci commented Jul 2, 2026

Uh oh!

foundationdb-ci commented Jul 2, 2026

Uh oh!

foundationdb-ci commented Jul 2, 2026

Uh oh!

foundationdb-ci commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

saintstack commented Jul 2, 2026

Uh oh!

foundationdb-ci commented Jul 2, 2026

Result of foundationdb-pr-macos-m1 on macOS 14.x

Uh oh!

foundationdb-ci commented Jul 2, 2026

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jul 2, 2026

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jul 2, 2026

Result of foundationdb-pr-macos on macOS 14.x

Uh oh!

foundationdb-ci commented Jul 2, 2026

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jul 2, 2026

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants