Skip to content

(Backport to 7.4): DD finishMoveKeys: move waitForShardReady outside transaction (#13364)#13642

Open
saintstack wants to merge 1 commit into
apple:release-7.4from
saintstack:backport-0887ae6d29
Open

(Backport to 7.4): DD finishMoveKeys: move waitForShardReady outside transaction (#13364)#13642
saintstack wants to merge 1 commit into
apple:release-7.4from
saintstack:backport-0887ae6d29

Conversation

@saintstack

Copy link
Copy Markdown
Contributor

Backport 0887ae6 from main to release-7.4

  ## Backport notes

  - Branch `release-7.4` uses Flow actors (`.actor.cpp`, `state`
    keyword, `wait()` macro) rather than main's C++20 coroutines
    (`co_await`). Local variables that need to cross a `wait()` point
    are declared with `state`; return statements use `return`, not
    `co_return`.
  - `constexpr` is not compatible with `state` variables on 7.4.
  - Otherwise the change is behavior-identical to the main PR.

DD finishMove*: drop the transaction across waitForShardReady

SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction with a ~5s lifetime (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When destination servers were slow to respond, the wait alone consumed the txn budget, and commits failed with transaction_too_old — retries too, cascading into the DD pipeline stalls observed in incidents.

Restructure both finish-move functions (finishMoveKeys / finishMoveShards, dispatched on SHARD_ENCODE_LOCATION_METADATA) into a two-transaction pattern with the wait in between:

Transaction 1: read keyServers / serverTags / serverList (and dataMove
metadata for finishMoveShards) via the new
readShardState() helper. Save the read version and
drop the transaction (tr.reset()).
Wait: waitForShardReady — runs OUTSIDE any transaction;
the 15s timeout is now safe.
Transaction 2: re-verify via destUnchanged() (dest hasn't changed,
dataMove still in Running phase for finishMoveShards),
then commit metadata writes.

If the destination changed during the wait, the inner loop retries from the top — same as today's behaviour on transient errors, just without burning the txn budget on the wait itself.

Verification & retry details:

  • destUnchanged() loops every sub-range, not just keyServers[0]. The keys-flavor (expectedDataMoveId={}) tolerates empty-dest entries whose src ⊆ expectedDest — matching the planning loop's alreadyMoved = dest2.empty() && isSubset branch, which lets sibling iterations of OUR move that already completed pass through. A foreign src on an empty-dest entry signals a different move owns the range and forces a retry to avoid clobbering it. The shards-flavor uses the dataMoveId stamp as the per-sub-range invariant.

  • All retry paths (dest-changed, data-move-deleted, phase-changed, plus the count-mismatch else-branch in finishMoveShards) are bounded by FINISH_MOVE_KEYS_MAX_RETRIES and back off via finishMoveKeysBackoff(). Without the cap the dest-changed branch could livelock; the shards- side count-mismatch path was previously unbounded.

  • Txn 2's writes use the post-wait snapshot: keyServersValue uses reread.uidToTagMap; finishMoveShards introduces a postWaitDataMove local so the partial-complete / deleteCheckpoints / dataMoveValue writes reflect the fresh dataMove state.

  • runPreCheck=false is set inline on each retry path. The partial-complete success-continue deliberately leaves runPreCheck alone so chunks 2..N of a multi-transaction move each get their own AUDIT_DATAMOVE_PRE_CHECK.

  • finishMoveShards now takes finishMoveKeysParallelismLock per iteration (mirroring finishMoveKeys). The lock had been function- scoped and released mid-iteration before the wait, so retries ran lockless and silently exceeded MOVE_KEYS_PARALLELISM.

  • taskID = TaskPriority::MoveKeys is restored on txn 2 in both functions (tr.reset() drops it).

  • New CODE_PROBEs ("dest changed", "data move deleted", "phase changed") are marked probe::decoration::rare; reaching them requires a concurrent reassignment racing into the wait window.

  • SHARD_READY_DELAY is buggified to 5.0s to exercise the slow-dest scenario in simulation.

Validation:

  • k8s rig: 3 transaction_too_old events vs 360,289 in the prior run without the patch; 0 OOMs.
  • Simulation (DDPipelineStall.toml, knob ON): cascade trigger eliminated; transaction_too_old goes to zero; residual TryFinishMoveShardsError events are not_committed which retry cleanly.

` 20260701-232737-stack-backport-74-028aaba11a9b3a30 compressed=True data_size=41562049 duration=3423521 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:29:46 sanity=False started=100000 stopped=20260701-235723 submitted=20260701-232737 timeout=5400 username=stack-backport-74

`

DD finishMoveKeys: move waitForShardReady outside transaction (apple#13364)
* DD finishMove*: drop the transaction across waitForShardReady

SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction with a
~5s lifetime (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When destination
servers were slow to respond, the wait alone consumed the txn budget,
and commits failed with transaction_too_old — retries too, cascading
into the DD pipeline stalls observed in incidents.

Restructure both finish-move functions (finishMoveKeys / finishMoveShards,
dispatched on SHARD_ENCODE_LOCATION_METADATA) into a two-transaction
pattern with the wait in between:

  Transaction 1: read keyServers / serverTags / serverList (and dataMove
                 metadata for finishMoveShards) via the new
                 readShardState() helper. Save the read version and
                 drop the transaction (tr.reset()).
  Wait:          waitForShardReady — runs OUTSIDE any transaction;
                 the 15s timeout is now safe.
  Transaction 2: re-verify via destUnchanged() (dest hasn't changed,
                 dataMove still in Running phase for finishMoveShards),
                 then commit metadata writes.

If the destination changed during the wait, the inner loop retries from
the top — same as today's behaviour on transient errors, just without
burning the txn budget on the wait itself.

Verification & retry details:

* destUnchanged() loops every sub-range, not just keyServers[0]. The
  keys-flavor (`expectedDataMoveId={}`) tolerates empty-dest entries
  whose src ⊆ expectedDest — matching the planning loop's
  `alreadyMoved = dest2.empty() && isSubset` branch, which lets sibling
  iterations of OUR move that already completed pass through. A foreign
  src on an empty-dest entry signals a different move owns the range
  and forces a retry to avoid clobbering it. The shards-flavor uses
  the dataMoveId stamp as the per-sub-range invariant.

* All retry paths (dest-changed, data-move-deleted, phase-changed, plus
  the count-mismatch else-branch in finishMoveShards) are bounded by
  FINISH_MOVE_KEYS_MAX_RETRIES and back off via finishMoveKeysBackoff().
  Without the cap the dest-changed branch could livelock; the shards-
  side count-mismatch path was previously unbounded.

* Txn 2's writes use the post-wait snapshot: keyServersValue uses
  reread.uidToTagMap; finishMoveShards introduces a postWaitDataMove
  local so the partial-complete / deleteCheckpoints / dataMoveValue
  writes reflect the fresh dataMove state.

* runPreCheck=false is set inline on each retry path. The
  partial-complete success-continue deliberately leaves runPreCheck
  alone so chunks 2..N of a multi-transaction move each get their own
  AUDIT_DATAMOVE_PRE_CHECK.

* finishMoveShards now takes finishMoveKeysParallelismLock per
  iteration (mirroring finishMoveKeys). The lock had been function-
  scoped and released mid-iteration before the wait, so retries ran
  lockless and silently exceeded MOVE_KEYS_PARALLELISM.

* taskID = TaskPriority::MoveKeys is restored on txn 2 in both functions
  (tr.reset() drops it).

* New CODE_PROBEs ("dest changed", "data move deleted",
  "phase changed") are marked probe::decoration::rare; reaching them
  requires a concurrent reassignment racing into the wait window.

* SHARD_READY_DELAY is buggified to 5.0s to exercise the slow-dest
  scenario in simulation.

Validation:

* k8s rig: 3 transaction_too_old events vs 360,289 in the prior run
  without the patch; 0 OOMs.
* Simulation (DDPipelineStall.toml, knob ON): cascade trigger
  eliminated; transaction_too_old goes to zero; residual
  TryFinishMoveShardsError events are not_committed which retry
  cleanly.
@saintstack saintstack requested a review from sbodagala July 2, 2026 18:49
@saintstack saintstack requested a review from spraza as a code owner July 2, 2026 18:49
@saintstack saintstack changed the title DD finishMoveKeys: move waitForShardReady outside transaction (#13364) (Backport to 7.4): DD finishMoveKeys: move waitForShardReady outside transaction (#13364) Jul 2, 2026
@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS 14.x

  • Commit ID: 0097a5f
  • Duration 0:34:47
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 0097a5f
  • Duration 0:43:50
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux RHEL 9

  • Commit ID: 0097a5f
  • Duration 0:47:43
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS 14.x

  • Commit ID: 0097a5f
  • Duration 0:52:07
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 0097a5f
  • Duration 0:54:38
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 0097a5f
  • Duration 1:01:13
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants