Skip to content

fix(model_gateway): reclaim per-worker mutation lock for absent worker ids#1684

Open
slin1237 wants to merge 2 commits into
mainfrom
fix/worker-mutation-lock-leak
Open

fix(model_gateway): reclaim per-worker mutation lock for absent worker ids#1684
slin1237 wants to merge 2 commits into
mainfrom
fix/worker-mutation-lock-leak

Conversation

@slin1237

@slin1237 slin1237 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Description

Problem

The worker registry keeps a per-worker mutation-lock map (worker_mutation_locks) to serialize register/replace/remove/status mutations for a given worker id. Three mutation methods — apply_if_revision, transition_status_inner, replace_inner — acquire the lock with worker_mutation_locks.entry(id).or_insert_with(...) before checking the worker exists. The only cleanup is in remove_inner, gated on the worker having been present. So any mutation call for an id that is no longer registered re-inserts a lock entry that is never reclaimed. This is reachable in production: a health probe that completes after its worker was removed calls registry.apply_if_revision (worker/manager.rs:479), re-creating the lock entry for the dead id every time, so the map grows unbounded over the process lifetime.

Solution

Add a private helper drop_lock_if_orphaned that the three methods call on their worker-absent return path (and only there — not on a revision mismatch, where the worker still exists). It removes the entry with DashMap::remove_if using Arc::ptr_eq(existing, lock) && Arc::strong_count(lock) == 2. Because remove_if evaluates the predicate while holding the shard write lock, the check is atomic against a concurrent register_inner/remove_inner that clones the same Arc: if any other party holds a clone the strong count exceeds 2 and the entry is left in place; otherwise only the map and this call hold it, so removal is safe and cannot strand a lock a concurrent live mutation is using. The existing serialization (lock held across the index diff + event emit) is untouched.

Changes

  • Add private WorkerRegistry::drop_lock_if_orphaned(worker_id, lock) that conditionally removes an orphaned per-worker mutation lock under the shard lock (Arc::ptr_eq + strong_count == 2 guard).
  • Call it on the worker-absent return path of replace_inner, apply_if_revision, and transition_status_inner; convert the latter two from ? to an explicit match so the absent branch reclaims the lock without firing on a revision mismatch.
  • Add a #[cfg(test)] pub(crate) mutation_lock_count() accessor.
  • Add regression tests.

Test Plan

  • test_mutation_on_absent_worker_does_not_leak_lock — register W, remove it (assert mutation_lock_count() == 0), then call apply_if_revision/transition_status/replace 32× each for W's removed id plus apply_if_revision 32× for a never-registered id; assert count stays 0. Without the fix the map grows (observed len 2).
  • test_revision_mismatch_keeps_lock_for_present_worker — register W, call apply_if_revision with a stale revision; assert it returns None and mutation_lock_count() == 1, proving the reclaim fires only for truly-absent workers and never over-removes a live worker's lock.

Authoritative gate (sccache disabled, RUSTC_WRAPPER=""):

$ rustup run nightly cargo fmt --all -- --check
FMT CLEAN (exit 0)

$ cargo clippy --workspace --all-targets --all-features -- -D warnings
Finished `dev` profile target(s) in 12.63s
exit 0

$ cargo test -p smg --lib (mutation_on_absent_worker / revision_mismatch_keeps_lock)
test result: ok — test_mutation_on_absent_worker_does_not_leak_lock (1 passed), test_revision_mismatch_keeps_lock_for_present_worker (1 passed)
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • Tests

    • Added tests to verify lock cleanup and prevent resource leaks in concurrent scenarios.
  • Chores

    • Improved internal lock management to prevent orphaned resource entries from persisting during worker removal or concurrent operations.

…r ids

apply_if_revision, transition_status_inner, and replace_inner each insert a
per-worker entry into worker_mutation_locks before confirming the worker
exists, but the only cleanup lives in remove_inner and is gated on the
worker having been present. A mutation call for an already-removed id (e.g.
a late health-probe completion driving apply_if_revision) re-inserts a lock
that is never reclaimed, so the map grows without bound.

Add a drop_lock_if_orphaned helper invoked on the worker-absent return path
of all three methods. It removes the entry via remove_if with an
Arc::ptr_eq + strong_count == 2 predicate evaluated under the DashMap shard
write lock, so it drops the entry only when it is still the exact Arc this
call created and unshared. This stays atomic against a concurrent insert
reusing the key and never drops a lock a live mutation needs; the entry is
reclaimed only when the worker is truly absent, not on a revision mismatch.

Signed-off-by: Simo Lin <25425177+slin1237@users.noreply.github.com>
@slin1237 slin1237 requested a review from CatherineSue as a code owner June 11, 2026 21:18
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cd3f3e4e-fb61-440f-851c-426f0150d708

📥 Commits

Reviewing files that changed from the base of the PR and between 93da198 and 035b520.

📒 Files selected for processing (1)
  • model_gateway/src/worker/registry.rs

📝 Walkthrough

Walkthrough

Fixes a per-worker mutation lock leak in WorkerRegistry. A new drop_lock_if_orphaned helper conditionally removes a worker_mutation_locks entry when the worker ID is absent after lock acquisition, using Arc::ptr_eq and strong_count == 2 to avoid races. The three mutation paths (replace_inner, apply_if_revision, transition_status_inner) now call this helper on the absent-worker early-return paths.

Changes

Per-worker mutation lock orphan reclaim

Layer / File(s) Summary
drop_lock_if_orphaned helper and introspection method
model_gateway/src/worker/registry.rs
Adds drop_lock_if_orphaned, which removes a per-worker lock entry only when the stored Arc pointer matches the passed lock and Arc::strong_count == 2. Also adds a #[cfg(test)] mutation_lock_count method to expose the map size for test assertions.
Orphan reclaim wired into three mutation paths
model_gateway/src/worker/registry.rs
replace_inner, apply_if_revision, and transition_status_inner each call drop_lock_if_orphaned at the point where the target worker is absent after acquiring the per-worker lock, replacing the previous silent return that left the lock entry dangling.
Lock-leak prevention and regression tests
model_gateway/src/worker/registry.rs
Tests assert mutation_lock_count remains bounded under repeated absent-worker-ID calls across all three mutation APIs, and a regression test confirms the lock is not erroneously reclaimed for a present worker when apply_if_revision no-ops due to a stale revision mismatch.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • CatherineSue
  • key4ng

Poem

🐇 A lock left behind, like a burrow forgot,
Now vanishes cleanly—no leak, not a dot!
ptr_eq and strong_count, two guards at the gate,
Orphaned Arcs swept away before growing too late.
The registry hums, every entry accounted—
No phantom locks left where no worker is mounted! 🔒

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the fix: reclaiming per-worker mutation locks when worker IDs are absent, which is the primary objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/worker-mutation-lock-leak

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to reclaim per-worker mutation locks for absent or removed workers in WorkerRegistry to prevent memory leaks, specifically handling cases like late health probes. It adds a helper method drop_lock_if_orphaned and integrates it into replace, apply_if_revision, and transition_status when a worker is not found, alongside new unit tests. The review feedback correctly points out that drop_lock_if_orphaned is missing from the absent/abort paths of remove_inner, which could still lead to a memory leak if remove is called on an already-removed worker. Additionally, it is suggested to expand the unit tests to verify that duplicate remove calls do not leak locks.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +1104 to +1109
fn drop_lock_if_orphaned(&self, worker_id: &WorkerId, lock: &Arc<parking_lot::Mutex<()>>) {
self.worker_mutation_locks
.remove_if(worker_id, |_, existing| {
Arc::ptr_eq(existing, lock) && Arc::strong_count(lock) == 2
});
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While drop_lock_if_orphaned is correctly called in the absent paths of replace_inner, apply_if_revision, and transition_status_inner, it is currently missing from the absent/abort paths of remove_inner.

If remove_inner is called for an already-removed or absent worker (or if it aborts due to an origin mismatch while the worker is absent), it will insert a lock entry via .entry().or_insert_with(...) but return None without reclaiming it, leading to a memory leak.

To fix this, remove_inner should also call drop_lock_if_orphaned when returning None on those paths:

fn remove_inner(
    &self,
    worker_id: &WorkerId,
    expect_origin: Option<WorkerOrigin>,
) -> Option<Arc<dyn Worker>> {
    // ...
    if let Some(expected) = expect_origin {
        if self.origin_of(worker_id) != Some(expected) {
            // ...
            if !self.workers.contains_key(worker_id) {
                self.drop_lock_if_orphaned(worker_id, &lock);
            }
            return None;
        }
    }

    if let Some((_, worker)) = self.workers.remove(worker_id) {
        // ...
        Some(worker)
    } else {
        self.drop_lock_if_orphaned(worker_id, &lock);
        None
    }
}

.build(),
);
let worker_id = registry.register(worker).unwrap();
registry.remove(&worker_id);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent regressions and verify that calling remove on an already-removed or absent worker does not leak the lock, we should call registry.remove(&worker_id) a second time and assert that the lock count remains 0.

registry.remove(&worker_id);
registry.remove(&worker_id);

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86387ba9f7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

fn drop_lock_if_orphaned(&self, worker_id: &WorkerId, lock: &Arc<parking_lot::Mutex<()>>) {
self.worker_mutation_locks
.remove_if(worker_id, |_, existing| {
Arc::ptr_eq(existing, lock) && Arc::strong_count(lock) == 2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reclaim shared orphan locks after the last waiter

When an absent-worker mutation shares this Arc with a concurrent remove(&id) for the same already-removed worker id, the strong count is greater than 2 so this helper leaves the map entry behind. The waiting remove_inner call then takes the same mutex, finds no worker, and returns None without any orphan-lock cleanup, so the lock entry remains forever; this preserves the leak for concurrent late probe/status/replace work plus duplicate remove calls on removed ids. Either make every absent path participate in the cleanup or ensure the final waiter removes the orphaned entry.

Useful? React with 👍 / 👎.

@github-actions github-actions Bot added the model-gateway Model gateway crate changes label Jun 11, 2026

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, focused fix for the per-worker mutation-lock leak. The drop_lock_if_orphaned helper is well-designed — the Arc::ptr_eq + strong_count == 2 guard under the shard write lock correctly handles concurrent callers (last waiter cleans up) and re-registration races. Tests cover both the leak scenario and the over-removal guard. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model-gateway Model gateway crate changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant