fix(credentials): recover from stale auth-profiles.lock#1636
Conversation
📝 WalkthroughWalkthroughThis PR resolves issue ChangesStale Auth-Profiles Lock Recovery
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/credentials/profiles.rs (1)
468-470:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle pid-file write failures before returning the guard.
If
writeln!fails here, this still reports the lock as acquired and leaves behind a malformed/empty lock file. After a crash, stale recovery can no longer parse an owner PID, so later callers fall back to the full timeout path instead of recovering.Suggested fix
Ok(mut file) => { - let _ = writeln!(file, "pid={}", std::process::id()); + if let Err(e) = writeln!(file, "pid={}", std::process::id()) { + let _ = fs::remove_file(&self.lock_path); + return Err(e).with_context(|| { + format!( + "Failed to write auth profile lock owner to {}", + self.lock_path.display() + ) + }); + } return Ok(AuthProfileLockGuard { lock_path: self.lock_path.clone(), }); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/credentials/profiles.rs` around lines 468 - 470, The code currently ignores the result of writeln! when writing "pid={}", so an I/O failure can leave a malformed lock file while still returning Ok(AuthProfileLockGuard) which misleads recovery; change the logic around the writeln! call in the lock-acquisition branch so you check its Result and, on Err, clean up and return an Err instead of returning the guard — e.g., flush/close and remove the created lock file (or propagate a clear error) if writeln! fails, only returning Ok(AuthProfileLockGuard { ... }) when the pid write succeeds.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/credentials/profiles.rs`:
- Around line 485-487: The loop that handles the AlreadyExists path repeatedly
calls clear_lock_if_stale() until it returns true, causing repeated sysinfo
probes and log spam; change the logic so that you set cleared_stale = true
immediately when you perform the first probe (i.e., as soon as you call
clear_lock_if_stale() inside the AlreadyExists handling), not only when
clear_lock_if_stale() returns true, so that subsequent iterations skip further
stale-recovery attempts for this acquire attempt (adjust control flow around the
call to clear_lock_if_stale() and the cleared_stale flag in the AlreadyExists
loop accordingly).
---
Outside diff comments:
In `@src/openhuman/credentials/profiles.rs`:
- Around line 468-470: The code currently ignores the result of writeln! when
writing "pid={}", so an I/O failure can leave a malformed lock file while still
returning Ok(AuthProfileLockGuard) which misleads recovery; change the logic
around the writeln! call in the lock-acquisition branch so you check its Result
and, on Err, clean up and return an Err instead of returning the guard — e.g.,
flush/close and remove the created lock file (or propagate a clear error) if
writeln! fails, only returning Ok(AuthProfileLockGuard { ... }) when the pid
write succeeds.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 86fca8d8-edb0-40a4-8cc9-d4ed4e1ff4ea
📒 Files selected for processing (13)
app/src/utils/__tests__/toolTimelineFormatting.test.tsapp/src/utils/toolTimelineFormatting.tssrc/openhuman/agent/agents/orchestrator/prompt.mdsrc/openhuman/agent/agents/orchestrator/prompt.rssrc/openhuman/agent/harness/definition.rssrc/openhuman/agent/harness/session/builder.rssrc/openhuman/agent/harness/tool_loop.rssrc/openhuman/channels/runtime/dispatch.rssrc/openhuman/credentials/profiles.rssrc/openhuman/credentials/profiles_tests.rssrc/openhuman/tools/impl/agent/dispatch.rssrc/openhuman/tools/impl/agent/skill_delegation.rssrc/openhuman/tools/orchestrator_tools.rs
| if !cleared_stale && self.clear_lock_if_stale() { | ||
| cleared_stale = true; | ||
| continue; |
There was a problem hiding this comment.
Flip cleared_stale after the first probe, not only after a successful delete.
Right now clear_lock_if_stale() is retried on every AlreadyExists loop whenever the lock belongs to a live pid, is malformed, or can't be read. That defeats the intended “one stale-recovery attempt per acquire” behavior and turns the contended path into repeated sysinfo probes and warning spam for up to 10 seconds.
Suggested fix
- if !cleared_stale && self.clear_lock_if_stale() {
- cleared_stale = true;
- continue;
+ if !cleared_stale {
+ cleared_stale = true;
+ if self.clear_lock_if_stale() {
+ continue;
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if !cleared_stale && self.clear_lock_if_stale() { | |
| cleared_stale = true; | |
| continue; | |
| if !cleared_stale { | |
| cleared_stale = true; | |
| if self.clear_lock_if_stale() { | |
| continue; | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/credentials/profiles.rs` around lines 485 - 487, The loop that
handles the AlreadyExists path repeatedly calls clear_lock_if_stale() until it
returns true, causing repeated sysinfo probes and log spam; change the logic so
that you set cleared_stale = true immediately when you perform the first probe
(i.e., as soon as you call clear_lock_if_stale() inside the AlreadyExists
handling), not only when clear_lock_if_stale() returns true, so that subsequent
iterations skip further stale-recovery attempts for this acquire attempt (adjust
control flow around the call to clear_lock_if_stale() and the cleared_stale flag
in the AlreadyExists loop accordingly).
|
@obchain please resolve merge conflicts before review. |
Previously a previous-run crash that left auth-profiles.lock on disk would block every RPC path touching the auth profile store for the full LOCK_TIMEOUT_MS window, and `app_state_snapshot` polling turned that into a per-2s error storm — the user effectively had to delete the file by hand. The lock writer already records its pid, but the acquirer never looked at it. On `AlreadyExists`, peek at the recorded pid once before busy-waiting: if the owning process is no longer alive (sysinfo lookup), remove the file and retry. Live owners and malformed locks still fall through to the existing busy-wait + timeout path, so contention against a real sibling process keeps serialising correctly. The first acquire on startup doubles as the crash-cleanup pass, so no separate startup hook is needed. Refs tinyhumansai#1612
0cb15ef to
e42c023
Compare
|
@graycyrus Rebased on main and pushed. Both review comments addressed in the same commit. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/credentials/profiles.rs (1)
494-517: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winAdd debug/trace logs for the new probe/wait branches.
The new recovery flow only logs after read/remove failures or successful stale cleanup. There is still no low-level diagnostic for the first stale probe, the “pid still alive” fallback, or the timeout path, which will make lock-contention incidents hard to reconstruct.
Suggested instrumentation
Err(e) if e.kind() == std::io::ErrorKind::AlreadyExists => { + tracing::debug!( + target: "auth-profiles", + lock_path = %self.lock_path.display(), + waited_ms = waited, + stale_probe_attempted = cleared_stale, + "[credentials] auth profile lock already exists" + ); if !cleared_stale { cleared_stale = true; if self.clear_lock_if_stale() { + tracing::debug!( + target: "auth-profiles", + lock_path = %self.lock_path.display(), + "[credentials] stale auth profile lock cleared; retrying acquire" + ); continue; } } if waited >= LOCK_TIMEOUT_MS { + tracing::warn!( + target: "auth-profiles", + lock_path = %self.lock_path.display(), + waited_ms = waited, + "[credentials] timed out waiting for auth profile lock" + ); anyhow::bail!("Timed out waiting for auth profile lock"); }if is_pid_alive(pid) { + tracing::trace!( + target: "auth-profiles", + lock_path = %self.lock_path.display(), + pid, + "[credentials] auth profile lock owner still alive; keeping lock in place" + ); return false; }As per coding guidelines: "
src/**/*.rs: All new/changed behavior in Rust core must include verbose diagnostics logging with stable grep-friendly prefixes" and "uselog/tracingatdebugortracelevel for development-oriented diagnostics on new/changed flows, including logs at ... branch decisions, external calls, retries/timeouts, state transitions, and error handling paths".Also applies to: 558-560
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/credentials/profiles.rs` around lines 494 - 517, The auth-profile lock recovery branch lacks diagnostic logs; add trace/debug logs with a stable prefix (e.g. "auth-profile-lock:") around the stale-probe and wait logic: log when cleared_stale is first flipped and a probe is attempted, log the result of self.clear_lock_if_stale() (success vs. pid still alive vs. unreadable), log each retry/sleep including waited and LOCK_WAIT_MS, and log the timeout branch just before the anyhow::bail! for LOCK_TIMEOUT_MS; use the existing identifiers cleared_stale, clear_lock_if_stale(), LOCK_WAIT_MS and LOCK_TIMEOUT_MS and use log/tracing at debug/trace level so these branches are easily greppable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/credentials/profiles.rs`:
- Around line 484-487: The error path removes the lock file while the File
handle `file` is still open, which can make `fs::remove_file(&self.lock_path)`
fail on Windows; change the error branch so you explicitly close the handle
(e.g. call `drop(file)` or let the `file` go out of scope) before calling
`fs::remove_file(&self.lock_path)`, keeping the same returned error context from
the `writeln!(file, "pid={}", std::process::id())` failure; update the block
around `writeln!(file, ...)`, the `file` variable, and `self.lock_path` to
ensure the file is closed prior to removal.
---
Outside diff comments:
In `@src/openhuman/credentials/profiles.rs`:
- Around line 494-517: The auth-profile lock recovery branch lacks diagnostic
logs; add trace/debug logs with a stable prefix (e.g. "auth-profile-lock:")
around the stale-probe and wait logic: log when cleared_stale is first flipped
and a probe is attempted, log the result of self.clear_lock_if_stale() (success
vs. pid still alive vs. unreadable), log each retry/sleep including waited and
LOCK_WAIT_MS, and log the timeout branch just before the anyhow::bail! for
LOCK_TIMEOUT_MS; use the existing identifiers cleared_stale,
clear_lock_if_stale(), LOCK_WAIT_MS and LOCK_TIMEOUT_MS and use log/tracing at
debug/trace level so these branches are easily greppable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0d8debe8-40fe-4aa3-ac80-0cf0225577aa
📒 Files selected for processing (2)
src/openhuman/credentials/profiles.rssrc/openhuman/credentials/profiles_tests.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/openhuman/credentials/profiles_tests.rs
| if let Err(e) = writeln!(file, "pid={}", std::process::id()) { | ||
| let _ = fs::remove_file(&self.lock_path); | ||
| return Err(e).with_context(|| { | ||
| "Failed to write auth profile lock owner".to_string() |
There was a problem hiding this comment.
Drop the file handle before removing the incomplete lock.
remove_file runs while file is still open. On Windows that delete commonly fails, so this error path can still leave an empty/malformed auth-profiles.lock behind and reintroduce the 10-second lockout on the next acquire.
Suggested fix
if let Err(e) = writeln!(file, "pid={}", std::process::id()) {
- let _ = fs::remove_file(&self.lock_path);
+ drop(file);
+ let _ = fs::remove_file(&self.lock_path);
return Err(e).with_context(|| {
"Failed to write auth profile lock owner".to_string()
});
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/credentials/profiles.rs` around lines 484 - 487, The error path
removes the lock file while the File handle `file` is still open, which can make
`fs::remove_file(&self.lock_path)` fail on Windows; change the error branch so
you explicitly close the handle (e.g. call `drop(file)` or let the `file` go out
of scope) before calling `fs::remove_file(&self.lock_path)`, keeping the same
returned error context from the `writeln!(file, "pid={}", std::process::id())`
failure; update the block around `writeln!(file, ...)`, the `file` variable, and
`self.lock_path` to ensure the file is closed prior to removal.
Summary
auth-profiles.lockleft behind by a previous-run crash, so the user is not stuck in a per-2s error storm until they delete the lock by hand.AlreadyExists, peek at the recordedpid=once and remove the lock if that process is no longer alive (viasysinfo, already a transitive dep — no new crate).Problem
src/openhuman/credentials/profiles.rs'sAuthProfilesStore::acquire_lockwritespid={n}into the lock file on acquire but never reads it back. If a previous openhuman process crashed while holding the lock, the file survives across the restart, every RPC path that touches the auth profile store fails for the fullLOCK_TIMEOUT_MS(10 s) window, andapp_state_snapshotpolling (~2 s cadence) turns that into a rapid-fire error loop — the app is effectively bricked for that user until they manuallyrmthe lock.Sentry shows this is observable in production:
OPENHUMAN-TAURI-CM— 16 events from one macOS user inside a 30-second window.OPENHUMAN-TAURI-CE,-CD— same shape on Windows.The error message in all three cases is
Failed to create auth profile lock at <user_home>/.openhuman/users/<id>/auth-profiles.lock.Solution
Close the loop the writer already opened. In
AuthProfilesStore::acquire_lock, whencreate_new(true)returnsAlreadyExists, attempt the stale-recovery path once before falling back to the busy-wait:pid=<u32>line.sysinfo::System::refresh_processes_specifics(ProcessesToUpdate::Some(&[pid]), …).fs::remove_fileand re-enter the loop, which will succeed on the nextcreate_newattempt.LOCK_WAIT_MSsleep +LOCK_TIMEOUT_MSdeadline.The clear-stale pass is gated by a
cleared_staleflag so a single acquire never tries it twice. That keeps contention against a live sibling process correctly serialising — we are not racing the lock owner, we are only reaping locks whose owner is provably gone.The first acquire on application startup naturally doubles as the crash-cleanup pass, so no separate "scan for stale locks at boot" hook is needed.
Behaviour matrix
pid=lineSubmission Checklist
profiles_tests.rs, see Test plan below.profiles.rsis exercised by the new unit tests (acquire_lock_clears_stale_lock_with_dead_pid,acquire_lock_recovers_after_upsert_when_dead_pid_lock_left_behind,clear_lock_if_stale_*,is_pid_alive_*).Closes #1612in the## Relatedsection.Impact
read_to_string+ one targetedsysinforefresh per blocked acquire. Only triggered when the lock already exists, and only once per acquire.pid=line format is unchanged.Test plan
8 new unit tests, full run is
25 passed; 0 failed:is_pid_alive_detects_current_processis_pid_alive_returns_false_for_synthetic_dead_pidacquire_lock_clears_stale_lock_with_dead_pidacquire_lock_recovers_after_upsert_when_dead_pid_lock_left_behindclear_lock_if_stale_leaves_live_pid_aloneclear_lock_if_stale_leaves_malformed_lock_aloneclear_lock_if_stale_is_noop_when_lock_missingacquire_lock_writes_pid_so_future_callers_can_recoverAlso verified
cargo fmt+cargo check --manifest-path Cargo.tomlclean.Related
app_state_snapshot.Summary by CodeRabbit
Bug Fixes