Skip to content

fix(credentials): recover from stale auth-profiles.lock#1636

Merged
senamakel merged 1 commit into
tinyhumansai:mainfrom
obchain:fix/1612-stale-auth-lock-recovery
May 13, 2026
Merged

fix(credentials): recover from stale auth-profiles.lock#1636
senamakel merged 1 commit into
tinyhumansai:mainfrom
obchain:fix/1612-stale-auth-lock-recovery

Conversation

@obchain
Copy link
Copy Markdown
Contributor

@obchain obchain commented May 13, 2026

Summary

  • Detect and recover from a stale auth-profiles.lock left behind by a previous-run crash, so the user is not stuck in a per-2s error storm until they delete the lock by hand.
  • On AlreadyExists, peek at the recorded pid= once and remove the lock if that process is no longer alive (via sysinfo, already a transitive dep — no new crate).
  • Live owners and malformed locks still fall through to the existing busy-wait + 10 s timeout path, so contention against a real sibling process keeps serialising correctly.

Problem

src/openhuman/credentials/profiles.rs's AuthProfilesStore::acquire_lock writes pid={n} into the lock file on acquire but never reads it back. If a previous openhuman process crashed while holding the lock, the file survives across the restart, every RPC path that touches the auth profile store fails for the full LOCK_TIMEOUT_MS (10 s) window, and app_state_snapshot polling (~2 s cadence) turns that into a rapid-fire error loop — the app is effectively bricked for that user until they manually rm the lock.

Sentry shows this is observable in production:

The error message in all three cases is Failed to create auth profile lock at <user_home>/.openhuman/users/<id>/auth-profiles.lock.

Solution

Close the loop the writer already opened. In AuthProfilesStore::acquire_lock, when create_new(true) returns AlreadyExists, attempt the stale-recovery path once before falling back to the busy-wait:

  1. Read the lock file, look for a pid=<u32> line.
  2. Probe the pid via sysinfo::System::refresh_processes_specifics(ProcessesToUpdate::Some(&[pid]), …).
  3. If the pid is not present, fs::remove_file and re-enter the loop, which will succeed on the next create_new attempt.
  4. Otherwise (live pid, or no parseable pid line, or read error), fall through to the existing LOCK_WAIT_MS sleep + LOCK_TIMEOUT_MS deadline.

The clear-stale pass is gated by a cleared_stale flag so a single acquire never tries it twice. That keeps contention against a live sibling process correctly serialising — we are not racing the lock owner, we are only reaping locks whose owner is provably gone.

The first acquire on application startup naturally doubles as the crash-cleanup pass, so no separate "scan for stale locks at boot" hook is needed.

Behaviour matrix

Lock state Behaviour
No lock file Created as before.
Lock recording a dead pid Removed once, retried, succeeds.
Lock recording a live pid Left alone; busy-wait + 10 s timeout (existing behaviour).
Lock with no parseable pid= line Left alone; busy-wait + timeout. Conservative on purpose — avoids racing a writer that opened the file but has not yet flushed its pid.
Lock disappears between read and remove Treated as already cleared; retry.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy — 8 new tests in profiles_tests.rs, see Test plan below.
  • Diff coverage ≥ 80% — every new/changed line in profiles.rs is exercised by the new unit tests (acquire_lock_clears_stale_lock_with_dead_pid, acquire_lock_recovers_after_upsert_when_dead_pid_lock_left_behind, clear_lock_if_stale_*, is_pid_alive_*).
  • N/A — no user-visible feature row added/removed/renamed; this is a fault-tolerance fix.
  • N/A — no matrix-tracked feature IDs touched.
  • No new external network dependencies introduced (mock backend used per Testing Strategy).
  • N/A — not a release-cut UX surface.
  • Linked issue closed via Closes #1612 in the ## Related section.

Impact

  • Runtime: removes a class of "app is unusable until the user finds the lock file" failures. No behaviour change for the happy path — first-acquire-from-clean-state is identical.
  • Performance: one extra read_to_string + one targeted sysinfo refresh per blocked acquire. Only triggered when the lock already exists, and only once per acquire.
  • Security: pid file is only consulted, never executed against. Malformed content (including attacker-controlled junk in the lock file) cannot cause the file to be removed — only a parseable-and-dead pid does.
  • Migration / compatibility: none. Existing pid= line format is unchanged.

Test plan

cargo test --manifest-path Cargo.toml --lib openhuman::credentials::profiles

8 new unit tests, full run is 25 passed; 0 failed:

  • is_pid_alive_detects_current_process
  • is_pid_alive_returns_false_for_synthetic_dead_pid
  • acquire_lock_clears_stale_lock_with_dead_pid
  • acquire_lock_recovers_after_upsert_when_dead_pid_lock_left_behind
  • clear_lock_if_stale_leaves_live_pid_alone
  • clear_lock_if_stale_leaves_malformed_lock_alone
  • clear_lock_if_stale_is_noop_when_lock_missing
  • acquire_lock_writes_pid_so_future_callers_can_recover

Also verified cargo fmt + cargo check --manifest-path Cargo.toml clean.

Related

  • Closes: Stale auth-profiles.lock blocks all RPC calls — user stuck in error loop #1612
  • Follow-up PR(s)/TODOs: the issue also lists "graceful degradation: return cached data when the lock is contended" as a longer-term improvement. Out of scope for this PR — the stale-detection fix already resolves the rapid-fire error loop captured in Sentry; the cached-fallback path is a separate change against app_state_snapshot.

Summary by CodeRabbit

Bug Fixes

  • Improved credential profile lock handling: The system now detects and removes stale locks from processes that are no longer running, preventing the app from becoming stuck.
  • Enhanced lock file validation: Lock file creation now ensures process ID information is written correctly and reports errors if validation fails.

Review Change Stack

@obchain obchain requested a review from a team May 13, 2026 12:26
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

This PR resolves issue #1612 by implementing stale lock detection and recovery in the auth profiles lock mechanism. When acquire_lock encounters an existing lock file, it now checks whether the recorded process ID is still alive; if the process is dead, it safely removes the stale lock and retries acquisition. The implementation uses sysinfo for cross-platform PID liveness checking, carefully handles malformed lock files, and enforces valid PID writes on lock creation.

Changes

Stale Auth-Profiles Lock Recovery

Layer / File(s) Summary
PID liveness detection infrastructure
src/openhuman/credentials/profiles.rs
Introduces is_pid_alive(pid: u32) -> bool using sysinfo to check whether a given process ID currently exists on the system.
Stale lock cleanup helper with PID parsing
src/openhuman/credentials/profiles.rs
Adds clear_lock_if_stale function that safely reads the lock file, parses the pid= line as u32, invokes is_pid_alive to verify the recorded process, and removes the lock only when the PID is not alive; malformed or unreadable locks are preserved with tracing logs for failures.
Lock acquisition integration with stale detection
src/openhuman/credentials/profiles.rs
Updates acquire_lock to add a cleared_stale flag limiting stale cleanup to once per acquisition, invokes the cleanup helper on AlreadyExists error to remove dead-pid locks and retry, and enforces that PID write failures trigger lock removal and return an error instead of leaving a malformed lock behind.
Tests for stale lock recovery and PID tracking
src/openhuman/credentials/profiles_tests.rs
Validates stale-lock removal when a dead PID is recorded, ensures live-pid and malformed locks are preserved, verifies no-op when lock file is missing, confirms acquire_lock persists the owning PID and removes it on guard drop, and tests is_pid_alive behavior for the current process vs synthetic dead PIDs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • tinyhumansai/openhuman#1515: Modifies the same AuthProfilesStore::acquire_lock method in profiles.rs, focusing on lock-acquisition error-message formatting while this PR introduces stale-lock detection and recovery.

Poem

🐰 A lock file lingers, a ghost in the cache,
Its process long gone—the lock didn't pass.
With sysinfo now checking if PIDs still live,
Stale locks are cleared; fresh starts we forgive.
No more stuck users, the error loop's gone! 🔐✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(credentials): recover from stale auth-profiles.lock' clearly and concisely describes the main change—detecting and recovering from stale lock files.
Linked Issues check ✅ Passed The PR implements stale lock detection using PID probing, removes dead-process locks, and preserves live-owner serialization, directly addressing all coding objectives from issue #1612.
Out of Scope Changes check ✅ Passed All changes are scoped to the stale lock recovery implementation: AuthProfilesStore::acquire_lock modifications and comprehensive unit tests with no unrelated alterations.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/openhuman/credentials/profiles.rs (1)

468-470: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle pid-file write failures before returning the guard.

If writeln! fails here, this still reports the lock as acquired and leaves behind a malformed/empty lock file. After a crash, stale recovery can no longer parse an owner PID, so later callers fall back to the full timeout path instead of recovering.

Suggested fix
                 Ok(mut file) => {
-                    let _ = writeln!(file, "pid={}", std::process::id());
+                    if let Err(e) = writeln!(file, "pid={}", std::process::id()) {
+                        let _ = fs::remove_file(&self.lock_path);
+                        return Err(e).with_context(|| {
+                            format!(
+                                "Failed to write auth profile lock owner to {}",
+                                self.lock_path.display()
+                            )
+                        });
+                    }
                     return Ok(AuthProfileLockGuard {
                         lock_path: self.lock_path.clone(),
                     });
                 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/credentials/profiles.rs` around lines 468 - 470, The code
currently ignores the result of writeln! when writing "pid={}", so an I/O
failure can leave a malformed lock file while still returning
Ok(AuthProfileLockGuard) which misleads recovery; change the logic around the
writeln! call in the lock-acquisition branch so you check its Result and, on
Err, clean up and return an Err instead of returning the guard — e.g.,
flush/close and remove the created lock file (or propagate a clear error) if
writeln! fails, only returning Ok(AuthProfileLockGuard { ... }) when the pid
write succeeds.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/credentials/profiles.rs`:
- Around line 485-487: The loop that handles the AlreadyExists path repeatedly
calls clear_lock_if_stale() until it returns true, causing repeated sysinfo
probes and log spam; change the logic so that you set cleared_stale = true
immediately when you perform the first probe (i.e., as soon as you call
clear_lock_if_stale() inside the AlreadyExists handling), not only when
clear_lock_if_stale() returns true, so that subsequent iterations skip further
stale-recovery attempts for this acquire attempt (adjust control flow around the
call to clear_lock_if_stale() and the cleared_stale flag in the AlreadyExists
loop accordingly).

---

Outside diff comments:
In `@src/openhuman/credentials/profiles.rs`:
- Around line 468-470: The code currently ignores the result of writeln! when
writing "pid={}", so an I/O failure can leave a malformed lock file while still
returning Ok(AuthProfileLockGuard) which misleads recovery; change the logic
around the writeln! call in the lock-acquisition branch so you check its Result
and, on Err, clean up and return an Err instead of returning the guard — e.g.,
flush/close and remove the created lock file (or propagate a clear error) if
writeln! fails, only returning Ok(AuthProfileLockGuard { ... }) when the pid
write succeeds.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 86fca8d8-edb0-40a4-8cc9-d4ed4e1ff4ea

📥 Commits

Reviewing files that changed from the base of the PR and between 7beed3a and 0cb15ef.

📒 Files selected for processing (13)
  • app/src/utils/__tests__/toolTimelineFormatting.test.ts
  • app/src/utils/toolTimelineFormatting.ts
  • src/openhuman/agent/agents/orchestrator/prompt.md
  • src/openhuman/agent/agents/orchestrator/prompt.rs
  • src/openhuman/agent/harness/definition.rs
  • src/openhuman/agent/harness/session/builder.rs
  • src/openhuman/agent/harness/tool_loop.rs
  • src/openhuman/channels/runtime/dispatch.rs
  • src/openhuman/credentials/profiles.rs
  • src/openhuman/credentials/profiles_tests.rs
  • src/openhuman/tools/impl/agent/dispatch.rs
  • src/openhuman/tools/impl/agent/skill_delegation.rs
  • src/openhuman/tools/orchestrator_tools.rs

Comment thread src/openhuman/credentials/profiles.rs Outdated
Comment on lines +485 to +487
if !cleared_stale && self.clear_lock_if_stale() {
cleared_stale = true;
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Flip cleared_stale after the first probe, not only after a successful delete.

Right now clear_lock_if_stale() is retried on every AlreadyExists loop whenever the lock belongs to a live pid, is malformed, or can't be read. That defeats the intended “one stale-recovery attempt per acquire” behavior and turns the contended path into repeated sysinfo probes and warning spam for up to 10 seconds.

Suggested fix
-                    if !cleared_stale && self.clear_lock_if_stale() {
-                        cleared_stale = true;
-                        continue;
+                    if !cleared_stale {
+                        cleared_stale = true;
+                        if self.clear_lock_if_stale() {
+                            continue;
+                        }
                     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if !cleared_stale && self.clear_lock_if_stale() {
cleared_stale = true;
continue;
if !cleared_stale {
cleared_stale = true;
if self.clear_lock_if_stale() {
continue;
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/credentials/profiles.rs` around lines 485 - 487, The loop that
handles the AlreadyExists path repeatedly calls clear_lock_if_stale() until it
returns true, causing repeated sysinfo probes and log spam; change the logic so
that you set cleared_stale = true immediately when you perform the first probe
(i.e., as soon as you call clear_lock_if_stale() inside the AlreadyExists
handling), not only when clear_lock_if_stale() returns true, so that subsequent
iterations skip further stale-recovery attempts for this acquire attempt (adjust
control flow around the call to clear_lock_if_stale() and the cleared_stale flag
in the AlreadyExists loop accordingly).

@graycyrus
Copy link
Copy Markdown
Contributor

@obchain please resolve merge conflicts before review.

Previously a previous-run crash that left auth-profiles.lock on disk
would block every RPC path touching the auth profile store for the
full LOCK_TIMEOUT_MS window, and `app_state_snapshot` polling turned
that into a per-2s error storm — the user effectively had to delete
the file by hand. The lock writer already records its pid, but the
acquirer never looked at it.

On `AlreadyExists`, peek at the recorded pid once before busy-waiting:
if the owning process is no longer alive (sysinfo lookup), remove the
file and retry. Live owners and malformed locks still fall through to
the existing busy-wait + timeout path, so contention against a real
sibling process keeps serialising correctly. The first acquire on
startup doubles as the crash-cleanup pass, so no separate startup
hook is needed.

Refs tinyhumansai#1612
@obchain obchain force-pushed the fix/1612-stale-auth-lock-recovery branch from 0cb15ef to e42c023 Compare May 13, 2026 17:59
@obchain
Copy link
Copy Markdown
Contributor Author

obchain commented May 13, 2026

@graycyrus Rebased on main and pushed. Both review comments addressed in the same commit.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/openhuman/credentials/profiles.rs (1)

494-517: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add debug/trace logs for the new probe/wait branches.

The new recovery flow only logs after read/remove failures or successful stale cleanup. There is still no low-level diagnostic for the first stale probe, the “pid still alive” fallback, or the timeout path, which will make lock-contention incidents hard to reconstruct.

Suggested instrumentation
                 Err(e) if e.kind() == std::io::ErrorKind::AlreadyExists => {
+                    tracing::debug!(
+                        target: "auth-profiles",
+                        lock_path = %self.lock_path.display(),
+                        waited_ms = waited,
+                        stale_probe_attempted = cleared_stale,
+                        "[credentials] auth profile lock already exists"
+                    );
                     if !cleared_stale {
                         cleared_stale = true;
                         if self.clear_lock_if_stale() {
+                            tracing::debug!(
+                                target: "auth-profiles",
+                                lock_path = %self.lock_path.display(),
+                                "[credentials] stale auth profile lock cleared; retrying acquire"
+                            );
                             continue;
                         }
                     }
                     if waited >= LOCK_TIMEOUT_MS {
+                        tracing::warn!(
+                            target: "auth-profiles",
+                            lock_path = %self.lock_path.display(),
+                            waited_ms = waited,
+                            "[credentials] timed out waiting for auth profile lock"
+                        );
                         anyhow::bail!("Timed out waiting for auth profile lock");
                     }
         if is_pid_alive(pid) {
+            tracing::trace!(
+                target: "auth-profiles",
+                lock_path = %self.lock_path.display(),
+                pid,
+                "[credentials] auth profile lock owner still alive; keeping lock in place"
+            );
             return false;
         }

As per coding guidelines: "src/**/*.rs: All new/changed behavior in Rust core must include verbose diagnostics logging with stable grep-friendly prefixes" and "use log / tracing at debug or trace level for development-oriented diagnostics on new/changed flows, including logs at ... branch decisions, external calls, retries/timeouts, state transitions, and error handling paths".

Also applies to: 558-560

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/credentials/profiles.rs` around lines 494 - 517, The
auth-profile lock recovery branch lacks diagnostic logs; add trace/debug logs
with a stable prefix (e.g. "auth-profile-lock:") around the stale-probe and wait
logic: log when cleared_stale is first flipped and a probe is attempted, log the
result of self.clear_lock_if_stale() (success vs. pid still alive vs.
unreadable), log each retry/sleep including waited and LOCK_WAIT_MS, and log the
timeout branch just before the anyhow::bail! for LOCK_TIMEOUT_MS; use the
existing identifiers cleared_stale, clear_lock_if_stale(), LOCK_WAIT_MS and
LOCK_TIMEOUT_MS and use log/tracing at debug/trace level so these branches are
easily greppable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/credentials/profiles.rs`:
- Around line 484-487: The error path removes the lock file while the File
handle `file` is still open, which can make `fs::remove_file(&self.lock_path)`
fail on Windows; change the error branch so you explicitly close the handle
(e.g. call `drop(file)` or let the `file` go out of scope) before calling
`fs::remove_file(&self.lock_path)`, keeping the same returned error context from
the `writeln!(file, "pid={}", std::process::id())` failure; update the block
around `writeln!(file, ...)`, the `file` variable, and `self.lock_path` to
ensure the file is closed prior to removal.

---

Outside diff comments:
In `@src/openhuman/credentials/profiles.rs`:
- Around line 494-517: The auth-profile lock recovery branch lacks diagnostic
logs; add trace/debug logs with a stable prefix (e.g. "auth-profile-lock:")
around the stale-probe and wait logic: log when cleared_stale is first flipped
and a probe is attempted, log the result of self.clear_lock_if_stale() (success
vs. pid still alive vs. unreadable), log each retry/sleep including waited and
LOCK_WAIT_MS, and log the timeout branch just before the anyhow::bail! for
LOCK_TIMEOUT_MS; use the existing identifiers cleared_stale,
clear_lock_if_stale(), LOCK_WAIT_MS and LOCK_TIMEOUT_MS and use log/tracing at
debug/trace level so these branches are easily greppable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0d8debe8-40fe-4aa3-ac80-0cf0225577aa

📥 Commits

Reviewing files that changed from the base of the PR and between 0cb15ef and e42c023.

📒 Files selected for processing (2)
  • src/openhuman/credentials/profiles.rs
  • src/openhuman/credentials/profiles_tests.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/openhuman/credentials/profiles_tests.rs

Comment on lines +484 to +487
if let Err(e) = writeln!(file, "pid={}", std::process::id()) {
let _ = fs::remove_file(&self.lock_path);
return Err(e).with_context(|| {
"Failed to write auth profile lock owner".to_string()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Drop the file handle before removing the incomplete lock.

remove_file runs while file is still open. On Windows that delete commonly fails, so this error path can still leave an empty/malformed auth-profiles.lock behind and reintroduce the 10-second lockout on the next acquire.

Suggested fix
                     if let Err(e) = writeln!(file, "pid={}", std::process::id()) {
-                        let _ = fs::remove_file(&self.lock_path);
+                        drop(file);
+                        let _ = fs::remove_file(&self.lock_path);
                         return Err(e).with_context(|| {
                             "Failed to write auth profile lock owner".to_string()
                         });
                     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/credentials/profiles.rs` around lines 484 - 487, The error path
removes the lock file while the File handle `file` is still open, which can make
`fs::remove_file(&self.lock_path)` fail on Windows; change the error branch so
you explicitly close the handle (e.g. call `drop(file)` or let the `file` go out
of scope) before calling `fs::remove_file(&self.lock_path)`, keeping the same
returned error context from the `writeln!(file, "pid={}", std::process::id())`
failure; update the block around `writeln!(file, ...)`, the `file` variable, and
`self.lock_path` to ensure the file is closed prior to removal.

@senamakel senamakel merged commit d46abe1 into tinyhumansai:main May 13, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stale auth-profiles.lock blocks all RPC calls — user stuck in error loop

3 participants