fix(onboarding): raise snapshot timeout + staged still-working UI (#2156) by obchain · Pull Request #2179 · tinyhumansai/openhuman

obchain · 2026-05-19T06:53:22Z

Summary

Add per-call timeoutMs override to callCoreRpc so slow-but-alive RPCs can opt into a longer-but-still-bounded budget without changing the 30s default for fast calls.
Raise the first-launch openhuman.app_state_snapshot and learning_save_profile timeouts to 90s — they legitimately run 30–40s on M-series Macs while memory tree init + Composio warmup compete for the snapshot critical path.
Stage the onboarding profile-build UI: after 30s of pending pipeline the copy swaps to a calmer "Still working on your profile…" message with a core.ping alive indicator that distinguishes slow-but-alive from truly-unreachable cores.
Continue to chat stays available throughout — no UX regression on the existing escape hatch.

Problem

On slow first-launches (reported on M-series Mac Mini, macOS 26.1), openhuman.app_state_snapshot completes successfully but takes 32–37s while heavy boot work (memory tree init, Composio registry warmup, mascot rasterization, scheduler gate at 97% CPU) competes for the snapshot path. The global 30s RPC timeout kills the call mid-flight and parks users on the post-login profile-build fallback even though the backend would have answered moments later. Users read the fallback as broken login.

The existing build-profile UI also gives no signal during a long wait, so a slow-but-alive 35–40s snapshot looks identical to a hang.

Solution

callCoreRpc learns an optional timeoutMs parameter, clamped to the same [1s, 10min] window as the global default. Default behaviour is unchanged for the ~hundred existing call sites.
fetchCoreAppSnapshot and saveProfile pass 90_000 so the first-launch slow-success completes inline. Real failures still abort within 90s.
ContextGatheringStep adds a 30s threshold timer that flips the title/description to a calmer "still working" variant and starts a 5s core.ping probe loop. The probe's result drives a small dot indicator (alive / probing / unreachable) so users can tell whether they should wait or take action.

Submission Checklist

Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
Diff coverage ≥ 80% — CI is expected to validate this; locally Vitest covers all three new code paths (per-call timeout override, snapshot timeout threading, staged UI with alive/unreachable probe).
Coverage matrix updated — N/A: behaviour-only change on the existing onboarding context-gathering flow; no new feature row added/removed/renamed.
All affected feature IDs from the matrix are listed in the PR description under ## Related — N/A: no coverage-matrix feature ID changed.
No new external network dependencies introduced (mock backend used per Testing Strategy)
Manual smoke checklist updated if this touches release-cut surfaces (docs/RELEASE-MANUAL-SMOKE.md) — N/A: narrow onboarding fallback behaviour change.
Linked issue closed via Closes #NNN in the ## Related section

Impact

Desktop onboarding UI + frontend RPC client. No persistence, security, migration, or backend changes.
The 90s budget is still bounded; nothing can hang forever.
Fast-path RPCs continue to honour the 30s global default — no behaviour change for the existing ~100 call sites that do not opt into a longer timeout.
12 locales updated with the 5 new onboarding.contextGathering.* keys (stillWorkingTitle, stillWorkingDesc, coreAlive, coreAliveProbing, coreUnreachable).

AI Authored PR Metadata (required for Codex/Linear PRs)

Keep this section for AI-authored PRs. For human-only PRs, mark each field N/A.

Linear Issue

Key: N/A
URL: N/A

Commit & Branch

Branch: fix/2156-snapshot-timeout-staged-fallback
Commit SHA: c6b910b7f7a1a12969d271226dadc9946acf6690

Validation Run

pnpm --filter openhuman-app format:check — pre-push hook ran prettier; format-only fixes amended in.
pnpm typecheck — clean.
Focused tests: pnpm debug unit ContextGatheringStep 14/14, pnpm debug unit coreRpcClient 72/72, pnpm debug unit coreStateApi 13/13.
Rust fmt/check (if changed): N/A — no Rust changes.
Tauri fmt/check (if changed): N/A — no Tauri changes.

Validation Blocked

command: N/A
error: N/A
impact: N/A

Behavior Changes

Intended behavior change: first-launch profile build no longer treats a slow-but-alive 35–40s snapshot as a failure; a calmer staged UI with an alive-indicator replaces the silent wait between 30s and ~90s.
User-visible effect: post-login "Almost there!" fallback fires far less often; when the wait genuinely is long, users see honest progress copy plus a core-status dot, with continue-to-chat still available throughout.

Parity Contract

Legacy behavior preserved: pre-30s happy path renders identical copy ("Building your profile…"). Continue-to-chat button still appears throughout. Existing error fallback unchanged.
Guard/fallback/dispatch parity checks: callCoreRpc callers that do not pass timeoutMs continue to use the 30s global default. The 90s budget is still bounded; nothing can hang forever.

Summary by CodeRabbit

New Features
- A "Still working" onboarding state appears after ~30s for slow first-launches and shows core reachability status to indicate progress.
- Longer per-operation timeouts reduce failures during initial profile build and first-run loading.
Translations
- Updated localized copy for the onboarding context/gathering flow across multiple languages.

The global CORE_RPC_TIMEOUT_MS (30s) treats every RPC as equally fast/slow, but slow-but-alive paths such as first-launch openhuman.app_state_snapshot legitimately run for 35-40s on M-series Macs while memory tree init, Composio warmup, and other boot work compete for the snapshot critical path. Capping every call at 30s forces those users to a failure-style fallback even though the call would have succeeded a few seconds later. Add an optional timeoutMs override, clamped to the same [1s, 10min] window as the global default, so callers like fetchCoreAppSnapshot can opt into a longer-but-still-bounded budget without changing the default for fast RPCs. Refs tinyhumansai#2156.

…ansai#2156) First-launch openhuman.app_state_snapshot can take 30-40s on slow hardware while memory tree init and Composio registry warmup compete for the snapshot critical path. The previous global 30s timeout killed those calls and parked users on the post-login fallback even though the backend would have answered moments later. Pass SNAPSHOT_TIMEOUT_MS=90s via the new callCoreRpc per-call timeoutMs option so slow-but-alive cores complete inline. Real failures still abort within 90s rather than hanging forever. Refs tinyhumansai#2156.

…hot (tinyhumansai#2156) When the post-login profile build runs past the 30s hardcoded threshold, users today see no signal that the system is still making progress — the existing build animation looks identical to a hang. On a slow-but-alive core that legitimately needs ~40s to complete app_state_snapshot or learning_save_profile, this reads as broken login. Add a staged transition: after STILL_WORKING_THRESHOLD_MS the copy swaps to a calmer 'Still working on your profile…' message, and a lightweight core.ping probe runs on a 5s interval so the indicator can distinguish slow-but-alive vs truly unreachable cores. Continue to chat stays available throughout. Also pass SAVE_PROFILE_TIMEOUT_MS=90s on learning_save_profile so the legitimate slow-success completes inline instead of falling to the error path. i18n strings shipped for 12 locales. Refs tinyhumansai#2156.

coderabbitai · 2026-05-19T06:53:38Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 13886009-1bc8-4ddd-b62f-a25284a885e3

📥 Commits

Reviewing files that changed from the base of the PR and between 080c04c and 0dbe9df.

📒 Files selected for processing (16)

app/src/lib/i18n/chunks/ar-4.ts
app/src/lib/i18n/chunks/bn-4.ts
app/src/lib/i18n/chunks/en-4.ts
app/src/lib/i18n/chunks/es-4.ts
app/src/lib/i18n/chunks/fr-4.ts
app/src/lib/i18n/chunks/hi-4.ts
app/src/lib/i18n/chunks/id-4.ts
app/src/lib/i18n/chunks/it-4.ts
app/src/lib/i18n/chunks/pt-4.ts
app/src/lib/i18n/chunks/ru-4.ts
app/src/lib/i18n/chunks/zh-CN-4.ts
app/src/lib/i18n/en.ts
app/src/pages/onboarding/steps/ContextGatheringStep.tsx
app/src/pages/onboarding/steps/__tests__/ContextGatheringStep.test.tsx
app/src/services/__tests__/coreRpcClient.test.ts
app/src/services/coreRpcClient.ts

✅ Files skipped from review due to trivial changes (8)

app/src/lib/i18n/chunks/en-4.ts
app/src/lib/i18n/chunks/ar-4.ts
app/src/lib/i18n/chunks/bn-4.ts
app/src/lib/i18n/chunks/zh-CN-4.ts
app/src/lib/i18n/en.ts
app/src/lib/i18n/chunks/ru-4.ts
app/src/lib/i18n/chunks/pt-4.ts
app/src/lib/i18n/chunks/it-4.ts

📝 Walkthrough

Walkthrough

This PR adds a staged "still working" onboarding UI with periodic core probes, per-call RPC timeout overrides (clamped), a 90s snapshot timeout, tests for timeout/probing behavior, and i18n updates across locale bundles.

Changes

Onboarding Staged Loading & Core Reachability

Layer / File(s)	Summary
Multi-language translations for staged loading UI `app/src/lib/i18n/chunks/*-4.ts`, `app/src/lib/i18n/en.ts`	Translation keys added/updated for `onboarding.contextGathering`: `coreAlive`, `coreAliveProbing`, `coreUnreachable`, `stillWorkingDesc`, `stillWorkingTitle`, and updated `errorDesc`.
Per-call RPC timeout override infrastructure & tests `app/src/services/coreRpcClient.ts`, `app/src/services/__tests__/coreRpcClient.test.ts`	`callCoreRpc` accepts optional `timeoutMs` validated/clamped to bounds; fetch abort uses effective timeout and surfaces it in errors. `testCoreRpcConnection` accepts an AbortSignal. Tests cover override, clamp, and fallback.
Snapshot RPC timeout extension to 90 seconds `app/src/services/coreStateApi.ts`, `app/src/services/coreStateApi.test.ts`	Added exported `SNAPSHOT_TIMEOUT_MS = 90_000` and pass it as `timeoutMs` to `openhuman.app_state_snapshot`; tests assert the timeout is passed and within bounds.
ContextGatheringStep staged loading & core probing implementation `app/src/pages/onboarding/steps/ContextGatheringStep.tsx`	Component flips to a "still working" UI after ~30s while pipeline pending, periodically probes core reachability with abort-bounded probes and single-flight guarding, displays alive/unreachable indicator, and calls `learning_save_profile` with extended timeout.
Test setup and staged loading UI test suite `app/src/pages/onboarding/steps/__tests__/ContextGatheringStep.test.tsx`	Expanded `coreRpcClient` mock (`getCoreRpcUrl`, `testCoreRpcConnection`) and added staged tests validating the 90s timeout override, UI transition after ~30s, core probe outcomes, and probe abort behavior.

Sequence Diagram(s)

sequenceDiagram
  participant UI as ContextGatheringStep
  participant Pipeline as OnboardingPipeline
  participant CoreRpcClient as coreRpcClient
  participant Core as CoreRPC
  UI->>Pipeline: start profile build
  Pipeline->>CoreRpcClient: openhuman.learning_save_profile(timeoutMs=90s)
  UI->>CoreRpcClient: getCoreRpcUrl() / testCoreRpcConnection() (periodic while pending)
  CoreRpcClient->>Core: probe testCoreRpcConnection (with AbortSignal)
  Core-->>CoreRpcClient: respond (200/401 or error)
  CoreRpcClient-->>UI: alive/unreachable status
  Pipeline-->>UI: completes -> normal UI

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

tinyhumansai/openhuman#2196: Overlaps on coreRpcClient timeout/abort behavior changes.
tinyhumansai/openhuman#2057: Edits the same onboarding.contextGathering.errorDesc i18n key across locales.
tinyhumansai/openhuman#2042: Related edits to zh-CN onboarding contextGathering copy.

Suggested labels

feature

Suggested reviewers

graycyrus

Poem

🐰 I tweak the waits and hush the fright,

"Still working..." glows through patient night,
I ping the core with gentle taps,
Abort the stalled and skip the traps,
Slow builds finish — huzzah, delight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: raising snapshot timeout and adding staged still-working UI for onboarding.
Linked Issues check	✅ Passed	All acceptance criteria from issue `#2156` are met: false fallback reduced via 90s timeout override, progress UI added with 30s threshold, backend alive detection via core.ping probes, failure remains bounded at 10min max, and tests cover slow-success/timeout/failure paths.
Out of Scope Changes check	✅ Passed	All changes are within scope: i18n updates for onboarding strings, RPC client timeout override feature, coreStateApi/ContextGatheringStep modifications, and comprehensive test coverage all directly support issue `#2156` requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

app/src/pages/onboarding/steps/ContextGatheringStep.tsx (1)

376-434: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Hide the slow-path UI once the pipeline has finished.

After a slow success, stillWorking stays true, so the component can keep showing “still working…” and the ping indicator during the 800ms auto-advance window or while onNext() is still pending. Gate that copy/indicator on !finished && !hasError instead of the raw state.

♻️ Suggested guard

-  const titleKey = stillWorking
+  const showStillWorking = stillWorking && !finished && !hasError;
+
+  const titleKey = showStillWorking
     ? 'onboarding.contextGathering.stillWorkingTitle'
     : 'onboarding.contextGathering.buildingProfile';
-  const descKey = stillWorking
+  const descKey = showStillWorking
     ? 'onboarding.contextGathering.stillWorkingDesc'
     : 'onboarding.contextGathering.buildingDesc';
@@
-        {stillWorking && (
+        {showStillWorking && (
           <div

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src/pages/onboarding/steps/ContextGatheringStep.tsx` around lines 376 -
434, The component currently uses stillWorking to show the "still working" copy
and the core alive indicator, which causes the slow-path UI to remain visible
after the pipeline finishes; update the guards so both the title/description
selection and the alive indicator are shown only when stillWorking is true AND
finished is false AND hasError is false (i.e., use !finished && !hasError &&
stillWorking in place of just stillWorking), and keep references to
aliveLabelKey and aliveState for the alive indicator rendering so the visual
state logic remains unchanged.

🧹 Nitpick comments (1)

app/src/services/__tests__/coreRpcClient.test.ts (1)

360-365: ⚡ Quick win

Assert pending state before the timeout boundary in the override/clamp tests.

These tests currently validate only the final timeout message, so an early-abort timing regression can slip through if the message still reflects the override/clamped value. Add a pre-boundary “still pending” assertion before the expected cutoff.

Proposed test hardening

   const pending = callCoreRpc({ method: 'openhuman.app_state_snapshot', timeoutMs: 60_000 });
   pending.catch(() => {});
+  let settled = false;
+  pending.finally(() => {
+    settled = true;
+  });

   // 30s passes — global default would have aborted by now, but the
   // per-call 60s override keeps the request alive.
   await vi.advanceTimersByTimeAsync(31_000);
+  await Promise.resolve();
+  expect(settled).toBe(false);
   // Not yet rejected. Advance to the override boundary.
   await vi.advanceTimersByTimeAsync(30_000);

   const pending = callCoreRpc({
     method: 'openhuman.app_state_snapshot',
     timeoutMs: 2 * 60 * 60 * 1_000,
   });
   pending.catch(() => {});
+  let settled = false;
+  pending.finally(() => {
+    settled = true;
+  });

   const MAX_MS = 10 * 60 * 1_000;
-  await vi.advanceTimersByTimeAsync(MAX_MS + 1);
+  await vi.advanceTimersByTimeAsync(MAX_MS - 1);
+  await Promise.resolve();
+  expect(settled).toBe(false);
+  await vi.advanceTimersByTimeAsync(2);

Also applies to: 400-402

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src/services/__tests__/coreRpcClient.test.ts` around lines 360 - 365, Add
an explicit "still pending" assertion before advancing to the per-call timeout
boundary in the override/clamp tests so an early-abort regression is caught;
locate the test cases around the advancing timers (the block that does await
vi.advanceTimersByTimeAsync(31_000) then await
vi.advanceTimersByTimeAsync(30_000)) in coreRpcClient.test.ts and insert a
pre-boundary check that the RPC call/promise is not settled (e.g., the pending
promise or mock callback has not been called) before advancing the final
30_000ms, and make the same change in the analogous test block at the other spot
referenced (around lines 400-402).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/src/pages/onboarding/steps/ContextGatheringStep.tsx`:
- Around line 318-342: The probe function can overlap because setInterval fires
every ALIVE_PROBE_INTERVAL_MS regardless of pending async work; add a guard
(e.g., an inFlight boolean inside the useEffect) so probe immediately returns if
a previous probe is still running, set inFlight = true at start and false in a
finally block, and only update aliveState when not cancelled; alternatively
replace setInterval with a recursive setTimeout that awaits probe before
scheduling the next call. Apply the guard to the probe used with getCoreRpcUrl()
and testCoreRpcConnection(), and ensure cleanup still sets cancelled and clears
the timer/interval.

---

Outside diff comments:
In `@app/src/pages/onboarding/steps/ContextGatheringStep.tsx`:
- Around line 376-434: The component currently uses stillWorking to show the
"still working" copy and the core alive indicator, which causes the slow-path UI
to remain visible after the pipeline finishes; update the guards so both the
title/description selection and the alive indicator are shown only when
stillWorking is true AND finished is false AND hasError is false (i.e., use
!finished && !hasError && stillWorking in place of just stillWorking), and keep
references to aliveLabelKey and aliveState for the alive indicator rendering so
the visual state logic remains unchanged.

---

Nitpick comments:
In `@app/src/services/__tests__/coreRpcClient.test.ts`:
- Around line 360-365: Add an explicit "still pending" assertion before
advancing to the per-call timeout boundary in the override/clamp tests so an
early-abort regression is caught; locate the test cases around the advancing
timers (the block that does await vi.advanceTimersByTimeAsync(31_000) then await
vi.advanceTimersByTimeAsync(30_000)) in coreRpcClient.test.ts and insert a
pre-boundary check that the RPC call/promise is not settled (e.g., the pending
promise or mock callback has not been called) before advancing the final
30_000ms, and make the same change in the analogous test block at the other spot
referenced (around lines 400-402).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d0222f47-04b2-4686-9c1e-aad27a76eccd

📥 Commits

Reviewing files that changed from the base of the PR and between c25fc8e and c6b910b.

📒 Files selected for processing (18)

app/src/lib/i18n/chunks/ar-4.ts
app/src/lib/i18n/chunks/bn-4.ts
app/src/lib/i18n/chunks/en-4.ts
app/src/lib/i18n/chunks/es-4.ts
app/src/lib/i18n/chunks/fr-4.ts
app/src/lib/i18n/chunks/hi-4.ts
app/src/lib/i18n/chunks/id-4.ts
app/src/lib/i18n/chunks/it-4.ts
app/src/lib/i18n/chunks/pt-4.ts
app/src/lib/i18n/chunks/ru-4.ts
app/src/lib/i18n/chunks/zh-CN-4.ts
app/src/lib/i18n/en.ts
app/src/pages/onboarding/steps/ContextGatheringStep.tsx
app/src/pages/onboarding/steps/__tests__/ContextGatheringStep.test.tsx
app/src/services/__tests__/coreRpcClient.test.ts
app/src/services/coreRpcClient.ts
app/src/services/coreStateApi.test.ts
app/src/services/coreStateApi.ts

- ContextGatheringStep alive probe: add inFlight single-flight guard so overlapping core.ping calls cannot stack when the previous probe is still pending — prevents stale responses racing aliveState on the unreachable path. - Hide the still-working title/desc/alive-indicator once finished or hasError flips true. Without this, slow-success users see the still-working copy during the 800ms auto-advance window after the pipeline actually completes. - coreRpcClient timeout-override + clamp tests: assert the pending promise is still in flight before the expected cutoff so an early-abort timing regression cannot slip through with only the final timeout message matching.

obchain · 2026-05-19T07:31:48Z

Thanks — addressed in 4318b84:

Overlapping probes: added inFlight single-flight guard so consecutive 5s ticks can't stack pending core.ping promises.
Slow-path UI after finish: gated the title/desc/alive-indicator on showStillWorking = stillWorking && !finished && !hasError, so the 800ms auto-advance window no longer shows the "still working…" copy.
Test hardening: added pre-boundary settled === false assertions on the timeoutMs-override and clamp tests so an early-abort regression can't slip through.

All 14 ContextGatheringStep + 72 coreRpcClient + 13 coreStateApi tests green locally.

CodeGhost21

Reviewed against issue #2156. Acceptance criteria check:

Criterion	Status
False fallback reduced (35–40s succeeds inline)	Met — snapshot + save_profile bumped to 90s.
Timeout policy updated with rationale	Met — `resolvePerCallTimeoutMs` is clamped [1s, 10min]; documented in `coreRpcClient.ts` and `coreStateApi.ts`.
Progress UI added ("still working")	Met — `ContextGatheringStep` swaps copy at 30s.
Failure remains bounded	Met — 90s per-call timeout still aborts; staged UI auto-clears on `hasError`.
Backend alive detection (slow-alive vs unreachable)	Partially met — see inline on the probe loop; the probe has no timeout, so a TCP-black-hole core leaves the indicator stuck at "probing" forever.
Regression tests (slow-success/timeout/failure)	Met — new tests cover override, clamp, default, still-working transition, alive, and unreachable.

Overall the change does what the issue asks. Two correctness items below — one is a UX hole in the unreachable-detection path, the other is a misleading code comment that future maintainers will get burned by. Translations across 12 locales look fine.

CodeGhost21 · 2026-05-19T18:36:02Z

+  // Periodic alive probe while in still-working state. `core.ping` bypasses
+  // bearer auth and resolves quickly even when the busy snapshot RPC is
+  // holding the worker, so a green ping during a slow snapshot is exactly
+  // the alive-but-slow signal users need to see.


Comment is incorrect: core.ping does not bypass bearer auth. Per src/core/auth.rs, POST /rpc always requires the bearer token — only /, /health, /auth/telegram, /schema, /events, /events/webhooks, and /ws/dictation are in PUBLIC_PATHS. The probe happens to work because testCoreRpcConnection does thread Authorization: Bearer <token> through. If the token has not yet been resolved when the probe fires (cold start, IPC race on first launch — exactly the scenario this UI targets), every probe gets a 401 unauthorized and the indicator flips to unreachable even though the core is fine.

Suggest either:

Re-word the comment to drop the "bypasses bearer auth" claim, AND treat HTTP 401 as alive (auth not ready ≠ core down) rather than unreachable; or

Add core.ping to PUBLIC_PATHS in src/core/auth.rs if you actually want the documented behavior — but that's a separate decision that needs its own review.

Fixed in e9451a6: rewrote the comment to drop the false "bypasses bearer auth" claim, and the probe now treats HTTP 401 as alive (cold-start IPC race where the token hasn't been resolved yet ≠ core down). Added test "treats HTTP 401 as alive (auth not ready yet, core is up)".

CodeGhost21 · 2026-05-19T18:36:02Z

+        const url = await getCoreRpcUrl();
+        const response = await testCoreRpcConnection(url);
+        if (!cancelled) {
+          setAliveState(response.ok ? 'alive' : 'unreachable');


testCoreRpcConnection calls raw fetch() with no AbortController (see coreRpcClient.ts:331-345), so it has no timeout. The inline comment a few lines up claims "each probe times out after the global fetch budget" — there is no global fetch budget on this code path; the global budget lives in callCoreRpc, not testCoreRpcConnection.

Consequence: on a TCP-black-hole core (firewall drops SYN, suspended laptop coming back online, etc. — the exact "unreachable" case this indicator exists to surface), the first probe's fetch hangs indefinitely. The inFlight single-flight guard then blocks every subsequent 5s tick, and the user is parked on "Checking core connection…" forever — which is the same UX failure mode #2156 is trying to fix, just one layer up.

Fix: bound the probe with its own AbortController + ~3s timeout, and treat the abort as unreachable. Example:

const probeController = new AbortController(); const probeTimeout = window.setTimeout(() => probeController.abort(), 3_000); try { const response = await testCoreRpcConnection(url, undefined, { signal: probeController.signal }); // ... } finally { window.clearTimeout(probeTimeout); }

(testCoreRpcConnection will need an optional RequestInit-ish parameter to forward signal.)

Also worth covering this with a test — current test suite mocks testCoreRpcConnection so the missing timeout is invisible to CI.

Fixed in e9451a6: testCoreRpcConnection now accepts an optional { signal }, and the probe wraps each call in its own AbortController with PROBE_TIMEOUT_MS = 3_000. The finally clears the timeout and the inFlight guard, so a TCP black-hole probe aborts cleanly and the next 5s tick fires. Added test "passes an AbortSignal so a TCP black-hole probe cannot hang forever" that asserts both the signal is forwarded and the state flips to unreachable after the abort.

# Conflicts: # app/src/lib/i18n/chunks/ar-4.ts # app/src/lib/i18n/chunks/bn-4.ts # app/src/lib/i18n/chunks/en-4.ts # app/src/lib/i18n/chunks/es-4.ts # app/src/lib/i18n/chunks/fr-4.ts # app/src/lib/i18n/chunks/id-4.ts # app/src/lib/i18n/chunks/pt-4.ts # app/src/lib/i18n/chunks/ru-4.ts # app/src/lib/i18n/chunks/zh-CN-4.ts # app/src/lib/i18n/en.ts

…t21 review on tinyhumansai#2179) - testCoreRpcConnection: accept optional { signal } so callers can bound the raw fetch(). Without this, an unreachable core (TCP black-hole, suspended laptop) hangs the probe forever — the single-flight guard then blocks every 5s tick and the user is parked on "Checking core connection…" indefinitely, the exact failure mode tinyhumansai#2156 fixes one layer up. - ContextGatheringStep alive probe: wrap each probe with its own AbortController + 3s timeout (PROBE_TIMEOUT_MS). - Treat HTTP 401 as 'alive': on cold start the bearer token resolution can race the first probe; the response is 401 even though the core is fine. Auth-not-ready ≠ core-down. - Rewrite the misleading comment that claimed core.ping bypasses bearer auth — it does not (see src/core/auth.rs PUBLIC_PATHS). - Align en/en-4 errorDesc with the staged-fallback copy the PR tests already assert ("Your chat is ready…"). Other locale chunks left untouched — re-translating is out of scope for this fix. - Add tests: 401 → alive, AbortSignal forwarded → probe abort → unreachable.

# Conflicts: # app/src/services/coreRpcClient.ts

…nyhumansai#2156) (tinyhumansai#2179) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>

obchain added 4 commits May 19, 2026 12:17

chore: prettier format fixes from pre-push hook

c6b910b

obchain requested a review from a team May 19, 2026 06:53

coderabbitai Bot added the working A PR that is being worked on by the team. label May 19, 2026

coderabbitai Bot requested changes May 19, 2026

View reviewed changes

Comment thread app/src/pages/onboarding/steps/ContextGatheringStep.tsx

obchain added 2 commits May 19, 2026 12:58

chore: prettier auto-fixes from pre-push hook

080c04c

coderabbitai Bot previously approved these changes May 19, 2026

View reviewed changes

CodeGhost21 self-requested a review May 19, 2026 18:29

CodeGhost21 reviewed May 19, 2026

View reviewed changes

senamakel added 2 commits May 19, 2026 20:02

senamakel dismissed coderabbitai[bot]’s stale review via e9451a6 May 20, 2026 03:11

Merge remote-tracking branch 'upstream/main' into pr/2179

0dbe9df

# Conflicts: # app/src/services/coreRpcClient.ts

coderabbitai Bot added the feature Net-new user-facing capability or product behavior. label May 20, 2026

coderabbitai Bot approved these changes May 20, 2026

View reviewed changes

senamakel merged commit 9376ffc into tinyhumansai:main May 20, 2026
28 of 31 checks passed

This was referenced May 20, 2026

fix(onboarding): capture completeAndExit rejection in Sentry (#2081) #2327

Merged

when i press the button "continue to chat",nothing happen. #2081

Closed

coderabbitai Bot mentioned this pull request May 20, 2026

perf(app-state): parallelize runtime snapshot and add per-stage timeouts #2209

Merged

12 tasks

mtkik pushed a commit to mtkik/openhuman-meet that referenced this pull request May 21, 2026

fix(onboarding): raise snapshot timeout + staged still-working UI (ti…

b6726e3

…nyhumansai#2156) (tinyhumansai#2179) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>

coderabbitai Bot mentioned this pull request May 22, 2026

fix(app): normalize cloud core RPC URLs #2480

Merged

12 tasks

CodeGhost21 pushed a commit to CodeGhost21/openhuman that referenced this pull request May 22, 2026

fix(onboarding): raise snapshot timeout + staged still-working UI (ti…

a36647e

…nyhumansai#2156) (tinyhumansai#2179) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>

coderabbitai Bot mentioned this pull request May 23, 2026

fix(auth): deliver OAuth JWT to remote core in cloud mode #2453

Open

13 tasks

Conversation

obchain commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

obchain commented May 19, 2026

Uh oh!

CodeGhost21 left a comment

Choose a reason for hiding this comment

Uh oh!

CodeGhost21 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

senamakel May 20, 2026

Choose a reason for hiding this comment

Uh oh!

CodeGhost21 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

senamakel May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

obchain commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading