fix(onboarding): raise snapshot timeout + staged still-working UI (#2156)#2179
Conversation
The global CORE_RPC_TIMEOUT_MS (30s) treats every RPC as equally fast/slow, but slow-but-alive paths such as first-launch openhuman.app_state_snapshot legitimately run for 35-40s on M-series Macs while memory tree init, Composio warmup, and other boot work compete for the snapshot critical path. Capping every call at 30s forces those users to a failure-style fallback even though the call would have succeeded a few seconds later. Add an optional timeoutMs override, clamped to the same [1s, 10min] window as the global default, so callers like fetchCoreAppSnapshot can opt into a longer-but-still-bounded budget without changing the default for fast RPCs. Refs tinyhumansai#2156.
…ansai#2156) First-launch openhuman.app_state_snapshot can take 30-40s on slow hardware while memory tree init and Composio registry warmup compete for the snapshot critical path. The previous global 30s timeout killed those calls and parked users on the post-login fallback even though the backend would have answered moments later. Pass SNAPSHOT_TIMEOUT_MS=90s via the new callCoreRpc per-call timeoutMs option so slow-but-alive cores complete inline. Real failures still abort within 90s rather than hanging forever. Refs tinyhumansai#2156.
…hot (tinyhumansai#2156) When the post-login profile build runs past the 30s hardcoded threshold, users today see no signal that the system is still making progress — the existing build animation looks identical to a hang. On a slow-but-alive core that legitimately needs ~40s to complete app_state_snapshot or learning_save_profile, this reads as broken login. Add a staged transition: after STILL_WORKING_THRESHOLD_MS the copy swaps to a calmer 'Still working on your profile…' message, and a lightweight core.ping probe runs on a 5s interval so the indicator can distinguish slow-but-alive vs truly unreachable cores. Continue to chat stays available throughout. Also pass SAVE_PROFILE_TIMEOUT_MS=90s on learning_save_profile so the legitimate slow-success completes inline instead of falling to the error path. i18n strings shipped for 12 locales. Refs tinyhumansai#2156.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (16)
✅ Files skipped from review due to trivial changes (8)
📝 WalkthroughWalkthroughThis PR adds a staged "still working" onboarding UI with periodic core probes, per-call RPC timeout overrides (clamped), a 90s snapshot timeout, tests for timeout/probing behavior, and i18n updates across locale bundles. ChangesOnboarding Staged Loading & Core Reachability
Sequence Diagram(s)sequenceDiagram
participant UI as ContextGatheringStep
participant Pipeline as OnboardingPipeline
participant CoreRpcClient as coreRpcClient
participant Core as CoreRPC
UI->>Pipeline: start profile build
Pipeline->>CoreRpcClient: openhuman.learning_save_profile(timeoutMs=90s)
UI->>CoreRpcClient: getCoreRpcUrl() / testCoreRpcConnection() (periodic while pending)
CoreRpcClient->>Core: probe testCoreRpcConnection (with AbortSignal)
Core-->>CoreRpcClient: respond (200/401 or error)
CoreRpcClient-->>UI: alive/unreachable status
Pipeline-->>UI: completes -> normal UI
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
app/src/pages/onboarding/steps/ContextGatheringStep.tsx (1)
376-434:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winHide the slow-path UI once the pipeline has finished.
After a slow success,
stillWorkingstaystrue, so the component can keep showing “still working…” and the ping indicator during the 800ms auto-advance window or whileonNext()is still pending. Gate that copy/indicator on!finished && !hasErrorinstead of the raw state.♻️ Suggested guard
- const titleKey = stillWorking + const showStillWorking = stillWorking && !finished && !hasError; + + const titleKey = showStillWorking ? 'onboarding.contextGathering.stillWorkingTitle' : 'onboarding.contextGathering.buildingProfile'; - const descKey = stillWorking + const descKey = showStillWorking ? 'onboarding.contextGathering.stillWorkingDesc' : 'onboarding.contextGathering.buildingDesc'; @@ - {stillWorking && ( + {showStillWorking && ( <div🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src/pages/onboarding/steps/ContextGatheringStep.tsx` around lines 376 - 434, The component currently uses stillWorking to show the "still working" copy and the core alive indicator, which causes the slow-path UI to remain visible after the pipeline finishes; update the guards so both the title/description selection and the alive indicator are shown only when stillWorking is true AND finished is false AND hasError is false (i.e., use !finished && !hasError && stillWorking in place of just stillWorking), and keep references to aliveLabelKey and aliveState for the alive indicator rendering so the visual state logic remains unchanged.
🧹 Nitpick comments (1)
app/src/services/__tests__/coreRpcClient.test.ts (1)
360-365: ⚡ Quick winAssert pending state before the timeout boundary in the override/clamp tests.
These tests currently validate only the final timeout message, so an early-abort timing regression can slip through if the message still reflects the override/clamped value. Add a pre-boundary “still pending” assertion before the expected cutoff.
Proposed test hardening
const pending = callCoreRpc({ method: 'openhuman.app_state_snapshot', timeoutMs: 60_000 }); pending.catch(() => {}); + let settled = false; + pending.finally(() => { + settled = true; + }); // 30s passes — global default would have aborted by now, but the // per-call 60s override keeps the request alive. await vi.advanceTimersByTimeAsync(31_000); + await Promise.resolve(); + expect(settled).toBe(false); // Not yet rejected. Advance to the override boundary. await vi.advanceTimersByTimeAsync(30_000);const pending = callCoreRpc({ method: 'openhuman.app_state_snapshot', timeoutMs: 2 * 60 * 60 * 1_000, }); pending.catch(() => {}); + let settled = false; + pending.finally(() => { + settled = true; + }); const MAX_MS = 10 * 60 * 1_000; - await vi.advanceTimersByTimeAsync(MAX_MS + 1); + await vi.advanceTimersByTimeAsync(MAX_MS - 1); + await Promise.resolve(); + expect(settled).toBe(false); + await vi.advanceTimersByTimeAsync(2);Also applies to: 400-402
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src/services/__tests__/coreRpcClient.test.ts` around lines 360 - 365, Add an explicit "still pending" assertion before advancing to the per-call timeout boundary in the override/clamp tests so an early-abort regression is caught; locate the test cases around the advancing timers (the block that does await vi.advanceTimersByTimeAsync(31_000) then await vi.advanceTimersByTimeAsync(30_000)) in coreRpcClient.test.ts and insert a pre-boundary check that the RPC call/promise is not settled (e.g., the pending promise or mock callback has not been called) before advancing the final 30_000ms, and make the same change in the analogous test block at the other spot referenced (around lines 400-402).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/src/pages/onboarding/steps/ContextGatheringStep.tsx`:
- Around line 318-342: The probe function can overlap because setInterval fires
every ALIVE_PROBE_INTERVAL_MS regardless of pending async work; add a guard
(e.g., an inFlight boolean inside the useEffect) so probe immediately returns if
a previous probe is still running, set inFlight = true at start and false in a
finally block, and only update aliveState when not cancelled; alternatively
replace setInterval with a recursive setTimeout that awaits probe before
scheduling the next call. Apply the guard to the probe used with getCoreRpcUrl()
and testCoreRpcConnection(), and ensure cleanup still sets cancelled and clears
the timer/interval.
---
Outside diff comments:
In `@app/src/pages/onboarding/steps/ContextGatheringStep.tsx`:
- Around line 376-434: The component currently uses stillWorking to show the
"still working" copy and the core alive indicator, which causes the slow-path UI
to remain visible after the pipeline finishes; update the guards so both the
title/description selection and the alive indicator are shown only when
stillWorking is true AND finished is false AND hasError is false (i.e., use
!finished && !hasError && stillWorking in place of just stillWorking), and keep
references to aliveLabelKey and aliveState for the alive indicator rendering so
the visual state logic remains unchanged.
---
Nitpick comments:
In `@app/src/services/__tests__/coreRpcClient.test.ts`:
- Around line 360-365: Add an explicit "still pending" assertion before
advancing to the per-call timeout boundary in the override/clamp tests so an
early-abort regression is caught; locate the test cases around the advancing
timers (the block that does await vi.advanceTimersByTimeAsync(31_000) then await
vi.advanceTimersByTimeAsync(30_000)) in coreRpcClient.test.ts and insert a
pre-boundary check that the RPC call/promise is not settled (e.g., the pending
promise or mock callback has not been called) before advancing the final
30_000ms, and make the same change in the analogous test block at the other spot
referenced (around lines 400-402).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d0222f47-04b2-4686-9c1e-aad27a76eccd
📒 Files selected for processing (18)
app/src/lib/i18n/chunks/ar-4.tsapp/src/lib/i18n/chunks/bn-4.tsapp/src/lib/i18n/chunks/en-4.tsapp/src/lib/i18n/chunks/es-4.tsapp/src/lib/i18n/chunks/fr-4.tsapp/src/lib/i18n/chunks/hi-4.tsapp/src/lib/i18n/chunks/id-4.tsapp/src/lib/i18n/chunks/it-4.tsapp/src/lib/i18n/chunks/pt-4.tsapp/src/lib/i18n/chunks/ru-4.tsapp/src/lib/i18n/chunks/zh-CN-4.tsapp/src/lib/i18n/en.tsapp/src/pages/onboarding/steps/ContextGatheringStep.tsxapp/src/pages/onboarding/steps/__tests__/ContextGatheringStep.test.tsxapp/src/services/__tests__/coreRpcClient.test.tsapp/src/services/coreRpcClient.tsapp/src/services/coreStateApi.test.tsapp/src/services/coreStateApi.ts
- ContextGatheringStep alive probe: add inFlight single-flight guard so overlapping core.ping calls cannot stack when the previous probe is still pending — prevents stale responses racing aliveState on the unreachable path. - Hide the still-working title/desc/alive-indicator once finished or hasError flips true. Without this, slow-success users see the still-working copy during the 800ms auto-advance window after the pipeline actually completes. - coreRpcClient timeout-override + clamp tests: assert the pending promise is still in flight before the expected cutoff so an early-abort timing regression cannot slip through with only the final timeout message matching.
|
Thanks — addressed in 4318b84:
All 14 ContextGatheringStep + 72 coreRpcClient + 13 coreStateApi tests green locally. |
CodeGhost21
left a comment
There was a problem hiding this comment.
Reviewed against issue #2156. Acceptance criteria check:
| Criterion | Status |
|---|---|
| False fallback reduced (35–40s succeeds inline) | Met — snapshot + save_profile bumped to 90s. |
| Timeout policy updated with rationale | Met — resolvePerCallTimeoutMs is clamped [1s, 10min]; documented in coreRpcClient.ts and coreStateApi.ts. |
| Progress UI added ("still working") | Met — ContextGatheringStep swaps copy at 30s. |
| Failure remains bounded | Met — 90s per-call timeout still aborts; staged UI auto-clears on hasError. |
| Backend alive detection (slow-alive vs unreachable) | Partially met — see inline on the probe loop; the probe has no timeout, so a TCP-black-hole core leaves the indicator stuck at "probing" forever. |
| Regression tests (slow-success/timeout/failure) | Met — new tests cover override, clamp, default, still-working transition, alive, and unreachable. |
Overall the change does what the issue asks. Two correctness items below — one is a UX hole in the unreachable-detection path, the other is a misleading code comment that future maintainers will get burned by. Translations across 12 locales look fine.
| // Periodic alive probe while in still-working state. `core.ping` bypasses | ||
| // bearer auth and resolves quickly even when the busy snapshot RPC is | ||
| // holding the worker, so a green ping during a slow snapshot is exactly | ||
| // the alive-but-slow signal users need to see. |
There was a problem hiding this comment.
Comment is incorrect: core.ping does not bypass bearer auth. Per src/core/auth.rs, POST /rpc always requires the bearer token — only /, /health, /auth/telegram, /schema, /events, /events/webhooks, and /ws/dictation are in PUBLIC_PATHS. The probe happens to work because testCoreRpcConnection does thread Authorization: Bearer <token> through. If the token has not yet been resolved when the probe fires (cold start, IPC race on first launch — exactly the scenario this UI targets), every probe gets a 401 unauthorized and the indicator flips to unreachable even though the core is fine.
Suggest either:
- Re-word the comment to drop the "bypasses bearer auth" claim, AND treat HTTP 401 as
alive(auth not ready ≠ core down) rather thanunreachable; or - Add
core.pingtoPUBLIC_PATHSinsrc/core/auth.rsif you actually want the documented behavior — but that's a separate decision that needs its own review.
There was a problem hiding this comment.
Fixed in e9451a6: rewrote the comment to drop the false "bypasses bearer auth" claim, and the probe now treats HTTP 401 as alive (cold-start IPC race where the token hasn't been resolved yet ≠ core down). Added test "treats HTTP 401 as alive (auth not ready yet, core is up)".
| const url = await getCoreRpcUrl(); | ||
| const response = await testCoreRpcConnection(url); | ||
| if (!cancelled) { | ||
| setAliveState(response.ok ? 'alive' : 'unreachable'); |
There was a problem hiding this comment.
testCoreRpcConnection calls raw fetch() with no AbortController (see coreRpcClient.ts:331-345), so it has no timeout. The inline comment a few lines up claims "each probe times out after the global fetch budget" — there is no global fetch budget on this code path; the global budget lives in callCoreRpc, not testCoreRpcConnection.
Consequence: on a TCP-black-hole core (firewall drops SYN, suspended laptop coming back online, etc. — the exact "unreachable" case this indicator exists to surface), the first probe's fetch hangs indefinitely. The inFlight single-flight guard then blocks every subsequent 5s tick, and the user is parked on "Checking core connection…" forever — which is the same UX failure mode #2156 is trying to fix, just one layer up.
Fix: bound the probe with its own AbortController + ~3s timeout, and treat the abort as unreachable. Example:
const probeController = new AbortController();
const probeTimeout = window.setTimeout(() => probeController.abort(), 3_000);
try {
const response = await testCoreRpcConnection(url, undefined, { signal: probeController.signal });
// ...
} finally {
window.clearTimeout(probeTimeout);
}(testCoreRpcConnection will need an optional RequestInit-ish parameter to forward signal.)
Also worth covering this with a test — current test suite mocks testCoreRpcConnection so the missing timeout is invisible to CI.
There was a problem hiding this comment.
Fixed in e9451a6: testCoreRpcConnection now accepts an optional { signal }, and the probe wraps each call in its own AbortController with PROBE_TIMEOUT_MS = 3_000. The finally clears the timeout and the inFlight guard, so a TCP black-hole probe aborts cleanly and the next 5s tick fires. Added test "passes an AbortSignal so a TCP black-hole probe cannot hang forever" that asserts both the signal is forwarded and the state flips to unreachable after the abort.
# Conflicts: # app/src/lib/i18n/chunks/ar-4.ts # app/src/lib/i18n/chunks/bn-4.ts # app/src/lib/i18n/chunks/en-4.ts # app/src/lib/i18n/chunks/es-4.ts # app/src/lib/i18n/chunks/fr-4.ts # app/src/lib/i18n/chunks/id-4.ts # app/src/lib/i18n/chunks/pt-4.ts # app/src/lib/i18n/chunks/ru-4.ts # app/src/lib/i18n/chunks/zh-CN-4.ts # app/src/lib/i18n/en.ts
…t21 review on tinyhumansai#2179) - testCoreRpcConnection: accept optional { signal } so callers can bound the raw fetch(). Without this, an unreachable core (TCP black-hole, suspended laptop) hangs the probe forever — the single-flight guard then blocks every 5s tick and the user is parked on "Checking core connection…" indefinitely, the exact failure mode tinyhumansai#2156 fixes one layer up. - ContextGatheringStep alive probe: wrap each probe with its own AbortController + 3s timeout (PROBE_TIMEOUT_MS). - Treat HTTP 401 as 'alive': on cold start the bearer token resolution can race the first probe; the response is 401 even though the core is fine. Auth-not-ready ≠ core-down. - Rewrite the misleading comment that claimed core.ping bypasses bearer auth — it does not (see src/core/auth.rs PUBLIC_PATHS). - Align en/en-4 errorDesc with the staged-fallback copy the PR tests already assert ("Your chat is ready…"). Other locale chunks left untouched — re-translating is out of scope for this fix. - Add tests: 401 → alive, AbortSignal forwarded → probe abort → unreachable.
# Conflicts: # app/src/services/coreRpcClient.ts
…nyhumansai#2156) (tinyhumansai#2179) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
…nyhumansai#2156) (tinyhumansai#2179) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
Summary
timeoutMsoverride tocallCoreRpcso slow-but-alive RPCs can opt into a longer-but-still-bounded budget without changing the 30s default for fast calls.openhuman.app_state_snapshotandlearning_save_profiletimeouts to 90s — they legitimately run 30–40s on M-series Macs while memory tree init + Composio warmup compete for the snapshot critical path.core.pingalive indicator that distinguishes slow-but-alive from truly-unreachable cores.Problem
On slow first-launches (reported on M-series Mac Mini, macOS 26.1),
openhuman.app_state_snapshotcompletes successfully but takes 32–37s while heavy boot work (memory tree init, Composio registry warmup, mascot rasterization, scheduler gate at 97% CPU) competes for the snapshot path. The global 30s RPC timeout kills the call mid-flight and parks users on the post-login profile-build fallback even though the backend would have answered moments later. Users read the fallback as broken login.The existing build-profile UI also gives no signal during a long wait, so a slow-but-alive 35–40s snapshot looks identical to a hang.
Solution
callCoreRpclearns an optionaltimeoutMsparameter, clamped to the same[1s, 10min]window as the global default. Default behaviour is unchanged for the ~hundred existing call sites.fetchCoreAppSnapshotandsaveProfilepass90_000so the first-launch slow-success completes inline. Real failures still abort within 90s.ContextGatheringStepadds a 30s threshold timer that flips the title/description to a calmer "still working" variant and starts a 5score.pingprobe loop. The probe's result drives a small dot indicator (alive / probing / unreachable) so users can tell whether they should wait or take action.Submission Checklist
## Related— N/A: no coverage-matrix feature ID changed.docs/RELEASE-MANUAL-SMOKE.md) — N/A: narrow onboarding fallback behaviour change.Closes #NNNin the## RelatedsectionImpact
onboarding.contextGathering.*keys (stillWorkingTitle,stillWorkingDesc,coreAlive,coreAliveProbing,coreUnreachable).Related
AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/2156-snapshot-timeout-staged-fallbackc6b910b7f7a1a12969d271226dadc9946acf6690Validation Run
pnpm --filter openhuman-app format:check— pre-push hook ran prettier; format-only fixes amended in.pnpm typecheck— clean.pnpm debug unit ContextGatheringStep14/14,pnpm debug unit coreRpcClient72/72,pnpm debug unit coreStateApi13/13.Validation Blocked
command:N/Aerror:N/Aimpact:N/ABehavior Changes
Parity Contract
callCoreRpccallers that do not passtimeoutMscontinue to use the 30s global default. The 90s budget is still bounded; nothing can hang forever.Summary by CodeRabbit
New Features
Translations