Skip to content

fix(onboarding): raise snapshot timeout + staged still-working UI (#2156)#2179

Merged
senamakel merged 9 commits into
tinyhumansai:mainfrom
obchain:fix/2156-snapshot-timeout-staged-fallback
May 20, 2026
Merged

fix(onboarding): raise snapshot timeout + staged still-working UI (#2156)#2179
senamakel merged 9 commits into
tinyhumansai:mainfrom
obchain:fix/2156-snapshot-timeout-staged-fallback

Conversation

@obchain
Copy link
Copy Markdown
Contributor

@obchain obchain commented May 19, 2026

Summary

  • Add per-call timeoutMs override to callCoreRpc so slow-but-alive RPCs can opt into a longer-but-still-bounded budget without changing the 30s default for fast calls.
  • Raise the first-launch openhuman.app_state_snapshot and learning_save_profile timeouts to 90s — they legitimately run 30–40s on M-series Macs while memory tree init + Composio warmup compete for the snapshot critical path.
  • Stage the onboarding profile-build UI: after 30s of pending pipeline the copy swaps to a calmer "Still working on your profile…" message with a core.ping alive indicator that distinguishes slow-but-alive from truly-unreachable cores.
  • Continue to chat stays available throughout — no UX regression on the existing escape hatch.

Problem

On slow first-launches (reported on M-series Mac Mini, macOS 26.1), openhuman.app_state_snapshot completes successfully but takes 32–37s while heavy boot work (memory tree init, Composio registry warmup, mascot rasterization, scheduler gate at 97% CPU) competes for the snapshot path. The global 30s RPC timeout kills the call mid-flight and parks users on the post-login profile-build fallback even though the backend would have answered moments later. Users read the fallback as broken login.

The existing build-profile UI also gives no signal during a long wait, so a slow-but-alive 35–40s snapshot looks identical to a hang.

Solution

  1. callCoreRpc learns an optional timeoutMs parameter, clamped to the same [1s, 10min] window as the global default. Default behaviour is unchanged for the ~hundred existing call sites.
  2. fetchCoreAppSnapshot and saveProfile pass 90_000 so the first-launch slow-success completes inline. Real failures still abort within 90s.
  3. ContextGatheringStep adds a 30s threshold timer that flips the title/description to a calmer "still working" variant and starts a 5s core.ping probe loop. The probe's result drives a small dot indicator (alive / probing / unreachable) so users can tell whether they should wait or take action.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
  • Diff coverage ≥ 80% — CI is expected to validate this; locally Vitest covers all three new code paths (per-call timeout override, snapshot timeout threading, staged UI with alive/unreachable probe).
  • Coverage matrix updated — N/A: behaviour-only change on the existing onboarding context-gathering flow; no new feature row added/removed/renamed.
  • All affected feature IDs from the matrix are listed in the PR description under ## Related — N/A: no coverage-matrix feature ID changed.
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • Manual smoke checklist updated if this touches release-cut surfaces (docs/RELEASE-MANUAL-SMOKE.md) — N/A: narrow onboarding fallback behaviour change.
  • Linked issue closed via Closes #NNN in the ## Related section

Impact

  • Desktop onboarding UI + frontend RPC client. No persistence, security, migration, or backend changes.
  • The 90s budget is still bounded; nothing can hang forever.
  • Fast-path RPCs continue to honour the 30s global default — no behaviour change for the existing ~100 call sites that do not opt into a longer timeout.
  • 12 locales updated with the 5 new onboarding.contextGathering.* keys (stillWorkingTitle, stillWorkingDesc, coreAlive, coreAliveProbing, coreUnreachable).

Related


AI Authored PR Metadata (required for Codex/Linear PRs)

Keep this section for AI-authored PRs. For human-only PRs, mark each field N/A.

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: fix/2156-snapshot-timeout-staged-fallback
  • Commit SHA: c6b910b7f7a1a12969d271226dadc9946acf6690

Validation Run

  • pnpm --filter openhuman-app format:check — pre-push hook ran prettier; format-only fixes amended in.
  • pnpm typecheck — clean.
  • Focused tests: pnpm debug unit ContextGatheringStep 14/14, pnpm debug unit coreRpcClient 72/72, pnpm debug unit coreStateApi 13/13.
  • Rust fmt/check (if changed): N/A — no Rust changes.
  • Tauri fmt/check (if changed): N/A — no Tauri changes.

Validation Blocked

  • command: N/A
  • error: N/A
  • impact: N/A

Behavior Changes

  • Intended behavior change: first-launch profile build no longer treats a slow-but-alive 35–40s snapshot as a failure; a calmer staged UI with an alive-indicator replaces the silent wait between 30s and ~90s.
  • User-visible effect: post-login "Almost there!" fallback fires far less often; when the wait genuinely is long, users see honest progress copy plus a core-status dot, with continue-to-chat still available throughout.

Parity Contract

  • Legacy behavior preserved: pre-30s happy path renders identical copy ("Building your profile…"). Continue-to-chat button still appears throughout. Existing error fallback unchanged.
  • Guard/fallback/dispatch parity checks: callCoreRpc callers that do not pass timeoutMs continue to use the 30s global default. The 90s budget is still bounded; nothing can hang forever.

Summary by CodeRabbit

  • New Features

    • A "Still working" onboarding state appears after ~30s for slow first-launches and shows core reachability status to indicate progress.
    • Longer per-operation timeouts reduce failures during initial profile build and first-run loading.
  • Translations

    • Updated localized copy for the onboarding context/gathering flow across multiple languages.

Review Change Stack

obchain added 4 commits May 19, 2026 12:17
The global CORE_RPC_TIMEOUT_MS (30s) treats every RPC as equally
fast/slow, but slow-but-alive paths such as first-launch
openhuman.app_state_snapshot legitimately run for 35-40s on M-series
Macs while memory tree init, Composio warmup, and other boot work
compete for the snapshot critical path. Capping every call at 30s
forces those users to a failure-style fallback even though the call
would have succeeded a few seconds later.

Add an optional timeoutMs override, clamped to the same [1s, 10min]
window as the global default, so callers like fetchCoreAppSnapshot
can opt into a longer-but-still-bounded budget without changing the
default for fast RPCs.

Refs tinyhumansai#2156.
…ansai#2156)

First-launch openhuman.app_state_snapshot can take 30-40s on slow
hardware while memory tree init and Composio registry warmup compete
for the snapshot critical path. The previous global 30s timeout
killed those calls and parked users on the post-login fallback even
though the backend would have answered moments later.

Pass SNAPSHOT_TIMEOUT_MS=90s via the new callCoreRpc per-call
timeoutMs option so slow-but-alive cores complete inline. Real
failures still abort within 90s rather than hanging forever.

Refs tinyhumansai#2156.
…hot (tinyhumansai#2156)

When the post-login profile build runs past the 30s hardcoded
threshold, users today see no signal that the system is still
making progress — the existing build animation looks identical to
a hang. On a slow-but-alive core that legitimately needs ~40s to
complete app_state_snapshot or learning_save_profile, this reads
as broken login.

Add a staged transition: after STILL_WORKING_THRESHOLD_MS the
copy swaps to a calmer 'Still working on your profile…' message,
and a lightweight core.ping probe runs on a 5s interval so the
indicator can distinguish slow-but-alive vs truly unreachable
cores. Continue to chat stays available throughout.

Also pass SAVE_PROFILE_TIMEOUT_MS=90s on learning_save_profile so
the legitimate slow-success completes inline instead of falling to
the error path.

i18n strings shipped for 12 locales.

Refs tinyhumansai#2156.
@obchain obchain requested a review from a team May 19, 2026 06:53
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 13886009-1bc8-4ddd-b62f-a25284a885e3

📥 Commits

Reviewing files that changed from the base of the PR and between 080c04c and 0dbe9df.

📒 Files selected for processing (16)
  • app/src/lib/i18n/chunks/ar-4.ts
  • app/src/lib/i18n/chunks/bn-4.ts
  • app/src/lib/i18n/chunks/en-4.ts
  • app/src/lib/i18n/chunks/es-4.ts
  • app/src/lib/i18n/chunks/fr-4.ts
  • app/src/lib/i18n/chunks/hi-4.ts
  • app/src/lib/i18n/chunks/id-4.ts
  • app/src/lib/i18n/chunks/it-4.ts
  • app/src/lib/i18n/chunks/pt-4.ts
  • app/src/lib/i18n/chunks/ru-4.ts
  • app/src/lib/i18n/chunks/zh-CN-4.ts
  • app/src/lib/i18n/en.ts
  • app/src/pages/onboarding/steps/ContextGatheringStep.tsx
  • app/src/pages/onboarding/steps/__tests__/ContextGatheringStep.test.tsx
  • app/src/services/__tests__/coreRpcClient.test.ts
  • app/src/services/coreRpcClient.ts
✅ Files skipped from review due to trivial changes (8)
  • app/src/lib/i18n/chunks/en-4.ts
  • app/src/lib/i18n/chunks/ar-4.ts
  • app/src/lib/i18n/chunks/bn-4.ts
  • app/src/lib/i18n/chunks/zh-CN-4.ts
  • app/src/lib/i18n/en.ts
  • app/src/lib/i18n/chunks/ru-4.ts
  • app/src/lib/i18n/chunks/pt-4.ts
  • app/src/lib/i18n/chunks/it-4.ts

📝 Walkthrough

Walkthrough

This PR adds a staged "still working" onboarding UI with periodic core probes, per-call RPC timeout overrides (clamped), a 90s snapshot timeout, tests for timeout/probing behavior, and i18n updates across locale bundles.

Changes

Onboarding Staged Loading & Core Reachability

Layer / File(s) Summary
Multi-language translations for staged loading UI
app/src/lib/i18n/chunks/*-4.ts, app/src/lib/i18n/en.ts
Translation keys added/updated for onboarding.contextGathering: coreAlive, coreAliveProbing, coreUnreachable, stillWorkingDesc, stillWorkingTitle, and updated errorDesc.
Per-call RPC timeout override infrastructure & tests
app/src/services/coreRpcClient.ts, app/src/services/__tests__/coreRpcClient.test.ts
callCoreRpc accepts optional timeoutMs validated/clamped to bounds; fetch abort uses effective timeout and surfaces it in errors. testCoreRpcConnection accepts an AbortSignal. Tests cover override, clamp, and fallback.
Snapshot RPC timeout extension to 90 seconds
app/src/services/coreStateApi.ts, app/src/services/coreStateApi.test.ts
Added exported SNAPSHOT_TIMEOUT_MS = 90_000 and pass it as timeoutMs to openhuman.app_state_snapshot; tests assert the timeout is passed and within bounds.
ContextGatheringStep staged loading & core probing implementation
app/src/pages/onboarding/steps/ContextGatheringStep.tsx
Component flips to a "still working" UI after ~30s while pipeline pending, periodically probes core reachability with abort-bounded probes and single-flight guarding, displays alive/unreachable indicator, and calls learning_save_profile with extended timeout.
Test setup and staged loading UI test suite
app/src/pages/onboarding/steps/__tests__/ContextGatheringStep.test.tsx
Expanded coreRpcClient mock (getCoreRpcUrl, testCoreRpcConnection) and added staged tests validating the 90s timeout override, UI transition after ~30s, core probe outcomes, and probe abort behavior.

Sequence Diagram(s)

sequenceDiagram
  participant UI as ContextGatheringStep
  participant Pipeline as OnboardingPipeline
  participant CoreRpcClient as coreRpcClient
  participant Core as CoreRPC
  UI->>Pipeline: start profile build
  Pipeline->>CoreRpcClient: openhuman.learning_save_profile(timeoutMs=90s)
  UI->>CoreRpcClient: getCoreRpcUrl() / testCoreRpcConnection() (periodic while pending)
  CoreRpcClient->>Core: probe testCoreRpcConnection (with AbortSignal)
  Core-->>CoreRpcClient: respond (200/401 or error)
  CoreRpcClient-->>UI: alive/unreachable status
  Pipeline-->>UI: completes -> normal UI
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

feature

Suggested reviewers

  • graycyrus

Poem

🐰 I tweak the waits and hush the fright,

"Still working..." glows through patient night,
I ping the core with gentle taps,
Abort the stalled and skip the traps,
Slow builds finish — huzzah, delight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: raising snapshot timeout and adding staged still-working UI for onboarding.
Linked Issues check ✅ Passed All acceptance criteria from issue #2156 are met: false fallback reduced via 90s timeout override, progress UI added with 30s threshold, backend alive detection via core.ping probes, failure remains bounded at 10min max, and tests cover slow-success/timeout/failure paths.
Out of Scope Changes check ✅ Passed All changes are within scope: i18n updates for onboarding strings, RPC client timeout override feature, coreStateApi/ContextGatheringStep modifications, and comprehensive test coverage all directly support issue #2156 requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 19, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
app/src/pages/onboarding/steps/ContextGatheringStep.tsx (1)

376-434: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Hide the slow-path UI once the pipeline has finished.

After a slow success, stillWorking stays true, so the component can keep showing “still working…” and the ping indicator during the 800ms auto-advance window or while onNext() is still pending. Gate that copy/indicator on !finished && !hasError instead of the raw state.

♻️ Suggested guard
-  const titleKey = stillWorking
+  const showStillWorking = stillWorking && !finished && !hasError;
+
+  const titleKey = showStillWorking
     ? 'onboarding.contextGathering.stillWorkingTitle'
     : 'onboarding.contextGathering.buildingProfile';
-  const descKey = stillWorking
+  const descKey = showStillWorking
     ? 'onboarding.contextGathering.stillWorkingDesc'
     : 'onboarding.contextGathering.buildingDesc';
@@
-        {stillWorking && (
+        {showStillWorking && (
           <div
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src/pages/onboarding/steps/ContextGatheringStep.tsx` around lines 376 -
434, The component currently uses stillWorking to show the "still working" copy
and the core alive indicator, which causes the slow-path UI to remain visible
after the pipeline finishes; update the guards so both the title/description
selection and the alive indicator are shown only when stillWorking is true AND
finished is false AND hasError is false (i.e., use !finished && !hasError &&
stillWorking in place of just stillWorking), and keep references to
aliveLabelKey and aliveState for the alive indicator rendering so the visual
state logic remains unchanged.
🧹 Nitpick comments (1)
app/src/services/__tests__/coreRpcClient.test.ts (1)

360-365: ⚡ Quick win

Assert pending state before the timeout boundary in the override/clamp tests.

These tests currently validate only the final timeout message, so an early-abort timing regression can slip through if the message still reflects the override/clamped value. Add a pre-boundary “still pending” assertion before the expected cutoff.

Proposed test hardening
   const pending = callCoreRpc({ method: 'openhuman.app_state_snapshot', timeoutMs: 60_000 });
   pending.catch(() => {});
+  let settled = false;
+  pending.finally(() => {
+    settled = true;
+  });

   // 30s passes — global default would have aborted by now, but the
   // per-call 60s override keeps the request alive.
   await vi.advanceTimersByTimeAsync(31_000);
+  await Promise.resolve();
+  expect(settled).toBe(false);
   // Not yet rejected. Advance to the override boundary.
   await vi.advanceTimersByTimeAsync(30_000);
   const pending = callCoreRpc({
     method: 'openhuman.app_state_snapshot',
     timeoutMs: 2 * 60 * 60 * 1_000,
   });
   pending.catch(() => {});
+  let settled = false;
+  pending.finally(() => {
+    settled = true;
+  });

   const MAX_MS = 10 * 60 * 1_000;
-  await vi.advanceTimersByTimeAsync(MAX_MS + 1);
+  await vi.advanceTimersByTimeAsync(MAX_MS - 1);
+  await Promise.resolve();
+  expect(settled).toBe(false);
+  await vi.advanceTimersByTimeAsync(2);

Also applies to: 400-402

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src/services/__tests__/coreRpcClient.test.ts` around lines 360 - 365, Add
an explicit "still pending" assertion before advancing to the per-call timeout
boundary in the override/clamp tests so an early-abort regression is caught;
locate the test cases around the advancing timers (the block that does await
vi.advanceTimersByTimeAsync(31_000) then await
vi.advanceTimersByTimeAsync(30_000)) in coreRpcClient.test.ts and insert a
pre-boundary check that the RPC call/promise is not settled (e.g., the pending
promise or mock callback has not been called) before advancing the final
30_000ms, and make the same change in the analogous test block at the other spot
referenced (around lines 400-402).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/src/pages/onboarding/steps/ContextGatheringStep.tsx`:
- Around line 318-342: The probe function can overlap because setInterval fires
every ALIVE_PROBE_INTERVAL_MS regardless of pending async work; add a guard
(e.g., an inFlight boolean inside the useEffect) so probe immediately returns if
a previous probe is still running, set inFlight = true at start and false in a
finally block, and only update aliveState when not cancelled; alternatively
replace setInterval with a recursive setTimeout that awaits probe before
scheduling the next call. Apply the guard to the probe used with getCoreRpcUrl()
and testCoreRpcConnection(), and ensure cleanup still sets cancelled and clears
the timer/interval.

---

Outside diff comments:
In `@app/src/pages/onboarding/steps/ContextGatheringStep.tsx`:
- Around line 376-434: The component currently uses stillWorking to show the
"still working" copy and the core alive indicator, which causes the slow-path UI
to remain visible after the pipeline finishes; update the guards so both the
title/description selection and the alive indicator are shown only when
stillWorking is true AND finished is false AND hasError is false (i.e., use
!finished && !hasError && stillWorking in place of just stillWorking), and keep
references to aliveLabelKey and aliveState for the alive indicator rendering so
the visual state logic remains unchanged.

---

Nitpick comments:
In `@app/src/services/__tests__/coreRpcClient.test.ts`:
- Around line 360-365: Add an explicit "still pending" assertion before
advancing to the per-call timeout boundary in the override/clamp tests so an
early-abort regression is caught; locate the test cases around the advancing
timers (the block that does await vi.advanceTimersByTimeAsync(31_000) then await
vi.advanceTimersByTimeAsync(30_000)) in coreRpcClient.test.ts and insert a
pre-boundary check that the RPC call/promise is not settled (e.g., the pending
promise or mock callback has not been called) before advancing the final
30_000ms, and make the same change in the analogous test block at the other spot
referenced (around lines 400-402).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d0222f47-04b2-4686-9c1e-aad27a76eccd

📥 Commits

Reviewing files that changed from the base of the PR and between c25fc8e and c6b910b.

📒 Files selected for processing (18)
  • app/src/lib/i18n/chunks/ar-4.ts
  • app/src/lib/i18n/chunks/bn-4.ts
  • app/src/lib/i18n/chunks/en-4.ts
  • app/src/lib/i18n/chunks/es-4.ts
  • app/src/lib/i18n/chunks/fr-4.ts
  • app/src/lib/i18n/chunks/hi-4.ts
  • app/src/lib/i18n/chunks/id-4.ts
  • app/src/lib/i18n/chunks/it-4.ts
  • app/src/lib/i18n/chunks/pt-4.ts
  • app/src/lib/i18n/chunks/ru-4.ts
  • app/src/lib/i18n/chunks/zh-CN-4.ts
  • app/src/lib/i18n/en.ts
  • app/src/pages/onboarding/steps/ContextGatheringStep.tsx
  • app/src/pages/onboarding/steps/__tests__/ContextGatheringStep.test.tsx
  • app/src/services/__tests__/coreRpcClient.test.ts
  • app/src/services/coreRpcClient.ts
  • app/src/services/coreStateApi.test.ts
  • app/src/services/coreStateApi.ts

Comment thread app/src/pages/onboarding/steps/ContextGatheringStep.tsx
obchain added 2 commits May 19, 2026 12:58
- ContextGatheringStep alive probe: add inFlight single-flight guard so
  overlapping core.ping calls cannot stack when the previous probe is
  still pending — prevents stale responses racing aliveState on the
  unreachable path.
- Hide the still-working title/desc/alive-indicator once finished or
  hasError flips true. Without this, slow-success users see the
  still-working copy during the 800ms auto-advance window after the
  pipeline actually completes.
- coreRpcClient timeout-override + clamp tests: assert the pending
  promise is still in flight before the expected cutoff so an
  early-abort timing regression cannot slip through with only the
  final timeout message matching.
@obchain
Copy link
Copy Markdown
Contributor Author

obchain commented May 19, 2026

Thanks — addressed in 4318b84:

  • Overlapping probes: added inFlight single-flight guard so consecutive 5s ticks can't stack pending core.ping promises.
  • Slow-path UI after finish: gated the title/desc/alive-indicator on showStillWorking = stillWorking && !finished && !hasError, so the 800ms auto-advance window no longer shows the "still working…" copy.
  • Test hardening: added pre-boundary settled === false assertions on the timeoutMs-override and clamp tests so an early-abort regression can't slip through.

All 14 ContextGatheringStep + 72 coreRpcClient + 13 coreStateApi tests green locally.

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 19, 2026
@CodeGhost21 CodeGhost21 self-requested a review May 19, 2026 18:29
Copy link
Copy Markdown
Contributor

@CodeGhost21 CodeGhost21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against issue #2156. Acceptance criteria check:

Criterion Status
False fallback reduced (35–40s succeeds inline) Met — snapshot + save_profile bumped to 90s.
Timeout policy updated with rationale Met — resolvePerCallTimeoutMs is clamped [1s, 10min]; documented in coreRpcClient.ts and coreStateApi.ts.
Progress UI added ("still working") Met — ContextGatheringStep swaps copy at 30s.
Failure remains bounded Met — 90s per-call timeout still aborts; staged UI auto-clears on hasError.
Backend alive detection (slow-alive vs unreachable) Partially met — see inline on the probe loop; the probe has no timeout, so a TCP-black-hole core leaves the indicator stuck at "probing" forever.
Regression tests (slow-success/timeout/failure) Met — new tests cover override, clamp, default, still-working transition, alive, and unreachable.

Overall the change does what the issue asks. Two correctness items below — one is a UX hole in the unreachable-detection path, the other is a misleading code comment that future maintainers will get burned by. Translations across 12 locales look fine.

// Periodic alive probe while in still-working state. `core.ping` bypasses
// bearer auth and resolves quickly even when the busy snapshot RPC is
// holding the worker, so a green ping during a slow snapshot is exactly
// the alive-but-slow signal users need to see.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is incorrect: core.ping does not bypass bearer auth. Per src/core/auth.rs, POST /rpc always requires the bearer token — only /, /health, /auth/telegram, /schema, /events, /events/webhooks, and /ws/dictation are in PUBLIC_PATHS. The probe happens to work because testCoreRpcConnection does thread Authorization: Bearer <token> through. If the token has not yet been resolved when the probe fires (cold start, IPC race on first launch — exactly the scenario this UI targets), every probe gets a 401 unauthorized and the indicator flips to unreachable even though the core is fine.

Suggest either:

  • Re-word the comment to drop the "bypasses bearer auth" claim, AND treat HTTP 401 as alive (auth not ready ≠ core down) rather than unreachable; or
  • Add core.ping to PUBLIC_PATHS in src/core/auth.rs if you actually want the documented behavior — but that's a separate decision that needs its own review.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e9451a6: rewrote the comment to drop the false "bypasses bearer auth" claim, and the probe now treats HTTP 401 as alive (cold-start IPC race where the token hasn't been resolved yet ≠ core down). Added test "treats HTTP 401 as alive (auth not ready yet, core is up)".

const url = await getCoreRpcUrl();
const response = await testCoreRpcConnection(url);
if (!cancelled) {
setAliveState(response.ok ? 'alive' : 'unreachable');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testCoreRpcConnection calls raw fetch() with no AbortController (see coreRpcClient.ts:331-345), so it has no timeout. The inline comment a few lines up claims "each probe times out after the global fetch budget" — there is no global fetch budget on this code path; the global budget lives in callCoreRpc, not testCoreRpcConnection.

Consequence: on a TCP-black-hole core (firewall drops SYN, suspended laptop coming back online, etc. — the exact "unreachable" case this indicator exists to surface), the first probe's fetch hangs indefinitely. The inFlight single-flight guard then blocks every subsequent 5s tick, and the user is parked on "Checking core connection…" forever — which is the same UX failure mode #2156 is trying to fix, just one layer up.

Fix: bound the probe with its own AbortController + ~3s timeout, and treat the abort as unreachable. Example:

const probeController = new AbortController();
const probeTimeout = window.setTimeout(() => probeController.abort(), 3_000);
try {
  const response = await testCoreRpcConnection(url, undefined, { signal: probeController.signal });
  // ...
} finally {
  window.clearTimeout(probeTimeout);
}

(testCoreRpcConnection will need an optional RequestInit-ish parameter to forward signal.)

Also worth covering this with a test — current test suite mocks testCoreRpcConnection so the missing timeout is invisible to CI.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e9451a6: testCoreRpcConnection now accepts an optional { signal }, and the probe wraps each call in its own AbortController with PROBE_TIMEOUT_MS = 3_000. The finally clears the timeout and the inFlight guard, so a TCP black-hole probe aborts cleanly and the next 5s tick fires. Added test "passes an AbortSignal so a TCP black-hole probe cannot hang forever" that asserts both the signal is forwarded and the state flips to unreachable after the abort.

senamakel added 2 commits May 19, 2026 20:02
# Conflicts:
#	app/src/lib/i18n/chunks/ar-4.ts
#	app/src/lib/i18n/chunks/bn-4.ts
#	app/src/lib/i18n/chunks/en-4.ts
#	app/src/lib/i18n/chunks/es-4.ts
#	app/src/lib/i18n/chunks/fr-4.ts
#	app/src/lib/i18n/chunks/id-4.ts
#	app/src/lib/i18n/chunks/pt-4.ts
#	app/src/lib/i18n/chunks/ru-4.ts
#	app/src/lib/i18n/chunks/zh-CN-4.ts
#	app/src/lib/i18n/en.ts
…t21 review on tinyhumansai#2179)

- testCoreRpcConnection: accept optional { signal } so callers can bound
  the raw fetch(). Without this, an unreachable core (TCP black-hole,
  suspended laptop) hangs the probe forever — the single-flight guard then
  blocks every 5s tick and the user is parked on "Checking core
  connection…" indefinitely, the exact failure mode tinyhumansai#2156 fixes one layer
  up.
- ContextGatheringStep alive probe: wrap each probe with its own
  AbortController + 3s timeout (PROBE_TIMEOUT_MS).
- Treat HTTP 401 as 'alive': on cold start the bearer token resolution can
  race the first probe; the response is 401 even though the core is fine.
  Auth-not-ready ≠ core-down.
- Rewrite the misleading comment that claimed core.ping bypasses bearer
  auth — it does not (see src/core/auth.rs PUBLIC_PATHS).
- Align en/en-4 errorDesc with the staged-fallback copy the PR tests
  already assert ("Your chat is ready…"). Other locale chunks left
  untouched — re-translating is out of scope for this fix.
- Add tests: 401 → alive, AbortSignal forwarded → probe abort → unreachable.
# Conflicts:
#	app/src/services/coreRpcClient.ts
@coderabbitai coderabbitai Bot added the feature Net-new user-facing capability or product behavior. label May 20, 2026
@senamakel senamakel merged commit 9376ffc into tinyhumansai:main May 20, 2026
28 of 31 checks passed
mtkik pushed a commit to mtkik/openhuman-meet that referenced this pull request May 21, 2026
CodeGhost21 pushed a commit to CodeGhost21/openhuman that referenced this pull request May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Net-new user-facing capability or product behavior. working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow profile snapshot triggers fallback before backend succeeds

3 participants