Skip to content

FEA-1363: Fix SQLite write contention causing dashboard corruption#234

Open
thadeusb wants to merge 2 commits into
mainfrom
feat/fea-1363
Open

FEA-1363: Fix SQLite write contention causing dashboard corruption#234
thadeusb wants to merge 2 commits into
mainfrom
feat/fea-1363

Conversation

@thadeusb
Copy link
Copy Markdown
Contributor

Feature

FEA-1363 — Fix SQLite write contention causing dashboard corruption under 22+ concurrent agents

Summary

  • Use BEGIN IMMEDIATE instead of BEGIN DEFERRED in compat-sqlite transaction wrapper to prevent deadlocks
  • Move transcriptCache.extract() disk I/O outside the SQLite write transaction (50-200ms lock hold per event)
  • Add setImmediate-based write queue to batch concurrent hook events into single transactions (~20x contention reduction)
  • Wrap watchdog read-check-write cycle in atomic transaction to prevent stale-state decisions
  • Bump agent-session-sync-service busy_timeout from 1000ms to 5000ms

Architecture

flowchart TD
    A[Hook POST] --> B{Validate}
    B -->|Valid| C[Extract transcript outside lock]
    C --> D[Enqueue + respond immediately]
    D --> E[setImmediate drain]
    E --> F[BEGIN IMMEDIATE - batch]
    F --> G[processEventCore per item with try/catch]
    G --> H[COMMIT]
    H --> I[Broadcast + plan upserts]
    J[Watchdog 15s] --> K[Extract transcripts outside tx]
    K --> L[BEGIN IMMEDIATE - atomic read+write]
    L --> M[COMMIT + broadcast]
Loading

Plan

PLN-705 — Challenge: Sonnet PASS, Opus PASS, Codex FAIL (all findings addressed in v2)

Key Decisions

  • POST /event now responds { ok: true } before DB write commits (CLI only checks HTTP status)
  • processEventCore extracted as unwrapped function; processEvent wrapper kept for potential future single-event callers
  • All changes flow through the build/patch layer — no manual .generated/ edits

Decision Table

Verified at .closedloop-ai/decision-tables/fea-1363.md — all 20 delta checklist items and 7 acceptance criteria passed.

Review Summary

4 parallel reviewers: correctness (Opus) PASS, security (Sonnet) PASS, conventions (Opus) PASS, Codex FAIL. 6 findings total, 2 fixed (outer try/catch on drainHookQueue, restored SubagentJsonlImported broadcast), 4 noted for follow-up.

Downstream Updates

None — standalone feature.

Feature Flags

None — bug fix restoring intended behavior.

Database/Migration Safety

No schema/migration changes.

Test Plan

  • just desktop-typecheck passes
  • just desktop-lint passes
  • just desktop-test passes (1797 tests, 0 failures)
  • 15 new static analysis tests verify all patches landed correctly
  • Build script assertGeneratedTree() hard-gates all 5 patch artifacts
  • Manual: Run a ClosedLoop build spawning 20+ parallel agents and verify dashboard survives

- Use BEGIN IMMEDIATE instead of DEFERRED in compat-sqlite transaction
  wrapper to prevent deadlocks when concurrent transactions upgrade locks
- Move transcriptCache.extract() disk I/O outside the SQLite write
  transaction in processEvent (50-200ms held lock per event)
- Add setImmediate-based write queue to batch concurrent hook events
  into single transactions, reducing lock contention ~20x
- Wrap watchdog read-check-write cycle in atomic transaction to prevent
  stale-state decisions between queued hook commits
- Bump agent-session-sync-service busy_timeout from 1000ms to 5000ms
  to match the sidecar's own timeout on separate DB connection
- Add sourceCompatSqlite to build stamp hash inputs
- Add 15 static analysis tests verifying all patches landed

Testing:
- just desktop-typecheck passes
- just desktop-lint passes
- just desktop-test passes (1797 tests, 0 failures)
- Build script patches apply cleanly with assertGeneratedTree gates

Risks:
- POST /event now responds before DB write commits (CLI only checks
  HTTP status, doesn't parse response body)
- Write queue batches events arriving in the same tick; if the batch
  transaction fails, events are lost (dashboard self-heals via
  watchdog/re-import from transcripts)
@thadeusb thadeusb requested a review from a team May 22, 2026 23:45
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aef3535c8c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread apps/desktop/scripts/build-agent-monitor.mjs
Comment thread apps/desktop/scripts/build-agent-monitor.mjs Outdated
- Requeue hook events on batch transaction failure instead of dropping
- Track per-event success and only run plan upserts / subagent imports
  for events that were actually persisted
" processEventCore(batch[i].hookType, batch[i].data, batch[i].transcriptData);",
" succeeded.add(i);",
" } catch (err) {",
' console.warn("[hooks] batch event failed:", batch[i].hookType, err?.message || err);',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching processEventCore(...) exceptions here means a hook is no longer atomic. processEventCore still performs many writes before later transcript/metadata/event insertion work, so if it throws midway through one hook, the outer batch transaction can still commit those earlier mutations and just skip the rest of that hook. Upstream wrapped each hook in its own db.transaction(...), so a failing hook rolled back entirely instead of leaving partial session/agent state behind.

' console.warn("[plans] hook capture failed:", e && e.message);',
" }",
" enqueueHookEvent(hook_type, data, transcriptData, planCapture);",
" res.json({ ok: true });",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the request contract from "event durably written before 200" to "event only enqueued in memory before 200". enqueueHookEvent(...) just pushes into process memory and schedules setImmediate, but the hook client exits as soon as it gets the HTTP response. If the desktop app crashes or the queued write later fails after this line, the caller has already been told the hook was accepted and there is no retry signal, so the event is silently lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants