FEA-1363: Fix SQLite write contention causing dashboard corruption#234
FEA-1363: Fix SQLite write contention causing dashboard corruption#234thadeusb wants to merge 2 commits into
Conversation
- Use BEGIN IMMEDIATE instead of DEFERRED in compat-sqlite transaction wrapper to prevent deadlocks when concurrent transactions upgrade locks - Move transcriptCache.extract() disk I/O outside the SQLite write transaction in processEvent (50-200ms held lock per event) - Add setImmediate-based write queue to batch concurrent hook events into single transactions, reducing lock contention ~20x - Wrap watchdog read-check-write cycle in atomic transaction to prevent stale-state decisions between queued hook commits - Bump agent-session-sync-service busy_timeout from 1000ms to 5000ms to match the sidecar's own timeout on separate DB connection - Add sourceCompatSqlite to build stamp hash inputs - Add 15 static analysis tests verifying all patches landed Testing: - just desktop-typecheck passes - just desktop-lint passes - just desktop-test passes (1797 tests, 0 failures) - Build script patches apply cleanly with assertGeneratedTree gates Risks: - POST /event now responds before DB write commits (CLI only checks HTTP status, doesn't parse response body) - Write queue batches events arriving in the same tick; if the batch transaction fails, events are lost (dashboard self-heals via watchdog/re-import from transcripts)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aef3535c8c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Requeue hook events on batch transaction failure instead of dropping - Track per-event success and only run plan upserts / subagent imports for events that were actually persisted
| " processEventCore(batch[i].hookType, batch[i].data, batch[i].transcriptData);", | ||
| " succeeded.add(i);", | ||
| " } catch (err) {", | ||
| ' console.warn("[hooks] batch event failed:", batch[i].hookType, err?.message || err);', |
There was a problem hiding this comment.
Catching processEventCore(...) exceptions here means a hook is no longer atomic. processEventCore still performs many writes before later transcript/metadata/event insertion work, so if it throws midway through one hook, the outer batch transaction can still commit those earlier mutations and just skip the rest of that hook. Upstream wrapped each hook in its own db.transaction(...), so a failing hook rolled back entirely instead of leaving partial session/agent state behind.
| ' console.warn("[plans] hook capture failed:", e && e.message);', | ||
| " }", | ||
| " enqueueHookEvent(hook_type, data, transcriptData, planCapture);", | ||
| " res.json({ ok: true });", |
There was a problem hiding this comment.
This changes the request contract from "event durably written before 200" to "event only enqueued in memory before 200". enqueueHookEvent(...) just pushes into process memory and schedules setImmediate, but the hook client exits as soon as it gets the HTTP response. If the desktop app crashes or the queued write later fails after this line, the caller has already been told the hook was accepted and there is no retry signal, so the event is silently lost.
Feature
FEA-1363 — Fix SQLite write contention causing dashboard corruption under 22+ concurrent agents
Summary
BEGIN IMMEDIATEinstead ofBEGIN DEFERREDin compat-sqlite transaction wrapper to prevent deadlockstranscriptCache.extract()disk I/O outside the SQLite write transaction (50-200ms lock hold per event)setImmediate-based write queue to batch concurrent hook events into single transactions (~20x contention reduction)agent-session-sync-servicebusy_timeoutfrom 1000ms to 5000msArchitecture
flowchart TD A[Hook POST] --> B{Validate} B -->|Valid| C[Extract transcript outside lock] C --> D[Enqueue + respond immediately] D --> E[setImmediate drain] E --> F[BEGIN IMMEDIATE - batch] F --> G[processEventCore per item with try/catch] G --> H[COMMIT] H --> I[Broadcast + plan upserts] J[Watchdog 15s] --> K[Extract transcripts outside tx] K --> L[BEGIN IMMEDIATE - atomic read+write] L --> M[COMMIT + broadcast]Plan
PLN-705 — Challenge: Sonnet PASS, Opus PASS, Codex FAIL (all findings addressed in v2)
Key Decisions
/eventnow responds{ ok: true }before DB write commits (CLI only checks HTTP status)processEventCoreextracted as unwrapped function;processEventwrapper kept for potential future single-event callers.generated/editsDecision Table
Verified at
.closedloop-ai/decision-tables/fea-1363.md— all 20 delta checklist items and 7 acceptance criteria passed.Review Summary
4 parallel reviewers: correctness (Opus) PASS, security (Sonnet) PASS, conventions (Opus) PASS, Codex FAIL. 6 findings total, 2 fixed (outer try/catch on drainHookQueue, restored SubagentJsonlImported broadcast), 4 noted for follow-up.
Downstream Updates
None — standalone feature.
Feature Flags
None — bug fix restoring intended behavior.
Database/Migration Safety
No schema/migration changes.
Test Plan
just desktop-typecheckpassesjust desktop-lintpassesjust desktop-testpasses (1797 tests, 0 failures)assertGeneratedTree()hard-gates all 5 patch artifacts