Issue 10: Resume across runs/crashes (deferred, low priority)

# Issue 10: Resume across runs/crashes (deferred, low priority)

**Epic:** Task-queue orchestrator runner · **Sequenced after:** #628 ·
**Functionally builds on:** #622, #624, #627 · **Priority:** low

> Deferred. The disk reflection is already built in #622 (queue, handoffs, audit
> log). This issue is the one place that reads a leftover queue back. Pick it up
> after the experiment is running, and only if resume proves worth it.

## Why

A wizard run can crash or be killed mid-drain. The disk reflection (#622) means the
state survives. This issue makes a subsequent run continue that state instead of
starting fresh. It is additive on top of the persisted schema.

Prior art in the PostHog monorepo, worth following rather than reinventing: the
Tasks product keeps per-run JSON state with a `resume_from_run_id` chain and
atomic state mutation under `select_for_update` (`products/tasks/backend/models.py`,
`get_resume_chain` and `mutate_state_atomic`), and Signals avoids double-spawning
on reruns with `run_count` in workflow ids and `ack_id` dedupe. Borrow the
resume-chain shape for inheriting prior outputs, and the dedupe so a resumed queue
does not re-run or double-spawn a task.

## Scope / deliverable

1. **Resume detection.** On construction, if
   `<installDir>/.posthog-wizard/queue.json` exists, matches the current `version`,
   and its `runId` and installDir identify the same run, continue it instead of
   starting fresh (the #622 default).
2. **In_progress recovery.** A task left `in_progress` is suspect, a crash
   mid-task. If `attempts < maxAttempts`, reset it to `pending` and re-run,
   otherwise mark it `failed`. Record the reset in `audit.jsonl`.
3. **Idempotency.** Resume relies on task bodies being re-runnable, read before
   write, and the `{type, inputs}` dedup guard (#623) backs it up. Audit the three
   real task bodies (#627) for safe re-execution, and fix any that are not.
4. **Stale-queue safety.** A `version` mismatch, or a `runId` from an unrelated
   run, is discarded rather than adopted.
5. **Telemetry.** Add the `resumed: boolean` flag to `orchestrator run finished`
   (stubbed in #628), and emit an `orchestrator resumed { tasks_pending, tasks_done }`
   event.
6. **Gating.** Decide automatic versus opt-in resume. The recommendation is
   automatic, guarded by the runId and installDir match, and clearly logged.

## Key files

- `src/lib/programs/orchestrator/queue.ts` (resume path on construction)
- `src/lib/programs/orchestrator/orchestrator-runner.ts` (`resumed` telemetry)
- the `install`, `init`, and `instrument-events` agent prompts and mini-skills
  (idempotency audit)

## Acceptance criteria

- [ ] A `kill -9` mid-drain, then a re-run, resumes the queue. The in_progress task
      resets to `pending` and completes, and no task runs twice to completion.
- [ ] An idempotent re-run of the real slice (#627) does not double-instrument.
- [ ] A `version` mismatch, or a foreign `runId`, is discarded rather than adopted.
- [ ] `orchestrator run finished` carries `resumed: true` on a resumed run.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 10: Resume across runs/crashes (deferred, low priority) #629

Issue 10: Resume across runs/crashes (deferred, low priority)

Why

Scope / deliverable

Key files

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue 10: Resume across runs/crashes (deferred, low priority) #629

Description

Issue 10: Resume across runs/crashes (deferred, low priority)

Why

Scope / deliverable

Key files

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions