Skip to content

Issue 10: Resume across runs/crashes (deferred, low priority) #629

@gewenyu99

Description

@gewenyu99

Issue 10: Resume across runs/crashes (deferred, low priority)

Epic: Task-queue orchestrator runner · Sequenced after: #628 ·
Functionally builds on: #622, #624, #627 · Priority: low

Deferred. The disk reflection is already built in #622 (queue, handoffs, audit
log). This issue is the one place that reads a leftover queue back. Pick it up
after the experiment is running, and only if resume proves worth it.

Why

A wizard run can crash or be killed mid-drain. The disk reflection (#622) means the
state survives. This issue makes a subsequent run continue that state instead of
starting fresh. It is additive on top of the persisted schema.

Prior art in the PostHog monorepo, worth following rather than reinventing: the
Tasks product keeps per-run JSON state with a resume_from_run_id chain and
atomic state mutation under select_for_update (products/tasks/backend/models.py,
get_resume_chain and mutate_state_atomic), and Signals avoids double-spawning
on reruns with run_count in workflow ids and ack_id dedupe. Borrow the
resume-chain shape for inheriting prior outputs, and the dedupe so a resumed queue
does not re-run or double-spawn a task.

Scope / deliverable

  1. Resume detection. On construction, if
    <installDir>/.posthog-wizard/queue.json exists, matches the current version,
    and its runId and installDir identify the same run, continue it instead of
    starting fresh (the Issue 3: Queue + persistence layer #622 default).
  2. In_progress recovery. A task left in_progress is suspect, a crash
    mid-task. If attempts < maxAttempts, reset it to pending and re-run,
    otherwise mark it failed. Record the reset in audit.jsonl.
  3. Idempotency. Resume relies on task bodies being re-runnable, read before
    write, and the {type, inputs} dedup guard ( Issue 4: Orchestrator MCP tools (in wizard-tools) #623) backs it up. Audit the three
    real task bodies ( Issue 8: Real task bodies + full 1:1 integration flow #627) for safe re-execution, and fix any that are not.
  4. Stale-queue safety. A version mismatch, or a runId from an unrelated
    run, is discarded rather than adopted.
  5. Telemetry. Add the resumed: boolean flag to orchestrator run finished
    (stubbed in Issue 9: Telemetry + experiment instrumentation #628), and emit an orchestrator resumed { tasks_pending, tasks_done }
    event.
  6. Gating. Decide automatic versus opt-in resume. The recommendation is
    automatic, guarded by the runId and installDir match, and clearly logged.

Key files

  • src/lib/programs/orchestrator/queue.ts (resume path on construction)
  • src/lib/programs/orchestrator/orchestrator-runner.ts (resumed telemetry)
  • the install, init, and instrument-events agent prompts and mini-skills
    (idempotency audit)

Acceptance criteria

  • A kill -9 mid-drain, then a re-run, resumes the queue. The in_progress task
    resets to pending and completes, and no task runs twice to completion.
  • An idempotent re-run of the real slice ( Issue 8: Real task bodies + full 1:1 integration flow #627) does not double-instrument.
  • A version mismatch, or a foreign runId, is discarded rather than adopted.
  • orchestrator run finished carries resumed: true on a resumed run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions