feat: persist automation scheduler state#998
Conversation
|
Warning Review limit reached
More reviews will be available in 19 minutes and 25 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughConverts automation definitions and runs from in-memory caches to durable DB tables, implements cron scheduling with timezone-aware next-fire computation, and adds distributed scheduler ownership and run leasing (Flock). API routes now await scheduler ownership settlement and error mappings return 409 for conflicts/active-run ownership. ChangesAutomation Durability & Cron Scheduling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request transitions the automation definitions and runs from in-memory maps to persistent SQLite tables using Drizzle ORM, and introduces cron-based recurring schedules along with a distributed locking mechanism using Flock to prevent concurrent scheduler executions. The review feedback highlights several critical performance issues: hasDurableActiveWriter performs N+1 database queries inside a loop under an active transaction, computeNextCronFireAt repeatedly parses cron expressions and instantiates sets inside a large loop causing CPU overhead, and functions like list, runs, and reconcileInterruptedRuns fetch entire collections into memory to filter them instead of utilizing targeted SQL queries.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/opencode/src/automation/index.ts (1)
798-807:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftPrevent misclassifying executor failures as
cancelledwhen the run is stillscheduled
executeRunwaits forawait executor(...)before advancingscheduledtorunning, but itscatchpath treats any error whilecurrent.state === "scheduled"asstopReason: "cancelled"(packages/opencode/src/automation/index.ts around 830-837).sessionPromptExecutoronly callsAutomation.markRunStarted(...)afterawait Session.create(...), so failures during pre-start steps can leave the runscheduledand then be mislabeled ascancelled(packages/opencode/src/automation/runner.ts).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/opencode/src/automation/index.ts` around lines 798 - 807, The executor error path is incorrectly converting pre-start failures into a "cancelled" stop for runs that are still "scheduled"; update executeRun's catch logic so it does not call stopRun(..., "cancelled") for errors when latest/current state is "scheduled" (only treat controller.signal.aborted as cancellation), and instead mark those runs as failed/errored or leave them scheduled until an explicit start occurs; additionally, in sessionPromptExecutor move or add the call to Automation.markRunStarted(...) to occur before any awaited pre-start operations (e.g., before Session.create) so that start-related failures are attributed to a running run rather than a scheduled one—use the symbols executor, executeRun, getRun, stopRun, publishRunUpdated, markRunStarted, sessionPromptExecutor, controller.signal to locate and change the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/opencode/src/automation/index.ts`:
- Around line 858-885: The runs() function materializes the entire ledger by
calling allRuns() which uses .all() then slices; change this to push
cursor/limit into the DB query in allRuns(automationID) so SQL returns only the
requested page. Modify allRuns to accept limit and cursor (or cursor fields) and
implement a DB query that applies ordering (desc triggered_at, desc id), uses a
cursor predicate to fetch rows after the cursor (compare triggered_at and id)
and a limit of requestedLimit+1 to detect nextCursor, removing the in-memory
.all() + slice; update runs() to call allRuns(automationID, limit, cursor) and
compute nextCursor from the extra row. Ensure symbols referenced: runs, allRuns,
Database.use, AutomationRunTable, .orderBy, and remove the post-query .all().
- Around line 456-478: The durable write in writeDefinition (and the similar
block at lines 481-510) must use optimistic concurrency: instead of blind upsert
by primary key, perform a compare-and-swap on the stored revision/state (e.g.,
data.revision or a dedicated revision column) inside the same Database.use
transaction so the update only applies if the current row's revision matches the
expected previous revision; if it doesn't match, abort/throw so the caller can
retry or reconcile. Locate the AutomationDefinitionTable upsert in
writeDefinition and change the onConflict/update to include a WHERE or
conditional that checks AutomationDefinitionTable.data->>'revision' (or
AutomationDefinitionTable.revision) equals the expected prior revision from the
Definition being written, increment the revision on success, and apply the same
pattern to the other durable write blocks referenced (lines ~481-510) so all
state transitions use compare-and-swap semantics and surface a conflict error
when revisions diverge.
In `@packages/opencode/src/automation/scheduler.ts`:
- Around line 120-145: cronValues and parseCronSchedule must reject malformed
cron input before iterating to avoid infinite loops and blowups (e.g. "*/0",
negative steps, bad ranges, too few/too many fields); update parseCronSchedule
to validate the expression has exactly five whitespace-separated fields, and in
cronValues validate that stepRaw parses to an integer > 0, that range bounds are
numeric and within [min,max], that start <= end, and that wildcard/step combos
produce a sensible start/end; on any invalid condition throw a clear error so
the caller (e.g. settleOwner) can skip/bail rather than entering a
non-terminating loop or producing huge sets.
In `@packages/opencode/src/server/instance/automation.ts`:
- Around line 239-243: Move scheduler ownership settlement before deleting the
automation: call AutomationScheduler.current() and await scheduler.settleOwner()
first, then call Automation.remove(...) to produce the tombstone, and finally
call scheduler.cancel(removed.tombstone.id); ensure you preserve the same
identifiers (Automation.remove, AutomationScheduler.current,
scheduler.settleOwner, scheduler.cancel, removed.tombstone.id) and await the
settleOwner promise before performing the remove so ownership reconciliation
happens prior to the tombstone flow.
In `@packages/opencode/test/server/automation-routes.test.ts`:
- Around line 202-220: The test currently stubs AutomationScheduler.install with
settleOwner flipping a synchronous flag, which only proves it was called not
awaited; change the stub for settleOwner in the test "list route waits for
scheduler owner settle..." (and the similar test at 224-248) to return a
deferred Promise (e.g., create a resolve function outside the stub, have
settleOwner return that Promise) so the HTTP request remains pending until you
explicitly call the resolver, then assert that settled remains false before
resolving and true after resolving, confirming the route actually awaited
settleOwner (refer to AutomationScheduler.install and settleOwner in the test).
---
Outside diff comments:
In `@packages/opencode/src/automation/index.ts`:
- Around line 798-807: The executor error path is incorrectly converting
pre-start failures into a "cancelled" stop for runs that are still "scheduled";
update executeRun's catch logic so it does not call stopRun(..., "cancelled")
for errors when latest/current state is "scheduled" (only treat
controller.signal.aborted as cancellation), and instead mark those runs as
failed/errored or leave them scheduled until an explicit start occurs;
additionally, in sessionPromptExecutor move or add the call to
Automation.markRunStarted(...) to occur before any awaited pre-start operations
(e.g., before Session.create) so that start-related failures are attributed to a
running run rather than a scheduled one—use the symbols executor, executeRun,
getRun, stopRun, publishRunUpdated, markRunStarted, sessionPromptExecutor,
controller.signal to locate and change the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: b66a6d3b-49d2-4cd4-9da5-26917a7cd83c
📒 Files selected for processing (11)
packages/opencode/migration/20260530170000_automation_persistence/migration.sqlpackages/opencode/package.jsonpackages/opencode/src/automation/automation.sql.tspackages/opencode/src/automation/index.tspackages/opencode/src/automation/scheduler.tspackages/opencode/src/server/instance/automation.tspackages/opencode/src/storage/schema.tspackages/opencode/src/util/flock.tspackages/opencode/test/server/automation-routes.test.tspackages/opencode/test/server/automation-runner.test.tspackages/opencode/test/server/automation-scheduler.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/opencode/src/automation/index.ts`:
- Around line 628-656: The replaceRun function currently returns next when the
DB row is missing or owned by another project, which silently signals success;
update the check inside replaceRun (the block that reads AutomationRunTable and
tests row/project_id/owner_directory) to instead surface the failure—either
throw the same NotFoundError used by replaceDefinition or return previous to
indicate no change; ensure the change occurs before any DB update and keep the
revision-conflict behavior (returning current) unchanged so callers can
distinguish not-found vs. conflict cases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: d9dd494b-69fb-4a3d-8109-24571b0e65a3
⛔ Files ignored due to path filters (2)
packages/sdk/js/src/v2/gen/sdk.gen.tsis excluded by!**/gen/**packages/sdk/js/src/v2/gen/types.gen.tsis excluded by!**/gen/**
📒 Files selected for processing (6)
packages/opencode/src/automation/index.tspackages/opencode/src/automation/scheduler.tspackages/opencode/src/server/instance/automation.tspackages/opencode/test/server/automation-routes.test.tspackages/opencode/test/server/automation-runner.test.tspackages/opencode/test/server/automation-scheduler.test.ts
🚧 Files skipped from review as they are similar to previous changes (4)
- packages/opencode/src/server/instance/automation.ts
- packages/opencode/test/server/automation-routes.test.ts
- packages/opencode/src/automation/scheduler.ts
- packages/opencode/test/server/automation-scheduler.test.ts
Closes the backend gaps left after PR1-5 so the PR6/PR7 frontend slice can
build on a complete, stable automation contract.
Goal / change boundary:
- model { providerID, modelID } now required on every AutomationDefinition;
runner passes it through so runs no longer depend on the runtime model
fallback chain (which drifts across restarts).
- variant? optional reasoning/effort, validated against the live provider
catalog on create/update; invalid variant returns 422 instead of failing
late in the run.
- stop.kind === "condition" rejected at create/update with structured
{ field: "stop", message: "unsupported_stop_condition" } (scheduler never
schedules condition stops, so accepting them only leaked dead UI surface).
- Derived fields nextFireAt / nextFires / failureStreak populated at
create/update and refreshed after every terminal run; scheduler re-publishes
automation.definition.updated with a bumped revision for global-sync.
- cron validation consolidated into src/automation/cron.ts; scheduler and
derived both consume it.
- automate tool schema picks up model/variant + Provider-backed create-time
validation.
- Migration 20260601100000 drops pre-release rows so model can be NOT NULL.
Verification:
- CI green on 5e177ba (unit-opencode flake on an unrelated VCS-routes
20s timeout cleared on rerun).
- Local: 129 automation tests (routes/runner/scheduler/event-fixtures/tool/
cron) + 35 session processor tests pass.
Review follow-ups addressed:
- codex: scheduler self-loop guard keyed by id:revision (not id alone).
- P2: recordRunOutcome retries on ConflictError instead of silently dropping
the run outcome.
- CodeRabbit: 422 create test uses a guaranteed-invalid model; fixed sleeps
replaced with terminal-state polling.
- Test seam for the ConflictError path moved off the public Automation API
into an internal __test_hooks module.
Residual risk / deferred (tracked, not blocking PR6):
- needs_user_input / loop_gate run-error codes remain reserved; producing
them needs prompt-loop semantics changes.
- Session.automationID reverse lookup deferred (session-contract migration).
- stop=condition kept in the schema layer for SDK shape parity but rejected
at validate time; collapses once a condition evaluator lands.
Linked: issue #950; follows PR #960 #983 #984 #998 #1004; unblocks PR6/PR7.
Summary
Adds issue #950 PR4 backend persistence and durable scheduling behavior for automations:
runNowcannot bypass same-project writer protection across processesWhy
PR1 froze the automation contract, PR2 made manual
runNowexecute, and PR3 added in-memory scheduler timers. PR4 makes those records durable and prevents duplicated background firing when multiple app/server processes are open.This PR follows the merged PR1-3 contract: terminal non-attempts use
state: "stopped"plusstopReasonsuch asmissed_schedule,expired, orblocker_lost. It does not reopen the route/schema surface back to separateskippedorexpiredstates.Related Issue
Closes part of #950. This is PR4 in the agreed 7-PR breakdown.
Human Review Status
Pending
Review Focus
Please focus on the persisted automation table shape, owner-directory scoping, durable active-run guard, scheduler owner fallback/retry behavior, cron next-fire semantics, and stale run reconciliation mapping.
Risk Notes
Migration/data behavior: this adds new SQLite tables for automation definitions and runs. Definitions and run ledgers are scoped by project plus owner directory.
Concurrency behavior: background timers only run in the process that owns the durable scheduler lock. Non-owners can still serve CRUD and
runNow, but execution is guarded by durable active-run state.Cron scope: cron supports the existing frozen 5-field validation subset. This PR computes next fires by minute iteration with the definition timezone; it does not add a broader cron parser or PR5 worktree support.
Skipped checklist items:
How To Verify
Screenshots or Recordings
Not applicable. No visible UI changes.
Checklist
bug,enhancement,task,documentation. Type labels are author-added; the labeler bot does NOT assign them. Add the label in the GitHub UI, then tick this.app,ui,platform,harness,ci. The labeler bot assigns these on PR open based on changed paths. Confirm the bot's choice (or override if wrong), then tick this.P0,P1,P2,P3. The priority-triage bot suggests one on PR open. Confirm or override, then tick this.Pending,Approved by @<reviewer>, orNot required: <reason>(default isPending; "not required" is restricted to bot-authored low-risk PRs).dev, and my PR title and commit messages use Conventional Commits in English.Summary by CodeRabbit
New Features
Refactor
Tests