Skip to content

fix(gasoline): fix postgres contention and stuck workflow race condition#4388

Draft
NathanFlurry wants to merge 1 commit intomainfrom
fix/gasoline-postgres-contention
Draft

fix(gasoline): fix postgres contention and stuck workflow race condition#4388
NathanFlurry wants to merge 1 commit intomainfrom
fix/gasoline-postgres-contention

Conversation

@NathanFlurry
Copy link
Member

Summary

  • Fix stuck workflows: Write Immediate wake condition in commit_workflow when wake_signals is non-empty, preventing a race where workflows get permanently stuck with no WorkflowWakeConditionKey
  • Reduce contention: Disable worker partition overlap for Postgres, switch read-only operations (get_workflows, pull_workflow_history, pull_next_signals) from Serializable to Snapshot isolation
  • Split mega-transaction: Break pull_workflows into a Snapshot scan phase + per-workflow Serializable lease transactions, eliminating the O(n²) conflict range pressure

Also adds gasoline-load-test crate for reproducing and verifying these issues under concurrent signal load with Postgres.

Detail

Six fixes targeting gasoline on Postgres where the GiST exclusion constraint for conflict detection is much more expensive than FDB's native conflict tracking:

  1. Stuck workflow racecommit_workflow writes WakeSignalKeys but not a WorkflowWakeConditionKey, relying on a future publish_signal. If signals arrive while keys are cleared, no wake condition is ever created. Fix: write Immediate wake condition when wake_signals is non-empty.
  2. Partition overlap — Each worker pulls its own partition + next worker's for redundancy, doubling GiST conflict surface. Fix: disable overlap for Postgres; rely on 30s WORKER_LOST_THRESHOLD_MS for dead worker recovery.
  3. get_workflows Snapshot — Read-only status check doesn't need Serializable.
  4. pull_workflow_history Snapshot — Workflow is already leased; no concurrent writes possible.
  5. pull_next_signals Snapshot — Retry loop + fix 1 provide correctness without Serializable.
  6. pull_workflows split — Snapshot scan for candidates, then per-workflow lease transactions via buffer_unordered(64).

Test plan

  • Load test: 50 workflows, 10 signals each, concurrency 10, 20ms delay, NATS pubsub — all workflows complete in 2-6 seconds
  • Verified stuck workflow race is eliminated (previously 2/70 stuck at 40 workflows)
  • Verified worker no longer crashes from pull_workflows timeout under concurrent load
  • Run full CI suite

🤖 Generated with Claude Code

Six fixes for gasoline running on Postgres:

1. Write Immediate wake condition in commit_workflow when wake_signals
   is non-empty, preventing a race where workflows get permanently stuck
   with no WorkflowWakeConditionKey.

2. Disable worker partition overlap for Postgres to halve GiST exclusion
   constraint conflict surface in pull_workflows.

3. Change get_workflows reads from Serializable to Snapshot (read-only
   status check needs no conflict tracking).

4. Change pull_workflow_history reads from Serializable to Snapshot
   (workflow is already leased, no concurrent writes).

5. Change pull_next_signals scan/reads from Serializable to Snapshot
   (retry loop + fix 1 provide correctness guarantees).

6. Split pull_workflows mega-transaction into Snapshot scan phase +
   per-workflow Serializable lease transactions via buffer_unordered(64).

Also adds gasoline-load-test crate for reproducing and verifying fixes.
@railway-app
Copy link

railway-app bot commented Mar 9, 2026

🚅 Deployed to the rivet-pr-4388 environment in rivet-frontend

Service Status Web Updated (UTC)
website ❌ Build Failed (View Logs) Web Mar 9, 2026 at 4:44 pm
frontend-inspector ❌ Build Failed (View Logs) Web Mar 9, 2026 at 4:44 pm
frontend-cloud ❌ Build Failed (View Logs) Web Mar 9, 2026 at 4:43 pm
ladle ❌ Build Failed (View Logs) Web Mar 9, 2026 at 4:43 pm
mcp-hub ✅ Success (View Logs) Web Mar 9, 2026 at 4:43 pm

@pkg-pr-new
Copy link

pkg-pr-new bot commented Mar 9, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4388

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4388

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4388

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4388

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4388

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4388

@rivetkit/sqlite-vfs

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sqlite-vfs@4388

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4388

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4388

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4388

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4388

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4388

commit: 45d9e36

@claude
Copy link

claude bot commented Mar 9, 2026

PR Review: fix(gasoline): fix postgres contention and stuck workflow race condition

Overall this is a solid, well-motivated set of fixes. The PR description clearly maps each fix to a root cause. A few things worth discussing:

1. Error swallowing in Phase 2 may silently drop workflows

engine/packages/gasoline/src/db/kv/mod.rs - Phase 2 lease loop

A lease failure now silently skips the workflow (tracing::warn + Ok(None)) rather than propagating the error. This is intentional for resiliency but changes the failure semantics significantly. If there is a persistent error (connection failure, serialization bug, logic error), every workflow will be skipped on every tick indefinitely with no signal to the caller. Consider tracking a lease_error_count and propagating if it exceeds a threshold within one pull_workflows call, or at minimum escalating to tracing::error! since a failed lease is abnormal.

2. Subtle correctness coupling: pull_next_signals Snapshot relies on fix 1

Fix 5 (pull_next_signals Snapshot) is only safe because fix 1 (Immediate wake condition in commit_workflow) exists. If fix 1 were ever reverted, fix 5 would reintroduce the stuck-workflow race. Worth adding a brief cross-reference in the comment pointing to commit_workflow's Immediate wake condition write.

3. Phase 2 introduces O(n x m) scan per transaction

The inner loop iterates all wake_keys for every candidate workflow (up to 1000 candidates x 10000 wake keys = 10M comparisons per pull_workflows). The old code did a similar scan inside the single transaction so this is not a complexity regression, but it is now spread across up to 1000 separate transactions. If profiling shows this matters, indexing wake_keys by workflow_id into a HashMap before Phase 2 would reduce this to O(1) lookups per transaction.

4. Removed comments had non-delta explanatory value

Two removed comments were not just documenting deltas but explained why limits exist: the 10k dedup hard limit comment explained the two-stage limit relationship, and the metrics TODO (This will record metrics even if the txn fails, which is wrong) should remain if not resolved by the split, or be removed with a note that it was fixed.

5. Load test: dead workflows do not fail fast

In bombarder_logic, the monitoring loop logs dead workflows but continues running until the full 120s timeout. If fix 1 is correct, dead workflows should never appear. Consider breaking early when dead > 0 so the load test fails faster and more clearly when a regression occurs.

6. Minor: run_standalone does not call dont_stop_docker_containers_on_drop()

run_worker and run_bombarder both call this method, but run_standalone does not, so the Postgres container is torn down on exit in standalone mode. If intentional for auto-cleanup that is fine, but worth confirming.

Overall

The root cause analysis is thorough and the fixes are well-targeted. The two-phase pull_workflows split is the right architectural change for Postgres. The Immediate wake condition fix for commit_workflow is the correctness anchor that makes several other isolation relaxations safe. The load test crate is a valuable addition for reproducing and guarding against regressions.

Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant