fix(gasoline): fix postgres contention and stuck workflow race condition by NathanFlurry · Pull Request #4388 · rivet-dev/rivet

NathanFlurry · 2026-03-09T16:42:07Z

Summary

Fix stuck workflows: Write Immediate wake condition in commit_workflow when wake_signals is non-empty, preventing a race where workflows get permanently stuck with no WorkflowWakeConditionKey
Reduce contention: Disable worker partition overlap for Postgres, switch read-only operations (get_workflows, pull_workflow_history, pull_next_signals) from Serializable to Snapshot isolation
Split mega-transaction: Break pull_workflows into a Snapshot scan phase + per-workflow Serializable lease transactions, eliminating the O(n²) conflict range pressure

Also adds gasoline-load-test crate for reproducing and verifying these issues under concurrent signal load with Postgres.

Detail

Six fixes targeting gasoline on Postgres where the GiST exclusion constraint for conflict detection is much more expensive than FDB's native conflict tracking:

Stuck workflow race — commit_workflow writes WakeSignalKeys but not a WorkflowWakeConditionKey, relying on a future publish_signal. If signals arrive while keys are cleared, no wake condition is ever created. Fix: write Immediate wake condition when wake_signals is non-empty.
Partition overlap — Each worker pulls its own partition + next worker's for redundancy, doubling GiST conflict surface. Fix: disable overlap for Postgres; rely on 30s WORKER_LOST_THRESHOLD_MS for dead worker recovery.
get_workflows Snapshot — Read-only status check doesn't need Serializable.
pull_workflow_history Snapshot — Workflow is already leased; no concurrent writes possible.
pull_next_signals Snapshot — Retry loop + fix 1 provide correctness without Serializable.
pull_workflows split — Snapshot scan for candidates, then per-workflow lease transactions via buffer_unordered(64).

Test plan

Load test: 50 workflows, 10 signals each, concurrency 10, 20ms delay, NATS pubsub — all workflows complete in 2-6 seconds
Verified stuck workflow race is eliminated (previously 2/70 stuck at 40 workflows)
Verified worker no longer crashes from pull_workflows timeout under concurrent load
Run full CI suite

🤖 Generated with Claude Code

Six fixes for gasoline running on Postgres: 1. Write Immediate wake condition in commit_workflow when wake_signals is non-empty, preventing a race where workflows get permanently stuck with no WorkflowWakeConditionKey. 2. Disable worker partition overlap for Postgres to halve GiST exclusion constraint conflict surface in pull_workflows. 3. Change get_workflows reads from Serializable to Snapshot (read-only status check needs no conflict tracking). 4. Change pull_workflow_history reads from Serializable to Snapshot (workflow is already leased, no concurrent writes). 5. Change pull_next_signals scan/reads from Serializable to Snapshot (retry loop + fix 1 provide correctness guarantees). 6. Split pull_workflows mega-transaction into Snapshot scan phase + per-workflow Serializable lease transactions via buffer_unordered(64). Also adds gasoline-load-test crate for reproducing and verifying fixes.

railway-app · 2026-03-09T16:42:19Z

🚅 Deployed to the rivet-pr-4388 environment in rivet-frontend

Service	Status	Web	Updated (UTC)
website	❌ Build Failed (View Logs)	Web	Mar 9, 2026 at 4:44 pm
frontend-inspector	❌ Build Failed (View Logs)	Web	Mar 9, 2026 at 4:44 pm
frontend-cloud	❌ Build Failed (View Logs)	Web	Mar 9, 2026 at 4:43 pm
ladle	❌ Build Failed (View Logs)	Web	Mar 9, 2026 at 4:43 pm
mcp-hub	✅ Success (View Logs)	Web	Mar 9, 2026 at 4:43 pm

pkg-pr-new · 2026-03-09T16:45:42Z

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4388

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4388

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4388

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4388

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4388

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4388

@rivetkit/sqlite-vfs

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sqlite-vfs@4388

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4388

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4388

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4388

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4388

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4388

commit: 45d9e36

claude · 2026-03-09T16:47:04Z

PR Review: fix(gasoline): fix postgres contention and stuck workflow race condition

Overall this is a solid, well-motivated set of fixes. The PR description clearly maps each fix to a root cause. A few things worth discussing:

1. Error swallowing in Phase 2 may silently drop workflows

engine/packages/gasoline/src/db/kv/mod.rs - Phase 2 lease loop

A lease failure now silently skips the workflow (tracing::warn + Ok(None)) rather than propagating the error. This is intentional for resiliency but changes the failure semantics significantly. If there is a persistent error (connection failure, serialization bug, logic error), every workflow will be skipped on every tick indefinitely with no signal to the caller. Consider tracking a lease_error_count and propagating if it exceeds a threshold within one pull_workflows call, or at minimum escalating to tracing::error! since a failed lease is abnormal.

2. Subtle correctness coupling: pull_next_signals Snapshot relies on fix 1

Fix 5 (pull_next_signals Snapshot) is only safe because fix 1 (Immediate wake condition in commit_workflow) exists. If fix 1 were ever reverted, fix 5 would reintroduce the stuck-workflow race. Worth adding a brief cross-reference in the comment pointing to commit_workflow's Immediate wake condition write.

3. Phase 2 introduces O(n x m) scan per transaction

The inner loop iterates all wake_keys for every candidate workflow (up to 1000 candidates x 10000 wake keys = 10M comparisons per pull_workflows). The old code did a similar scan inside the single transaction so this is not a complexity regression, but it is now spread across up to 1000 separate transactions. If profiling shows this matters, indexing wake_keys by workflow_id into a HashMap before Phase 2 would reduce this to O(1) lookups per transaction.

4. Removed comments had non-delta explanatory value

Two removed comments were not just documenting deltas but explained why limits exist: the 10k dedup hard limit comment explained the two-stage limit relationship, and the metrics TODO (This will record metrics even if the txn fails, which is wrong) should remain if not resolved by the split, or be removed with a note that it was fixed.

5. Load test: dead workflows do not fail fast

In bombarder_logic, the monitoring loop logs dead workflows but continues running until the full 120s timeout. If fix 1 is correct, dead workflows should never appear. Consider breaking early when dead > 0 so the load test fails faster and more clearly when a regression occurs.

6. Minor: run_standalone does not call dont_stop_docker_containers_on_drop()

run_worker and run_bombarder both call this method, but run_standalone does not, so the Postgres container is torn down on exit in standalone mode. If intentional for auto-cleanup that is fine, but worth confirming.

Overall

The root cause analysis is thorough and the fixes are well-targeted. The two-phase pull_workflows split is the right architectural change for Postgres. The Immediate wake condition fix for commit_workflow is the correctness anchor that makes several other isolation relaxations safe. The load test crate is a valuable addition for reproducing and guarding against regressions.

Generated with Claude Code

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4388 March 9, 2026 16:42 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gasoline): fix postgres contention and stuck workflow race condition#4388

fix(gasoline): fix postgres contention and stuck workflow race condition#4388
NathanFlurry wants to merge 1 commit intomainfrom
fix/gasoline-postgres-contention

NathanFlurry commented Mar 9, 2026

Uh oh!

railway-app bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Mar 9, 2026

Uh oh!

claude bot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NathanFlurry commented Mar 9, 2026

Summary

Detail

Test plan

Uh oh!

railway-app bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pkg-pr-new bot commented Mar 9, 2026

Uh oh!

claude bot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

railway-app bot commented Mar 9, 2026 •

edited

Loading