fix(gasoline): fix postgres contention and stuck workflow race condition#4388
fix(gasoline): fix postgres contention and stuck workflow race condition#4388NathanFlurry wants to merge 1 commit intomainfrom
Conversation
Six fixes for gasoline running on Postgres: 1. Write Immediate wake condition in commit_workflow when wake_signals is non-empty, preventing a race where workflows get permanently stuck with no WorkflowWakeConditionKey. 2. Disable worker partition overlap for Postgres to halve GiST exclusion constraint conflict surface in pull_workflows. 3. Change get_workflows reads from Serializable to Snapshot (read-only status check needs no conflict tracking). 4. Change pull_workflow_history reads from Serializable to Snapshot (workflow is already leased, no concurrent writes). 5. Change pull_next_signals scan/reads from Serializable to Snapshot (retry loop + fix 1 provide correctness guarantees). 6. Split pull_workflows mega-transaction into Snapshot scan phase + per-workflow Serializable lease transactions via buffer_unordered(64). Also adds gasoline-load-test crate for reproducing and verifying fixes.
|
🚅 Deployed to the rivet-pr-4388 environment in rivet-frontend
|
|
PR Review: fix(gasoline): fix postgres contention and stuck workflow race condition Overall this is a solid, well-motivated set of fixes. The PR description clearly maps each fix to a root cause. A few things worth discussing: 1. Error swallowing in Phase 2 may silently drop workflows engine/packages/gasoline/src/db/kv/mod.rs - Phase 2 lease loop A lease failure now silently skips the workflow (tracing::warn + Ok(None)) rather than propagating the error. This is intentional for resiliency but changes the failure semantics significantly. If there is a persistent error (connection failure, serialization bug, logic error), every workflow will be skipped on every tick indefinitely with no signal to the caller. Consider tracking a lease_error_count and propagating if it exceeds a threshold within one pull_workflows call, or at minimum escalating to tracing::error! since a failed lease is abnormal. 2. Subtle correctness coupling: pull_next_signals Snapshot relies on fix 1 Fix 5 (pull_next_signals Snapshot) is only safe because fix 1 (Immediate wake condition in commit_workflow) exists. If fix 1 were ever reverted, fix 5 would reintroduce the stuck-workflow race. Worth adding a brief cross-reference in the comment pointing to commit_workflow's Immediate wake condition write. 3. Phase 2 introduces O(n x m) scan per transaction The inner loop iterates all wake_keys for every candidate workflow (up to 1000 candidates x 10000 wake keys = 10M comparisons per pull_workflows). The old code did a similar scan inside the single transaction so this is not a complexity regression, but it is now spread across up to 1000 separate transactions. If profiling shows this matters, indexing wake_keys by workflow_id into a HashMap before Phase 2 would reduce this to O(1) lookups per transaction. 4. Removed comments had non-delta explanatory value Two removed comments were not just documenting deltas but explained why limits exist: the 10k dedup hard limit comment explained the two-stage limit relationship, and the metrics TODO (This will record metrics even if the txn fails, which is wrong) should remain if not resolved by the split, or be removed with a note that it was fixed. 5. Load test: dead workflows do not fail fast In bombarder_logic, the monitoring loop logs dead workflows but continues running until the full 120s timeout. If fix 1 is correct, dead workflows should never appear. Consider breaking early when dead > 0 so the load test fails faster and more clearly when a regression occurs. 6. Minor: run_standalone does not call dont_stop_docker_containers_on_drop() run_worker and run_bombarder both call this method, but run_standalone does not, so the Postgres container is torn down on exit in standalone mode. If intentional for auto-cleanup that is fine, but worth confirming. Overall The root cause analysis is thorough and the fixes are well-targeted. The two-phase pull_workflows split is the right architectural change for Postgres. The Immediate wake condition fix for commit_workflow is the correctness anchor that makes several other isolation relaxations safe. The load test crate is a valuable addition for reproducing and guarding against regressions. Generated with Claude Code |
Summary
Immediatewake condition incommit_workflowwhenwake_signalsis non-empty, preventing a race where workflows get permanently stuck with noWorkflowWakeConditionKeyget_workflows,pull_workflow_history,pull_next_signals) fromSerializabletoSnapshotisolationpull_workflowsinto a Snapshot scan phase + per-workflow Serializable lease transactions, eliminating the O(n²) conflict range pressureAlso adds
gasoline-load-testcrate for reproducing and verifying these issues under concurrent signal load with Postgres.Detail
Six fixes targeting gasoline on Postgres where the GiST exclusion constraint for conflict detection is much more expensive than FDB's native conflict tracking:
commit_workflowwritesWakeSignalKeysbut not aWorkflowWakeConditionKey, relying on a futurepublish_signal. If signals arrive while keys are cleared, no wake condition is ever created. Fix: writeImmediatewake condition whenwake_signalsis non-empty.WORKER_LOST_THRESHOLD_MSfor dead worker recovery.get_workflowsSnapshot — Read-only status check doesn't need Serializable.pull_workflow_historySnapshot — Workflow is already leased; no concurrent writes possible.pull_next_signalsSnapshot — Retry loop + fix 1 provide correctness without Serializable.pull_workflowssplit — Snapshot scan for candidates, then per-workflow lease transactions viabuffer_unordered(64).Test plan
pull_workflowstimeout under concurrent load🤖 Generated with Claude Code