fix(storage): guarantee forward progress for visits on transient flush failures#1197
Draft
vringar wants to merge 1 commit into
Draft
fix(storage): guarantee forward progress for visits on transient flush failures#1197vringar wants to merge 1 commit into
vringar wants to merge 1 commit into
Conversation
…h failures A visit whose records were cached in the StorageController could be stranded perpetually in-progress -- never committed and never marked incomplete -- so the visit reached no terminal state, the work-queue client never observed completion, and the data was lost. This was observed in production parquet->S3 crawls. Root cause: a single transient structured-storage write error (e.g. an S3 PutObject failure) was fatal to forward progress: - ArrowProvider.flush_cache wrote batches table-by-table and only afterwards cleared the batches and signalled the per-finalize flush_events. If a write_table call raised partway through, the batches and events were left in an inconsistent state: the still-cached records were stranded and the finalize tokens for those visits could never resolve, marking the visits perpetually in-progress. - save_batch_if_past_timeout (the only periodic flush driver while a crawl is live) had no error handling around flush_cache. One transient error killed the coroutine permanently, so finalized visits were never flushed again and stayed pending forever. - shutdown's final drain flush was likewise unguarded, so a transient error there aborted the drain and left finalized visits out of the completion queue. Fixes: - Make ArrowProvider.flush_cache atomic: write all tables first; only on full success clear the batches and set the flush_events. On failure keep both intact and re-raise so a later flush retries cleanly without losing data or stranding tokens. - Make save_batch_if_past_timeout survive transient flush errors: log and retry on the next tick instead of dying. - Retry the shutdown drain flush a bounded number of times so a transient error no longer aborts the drain. Adds property-based (Hypothesis) liveness tests in test/storage/test_storage_controller_liveness.py that drive the controller with arbitrary interleavings of start/finalize(success)/finalize(failure)/ never-finalize/concurrent-visits and transient write failures, asserting the forward-progress invariants: successfully finalized visits are fully committed, every observed visit reaches a terminal state (never silently stranded), shutdown drains all pending data, and per-visit bookkeeping stays bounded. The transient-failure property reproduces the original stranding bug and is green with this fix. Also adds hypothesis to the dev/runtime environment (environment.yaml and scripts/environment-unpinned-dev.yaml); it was previously absent, so the new liveness tests could not be collected in CI (ModuleNotFoundError: hypothesis).
02fd17c to
e815b75
Compare
Contributor
Author
|
Follow-up: the branch was CI-red on all 7 Added |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1197 +/- ##
==========================================
+ Coverage 62.14% 62.28% +0.13%
==========================================
Files 40 40
Lines 3918 3937 +19
==========================================
+ Hits 2435 2452 +17
- Misses 1483 1485 +2 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In production parquet->S3 crawls, some visits were observed to get claimed and
perpetually marked in-progress by the work-queue client while their data
never got committed or pushed out. Records were produced and cached in the
StorageController, but the cache never flushed and the visit never reached aterminal state -- neither committed nor explicitly marked incomplete. The visit
was stranded forever and its data lost. This is a liveness (forward-progress)
failure.
Root cause
A single transient structured-storage write error (e.g. an S3 PutObject
failure) was fatal to forward progress:
ArrowProvider.flush_cachewas not atomic. It wrote batchestable-by-table and only afterwards cleared
_batchesand set the per-finalizeflush_events. If awrite_tablecall raised partway through, the cachedrecords were stranded and the finalize tokens for those visits could never
resolve, so those visits stayed perpetually in-progress.
save_batch_if_past_timeout-- the only periodic flush driver while a crawlis live -- had no error handling around
flush_cache. One transient errorkilled the coroutine permanently (it is a
NoReturnloop), so finalizedvisits were never flushed again and stayed pending forever.
shutdown's final drain flush was likewise unguarded, so a transienterror there aborted the drain and left finalized visits out of the completion
queue (and could deadlock on the never-resolving token).
The work-queue client never observes completion for a stranded visit, so it
keeps the site claimed indefinitely.
Fix
ArrowProvider.flush_cacheis now atomic: write all tables first; only onfull success clear
_batchesand setflush_events. On failure, keep bothintact and re-raise so a later flush retries cleanly -- no data loss, no
stranded tokens.
save_batch_if_past_timeoutsurvives transient flush errors: it logs andretries on the next tick instead of dying.
(
SHUTDOWN_FLUSH_RETRIES) so a transient error no longer aborts the drain. Thebound guarantees shutdown can never hang on a permanently-failing backend.
Tests
New property-based (Hypothesis) liveness tests in
test/storage/test_storage_controller_liveness.pydrive the controller in-process(no socket, no subprocess, no Firefox) with arbitrary interleavings of
start / finalize(success) / finalize(failure) / never-finalize / concurrent
visits and transient write failures, asserting the forward-progress invariants:
completion queue / recorded incomplete), never silently stranded;
cache);
test_forward_progress_survives_transient_write_failuresreproduces the originalstranding bug (fails on the pre-fix code, green with this fix).
Documented contract boundary
A visit that emits zero records and is never finalized is invisible to
the
StorageController(it has nothing to key on); driving such aclaimed-but-silent visit to a terminal state is the work-queue client's
responsibility (e.g. a visit-claim timeout). The tests encode this boundary
explicitly.
🤖 Generated with Claude Code