Skip to content

fix(storage): guarantee forward progress for visits on transient flush failures#1197

Draft
vringar wants to merge 1 commit into
masterfrom
fix/storage-forward-progress
Draft

fix(storage): guarantee forward progress for visits on transient flush failures#1197
vringar wants to merge 1 commit into
masterfrom
fix/storage-forward-progress

Conversation

@vringar

@vringar vringar commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Problem

In production parquet->S3 crawls, some visits were observed to get claimed and
perpetually marked in-progress by the work-queue client while their data
never got committed or pushed out. Records were produced and cached in the
StorageController, but the cache never flushed and the visit never reached a
terminal state -- neither committed nor explicitly marked incomplete. The visit
was stranded forever and its data lost. This is a liveness (forward-progress)
failure.

Root cause

A single transient structured-storage write error (e.g. an S3 PutObject
failure) was fatal to forward progress:

  1. ArrowProvider.flush_cache was not atomic. It wrote batches
    table-by-table and only afterwards cleared _batches and set the per-finalize
    flush_events. If a write_table call raised partway through, the cached
    records were stranded and the finalize tokens for those visits could never
    resolve
    , so those visits stayed perpetually in-progress.

  2. save_batch_if_past_timeout -- the only periodic flush driver while a crawl
    is live -- had no error handling around flush_cache.
    One transient error
    killed the coroutine permanently (it is a NoReturn loop), so finalized
    visits were never flushed again and stayed pending forever.

  3. shutdown's final drain flush was likewise unguarded, so a transient
    error there aborted the drain and left finalized visits out of the completion
    queue (and could deadlock on the never-resolving token).

The work-queue client never observes completion for a stranded visit, so it
keeps the site claimed indefinitely.

Fix

  • ArrowProvider.flush_cache is now atomic: write all tables first; only on
    full success clear _batches and set flush_events. On failure, keep both
    intact and re-raise so a later flush retries cleanly -- no data loss, no
    stranded tokens.
  • save_batch_if_past_timeout survives transient flush errors: it logs and
    retries on the next tick instead of dying.
  • The shutdown drain flush is retried a bounded number of times
    (SHUTDOWN_FLUSH_RETRIES) so a transient error no longer aborts the drain. The
    bound guarantees shutdown can never hang on a permanently-failing backend.

Tests

New property-based (Hypothesis) liveness tests in
test/storage/test_storage_controller_liveness.py drive the controller in-process
(no socket, no subprocess, no Firefox) with arbitrary interleavings of
start / finalize(success) / finalize(failure) / never-finalize / concurrent
visits and transient write failures, asserting the forward-progress invariants:

  • I1 every visit finalized(success) with records is fully committed;
  • I2 every visit the controller observed reaches a terminal state (in the
    completion queue / recorded incomplete), never silently stranded;
  • I3 shutdown drains all pending committable data (nothing left only in
    cache);
  • I4 per-visit bookkeeping stays bounded.

test_forward_progress_survives_transient_write_failures reproduces the original
stranding bug (fails on the pre-fix code, green with this fix).

Documented contract boundary

A visit that emits zero records and is never finalized is invisible to
the StorageController (it has nothing to key on); driving such a
claimed-but-silent visit to a terminal state is the work-queue client's
responsibility (e.g. a visit-claim timeout). The tests encode this boundary
explicitly.

🤖 Generated with Claude Code

…h failures

A visit whose records were cached in the StorageController could be stranded
perpetually in-progress -- never committed and never marked incomplete -- so the
visit reached no terminal state, the work-queue client never observed completion,
and the data was lost. This was observed in production parquet->S3 crawls.

Root cause: a single transient structured-storage write error (e.g. an S3
PutObject failure) was fatal to forward progress:

- ArrowProvider.flush_cache wrote batches table-by-table and only afterwards
  cleared the batches and signalled the per-finalize flush_events. If a
  write_table call raised partway through, the batches and events were left in
  an inconsistent state: the still-cached records were stranded and the finalize
  tokens for those visits could never resolve, marking the visits perpetually
  in-progress.

- save_batch_if_past_timeout (the only periodic flush driver while a crawl is
  live) had no error handling around flush_cache. One transient error killed the
  coroutine permanently, so finalized visits were never flushed again and stayed
  pending forever.

- shutdown's final drain flush was likewise unguarded, so a transient error
  there aborted the drain and left finalized visits out of the completion queue.

Fixes:

- Make ArrowProvider.flush_cache atomic: write all tables first; only on full
  success clear the batches and set the flush_events. On failure keep both
  intact and re-raise so a later flush retries cleanly without losing data or
  stranding tokens.
- Make save_batch_if_past_timeout survive transient flush errors: log and retry
  on the next tick instead of dying.
- Retry the shutdown drain flush a bounded number of times so a transient error
  no longer aborts the drain.

Adds property-based (Hypothesis) liveness tests in
test/storage/test_storage_controller_liveness.py that drive the controller with
arbitrary interleavings of start/finalize(success)/finalize(failure)/
never-finalize/concurrent-visits and transient write failures, asserting the
forward-progress invariants: successfully finalized visits are fully committed,
every observed visit reaches a terminal state (never silently stranded),
shutdown drains all pending data, and per-visit bookkeeping stays bounded. The
transient-failure property reproduces the original stranding bug and is green
with this fix.

Also adds hypothesis to the dev/runtime environment (environment.yaml and
scripts/environment-unpinned-dev.yaml); it was previously absent, so the new
liveness tests could not be collected in CI (ModuleNotFoundError: hypothesis).
@vringar vringar force-pushed the fix/storage-forward-progress branch from 02fd17c to e815b75 Compare June 20, 2026 01:35
@vringar

vringar commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up: the branch was CI-red on all 7 tests shards with ModuleNotFoundError: No module named 'hypothesis' at collection — the new liveness tests import hypothesis but the dependency was absent from the CI environment.

Added hypothesis=6.155.6 to environment.yaml and hypothesis to scripts/environment-unpinned-dev.yaml (the "environment.yaml pins match unpinned sources" pre-commit hook passes). All 4 liveness invariants pass locally (4 passed, hypothesis 6.155.6); CI re-triggered.

@codecov

codecov Bot commented Jun 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.81818% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.28%. Comparing base (315e667) to head (e815b75).

Files with missing lines Patch % Lines
openwpm/storage/storage_controller.py 73.91% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1197      +/-   ##
==========================================
+ Coverage   62.14%   62.28%   +0.13%     
==========================================
  Files          40       40              
  Lines        3918     3937      +19     
==========================================
+ Hits         2435     2452      +17     
- Misses       1483     1485       +2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant