fix(storage): guarantee forward progress for visits on transient flush failures by vringar · Pull Request #1197 · openwpm/OpenWPM

vringar · 2026-06-20T01:12:54Z

Problem

In production parquet->S3 crawls, some visits were observed to get claimed and
perpetually marked in-progress by the work-queue client while their data
never got committed or pushed out. Records were produced and cached in the
StorageController, but the cache never flushed and the visit never reached a
terminal state -- neither committed nor explicitly marked incomplete. The visit
was stranded forever and its data lost. This is a liveness (forward-progress)
failure.

Root cause

A single transient structured-storage write error (e.g. an S3 PutObject
failure) was fatal to forward progress:

ArrowProvider.flush_cache was not atomic. It wrote batches
table-by-table and only afterwards cleared _batches and set the per-finalize
flush_events. If a write_table call raised partway through, the cached
records were stranded and the finalize tokens for those visits could never
resolve, so those visits stayed perpetually in-progress.
save_batch_if_past_timeout -- the only periodic flush driver while a crawl
is live -- had no error handling around flush_cache. One transient error
killed the coroutine permanently (it is a NoReturn loop), so finalized
visits were never flushed again and stayed pending forever.
shutdown's final drain flush was likewise unguarded, so a transient
error there aborted the drain and left finalized visits out of the completion
queue (and could deadlock on the never-resolving token).

The work-queue client never observes completion for a stranded visit, so it
keeps the site claimed indefinitely.

Fix

ArrowProvider.flush_cache is now atomic: write all tables first; only on
full success clear _batches and set flush_events. On failure, keep both
intact and re-raise so a later flush retries cleanly -- no data loss, no
stranded tokens.
save_batch_if_past_timeout survives transient flush errors: it logs and
retries on the next tick instead of dying.
The shutdown drain flush is retried a bounded number of times
(SHUTDOWN_FLUSH_RETRIES) so a transient error no longer aborts the drain. The
bound guarantees shutdown can never hang on a permanently-failing backend.

Tests

New property-based (Hypothesis) liveness tests in
test/storage/test_storage_controller_liveness.py drive the controller in-process
(no socket, no subprocess, no Firefox) with arbitrary interleavings of
start / finalize(success) / finalize(failure) / never-finalize / concurrent
visits and transient write failures, asserting the forward-progress invariants:

I1 every visit finalized(success) with records is fully committed;
I2 every visit the controller observed reaches a terminal state (in the
completion queue / recorded incomplete), never silently stranded;
I3 shutdown drains all pending committable data (nothing left only in
cache);
I4 per-visit bookkeeping stays bounded.

test_forward_progress_survives_transient_write_failures reproduces the original
stranding bug (fails on the pre-fix code, green with this fix).

Documented contract boundary

A visit that emits zero records and is never finalized is invisible to
the StorageController (it has nothing to key on); driving such a
claimed-but-silent visit to a terminal state is the work-queue client's
responsibility (e.g. a visit-claim timeout). The tests encode this boundary
explicitly.

🤖 Generated with Claude Code

…h failures A visit whose records were cached in the StorageController could be stranded perpetually in-progress -- never committed and never marked incomplete -- so the visit reached no terminal state, the work-queue client never observed completion, and the data was lost. This was observed in production parquet->S3 crawls. Root cause: a single transient structured-storage write error (e.g. an S3 PutObject failure) was fatal to forward progress: - ArrowProvider.flush_cache wrote batches table-by-table and only afterwards cleared the batches and signalled the per-finalize flush_events. If a write_table call raised partway through, the batches and events were left in an inconsistent state: the still-cached records were stranded and the finalize tokens for those visits could never resolve, marking the visits perpetually in-progress. - save_batch_if_past_timeout (the only periodic flush driver while a crawl is live) had no error handling around flush_cache. One transient error killed the coroutine permanently, so finalized visits were never flushed again and stayed pending forever. - shutdown's final drain flush was likewise unguarded, so a transient error there aborted the drain and left finalized visits out of the completion queue. Fixes: - Make ArrowProvider.flush_cache atomic: write all tables first; only on full success clear the batches and set the flush_events. On failure keep both intact and re-raise so a later flush retries cleanly without losing data or stranding tokens. - Make save_batch_if_past_timeout survive transient flush errors: log and retry on the next tick instead of dying. - Retry the shutdown drain flush a bounded number of times so a transient error no longer aborts the drain. Adds property-based (Hypothesis) liveness tests in test/storage/test_storage_controller_liveness.py that drive the controller with arbitrary interleavings of start/finalize(success)/finalize(failure)/ never-finalize/concurrent-visits and transient write failures, asserting the forward-progress invariants: successfully finalized visits are fully committed, every observed visit reaches a terminal state (never silently stranded), shutdown drains all pending data, and per-visit bookkeeping stays bounded. The transient-failure property reproduces the original stranding bug and is green with this fix. Also adds hypothesis to the dev/runtime environment (environment.yaml and scripts/environment-unpinned-dev.yaml); it was previously absent, so the new liveness tests could not be collected in CI (ModuleNotFoundError: hypothesis).

vringar · 2026-06-20T01:37:35Z

Follow-up: the branch was CI-red on all 7 tests shards with ModuleNotFoundError: No module named 'hypothesis' at collection — the new liveness tests import hypothesis but the dependency was absent from the CI environment.

Added hypothesis=6.155.6 to environment.yaml and hypothesis to scripts/environment-unpinned-dev.yaml (the "environment.yaml pins match unpinned sources" pre-commit hook passes). All 4 liveness invariants pass locally (4 passed, hypothesis 6.155.6); CI re-triggered.

codecov · 2026-06-20T01:49:09Z

Codecov Report

❌ Patch coverage is 81.81818% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.28%. Comparing base (315e667) to head (e815b75).

Files with missing lines	Patch %	Lines
openwpm/storage/storage_controller.py	73.91%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1197      +/-   ##
==========================================
+ Coverage   62.14%   62.28%   +0.13%     
==========================================
  Files          40       40              
  Lines        3918     3937      +19     
==========================================
+ Hits         2435     2452      +17     
- Misses       1483     1485       +2

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

vringar force-pushed the fix/storage-forward-progress branch from 02fd17c to e815b75 Compare June 20, 2026 01:35

vringar mentioned this pull request Jun 20, 2026

feat(storage): data-failure policy — fail visits on constraint violations (crosslink #46) #1203

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(storage): guarantee forward progress for visits on transient flush failures#1197

fix(storage): guarantee forward progress for visits on transient flush failures#1197
vringar wants to merge 1 commit into
masterfrom
fix/storage-forward-progress

vringar commented Jun 20, 2026

Uh oh!

vringar commented Jun 20, 2026

Uh oh!

codecov Bot commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vringar commented Jun 20, 2026

Problem

Root cause

Fix

Tests

Documented contract boundary

Uh oh!

vringar commented Jun 20, 2026

Uh oh!

codecov Bot commented Jun 20, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant