Skip to content

fix(get): retry against next candidate when stream assembly fails#4367

Merged
sanity merged 4 commits into
mainfrom
fix-4345-get-retry
Jun 8, 2026
Merged

fix(get): retry against next candidate when stream assembly fails#4367
sanity merged 4 commits into
mainfrom
fix-4345-get-retry

Conversation

@sanity

@sanity sanity commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Problem

#4345: a GET for a multi-fragment contract state intermittently fails with stream assembly: no fragments received within inactivity timeout. The post-v0.2.70 analysis on the issue showed the per-stream transport failure chain is still intact, and on top of it the GET driver compounds the problem at the op layer:

The driver classifies the ResponseStreaming header as terminal success and exits its retry loop. If assemble_and_cache_stream then fails (fragments lost on a lossy path, the sender aborting on the 3s cwnd-wait timeout, a relay's pipe dying after the header was forwarded), the driver only logs a WARN and synthesizes the generic GET succeeded on wire but local store lookup failed client error. One transport hiccup burns the whole GET — the MAX_RETRIES = 3 budget is never consumed, even though other candidates could serve the contract. The relay driver already treats header-without-stream as a retryable routing failure (drive_relay_get_inner's claim-failure continue); the originator had no equivalent.

This is the failure behind riverctl invite accept alternating between the store-lookup error and 60s timeouts, and freenet/mail#288's inbox-GET failures.

Approach

Op layer (commit 1, hardened in commit 4): new drive_get_with_assembly_retry wrapper around the shared retry loop, used by both the client driver and the sub-op driver. On assembly failure it:

  • records RouteOutcome::Failure for the candidate whose header never became a stream (client path only — sub-op GETs have never fed the router, mirroring pre-existing design);
  • advances to the next candidate via the existing MAX_RETRIES budget;
  • re-enters the loop with a fresh attempt tx — reusing the original tx would collide with the relay dedup gates (active_relay_get_txs) while the failed attempt's relay chain is still draining.

Once a streaming header has been seen, exhaustion can never surface as NotFound (the contract provably exists): assembly-time budget exhaustion keeps the original Done(Streaming) outcome, and a wire exhaustion on a re-entered loop is converted back to the remembered header (review finding, commit 4). Either way the client sees a diagnostic OperationError carrying the assembly cause. The shape (outer wrapper rather than retrying inside the loop's Terminal arm) is dictated by the op_ctx.rs pin test requiring the Terminal arm to return synchronously.

Observability (commit 2): transfer_failed / transport_snapshot / timeout telemetry events carried an empty peer_id, and no event carried a version — the v0.2.70 post-release analysis could not attribute the failing senders. The emitting node's own peer id is now threaded into TelemetryReporter/TelemetryWorker (same construction and caveats as the shadow-RTT events in p2p_impl.rs), and the standard service.version OTLP resource attribute is added. Additive only; the dashboard parses attributes permissively and events.peer_id was already nullable.

Out of scope: the underlying transport failure chain (loss-pause cwnd cap + 3s cwnd-wait abort + flightsize release semantics). That fix needs a stream→packet index in SentPacketTracker and touches congestion-control invariants — analysis posted on #4345 for the assignees. This PR makes individual GETs survive that chain; it does not reduce the per-stream failure rate.

Testing

  • E2E regression (test_streaming_get_retries_after_assembly_failure): two-phase cold-node isolation on a routable sparse topology; arms exactly one deterministic assembly failure for the contract key via the new get_assembly_fault_injection hook (keyed by ContractKey so parallel tests can't consume each other's budget; the injection fails before the claim — the relay claim-failure shape — leaving the inbound stream orphaned for GC; driver-visible behavior is identical to the production mid-assembly timeout (both surface as Err from assemble_and_cache_stream)). Asserts the driver took the retry path and the requester ended up with the full, correct state. Verified to fail without the fix (every candidate: retry never taken, no state) and pass with it.
  • New source-scrape pin tests: assembly failure must capture the failed target before advance(), must use a fresh attempt tx, and exhaustion must never become NotFound; both drivers must route through the wrapper.
  • Unit tests for the injection hook semantics and for telemetry (transfer_failed carries the local peer id; OTLP resource carries service.version).
  • Full streaming_e2e suite green (11 passed, 2 pre-existing ignores); GET driver unit tests green (83); cargo fmt + CI-equivalent cargo clippy --locked clean.

Part of #4345 — the issue stays open for the transport-layer root cause.

[AI-assisted - Claude]

sanity and others added 2 commits June 7, 2026 20:31
)

The GET driver classified the ResponseStreaming *header* as terminal
success and exited its retry loop; if the stream assembly then failed
(fragments lost on a lossy path, the sender aborting on cwnd-wait
timeout, a relay's pipe dying after the header was forwarded), the
driver only logged a WARN and synthesized a generic client error —
one transport hiccup burned the whole GET with the MAX_RETRIES budget
untouched. The relay driver already treats header-without-stream as a
retryable routing failure; this gives the originator (and the sub-op
driver) the same semantics.

- New drive_get_with_assembly_retry wrapper: on assembly failure,
  penalize the failed candidate (RouteOutcome::Failure), advance to
  the next candidate (shared retry budget), and re-enter the loop
  with a fresh attempt tx (reusing the old tx would collide with the
  relay dedup gates). Exhaustion keeps the Done(Streaming) outcome so
  the client sees a diagnostic OperationError, never a false NotFound.
- The synthesized client error now carries the assembly failure cause
  instead of the generic store-lookup message.
- Deterministic per-contract-key fault injection
  (get_assembly_fault_injection) + e2e regression test that fails
  without the fix and passes with it, plus source-scrape pin tests
  for the new invariants.

Part of #4345 (op-layer resilience; the underlying transport
flightsize/loss-pause interaction is tracked separately in the issue).

[AI-assisted - Claude]

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d and version (#4345)

transfer_failed / transfer_started / transfer_completed,
transport_snapshot, and timeout telemetry events carried an empty
peer_id, making sender attribution in the collector impossible — the
v0.2.70 post-release analysis of #4345 could not tell which versions
the failing senders ran. The transport layer genuinely has no peer
identity, but the emitting node's own id is what these events need
and it is available at reporter construction.

- Thread the local peer id (public key + best-effort address, same
  construction and caveats as the shadow-RTT events in p2p_impl.rs)
  into TelemetryReporter/TelemetryWorker and stamp it on the three
  event families that previously sent an empty peer_id.
- Add the standard service.version OTLP resource attribute so every
  event is attributable to a crate version without joining against
  peer_startup events.

Additive fields only; the dashboard parses unknown attributes
permissively and events.peer_id was already nullable.

[AI-assisted - Claude]

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

I have all the context I need. Here is the review:


Rule Review: No blocking issues; one scope note and one minor style observation

Rules checked: git-workflow.md, code-style.md, testing.md, operations.md
Files reviewed: 7

Warnings

None.

Info

  • crates/core/src/tracing/telemetry.rs (and node.rs, p2p_impl.rs) — The commit feat(telemetry): attribute transport-level events to the local peer id and version (#4345) is a new capability (propagating local_peer_id into TelemetryReporter / TelemetryWorker, adding service.version to OTLP output) bundled into a PR whose title is fix(get). The git-workflow rule requires one logical change per PR and flags features bundled into fix PRs as scope creep. The telemetry work is motivated by the same issue (GET fails with 'stream assembly: no fragments received within inactivity timeout' for multi-fragment state; contract never caches, SUBSCRIBE then rejects #4345) and clearly labeled with a feat: prefix (not smuggled), but a reviewer enforcing the one-change rule could reasonably ask for it to be split. (rule: git-workflow.md — one logical change per PR)

  • crates/core/src/operations/get/op_ctx_task.rs:340-352 — The let Some(peer_addr) = driver.current_target.socket_addr() else { … break result; } early-exit branch in drive_get_with_assembly_retry is new code that directly affects AssemblyOutcome (sets assembly.error, breaks without retrying). This defensive guard mirrors a branch that existed before this PR in drive_client_get_inner, but there is no unit or source-pin test that exercises it specifically. It is a genuine if rare failure mode (target with no socket address). A small source-pin assertion confirming socket_addr() is checked before the assembly call would harden this invariant the same way the other structural pins do. (rule: testing.md — edge-case paths should have test coverage)

No other issues found: fix: commits include a comprehensive regression test (test_streaming_get_retries_after_assembly_failure) plus source-pin tests for the key structural invariants (assembly_failure_advances_with_fresh_tx_and_never_exhausts_to_notfound, streaming_terminal_calls_assemble_and_cache_stream); no .unwrap() in production code; no fire-and-forget spawns; the assembly-failure retry loop has no sleep, so the jitter rule does not apply; the AdvanceOutcome match is exhaustive; the operations.md documentation update is accurate and consistent with the new code structure.


Rule review against .claude/rules/. WARNING findings block merge.

sanity and others added 2 commits June 7, 2026 20:42
… wrapper

Match the retry-loop result by reference (the Streaming fields are all
Copy) so the original outcome value stays whole; the three give-up
exits in drive_get_with_assembly_retry now `break result` instead of
each rebuilding RetryLoopOutcome::Done(Terminal::Streaming { ... }).
This removes the thrice-repeated rewrap, drops the now-unused
total_size binding, and lists the non-streaming pass-through variants
explicitly (future Terminal variants force a decision here).

Also derive Default for AssemblyOutcome and update the source-scrape
pin test anchor from the rebuilt-terminal text to `break result` in
the same edit; the negative Exhausted-conversion assertion is
unchanged. No behavior change.

[AI-assisted - Claude]

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… exhaustion, stale diagnostics, test gaps

Review findings from the four-perspective + external (Gemini) pass on
PR #4367, all addressed:

- False NotFound (code-first + testing reviewers, blocking class): an
  assembly failure that advanced and re-entered the loop could then
  exhaust on the wire, and the pass-through Exhausted outcome mapped
  to ContractResponse::NotFound — a false NotFound for a contract
  whose streaming header proved existence. The wrapper now remembers
  the failed header and converts a subsequent wire exhaustion back to
  Done(Streaming), so the client always gets the diagnostic
  OperationError. Pinned in the assembly-retry pin test.
- Stale assembly.error (both reviewers): cleared when a retry
  resolves via a non-streaming terminal, so an unrelated store-lookup
  miss is not mislabeled as an assembly failure.
- Stale cross-attempt timing on the conversion path (adversarial
  reviewer): request_sent_at/response_received_at are cleared in the
  conversion arm so the caller's received<sent guard doesn't fire its
  clock-regression WARN with an incoherent pair.
- Test gaps (testing reviewer): cause-string construction extracted
  into pure helpers (synthesized_get_error_cause /
  sub_op_not_found_cause) with unit tests; route-failure emission and
  header-memory pinned in the Next arm (and pinned absent in the
  Exhausted arm); transport_snapshot construction extracted and
  tested; timeout-event peer-id stamping tested; tighter pin anchors.
- E2E CI bound (testing reviewer): candidate iteration capped (the
  deterministic seed demonstrates on the 5th candidate; cap at 6).
- Peer-id construction dedup (code-first nit): shared
  NodeConfig::local_peer_id_string() used by both the telemetry
  reporter and the shadow-RTT/reference-ping path.
- operations.md: documented the GET assembly-retry wrapper entry
  point (big-picture reviewer).

[AI-assisted - Claude]

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sanity

sanity commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

Consolidated multi-perspective review (Full tier)

Reviewers: code-first, testing, big-picture, adversarial (4 Claude perspectives) + Gemini as the external non-Claude pass (Codex unavailable — usage limit until Jun 10). All findings addressed in b8140fc or dismissed with justification below.

Fixed (b8140fc)

  1. False NotFound after assembly-retry exhaustion (code-first Overall architecture RFC #1, testing Overall architecture RFC #1 — the one blocking-class finding, found independently by both): an assembly failure that advanced and re-entered the loop could then exhaust on the wire, and the pass-through Exhausted mapped to NotFound — false for a contract whose streaming header proved existence, and it silently dropped the assembly diagnostic. The wrapper now remembers the failed header and converts a later wire exhaustion back to Done(Streaming) → diagnostic OperationError. Pinned.
  2. Stale assembly.error (code-first Peer resource usage balancing #4, testing Social credit #6): cleared when a retry resolves via InlineFound/LocalCompletion, so unrelated store-lookup misses aren't mislabeled as assembly failures.
  3. Incoherent cross-attempt timing on the conversion path (adversarial Overall architecture RFC #1): request_sent_at/response_received_at cleared in the conversion arm — the received<sent "clock-source regression" WARN no longer fires with a stale pair.
  4. Test gaps (testing NAT traversal #2Intelligent routing #5, Implement join ring op #8): cause strings extracted into pure helpers with unit tests; route-failure emission + header memory pinned in the Next arm and pinned absent in the Exhausted arm; transport_snapshot construction extracted + tested; timeout-event peer-id stamping tested; tighter pin anchoring on match assemble_and_cache_stream.
  5. E2E CI bound (testing): candidate iteration capped at 6 (deterministic seed demonstrates on the 5th candidate); process-global counter caveat documented.
  6. Peer-id construction dedup (code-first Contract-key API #7): shared NodeConfig::local_peer_id_string() now used by both the telemetry reporter and the shadow-RTT/reference-ping path.
  7. .claude/rules/operations.md (big-picture): GET assembly-retry wrapper documented as the canonical entry shape for retryable post-terminal side effects.

Dismissed, with justification

  • Worst-case GET latency up to ~8 min (Gemini P2): real but bounded by the same MAX_RETRIES budget that already allowed ~4 min of wire timeouts pre-fix; clients keep their own timeouts (riverctl 60s, mail 30s) so the user-visible wait is unchanged — and the background completion that follows is precisely the desirable behavior observed in the v0.2.70 field report (contract eventually cached + auto-subscribed, so the user's next attempt succeeds). Documented in the wrapper rustdoc.
  • Leaf-node peer-id address staleness (Gemini P2): known caveat shared with the shadow-RTT events — the pubkey half (the part that identifies the node) is always correct; the address half falls back to the listener until external-address discovery. Refresh is transport: make reference-ping target + enable configurable (#4074 Phase 1.5 follow-up) #4294's scope; the shared helper means one fix covers all emitters.
  • Sub-op retries don't feed the router (Gemini P3): deliberate — sub-op GETs have never emitted route events (pre-existing design); enabling that is a behavior change beyond this fix's scope.
  • Addressless-target break instead of advance (Gemini P3): only reachable when the initial target was the addressless own_location() fallback on an isolated node; preserves pre-fix semantics exactly for a path where a streaming header is already pathological.
  • Success route event can credit the last exhausted candidate when the local store happens to hold a copy (adversarial NAT traversal #2, nit): rare, telemetry-noise-level; returning the local copy to the client is correct, and re-attributing the route event would complicate the shared Done arm for negligible model impact.
  • maybe_subscribe_child unconditional on host_result (adversarial Decentralized email proof-of-concept #3): pre-existing asymmetry, reachable before this PR via the inline assembly-failure path; gating it changes explicit-subscribe semantics and belongs in its own change if wanted.
  • testing feature unification via fdev (adversarial Peer resource usage balancing #4): pre-existing repo-wide property; the new hook is inert unless inject_failures is called, matching the existing test-hook pattern.

Follow-ups noted (not this PR)

  • Merge-order coordination with draft fix(get): populate hop_count on streaming GET successes #4329 (same-file conflict guaranteed; whichever lands second needs conflict-resolution re-review).
  • bug-prevention-patterns row for the incident class "terminal success classified before a fallible side effect completes → retry budget never consumed" (lives in the parent freenet config repo).
  • Optional: preserve the original client tx in structured logs across assembly retries (big-picture nit).

Adversarial reviewer verified explicitly: loop termination (budget consumed before any candidate including the bootstrap fallback), orphan-stream GC bounds the abandoned stream (60s TTL, no cross-attempt claim collision), per-attempt pending_op_results slots released on all paths, no missed TelemetryReporter::new call sites.

[AI-assisted - Claude]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant