fix(get): retry against next candidate when stream assembly fails by sanity · Pull Request #4367 · freenet/freenet-core

sanity · 2026-06-08T01:32:36Z

Problem

#4345: a GET for a multi-fragment contract state intermittently fails with stream assembly: no fragments received within inactivity timeout. The post-v0.2.70 analysis on the issue showed the per-stream transport failure chain is still intact, and on top of it the GET driver compounds the problem at the op layer:

The driver classifies the ResponseStreaming header as terminal success and exits its retry loop. If assemble_and_cache_stream then fails (fragments lost on a lossy path, the sender aborting on the 3s cwnd-wait timeout, a relay's pipe dying after the header was forwarded), the driver only logs a WARN and synthesizes the generic GET succeeded on wire but local store lookup failed client error. One transport hiccup burns the whole GET — the MAX_RETRIES = 3 budget is never consumed, even though other candidates could serve the contract. The relay driver already treats header-without-stream as a retryable routing failure (drive_relay_get_inner's claim-failure continue); the originator had no equivalent.

This is the failure behind riverctl invite accept alternating between the store-lookup error and 60s timeouts, and freenet/mail#288's inbox-GET failures.

Approach

Op layer (commit 1, hardened in commit 4): new drive_get_with_assembly_retry wrapper around the shared retry loop, used by both the client driver and the sub-op driver. On assembly failure it:

records RouteOutcome::Failure for the candidate whose header never became a stream (client path only — sub-op GETs have never fed the router, mirroring pre-existing design);
advances to the next candidate via the existing MAX_RETRIES budget;
re-enters the loop with a fresh attempt tx — reusing the original tx would collide with the relay dedup gates (active_relay_get_txs) while the failed attempt's relay chain is still draining.

Once a streaming header has been seen, exhaustion can never surface as NotFound (the contract provably exists): assembly-time budget exhaustion keeps the original Done(Streaming) outcome, and a wire exhaustion on a re-entered loop is converted back to the remembered header (review finding, commit 4). Either way the client sees a diagnostic OperationError carrying the assembly cause. The shape (outer wrapper rather than retrying inside the loop's Terminal arm) is dictated by the op_ctx.rs pin test requiring the Terminal arm to return synchronously.

Observability (commit 2): transfer_failed / transport_snapshot / timeout telemetry events carried an empty peer_id, and no event carried a version — the v0.2.70 post-release analysis could not attribute the failing senders. The emitting node's own peer id is now threaded into TelemetryReporter/TelemetryWorker (same construction and caveats as the shadow-RTT events in p2p_impl.rs), and the standard service.version OTLP resource attribute is added. Additive only; the dashboard parses attributes permissively and events.peer_id was already nullable.

Out of scope: the underlying transport failure chain (loss-pause cwnd cap + 3s cwnd-wait abort + flightsize release semantics). That fix needs a stream→packet index in SentPacketTracker and touches congestion-control invariants — analysis posted on #4345 for the assignees. This PR makes individual GETs survive that chain; it does not reduce the per-stream failure rate.

Testing

E2E regression (test_streaming_get_retries_after_assembly_failure): two-phase cold-node isolation on a routable sparse topology; arms exactly one deterministic assembly failure for the contract key via the new get_assembly_fault_injection hook (keyed by ContractKey so parallel tests can't consume each other's budget; the injection fails before the claim — the relay claim-failure shape — leaving the inbound stream orphaned for GC; driver-visible behavior is identical to the production mid-assembly timeout (both surface as Err from assemble_and_cache_stream)). Asserts the driver took the retry path and the requester ended up with the full, correct state. Verified to fail without the fix (every candidate: retry never taken, no state) and pass with it.
New source-scrape pin tests: assembly failure must capture the failed target before advance(), must use a fresh attempt tx, and exhaustion must never become NotFound; both drivers must route through the wrapper.
Unit tests for the injection hook semantics and for telemetry (transfer_failed carries the local peer id; OTLP resource carries service.version).
Full streaming_e2e suite green (11 passed, 2 pre-existing ignores); GET driver unit tests green (83); cargo fmt + CI-equivalent cargo clippy --locked clean.

Part of #4345 — the issue stays open for the transport-layer root cause.

[AI-assisted - Claude]

) The GET driver classified the ResponseStreaming *header* as terminal success and exited its retry loop; if the stream assembly then failed (fragments lost on a lossy path, the sender aborting on cwnd-wait timeout, a relay's pipe dying after the header was forwarded), the driver only logged a WARN and synthesized a generic client error — one transport hiccup burned the whole GET with the MAX_RETRIES budget untouched. The relay driver already treats header-without-stream as a retryable routing failure; this gives the originator (and the sub-op driver) the same semantics. - New drive_get_with_assembly_retry wrapper: on assembly failure, penalize the failed candidate (RouteOutcome::Failure), advance to the next candidate (shared retry budget), and re-enter the loop with a fresh attempt tx (reusing the old tx would collide with the relay dedup gates). Exhaustion keeps the Done(Streaming) outcome so the client sees a diagnostic OperationError, never a false NotFound. - The synthesized client error now carries the assembly failure cause instead of the generic store-lookup message. - Deterministic per-contract-key fault injection (get_assembly_fault_injection) + e2e regression test that fails without the fix and passes with it, plus source-scrape pin tests for the new invariants. Part of #4345 (op-layer resilience; the underlying transport flightsize/loss-pause interaction is tracked separately in the issue). [AI-assisted - Claude] Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d and version (#4345) transfer_failed / transfer_started / transfer_completed, transport_snapshot, and timeout telemetry events carried an empty peer_id, making sender attribution in the collector impossible — the v0.2.70 post-release analysis of #4345 could not tell which versions the failing senders ran. The transport layer genuinely has no peer identity, but the emitting node's own id is what these events need and it is available at reporter construction. - Thread the local peer id (public key + best-effort address, same construction and caveats as the shadow-RTT events in p2p_impl.rs) into TelemetryReporter/TelemetryWorker and stamp it on the three event families that previously sent an empty peer_id. - Add the standard service.version OTLP resource attribute so every event is attributable to a crate version without joining against peer_startup events. Additive fields only; the dashboard parses unknown attributes permissively and events.peer_id was already nullable. [AI-assisted - Claude] Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-08T01:36:38Z

I have all the context I need. Here is the review:

Rule Review: No blocking issues; one scope note and one minor style observation

Rules checked: git-workflow.md, code-style.md, testing.md, operations.md
Files reviewed: 7

Warnings

None.

Info

crates/core/src/tracing/telemetry.rs (and node.rs, p2p_impl.rs) — The commit feat(telemetry): attribute transport-level events to the local peer id and version (#4345) is a new capability (propagating local_peer_id into TelemetryReporter / TelemetryWorker, adding service.version to OTLP output) bundled into a PR whose title is fix(get). The git-workflow rule requires one logical change per PR and flags features bundled into fix PRs as scope creep. The telemetry work is motivated by the same issue (GET fails with 'stream assembly: no fragments received within inactivity timeout' for multi-fragment state; contract never caches, SUBSCRIBE then rejects #4345) and clearly labeled with a feat: prefix (not smuggled), but a reviewer enforcing the one-change rule could reasonably ask for it to be split. (rule: git-workflow.md — one logical change per PR)
crates/core/src/operations/get/op_ctx_task.rs:340-352 — The let Some(peer_addr) = driver.current_target.socket_addr() else { … break result; } early-exit branch in drive_get_with_assembly_retry is new code that directly affects AssemblyOutcome (sets assembly.error, breaks without retrying). This defensive guard mirrors a branch that existed before this PR in drive_client_get_inner, but there is no unit or source-pin test that exercises it specifically. It is a genuine if rare failure mode (target with no socket address). A small source-pin assertion confirming socket_addr() is checked before the assembly call would harden this invariant the same way the other structural pins do. (rule: testing.md — edge-case paths should have test coverage)

No other issues found: fix: commits include a comprehensive regression test (test_streaming_get_retries_after_assembly_failure) plus source-pin tests for the key structural invariants (assembly_failure_advances_with_fresh_tx_and_never_exhausts_to_notfound, streaming_terminal_calls_assemble_and_cache_stream); no .unwrap() in production code; no fire-and-forget spawns; the assembly-failure retry loop has no sleep, so the jitter rule does not apply; the AdvanceOutcome match is exhaustive; the operations.md documentation update is accurate and consistent with the new code structure.

Rule review against .claude/rules/. WARNING findings block merge.

… wrapper Match the retry-loop result by reference (the Streaming fields are all Copy) so the original outcome value stays whole; the three give-up exits in drive_get_with_assembly_retry now `break result` instead of each rebuilding RetryLoopOutcome::Done(Terminal::Streaming { ... }). This removes the thrice-repeated rewrap, drops the now-unused total_size binding, and lists the non-streaming pass-through variants explicitly (future Terminal variants force a decision here). Also derive Default for AssemblyOutcome and update the source-scrape pin test anchor from the rebuilt-terminal text to `break result` in the same edit; the negative Exhausted-conversion assertion is unchanged. No behavior change. [AI-assisted - Claude] Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… exhaustion, stale diagnostics, test gaps Review findings from the four-perspective + external (Gemini) pass on PR #4367, all addressed: - False NotFound (code-first + testing reviewers, blocking class): an assembly failure that advanced and re-entered the loop could then exhaust on the wire, and the pass-through Exhausted outcome mapped to ContractResponse::NotFound — a false NotFound for a contract whose streaming header proved existence. The wrapper now remembers the failed header and converts a subsequent wire exhaustion back to Done(Streaming), so the client always gets the diagnostic OperationError. Pinned in the assembly-retry pin test. - Stale assembly.error (both reviewers): cleared when a retry resolves via a non-streaming terminal, so an unrelated store-lookup miss is not mislabeled as an assembly failure. - Stale cross-attempt timing on the conversion path (adversarial reviewer): request_sent_at/response_received_at are cleared in the conversion arm so the caller's received<sent guard doesn't fire its clock-regression WARN with an incoherent pair. - Test gaps (testing reviewer): cause-string construction extracted into pure helpers (synthesized_get_error_cause / sub_op_not_found_cause) with unit tests; route-failure emission and header-memory pinned in the Next arm (and pinned absent in the Exhausted arm); transport_snapshot construction extracted and tested; timeout-event peer-id stamping tested; tighter pin anchors. - E2E CI bound (testing reviewer): candidate iteration capped (the deterministic seed demonstrates on the 5th candidate; cap at 6). - Peer-id construction dedup (code-first nit): shared NodeConfig::local_peer_id_string() used by both the telemetry reporter and the shadow-RTT/reference-ping path. - operations.md: documented the GET assembly-retry wrapper entry point (big-picture reviewer). [AI-assisted - Claude] Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

sanity · 2026-06-08T02:11:49Z

Consolidated multi-perspective review (Full tier)

Reviewers: code-first, testing, big-picture, adversarial (4 Claude perspectives) + Gemini as the external non-Claude pass (Codex unavailable — usage limit until Jun 10). All findings addressed in b8140fc or dismissed with justification below.

Fixed (`b8140fc`)

False NotFound after assembly-retry exhaustion (code-first Overall architecture RFC #1, testing Overall architecture RFC #1 — the one blocking-class finding, found independently by both): an assembly failure that advanced and re-entered the loop could then exhaust on the wire, and the pass-through Exhausted mapped to NotFound — false for a contract whose streaming header proved existence, and it silently dropped the assembly diagnostic. The wrapper now remembers the failed header and converts a later wire exhaustion back to Done(Streaming) → diagnostic OperationError. Pinned.
Stale assembly.error (code-first Peer resource usage balancing #4, testing Social credit #6): cleared when a retry resolves via InlineFound/LocalCompletion, so unrelated store-lookup misses aren't mislabeled as assembly failures.
Incoherent cross-attempt timing on the conversion path (adversarial Overall architecture RFC #1): request_sent_at/response_received_at cleared in the conversion arm — the received<sent "clock-source regression" WARN no longer fires with a stale pair.
Test gaps (testing NAT traversal #2–Intelligent routing #5, Implement join ring op #8): cause strings extracted into pure helpers with unit tests; route-failure emission + header memory pinned in the Next arm and pinned absent in the Exhausted arm; transport_snapshot construction extracted + tested; timeout-event peer-id stamping tested; tighter pin anchoring on match assemble_and_cache_stream.
E2E CI bound (testing): candidate iteration capped at 6 (deterministic seed demonstrates on the 5th candidate); process-global counter caveat documented.
Peer-id construction dedup (code-first Contract-key API #7): shared NodeConfig::local_peer_id_string() now used by both the telemetry reporter and the shadow-RTT/reference-ping path.
.claude/rules/operations.md (big-picture): GET assembly-retry wrapper documented as the canonical entry shape for retryable post-terminal side effects.

Dismissed, with justification

Worst-case GET latency up to ~8 min (Gemini P2): real but bounded by the same MAX_RETRIES budget that already allowed ~4 min of wire timeouts pre-fix; clients keep their own timeouts (riverctl 60s, mail 30s) so the user-visible wait is unchanged — and the background completion that follows is precisely the desirable behavior observed in the v0.2.70 field report (contract eventually cached + auto-subscribed, so the user's next attempt succeeds). Documented in the wrapper rustdoc.
Leaf-node peer-id address staleness (Gemini P2): known caveat shared with the shadow-RTT events — the pubkey half (the part that identifies the node) is always correct; the address half falls back to the listener until external-address discovery. Refresh is transport: make reference-ping target + enable configurable (#4074 Phase 1.5 follow-up) #4294's scope; the shared helper means one fix covers all emitters.
Sub-op retries don't feed the router (Gemini P3): deliberate — sub-op GETs have never emitted route events (pre-existing design); enabling that is a behavior change beyond this fix's scope.
Addressless-target break instead of advance (Gemini P3): only reachable when the initial target was the addressless own_location() fallback on an isolated node; preserves pre-fix semantics exactly for a path where a streaming header is already pathological.
Success route event can credit the last exhausted candidate when the local store happens to hold a copy (adversarial NAT traversal #2, nit): rare, telemetry-noise-level; returning the local copy to the client is correct, and re-attributing the route event would complicate the shared Done arm for negligible model impact.
maybe_subscribe_child unconditional on host_result (adversarial Decentralized email proof-of-concept #3): pre-existing asymmetry, reachable before this PR via the inline assembly-failure path; gating it changes explicit-subscribe semantics and belongs in its own change if wanted.
testing feature unification via fdev (adversarial Peer resource usage balancing #4): pre-existing repo-wide property; the new hook is inert unless inject_failures is called, matching the existing test-hook pattern.

Follow-ups noted (not this PR)

Merge-order coordination with draft fix(get): populate hop_count on streaming GET successes #4329 (same-file conflict guaranteed; whichever lands second needs conflict-resolution re-review).
bug-prevention-patterns row for the incident class "terminal success classified before a fallible side effect completes → retry budget never consumed" (lives in the parent freenet config repo).
Optional: preserve the original client tx in structured logs across assembly retries (big-picture nit).

Adversarial reviewer verified explicitly: loop termination (budget consumed before any candidate including the bootstrap fallback), orphan-stream GC bounds the abandoned stream (60s TTL, no cross-attempt claim collision), per-attempt pending_op_results slots released on all paths, no missed TelemetryReporter::new call sites.

[AI-assisted - Claude]

sanity and others added 2 commits June 7, 2026 20:31

sanity mentioned this pull request Jun 8, 2026

GET fails with 'stream assembly: no fragments received within inactivity timeout' for multi-fragment state; contract never caches, SUBSCRIBE then rejects #4345

Closed

sanity and others added 2 commits June 7, 2026 20:42

sanity added this pull request to the merge queue Jun 8, 2026

Merged via the queue into main with commit be69a61 Jun 8, 2026
15 checks passed

sanity deleted the fix-4345-get-retry branch June 8, 2026 03:25

iduartgomez mentioned this pull request Jun 8, 2026

fix: deliver prune disconnect cause through the waiter channel #4360

Merged

netsirius mentioned this pull request Jun 13, 2026

fix: drain tracker by stream_id on abort #4395

Closed

This was referenced Jun 15, 2026

test: make test_streaming_get_retries_after_assembly_failure load-robust #4420

Merged

Silent node deadlock: untimed event-log channel send blocks the event loop under telemetry-sink stall #4466

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(get): retry against next candidate when stream assembly fails#4367

fix(get): retry against next candidate when stream assembly fails#4367
sanity merged 4 commits into
mainfrom
fix-4345-get-retry

sanity commented Jun 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

sanity commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sanity commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Approach

Testing

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rule Review: No blocking issues; one scope note and one minor style observation

Warnings

Info

Uh oh!

sanity commented Jun 8, 2026

Consolidated multi-perspective review (Full tier)

Fixed (b8140fc)

Dismissed, with justification

Follow-ups noted (not this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sanity commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Fixed (`b8140fc`)