fix(get): retry against next candidate when stream assembly fails#4367
Conversation
) The GET driver classified the ResponseStreaming *header* as terminal success and exited its retry loop; if the stream assembly then failed (fragments lost on a lossy path, the sender aborting on cwnd-wait timeout, a relay's pipe dying after the header was forwarded), the driver only logged a WARN and synthesized a generic client error — one transport hiccup burned the whole GET with the MAX_RETRIES budget untouched. The relay driver already treats header-without-stream as a retryable routing failure; this gives the originator (and the sub-op driver) the same semantics. - New drive_get_with_assembly_retry wrapper: on assembly failure, penalize the failed candidate (RouteOutcome::Failure), advance to the next candidate (shared retry budget), and re-enter the loop with a fresh attempt tx (reusing the old tx would collide with the relay dedup gates). Exhaustion keeps the Done(Streaming) outcome so the client sees a diagnostic OperationError, never a false NotFound. - The synthesized client error now carries the assembly failure cause instead of the generic store-lookup message. - Deterministic per-contract-key fault injection (get_assembly_fault_injection) + e2e regression test that fails without the fix and passes with it, plus source-scrape pin tests for the new invariants. Part of #4345 (op-layer resilience; the underlying transport flightsize/loss-pause interaction is tracked separately in the issue). [AI-assisted - Claude] Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d and version (#4345) transfer_failed / transfer_started / transfer_completed, transport_snapshot, and timeout telemetry events carried an empty peer_id, making sender attribution in the collector impossible — the v0.2.70 post-release analysis of #4345 could not tell which versions the failing senders ran. The transport layer genuinely has no peer identity, but the emitting node's own id is what these events need and it is available at reporter construction. - Thread the local peer id (public key + best-effort address, same construction and caveats as the shadow-RTT events in p2p_impl.rs) into TelemetryReporter/TelemetryWorker and stamp it on the three event families that previously sent an empty peer_id. - Add the standard service.version OTLP resource attribute so every event is attributable to a crate version without joining against peer_startup events. Additive fields only; the dashboard parses unknown attributes permissively and events.peer_id was already nullable. [AI-assisted - Claude] Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
I have all the context I need. Here is the review: Rule Review: No blocking issues; one scope note and one minor style observationRules checked: WarningsNone. Info
No other issues found: Rule review against |
… wrapper
Match the retry-loop result by reference (the Streaming fields are all
Copy) so the original outcome value stays whole; the three give-up
exits in drive_get_with_assembly_retry now `break result` instead of
each rebuilding RetryLoopOutcome::Done(Terminal::Streaming { ... }).
This removes the thrice-repeated rewrap, drops the now-unused
total_size binding, and lists the non-streaming pass-through variants
explicitly (future Terminal variants force a decision here).
Also derive Default for AssemblyOutcome and update the source-scrape
pin test anchor from the rebuilt-terminal text to `break result` in
the same edit; the negative Exhausted-conversion assertion is
unchanged. No behavior change.
[AI-assisted - Claude]
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… exhaustion, stale diagnostics, test gaps Review findings from the four-perspective + external (Gemini) pass on PR #4367, all addressed: - False NotFound (code-first + testing reviewers, blocking class): an assembly failure that advanced and re-entered the loop could then exhaust on the wire, and the pass-through Exhausted outcome mapped to ContractResponse::NotFound — a false NotFound for a contract whose streaming header proved existence. The wrapper now remembers the failed header and converts a subsequent wire exhaustion back to Done(Streaming), so the client always gets the diagnostic OperationError. Pinned in the assembly-retry pin test. - Stale assembly.error (both reviewers): cleared when a retry resolves via a non-streaming terminal, so an unrelated store-lookup miss is not mislabeled as an assembly failure. - Stale cross-attempt timing on the conversion path (adversarial reviewer): request_sent_at/response_received_at are cleared in the conversion arm so the caller's received<sent guard doesn't fire its clock-regression WARN with an incoherent pair. - Test gaps (testing reviewer): cause-string construction extracted into pure helpers (synthesized_get_error_cause / sub_op_not_found_cause) with unit tests; route-failure emission and header-memory pinned in the Next arm (and pinned absent in the Exhausted arm); transport_snapshot construction extracted and tested; timeout-event peer-id stamping tested; tighter pin anchors. - E2E CI bound (testing reviewer): candidate iteration capped (the deterministic seed demonstrates on the 5th candidate; cap at 6). - Peer-id construction dedup (code-first nit): shared NodeConfig::local_peer_id_string() used by both the telemetry reporter and the shadow-RTT/reference-ping path. - operations.md: documented the GET assembly-retry wrapper entry point (big-picture reviewer). [AI-assisted - Claude] Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Consolidated multi-perspective review (Full tier)Reviewers: code-first, testing, big-picture, adversarial (4 Claude perspectives) + Gemini as the external non-Claude pass (Codex unavailable — usage limit until Jun 10). All findings addressed in b8140fc or dismissed with justification below. Fixed (b8140fc)
Dismissed, with justification
Follow-ups noted (not this PR)
Adversarial reviewer verified explicitly: loop termination (budget consumed before any candidate including the bootstrap fallback), orphan-stream GC bounds the abandoned stream (60s TTL, no cross-attempt claim collision), per-attempt [AI-assisted - Claude] |
Problem
#4345: a GET for a multi-fragment contract state intermittently fails with
stream assembly: no fragments received within inactivity timeout. The post-v0.2.70 analysis on the issue showed the per-stream transport failure chain is still intact, and on top of it the GET driver compounds the problem at the op layer:The driver classifies the
ResponseStreamingheader as terminal success and exits its retry loop. Ifassemble_and_cache_streamthen fails (fragments lost on a lossy path, the sender aborting on the 3s cwnd-wait timeout, a relay's pipe dying after the header was forwarded), the driver only logs a WARN and synthesizes the genericGET succeeded on wire but local store lookup failedclient error. One transport hiccup burns the whole GET — theMAX_RETRIES = 3budget is never consumed, even though other candidates could serve the contract. The relay driver already treats header-without-stream as a retryable routing failure (drive_relay_get_inner's claim-failurecontinue); the originator had no equivalent.This is the failure behind
riverctl invite acceptalternating between the store-lookup error and 60s timeouts, and freenet/mail#288's inbox-GET failures.Approach
Op layer (commit 1, hardened in commit 4): new
drive_get_with_assembly_retrywrapper around the shared retry loop, used by both the client driver and the sub-op driver. On assembly failure it:RouteOutcome::Failurefor the candidate whose header never became a stream (client path only — sub-op GETs have never fed the router, mirroring pre-existing design);MAX_RETRIESbudget;active_relay_get_txs) while the failed attempt's relay chain is still draining.Once a streaming header has been seen, exhaustion can never surface as
NotFound(the contract provably exists): assembly-time budget exhaustion keeps the originalDone(Streaming)outcome, and a wire exhaustion on a re-entered loop is converted back to the remembered header (review finding, commit 4). Either way the client sees a diagnosticOperationErrorcarrying the assembly cause. The shape (outer wrapper rather than retrying inside the loop's Terminal arm) is dictated by theop_ctx.rspin test requiring the Terminal arm to return synchronously.Observability (commit 2):
transfer_failed/transport_snapshot/timeouttelemetry events carried an emptypeer_id, and no event carried a version — the v0.2.70 post-release analysis could not attribute the failing senders. The emitting node's own peer id is now threaded intoTelemetryReporter/TelemetryWorker(same construction and caveats as the shadow-RTT events inp2p_impl.rs), and the standardservice.versionOTLP resource attribute is added. Additive only; the dashboard parses attributes permissively andevents.peer_idwas already nullable.Out of scope: the underlying transport failure chain (loss-pause cwnd cap + 3s cwnd-wait abort + flightsize release semantics). That fix needs a stream→packet index in
SentPacketTrackerand touches congestion-control invariants — analysis posted on #4345 for the assignees. This PR makes individual GETs survive that chain; it does not reduce the per-stream failure rate.Testing
test_streaming_get_retries_after_assembly_failure): two-phase cold-node isolation on a routable sparse topology; arms exactly one deterministic assembly failure for the contract key via the newget_assembly_fault_injectionhook (keyed byContractKeyso parallel tests can't consume each other's budget; the injection fails before the claim — the relay claim-failure shape — leaving the inbound stream orphaned for GC; driver-visible behavior is identical to the production mid-assembly timeout (both surface asErrfromassemble_and_cache_stream)). Asserts the driver took the retry path and the requester ended up with the full, correct state. Verified to fail without the fix (every candidate: retry never taken, no state) and pass with it.advance(), must use a fresh attempt tx, and exhaustion must never becomeNotFound; both drivers must route through the wrapper.transfer_failedcarries the local peer id; OTLP resource carriesservice.version).streaming_e2esuite green (11 passed, 2 pre-existing ignores); GET driver unit tests green (83);cargo fmt+ CI-equivalentcargo clippy --lockedclean.Part of #4345 — the issue stays open for the transport-layer root cause.
[AI-assisted - Claude]