Skip to content

fix(get): populate hop_count on streaming GET successes#4329

Draft
sanity wants to merge 3 commits into
mainfrom
fix-4249
Draft

fix(get): populate hop_count on streaming GET successes#4329
sanity wants to merge 3 commits into
mainfrom
fix-4249

Conversation

@sanity

@sanity sanity commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Problem

PR #4245 added a hop_count field to GetMsg::Response so terminal GET telemetry (GetSuccess) could report routing depth, but left GetMsg::ResponseStreaming — the variant used for GET responses over streaming_threshold (default 64 KB) — without the field.

Worse, in the task-per-tx GET architecture the originator's GET reply is delivered straight to its driver's pending_op_result waiter rather than through the event loop's inbound dispatch, so the originator never runs from_inbound_msg_v1 for its own reply. The implicit GetSuccess arm there only fires at relays forwarding an inline Response{Found}, and not at all for streamed responses (ResponseStreaming fell into the Get(_) => Ignored catch-all).

Net effect: streamed GET successes — the large, data-rich contracts most likely to surface routing problems — produced no terminal GET telemetry at all. The dashboard's GET success-rate and hop-depth panels were silently blank for any contract large enough to stream, and the simulation regression test for this (test_hop_count_populated_on_terminal_get_events, #4250) had to stay #[ignore]d because it never saw a terminal GET event.

Approach

The issue anticipated this ("If they don't emit GetSuccess: route streaming completion through GetSuccess (preferred)"). Empirically confirmed via instrumentation that from_inbound_msg_v1 is never reached for the originator's own GET reply, so the fix routes terminal GET telemetry through the driver:

  • Add hop_count: usize to GetMsg::ResponseStreaming and thread it through the driver's Terminal::Streaming, classify, the storer/upgrade producer (relay_send_found), and the relay fork+pipe forward — mirroring the established PutMsg::ResponseStreaming pattern.
  • Emit GetSuccess explicitly from the originator's GET driver (drive_client_get_inner) on client-visible success, carrying the wire-carried hop_count clamped to max_hops_to_live — mirroring how PUT emits from finalize_put_at_originator. Covers inline AND streaming, direct AND relayed GETs uniformly.
  • Add the matching from_inbound_msg_v1 ResponseStreaming arm so relays emit GetSuccess for streamed responses too, restoring symmetry with the inline Response{Found} arm.
  • Bump version / min-compatible-version to 0.2.69 for the positional bincode wire-format change.

No double-emission: the originator emits via the driver (relays never run that branch); relays emit via from_inbound_msg_v1 (the originator never reaches it for its own reply). Each peer attributes one GetSuccess to itself, exactly as the inline path already does.

Testing

  • test_get_msg_response_streaming_hop_count_roundtrip (get.rs): wire roundtrip for the new field across 5 hop-count values.
  • classify_response_streaming_is_streaming_terminal (op_ctx_task.rs): classifier preserves hop_count into Terminal::Streaming.
  • test_streaming_get_emits_get_success_with_hop_count (streaming_e2e): a deterministic 1 MB relayed streaming GET emits a GetSuccess event with a populated, in-range hop_count; guarded by the streaming-forward counter so it can't pass on a local-cache shortcut. Verified to fail before the fix (0 GetSuccess events) and pass after.
  • Un-ignored test_hop_count_populated_on_terminal_get_events (test: un-ignore test_hop_count_populated_on_terminal_get_events once a GET-producing workload is wired through TestConfig #4250): rewired onto a deterministic controlled PUT-then-GET workload (sparse topology forces a routed GET) so it reliably produces terminal GET successes. Stable across repeated runs.

cargo fmt, cargo clippy --locked -- -D warnings, and the full streaming_e2e + GET simulation suites are green locally; a test_direct_runner_determinism run confirms the telemetry change is deterministic.

Closes #4249
Closes #4250

[AI-assisted - Claude]

## Problem

PR #4245 added a `hop_count` field to `GetMsg::Response` so terminal GET
telemetry (`GetSuccess`) could report routing depth, but left
`GetMsg::ResponseStreaming` — the variant used for GET responses over
`streaming_threshold` (default 64 KB) — without the field. Worse, in the
task-per-tx GET architecture the originator's GET reply is delivered
straight to its driver's `pending_op_result` waiter rather than through
the event loop's inbound dispatch, so the originator never runs
`from_inbound_msg_v1` for its own reply. The implicit `GetSuccess` arm
there only fires at relays forwarding an inline `Response{Found}`, and
not at all for streamed responses (`ResponseStreaming` fell into the
`Get(_) => Ignored` catch-all). Net effect: streamed GET successes — the
large, data-rich contracts most likely to surface routing problems —
produced no terminal GET telemetry at all, and the simulation regression
test for this (`test_hop_count_populated_on_terminal_get_events`, #4250)
had to stay `#[ignore]`d because it never saw a terminal GET event.

## Approach

- Add `hop_count: usize` to `GetMsg::ResponseStreaming` and thread it
  through the driver's `Terminal::Streaming`, `classify`, the storer/
  upgrade producer (`relay_send_found`), and the relay fork+pipe forward,
  exactly mirroring the established `PutMsg::ResponseStreaming` pattern.
- Emit `GetSuccess` explicitly from the originator's GET driver
  (`drive_client_get_inner`) on client-visible success, carrying the
  wire-carried `hop_count` clamped to `max_hops_to_live` — mirroring how
  PUT emits from `finalize_put_at_originator`. This covers inline AND
  streaming, direct AND relayed GETs uniformly.
- Add the matching `from_inbound_msg_v1` `ResponseStreaming` arm so relays
  emit `GetSuccess` for streamed responses too, restoring symmetry with
  the inline `Response{Found}` arm.
- Bump `version`/`min-compatible-version` to 0.2.69 for the wire-format
  change (positional bincode field).

## Testing

- `test_get_msg_response_streaming_hop_count_roundtrip` (get.rs): wire
  roundtrip for the new field.
- `classify_response_streaming_is_streaming_terminal` (op_ctx_task.rs):
  the classifier preserves `hop_count` into `Terminal::Streaming`.
- `test_streaming_get_emits_get_success_with_hop_count` (streaming_e2e):
  a deterministic 1MB relayed streaming GET emits a `GetSuccess` event
  with a populated, in-range `hop_count`. Fails before the fix (0 events).
- Un-ignored `test_hop_count_populated_on_terminal_get_events` (#4250):
  rewired onto a deterministic controlled PUT-then-GET workload so it
  reliably produces terminal GET successes.

Closes #4249
Closes #4250

[AI-assisted - Claude]

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Rule Review: No issues found

Rules checked: git-workflow.md, code-style.md, testing.md, operations.md
Files reviewed: 7 (Cargo.lock, crates/core/Cargo.toml, crates/core/src/operations/get.rs, crates/core/src/operations/get/op_ctx_task.rs, crates/core/src/tracing.rs, crates/core/tests/simulation_integration.rs, crates/core/tests/streaming_e2e.rs, crates/fdev/Cargo.toml)

No rule violations detected.

Key properties verified:

  • Regression tests present: test_get_msg_response_streaming_hop_count_roundtrip (bincode roundtrip with 5 boundary values), test_streaming_get_emits_get_success_with_hop_count (streaming_e2e.rs E2E), and the un-ignored test_hop_count_populated_on_terminal_get_events simulation test all satisfy the fix: regression-test requirement.
  • No production .unwrap(): All new .unwrap() calls are in test code only.
  • No fire-and-forget spawns: The new telemetry emission uses register_events(...).await on op_manager.ring — not a spawned task.
  • No push-before-send violation: The GetSuccess telemetry emission in drive_client_get_inner occurs after the stream is assembled and host_result is determined — it's not a network send and doesn't affect upstream-forward ordering.
  • Double-count prevention: is_streaming_reply gate correctly restricts driver-side GetSuccess emission to the streaming path only; the max_dup <= 1 simulation assertion guards against future regressions.
  • Wire-compat handled correctly: min-compatible-version bumped to 0.2.69, matching the version bump, with an explicit code comment explaining why #[serde(default)] doesn't provide bincode backward compatibility.
  • Previously-ignored test properly un-ignored: The test was rewritten on run_controlled_simulation to produce deterministic terminal GET events; no tracking issue reference is needed since the reason for the original ignore is resolved.
  • hop_count clamped to max_hops_to_live: Mirrors the defense-in-depth applied on the inline from_inbound_msg_v1 path.

Rule review against .claude/rules/. WARNING findings block merge.

sanity and others added 2 commits June 2, 2026 12:40
Codex review (P2): the `from_inbound_msg_v1` `ResponseStreaming` arm
recorded `GetSuccess` on the stream HEADER, before the payload arrived.
Unlike inline `Response{Found}` (whose envelope carries the full payload,
so receipt IS success), a `ResponseStreaming` message is only metadata —
the payload streams separately and can still fail to be delivered,
claimed, assembled, or deserialized. Emitting success on the header would
report success for a payload the node may never receive, and risked
double-counting against the new driver-side emission.

Fix: remove that arm entirely (let `ResponseStreaming` stay in the
`Get(_) => Ignored` catch-all). The terminal streaming `GetSuccess` is now
emitted solely by the originator's GET driver, which fires only after
`build_host_response`'s local-store re-query confirms the assembled
payload is present (`host_result.is_ok()`) — so success accurately
reflects payload arrival and there is exactly one emission per peer.

Also: give `test_streaming_get_emits_get_success_with_hop_count` a unique
seed (was shared with `test_streaming_get_through_relay`). The shared seed
collided in seed-keyed global registries under parallel execution, making
the suite intermittently fail; the streaming-forward counter guard still
proves the streaming path fires regardless of seed.

[AI-assisted - Claude]

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…count

Codex review (P2): an inline `GetMsg::Response` reply flows through
`handle_pure_network_message_v1`, whose unconditional `from_inbound_msg_v1`
call already emits a `GetSuccess` for the inline `Response{Found}` arm
BEFORE the driver bypass forwards the reply. So the previous unconditional
driver-side emission produced a SECOND `GetSuccess` for the same
`(tx, peer)` on inline GETs — a double-count. Reproduced in
`test_hop_count_populated_on_terminal_get_events`: one peer logged 2
GetSuccess events for a single tx.

Fix: gate the driver-side emission to the streaming reply path
(`Terminal::is_streaming()`). `GetMsg::ResponseStreaming` is the only GET
reply `from_inbound_msg_v1` deliberately does not emit for (the header is
not proof the payload arrived), so the driver emission fills exactly that
gap without duplicating the inline path.

Also add a permanent no-double-count guard to the un-ignored #4250 test:
assert no peer emits more than one GetSuccess per transaction. Verified it
fails (max_dup=2) if the emission is ungated and passes (max_dup=1) when
gated to streaming.

[AI-assisted - Claude]

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sanity

sanity commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

Review summary (multi-lens self-review + external Codex pass)

Ran a self-review across four lenses (code-first, testing-gaps, skeptical/adversarial, big-picture) plus two rounds of external codex review --base main. The Codex passes surfaced two real P2 findings, both now fixed; the final Codex pass is clean ("No discrete correctness issues were identified in the diff").

Finding 1 (Codex P2) — premature streaming-header success.
The first revision added a from_inbound_msg_v1 ResponseStreaming arm that recorded GetSuccess on the stream header, before the payload arrived. Unlike inline Response{Found} (whose envelope carries the full payload), a streaming header is only metadata and the transfer can still fail. Resolved: removed that arm; the streaming GetSuccess is now emitted solely by the originator's GET driver, only after build_host_response's local-store re-query confirms the assembled payload is present (host_result.is_ok()).

Finding 2 (Codex P2) — inline GetSuccess double-count.
An inline Response reply flows through handle_pure_network_message_v1, whose unconditional from_inbound_msg_v1 call already emits GetSuccess for inline Response{Found} before the driver bypass. The unconditional driver emission therefore double-counted inline GET successes (reproduced: one peer logged 2 GetSuccess for one tx). Resolved: gated the driver emission to the streaming path (Terminal::is_streaming()), which is exactly the path from_inbound_msg_v1 ignores. Added a permanent no-double-count guard to the un-ignored #4250 test (asserts no peer emits >1 GetSuccess per tx); verified it fails ungated and passes gated.

Investigation note. Instrumentation confirmed that in the task-per-tx architecture the originator's GET reply is delivered straight to its driver's pending_op_result waiter; for streaming the reply never reaches from_inbound_msg_v1, so the originator produced no terminal streaming GET telemetry at all — the core #4249 gap. The driver-side emission (mirroring PUT's finalize_put_at_originator) is the correct fix.

Verification. cargo fmt, cargo clippy --locked -- -D warnings, full streaming_e2e suite (incl. the new regression test, run 3× in parallel — no flakiness), the un-ignored #4250 test, the get unit tests, and test_direct_runner_determinism are all green locally. The streaming regression test is verified to fail before the fix and pass after.

[AI-assisted - Claude]

@sanity sanity added this pull request to the merge queue Jun 2, 2026
@sanity sanity marked this pull request as draft June 2, 2026 18:11
@sanity

sanity commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

Holding this PR for human review before merge. It turned out to require a wire-format / protocol change (bump to v0.2.69, hop_count added to GetMsg::ResponseStreaming / Terminal::Streaming). Wire-format/protocol changes are our highest-risk surface and warrant a human look before they reach the network — even with green CI and a clean external (Codex) review.

This was produced by an autonomous fix agent whose guardrails said to stop and report on wire-format changes rather than auto-merge; flagging it here so @sanity can review the protocol-version bump and cross-version compatibility before it lands. Converted to draft + auto-merge disabled to take it out of the merge queue meanwhile. The code itself looks solid (3 regression tests, 2 Codex P2s fixed, un-ignores #4250) — this is purely the merge gate for a protocol change.

[AI-assisted - Claude]

@sanity sanity removed this pull request from the merge queue due to a manual request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant