fix: populate hop_count on terminal GET events#4245
Conversation
Previously, `EventKind::Get::{GetSuccess,GetNotFound,GetFailure}.hop_count`
was computed at log time via `op_manager.get_current_hop(id)` — which is a
stub that always returns `None` (op_state_manager.rs:615). The result: the
field was effectively never populated, so any consumer of hop_count data
(routing-claim analysis, sweep benchmark) saw `None` for ~99% of events.
This commit carries hop_count on the wire instead:
- Add `hop_count: usize` field to `GetMsg::Response`. Semantics: the
forward-path depth (`max_htl - htl_at_responder`) at the node that
produced the Response. Relays preserve this value when bubbling a
downstream Response upstream — they do NOT increment on the return
path. The originator reads it directly from the inbound message.
- Thread the value through the relay driver:
- `relay_send_not_found(htl)` / `relay_send_found(hop_count)` now take
the value at construction.
- `Terminal::InlineFound` gains a `hop_count` field so the bubble-up
callsite can preserve the downstream storer's value.
- All 5 production callsites compute `max_htl - htl` (own production)
or pass through `downstream_hop_count` (bubble-up R12a).
- `tracing.rs` `from_inbound_msg_v1` reads `hop_count` directly from the
Response message; the `get_current_hop` call is dropped.
Wire compat: bincode is positional so adding a field is binary-incompatible
without `MIN_COMPATIBLE_VERSION` consideration. Same pattern as the earlier
`subscribe: bool` addition on `GetMsg::Request`. The repo's handshake-time
version check governs cross-version peering.
Scope notes:
- This commit covers GET only. PUT, UPDATE, and SUBSCRIBE terminal events
ALSO carry `hop_count: Option<usize>` and ALSO go through the same
`get_current_hop` stub — they're broken too, but fixing them needs the
same wire-format threading and is out of scope for this PR. The whitepaper
benchmark is focused on GET routing, so this commit is sufficient.
- `GetMsg::ResponseStreaming` (large-payload streaming path) is NOT updated.
The simulator's small-state GETs use the inline `Response` path; streaming
is for >64KB payloads. Can be extended if streaming-mode hop data is
needed later.
[AI-assisted - Claude]
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a relay GET exhausts downstream candidates without finding the contract it sends NotFound upstream. Previously the event-log NotFound construction passed hop_count=None even though the relay knows its own forward depth (max_htl - htl). Compute and pass it.
|
I have all the context I need. Let me write up the review. Rule Review: Regression test gap in tracing.rs fix pathRules checked: Warnings
Info
Summary: One warning around the untested tracing path — the wire-format and classifier tests are solid, but a direct unit test asserting that Rule review against |
Adds a public EventKind::hop_count() accessor so external tests can read hop_count from terminal GET events without pattern-matching on the pub(crate) GetEvent enum. Adds test_hop_count_populated_on_terminal_get_events as a regression test for the bug fixed in the previous two commits: before the fix, hop_count was None on ~99% of terminal GET events because op_manager had cleaned up the op before the event was logged. The test runs a small SimNetwork and asserts at least one terminal GET event has populated hop_count — which fails without the wire-format fix and passes with it.
Two findings from the Claude rule review on the previous push: 1. **MIN_COMPATIBLE_VERSION not bumped.** Adding a positional field to bincode-serialized GetMsg::Response breaks wire compat with peers running pre-fix binaries (0.2.61 / 0.2.62 admit-then-deserialization- fail). Bumps both crates/core/Cargo.toml's package version and min-compatible-version to 0.2.63. Also bumps the fdev dep on core to match. Build-script check (build.rs:78) verifies the invariant min_compatible_version <= package_version. 2. **Regression test only asserted presence, not correctness.** The simulation-integration test asserted populated > 0, which would pass even if every hop_count = 0. In practice the default event_chain workload at CI scales doesn't produce terminal GET events at all (the test was failing in CI with 2840 events and 0 GET events). Replacing the simulation test with a direct unit test in crates/core/src/operations/get.rs that asserts the new hop_count field roundtrips through bincode for both Found and NotFound variants of GetMsg::Response across a range of values (0, 1, 4, 10, 64). That is the specific regression scenario this PR addresses; the bincode- positional caveat from the existing subscribe-roundtrip test applies identically here. The simulation-integration test is kept in tree but marked #[ignore] with a docstring pointing at the unit test. It can be rebuilt later once the run_controlled_simulation pattern from test_get_reliability_diagnostic is generalised to non-nightly CI.
Synthesised findings from code-first / testing / skeptical / big-picture /
Codex reviews — fixes the actionable ones; tracking the rest.
**Testing-reviewer (was blocking):** added classifier-side regression test
`classify_response_found_preserves_hop_count` in op_ctx_task.rs that
asserts `classify()` propagates the wire-carried hop_count into
`Terminal::InlineFound` unchanged across hop_count values
{0,1,4,10,64}. The existing bincode-roundtrip test in get.rs covered
the wire format only; this test pins the classifier so a future refactor
that synthesised 0 (or dropped the field) on the relay bubble-up path
trips immediately.
**Skeptical-reviewer M1 (bound check on peer-supplied hop_count):**
clamp `hop_count` to `ring.max_hops_to_live` at the two
GetSuccess/GetNotFound extraction sites in tracing.rs. A malicious or
buggy peer can ship hop_count = usize::MAX in the bincode payload;
the clamp keeps telemetry/dashboards within sensible bounds.
**Skeptical-reviewer M2 (NotFound semantics documentation):** added a
docstring paragraph at the NotFound extraction site clarifying that
hop_count there is the *exhaustion depth* (deepest peer the request
reached before exhausting / store-missing), not a path-to-storer
depth — the typical analytics misinterpretation.
**Big-picture nit:** updated the stale line reference in
op_state_manager.rs:614 (was 'tracing.rs:1266', actually 1271) and
clarified that only the PUT path still consumes that stub; GET was
migrated to a wire-carried field by this PR.
Skeptical M1 verification confirmed wire-format handshake gracefully
rejects mixed 0.2.62/0.2.63 in both directions (no crash, just
ack_error rejection). All InlineFound construction sites verified to
use the correct hop_count source.
Not addressed in this commit (tracked for follow-up):
- ResponseStreaming variant doesn't carry hop_count (would affect
large-payload GETs returning via streaming if they emit GetSuccess
through this path; needs a separate audit).
- PutSuccess / UpdateSuccess / SubscribeSuccess still use the
get_current_hop() stub and have the same None-hop_count gap. Will
follow up as a separate PR once GET is merged.
- Un-#[ignore]ing the simulation-integration test (needs a workload
that deterministically produces terminal GETs at CI scales).
Codex re-review flagged the #[ignore] reason as missing the required tracking-issue reference per repo testing conventions. Now references #4250 which captures the work to un-ignore.
sanity
left a comment
There was a problem hiding this comment.
Comprehensive PR Review: #4245
Summary
- PR Title: fix: populate hop_count on terminal GET events
- Type: fix (wire-format change)
- Review tier: Full (wire format is a high-risk surface)
- Reviewers run: freenet:code-first, freenet:testing, freenet:skeptical, freenet:big-picture, Codex (twice — once on initial submission, once on the fix commits)
- HEAD reviewed:
85557cf7(post fix commits)
Code-First Analysis
Independent understanding: PR adds positional hop_count: usize to GetMsg::Response. Storer fills max_htl - htl, relay bubble-up preserves it verbatim, HTL-exhaustion NotFound fills its own max_htl - htl. Tracing-layer extraction reads from the wire instead of the stub op_manager.get_current_hop().
Alignment: matches the PR description for inline Response. Two scope gaps that the description correctly acknowledges:
GetMsg::ResponseStreamingnot addressed (streaming GETs unchanged).GetFailureevent has no live emission site in the tree — the "Success/NotFound/Failure" framing is partly aspirational.
Wire-compat: verified at handshake — connection_handler.rs:2584-2609 cleanly rejects mixed 0.2.62↔0.2.63 in both directions via ack_error.
Testing Assessment
| Test Type | Status | Notes |
|---|---|---|
| Unit (wire roundtrip) | ✅ | test_get_msg_response_hop_count_roundtrip in get.rs — Found/NotFound across {0,1,4,10,64} |
| Unit (classifier) | ✅ | classify_response_found_preserves_hop_count in op_ctx_task.rs — pins the bubble-up path |
| Simulation | test_hop_count_populated_on_terminal_get_events is #[ignore]d; tracks #4250 |
|
| E2E | N/A |
Regression Test: present and adequate at the unit level. Initial testing-reviewer concern about "presence-only" assertion has been addressed: there are now two unit tests pinning both the wire format and the classifier-side propagation. The blocking ask was the classifier pin — done.
Remaining gap: no end-to-end integration test exercises the storer→relay→originator chain. The two unit tests catch the most likely regression modes (field dropped, classifier synthesised 0, refactor lost the field) but not "a relay overwrites with its own htl in relay_send_found". Mitigation: the skeptical reviewer verified by code reading that all relay_send_found callers correctly forward downstream_hop_count. Tracking #4250 covers building the simulation harness for end-to-end coverage.
Skeptical Findings
| Concern | Severity | Disposition |
|---|---|---|
| Wire-format handshake mixed-version crash risk | HIGH (theoretical) | Verified safe by reviewer at connection_handler.rs:2584-2609 — graceful ack_error rejection in both directions |
| Relay-preservation invariant could regress | HIGH (test gap) | All current GetMsg::Response construction sites verified to use the correct source (skeptical); classifier pin test added; relay_send_found end-to-end coverage tracked in #4250 |
Peer-supplied hop_count = usize::MAX pollutes telemetry |
MED | Addressed: clamped to op_manager.ring.max_hops_to_live at both extraction sites in tracing.rs |
NotFound.hop_count analytics ambiguity |
MED | Addressed: docstring clarifies it's exhaustion depth, not path-to-storer |
usize overflow / off-by-one at HTL=0 boundary |
LOW | Reviewer-verified math is correct; comment-clarification ask not acted on |
Old hop_count: 0 literals in tests don't assert preservation |
LOW | Untouched — pre-existing; not regressed |
Big Picture
- Goal alignment: yes. PR does exactly what its title and body claim.
- Scope: tightly scoped, all 7 changed files justified.
- Removed code: none destructive.
get_current_hop()stub kept for the remaining PUT caller; documentation updated to reflect current state. - Counterpart bugs (
PutSuccess/UpdateSuccess/SubscribeSuccess): confirmed to exist via the same root cause. Tracked in #4248. - Streaming gap: tracked in #4249.
#[ignore]d test follow-up: tracked in #4250.
Documentation
- Code docs: complete.
hop_countfield has a load-bearing docstring atget.rs:96-112explaining semantics and wire-compat handling. op_state_manager.rs:613stub docstring updated to reflect that only PUT still consumes it.tracing.rs:1322-1333clarifies that NotFoundhop_countis exhaustion depth.
Recommendations
Must Fix (Blocking)
All blocking findings from the initial review were addressed in the fix commits (9299f54e, c040837e, 85557cf7).
Should Fix (Important)
All "should fix" items either addressed in-PR or tracked as follow-up issues (#4248, #4249, #4250).
Consider (Suggestions)
- Off-by-one chain-of-reasoning comment at
op_ctx_task.rs:1967-1970could be tightened (low-priority readability). - Several existing
hop_count: 0literals in pre-existing tests (get.rs:283, 304, 366, 384;op_ctx_task.rs:2481, 2497, 2582) could carry non-zero values and assert preservation — would catch a future struct-shape refactor that loses the field. Not blocking.
False positives dismissed
- Codex flagged
fdevbinstall URLs pointing atv0.2.62—crates/fdev/Cargo.toml:13-16has an explicit comment thatrelease.ymlrewrites these URLs on each version bump (issue #3995). Not a finding. - Big-picture reviewer's "29 file scope" measurement (using three-dot diff) was against a stale local main — actual PR is the expected 7 files.
Verdict
State: Ready to Merge
HEAD SHA reviewed: 85557cf7
All blocking findings addressed in the fix commits on this branch:
- Testing-reviewer's classifier-preservation pin →
classify_response_found_preserves_hop_count - Skeptical M1 (bound clamp) → tracing.rs clamp to
max_hops_to_live - Skeptical M2 (NotFound semantics) → docstring
- Big-picture stale-line-number nit → fixed
- Codex re-review's ignore-reason missing tracking issue → references #4250
- Streaming gap, non-GET counterparts, sim-test un-ignore → tracked in #4249/#4248/#4250
The wire-format change is the most consequential part. The skeptical reviewer verified the handshake guards are correct in both directions, and every GetMsg::Response construction site uses the correct hop_count source. The unit tests pin the two most-likely-to-regress paths (wire roundtrip + classifier propagation).
[AI-assisted - Claude]
Codex flagged the fdev binstall metadata URLs pointing at v0.2.62 in the initial PR review. I dismissed it as a false positive because crates/fdev/Cargo.toml has an explicit comment saying release.yml rewrites these on each release (issue #3995). But CI caught the inconsistency: the binstall_metadata test asserts the URL embeds the current freenet dep version, and that test failed in Unit & Integration. Fix the URLs to v0.2.63 to match the freenet dep version bump in this PR. The release.yml rewrite path still runs on release; this just keeps the PR-level invariant intact.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Production telemetry bug:
hop_countis currentlyNoneon the vast majority of terminal GET events because it's computed at log time viaop_manager.get_current_hop(id), which returnsNonewhenever the operation has been cleaned up before the response is logged. As a result every dashboard / telemetry / benchmark that looks at GET routing depth is silently working with empty data.This PR threads the hop count through the wire (positional field on
GetMsg::Response) so the originator has a populated value when it constructsGetSuccess/GetNotFoundlog events. Also fills inhop_countat the relay that emits an exhaustionNotFound(it knows its ownmax_htl - htl).After this fix,
hop_countis populated on 100% of inline terminal GET events in deterministic-simulation testing across N ∈ {20, 50, 100, 200} (verified during the hop-count benchmark sweep — see #4237).Scope
Inline GET responses only. Specifically:
GetMsg::Response{Found, value: Some(state)}→ emitsGetSuccesswith populated hop_countGetMsg::Response{NotFound}→ emitsGetNotFoundwith populated hop_count (interpreted as exhaustion depth, not path-to-storer)GetMsg::ResponseStreaming(large-payload GETs above streaming threshold) — does NOT carry hop_count. Tracked in telemetry: hop_count not populated on streaming GET successes (GetMsg::ResponseStreaming) #4249.GetMsg::Response{Found, value: None}(failure branch) —GetEvent::GetFailureexists in tracing but has no live emission site in the current tree.get_current_hop()bug. Tracked in telemetry: populate hop_count on PutSuccess and SubscribeSuccess events #4248.Wire-format compatibility
Adds a positional field to a bincode-serialized type. Bincode does not honour
#[serde(default)]for positional encoding, so cross-version compatibility is gated byMIN_COMPATIBLE_VERSIONat the handshake layer:crates/core/Cargo.tomlversionandmin-compatible-versionboth from0.2.62→0.2.63.crates/fdev/Cargo.tomlfreenet path dep correspondingly.crates/core/src/transport/connection_handler.rs:2584-2609: handshake gracefully rejects mixed-version peers in both directions viaack_error(no crash).Test plan
cargo build -p freenet --features "testing" --testscleantest_get_msg_response_hop_count_roundtrip(new) — asserts the new field roundtrips through bincode forFound/NotFoundacross values {0,1,4,10,64}.classify_response_found_preserves_hop_count(new) — asserts the classifier propagates the wire-carried hop_count intoTerminal::InlineFoundunchanged across the same set of values. Catches a regression where a refactor synthesised 0 (or dropped the field) on the relay bubble-up path.test_hop_count_populated_on_terminal_get_events(simulation-level) — kept in tree but#[ignore]d, tracking test: un-ignore test_hop_count_populated_on_terminal_get_events once a GET-producing workload is wired through TestConfig #4250 to un-ignore once a deterministic GET-producing workload is wired through TestConfig.Defence-in-depth
hop_counttoop_manager.ring.max_hops_to_liveso a malicious or buggy peer can't pollute telemetry withusize::MAX.Review history
This PR went through a Full multi-perspective review (code-first / testing / skeptical / big-picture / Codex). Findings addressed in commits on this branch:
b3527ea29299f54eOrigin
Extracted from PR #4237 (hop-count sweep benchmark for the Freenet whitepaper). #4237 remains draft for the benchmark scaffolding; this PR is the production fix portion.
[AI-assisted - Claude]