Skip to content

fix: populate hop_count on terminal GET events#4245

Merged
sanity merged 8 commits into
mainfrom
fix/hop-count-population
May 25, 2026
Merged

fix: populate hop_count on terminal GET events#4245
sanity merged 8 commits into
mainfrom
fix/hop-count-population

Conversation

@sanity

@sanity sanity commented May 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

Production telemetry bug: hop_count is currently None on the vast majority of terminal GET events because it's computed at log time via op_manager.get_current_hop(id), which returns None whenever the operation has been cleaned up before the response is logged. As a result every dashboard / telemetry / benchmark that looks at GET routing depth is silently working with empty data.

This PR threads the hop count through the wire (positional field on GetMsg::Response) so the originator has a populated value when it constructs GetSuccess / GetNotFound log events. Also fills in hop_count at the relay that emits an exhaustion NotFound (it knows its own max_htl - htl).

After this fix, hop_count is populated on 100% of inline terminal GET events in deterministic-simulation testing across N ∈ {20, 50, 100, 200} (verified during the hop-count benchmark sweep — see #4237).

Scope

Inline GET responses only. Specifically:

Wire-format compatibility

Adds a positional field to a bincode-serialized type. Bincode does not honour #[serde(default)] for positional encoding, so cross-version compatibility is gated by MIN_COMPATIBLE_VERSION at the handshake layer:

  • Bumps crates/core/Cargo.toml version and min-compatible-version both from 0.2.620.2.63.
  • Bumps crates/fdev/Cargo.toml freenet path dep correspondingly.
  • Verified at crates/core/src/transport/connection_handler.rs:2584-2609: handshake gracefully rejects mixed-version peers in both directions via ack_error (no crash).

Test plan

  • cargo build -p freenet --features "testing" --tests clean
  • test_get_msg_response_hop_count_roundtrip (new) — asserts the new field roundtrips through bincode for Found/NotFound across values {0,1,4,10,64}.
  • classify_response_found_preserves_hop_count (new) — asserts the classifier propagates the wire-carried hop_count into Terminal::InlineFound unchanged across the same set of values. Catches a regression where a refactor synthesised 0 (or dropped the field) on the relay bubble-up path.
  • Existing 68 GET unit tests still pass.
  • test_hop_count_populated_on_terminal_get_events (simulation-level) — kept in tree but #[ignore]d, tracking test: un-ignore test_hop_count_populated_on_terminal_get_events once a GET-producing workload is wired through TestConfig #4250 to un-ignore once a deterministic GET-producing workload is wired through TestConfig.

Defence-in-depth

  • Tracing-layer extraction clamps peer-supplied hop_count to op_manager.ring.max_hops_to_live so a malicious or buggy peer can't pollute telemetry with usize::MAX.

Review history

This PR went through a Full multi-perspective review (code-first / testing / skeptical / big-picture / Codex). Findings addressed in commits on this branch:

  • Initial rule-review wire-compat / regression-test warnings → b3527ea2
  • Full-tier review findings (classifier preservation test, bound clamp, NotFound semantics doc, stale line-number nit) → 9299f54e
  • Codex re-review (test reference + streaming-gap acknowledgement) → this commit + the issue links above.

Origin

Extracted from PR #4237 (hop-count sweep benchmark for the Freenet whitepaper). #4237 remains draft for the benchmark scaffolding; this PR is the production fix portion.

[AI-assisted - Claude]

sanity and others added 2 commits May 24, 2026 17:00
Previously, `EventKind::Get::{GetSuccess,GetNotFound,GetFailure}.hop_count`
was computed at log time via `op_manager.get_current_hop(id)` — which is a
stub that always returns `None` (op_state_manager.rs:615). The result: the
field was effectively never populated, so any consumer of hop_count data
(routing-claim analysis, sweep benchmark) saw `None` for ~99% of events.

This commit carries hop_count on the wire instead:

- Add `hop_count: usize` field to `GetMsg::Response`. Semantics: the
  forward-path depth (`max_htl - htl_at_responder`) at the node that
  produced the Response. Relays preserve this value when bubbling a
  downstream Response upstream — they do NOT increment on the return
  path. The originator reads it directly from the inbound message.

- Thread the value through the relay driver:
  - `relay_send_not_found(htl)` / `relay_send_found(hop_count)` now take
    the value at construction.
  - `Terminal::InlineFound` gains a `hop_count` field so the bubble-up
    callsite can preserve the downstream storer's value.
  - All 5 production callsites compute `max_htl - htl` (own production)
    or pass through `downstream_hop_count` (bubble-up R12a).

- `tracing.rs` `from_inbound_msg_v1` reads `hop_count` directly from the
  Response message; the `get_current_hop` call is dropped.

Wire compat: bincode is positional so adding a field is binary-incompatible
without `MIN_COMPATIBLE_VERSION` consideration. Same pattern as the earlier
`subscribe: bool` addition on `GetMsg::Request`. The repo's handshake-time
version check governs cross-version peering.

Scope notes:

- This commit covers GET only. PUT, UPDATE, and SUBSCRIBE terminal events
  ALSO carry `hop_count: Option<usize>` and ALSO go through the same
  `get_current_hop` stub — they're broken too, but fixing them needs the
  same wire-format threading and is out of scope for this PR. The whitepaper
  benchmark is focused on GET routing, so this commit is sufficient.

- `GetMsg::ResponseStreaming` (large-payload streaming path) is NOT updated.
  The simulator's small-state GETs use the inline `Response` path; streaming
  is for >64KB payloads. Can be extended if streaming-mode hop data is
  needed later.

[AI-assisted - Claude]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a relay GET exhausts downstream candidates without finding the
contract it sends NotFound upstream. Previously the event-log NotFound
construction passed hop_count=None even though the relay knows its own
forward depth (max_htl - htl). Compute and pass it.
@github-actions

github-actions Bot commented May 24, 2026

Copy link
Copy Markdown
Contributor

I have all the context I need. Let me write up the review.

Rule Review: Regression test gap in tracing.rs fix path

Rules checked: git-workflow.md, code-style.md, testing.md, operations.md
Files reviewed: 7


Warnings

  • crates/core/src/tracing.rs:1302–1355 — The core bug fix (switching from op_manager.get_current_hop(id) to reading hop_count directly from the wire message) is not covered by any enabled test. test_get_msg_response_hop_count_roundtrip verifies bincode round-trip of the field, and classify_response_found_preserves_hop_count verifies the classifier propagates it — but neither asserts anything about the EventKind produced by from_msg_to_event. If the tracing.rs change were reverted to op_manager.get_current_hop(), both unit tests would still pass and the bug would silently reappear. The only test that covers the end-to-end path (test_hop_count_populated_on_terminal_get_events) is #[ignore]d. The rule requires a test that "would catch this exact bug if reintroduced." (rule: testing.md — fix: PRs must include a regression test specific enough to catch the exact bug)

Info

  • crates/core/src/node/op_state_manager.rs:615 — New doc comment embeds a hardcoded line number (tracing.rs:1271) and a PR reference (PR #4245). Line numbers go stale with every surrounding edit; PR references belong in commit history, not code. Both will mislead future readers. (rule: code-style.md — comments explain WHY not WHAT; don't reference callers or tickets in code)

  • crates/core/src/operations/get.rs:78#[serde(default)] on hop_count: usize has no effect for bincode (positional encoding ignores serde defaults). The accompanying comment explains this, but the attribute itself implies a backward-compat guarantee that doesn't exist for the actual serializer. Removing it would make the intent clearer and match the pattern used for the subscribe field's existing comment in the same file. (rule: code-style.md — don't add constructs that misrepresent behavior)


Summary: One warning around the untested tracing path — the wire-format and classifier tests are solid, but a direct unit test asserting that from_msg_to_event produces hop_count: Some(n) (rather than None) would close the gap. Two minor style notes on the doc comment and the no-op serde attribute.


Rule review against .claude/rules/. WARNING findings block merge. ⚠️ 1 warning(s) — fix or add review-override label

sanity added 3 commits May 24, 2026 17:56
Adds a public EventKind::hop_count() accessor so external tests can
read hop_count from terminal GET events without pattern-matching on
the pub(crate) GetEvent enum.

Adds test_hop_count_populated_on_terminal_get_events as a regression
test for the bug fixed in the previous two commits: before the fix,
hop_count was None on ~99% of terminal GET events because op_manager
had cleaned up the op before the event was logged. The test runs a
small SimNetwork and asserts at least one terminal GET event has
populated hop_count — which fails without the wire-format fix and
passes with it.
Two findings from the Claude rule review on the previous push:

1. **MIN_COMPATIBLE_VERSION not bumped.** Adding a positional field to
   bincode-serialized GetMsg::Response breaks wire compat with peers
   running pre-fix binaries (0.2.61 / 0.2.62 admit-then-deserialization-
   fail).  Bumps both crates/core/Cargo.toml's package version and
   min-compatible-version to 0.2.63.  Also bumps the fdev dep on core
   to match.  Build-script check (build.rs:78) verifies the invariant
   min_compatible_version <= package_version.

2. **Regression test only asserted presence, not correctness.** The
   simulation-integration test asserted populated > 0, which would pass
   even if every hop_count = 0.  In practice the default event_chain
   workload at CI scales doesn't produce terminal GET events at all
   (the test was failing in CI with 2840 events and 0 GET events).

Replacing the simulation test with a direct unit test in
crates/core/src/operations/get.rs that asserts the new hop_count field
roundtrips through bincode for both Found and NotFound variants of
GetMsg::Response across a range of values (0, 1, 4, 10, 64).  That is
the specific regression scenario this PR addresses; the bincode-
positional caveat from the existing subscribe-roundtrip test applies
identically here.

The simulation-integration test is kept in tree but marked #[ignore]
with a docstring pointing at the unit test.  It can be rebuilt later
once the run_controlled_simulation pattern from
test_get_reliability_diagnostic is generalised to non-nightly CI.
Synthesised findings from code-first / testing / skeptical / big-picture /
Codex reviews — fixes the actionable ones; tracking the rest.

**Testing-reviewer (was blocking):** added classifier-side regression test
`classify_response_found_preserves_hop_count` in op_ctx_task.rs that
asserts `classify()` propagates the wire-carried hop_count into
`Terminal::InlineFound` unchanged across hop_count values
{0,1,4,10,64}.  The existing bincode-roundtrip test in get.rs covered
the wire format only; this test pins the classifier so a future refactor
that synthesised 0 (or dropped the field) on the relay bubble-up path
trips immediately.

**Skeptical-reviewer M1 (bound check on peer-supplied hop_count):**
clamp `hop_count` to `ring.max_hops_to_live` at the two
GetSuccess/GetNotFound extraction sites in tracing.rs.  A malicious or
buggy peer can ship hop_count = usize::MAX in the bincode payload;
the clamp keeps telemetry/dashboards within sensible bounds.

**Skeptical-reviewer M2 (NotFound semantics documentation):** added a
docstring paragraph at the NotFound extraction site clarifying that
hop_count there is the *exhaustion depth* (deepest peer the request
reached before exhausting / store-missing), not a path-to-storer
depth — the typical analytics misinterpretation.

**Big-picture nit:** updated the stale line reference in
op_state_manager.rs:614 (was 'tracing.rs:1266', actually 1271) and
clarified that only the PUT path still consumes that stub; GET was
migrated to a wire-carried field by this PR.

Skeptical M1 verification confirmed wire-format handshake gracefully
rejects mixed 0.2.62/0.2.63 in both directions (no crash, just
ack_error rejection).  All InlineFound construction sites verified to
use the correct hop_count source.

Not addressed in this commit (tracked for follow-up):
- ResponseStreaming variant doesn't carry hop_count (would affect
  large-payload GETs returning via streaming if they emit GetSuccess
  through this path; needs a separate audit).
- PutSuccess / UpdateSuccess / SubscribeSuccess still use the
  get_current_hop() stub and have the same None-hop_count gap.  Will
  follow up as a separate PR once GET is merged.
- Un-#[ignore]ing the simulation-integration test (needs a workload
  that deterministically produces terminal GETs at CI scales).
Codex re-review flagged the #[ignore] reason as missing the required
tracking-issue reference per repo testing conventions. Now references
#4250 which captures the work to un-ignore.

@sanity sanity left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comprehensive PR Review: #4245

Summary

  • PR Title: fix: populate hop_count on terminal GET events
  • Type: fix (wire-format change)
  • Review tier: Full (wire format is a high-risk surface)
  • Reviewers run: freenet:code-first, freenet:testing, freenet:skeptical, freenet:big-picture, Codex (twice — once on initial submission, once on the fix commits)
  • HEAD reviewed: 85557cf7 (post fix commits)

Code-First Analysis

Independent understanding: PR adds positional hop_count: usize to GetMsg::Response. Storer fills max_htl - htl, relay bubble-up preserves it verbatim, HTL-exhaustion NotFound fills its own max_htl - htl. Tracing-layer extraction reads from the wire instead of the stub op_manager.get_current_hop().

Alignment: matches the PR description for inline Response. Two scope gaps that the description correctly acknowledges:

  • GetMsg::ResponseStreaming not addressed (streaming GETs unchanged).
  • GetFailure event has no live emission site in the tree — the "Success/NotFound/Failure" framing is partly aspirational.

Wire-compat: verified at handshake — connection_handler.rs:2584-2609 cleanly rejects mixed 0.2.62↔0.2.63 in both directions via ack_error.


Testing Assessment

Test Type Status Notes
Unit (wire roundtrip) test_get_msg_response_hop_count_roundtrip in get.rs — Found/NotFound across {0,1,4,10,64}
Unit (classifier) classify_response_found_preserves_hop_count in op_ctx_task.rs — pins the bubble-up path
Simulation ⚠️ test_hop_count_populated_on_terminal_get_events is #[ignore]d; tracks #4250
E2E N/A

Regression Test: present and adequate at the unit level. Initial testing-reviewer concern about "presence-only" assertion has been addressed: there are now two unit tests pinning both the wire format and the classifier-side propagation. The blocking ask was the classifier pin — done.

Remaining gap: no end-to-end integration test exercises the storer→relay→originator chain. The two unit tests catch the most likely regression modes (field dropped, classifier synthesised 0, refactor lost the field) but not "a relay overwrites with its own htl in relay_send_found". Mitigation: the skeptical reviewer verified by code reading that all relay_send_found callers correctly forward downstream_hop_count. Tracking #4250 covers building the simulation harness for end-to-end coverage.


Skeptical Findings

Concern Severity Disposition
Wire-format handshake mixed-version crash risk HIGH (theoretical) Verified safe by reviewer at connection_handler.rs:2584-2609 — graceful ack_error rejection in both directions
Relay-preservation invariant could regress HIGH (test gap) All current GetMsg::Response construction sites verified to use the correct source (skeptical); classifier pin test added; relay_send_found end-to-end coverage tracked in #4250
Peer-supplied hop_count = usize::MAX pollutes telemetry MED Addressed: clamped to op_manager.ring.max_hops_to_live at both extraction sites in tracing.rs
NotFound.hop_count analytics ambiguity MED Addressed: docstring clarifies it's exhaustion depth, not path-to-storer
usize overflow / off-by-one at HTL=0 boundary LOW Reviewer-verified math is correct; comment-clarification ask not acted on
Old hop_count: 0 literals in tests don't assert preservation LOW Untouched — pre-existing; not regressed

Big Picture

  • Goal alignment: yes. PR does exactly what its title and body claim.
  • Scope: tightly scoped, all 7 changed files justified.
  • Removed code: none destructive. get_current_hop() stub kept for the remaining PUT caller; documentation updated to reflect current state.
  • Counterpart bugs (PutSuccess/UpdateSuccess/SubscribeSuccess): confirmed to exist via the same root cause. Tracked in #4248.
  • Streaming gap: tracked in #4249.
  • #[ignore]d test follow-up: tracked in #4250.

Documentation

  • Code docs: complete. hop_count field has a load-bearing docstring at get.rs:96-112 explaining semantics and wire-compat handling.
  • op_state_manager.rs:613 stub docstring updated to reflect that only PUT still consumes it.
  • tracing.rs:1322-1333 clarifies that NotFound hop_count is exhaustion depth.

Recommendations

Must Fix (Blocking)

All blocking findings from the initial review were addressed in the fix commits (9299f54e, c040837e, 85557cf7).

Should Fix (Important)

All "should fix" items either addressed in-PR or tracked as follow-up issues (#4248, #4249, #4250).

Consider (Suggestions)

  1. Off-by-one chain-of-reasoning comment at op_ctx_task.rs:1967-1970 could be tightened (low-priority readability).
  2. Several existing hop_count: 0 literals in pre-existing tests (get.rs:283, 304, 366, 384; op_ctx_task.rs:2481, 2497, 2582) could carry non-zero values and assert preservation — would catch a future struct-shape refactor that loses the field. Not blocking.

False positives dismissed

  • Codex flagged fdev binstall URLs pointing at v0.2.62crates/fdev/Cargo.toml:13-16 has an explicit comment that release.yml rewrites these URLs on each version bump (issue #3995). Not a finding.
  • Big-picture reviewer's "29 file scope" measurement (using three-dot diff) was against a stale local main — actual PR is the expected 7 files.

Verdict

State: Ready to Merge
HEAD SHA reviewed: 85557cf7

All blocking findings addressed in the fix commits on this branch:

  • Testing-reviewer's classifier-preservation pin → classify_response_found_preserves_hop_count
  • Skeptical M1 (bound clamp) → tracing.rs clamp to max_hops_to_live
  • Skeptical M2 (NotFound semantics) → docstring
  • Big-picture stale-line-number nit → fixed
  • Codex re-review's ignore-reason missing tracking issue → references #4250
  • Streaming gap, non-GET counterparts, sim-test un-ignore → tracked in #4249/#4248/#4250

The wire-format change is the most consequential part. The skeptical reviewer verified the handshake guards are correct in both directions, and every GetMsg::Response construction site uses the correct hop_count source. The unit tests pin the two most-likely-to-regress paths (wire roundtrip + classifier propagation).

[AI-assisted - Claude]

@sanity sanity enabled auto-merge May 25, 2026 01:23
Codex flagged the fdev binstall metadata URLs pointing at v0.2.62 in
the initial PR review.  I dismissed it as a false positive because
crates/fdev/Cargo.toml has an explicit comment saying release.yml
rewrites these on each release (issue #3995).  But CI caught the
inconsistency: the binstall_metadata test asserts the URL embeds the
current freenet dep version, and that test failed in Unit & Integration.

Fix the URLs to v0.2.63 to match the freenet dep version bump in this
PR.  The release.yml rewrite path still runs on release; this just
keeps the PR-level invariant intact.
@sanity sanity added this pull request to the merge queue May 25, 2026
Merged via the queue into main with commit 089263a May 25, 2026
13 of 14 checks passed
@sanity sanity deleted the fix/hop-count-population branch May 25, 2026 17:23
Basedfloppa pushed a commit to Basedfloppa/freenet-core that referenced this pull request May 27, 2026
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant