fix: deliver prune disconnect cause through the waiter channel#4360
Conversation
|
Based on my review of the diff, rule files, and surrounding context, I have one finding. Rule Review: Minor stale doc commentRules checked: WarningsNone. Info
Summary of substantive checks:
Rule review against |
iduartgomez
left a comment
There was a problem hiding this comment.
Comprehensive PR Review: #4360
Summary
- PR Title: fix: deliver prune disconnect cause through the waiter channel
- Type: fix (bug fix — concurrency / async / channel semantics)
- CI Status: ✅ all green (Clippy, Fmt, Unit & Integration, Simulation, NAT Validation, Windows, macOS bundle-swap, Conventional Commits, Rule Review)
- Linked Issues: Closes #4313 (approved: T-bug, P-high, assigned to author)
- Review tier: Full (high-risk surface: concurrency/async + channel item-type change across every recv site)
- Reviewers run: code-first, testing, skeptical, big-picture (all 4) + my own verification. External model pass (Codex) could not run — auth token expired (
refresh_token_reused/token_expired); re-login withcodex loginto enable it on the next pass. Gemini not installed. - HEAD SHA reviewed:
e14ba9ec73f0cc8b5e3201edb98756abcca2a1d0 - Already approved by: iduartgomez
Code-First Analysis
Independent Understanding: The PR changes the pending_op_results[tx] waiter-channel item from NetMessage to a new enum WaiterReply { Reply(NetMessage), PeerDisconnected { peer } } (node/network_bridge.rs:222). A new NodeEvent::TransactionOrphaned { tx, peer } handler (p2p_protoc.rs:2039) removes the sender from the map, try_sends WaiterReply::PeerDisconnected, then drops the sender — relying on tokio mpsc's "buffered item delivered before None" guarantee so a parked driver observes the cause deterministically (send-before-drop). A new chokepoint recv_waiter_reply (op_ctx.rs:372) maps Reply→Ok, PeerDisconnected→Err(OpError::PeerDisconnected), bare-None→Err(OpError::NotificationError). The new OpError::PeerDisconnected routes to the generic advance arm in drive_retry_loop (peer gone → next route), not the same-peer infra-retry that NotificationError uses.
Stated Intent: Stop the "failed notifying, channel closed" FORBIDDEN_MARKER (which is OpError::NotificationError's Display, operations.rs:88) from leaking to clients on prune-orphaned transactions; deliver the disconnect cause structurally instead of via a racy side mechanism.
Alignment: Matches — the implementation does exactly what the Solution describes, at the structural level claimed. Verified OpError::PeerDisconnected's Display is "awaited peer {peer} disconnected before replying" (operations.rs:95) — contains no marker.
Gaps — the PR description misdescribes history (the code is correct): code-first and big-picture independently confirmed against the merge-base (875b97a47) and full git log --all -S:
OrphanCauseRegistry/orphan_causes/drop_entrynever existed anywhere in the repo or git history. The PR deletes no such registry. The entire second "Problem" paragraph (the "~0.01% win-rate registry was inert" analysis) and the "Deletes the entire OrphanCauseRegistry" bullet describe an intermediate in-branch iteration that was squashed away — not anything present at the PR's base.OpError::WaiterClosednever existed —PeerDisconnectedis added alongside the untouchedNotificationError, not a rename ofWaiterClosed.recv_register_waiternever existed —recv_waiter_replyis extracted from inline logic insend_and_await, not renamed from anything.try_release_pending_op_slot[_on]was renamed, not deleted, tonotify_orphaned_transaction[_on](all #4154/#4238 behavior preserved).
This is a description-vs-code defect only; no functional impact. Recommend rewriting the Solution section to match the shipped commit.
Testing Assessment
Coverage Level: adequate — meets the fix:-must-have-a-regression-test bar; well-layered (chokepoint unit + op-state pipeline + source-scrape pins + multi-thread churn + integration). All 11 targeted tests pass locally (incl. the 2000-iteration churn test); clippy clean; error_notification integration compiles.
| Test Type | Status | Notes |
|---|---|---|
| Unit | ✅ | recv_waiter_reply_* (3 cases: reply / disconnect / bare-close fallback); notify_orphaned_transaction_* (4, incl. #4238 full→DEBUG / closed→WARN guards) |
| Integration | ✅/ |
test_connection_drop_error_notification strengthened (asserts no FORBIDDEN_MARKER on both arms; 500ms→50ms pre-drop) — but see false-negative note below |
| Simulation | ✅ | Simulation CI job green |
| E2E | The 50ms drop window is wall-clock against real loopback with no start_paused; on a loaded runner the PUT may not be in flight yet → orphan path silently untested. It is a false-negative risk only (won't flake red), because the test asserts only marker-absence + non-timeout, with no positive assertion that the orphan path was taken |
Regression Test: present and valid (parked_driver_always_observes_peer_disconnected_under_churn, multi-thread/2-worker/2000-iter) — but the PR's claim that it "fails on the deleted registry approach" is overstated: it drives a hand-rolled mock of the post-fix send-before-drop ordering, proving the tokio "buffered-before-None" property; it does not drive the production prune → TransactionOrphaned → handler sequence, so it would not have failed against the (never-merged) registry approach. The production ordering is guarded by the source-scrape pin transaction_orphaned_handler_sends_cause_before_dropping_sender, which is load-bearing-but-brittle (substring-position match).
Missing Tests (non-blocking):
- No behavioral coverage for the new CONNECT
PeerDisconnectedarms (connect/op_ctx_task.rs:287clientbreak,:909relayreturn Ok(())) — not even a source pin. - PUT relay
recv_waiter_replyarms (put/op_ctx_task.rs:1313,:2249) covered only transitively. - Full-channel branch of the orphan handler (real reply already buffered →
PeerDisconnecteddropped) asserted by comment, not test. - Multi-orphan batch (
handle_orphaned_transactionsloopsVec<Transaction>; every test uses N=1).
Skeptical Findings
Risk Level: low — adversarial pass found no high/medium-severity bug introduced by this PR. Every raised concern traced to a verified-safe conclusion.
| Concern | Severity | Location | Details |
|---|---|---|---|
| Real reply dropped in favor of orphan signal | low | node.rs:847 (reply try_send) vs p2p_protoc.rs:2052 (orphan try_send) |
Orphan signal is now a buffered item racing the real reply into the capacity-1 slot (separate tasks). If PeerDisconnected wins, the terminal reply is dropped and the driver retries with a fresh attempt_tx. Benign for idempotent GET/PUT/SUBSCRIBE; never double-delivers (client result gated on the single consumed item). Slightly more retries than the old close-based wake. Worth a description note. |
| OPERATION_TTL fallback window | low | op_state_manager.rs:1075 |
If TransactionOrphaned is emitted before the driver's waiter is inserted, remove(&tx) finds nothing → driver waits full TTL. Same residual #4154 race; not a regression. |
| Orphan-signal loss on full/closed channel | verified safe | p2p_protoc.rs:2047-2054 |
Benign in every fill state (cap-1: buffered Reply consumed, or driver gone; cap-N CONNECT: FIFO drains all accepts then break). No re-wedge to hang or marker leak. |
| Prune racing real reply | verified safe | node.rs:847 |
Single cap-1 slot, exactly one winner (tokio atomic reservation); driver acts once; completed() fires once. No double-delivery, no lost client result. |
PeerDisconnected routing at every recv site |
verified safe | GET op_ctx.rs:571, SUBSCRIBE subscribe:623, PUT relay put:1356/:2275, CONNECT connect:287/907, UPDATE fire-and-forget never recvs |
Reaches client only as a real cause string, never the marker |
| Correct disconnected-peer addr at all 5 prune sites | verified safe | p2p_protoc.rs:1470, 2273, 2371, 3341, 3789 |
Each passes the same addr used to build the pruned PeerId |
| Pre-existing: loopback same-tx waiter / cross-peer-rebind spurious advance | pre-existing | documented in-code | Neither introduced nor worsened by this PR |
Big Picture Assessment
Goal Alignment: yes — addresses #4313's "prune cause and op-result must be mutually ordered" by making the cause delivered by the same operation that closes the slot. Replaces a probabilistically-correct mechanism (the description measured ~99.99% marker-leak on the prior approach) with a deterministic one grounded in tokio mpsc semantics. The WaiterReply enum converts a runtime race into a compile-time obligation: every recv site must handle PeerDisconnected; GET/SUBSCRIBE inherit it for free through the send_and_await chokepoint.
Anti-Patterns Detected: none — the opposite of CI-chasing. The integration test is strengthened, not weakened; the churn test is explicitly designed to be a strong guard; no assertions weakened, no tests deleted, no #[ignore] added. State-cleanup parity with TransactionCompleted holds (same three maps: tx_to_client, pending_op_results, live_tx_tracker).
Removed Code Concerns: none — no separately-merged prior fix is torn out (the "registry" never existed on main). #4154 wake intent + #4238 log-flood guard both preserved.
Scope Assessment: focused — the true diff is 13 files / 536+ / 170− (git diff 875b97a47 HEAD). gh pr diff / diff-against-origin/main shows ~1300 phantom deletions (get/op_ctx_task.rs, tracing.rs, telemetry.rs, streaming_e2e.rs, etc.) that are artifacts of the stale base, not this PR. Confirmed by all three reviewers that the PR touches none of those files. Rebase onto current main before merge so the published diff/CI don't appear to revert #4364/#4367.
Minor: apps/freenet-ping/Cargo.lock bumps freenet 0.2.67→0.2.70 (lockfile sync); harmless, routine.
Documentation
- Code docs: complete — handler comment documents send-before-drop;
WaiterReplyenum doc explains thelarge_enum_variantallow (hot-pathReplynot boxed, footprint unchanged);OpError::PeerDisconnecteddoc explains the routing. - Architecture/rules docs:
.claude/rules/operations.mdupdated correctly with a new "WHEN awaiting a reply onpending_op_results[tx](#4313)" section (chokepoint, send-before-drop invariant, advance-arm routing, pin list). ReferencedOpCtx::send_to_and_awaitexists. Zero stale references to the renamed/never-existent symbols anywhere indocs/,AGENTS.md,.claude/rules/, crateCLAUDE.md(theOrphanStreamRegistryhits indocs/architecture/transport/are an unrelated subsystem). - User docs: n/a (internal mechanism).
Recommendations
Must Fix (Blocking)
- Rebase onto current
origin/mainbefore merge. The branch predates #4364/#4367; the published diff otherwise appears to revert ~1300 lines of GET/telemetry code. The GET path handlesPeerDisconnectedviadrive_retry_loop's generic advance arm and the PR doesn't touchget/op_ctx_task.rs, so the rebase should be conflict-free — re-confirm integration tests after.
Should Fix (Important)
- Rewrite the PR "Solution" section to match the shipped commit. Remove/mark-as-historical the claims about deleting
OrphanCauseRegistry/orphan_causes/drop_entry, replacingOpError::WaiterClosed, and renamingrecv_register_waiter— none of those symbols exist at the PR's base. Accurate framing: "addsWaiterReply+OpError::PeerDisconnected; extractsrecv_waiter_reply; renamestry_release_pending_op_slot[_on]→notify_orphaned_transaction[_on]; replaces theTransactionCompleted-reuse orphan-wake with a dedicatedTransactionOrphanedevent carrying the cause through the channel." Drop or reframe the win-rate paragraph. - Temper the churn test's claim (
op_ctx.rs:1611comment + PR body): it pins the tokio send-before-drop property, it does not reproduce the pre-fix race. Either reword, or have it drive the realnotify_orphaned_transaction_on → handlersequence inside the loop to make it a genuine end-to-end reproducer.
Consider (Suggestions)
- Add a behavioral test (or at least a source pin) for the CONNECT
PeerDisconnectedarms (connect/op_ctx_task.rs:287,:909) — currently zero coverage on a fan-in path where "stop early on disconnect" can be subtly wrong with a partialacceptedset. - Add a positive assertion to the E2E that the orphan path was actually exercised (e.g. response is the disconnect/error variant, or scrape a log marker), so the 50ms window can't silently stop testing the fix.
- Add a full-channel test for the orphan handler (pre-fill cap-1 with
Reply, fire orphan, assert driver still getsReply) and a multi-orphan-batch test. - Optional: note in the description that a disconnect racing a successful reply now triggers a retry slightly more often (skeptical Low #1) — behavioral, harmless for idempotent ops.
Verdict
State: Ready to Merge (after the rebase) — no correctness defects; the blocking item is mechanical (stale base), and the Should/Consider items are description/test-quality, not code bugs.
The implementation is correct, well-tested, and well-documented; all four reviewers + independent verification converged on this. The defects are in the PR description (fabricated deletion history) and the base (stale), not the code. The send-before-drop design is sound and the failure modes are genuinely benign. Because none of the findings require a code change to the fix itself, a follow-up rebase + description edit does not require a full re-review — a diff-of-the-diff check on the rebased HEAD is sufficient (confirm the phantom deletions are gone and the 13-file change is unchanged).
[AI-assisted - Claude]
## Problem
Under connection churn, prune_connection produces orphaned transactions
(ops whose awaited downstream peer just disconnected). The prune path woke
each parked driver by *closing* its pending_op_results[tx] waiter channel.
The originator, parked in OpCtx::send_and_await -> recv(), saw the close
(None) and returned OpError::NotificationError, whose Display is the
"failed notifying, channel closed" FORBIDDEN_MARKER that
tests/error_notification.rs asserts must never reach a client. This is the
low-rate flake in test_put_error_notification* / test_put_error_*.
## Solution
Carry the disconnect cause through the waiter channel itself, so the close
*delivers* the reason instead of relying on a bare channel close that
collapses every teardown into NotificationError.
- The waiter channel item changes from NetMessage to
WaiterReply { Reply(NetMessage), PeerDisconnected { peer } }.
- New NodeEvent::TransactionOrphaned { tx, peer }. Its event-loop handler
takes the sender out of pending_op_results, try_sends
WaiterReply::PeerDisconnected { peer }, *then* drops the sender.
- Key insight: tokio mpsc delivers a buffered item before the channel reads
None, so the parked driver observes PeerDisconnected deterministically
(send-before-drop). No side registry, no race.
- recv_waiter_reply is extracted from the inline recv in send_and_await as
the single chokepoint; the WaiterReply enum makes the compiler force every
recv site -- including the CONNECT and PUT relays -- to handle
PeerDisconnected.
- New OpError::PeerDisconnected { peer } (added alongside the unchanged
NotificationError). It routes to the generic *advance* arm in
drive_retry_loop (the peer is gone -- try the next route), not the
same-peer infra-retry that NotificationError uses.
- The #4154 orphan-wake emitter try_release_pending_op_slot[_on] is renamed
to notify_orphaned_transaction[_on] and enhanced to carry the peer; its
#4154/#4238 behavior (best-effort try_send, Full->DEBUG, Closed->WARN) is
preserved. The prior wake reused NodeEvent::TransactionCompleted, which
only dropped the sender; this replaces it with the dedicated
TransactionOrphaned event that carries the cause through the channel.
This removes the failure class structurally rather than narrowing a race
window.
## Testing
- Concurrency stress -- parked_driver_always_observes_peer_disconnected_under_churn
(op_ctx, multi-thread, 2 workers, 2000 iterations): pins the load-bearing
tokio buffered-item-before-None property by driving the post-fix
send-before-drop order against a parked driver. (It pins the channel
property, not the production prune->handler sequence -- that ordering is
guarded by the source-scrape pin below.)
- Chokepoint behaviour (op_ctx): recv_waiter_reply_returns_reply_on_reply_item,
recv_waiter_reply_returns_peer_disconnected_signal,
recv_waiter_reply_falls_back_to_notification_error_on_bare_close.
- End-to-end at op-state level (op_state_manager):
orphaned_transaction_wakes_parked_waiter_with_peer_disconnected, plus
ported #4154/#4238 guards (notify_orphaned_transaction_emits_transaction_orphaned,
_handles_dropped_receiver, _full_does_not_emit_warn, _closed_still_emits_warn).
- Source-scrape pins (p2p_protoc):
transaction_orphaned_handler_sends_cause_before_dropping_sender (guards the
send-before-drop order the type system can't),
handle_orphaned_transactions_wakes_parked_drivers.
- CONNECT arms (connect/op_ctx_task): both_connect_drivers_handle_peer_disconnected
pins the explicit PeerDisconnected arm in both connect drivers (they
pattern-match WaiterReply inline rather than via the recv_waiter_reply
chokepoint).
- Integration: test_connection_drop_error_notification asserts the response
carries no FORBIDDEN_MARKER and drops the peer at 50 ms so the PUT is in
flight at prune time. It warns (not fails) when timing makes the run
vacuous, since it is a real-network test with no mocked time.
- .claude/rules/operations.md updated to document the channel-carried design.
## Fixes
Closes #4313.
e14ba9e to
e03da4c
Compare
Rebased + applied review suggestionsPushed Rebase (blocking item)
PR description / commit message rewritten
Test-quality suggestions applied
Verified locally: [AI-assisted - Claude] |
Problem
Under connection churn, prune_connection produces orphaned transactions
(ops whose awaited downstream peer just disconnected). The prune path woke
each parked driver by closing its pending_op_results[tx] waiter channel.
The originator, parked in OpCtx::send_and_await -> recv(), saw the close
(None) and returned OpError::NotificationError, whose Display is the
"failed notifying, channel closed" FORBIDDEN_MARKER that
tests/error_notification.rs asserts must never reach a client. This is the
low-rate flake in test_put_error_notification* / test_put_error_*.
Solution
Carry the disconnect cause through the waiter channel itself, so the close
delivers the reason instead of relying on a bare channel close that
collapses every teardown into NotificationError.
WaiterReply { Reply(NetMessage), PeerDisconnected { peer } }.
takes the sender out of pending_op_results, try_sends
WaiterReply::PeerDisconnected { peer }, then drops the sender.
None, so the parked driver observes PeerDisconnected deterministically
(send-before-drop). No side registry, no race.
the single chokepoint; the WaiterReply enum makes the compiler force every
recv site -- including the CONNECT and PUT relays -- to handle
PeerDisconnected.
NotificationError). It routes to the generic advance arm in
drive_retry_loop (the peer is gone -- try the next route), not the
same-peer infra-retry that NotificationError uses.
to notify_orphaned_transaction[_on] and enhanced to carry the peer; its
GET initiated during peer-connection churn stalls silently for 60+s instead of fail-fast / re-route #4154/Log spam: try_release_pending_op_slot WARN fires 30K+/hr under load #4238 behavior (best-effort try_send, Full->DEBUG, Closed->WARN) is
preserved. The prior wake reused NodeEvent::TransactionCompleted, which
only dropped the sender; this replaces it with the dedicated
TransactionOrphaned event that carries the cause through the channel.
This removes the failure class structurally rather than narrowing a race
window.
Testing
(op_ctx, multi-thread, 2 workers, 2000 iterations): pins the load-bearing
tokio buffered-item-before-None property by driving the post-fix
send-before-drop order against a parked driver. (It pins the channel
property, not the production prune->handler sequence -- that ordering is
guarded by the source-scrape pin below.)
recv_waiter_reply_returns_peer_disconnected_signal,
recv_waiter_reply_falls_back_to_notification_error_on_bare_close.
orphaned_transaction_wakes_parked_waiter_with_peer_disconnected, plus
ported GET initiated during peer-connection churn stalls silently for 60+s instead of fail-fast / re-route #4154/Log spam: try_release_pending_op_slot WARN fires 30K+/hr under load #4238 guards (notify_orphaned_transaction_emits_transaction_orphaned,
_handles_dropped_receiver, _full_does_not_emit_warn, _closed_still_emits_warn).
transaction_orphaned_handler_sends_cause_before_dropping_sender (guards the
send-before-drop order the type system can't),
handle_orphaned_transactions_wakes_parked_drivers.
pins the explicit PeerDisconnected arm in both connect drivers (they
pattern-match WaiterReply inline rather than via the recv_waiter_reply
chokepoint).
carries no FORBIDDEN_MARKER and drops the peer at 50 ms so the PUT is in
flight at prune time. It warns (not fails) when timing makes the run
vacuous, since it is a real-network test with no mocked time.
Fixes
Closes #4313.