You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The single-threaded contract_handling event loop runs network related-contract
sub-GETs inline, blocking every other queued contract op — including GetQuery reads that would otherwise be served instantly from the local store —
for up to 10s (validation path) or 120s (update-requires path) per
stuck op.
This is the op/executor-layer half of the "first send hangs the recipient node,
all contract GETs fail" symptom (freenet/mail#288). It is distinct from the
transport-layer mechanism in #4345 (cwnd-wait abort / stream-assembly wedge):
even with the transport fixes (#4353, #4367, #4374) fully in place, a single
related-contract fetch that can't be retrieved quickly still head-of-line-blocks
the loop. This is the piece that explains why locally-cached GETs stall, which
the transport explanation alone does not.
Mechanism (file:line, current main)
contract_handling is a single consumer task; it processes one event to
completion before the next:
spawned once: node/p2p_impl.rs:453-470
serial loop: contract.rs:667-727, awaits handle_contract_event(...).await
at contract.rs:709-711 before popping the next event
the pool's own doc comment confirms it: "the sequential event loop means at
most one contract is ever in flight at a time" (contract/executor/runtime.rs:382-383,
and :122-125)
A PutQuery/UpdateQuery whose validate_state/update_state returns RequestRelated/requires(missing) and whose related contract is not local
escalates to a network sub-op GET that is .awaited inline:
fetch_related_via_network (runtime.rs:4266-4298): start_sub_op_get then rx.await, bounded by RELATED_FETCH_TIMEOUT = 10s (runtime.rs:12)
local_state_or_from_network (runtime.rs:4200-4249): start_sub_op_get
then rx.await under SUB_OP_FETCH_TIMEOUT = 120s (runtime.rs:4227-4228)
these are &mut self methods reached from handle_contract_event's
Put/Update arms, so the await happens on the contract_handling task itself.
A local-cache-hit GetQuery is serviced by the same loop and does only
local reads (handle_contract_event GET arm → fetch_contract → perform_contract_get, local state_store/contract-store only, contract.rs:924-1003). It would return instantly — but only once it reaches
the head of the queue. If a related-fetch-blocked Put/Update sits ahead of it,
it waits the full 10s/120s.
The worst case is not a single stall: each queued op that independently needs the
unreachable related contract pays its own timeout serially, so a burst of
mailbox-triggered upserts can pin the loop far longer than one timeout interval.
This is also flagged by the repo's own rule .claude/rules/operations.md
("Forward Upstream Before Contract-Handling Work": "validate_state and the 10s
RELATED_FETCH_TIMEOUT [are] directly on the upstream-visible critical path").
Impact
Any contract whose validation/merge needs a related contract that isn't held
locally and can't be fetched promptly will head-of-line-block all contract ops
on that node for 10–120s (or longer under a burst).
freenet-email is the prime trigger: inbox/AFT upserts request related contracts
on first cross-identity send; on a small/churny network the related contract is
frequently not local. Matches the user report in First send to a recipient hangs recipient node, all contract GETs fail mail#288 ("all contract
GETs fail … clears up after some time").
The 120s inline hold on the local_state_or_from_network path is the most
severe: a single sub-op GET for a missing contract can freeze the contract
pipeline for two minutes.
Proposed fix direction (under design — not final)
Lift the network related-fetch off the serial loop so the loop is never held
on a network round-trip:
When validation discovers a missing related contract that requires a network
fetch, kick off the sub-op GET (it already runs as its own task via start_sub_op_get) without awaiting it inline, release the loop, and
re-enqueue the upsert as a continuation once the fetch completes / caches the
related contract locally. The depth=1 / one-round invariant
(.claude/rules/contracts.md) is preserved — the resumed upsert still performs
exactly one fetch round, and a second RequestRelated is still the depth>1
error.
Local-only related lookups stay inline (they're fast and don't touch the
network).
A smaller complementary mitigation (serve local-cache-hit GetQuery reads off a
concurrent read path so they can't queue behind a blocked upsert) is possible but
does not fix non-cached ops stalling, so it is secondary to the off-loop fetch.
Regression test must reproduce the stall: enqueue an upsert that triggers a
network related-fetch for an unreachable contract behind a local-cache-hit GET,
and assert the GET completes promptly instead of waiting out the
related-fetch timeout.
Related: #4345 (transport half, still open), freenet/mail#288 (user-visible
symptom), PR #4006 (where the network escalation was introduced).
Summary
The single-threaded
contract_handlingevent loop runs network related-contractsub-GETs inline, blocking every other queued contract op — including
GetQueryreads that would otherwise be served instantly from the local store —for up to 10s (validation path) or 120s (update-
requirespath) perstuck op.
This is the op/executor-layer half of the "first send hangs the recipient node,
all contract GETs fail" symptom (freenet/mail#288). It is distinct from the
transport-layer mechanism in #4345 (cwnd-wait abort / stream-assembly wedge):
even with the transport fixes (#4353, #4367, #4374) fully in place, a single
related-contract fetch that can't be retrieved quickly still head-of-line-blocks
the loop. This is the piece that explains why locally-cached GETs stall, which
the transport explanation alone does not.
Mechanism (file:line, current
main)contract_handlingis a single consumer task; it processes one event tocompletion before the next:
node/p2p_impl.rs:453-470contract.rs:667-727, awaitshandle_contract_event(...).awaitat
contract.rs:709-711before popping the next eventmost one contract is ever in flight at a time" (
contract/executor/runtime.rs:382-383,and
:122-125)A
PutQuery/UpdateQuerywhosevalidate_state/update_statereturnsRequestRelated/requires(missing)and whose related contract is not localescalates to a network sub-op GET that is
.awaited inline:fetch_related_via_network(runtime.rs:4266-4298):start_sub_op_getthenrx.await, bounded byRELATED_FETCH_TIMEOUT = 10s(runtime.rs:12)local_state_or_from_network(runtime.rs:4200-4249):start_sub_op_getthen
rx.awaitunderSUB_OP_FETCH_TIMEOUT = 120s(runtime.rs:4227-4228)&mut selfmethods reached fromhandle_contract_event'sPut/Update arms, so the await happens on the
contract_handlingtask itself.A local-cache-hit
GetQueryis serviced by the same loop and does onlylocal reads (
handle_contract_eventGET arm →fetch_contract→perform_contract_get, localstate_store/contract-store only,contract.rs:924-1003). It would return instantly — but only once it reachesthe head of the queue. If a related-fetch-blocked Put/Update sits ahead of it,
it waits the full 10s/120s.
The worst case is not a single stall: each queued op that independently needs the
unreachable related contract pays its own timeout serially, so a burst of
mailbox-triggered upserts can pin the loop far longer than one timeout interval.
This is also flagged by the repo's own rule
.claude/rules/operations.md("Forward Upstream Before Contract-Handling Work": "validate_state and the 10s
RELATED_FETCH_TIMEOUT [are] directly on the upstream-visible critical path").
Impact
locally and can't be fetched promptly will head-of-line-block all contract ops
on that node for 10–120s (or longer under a burst).
on first cross-identity send; on a small/churny network the related contract is
frequently not local. Matches the user report in First send to a recipient hangs recipient node, all contract GETs fail mail#288 ("all contract
GETs fail … clears up after some time").
local_state_or_from_networkpath is the mostsevere: a single sub-op GET for a missing contract can freeze the contract
pipeline for two minutes.
Proposed fix direction (under design — not final)
Lift the network related-fetch off the serial loop so the loop is never held
on a network round-trip:
fetch, kick off the sub-op GET (it already runs as its own task via
start_sub_op_get) without awaiting it inline, release the loop, andre-enqueue the upsert as a continuation once the fetch completes / caches the
related contract locally. The depth=1 / one-round invariant
(
.claude/rules/contracts.md) is preserved — the resumed upsert still performsexactly one fetch round, and a second
RequestRelatedis still the depth>1error.
network).
A smaller complementary mitigation (serve local-cache-hit
GetQueryreads off aconcurrent read path so they can't queue behind a blocked upsert) is possible but
does not fix non-cached ops stalling, so it is secondary to the off-loop fetch.
Regression test must reproduce the stall: enqueue an upsert that triggers a
network related-fetch for an unreachable contract behind a local-cache-hit GET,
and assert the GET completes promptly instead of waiting out the
related-fetch timeout.
Related: #4345 (transport half, still open), freenet/mail#288 (user-visible
symptom), PR #4006 (where the network escalation was introduced).
cc @iduartgomez @netsirius
[AI-assisted - Claude]