Skip to content

Executor head-of-line blocking: inline network related-fetch on contract_handling loop stalls all contract ops (incl. cached GETs) #4391

@sanity

Description

@sanity

Summary

The single-threaded contract_handling event loop runs network related-contract
sub-GETs inline, blocking every other queued contract op — including
GetQuery reads that would otherwise be served instantly from the local store —
for up to 10s (validation path) or 120s (update-requires path) per
stuck op.

This is the op/executor-layer half of the "first send hangs the recipient node,
all contract GETs fail" symptom (freenet/mail#288). It is distinct from the
transport-layer mechanism in #4345 (cwnd-wait abort / stream-assembly wedge):
even with the transport fixes (#4353, #4367, #4374) fully in place, a single
related-contract fetch that can't be retrieved quickly still head-of-line-blocks
the loop. This is the piece that explains why locally-cached GETs stall, which
the transport explanation alone does not.

Mechanism (file:line, current main)

  1. contract_handling is a single consumer task; it processes one event to
    completion before the next:

    • spawned once: node/p2p_impl.rs:453-470
    • serial loop: contract.rs:667-727, awaits handle_contract_event(...).await
      at contract.rs:709-711 before popping the next event
    • the pool's own doc comment confirms it: "the sequential event loop means at
      most one contract is ever in flight at a time" (contract/executor/runtime.rs:382-383,
      and :122-125)
  2. A PutQuery/UpdateQuery whose validate_state/update_state returns
    RequestRelated/requires(missing) and whose related contract is not local
    escalates to a network sub-op GET that is .awaited inline:

    • fetch_related_via_network (runtime.rs:4266-4298): start_sub_op_get then
      rx.await, bounded by RELATED_FETCH_TIMEOUT = 10s (runtime.rs:12)
    • local_state_or_from_network (runtime.rs:4200-4249): start_sub_op_get
      then rx.await under SUB_OP_FETCH_TIMEOUT = 120s (runtime.rs:4227-4228)
    • these are &mut self methods reached from handle_contract_event's
      Put/Update arms, so the await happens on the contract_handling task itself.
  3. A local-cache-hit GetQuery is serviced by the same loop and does only
    local reads (handle_contract_event GET arm → fetch_contract
    perform_contract_get, local state_store/contract-store only,
    contract.rs:924-1003). It would return instantly — but only once it reaches
    the head of the queue. If a related-fetch-blocked Put/Update sits ahead of it,
    it waits the full 10s/120s.

The worst case is not a single stall: each queued op that independently needs the
unreachable related contract pays its own timeout serially, so a burst of
mailbox-triggered upserts can pin the loop far longer than one timeout interval.

This is also flagged by the repo's own rule .claude/rules/operations.md
("Forward Upstream Before Contract-Handling Work": "validate_state and the 10s
RELATED_FETCH_TIMEOUT [are] directly on the upstream-visible critical path").

Impact

  • Any contract whose validation/merge needs a related contract that isn't held
    locally and can't be fetched promptly will head-of-line-block all contract ops
    on that node for 10–120s (or longer under a burst).
  • freenet-email is the prime trigger: inbox/AFT upserts request related contracts
    on first cross-identity send; on a small/churny network the related contract is
    frequently not local. Matches the user report in First send to a recipient hangs recipient node, all contract GETs fail mail#288 ("all contract
    GETs fail … clears up after some time").
  • The 120s inline hold on the local_state_or_from_network path is the most
    severe: a single sub-op GET for a missing contract can freeze the contract
    pipeline for two minutes.

Proposed fix direction (under design — not final)

Lift the network related-fetch off the serial loop so the loop is never held
on a network round-trip:

  • When validation discovers a missing related contract that requires a network
    fetch, kick off the sub-op GET (it already runs as its own task via
    start_sub_op_get) without awaiting it inline, release the loop, and
    re-enqueue the upsert as a continuation once the fetch completes / caches the
    related contract locally. The depth=1 / one-round invariant
    (.claude/rules/contracts.md) is preserved — the resumed upsert still performs
    exactly one fetch round, and a second RequestRelated is still the depth>1
    error.
  • Local-only related lookups stay inline (they're fast and don't touch the
    network).

A smaller complementary mitigation (serve local-cache-hit GetQuery reads off a
concurrent read path so they can't queue behind a blocked upsert) is possible but
does not fix non-cached ops stalling, so it is secondary to the off-loop fetch.

Regression test must reproduce the stall: enqueue an upsert that triggers a
network related-fetch for an unreachable contract behind a local-cache-hit GET,
and assert the GET completes promptly instead of waiting out the
related-fetch timeout.

Related: #4345 (transport half, still open), freenet/mail#288 (user-visible
symptom), PR #4006 (where the network escalation was introduced).

cc @iduartgomez @netsirius

[AI-assisted - Claude]

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-contractsArea: Contract runtime, SDK, and executionA-networkingArea: Networking, ring protocol, peer discoveryE-hardExperience needed to fix/implement: Hard / a lotP-highHigh priorityS-needs-designStatus: Needs architectural design or RFCT-bugType: Something is broken

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions