Executor head-of-line blocking: inline network related-fetch on contract_handling loop stalls all contract ops (incl. cached GETs)

## Summary

The single-threaded `contract_handling` event loop runs network *related-contract*
sub-GETs **inline**, blocking every other queued contract op — including
`GetQuery` reads that would otherwise be served instantly from the local store —
for up to **10s** (validation path) or **120s** (update-`requires` path) per
stuck op.

This is the op/executor-layer half of the "first send hangs the recipient node,
all contract GETs fail" symptom (freenet/mail#288). It is **distinct from** the
transport-layer mechanism in #4345 (cwnd-wait abort / stream-assembly wedge):
even with the transport fixes (#4353, #4367, #4374) fully in place, a single
related-contract fetch that can't be retrieved quickly still head-of-line-blocks
the loop. This is the piece that explains why *locally-cached* GETs stall, which
the transport explanation alone does not.

## Mechanism (file:line, current `main`)

1. `contract_handling` is a single consumer task; it processes one event to
   completion before the next:
   - spawned once: `node/p2p_impl.rs:453-470`
   - serial loop: `contract.rs:667-727`, awaits `handle_contract_event(...).await`
     at `contract.rs:709-711` before popping the next event
   - the pool's own doc comment confirms it: "the sequential event loop means at
     most one contract is ever in flight at a time" (`contract/executor/runtime.rs:382-383`,
     and `:122-125`)

2. A `PutQuery`/`UpdateQuery` whose `validate_state`/`update_state` returns
   `RequestRelated`/`requires(missing)` and whose related contract is **not local**
   escalates to a **network** sub-op GET that is `.await`ed inline:
   - `fetch_related_via_network` (`runtime.rs:4266-4298`): `start_sub_op_get` then
     `rx.await`, bounded by `RELATED_FETCH_TIMEOUT = 10s` (`runtime.rs:12`)
   - `local_state_or_from_network` (`runtime.rs:4200-4249`): `start_sub_op_get`
     then `rx.await` under `SUB_OP_FETCH_TIMEOUT = 120s` (`runtime.rs:4227-4228`)
   - these are `&mut self` methods reached from `handle_contract_event`'s
     Put/Update arms, so the await happens on the `contract_handling` task itself.

3. A local-cache-hit `GetQuery` is serviced by the *same* loop and does only
   local reads (`handle_contract_event` GET arm → `fetch_contract` →
   `perform_contract_get`, local `state_store`/contract-store only,
   `contract.rs:924-1003`). It would return instantly — but only once it reaches
   the head of the queue. If a related-fetch-blocked Put/Update sits ahead of it,
   it waits the full 10s/120s.

The worst case is not a single stall: each queued op that independently needs the
unreachable related contract pays its own timeout serially, so a burst of
mailbox-triggered upserts can pin the loop far longer than one timeout interval.

This is also flagged by the repo's own rule `.claude/rules/operations.md`
("Forward Upstream Before Contract-Handling Work": "validate_state and the 10s
RELATED_FETCH_TIMEOUT [are] directly on the upstream-visible critical path").

## Impact

- Any contract whose validation/merge needs a related contract that isn't held
  locally and can't be fetched promptly will head-of-line-block all contract ops
  on that node for 10–120s (or longer under a burst).
- freenet-email is the prime trigger: inbox/AFT upserts request related contracts
  on first cross-identity send; on a small/churny network the related contract is
  frequently not local. Matches the user report in freenet/mail#288 ("all contract
  GETs fail … clears up after some time").
- The 120s inline hold on the `local_state_or_from_network` path is the most
  severe: a single sub-op GET for a missing contract can freeze the contract
  pipeline for two minutes.

## Proposed fix direction (under design — not final)

Lift the **network** related-fetch off the serial loop so the loop is never held
on a network round-trip:

- When validation discovers a missing related contract that requires a network
  fetch, kick off the sub-op GET (it already runs as its own task via
  `start_sub_op_get`) **without** awaiting it inline, release the loop, and
  re-enqueue the upsert as a continuation once the fetch completes / caches the
  related contract locally. The depth=1 / one-round invariant
  (`.claude/rules/contracts.md`) is preserved — the resumed upsert still performs
  exactly one fetch round, and a second `RequestRelated` is still the depth>1
  error.
- Local-only related lookups stay inline (they're fast and don't touch the
  network).

A smaller complementary mitigation (serve local-cache-hit `GetQuery` reads off a
concurrent read path so they can't queue behind a blocked upsert) is possible but
does not fix non-cached ops stalling, so it is secondary to the off-loop fetch.

Regression test must reproduce the stall: enqueue an upsert that triggers a
network related-fetch for an unreachable contract behind a local-cache-hit GET,
and assert the GET completes promptly instead of waiting out the
related-fetch timeout.

Related: #4345 (transport half, still open), freenet/mail#288 (user-visible
symptom), PR #4006 (where the network escalation was introduced).

cc @iduartgomez @netsirius

[AI-assisted - Claude]


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Executor head-of-line blocking: inline network related-fetch on contract_handling loop stalls all contract ops (incl. cached GETs) #4391

Summary

Mechanism (file:line, current `main`)

Impact

Proposed fix direction (under design — not final)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Executor head-of-line blocking: inline network related-fetch on contract_handling loop stalls all contract ops (incl. cached GETs) #4391

Description

Summary

Mechanism (file:line, current main)

Impact

Proposed fix direction (under design — not final)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Mechanism (file:line, current `main`)