Symptom
User reported a GET against a low-popularity contract on a local node took 30+ seconds and the HTTP client timed out. Retry one second later succeeded in 18s. Telemetry shows the first GET actually completed network-side after 62 seconds — it didn't fail, it just stalled.
Evidence from telemetry (2026-05-19, contract 6FzSeAUKcqJrveKyU8RJgGKc5jRB1Z2juvxXtwTA4Em9)
Originating peer jMphpuy7z7PPG4Xt@73.98.109.226 (user's local node):
| Attempt |
TX ID |
Started (UTC) |
Completed (UTC) |
elapsed_ms |
| 1st (client timed out @30s) |
01KRZ6ADBAT9D0WZYAXC6SQSR2 |
~04:01:20.7 |
04:02:22.462 |
61716 |
| 2nd (retry, succeeded) |
01KRZ6BDQE7DG7W8V2BCDMX0G2 |
04:01:53.823 |
04:02:07.367 |
13544 |
Originating peer's events around the first GET:
04:01:19.557 — disconnect (peer connection lost)
04:01:20.7 — first GET issued (derived from ULID + elapsed_ms)
04:01:32.919 — connect_request_received + connect_rejected
04:01:32.999 — connect_request_sent
04:01:33.321 — connect_connected (recovery)
04:01:53.823 — retry GET — routes cleanly via 3 hops, succeeds in 13.5s
04:02:22.462 — original first GET finally get_success (62s after issue)
Notable: the first transaction emitted zero get_request events in telemetry — only the eventual get_success. Compare with the retry, which emitted 9 get_request events (one per hop) within the first 400ms. The op was held in a pre-routing state while the originating node was reconnecting, and no peer-forwarding telemetry was emitted until the eventual response.
Hypothesis
When a GET is initiated and the originating node has just lost peer connections (or otherwise lacks a viable forwarding target), the op gets parked in an internal queue waiting for routing to recover. It is not actively re-evaluated, not aggressively re-routed once new peers connect (~13s later in this case), and not failed-fast back to the caller. It just waits — in this case, 62 seconds.
The client-visible 30s timeout is just the symptom; the underlying issue is that the operation has no fail-fast or active-reroute semantics during peer-state churn.
What I'd expect
Some combination of:
- Fail-fast: if a GET has zero viable forwarding targets at issue-time, return an error to the client immediately rather than parking it.
- Active reroute: when new peers connect while a GET is parked, re-evaluate routing and dispatch.
- Bounded park time: explicit timeout on "waiting for routable peer" with telemetry on what we were waiting for.
Today there's apparently no telemetry between op-creation and op-completion in this path — adding instrumentation for "GET parked waiting for routing" would already help diagnose this class of failures.
Repro
Hard to reliably repro without inducing a disconnect at the right moment, but the conditions are:
- A node loses one or more peer connections.
- Within ~1s, a contract GET is issued via HTTP.
- Result: GET hangs for tens of seconds until either peers recover and routing finally fires, or client times out.
Cross-reference
- Original investigation: freenet/river debug session 2026-05-19
- Telemetry data: nova OTLP collector logs.jsonl, contract
6FzSeAUKcqJrveKyU8RJgGKc5jRB1Z2juvxXtwTA4Em9, TX 01KRZ6ADBAT9D0WZYAXC6SQSR2
[AI-assisted - Claude]
Symptom
User reported a GET against a low-popularity contract on a local node took 30+ seconds and the HTTP client timed out. Retry one second later succeeded in 18s. Telemetry shows the first GET actually completed network-side after 62 seconds — it didn't fail, it just stalled.
Evidence from telemetry (2026-05-19, contract
6FzSeAUKcqJrveKyU8RJgGKc5jRB1Z2juvxXtwTA4Em9)Originating peer
jMphpuy7z7PPG4Xt@73.98.109.226(user's local node):elapsed_ms01KRZ6ADBAT9D0WZYAXC6SQSR201KRZ6BDQE7DG7W8V2BCDMX0G2Originating peer's events around the first GET:
04:01:19.557—disconnect(peer connection lost)04:01:20.7— first GET issued (derived from ULID +elapsed_ms)04:01:32.919—connect_request_received+connect_rejected04:01:32.999—connect_request_sent04:01:33.321—connect_connected(recovery)04:01:53.823— retry GET — routes cleanly via 3 hops, succeeds in 13.5s04:02:22.462— original first GET finallyget_success(62s after issue)Notable: the first transaction emitted zero
get_requestevents in telemetry — only the eventualget_success. Compare with the retry, which emitted 9get_requestevents (one per hop) within the first 400ms. The op was held in a pre-routing state while the originating node was reconnecting, and no peer-forwarding telemetry was emitted until the eventual response.Hypothesis
When a GET is initiated and the originating node has just lost peer connections (or otherwise lacks a viable forwarding target), the op gets parked in an internal queue waiting for routing to recover. It is not actively re-evaluated, not aggressively re-routed once new peers connect (~13s later in this case), and not failed-fast back to the caller. It just waits — in this case, 62 seconds.
The client-visible 30s timeout is just the symptom; the underlying issue is that the operation has no fail-fast or active-reroute semantics during peer-state churn.
What I'd expect
Some combination of:
Today there's apparently no telemetry between op-creation and op-completion in this path — adding instrumentation for "GET parked waiting for routing" would already help diagnose this class of failures.
Repro
Hard to reliably repro without inducing a disconnect at the right moment, but the conditions are:
Cross-reference
6FzSeAUKcqJrveKyU8RJgGKc5jRB1Z2juvxXtwTA4Em9, TX01KRZ6ADBAT9D0WZYAXC6SQSR2[AI-assisted - Claude]