Skip to content

GET initiated during peer-connection churn stalls silently for 60+s instead of fail-fast / re-route #4154

Description

@sanity

Symptom

User reported a GET against a low-popularity contract on a local node took 30+ seconds and the HTTP client timed out. Retry one second later succeeded in 18s. Telemetry shows the first GET actually completed network-side after 62 seconds — it didn't fail, it just stalled.

Evidence from telemetry (2026-05-19, contract 6FzSeAUKcqJrveKyU8RJgGKc5jRB1Z2juvxXtwTA4Em9)

Originating peer jMphpuy7z7PPG4Xt@73.98.109.226 (user's local node):

Attempt TX ID Started (UTC) Completed (UTC) elapsed_ms
1st (client timed out @30s) 01KRZ6ADBAT9D0WZYAXC6SQSR2 ~04:01:20.7 04:02:22.462 61716
2nd (retry, succeeded) 01KRZ6BDQE7DG7W8V2BCDMX0G2 04:01:53.823 04:02:07.367 13544

Originating peer's events around the first GET:

  • 04:01:19.557disconnect (peer connection lost)
  • 04:01:20.7 — first GET issued (derived from ULID + elapsed_ms)
  • 04:01:32.919connect_request_received + connect_rejected
  • 04:01:32.999connect_request_sent
  • 04:01:33.321connect_connected (recovery)
  • 04:01:53.823 — retry GET — routes cleanly via 3 hops, succeeds in 13.5s
  • 04:02:22.462 — original first GET finally get_success (62s after issue)

Notable: the first transaction emitted zero get_request events in telemetry — only the eventual get_success. Compare with the retry, which emitted 9 get_request events (one per hop) within the first 400ms. The op was held in a pre-routing state while the originating node was reconnecting, and no peer-forwarding telemetry was emitted until the eventual response.

Hypothesis

When a GET is initiated and the originating node has just lost peer connections (or otherwise lacks a viable forwarding target), the op gets parked in an internal queue waiting for routing to recover. It is not actively re-evaluated, not aggressively re-routed once new peers connect (~13s later in this case), and not failed-fast back to the caller. It just waits — in this case, 62 seconds.

The client-visible 30s timeout is just the symptom; the underlying issue is that the operation has no fail-fast or active-reroute semantics during peer-state churn.

What I'd expect

Some combination of:

  • Fail-fast: if a GET has zero viable forwarding targets at issue-time, return an error to the client immediately rather than parking it.
  • Active reroute: when new peers connect while a GET is parked, re-evaluate routing and dispatch.
  • Bounded park time: explicit timeout on "waiting for routable peer" with telemetry on what we were waiting for.

Today there's apparently no telemetry between op-creation and op-completion in this path — adding instrumentation for "GET parked waiting for routing" would already help diagnose this class of failures.

Repro

Hard to reliably repro without inducing a disconnect at the right moment, but the conditions are:

  1. A node loses one or more peer connections.
  2. Within ~1s, a contract GET is issued via HTTP.
  3. Result: GET hangs for tens of seconds until either peers recover and routing finally fires, or client times out.

Cross-reference

  • Original investigation: freenet/river debug session 2026-05-19
  • Telemetry data: nova OTLP collector logs.jsonl, contract 6FzSeAUKcqJrveKyU8RJgGKc5jRB1Z2juvxXtwTA4Em9, TX 01KRZ6ADBAT9D0WZYAXC6SQSR2

[AI-assisted - Claude]

Metadata

Metadata

Assignees

Labels

A-networkingArea: Networking, ring protocol, peer discoveryE-mediumExperience needed to fix/implement: Medium / intermediateP-highHigh priorityS-needs-designStatus: Needs architectural design or RFCT-bugType: Something is broken

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions