Skip to content

feat: drone swarm telemetry tuning — configurable timeouts, queue capacity, sync fix, shutdown()#59

Open
thewoodfish wants to merge 2 commits into
mainfrom
feature/drone-swarm-telemetry-tuning
Open

feat: drone swarm telemetry tuning — configurable timeouts, queue capacity, sync fix, shutdown()#59
thewoodfish wants to merge 2 commits into
mainfrom
feature/drone-swarm-telemetry-tuning

Conversation

@thewoodfish

Copy link
Copy Markdown
Member

Summary

Four general-purpose improvements extracted from the ds-swarm drone-swarm integration, each a good upstream candidate.

  • Configurable recv_from_network polling (prelude.rs, mod.rs): replaces hardcoded NETWORK_READ_TIMEOUT = 30 s / TASK_SLEEP_DURATION = 3 s with CoreBuilder::with_network_timeout(max_polls, poll_interval_ms). Defaults are identical to the old behaviour (10 × 3 000 ms = 30 s). Also fixes the MutexGuard being held across the sleep call, which blocked other tasks from writing responses.

  • Configurable event queue capacity (prelude.rs, mod.rs): DataQueue previously had a hardcoded capacity of 300. Added DataQueue::with_capacity() and CoreBuilder::with_event_queue_capacity(). Default unchanged.

  • Fix: ReplNetworkConfig::Custom sync_wait_time ignored (replication.rs): sync_with_eventual_consistency was always sleeping SYNC_WAIT_TIME (5 s) regardless of the user-supplied value. Now reads sync_wait_time from self.config, matching the pattern used for data_aging_period in the same function.

  • Core::shutdown() (mod.rs): stores AbortHandles for the two background tasks spawned by build() in a shared Arc<Mutex<Vec<AbortHandle>>>. Core::shutdown() aborts them, releasing the libp2p Swarm and its TCP/UDP listeners immediately — enabling port reuse in tests and graceful restarts. Tokio-runtime only, no behaviour change unless shutdown() is called.

Test plan

  • cargo check --features tokio-runtime passes (verified clean)
  • Confirm defaults are unchanged: with_network_timeout not called → 30 s ceiling as before
  • Confirm with_event_queue_capacity not called → queue capacity 300 as before
  • Confirm ReplNetworkConfig::Defaultsync_wait_time still 5 s
  • Call core.shutdown().await after GossipsubExitNetwork and verify port is released for re-use

🤖 Generated with Claude Code

thewoodfish and others added 2 commits April 25, 2026 02:31
…figurable

Three targeted fixes for high-frequency telemetry workloads (drone swarm at
200 ms gossip interval, 50 nodes):

- CoreBuilder::with_network_timeout(max_polls, poll_interval_ms): replaces the
  hardcoded 3 s poll sleep and 10-retry limit in recv_from_network.  Defaults
  are unchanged (10 × 3 000 ms = 30 s).  Also fixes the MutexGuard being held
  across the sleep call, which blocked concurrent response writers.

- CoreBuilder::with_event_queue_capacity(capacity): DataQueue capacity was a
  hardcoded 300-element constant.  It is now a per-instance runtime value with
  the same default.

- Fix: ReplNetworkConfig::Custom sync_wait_time was stored but never read by
  the eventual-consistency background loop, which always slept for the constant
  SYNC_WAIT_TIME (5 s).  The configured value is now honoured.

See CHANGES_FOR_DRONE_SWARM.md for full rationale and upstream PR guidance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…listeners

Stores AbortHandles for the two tokio tasks spawned by build() in a
shared Arc<Mutex<Vec<AbortHandle>>> on Core.  Core::shutdown() aborts
all handles, releasing the libp2p Swarm (and its TCP/UDP listeners)
immediately — enabling port reuse in tests and graceful drone restarts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sacha-l

sacha-l commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

@thewoodfish - can you refer to the specific issues this closes where relevant?

@thewoodfish

Copy link
Copy Markdown
Member Author

Closes issue #57 — "Core::recv_from_network polls with hardcoded 3s TASK_SLEEP_DURATION — every RPC has a 3s floor"

  • Added CoreBuilder::with_network_timeout(max_polls, poll_interval_ms) to replace the hardcoded 3 s/30 s polling constants.
  • Bonus fix: the MutexGuard on stream_response_buffer was held across the sleep call (blocking concurrent response writers) — that's also
    fixed.

Partially addresses #50 — "Gossipsub mesh takes ~5 s to form; broadcasts before then silently fail" - Not a direct fix for mesh formation, but the tunable poll interval means callers are no longer forced to wait up to 30 s for a response — reducing the symptom window.

Silent bug fix (no issue filed) — ReplNetworkConfig::Custom { sync_wait_time } was stored but the eventual consistency loop always slept 5 s regardless. Now correctly honored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants