feat: drone swarm telemetry tuning — configurable timeouts, queue capacity, sync fix, shutdown()#59
Open
thewoodfish wants to merge 2 commits into
Open
feat: drone swarm telemetry tuning — configurable timeouts, queue capacity, sync fix, shutdown()#59thewoodfish wants to merge 2 commits into
thewoodfish wants to merge 2 commits into
Conversation
…figurable Three targeted fixes for high-frequency telemetry workloads (drone swarm at 200 ms gossip interval, 50 nodes): - CoreBuilder::with_network_timeout(max_polls, poll_interval_ms): replaces the hardcoded 3 s poll sleep and 10-retry limit in recv_from_network. Defaults are unchanged (10 × 3 000 ms = 30 s). Also fixes the MutexGuard being held across the sleep call, which blocked concurrent response writers. - CoreBuilder::with_event_queue_capacity(capacity): DataQueue capacity was a hardcoded 300-element constant. It is now a per-instance runtime value with the same default. - Fix: ReplNetworkConfig::Custom sync_wait_time was stored but never read by the eventual-consistency background loop, which always slept for the constant SYNC_WAIT_TIME (5 s). The configured value is now honoured. See CHANGES_FOR_DRONE_SWARM.md for full rationale and upstream PR guidance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…listeners Stores AbortHandles for the two tokio tasks spawned by build() in a shared Arc<Mutex<Vec<AbortHandle>>> on Core. Core::shutdown() aborts all handles, releasing the libp2p Swarm (and its TCP/UDP listeners) immediately — enabling port reuse in tests and graceful drone restarts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collaborator
|
@thewoodfish - can you refer to the specific issues this closes where relevant? |
Member
Author
|
Closes issue #57 — "Core::recv_from_network polls with hardcoded 3s TASK_SLEEP_DURATION — every RPC has a 3s floor"
Partially addresses #50 — "Gossipsub mesh takes ~5 s to form; broadcasts before then silently fail" - Not a direct fix for mesh formation, but the tunable poll interval means callers are no longer forced to wait up to 30 s for a response — reducing the symptom window. Silent bug fix (no issue filed) — ReplNetworkConfig::Custom { sync_wait_time } was stored but the eventual consistency loop always slept 5 s regardless. Now correctly honored. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four general-purpose improvements extracted from the
ds-swarmdrone-swarm integration, each a good upstream candidate.Configurable
recv_from_networkpolling (prelude.rs,mod.rs): replaces hardcodedNETWORK_READ_TIMEOUT = 30 s/TASK_SLEEP_DURATION = 3 swithCoreBuilder::with_network_timeout(max_polls, poll_interval_ms). Defaults are identical to the old behaviour (10 × 3 000 ms = 30 s). Also fixes theMutexGuardbeing held across the sleep call, which blocked other tasks from writing responses.Configurable event queue capacity (
prelude.rs,mod.rs):DataQueuepreviously had a hardcoded capacity of 300. AddedDataQueue::with_capacity()andCoreBuilder::with_event_queue_capacity(). Default unchanged.Fix:
ReplNetworkConfig::Customsync_wait_timeignored (replication.rs):sync_with_eventual_consistencywas always sleepingSYNC_WAIT_TIME(5 s) regardless of the user-supplied value. Now readssync_wait_timefromself.config, matching the pattern used fordata_aging_periodin the same function.Core::shutdown()(mod.rs): storesAbortHandles for the two background tasks spawned bybuild()in a sharedArc<Mutex<Vec<AbortHandle>>>.Core::shutdown()aborts them, releasing the libp2p Swarm and its TCP/UDP listeners immediately — enabling port reuse in tests and graceful restarts. Tokio-runtime only, no behaviour change unlessshutdown()is called.Test plan
cargo check --features tokio-runtimepasses (verified clean)with_network_timeoutnot called → 30 s ceiling as beforewith_event_queue_capacitynot called → queue capacity 300 as beforeReplNetworkConfig::Default→sync_wait_timestill 5 score.shutdown().awaitafterGossipsubExitNetworkand verify port is released for re-use🤖 Generated with Claude Code