fix(network): add simultaneous-dial tie-breaking in SessionManager#3436
Open
peilun-conflux wants to merge 1 commit intoConflux-Chain:masterfrom
Open
Conversation
03d7a44 to
f2c871f
Compare
When two nodes simultaneously dial each other, the network layer's `SessionManager::update_ingress_node_id` had no tie-breaking: each side saw the incoming ingress as a duplicate of its own outbound, unconditionally replaced the node_id_index entry, and killed its own egress via `kill_connection_by_token`. Both sides did this in parallel, so BOTH TCP connections died — the killed outbound on side A is the killed inbound on side B, and vice versa. This commit adds a deterministic tie-breaker inside `update_ingress_node_id`. When an existing entry is an egress, both sides compare own_node_id against remote_node_id the same way and converge on the same surviving TCP connection: the connection initiated by the higher-NodeId peer wins. The loser side kills its own egress; the winner side returns DropNew and the new ingress is disconnected via a Custom disconnect reason (the caller's existing Err path kills the new session). To avoid deadlocking on the per-session `RwLock<Session>` while holding `node_id_index.write()`, the `originated` flag is now stored alongside the slab token in a new `IndexEntry` struct. The tie-breaker reads it from the index directly, not from the session lock. Follow-up to Conflux-Chain#3435: Conflux-Chain#3435 fixes the PoS HotStuff handler's `disconnect_peer(old_peer_id)` wrong-session kill at the protocol layer; this PR fixes the complementary simultaneous-dial bug at the network layer. The two bugs are orthogonal and both needed to fully address the testnet stall. Verified via a new Python integration test (tests/network_tests/simultaneous_dial_test.py) that triggers simultaneous dials on two node pairs in parallel and asserts that "Remove old session from the same node" kills appear on at most one side of each pair. The test reliably fails on master without this fix (kills on both sides) and passes with this fix (kills only on the loser side). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f2c871f to
740898d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When two nodes simultaneously dial each other,
SessionManager::update_ingress_node_idhad no tie-breaking: each side replaced itsnode_id_indexentry and killed its own egress, so both TCP connections died.This PR adds a deterministic tie-breaker: keep the connection initiated by the higher-
NodeIdpeer. Both sides compute the same comparison and converge on the same surviving connection. The loser kills its own egress (existingReplacedpath); the winner returns a newDropNewresult and the new ingress is disconnected viaCustom("simultaneous dial: drop new connection").The
originatedflag is stored alongside the slab token in a newIndexEntrystruct so the tie-breaker can read it without acquiring the per-sessionRwLock<Session>— which would deadlock against any thread holdingSession::write()and waiting fornode_id_index.write().Follow-up to #3435, which fixes the complementary wrong-session-kill bug at the PoS protocol layer. Orthogonal, can be merged in either order.
Test
tests/network_tests/simultaneous_dial.py— runs 100 simultaneous-dial trials across two node pairs and asserts that"Remove old session from the same node"kills appear on at most one side of each pair. Reliably fails on master, passes with this fix. Opt-in (named without_test.py, skipped bytest_all.py); run directly when needed.This change is