Skip to content

fix(network): add simultaneous-dial tie-breaking in SessionManager#3436

Open
peilun-conflux wants to merge 1 commit intoConflux-Chain:masterfrom
peilun-conflux:fix/network-simultaneous-dial-tie-breaking
Open

fix(network): add simultaneous-dial tie-breaking in SessionManager#3436
peilun-conflux wants to merge 1 commit intoConflux-Chain:masterfrom
peilun-conflux:fix/network-simultaneous-dial-tie-breaking

Conversation

@peilun-conflux
Copy link
Copy Markdown
Contributor

@peilun-conflux peilun-conflux commented Apr 10, 2026

Summary

When two nodes simultaneously dial each other, SessionManager::update_ingress_node_id had no tie-breaking: each side replaced its node_id_index entry and killed its own egress, so both TCP connections died.

This PR adds a deterministic tie-breaker: keep the connection initiated by the higher-NodeId peer. Both sides compute the same comparison and converge on the same surviving connection. The loser kills its own egress (existing Replaced path); the winner returns a new DropNew result and the new ingress is disconnected via Custom("simultaneous dial: drop new connection").

The originated flag is stored alongside the slab token in a new IndexEntry struct so the tie-breaker can read it without acquiring the per-session RwLock<Session> — which would deadlock against any thread holding Session::write() and waiting for node_id_index.write().

Follow-up to #3435, which fixes the complementary wrong-session-kill bug at the PoS protocol layer. Orthogonal, can be merged in either order.

Test

tests/network_tests/simultaneous_dial.py — runs 100 simultaneous-dial trials across two node pairs and asserts that "Remove old session from the same node" kills appear on at most one side of each pair. Reliably fails on master, passes with this fix. Opt-in (named without _test.py, skipped by test_all.py); run directly when needed.


This change is Reviewable

@peilun-conflux peilun-conflux force-pushed the fix/network-simultaneous-dial-tie-breaking branch 2 times, most recently from 03d7a44 to f2c871f Compare April 10, 2026 16:54
When two nodes simultaneously dial each other, the network layer's
`SessionManager::update_ingress_node_id` had no tie-breaking: each side
saw the incoming ingress as a duplicate of its own outbound,
unconditionally replaced the node_id_index entry, and killed its own
egress via `kill_connection_by_token`. Both sides did this in parallel,
so BOTH TCP connections died — the killed outbound on side A is the
killed inbound on side B, and vice versa.

This commit adds a deterministic tie-breaker inside
`update_ingress_node_id`. When an existing entry is an egress, both
sides compare own_node_id against remote_node_id the same way and
converge on the same surviving TCP connection: the connection initiated
by the higher-NodeId peer wins. The loser side kills its own egress;
the winner side returns DropNew and the new ingress is disconnected via
a Custom disconnect reason (the caller's existing Err path kills the
new session).

To avoid deadlocking on the per-session `RwLock<Session>` while holding
`node_id_index.write()`, the `originated` flag is now stored alongside
the slab token in a new `IndexEntry` struct. The tie-breaker reads it
from the index directly, not from the session lock.

Follow-up to Conflux-Chain#3435: Conflux-Chain#3435 fixes the PoS HotStuff handler's
`disconnect_peer(old_peer_id)` wrong-session kill at the protocol
layer; this PR fixes the complementary simultaneous-dial bug at the
network layer. The two bugs are orthogonal and both needed to fully
address the testnet stall.

Verified via a new Python integration test
(tests/network_tests/simultaneous_dial_test.py) that triggers
simultaneous dials on two node pairs in parallel and asserts that
"Remove old session from the same node" kills appear on at most one
side of each pair. The test reliably fails on master without this fix
(kills on both sides) and passes with this fix (kills only on the
loser side).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@peilun-conflux peilun-conflux force-pushed the fix/network-simultaneous-dial-tie-breaking branch from f2c871f to 740898d Compare April 10, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant