Rewrite network protocol/service to use channels by gterzian · Pull Request #1340 · paritytech/substrate

gterzian · 2018-12-31T04:07:30Z

Here is a work in progress/proof of concept for a slightly different approach to concurrency involving running components in their own threads, based on their own "event-loop", mainly consisting of receiving and handling messages sequentially.

The main benefit is that, since components are running in their own dedicated thread, there is no need for locking of any kind. Also, you're essentially building your own "runtime", and can enable components to run independently of each other, logically "in parallel".

Some examples:

Protocol is now running in a thread, handling ProtocolMsg, including periodics Tick and ProgagateExtrinsincs. Protocol itself isn't shared, or has it's methods called, outside of it's own thread. The result is this:

The network Service now contains senders to both Protocol and NetworkService(libp2p), and instead of calling methods on those, it sends messages to be handled on their respective event-loops.

The libp2p Service is still polled by the tokio runtime. However it isn't shared with other components via the SyncIo anymore. Instead, it receives messages from Protocol(and others) via a channel, and messages are handled inside poll. So while this still seems to require locking Service before doing a poll, that seems to be the only case left(previously it would be locked at each call to SyncIo). The handling of ServiceEvent, enqueud on a stream by the network service, is still done on the tokio runtime as well, however it now only consists of sending messages to Protocol, and, in the case of reporting a peer, back to the network service. This could be seen as a "bridge" between the tokio runtime/libp2p and the protocol.

Note that net_sync above is essentially a wrapper around the network service itself. So previously not only was protocol "running" on the same thread as the network service, but the network service was then also shared with protocol through net_sync(implementingSyncIo). Now, the network service and protocol both run independently of each other. Protocol can send messages to the network service, and the network services communicates back via incoming NetworkServiceEvent. The handling of NetworkServiceEvent, still on the tokio runtime, then acts as a thin layer of glue between the two(which prevents the need for the network service to depend on anything from the protocol).

The NetworkLink, shared with the import queue, is now also just a wrapper around a network and protocol sender. So instead of calling methods, which required locking, it now sends messages, to be handled on the respective event-loops of protocol or the network service.

Note that the two send are non-blocking, meaning the the import_queue can go back to work right after(which would be especially relevant if the import queue were to be an independently running component, as is done in #1327), and both the network and protocol will handle these messages independently.

As a general note, this not only removes locks(which make the logic of Protocol really just single-threaded sequential stuff), it also removes dependencies on the various "components" in your system, in various ways:

A the level of the "generic" signature of structs:

Service itself has nothing to do with Specialization or ExHashT, these are related to Protocol, yet because service contained a protocol, these needed to be included in the signature.

At the level of the "parallelism" of the various components.

remove_reserved_peer previously required to lock the network service(meaning it couldn't be polled in the meantime), and the method call on protocol also required various locks, some of which could block the import-queue, and others would block the network service(each method call on SyncIo would lock the network service, meaning it couldn't be polled in the meantime).

When is on_peer_disconnected called now, Since it's not done in remove_reserved_peer anymore?

When the network service receives the RemoveReservedPeer message, it will enqueue a new DisconnectNode event containing info about nodes to be disconnected.

This event will later be handled, and result in a message send to protocol

This results in protocol removing nodes from it's own state, and network removing reserved peers, to happen completely in non-blocking fashion of each other. This setup could also further free-up the import-queue, since the shared dependency/lock on ChainSync is be removed.

Finally, if you absolutely must block one component, while waiting for an "answer" from another one, this can be implemented like this:

The tests are going to have to be rewritten for them to pass, however I first wanted to get your point of view on this approach.

Note that when I mention "parallelism", I'm less thinking in terms of performance, and more in terms of logic(obviously, the actual parallelism will be limited by the physical reality of the machine the code is running on). So it's less about boosting performance, more about ensuring components are executing independently. It might bring some performance bonus as well, but that's not the goal.

cc @andresilva @tomusdrw This is a concrete example of some of the stuff I mentioned in relation to "actors"...

parity-cla-bot · 2018-12-31T04:07:34Z

It looks like @gterzian signed our Contributor License Agreement. 👍

Many thanks,

Parity Technologies CLA Bot

gterzian · 2019-01-11T06:38:51Z

Ok this one is now ready for review. @tomaka ?

I've added one commit that introduces a basic back pressure mechanism between the libp2p and protocol(since the the protocol operations are now in their own thread, and the events from libp2p could in theory otherwise pile-up in the unbounded channel used to communicate with protocol).

Please note that in a few files with "lots" of changes, and because my editor was set to indent with spaces, I ended up running cargo fmt with hard_tabs = true in order to remove the spaces. This introduced a few extra changes, which I hope is ok.

tomaka · 2019-01-11T09:38:34Z

After reading the diff, I think I misunderstood your description.
To me the whole fact that we're using channels and separate threads is a good thing, but should be an implementation detail hidden deep in the code, and not exposed in the API at all.

I don't really see the point of exposing a channel in the API of network-libp2p, considering that it's network that manages the thread. It could simply be the network crate that polls the channel in its thread, and passes on the messages to network-libp2p, and this is in fact already more or less what is happening.

gterzian · 2019-01-11T10:12:24Z

using channels and separate threads is a good thing, but should be an implementation detail hidden deep in the code, and not exposed in the API at all.

@tomaka I can re-introduce the SyncIo and make network use that instead of a channel directly.

We could also move the equivalent of SyncIo to network-libp2p(what I removed was actually in network), and completely hide the use of NetworkMsg from network.

Would that be better?

I don't really see the point of exposing a channel in the API of network-libp2p, considering that it's network that manages the thread.

network manages the thread that receives ProtocolMsg, sent by network-libp2p from here.

network-libp2preceives NetworkMsg running on Tokio. Currently, those messages are received, non-blockingly, and handled, as part of poll.

So network "owns" a sender to send NetworkMsg(which could be wrapped inside SyncIo), and a receiver to receive and handle ProtocolMsg, in it it's own thread.

network-libp2p is the mirror of that, it owns a sender to send ProtocolMsg, and a receiver to receive NetworkMsg in "its own thread", a tokio runtime actually.

The nice thing is that since the ProtocolMsg are sent from from here, in response to incoming ServiceEvent, network-libp2p doesn't need to send ProtocolMsg directly.

This use of a stream of event is really related to network-libp2p running on tokio, yet on the other side we could indeed make network use a trait from network-libp2p, with only the implementation sending NetworkMsg, which would equally hide the messaging from network.

tomaka · 2019-01-11T10:57:10Z

What I'm suggesting is to instead move handle_protocol_messages in the network crate and call it from here.

network-libp2p doesn't own any thread or events loop.
The API surface of libp2p's NetworkService is poll() and send_custom_message(), and a few other methods. This makes its internal state very coherent, easy to figure out, and easy to test (although it doesn't have any test at the moment).

For example if deny_unreserved_peers() returns a node ID, then you know we were connected to it. There is no "later" or "was maybe connected". We were connected to it, and the only way we can move to the disconnected state is by calling one of the methods of the service (like poll()). The API of NetworkService is a state machine, and its state can only be updated by calling methods and doesn't automatically change over time.

Sure, internally, libp2p uses sub-tasks so that the network connections are actually asynchronous. And internally libp2p is quite similar to an actor model. However having a simple synchronous API exposed is a very nice property in my opinion.

On the other hand network already has a lot of internal asynchronicity to handle, and I'm not sure about introducing a spaghetti of events into network-libp2p when to me it looks like this belongs to network.

gterzian · 2019-01-11T13:05:14Z

For example if deny_unreserved_peers() returns a node ID, then you know we were connected to it. There is no "later" or "was maybe connected". We were connected to it, and the only way we can move to the disconnected state is by calling one of the methods of the service.

In this proposal, you get exactly the same behavior from the perspective of libp2p, the async-ness is only introduced with regards to sending the message, not handling it.

When libp2p handles a NetworkMsg::DenyUnresservedPeers message, it will call its own deny_unreserved_peers method in an absolutely sync way.

The question is whether network cares about whether the node is connected when it sends the message. It appears to me that it doesn't. If libp2p disconnects a node, it will enqueue an event back for network to handle, and disconnect the peer internally if necessary.

This particular case actually requires the "node id" of a disconnected node to be communicated back to network, but most operations, like sending a custom message over the network, do not require any immediate response. What benefit do we get from waiting on a lock to be able to call send_custom_message?

However having a simple synchronous API exposed is a very nice property in my opinion.

If we're using threads, perhaps we might as well go for as much parallelism as the integrity of the system can tolerate, and only do things synchronously when necessary?

If component absolutely need to do a state transition in a "sync" way, that can be dealt with with a reply channel send on the original message(example).

Also, is the current "synchronous API" really that simple? What about the various concurrent components competing to acquire locks?

What I'm suggesting is to instead move handle_protocol_messages in the network crate and call it from here.

I actually did handle the messages at the place you suggest in an earlier version of this PR, and I moved the message handling to poll for a couple of reasons:

Handling the messages in stream::poll_fn().for_each instead of just in poll require locking for each method call, hence it can't be done in parallel to poll, hence it might as well be done together(any method call inside poll doesn't require locking since the whole thing is locked in order to do a poll).
If you do the message handling in stream::poll_fn().for_each you end up with several clones like let network_service2 = network_service.clone();. Now we're actually able to remove the let network_service2 = network_service.clone(); which was previously used by SyncIo, since all we need is one to do a network_service.lock().poll()(see the previous code).

rphmeier · 2019-02-05T18:50:31Z

 		let extrinsics = self.transaction_pool.transactions();
-
+		// TODO: see if self.send_message can be made &self,
+		// so that this vec would become uncessary.


TODO should be linked to issue

i don't think that's the only problem; even if self.send_message took &self, we would still be invoking it with self.context_data.peers borrowed mutable (not allowed). There is a free function send_message you can try, but it needs an &mut peers.

there are a couple other similar TODOs, they should be addressed the same way.

rphmeier · 2019-02-05T18:51:37Z

+				let _ = self
+					.network_chan
+					.send(NetworkMsg::GetPeerId(who.clone(), sender));
+				let node_id = port.recv().expect("Failed to receive GetPeerId response");


TBH i don't like blocking on waiting for messages here. The expectation is not really a convincing argument of impossibility either.

I also don't like the blocking here, and I am less concerned about it than in the futures case discussed previously, because this only blocks the Protocol thread and nothing else.

It would off-course be better to replace this with a non-blocking workflow, and I believe @tomaka has various plans to move some logic currently here into network-libp2p, what is going on here is perhaps a good candidate for that...

I have updated the expectation.

rphmeier · 2019-02-05T18:53:59Z

+	who: NodeIndex,
+	mut message: Message<B>,
+) {
+	match &mut message {


why match &mut message? seems that match message will do.

rphmeier · 2019-02-05T18:57:17Z

 	}

+	/// Get a clone of the channel to network/libp2p.
+	pub fn network_sender(&self) -> NetworkChan {


doesn't that effectively expose all internal APIs?

rphmeier · 2019-02-05T18:58:00Z

-		self.handler.on_block_imported(&mut NetSyncIo::new(&self.network, self.protocol_id), hash, header)
+		let _ = self
+			.protocol_sender
+			.send(ProtocolMsg::BlockImported(hash, header.clone()));


I wonder if these functions should be changed to take the header by-value explicitly so the clone isn't hidden.

rphmeier · 2019-02-05T18:59:27Z

+impl NetworkChan {
+	/// Create a new network chan.
+	pub fn new(sender: Sender<NetworkMsg>, task_notify: Arc<AtomicTask>) -> Self {
+		Self {


style: using Self like this is weird.

rphmeier · 2019-02-05T19:03:15Z

-				protocol.on_clogged_peer(&mut net_sync, node_index,
-					messages.iter().map(|d| d.as_ref()));
+				debug!(target: "sync", "{} clogging messages:", messages.len());
+				for msg_bytes in messages.iter().take(5) {


why take 5?

This code is from master at

substrate/core/network/src/protocol.rs

Line 436 in d0f824f

for msg_bytes in clogging_messages.take(5) {

@tomaka would know the reason for it...

I didn't put this. I imagine someone did because it was spamming the logs too much. It's a bit stupid as the whole point of dumping all the messages was to see the frequency's of each one.

gterzian · 2019-02-06T07:05:22Z

@rphmeier Thanks for the review, I think all your comments have been addressed...

…to_use_channels

andresilva · 2019-02-06T14:24:02Z

FWIW I tested this on polkadot by syncing the alexander chain from scratch, worked fine. 👍

gterzian · 2019-02-06T14:47:45Z

@andresilva thank you!

gterzian added the A3-in_progress Pull request is in progress. No review needed at this stage. label Dec 31, 2018

gterzian force-pushed the rewrite_protocol_to_use_channels branch from 03c3504 to 3f82e52 Compare January 2, 2019 11:00

gterzian force-pushed the rewrite_protocol_to_use_channels branch 10 times, most recently from 0f956fd to 7f434bc Compare January 11, 2019 05:53

gterzian added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Jan 11, 2019

gterzian force-pushed the rewrite_protocol_to_use_channels branch 5 times, most recently from d0c9630 to f9f4198 Compare January 11, 2019 06:15

gterzian force-pushed the rewrite_protocol_to_use_channels branch 4 times, most recently from 029c255 to d153086 Compare January 11, 2019 06:57

gterzian changed the title ~~[WIP] Rewrite network protocol/service to use channels~~ Rewrite network protocol/service to use channels Jan 11, 2019

rphmeier reviewed Feb 5, 2019

View reviewed changes

gterzian mentioned this pull request Feb 6, 2019

Network: improve code that sends message to multiple peers #1698

Closed

gterzian added 12 commits February 6, 2019 14:00

rename job to task

9e61153

style: re-add comma

9efd1ee

remove extra string allocs

90f400a

rename use of channel

eda969f

turn TODO into FIXME

4271a14

remove mut in match

dc58d81

remove Self in new

575508f

pass headers by value to network service

69ab201

remove network sender from service

d9e89eb

remove TODO

663d736

better expect

9f63fbe

rationalize use of network sender in ondemand

0e82f19

bkchr added 2 commits February 6, 2019 11:48

Merge remote-tracking branch 'upstream/master' into rewrite_protocol_…

03bf5ae

…to_use_channels

Merge remote-tracking branch 'upstream/master' into rewrite_protocol_…

b95cc87

…to_use_channels

andresilva mentioned this pull request Feb 6, 2019

core: network: fix sync on testing network #1713

Merged

This was referenced Feb 8, 2019

update to latest substrate - protocol API update paritytech/polkadot#130

Merged

Remove wait on future in network bridge #1765

Merged

andresilva mentioned this pull request Feb 13, 2019

core: grandpa: collect garbage for topic #1780

Merged

gterzian mentioned this pull request May 8, 2019

Remove with_gossip #2500

Closed

Conversation

gterzian commented Dec 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parity-cla-bot commented Dec 31, 2018

Uh oh!

gterzian commented Jan 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomaka commented Jan 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gterzian commented Jan 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomaka commented Jan 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gterzian commented Jan 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gterzian commented Feb 6, 2019

Uh oh!

andresilva commented Feb 6, 2019

Uh oh!

gterzian commented Feb 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

gterzian commented Dec 31, 2018 •

edited

Loading

gterzian commented Jan 11, 2019 •

edited

Loading

tomaka commented Jan 11, 2019 •

edited

Loading

gterzian commented Jan 11, 2019 •

edited

Loading

tomaka commented Jan 11, 2019 •

edited

Loading

gterzian commented Jan 11, 2019 •

edited

Loading