Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Remove the NetworkChan#2577

Merged
tomaka merged 10 commits into
paritytech:masterfrom
tomaka:network-chan-rm
May 21, 2019
Merged

Remove the NetworkChan#2577
tomaka merged 10 commits into
paritytech:masterfrom
tomaka:network-chan-rm

Conversation

@tomaka

@tomaka tomaka commented May 14, 2019

Copy link
Copy Markdown
Contributor
  • protocol.rs now communicates back its intentions towards the network through a NetworkOut trait that is passed whenever a method is called.
  • Removes the NetworkChan type altogether, and replaces it internally with an UnboundedSender from the futures crate.

After that, the codes of Protocol and Service are much already much cleaner in my opinion. There are a lot of small adjustments to make that would be off-topic for this PR, so I'll do them on top.

This PR really makes the code of the test helpers much more dirty (it already was), but the next step will be to rework that.

@tomaka tomaka added A0-please_review Pull request needs code review. B2-breaksapi labels May 14, 2019
@tomaka tomaka requested review from arkpar and gterzian May 14, 2019 12:45
@tomaka tomaka added the I7-refactor Code needs refactoring. label May 14, 2019
@gavofyork gavofyork added A3-needsresolving and removed A0-please_review Pull request needs code review. I7-refactor Code needs refactoring. labels May 14, 2019
@tomaka tomaka force-pushed the network-chan-rm branch from ebcbbbc to 6768549 Compare May 14, 2019 14:34
@tomaka tomaka added A0-please_review Pull request needs code review. and removed A3-needsresolving labels May 14, 2019

@DemiMarie-temp DemiMarie-temp left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI is failing, but otherwise the code looks good.

Comment thread core/network/src/protocol.rs Outdated
type Error = void::Void;

fn poll(&mut self) -> Poll<Self::Item, Self::Error> {
pub fn poll(&mut self, network_out: &mut dyn NetworkOut<B>) -> Poll<(), void::Void> {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always returns Async::NotReady. If this is deliberate, we might be able to return Poll<void::Void, void::Void> here.

let outcome = match poll_value {
Ok(Async::NotReady) => break,
Ok(Async::Ready(Some(NetworkServiceEvent::OpenedCustomProtocol { peer_id, version, debug_info, .. }))) => {
debug_assert!(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this assert should either be on in release builds, or removed entirely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're kind of doing defensive programming. We don't want release builds to panic, ever, but debug_asserts are fine.

Ok(NetworkMsg::Synchronized) => return Ok(()),
Err(error) => return Err(error),
Ok(msg) => self.buffered_messages.lock().push_back(msg),
pub fn wait_sync(&self) -> Result<(), ()> {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this would return a future, but I would not consider that urgent.

@tomaka

tomaka commented May 14, 2019

Copy link
Copy Markdown
Contributor Author

The CI is failing because it turns out that one of the aura tests uses the network testnet helper from within a tokio runtime.

I will have to do some cleanup in the network tests for this to work.

@tomaka tomaka added A3-in_progress Pull request is in progress. No review needed at this stage. and removed A0-please_review Pull request needs code review. labels May 14, 2019

@gterzian gterzian left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks correct, and I have some questions regarding the overall direction:

  1. Is the idea to fully merge protocol and the libp2p network service over time? I'm asking because like I noted in earlier changes, this is moving into a direction where it would become increasingly harder to run those two independently, either as tasks or threads. My impression was that protocol was a layer between the libp2p service and consensus, that was still independent from both and could benefit from running independently.
  2. If the answer to 1 is "yes, we want to merge these two concepts", than is the plan to take that a step further and basically put protocol inside the network service and remove this run_thread function? I'm asking because run_thread, is essentially sequential code running inside a single task. There is a lot to be said for consolidating logic to within a single task, however only if we want to merge protocol with the libp2p service, and in that case the run_thread function could be futher simplified by just doing everything inside the poll function of the network service, which could poll the network_rx, call methods of protocol, maybe still using something like the NetworkOut introduced, or perhaps the various parts of protocol could just live directly on the network service.
    Another way to say this is that run_thread basically is already a merged protocol and network service, but consisting of various back and forth with polling various things and using NetworkOut. Could all that not be hidden in a single struct implementing future or stream?
  3. Previously NetworkChan was used both by a protocol running in it's own thread(later task), and the Service that is shared with other parts of the system. This had the benefit that all "network operations" went through a single channel. Now protocol is effectively calling methods of the libp2p network service "directly" through NetworkOut from the same task. This potentially changes the ordering of operations across the system in subtle ways.

An example:

  1. Import-queue does a request_justification, this enqueues a message on the protocol_sender.
  2. Inside run_thread, we do a protocol_rx.poll() and handle the message, resulting in a call to protocol.request_justification.
  3. In some cases this then results in a call to network_out.send_message.
  4. Previously, the send_message was implemented by queueing a message onto the NetworkChan.
  5. Currently, a method on the network service will be called.
  6. The difference is essentially that previously the send_message would only be handled by the network service when it handled the message enqueued inside send_message, and only after having handled any other messages that would have been enqueued "earlier" onto the NetworkChan. Now, the send_message will be handled immediately after protocol has handled the initial message ProtocolMsg.
  7. In other words, any send_message will be done "sync" during the handling of the ProtocolMsg::RequestJustification, while previously will be done "async" via the handling of a subsequent message, enqueued as a result of handling the ProtocolMsg::RequestJustification.

On the other hand, when the import queue does something like report_peer, it still goes over a channel, which previously was NetworkChan.

So NetworkChan was not just an implementation detail, it was also a way to consolidate all network operations requested by non-networking components into a single message queue.

The subtle changes in ordering of operation due to some now being sync method calls, versus async message handling, might not matter, or be the intention, and I would just like to make sure they are considered(they might not otherwise become obvious until later, since the test-suite might not catch them).

This obviously goes back to the whole "do we want to run protocol and networking in the same task/thread?".

At this point, it appears to me that one part of the system, the import-queue and so on, will always live in a different thread, and communicate in an async way back with protocol and/or networking. Previously protocol and networking also ran in separate threads, and communicated in an async way.

Now we're moving in a direction of merging protocol with networking, and running them fully in sync, yet the other part involving things like the import-queue is still async. So moving one part of the system into the same task and making everything sync could be the right thing to do, and I think it should be discussed explicitly, since it will influence the ordering of operations coming from other, still async, parts.

Also, by merging protocol and network into one task, we lose all opportunities for parallelism between the two.

For example, do we really want all the stuff happening in ChainSync to always fully block networking, and vice-versa? I was under the impression that there is quite a lot of logic being run, and that it might be best to separate those two components, or at least retaining an easy option of separating them later.

In any case, I'd like to see an explicit discussion about how we intend to "run" protocol, networking, and the other components communicating with them, that addresses the concurrency of the system(or lack thereof). Otherwise we run the risk of doing everything in a single task, which in the immediate might appear "simpler", and then later realizing that we'd like to get some parallelism between various operations, which would then be much harder to achieve.

@tomaka

tomaka commented May 15, 2019

Copy link
Copy Markdown
Contributor Author
  1. For the libp2p service and the protocol in particular, the answer is yes. The libp2p service does almost nothing. It reads events from a channel (that's in libp2p's code), eventually updates some local state (libp2p and network-libp2p), then sends messages to other channels (in network::service.rs). Its CPU usage is minimal, and to me it doesn't really make sense to me to run this as its own thread.

However it is clearly not my intention to make them impossible to split.
My intention is that instead of having, say, structs A, B and C that expose an asynchronous API, you would split each of them between Async, Aasync, Bsync, Basync, and Csync, CAsync. Then you leave Async, Bsync and Csync in their own module with their synchronous API, and you move Aasync, Basync and Casync to a single module and handle all the asynchronicity there.
The logic would remain in Async, Bsync and Csync, while the additional complexity introduced by concurrency would be contained in a single module that would be responsible for figuring out things such as ordering.

That's what I'm doing for protocol.rs: I'm basically splitting it in two: ProtocolSync (whose name stays Protocol), and ProtocolAsync (which is unnamed in the code but more or less consists in the run_thread function) that resides in service.rs.

Another big advantage of this, which is a bit tricky to illustrate, is that communications have a more precise meaning. By having services directly communicate to each other, you sometimes have to pass some context which would be out of scope of any of the two services individually.
For example right now the import queue API has an import_block method. If you move all the asynchronicity at the same place, you can easily know which blocks came from the networking and which blocks came from mining, and can for example prioritize the blocks coming from the mining. Contrast this with having to pass some sort of "priority level value" to the networking and the mining codes.

  1. Yes, it is indeed my intention to entirely remove run_thread and move what it's doing to a poll() method that could spawn background tasks. That's more of a long term objective for Add a light client code that compiles in the browser #2416 .

  2. Yes I've noticed that the ordering could potentially change here. In practice this only concerns the OnDemandService, as it is the only object that still indirectly uses the channel. Its requests could be delayed a bit compared to now. In practice it doesn't matter anyway, because: 1- on the network we only send requests that are all independent of each other, 2- the OnDemandService is running in a different thread than the networking, and therefore ordering wasn't guaranteed before anyway.

However note that this kind of tricky situations is exactly what I want to avoid by reworking the concurrency story of the network. I totally agree that it can indeed lead to subtle bugs, and more determinism not only makes things more simple to reason about, but also makes it easier to detect and reproduce bugs.

For example, do we really want all the stuff happening in ChainSync to always fully block networking, and vice-versa?

If something expensive is happening in ChainSync, then that specific thing should be made asynchronous and not ChainSync as a whole.
For example, right now importing a chunk of blocks can take up to 30 seconds in debug mode (for a reason I have yet to examine). If everything was put into a single task/thread, then indeed this would block the entire networking for 30 seconds.
However, note that even if ChainSync is in its own dedicated task, importing a chunk of blocks would still block any incoming request for 30 seconds. The solution to me is to specifically extract the thing that takes a lot of CPU, and not to make the entire ChainSync in its own task.

In any case, I'd like to see an explicit discussion about how we intend to "run" protocol, networking, and the other components communicating with them, that addresses the concurrency of the system(or lack thereof). Otherwise we run the risk of doing everything in a single task, which in the immediate might appear "simpler", and then later realizing that we'd like to get some parallelism between various operations, which would then be much harder to achieve.

First of all, if we have to choose between "simple and one task" and "hard and multithreaded", I would argue that we should lean towards "hard and multithreaded" only if there's a benefit to multithreading things. Otherwise we run into premature optimizations. I clearly think that we can "asyncify" things later on top of synchronous API (with the Async/Aasync splitting thingy), without having to take decisions too far ahead.

As for a precise answer, the way I see it in the very long term is:

  • Each TCP connection has its own dedicated task (that's already the case in libp2p's code, but hidden beneath the API) that handle parsing networking messages.
  • The protocol/network task gathers messages received by these per-connection tasks, handles all the logic related to protocols (ie. if we receive an answer), and generates messages for the consensus, the import queue, and requests for the client. This would also contain the on-demand service.
  • The client has its own task, and asynchronously handles all database read/write requests. For example it would read a request for the headers of blocks 8022 to 8290, fetch them from its database, and answer. For the light client, it optionally generates network requests itself that are handled by the protocol/network task. Right now this is already more or less the case, except that it's done by locking mutexes instead of sending messages.
  • The import queue has its own task, and is between the protocol/network and the client, as it is already the case.

@tomaka tomaka force-pushed the network-chan-rm branch from 97d6bbe to 0848d9a Compare May 15, 2019 11:34
@tomaka

tomaka commented May 15, 2019

Copy link
Copy Markdown
Contributor Author

Fixing the network tests is hell on earth.

@gterzian

gterzian commented May 15, 2019

Copy link
Copy Markdown
Contributor

Thanks for providing additional info, overall I think it sounds like a reasonable approach, and I do have a few more comments:

I'm basically splitting it in two: ProtocolSync (whose name stays Protocol), and ProtocolAsync (which is unnamed in the code but more or less consists in the run_thread function) that resides in service.rs.

I can support the goal of having simple sync api, with some sort of wrapper around them to make them async where needed, and the only thing I am worried about is that if making things "async" means having to share entire stucts like protocol across threads, we will be back to having comments everywhere about "take the locks in that order" and so on(and few operations will be really async, due to the synchronization requirements).

For example if I now look at run_thread, I think that making the various operations taking place async, as opposed to all in a single task, would require sharing state across threads(or tasks), because the API of the various components are now method based, versus message based. Except off-course if you were to re-introduce channels and call the "sync" api in response to receiving "async" messages inside some sort of wrapper. And wouldn't that be sort-of re-introducing what has been removed in this PR?

If something expensive is happening in ChainSync, then that specific thing should be made asynchronous and not ChainSync as a whole.

I'm not sure if there is anything really "expensive" going in in ChainSync, or other parts of protocol like gossip, yet I do think quite a few of these operations could probably be done while the network service does other things. However splitting those off separately from Protocol as a whole is probably not worth it since it introduce complications around sharing the state necessary to perform those.

That's essentially while I though having protocol run in a thread, or maybe rather a task, as a whole and separate from the actual network service, was a pretty decent idea.

On the other hand perhaps other boundaries will emerge, and some operations taking place in protocol, such as validating gossip messages, could be moved to other threads of tasks, leaving network/protocol only doing minimal network-related work.

@tomaka

tomaka commented May 15, 2019

Copy link
Copy Markdown
Contributor Author

I can support the goal of having simple sync api, with some sort of wrapper around them to make them async where needed, and the only thing I am worried about is that if making things "async" means having to share entire stucts like protocol across threads, we will be back to having comments everywhere about "take the locks in that order" and so on(and few operations will be really async, due to the synchronization requirements).

Well, the difference would be that all locks are in the same place, which makes it much easier to figure out whether a deadlock could happen.

However, I would go for channels and not arc-mutexes.

And wouldn't that be sort-of re-introducing what has been removed in this PR?

For specifically network and protocol I think they should be merged, and therefore this PR moves in the right direction. If these were two different services and we wanted to keep them separate, then yes we would keep channels and messages, but isolated in a single module.

@gterzian

Copy link
Copy Markdown
Contributor

Ok, thanks for the explanations and looking forward to see where this will lead...

@tomaka tomaka force-pushed the network-chan-rm branch 4 times, most recently from 185df97 to d1cbec4 Compare May 20, 2019 11:24
@tomaka tomaka force-pushed the network-chan-rm branch from d1cbec4 to 13baa5e Compare May 20, 2019 12:54
@tomaka

tomaka commented May 20, 2019

Copy link
Copy Markdown
Contributor Author

Ready for review.
I initially had the intention to rework the network tests, but this task is still way too huge at the moment.
Instead, I added a uses_tokio() -> bool method on TestNetFactory.

@tomaka tomaka added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels May 20, 2019
if self.use_tokio {
fut.wait()
} else {
tokio::runtime::current_thread::block_on_all(fut)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does using fut.wait() when this is running already inside tokio give any benefit to just always doing a tokio::runtime::current_thread::block_on_all(fut)?

@tomaka tomaka May 20, 2019

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block_on_all function always creates a new runtime, which results in a panic if we are already within a runtime. This is why I had to introduce this use_tokio thing.
As far as I know, there is no way to know whether you are already within a tokio runtime. The closest to that is DefaultExecutor::status(), but for some reason it returns Err despite being in a tokio runtime.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the info...

@tomaka tomaka merged commit 4794d98 into paritytech:master May 21, 2019
@tomaka tomaka deleted the network-chan-rm branch May 21, 2019 12:07
MTDK1 pushed a commit to bdevux/substrate that referenced this pull request Jul 10, 2019
* Remove the NetworkChan from the API

* Remove the NetworkChan altogether

* Address review

* Fix line widths

* More line width fixes

* Remove pub visibility from entire world

* Fix tests
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

A0-please_review Pull request needs code review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants