Remove the NetworkChan#2577
Conversation
DemiMarie-temp
left a comment
There was a problem hiding this comment.
The CI is failing, but otherwise the code looks good.
| type Error = void::Void; | ||
|
|
||
| fn poll(&mut self) -> Poll<Self::Item, Self::Error> { | ||
| pub fn poll(&mut self, network_out: &mut dyn NetworkOut<B>) -> Poll<(), void::Void> { |
There was a problem hiding this comment.
This always returns Async::NotReady. If this is deliberate, we might be able to return Poll<void::Void, void::Void> here.
| let outcome = match poll_value { | ||
| Ok(Async::NotReady) => break, | ||
| Ok(Async::Ready(Some(NetworkServiceEvent::OpenedCustomProtocol { peer_id, version, debug_info, .. }))) => { | ||
| debug_assert!( |
There was a problem hiding this comment.
It seems that this assert should either be on in release builds, or removed entirely.
There was a problem hiding this comment.
We're kind of doing defensive programming. We don't want release builds to panic, ever, but debug_asserts are fine.
| Ok(NetworkMsg::Synchronized) => return Ok(()), | ||
| Err(error) => return Err(error), | ||
| Ok(msg) => self.buffered_messages.lock().push_back(msg), | ||
| pub fn wait_sync(&self) -> Result<(), ()> { |
There was a problem hiding this comment.
Ideally, this would return a future, but I would not consider that urgent.
|
The CI is failing because it turns out that one of the aura tests uses the network testnet helper from within a tokio runtime. I will have to do some cleanup in the network tests for this to work. |
There was a problem hiding this comment.
The code looks correct, and I have some questions regarding the overall direction:
- Is the idea to fully merge
protocoland the libp2p network service over time? I'm asking because like I noted in earlier changes, this is moving into a direction where it would become increasingly harder to run those two independently, either as tasks or threads. My impression was thatprotocolwas a layer between the libp2p service and consensus, that was still independent from both and could benefit from running independently. - If the answer to 1 is "yes, we want to merge these two concepts", than is the plan to take that a step further and basically put
protocolinside the network service and remove thisrun_threadfunction? I'm asking becauserun_thread, is essentially sequential code running inside a single task. There is a lot to be said for consolidating logic to within a single task, however only if we want to mergeprotocolwith the libp2p service, and in that case therun_threadfunction could be futher simplified by just doing everything inside thepollfunction of the network service, which could poll thenetwork_rx, call methods ofprotocol, maybe still using something like theNetworkOutintroduced, or perhaps the various parts ofprotocolcould just live directly on the network service.
Another way to say this is thatrun_threadbasically is already a mergedprotocoland network service, but consisting of various back and forth with polling various things and usingNetworkOut. Could all that not be hidden in a single struct implementing future or stream? - Previously
NetworkChanwas used both by aprotocolrunning in it's own thread(later task), and theServicethat is shared with other parts of the system. This had the benefit that all "network operations" went through a single channel. Nowprotocolis effectively calling methods of the libp2p network service "directly" throughNetworkOutfrom the same task. This potentially changes the ordering of operations across the system in subtle ways.
An example:
- Import-queue does a
request_justification, this enqueues a message on theprotocol_sender. - Inside
run_thread, we do aprotocol_rx.poll()and handle the message, resulting in a call toprotocol.request_justification. - In some cases this then results in a call to
network_out.send_message. - Previously, the
send_messagewas implemented by queueing a message onto theNetworkChan. - Currently, a method on the network service will be called.
- The difference is essentially that previously the
send_messagewould only be handled by the network service when it handled the message enqueued insidesend_message, and only after having handled any other messages that would have been enqueued "earlier" onto theNetworkChan. Now, thesend_messagewill be handled immediately afterprotocolhas handled the initial messageProtocolMsg. - In other words, any
send_messagewill be done "sync" during the handling of theProtocolMsg::RequestJustification, while previously will be done "async" via the handling of a subsequent message, enqueued as a result of handling theProtocolMsg::RequestJustification.
On the other hand, when the import queue does something like report_peer, it still goes over a channel, which previously was NetworkChan.
So NetworkChan was not just an implementation detail, it was also a way to consolidate all network operations requested by non-networking components into a single message queue.
The subtle changes in ordering of operation due to some now being sync method calls, versus async message handling, might not matter, or be the intention, and I would just like to make sure they are considered(they might not otherwise become obvious until later, since the test-suite might not catch them).
This obviously goes back to the whole "do we want to run protocol and networking in the same task/thread?".
At this point, it appears to me that one part of the system, the import-queue and so on, will always live in a different thread, and communicate in an async way back with protocol and/or networking. Previously protocol and networking also ran in separate threads, and communicated in an async way.
Now we're moving in a direction of merging protocol with networking, and running them fully in sync, yet the other part involving things like the import-queue is still async. So moving one part of the system into the same task and making everything sync could be the right thing to do, and I think it should be discussed explicitly, since it will influence the ordering of operations coming from other, still async, parts.
Also, by merging protocol and network into one task, we lose all opportunities for parallelism between the two.
For example, do we really want all the stuff happening in ChainSync to always fully block networking, and vice-versa? I was under the impression that there is quite a lot of logic being run, and that it might be best to separate those two components, or at least retaining an easy option of separating them later.
In any case, I'd like to see an explicit discussion about how we intend to "run" protocol, networking, and the other components communicating with them, that addresses the concurrency of the system(or lack thereof). Otherwise we run the risk of doing everything in a single task, which in the immediate might appear "simpler", and then later realizing that we'd like to get some parallelism between various operations, which would then be much harder to achieve.
However it is clearly not my intention to make them impossible to split. That's what I'm doing for Another big advantage of this, which is a bit tricky to illustrate, is that communications have a more precise meaning. By having services directly communicate to each other, you sometimes have to pass some context which would be out of scope of any of the two services individually.
However note that this kind of tricky situations is exactly what I want to avoid by reworking the concurrency story of the network. I totally agree that it can indeed lead to subtle bugs, and more determinism not only makes things more simple to reason about, but also makes it easier to detect and reproduce bugs.
If something expensive is happening in
First of all, if we have to choose between "simple and one task" and "hard and multithreaded", I would argue that we should lean towards "hard and multithreaded" only if there's a benefit to multithreading things. Otherwise we run into premature optimizations. I clearly think that we can "asyncify" things later on top of synchronous API (with the As for a precise answer, the way I see it in the very long term is:
|
|
Fixing the network tests is hell on earth. |
|
Thanks for providing additional info, overall I think it sounds like a reasonable approach, and I do have a few more comments:
I can support the goal of having simple sync api, with some sort of wrapper around them to make them async where needed, and the only thing I am worried about is that if making things "async" means having to share entire stucts like For example if I now look at
I'm not sure if there is anything really "expensive" going in in ChainSync, or other parts of protocol like gossip, yet I do think quite a few of these operations could probably be done while the network service does other things. However splitting those off separately from Protocol as a whole is probably not worth it since it introduce complications around sharing the state necessary to perform those. That's essentially while I though having On the other hand perhaps other boundaries will emerge, and some operations taking place in protocol, such as validating gossip messages, could be moved to other threads of tasks, leaving network/protocol only doing minimal network-related work. |
Well, the difference would be that all locks are in the same place, which makes it much easier to figure out whether a deadlock could happen. However, I would go for channels and not arc-mutexes.
For specifically network and protocol I think they should be merged, and therefore this PR moves in the right direction. If these were two different services and we wanted to keep them separate, then yes we would keep channels and messages, but isolated in a single module. |
|
Ok, thanks for the explanations and looking forward to see where this will lead... |
185df97 to
d1cbec4
Compare
|
Ready for review. |
| if self.use_tokio { | ||
| fut.wait() | ||
| } else { | ||
| tokio::runtime::current_thread::block_on_all(fut) |
There was a problem hiding this comment.
Does using fut.wait() when this is running already inside tokio give any benefit to just always doing a tokio::runtime::current_thread::block_on_all(fut)?
There was a problem hiding this comment.
The block_on_all function always creates a new runtime, which results in a panic if we are already within a runtime. This is why I had to introduce this use_tokio thing.
As far as I know, there is no way to know whether you are already within a tokio runtime. The closest to that is DefaultExecutor::status(), but for some reason it returns Err despite being in a tokio runtime.
There was a problem hiding this comment.
Ok, thanks for the info...
* Remove the NetworkChan from the API * Remove the NetworkChan altogether * Address review * Fix line widths * More line width fixes * Remove pub visibility from entire world * Fix tests
protocol.rsnow communicates back its intentions towards the network through aNetworkOuttrait that is passed whenever a method is called.NetworkChantype altogether, and replaces it internally with anUnboundedSenderfrom thefuturescrate.After that, the codes of
ProtocolandServiceare much already much cleaner in my opinion. There are a lot of small adjustments to make that would be off-topic for this PR, so I'll do them on top.This PR really makes the code of the test helpers much more dirty (it already was), but the next step will be to rework that.