Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
We use substrate (branch polkadot-v0.9.26) on our testnet (Aleph Zero testnet) and quite often (some nodes get that every couple of days) experience a problem with the sync protocol getting stuck. The way it looks like is that till some point everything works OK, after which the node enters a strange state and keeps logging:
2022-08-11 13:12:51 ⚙️ Syncing 0.0 bps, target=#7930720 (26 peers), best: #7927340 (0xa832…7ebb), finalized #7927337 (0x292d…fd2a), ⬇ 2.9kiB/s ⬆ 0.4kiB/s
without making any progress anymore. Restarting the node always helps.
So far we haven't really found any plausible explanation for this -- it doesn't seem to be happening at any specific blocks -- different nodes get stuck on different blocks. We found a strange way to (sometimes) reproduce it -- run a node on a laptop, then hibernate it for some time, and then turn it back on -- once the node reconnects and tries to sync, it gets stuck in the same way (and restart is necessary). This way we were able to produce the attached log (it has trace logs on sync). Attached below.
verbose-logs.log
This is however not a typical situation in which the problem arises (i.e. long network disconnect) -- in fact normally nodes work normally and just like that they stop syncing :/
The main difference between AlephZero and Polkadot and standard parachain nodes is that we have low block-time (1sec) and that we use Aura + AlephBFT instead of Babe + Grandpa. However, the latter shouldn't really matter because this affects mostly non-validator nodes, who don't even run any consensus code.
Our only guess was that maybe this is because we don't have this fix #11817 in our substrate dependency, but by inspecting the code it doesn't seem that it could have such an effect...
We would appreciate any help in finding out what the culprit of that could be... Thanks!
Steps to reproduce
No response
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
We use substrate (branch polkadot-v0.9.26) on our testnet (Aleph Zero testnet) and quite often (some nodes get that every couple of days) experience a problem with the sync protocol getting stuck. The way it looks like is that till some point everything works OK, after which the node enters a strange state and keeps logging:
2022-08-11 13:12:51 ⚙️ Syncing 0.0 bps, target=#7930720 (26 peers), best: #7927340 (0xa832…7ebb), finalized #7927337 (0x292d…fd2a), ⬇ 2.9kiB/s ⬆ 0.4kiB/swithout making any progress anymore. Restarting the node always helps.
So far we haven't really found any plausible explanation for this -- it doesn't seem to be happening at any specific blocks -- different nodes get stuck on different blocks. We found a strange way to (sometimes) reproduce it -- run a node on a laptop, then hibernate it for some time, and then turn it back on -- once the node reconnects and tries to sync, it gets stuck in the same way (and restart is necessary). This way we were able to produce the attached log (it has
tracelogs onsync). Attached below.verbose-logs.log
This is however not a typical situation in which the problem arises (i.e. long network disconnect) -- in fact normally nodes work normally and just like that they stop syncing :/
The main difference between AlephZero and Polkadot and standard parachain nodes is that we have low block-time (1sec) and that we use Aura + AlephBFT instead of Babe + Grandpa. However, the latter shouldn't really matter because this affects mostly non-validator nodes, who don't even run any consensus code.
Our only guess was that maybe this is because we don't have this fix #11817 in our substrate dependency, but by inspecting the code it doesn't seem that it could have such an effect...
We would appreciate any help in finding out what the culprit of that could be... Thanks!
Steps to reproduce
No response