Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Sync protocol gets stuck and node requires a restart. #12101

Description

@DamianStraszak

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

We use substrate (branch polkadot-v0.9.26) on our testnet (Aleph Zero testnet) and quite often (some nodes get that every couple of days) experience a problem with the sync protocol getting stuck. The way it looks like is that till some point everything works OK, after which the node enters a strange state and keeps logging:
2022-08-11 13:12:51 ⚙️ Syncing 0.0 bps, target=#7930720 (26 peers), best: #7927340 (0xa832…7ebb), finalized #7927337 (0x292d…fd2a), ⬇ 2.9kiB/s ⬆ 0.4kiB/s
without making any progress anymore. Restarting the node always helps.

So far we haven't really found any plausible explanation for this -- it doesn't seem to be happening at any specific blocks -- different nodes get stuck on different blocks. We found a strange way to (sometimes) reproduce it -- run a node on a laptop, then hibernate it for some time, and then turn it back on -- once the node reconnects and tries to sync, it gets stuck in the same way (and restart is necessary). This way we were able to produce the attached log (it has trace logs on sync). Attached below.
verbose-logs.log

This is however not a typical situation in which the problem arises (i.e. long network disconnect) -- in fact normally nodes work normally and just like that they stop syncing :/

The main difference between AlephZero and Polkadot and standard parachain nodes is that we have low block-time (1sec) and that we use Aura + AlephBFT instead of Babe + Grandpa. However, the latter shouldn't really matter because this affects mostly non-validator nodes, who don't even run any consensus code.

Our only guess was that maybe this is because we don't have this fix #11817 in our substrate dependency, but by inspecting the code it doesn't seem that it could have such an effect...

We would appreciate any help in finding out what the culprit of that could be... Thanks!

Steps to reproduce

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    J2-unconfirmedIssue might be valid, but it’s not yet known.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions