You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Aug 24, 2020. It is now read-only.
We encountered a situation where we had mismatched certs between the scheduler and the log cache nodes. I would have expected to look at the health endpoints and have them better guide me.
Proposed new metrics (all counters):
Scheduler
SuccessfulRoutingTableSets
FailedRoutingTableSets
Log Cache
SuccessfulRoutedEnvelopes
FailedRoutedEnvelopes
RoutingTableSets
In the given failure, if I had seen LogCache.RoutingTableSets set to 0 for all the nodes, I would have quickly been guided to the Scheduler. Instead I first looked at log cache nodes (assuming there was routing problems between nodes). The lack of metrics about the intra communication (SuccessfulRoutedEnvelopes and FailedRoutedEnvelopes would better inform) leaves operators in the dark a bit here.