Skip to content

fix: track rate limited errors#8116

Merged
wemeetagain merged 7 commits into
peerDASfrom
te/detect_rate_limit_error
Aug 8, 2025
Merged

fix: track rate limited errors#8116
wemeetagain merged 7 commits into
peerDASfrom
te/detect_rate_limit_error

Conversation

@twoeths

@twoeths twoeths commented Aug 6, 2025

Copy link
Copy Markdown
Contributor

Motivation

  • when we end request to peers, they may respond with rate limit error and we don't track it
  • this is a prerequisite for rate limit prevention that I'll implement later. We want to make sure REQUEST_RATE_LIMITED does not happen in that branch

Description

  • fix the malformed extracted error message, see the unit test
  • throw REQUEST_RATE_LIMITED if message contains "rate limit", see rate limited error messages of all clients here
  • new metric to track out going error by reason

Closes #8065
Closes #8110

Test on dev nodes

Screenshot 2025-08-06 at 15 52 46

@twoeths

twoeths commented Aug 6, 2025

Copy link
Copy Markdown
Contributor Author

new rate limited error messages with this branch:

  • Prysm: verbose: Req error method=beacon_blocks_by_range, version=2, encoding=ssz_snappy, client=Prysm, peer=16...iQxcdU, requestId=96163, status=1, errorMessage=rate limited
  • Lighthouse: verbose: Req error method=beacon_blocks_by_range, version=2, encoding=ssz_snappy, client=Lighthouse, peer=16...xwThs6, requestId=75188, status=139, errorMessage=Rate limited. There are already 2 active requests with the same protocol
  • NA: verbose: Req error method=data_column_sidecars_by_range, version=1, encoding=ssz_snappy, client=NA, peer=16...evfBuU, requestId=74333, status=1, errorMessage=rate limited

@twoeths twoeths marked this pull request as ready for review August 6, 2025 10:14
@twoeths twoeths requested a review from a team as a code owner August 6, 2025 10:14
@twoeths twoeths changed the title fix: decode p2p error message fix: track rate limited errors Aug 6, 2025
Comment thread packages/reqresp/src/request/errors.ts Outdated
}
}

// other errors like RequestErrorCode.REQUEST_RATE_LIMITED could come from ourself, not the peer so we should not penalize them

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this. Wouldnt an outgoing request that we "rate limit" mean that we are not making the request. How would the error "come from ourself"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per spec we should not send more than 2 requests per protocol to the same peer
in the future we should implement that at client side, ie do not send more requests to the same peer/protocol
then we can avoid that error

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh... as in we throw RequestErrorCode.REQUEST_RATE_LIMITED from the client instead of just not making the request?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now it'll throw RESP_RATE_LIMITED, ie we makes a request and peer returns that error. We don't have self rate limiter yet, the strategy for now is to control at the client side, ie in SyncChain (I have a local branch for it, will create a PR soon but want some more testings)

I'll also implement self rate limiter soon, in that case it'll throw REQUEST_RATE_LIMITED. Then we'll track this error and see which module needs to control active requests like in SyncChain

return {code: RequestErrorCode.RESP_TIMEOUT};
}

switch (status) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a RespStatus.RATE_LIMITED (which == 139) and check in this switch?
This will help in case error messages in clients change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rate limited error is not specified in the response code https://github.com/ethereum/consensus-specs/blob/dev/specs/phase0/p2p-interface.md#responding-side

lighthouse returned 139 but other clients returned 1

I don't think we can improve unless it's specified in the spec

{
name: "NA - rate limited",
errorMessage: "rate limited",
},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should add a test case for the "wait #.###" condition?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more test cases in 7624a60

Comment on lines +30 to +34
outgoingErrorReasons: register.gauge<{reason: RequestErrorCode}>({
name: "beacon_reqresp_outgoing_requests_error_reason_total",
help: "Count total outgoing request errors by reason",
labelNames: ["reason"],
}),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is defined as a gauge but is being used with the inc() method in ReqResp.ts, which is counter behavior. For accurate metrics reporting, this should be defined as a counter instead of a gauge. Gauges represent values that can go up and down, while counters are for values that only increase over time, which appears to be the intended usage here.

Suggested change
outgoingErrorReasons: register.gauge<{reason: RequestErrorCode}>({
name: "beacon_reqresp_outgoing_requests_error_reason_total",
help: "Count total outgoing request errors by reason",
labelNames: ["reason"],
}),
outgoingErrorReasons: register.counter<{reason: RequestErrorCode}>({
name: "beacon_reqresp_outgoing_requests_error_reason_total",
help: "Count total outgoing request errors by reason",
labelNames: ["reason"],
}),

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

@wemeetagain wemeetagain merged commit 68b6feb into peerDAS Aug 8, 2025
15 of 19 checks passed
@wemeetagain wemeetagain deleted the te/detect_rate_limit_error branch August 8, 2025 11:27
@nflaig

nflaig commented Aug 9, 2025

Copy link
Copy Markdown
Member

seems like since we merged this PR our unit tests are failing consistently

wemeetagain added a commit that referenced this pull request Aug 11, 2025
**Motivation**

- fix unit tests and e2e tests of `peerDAS` branch

**Description**

- when sending status, based on its body we set correct version,
otherwise peers cannot deserialize the request body
- at fulu fork transition, update local status cache so that it sends
the correct version of status message
- fix failed unit test as introduced by
[PR-8116](#8116 (comment))

---------

Co-authored-by: Tuyen Nguyen <twoeths@users.noreply.github.com>
Co-authored-by: Nico Flaig <nflaig@protonmail.com>
Co-authored-by: Cayman <caymannava@gmail.com>
wemeetagain added a commit that referenced this pull request Aug 28, 2025
**Motivation**

- track req/resp outgoing request error by reason

**Description**

- the metric was added in #8116

<img width="835" height="609" alt="Screenshot 2025-08-28 at 15 25 59"
src="https://github.com/user-attachments/assets/92b97adc-9ae1-4ce0-a1d3-ef32378d5ee0"
/>

---------

Co-authored-by: Tuyen Nguyen <twoeths@users.noreply.github.com>
Co-authored-by: Cayman <caymannava@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@wemeetagain

Copy link
Copy Markdown
Member

🎉 This PR is included in v1.34.0 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants