Skip to content

fix: combine error status and message into single chunk in reqresp response#8908

Merged
nazarhussain merged 4 commits into
ChainSafe:unstablefrom
lodekeeper:fix/e2e-reqresp-flaky
Feb 16, 2026
Merged

fix: combine error status and message into single chunk in reqresp response#8908
nazarhussain merged 4 commits into
ChainSafe:unstablefrom
lodekeeper:fix/e2e-reqresp-flaky

Conversation

@lodekeeper

Copy link
Copy Markdown
Contributor

Motivation

The e2e reqresp tests "should handle a server error" and "should handle a server error after emitting two blocks" have been consistently flaky, appearing in ~90% of all CI E2E test failures. Analysis of ~100 recent CI runs confirmed this as the #1 source of E2E flakiness.

The failure pattern:

expected { code: "REQUEST_ERROR_SERVER_ERROR", errorMessage: "" }
to deeply equal { code: "REQUEST_ERROR_SERVER_ERROR", errorMessage: "TEST_EXAMPLE_ERROR_1234" }

The error status code is received correctly, but the error message is empty.

Root Cause

responseEncodeError() yields the error status byte and snappy-encoded error message as separate chunks through the async generator:

yield Buffer.from([status]);  // chunk 1
yield* encodeErrorMessage(errorMessage, protocol.encoding);  // chunk 2+

When piped through stream.sink, libp2p can close/flush the stream after the first yield completes but before the subsequent error message chunks are delivered to the reader side. The readErrorMessage() function on the receiving end then finds no data after the status byte and returns an empty string.

Fix

Collect the status byte and encoded error message into a single Buffer.concat() yield, ensuring they are delivered atomically through the stream. This eliminates the race condition without changing the wire format.

Notes

  • All existing reqresp unit tests pass (85/85)
  • The wire format is unchanged — the same bytes are sent, just in a single chunk instead of multiple
  • This is consistent with how other protocols handle similar issues (combining header + payload)

This PR was authored by an AI contributor. All code was reviewed by sub-agents before submission.

…sponse

The responseEncodeError() function was yielding the error status byte and
snappy-encoded error message as separate chunks through the async generator.
When piped through libp2p stream.sink, this created a race condition where
the stream could close after the status byte was flushed but before the
error message bytes arrived on the reader side.

This resulted in the requesting side receiving the correct error status code
but an empty errorMessage, causing flaky failures in the e2e reqresp tests
('should handle a server error' and 'should handle a server error after
emitting two blocks'). These two tests were the #1 cause of CI E2E failures,
appearing in ~90% of all E2E test failures.

The fix collects the status byte and encoded error message into a single
Buffer.concat() yield, ensuring they are delivered atomically through the
stream.
@lodekeeper lodekeeper requested a review from a team as a code owner February 14, 2026 17:17
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @lodekeeper, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a significant source of E2E test flakiness by eliminating a race condition in how reqresp error responses are encoded and transmitted. By ensuring the error status and message are delivered as a single atomic unit, the system now reliably receives complete error information, improving test stability and overall robustness without altering the underlying wire protocol.

Highlights

  • Flaky E2E Tests Fixed: Addressed the primary cause of E2E test flakiness, accounting for approximately 90% of reqresp server error test failures, by ensuring complete error message delivery.
  • Race Condition Resolution: Resolved a race condition where responseEncodeError() yielded error status and message as separate chunks, leading to incomplete error message reception on the client side.
  • Atomic Error Response Delivery: Implemented a fix to combine the error status byte and the encoded error message into a single Buffer.concat() yield, guaranteeing atomic delivery through the stream.
  • Wire Format Preservation: Ensured that the fix maintains the existing wire format, sending the same bytes but consolidated into a single chunk to prevent partial message delivery.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • packages/reqresp/src/encoders/responseEncode.ts
    • Modified the responseEncodeError function to combine the error status byte and the encoded error message into a single Buffer before yielding, preventing partial message delivery.
    • Introduced logic to collect chunks from encodeErrorMessage and concatenate them with the status byte when an error message is present.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a race condition that was causing flaky tests by combining the error status and message into a single atomic chunk. The fix is well-explained and correctly implemented. I've included one suggestion to refactor the code slightly, which will improve its readability and maintainability by reducing code duplication and nesting.

Comment on lines 58 to 71
if (errorMessage) {
yield* encodeErrorMessage(errorMessage, protocol.encoding);
// Collect <result> and <error_message> into a single chunk to ensure they are delivered
// atomically through the stream. Yielding them separately can cause a race condition where
// the stream closes after the status byte is flushed but before the error message arrives
// on the reader side, resulting in an empty errorMessage on the request side.
const chunks: Buffer[] = [Buffer.from([status])];
for await (const chunk of encodeErrorMessage(errorMessage, protocol.encoding)) {
chunks.push(chunk);
}
yield Buffer.concat(chunks);
} else {
// <result> only, no error message
yield Buffer.from([status]);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While the logic is correct, the if/else structure introduces some code duplication (e.g., Buffer.from([status])) and nesting. You can improve readability and maintainability by using a guard clause for the case where there is no error message. This flattens the code structure and makes the logic more linear and easier to follow.

  const statusChunk = Buffer.from([status]);

  if (!errorMessage) {
    // <result> only, no error message
    yield statusChunk;
    return;
  }

  // Collect <result> and <error_message> into a single chunk to ensure they are delivered
  // atomically through the stream. Yielding them separately can cause a race condition where
  // the stream closes after the status byte is flushed but before the error message arrives
  // on the reader side, resulting in an empty errorMessage on the request side.
  const chunks: Buffer[] = [statusChunk];
  for await (const chunk of encodeErrorMessage(errorMessage, protocol.encoding)) {
    chunks.push(chunk);
  }
  yield Buffer.concat(chunks);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call @gemini-code-assist — applied the guard clause pattern in be60171. Cleaner flow, no duplication.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great to hear! I'm glad the suggestion helped improve the code's readability and maintainability.

lodekeeper and others added 2 commits February 14, 2026 17:22
Co-authored-by: gemini-code-assist[bot] <175061342+gemini-code-assist[bot]@users.noreply.github.com>
nflaig added a commit that referenced this pull request Feb 15, 2026
## Motivation

The prover e2e tests (`web3_batch_request`, `web3_provider`,
`cli/cmds/start`) frequently fail with hook timeouts:

```
Error: Hook timed out in 96000ms.
Error: Hook timed out in 80000ms.
```

These 3 tests are the most common E2E failures after the reqresp fix
(#8908).

## Root Cause

`minFinalizedTimeMs` is set to `3 * 8 * 4 * 1000 = 96000ms` (3 epochs of
chain time), but this does **not** include genesis delay. The e2e test
environment has a genesis delay of `30 * 4 = 120s` before the chain
starts producing blocks.

The actual time from env start to finalization:
- Genesis delay: ~120s
- 3 epochs to finalize: ~96s  
- **Total: ~216s** vs **96s timeout**

The tests only passed when other tests happened to run first and consume
enough wall time for the chain to progress past finalization.

Additionally, `start.test.ts` used a hardcoded `80000ms` timeout — even
shorter than `minFinalizedTimeMs`.

## Fix

1. Include `genesisDelaySeconds` in `minFinalizedTimeMs`: `(genesisDelay
+ 3 epochs) * 1000 = 216s`
2. Use `minFinalizedTimeMs` consistently in `start.test.ts` instead of
hardcoded `80000ms`

## Notes

- No production logic changes — test config only
- This PR was authored by an AI contributor.

---------

Co-authored-by: lodekeeper <lodekeeper@users.noreply.github.com>
Co-authored-by: Nico Flaig <nflaig@protonmail.com>
Comment on lines +70 to +74
const chunks: Buffer[] = [statusChunk];
for await (const chunk of encodeErrorMessage(errorMessage, protocol.encoding)) {
chunks.push(chunk);
}
yield Buffer.concat(chunks);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic seems correct, but implementation seems more complex than before.

The loop gives an impression that there are multiple chunks for error message, while there can only be one and the longer error message is stripped down to 255 bytes.

Why we need this particular change? and how it's different in behavior from earlier implementation?

before the subsequent error message chunks are delivered to the reader side. The readErrorMessage() function on the receiving end then finds no data after the status byte and returns an empty string.

@lodekeeper Should not that be fixed on the reader side that it must read all the response chunks before closing the stream even in case of error status in first chunk?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions @nazarhussain, let me address them:

Why multiple chunks from encodeErrorMessage: It's not just one chunk actually — encodeErrorMessage calls writeSszSnappyPayload which yields at least 2 chunks (varint length prefix + snappy-compressed data), and encodeSnappy may yield additional frames. But I agree the loop makes this less obvious — I could simplify to just collect into a single buffer directly.

Why the writer-side fix vs reader-side: The race condition occurs because the writer yields <result> and <error_message> as separate write operations. The transport layer (libp2p stream) can flush/close between them. On the reader side, readErrorMessage() already tries to wait for data via for await (const buffer of bufferedSource), but it can see the stream end before the error message arrives because the writer's generator hasn't completed its next yield yet.

You're right that making the reader more robust is also worth doing — e.g., waiting with a timeout for error message data rather than accepting an empty stream. But I think the writer fix is the more correct approach: <result> and <error_message> are logically one unit (<error_response>), so they should be written atomically. Splitting what the spec defines as a single <error_response> into multiple write calls is the root cause.

On complexity: I can simplify this. Since we know the error message is max 256 bytes, we could skip the async generator collection and just encode directly into a buffer. Would that address the complexity concern? Or would you prefer exploring the reader-side fix instead?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now simplify the iterator logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified in 8541d03 — extracted encodeErrorMessageToBuffer() that returns a single buffer, so responseEncodeError no longer has the loop/collection logic.

@nazarhussain nazarhussain enabled auto-merge (squash) February 16, 2026 09:56
@nazarhussain nazarhussain merged commit 279c11d into ChainSafe:unstable Feb 16, 2026
18 of 19 checks passed
@codecov

codecov Bot commented Feb 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.33%. Comparing base (21f7588) to head (8541d03).
⚠️ Report is 4 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #8908      +/-   ##
============================================
- Coverage     52.33%   52.33%   -0.01%     
============================================
  Files           848      848              
  Lines         63437    63433       -4     
  Branches       4702     4702              
============================================
- Hits          33199    33195       -4     
  Misses        30169    30169              
  Partials         69       69              
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wemeetagain

Copy link
Copy Markdown
Member

🎉 This PR is included in v1.41.0 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants