fix: combine error status and message into single chunk in reqresp response by lodekeeper · Pull Request #8908 · ChainSafe/lodestar

lodekeeper · 2026-02-14T17:17:24Z

Motivation

The e2e reqresp tests "should handle a server error" and "should handle a server error after emitting two blocks" have been consistently flaky, appearing in ~90% of all CI E2E test failures. Analysis of ~100 recent CI runs confirmed this as the #1 source of E2E flakiness.

The failure pattern:

expected { code: "REQUEST_ERROR_SERVER_ERROR", errorMessage: "" }
to deeply equal { code: "REQUEST_ERROR_SERVER_ERROR", errorMessage: "TEST_EXAMPLE_ERROR_1234" }

The error status code is received correctly, but the error message is empty.

Root Cause

responseEncodeError() yields the error status byte and snappy-encoded error message as separate chunks through the async generator:

yield Buffer.from([status]);  // chunk 1
yield* encodeErrorMessage(errorMessage, protocol.encoding);  // chunk 2+

When piped through stream.sink, libp2p can close/flush the stream after the first yield completes but before the subsequent error message chunks are delivered to the reader side. The readErrorMessage() function on the receiving end then finds no data after the status byte and returns an empty string.

Fix

Collect the status byte and encoded error message into a single Buffer.concat() yield, ensuring they are delivered atomically through the stream. This eliminates the race condition without changing the wire format.

Notes

All existing reqresp unit tests pass (85/85)
The wire format is unchanged — the same bytes are sent, just in a single chunk instead of multiple
This is consistent with how other protocols handle similar issues (combining header + payload)

This PR was authored by an AI contributor. All code was reviewed by sub-agents before submission.

…sponse The responseEncodeError() function was yielding the error status byte and snappy-encoded error message as separate chunks through the async generator. When piped through libp2p stream.sink, this created a race condition where the stream could close after the status byte was flushed but before the error message bytes arrived on the reader side. This resulted in the requesting side receiving the correct error status code but an empty errorMessage, causing flaky failures in the e2e reqresp tests ('should handle a server error' and 'should handle a server error after emitting two blocks'). These two tests were the #1 cause of CI E2E failures, appearing in ~90% of all E2E test failures. The fix collects the status byte and encoded error message into a single Buffer.concat() yield, ensuring they are delivered atomically through the stream.

gemini-code-assist · 2026-02-14T17:17:36Z

Summary of Changes

Hello @lodekeeper, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a significant source of E2E test flakiness by eliminating a race condition in how reqresp error responses are encoded and transmitted. By ensuring the error status and message are delivered as a single atomic unit, the system now reliably receives complete error information, improving test stability and overall robustness without altering the underlying wire protocol.

Highlights

Flaky E2E Tests Fixed: Addressed the primary cause of E2E test flakiness, accounting for approximately 90% of reqresp server error test failures, by ensuring complete error message delivery.
Race Condition Resolution: Resolved a race condition where responseEncodeError() yielded error status and message as separate chunks, leading to incomplete error message reception on the client side.
Atomic Error Response Delivery: Implemented a fix to combine the error status byte and the encoded error message into a single Buffer.concat() yield, guaranteeing atomic delivery through the stream.
Wire Format Preservation: Ensured that the fix maintains the existing wire format, sending the same bytes but consolidated into a single chunk to prevent partial message delivery.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

packages/reqresp/src/encoders/responseEncode.ts
- Modified the responseEncodeError function to combine the error status byte and the encoded error message into a single Buffer before yielding, preventing partial message delivery.
- Introduced logic to collect chunks from encodeErrorMessage and concatenate them with the status byte when an error message is present.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses a race condition that was causing flaky tests by combining the error status and message into a single atomic chunk. The fix is well-explained and correctly implemented. I've included one suggestion to refactor the code slightly, which will improve its readability and maintainability by reducing code duplication and nesting.

gemini-code-assist · 2026-02-14T17:18:45Z

  if (errorMessage) {
-    yield* encodeErrorMessage(errorMessage, protocol.encoding);
+    // Collect <result> and <error_message> into a single chunk to ensure they are delivered
+    // atomically through the stream. Yielding them separately can cause a race condition where
+    // the stream closes after the status byte is flushed but before the error message arrives
+    // on the reader side, resulting in an empty errorMessage on the request side.
+    const chunks: Buffer[] = [Buffer.from([status])];
+    for await (const chunk of encodeErrorMessage(errorMessage, protocol.encoding)) {
+      chunks.push(chunk);
+    }
+    yield Buffer.concat(chunks);
+  } else {
+    // <result> only, no error message
+    yield Buffer.from([status]);
  }


While the logic is correct, the if/else structure introduces some code duplication (e.g., Buffer.from([status])) and nesting. You can improve readability and maintainability by using a guard clause for the case where there is no error message. This flattens the code structure and makes the logic more linear and easier to follow.

const statusChunk = Buffer.from([status]); if (!errorMessage) { // <result> only, no error message yield statusChunk; return; } // Collect <result> and <error_message> into a single chunk to ensure they are delivered // atomically through the stream. Yielding them separately can cause a race condition where // the stream closes after the status byte is flushed but before the error message arrives // on the reader side, resulting in an empty errorMessage on the request side. const chunks: Buffer[] = [statusChunk]; for await (const chunk of encodeErrorMessage(errorMessage, protocol.encoding)) { chunks.push(chunk); } yield Buffer.concat(chunks);

Good call @gemini-code-assist — applied the guard clause pattern in be60171. Cleaner flow, no duplication.

That's great to hear! I'm glad the suggestion helped improve the code's readability and maintainability.

Co-authored-by: gemini-code-assist[bot] <175061342+gemini-code-assist[bot]@users.noreply.github.com>

## Motivation The prover e2e tests (`web3_batch_request`, `web3_provider`, `cli/cmds/start`) frequently fail with hook timeouts: ``` Error: Hook timed out in 96000ms. Error: Hook timed out in 80000ms. ``` These 3 tests are the most common E2E failures after the reqresp fix (#8908). ## Root Cause `minFinalizedTimeMs` is set to `3 * 8 * 4 * 1000 = 96000ms` (3 epochs of chain time), but this does **not** include genesis delay. The e2e test environment has a genesis delay of `30 * 4 = 120s` before the chain starts producing blocks. The actual time from env start to finalization: - Genesis delay: ~120s - 3 epochs to finalize: ~96s - **Total: ~216s** vs **96s timeout** The tests only passed when other tests happened to run first and consume enough wall time for the chain to progress past finalization. Additionally, `start.test.ts` used a hardcoded `80000ms` timeout — even shorter than `minFinalizedTimeMs`. ## Fix 1. Include `genesisDelaySeconds` in `minFinalizedTimeMs`: `(genesisDelay + 3 epochs) * 1000 = 216s` 2. Use `minFinalizedTimeMs` consistently in `start.test.ts` instead of hardcoded `80000ms` ## Notes - No production logic changes — test config only - This PR was authored by an AI contributor. --------- Co-authored-by: lodekeeper <lodekeeper@users.noreply.github.com> Co-authored-by: Nico Flaig <nflaig@protonmail.com>

nazarhussain · 2026-02-16T09:28:20Z

+  const chunks: Buffer[] = [statusChunk];
+  for await (const chunk of encodeErrorMessage(errorMessage, protocol.encoding)) {
+    chunks.push(chunk);
+  }
+  yield Buffer.concat(chunks);


The logic seems correct, but implementation seems more complex than before.

The loop gives an impression that there are multiple chunks for error message, while there can only be one and the longer error message is stripped down to 255 bytes.

Why we need this particular change? and how it's different in behavior from earlier implementation?

before the subsequent error message chunks are delivered to the reader side. The readErrorMessage() function on the receiving end then finds no data after the status byte and returns an empty string.

@lodekeeper Should not that be fixed on the reader side that it must read all the response chunks before closing the stream even in case of error status in first chunk?

Good questions @nazarhussain, let me address them:

Why multiple chunks from encodeErrorMessage: It's not just one chunk actually — encodeErrorMessage calls writeSszSnappyPayload which yields at least 2 chunks (varint length prefix + snappy-compressed data), and encodeSnappy may yield additional frames. But I agree the loop makes this less obvious — I could simplify to just collect into a single buffer directly.

Why the writer-side fix vs reader-side: The race condition occurs because the writer yields <result> and <error_message> as separate write operations. The transport layer (libp2p stream) can flush/close between them. On the reader side, readErrorMessage() already tries to wait for data via for await (const buffer of bufferedSource), but it can see the stream end before the error message arrives because the writer's generator hasn't completed its next yield yet.

You're right that making the reader more robust is also worth doing — e.g., waiting with a timeout for error message data rather than accepting an empty stream. But I think the writer fix is the more correct approach: <result> and <error_message> are logically one unit (<error_response>), so they should be written atomically. Splitting what the spec defines as a single <error_response> into multiple write calls is the root cause.

On complexity: I can simplify this. Since we know the error message is max 256 bytes, we could skip the async generator collection and just encode directly into a buffer. Would that address the complexity concern? Or would you prefer exploring the reader-side fix instead?

For now simplify the iterator logic.

Simplified in 8541d03 — extracted encodeErrorMessageToBuffer() that returns a single buffer, so responseEncodeError no longer has the loop/collection logic.

codecov · 2026-02-16T10:12:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.33%. Comparing base (21f7588) to head (8541d03).
⚠️ Report is 4 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #8908      +/-   ##
============================================
- Coverage     52.33%   52.33%   -0.01%     
============================================
  Files           848      848              
  Lines         63437    63433       -4     
  Branches       4702     4702              
============================================
- Hits          33199    33195       -4     
  Misses        30169    30169              
  Partials         69       69

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

wemeetagain · 2026-03-19T18:11:24Z

🎉 This PR is included in v1.41.0 🎉

lodekeeper requested a review from a team as a code owner February 14, 2026 17:17

gemini-code-assist Bot reviewed Feb 14, 2026

View reviewed changes

lodekeeper and others added 2 commits February 14, 2026 17:22

refactor: use guard clause for cleaner control flow

be60171

Co-authored-by: gemini-code-assist[bot] <175061342+gemini-code-assist[bot]@users.noreply.github.com>

Merge branch 'unstable' into fix/e2e-reqresp-flaky

3c844fe

lodekeeper mentioned this pull request Feb 14, 2026

test: increase prover e2e hook timeout to include genesis delay #8910

Merged

nazarhussain approved these changes Feb 16, 2026

View reviewed changes

nazarhussain requested changes Feb 16, 2026

View reviewed changes

refactor: simplify iterator logic with encodeErrorMessageToBuffer

8541d03

nazarhussain approved these changes Feb 16, 2026

View reviewed changes

nazarhussain enabled auto-merge (squash) February 16, 2026 09:56

nazarhussain merged commit 279c11d into ChainSafe:unstable Feb 16, 2026
18 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: combine error status and message into single chunk in reqresp response#8908

fix: combine error status and message into single chunk in reqresp response#8908
nazarhussain merged 4 commits into
ChainSafe:unstablefrom
lodekeeper:fix/e2e-reqresp-flaky

lodekeeper commented Feb 14, 2026

Uh oh!

gemini-code-assist Bot commented Feb 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Feb 14, 2026

Uh oh!

lodekeeper Feb 14, 2026

Uh oh!

gemini-code-assist Bot Feb 14, 2026

Uh oh!

nazarhussain Feb 16, 2026

Uh oh!

lodekeeper Feb 16, 2026

Uh oh!

nazarhussain Feb 16, 2026

Uh oh!

lodekeeper Feb 16, 2026

Uh oh!

Uh oh!

codecov Bot commented Feb 16, 2026

Uh oh!

wemeetagain commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

lodekeeper commented Feb 14, 2026

Motivation

Root Cause

Fix

Notes

Uh oh!

gemini-code-assist Bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

lodekeeper Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

nazarhussain Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

lodekeeper Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

nazarhussain Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

lodekeeper Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Feb 16, 2026

Codecov Report

Uh oh!

wemeetagain commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants