fix: combine error status and message into single chunk in reqresp response#8908
Conversation
…sponse
The responseEncodeError() function was yielding the error status byte and
snappy-encoded error message as separate chunks through the async generator.
When piped through libp2p stream.sink, this created a race condition where
the stream could close after the status byte was flushed but before the
error message bytes arrived on the reader side.
This resulted in the requesting side receiving the correct error status code
but an empty errorMessage, causing flaky failures in the e2e reqresp tests
('should handle a server error' and 'should handle a server error after
emitting two blocks'). These two tests were the #1 cause of CI E2E failures,
appearing in ~90% of all E2E test failures.
The fix collects the status byte and encoded error message into a single
Buffer.concat() yield, ensuring they are delivered atomically through the
stream.
Summary of ChangesHello @lodekeeper, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a significant source of E2E test flakiness by eliminating a race condition in how Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a race condition that was causing flaky tests by combining the error status and message into a single atomic chunk. The fix is well-explained and correctly implemented. I've included one suggestion to refactor the code slightly, which will improve its readability and maintainability by reducing code duplication and nesting.
| if (errorMessage) { | ||
| yield* encodeErrorMessage(errorMessage, protocol.encoding); | ||
| // Collect <result> and <error_message> into a single chunk to ensure they are delivered | ||
| // atomically through the stream. Yielding them separately can cause a race condition where | ||
| // the stream closes after the status byte is flushed but before the error message arrives | ||
| // on the reader side, resulting in an empty errorMessage on the request side. | ||
| const chunks: Buffer[] = [Buffer.from([status])]; | ||
| for await (const chunk of encodeErrorMessage(errorMessage, protocol.encoding)) { | ||
| chunks.push(chunk); | ||
| } | ||
| yield Buffer.concat(chunks); | ||
| } else { | ||
| // <result> only, no error message | ||
| yield Buffer.from([status]); | ||
| } |
There was a problem hiding this comment.
While the logic is correct, the if/else structure introduces some code duplication (e.g., Buffer.from([status])) and nesting. You can improve readability and maintainability by using a guard clause for the case where there is no error message. This flattens the code structure and makes the logic more linear and easier to follow.
const statusChunk = Buffer.from([status]);
if (!errorMessage) {
// <result> only, no error message
yield statusChunk;
return;
}
// Collect <result> and <error_message> into a single chunk to ensure they are delivered
// atomically through the stream. Yielding them separately can cause a race condition where
// the stream closes after the status byte is flushed but before the error message arrives
// on the reader side, resulting in an empty errorMessage on the request side.
const chunks: Buffer[] = [statusChunk];
for await (const chunk of encodeErrorMessage(errorMessage, protocol.encoding)) {
chunks.push(chunk);
}
yield Buffer.concat(chunks);There was a problem hiding this comment.
Good call @gemini-code-assist — applied the guard clause pattern in be60171. Cleaner flow, no duplication.
There was a problem hiding this comment.
That's great to hear! I'm glad the suggestion helped improve the code's readability and maintainability.
Co-authored-by: gemini-code-assist[bot] <175061342+gemini-code-assist[bot]@users.noreply.github.com>
## Motivation The prover e2e tests (`web3_batch_request`, `web3_provider`, `cli/cmds/start`) frequently fail with hook timeouts: ``` Error: Hook timed out in 96000ms. Error: Hook timed out in 80000ms. ``` These 3 tests are the most common E2E failures after the reqresp fix (#8908). ## Root Cause `minFinalizedTimeMs` is set to `3 * 8 * 4 * 1000 = 96000ms` (3 epochs of chain time), but this does **not** include genesis delay. The e2e test environment has a genesis delay of `30 * 4 = 120s` before the chain starts producing blocks. The actual time from env start to finalization: - Genesis delay: ~120s - 3 epochs to finalize: ~96s - **Total: ~216s** vs **96s timeout** The tests only passed when other tests happened to run first and consume enough wall time for the chain to progress past finalization. Additionally, `start.test.ts` used a hardcoded `80000ms` timeout — even shorter than `minFinalizedTimeMs`. ## Fix 1. Include `genesisDelaySeconds` in `minFinalizedTimeMs`: `(genesisDelay + 3 epochs) * 1000 = 216s` 2. Use `minFinalizedTimeMs` consistently in `start.test.ts` instead of hardcoded `80000ms` ## Notes - No production logic changes — test config only - This PR was authored by an AI contributor. --------- Co-authored-by: lodekeeper <lodekeeper@users.noreply.github.com> Co-authored-by: Nico Flaig <nflaig@protonmail.com>
| const chunks: Buffer[] = [statusChunk]; | ||
| for await (const chunk of encodeErrorMessage(errorMessage, protocol.encoding)) { | ||
| chunks.push(chunk); | ||
| } | ||
| yield Buffer.concat(chunks); |
There was a problem hiding this comment.
The logic seems correct, but implementation seems more complex than before.
The loop gives an impression that there are multiple chunks for error message, while there can only be one and the longer error message is stripped down to 255 bytes.
Why we need this particular change? and how it's different in behavior from earlier implementation?
before the subsequent error message chunks are delivered to the reader side. The readErrorMessage() function on the receiving end then finds no data after the status byte and returns an empty string.
@lodekeeper Should not that be fixed on the reader side that it must read all the response chunks before closing the stream even in case of error status in first chunk?
There was a problem hiding this comment.
Good questions @nazarhussain, let me address them:
Why multiple chunks from encodeErrorMessage: It's not just one chunk actually — encodeErrorMessage calls writeSszSnappyPayload which yields at least 2 chunks (varint length prefix + snappy-compressed data), and encodeSnappy may yield additional frames. But I agree the loop makes this less obvious — I could simplify to just collect into a single buffer directly.
Why the writer-side fix vs reader-side: The race condition occurs because the writer yields <result> and <error_message> as separate write operations. The transport layer (libp2p stream) can flush/close between them. On the reader side, readErrorMessage() already tries to wait for data via for await (const buffer of bufferedSource), but it can see the stream end before the error message arrives because the writer's generator hasn't completed its next yield yet.
You're right that making the reader more robust is also worth doing — e.g., waiting with a timeout for error message data rather than accepting an empty stream. But I think the writer fix is the more correct approach: <result> and <error_message> are logically one unit (<error_response>), so they should be written atomically. Splitting what the spec defines as a single <error_response> into multiple write calls is the root cause.
On complexity: I can simplify this. Since we know the error message is max 256 bytes, we could skip the async generator collection and just encode directly into a buffer. Would that address the complexity concern? Or would you prefer exploring the reader-side fix instead?
There was a problem hiding this comment.
For now simplify the iterator logic.
There was a problem hiding this comment.
Simplified in 8541d03 — extracted encodeErrorMessageToBuffer() that returns a single buffer, so responseEncodeError no longer has the loop/collection logic.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #8908 +/- ##
============================================
- Coverage 52.33% 52.33% -0.01%
============================================
Files 848 848
Lines 63437 63433 -4
Branches 4702 4702
============================================
- Hits 33199 33195 -4
Misses 30169 30169
Partials 69 69 🚀 New features to boost your workflow:
|
|
🎉 This PR is included in v1.41.0 🎉 |
Motivation
The e2e reqresp tests "should handle a server error" and "should handle a server error after emitting two blocks" have been consistently flaky, appearing in ~90% of all CI E2E test failures. Analysis of ~100 recent CI runs confirmed this as the #1 source of E2E flakiness.
The failure pattern:
The error status code is received correctly, but the error message is empty.
Root Cause
responseEncodeError()yields the error status byte and snappy-encoded error message as separate chunks through the async generator:When piped through
stream.sink, libp2p can close/flush the stream after the first yield completes but before the subsequent error message chunks are delivered to the reader side. ThereadErrorMessage()function on the receiving end then finds no data after the status byte and returns an empty string.Fix
Collect the status byte and encoded error message into a single
Buffer.concat()yield, ensuring they are delivered atomically through the stream. This eliminates the race condition without changing the wire format.Notes