perf: avoid O(N²) when a single SSE event spans many chunks#28
Merged
Merged
Conversation
Replace the per-feed `incompleteLine + chunk` concat with a `pendingFragments: string[]` joined once when a terminator arrives. ~640× faster on the new `huge-line-drip` bench fixture; ~14× faster end-to-end on a real 280 MB MCP workload (93 s → 6.8 s). Hot path unchanged, all 42 existing tests pass.
Contributor
Author
|
Let me know if there is anything wrong with this! I had to pnpm patch it in our repo but would love to remove that patch and have it be fixed at the source! I tried to benchmark it as much as possible with your framework but also via the |
1 task
Owner
|
Thanks a bunch! Appreciate it ❤️ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
createParser().feed(chunk)is O(N²) in total bytes when a single SSE line spans many chunks. The cost isincompleteLine = incompleteLine + chunk, which allocates a new string sized at all-bytes-seen-so-far on every call.This swaps the string accumulator for
pendingFragments: string[], joined exactly once when a line terminator finally arrives. The hot path is gated by a singlependingFragments.length === 0check and delegates straight toprocessLines(chunk)exactly like the original.End-to-end win on a real MCP workload: 93 s → 6.8 s parsing a 280 MB payload (≈14×). Synthetic bench: ~640× on the worst-case shape.
Background
Prior work addressed intra-chunk line splitting: #19 closed by 3.0.1's
splitLinesrewrite (8952917), and vercel/ai#5862 closed after 3.0.1.Neither touched the per-feed concat. It only really hurts when a single SSE event carries a large payload streamed in many small chunks. That shape shows up with LLM responses without intra-event newlines, MCP-over-SSE servers like
mcp-clickhousethat emit results as one content block, or any consumer chunking on small TCP/TLS frames. 3.0.7's perf refactor improved per-line work but left the concatenation pattern intact.The bug
src/parse.ts(3.0.7):For a stream where one SSE line is N bytes split across K chunks (no terminator until the very end), every
feed()allocates a string sized at the running total. Total work isΣ(i × chunk_size)≈O(N²/chunk_size).Reproducing
The new
huge-line-dripbench fixture (256 KiB payload chunked into 1–8 byte slices, no terminator until the end) onmain:Per-chunk cost grows linearly with the buffer already accumulated. Classic "string concat in a loop" fingerprint. Measured against a real 280 MB MCP payload:
For a real-world reproducer: point
@ai-sdk/mcpatmcp-clickhouseagainst any ClickHouse instance with a ≥1M-row table, issueSELECT * LIMIT 1000000.mcp-clickhousebuilds the full result string and emits it as a single SSEmessageevent.The fix
processLinesandparseLineinternals are unchanged.reset()adapts to the array form.feedFirstis inlined intofeed(one BOM check on the first chunk, gated by the existingisFirstChunkflag).Why the hot path stays free: well-formed SSE chunks end with
\n\n.processLinesconsumes them, returns'', and nothing gets pushed. The next call seespendingFragments.length === 0and takes the same path the original code did. The buffering only kicks in once a chunk leaves a partial line behind.Validation
Synthetic bench
Matched-clock comparison (Apple M4 Max, Node 24.11, mitata defaults):
Run-to-run variance on the same code across three identical invocations ranged from 1% (
edge-cases) to 52% (data-only) on this hardware, so sub-10% deltas on small fixtures aren't meaningful signals. The wins onidle-stream,small-chunk, andhuge-line-dripare well outside that band.End-to-end (real workload)
1M rows / 280 MB payload via
mcp-clickhouse→ ClickHouse Cloud, fetched through@ai-sdk/mcp's streamable-HTTP transport. Numbers came from a downstream MCP integration during an unrelated latency investigation:3.0.7 lowered RSS but didn't change wall time on this shape, which points to the per-feed concat being the dominant cost for streams where a single event spans many chunks.
Instrumented call distribution on the 280 MB payload: 4,436
feed()calls hit the new fast path (no concat, chunk had no terminator, just appended to the buffer); 2 hit the slow path (the first chunk carrying theevent:/data:header, and the last chunk carrying the\n\nterminator). Exactly what the design predicted.