Skip to content

perf: avoid O(N²) when a single SSE event spans many chunks#28

Merged
rexxars merged 1 commit into
rexxars:mainfrom
emoralesb05:perf/streaming-investigation
Apr 19, 2026
Merged

perf: avoid O(N²) when a single SSE event spans many chunks#28
rexxars merged 1 commit into
rexxars:mainfrom
emoralesb05:perf/streaming-investigation

Conversation

@emoralesb05

Copy link
Copy Markdown
Contributor

Summary

createParser().feed(chunk) is O(N²) in total bytes when a single SSE line spans many chunks. The cost is incompleteLine = incompleteLine + chunk, which allocates a new string sized at all-bytes-seen-so-far on every call.

This swaps the string accumulator for pendingFragments: string[], joined exactly once when a line terminator finally arrives. The hot path is gated by a single pendingFragments.length === 0 check and delegates straight to processLines(chunk) exactly like the original.

End-to-end win on a real MCP workload: 93 s → 6.8 s parsing a 280 MB payload (≈14×). Synthetic bench: ~640× on the worst-case shape.

Background

Prior work addressed intra-chunk line splitting: #19 closed by 3.0.1's splitLines rewrite (8952917), and vercel/ai#5862 closed after 3.0.1.

Neither touched the per-feed concat. It only really hurts when a single SSE event carries a large payload streamed in many small chunks. That shape shows up with LLM responses without intra-event newlines, MCP-over-SSE servers like mcp-clickhouse that emit results as one content block, or any consumer chunking on small TCP/TLS frames. 3.0.7's perf refactor improved per-line work but left the concatenation pattern intact.

The bug

src/parse.ts (3.0.7):

let incompleteLine = ''

function feed(chunk: string) {
  // ...
  const input = incompleteLine === '' ? chunk : incompleteLine + chunk
  incompleteLine = processLines(input)
}

For a stream where one SSE line is N bytes split across K chunks (no terminator until the very end), every feed() allocates a string sized at the running total. Total work is Σ(i × chunk_size)O(N²/chunk_size).

Reproducing

The new huge-line-drip bench fixture (256 KiB payload chunked into 1–8 byte slices, no terminator until the end) on main:

feed() — huge-line-drip   1300.00 ms/iter   (~30 MB allocated)

Per-chunk cost grows linearly with the buffer already accumulated. Classic "string concat in a loop" fingerprint. Measured against a real 280 MB MCP payload:

Buffered so far Per-chunk processing
3 MB 1 ms
66 MB 8 ms
131 MB 16 ms
197 MB 27 ms
262 MB 37 ms
288 MB 79 ms

For a real-world reproducer: point @ai-sdk/mcp at mcp-clickhouse against any ClickHouse instance with a ≥1M-row table, issue SELECT * LIMIT 1000000. mcp-clickhouse builds the full result string and emits it as a single SSE message event.

The fix

// Hot path: no buffered prefix from a prior partial line. Hand the chunk
// straight to processLines, exactly like the original implementation.
// Zero new work in the common case (every chunk ends with `\n\n`).
if (pendingFragments.length === 0) {
  const trailing = processLines(chunk)
  if (trailing !== '') pendingFragments.push(trailing)
  return
}

// We have a buffered prefix. If this chunk also has no terminator, append
// to the buffer without concatenating. That's the O(N²) trap we're avoiding.
if (chunk.indexOf('\n') === -1 && chunk.indexOf('\r') === -1) {
  pendingFragments.push(chunk)
  return
}

// Terminator arrived. Join the accumulated fragments + this chunk once,
// process, and buffer any new trailing partial line.
pendingFragments.push(chunk)
const input = pendingFragments.join('')
pendingFragments.length = 0
const trailing = processLines(input)
if (trailing !== '') pendingFragments.push(trailing)

processLines and parseLine internals are unchanged. reset() adapts to the array form. feedFirst is inlined into feed (one BOM check on the first chunk, gated by the existing isFirstChunk flag).

Why the hot path stays free: well-formed SSE chunks end with \n\n. processLines consumes them, returns '', and nothing gets pushed. The next call sees pendingFragments.length === 0 and takes the same path the original code did. The buffering only kicks in once a chunk leaves a partial line behind.

Validation

Synthetic bench

Matched-clock comparison (Apple M4 Max, Node 24.11, mitata defaults):

Bench Baseline Fix Δ
data-only 13.10 µs 13.22 µs +1%
named-event 7.00 µs 7.28 µs +4%
identified-event 10.14 µs 10.65 µs +5%
multibyte 10.56 µs 10.79 µs +2%
heartbeat 7.44 µs 7.36 µs −1%
idle-stream 20.66 µs 7.95 µs −62%
small-chunk 143.78 µs 114.69 µs −22%
large-multiline-data 13.22 µs 13.30 µs +1%
huge-line-drip 1.18 s 1.84 ms ~640×
edge-cases 5.79 µs 6.21 µs +7%

Run-to-run variance on the same code across three identical invocations ranged from 1% (edge-cases) to 52% (data-only) on this hardware, so sub-10% deltas on small fixtures aren't meaningful signals. The wins on idle-stream, small-chunk, and huge-line-drip are well outside that band.

End-to-end (real workload)

1M rows / 280 MB payload via mcp-clickhouse → ClickHouse Cloud, fetched through @ai-sdk/mcp's streamable-HTTP transport. Numbers came from a downstream MCP integration during an unrelated latency investigation:

Version Total time RSS peak
3.0.6 93 s 3.9 GB
3.0.7 (the perf refactor) 101 s 2.3 GB
3.0.7 + this fix 6.8 s 2.4 GB

3.0.7 lowered RSS but didn't change wall time on this shape, which points to the per-feed concat being the dominant cost for streams where a single event spans many chunks.

Instrumented call distribution on the 280 MB payload: 4,436 feed() calls hit the new fast path (no concat, chunk had no terminator, just appended to the buffer); 2 hit the slow path (the first chunk carrying the event:/data: header, and the last chunk carrying the \n\n terminator). Exactly what the design predicted.

Replace the per-feed `incompleteLine + chunk` concat with a `pendingFragments: string[]` joined once when a terminator arrives.

~640× faster on the new `huge-line-drip` bench fixture; ~14× faster end-to-end on a real 280 MB MCP workload (93 s → 6.8 s). Hot path unchanged, all 42 existing tests pass.
@emoralesb05

Copy link
Copy Markdown
Contributor Author

Let me know if there is anything wrong with this! I had to pnpm patch it in our repo but would love to remove that patch and have it be fixed at the source!

I tried to benchmark it as much as possible with your framework but also via the @ai-sdk/mcp directly with clickhouse with different sizes and payloads.. i hope this helps!

@rexxars rexxars merged commit 4c41223 into rexxars:main Apr 19, 2026
1 check passed
@rexxars

rexxars commented Apr 19, 2026

Copy link
Copy Markdown
Owner

Thanks a bunch! Appreciate it ❤️

@emoralesb05 emoralesb05 deleted the perf/streaming-investigation branch April 19, 2026 17:52
@emoralesb05 emoralesb05 restored the perf/streaming-investigation branch April 19, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants