Skip to content

fix(anthropic): preserve all tool_calls when an OpenAI delta contains multiple#26636

Open
freddyhaddad wants to merge 1 commit into
BerriAI:mainfrom
freddyhaddad:fix/anthropic-stream-multi-tool-call
Open

fix(anthropic): preserve all tool_calls when an OpenAI delta contains multiple#26636
freddyhaddad wants to merge 1 commit into
BerriAI:mainfrom
freddyhaddad:fix/anthropic-stream-multi-tool-call

Conversation

@freddyhaddad

Copy link
Copy Markdown

Relevant issues

No existing issue. Surfaced by a real-world Claude Code + mlx_lm.server (kimi-k2.6) deployment where parallel-subagent dispatch was failing with InputValidationError: description / prompt missing.

Pre-Submission checklist

  • I have Added testing in the tests/test_litellm/ directory — new file tests/test_litellm/llms/anthropic/test_anthropic_stream_multi_tool_call_split.py with 12 cases covering the static splitter, sync iteration, async iteration, and construction-time wiring
  • My PR passes the relevant unit tests — 266 passed across the new test file plus all tests/test_litellm/llms/anthropic/experimental_pass_through/, including the previously-existing TestAnthropicStreamWrapperToolArgs::test_sync_tool_args_not_dropped and test_async_tool_args_not_dropped regression tests
  • My PR's scope is as isolated as possible — single bug, single file change in litellm/, single new test file
  • Greptile review — not yet (I'll request after CI runs).

Type

🐛 Bug Fix

Changes

Problem

AnthropicStreamWrapper (the OpenAI→Anthropic streaming converter behind /v1/messages when use_chat_completions_url_for_anthropic_messages: true or anthropic-format passthrough) silently drops every tool_call beyond the first when an upstream OpenAI streaming chunk contains multiple complete tool_calls in a single delta.

Root cause: the helper _translate_streaming_openai_chunk_to_anthropic_content_block in transformation.py indexes choices[0].delta.tool_calls[0], and the streaming loop in AnthropicStreamWrapper.__next__ / __anext__ uses _should_start_new_content_block to decide WHEN to emit a new content_block_start. The combination only ever observes the first tool_call per chunk, so subsequent tool_calls in the same delta produce no content_block_start / input_json_delta / content_block_stop triple and are lost from the converted Anthropic event stream.

Trigger

mlx_lm.server emits all parallel tool_calls together in a single final delta after their text has been fully generated (this matches its single-pass, non-incremental tool-call streaming model). Likely affects other providers with similar patterns; OpenAI itself usually emits one tool_call per delta so this bug doesn't surface there.

Real-world symptom

When a user asks Claude Code (kimi-k2.6 backend, via this LiteLLM passthrough) to dispatch parallel subagents — e.g., "fetch these URLs concurrently with two Agent tool calls" — the model emits a 2-tool-call OpenAI delta, but Claude Code's harness only sees one tool_use content block. Its conversation state diverges from the model's, the model gets confused on the next turn and emits more tool_use blocks with empty input (mode-collapse on validation failure), and the harness loops with:

InputValidationError: Agent failed due to the following issues:
The required parameter `description` is missing
The required parameter `prompt` is missing

Fix

Introduce a small dual-protocol wrapper class _MultiToolCallSplitter that sits between the upstream completion stream and the existing converter, splitting any chunk whose delta.tool_calls has length > 1 into N chunks (one tool_call each) via deep-copy. The wrapper supports both __iter__ / __next__ and __aiter__ / __anext__ and matches whichever protocol the consumer (AnthropicStreamWrapper.__next__ vs __anext__) drives — necessary because some upstream stream wrappers (e.g. CustomStreamWrapper) expose both protocols, and a single-protocol wrapping at construction time would break the unused side. Single-tool-call chunks pass through as the same instance with no copy.

The downstream converter and the rest of AnthropicStreamWrapper's state machine are untouched; they continue to assume one tool_call per chunk, which is now invariant by construction.

Code change

One new module-level class _MultiToolCallSplitter, one static helper AnthropicStreamWrapper._split_chunk_by_tool_calls, and one line in __init__ that wires them in. 107 lines added, 1 line removed in litellm/llms/anthropic/experimental_pass_through/adapters/streaming_iterator.py.

Screenshots / Proof of Fix

Before

Direct /v1/messages streaming test against mlx_lm.server for a prompt that produces 2 parallel tool_calls in one delta:

Total events: 8
  message_start
  content_block_start index=0 type=text
  content_block_stop index=0
  content_block_start index=1 type=tool_use name=Agent     ← only one tool_use
  content_block_delta index=1 input_json_delta partial='{"description": "...", "prompt": "..."}'
  content_block_stop index=1
  message_delta
  message_stop

After

Total events: 11
  message_start
  content_block_start index=0 type=text
  content_block_stop index=0
  content_block_start index=1 type=tool_use name=Agent     ← first tool_use
  content_block_delta index=1 input_json_delta partial='{"description": "count .py files...", ...}'
  content_block_stop index=1
  content_block_start index=2 type=tool_use name=Agent     ← second tool_use, was dropped before
  content_block_delta index=2 input_json_delta partial='{"description": "count .md files...", ...}'
  content_block_stop index=2
  message_delta
  message_stop

Claude Code TUI verification (end-to-end)

Same prompt that previously looped 8x with Invalid tool parameters now succeeds:

⏺ 2 agents finished
   ├ Count .py files in mlx_lm  ·  Done
   └ Count .md files in mlx-lm  ·  Done

⏺ Results:
  - 175 .py files in mlx_lm/
  - 31 .md files in mlx-lm/

Tests

$ uv run pytest tests/test_litellm/llms/anthropic/test_anthropic_stream_multi_tool_call_split.py -v
========================== 12 passed in 0.11s ==========================

$ uv run pytest tests/test_litellm/llms/anthropic/experimental_pass_through/ -q
========================= 254 passed in 6.4s =========================

The 12 new tests cover:

  • _split_chunk_by_tool_calls static helper: passthrough for None / "None" sentinels, single-tool-call chunks (no copy), 2-call splits, 3-call splits, chunks with no choices, chunks with delta=None
  • _MultiToolCallSplitter sync iteration
  • _MultiToolCallSplitter async iteration
  • AnthropicStreamWrapper construction with a dual-protocol upstream stream (verifies both sync and async paths reach the splitter)

Backwards compatibility

  • Single-tool-call chunks (the OpenAI default) pass through as the same Python instance — no deep-copy cost on the common path.
  • Public API of AnthropicStreamWrapper unchanged. Behavior unchanged for any provider that already emits one tool_call per delta.
  • The previously-existing TestAnthropicStreamWrapperToolArgs::test_sync_tool_args_not_dropped and test_async_tool_args_not_dropped (added in Anthropic adapter streaming drops tool_use input args on content block transition #24134) continue to pass — verified explicitly because they use a SimpleIterator fixture that exposes both sync and async protocols, which the dual-protocol splitter handles correctly.

🤖 Generated with Claude Code

@CLAassistant

CLAassistant commented Apr 27, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@freddyhaddad freddyhaddad force-pushed the fix/anthropic-stream-multi-tool-call branch 2 times, most recently from 85fda80 to 6c34140 Compare April 27, 2026 21:22
@codspeed-hq

codspeed-hq Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing freddyhaddad:fix/anthropic-stream-multi-tool-call (2e3da15) with main (3d2b8fe)

Open in CodSpeed

@codecov

codecov Bot commented Apr 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.07843% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...mental_pass_through/adapters/streaming_iterator.py 96.07% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a real-world bug in AnthropicStreamWrapper where parallel tool_calls emitted in a single OpenAI streaming delta were silently dropped after the first. The fix introduces _MultiToolCallSplitter, a thin wrapper that splits any multi-tool-call chunk into one chunk per tool_call before it reaches the downstream converter. The approach is minimal, backward-compatible (single-tool-call chunks pass through as the same instance), and the previous review-thread concerns (object aliasing in sub-chunks, missing end-to-end tests) appear to have been addressed in the current revision.

Confidence Score: 5/5

Safe to merge — targeted, minimal change with correct logic and solid test coverage including end-to-end event-sequence assertions.

No P0/P1 issues found. The previous thread's concerns (sub-chunk object aliasing, missing end-to-end tests) have been addressed in the current revision. Single-tool-call common path is zero-copy, multi-tool-call splitting logic is correct, and both sync/async protocols are handled cleanly.

No files require special attention.

Important Files Changed

Filename Overview
litellm/llms/anthropic/experimental_pass_through/adapters/streaming_iterator.py Adds _MultiToolCallSplitter class and _split_chunk_by_tool_calls static helper; wires splitter into AnthropicStreamWrapper.__init__; switches __next__/__anext__ loops from self.completion_stream to self._completion_stream_splitter. Logic is correct: single-tool-call chunks pass through as the same instance, multi-call chunks are deep-copied per slot, both sync and async protocols are supported via a shared deque buffer.
tests/test_litellm/llms/anthropic/test_anthropic_stream_multi_tool_call_split.py New test file covering the static splitter helper (passthrough sentinels, single/2-call/3-call splits, edge cases), _MultiToolCallSplitter sync and async iteration, and two full end-to-end AnthropicStreamWrapper tests that assert on the emitted content_block_start/content_block_delta event sequence. No real network calls; all mocks or LiteLLM-internal types.

Sequence Diagram

sequenceDiagram
    participant Upstream as Upstream OpenAI Stream
    participant Splitter as _MultiToolCallSplitter
    participant Converter as AnthropicStreamWrapper.__next__
    participant Client as Anthropic Client

    Note over Upstream,Splitter: Single delta with 2 parallel tool_calls
    Upstream->>Splitter: chunk{tool_calls:[TC_A, TC_B]}
    Splitter->>Splitter: _split_chunk_by_tool_calls → [sub_A, sub_B]
    Splitter->>Splitter: buffer.append(sub_B)
    Splitter-->>Converter: sub_A {tool_calls:[TC_A]}

    Converter->>Client: content_block_start index=1 (tool_use TC_A)
    Converter->>Client: content_block_delta index=1 (args_A)
    Converter->>Client: content_block_stop index=1

    Note over Splitter,Converter: Next __next__ call drains the buffer
    Splitter-->>Converter: sub_B {tool_calls:[TC_B]} (from buffer)

    Converter->>Client: content_block_start index=2 (tool_use TC_B)
    Converter->>Client: content_block_delta index=2 (args_B)
    Converter->>Client: content_block_stop index=2
Loading

Reviews (2): Last reviewed commit: "fix(anthropic): preserve all tool_calls ..." | Re-trigger Greptile

Comment on lines +58 to +67
def __next__(self) -> Any:
while True:
if self._buffer:
return self._buffer.popleft()
chunk = next(self._sync_iter_obj) # raises StopIteration at EOF
splits = AnthropicStreamWrapper._split_chunk_by_tool_calls(chunk)
if len(splits) <= 1:
return splits[0]
self._buffer.extend(splits[1:])
return splits[0]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 while True loop body always returns on first iteration

The while True: loop in __next__ (and identically in __anext__) is unreachable beyond the first iteration. _split_chunk_by_tool_calls always returns a list with ≥ 1 element (it returns [chunk] for every edge case), so every code path inside the loop body hits a return. The loop will never cycle back to check self._buffer a second time from a single __next__ call — that check only matters on the next call, which Python's for machinery handles by invoking __next__ again.

Suggested change
def __next__(self) -> Any:
while True:
if self._buffer:
return self._buffer.popleft()
chunk = next(self._sync_iter_obj) # raises StopIteration at EOF
splits = AnthropicStreamWrapper._split_chunk_by_tool_calls(chunk)
if len(splits) <= 1:
return splits[0]
self._buffer.extend(splits[1:])
return splits[0]
def __next__(self) -> Any:
if self._buffer:
return self._buffer.popleft()
chunk = next(self._sync_iter_obj) # raises StopIteration at EOF
splits = AnthropicStreamWrapper._split_chunk_by_tool_calls(chunk)
if len(splits) <= 1:
return splits[0]
self._buffer.extend(splits[1:])
return splits[0]

Comment on lines +156 to +161
out: List[Any] = []
for one_tc in tcs:
sub = copy.deepcopy(chunk)
sub.choices[0].delta.tool_calls = [one_tc]
out.append(sub)
return out

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Sub-chunks share references to original tool-call objects

After deep-copying the full chunk into sub, the code immediately replaces the copied tool_calls list with a list containing the original (not deep-copied) one_tc object:

sub = copy.deepcopy(chunk)          # deep copies tool_calls inside
sub.choices[0].delta.tool_calls = [one_tc]  # drops copy; re-assigns original

This means each emitted sub-chunk's tool_calls[0] is the same Python object as the corresponding slot in the original chunk's tool_calls. Any downstream mutation of that object (e.g. setting function.arguments) would silently alias back to the original chunk. Consider deep-copying one_tc explicitly to make sub-chunks fully independent:

for one_tc in tcs:
    sub = copy.deepcopy(chunk)
    sub.choices[0].delta.tool_calls = [copy.deepcopy(one_tc)]
    out.append(sub)

Alternatively, avoid the wasted deep-copy of tcs by only copying the outer structure and deep-copying each individual tool_call.

Comment on lines +1 to +20
"""
Tests for AnthropicStreamWrapper's multi-tool-call splitting.

Some upstream providers (e.g. mlx_lm.server) emit multiple complete
tool_calls inside a SINGLE OpenAI streaming delta when the model produces
parallel tool calls. Anthropic's streaming format requires one
``content_block`` per ``tool_use`` and the downstream converter in
``AnthropicStreamWrapper`` indexes ``tool_calls[0]`` — so without splitting,
all but the first tool_call are silently dropped from the converted
``/v1/messages`` stream.

These tests verify that ``AnthropicStreamWrapper`` splits such chunks into
one tool_call per chunk before the converter sees them.
"""

import asyncio
import os
import sys
from typing import List
from unittest.mock import MagicMock

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No end-to-end test for the full Anthropic SSE event sequence

The new tests thoroughly validate _split_chunk_by_tool_calls and _MultiToolCallSplitter in isolation, and verify that wrapper.completion_stream expands correctly. However, none of them exercise the complete AnthropicStreamWrapper.__next__ / __anext__ state machine with a multi-tool-call input to confirm that the final emitted Anthropic event sequence actually contains paired content_block_start / content_block_delta / content_block_stop triples for each parallel tool call. A regression in _should_start_new_content_block or the content_block_index accounting would go undetected by the current suite. An integration-style test (similar to the existing TestAnthropicStreamWrapperToolArgs tests) that consumes list(wrapper) and asserts on the type and index of each emitted event would close this gap.

… multiple

The Anthropic /v1/messages streaming adapter (AnthropicStreamWrapper)
silently drops every tool_call beyond the first when an upstream OpenAI
streaming chunk contains multiple complete tool_calls in a single delta.
The downstream converter
(_translate_streaming_openai_chunk_to_anthropic_content_block) indexes
tool_calls[0], so without splitting at the wrapper level only one
content_block_start / input_json_delta / content_block_stop triple is
emitted regardless of how many parallel tool_calls the model produced.

The provider that triggered this is mlx_lm.server, which emits all
parallel tool_calls in one final delta after their text has been fully
generated. Real-world impact: Claude Code (which speaks /v1/messages)
sees only the first parallel subagent dispatch, then loops on
"InputValidationError: required parameter description / prompt missing"
because the model's follow-up tool_use blocks arrive without their
content.

Fix: a small _MultiToolCallSplitter wraps the upstream stream inside
__init__. It supports both __iter__/__next__ and __aiter__/__anext__
transparently so AnthropicStreamWrapper.__next__ and __anext__ each see
the matching protocol. When a chunk's delta has more than one tool_call,
the splitter deep-copies the chunk N times (one tool_call each) and
buffers the rest. Single-tool-call chunks pass through unchanged
(returns the same instance, no copy).

Verified end-to-end against mlx_lm.server kimi-k2.6 + Claude Code's
parallel-Agent-tool dispatch: before patch, /v1/messages stream emitted
8 events with 1 content_block_start for tool_use; after patch, 11
events with 2 content_block_starts (index 1 + index 2), each carrying
its own input_json_delta. Claude Code TUI confirms both subagents now
spawn, run, and return cleanly.

Test added: tests/test_litellm/llms/anthropic/test_anthropic_stream_multi_tool_call_split.py
covers the static splitter, sync iteration, async iteration, and
construction-time wiring through AnthropicStreamWrapper.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@freddyhaddad freddyhaddad force-pushed the fix/anthropic-stream-multi-tool-call branch from 6c34140 to 2e3da15 Compare April 27, 2026 21:55
@freddyhaddad

Copy link
Copy Markdown
Author

Hi maintainers — pushed an update addressing the bot review:

  • lint (mypy) ✅ — fixed by typing the lazy iterator slots as Any instead of Optional[Any] (Python's iteration protocol guarantees __iter__ runs before __next__)
  • Greptile Give me consistent exceptions  #1 (while True unreachable) ✅ — removed; the loop only ever ran once
  • Greptile Enable model / call timeouts #2 (sub-chunk references) ✅ — added copy.deepcopy(one_tc) so each split chunk has fully independent state
  • Greptile Guarantee format of exceptions #3 (no end-to-end test) ✅ — added TestAnthropicStreamWrapperEndToEnd (sync + async) that drives the full AnthropicStreamWrapper and asserts two content_block_start of type tool_use are emitted for a multi-tool-call delta, each with its input_json_delta
  • CodeQL (attribute shadow) ✅ — moved the splitter onto a separate self._completion_stream_splitter attribute instead of overwriting self.completion_stream from the superclass

One open issue I can't fix from this side: the Verify PR source branch workflow rejects external contributions and points users at a branch named litellm_oss_branch, but that branch doesn't appear to exist in this repo. The closest matches are date-stamped litellm-oss-staging-* branches (most recent: litellm-oss-staging-04-25-2026), but I don't want to guess at the right target. Could you let me know which branch I should retarget to, or is the contribution-via-fork-to-main path expected to fail this gate at this stage of review?

(Or if it's easier, the workflow text at .github/workflows/guard-main-branch.yml references a branch that doesn't exist — happy to open a tiny follow-up PR fixing that pointer if you let me know the canonical name.)

Bojun-Vvibe added a commit to Bojun-Vvibe/oss-contributions that referenced this pull request Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants