fix(anthropic): preserve all tool_calls when an OpenAI delta contains multiple#26636
fix(anthropic): preserve all tool_calls when an OpenAI delta contains multiple#26636freddyhaddad wants to merge 1 commit into
Conversation
85fda80 to
6c34140
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Greptile SummaryThis PR fixes a real-world bug in Confidence Score: 5/5Safe to merge — targeted, minimal change with correct logic and solid test coverage including end-to-end event-sequence assertions. No P0/P1 issues found. The previous thread's concerns (sub-chunk object aliasing, missing end-to-end tests) have been addressed in the current revision. Single-tool-call common path is zero-copy, multi-tool-call splitting logic is correct, and both sync/async protocols are handled cleanly. No files require special attention.
|
| Filename | Overview |
|---|---|
| litellm/llms/anthropic/experimental_pass_through/adapters/streaming_iterator.py | Adds _MultiToolCallSplitter class and _split_chunk_by_tool_calls static helper; wires splitter into AnthropicStreamWrapper.__init__; switches __next__/__anext__ loops from self.completion_stream to self._completion_stream_splitter. Logic is correct: single-tool-call chunks pass through as the same instance, multi-call chunks are deep-copied per slot, both sync and async protocols are supported via a shared deque buffer. |
| tests/test_litellm/llms/anthropic/test_anthropic_stream_multi_tool_call_split.py | New test file covering the static splitter helper (passthrough sentinels, single/2-call/3-call splits, edge cases), _MultiToolCallSplitter sync and async iteration, and two full end-to-end AnthropicStreamWrapper tests that assert on the emitted content_block_start/content_block_delta event sequence. No real network calls; all mocks or LiteLLM-internal types. |
Sequence Diagram
sequenceDiagram
participant Upstream as Upstream OpenAI Stream
participant Splitter as _MultiToolCallSplitter
participant Converter as AnthropicStreamWrapper.__next__
participant Client as Anthropic Client
Note over Upstream,Splitter: Single delta with 2 parallel tool_calls
Upstream->>Splitter: chunk{tool_calls:[TC_A, TC_B]}
Splitter->>Splitter: _split_chunk_by_tool_calls → [sub_A, sub_B]
Splitter->>Splitter: buffer.append(sub_B)
Splitter-->>Converter: sub_A {tool_calls:[TC_A]}
Converter->>Client: content_block_start index=1 (tool_use TC_A)
Converter->>Client: content_block_delta index=1 (args_A)
Converter->>Client: content_block_stop index=1
Note over Splitter,Converter: Next __next__ call drains the buffer
Splitter-->>Converter: sub_B {tool_calls:[TC_B]} (from buffer)
Converter->>Client: content_block_start index=2 (tool_use TC_B)
Converter->>Client: content_block_delta index=2 (args_B)
Converter->>Client: content_block_stop index=2
Reviews (2): Last reviewed commit: "fix(anthropic): preserve all tool_calls ..." | Re-trigger Greptile
| def __next__(self) -> Any: | ||
| while True: | ||
| if self._buffer: | ||
| return self._buffer.popleft() | ||
| chunk = next(self._sync_iter_obj) # raises StopIteration at EOF | ||
| splits = AnthropicStreamWrapper._split_chunk_by_tool_calls(chunk) | ||
| if len(splits) <= 1: | ||
| return splits[0] | ||
| self._buffer.extend(splits[1:]) | ||
| return splits[0] |
There was a problem hiding this comment.
while True loop body always returns on first iteration
The while True: loop in __next__ (and identically in __anext__) is unreachable beyond the first iteration. _split_chunk_by_tool_calls always returns a list with ≥ 1 element (it returns [chunk] for every edge case), so every code path inside the loop body hits a return. The loop will never cycle back to check self._buffer a second time from a single __next__ call — that check only matters on the next call, which Python's for machinery handles by invoking __next__ again.
| def __next__(self) -> Any: | |
| while True: | |
| if self._buffer: | |
| return self._buffer.popleft() | |
| chunk = next(self._sync_iter_obj) # raises StopIteration at EOF | |
| splits = AnthropicStreamWrapper._split_chunk_by_tool_calls(chunk) | |
| if len(splits) <= 1: | |
| return splits[0] | |
| self._buffer.extend(splits[1:]) | |
| return splits[0] | |
| def __next__(self) -> Any: | |
| if self._buffer: | |
| return self._buffer.popleft() | |
| chunk = next(self._sync_iter_obj) # raises StopIteration at EOF | |
| splits = AnthropicStreamWrapper._split_chunk_by_tool_calls(chunk) | |
| if len(splits) <= 1: | |
| return splits[0] | |
| self._buffer.extend(splits[1:]) | |
| return splits[0] |
| out: List[Any] = [] | ||
| for one_tc in tcs: | ||
| sub = copy.deepcopy(chunk) | ||
| sub.choices[0].delta.tool_calls = [one_tc] | ||
| out.append(sub) | ||
| return out |
There was a problem hiding this comment.
Sub-chunks share references to original tool-call objects
After deep-copying the full chunk into sub, the code immediately replaces the copied tool_calls list with a list containing the original (not deep-copied) one_tc object:
sub = copy.deepcopy(chunk) # deep copies tool_calls inside
sub.choices[0].delta.tool_calls = [one_tc] # drops copy; re-assigns originalThis means each emitted sub-chunk's tool_calls[0] is the same Python object as the corresponding slot in the original chunk's tool_calls. Any downstream mutation of that object (e.g. setting function.arguments) would silently alias back to the original chunk. Consider deep-copying one_tc explicitly to make sub-chunks fully independent:
for one_tc in tcs:
sub = copy.deepcopy(chunk)
sub.choices[0].delta.tool_calls = [copy.deepcopy(one_tc)]
out.append(sub)Alternatively, avoid the wasted deep-copy of tcs by only copying the outer structure and deep-copying each individual tool_call.
| """ | ||
| Tests for AnthropicStreamWrapper's multi-tool-call splitting. | ||
|
|
||
| Some upstream providers (e.g. mlx_lm.server) emit multiple complete | ||
| tool_calls inside a SINGLE OpenAI streaming delta when the model produces | ||
| parallel tool calls. Anthropic's streaming format requires one | ||
| ``content_block`` per ``tool_use`` and the downstream converter in | ||
| ``AnthropicStreamWrapper`` indexes ``tool_calls[0]`` — so without splitting, | ||
| all but the first tool_call are silently dropped from the converted | ||
| ``/v1/messages`` stream. | ||
|
|
||
| These tests verify that ``AnthropicStreamWrapper`` splits such chunks into | ||
| one tool_call per chunk before the converter sees them. | ||
| """ | ||
|
|
||
| import asyncio | ||
| import os | ||
| import sys | ||
| from typing import List | ||
| from unittest.mock import MagicMock |
There was a problem hiding this comment.
No end-to-end test for the full Anthropic SSE event sequence
The new tests thoroughly validate _split_chunk_by_tool_calls and _MultiToolCallSplitter in isolation, and verify that wrapper.completion_stream expands correctly. However, none of them exercise the complete AnthropicStreamWrapper.__next__ / __anext__ state machine with a multi-tool-call input to confirm that the final emitted Anthropic event sequence actually contains paired content_block_start / content_block_delta / content_block_stop triples for each parallel tool call. A regression in _should_start_new_content_block or the content_block_index accounting would go undetected by the current suite. An integration-style test (similar to the existing TestAnthropicStreamWrapperToolArgs tests) that consumes list(wrapper) and asserts on the type and index of each emitted event would close this gap.
… multiple The Anthropic /v1/messages streaming adapter (AnthropicStreamWrapper) silently drops every tool_call beyond the first when an upstream OpenAI streaming chunk contains multiple complete tool_calls in a single delta. The downstream converter (_translate_streaming_openai_chunk_to_anthropic_content_block) indexes tool_calls[0], so without splitting at the wrapper level only one content_block_start / input_json_delta / content_block_stop triple is emitted regardless of how many parallel tool_calls the model produced. The provider that triggered this is mlx_lm.server, which emits all parallel tool_calls in one final delta after their text has been fully generated. Real-world impact: Claude Code (which speaks /v1/messages) sees only the first parallel subagent dispatch, then loops on "InputValidationError: required parameter description / prompt missing" because the model's follow-up tool_use blocks arrive without their content. Fix: a small _MultiToolCallSplitter wraps the upstream stream inside __init__. It supports both __iter__/__next__ and __aiter__/__anext__ transparently so AnthropicStreamWrapper.__next__ and __anext__ each see the matching protocol. When a chunk's delta has more than one tool_call, the splitter deep-copies the chunk N times (one tool_call each) and buffers the rest. Single-tool-call chunks pass through unchanged (returns the same instance, no copy). Verified end-to-end against mlx_lm.server kimi-k2.6 + Claude Code's parallel-Agent-tool dispatch: before patch, /v1/messages stream emitted 8 events with 1 content_block_start for tool_use; after patch, 11 events with 2 content_block_starts (index 1 + index 2), each carrying its own input_json_delta. Claude Code TUI confirms both subagents now spawn, run, and return cleanly. Test added: tests/test_litellm/llms/anthropic/test_anthropic_stream_multi_tool_call_split.py covers the static splitter, sync iteration, async iteration, and construction-time wiring through AnthropicStreamWrapper. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
6c34140 to
2e3da15
Compare
|
Hi maintainers — pushed an update addressing the bot review:
One open issue I can't fix from this side: the (Or if it's easier, the workflow text at |
…rge-after-nits, BerriAI/litellm#26636 merge-after-nits
Relevant issues
No existing issue. Surfaced by a real-world Claude Code +
mlx_lm.server(kimi-k2.6) deployment where parallel-subagent dispatch was failing withInputValidationError: description / prompt missing.Pre-Submission checklist
tests/test_litellm/directory — new filetests/test_litellm/llms/anthropic/test_anthropic_stream_multi_tool_call_split.pywith 12 cases covering the static splitter, sync iteration, async iteration, and construction-time wiringtests/test_litellm/llms/anthropic/experimental_pass_through/, including the previously-existingTestAnthropicStreamWrapperToolArgs::test_sync_tool_args_not_droppedandtest_async_tool_args_not_droppedregression testslitellm/, single new test fileType
🐛 Bug Fix
Changes
Problem
AnthropicStreamWrapper(the OpenAI→Anthropic streaming converter behind/v1/messageswhenuse_chat_completions_url_for_anthropic_messages: trueor anthropic-format passthrough) silently drops every tool_call beyond the first when an upstream OpenAI streaming chunk contains multiple complete tool_calls in a single delta.Root cause: the helper
_translate_streaming_openai_chunk_to_anthropic_content_blockintransformation.pyindexeschoices[0].delta.tool_calls[0], and the streaming loop inAnthropicStreamWrapper.__next__/__anext__uses_should_start_new_content_blockto decide WHEN to emit a newcontent_block_start. The combination only ever observes the first tool_call per chunk, so subsequent tool_calls in the same delta produce nocontent_block_start/input_json_delta/content_block_stoptriple and are lost from the converted Anthropic event stream.Trigger
mlx_lm.serveremits all parallel tool_calls together in a single final delta after their text has been fully generated (this matches its single-pass, non-incremental tool-call streaming model). Likely affects other providers with similar patterns; OpenAI itself usually emits one tool_call per delta so this bug doesn't surface there.Real-world symptom
When a user asks Claude Code (kimi-k2.6 backend, via this LiteLLM passthrough) to dispatch parallel subagents — e.g., "fetch these URLs concurrently with two
Agenttool calls" — the model emits a 2-tool-call OpenAI delta, but Claude Code's harness only sees onetool_usecontent block. Its conversation state diverges from the model's, the model gets confused on the next turn and emits more tool_use blocks with emptyinput(mode-collapse on validation failure), and the harness loops with:Fix
Introduce a small dual-protocol wrapper class
_MultiToolCallSplitterthat sits between the upstream completion stream and the existing converter, splitting any chunk whosedelta.tool_callshas length > 1 into N chunks (one tool_call each) via deep-copy. The wrapper supports both__iter__/__next__and__aiter__/__anext__and matches whichever protocol the consumer (AnthropicStreamWrapper.__next__vs__anext__) drives — necessary because some upstream stream wrappers (e.g.CustomStreamWrapper) expose both protocols, and a single-protocol wrapping at construction time would break the unused side. Single-tool-call chunks pass through as the same instance with no copy.The downstream converter and the rest of
AnthropicStreamWrapper's state machine are untouched; they continue to assume one tool_call per chunk, which is now invariant by construction.Code change
One new module-level class
_MultiToolCallSplitter, one static helperAnthropicStreamWrapper._split_chunk_by_tool_calls, and one line in__init__that wires them in. 107 lines added, 1 line removed inlitellm/llms/anthropic/experimental_pass_through/adapters/streaming_iterator.py.Screenshots / Proof of Fix
Before
Direct
/v1/messagesstreaming test againstmlx_lm.serverfor a prompt that produces 2 parallel tool_calls in one delta:After
Claude Code TUI verification (end-to-end)
Same prompt that previously looped 8x with
Invalid tool parametersnow succeeds:Tests
The 12 new tests cover:
_split_chunk_by_tool_callsstatic helper: passthrough forNone/"None"sentinels, single-tool-call chunks (no copy), 2-call splits, 3-call splits, chunks with nochoices, chunks withdelta=None_MultiToolCallSplittersync iteration_MultiToolCallSplitterasync iterationAnthropicStreamWrapperconstruction with a dual-protocol upstream stream (verifies both sync and async paths reach the splitter)Backwards compatibility
AnthropicStreamWrapperunchanged. Behavior unchanged for any provider that already emits one tool_call per delta.TestAnthropicStreamWrapperToolArgs::test_sync_tool_args_not_droppedandtest_async_tool_args_not_dropped(added in Anthropic adapter streaming drops tool_use input args on content block transition #24134) continue to pass — verified explicitly because they use aSimpleIteratorfixture that exposes both sync and async protocols, which the dual-protocol splitter handles correctly.🤖 Generated with Claude Code