fix(advisor): route through proxy, render native blocks, and attribute per-model usage#30546
Conversation
…ng env vars The advisor orchestration handler called anthropic_messages() directly for sub-calls, which required ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL to be set. In a proxy setup, the workaround was pointing those env vars at localhost with the master key, creating a pointless HTTP round-trip. Now _call_messages_handler checks for the proxy's llm_router and routes through it when available. The router resolves model deployments and credentials from the proxy config in-process. Falls back to the direct anthropic_messages() call for standalone SDK usage (no proxy). Also fixes a bug where api_key=None / api_base=None were always passed to the advisor sub-call, which would override the router's deployment credentials during kwargs merging. Now only passes them when explicitly set in the advisor tool definition.
The advisor sub-call routed through the proxy router but carried none of the caller's litellm_metadata (user_api_key/team/budget/session), so the advisor model's spend was never attributed to the caller's key or budget and the call did not group under the session in the UI. Forward litellm_metadata into the advisor leg so it is tracked exactly like the executor leg. Forward only litellm_metadata, not the executor's generation params: feeding the advisor the executor's agent system prompt (and tool_choice) made it mimic the executor and echo the advisor call instead of answering the question. api_key/api_base still come from the advisor tool definition only. Regression test asserts the advisor leg receives litellm_metadata but not the executor's system/tool_choice.
In our orchestration the advisor is handed a plain /v1/messages request rather
than Anthropic's native server-side framing. With no role of its own it adopted
the executor's persona from the forwarded conversation and refused or punted
("there's no separate advisor I can query, I'm answering directly"), which read
to the executor as the advisor kicking the task back.
Add ADVISOR_SYSTEM_PROMPT and pass it as the advisor leg's system prompt so the
advisor answers as the advisor. The executor's own system prompt is still not
forwarded, and the forwarded conversation context is unchanged; this only adds
the role. A regression test asserts the advisor leg carries the advisor role
prompt and not the executor's.
The rebase conflict resolution accidentally stripped comments from upstream's test_named_params_forwarded_into_advisor_executor_subcall and test_pre_request_hook_override_does_not_collide_with_explicit_kwargs. Restoring them so the PR diff only adds new tests without modifying upstream code.
…dvisor blocks
The non-native advisor orchestrator flattened the executor/advisor loop into a
single final response, so Claude Code (and any client) saw a plain message with
no signal that an advisor was consulted. Surface each advisor exchange as the
native server_tool_use (name "advisor") and advisor_tool_result blocks that
clients key their advisor UI on
Streaming now runs a real orchestrator: the server_tool_use block is flushed
before the advisor sub-call is awaited, so the advisor renders as in-progress
for its real latency and resolves when the result arrives, matching the native
server-side experience. Non-streaming and streaming share one _run_loop
generator so the two paths cannot diverge
Also fix strip_advisor_blocks_from_messages to recover advice text from the
native {"type":"advisor_result","text":...} content shape; it previously
handled only str/list content and silently dropped the dict's text, which
broke the multi-turn round-trip once we started emitting these blocks
…ations[] Clients such as Claude Code compute per-model usage from two sources: the top-level usage is attributed to the executor model, and each usage.iterations[] entry of type "advisor_message" is attributed to that entry's own model. Our orchestrator returned only the final executor usage with no iterations array, so the advisor model's tokens folded into the executor's line and its cost showed as zero Emit usage.iterations[] from the orchestrator: one advisor_message entry per advisor sub-call carrying the advisor model and its token counts, on both the non-streaming response and the streaming message_delta. The list ends with the executor's final turn so the client's context-window readout, which reads the last iteration, stays correct
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Greptile SummaryThis PR delivers three coherent fixes to the advisor orchestration loop for non-Anthropic executor providers: routing sub-calls through the proxy's
Confidence Score: 5/5Safe to merge; all changes are additive to an experimental pass-through path with no impact on existing non-advisor routes. The orchestration refactor is well-structured and the new _run_loop generator correctly covers both the streaming and non-streaming paths with a single execution trace. The auth check follows the established MCP sampling pattern, the credential-isolation logic is explicitly tested, and the streaming ordering guarantee is verified with an interleaving test. The two findings are narrow in scope: one restores a type filter accidentally dropped during a helper extraction, and the other broadens an exception guard that currently only catches ImportError. No files require special attention beyond the two inline suggestions.
|
| Filename | Overview |
|---|---|
| litellm/constants.py | Adds ADVISOR_SYSTEM_PROMPT constant — a clean, well-commented addition with no issues. |
| litellm/llms/anthropic/common_utils.py | Adds _advisor_result_text() helper and refactors strip_advisor_blocks_from_messages to use it; the list-case branch drops the original type=="text" filter, allowing any dict with a "text" field to be selected. |
| litellm/llms/anthropic/experimental_pass_through/messages/fake_stream_iterator.py | Extracts _sse() and build_content_block_chunks() as module-level helpers; adds server_tool_use and advisor_tool_result block types; fixes the orphan content_block_stop bug for unknown types. |
| litellm/llms/anthropic/experimental_pass_through/messages/interceptors/advisor.py | Major refactor: introduces _run_loop async generator, _collect/_stream consumers, router routing, advisor model access validation, and usage.iterations[] attribution; logic is sound with one note about _get_llm_router only catching ImportError. |
| tests/test_litellm/llms/anthropic/chat/test_anthropic_chat_transformation.py | Adds test for the dict-shaped advisor_result content fix; new test is well-targeted and uses mocks only. |
| tests/test_litellm/llms/anthropic/experimental_pass_through/messages/test_advisor_integration.py | Adds extensive integration tests (router routing, credential forwarding, metadata isolation, streaming ordering, usage attribution, model access auth) — all mocked, good coverage of the new behaviors. |
| tests/test_litellm/llms/anthropic/experimental_pass_through/messages/test_fake_stream_iterator.py | New test file covering server_tool_use and advisor_tool_result streaming, and the no-orphan-stop fix; well-structured and all mocked. |
| tests/test_litellm/llms/anthropic/messages/test_advisor_orchestration.py | Renames and strengthens test_loop_streaming_wraps_response: moves iteration inside the patch context (correct for lazy generators) and adds content and message_stop assertions. |
Reviews (2): Last reviewed commit: "ci: retrigger checks" | Re-trigger Greptile
PR overviewAll previously flagged issues have been addressed. No open security concerns remain on this pull request. Security reviewNo open security issues remain on this pull request. Fixed/addressed: 1 · PR risk: 0/10 |
The spread-merge of usage with iterations produces a plain dict, which mypy rejects against the AnthropicUsage TypedDict. Cast it explicitly.
The advisor_model comes from client-supplied tool input and was routed through llm_router without checking can_key_call_model. A caller whose key only covered the executor model could invoke any router model as advisor. Now validates the advisor model against the caller's UserAPIKeyAuth before entering the orchestration loop. The check only runs inside the proxy (when a router is available); standalone SDK usage has no auth layer and skips it.
Import _sse from fake_stream_iterator.py instead of defining an identical copy in advisor.py.
The streaming path emitted message_start with input_tokens: 0 before any executor call ran. Anthropic's native SSE puts real input_tokens in message_start. Now peeks the first event from the loop to extract the executor's actual input_tokens before emitting message_start.
build_content_block_chunks unconditionally emitted content_block_stop even when no content_block_start was emitted for unrecognized block types, producing an orphan stop event that violates the SSE protocol. Now only emits content_block_stop when a start was emitted.
|
Thanks for the contribution! One thing to address before we can move forward:
Once those are addressed, we'll take a closer look — thanks again! |
Relevant issues
Upstream advisor rollout: #25516. Ported from scaledata/litellm PRs #4, #7, #8
Pre-Submission checklist
make test-unit(the touched/related suites pass; 32 tests across 4 test files)@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewWhat and why
Three related fixes to the advisor orchestration loop for non-Anthropic executor providers, combined into a single PR since they form a coherent stack (each builds on the prior). Follow-up commits address all review findings from Greptile and Veria-AI.
1. Route sub-calls through the proxy router (scaledata/litellm#4)
Routes advisor and executor sub-calls through the proxy's
llm_routerin-process when available (falling back to directanthropic_messages()for standalone SDK use), instead of requiringANTHROPIC_API_KEYandANTHROPIC_BASE_URLpointed at localhost with the master key. Stops passingapi_key=None/api_base=Noneto the advisor sub-call, which would override the router's resolved deployment credentials during kwargs merging. Forwards the caller'slitellm_metadatainto the advisor sub-call so the advisor model's spend is attributed to the caller's key and budget. Gives the advisor sub-call its own role system prompt so it answers as the advisor rather than adopting the executor's persona.2. Render advisor activity in clients (scaledata/litellm#7)
Through the proxy, a non-Anthropic executor with an
advisor_20260301tool was orchestrated entirely in litellm: the executor/advisor loop ran in-process and only the final, flattened executor response was returned. Claude Code drives its advisor UI fromserver_tool_use(nameadvisor) andadvisor_tool_resultcontent blocks; when the executor is non-Anthropic the server cannot run that loop, so this change has our orchestrator synthesize the same blocks, making the proxy path render identically to the native one. For streaming, theserver_tool_useblock is flushed before the advisor sub-call is awaited, so the advisor renders as in-progress for its real latency and resolves when the result arrives.Also fixes
strip_advisor_blocks_from_messagesto recover advice text from the native{"type":"advisor_result","text":...}content shape (previously handled only str/list content and silently dropped the dict's text, breaking the multi-turn round-trip).Also extracts the shared
build_content_block_chunkshelper infake_stream_iterator.pyand teaches it theserver_tool_useandadvisor_tool_resultblock types (previously unknown types were silently dropped from the stream).3. Attribute advisor spend to its own model (scaledata/litellm#8)
When an advisor consult ran through the proxy, Claude Code's
/usageattributed all spend to the executor model and showed nothing for the advisor model. Claude Code computes per-model usage fromusage.iterations[]entries oftype:"advisor_message". This change emitsusage.iterations[]from the orchestrator: oneadvisor_messageentry per advisor sub-call, carrying the advisor model and its token counts, on both the non-streaming response and the streamingmessage_delta.4. Review feedback fixes
Addresses all automated review findings:
advisor_modelagainst the caller'sUserAPIKeyAuthviacan_key_call_modelbefore entering the loop, following the same pattern as the MCP sampling handler. Skipped in standalone SDK mode (no auth layer)_ssehelper (Greptile): removed the copy fromadvisor.py, now imports fromfake_stream_iterator.pymessage_start.usagezero tokens (Greptile): the streaming path now peeks the first loop event to extract the executor's realinput_tokensbefore emittingmessage_start, matching the Anthropic SSE protocol andFakeAnthropicMessagesStreamIteratorcontent_block_stop(Greptile):build_content_block_chunksnow only emitscontent_block_stopwhen acontent_block_startwas emitted, so unknown block types produce no chunks instead of a protocol-violating orphan stopCI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Screenshots / Proof of Fix
Run a proxy with a non-Anthropic executor and an Anthropic advisor, then drive the advisor through the streaming
/v1/messagespassthrough hitting real providers:Expected SSE order:
content_block_startforserver_tool_use(name advisor), a pause for the advisor latency, thencontent_block_startforadvisor_tool_result, then the executor text. Themessage_startcarries realinput_tokensfrom the executor, and themessage_deltacarriesusage.iterations[]with the advisor model's token counts. In Claude Code pointed at the proxy with/advisor opus-advisor, the advisor dot shows in-progress then resolves, and/usageshows a separate cost line for the advisor modelType
🐛 Bug Fix
Changes
See the "What and why" section above