Skip to content

fix(otel): export SERVER span on management-endpoint success without http_request#28792

Closed
yassin-berriai wants to merge 4 commits into
litellm_internal_stagingfrom
claude/funny-fermat-xUS28
Closed

fix(otel): export SERVER span on management-endpoint success without http_request#28792
yassin-berriai wants to merge 4 commits into
litellm_internal_stagingfrom
claude/funny-fermat-xUS28

Conversation

@yassin-berriai

Copy link
Copy Markdown
Contributor

Summary

While building an end-to-end verification matrix for the four OTEL-fidelity PRs shipped in v1.87.0 — #28273 (team attrs on all spans), #28362 (serialize guardrail_response), #28364 (guardrail span on failure + status/categories), #28405 (http.response.status_code on the SERVER span) — the matrix surfaced one real gap in #28405's coverage, fixed here.

The bug

management_endpoint_wrapper only invokes async_management_endpoint_success_hook (which stamps http.response.status_code=200 and end()s the parent SERVER span) when the handler declares an http_request parameter — the if _http_request: guard at litellm/proxy/management_helpers/utils.py:471.

45 of 65 wrapped management endpoints — including all /key/*, /user/*, /mcp/*, and several /team/member* — have no http_request param. On their success path the SERVER span (created in user_api_key_auth) was therefore never ended and never exported. The failure path was unaffected (it ends the parent via a func.__name__ fallback), so dashboards saw failed admin calls but not successful ones — contradicting the goal of "HTTP status + path + duration on the root span for all statuses."

The fix

Mirror the failure branch in the success branch: run the success hook regardless, falling back to func.__name__ for the route when http_request is absent. One file: litellm/proxy/management_helpers/utils.py.

Verification

Unit: new test_management_wrapper_success_ends_server_span_without_http_request drives the real wrapper around an http_request-less handler and asserts the SERVER span exports with 200. It fails before this change ("SERVER span never finished") and passes after. Full tests/test_litellm/integrations/open_telemetry/ suite: 110 passed.

End-to-end (otel_verify/): a self-contained harness (mock model + custom file span-exporter + a Bedrock-style guardrail) drives a 12-case curl matrix against a running proxy, each with a deterministic traceparent, and asserts the spans:

Case HTTP Checks Before After
chat success (team key) 200 #28405 status/route/dur + #28273 team attrs on SERVER/litellm_request/raw_gen_ai_request
chat invalid-JSON / bad-key / unknown-model 400/401/400 #28405
chat mock 429 / 500 (team key) 429/500 #28405 + #28273 team attrs on SERVER + Failed Proxy Server Request
/v1/messages success 200 #28405
/key/generate success 200 #28405 SERVER span exported ❌ leaked
/key/generate failure / 422 500/422 #28405
guardrail block 400 #28364 span on failure path + status/action/violation_categories; #28362 guardrail_response valid JSON
guardrail allow 200 #28364 status=success; #28362 valid JSON

Result: 12/12 pass after the fix (was 11/12). All four original PRs are confirmed working at runtime.

Harness files: config.yaml, setup.sh (team+key), otel_file_exporter.py, verify_guardrail.py, run_matrix.sh, verify_spans.py. Runtime artifacts (incl. the generated virtual key) are gitignored.

Notes

  • /v1/responses was not exercised live (the matrix focuses on chat/messages/admin). The existing unit suite covers it.
  • Draft pending maintainer review.

Generated by Claude Code

…ut http_request

management_endpoint_wrapper only ran async_management_endpoint_success_hook —
which stamps http.response.status_code=200 and end()s the parent SERVER span —
when the handler declared an http_request param. 45 of 65 wrapped endpoints
(all /key/*, /user/*, /mcp/*, ...) lack it, so on success their SERVER span,
created in auth, was never ended and thus never exported. Mirror the failure
branch: invoke the success hook regardless, falling back to func.__name__ for
the route.

Add a wrapper-level regression test (fails before this change) and an
otel_verify/ end-to-end harness (config + curl matrix + span verifier)
verifying PRs #28273, #28362, #28364, #28405 against a running proxy.

https://claude.ai/code/session_016u6Pe2S2zBVrUuFF1N6GkJ
@CLAassistant

CLAassistant commented May 25, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codspeed-hq

codspeed-hq Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing claude/funny-fermat-xUS28 (8323ea3) with main (06f6cfc)

Open in CodSpeed

@yassin-berriai yassin-berriai changed the base branch from main to litellm_oss_branch May 25, 2026 16:58
@codecov

codecov Bot commented May 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@yassin-berriai yassin-berriai changed the base branch from litellm_oss_branch to litellm_internal_staging May 25, 2026 17:00

def main():
traces = {}
for line in open(sys.argv[1] if len(sys.argv) > 1 else "spans.jsonl"):
Comment thread otel_verify/verify_spans.py Outdated
spans = traces.get(tid, [])
by_name = {}
for s in spans:
by_name.setdefault(s["name"], s).update() if False else by_name.setdefault(s["name"], s)
cats_ok = False
try:
cats_ok = sorted(json.loads(cats_raw)) == sorted(g["categories"])
except Exception:
claude added 3 commits May 25, 2026 17:04
Re-runs CI against the corrected base branch and tidies the verification
harness (Black formatting + removal of an unused by_name block).

https://claude.ai/code/session_016u6Pe2S2zBVrUuFF1N6GkJ
Adding the success-path fix pushed management_endpoint_wrapper to 52
statements (ruff PLR0915 limit is 50). Extract the shared success/failure
OTEL span emission into _emit_management_endpoint_otel_span, which also
de-duplicates the two near-identical inline blocks. Behavior is unchanged:
the parent SERVER span is stamped + ended on both paths, including for
handlers without an http_request param.

https://claude.ai/code/session_016u6Pe2S2zBVrUuFF1N6GkJ
…guards

Bring patch coverage of _emit_management_endpoint_otel_span and the wrapper
to 100%: add tests for the failure path, the http_request-present path, the
no-OTEL-logger early return, and the non-blocking post-success error path.
Remove two unreachable `if kwargs is None` guards (**kwargs is always a dict).

https://claude.ai/code/session_016u6Pe2S2zBVrUuFF1N6GkJ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants