Skip to content

feat(otel): OTel-standard attributes on the proxy SERVER span (status code, route/path, preprocessing latency)#28040

Merged
yassin-berriai merged 7 commits into
litellm_internal_stagingfrom
litellm_otel_status_code_attr
May 16, 2026
Merged

feat(otel): OTel-standard attributes on the proxy SERVER span (status code, route/path, preprocessing latency)#28040
yassin-berriai merged 7 commits into
litellm_internal_stagingfrom
litellm_otel_status_code_attr

Conversation

@ryan-crabbe-berri

@ryan-crabbe-berri ryan-crabbe-berri commented May 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds the three OTel-standard attributes the FIL telemetry ask needs, all on the proxy SERVER span (Received Proxy Server Request) — the logging handlers write a child span, so each is set where the SERVER span is in hand (auth path / post-call hooks). Squash-and-merge.

  • http.response.status_code (int) — on failures; legacy error.code kept
  • http.route (route template) + url.path (literal path)
  • litellm.preprocessing.duration_ms — proxy-receive → first provider handoff (excludes retries)

Screenshots

error curl
Screenshot 2026-05-16 at 11 04 08 AM
success curl
Screenshot 2026-05-16 at 10 13 56 AM
error trace
Screenshot 2026-05-16 at 10 47 17 AM

success trace
Screenshot 2026-05-16 at 10 25 05 AM

Test plan

  • Automated: test_opentelemetry.py, test_auth_utils.py, test_litellm_logging.py — 447 passed, no regression. Lint (Black/Ruff/MyPy) clean.
  • Manual: console-exporter run; all three attributes land on one SERVER span for both success and failure.

Resolves LIT-3086

Set the OTel-standard http.response.status_code (integer) on failure
spans alongside the existing OpenInference error.code (kept for
back-compat). error.type is already emitted via ERROR_TYPE.

Crucially, also record structured error attributes on the proxy SERVER
span ('Received Proxy Server Request') from async_post_call_failure_hook
- the only place the SERVER span is in hand. _handle_failure records on
the litellm_request child span (the parent span is not propagated into
its kwargs), so prior to this change the SERVER span that dashboards
query carried only span status, never error.code/error.type. Reuses
_record_exception_on_span + StandardLoggingPayloadSetup.get_error_information
so values match the child span.

Tests: recorder unit coverage + a hook-driven test asserting the SERVER
span is stamped (the gap recorder-only tests missed). Full
test_opentelemetry.py suite: 197 passed.
@codecov

codecov Bot commented May 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.70588% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
litellm/integrations/opentelemetry.py 90.90% 4 Missing ⚠️
litellm/proxy/auth/user_api_key_auth.py 40.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented May 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds three OTel-standard attributes to the proxy SERVER span (Received Proxy Server Request): http.response.status_code (int) on failures, http.route + url.path stamped in the auth path, and litellm.preprocessing.duration_ms (proxy-receive → first provider handoff, retries excluded) on both success and failure paths.

  • Error attributes (http.response.status_code, error.type, legacy error.code) are now set directly on the SERVER span via async_post_call_failure_hook, which was previously only marking span status without structured attributes. The http.response.status_code coerces the string error_code to int and silently omits it for non-numeric codes.
  • Route attributes are set immediately after the SERVER span is created in the auth builder — the only point where both the span and the full Request object are available together.
  • Preprocessing duration is anchored by a set-once first_api_call_start_time (written on the first pre_call only, not on retries) and a litellm_received_at datetime propagated through request state → internal metadata → failure-path top-level key, with careful guard rails to avoid injecting datetimes into user-facing metadata.

Confidence Score: 5/5

Additive-only changes to the OTel layer; all new code is guarded with None checks and broad exception handlers so a missing span, missing timestamp, or non-numeric error code degrades gracefully to a no-op rather than surfacing an error.

The three new attributes are set on spans that already exist, the timing anchors are propagated through channels that already carry similar internal metadata (headers, parent_otel_span), and all edge cases (pre-auth failure, missing logging obj, clock skew, non-numeric error code, None span) have explicit tests. No auth logic, routing logic, or DB access is modified.

No files require special attention. The most non-trivial logic is in set_preprocessing_duration_attribute and the lift-before-pop in proxy/utils.py, both of which are directly covered by the new tests.

Important Files Changed

Filename Overview
litellm/integrations/opentelemetry.py Adds four module-level attribute-name constants, exposes http.response.status_code (as int) in _record_exception_on_span, stamps error attrs + preprocessing duration on the SERVER span in async_post_call_failure_hook, adds the same duration in async_post_call_success_hook, and introduces two new well-guarded helpers (set_proxy_request_route_attributes, set_preprocessing_duration_attribute).
litellm/litellm_core_utils/litellm_logging.py Adds set-once first_api_call_start_time to model_call_details in pre_call(); deliberately does not touch litellm_params["metadata"] to avoid echoing a datetime into provider request bodies or batch objects.
litellm/proxy/auth/user_api_key_auth.py Captures the true proxy-receive instant into request.state immediately at the top of _user_api_key_auth_builder, then calls set_proxy_request_route_attributes on the freshly-created SERVER span inside the existing open_telemetry_logger guard.
litellm/proxy/litellm_pre_call_utils.py Propagates litellm_received_at (datetime or None) from request.state into the internal metadata dict via the same channel as endpoint/headers, making it accessible to the OTel layer for preprocessing latency.
litellm/proxy/utils.py Lifts first_api_call_start_time off the logging object into the top level of request_data before the non-serialisable logging object is popped, so failure-path OTel callbacks can still compute preprocessing latency.

Reviews (3): Last reviewed commit: "Merge remote-tracking branch 'origin/lit..." | Re-trigger Greptile

Add the OTel-standard http.route (low-cardinality route template, e.g.
/v1/threads/{thread_id}/runs) and url.path (literal path) to the SERVER
span ('Received Proxy Server Request') so dashboards can group traffic
by endpoint instead of seeing every path param as a unique value.

Same architectural gap as the status-code commit: the success/failure
logging handlers write the litellm_request CHILD span, and
_handle_success explicitly refuses to copy to the SERVER span. Verified
with a console-exporter run that the SERVER span was bare on success.

Unlike error info, route/path are known at request time, so set them
directly on the freshly-created SERVER span in user_api_key_auth (one
edit point, works for success and failure, no hook-ordering risk):
- http.route from the matched FastAPI route (scope['route'].path),
  empirically confirmed populated at auth-dependency time.
- url.path from the existing literal-path variable.
New get_request_route_template helper + set_proxy_request_route_attributes
(no-op on None span, so the Langfuse override stays safe).

Tests: route-attribute setter + route-template helper edges. Full
test_opentelemetry.py and test_auth_utils.py green.
@ryan-crabbe-berri ryan-crabbe-berri changed the title feat(otel): expose http.response.status_code on failure spans (incl. SERVER span) feat(otel): OTel-standard HTTP attributes on the proxy SERVER span (status code, http.route, url.path) May 16, 2026
… span

Expose the total time LiteLLM spends before the upstream provider
request begins (auth + parsing + pre-call hooks) as a single number on
the SERVER span ('Received Proxy Server Request'). Window:
proxy-receive -> FIRST provider handoff.

Retry semantics: first attempt only (pure preprocessing, excludes
retry loops + backoff). api_call_start_time is overwritten on every
attempt, so a set-once first_api_call_start_time pins the first handoff.

Same architectural gap as the prior two commits: the success/failure
logging handlers write the litellm_request CHILD span, not the SERVER
span. Set it instead from the post-call hooks on
user_api_key_dict.parent_otel_span.

Failure-path subtlety: request_data.pop('litellm_logging_obj') runs
before the failure-hook loop, so the failure hook can't read the
logging object. litellm_received_at is propagated via the existing
request->metadata channel, and first_api_call_start_time is mirrored
onto litellm_params.metadata, so both anchors survive into request_data
and the OTel helper reads them uniformly for success and failure.

Edits: user_api_key_auth (stash receive instant), litellm_pre_call_utils
(propagate it), litellm_logging (set-once first handoff + metadata
mirror), opentelemetry (constant + set_preprocessing_duration_attribute,
called from both post-call hooks).

Tests: duration helper (both container shapes, missing/negative/None
edges) + set-once invariant (retry doesn't overwrite, metadata mirror).
test_opentelemetry.py + test_auth_utils.py + test_litellm_logging.py:
447 passed. Verified live: SERVER span carries the attribute on success
and failure, coexisting with the status-code and route attributes.
@ryan-crabbe-berri ryan-crabbe-berri changed the title feat(otel): OTel-standard HTTP attributes on the proxy SERVER span (status code, http.route, url.path) feat(otel): OTel-standard attributes on the proxy SERVER span (status code, route/path, preprocessing latency) May 16, 2026
No behavior change. MyPy (CI lint) flagged:
- error_information["error_code"] is str|None: narrow via a None-checked
  local before int().
- _to_timestamp returns Optional[float]: resolve both anchors and return
  early if either is None instead of subtracting possibly-None floats.
@ryan-crabbe-berri

Copy link
Copy Markdown
Collaborator Author

@greptileai re review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high mode and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6bf8db3. Configure here.

# logging object directly).
_lp = self.model_call_details.get("litellm_params")
if isinstance(_lp, dict) and isinstance(_lp.get("metadata"), dict):
_lp["metadata"]["first_api_call_start_time"] = _first_handoff

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preprocessing duration lost on thread/assistant failure path

Low Severity

For thread/assistant endpoints, first_api_call_start_time is written to a copy of litellm_params["metadata"] instead of the original request_data["litellm_metadata"]. This prevents set_preprocessing_duration_attribute from finding the start time on the failure path, causing the preprocessing duration to be omitted.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6bf8db3. Configure here.

…tart_time

The PR3 set-once preprocessing anchor was mirrored into
litellm_params["metadata"] from core litellm_logging.py. That dict is
the caller's request metadata, mutated in place and shared across every
call path including pure SDK (litellm.acreate_batch). It got echoed into
LiteLLMBatch(metadata=...), which the OpenAI batch schema types as
Dict[str, str] -> pydantic ValidationError on a datetime value.

- litellm_logging.py: set first_api_call_start_time only on
  model_call_details (success path reads it there directly).
- proxy/utils.py: post_call_failure_hook lifts it off the logging object
  into request_data (internal top-level key, same convention as the
  other proxy-internal request_data keys) right before the existing
  litellm_logging_obj pop. Never touches user metadata.
- opentelemetry.py: read the anchor from the container top level
  (model_call_details on success, request_data on failure).
- Tests updated; add TestPostCallFailureHookLiftsFirstApiCallStartTime.

Fixes the batches_testing regression introduced on this branch.
@cursor

cursor Bot commented May 16, 2026

Copy link
Copy Markdown

Bugbot is paused — on-demand spend limit reached

Bugbot uses usage-based billing for this team and has hit its on-demand spend limit.

A team admin can raise the spend limit in the Cursor dashboard, or wait for the next billing cycle to continue.

@ryan-crabbe-berri

Copy link
Copy Markdown
Collaborator Author

@greptile re review

Collapse multi-line why-blocks to one or two lines and drop process/plan references (PR-numbering, "the plan") from test comments. No behavior change.
@yassin-berriai yassin-berriai merged commit 0300333 into litellm_internal_staging May 16, 2026
112 of 114 checks passed
@yassin-berriai yassin-berriai deleted the litellm_otel_status_code_attr branch May 16, 2026 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants