feat(otel): OTel-standard attributes on the proxy SERVER span (status code, route/path, preprocessing latency)#28040
Conversation
Set the OTel-standard http.response.status_code (integer) on failure
spans alongside the existing OpenInference error.code (kept for
back-compat). error.type is already emitted via ERROR_TYPE.
Crucially, also record structured error attributes on the proxy SERVER
span ('Received Proxy Server Request') from async_post_call_failure_hook
- the only place the SERVER span is in hand. _handle_failure records on
the litellm_request child span (the parent span is not propagated into
its kwargs), so prior to this change the SERVER span that dashboards
query carried only span status, never error.code/error.type. Reuses
_record_exception_on_span + StandardLoggingPayloadSetup.get_error_information
so values match the child span.
Tests: recorder unit coverage + a hook-driven test asserting the SERVER
span is stamped (the gap recorder-only tests missed). Full
test_opentelemetry.py suite: 197 passed.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Greptile SummaryAdds three OTel-standard attributes to the proxy SERVER span (
Confidence Score: 5/5Additive-only changes to the OTel layer; all new code is guarded with None checks and broad exception handlers so a missing span, missing timestamp, or non-numeric error code degrades gracefully to a no-op rather than surfacing an error. The three new attributes are set on spans that already exist, the timing anchors are propagated through channels that already carry similar internal metadata (headers, parent_otel_span), and all edge cases (pre-auth failure, missing logging obj, clock skew, non-numeric error code, None span) have explicit tests. No auth logic, routing logic, or DB access is modified. No files require special attention. The most non-trivial logic is in set_preprocessing_duration_attribute and the lift-before-pop in proxy/utils.py, both of which are directly covered by the new tests.
|
| Filename | Overview |
|---|---|
| litellm/integrations/opentelemetry.py | Adds four module-level attribute-name constants, exposes http.response.status_code (as int) in _record_exception_on_span, stamps error attrs + preprocessing duration on the SERVER span in async_post_call_failure_hook, adds the same duration in async_post_call_success_hook, and introduces two new well-guarded helpers (set_proxy_request_route_attributes, set_preprocessing_duration_attribute). |
| litellm/litellm_core_utils/litellm_logging.py | Adds set-once first_api_call_start_time to model_call_details in pre_call(); deliberately does not touch litellm_params["metadata"] to avoid echoing a datetime into provider request bodies or batch objects. |
| litellm/proxy/auth/user_api_key_auth.py | Captures the true proxy-receive instant into request.state immediately at the top of _user_api_key_auth_builder, then calls set_proxy_request_route_attributes on the freshly-created SERVER span inside the existing open_telemetry_logger guard. |
| litellm/proxy/litellm_pre_call_utils.py | Propagates litellm_received_at (datetime or None) from request.state into the internal metadata dict via the same channel as endpoint/headers, making it accessible to the OTel layer for preprocessing latency. |
| litellm/proxy/utils.py | Lifts first_api_call_start_time off the logging object into the top level of request_data before the non-serialisable logging object is popped, so failure-path OTel callbacks can still compute preprocessing latency. |
Reviews (3): Last reviewed commit: "Merge remote-tracking branch 'origin/lit..." | Re-trigger Greptile
Add the OTel-standard http.route (low-cardinality route template, e.g.
/v1/threads/{thread_id}/runs) and url.path (literal path) to the SERVER
span ('Received Proxy Server Request') so dashboards can group traffic
by endpoint instead of seeing every path param as a unique value.
Same architectural gap as the status-code commit: the success/failure
logging handlers write the litellm_request CHILD span, and
_handle_success explicitly refuses to copy to the SERVER span. Verified
with a console-exporter run that the SERVER span was bare on success.
Unlike error info, route/path are known at request time, so set them
directly on the freshly-created SERVER span in user_api_key_auth (one
edit point, works for success and failure, no hook-ordering risk):
- http.route from the matched FastAPI route (scope['route'].path),
empirically confirmed populated at auth-dependency time.
- url.path from the existing literal-path variable.
New get_request_route_template helper + set_proxy_request_route_attributes
(no-op on None span, so the Langfuse override stays safe).
Tests: route-attribute setter + route-template helper edges. Full
test_opentelemetry.py and test_auth_utils.py green.
… span
Expose the total time LiteLLM spends before the upstream provider
request begins (auth + parsing + pre-call hooks) as a single number on
the SERVER span ('Received Proxy Server Request'). Window:
proxy-receive -> FIRST provider handoff.
Retry semantics: first attempt only (pure preprocessing, excludes
retry loops + backoff). api_call_start_time is overwritten on every
attempt, so a set-once first_api_call_start_time pins the first handoff.
Same architectural gap as the prior two commits: the success/failure
logging handlers write the litellm_request CHILD span, not the SERVER
span. Set it instead from the post-call hooks on
user_api_key_dict.parent_otel_span.
Failure-path subtlety: request_data.pop('litellm_logging_obj') runs
before the failure-hook loop, so the failure hook can't read the
logging object. litellm_received_at is propagated via the existing
request->metadata channel, and first_api_call_start_time is mirrored
onto litellm_params.metadata, so both anchors survive into request_data
and the OTel helper reads them uniformly for success and failure.
Edits: user_api_key_auth (stash receive instant), litellm_pre_call_utils
(propagate it), litellm_logging (set-once first handoff + metadata
mirror), opentelemetry (constant + set_preprocessing_duration_attribute,
called from both post-call hooks).
Tests: duration helper (both container shapes, missing/negative/None
edges) + set-once invariant (retry doesn't overwrite, metadata mirror).
test_opentelemetry.py + test_auth_utils.py + test_litellm_logging.py:
447 passed. Verified live: SERVER span carries the attribute on success
and failure, coexisting with the status-code and route attributes.
No behavior change. MyPy (CI lint) flagged: - error_information["error_code"] is str|None: narrow via a None-checked local before int(). - _to_timestamp returns Optional[float]: resolve both anchors and return early if either is None instead of subtracting possibly-None floats.
|
@greptileai re review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high mode and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6bf8db3. Configure here.
| # logging object directly). | ||
| _lp = self.model_call_details.get("litellm_params") | ||
| if isinstance(_lp, dict) and isinstance(_lp.get("metadata"), dict): | ||
| _lp["metadata"]["first_api_call_start_time"] = _first_handoff |
There was a problem hiding this comment.
Preprocessing duration lost on thread/assistant failure path
Low Severity
For thread/assistant endpoints, first_api_call_start_time is written to a copy of litellm_params["metadata"] instead of the original request_data["litellm_metadata"]. This prevents set_preprocessing_duration_attribute from finding the start time on the failure path, causing the preprocessing duration to be omitted.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 6bf8db3. Configure here.
…tart_time The PR3 set-once preprocessing anchor was mirrored into litellm_params["metadata"] from core litellm_logging.py. That dict is the caller's request metadata, mutated in place and shared across every call path including pure SDK (litellm.acreate_batch). It got echoed into LiteLLMBatch(metadata=...), which the OpenAI batch schema types as Dict[str, str] -> pydantic ValidationError on a datetime value. - litellm_logging.py: set first_api_call_start_time only on model_call_details (success path reads it there directly). - proxy/utils.py: post_call_failure_hook lifts it off the logging object into request_data (internal top-level key, same convention as the other proxy-internal request_data keys) right before the existing litellm_logging_obj pop. Never touches user metadata. - opentelemetry.py: read the anchor from the container top level (model_call_details on success, request_data on failure). - Tests updated; add TestPostCallFailureHookLiftsFirstApiCallStartTime. Fixes the batches_testing regression introduced on this branch.
Bugbot is paused — on-demand spend limit reachedBugbot uses usage-based billing for this team and has hit its on-demand spend limit. A team admin can raise the spend limit in the Cursor dashboard, or wait for the next billing cycle to continue. |
…itellm_otel_status_code_attr
|
@greptile re review |
Collapse multi-line why-blocks to one or two lines and drop process/plan references (PR-numbering, "the plan") from test comments. No behavior change.
0300333
into
litellm_internal_staging


Summary
Adds the three OTel-standard attributes the FIL telemetry ask needs, all on the proxy SERVER span (
Received Proxy Server Request) — the logging handlers write a child span, so each is set where the SERVER span is in hand (auth path / post-call hooks). Squash-and-merge.http.response.status_code(int) — on failures; legacyerror.codekepthttp.route(route template) +url.path(literal path)litellm.preprocessing.duration_ms— proxy-receive → first provider handoff (excludes retries)Screenshots
error curl



success curl
error trace
success trace

Test plan
test_opentelemetry.py,test_auth_utils.py,test_litellm_logging.py— 447 passed, no regression. Lint (Black/Ruff/MyPy) clean.Resolves LIT-3086