Skip to content

feat(otel-v2): emit the 6 gen_ai.client.* metrics at parity with v1#30326

Merged
yassin-berriai merged 1 commit into
litellm_internal_stagingfrom
litellm_fix_otel_v2_metrics
Jun 15, 2026
Merged

feat(otel-v2): emit the 6 gen_ai.client.* metrics at parity with v1#30326
yassin-berriai merged 1 commit into
litellm_internal_stagingfrom
litellm_fix_otel_v2_metrics

Conversation

@yassin-berriai

Copy link
Copy Markdown
Contributor

Relevant issues

Extends #30257 (LIT-3600) to the OpenTelemetry v2 integration.

Dependency

This is stacked on #30257 and is based on that branch, because it reuses the metric attribute filter helpers introduced there. The base is litellm_fix_otel_metrics_cardinality for a clean v2-only diff; retarget to litellm_internal_staging once #30257 merges.

Changes

The v2 OTEL integration (litellm/integrations/otel/) was a span engine; it declared two metric histograms but never created a meter or recorded anything, so a v2-default deployment emitted no gen_ai.client.* metrics at all. This brings it to parity with v1.

It adds the four missing metric names, all six histograms, a meter-provider builder that mirrors v1's exporter selection, and a GenAIMetricRecorder that records token usage (split input/output), cost, operation duration, time to first token (streaming only), time per output token, and response duration on the success hook. Everything is gated on config.enable_metrics, so the default is unchanged.

The attribute cardinality filter is reused from v1 by import (no duplication of the valid-name set or validation) and resolved lazily from callback_settings.otel.attributes, the same as v1. A misconfigured filter raises out of the recorder; the logger surfaces it once at ERROR and records nothing, rather than silently disabling metrics, and a corrected config recovers without a restart.

Type

New Feature

Screenshots / Proof of Fix

Live proxy on the v2 path (LITELLM_OTEL_V2=true) with metrics on, the console metric exporter, and an exclude_list configured. Run with the repo root on PYTHONPATH so the branch code loads (otherwise python litellm/proxy/proxy_cli.py resolves litellm from an installed copy).

PYTHONPATH="$(pwd)" LITELLM_OTEL_V2=true LITELLM_OTEL_INTEGRATION_ENABLE_METRICS=true \
  OTEL_EXPORTER=console OTEL_ENDPOINT="" \
  python litellm/proxy/proxy_cli.py --config config.yaml --detailed_debug 2>&1 | tee litellm.log

curl -s -N http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-1234" -H "Content-Type: application/json" \
  -d '{"model":"gpt","messages":[{"role":"user","content":"count to three"}],"stream":true}'

A streaming call emits all six instruments:

gen_ai.client.operation.duration
gen_ai.client.token.usage
gen_ai.client.token.cost
gen_ai.client.response.time_to_first_token
gen_ai.client.response.time_per_output_token
gen_ai.client.response.duration

With the exclude_list set, the recorded gen_ai.client.token.usage data point carries the identity attributes and gen_ai.token.type but none of the excluded keys (hidden_params, metadata.requester_metadata, metadata.requester_ip_address, ...), matching v1.

Docs: BerriAI/litellm-docs#344

@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.00000% with 44 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
litellm/integrations/otel/plumbing/providers.py 12.12% 29 Missing ⚠️
litellm/integrations/otel/plumbing/metrics.py 88.88% 13 Missing ⚠️
litellm/integrations/otel/logger.py 90.90% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR brings the OTel v2 integration to metric parity with v1 by adding the four missing histogram instruments (token.cost, response.time_to_first_token, response.time_per_output_token, response.duration), wiring a MeterProvider into OpenTelemetryV2.__init__, and introducing a GenAIMetricRecorder that records all six gen_ai.client.* histograms on the async success hook. The cardinality filter is reused from v1 with lazy resolution, and everything is gated on the existing enable_metrics flag so default behaviour is unchanged.

  • metrics.py (new)GenAIMetricRecorder handles attribute building, cardinality filtering, and per-metric recording; mirrors v1 timing math for TTFT, TPOT, and response duration.
  • providers.py — adds build_metric_reader / build_meter_provider factory functions mirroring the span-exporter selection, plus _otlp_metrics_endpoint for OTLP/HTTP signal-path rewriting.
  • Teststest_otel_v2_metrics.py covers the six-histogram path, streaming vs non-streaming split, include/exclude cardinality filtering, and the gen_ai.token.type discriminator guard; test_otel_v2_logger.py gains three integration tests for the logger layer.

Confidence Score: 4/5

The change is purely additive and gated on enable_metrics=False by default, so existing deployments are unaffected; both findings are non-blocking edge-case issues.

The core recording logic correctly mirrors v1 behaviour, the cardinality filter is reused rather than duplicated, and all six histograms are well-tested with an InMemoryMetricReader. The two findings — a silent console-exporter fallback for in_memory in build_metric_reader, and a broad ValueError catch that can mis-attribute SDK errors as filter misconfiguration — are both in low-traffic or edge-case paths and do not affect the default configuration.

litellm/integrations/otel/plumbing/providers.py (missing in_memory handler in build_metric_reader) and litellm/integrations/otel/logger.py (broad ValueError catch scope in _record_metrics)

Important Files Changed

Filename Overview
litellm/integrations/otel/plumbing/metrics.py Core new file: adds GenAIMetrics dataclass, create_genai_metrics factory, and GenAIMetricRecorder with lazy cardinality filter resolution; generally well-structured at parity with v1
litellm/integrations/otel/plumbing/providers.py Adds build_metric_reader and build_meter_provider factories; _otlp_metrics_endpoint mirrors the traces counterpart but build_metric_reader silently falls back to ConsoleMetricExporter for unrecognized kinds including "in_memory"
litellm/integrations/otel/logger.py Wires metric recorder into async_log_success_event; _record_metrics is best-effort and won't break request path; ValueError catch scope is slightly broad
litellm/integrations/otel/model/semconv.py Adds four new Metric constants (TOKEN_COST, TIME_TO_FIRST_TOKEN, TIME_PER_OUTPUT_TOKEN, RESPONSE_DURATION); clean additive change
tests/test_litellm/integrations/otel/test_otel_v2_logger.py Adds three integration tests for metrics (invalid filter, six-metric happy path, disabled-by-default); all use InMemoryMetricReader with no real network calls
tests/test_litellm/integrations/otel/test_otel_v2_metrics.py Comprehensive unit tests for the recorder layer; _drive_success mutates litellm.callback_settings globally without monkeypatch, inconsistent with other tests in the same file

Reviews (1): Last reviewed commit: "test(otel-v2): drop duplicate misconfig ..." | Re-trigger Greptile

Comment on lines +194 to +195
kind = (config.exporter or "console").lower()
if kind in ("otlp_http", "http", "http/protobuf", "http/json"):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 build_metric_reader doesn't recognise "in_memory" / "inmemory" / "memory" as exporter kinds, while the span-exporter path (_exporter_from_spec) does. Any config that reaches this function with exporter="in_memory" silently falls through to ConsoleMetricExporter, which is hard to diagnose because metrics appear to "work" in the console but aren't available for programmatic inspection. Adding the same guard that _exporter_from_spec already has keeps the two paths consistent.

Suggested change
kind = (config.exporter or "console").lower()
if kind in ("otlp_http", "http", "http/protobuf", "http/json"):
kind = (config.exporter or "console").lower()
if kind in ("in_memory", "inmemory", "memory"):
from opentelemetry.sdk.metrics.export import InMemoryMetricReader
return InMemoryMetricReader()
if kind in ("otlp_http", "http", "http/protobuf", "http/json"):

Comment on lines +244 to +253
self._metrics_recorder.record(kwargs, response_obj, start_time, end_time)
except ValueError as exc:
if not self._metric_filter_error_logged:
verbose_logger.error(
"OpenTelemetryV2: invalid otel.attributes metric filter, metrics disabled: %s",
exc,
)
self._metric_filter_error_logged = True
except Exception as exc:
verbose_logger.debug("OpenTelemetryV2: metric recording failed: %s", exc)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Broad ValueError catch mis-attributes SDK errors as filter misconfiguration

_ensure_filter is the intended source of ValueError here, but any ValueError raised deeper in record() — e.g. if the OTEL SDK or an attribute coercion rejects a value — will be caught by the same branch and logged as "invalid otel.attributes metric filter". On the first occurrence it also permanently sets _metric_filter_error_logged = True, silencing all future ValueErrors for the lifetime of the logger. Narrowing the catch to ValueError raised specifically by _ensure_filter (or using a dedicated sentinel exception) would prevent the misleading message and the premature silencing of unrelated errors.

Base automatically changed from litellm_fix_otel_metrics_cardinality to litellm_internal_staging June 13, 2026 00:29
@yassin-berriai yassin-berriai enabled auto-merge (squash) June 13, 2026 00:32
@yassin-berriai yassin-berriai force-pushed the litellm_fix_otel_v2_metrics branch from 770fe2b to 73a3cf1 Compare June 15, 2026 23:01

std_log = kwargs.get("standard_logging_object")
md = getattr(std_log, "metadata", None) or (std_log or {}).get("metadata", {})
for key in METRIC_METADATA_KEYS:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Unbounded metric cardinality from request metadata

With metrics enabled, requester_metadata is copied from the client's request metadata and this loop records it as an OTEL histogram attribute by default. An authenticated caller can send a unique metadata object on each completion request and force the process/exporter to allocate a distinct metric series for every call; default to excluding requester-controlled metadata from metrics unless an operator explicitly includes it, or bound/normalize the values before recording.

@veria-ai

veria-ai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

PR overview

This PR adds OpenTelemetry v2 support for emitting the six gen_ai.client.* metrics in parity with the existing v1 instrumentation. The touched metrics plumbing records completion-related metric attributes for LiteLLM requests.

There is one open security concern: requester-supplied metadata is included as metric attributes by default, which can let an authenticated caller create unbounded metric cardinality and pressure the process or metrics exporter. No issues have been fixed yet, so the PR still needs a guardrail such as excluding, bounding, or normalizing caller-controlled metadata before it is ready from a security perspective.

Open issues (1)

Fixed/addressed: 0 · PR risk: 6/10

@yassin-berriai yassin-berriai merged commit 45d5153 into litellm_internal_staging Jun 15, 2026
119 checks passed
@yassin-berriai yassin-berriai deleted the litellm_fix_otel_v2_metrics branch June 15, 2026 23:12
mateo-berri added a commit that referenced this pull request Jun 16, 2026
The default pull_request checkout uses refs/pull/N/merge, which folds the
latest base commits into HEAD. The diff-based gates (ruff delta, Any
discipline) then diff against the event's older base.sha and blame base's
own new commits on this branch; staging's otel-v2 and streaming changes
(#30326, #30485) tripped the Any gate on files this branch never touched.
Checking out the PR head sha makes the gates diff the real branch tip
against base, and pins the tree the mypy/basedpyright budgets were captured
against so their counts stay deterministic as the base advances.
mateo-berri added a commit that referenced this pull request Jun 16, 2026
…pyright) (#30379)

* ci: enable ruff preview rules under the budgeted strict gate

Turn on ruff preview in the strict-budget lane (ruff-strict.toml) only,
leaving the clean gate (ruff.toml) untouched so make lint-ruff stays at
zero. Enumerate the 118 firing codes explicitly with
explicit-preview-rules so the gate is deterministic and stable across
ruff upgrades rather than depending on preview auto-selecting the broad
catalog.

Grandfather the existing 58438 violations into ruff-strict-budget.json
as per-rule baselines with headroom, so only net-new violations fail CI.
The existing ten rules keep their hand-tuned slack; the new rules get
slack 10 when the baseline is 50 or more and 3 otherwise.

* ci: add ANN return-type rules to the budgeted strict gate

Add ANN201/202/204/205/206 (missing return annotations) to the strict
lane and grandfather the existing counts into ruff-strict-budget.json so
the codebase ratchets toward explicit return types without breaking CI.

* ci: add mypy (disallow_untyped_defs) and basedpyright strict gates with baselines

Add two type-check gates, each grandfathering the current tree so only
net-new violations fail CI, matching the ruff strict-budget ratchet.

mypy gains disallow_untyped_defs in litellm/mypy.ini (the config the CI
invocation actually reads; the root [tool.mypy] is not picked up from the
litellm/ working dir). The 4885 existing missing-annotation errors are
captured in litellm/.mypy-baseline.txt and the run is piped through
mypy-baseline filter so new untyped defs are rejected.

basedpyright runs in strict mode over litellm/, with
enableTypeIgnoreComments disabled so it only honors '# pyright: ignore'
and never polices mypy's '# type: ignore'. The existing strict diagnostics
are grandfathered into .basedpyright/baseline.json.

Both tools are pinned in the dev group and uv.lock; the lint workflow and
Makefile run them filtered through their baselines, with
lint-mypy-baseline-update and lint-basedpyright-baseline-update to ratchet.

* ci: raise lint job timeout to 15m for the basedpyright strict pass

* ci: pin pythonVersion 3.12 and regenerate baselines against merged base

Merge litellm_internal_staging so the baselines cover code the CI merge
includes (e.g. the cisco_ai_defense guardrail), which otherwise tripped
the mypy gate with 3 ungrandfathered no-untyped-def errors. Pin
pythonVersion 3.12 in pyrightconfig so basedpyright's strict analysis is
reproducible across interpreter versions (CI runs 3.12).

* ci: regenerate basedpyright baseline against the frozen lint env

The previous baseline was generated with optional provider deps (azure,
google, anthropic, mcp, numpydoc, google-genai) installed locally, so CI's
dev-only env surfaced ~3500 reportUnknown*/reportMissingTypeStubs errors
not in the baseline. Regenerate after uv sync --frozen so the baseline
reflects the same dependency set the lint job sees.

* ci: regenerate basedpyright baseline on python 3.12 frozen env

The prior baseline still carried proxy-dev packages (e.g. prisma) that the
lint job's dev-only, python 3.12 env lacks, leaving 2 unresolved-import
errors ungrandfathered. Regenerate in a python 3.12 venv synced to the
frozen lock with default groups only, so the baseline matches exactly what
CI sees.

* ci: replace type-check baselines with per-file count budgets

The mypy and basedpyright baselines were position-sensitive (and the
basedpyright one was a 27MB file), so ordinary line shifts churned them.
Replace both with a per-file count gate: scripts/type_check_gate.py reduces
each tool's output to errors-per-file and checks it against a committed
{file: max} budget, ignoring line and column numbers. A file fails only
when it gains more errors than its ceiling; debt can't be shuffled between
files because each file has its own cap and new files default to zero.

Budgets (mypy-file-budget.json 48K, basedpyright-file-budget.json 96K) are
generated in the python 3.12 frozen lint env so they match CI. Drops the
mypy-baseline dependency; basedpyright runs without its native baseline.
ratchet via make lint-mypy-budget-update / lint-basedpyright-budget-update.

* ci: add a small per-file slack to the type-check gate

Allow each file to drift PER_FILE_SLACK (5) errors past its recorded count
before failing, so a basedpyright inference ripple in an unrelated file
doesn't break the build over a couple of errors. Budgets still record exact
counts; the tolerance is applied at check time.

* ci: move type-check slack into the budget json and trim lint timeout

Make slack declarative: the budget is now {"slack": N, "files": {path: count}}
so the tolerance is tuned in JSON without editing the script, mirroring how
ruff-strict-budget.json carries its slack. --update preserves the existing
slack. Also drop the lint job timeout from 15m to 10m; the mypy and
basedpyright passes add ~2m, leaving the job around 4-5m, so 10m is a
comfortable margin.

* ci: collapse fully-adopted ruff categories and drop inert preview flag

ANN (all nine non-removed rules) and BLE (its only rule) were spelled out
code-by-code; replace each with its category selector, which is exactly
equivalent in 0.15.3 (the removed ANN101/ANN102 are skipped by a category
selector and error when named explicitly). explicit-preview-rules was inert:
every selected rule is stable and nothing is selected by category, so the flag
had nothing to gate. Verified the strict-rule counts are identical before and
after (62379 each, zero per-rule drift), so no budget change.

* ci: drop redundant pyright dev dependency

Nothing invokes bare pyright in the Makefile, the linting workflow, or
scripts; the basedpyright gate added on this branch is the only type
checker that runs. basedpyright is a superset fork that reads the same
pyrightconfig.json and honors the same "# pyright: ignore" comments, so
pyright==1.1.408 in the ci group was dead weight. Regenerated uv.lock
under the same exclude-newer cutoff so the only change is removing
pyright and its package stanza

* ci: un-weaken mypy and error on Any in basedpyright

mypy: enable warn_return_any, drop the valid-type silencer, and stop globally ignoring missing first-party imports via [mypy-litellm.*] ignore_missing_imports = False, which surfaced eight real broken litellm.* imports the blanket ignore was hiding; third-party imports stay ignored. The per-file budget moves 4888 -> 5799 (902 no-any-return, 1 valid-type, 8 import-not-found), all grandfathered so only net-new errors fail and the ceilings ratchet down

basedpyright: error on reportExplicitAny and reportAny. The per-file budget moves 117033 -> 148946 (6931 explicit-Any, 24954 Any-typed expressions), grandfathered the same way

* ci: add Any-discipline gate on changed lines under litellm/

Add scripts/check_any_discipline.py, a type-aware gate that fails when a
changed line holds a value typed Any -- including the X | Any unions that
mypy --strict / basedpyright accept (e.g. re.Match.group() -> str | Any,
json.loads() -> Any, bare dict -> dict[Any, Any]).

It reuses the repo's mypyc-compiled mypy 1.19 via a custom generic AST
walker (mypyc precludes subclassing TraverserVisitor), loads litellm/mypy.ini
for parity with lint-mypy, and uses a dedicated incremental cache
(.mypy_cache_any) with mtime+hash invalidation to force re-checks. Scope is
changed-lines-only so editing a legacy file never forces cleaning its
existing Any debt; suppress a genuine typed/untyped boundary with
# any-ok: <reason> (ANY002 requires the reason).

Wire it into the Makefile (lint-any, lint, lint-dev), a parallel
any-discipline CI job with its own actions/cache, .gitignore, and the
CLAUDE.md / CONTRIBUTING.md docs.

* ci: move Any-gate codes into the shared LIT namespace

Renumber the Any-discipline checker into the LIT*** scheme owned by
scripts/check_type_discipline.py (PR #30500) so the two checkers share one
rule namespace and suppression convention:

  ANY001 -> LIT002  (Any-typed value; LIT002 was the retired/free slot)
  ANY002 -> LIT005  (any-ok without a reason; the shared suppression-reason code)
  ANY000 -> LIT000  (setup/build/read error; the shared error code)

Messages and behavior are unchanged; LIT005's text already matches the
"<token> requires a reason" shape used for cast-ok/guard-ok.

* ci: gate mypy and basedpyright per error rule, not per file

Switch the mypy/basedpyright budget gate from per-file error counts to
per-rule-code totals, mirroring the {rule: {baseline, slack}} shape of
ruff-strict-budget.json. A rule fails when its codebase-wide error count
exceeds baseline + slack, so violations are tracked by category rather
than by file location.

scripts/type_check_gate.py now parses mypy from its text output (trailing
[code]) and basedpyright from --outputjson (the JSON `rule` field), since
basedpyright's wrapped text diagnostics mis-attribute the rule on
continuation lines. Replace the *-file-budget.json files with freshly
captured *-code-budget.json baselines and update the Makefile, CI, and
CLAUDE.md accordingly.

* docs: prefer Pydantic validation over any-ok suppression

Point the Any-discipline guidance at validating Any with Pydantic (a model
or TypeAdapter that returns a typed value or raises) and frame
# any-ok as a last resort that should ideally never be used.

* chore: remove extraneous comment

* chore: make the CLAUDE.md more concise

* chore: clean up bloated CONTRIBUTING.md additions

* chore: make Makefile more concise

* ci: add the lint-budget-update target CLAUDE.md references

CLAUDE.md tells contributors to run make lint-budget-update, but the
target was never defined. Add it as an aggregate that re-captures the
ruff, mypy, and basedpyright budgets in one shot.

* ci: recapture mypy and basedpyright budgets in the lint env

The per-rule baselines were captured in a richer dependency env than the
CI lint job's uv sync --frozen, so CI resolved fewer types and reported
more errors than the budgets allowed (no-any-return 902 over cap 900, plus
several basedpyright reportUnknown* rules). Regenerate both in the frozen
env so they grandfather the true CI debt: mypy 5786 -> 5799 (no-any-return
890 -> 902, valid-type 1 restored), basedpyright 146213 -> 148942.

* ci: check out PR head sha in lint and any-discipline jobs

The default pull_request checkout uses refs/pull/N/merge, which folds the
latest base commits into HEAD. The diff-based gates (ruff delta, Any
discipline) then diff against the event's older base.sha and blame base's
own new commits on this branch; staging's otel-v2 and streaming changes
(#30326, #30485) tripped the Any gate on files this branch never touched.
Checking out the PR head sha makes the gates diff the real branch tip
against base, and pins the tree the mypy/basedpyright budgets were captured
against so their counts stay deterministic as the base advances.

* ci(lint): renumber Any-typed-value rule LIT002 -> LIT009

Free up LIT002 for the sibling type-discipline gate (check_type_discipline.py,
#30500), which groups its mutable-collection family at LIT001 (annotation) and
LIT002 (construction). This gate's Any-typed-value rule moves to LIT009 so the
shared LIT namespace stays contiguous with no holes; LIT000 and LIT005 are
unchanged.

* style: rename lint-strict-budget -> lint-ruff-budget

* ci: harden type-check gates against silent passes (greptile review)

type_check_gate.py: refuse to certify a vacuous run. The CI pipe swallows
the tool's exit code ('tool || true'), so a crashed mypy/basedpyright that
emits nothing would parse to zero errors, breach no ceiling, and pass.
is_vacuous_run() now fails when nothing was parsed but the budget expects
errors. Also wrap basedpyright's json.loads in a JSONDecodeError handler
that prints the offending output instead of dumping a raw traceback.

check_any_discipline.py: ALL_LINES was None, which dict.get() also returns
for a path absent from the line map, so a path-normalisation mismatch could
let a violation on an unchanged file pass the scope filter. Make ALL_LINES a
distinct sentinel object so 'whole file' and 'path missing' are unambiguous.

Adds tests for all three.
mateo-berri added a commit that referenced this pull request Jun 16, 2026
…30554)

* chore(codecov): add Batches, Videos, and Realtime components (#30517)

* chore(codecov): add Batches, Videos, and Realtime components

Define per-feature Codecov components so PR comments track coverage
for batch API, video generation, and realtime streaming paths.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(codecov): use wildcard path for Batches proxy component

Align batches_endpoints glob with Videos, Realtime, and Proxy_Authentication.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(batches): move orphan tests into tests/test_litellm for CI coverage (#30510)

Four batch-related tests lived under tests/litellm/ and were never picked
up by GitHub Actions. Relocate them and fix gemini multimodal e2e to use
the batchEmbedContents path expected for gemini/ provider.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(guardrails): run pre_call hook once for model-level guardrails (#30543)

* fix(guardrails): run pre_call hook once for model-level guardrails

A CustomGuardrail attached to a deployment via litellm_params.guardrails
gets its async_pre_call_hook invoked twice per request: once by the proxy
pre-call loop and again by async_pre_call_deployment_hook after the router
spreads the model-level guardrails into the top-level request kwargs.

Record in request metadata that the proxy pre-call loop already ran a given
guardrail, and have the deployment hook skip it when the marker is present.
Direct-SDK usage never runs the proxy loop, so the deployment hook stays the
sole invocation there and still fires exactly once.

The marker key is stripped from untrusted caller metadata so a request body
cannot suppress a model-only guardrail by pre-seeding it.

* fix(guardrails): mark pre_call dedup on the post-hook request data

Record the exactly-once marker after async_pre_call_hook runs, on the data
object that flows downstream, rather than before it. A guardrail whose hook
returns a brand-new request dict (instead of mutating or spreading the one it
received) would otherwise discard the marker, letting the deployment hook
re-run the guardrail a second time.

* fix(guardrails): stop re-initializing DB guardrails on every poll (#30542)

* fix(guardrails): stop re-initializing DB guardrails on every poll

InMemoryGuardrailHandler._has_guardrail_params_changed compared the
in-memory LitellmParams against the raw dict loaded from the DB. The
in-memory side carries every field default and coerces enums via
model_dump(), while the DB side only holds the keys originally stored,
so the two shapes never compared equal and the guardrail was rebuilt on
every poll cycle.

Each rebuild created a fresh instance, but delete_in_memory_guardrail
only removed the old callback from litellm.callbacks. Request handling
promotes guardrail callbacks into the success/failure/async lists, so
the previous instance stayed referenced there and instances accumulated.

Normalize both sides through LitellmParams(...).model_dump() before
diffing, and purge the callback from every callback list on delete.

* refactor(guardrails): narrow params-normalization fallback to ValidationError

The comparison normalizer caught a bare Exception and silently fell back
to the raw dict, which hid the cause and quietly degraded the affected
guardrail back to re-initializing on every poll. Catch only the
ValidationError that LitellmParams construction can raise, log a warning
so the offending row is diagnosable, and let any other error surface
instead of being swallowed.

* refactor(callbacks): add remove_callback_from_all_lists helper to manager

Move the knowledge of which callback lists a callback can be promoted
into out of the guardrail registry and into LoggingCallbackManager, where
the rest of the callback-list bookkeeping already lives. delete_in_memory_guardrail
now delegates to the new helper instead of iterating the lists itself.

* chore(oss): litellm oss staging 150626 (#30463)

* fix(pricing): add GitHub Copilot MAI Code Flash pricing (#30415)

* fix(pricing): add GitHub Copilot MAI Code Flash pricing

Add GitHub Copilot pricing entries for MAI-Code-1-Flash and the internal Copilot CLI model name so cost calculation can price input, cached input, and output tokens.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test(pricing): cover GitHub Copilot MAI Code Flash pricing

Add regression coverage for both GitHub Copilot MAI-Code-1-Flash model names, including cached input pricing, chat endpoint metadata, and cost_per_token arithmetic.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(router/proxy): propagate completed_response through FallbackResponsesStreamWrapper for streaming /v1/responses container ownership (#30210) (#30213)

* fix(router/proxy): propagate completed_response through FallbackResponsesStreamWrapper for streaming /v1/responses container ownership (#30210)

#28990 added ownership recording for streaming /v1/responses via
_wrap_responses_stream_for_container_ownership, which reads
`getattr(stream_response, 'completed_response', None)` to extract the
ResponsesAPIResponse. The unit test bypassed the Router, so it never
exercised the production wrapping path.

Through the Router (every proxy deployment), the stream is wrapped by
FallbackResponsesStreamWrapper (router.py:2527). Its __init__ set
`self.completed_response = None` and __anext__ only forwarded chunks
— the inner source iterator's terminal event never bubbled up to the
attribute the ownership hook reads, so the hook silently recorded
nothing and every follow-up /v1/containers/<id>/files call returned
403 for non-admin keys.

This commit:

- router.py: pre-resolves the responses-API terminal event tuple
  (response.completed / .incomplete / .failed) once per
  _aresponses_streaming_iterator call, and has the wrapper's __anext__
  sniff each forwarded chunk's .type. First terminal event hit gets
  stored on the wrapper's completed_response. Iterator-agnostic — works
  for source_iterator AND any future wrapper.

- common_request_processing.py: when _extract_completed_responses_response
  returns None we now warn instead of silently skipping. Reporter on
  #30210 lost a day to this exact silent skip; the warning surfaces
  future regressions of the same shape directly in operator logs.

Fixes #30210

* fix(router): type-ignore wrapper getattr-defaults; broaden ownership-skip warning

CI lint (mypy) flagged the three pre-existing getattr(..., None) assignments
in FallbackResponsesStreamWrapper.__init__:

  router.py:2564 self.response = getattr(source_iterator, 'response', None)
  router.py:2565 self.model    = getattr(source_iterator, 'model', None)
  router.py:2566 self.logging_obj = getattr(..., None)

Those lines also exist on litellm_internal_staging and pass mypy there.
Adding the typed terminal-event tuple above the class made the function
body more narrowable, which surfaced the pre-existing mismatch — base
class declares non-Optional types but the bridge path
(LiteLLMCompletionStreamingIterator) legitimately omits these. Keep
the None fallback and silence with type: ignore[assignment].

Greptile 4/5 note: the ownership-skip warning hard-named code_interpreter
which misleads operators when a non-code_interpreter stream aborts.
Generalize to 'any tool container (e.g. code_interpreter)'.

* fix(register_model): drop synthesized zero costs to preserve sparse entries (#30198) (#30201)

* fix(register_model): drop synthesized zero costs to preserve sparse entries (#30198)

get_model_info synthesizes input_cost_per_token / output_cost_per_token = 0
when they are absent from the raw entry (the price-unknown and free cases
share the same representation). register_model then merges that result back
into litellm.model_cost, which flips a sparse entry from 'no cost keys'
(priced via model name) to 'cost keys = 0' (free).

That defeats _is_cost_explicitly_configured (#24949) on re-registration:
_is_model_cost_zero returns True, common_checks skips every tag / key /
team / user / org budget check for the group, and over-budget traffic
keeps returning 200. Spend keeps recording because cost calc still resolves
by model name, so the symptom is silent and only triggers on the second
register_model pass (router rebuild, /model/update, config sync).

Mirror the existing litellm_provider-None guard one block above and pop
the cost fields from the synthesized result when they are absent from the
raw entry and not in the caller's value. Caller-provided zeros (genuinely
free models, BYOK overrides) are preserved.

Fixes #30198

* fix(register_model): switch _raw_entry to is-None checks + drop dead test assertion

Greptile #30201 review notes:
- the `or`-chain in the raw-entry lookup treated an empty dict (a key
  with no fields) as falsy and fell through to the second arm — replace
  with explicit `is None` checks so a present-but-empty entry is still
  taken at face value.
- the first assertion in `test_router_double_init_keeps_db_model_entry_sparse`
  used `in (None, 0)` which passes under the bug condition (cost = 0
  matches the tuple); the strong follow-up assertion already covers
  every shape, so drop the dead branch.

* fix(bedrock mantle): use unique function-call id for responses->chat tool calls (#30426)

* fix(bedrock mantle): use unique function-call id for responses->chat tool calls

...

* fix(bedrock mantle): scope unique tool-call id fallback to degenerate call_id

The previous revision preferred the Responses item id for every tool call, which broke providers (and existing tests) where call_id is a unique, canonical correlation key. Restrict the fallback to the degenerate index-based call_id that Bedrock Mantle returns (call_0, call_1, ... resetting per response) and keep call_id otherwise. Revert the change to the OUTPUT_ITEM_DONE streaming handler, whose tool_call_chunk is never emitted (dead code, per review). Extend the regression tests to assert a normal call_id is preserved.

* fix(router): preserve azure_ad_token through CredentialLiteLLMParams for /v1/files + batches (#30235) (#30241)

* fix(router): preserve azure_ad_token through CredentialLiteLLMParams for /v1/files + batches (#30235)

Router.get_deployment_credentials_with_provider re-validates a
deployment's litellm_params through CredentialLiteLLMParams before
handing them to file/batch/passthrough callers:

    return CredentialLiteLLMParams(
        **deployment.litellm_params.model_dump(exclude_none=True)
    ).model_dump(exclude_none=True)

Any field NOT declared on CredentialLiteLLMParams gets silently dropped
on the way through. azure_ad_token was undeclared, so Azure deployments
using OAuth/M2M (azure_ad_token instead of a static api_key) silently
lost their token at the files endpoint and the proxy returned:

    Missing credentials. Please pass one of api_key, azure_ad_token,
    azure_ad_token_provider, ...

Declare azure_ad_token on CredentialLiteLLMParams alongside api_key /
api_base / api_version so it rides through the round-trip. Static-key
deployments stay unaffected (Optional, default None, dropped by
exclude_none=True). Provider-callable (azure_ad_token_provider) is a
separate concern and out of scope here.

Fixes #30235

* fix(ui-types): regenerate schema.d.ts for new azure_ad_token field

CI's 'Verify schema.d.ts matches the proxy OpenAPI spec' check
auto-detected the new field and emitted the exact diff to apply.
Two schemas had `aws_secret_access_key` from CredentialLiteLLMParams,
both get the new azure_ad_token marker next to it.

* fix(proxy): org_admin with own user_id now sees all org teams on /v2/team/list (#30247)

When the UI sends the callers own user_id (as it does for non-Admin
global roles), _enforce_list_team_v2_access now nulls it out for org
admins so _build_team_list_where_conditions scopes by organization_id
only -- matching the legacy /team/list behavior and the documented intent.

Fixes #30215

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* test(vertex_ai): multi-region regression coverage for cachedContents host (#29571) (#29707)

litellm_internal_staging already routes the cachedContents URL through
get_vertex_base_url, fixing the multi-region 404 reported in #29571 —
but carries no test coverage for the actual regression scenario (eu/us
must resolve to the REP host aiplatform.{geo}.rep.googleapis.com).

Add TestContextCachingMultiRegionUrls: parametrized eu/us REP-host
assertions (including absence of the old broken {geo}-aiplatform host),
plus regional (us-central1) and global no-regression checks.

* fix(proxy): close upstream LLM stream when client disconnects mid-stream (#30245)

* fix(proxy): close upstream LLM stream when client disconnects mid-stream

When a streaming client disconnects, Starlette abandons the response
body iterator without calling aclose(), so the proxy's connection to
the upstream backend stays open until garbage collection, which may
never come. The backend (e.g. vLLM) keeps generating into a dead pipe:
small responses drain invisibly into TCP buffers while large ones block
the backend on a full send buffer indefinitely (observed via lsof as an
ESTABLISHED proxy->backend connection minutes after the client left)

create_response now returns a StreamingResponse subclass that closes
both its body iterator and the wrapped upstream-facing generator in a
shielded finally. The upstream generator is closed directly rather than
through a cascade because aclose() on a never-started generator skips
its body, which would make the cascade a no-op when the client
disconnects before the first chunk is sent.
async_streaming_data_generator also gains the same shielded
finally-aclose that async_data_generator in proxy_server.py already
had, covering the Anthropic and Google SSE paths

With this, killing a streaming client causes the backend to observe the
abort within about a second and free its slot, while completed streams
are unaffected. No flag is needed, unlike the non-streaming opt-in
cancel in #30223: this only releases resources after the client is
already gone and does not change any response a client can observe

Fixes #30244

* fix(proxy): close upstream even when body iterator aclose raises BaseException

Addresses the Greptile finding on #30245: the cleanup loop caught only
Exception while the generator-level cleanup catches BaseException, so a
CancelledError or GeneratorExit escaping body_iterator.aclose() would
skip closing the upstream generator. Both sites now use the same scope
and a regression test pins that the upstream is closed even when the
body iterator explodes with a BaseException

* fix(llms): expose aclose on BaseModelResponseIterator so stream close reaches the provider connection

The response-level close added for #30244 only worked for SDK-based
providers (e.g. openai), whose streams expose aclose all the way down.
Providers served by base_llm_http_handler (hosted_vllm and most modern
transformation-based providers) wrap a bare response.aiter_lines()
generator in BaseModelResponseIterator, which had no aclose or close at
all, and nothing retained the httpx response object; so
CustomStreamWrapper.aclose() silently did nothing and the upstream
connection stayed open. Verified with a vLLM-style mock: with
hosted_vllm/ the backend streamed all 100 chunks to completion after
the client disconnected, while openai/ aborted at chunk 6

BaseModelResponseIterator now carries an optional http_response and an
aclose() that closes it; make_async_call_stream_helper attaches the
response after building the iterator. With this, hosted_vllm aborts the
backend within ~1.6s of the client dropping, and completed streams are
unaffected

---------

Co-authored-by: kursad <kursad.lacin@brado.net>

* feat(anthropic): surface compaction usage iterations data (#27065)

* feat(anthropic): surface compaction usage iterations data

* style: apply black formatting to fix lint checks

* fix(usage): correct calculate usage with cached tokens when use ChatCompletionUsageBlock (#30422)

* fix(usage): correct calculate usage with cached tokens when use ChatCompletionUsageBlock

* fix(usage): optimize test imports

* feat: add fastCRW search provider (#30434)

* feat(provider): add LibertAI as a JSON-configured OpenAI-compatible provider (#30203)

* feat(provider): add LibertAI as a JSON-configured OpenAI-compatible provider

* libertai: update served endpoints backup + add mode/matrix tests

Addresses review feedback:
- Add libertai to litellm/provider_endpoints_support_backup.json, the file
  actually served by GET /public/supported_endpoints (the root
  provider_endpoints_support.json already had it).
- Add tests asserting bge-m3 normalizes to mode='embedding' and that the
  served matrix lists libertai. embeddings stays false: the JSON-configured
  provider path only wires chat routing (OpenAILike embedding handler is
  reached only for literal openai_like/llamafile/lm_studio), matching the
  llamagate precedent; bge-m3 remains in the cost map for metadata.

---------

Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com>

* feat(provider): add ModelScope as an OpenAI-compatible provider (#28460)

* add ModelScope API support

* add modelscope api support

* update modelscope model list

* add image-genetation support

* update test and multimodal

* fix: address PR review feedback for modelscope provider

* update README

* fix(customer_endpoints): restrict /customer/daily/activity to admin-only (#28849)

* fix(customer_endpoints): restrict /customer/daily/activity to admin-only

* fix(customer_endpoints): check role before prisma_client guard

* fix(custom_guardrail): key disable_global_guardrails takes precedence over team guardrail list (#28563)

* fix(fallbacks): preserve fallback model in SDK fallback responses (#28260)

* fix(fallbacks): preserve fallback model in response when using SDK-level fallbacks

* fix(fallbacks): gate x-litellm-* passthrough to trusted callers only

The previous patch unconditionally let `x-litellm-*` keys bypass the
`llm_provider-` prefix in `process_response_headers`. That function is
also called on raw upstream-provider response headers (e.g. from
`llm_http_handler.py`), so a malicious provider could return
`x-litellm-attempted-fallbacks` and spoof a LiteLLM-internal marker,
bypassing the proxy model-override guard.

Add a `preserve_litellm_internal_headers` flag (default False). Only
`response_metadata.py`, which re-processes the already-built
`_hidden_params["additional_headers"]` dict (LiteLLM-owned), passes
True. Raw provider header callsites keep the default False, so upstream
`x-litellm-*` still gets the `llm_provider-` prefix.

Adds a regression test for the spoofing case and renames the existing
preserve test to make the trusted-path semantics explicit.

* fix(fallbacks): ignore preserve_litellm_internal_headers for raw httpx.Headers inputs

* style(core_helpers): apply black formatting

* fix(lint): remove banned typing.List/Dict/Any imports and suppress PLR0913 on interface overrides

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(lint): apply black formatting to modelscope chat transformation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(lint): replace noqa with proper fixes — use **kwargs and Awaitable instead of Any/List

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(lint): remove unused AllMessageValues import

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* revert: restore base_model_iterator.py to original PR state

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(lint): restore full method signatures for MyPy compatibility; bump PLR0913 budget for new provider files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(lint): use @OverRide to suppress PLR0913 on inherited signatures instead of bumping budget

The overrides keep their full base-class signatures for MyPy compatibility, but those signatures carry more than five parameters, which tripped PLR0913 on each subclass redeclaration. Since the arity is dictated by the base class and cannot be reduced, decorate the overrides with typing_extensions.override; ruff treats that as the intended signal that the parameter count is not under the author's control and skips PLR0913. This restores the PLR0913 baseline to 1813.

* fix(lint): add @OverRide to modelscope image generation overrides

Apply the same typing_extensions.override treatment to the image generation config so its inherited-signature overrides do not count against PLR0913.

---------

Co-authored-by: Joel Tony <github@jaytau.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: hcl <chenglunhu@gmail.com>
Co-authored-by: ztko <96878659+koztkozt@users.noreply.github.com>
Co-authored-by: Nahrin <nahrin@nahrinoda.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Humphrey <a739376838@gmail.com>
Co-authored-by: kursadlacin <kursadlacin@gmail.com>
Co-authored-by: kursad <kursad.lacin@brado.net>
Co-authored-by: Dushyant Acharya <dushyantacharya873@gmail.com>
Co-authored-by: Yuriy <yuriy.shuyskiy@gmail.com>
Co-authored-by: Recep S <22618852+us@users.noreply.github.com>
Co-authored-by: Moshe Malawach <moshe.malawach@protonmail.com>
Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com>
Co-authored-by: Rongkun Yan <2493404415@qq.com>
Co-authored-by: Varshith <kvarshithgowda@gmail.com>
Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>

* ci(lint): add blanket-noqa, dataclass-default, and unused-noqa Ruff rules (#30516)

* ci(lint): enforce blanket-noqa, dataclass-default, and unused-noqa rules

Enable PGH004 (blanket-noqa), RUF008 (mutable-dataclass-default),
RUF009 (function-call-in-dataclass-default-argument), and RUF100
(unused-noqa) in ruff.toml, and clean up every resulting violation.

RUF008/RUF009 were already clean. PGH004/RUF100 surfaced ~335 stale or
blanket noqas: blanket `# noqa` are now scoped to the rule they actually
suppress (mostly T201), dead directives are removed, and inapplicable
codes are trimmed (e.g. F401 dropped from `import *`).

lint.external lists rules enforced outside this config (the strict-rule
gate via ruff-strict.toml and upstream litellm's own ruff config) so
RUF100 keeps the noqa directives that protect them instead of stripping
coverage this config can't see.

* ci(lint): trim RUF100 external list to load-bearing codes only

Drop the 9 precautionary strict-gate codes (ANN001/002/003/401, B006,
PLR0913, PLW0603, RUF012, TID251) that have zero `# noqa` references in
the gated source. Keep only the 11 codes with live suppressions so
RUF100 doesn't flag them as unused. Future strict-gate suppressions can
re-add codes here (or fix the underlying issue) as needed.

* ci: ratchet lint and type-check gates (ruff preview, ANN, mypy, basedpyright) (#30379)

* ci: enable ruff preview rules under the budgeted strict gate

Turn on ruff preview in the strict-budget lane (ruff-strict.toml) only,
leaving the clean gate (ruff.toml) untouched so make lint-ruff stays at
zero. Enumerate the 118 firing codes explicitly with
explicit-preview-rules so the gate is deterministic and stable across
ruff upgrades rather than depending on preview auto-selecting the broad
catalog.

Grandfather the existing 58438 violations into ruff-strict-budget.json
as per-rule baselines with headroom, so only net-new violations fail CI.
The existing ten rules keep their hand-tuned slack; the new rules get
slack 10 when the baseline is 50 or more and 3 otherwise.

* ci: add ANN return-type rules to the budgeted strict gate

Add ANN201/202/204/205/206 (missing return annotations) to the strict
lane and grandfather the existing counts into ruff-strict-budget.json so
the codebase ratchets toward explicit return types without breaking CI.

* ci: add mypy (disallow_untyped_defs) and basedpyright strict gates with baselines

Add two type-check gates, each grandfathering the current tree so only
net-new violations fail CI, matching the ruff strict-budget ratchet.

mypy gains disallow_untyped_defs in litellm/mypy.ini (the config the CI
invocation actually reads; the root [tool.mypy] is not picked up from the
litellm/ working dir). The 4885 existing missing-annotation errors are
captured in litellm/.mypy-baseline.txt and the run is piped through
mypy-baseline filter so new untyped defs are rejected.

basedpyright runs in strict mode over litellm/, with
enableTypeIgnoreComments disabled so it only honors '# pyright: ignore'
and never polices mypy's '# type: ignore'. The existing strict diagnostics
are grandfathered into .basedpyright/baseline.json.

Both tools are pinned in the dev group and uv.lock; the lint workflow and
Makefile run them filtered through their baselines, with
lint-mypy-baseline-update and lint-basedpyright-baseline-update to ratchet.

* ci: raise lint job timeout to 15m for the basedpyright strict pass

* ci: pin pythonVersion 3.12 and regenerate baselines against merged base

Merge litellm_internal_staging so the baselines cover code the CI merge
includes (e.g. the cisco_ai_defense guardrail), which otherwise tripped
the mypy gate with 3 ungrandfathered no-untyped-def errors. Pin
pythonVersion 3.12 in pyrightconfig so basedpyright's strict analysis is
reproducible across interpreter versions (CI runs 3.12).

* ci: regenerate basedpyright baseline against the frozen lint env

The previous baseline was generated with optional provider deps (azure,
google, anthropic, mcp, numpydoc, google-genai) installed locally, so CI's
dev-only env surfaced ~3500 reportUnknown*/reportMissingTypeStubs errors
not in the baseline. Regenerate after uv sync --frozen so the baseline
reflects the same dependency set the lint job sees.

* ci: regenerate basedpyright baseline on python 3.12 frozen env

The prior baseline still carried proxy-dev packages (e.g. prisma) that the
lint job's dev-only, python 3.12 env lacks, leaving 2 unresolved-import
errors ungrandfathered. Regenerate in a python 3.12 venv synced to the
frozen lock with default groups only, so the baseline matches exactly what
CI sees.

* ci: replace type-check baselines with per-file count budgets

The mypy and basedpyright baselines were position-sensitive (and the
basedpyright one was a 27MB file), so ordinary line shifts churned them.
Replace both with a per-file count gate: scripts/type_check_gate.py reduces
each tool's output to errors-per-file and checks it against a committed
{file: max} budget, ignoring line and column numbers. A file fails only
when it gains more errors than its ceiling; debt can't be shuffled between
files because each file has its own cap and new files default to zero.

Budgets (mypy-file-budget.json 48K, basedpyright-file-budget.json 96K) are
generated in the python 3.12 frozen lint env so they match CI. Drops the
mypy-baseline dependency; basedpyright runs without its native baseline.
ratchet via make lint-mypy-budget-update / lint-basedpyright-budget-update.

* ci: add a small per-file slack to the type-check gate

Allow each file to drift PER_FILE_SLACK (5) errors past its recorded count
before failing, so a basedpyright inference ripple in an unrelated file
doesn't break the build over a couple of errors. Budgets still record exact
counts; the tolerance is applied at check time.

* ci: move type-check slack into the budget json and trim lint timeout

Make slack declarative: the budget is now {"slack": N, "files": {path: count}}
so the tolerance is tuned in JSON without editing the script, mirroring how
ruff-strict-budget.json carries its slack. --update preserves the existing
slack. Also drop the lint job timeout from 15m to 10m; the mypy and
basedpyright passes add ~2m, leaving the job around 4-5m, so 10m is a
comfortable margin.

* ci: collapse fully-adopted ruff categories and drop inert preview flag

ANN (all nine non-removed rules) and BLE (its only rule) were spelled out
code-by-code; replace each with its category selector, which is exactly
equivalent in 0.15.3 (the removed ANN101/ANN102 are skipped by a category
selector and error when named explicitly). explicit-preview-rules was inert:
every selected rule is stable and nothing is selected by category, so the flag
had nothing to gate. Verified the strict-rule counts are identical before and
after (62379 each, zero per-rule drift), so no budget change.

* ci: drop redundant pyright dev dependency

Nothing invokes bare pyright in the Makefile, the linting workflow, or
scripts; the basedpyright gate added on this branch is the only type
checker that runs. basedpyright is a superset fork that reads the same
pyrightconfig.json and honors the same "# pyright: ignore" comments, so
pyright==1.1.408 in the ci group was dead weight. Regenerated uv.lock
under the same exclude-newer cutoff so the only change is removing
pyright and its package stanza

* ci: un-weaken mypy and error on Any in basedpyright

mypy: enable warn_return_any, drop the valid-type silencer, and stop globally ignoring missing first-party imports via [mypy-litellm.*] ignore_missing_imports = False, which surfaced eight real broken litellm.* imports the blanket ignore was hiding; third-party imports stay ignored. The per-file budget moves 4888 -> 5799 (902 no-any-return, 1 valid-type, 8 import-not-found), all grandfathered so only net-new errors fail and the ceilings ratchet down

basedpyright: error on reportExplicitAny and reportAny. The per-file budget moves 117033 -> 148946 (6931 explicit-Any, 24954 Any-typed expressions), grandfathered the same way

* ci: add Any-discipline gate on changed lines under litellm/

Add scripts/check_any_discipline.py, a type-aware gate that fails when a
changed line holds a value typed Any -- including the X | Any unions that
mypy --strict / basedpyright accept (e.g. re.Match.group() -> str | Any,
json.loads() -> Any, bare dict -> dict[Any, Any]).

It reuses the repo's mypyc-compiled mypy 1.19 via a custom generic AST
walker (mypyc precludes subclassing TraverserVisitor), loads litellm/mypy.ini
for parity with lint-mypy, and uses a dedicated incremental cache
(.mypy_cache_any) with mtime+hash invalidation to force re-checks. Scope is
changed-lines-only so editing a legacy file never forces cleaning its
existing Any debt; suppress a genuine typed/untyped boundary with
# any-ok: <reason> (ANY002 requires the reason).

Wire it into the Makefile (lint-any, lint, lint-dev), a parallel
any-discipline CI job with its own actions/cache, .gitignore, and the
CLAUDE.md / CONTRIBUTING.md docs.

* ci: move Any-gate codes into the shared LIT namespace

Renumber the Any-discipline checker into the LIT*** scheme owned by
scripts/check_type_discipline.py (PR #30500) so the two checkers share one
rule namespace and suppression convention:

  ANY001 -> LIT002  (Any-typed value; LIT002 was the retired/free slot)
  ANY002 -> LIT005  (any-ok without a reason; the shared suppression-reason code)
  ANY000 -> LIT000  (setup/build/read error; the shared error code)

Messages and behavior are unchanged; LIT005's text already matches the
"<token> requires a reason" shape used for cast-ok/guard-ok.

* ci: gate mypy and basedpyright per error rule, not per file

Switch the mypy/basedpyright budget gate from per-file error counts to
per-rule-code totals, mirroring the {rule: {baseline, slack}} shape of
ruff-strict-budget.json. A rule fails when its codebase-wide error count
exceeds baseline + slack, so violations are tracked by category rather
than by file location.

scripts/type_check_gate.py now parses mypy from its text output (trailing
[code]) and basedpyright from --outputjson (the JSON `rule` field), since
basedpyright's wrapped text diagnostics mis-attribute the rule on
continuation lines. Replace the *-file-budget.json files with freshly
captured *-code-budget.json baselines and update the Makefile, CI, and
CLAUDE.md accordingly.

* docs: prefer Pydantic validation over any-ok suppression

Point the Any-discipline guidance at validating Any with Pydantic (a model
or TypeAdapter that returns a typed value or raises) and frame
# any-ok as a last resort that should ideally never be used.

* chore: remove extraneous comment

* chore: make the CLAUDE.md more concise

* chore: clean up bloated CONTRIBUTING.md additions

* chore: make Makefile more concise

* ci: add the lint-budget-update target CLAUDE.md references

CLAUDE.md tells contributors to run make lint-budget-update, but the
target was never defined. Add it as an aggregate that re-captures the
ruff, mypy, and basedpyright budgets in one shot.

* ci: recapture mypy and basedpyright budgets in the lint env

The per-rule baselines were captured in a richer dependency env than the
CI lint job's uv sync --frozen, so CI resolved fewer types and reported
more errors than the budgets allowed (no-any-return 902 over cap 900, plus
several basedpyright reportUnknown* rules). Regenerate both in the frozen
env so they grandfather the true CI debt: mypy 5786 -> 5799 (no-any-return
890 -> 902, valid-type 1 restored), basedpyright 146213 -> 148942.

* ci: check out PR head sha in lint and any-discipline jobs

The default pull_request checkout uses refs/pull/N/merge, which folds the
latest base commits into HEAD. The diff-based gates (ruff delta, Any
discipline) then diff against the event's older base.sha and blame base's
own new commits on this branch; staging's otel-v2 and streaming changes
(#30326, #30485) tripped the Any gate on files this branch never touched.
Checking out the PR head sha makes the gates diff the real branch tip
against base, and pins the tree the mypy/basedpyright budgets were captured
against so their counts stay deterministic as the base advances.

* ci(lint): renumber Any-typed-value rule LIT002 -> LIT009

Free up LIT002 for the sibling type-discipline gate (check_type_discipline.py,
#30500), which groups its mutable-collection family at LIT001 (annotation) and
LIT002 (construction). This gate's Any-typed-value rule moves to LIT009 so the
shared LIT namespace stays contiguous with no holes; LIT000 and LIT005 are
unchanged.

* style: rename lint-strict-budget -> lint-ruff-budget

* ci: harden type-check gates against silent passes (greptile review)

type_check_gate.py: refuse to certify a vacuous run. The CI pipe swallows
the tool's exit code ('tool || true'), so a crashed mypy/basedpyright that
emits nothing would parse to zero errors, breach no ceiling, and pass.
is_vacuous_run() now fails when nothing was parsed but the budget expects
errors. Also wrap basedpyright's json.loads in a JSONDecodeError handler
that prints the offending output instead of dumping a raw traceback.

check_any_discipline.py: ALL_LINES was None, which dict.get() also returns
for a path absent from the line map, so a path-normalisation mismatch could
let a violation on an unchanged file pass the scope filter. Make ALL_LINES a
distinct sentinel object so 'whole file' and 'path missing' are unambiguous.

Adds tests for all three.

---------

Co-authored-by: Sameer Kankute <sameer@berri.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Joel Tony <github@jaytau.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: hcl <chenglunhu@gmail.com>
Co-authored-by: ztko <96878659+koztkozt@users.noreply.github.com>
Co-authored-by: Nahrin <nahrin@nahrinoda.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Humphrey <a739376838@gmail.com>
Co-authored-by: kursadlacin <kursadlacin@gmail.com>
Co-authored-by: kursad <kursad.lacin@brado.net>
Co-authored-by: Dushyant Acharya <dushyantacharya873@gmail.com>
Co-authored-by: Yuriy <yuriy.shuyskiy@gmail.com>
Co-authored-by: Recep S <22618852+us@users.noreply.github.com>
Co-authored-by: Moshe Malawach <moshe.malawach@protonmail.com>
Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com>
Co-authored-by: Rongkun Yan <2493404415@qq.com>
Co-authored-by: Varshith <kvarshithgowda@gmail.com>
michaelxer pushed a commit to michaelxer/litellm that referenced this pull request Jun 17, 2026
…erriAI#30326)

* fix(otel): cap metric attribute cardinality with include/exclude lists

OTEL metrics stamped every per-request hidden_params and metadata.* field
onto each gen_ai.client.* sample, so near-unique values created one metric
time series per request and backends like Splunk Observability Cloud throttled
and dropped the data.

Add an attributes block under callback_settings.otel with mutually-exclusive
include_list (allowlist) and exclude_list (denylist), validated against the
known attribute names at startup and applied once to the metric attributes in
_record_metrics. Spans are untouched, and with no config every attribute is
still emitted so existing setups are unaffected.

Resolves LIT-3600

* fix(otel): resolve metric attribute filter from callback_settings

The proxy usually constructs the OpenTelemetry logger without forwarding the
attributes kwarg, while the filter lives under
litellm.callback_settings["otel"]["attributes"]. __init__ only read the kwarg,
so the recording instance kept config.attributes=None and shipped metrics at
full cardinality even when the filter was configured; a live proxy run exposed
this. Fall back to the global at init for the base otel logger, and add a
regression test that drives the real success hook through the callback_settings
path (the unit tests passed before because they injected the config directly).

* fix(otel): reject gen_ai.token.type from metric attribute filter lists

gen_ai.token.type was a member of VALID_METRIC_ATTRIBUTE_NAMES, so an
operator could list it in include_list or exclude_list and pass startup
validation. The attribute is injected into the input/output token series
after _filter_metric_attributes runs, so the filter never sees it and the
request silently has no effect.

Reject it loudly from either list instead, matching the contract that a
non-actionable attribute name fails fast rather than falling through to a
no-op. It stays a structural discriminator on the token-usage histogram.

* fix(otel): resolve metric attribute filter lazily at record time

The proxy constructs the OpenTelemetry logger before it populates
litellm.callback_settings["otel"]["attributes"], so resolving the filter at
__init__ left config.attributes None and shipped metrics at full cardinality. A
live proxy run confirmed the leak. Resolve the filter on the first metric record
instead, when callback_settings is populated, while still validating an explicit
config eagerly so a bad SDK config fails at startup. The regression test now
constructs the logger before populating callback_settings to mirror that
ordering, so it fails if the filter is resolved too early.

* fix(otel): don't cache invalid filter on lazy callback_settings path

On the lazy callback_settings resolution path, _ensure_metric_attribute_filter
wrote self.config.attributes before validating it. When validation then failed,
_metric_attr_filter_resolved stayed False while config.attributes held the bad
filter, so the next record skipped the callback_settings re-read and re-raised
the stale error indefinitely; fixing the misconfiguration required a restart.

Drop the premature write and resolve from the local value. A subsequent record
now re-reads callback_settings, so a corrected config takes effect without a
restart. The write was dead on the success path anyway, since the resolved
frozensets are what the filter reads.

* feat(otel-v2): emit the 6 gen_ai.client.* metrics at parity with v1

The v2 OpenTelemetry integration was a span engine: it declared two metric
histograms but never created a meter or recorded anything. Bring it to parity
with v1 so a v2-default deployment gets bounded metrics.

Adds the 4 missing metric names, all 6 histograms, a meter-provider builder that
mirrors v1's exporter selection, and a GenAIMetricRecorder that records token
usage (split input/output), cost, operation duration, TTFT (streaming), TPOT,
and response duration on the success hook. Gated on config.enable_metrics so the
default is unchanged.

The attribute cardinality filter is reused from v1 by import (no duplication of
the valid-name set or validation) and resolved lazily from
callback_settings.otel.attributes, matching v1. A misconfigured filter raises
out of the recorder; the logger surfaces it once at ERROR and records nothing,
rather than silently disabling metrics, and a corrected config recovers without
a restart.

* test(otel-v2): drop duplicate misconfig logger test (covered in test_otel_v2_logger)
michaelxer pushed a commit to michaelxer/litellm that referenced this pull request Jun 17, 2026
…erriAI#30326)

* fix(otel): cap metric attribute cardinality with include/exclude lists

OTEL metrics stamped every per-request hidden_params and metadata.* field
onto each gen_ai.client.* sample, so near-unique values created one metric
time series per request and backends like Splunk Observability Cloud throttled
and dropped the data.

Add an attributes block under callback_settings.otel with mutually-exclusive
include_list (allowlist) and exclude_list (denylist), validated against the
known attribute names at startup and applied once to the metric attributes in
_record_metrics. Spans are untouched, and with no config every attribute is
still emitted so existing setups are unaffected.

Resolves LIT-3600

* fix(otel): resolve metric attribute filter from callback_settings

The proxy usually constructs the OpenTelemetry logger without forwarding the
attributes kwarg, while the filter lives under
litellm.callback_settings["otel"]["attributes"]. __init__ only read the kwarg,
so the recording instance kept config.attributes=None and shipped metrics at
full cardinality even when the filter was configured; a live proxy run exposed
this. Fall back to the global at init for the base otel logger, and add a
regression test that drives the real success hook through the callback_settings
path (the unit tests passed before because they injected the config directly).

* fix(otel): reject gen_ai.token.type from metric attribute filter lists

gen_ai.token.type was a member of VALID_METRIC_ATTRIBUTE_NAMES, so an
operator could list it in include_list or exclude_list and pass startup
validation. The attribute is injected into the input/output token series
after _filter_metric_attributes runs, so the filter never sees it and the
request silently has no effect.

Reject it loudly from either list instead, matching the contract that a
non-actionable attribute name fails fast rather than falling through to a
no-op. It stays a structural discriminator on the token-usage histogram.

* fix(otel): resolve metric attribute filter lazily at record time

The proxy constructs the OpenTelemetry logger before it populates
litellm.callback_settings["otel"]["attributes"], so resolving the filter at
__init__ left config.attributes None and shipped metrics at full cardinality. A
live proxy run confirmed the leak. Resolve the filter on the first metric record
instead, when callback_settings is populated, while still validating an explicit
config eagerly so a bad SDK config fails at startup. The regression test now
constructs the logger before populating callback_settings to mirror that
ordering, so it fails if the filter is resolved too early.

* fix(otel): don't cache invalid filter on lazy callback_settings path

On the lazy callback_settings resolution path, _ensure_metric_attribute_filter
wrote self.config.attributes before validating it. When validation then failed,
_metric_attr_filter_resolved stayed False while config.attributes held the bad
filter, so the next record skipped the callback_settings re-read and re-raised
the stale error indefinitely; fixing the misconfiguration required a restart.

Drop the premature write and resolve from the local value. A subsequent record
now re-reads callback_settings, so a corrected config takes effect without a
restart. The write was dead on the success path anyway, since the resolved
frozensets are what the filter reads.

* feat(otel-v2): emit the 6 gen_ai.client.* metrics at parity with v1

The v2 OpenTelemetry integration was a span engine: it declared two metric
histograms but never created a meter or recorded anything. Bring it to parity
with v1 so a v2-default deployment gets bounded metrics.

Adds the 4 missing metric names, all 6 histograms, a meter-provider builder that
mirrors v1's exporter selection, and a GenAIMetricRecorder that records token
usage (split input/output), cost, operation duration, TTFT (streaming), TPOT,
and response duration on the success hook. Gated on config.enable_metrics so the
default is unchanged.

The attribute cardinality filter is reused from v1 by import (no duplication of
the valid-name set or validation) and resolved lazily from
callback_settings.otel.attributes, matching v1. A misconfigured filter raises
out of the recorder; the logger surfaces it once at ERROR and records nothing,
rather than silently disabling metrics, and a corrected config recovers without
a restart.

* test(otel-v2): drop duplicate misconfig logger test (covered in test_otel_v2_logger)
koladefaj pushed a commit to koladefaj/litellm that referenced this pull request Jun 17, 2026
…erriAI#30326)

* fix(otel): cap metric attribute cardinality with include/exclude lists

OTEL metrics stamped every per-request hidden_params and metadata.* field
onto each gen_ai.client.* sample, so near-unique values created one metric
time series per request and backends like Splunk Observability Cloud throttled
and dropped the data.

Add an attributes block under callback_settings.otel with mutually-exclusive
include_list (allowlist) and exclude_list (denylist), validated against the
known attribute names at startup and applied once to the metric attributes in
_record_metrics. Spans are untouched, and with no config every attribute is
still emitted so existing setups are unaffected.

Resolves LIT-3600

* fix(otel): resolve metric attribute filter from callback_settings

The proxy usually constructs the OpenTelemetry logger without forwarding the
attributes kwarg, while the filter lives under
litellm.callback_settings["otel"]["attributes"]. __init__ only read the kwarg,
so the recording instance kept config.attributes=None and shipped metrics at
full cardinality even when the filter was configured; a live proxy run exposed
this. Fall back to the global at init for the base otel logger, and add a
regression test that drives the real success hook through the callback_settings
path (the unit tests passed before because they injected the config directly).

* fix(otel): reject gen_ai.token.type from metric attribute filter lists

gen_ai.token.type was a member of VALID_METRIC_ATTRIBUTE_NAMES, so an
operator could list it in include_list or exclude_list and pass startup
validation. The attribute is injected into the input/output token series
after _filter_metric_attributes runs, so the filter never sees it and the
request silently has no effect.

Reject it loudly from either list instead, matching the contract that a
non-actionable attribute name fails fast rather than falling through to a
no-op. It stays a structural discriminator on the token-usage histogram.

* fix(otel): resolve metric attribute filter lazily at record time

The proxy constructs the OpenTelemetry logger before it populates
litellm.callback_settings["otel"]["attributes"], so resolving the filter at
__init__ left config.attributes None and shipped metrics at full cardinality. A
live proxy run confirmed the leak. Resolve the filter on the first metric record
instead, when callback_settings is populated, while still validating an explicit
config eagerly so a bad SDK config fails at startup. The regression test now
constructs the logger before populating callback_settings to mirror that
ordering, so it fails if the filter is resolved too early.

* fix(otel): don't cache invalid filter on lazy callback_settings path

On the lazy callback_settings resolution path, _ensure_metric_attribute_filter
wrote self.config.attributes before validating it. When validation then failed,
_metric_attr_filter_resolved stayed False while config.attributes held the bad
filter, so the next record skipped the callback_settings re-read and re-raised
the stale error indefinitely; fixing the misconfiguration required a restart.

Drop the premature write and resolve from the local value. A subsequent record
now re-reads callback_settings, so a corrected config takes effect without a
restart. The write was dead on the success path anyway, since the resolved
frozensets are what the filter reads.

* feat(otel-v2): emit the 6 gen_ai.client.* metrics at parity with v1

The v2 OpenTelemetry integration was a span engine: it declared two metric
histograms but never created a meter or recorded anything. Bring it to parity
with v1 so a v2-default deployment gets bounded metrics.

Adds the 4 missing metric names, all 6 histograms, a meter-provider builder that
mirrors v1's exporter selection, and a GenAIMetricRecorder that records token
usage (split input/output), cost, operation duration, TTFT (streaming), TPOT,
and response duration on the success hook. Gated on config.enable_metrics so the
default is unchanged.

The attribute cardinality filter is reused from v1 by import (no duplication of
the valid-name set or validation) and resolved lazily from
callback_settings.otel.attributes, matching v1. A misconfigured filter raises
out of the recorder; the logger surfaces it once at ERROR and records nothing,
rather than silently disabling metrics, and a corrected config recovers without
a restart.

* test(otel-v2): drop duplicate misconfig logger test (covered in test_otel_v2_logger)
koladefaj pushed a commit to koladefaj/litellm that referenced this pull request Jun 17, 2026
…pyright) (BerriAI#30379)

* ci: enable ruff preview rules under the budgeted strict gate

Turn on ruff preview in the strict-budget lane (ruff-strict.toml) only,
leaving the clean gate (ruff.toml) untouched so make lint-ruff stays at
zero. Enumerate the 118 firing codes explicitly with
explicit-preview-rules so the gate is deterministic and stable across
ruff upgrades rather than depending on preview auto-selecting the broad
catalog.

Grandfather the existing 58438 violations into ruff-strict-budget.json
as per-rule baselines with headroom, so only net-new violations fail CI.
The existing ten rules keep their hand-tuned slack; the new rules get
slack 10 when the baseline is 50 or more and 3 otherwise.

* ci: add ANN return-type rules to the budgeted strict gate

Add ANN201/202/204/205/206 (missing return annotations) to the strict
lane and grandfather the existing counts into ruff-strict-budget.json so
the codebase ratchets toward explicit return types without breaking CI.

* ci: add mypy (disallow_untyped_defs) and basedpyright strict gates with baselines

Add two type-check gates, each grandfathering the current tree so only
net-new violations fail CI, matching the ruff strict-budget ratchet.

mypy gains disallow_untyped_defs in litellm/mypy.ini (the config the CI
invocation actually reads; the root [tool.mypy] is not picked up from the
litellm/ working dir). The 4885 existing missing-annotation errors are
captured in litellm/.mypy-baseline.txt and the run is piped through
mypy-baseline filter so new untyped defs are rejected.

basedpyright runs in strict mode over litellm/, with
enableTypeIgnoreComments disabled so it only honors '# pyright: ignore'
and never polices mypy's '# type: ignore'. The existing strict diagnostics
are grandfathered into .basedpyright/baseline.json.

Both tools are pinned in the dev group and uv.lock; the lint workflow and
Makefile run them filtered through their baselines, with
lint-mypy-baseline-update and lint-basedpyright-baseline-update to ratchet.

* ci: raise lint job timeout to 15m for the basedpyright strict pass

* ci: pin pythonVersion 3.12 and regenerate baselines against merged base

Merge litellm_internal_staging so the baselines cover code the CI merge
includes (e.g. the cisco_ai_defense guardrail), which otherwise tripped
the mypy gate with 3 ungrandfathered no-untyped-def errors. Pin
pythonVersion 3.12 in pyrightconfig so basedpyright's strict analysis is
reproducible across interpreter versions (CI runs 3.12).

* ci: regenerate basedpyright baseline against the frozen lint env

The previous baseline was generated with optional provider deps (azure,
google, anthropic, mcp, numpydoc, google-genai) installed locally, so CI's
dev-only env surfaced ~3500 reportUnknown*/reportMissingTypeStubs errors
not in the baseline. Regenerate after uv sync --frozen so the baseline
reflects the same dependency set the lint job sees.

* ci: regenerate basedpyright baseline on python 3.12 frozen env

The prior baseline still carried proxy-dev packages (e.g. prisma) that the
lint job's dev-only, python 3.12 env lacks, leaving 2 unresolved-import
errors ungrandfathered. Regenerate in a python 3.12 venv synced to the
frozen lock with default groups only, so the baseline matches exactly what
CI sees.

* ci: replace type-check baselines with per-file count budgets

The mypy and basedpyright baselines were position-sensitive (and the
basedpyright one was a 27MB file), so ordinary line shifts churned them.
Replace both with a per-file count gate: scripts/type_check_gate.py reduces
each tool's output to errors-per-file and checks it against a committed
{file: max} budget, ignoring line and column numbers. A file fails only
when it gains more errors than its ceiling; debt can't be shuffled between
files because each file has its own cap and new files default to zero.

Budgets (mypy-file-budget.json 48K, basedpyright-file-budget.json 96K) are
generated in the python 3.12 frozen lint env so they match CI. Drops the
mypy-baseline dependency; basedpyright runs without its native baseline.
ratchet via make lint-mypy-budget-update / lint-basedpyright-budget-update.

* ci: add a small per-file slack to the type-check gate

Allow each file to drift PER_FILE_SLACK (5) errors past its recorded count
before failing, so a basedpyright inference ripple in an unrelated file
doesn't break the build over a couple of errors. Budgets still record exact
counts; the tolerance is applied at check time.

* ci: move type-check slack into the budget json and trim lint timeout

Make slack declarative: the budget is now {"slack": N, "files": {path: count}}
so the tolerance is tuned in JSON without editing the script, mirroring how
ruff-strict-budget.json carries its slack. --update preserves the existing
slack. Also drop the lint job timeout from 15m to 10m; the mypy and
basedpyright passes add ~2m, leaving the job around 4-5m, so 10m is a
comfortable margin.

* ci: collapse fully-adopted ruff categories and drop inert preview flag

ANN (all nine non-removed rules) and BLE (its only rule) were spelled out
code-by-code; replace each with its category selector, which is exactly
equivalent in 0.15.3 (the removed ANN101/ANN102 are skipped by a category
selector and error when named explicitly). explicit-preview-rules was inert:
every selected rule is stable and nothing is selected by category, so the flag
had nothing to gate. Verified the strict-rule counts are identical before and
after (62379 each, zero per-rule drift), so no budget change.

* ci: drop redundant pyright dev dependency

Nothing invokes bare pyright in the Makefile, the linting workflow, or
scripts; the basedpyright gate added on this branch is the only type
checker that runs. basedpyright is a superset fork that reads the same
pyrightconfig.json and honors the same "# pyright: ignore" comments, so
pyright==1.1.408 in the ci group was dead weight. Regenerated uv.lock
under the same exclude-newer cutoff so the only change is removing
pyright and its package stanza

* ci: un-weaken mypy and error on Any in basedpyright

mypy: enable warn_return_any, drop the valid-type silencer, and stop globally ignoring missing first-party imports via [mypy-litellm.*] ignore_missing_imports = False, which surfaced eight real broken litellm.* imports the blanket ignore was hiding; third-party imports stay ignored. The per-file budget moves 4888 -> 5799 (902 no-any-return, 1 valid-type, 8 import-not-found), all grandfathered so only net-new errors fail and the ceilings ratchet down

basedpyright: error on reportExplicitAny and reportAny. The per-file budget moves 117033 -> 148946 (6931 explicit-Any, 24954 Any-typed expressions), grandfathered the same way

* ci: add Any-discipline gate on changed lines under litellm/

Add scripts/check_any_discipline.py, a type-aware gate that fails when a
changed line holds a value typed Any -- including the X | Any unions that
mypy --strict / basedpyright accept (e.g. re.Match.group() -> str | Any,
json.loads() -> Any, bare dict -> dict[Any, Any]).

It reuses the repo's mypyc-compiled mypy 1.19 via a custom generic AST
walker (mypyc precludes subclassing TraverserVisitor), loads litellm/mypy.ini
for parity with lint-mypy, and uses a dedicated incremental cache
(.mypy_cache_any) with mtime+hash invalidation to force re-checks. Scope is
changed-lines-only so editing a legacy file never forces cleaning its
existing Any debt; suppress a genuine typed/untyped boundary with
# any-ok: <reason> (ANY002 requires the reason).

Wire it into the Makefile (lint-any, lint, lint-dev), a parallel
any-discipline CI job with its own actions/cache, .gitignore, and the
CLAUDE.md / CONTRIBUTING.md docs.

* ci: move Any-gate codes into the shared LIT namespace

Renumber the Any-discipline checker into the LIT*** scheme owned by
scripts/check_type_discipline.py (PR BerriAI#30500) so the two checkers share one
rule namespace and suppression convention:

  ANY001 -> LIT002  (Any-typed value; LIT002 was the retired/free slot)
  ANY002 -> LIT005  (any-ok without a reason; the shared suppression-reason code)
  ANY000 -> LIT000  (setup/build/read error; the shared error code)

Messages and behavior are unchanged; LIT005's text already matches the
"<token> requires a reason" shape used for cast-ok/guard-ok.

* ci: gate mypy and basedpyright per error rule, not per file

Switch the mypy/basedpyright budget gate from per-file error counts to
per-rule-code totals, mirroring the {rule: {baseline, slack}} shape of
ruff-strict-budget.json. A rule fails when its codebase-wide error count
exceeds baseline + slack, so violations are tracked by category rather
than by file location.

scripts/type_check_gate.py now parses mypy from its text output (trailing
[code]) and basedpyright from --outputjson (the JSON `rule` field), since
basedpyright's wrapped text diagnostics mis-attribute the rule on
continuation lines. Replace the *-file-budget.json files with freshly
captured *-code-budget.json baselines and update the Makefile, CI, and
CLAUDE.md accordingly.

* docs: prefer Pydantic validation over any-ok suppression

Point the Any-discipline guidance at validating Any with Pydantic (a model
or TypeAdapter that returns a typed value or raises) and frame
# any-ok as a last resort that should ideally never be used.

* chore: remove extraneous comment

* chore: make the CLAUDE.md more concise

* chore: clean up bloated CONTRIBUTING.md additions

* chore: make Makefile more concise

* ci: add the lint-budget-update target CLAUDE.md references

CLAUDE.md tells contributors to run make lint-budget-update, but the
target was never defined. Add it as an aggregate that re-captures the
ruff, mypy, and basedpyright budgets in one shot.

* ci: recapture mypy and basedpyright budgets in the lint env

The per-rule baselines were captured in a richer dependency env than the
CI lint job's uv sync --frozen, so CI resolved fewer types and reported
more errors than the budgets allowed (no-any-return 902 over cap 900, plus
several basedpyright reportUnknown* rules). Regenerate both in the frozen
env so they grandfather the true CI debt: mypy 5786 -> 5799 (no-any-return
890 -> 902, valid-type 1 restored), basedpyright 146213 -> 148942.

* ci: check out PR head sha in lint and any-discipline jobs

The default pull_request checkout uses refs/pull/N/merge, which folds the
latest base commits into HEAD. The diff-based gates (ruff delta, Any
discipline) then diff against the event's older base.sha and blame base's
own new commits on this branch; staging's otel-v2 and streaming changes
(BerriAI#30326, BerriAI#30485) tripped the Any gate on files this branch never touched.
Checking out the PR head sha makes the gates diff the real branch tip
against base, and pins the tree the mypy/basedpyright budgets were captured
against so their counts stay deterministic as the base advances.

* ci(lint): renumber Any-typed-value rule LIT002 -> LIT009

Free up LIT002 for the sibling type-discipline gate (check_type_discipline.py,
BerriAI#30500), which groups its mutable-collection family at LIT001 (annotation) and
LIT002 (construction). This gate's Any-typed-value rule moves to LIT009 so the
shared LIT namespace stays contiguous with no holes; LIT000 and LIT005 are
unchanged.

* style: rename lint-strict-budget -> lint-ruff-budget

* ci: harden type-check gates against silent passes (greptile review)

type_check_gate.py: refuse to certify a vacuous run. The CI pipe swallows
the tool's exit code ('tool || true'), so a crashed mypy/basedpyright that
emits nothing would parse to zero errors, breach no ceiling, and pass.
is_vacuous_run() now fails when nothing was parsed but the budget expects
errors. Also wrap basedpyright's json.loads in a JSONDecodeError handler
that prints the offending output instead of dumping a raw traceback.

check_any_discipline.py: ALL_LINES was None, which dict.get() also returns
for a path absent from the line map, so a path-normalisation mismatch could
let a violation on an unchanged file pass the scope filter. Make ALL_LINES a
distinct sentinel object so 'whole file' and 'path missing' are unambiguous.

Adds tests for all three.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants