Skip to content

chore(ci): merge dev branch#28290

Merged
yuneng-berri merged 1 commit into
litellm_internal_stagingfrom
litellm_yj_may19
May 21, 2026
Merged

chore(ci): merge dev branch#28290
yuneng-berri merged 1 commit into
litellm_internal_stagingfrom
litellm_yj_may19

Conversation

@yuneng-berri

Copy link
Copy Markdown
Collaborator
  • fix: end user logs

  • fix(auth): address PR review feedback on end-user id validation

  • Gate DB validation behind litellm.validate_end_user_id_in_db (default False) so arbitrary client-supplied identifiers still pass through.
  • Reuse get_end_user_object / get_user_object / _get_fuzzy_user_object instead of issuing raw Prisma queries in the auth hot path.
  • Consolidate: builder does the resolution once and stores it on the auth obj; centralized checks reuse it, the outer user_api_key_auth copy is removed.
  • Preserve end_user_id when litellm.max_end_user_budget_id is set so the default end-user budget can still apply to new customers.
  • fix(auth): gate JSON-blob user-id rejection behind validate_end_user_id_in_db

Addresses PR review feedback: the JSON-encoded dict/list rejection in _coerce_user_id_to_str was unconditionally applied, which would silently stop tracking spend for deployments passing JSON-encoded user identifiers on upgrade. Per the backwards-compatibility rule, default-path behavior changes must be opt-in.

Now only strings that decode to a JSON object/array are dropped when litellm.validate_end_user_id_in_db is True. Non-string dict/list/tuple values are still always dropped, since stringifying them produces unusable "{'device_id': ...}"-shaped spend-log rows.

  • fix(auth): route email end-user lookup through get_user_object cache

The email-shaped end-user id branch called _get_fuzzy_user_object directly, bypassing get_user_object's _should_check_db throttle and user_api_key_cache. Every unique email would hit an unbudgeted raw Prisma query on the critical auth path. Collapsing the two calls into one get_user_object invocation with user_email=end_user_id routes through the cached helper per PR review feedback.

  • fix(auth): keep end-user safety net at user_api_key_auth tail

Krrish flagged that removing the tail-of-user_api_key_auth assignment was a regression risk: _user_api_key_auth_builder has multiple early-return paths (master_key=None, /user/auth, JWT short-circuits) that bypass the end-user resolution block, so dropping the safety net silently strips end-user attribution from those paths.

Restore the assignment but route it through resolve_and_validate_end_user_id so the same validation rules apply. Skip the second pass when the builder already set an id.

Adds two tests pinning the behaviour: one for the early-return safety net and one verifying we don't double-resolve when the builder set the id.

Relevant issues

Linear ticket

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

Type

🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test

Changes

* fix: end user logs

* fix(auth): address PR review feedback on end-user id validation

- Gate DB validation behind litellm.validate_end_user_id_in_db (default
  False) so arbitrary client-supplied identifiers still pass through.
- Reuse get_end_user_object / get_user_object / _get_fuzzy_user_object
  instead of issuing raw Prisma queries in the auth hot path.
- Consolidate: builder does the resolution once and stores it on the
  auth obj; centralized checks reuse it, the outer user_api_key_auth
  copy is removed.
- Preserve end_user_id when litellm.max_end_user_budget_id is set so
  the default end-user budget can still apply to new customers.

* fix(auth): gate JSON-blob user-id rejection behind validate_end_user_id_in_db

Addresses PR review feedback: the JSON-encoded dict/list rejection in
_coerce_user_id_to_str was unconditionally applied, which would silently
stop tracking spend for deployments passing JSON-encoded user identifiers
on upgrade. Per the backwards-compatibility rule, default-path behavior
changes must be opt-in.

Now only strings that decode to a JSON object/array are dropped when
litellm.validate_end_user_id_in_db is True. Non-string dict/list/tuple
values are still always dropped, since stringifying them produces
unusable "{'device_id': ...}"-shaped spend-log rows.

* fix(auth): route email end-user lookup through get_user_object cache

The email-shaped end-user id branch called _get_fuzzy_user_object directly,
bypassing get_user_object's _should_check_db throttle and user_api_key_cache.
Every unique email would hit an unbudgeted raw Prisma query on the critical
auth path. Collapsing the two calls into one get_user_object invocation
with user_email=end_user_id routes through the cached helper per PR review
feedback.

* fix(auth): keep end-user safety net at user_api_key_auth tail

Krrish flagged that removing the tail-of-user_api_key_auth assignment
was a regression risk: ``_user_api_key_auth_builder`` has multiple
early-return paths (master_key=None, /user/auth, JWT short-circuits)
that bypass the end-user resolution block, so dropping the safety net
silently strips end-user attribution from those paths.

Restore the assignment but route it through resolve_and_validate_end_user_id
so the same validation rules apply. Skip the second pass when the builder
already set an id.

Adds two tests pinning the behaviour: one for the early-return safety
net and one verifying we don't double-resolve when the builder set the id.
@yuneng-berri yuneng-berri changed the title [Infra] Merge dev branch chore(ci): merge dev branch May 19, 2026
@codecov

codecov Bot commented May 19, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR refines the end-user ID attribution path in the proxy auth layer: it adds an opt-in validate_end_user_id_in_db flag to validate caller-supplied user IDs against known DB rows, introduces _coerce_user_id_to_str to drop non-string structured values from all extraction sites, and restores the tail safety-net in user_api_key_auth so early-return builder paths still attribute spend logs correctly.

  • resolve_and_validate_end_user_id and _end_user_id_exists_in_db are added to auth_checks.py; they reuse get_end_user_object / get_user_object so validation shares the existing cache and avoids raw Prisma calls on the hot auth path.
  • _coerce_user_id_to_str replaces all str() coercions in get_end_user_id_from_request_body, dropping dict/list/tuple values unconditionally and gating JSON-string rejection behind the new flag for backward compatibility.
  • The tail safety-net in user_api_key_auth re-routes through resolve_and_validate_end_user_id, and two new integration tests pin the no-double-resolve and early-return-fallback invariants.

Confidence Score: 4/5

Safe to merge for default configurations; the new opt-in validation path has a minor exception-handling gap worth addressing before enabling the flag in production

The core logic is well-structured with careful backward-compatibility gating throughout. Three findings are all non-blocking: a BudgetExceededError gap in the tail safety-net (only reachable with the opt-in flag plus a builder early-return), two trivially-true test assertions, and one test that misses its documented branch. No issues affect the default flag-off code path.

litellm/proxy/auth/user_api_key_auth.py tail section (lines 2188-2208) and the corresponding tests in test_auth_checks.py

Important Files Changed

Filename Overview
litellm/init.py Adds validate_end_user_id_in_db: bool = False global flag — clean, backwards-compatible addition with accurate inline documentation
litellm/proxy/auth/auth_checks.py New validation helpers reuse existing cache/DB helpers correctly; BudgetExceededError propagation from the tail safety-net call site is outside the exception handler
litellm/proxy/auth/auth_utils.py _coerce_user_id_to_str wired through all extraction sites with correct backward-compat gating
litellm/proxy/auth/user_api_key_auth.py Builder, centralized checks, and tail safety-net correctly layered; BudgetExceededError from tail block bypasses exception handler
tests/test_litellm/proxy/auth/test_auth_checks.py Good coverage but _get_fuzzy_user_object.assert_not_awaited() assertions are trivially true
tests/test_litellm/proxy/auth/test_auth_utils.py Thorough coercion tests; brace-string test doesn't enable the validation flag
tests/test_litellm/proxy/auth/test_user_api_key_auth.py Two new integration tests correctly pin the early-return and no-double-resolve invariants

Comments Outside Diff (1)

  1. tests/test_litellm/proxy/auth/test_auth_checks.py, line 525 (link)

    P2 _get_fuzzy_user_object.assert_not_awaited() assertions are trivially true

    _get_fuzzy_user_object is never called directly from _end_user_id_exists_in_db; it is only invoked as an internal detail of get_user_object, which is itself mocked. The assertion is therefore always satisfied regardless of what the production code does, and would not catch a regression where _end_user_id_exists_in_db is changed to call _get_fuzzy_user_object directly. The same trivially-true assertion appears at line 555. The meaningful check for the email-routing behaviour is already the user_kwargs["user_email"] assertion at lines 552-553 and 576.

Reviews (1): Last reviewed commit: "fix: end user logs (#27758)" | Re-trigger Greptile

Comment on lines +864 to +868

def test_dict_metadata_user_id_returns_none(self):
request_body = {
"metadata": {"user_id": {"device_id": "abc"}},
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Test doesn't exercise the JSON-parse branch it documents

The docstring says "A string starting with { but failing to parse stays as-is", implying that the safe_json_loads guard inside _coerce_user_id_to_str is being tested. However, the test doesn't set litellm.validate_end_user_id_in_db = True, so the if litellm.validate_end_user_id_in_db and stripped[:1] in ("{", "[") condition is never entered. A companion test with the flag enabled is needed to confirm that invalid-JSON brace strings are preserved after safe_json_loads returns None.

Comment on lines +2188 to +2208
if user_api_key_auth_obj.end_user_id is None:
from litellm.proxy.proxy_server import (
prisma_client,
proxy_logging_obj,
user_api_key_cache,
)

raw_end_user_id = get_end_user_id_from_request_body(
request_data, _safe_get_request_headers(request)
)
if raw_end_user_id is not None:
resolved_end_user_id = await resolve_and_validate_end_user_id(
raw_end_user_id=raw_end_user_id,
prisma_client=prisma_client,
user_api_key_cache=user_api_key_cache,
parent_otel_span=user_api_key_auth_obj.parent_otel_span,
proxy_logging_obj=proxy_logging_obj,
route=route,
)
if resolved_end_user_id is not None:
user_api_key_auth_obj.end_user_id = resolved_end_user_id

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 BudgetExceededError from tail safety-net bypasses the exception handler

When validate_end_user_id_in_db=True, the tail block calls resolve_and_validate_end_user_id, which calls _end_user_id_exists_in_db -> get_end_user_object. get_end_user_object re-raises litellm.BudgetExceededError for over-budget users, and _end_user_id_exists_in_db propagates it. Because this block sits outside the try/except around _run_centralized_common_checks (lines 2165-2180), a budget-exceeded exception here escapes user_api_key_auth without going through UserAPIKeyAuthExceptionHandler._handle_authentication_error, producing a raw 500 instead of a formatted 429/403 on builder early-return paths when the flag is enabled.

@yuneng-berri yuneng-berri merged commit 697a90e into litellm_internal_staging May 21, 2026
115 checks passed
lorenzbaraldi pushed a commit to lorenzbaraldi/litellm that referenced this pull request May 21, 2026
* fix: end user logs

* fix(auth): address PR review feedback on end-user id validation

- Gate DB validation behind litellm.validate_end_user_id_in_db (default
  False) so arbitrary client-supplied identifiers still pass through.
- Reuse get_end_user_object / get_user_object / _get_fuzzy_user_object
  instead of issuing raw Prisma queries in the auth hot path.
- Consolidate: builder does the resolution once and stores it on the
  auth obj; centralized checks reuse it, the outer user_api_key_auth
  copy is removed.
- Preserve end_user_id when litellm.max_end_user_budget_id is set so
  the default end-user budget can still apply to new customers.

* fix(auth): gate JSON-blob user-id rejection behind validate_end_user_id_in_db

Addresses PR review feedback: the JSON-encoded dict/list rejection in
_coerce_user_id_to_str was unconditionally applied, which would silently
stop tracking spend for deployments passing JSON-encoded user identifiers
on upgrade. Per the backwards-compatibility rule, default-path behavior
changes must be opt-in.

Now only strings that decode to a JSON object/array are dropped when
litellm.validate_end_user_id_in_db is True. Non-string dict/list/tuple
values are still always dropped, since stringifying them produces
unusable "{'device_id': ...}"-shaped spend-log rows.

* fix(auth): route email end-user lookup through get_user_object cache

The email-shaped end-user id branch called _get_fuzzy_user_object directly,
bypassing get_user_object's _should_check_db throttle and user_api_key_cache.
Every unique email would hit an unbudgeted raw Prisma query on the critical
auth path. Collapsing the two calls into one get_user_object invocation
with user_email=end_user_id routes through the cached helper per PR review
feedback.

* fix(auth): keep end-user safety net at user_api_key_auth tail

Krrish flagged that removing the tail-of-user_api_key_auth assignment
was a regression risk: ``_user_api_key_auth_builder`` has multiple
early-return paths (master_key=None, /user/auth, JWT short-circuits)
that bypass the end-user resolution block, so dropping the safety net
silently strips end-user attribution from those paths.

Restore the assignment but route it through resolve_and_validate_end_user_id
so the same validation rules apply. Skip the second pass when the builder
already set an id.

Adds two tests pinning the behaviour: one for the early-return safety
net and one verifying we don't double-resolve when the builder set the id.

Co-authored-by: Dennis Henry <dennis.henry@okta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants