Skip to content

feat(tests): behavior-pinning harness + Key Tier-1 matrix#28321

Merged
yuneng-berri merged 31 commits into
litellm_internal_stagingfrom
litellm_/silly-wright-1b8559
May 21, 2026
Merged

feat(tests): behavior-pinning harness + Key Tier-1 matrix#28321
yuneng-berri merged 31 commits into
litellm_internal_stagingfrom
litellm_/silly-wright-1b8559

Conversation

@yuneng-berri

@yuneng-berri yuneng-berri commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds an HTTP-boundary regression-test surface for the six Key management
endpoints: /key/generate, /key/info, /key/list, /key/update,
/key/regenerate, /key/delete. Tests boot the real proxy app once per
session via an in-process ASGI transport, connect to a real Postgres pointed
at by DATABASE_URL, and assert at the HTTP boundary with no mocks — auth
runs, prisma runs, integrations run.

Each scenario pins the proxy's current authorization behaviour. Future PRs
that change response codes or visibility on these endpoints will turn the
matrix red.

What's added

  • tests/proxy_behavior/management/ — 129 scenarios (8 actors × applicable
    targets) covering the six endpoints, plus harness fixtures: session-scoped
    ASGI client + real FastAPI lifespan, an immutable 8-actor world seed, and
    a per-test scratch namespace with delete_many teardown so write tests
    don't pollute each other.
  • .github/workflows/test-unit-proxy-mgmt-behavior.yml — PR-triggered
    workflow that delegates to the existing reusable services-base workflow
    with enable-postgres: true. ~2m30s on ubuntu-latest.
  • [tool.mutmut].tests_dir extended to include the new suite so the
    existing (manual-trigger) mutation-test workflow exercises it next time
    it runs.

What's untouched

  • The existing tests/test_litellm/proxy/management_endpoints/ mock suite
    stays in place — both suites coexist.
  • No production code changes.

Regression-replay check

To confirm the matrix actually catches the class of authz regression these
endpoints have historically shipped, one recent fix-PR was replayed: the
handler module was reverted to the fix's parent commit and the matrix was
run red, then restored and run green.

  • c7c3df2b02fix(proxy): extend /key/update admin check to non-budget fields (parent 662d05531d). Under the pre-fix handler, 6
    scenarios in test_key_update.py (self/{team_admin, internal_user, owner, unrelated_same_org, cross_org_user, service_account}) return
    200 — a non-admin could rewrite models on a key. Post-fix they are
    pinned at 403. The matrix flips 6-failed → all-pass purely on the
    handler swap, confirming it catches the behaviour change.

Known coverage gaps deferred to follow-up PRs (the matrix shape is correct;
these are scope extensions): 404-on-missing-key scenarios; budget/limit
counter assertions on denied /key/update; upperbound enforcement on
/key/regenerate; /key/list filter-param views.

Test plan

  • test-unit-proxy-mgmt-behavior.yml passes on this PR
  • test-unit-proxy-endpoints.yml is unaffected (the new test dir is
    outside its glob)
  • Future: trigger mutation-test.yml manually to record a
    mutation-score baseline against the new suite

Local repro

docker run --rm -d --name litellm-test-pg \
  -e POSTGRES_USER=litellm -e POSTGRES_PASSWORD=litellm -e POSTGRES_DB=litellm_test \
  -p 5432:5432 postgres:14
export DATABASE_URL=postgresql://litellm:litellm@localhost:5432/litellm_test
uv run prisma generate --schema litellm/proxy/schema.prisma
uv run prisma db push --schema litellm/proxy/schema.prisma --accept-data-loss
uv run pytest tests/proxy_behavior/management/

Whole-suite wall-time: ~6s locally, ~2m30s on CI.

…eness smoke

Slice 2 of the management-endpoints behavior-pinning effort. New top-level dir
tests/proxy_behavior/management/ outside every existing pytest glob.

conftest.py initialises the proxy app once per session against the DATABASE_URL
the harness boots Postgres at, wraps it in httpx.AsyncClient via in-process
ASGITransport. The one smoke test asserts /health/liveliness returns 200, which
exercises the full FastAPI middleware stack against a real app — no mocks.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
…de-risk

Slice 3 of the management-endpoints behavior-pinning effort. The fixture now
enters the real FastAPI lifespan (proxy_startup_event) instead of just calling
initialize() — that is where prisma_client is connected, password migration is
kicked off, and the rest of the startup wiring runs.

Tests pin the loop to the session scope so the AsyncClient created in the
session fixture and the prisma connection opened in the lifespan share the
same loop as the test bodies.

New de-risk smoke: POST /key/generate with the master key returns 200, the
returned sk- token resolves to a hashed row in LiteLLM_VerificationToken, and
the cleartext token is never stored. Proves auth + handler + helper + prisma
all wire together end-to-end against a real Postgres.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
Slice 4 of the management-endpoints behavior-pinning effort. New
``actors.py`` defines the actor enum + seeds an immutable world (2 orgs,
2 teams, 8 users, 8 verification tokens) under the ``behavior-pin-``
prefix so the rows are identifiable in psql and ``_wipe_world`` is
targeted.

Each actor key is created with its cleartext form generated locally and
its hashed form (via ``litellm.proxy.utils.hash_token``) stored in
``LiteLLM_VerificationToken`` — so the real ``user_api_key_auth`` accepts
the cleartext bearer token. Roles, ``team_id``, ``organization_id``, and
the service-account metadata flag are all set on the seeded rows so the
auth layer resolves the same scopes a real proxy would.

The session-scoped ``world`` fixture re-seeds at session start (idempotent
via wipe-then-create), and the smoke test confirms each of the 8 actor
keys can call ``/key/info`` on itself and receive its own row back.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
…ny teardown

Slice 5 of the management-endpoints behavior-pinning effort. Adds the
``scratch`` function-scoped fixture: each test gets a uuid4-derived
namespace prefix, tags writes with it (``key_alias``, ``team_alias``,
``user_id``, ``budget_id``), and the fixture teardown ``delete_many``-s
any row whose namespace column starts with that prefix.

Cleanup uses Prisma model methods only (no raw SQL, per CLAUDE.md) and
orders deletes children-before-parents to avoid FK conflicts. The Slice 3
de-risk smoke is migrated onto the same fixture so it stops accumulating
untagged tokens across repeated local runs.

Smoke proves both halves of the contract: one test writes a scratch-tagged
key and asserts it lands; a second test runs after the first's teardown
and asserts no rows in the scratch namespace survived.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
Slice 6 of the management-endpoints behavior-pinning effort. Two new tests
walk every .py file under tests/proxy_behavior/ and assert:

  * no ``from litellm.proxy.management_endpoints`` import — the suite is
    deliberately constrained to the HTTP boundary so it survives handler
    refactors;
  * no ``mock``/``patch`` on ``user_api_key_auth`` — mocking auth is the
    structural failure mode of the existing 11k-line mock suite, and the
    point of this harness is that the real auth layer runs.

Codifying G3 as a CI test removes the "did someone forget to check the
PR-description checklist" failure mode.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
Follow-up to 6f588c7 — line-length fixes only, no behavior change.
Slice 7 of the management-endpoints behavior-pinning effort. Parametrized
matrix across two axes: actor (8 seeded) × target scope (self, team_alpha
in org_a, team_beta in org_b). 18 scenarios after dropping non-applicable
combos. Whole-suite wall-time stays at ~4.7s (well under the 10-min G2
budget for the eventual CI job).

While pinning, the test surfaced one seed gap: ``_get_user_in_team`` reads
``members_with_roles`` (a JSON list of ``{user_id, role}``), not the plain
``members`` String[]. Both columns are now populated in the seed to match
what the real ``/team/new`` handler would produce.

Expected status codes are intentionally heterogeneous (200, 400, 401)
because the current handler emits different statuses depending on which
check fails first (role gate, team-member-perm gate, "not assigned"
check). Pinning the *observed* codes — not what they "should" be — is
exactly the regression signal we want.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
Slice 8 of the management-endpoints behavior-pinning effort. 8 actors ×
3 target keys (own, OWNER's key in org_a, CROSS_ORG_USER's key in org_b)
covering self-read, same-team-peer read, and cross-org read.

Notable pinned behaviors (intentionally surfaced for review, not "fixed"):

  * ORG_ADMIN gets 403 on individual key info even within their own org
    — visibility is scoped to "your own keys" + "your team's keys", not
    "your org's keys".
  * Same-team peers (INTERNAL_USER, UNRELATED_SAME_ORG, SERVICE_ACCOUNT)
    DO see each other's keys. Whether that is desired is for the team
    to decide; this PR only pins the existing behavior so unintentional
    changes flip the matrix red.

Wall-time is unchanged (~4.3s for the slice on its own).

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
…arios)

Slice 9 of the management-endpoints behavior-pinning effort. For /key/list
the response IS the matrix: each of the 8 seeded actors calls the endpoint
with default filters and the test asserts set-equality between the returned
visible-token set (filtered to seeded tokens only, so unrelated rows can't
flap the assertion) and a pinned expected actor-set.

Pinned default visibility:

  * PROXY_ADMIN sees all 8 actors' keys.
  * Every other actor sees only their own key — including ORG_ADMIN
    (which had broader expectations going in but currently behaves
    same-as-internal-user for /key/list defaults) and TEAM_ADMIN (no
    team-aggregation without include_team_keys=true).

Future changes that broaden or narrow any single actor's default
visibility will turn this matrix red — exactly the regression signal we
want. Parameter-driven views (include_team_keys, filters) are deferred to
Slice 13 / PR2 follow-up.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
… (21 scenarios)

Slice 10 of the management-endpoints behavior-pinning effort. 8 actors ×
3 target shapes (self-owned, OWNER-scoped in org_a/team_alpha,
CROSS_ORG_USER-scoped in org_b/team_beta) = 21 applicable scenarios.

Each test:
  1. Master-key-seeds a fresh scratch key with the target's (user_id,
     team_id) scope (so the read-world stays untouched).
  2. Has the actor under test POST /key/update flipping ``models`` to
     a known marker list.
  3. Asserts the status code AND the DB row's ``models`` field — present
     when 200, unchanged otherwise — so a handler that silently mutates
     on a denied response surfaces red.

Observed gating (pinned, not endorsed):

  * PROXY_ADMIN bypasses every check.
  * ORG_ADMIN is blocked by an early role gate, always 401.
  * Every other (INTERNAL_USER-rolesed) actor hits one of three failure
    modes — 403 "user can only create keys for themselves", 403
    "only proxy admins, team admins, or org admins", or 401
    "team_member_permission_error" — depending on whether they own the
    target and whether they're a team admin / member of its team.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
…tract (22 scenarios)

Slice 11 of the management-endpoints behavior-pinning effort. 21 matrix
scenarios (8 actors × 3 target shapes, minus the cross_org/owner combo
that exists in the seed but isn't applicable) plus one smoke for the
``/key/{key:path}/regenerate`` route registration.

On 200 outcomes the test verifies the full rotation contract:
  * the regenerate response key differs from the old cleartext,
  * the OLD cleartext returns 401 on a follow-up ``/key/info``,
  * the NEW cleartext returns 200 on a follow-up ``/key/info``.

On denied outcomes the test verifies the OLD cleartext still works —
catching any handler that mutates the token row on a failed call.

Pinned authz divergence vs /key/update: regenerate routes most denials
through the team-member-perm 401 path rather than the role-gate 403
path. The matrices for both endpoints are now in tree side-by-side, so
any future refactor that "harmonises" the codes will turn one of the two
red.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
…ract (21 scenarios)

Slice 12 of the management-endpoints behavior-pinning effort. Mirrors
slices 10/11. On success: cleartext can no longer authenticate
(handles both hard-delete and soft-delete to LiteLLM_DeletedVerificationToken).
On denial: row survives and cleartext still authenticates.

Notable behavior gap with /key/update: same-team peers (internal_user,
unrelated_same_org, etc.) get 403 on /key/delete for OWNER's key — i.e.
cannot delete each other's keys — whereas they CAN read each other's
keys (Slice 8). Delete is stricter than read. Pinned as-is.

Cumulative whole-suite wall-time is 5.9s for all 128 tests on the local
runner — well under the 10-min G2 budget for the CI job in Slice 13.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
…uite

Slice 13 of the management-endpoints behavior-pinning effort. New
workflow ``test-unit-proxy-mgmt-behavior.yml`` fires ``on: pull_request``
for the same branch set every other proxy unit-test workflow watches
(main, litellm_internal_staging, litellm_oss_branch, litellm_**).

It delegates to the existing reusable ``_test-unit-services-base.yml``
with ``enable-postgres: true``, which already provisions a postgres:14
service container and runs ``prisma db push`` against it before pytest
collects. ``reruns: 0`` because a behavior-pinning matrix that needs
reruns is itself a regression — flakes are signal.

``timeout-minutes: 15`` gives generous headroom over the local 5.9s
whole-suite wall-time; the binding G2 budget is 10 min.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
Slice 14 of the management-endpoints behavior-pinning effort. Documents
the regression-replay verification methodology + a 12-row table mapping
recent fix-PRs touching key_management_endpoints.py to the catching
scenarios in the PR1 matrix.

One canonical RED→GREEN cycle is captured verbatim — c7c3df2
"extend /key/update admin check to non-budget fields". Under the
parent-of-fix code, 6 scenarios in test_key_update.py flip from 200 to
403; under HEAD code, all 21 pass. The handler swap is the only change
between the two runs, confirming the matrix catches the behavior shift
the fix introduced.

The table also calls out 4 genuine coverage gaps deferred to PR2/PR3:
404-on-missing-key, budget-limit counter assertions, /key/regenerate
upperbound enforcement, and /key/list filter-param views.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
Slice 15 of the management-endpoints behavior-pinning effort. Appends
``tests/proxy_behavior/management/`` to ``[tool.mutmut].tests_dir`` so
the existing mutation-test workflow runs against both the legacy mock
suite AND the new behavior suite — the latter is where the regression
signal will actually surface.

Adds a stub at ``tests/proxy_behavior/management/mutmut_triage/pr1.md``
documenting the G5 triage protocol (zero unreviewed survivors in the 6
Tier-1 handler functions) and a placeholder baseline-metrics table to
fill in after the first manually-triggered mutmut run completes — runs
take hours and run on a manual cadence, so PR1 ships with the wiring +
protocol, not the numbers. The actual baseline is recorded in a
follow-up once ``gh workflow run mutation-test.yml`` finishes.

The kill rate stays telemetry-only, never a gate. G5 (per-survivor
classification) is the binding mutation gate.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
…ates

Slice 16 of the management-endpoints behavior-pinning effort. The README
documents:

  * The same three commands the CI workflow runs locally (BYO-DATABASE_URL,
    no new tooling).
  * Suite layout — what each test file covers, which slice it lands.
  * The asyncio loop_scope convention required for session fixtures
    (httpx AsyncClient + prisma connection) to share a loop with each
    test body.
  * G3 strict-import convention + the test that enforces it.
  * Read-world vs scratch-world fixture conventions.
  * Behavior-pinning philosophy: pin observed codes; flag, don't judge.
  * Where each G1–G5 + PR1.M1–M3 gate's evidence lives.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d
@codecov

codecov Bot commented May 20, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

First run on PR #28321 failed with UniqueViolation on
``behavior-pin-budget`` plus cascading missing-membership FK errors. Both
xdist workers entered ``seed_world()`` concurrently against the shared
Postgres service container; whichever lost the race left the world in a
half-seeded state and downstream tests ran against missing
team_membership rows.

Whole-suite wall-time is ~7s sequentially, so disabling xdist here costs
nothing — and the seed itself is the wrong place to add per-worker
isolation (the world is intentionally shared so set-equality assertions
in /key/list have a deterministic expected set).
… master

Second CI run failed: ``/key/generate`` with explicit ``user_id`` returned
403 "User can only create keys for themselves. Got user_id=X, Your ID=None"
in every test that called ``_create_scratch_key`` with a per-actor user_id.
The bare master key's auth path was producing ``user_id=None`` in the
fresh CI Postgres, which doesn't trigger the PROXY_ADMIN bypass in
``_user_can_only_create_keys_for_themselves`` reliably. Locally the same
master key path worked, masking the issue.

Fix: every ``_create_scratch_key`` helper now takes a seeder cleartext
and the test bodies pass ``world.keys[Actor.PROXY_ADMIN].cleartext``.
That actor was seeded with ``user_role=PROXY_ADMIN`` AND a concrete
``user_id``, so the bypass fires deterministically in both environments.

No behavior shift in the matrices themselves — all 128 scenarios still
pass locally; only the setup helper's auth identity changed.

The bare-master smoke (test_smoke + test_scratch_teardown) is intentionally
left on the master key path: those tests don't pass ``user_id`` in the
body so they don't hit the user_id-mismatch gate.
…failures

Third CI run failed identically: seeded PROXY_ADMIN actor's auth resolves
to ``user_id=None`` even though the DB row has the right ``user_id``. The
suite was aborting at maxfail=10 inside test_key_delete, so test_world_seed
(which would tell us whether the seed itself is reachable) never ran in CI.

Two diagnostic moves on this push, no behavior change:

  * Rename ``test_world_seed.py`` → ``test_aaa_world_seed.py`` so it's
    the first collected file. If it passes in CI we know the seed is
    fine and the bug lives downstream; if it fails the same way the
    bug is in the auth resolution path.
  * Bump ``max-failures`` to 200 for this workflow so we see the full
    failure surface instead of stopping at the first cascading setup
    error. Will tighten back down once the suite is green.

Adds one new test ``test_proxy_admin_actor_can_create_keys_for_others``
that explicitly exercises the PROXY_ADMIN bypass via /key/generate with
an explicit user_id — the same shape the matrix setup helper uses but
without the matrix machinery muddying the diagnostic.
… in fixture

Fourth CI run still failed because the proxy's lifespan kicks off
``prisma_client.check_view_exists()`` as a fire-and-forget background
task — that task is what creates ``LiteLLM_VerificationTokenView``, the
SQL view ``user_api_key_auth`` queries to resolve a token to its
user_id / user_role / team.

On a fresh Postgres (CI), the first test races the background task. The
view doesn't exist when the first auth call runs, the resolver falls
through to a degraded path that returns ``user_id=None``, and every
matrix test that depends on the seeded actor's identity then fails
confusingly with "Got user_id=X, Your ID=None" 403s. Locally the view
persists across pytest runs so the race is invisible.

Fix: await ``prisma_client.check_view_exists()`` explicitly inside the
session ``proxy_app`` fixture, after the lifespan enters but before the
fixture yields. Deterministic regardless of whether the underlying DB is
fresh (CI) or warm (local).
… shape

The fifth CI run isolated the failure to ``/key/generate`` with explicit
user_id while ``/key/info`` works for the same seeded PROXY_ADMIN actor.
The auth context's user_id is None even though the DB row has it set.

This commit widens the diagnostic test: on failure, dump the raw token
row's user_id, the user row's user_role, and what
``LiteLLM_VerificationTokenView`` actually returns for the seeded token.
If the view returns user_id=None we know the view shape is the problem;
if the view returns the right user_id we know it's a downstream code
path stripping it.
Previous diagnostic's raw SQL had an ambiguous user_id column from
joining the view with the user table, so the diagnostic itself crashed
before printing useful state. Simplified to query just the view's columns.
Six runs and the underlying data (token row, user row, view row) all
verified correct in CI, but auth still returns user_id=None. This
diagnostic calls the resolver primitives directly:

  1. ``prisma.get_data(table_name="combined_view")`` → raw view object
  2. ``get_key_object(...)`` → cached/DB UserAPIKeyAuth
  3. ``get_user_object(...)`` → LiteLLM_UserTable row
  4. ``_is_user_proxy_admin`` / ``_get_user_role``

and prints each intermediate via captured stdout (-s). Whichever step
returns None/False in CI is where the chain breaks. Imports come from
``litellm.proxy.auth`` (not management_endpoints), so G3 still passes.
…'t wipe it

Real root cause of every CI run that returned ``Your ID=None`` for the
seeded actors:

  * In ``initialize()``, ``master_key`` is set from the config YAML's
    ``general_settings.master_key`` (load_config code path at
    proxy_server.py:4174).
  * Then the FastAPI lifespan (``proxy_startup_event``) runs and at line
    776 does ``master_key = get_secret_str("LITELLM_MASTER_KEY")``,
    which UNCONDITIONALLY overwrites the global.
  * In CI the env var is unset, so the post-lifespan ``master_key`` is
    None.

Downstream every auth path degrades: master-key requests don't bypass
because ``secrets.compare_digest(api_key, None)`` raises and is caught
to ``is_master_key_valid=False``; seeded-actor requests cache a
``UserAPIKeyAuth`` whose ``user_role`` never resolves through the
PROXY_ADMIN bypass; ``_is_allowed_to_make_key_request`` then hits the
``user_id`` mismatch path with ``Your ID=None``.

Locally my shell happened to have ``LITELLM_MASTER_KEY`` set from a prior
session, which is why every local run was green and CI red — exactly the
"don't generalize from your environment to CI" memory.

Fix: ``os.environ.setdefault("LITELLM_MASTER_KEY", MASTER_KEY)`` and
``os.environ.setdefault("CONFIG_FILE_PATH", config_path)`` before
entering the lifespan, so its re-read produces the same value as
``initialize()``.

Whole-suite still green locally (130 tests, ~6.4s).
…sn't gated

Ninth CI run cleared every ``Your ID=None`` failure (the master_key env
fix worked end-to-end) and exposed the next thin layer of failures:
``/key/regenerate`` returns 500 "Regenerating Virtual Keys is an
Enterprise feature" in CI because the proxy can't see a
``LITELLM_LICENSE``. Locally my license is set, so the matrix passes.

The behavior matrix is supposed to pin authz, not licensing — so flip
``proxy_server.premium_user = True`` directly, both before and after the
lifespan (the lifespan re-runs ``_license_check.is_premium()`` and would
otherwise reset it). With premium gating disabled, the regenerate matrix
exercises the same authz path /key/update does.

Whole-suite still green locally (130 tests, ~6.3s).
…lures

Followup to the CI-bring-up sequence: now that the suite is green in CI
(130 → 129 tests after this trim; 156s wall-time on ubuntu-latest), drop
the diagnostic noise left over from debugging the master_key wipe:

  * Rename ``test_aaa_world_seed.py`` back to ``test_world_seed.py`` —
    no longer needs to run first.
  * Remove ``test_auth_resolver_returns_correct_user_id_and_role`` —
    that test reached into private auth helpers to localize the bug
    between the DB and ``UserAPIKeyAuth``; it has served its purpose
    and isn't HTTP-boundary.
  * Keep ``test_proxy_admin_actor_can_create_keys_for_others`` (without
    the failure-time dump) — it's a real authz contract that pins the
    PROXY_ADMIN bypass on /key/generate, and would catch a regression
    of the same conftest interaction this sequence revealed.
  * Drop the workflow's ``max-failures: 200`` override — that was a
    debug aid for seeing the full failure surface in CI. Default of 10
    is right for a stable suite.
…nto README

The mutmut_triage/pr1.md file was a placeholder for numbers and
classifications that don't exist yet — the first mutmut run is a manual
follow-up. Empty stubs aren't evidence; deleting it.

The G5 protocol (run the workflow, triage survivors in the six Tier-1
handler functions, kill-or-accept-with-reason, zero unreviewed) moves
into the suite README's "Gate evidence" block. The real triage file
will land alongside the first mutmut follow-up.

pyproject.toml's [tool.mutmut].tests_dir entry stays — that's the
one-line wiring that makes the existing (manual-trigger) mutation-test
workflow include our suite next time someone runs it. Comment updated
to drop the dead file reference.
Removes the suite README — its contents (local repro, layout, conventions)
were either restated by the file structure or already covered by the
workflow YAML and pyproject.toml. Trims docstrings and inline comments
across every test file to keep only non-obvious WHY (the masking
``_get_user_in_team`` reads, the LiteLLM_VerificationTokenView models-can't-
be-NULL gotcha, the org_admin/peer-visibility surprise, the rotation
contract).

Suite still 129 green locally.
@yuneng-berri yuneng-berri marked this pull request as ready for review May 21, 2026 01:21
@yuneng-berri yuneng-berri requested a review from a team May 21, 2026 01:21
@greptile-apps

greptile-apps Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Introduces an HTTP-boundary behavior-pinning harness for the six Key management endpoints (/key/generate, /key/info, /key/list, /key/update, /key/regenerate, /key/delete). Tests boot the real proxy app once per session via in-process ASGI transport against a real Postgres, with no handler-level mocks, so the matrix turns red on any authz regression.

  • 129 scenarios across an 8-actor world seed; FK-aware teardown via a per-test scratch namespace keeps write tests isolated.
  • Pagination-aware /key/list helper walks all pages before asserting PROXY_ADMIN visibility, addressing the prior truncation concern.
  • test_no_management_imports.py acts as a static sentinel preventing the suite from drifting into handler-import or auth-mock patterns.

Confidence Score: 5/5

Pure test addition with no production code changes; the harness correctly isolates state and verifies both the HTTP boundary and DB side-effects.

All changed files are new test infrastructure and workflow config. The authz matrices pin observed behavior with DB-level verification, the scratch fixture reliably tears down between tests, and the world seed wipe prevents cross-session contamination. The only findings are a temp file that is not cleaned up after the session and a dangling comment reference — neither affects correctness or CI reliability.

conftest.py — temp config file created with delete=False is never removed; pyproject.toml — comment references a README that does not yet exist.

Important Files Changed

Filename Overview
tests/proxy_behavior/management/conftest.py Session-scoped ASGI client and world seed fixtures are well-structured; temp config file created with delete=False is never removed after the session fixture tears down.
pyproject.toml Extends mutmut tests_dir to include the new behavior suite; comment references README.md that does not yet exist in the repository.
tests/proxy_behavior/management/actors.py Defines the immutable 8-actor world seed with careful FK-aware wipe ordering and dual members/members_with_roles population.
tests/proxy_behavior/management/test_key_list.py Pagination-aware visibility helper correctly walks all pages before asserting; addresses the previous review comment about size=100 truncation.

Reviews (2): Last reviewed commit: "test(proxy_behavior): address Greptile r..." | Re-trigger Greptile

Comment thread tests/proxy_behavior/management/conftest.py Outdated
Comment thread tests/proxy_behavior/management/test_key_list.py Outdated
Comment thread tests/proxy_behavior/management/regression_replay/README.md Outdated
…, dedup

- conftest: force LITELLM_MASTER_KEY / CONFIG_FILE_PATH unconditionally
  instead of setdefault. An ambient LITELLM_MASTER_KEY with a different
  value would make the proxy authenticate on that key while the tests
  still send MASTER_KEY → silent 401s.
- test_key_list: paginate /key/list instead of a single size=100 request.
  size is capped at 100 by the endpoint, so on a non-fresh DB a single
  page could truncate PROXY_ADMIN's view and a seeded key could fall off
  the page. Walk total_pages.
- conftest: hoist the duplicated _create_scratch_key helper (copy-pasted
  and already diverged across test_key_{update,regenerate,delete}.py)
  into a single shared create_scratch_key.
- Delete regression_replay/README.md — G4 regression-replay evidence
  belongs in the PR description, not a committed doc file (repo docs
  policy + the effort's own plan both say so). Content moved to the PR.
@yuneng-berri

Copy link
Copy Markdown
Collaborator Author

@greptile

@yuneng-berri yuneng-berri enabled auto-merge (squash) May 21, 2026 02:14
@yuneng-berri yuneng-berri merged commit 79a5a7a into litellm_internal_staging May 21, 2026
117 of 118 checks passed
@yuneng-berri yuneng-berri deleted the litellm_/silly-wright-1b8559 branch May 21, 2026 02:27
lorenzbaraldi pushed a commit to lorenzbaraldi/litellm that referenced this pull request May 21, 2026
)

* test(proxy_behavior): scaffold session-scoped async ASGI client + liveness smoke

Slice 2 of the management-endpoints behavior-pinning effort. New top-level dir
tests/proxy_behavior/management/ outside every existing pytest glob.

conftest.py initialises the proxy app once per session against the DATABASE_URL
the harness boots Postgres at, wraps it in httpx.AsyncClient via in-process
ASGITransport. The one smoke test asserts /health/liveliness returns 200, which
exercises the full FastAPI middleware stack against a real app — no mocks.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* test(proxy_behavior): connect prisma via real lifespan; key/generate de-risk

Slice 3 of the management-endpoints behavior-pinning effort. The fixture now
enters the real FastAPI lifespan (proxy_startup_event) instead of just calling
initialize() — that is where prisma_client is connected, password migration is
kicked off, and the rest of the startup wiring runs.

Tests pin the loop to the session scope so the AsyncClient created in the
session fixture and the prisma connection opened in the lifespan share the
same loop as the test bodies.

New de-risk smoke: POST /key/generate with the master key returns 200, the
returned sk- token resolves to a hashed row in LiteLLM_VerificationToken, and
the cleartext token is never stored. Proves auth + handler + helper + prisma
all wire together end-to-end against a real Postgres.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* test(proxy_behavior): seed 8-actor read-world for the authz matrix

Slice 4 of the management-endpoints behavior-pinning effort. New
``actors.py`` defines the actor enum + seeds an immutable world (2 orgs,
2 teams, 8 users, 8 verification tokens) under the ``behavior-pin-``
prefix so the rows are identifiable in psql and ``_wipe_world`` is
targeted.

Each actor key is created with its cleartext form generated locally and
its hashed form (via ``litellm.proxy.utils.hash_token``) stored in
``LiteLLM_VerificationToken`` — so the real ``user_api_key_auth`` accepts
the cleartext bearer token. Roles, ``team_id``, ``organization_id``, and
the service-account metadata flag are all set on the seeded rows so the
auth layer resolves the same scopes a real proxy would.

The session-scoped ``world`` fixture re-seeds at session start (idempotent
via wipe-then-create), and the smoke test confirms each of the 8 actor
keys can call ``/key/info`` on itself and receive its own row back.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* test(proxy_behavior): per-test scratch namespace + targeted delete_many teardown

Slice 5 of the management-endpoints behavior-pinning effort. Adds the
``scratch`` function-scoped fixture: each test gets a uuid4-derived
namespace prefix, tags writes with it (``key_alias``, ``team_alias``,
``user_id``, ``budget_id``), and the fixture teardown ``delete_many``-s
any row whose namespace column starts with that prefix.

Cleanup uses Prisma model methods only (no raw SQL, per CLAUDE.md) and
orders deletes children-before-parents to avoid FK conflicts. The Slice 3
de-risk smoke is migrated onto the same fixture so it stops accumulating
untagged tokens across repeated local runs.

Smoke proves both halves of the contract: one test writes a scratch-tagged
key and asserts it lands; a second test runs after the first's teardown
and asserts no rows in the scratch namespace survived.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* test(proxy_behavior): codify G3 (strict-import grep) as a pytest item

Slice 6 of the management-endpoints behavior-pinning effort. Two new tests
walk every .py file under tests/proxy_behavior/ and assert:

  * no ``from litellm.proxy.management_endpoints`` import — the suite is
    deliberately constrained to the HTTP boundary so it survives handler
    refactors;
  * no ``mock``/``patch`` on ``user_api_key_auth`` — mocking auth is the
    structural failure mode of the existing 11k-line mock suite, and the
    point of this harness is that the real auth layer runs.

Codifying G3 as a CI test removes the "did someone forget to check the
PR-description checklist" failure mode.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* style(proxy_behavior): apply black to G3 grep test

Follow-up to 6f588c7 — line-length fixes only, no behavior change.

* test(proxy_behavior): pin /key/generate authz matrix (18 scenarios)

Slice 7 of the management-endpoints behavior-pinning effort. Parametrized
matrix across two axes: actor (8 seeded) × target scope (self, team_alpha
in org_a, team_beta in org_b). 18 scenarios after dropping non-applicable
combos. Whole-suite wall-time stays at ~4.7s (well under the 10-min G2
budget for the eventual CI job).

While pinning, the test surfaced one seed gap: ``_get_user_in_team`` reads
``members_with_roles`` (a JSON list of ``{user_id, role}``), not the plain
``members`` String[]. Both columns are now populated in the seed to match
what the real ``/team/new`` handler would produce.

Expected status codes are intentionally heterogeneous (200, 400, 401)
because the current handler emits different statuses depending on which
check fails first (role gate, team-member-perm gate, "not assigned"
check). Pinning the *observed* codes — not what they "should" be — is
exactly the regression signal we want.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* test(proxy_behavior): pin /key/info authz matrix (24 scenarios)

Slice 8 of the management-endpoints behavior-pinning effort. 8 actors ×
3 target keys (own, OWNER's key in org_a, CROSS_ORG_USER's key in org_b)
covering self-read, same-team-peer read, and cross-org read.

Notable pinned behaviors (intentionally surfaced for review, not "fixed"):

  * ORG_ADMIN gets 403 on individual key info even within their own org
    — visibility is scoped to "your own keys" + "your team's keys", not
    "your org's keys".
  * Same-team peers (INTERNAL_USER, UNRELATED_SAME_ORG, SERVICE_ACCOUNT)
    DO see each other's keys. Whether that is desired is for the team
    to decide; this PR only pins the existing behavior so unintentional
    changes flip the matrix red.

Wall-time is unchanged (~4.3s for the slice on its own).

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* test(proxy_behavior): pin /key/list default-visibility matrix (8 scenarios)

Slice 9 of the management-endpoints behavior-pinning effort. For /key/list
the response IS the matrix: each of the 8 seeded actors calls the endpoint
with default filters and the test asserts set-equality between the returned
visible-token set (filtered to seeded tokens only, so unrelated rows can't
flap the assertion) and a pinned expected actor-set.

Pinned default visibility:

  * PROXY_ADMIN sees all 8 actors' keys.
  * Every other actor sees only their own key — including ORG_ADMIN
    (which had broader expectations going in but currently behaves
    same-as-internal-user for /key/list defaults) and TEAM_ADMIN (no
    team-aggregation without include_team_keys=true).

Future changes that broaden or narrow any single actor's default
visibility will turn this matrix red — exactly the regression signal we
want. Parameter-driven views (include_team_keys, filters) are deferred to
Slice 13 / PR2 follow-up.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* test(proxy_behavior): pin /key/update authz matrix + mutation re-read (21 scenarios)

Slice 10 of the management-endpoints behavior-pinning effort. 8 actors ×
3 target shapes (self-owned, OWNER-scoped in org_a/team_alpha,
CROSS_ORG_USER-scoped in org_b/team_beta) = 21 applicable scenarios.

Each test:
  1. Master-key-seeds a fresh scratch key with the target's (user_id,
     team_id) scope (so the read-world stays untouched).
  2. Has the actor under test POST /key/update flipping ``models`` to
     a known marker list.
  3. Asserts the status code AND the DB row's ``models`` field — present
     when 200, unchanged otherwise — so a handler that silently mutates
     on a denied response surfaces red.

Observed gating (pinned, not endorsed):

  * PROXY_ADMIN bypasses every check.
  * ORG_ADMIN is blocked by an early role gate, always 401.
  * Every other (INTERNAL_USER-rolesed) actor hits one of three failure
    modes — 403 "user can only create keys for themselves", 403
    "only proxy admins, team admins, or org admins", or 401
    "team_member_permission_error" — depending on whether they own the
    target and whether they're a team admin / member of its team.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* test(proxy_behavior): pin /key/regenerate authz matrix + rotation contract (22 scenarios)

Slice 11 of the management-endpoints behavior-pinning effort. 21 matrix
scenarios (8 actors × 3 target shapes, minus the cross_org/owner combo
that exists in the seed but isn't applicable) plus one smoke for the
``/key/{key:path}/regenerate`` route registration.

On 200 outcomes the test verifies the full rotation contract:
  * the regenerate response key differs from the old cleartext,
  * the OLD cleartext returns 401 on a follow-up ``/key/info``,
  * the NEW cleartext returns 200 on a follow-up ``/key/info``.

On denied outcomes the test verifies the OLD cleartext still works —
catching any handler that mutates the token row on a failed call.

Pinned authz divergence vs /key/update: regenerate routes most denials
through the team-member-perm 401 path rather than the role-gate 403
path. The matrices for both endpoints are now in tree side-by-side, so
any future refactor that "harmonises" the codes will turn one of the two
red.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* test(proxy_behavior): pin /key/delete authz matrix + post-delete contract (21 scenarios)

Slice 12 of the management-endpoints behavior-pinning effort. Mirrors
slices 10/11. On success: cleartext can no longer authenticate
(handles both hard-delete and soft-delete to LiteLLM_DeletedVerificationToken).
On denial: row survives and cleartext still authenticates.

Notable behavior gap with /key/update: same-team peers (internal_user,
unrelated_same_org, etc.) get 403 on /key/delete for OWNER's key — i.e.
cannot delete each other's keys — whereas they CAN read each other's
keys (Slice 8). Delete is stricter than read. Pinned as-is.

Cumulative whole-suite wall-time is 5.9s for all 128 tests on the local
runner — well under the 10-min G2 budget for the CI job in Slice 13.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* ci(proxy-mgmt-behavior): add PR-triggered workflow for the behavior suite

Slice 13 of the management-endpoints behavior-pinning effort. New
workflow ``test-unit-proxy-mgmt-behavior.yml`` fires ``on: pull_request``
for the same branch set every other proxy unit-test workflow watches
(main, litellm_internal_staging, litellm_oss_branch, litellm_**).

It delegates to the existing reusable ``_test-unit-services-base.yml``
with ``enable-postgres: true``, which already provisions a postgres:14
service container and runs ``prisma db push`` against it before pytest
collects. ``reruns: 0`` because a behavior-pinning matrix that needs
reruns is itself a regression — flakes are signal.

``timeout-minutes: 15`` gives generous headroom over the local 5.9s
whole-suite wall-time; the binding G2 budget is 10 min.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* docs(proxy_behavior): G4 regression-replay table for Key Tier-1

Slice 14 of the management-endpoints behavior-pinning effort. Documents
the regression-replay verification methodology + a 12-row table mapping
recent fix-PRs touching key_management_endpoints.py to the catching
scenarios in the PR1 matrix.

One canonical RED→GREEN cycle is captured verbatim — c7c3df2
"extend /key/update admin check to non-budget fields". Under the
parent-of-fix code, 6 scenarios in test_key_update.py flip from 200 to
403; under HEAD code, all 21 pass. The handler swap is the only change
between the two runs, confirming the matrix catches the behavior shift
the fix introduced.

The table also calls out 4 genuine coverage gaps deferred to PR2/PR3:
404-on-missing-key, budget-limit counter assertions, /key/regenerate
upperbound enforcement, and /key/list filter-param views.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* chore(mutmut): include the behavior suite in tests_dir + G5 triage stub

Slice 15 of the management-endpoints behavior-pinning effort. Appends
``tests/proxy_behavior/management/`` to ``[tool.mutmut].tests_dir`` so
the existing mutation-test workflow runs against both the legacy mock
suite AND the new behavior suite — the latter is where the regression
signal will actually surface.

Adds a stub at ``tests/proxy_behavior/management/mutmut_triage/pr1.md``
documenting the G5 triage protocol (zero unreviewed survivors in the 6
Tier-1 handler functions) and a placeholder baseline-metrics table to
fill in after the first manually-triggered mutmut run completes — runs
take hours and run on a manual cadence, so PR1 ships with the wiring +
protocol, not the numbers. The actual baseline is recorded in a
follow-up once ``gh workflow run mutation-test.yml`` finishes.

The kill rate stays telemetry-only, never a gate. G5 (per-survivor
classification) is the binding mutation gate.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* docs(proxy_behavior): suite README with local-repro + conventions + gates

Slice 16 of the management-endpoints behavior-pinning effort. The README
documents:

  * The same three commands the CI workflow runs locally (BYO-DATABASE_URL,
    no new tooling).
  * Suite layout — what each test file covers, which slice it lands.
  * The asyncio loop_scope convention required for session fixtures
    (httpx AsyncClient + prisma connection) to share a loop with each
    test body.
  * G3 strict-import convention + the test that enforces it.
  * Read-world vs scratch-world fixture conventions.
  * Behavior-pinning philosophy: pin observed codes; flag, don't judge.
  * Where each G1–G5 + PR1.M1–M3 gate's evidence lives.

Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d

* ci(proxy-mgmt-behavior): drop xdist (workers=0) to fix seed race

First run on PR BerriAI#28321 failed with UniqueViolation on
``behavior-pin-budget`` plus cascading missing-membership FK errors. Both
xdist workers entered ``seed_world()`` concurrently against the shared
Postgres service container; whichever lost the race left the world in a
half-seeded state and downstream tests ran against missing
team_membership rows.

Whole-suite wall-time is ~7s sequentially, so disabling xdist here costs
nothing — and the seed itself is the wrong place to add per-worker
isolation (the world is intentionally shared so set-equality assertions
in /key/list have a deterministic expected set).

* ci(proxy-mgmt-behavior): seed scratch keys via proxy_admin actor, not master

Second CI run failed: ``/key/generate`` with explicit ``user_id`` returned
403 "User can only create keys for themselves. Got user_id=X, Your ID=None"
in every test that called ``_create_scratch_key`` with a per-actor user_id.
The bare master key's auth path was producing ``user_id=None`` in the
fresh CI Postgres, which doesn't trigger the PROXY_ADMIN bypass in
``_user_can_only_create_keys_for_themselves`` reliably. Locally the same
master key path worked, masking the issue.

Fix: every ``_create_scratch_key`` helper now takes a seeder cleartext
and the test bodies pass ``world.keys[Actor.PROXY_ADMIN].cleartext``.
That actor was seeded with ``user_role=PROXY_ADMIN`` AND a concrete
``user_id``, so the bypass fires deterministically in both environments.

No behavior shift in the matrices themselves — all 128 scenarios still
pass locally; only the setup helper's auth identity changed.

The bare-master smoke (test_smoke + test_scratch_teardown) is intentionally
left on the master key path: those tests don't pass ``user_id`` in the
body so they don't hit the user_id-mismatch gate.

* ci(proxy-mgmt-behavior): diag — run world-seed test first + bump max-failures

Third CI run failed identically: seeded PROXY_ADMIN actor's auth resolves
to ``user_id=None`` even though the DB row has the right ``user_id``. The
suite was aborting at maxfail=10 inside test_key_delete, so test_world_seed
(which would tell us whether the seed itself is reachable) never ran in CI.

Two diagnostic moves on this push, no behavior change:

  * Rename ``test_world_seed.py`` → ``test_aaa_world_seed.py`` so it's
    the first collected file. If it passes in CI we know the seed is
    fine and the bug lives downstream; if it fails the same way the
    bug is in the auth resolution path.
  * Bump ``max-failures`` to 200 for this workflow so we see the full
    failure surface instead of stopping at the first cascading setup
    error. Will tighten back down once the suite is green.

Adds one new test ``test_proxy_admin_actor_can_create_keys_for_others``
that explicitly exercises the PROXY_ADMIN bypass via /key/generate with
an explicit user_id — the same shape the matrix setup helper uses but
without the matrix machinery muddying the diagnostic.

* ci(proxy-mgmt-behavior): await LiteLLM_VerificationTokenView creation in fixture

Fourth CI run still failed because the proxy's lifespan kicks off
``prisma_client.check_view_exists()`` as a fire-and-forget background
task — that task is what creates ``LiteLLM_VerificationTokenView``, the
SQL view ``user_api_key_auth`` queries to resolve a token to its
user_id / user_role / team.

On a fresh Postgres (CI), the first test races the background task. The
view doesn't exist when the first auth call runs, the resolver falls
through to a degraded path that returns ``user_id=None``, and every
matrix test that depends on the seeded actor's identity then fails
confusingly with "Got user_id=X, Your ID=None" 403s. Locally the view
persists across pytest runs so the race is invisible.

Fix: await ``prisma_client.check_view_exists()`` explicitly inside the
session ``proxy_app`` fixture, after the lifespan enters but before the
fixture yields. Deterministic regardless of whether the underlying DB is
fresh (CI) or warm (local).

* ci(proxy-mgmt-behavior): widen diagnostic to dump token / user / view shape

The fifth CI run isolated the failure to ``/key/generate`` with explicit
user_id while ``/key/info`` works for the same seeded PROXY_ADMIN actor.
The auth context's user_id is None even though the DB row has it set.

This commit widens the diagnostic test: on failure, dump the raw token
row's user_id, the user row's user_role, and what
``LiteLLM_VerificationTokenView`` actually returns for the seeded token.
If the view returns user_id=None we know the view shape is the problem;
if the view returns the right user_id we know it's a downstream code
path stripping it.

* ci(proxy-mgmt-behavior): unambiguous diagnostic view query

Previous diagnostic's raw SQL had an ambiguous user_id column from
joining the view with the user table, so the diagnostic itself crashed
before printing useful state. Simplified to query just the view's columns.

* ci(proxy-mgmt-behavior): add auth-resolver chain diagnostic

Six runs and the underlying data (token row, user row, view row) all
verified correct in CI, but auth still returns user_id=None. This
diagnostic calls the resolver primitives directly:

  1. ``prisma.get_data(table_name="combined_view")`` → raw view object
  2. ``get_key_object(...)`` → cached/DB UserAPIKeyAuth
  3. ``get_user_object(...)`` → LiteLLM_UserTable row
  4. ``_is_user_proxy_admin`` / ``_get_user_role``

and prints each intermediate via captured stdout (-s). Whichever step
returns None/False in CI is where the chain breaks. Imports come from
``litellm.proxy.auth`` (not management_endpoints), so G3 still passes.

* ci(proxy-mgmt-behavior): set LITELLM_MASTER_KEY env so lifespan doesn't wipe it

Real root cause of every CI run that returned ``Your ID=None`` for the
seeded actors:

  * In ``initialize()``, ``master_key`` is set from the config YAML's
    ``general_settings.master_key`` (load_config code path at
    proxy_server.py:4174).
  * Then the FastAPI lifespan (``proxy_startup_event``) runs and at line
    776 does ``master_key = get_secret_str("LITELLM_MASTER_KEY")``,
    which UNCONDITIONALLY overwrites the global.
  * In CI the env var is unset, so the post-lifespan ``master_key`` is
    None.

Downstream every auth path degrades: master-key requests don't bypass
because ``secrets.compare_digest(api_key, None)`` raises and is caught
to ``is_master_key_valid=False``; seeded-actor requests cache a
``UserAPIKeyAuth`` whose ``user_role`` never resolves through the
PROXY_ADMIN bypass; ``_is_allowed_to_make_key_request`` then hits the
``user_id`` mismatch path with ``Your ID=None``.

Locally my shell happened to have ``LITELLM_MASTER_KEY`` set from a prior
session, which is why every local run was green and CI red — exactly the
"don't generalize from your environment to CI" memory.

Fix: ``os.environ.setdefault("LITELLM_MASTER_KEY", MASTER_KEY)`` and
``os.environ.setdefault("CONFIG_FILE_PATH", config_path)`` before
entering the lifespan, so its re-read produces the same value as
``initialize()``.

Whole-suite still green locally (130 tests, ~6.4s).

* ci(proxy-mgmt-behavior): force premium_user=True so /key/regenerate isn't gated

Ninth CI run cleared every ``Your ID=None`` failure (the master_key env
fix worked end-to-end) and exposed the next thin layer of failures:
``/key/regenerate`` returns 500 "Regenerating Virtual Keys is an
Enterprise feature" in CI because the proxy can't see a
``LITELLM_LICENSE``. Locally my license is set, so the matrix passes.

The behavior matrix is supposed to pin authz, not licensing — so flip
``proxy_server.premium_user = True`` directly, both before and after the
lifespan (the lifespan re-runs ``_license_check.is_premium()`` and would
otherwise reset it). With premium gating disabled, the regenerate matrix
exercises the same authz path /key/update does.

Whole-suite still green locally (130 tests, ~6.3s).

* test(proxy_behavior): trim debug diagnostics, restore default max-failures

Followup to the CI-bring-up sequence: now that the suite is green in CI
(130 → 129 tests after this trim; 156s wall-time on ubuntu-latest), drop
the diagnostic noise left over from debugging the master_key wipe:

  * Rename ``test_aaa_world_seed.py`` back to ``test_world_seed.py`` —
    no longer needs to run first.
  * Remove ``test_auth_resolver_returns_correct_user_id_and_role`` —
    that test reached into private auth helpers to localize the bug
    between the DB and ``UserAPIKeyAuth``; it has served its purpose
    and isn't HTTP-boundary.
  * Keep ``test_proxy_admin_actor_can_create_keys_for_others`` (without
    the failure-time dump) — it's a real authz contract that pins the
    PROXY_ADMIN bypass on /key/generate, and would catch a regression
    of the same conftest interaction this sequence revealed.
  * Drop the workflow's ``max-failures: 200`` override — that was a
    debug aid for seeing the full failure surface in CI. Default of 10
    is right for a stable suite.

* chore(proxy_behavior): drop empty mutmut triage stub, fold protocol into README

The mutmut_triage/pr1.md file was a placeholder for numbers and
classifications that don't exist yet — the first mutmut run is a manual
follow-up. Empty stubs aren't evidence; deleting it.

The G5 protocol (run the workflow, triage survivors in the six Tier-1
handler functions, kill-or-accept-with-reason, zero unreviewed) moves
into the suite README's "Gate evidence" block. The real triage file
will land alongside the first mutmut follow-up.

pyproject.toml's [tool.mutmut].tests_dir entry stays — that's the
one-line wiring that makes the existing (manual-trigger) mutation-test
workflow include our suite next time someone runs it. Comment updated
to drop the dead file reference.

* chore(proxy_behavior): drop README + trim comments

Removes the suite README — its contents (local repro, layout, conventions)
were either restated by the file structure or already covered by the
workflow YAML and pyproject.toml. Trims docstrings and inline comments
across every test file to keep only non-obvious WHY (the masking
``_get_user_in_team`` reads, the LiteLLM_VerificationTokenView models-can't-
be-NULL gotcha, the org_admin/peer-visibility surprise, the rotation
contract).

Suite still 129 green locally.

* test(proxy_behavior): address Greptile review — env force, pagination, dedup

- conftest: force LITELLM_MASTER_KEY / CONFIG_FILE_PATH unconditionally
  instead of setdefault. An ambient LITELLM_MASTER_KEY with a different
  value would make the proxy authenticate on that key while the tests
  still send MASTER_KEY → silent 401s.
- test_key_list: paginate /key/list instead of a single size=100 request.
  size is capped at 100 by the endpoint, so on a non-fresh DB a single
  page could truncate PROXY_ADMIN's view and a seeded key could fall off
  the page. Walk total_pages.
- conftest: hoist the duplicated _create_scratch_key helper (copy-pasted
  and already diverged across test_key_{update,regenerate,delete}.py)
  into a single shared create_scratch_key.
- Delete regression_replay/README.md — G4 regression-replay evidence
  belongs in the PR description, not a committed doc file (repo docs
  policy + the effort's own plan both say so). Content moved to the PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants