[LIT-2878] Epic B — VM provisioning (EC2 + AMI + provider abstraction)#27335
[LIT-2878] Epic B — VM provisioning (EC2 + AMI + provider abstraction)#27335ishaan-berri wants to merge 32 commits into
Conversation
…ema) Per-team BYOC AWS config consumed by the agent-session EC2 VM provider. `aws_creds_enc` is a JSON blob with each field individually encrypted via `encrypt_value_helper` so a partial DB leak doesn't expose the secret. Owned by Epic G's Settings UI (LIT-2891); consumed by Epic B's EC2 provider. LIT-2878
…hema) Mirrors the root-level schema.prisma change so the bundled proxy schema stays in sync. LIT-2878
…tras schema) Mirrors the schema in the published `litellm-proxy-extras` package so the bundled migrations match what Prisma actually applies on proxy startup. LIT-2878
Picked up by `prisma migrate deploy` on proxy startup (the `litellm-proxy-extras` package bundles its own migration directory). LIT-2878
…tatus The pluggable VM-provider ABC for agent sessions. Per-session VMs are provisioned via this interface; v1 implementation is EC2 (BYOC AWS). Key types: `ProvisionContext` carries the team's AWS creds + EC2 overrides through to the provider. `AwsCreds.__repr__` redacts secrets so we cannot accidentally print them. `InvalidCredentialsError` (400) and `ProvisionError` (500) are the user-facing error types. LIT-2878
In-memory `AgentVMProvider` used by the unit tests and as the default when `agent_settings.vm_provider` is unset. The factory returns `NoopProvider` when no AWS-backed provider is configured so the proxy boots cleanly without AWS credentials. LIT-2878
Reads `agent_settings.vm_provider` and `agent_settings.<provider>` from the loaded proxy config and builds the matching provider. Defaults to `noop`. Unknown values raise `ValueError` listing the supported providers so config typos surface fast. Validation #1 (test_factory) covers this path. Validation #6 (provider swap is config-only) is also exercised here. LIT-2878
…t helpers Reads the team's BYOC AWS creds from `LiteLLM_AgentVMConfig` (decrypts each field individually) and falls back to `LITELLM_AGENT_AWS_*` env vars for local dev. Raises `InvalidCredentialsError` if neither path yields creds (validation #11 fail-fast). Falls back gracefully when the table doesn't exist yet (Epic G hasn't shipped its migration), so this code can land before LIT-2891. LIT-2878
…back
One EC2 per session, launched in the team's AWS account using the team's
BYOC creds (decrypted at use, never logged). Spot first, on-demand fallback
when capacity unavailable. Per-session tags (litellm-session-id,
litellm-team-id, litellm-agent-id) for cleanup.
Safety:
- `set_stream_logger('botocore', WARNING)` so SigV4 payloads can't leak the
access key into proxy logs (regression-tested in #13)
- creds enter via ProvisionContext, never leave this module
- `_safe_aws_error` formats ClientError without echoing the request payload
- `InvalidClientTokenId` / `SignatureDoesNotMatch` → fail-fast InvalidCredentialsError
- terminate is idempotent on already-gone instances
Validations covered: #5 (spot fallback), #8 (terminate idempotent), #11
(invalid creds fail-fast), #13 (creds never logged).
LIT-2878
…sessions Three sweepers run on the same 30s tick: - bootstrap_timeout — sessions stuck in `provisioning` past the timeout - heartbeat_timeout — `ready` sessions whose daemon stopped checking in - max_session_minutes — sessions older than the configured ceiling Each sweeper: - bounds its batch to 100 rows so a backlog doesn't stall the loop - re-fetches the row (optimistic lock) before terminating so multiple proxy replicas don't double-terminate - treats terminate failures as non-fatal (retry next tick) Uses Prisma model methods (`find_many` / `find_unique` / `update`); no raw SQL per project rules. Validations covered: #7 (max_session_minutes), #9 (bootstrap_timeout), #10 (heartbeat_loss). LIT-2878
Documents the agent_settings YAML shape consumed by `get_vm_provider`. B0's AWS resource IDs are referenced via `default_ami_id: ami-CHANGEME`; the user fills in the real AMI after running `packer build`. LIT-2878
Builds an Ubuntu 24.04 AMI with node 24, python 3.13, git, gh, uv, bun, and the agent-runtime systemd unit autostarted on boot. The daemon honours `LITELLM_AGENT_MODE` from EC2 user-data: `session` for cold-boot, `warm` for warm-pool prewarming (B2). Uses IMDSv2 only. Tags every resource Packer creates for easy cleanup. `ami_users` lets us share the AMI cross-account without rebuilding. Validation #2 (`packer build`) covers this file. LIT-2878
Provisioner script driven by `litellm-agent-runtime.pkr.hcl`. Each tool is pinned to a specific version and verified against a SHA-256 sidecar where upstream provides one (uv) — see CLAUDE.md "CI Supply-Chain Safety". The bun installer has no checksum sidecar, so we pin a version and pull the artifact directly (not the install script). LIT-2878
Reads runtime config from /etc/litellm-agent/runtime.env (written by EC2 user-data, mode 600). Restart=on-failure with a 5s backoff so transient network blips during bootstrap don't permanently kill the session. LIT-2878
…l daemon Behaviour: - reads runtime config from systemd's EnvironmentFile - session mode: bootstrap → heartbeat every 30s, exit cleanly on HTTP 410 - warm mode: idle (real warm-pool hydrate lands in B2) - redacts JWT in log output Zero non-system deps (only `requests`, installed by the AMI builder). Replaced wholesale by Epic C. LIT-2878
Covers: prerequisites (Packer + AWS profile), `packer init` + `packer build` invocation, sharing the AMI cross-account via `ami_users`, the `LITELLM_AGENT_MODE` boot-mode contract, the Epic C migration path, and the leak-cleanup one-liner that targets `LitellmManagedBy=agent-vm-provider`. LIT-2878
…-leak Mocked tests using a fake boto3 client. Covers validations #5 (spot → on-demand fallback), #8 (terminate idempotent on already-gone), #11 (invalid creds raise InvalidCredentialsError without launching anything), and #13 (AWS keys never appear in log records, repr, or exception messages). LIT-2878
Validations #3 (real boot), #11 (real BYOC fail-fast), and the real-cloud piece of #8. Skipped by default; enable with `pytest -m slow` when the `LITELLM_AGENT_AWS_*` + `LITELLM_TEST_*` env vars are set. Each test wraps RunInstances in try/finally with TerminateInstances and installs a 60-min process watchdog (per the AWS safety boundary) so a hung test cannot leak an instance. LIT-2878
…emap python3 The cnf-update-db post-invoke hook fails (exit 100) when we install a non-default python3 alongside, because cnf-update-db imports apt_pkg which is bound to /usr/bin/python3 -> python3.12. Disabling the hook makes apt update + install idempotent for AMI builds. Also stop remapping /usr/bin/python3 via update-alternatives — it breaks Ubuntu's python-coupled apt tooling. Tools that need 3.13 invoke it explicitly; the systemd unit already uses /usr/bin/python3.13. LIT-2878
Avoids the PytestUnknownMarkWarning when collecting `tests/test_litellm/proxy/agent_session_endpoints/vm_providers/test_ec2_provider_real.py` without `-m slow`. LIT-2878
Defensive: if both `session_id` and `id` are missing, return early. The str-cast keeps mypy happy when the row uses a UUID-typed column. LIT-2878
`creds` was reassigned from `AwsCreds` to `Optional[AwsCreds]` in the fallback branch — rename the second binding so mypy can narrow the type. LIT-2878
Built by `packer build` against the BYOC PoC account (us-west-2). Customers running their own BYOC account should re-run the Packer build and replace this value. LIT-2878
EC2 `ModifyImageAttribute` rejects "Character sets beyond ASCII are not supported" when registering the AMI description. The whole AMI rolls back on this error. Replace em-dash with hyphen. LIT-2878
| set -e | ||
| mkdir -p /etc/litellm-agent | ||
| cat > /etc/litellm-agent/runtime.env <<EOF | ||
| LITELLM_SESSION_ID={ctx.session_id} |
There was a problem hiding this comment.
Medium: Shell injection in user-data
session_id, team_id, agent_id, base_url, mode, and daemon_jwt are inserted directly into a boot-time shell script. A user who can create a session with a value containing a newline plus shell syntax can add commands to the EC2 user-data and run them as root when the instance starts; serialize these values safely, for example by base64-encoding a complete env-file payload or shell-quoting each assignment before interpolation.
Medium: Shell injection in EC2 user-dataThis PR adds the EC2 VM provider and supporting BYOC configuration. The user-data script still interpolates session/runtime fields directly into a shell heredoc, so a caller who can influence those values can add commands that run during instance boot. Status: 1 open |
Greptile SummaryThis PR introduces the full EC2 VM-provisioning substrate for agent sessions: a pluggable
Confidence Score: 3/5Not safe to merge until the real-cloud tests are moved out of the mock-only directory and the NodeSource supply-chain check is fixed. The sweeper and provider logic is well-structured and the mocked test coverage is solid. However, the real-cloud test placement directly violates an enforced repository rule, the Node.js installer runs without any checksum verification on every AMI build (the sidecar URL does not exist), and the AgentVMProvider ABC omits the aws_creds parameter that EC2Provider unconditionally requires — leaving a silent failure trap for any future caller using the abstract type. tests/test_litellm/proxy/agent_session_endpoints/vm_providers/test_ec2_provider_real.py (must be relocated), infra/ami/scripts/install-runtime.sh (NodeSource verification), litellm/proxy/agent_session_endpoints/vm_providers/base.py (ABC signature)
|
| Filename | Overview |
|---|---|
| tests/test_litellm/proxy/agent_session_endpoints/vm_providers/test_ec2_provider_real.py | Real-cloud AWS tests placed in the mock-only tests/test_litellm/ directory, violating the repo rule; must be moved outside this directory |
| infra/ami/scripts/install-runtime.sh | NodeSource SHA256 sidecar URL always returns 404, so the sha256sum verification is silently skipped every build and the unverified setup script is run with sudo |
| litellm/proxy/agent_session_endpoints/vm_providers/base.py | ABC defines terminate/status without aws_creds; EC2Provider extends these with an extra optional arg that is mandatory in practice, creating a contract mismatch |
| litellm/proxy/agent_session_endpoints/vm_providers/ec2.py | EC2 provider with spot/on-demand fallback, idempotent terminate, and boto3 log silencing; daemon JWT is written in plaintext to user-data (visible via IMDS and AWS console) |
| litellm-proxy-extras/litellm_proxy_extras/migrations/20260506220000_add_agent_vm_config/migration.sql | New LiteLLM_AgentVMConfig table; updated_at is NOT NULL without DEFAULT CURRENT_TIMESTAMP, relying solely on Prisma ORM for the default |
| litellm/proxy/agent_session_endpoints/sweepers.py | Three background sweepers with batched Prisma queries, optimistic re-fetch lock, and clean asyncio loop; logic is correct and idempotent |
| litellm/proxy/agent_session_endpoints/vm_providers/team_config.py | Per-team BYOC creds resolver with per-field encryption and env-var fallback; region is stored unencrypted (intentional, not sensitive) |
| infra/ami/litellm-agent-runtime.pkr.hcl | Packer config with IMDSv2-only, ami_users sharing, and proper tags; install script is invoked as a separate shell step (not inline curl |
| litellm/proxy/agent_session_endpoints/vm_providers/factory.py | Clean provider registry pattern; defaults to noop, raises ValueError for unknown names |
| infra/ami/files/daemon-stub.py | Daemon stub with bootstrap + heartbeat loop, JWT redaction in logs, and clean signal handling; intended as a temporary placeholder for Epic C |
Reviews (1): Last reviewed commit: "fix: ASCII-only AMI description (AWS rej..." | Re-trigger Greptile
| """ | ||
| Real-cloud tests for `EC2Provider`. | ||
|
|
||
| Skipped by default. Enable with `pytest -m slow` and the BYOC env vars set: | ||
|
|
||
| LITELLM_AGENT_AWS_ACCESS_KEY_ID | ||
| LITELLM_AGENT_AWS_SECRET_ACCESS_KEY | ||
| LITELLM_AGENT_AWS_REGION (default us-west-2) | ||
| LITELLM_TEST_SUBNET_ID | ||
| LITELLM_TEST_SECURITY_GROUP_ID | ||
| LITELLM_TEST_IAM_INSTANCE_PROFILE | ||
| LITELLM_TEST_AMI_ID | ||
|
|
||
| These are the resources B0 captured for the BYOC PoC account | ||
| (see LIT-2888 deliverables comment). | ||
|
|
||
| Each test is wrapped in a try/finally that calls TerminateInstances. A 60-min | ||
| process-wide watchdog is also installed so a hung test cannot leak an | ||
| instance overnight. | ||
|
|
||
| Cost per run: ~$0.03 per session (one t3.large for under a minute). | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import os | ||
| import threading | ||
| import time | ||
| from typing import List, Optional | ||
|
|
There was a problem hiding this comment.
Real-cloud tests placed in mock-only directory
The repository rule requires that tests/test_litellm/ contain only mock tests with no real network calls. test_ec2_provider_real.py unconditionally calls RunInstances, TerminateInstances, and DescribeInstances against live AWS when the LITELLM_AGENT_AWS_* env vars are set. Marking the tests @pytest.mark.slow and gating them behind skipif(_have_creds()) still places real-cloud tests in the prohibited directory; they can be accidentally enabled by anyone running pytest -m slow with the correct env set, which can break the CI rule. These tests should live outside tests/test_litellm/ (e.g. tests/integration/ or similar).
Rule Used: What: prevent any tests from being added here that... (source)
| if [ -n "$NODESOURCE_SHA" ]; then | ||
| echo "$NODESOURCE_SHA $TMP_SETUP" | sha256sum -c - | ||
| fi | ||
| sudo -E bash "$TMP_SETUP" | ||
| sudo apt-get install -y nodejs | ||
|
|
||
| # --- gh CLI --- | ||
| sudo mkdir -p -m 755 /etc/apt/keyrings | ||
| GH_KEY=/etc/apt/keyrings/githubcli-archive-keyring.gpg | ||
| sudo curl -fsSL "https://cli.github.com/packages/githubcli-archive-keyring.gpg" -o "$GH_KEY" | ||
| sudo chmod go+r "$GH_KEY" | ||
| echo "deb [arch=$(dpkg --print-architecture) signed-by=$GH_KEY] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list >/dev/null | ||
| sudo apt-get update -y | ||
| sudo apt-get install -y gh |
There was a problem hiding this comment.
NodeSource SHA256 sidecar does not exist — verification always silently skipped
NodeSource does not publish a .sha256 file alongside setup_24.x (the URL https://deb.nodesource.com/setup_24.x.sha256 returns 404). Because of the || true, NODESOURCE_SHA is always an empty string, the if [ -n "$NODESOURCE_SHA" ] branch is never entered, and sudo -E bash "$TMP_SETUP" runs the unverified downloaded script unconditionally. The comment claims this satisfies the "no curl|sh" policy, but in practice the policy is never enforced for the Node.js installer. A compromised or tampered setup_24.x would be executed without detection. The fix is to either pin the Node.js binary/tarball directly (as is done for uv and bun) or use the NodeSource APT repository with GPG key verification instead of the setup script.
| @abstractmethod | ||
| async def terminate(self, vm: VMHandle) -> None: | ||
| """Terminate the VM. Must be idempotent (no-op if already terminated).""" | ||
|
|
||
| @abstractmethod | ||
| async def status(self, vm: VMHandle) -> VMStatus: | ||
| """Return the current VM status.""" |
There was a problem hiding this comment.
ABC interface omits
aws_creds that EC2Provider requires in practice
The AgentVMProvider ABC declares terminate(self, vm: VMHandle) and status(self, vm: VMHandle), but EC2Provider unconditionally raises InvalidCredentialsError when aws_creds is None. Any caller that relies on the abstract type — including future code that plugs in after Epic A — will silently get InvalidCredentialsError at runtime if it calls provider.terminate(handle) without the extra keyword. The sweeper works around this today with type: ignore[call-arg], but that suppresses the error rather than fixing the contract. The aws_creds parameter should be part of the abstract signature (or the BYOC credential flow should be moved out of the terminate/status call site entirely, e.g. embedded in VMHandle.metadata).
| return f"""#!/bin/bash | ||
| set -e | ||
| mkdir -p /etc/litellm-agent | ||
| cat > /etc/litellm-agent/runtime.env <<EOF | ||
| LITELLM_SESSION_ID={ctx.session_id} | ||
| LITELLM_TEAM_ID={ctx.team_id} | ||
| LITELLM_AGENT_ID={ctx.agent_id or ''} | ||
| LITELLM_BASE_URL={base_url} | ||
| LITELLM_AGENT_MODE={mode} | ||
| LITELLM_DAEMON_JWT={daemon_jwt} | ||
| EOF | ||
| chmod 600 /etc/litellm-agent/runtime.env | ||
| echo "{repos_b64}" | base64 -d > /etc/litellm-agent/repos.json | ||
| echo "{env_b64}" | base64 -d > /etc/litellm-agent/env.json | ||
| chmod 600 /etc/litellm-agent/repos.json /etc/litellm-agent/env.json | ||
| systemctl enable --now litellm-agent-runtime.service || true | ||
| """ |
There was a problem hiding this comment.
Daemon JWT written in plaintext to EC2 user-data
LITELLM_DAEMON_JWT={daemon_jwt} is embedded verbatim in the user-data shell script. EC2 user-data is stored unencrypted and is accessible via IMDS (http://169.254.169.254/latest/user-data) to any process running on the instance (IMDSv2 only raises the bar, it doesn't prevent instance-local access). It is also visible in the AWS console and via DescribeInstanceAttribute to any principal with ec2:DescribeInstanceAttribute in the BYOC account. The JWT should instead be injected through AWS SSM Parameter Store or Secrets Manager and retrieved by the daemon post-boot, keeping it out of the stored user-data payload.
| "network_access" JSONB, | ||
| "created_at" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP, | ||
| "created_by" TEXT NOT NULL, | ||
| "updated_at" TIMESTAMP(3) NOT NULL, |
There was a problem hiding this comment.
updated_at is declared NOT NULL without DEFAULT CURRENT_TIMESTAMP in the migration SQL. Prisma's @updatedAt is an ORM-level directive and does not inject a database-level default, so raw SQL inserts (e.g. backfills, admin tooling, or other non-Prisma writers) will fail without explicitly providing a value. Adding DEFAULT CURRENT_TIMESTAMP makes the column consistent with created_at and prevents insert failures.
| "updated_at" TIMESTAMP(3) NOT NULL, | |
| "updated_at" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP, |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Relevant issues
Linear ticket
Resolves LIT-2878
Pre-Submission checklist
tests/test_litellm/directory — 42 mocked tests + 3 slow-marked real-cloud tests.make test-unit.CI (LiteLLM team)
Type
🆕 New Feature
🚄 Infrastructure
Changes
Substrate for the agent-session VM provider, the BYOC AWS path, the AMI built via Packer, and the bootstrap/heartbeat sweepers.
Provider abstraction (
litellm/proxy/agent_session_endpoints/vm_providers/):base.py:AgentVMProviderABC,ProvisionContext,VMHandle,VMStatus,AwsCreds(redacting__repr__),Ec2Config.factory.py:get_vm_provider()keyed offagent_settings.vm_provider. Defaults tonoop. Unknown values raiseValueError.noop.py: in-memory provider for tests + default config.ec2.py: boto3 wrapper with BYOC creds, spot → on-demand fallback, idempotent terminate,set_stream_logger('botocore', WARNING)so SigV4 payloads can't leak the access key.team_config.py: readsLiteLLM_AgentVMConfig.aws_creds_enc(each field individually encrypted viaencrypt_value_helper) with env-var fallback.Sweepers (
litellm/proxy/agent_session_endpoints/sweepers.py):bootstrap_timeout_sweeper— sessions stuck inprovisioningpast the timeoutheartbeat_timeout_sweeper—readysessions whose daemon stopped checking inmax_session_minutes_sweeper— sessions older than the ceilingSchema:
LiteLLM_AgentVMConfigtable (one row per team) in all threeschema.prismacopies.litellm-proxy-extras/litellm_proxy_extras/migrations/20260506220000_add_agent_vm_config/.AMI (
infra/ami/):LITELLM_AGENT_MODEenv in user-data pickssession(cold-boot) vswarm(B2).ami-074a518157fe137b4(us-west-2).Tests (
tests/test_litellm/proxy/agent_session_endpoints/):pytest -m slowandLITELLM_AGENT_AWS_*/LITELLM_TEST_*env vars; each test wrapsRunInstancesin try/finally + 60-min watchdog per the AWS safety boundary.Ruff + mypy clean. Black-formatted.
Validation status (from LIT-2878):
pytest test_factory.py::test_factory_*— passingpacker build— succeeded,ami-074a518157fe137b4test_ec2_provider_real.py::test_real_boot— slow-marked, requires BYOC envtest_provision_spot_fallback_to_on_demand— passingtest_provider_swap_is_config_only— passingtest_max_session_minutes_sweeper_*— passingtest_terminate_already_gone_is_noop— passingtest_bootstrap_timeout_sweeper_*— passingtest_heartbeat_timeout_sweeper_*— passingtest_provision_invalid_creds_aws_response_*+test_no_db_no_env_raises_invalid_credentials— passingtest_byoc_cross_team_isolation— passingtest_aws_creds_never_logged_during_provision,test_aws_creds_repr_redacts,test_aws_creds_never_in_exception_message— passingValidations #4 and #8 (full end-to-end via
POST /v1/sessions) need Epic A's session endpoints to be merged before they can run; the substrate is in place.Follow-ups:
infra/ami/files/daemon-stub.pywith the real daemon.LiteLLM_AgentVMConfig.aws_creds_enc.