Skip to content

feat(metering): flow cost+cache aggregation, per-session warm routes, consumer /v1/session#22

Merged
jmlago merged 2 commits into
mainfrom
metering-warm-sidebar
Jun 24, 2026
Merged

feat(metering): flow cost+cache aggregation, per-session warm routes, consumer /v1/session#22
jmlago merged 2 commits into
mainfrom
metering-warm-sidebar

Conversation

@jmlago

@jmlago jmlago commented Jun 24, 2026

Copy link
Copy Markdown
Member

Follow-on to #18 (the work landed on cache-metrics after it was squash-merged, so it had no PR).

What

  • Flow cost/cache: ensemble (flow) calls were showing $0 / 0% cached downstream — the flow aggregated only tokens in/out and its synthetic flow chosen has no price. Now aggregates cost + cached across nodes.
  • Standard cache usage: emit usage.prompt_tokens_details.cached_tokens (not just x_router) so OpenAI-compatible clients (opencode) parse cache reads.
  • Streaming meter: fold the session meter on the streaming paths too (opencode always streams).
  • Header session: read X-Unhardcoded-Session as the conversation id (clients that can set headers but not body fields).
  • Warm routes: per-(session,family) warm map (family/provider/served_by — the real peer/backend) for display.
  • Consumer endpoint: GET /v1/session/{sid} (consumer-key authed) → {calls, tokens_in/out, tokens_cached, cost_usd, warm[]}, so a harness (opencode plugin) shows live economics without operator /x/* access.

Verified live

flow cost_usd 0.034, cached 2304; GET /v1/session/{sid} → warm [gpt-5.5/openrouter, z-ai/glm-5.2/openrouter_market].

Summary by CodeRabbit

  • New Features

    • Added a new session lookup endpoint for retrieving session details.
    • Streaming and non-streaming responses now include cached token usage details when available.
    • Session-based usage tracking now covers streamed requests and displays warm-route information.
  • Bug Fixes

    • Improved error handling for session requests with clearer authentication and upstream failure responses.
    • Ensured session metrics are recorded consistently across all streaming paths.

…rm routes, consumer /v1/session

Post-#18 follow-on (stranded on the merged cache-metrics branch). Lands:
- route_session_meter: /x/session endpoints wiring + per-(session,family) warm
  map (observe_route/warm) for display.
- shim: read session from X-Unhardcoded-Session header; meter the session on the
  STREAMING paths too; emit standard usage.prompt_tokens_details.cached_tokens
  so OpenAI-compatible clients (opencode) see cache; /x/session returns warm.
- llm_router_host: aggregate cost + cached across flow nodes (the synthetic
  'flow' chosen has no price); record warm route per node on success.
- auth_proxy: consumer-facing GET /v1/session/{sid} (consumer-key authed) so a
  harness reads cost/tokens/cache/warm without operator /x/* access.

Verified live: flow cost_usd 0.034 + cached 2304; GET /v1/session/{sid} returns
warm [gpt-5.5/openrouter, z-ai/glm-5.2/openrouter_market].
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@jmlago, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 35 minutes and 33 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d8906aad-3445-4ed6-9454-566ef11f55ac

📥 Commits

Reviewing files that changed from the base of the PR and between 5cd5350 and 20c2d13.

📒 Files selected for processing (5)
  • auth_proxy.py
  • route_session_meter.py
  • shim.py
  • tests/test_metering.py
  • tests/test_shim.py
📝 Walkthrough

Walkthrough

Adds per-session warm-route metadata tracking to route_session_meter with observe_route() and warm() functions. llm_router_host gains flow-level cost aggregation using tokens_cached/cost_reported and calls observe_route() after successful routing. shim.py threads req.session into all streaming final-chunk paths, adds prompt_tokens_details.cached_tokens to OpenAI usage responses, and includes warm-route data in the /x/session/{sid} meter endpoint. auth_proxy.py adds a new consumer-facing GET /v1/session/{sid} endpoint that authenticates and proxies to the router.

Changes

Session Warm-Route Tracking and Display

Layer / File(s) Summary
Warm-route store in route_session_meter
route_session_meter.py
Adds _warm in-memory dict keyed by session/family; introduces observe_route() to record provider and served-by metadata, warm() to retrieve session entries, and extends reset() to clear the new store.
Flow cost aggregation and warm-route recording in llm_router_host
llm_router_host.py
Imports route_session_meter; extends node_trace with tokens_cached/cost_reported; introduces _node_cost closure that prefers provider-reported cost or computes from pricing with cached-token discount; sums into flow_cost on the return payload; calls observe_route() in _fold_route_outcome on successful calls.
Session attribution and cache reporting in shim streaming paths
shim.py
Adds _session_from_header helper reading X-Unhardcoded-Session; extends _final_chunk_parts with optional session param that triggers route_session_meter.observe(); adds prompt_tokens_details.cached_tokens to streaming and non-streaming usage objects; propagates req.session through _pseudo_stream, _flow_stream, and _sse_gen (including its error path) and _router_response_to_openai.
Session display endpoints in shim and auth_proxy
shim.py, auth_proxy.py
/x/session/{sid} now appends a warm field from route_session_meter.warm(sid); auth_proxy adds GET /v1/session/{sid} that authenticates the caller via Bearer token, proxies to the router's internal endpoint, and returns upstream JSON or a 502 error.

Sequence Diagram(s)

sequenceDiagram
  participant Consumer
  participant auth_proxy as auth_proxy<br/>(GET /v1/session/{sid})
  participant shim_router as shim<br/>(/x/session/{sid})
  participant route_session_meter

  rect rgba(100, 149, 237, 0.5)
    note over Consumer,auth_proxy: Authentication
    Consumer->>auth_proxy: GET /v1/session/{sid}<br/>Authorization: Bearer token
    auth_proxy->>auth_proxy: _extract_token + _caller_auth
    auth_proxy-->>Consumer: 401/403 on auth failure
  end

  rect rgba(60, 179, 113, 0.5)
    note over auth_proxy,route_session_meter: Proxy + warm data assembly
    auth_proxy->>shim_router: GET /x/session/{sid}
    shim_router->>route_session_meter: warm(sid)
    route_session_meter-->>shim_router: [{family, provider, served_by}, ...]
    shim_router-->>auth_proxy: {calls, tokens, cost, warm: [...]}
    auth_proxy-->>Consumer: upstream JSON or 502
  end
Loading
sequenceDiagram
  participant Client
  participant shim_handler
  participant llm_router_host
  participant _fold_route_outcome
  participant route_session_meter

  Client->>shim_handler: POST /v1/chat/completions<br/>[X-Unhardcoded-Session: sid]
  shim_handler->>shim_handler: _session_from_header → req.session = sid
  shim_handler->>llm_router_host: execute_flow_async(session=sid)
  llm_router_host->>_fold_route_outcome: fold each node result
  _fold_route_outcome->>route_session_meter: observe_route(session, provider, family, served_by)
  llm_router_host-->>shim_handler: response with tokens_cached + cost_reported
  shim_handler->>shim_handler: _final_chunk_parts(session=sid)
  shim_handler->>route_session_meter: observe(sid, tokens_in, tokens_out, cost)
  shim_handler-->>Client: SSE chunk with usage.prompt_tokens_details.cached_tokens
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • genlayerlabs/unhardcoded#18: Directly adds the per-session usage/cost accumulator in route_session_meter and the initial tokens_cached/cost_reported propagation through llm_router_host and shim that this PR extends with warm-route metadata and streaming session attribution.

Poem

🐇 Hop hop, the session glows warm,
Each cache hit tracked, each routed norm.
The meter records which provider stayed near,
Costs summed from nodes, the flow path clear.
/v1/session now speaks to the crowd —
A bunny's data panel, fluffy and proud! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly captures the main additions: metering aggregation, warm route tracking, and the new consumer session endpoint.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch metering-warm-sidebar

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
route_session_meter.py (1)

20-24: 🧹 Nitpick | 🔵 Trivial

Unbounded per-session growth of in-memory stores.

_warm (like the existing _acc) accumulates one entry per distinct session id with no eviction or TTL, and reset() is only a test hook. In a long-running process, sessions are never reclaimed, so both maps grow without bound. Consider a bounded LRU or TTL-based expiry for _warm/_acc.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@route_session_meter.py` around lines 20 - 24, The in-memory session maps in
route_session_meter are unbounded, so `_warm` and `_acc` can grow forever as new
session IDs are seen. Update the session storage logic in `route_session_meter`
to use bounded eviction or TTL-based expiration instead of keeping entries
indefinitely, and make sure any helper methods that read/write these maps still
work with the new lifecycle. If there is a central accessor or mutation path for
`_warm`/`_acc`, adjust that code so stale sessions are removed automatically
without relying on `reset()`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@auth_proxy.py`:
- Around line 1962-1965: The upstream session proxy in the code path using
_client.get and sid should URL-encode the sid path parameter before building the
request target. Update the request in the session handler to quote sid with
safe="" so reserved characters stay within a single path segment and do not
alter the upstream URL.

In `@shim.py`:
- Around line 944-956: The session meter fold in _final_chunk_parts is counting
failed streaming requests because route_session_meter.observe is called
unconditionally whenever session is present. Add a guard so the per-session
meter is only updated for successful results, using result.get("ok") (or the
equivalent response status available in _final_chunk_parts) before calling
observe and setting x_router["session_acc"]. Keep the change localized to
_final_chunk_parts so streaming and non-streaming paths stay consistent with
_openai_error_from_router.

---

Nitpick comments:
In `@route_session_meter.py`:
- Around line 20-24: The in-memory session maps in route_session_meter are
unbounded, so `_warm` and `_acc` can grow forever as new session IDs are seen.
Update the session storage logic in `route_session_meter` to use bounded
eviction or TTL-based expiration instead of keeping entries indefinitely, and
make sure any helper methods that read/write these maps still work with the new
lifecycle. If there is a central accessor or mutation path for `_warm`/`_acc`,
adjust that code so stale sessions are removed automatically without relying on
`reset()`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a805402b-1045-4168-8dbe-56cb40ec489b

📥 Commits

Reviewing files that changed from the base of the PR and between d35af18 and 5cd5350.

📒 Files selected for processing (4)
  • auth_proxy.py
  • llm_router_host.py
  • route_session_meter.py
  • shim.py

Comment thread auth_proxy.py
Comment on lines +1962 to +1965
assert _client is not None
try:
r = await _client.get(f"{UPSTREAM}/x/session/{sid}", timeout=5.0)
return JSONResponse(status_code=r.status_code, content=r.json())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm whether other upstream proxy calls encode path params, and how UPSTREAM is defined
rg -nP -C2 'UPSTREAM\}/x/' --type=py
rg -nP 'from urllib.parse import|quote' auth_proxy.py

Repository: genlayerlabs/unhardcoded

Length of output: 162


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the relevant file and locate the session handler plus related routing.
git ls-files auth_proxy.py
ast-grep outline auth_proxy.py --view expanded || true

echo
echo "=== session-related lines ==="
rg -n -C 3 'session|UPSTREAM|sid' auth_proxy.py

echo
echo "=== upstream URL construction across repository ==="
rg -n -C 2 'f"\{UPSTREAM\}/x/' .

echo
echo "=== urllib.parse usage in auth_proxy.py ==="
rg -n -C 2 'urllib\.parse|quote\(' auth_proxy.py

Repository: genlayerlabs/unhardcoded

Length of output: 50380


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the handler around the cited lines and the route definition for `sid`.
wc -l auth_proxy.py
sed -n '1930,1985p' auth_proxy.py

echo
echo "=== route definitions for session endpoints ==="
rg -n -C 2 '`@app`\.(get|post|put|delete|api_route)\(.*/x/session|/x/session/\{sid' auth_proxy.py

Repository: genlayerlabs/unhardcoded

Length of output: 3708


Quote sid before proxying it upstream.

sid comes from a path parameter, so a value containing ? or other reserved characters can change the upstream request target. Use quote(sid, safe="") to keep it a single path segment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@auth_proxy.py` around lines 1962 - 1965, The upstream session proxy in the
code path using _client.get and sid should URL-encode the sid path parameter
before building the request target. Update the request in the session handler to
quote sid with safe="" so reserved characters stay within a single path segment
and do not alter the upstream URL.

Source: Linters/SAST tools

Comment thread shim.py
Comment on lines +944 to 956
# Per-session meter on the STREAMING path too (mirrors the non-streaming
# _router_response_to_openai). Streaming clients (e.g. opencode, stream:true)
# otherwise never fold into the session total. Idempotent per request: this
# runs once per final chunk.
if session:
acc = route_session_meter.observe(
session,
tokens_in=resp.get("tokens_in") or 0,
tokens_out=resp.get("tokens_out") or 0,
tokens_cached=resp.get("tokens_cached") or 0,
cost_usd=x_router["cost_usd"] or 0.0)
x_router["session_acc"] = acc
return resp, usage or None, x_router

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Session meter counts failed streaming calls but not failed non-streaming calls.

_final_chunk_parts is also invoked on the streaming failure paths (_sse_gen error branch at Line 1041 and _flow_stream at Line 993, which runs before the ok check). Since observe() unconditionally does calls += 1, a failed streaming request inflates session_acc.calls. The non-streaming failure path (_openai_error_from_router) never meters, so the two paths diverge. Consider gating the fold on result.get("ok") for consistency.

Proposed guard
-        if session:
+        if session and result.get("ok"):
             acc = route_session_meter.observe(
                 session,
                 tokens_in=resp.get("tokens_in") or 0,
                 tokens_out=resp.get("tokens_out") or 0,
                 tokens_cached=resp.get("tokens_cached") or 0,
                 cost_usd=x_router["cost_usd"] or 0.0)
             x_router["session_acc"] = acc
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Per-session meter on the STREAMING path too (mirrors the non-streaming
# _router_response_to_openai). Streaming clients (e.g. opencode, stream:true)
# otherwise never fold into the session total. Idempotent per request: this
# runs once per final chunk.
if session:
acc = route_session_meter.observe(
session,
tokens_in=resp.get("tokens_in") or 0,
tokens_out=resp.get("tokens_out") or 0,
tokens_cached=resp.get("tokens_cached") or 0,
cost_usd=x_router["cost_usd"] or 0.0)
x_router["session_acc"] = acc
return resp, usage or None, x_router
# Per-session meter on the STREAMING path too (mirrors the non-streaming
# _router_response_to_openai). Streaming clients (e.g. opencode, stream:true)
# otherwise never fold into the session total. Idempotent per request: this
# runs once per final chunk.
if session and result.get("ok"):
acc = route_session_meter.observe(
session,
tokens_in=resp.get("tokens_in") or 0,
tokens_out=resp.get("tokens_out") or 0,
tokens_cached=resp.get("tokens_cached") or 0,
cost_usd=x_router["cost_usd"] or 0.0)
x_router["session_acc"] = acc
return resp, usage or None, x_router
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shim.py` around lines 944 - 956, The session meter fold in _final_chunk_parts
is counting failed streaming requests because route_session_meter.observe is
called unconditionally whenever session is present. Add a guard so the
per-session meter is only updated for successful results, using result.get("ok")
(or the equivalent response status available in _final_chunk_parts) before
calling observe and setting x_router["session_acc"]. Keep the change localized
to _final_chunk_parts so streaming and non-streaming paths stay consistent with
_openai_error_from_router.

…ner)

GET /v1/session/{sid} authed only that the caller was *a* valid consumer,
never that the sid belonged to *that* consumer. route_session_meter is keyed
by sid alone, so any authed consumer could read any other consumer's session
economics (cost/tokens/cache) and `warm[]` — which discloses the real
family/provider/served_by peers serving someone else's conversation.

Bind sid -> owning consumer key and enforce it:

- route_session_meter: add _owner (sid->consumer key), guarded by _lock.
  observe() gains owner=; records first-writer-wins (setdefault) so a consumer
  reusing another's opaque sid cannot steal/overwrite ownership. add owner();
  reset() clears it.
- shim: thread the authed consumer (the ingress proxy's x-llm-router-caller
  header, captured onto ChatRequest.caller) into BOTH observe sites (streaming
  _final_chunk_parts and non-streaming _router_response_to_openai) as owner=.
- shim /x/session/{sid}: when a caller header is present (the consumer-facing
  path), only the owner may read; anyone else gets 404 — NOT 403, so the
  endpoint never confirms another consumer's sid exists. Operator /x/* (no
  header) stays unscoped.
- auth_proxy session_view: forward the authed caller as x-llm-router-caller so
  the upstream meter scopes the read.

Tests: meter-level first-writer-wins binding (test_metering) and an end-to-end
endpoint test (test_shim): owner reads 200, other consumer/unknown sid 404,
operator unscoped.
@jmlago jmlago merged commit 6ee4a49 into main Jun 24, 2026
1 check passed
jmlago added a commit that referenced this pull request Jun 26, 2026
…26)

Reconstructed from Edgars Nemše's #24, carrying only the family half. The
provider-adapter half is in its sibling PR; this one makes discovered
OpenRouter marketplace families provider-neutral.

What it does:
- sources/openrouter.py: a discovered marketplace model's policy-facing
  family is the provider-neutral name (`openai/gpt-5-mini` -> `gpt-5-mini`),
  while `wire_model_id` keeps the exact OpenRouter slug for the wire.
  `service_aliases` (config) handle the canonicalization exceptions where
  stripping the vendor isn't the right family (dated/suffixed slugs).
  Curated families served by the static `openrouter` provider are deduped
  out, so a marketplace row never shadows a curated family.
- This pairs with the engine's `provider_eq` (#18): family is the model,
  provider is a separate axis the algebra filters. It lets a single
  provider-agnostic policy span curated + marketplace routes for one model,
  which is the point of the router ("stop hardcoding models").

Core bump: `core` -> 97d0333 (unhardcoded-engine #22), which makes
`served_model_id` the offer's wire id (`offer.wire_model_id or model_family`).
With a neutral family + a distinct wire slug, the engine now wires the slug
on both the curated and discovered paths without every adapter special-casing
`wire_model_id`, and the replayable trace records the real wire id.

The §3 determinism boundary holds: `model_meta.lua` stays curated-keyed
(refresh_model_meta untouched), so an on-chain/genvm host lacking a discovered
offer still fails closed. Pinned by test_openrouter_model_meta_still_keyed_by
_curated_family and the dedup assertion in
test_openrouter_discovery_derives_policy_families_from_raw_model_ids.

Verification: nix-shell --run 'python -m pytest tests -q' -> 333 passed,
2 skipped, 0 failed, against the bumped core. No new dependency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant