Skip to content

feat(dashboard): cost-accuracy panel — measured spend vs list price (D, monitoring)#61

Merged
jmlago merged 2 commits into
mainfrom
cost-accuracy-overview
Jun 30, 2026
Merged

feat(dashboard): cost-accuracy panel — measured spend vs list price (D, monitoring)#61
jmlago merged 2 commits into
mainfrom
cost-accuracy-overview

Conversation

@jmlago

@jmlago jmlago commented Jun 30, 2026

Copy link
Copy Markdown
Member

What

PR D, reframed as monitoring (not the original auto-correction). A read-only "Cost accuracy" card in the dashboard Overview: per provider, the measured effective $/Mtok (from the calls ledger) vs the advertised list price, with a drift flag.

Why monitoring, not auto-correction

Auto-correcting the ranking from measured cost would: (a) form a routing feedback loop (cheaper-measured → more traffic → changes the measured cost), (b) be circular for compute-from-price providers (their cost_usd is derived from list price), and (c) change routing opaquely from a derived number. So D observes and flags; the operator decides (adjust C's manual multiplier, investigate) — the same human-in-the-loop stance as C.

B = list price · C = manual ranking lever · D = does reality match? (a warning, not an auto-tune)

How

  • host_store.cost_by_route(window_s) — per (provider, family) ledger aggregate (calls, tokens in/out/cached, cost_usd), derived by query (feat(host-store): the route_* family derived on the fly (#4a+#4b+#4c) #41: store raw, derive at read; no in-process fold).
  • _cost_accuracy_rows (pure, unit-tested) — joins that with the live ranked price from /x/runtime, dividing the fictitious price_multiplier back out so the comparison is measured-spend vs advertised-list. Computes expected cost with the same cache-read discount as billing, rolls up per provider, flags drift > 15% with ≥ 20 calls, sorts worst-first.
  • GET /dashboard/api/cost-accuracy (admin) + a card in the Overview.

Honest about signal strength

  • Reported-cost providers (openrouter)cost_usd is authoritative → drift reveals real discounts/surprises. High value.
  • Compute-from-price providers (direct openai/anthropic/google)cost_usd derives from list → reads ~1.0 by construction. That's itself informative ("no independent signal here") and a sanity check that billing matches the scrape.

Never touches routing

Read-only. No ranking/admission impact.

Verification

  • _cost_accuracy_rows: +25% drift flags (≥20 calls), at-list reads no drift, the multiplier is divided out before comparison, unpriced routes are skipped, and a big drift with <20 calls does not warn.
  • JS render verified directly (balanced markup, drift badge, empty state).
  • Suite 434 passed / 0 failed.

Closes the B/C/D pricing arc: list price → manual lever → measured reality-check.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@jmlago, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 17 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ea67b5ab-a5b2-4d62-874f-256250786aad

📥 Commits

Reviewing files that changed from the base of the PR and between d4760e9 and 7e1cda8.

📒 Files selected for processing (5)
  • auth_proxy.py
  • host_store.py
  • shim.py
  • tests/test_auth_proxy_dashboard_full.py
  • tests/test_shim.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cost-accuracy-overview

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

jmlago added 2 commits June 30, 2026 20:56
…D, monitoring)

PR D, reframed from "auto-correct the ranking" to OBSERVABILITY, per the decision
that an automatic measured-cost correction would (a) form a routing feedback loop,
(b) be circular for compute-from-price providers, and (c) hide an opaque routing
change. Instead D surfaces the deviation and lets the operator act — the same
human-in-the-loop stance as C's manual lever.

A read-only Overview card showing, per provider, the measured effective $/Mtok
(from the calls ledger) vs the advertised list price, with a drift flag.
- host_store.cost_by_route: per (provider, family) ledger aggregate over a window,
  derived by query (#41: store raw, derive at read — no in-process fold).
- _cost_accuracy_rows (pure, unit-tested): joins that with the live ranked price
  from /x/runtime, dividing the fictitious price_multiplier back out so the
  comparison is measured-spend vs advertised-list. Rolls up per provider, flags
  drift > 15% with >= 20 calls, sorts worst-first.
- GET /dashboard/api/cost-accuracy (admin) + a card in the Overview.

Never touches routing. The signal is strongest for providers that report their own
cost (openrouter — reveals real discounts/surprises); a compute-from-price provider
reads ~1.0 by construction, which is itself a useful "no independent signal here".
Suite 434/0; the join/deviation logic + the JS render verified directly.
…y (review)

Review (Axis 8): the panel rendered tautological rows identically to real-signal
ones. For a compute-from-price provider (the direct openai/anthropic/google),
measured and expected both derive from the same list price → deviation ~1.0 by
construction, and any drift is reprice noise (ledger cost_usd sealed at the
price-of-then vs the current ema), not an effective discount. Flagging "drift" on
those is non-actionable and trains the operator to ignore the badge — the opposite
of what a monitoring panel wants.

Record the cost basis as a raw fact and use it:
- shim._cost_basis (single source of the cost tiering _executed_cost_usd already
  used): 'subscription' | 'reported' (provider's own usage.cost — INDEPENDENT
  signal) | 'computed' (derived from list price — tautological) | None. Stamped on
  x_router and threaded to the ledger; new calls.cost_basis column (ALTER ADD
  COLUMN IF NOT EXISTS).
- cost_by_route aggregates n_reported; _cost_accuracy_rows labels each row
  reported|derived and only warns where the signal is real (reported). The panel
  shows the signal tag; derived drift renders muted, never badged.

Review (Axis 1): import _CACHE_READ_FACTOR from shim instead of redefining 0.1 —
the panel's expected cost must track the billing factor; a copy could silently
diverge into false drift.

Suite 439/0 (incl. _cost_basis tiers + a derived provider with big drift that does
NOT warn).
@jmlago jmlago force-pushed the cost-accuracy-overview branch from 902e75b to 7e1cda8 Compare June 30, 2026 20:05
@jmlago

jmlago commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

Both correct — addressed in the latest commit (rebased on C/#60).

Axis 8 (signal vs tautology). You're right: the panel mixed rows where measured cost is an independent signal (reported) with rows where it's tautological (computed from the same list price → ~1.0 by construction; any drift is reprice noise from sealed-then vs ema-now, not a discount), and could badge a non-actionable "drift" on a direct provider. Fixed by recording the cost basis as a raw fact and using it:

  • shim._cost_basis — now the single source of the cost tiering _executed_cost_usd already used: subscription | reported | computed | None. Stamped on x_router, threaded to the ledger via a new calls.cost_basis column (ALTER ADD COLUMN IF NOT EXISTS).
  • cost_by_route aggregates n_reported; _cost_accuracy_rows labels each row reported|derived and only warns where the signal is real (reported). The card shows the signal tag; a derived row's drift renders muted and is never badged.

Axis 1 (_CACHE_READ_FACTOR). Imported from shim now instead of redefining 0.1 — agreed it's not just DRY, the expected-cost calc must track the billing factor or it silently shows false drift.

New tests: _cost_basis tiers; a derived provider with a big apparent drift that does not warn; the reported provider still flags +25%. Suite 439/0.

Thanks — this turns the panel from "a number per provider" into "a number that means something only where it can."

@jmlago jmlago merged commit f92ba70 into main Jun 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant