feat(dashboard): cost-accuracy panel — measured spend vs list price (D, monitoring)#61
Conversation
|
Warning Review limit reached
Next review available in: 17 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…D, monitoring) PR D, reframed from "auto-correct the ranking" to OBSERVABILITY, per the decision that an automatic measured-cost correction would (a) form a routing feedback loop, (b) be circular for compute-from-price providers, and (c) hide an opaque routing change. Instead D surfaces the deviation and lets the operator act — the same human-in-the-loop stance as C's manual lever. A read-only Overview card showing, per provider, the measured effective $/Mtok (from the calls ledger) vs the advertised list price, with a drift flag. - host_store.cost_by_route: per (provider, family) ledger aggregate over a window, derived by query (#41: store raw, derive at read — no in-process fold). - _cost_accuracy_rows (pure, unit-tested): joins that with the live ranked price from /x/runtime, dividing the fictitious price_multiplier back out so the comparison is measured-spend vs advertised-list. Rolls up per provider, flags drift > 15% with >= 20 calls, sorts worst-first. - GET /dashboard/api/cost-accuracy (admin) + a card in the Overview. Never touches routing. The signal is strongest for providers that report their own cost (openrouter — reveals real discounts/surprises); a compute-from-price provider reads ~1.0 by construction, which is itself a useful "no independent signal here". Suite 434/0; the join/deviation logic + the JS render verified directly.
…y (review) Review (Axis 8): the panel rendered tautological rows identically to real-signal ones. For a compute-from-price provider (the direct openai/anthropic/google), measured and expected both derive from the same list price → deviation ~1.0 by construction, and any drift is reprice noise (ledger cost_usd sealed at the price-of-then vs the current ema), not an effective discount. Flagging "drift" on those is non-actionable and trains the operator to ignore the badge — the opposite of what a monitoring panel wants. Record the cost basis as a raw fact and use it: - shim._cost_basis (single source of the cost tiering _executed_cost_usd already used): 'subscription' | 'reported' (provider's own usage.cost — INDEPENDENT signal) | 'computed' (derived from list price — tautological) | None. Stamped on x_router and threaded to the ledger; new calls.cost_basis column (ALTER ADD COLUMN IF NOT EXISTS). - cost_by_route aggregates n_reported; _cost_accuracy_rows labels each row reported|derived and only warns where the signal is real (reported). The panel shows the signal tag; derived drift renders muted, never badged. Review (Axis 1): import _CACHE_READ_FACTOR from shim instead of redefining 0.1 — the panel's expected cost must track the billing factor; a copy could silently diverge into false drift. Suite 439/0 (incl. _cost_basis tiers + a derived provider with big drift that does NOT warn).
902e75b to
7e1cda8
Compare
|
Both correct — addressed in the latest commit (rebased on C/#60). Axis 8 (signal vs tautology). You're right: the panel mixed rows where measured cost is an independent signal (reported) with rows where it's tautological (computed from the same list price → ~1.0 by construction; any drift is reprice noise from sealed-then vs ema-now, not a discount), and could badge a non-actionable "drift" on a direct provider. Fixed by recording the cost basis as a raw fact and using it:
Axis 1 ( New tests: Thanks — this turns the panel from "a number per provider" into "a number that means something only where it can." |
What
PR D, reframed as monitoring (not the original auto-correction). A read-only "Cost accuracy" card in the dashboard Overview: per provider, the measured effective $/Mtok (from the
callsledger) vs the advertised list price, with a drift flag.Why monitoring, not auto-correction
Auto-correcting the ranking from measured cost would: (a) form a routing feedback loop (cheaper-measured → more traffic → changes the measured cost), (b) be circular for compute-from-price providers (their
cost_usdis derived from list price), and (c) change routing opaquely from a derived number. So D observes and flags; the operator decides (adjust C's manual multiplier, investigate) — the same human-in-the-loop stance as C.How
host_store.cost_by_route(window_s)— per(provider, family)ledger aggregate (calls, tokens in/out/cached, cost_usd), derived by query (feat(host-store): the route_* family derived on the fly (#4a+#4b+#4c) #41: store raw, derive at read; no in-process fold)._cost_accuracy_rows(pure, unit-tested) — joins that with the live ranked price from/x/runtime, dividing the fictitiousprice_multiplierback out so the comparison is measured-spend vs advertised-list. Computes expected cost with the same cache-read discount as billing, rolls up per provider, flags drift > 15% with ≥ 20 calls, sorts worst-first.GET /dashboard/api/cost-accuracy(admin) + a card in the Overview.Honest about signal strength
cost_usdis authoritative → drift reveals real discounts/surprises. High value.cost_usdderives from list → reads ~1.0 by construction. That's itself informative ("no independent signal here") and a sanity check that billing matches the scrape.Never touches routing
Read-only. No ranking/admission impact.
Verification
_cost_accuracy_rows: +25% drift flags (≥20 calls), at-list reads no drift, the multiplier is divided out before comparison, unpriced routes are skipped, and a big drift with <20 calls does not warn.Closes the B/C/D pricing arc: list price → manual lever → measured reality-check.