feat(host-store): the route_* family derived on the fly (#4a+#4b+#4c)#41
Conversation
…he fly (#4a)
The per-route reliability/latency EMAs were in-process dicts (per-pod, reset on
restart — §8 debt). Replace them with on-the-fly derivation from a raw
per-ATTEMPT ledger, so the measurement is fleet-consistent and survives restarts
while staying the host's job (not the algebra's).
- host_store: a `route_observations` table (one row per provider call the engine
made, including failed fallbacks — a grain `calls` lacks: `calls` is
per-request). `route_stats(window)` derives {route_key: success_rate,
latency_ms, count} in ONE aggregate query (not per-candidate); latency averages
successful calls only. The background writer is generalized to thunks so it
serves both the call ledger and route observations.
- llm_router_host: the fold writes a route observation (async, off the latency
path) instead of updating the in-process EMAs.
- sources/antseed.offers_sync + sources/openrouter.pricing: fetch route_stats
once and stamp success_rate/latency_ms per candidate from it; shim /x/market
perf reads it too.
- route_reliability is reduced to the `route_key` identity (reused by route_cache
/ tool_capability / the stamps); route_latency is deleted.
Design (ratified): on-the-fly, NO cache. Benchmarked against the prod RDS — RTT
0.8ms and real volume is tiny (hundreds of calls), so route_stats is a few ms; a
TTL memo would be premature (Axis 3). Scaling trigger noted: add a 1-2s memo only
if route_stats exceeds ~20ms (≈100x current traffic). The smoothing changes from
an EMA to a windowed average (default 15 min) — more honest (no restart reset).
§3 preserved: the host MEASURES, stamps success_rate/latency_ms as per-candidate
fields, route_key stays host-internal, nothing returns to the algebra.
Verification: full suite 405 passed, 3 skipped, 0 failed (3x, deterministic — an
async-writer test-isolation flake fixed by draining the write queue before the
pool is reset/closed).
|
Warning Review limit reached
Next review available in: 19 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (14)
📝 WalkthroughWalkthroughReplaces in-process EMA-based per-route reliability and latency tracking ( ChangesRoute observations: Postgres persistence replacing in-process EMA
Sequence Diagram(s)sequenceDiagram
participant Router as _fold_route_outcome
participant HS as host_store
participant Q as _write_q (background thread)
participant PG as route_observations (Postgres)
participant Consumer as shim / antseed / openrouter
Router->>HS: observe_route_call_async({ts, provider_id, model_family, served_by, ok, latency_ms})
HS->>Q: enqueue thunk(_insert_route_observation, snapshot)
Q->>PG: INSERT row
Consumer->>HS: route_stats(window_ms)
HS->>PG: SELECT aggregated success_rate, avg latency_ms, count WHERE ts > cutoff
PG-->>HS: per-route aggregated rows
HS-->>Consumer: dict[route_key, {success_rate, latency_ms, count}]
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
host_store.py (1)
496-496: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDrop the redundant
int()here.
round(lat)already returns anintwhenndigitsis omitted, so this can be simplified to:Proposed cleanup
- "latency_ms": int(round(lat)) if lat is not None else None, + "latency_ms": round(lat) if lat is not None else None,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@host_store.py` at line 496, The latency mapping in host_store.py is doing a redundant int() around round(lat). Simplify the expression in the host_store serialization logic that sets latency_ms by removing the unnecessary int() wrapper and leaving the rounded value as-is when lat is not None.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@llm_router_host.py`:
- Around line 504-509: The synchronous execute() path is missing route
observation recording, so only _resolve_call_async ends up calling
_fold_route_outcome. Update the execute flow in llm_router_host.py so
_h_call_provider (or the code path that returns resp) also invokes
_fold_route_outcome(request, result, session=...) just like the async path,
ensuring route_observations is populated for both sync and async executions.
In `@shim.py`:
- Around line 471-476: The latency aggregate in the route summary is being
weighted by total attempts instead of the number of successful latency samples,
so update the aggregation logic around the route_stats/summary path to use a
dedicated latency sample count. Add a latency_count field in
host_store.route_stats() using the same success-and-latency filter, then change
the provider/family aggregation in shim.py to weight latency_ms by that new
latency_count rather than count. Keep the success_rate weighting unchanged and
use the existing route_stats() and aggregate code paths to locate the fix.
In `@tests/conftest.py`:
- Around line 35-38: Bound the write-queue drain so tests cannot hang forever
when a background insert stalls. Replace the direct host_store._write_q.join()
usage in the conftest fixture and the repeated joins in the async
concurrency/host/shim tests with a small shared helper that waits for _write_q
to empty using a deadline and then skips or fails cleanly if the writer does not
finish. Keep the helper near the test support code and have it coordinate with
_writer_loop/task_done semantics without closing the pool.
In `@tests/test_host_store.py`:
- Around line 256-278: Add a test that covers a route observation with a missing
component in the route key so the key normalization contract is locked down. Use
the existing host store helpers and `route_stats()` to assert that a route with
a missing `served_by` or family is still grouped under the same normalized key
as `_route_key()` would produce, instead of a raw `f"{prov}|{fam}|{sby}"` key.
Keep the new case close to the current
`test_route_stats_derives_reliability_and_latency` /
`test_route_stats_window_excludes_old_observations` coverage so regressions in
`host_store._route_key()` vs `host_store.route_stats()` are caught.
---
Nitpick comments:
In `@host_store.py`:
- Line 496: The latency mapping in host_store.py is doing a redundant int()
around round(lat). Simplify the expression in the host_store serialization logic
that sets latency_ms by removing the unnecessary int() wrapper and leaving the
rounded value as-is when lat is not None.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9f796a6f-f81e-41ca-a143-cdfc4c8a13be
📒 Files selected for processing (15)
host_store.pyllm_router_host.pyroute_latency.pyroute_reliability.pyshim.pysources/antseed.pysources/openrouter.pytests/conftest.pytests/test_antseed_offers.pytests/test_async_concurrency.pytests/test_host.pytests/test_host_store.pytests/test_route_latency.pytests/test_route_reliability.pytests/test_shim.py
💤 Files with no reviewable changes (3)
- tests/test_route_latency.py
- route_latency.py
- tests/test_route_reliability.py
| # Record the outcome here (not in the hook) so the streaming/override path — | ||
| # all of opencode's traffic, and every flow node — writes a route | ||
| # observation too, the host-owned perf the algebra reads (derived) and the | ||
| # market view surfaces (#15/#4a). Mocks record as well, so a mocked call | ||
| # is measured exactly like a live one. | ||
| _fold_route_outcome(request, result, session=session) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Cover the synchronous execute() path too.
Line 509 records observations only for _resolve_call_async; public execute() still goes through _h_call_provider, which returns resp without calling _fold_route_outcome, so sync executions won’t populate route_observations after the EMA removal.
Proposed fix
def _h_call_provider(self, request):
py_req = _to_py(request) or {}
provider = py_req.get("provider_id")
model = py_req.get("model_family")
if (provider, model) in self._mock_responses:
resp = self._mock_responses[(provider, model)]
else:
resp = self._call_hook(py_req)
+ _fold_route_outcome(py_req, resp)
return _to_lua(self.lua, resp)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Record the outcome here (not in the hook) so the streaming/override path — | |
| # all of opencode's traffic, and every flow node — writes a route | |
| # observation too, the host-owned perf the algebra reads (derived) and the | |
| # market view surfaces (#15/#4a). Mocks record as well, so a mocked call | |
| # is measured exactly like a live one. | |
| _fold_route_outcome(request, result, session=session) | |
| def _h_call_provider(self, request): | |
| py_req = _to_py(request) or {} | |
| provider = py_req.get("provider_id") | |
| model = py_req.get("model_family") | |
| if (provider, model) in self._mock_responses: | |
| resp = self._mock_responses[(provider, model)] | |
| else: | |
| resp = self._call_hook(py_req) | |
| _fold_route_outcome(py_req, resp) | |
| return _to_lua(self.lua, resp) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@llm_router_host.py` around lines 504 - 509, The synchronous execute() path is
missing route observation recording, so only _resolve_call_async ends up calling
_fold_route_outcome. Update the execute flow in llm_router_host.py so
_h_call_provider (or the code path that returns resp) also invokes
_fold_route_outcome(request, result, session=...) just like the async path,
ensuring route_observations is populated for both sync and async executions.
| sr_rows = [r for r in rows if r.get("success_rate") is not None] | ||
| lt_rows = [r for r in rows if r.get("latency_ms") is not None] | ||
| sr_calls = sum(r["count"] for r in sr_rows) | ||
| lt_calls = sum(r["count"] for r in lt_rows) | ||
| sr = sum(r["success_rate"] * r["count"] for r in sr_rows) | ||
| lt = sum(r["latency_ms"] * r["count"] for r in lt_rows) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Don’t weight successful latency averages by total attempts.
Line 474 uses count, but route_stats().latency_ms excludes failed rows while count includes them. A route with many failures and one fast success will overweight its latency in the provider/family aggregate. Expose a latency sample count from route_stats() and weight with that instead.
Proposed direction
- lt_calls = sum(r["count"] for r in lt_rows)
+ lt_calls = sum(r["latency_count"] for r in lt_rows)
sr = sum(r["success_rate"] * r["count"] for r in sr_rows)
- lt = sum(r["latency_ms"] * r["count"] for r in lt_rows)
+ lt = sum(r["latency_ms"] * r["latency_count"] for r in lt_rows)Also add latency_count = count(*) FILTER (WHERE ok AND latency_ms IS NOT NULL) to host_store.route_stats().
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@shim.py` around lines 471 - 476, The latency aggregate in the route summary
is being weighted by total attempts instead of the number of successful latency
samples, so update the aggregation logic around the route_stats/summary path to
use a dedicated latency sample count. Add a latency_count field in
host_store.route_stats() using the same success-and-latency filter, then change
the provider/family aggregation in shim.py to weight latency_ms by that new
latency_count rather than count. Keep the success_rate weighting unchanged and
use the existing route_stats() and aggregate code paths to locate the fix.
| # Drain pending async writes from a prior test FIRST so they can't land | ||
| # after this truncate; don't close the pool (the background writer shares | ||
| # it — closing it mid-write races and flakes isolation). | ||
| host_store._write_q.join() |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Bound the write-queue drain.
_write_q.join() can block forever if a background insert hangs before _writer_loop() reaches task_done(), so this "skip if Postgres is unreachable" fixture can wedge the whole suite instead of skipping/failing. The same raw join is now repeated in tests/test_async_concurrency.py Line 86, tests/test_host.py Lines 293/305/321, and tests/test_shim.py Line 464, so it’s worth centralizing behind a small helper with a deadline.
Suggested direction
+def _drain_host_store_writes(timeout_s=5.0):
+ import time
+ deadline = time.monotonic() + timeout_s
+ while host_store._write_q.unfinished_tasks and time.monotonic() < deadline:
+ time.sleep(0.01)
+ if host_store._write_q.unfinished_tasks:
+ pytest.fail("timed out draining host_store write queue")
+
def host_store_clean():
"""Truncate the operational store before the test (isolation against the
shared Postgres). Skips the test if Postgres is unreachable."""
try:
- host_store._write_q.join()
+ _drain_host_store_writes()
host_store.truncate_all_for_tests()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Drain pending async writes from a prior test FIRST so they can't land | |
| # after this truncate; don't close the pool (the background writer shares | |
| # it — closing it mid-write races and flakes isolation). | |
| host_store._write_q.join() | |
| def _drain_host_store_writes(timeout_s=5.0): | |
| import time | |
| deadline = time.monotonic() + timeout_s | |
| while host_store._write_q.unfinished_tasks and time.monotonic() < deadline: | |
| time.sleep(0.01) | |
| if host_store._write_q.unfinished_tasks: | |
| pytest.fail("timed out draining host_store write queue") | |
| def host_store_clean(): | |
| """Truncate the operational store before the test (isolation against the | |
| shared Postgres). Skips the test if Postgres is unreachable.""" | |
| try: | |
| # Drain pending async writes from a prior test FIRST so they can't land | |
| # after this truncate; don't close the pool (the background writer shares | |
| # it — closing it mid-write races and flakes isolation). | |
| _drain_host_store_writes() | |
| host_store.truncate_all_for_tests() |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/conftest.py` around lines 35 - 38, Bound the write-queue drain so tests
cannot hang forever when a background insert stalls. Replace the direct
host_store._write_q.join() usage in the conftest fixture and the repeated joins
in the async concurrency/host/shim tests with a small shared helper that waits
for _write_q to empty using a deadline and then skips or fails cleanly if the
writer does not finish. Keep the helper near the test support code and have it
coordinate with _writer_loop/task_done semantics without closing the pool.
| def test_route_stats_derives_reliability_and_latency(store): | ||
| from conftest import seed_route_obs | ||
| # peerA: 4 ok + 1 fail -> success 0.8; latency avg over OK only | ||
| seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=100, n=3) | ||
| seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=300) # ok=4, lat avg=150 | ||
| seed_route_obs("antseed", "m", "peerA", ok=False, latency_ms=9999) # failure: latency ignored | ||
| seed_route_obs("antseed", "m", "peerB", ok=True, latency_ms=50) | ||
| st = store.route_stats() | ||
| assert st["antseed|m|peerA"]["success_rate"] == 0.8 | ||
| assert st["antseed|m|peerA"]["latency_ms"] == 150 # avg(100,100,100,300), failure excluded | ||
| assert st["antseed|m|peerA"]["count"] == 5 | ||
| assert st["antseed|m|peerB"]["success_rate"] == 1.0 | ||
| assert "antseed|m|missing" not in st | ||
|
|
||
|
|
||
| def test_route_stats_window_excludes_old_observations(store): | ||
| from conftest import seed_route_obs | ||
| import time | ||
| now = int(time.time() * 1000) | ||
| seed_route_obs("p", "m", "fresh", ok=True, ts=now) | ||
| seed_route_obs("p", "m", "stale", ok=True, ts=now - 20 * 60 * 1000) # 20 min ago | ||
| assert set(store.route_stats(window_ms=15 * 60 * 1000)) == {"p|m|fresh"} | ||
| assert set(store.route_stats(window_ms=30 * 60 * 1000)) == {"p|m|fresh", "p|m|stale"} |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Add a missing-component route-key case.
These tests only cover fully populated keys, so they won't catch the current contract drift between host_store._route_key() (missing parts normalize to "") and host_store.route_stats() (raw f"{prov}|{fam}|{sby}"). A route with a missing served_by or family would be looked up under a different key and stay unstamped.
Suggested test to lock the contract
+def test_route_stats_normalizes_keys_like_route_key(store):
+ from conftest import seed_route_obs
+ seed_route_obs("p", "m", None, ok=True)
+ st = store.route_stats()
+ assert "p|m|" in st
+ assert "p|m|None" not in st📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_route_stats_derives_reliability_and_latency(store): | |
| from conftest import seed_route_obs | |
| # peerA: 4 ok + 1 fail -> success 0.8; latency avg over OK only | |
| seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=100, n=3) | |
| seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=300) # ok=4, lat avg=150 | |
| seed_route_obs("antseed", "m", "peerA", ok=False, latency_ms=9999) # failure: latency ignored | |
| seed_route_obs("antseed", "m", "peerB", ok=True, latency_ms=50) | |
| st = store.route_stats() | |
| assert st["antseed|m|peerA"]["success_rate"] == 0.8 | |
| assert st["antseed|m|peerA"]["latency_ms"] == 150 # avg(100,100,100,300), failure excluded | |
| assert st["antseed|m|peerA"]["count"] == 5 | |
| assert st["antseed|m|peerB"]["success_rate"] == 1.0 | |
| assert "antseed|m|missing" not in st | |
| def test_route_stats_window_excludes_old_observations(store): | |
| from conftest import seed_route_obs | |
| import time | |
| now = int(time.time() * 1000) | |
| seed_route_obs("p", "m", "fresh", ok=True, ts=now) | |
| seed_route_obs("p", "m", "stale", ok=True, ts=now - 20 * 60 * 1000) # 20 min ago | |
| assert set(store.route_stats(window_ms=15 * 60 * 1000)) == {"p|m|fresh"} | |
| assert set(store.route_stats(window_ms=30 * 60 * 1000)) == {"p|m|fresh", "p|m|stale"} | |
| def test_route_stats_derives_reliability_and_latency(store): | |
| from conftest import seed_route_obs | |
| # peerA: 4 ok + 1 fail -> success 0.8; latency avg over OK only | |
| seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=100, n=3) | |
| seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=300) # ok=4, lat avg=150 | |
| seed_route_obs("antseed", "m", "peerA", ok=False, latency_ms=9999) # failure: latency ignored | |
| seed_route_obs("antseed", "m", "peerB", ok=True, latency_ms=50) | |
| st = store.route_stats() | |
| assert st["antseed|m|peerA"]["success_rate"] == 0.8 | |
| assert st["antseed|m|peerA"]["latency_ms"] == 150 # avg(100,100,100,300), failure excluded | |
| assert st["antseed|m|peerA"]["count"] == 5 | |
| assert st["antseed|m|peerB"]["success_rate"] == 1.0 | |
| assert "antseed|m|missing" not in st | |
| def test_route_stats_window_excludes_old_observations(store): | |
| from conftest import seed_route_obs | |
| import time | |
| now = int(time.time() * 1000) | |
| seed_route_obs("p", "m", "fresh", ok=True, ts=now) | |
| seed_route_obs("p", "m", "stale", ok=True, ts=now - 20 * 60 * 1000) # 20 min ago | |
| assert set(store.route_stats(window_ms=15 * 60 * 1000)) == {"p|m|fresh"} | |
| assert set(store.route_stats(window_ms=30 * 60 * 1000)) == {"p|m|fresh", "p|m|stale"} | |
| def test_route_stats_normalizes_keys_like_route_key(store): | |
| from conftest import seed_route_obs | |
| seed_route_obs("p", "m", None, ok=True) | |
| st = store.route_stats() | |
| assert "p|m|" in st | |
| assert "p|m|None" not in st |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_host_store.py` around lines 256 - 278, Add a test that covers a
route observation with a missing component in the route key so the key
normalization contract is locked down. Use the existing host store helpers and
`route_stats()` to assert that a route with a missing `served_by` or family is
still grouped under the same normalized key as `_route_key()` would produce,
instead of a raw `f"{prov}|{fam}|{sby}"` key. Keep the new case close to the
current `test_route_stats_derives_reliability_and_latency` /
`test_route_stats_window_excludes_old_observations` coverage so regressions in
`host_store._route_key()` vs `host_store.route_stats()` are caught.
…s (#4b) The per-session cache-affinity (route_cache) and usage meter (route_session_meter) were in-process dicts (per-pod, reset on restart — §8 debt). Derive them on the fly from `calls` (per-request; it carries session_id / status / provider / family / served_by / tokens / cost / caller — everything the folds held), so the views are fleet-consistent without any per-session fold. - host_store: hot_route(session) = the route of the session's most recent SUCCESSFUL call; session_totals / session_warm / session_owner / all_session_totals. - shim: /v1/session + /x/sessions + the cache_hot_route resolution read these; x_router.session_acc = committed session_totals + THIS in-flight call (not yet in the ledger). The owner is the caller of the session's earliest call (first-writer-wins, cross-consumer isolation), derived — no explicit binding. - llm_router_host: the fold no longer records cache/session state (the route observation + the call ledger carry it); route_cache + route_session_meter are deleted. Trade-off (documented): session_acc's prior-call totals come from the async-written ledger, so a burst of near-simultaneous calls on one session can lag by the queue drain (ms). The object marks the meter "measurement only, NOT billing", so this is acceptable; it is also a net improvement over the per-pod in-process meter (which was already wrong across a fleet). Verification: full suite 402 passed, 3 skipped, 0 failed (3x deterministic).
The learned per-route tool-incapability (route_tool_capability) was an in-process dict (per-pod, reset on restart — §8 debt). Derive it on the fly from the per-attempt route_observations ledger (#4a), which gains two signal columns. - route_observations: + tools_requested, tool_calls_emitted (the fold writes them on every attempt; only tools-requests carry signal). - host_store.tool_incapable_routes(window): a route is incapable when it had >= 20 tools-requests in the last 30 min with ZERO tool_calls. The window IS the re-test horizon (a route ages out if it stops being tool-tested); any tool_call in the window clears it. Capable is the default. - sources/antseed.offers_sync: fetches the incapable set once and drops supports_tools for those routes (was per-candidate is_capable()). - route_tool_capability is deleted; the fold no longer observes it. The behaviour shifts from "any tool_call proves capability PERMANENTLY" to a windowed verdict — more honest (a route that degrades is re-detected), consistent with the EMA->windowed change in #4a. Verification: full suite 402 passed, 3 skipped, 0 failed (3x deterministic). This completes the route_* family migration (#4a reliability/latency, #4b cache/session, #4c tool capability) — all five in-process folds now derived on the fly, fleet-consistent, from the raw ledgers.
…D, monitoring) PR D, reframed from "auto-correct the ranking" to OBSERVABILITY, per the decision that an automatic measured-cost correction would (a) form a routing feedback loop, (b) be circular for compute-from-price providers, and (c) hide an opaque routing change. Instead D surfaces the deviation and lets the operator act — the same human-in-the-loop stance as C's manual lever. A read-only Overview card showing, per provider, the measured effective $/Mtok (from the calls ledger) vs the advertised list price, with a drift flag. - host_store.cost_by_route: per (provider, family) ledger aggregate over a window, derived by query (#41: store raw, derive at read — no in-process fold). - _cost_accuracy_rows (pure, unit-tested): joins that with the live ranked price from /x/runtime, dividing the fictitious price_multiplier back out so the comparison is measured-spend vs advertised-list. Rolls up per provider, flags drift > 15% with >= 20 calls, sorts worst-first. - GET /dashboard/api/cost-accuracy (admin) + a card in the Overview. Never touches routing. The signal is strongest for providers that report their own cost (openrouter — reveals real discounts/surprises); a compute-from-price provider reads ~1.0 by construction, which is itself a useful "no independent signal here". Suite 434/0; the join/deviation logic + the JS render verified directly.
…D, monitoring) (#61) * feat(dashboard): cost-accuracy panel — measured spend vs list price (D, monitoring) PR D, reframed from "auto-correct the ranking" to OBSERVABILITY, per the decision that an automatic measured-cost correction would (a) form a routing feedback loop, (b) be circular for compute-from-price providers, and (c) hide an opaque routing change. Instead D surfaces the deviation and lets the operator act — the same human-in-the-loop stance as C's manual lever. A read-only Overview card showing, per provider, the measured effective $/Mtok (from the calls ledger) vs the advertised list price, with a drift flag. - host_store.cost_by_route: per (provider, family) ledger aggregate over a window, derived by query (#41: store raw, derive at read — no in-process fold). - _cost_accuracy_rows (pure, unit-tested): joins that with the live ranked price from /x/runtime, dividing the fictitious price_multiplier back out so the comparison is measured-spend vs advertised-list. Rolls up per provider, flags drift > 15% with >= 20 calls, sorts worst-first. - GET /dashboard/api/cost-accuracy (admin) + a card in the Overview. Never touches routing. The signal is strongest for providers that report their own cost (openrouter — reveals real discounts/surprises); a compute-from-price provider reads ~1.0 by construction, which is itself a useful "no independent signal here". Suite 434/0; the join/deviation logic + the JS render verified directly. * fix(dashboard): cost-accuracy distinguishes real signal from tautology (review) Review (Axis 8): the panel rendered tautological rows identically to real-signal ones. For a compute-from-price provider (the direct openai/anthropic/google), measured and expected both derive from the same list price → deviation ~1.0 by construction, and any drift is reprice noise (ledger cost_usd sealed at the price-of-then vs the current ema), not an effective discount. Flagging "drift" on those is non-actionable and trains the operator to ignore the badge — the opposite of what a monitoring panel wants. Record the cost basis as a raw fact and use it: - shim._cost_basis (single source of the cost tiering _executed_cost_usd already used): 'subscription' | 'reported' (provider's own usage.cost — INDEPENDENT signal) | 'computed' (derived from list price — tautological) | None. Stamped on x_router and threaded to the ledger; new calls.cost_basis column (ALTER ADD COLUMN IF NOT EXISTS). - cost_by_route aggregates n_reported; _cost_accuracy_rows labels each row reported|derived and only warns where the signal is real (reported). The panel shows the signal tag; derived drift renders muted, never badged. Review (Axis 1): import _CACHE_READ_FACTOR from shim instead of redefining 0.1 — the panel's expected cost must track the billing factor; a copy could silently diverge into false drift. Suite 439/0 (incl. _cost_basis tiers + a derived provider with big drift that does NOT warn).
#4: the
route_*family → derived on the fly (no in-process folds)The five per-route / per-session host folds were in-process dicts (per-pod, reset
on restart — §8 debt). This block replaces ALL of them with on-the-fly
derivation from the raw ledgers (
callsfrom #3, plus a new per-attemptroute_observations), so the measurements are fleet-consistent and surviverestarts — while staying the host's job, never the algebra's (§3).
Three atomic commits:
#4a — reliability + latency (the hot path)
route_observations(one RAW row per provider-call ATTEMPT, incl. failedfallbacks — a grain
callslacks).route_stats(window)derives{route_key: {success_rate, latency_ms, count}}in one aggregate query, notper-candidate.
route_reliability→justroute_key;route_latencydeleted.On the fly, NO cache — benchmarked against the prod RDS: RTT 0.8ms and real
volume is tiny (~400 calls), so a derive is a few ms. (The 73ms figure was a
200k-rows synthetic, ~500x prod.) A TTL memo would be premature (Axis 3);
scaling trigger noted (add a 1-2s memo only if it exceeds ~20ms). EMA → windowed
average (15 min).
#4b — cache-affinity + the session meter
hot_route/session_totals/session_warm/session_ownerderived fromcalls(per-request; carries session/route/outcome/tokens/cost/caller).x_router.session_acc= committed totals + the in-flight call; the owner is thecaller of the session's earliest call (first-writer-wins).
route_cache+route_session_meterdeleted. Trade-off documented: prior-call totals come fromthe async-written ledger (ms lag for bursts) — fine for a "measurement, NOT
billing" number, and a net gain over the per-pod meter.
#4c — learned tool capability
route_observationsgainstools_requested/tool_calls_emitted.tool_incapable_routes(window)= routes with ≥20 tools-requests in 30 min andzero tool_calls.
offers_syncdropssupports_toolsfor those.route_tool_capabilitydeleted. "Permanently capable" → windowed verdict (adegrading route is re-detected).
Invariants (§3 / anti-telos)
The host MEASURES and stamps
success_rate/latency_ms/supports_toolsasper-candidate fields;
route_keystays host-internal; nothing returns to thealgebra. Store raw, derive by query. Fail-soft throughout (DB error → field
default / capable / empty).
Verification
Full suite 402 passed, 3 skipped, 0 failed, 3x deterministic per
increment. (Fixed an async-writer test-isolation flake — drain the write queue
before resetting/closing the pool.) Net negative lines (5 modules + their
EMA tests replaced by derivations).
Closes the in-process-state §8 debt for the route_* family. Deploys via the
normal cycle (router image → schema auto-creates
route_observations+ the newcolumns; Python/psycopg, no Node, so no conninfo gotcha).