feat(host-store): the route_* family derived on the fly (#4a+#4b+#4c) by jmlago · Pull Request #41 · genlayerlabs/unhardcoded

jmlago · 2026-06-29T18:56:12Z

#4: the `route_*` family → derived on the fly (no in-process folds)

The five per-route / per-session host folds were in-process dicts (per-pod, reset
on restart — §8 debt). This block replaces ALL of them with on-the-fly
derivation from the raw ledgers (calls from #3, plus a new per-attempt
route_observations), so the measurements are fleet-consistent and survive
restarts — while staying the host's job, never the algebra's (§3).

Three atomic commits:

#4a — reliability + latency (the hot path)

route_observations (one RAW row per provider-call ATTEMPT, incl. failed
fallbacks — a grain calls lacks). route_stats(window) derives
{route_key: {success_rate, latency_ms, count}} in one aggregate query, not
per-candidate. route_reliability→just route_key; route_latency deleted.

On the fly, NO cache — benchmarked against the prod RDS: RTT 0.8ms and real
volume is tiny (~400 calls), so a derive is a few ms. (The 73ms figure was a
200k-rows synthetic, ~500x prod.) A TTL memo would be premature (Axis 3);
scaling trigger noted (add a 1-2s memo only if it exceeds ~20ms). EMA → windowed
average (15 min).

#4b — cache-affinity + the session meter

hot_route / session_totals / session_warm / session_owner derived from
calls (per-request; carries session/route/outcome/tokens/cost/caller).
x_router.session_acc = committed totals + the in-flight call; the owner is the
caller of the session's earliest call (first-writer-wins). route_cache +
route_session_meter deleted. Trade-off documented: prior-call totals come from
the async-written ledger (ms lag for bursts) — fine for a "measurement, NOT
billing" number, and a net gain over the per-pod meter.

#4c — learned tool capability

route_observations gains tools_requested / tool_calls_emitted.
tool_incapable_routes(window) = routes with ≥20 tools-requests in 30 min and
zero tool_calls. offers_sync drops supports_tools for those.
route_tool_capability deleted. "Permanently capable" → windowed verdict (a
degrading route is re-detected).

Invariants (§3 / anti-telos)

The host MEASURES and stamps success_rate/latency_ms/supports_tools as
per-candidate fields; route_key stays host-internal; nothing returns to the
algebra. Store raw, derive by query. Fail-soft throughout (DB error → field
default / capable / empty).

Verification

Full suite 402 passed, 3 skipped, 0 failed, 3x deterministic per
increment. (Fixed an async-writer test-isolation flake — drain the write queue
before resetting/closing the pool.) Net negative lines (5 modules + their
EMA tests replaced by derivations).

Closes the in-process-state §8 debt for the route_* family. Deploys via the
normal cycle (router image → schema auto-creates route_observations + the new
columns; Python/psycopg, no Node, so no conninfo gotcha).

…he fly (#4a) The per-route reliability/latency EMAs were in-process dicts (per-pod, reset on restart — §8 debt). Replace them with on-the-fly derivation from a raw per-ATTEMPT ledger, so the measurement is fleet-consistent and survives restarts while staying the host's job (not the algebra's). - host_store: a `route_observations` table (one row per provider call the engine made, including failed fallbacks — a grain `calls` lacks: `calls` is per-request). `route_stats(window)` derives {route_key: success_rate, latency_ms, count} in ONE aggregate query (not per-candidate); latency averages successful calls only. The background writer is generalized to thunks so it serves both the call ledger and route observations. - llm_router_host: the fold writes a route observation (async, off the latency path) instead of updating the in-process EMAs. - sources/antseed.offers_sync + sources/openrouter.pricing: fetch route_stats once and stamp success_rate/latency_ms per candidate from it; shim /x/market perf reads it too. - route_reliability is reduced to the `route_key` identity (reused by route_cache / tool_capability / the stamps); route_latency is deleted. Design (ratified): on-the-fly, NO cache. Benchmarked against the prod RDS — RTT 0.8ms and real volume is tiny (hundreds of calls), so route_stats is a few ms; a TTL memo would be premature (Axis 3). Scaling trigger noted: add a 1-2s memo only if route_stats exceeds ~20ms (≈100x current traffic). The smoothing changes from an EMA to a windowed average (default 15 min) — more honest (no restart reset). §3 preserved: the host MEASURES, stamps success_rate/latency_ms as per-candidate fields, route_key stays host-internal, nothing returns to the algebra. Verification: full suite 405 passed, 3 skipped, 0 failed (3x, deterministic — an async-writer test-isolation flake fixed by draining the write queue before the pool is reset/closed).

coderabbitai · 2026-06-29T18:56:28Z

Warning

Review limit reached

@jmlago, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 19 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 95eebfe0-f390-48be-a122-abb3ee5b4c61

📥 Commits

Reviewing files that changed from the base of the PR and between c7d72f9 and 01dd295.

📒 Files selected for processing (14)

host_store.py
llm_router_host.py
route_cache.py
route_reliability.py
route_session_meter.py
route_tool_capability.py
shim.py
sources/antseed.py
tests/conftest.py
tests/test_antseed_offers.py
tests/test_metering.py
tests/test_route_cache.py
tests/test_route_tool_capability.py
tests/test_shim.py

📝 Walkthrough

Walkthrough

Replaces in-process EMA-based per-route reliability and latency tracking (route_reliability.py, route_latency.py) with a Postgres-backed route_observations table in host_store. Observations are written asynchronously via _fold_route_outcome; route_stats() aggregates them. All consumers (shim, sources, tests) are updated to read from host_store.route_stats().

Changes

Route observations: Postgres persistence replacing in-process EMA

Layer / File(s)	Summary
`host_store` schema, insertion, stats, and lifecycle `host_store.py`	Adds `route_observations` table and indexes, extends retention pruning to millisecond-timestamped rows, refactors the async write queue to use thunks, adds `_insert_route_observation` and `observe_route_call_async`, adds `route_stats(window_ms)`, and updates `reset()`/`truncate_all_for_tests()` for test isolation.
Remove in-process EMA from `route_reliability` and `route_latency` `route_reliability.py`, `route_latency.py`	Strips all EMA state, observers, and accessors from `route_reliability.py` (retaining only `route_key` and a no-op `reset`); deletes `route_latency.py` entirely.
Wire `_fold_route_outcome` to `host_store.observe_route_call_async` `llm_router_host.py`	Updates imports and replaces direct `route_reliability`/`route_latency` observe calls in `_fold_route_outcome` with `host_store.observe_route_call_async` emitting a timestamped per-attempt payload.
Update shim and sources to read `host_store.route_stats()` `shim.py`, `sources/antseed.py`, `sources/openrouter.py`	Replaces snapshot helper calls with `host_store.route_stats()` lookups; `shim.py`'s `_perf()` now aggregates weighted success rate and latency across matching route keys.
Test infrastructure and updated coverage `tests/conftest.py`, `tests/test_host_store.py`, `tests/test_host.py`, `tests/test_antseed_offers.py`, `tests/test_async_concurrency.py`, `tests/test_shim.py`, `tests/test_route_reliability.py`	Updates `host_store_clean` fixture to drain the write queue before truncation; adds `seed_route_obs` helper; rewrites integration tests to drain `_write_q` and assert via `route_stats()`; adds new aggregation and windowing tests for `route_stats`; removes deleted EMA test files.

Sequence Diagram(s)

sequenceDiagram
  participant Router as _fold_route_outcome
  participant HS as host_store
  participant Q as _write_q (background thread)
  participant PG as route_observations (Postgres)
  participant Consumer as shim / antseed / openrouter

  Router->>HS: observe_route_call_async({ts, provider_id, model_family, served_by, ok, latency_ms})
  HS->>Q: enqueue thunk(_insert_route_observation, snapshot)
  Q->>PG: INSERT row
  Consumer->>HS: route_stats(window_ms)
  HS->>PG: SELECT aggregated success_rate, avg latency_ms, count WHERE ts > cutoff
  PG-->>HS: per-route aggregated rows
  HS-->>Consumer: dict[route_key, {success_rate, latency_ms, count}]

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

genlayerlabs/unhardcoded#36: Introduced the Postgres-backed host_store operational store schema, async write machinery, and reset/truncate lifecycle hooks that this PR extends with route_observations.

Poem

🐇 Hoppity-hop, the EMA is gone,
No more in-memory state to rely on!
Each route's fate now lives in Postgres rows,
Aggregated neatly wherever stats flows.
The rabbit stamps latency into the store—
Reliable routes, forevermore! 🗄️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly captures the main change: moving route reliability and latency derivation into host-store on the fly.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch host-store-route-stats

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

host_store.py (1)

496-496: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Drop the redundant int() here.
round(lat) already returns an int when ndigits is omitted, so this can be simplified to:

Proposed cleanup

-                    "latency_ms": int(round(lat)) if lat is not None else None,
+                    "latency_ms": round(lat) if lat is not None else None,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@host_store.py` at line 496, The latency mapping in host_store.py is doing a
redundant int() around round(lat). Simplify the expression in the host_store
serialization logic that sets latency_ms by removing the unnecessary int()
wrapper and leaving the rounded value as-is when lat is not None.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@llm_router_host.py`:
- Around line 504-509: The synchronous execute() path is missing route
observation recording, so only _resolve_call_async ends up calling
_fold_route_outcome. Update the execute flow in llm_router_host.py so
_h_call_provider (or the code path that returns resp) also invokes
_fold_route_outcome(request, result, session=...) just like the async path,
ensuring route_observations is populated for both sync and async executions.

In `@shim.py`:
- Around line 471-476: The latency aggregate in the route summary is being
weighted by total attempts instead of the number of successful latency samples,
so update the aggregation logic around the route_stats/summary path to use a
dedicated latency sample count. Add a latency_count field in
host_store.route_stats() using the same success-and-latency filter, then change
the provider/family aggregation in shim.py to weight latency_ms by that new
latency_count rather than count. Keep the success_rate weighting unchanged and
use the existing route_stats() and aggregate code paths to locate the fix.

In `@tests/conftest.py`:
- Around line 35-38: Bound the write-queue drain so tests cannot hang forever
when a background insert stalls. Replace the direct host_store._write_q.join()
usage in the conftest fixture and the repeated joins in the async
concurrency/host/shim tests with a small shared helper that waits for _write_q
to empty using a deadline and then skips or fails cleanly if the writer does not
finish. Keep the helper near the test support code and have it coordinate with
_writer_loop/task_done semantics without closing the pool.

In `@tests/test_host_store.py`:
- Around line 256-278: Add a test that covers a route observation with a missing
component in the route key so the key normalization contract is locked down. Use
the existing host store helpers and `route_stats()` to assert that a route with
a missing `served_by` or family is still grouped under the same normalized key
as `_route_key()` would produce, instead of a raw `f"{prov}|{fam}|{sby}"` key.
Keep the new case close to the current
`test_route_stats_derives_reliability_and_latency` /
`test_route_stats_window_excludes_old_observations` coverage so regressions in
`host_store._route_key()` vs `host_store.route_stats()` are caught.

---

Nitpick comments:
In `@host_store.py`:
- Line 496: The latency mapping in host_store.py is doing a redundant int()
around round(lat). Simplify the expression in the host_store serialization logic
that sets latency_ms by removing the unnecessary int() wrapper and leaving the
rounded value as-is when lat is not None.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9f796a6f-f81e-41ca-a143-cdfc4c8a13be

📥 Commits

Reviewing files that changed from the base of the PR and between f270400 and c7d72f9.

📒 Files selected for processing (15)

host_store.py
llm_router_host.py
route_latency.py
route_reliability.py
shim.py
sources/antseed.py
sources/openrouter.py
tests/conftest.py
tests/test_antseed_offers.py
tests/test_async_concurrency.py
tests/test_host.py
tests/test_host_store.py
tests/test_route_latency.py
tests/test_route_reliability.py
tests/test_shim.py

💤 Files with no reviewable changes (3)

tests/test_route_latency.py
route_latency.py
tests/test_route_reliability.py

coderabbitai · 2026-06-29T19:03:43Z

+        # Record the outcome here (not in the hook) so the streaming/override path —
+        # all of opencode's traffic, and every flow node — writes a route
+        # observation too, the host-owned perf the algebra reads (derived) and the
+        # market view surfaces (#15/#4a). Mocks record as well, so a mocked call
        # is measured exactly like a live one.
        _fold_route_outcome(request, result, session=session)


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Cover the synchronous execute() path too.

Line 509 records observations only for _resolve_call_async; public execute() still goes through _h_call_provider, which returns resp without calling _fold_route_outcome, so sync executions won’t populate route_observations after the EMA removal.

Proposed fix

def _h_call_provider(self, request): py_req = _to_py(request) or {} provider = py_req.get("provider_id") model = py_req.get("model_family") if (provider, model) in self._mock_responses: resp = self._mock_responses[(provider, model)] else: resp = self._call_hook(py_req) + _fold_route_outcome(py_req, resp) return _to_lua(self.lua, resp)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Record the outcome here (not in the hook) so the streaming/override path —

# all of opencode's traffic, and every flow node — writes a route

# observation too, the host-owned perf the algebra reads (derived) and the

# market view surfaces (#15/#4a). Mocks record as well, so a mocked call

# is measured exactly like a live one.

_fold_route_outcome(request, result, session=session)

def _h_call_provider(self, request):

py_req = _to_py(request) or {}

provider = py_req.get("provider_id")

model = py_req.get("model_family")

if (provider, model) in self._mock_responses:

resp = self._mock_responses[(provider, model)]

else:

resp = self._call_hook(py_req)

_fold_route_outcome(py_req, resp)

return _to_lua(self.lua, resp)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm_router_host.py` around lines 504 - 509, The synchronous execute() path is missing route observation recording, so only _resolve_call_async ends up calling _fold_route_outcome. Update the execute flow in llm_router_host.py so _h_call_provider (or the code path that returns resp) also invokes _fold_route_outcome(request, result, session=...) just like the async path, ensuring route_observations is populated for both sync and async executions.

coderabbitai · 2026-06-29T19:03:43Z

+            sr_rows = [r for r in rows if r.get("success_rate") is not None]
+            lt_rows = [r for r in rows if r.get("latency_ms") is not None]
+            sr_calls = sum(r["count"] for r in sr_rows)
+            lt_calls = sum(r["count"] for r in lt_rows)
+            sr = sum(r["success_rate"] * r["count"] for r in sr_rows)
+            lt = sum(r["latency_ms"] * r["count"] for r in lt_rows)


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Don’t weight successful latency averages by total attempts.

Line 474 uses count, but route_stats().latency_ms excludes failed rows while count includes them. A route with many failures and one fast success will overweight its latency in the provider/family aggregate. Expose a latency sample count from route_stats() and weight with that instead.

Proposed direction

- lt_calls = sum(r["count"] for r in lt_rows) + lt_calls = sum(r["latency_count"] for r in lt_rows) sr = sum(r["success_rate"] * r["count"] for r in sr_rows) - lt = sum(r["latency_ms"] * r["count"] for r in lt_rows) + lt = sum(r["latency_ms"] * r["latency_count"] for r in lt_rows)

Also add latency_count = count(*) FILTER (WHERE ok AND latency_ms IS NOT NULL) to host_store.route_stats().

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shim.py` around lines 471 - 476, The latency aggregate in the route summary is being weighted by total attempts instead of the number of successful latency samples, so update the aggregation logic around the route_stats/summary path to use a dedicated latency sample count. Add a latency_count field in host_store.route_stats() using the same success-and-latency filter, then change the provider/family aggregation in shim.py to weight latency_ms by that new latency_count rather than count. Keep the success_rate weighting unchanged and use the existing route_stats() and aggregate code paths to locate the fix.

coderabbitai · 2026-06-29T19:03:43Z

+        # Drain pending async writes from a prior test FIRST so they can't land
+        # after this truncate; don't close the pool (the background writer shares
+        # it — closing it mid-write races and flakes isolation).
+        host_store._write_q.join()


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Bound the write-queue drain.

_write_q.join() can block forever if a background insert hangs before _writer_loop() reaches task_done(), so this "skip if Postgres is unreachable" fixture can wedge the whole suite instead of skipping/failing. The same raw join is now repeated in tests/test_async_concurrency.py Line 86, tests/test_host.py Lines 293/305/321, and tests/test_shim.py Line 464, so it’s worth centralizing behind a small helper with a deadline.

Suggested direction

+def _drain_host_store_writes(timeout_s=5.0): + import time + deadline = time.monotonic() + timeout_s + while host_store._write_q.unfinished_tasks and time.monotonic() < deadline: + time.sleep(0.01) + if host_store._write_q.unfinished_tasks: + pytest.fail("timed out draining host_store write queue") + def host_store_clean(): """Truncate the operational store before the test (isolation against the shared Postgres). Skips the test if Postgres is unreachable.""" try: - host_store._write_q.join() + _drain_host_store_writes() host_store.truncate_all_for_tests()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Drain pending async writes from a prior test FIRST so they can't land

# after this truncate; don't close the pool (the background writer shares

# it — closing it mid-write races and flakes isolation).

host_store._write_q.join()

def _drain_host_store_writes(timeout_s=5.0):

import time

deadline = time.monotonic() + timeout_s

while host_store._write_q.unfinished_tasks and time.monotonic() < deadline:

time.sleep(0.01)

if host_store._write_q.unfinished_tasks:

pytest.fail("timed out draining host_store write queue")

def host_store_clean():

"""Truncate the operational store before the test (isolation against the

shared Postgres). Skips the test if Postgres is unreachable."""

try:

# Drain pending async writes from a prior test FIRST so they can't land

# after this truncate; don't close the pool (the background writer shares

# it — closing it mid-write races and flakes isolation).

_drain_host_store_writes()

host_store.truncate_all_for_tests()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/conftest.py` around lines 35 - 38, Bound the write-queue drain so tests cannot hang forever when a background insert stalls. Replace the direct host_store._write_q.join() usage in the conftest fixture and the repeated joins in the async concurrency/host/shim tests with a small shared helper that waits for _write_q to empty using a deadline and then skips or fails cleanly if the writer does not finish. Keep the helper near the test support code and have it coordinate with _writer_loop/task_done semantics without closing the pool.

coderabbitai · 2026-06-29T19:03:43Z

+def test_route_stats_derives_reliability_and_latency(store):
+    from conftest import seed_route_obs
+    # peerA: 4 ok + 1 fail -> success 0.8; latency avg over OK only
+    seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=100, n=3)
+    seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=300)   # ok=4, lat avg=150
+    seed_route_obs("antseed", "m", "peerA", ok=False, latency_ms=9999)  # failure: latency ignored
+    seed_route_obs("antseed", "m", "peerB", ok=True, latency_ms=50)
+    st = store.route_stats()
+    assert st["antseed|m|peerA"]["success_rate"] == 0.8
+    assert st["antseed|m|peerA"]["latency_ms"] == 150   # avg(100,100,100,300), failure excluded
+    assert st["antseed|m|peerA"]["count"] == 5
+    assert st["antseed|m|peerB"]["success_rate"] == 1.0
+    assert "antseed|m|missing" not in st
+
+
+def test_route_stats_window_excludes_old_observations(store):
+    from conftest import seed_route_obs
+    import time
+    now = int(time.time() * 1000)
+    seed_route_obs("p", "m", "fresh", ok=True, ts=now)
+    seed_route_obs("p", "m", "stale", ok=True, ts=now - 20 * 60 * 1000)  # 20 min ago
+    assert set(store.route_stats(window_ms=15 * 60 * 1000)) == {"p|m|fresh"}
+    assert set(store.route_stats(window_ms=30 * 60 * 1000)) == {"p|m|fresh", "p|m|stale"}


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Add a missing-component route-key case.

These tests only cover fully populated keys, so they won't catch the current contract drift between host_store._route_key() (missing parts normalize to "") and host_store.route_stats() (raw f"{prov}|{fam}|{sby}"). A route with a missing served_by or family would be looked up under a different key and stay unstamped.

Suggested test to lock the contract

+def test_route_stats_normalizes_keys_like_route_key(store): + from conftest import seed_route_obs + seed_route_obs("p", "m", None, ok=True) + st = store.route_stats() + assert "p|m|" in st + assert "p|m|None" not in st

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_route_stats_derives_reliability_and_latency(store):

from conftest import seed_route_obs

# peerA: 4 ok + 1 fail -> success 0.8; latency avg over OK only

seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=100, n=3)

seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=300) # ok=4, lat avg=150

seed_route_obs("antseed", "m", "peerA", ok=False, latency_ms=9999) # failure: latency ignored

seed_route_obs("antseed", "m", "peerB", ok=True, latency_ms=50)

st = store.route_stats()

assert st["antseed|m|peerA"]["success_rate"] == 0.8

assert st["antseed|m|peerA"]["latency_ms"] == 150 # avg(100,100,100,300), failure excluded

assert st["antseed|m|peerA"]["count"] == 5

assert st["antseed|m|peerB"]["success_rate"] == 1.0

assert "antseed|m|missing" not in st

def test_route_stats_window_excludes_old_observations(store):

from conftest import seed_route_obs

import time

now = int(time.time() * 1000)

seed_route_obs("p", "m", "fresh", ok=True, ts=now)

seed_route_obs("p", "m", "stale", ok=True, ts=now - 20 * 60 * 1000) # 20 min ago

assert set(store.route_stats(window_ms=15 * 60 * 1000)) == {"p|m|fresh"}

assert set(store.route_stats(window_ms=30 * 60 * 1000)) == {"p|m|fresh", "p|m|stale"}

def test_route_stats_derives_reliability_and_latency(store):

from conftest import seed_route_obs

# peerA: 4 ok + 1 fail -> success 0.8; latency avg over OK only

seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=100, n=3)

seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=300) # ok=4, lat avg=150

seed_route_obs("antseed", "m", "peerA", ok=False, latency_ms=9999) # failure: latency ignored

seed_route_obs("antseed", "m", "peerB", ok=True, latency_ms=50)

st = store.route_stats()

assert st["antseed|m|peerA"]["success_rate"] == 0.8

assert st["antseed|m|peerA"]["latency_ms"] == 150 # avg(100,100,100,300), failure excluded

assert st["antseed|m|peerA"]["count"] == 5

assert st["antseed|m|peerB"]["success_rate"] == 1.0

assert "antseed|m|missing" not in st

def test_route_stats_window_excludes_old_observations(store):

from conftest import seed_route_obs

import time

now = int(time.time() * 1000)

seed_route_obs("p", "m", "fresh", ok=True, ts=now)

seed_route_obs("p", "m", "stale", ok=True, ts=now - 20 * 60 * 1000) # 20 min ago

assert set(store.route_stats(window_ms=15 * 60 * 1000)) == {"p|m|fresh"}

assert set(store.route_stats(window_ms=30 * 60 * 1000)) == {"p|m|fresh", "p|m|stale"}

def test_route_stats_normalizes_keys_like_route_key(store):

from conftest import seed_route_obs

seed_route_obs("p", "m", None, ok=True)

st = store.route_stats()

assert "p|m|" in st

assert "p|m|None" not in st

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_host_store.py` around lines 256 - 278, Add a test that covers a route observation with a missing component in the route key so the key normalization contract is locked down. Use the existing host store helpers and `route_stats()` to assert that a route with a missing `served_by` or family is still grouped under the same normalized key as `_route_key()` would produce, instead of a raw `f"{prov}|{fam}|{sby}"` key. Keep the new case close to the current `test_route_stats_derives_reliability_and_latency` / `test_route_stats_window_excludes_old_observations` coverage so regressions in `host_store._route_key()` vs `host_store.route_stats()` are caught.

…s (#4b) The per-session cache-affinity (route_cache) and usage meter (route_session_meter) were in-process dicts (per-pod, reset on restart — §8 debt). Derive them on the fly from `calls` (per-request; it carries session_id / status / provider / family / served_by / tokens / cost / caller — everything the folds held), so the views are fleet-consistent without any per-session fold. - host_store: hot_route(session) = the route of the session's most recent SUCCESSFUL call; session_totals / session_warm / session_owner / all_session_totals. - shim: /v1/session + /x/sessions + the cache_hot_route resolution read these; x_router.session_acc = committed session_totals + THIS in-flight call (not yet in the ledger). The owner is the caller of the session's earliest call (first-writer-wins, cross-consumer isolation), derived — no explicit binding. - llm_router_host: the fold no longer records cache/session state (the route observation + the call ledger carry it); route_cache + route_session_meter are deleted. Trade-off (documented): session_acc's prior-call totals come from the async-written ledger, so a burst of near-simultaneous calls on one session can lag by the queue drain (ms). The object marks the meter "measurement only, NOT billing", so this is acceptable; it is also a net improvement over the per-pod in-process meter (which was already wrong across a fleet). Verification: full suite 402 passed, 3 skipped, 0 failed (3x deterministic).

The learned per-route tool-incapability (route_tool_capability) was an in-process dict (per-pod, reset on restart — §8 debt). Derive it on the fly from the per-attempt route_observations ledger (#4a), which gains two signal columns. - route_observations: + tools_requested, tool_calls_emitted (the fold writes them on every attempt; only tools-requests carry signal). - host_store.tool_incapable_routes(window): a route is incapable when it had >= 20 tools-requests in the last 30 min with ZERO tool_calls. The window IS the re-test horizon (a route ages out if it stops being tool-tested); any tool_call in the window clears it. Capable is the default. - sources/antseed.offers_sync: fetches the incapable set once and drops supports_tools for those routes (was per-candidate is_capable()). - route_tool_capability is deleted; the fold no longer observes it. The behaviour shifts from "any tool_call proves capability PERMANENTLY" to a windowed verdict — more honest (a route that degrades is re-detected), consistent with the EMA->windowed change in #4a. Verification: full suite 402 passed, 3 skipped, 0 failed (3x deterministic). This completes the route_* family migration (#4a reliability/latency, #4b cache/session, #4c tool capability) — all five in-process folds now derived on the fly, fleet-consistent, from the raw ledgers.

…D, monitoring) PR D, reframed from "auto-correct the ranking" to OBSERVABILITY, per the decision that an automatic measured-cost correction would (a) form a routing feedback loop, (b) be circular for compute-from-price providers, and (c) hide an opaque routing change. Instead D surfaces the deviation and lets the operator act — the same human-in-the-loop stance as C's manual lever. A read-only Overview card showing, per provider, the measured effective $/Mtok (from the calls ledger) vs the advertised list price, with a drift flag. - host_store.cost_by_route: per (provider, family) ledger aggregate over a window, derived by query (#41: store raw, derive at read — no in-process fold). - _cost_accuracy_rows (pure, unit-tested): joins that with the live ranked price from /x/runtime, dividing the fictitious price_multiplier back out so the comparison is measured-spend vs advertised-list. Rolls up per provider, flags drift > 15% with >= 20 calls, sorts worst-first. - GET /dashboard/api/cost-accuracy (admin) + a card in the Overview. Never touches routing. The signal is strongest for providers that report their own cost (openrouter — reveals real discounts/surprises); a compute-from-price provider reads ~1.0 by construction, which is itself a useful "no independent signal here". Suite 434/0; the join/deviation logic + the JS render verified directly.

…D, monitoring) (#61) * feat(dashboard): cost-accuracy panel — measured spend vs list price (D, monitoring) PR D, reframed from "auto-correct the ranking" to OBSERVABILITY, per the decision that an automatic measured-cost correction would (a) form a routing feedback loop, (b) be circular for compute-from-price providers, and (c) hide an opaque routing change. Instead D surfaces the deviation and lets the operator act — the same human-in-the-loop stance as C's manual lever. A read-only Overview card showing, per provider, the measured effective $/Mtok (from the calls ledger) vs the advertised list price, with a drift flag. - host_store.cost_by_route: per (provider, family) ledger aggregate over a window, derived by query (#41: store raw, derive at read — no in-process fold). - _cost_accuracy_rows (pure, unit-tested): joins that with the live ranked price from /x/runtime, dividing the fictitious price_multiplier back out so the comparison is measured-spend vs advertised-list. Rolls up per provider, flags drift > 15% with >= 20 calls, sorts worst-first. - GET /dashboard/api/cost-accuracy (admin) + a card in the Overview. Never touches routing. The signal is strongest for providers that report their own cost (openrouter — reveals real discounts/surprises); a compute-from-price provider reads ~1.0 by construction, which is itself a useful "no independent signal here". Suite 434/0; the join/deviation logic + the JS render verified directly. * fix(dashboard): cost-accuracy distinguishes real signal from tautology (review) Review (Axis 8): the panel rendered tautological rows identically to real-signal ones. For a compute-from-price provider (the direct openai/anthropic/google), measured and expected both derive from the same list price → deviation ~1.0 by construction, and any drift is reprice noise (ledger cost_usd sealed at the price-of-then vs the current ema), not an effective discount. Flagging "drift" on those is non-actionable and trains the operator to ignore the badge — the opposite of what a monitoring panel wants. Record the cost basis as a raw fact and use it: - shim._cost_basis (single source of the cost tiering _executed_cost_usd already used): 'subscription' | 'reported' (provider's own usage.cost — INDEPENDENT signal) | 'computed' (derived from list price — tautological) | None. Stamped on x_router and threaded to the ledger; new calls.cost_basis column (ALTER ADD COLUMN IF NOT EXISTS). - cost_by_route aggregates n_reported; _cost_accuracy_rows labels each row reported|derived and only warns where the signal is real (reported). The panel shows the signal tag; derived drift renders muted, never badged. Review (Axis 1): import _CACHE_READ_FACTOR from shim instead of redefining 0.1 — the panel's expected cost must track the billing factor; a copy could silently diverge into false drift. Suite 439/0 (incl. _cost_basis tiers + a derived provider with big drift that does NOT warn).

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

jmlago added 2 commits June 29, 2026 20:24

jmlago changed the title ~~feat(host-store): derive route reliability + latency on the fly (#4a)~~ feat(host-store): the route_* family derived on the fly (#4a+#4b+#4c) Jun 29, 2026

jmlago merged commit 093d2c9 into main Jun 29, 2026
1 check passed

This was referenced Jun 30, 2026

feat(dashboard): cost-accuracy panel — measured spend vs list price (D, monitoring) #61

Merged

feat(dynamic-pricing) F1: measured per-route economics (route_economics) #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(host-store): the route_* family derived on the fly (#4a+#4b+#4c)#41

feat(host-store): the route_* family derived on the fly (#4a+#4b+#4c)#41
jmlago merged 3 commits into
mainfrom
host-store-route-stats

jmlago commented Jun 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Uh oh!

coderabbitai Bot Jun 29, 2026

Uh oh!

coderabbitai Bot Jun 29, 2026

Uh oh!

coderabbitai Bot Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        # Record the outcome here (not in the hook) so the streaming/override path —
-        # all of opencode's traffic, and every flow node — writes a route
-        # observation too, the host-owned perf the algebra reads (derived) and the
-        # market view surfaces (#15/#4a). Mocks record as well, so a mocked call
-        # is measured exactly like a live one.
-        _fold_route_outcome(request, result, session=session)
+ def _h_call_provider(self, request):
+     py_req = _to_py(request) or {}
+     provider = py_req.get("provider_id")
+     model = py_req.get("model_family")
+     if (provider, model) in self._mock_responses:
+         resp = self._mock_responses[(provider, model)]
+     else:
+         resp = self._call_hook(py_req)
+    _fold_route_outcome(py_req, resp)
+     return _to_lua(self.lua, resp)

-        # Drain pending async writes from a prior test FIRST so they can't land
-        # after this truncate; don't close the pool (the background writer shares
-        # it — closing it mid-write races and flakes isolation).
-        host_store._write_q.join()
+def _drain_host_store_writes(timeout_s=5.0):
+    import time
+    deadline = time.monotonic() + timeout_s
+    while host_store._write_q.unfinished_tasks and time.monotonic() < deadline:
+        time.sleep(0.01)
+    if host_store._write_q.unfinished_tasks:
+        pytest.fail("timed out draining host_store write queue")
+def host_store_clean():
+    """Truncate the operational store before the test (isolation against the
+    shared Postgres). Skips the test if Postgres is unreachable."""
+    try:
+        # Drain pending async writes from a prior test FIRST so they can't land
+        # after this truncate; don't close the pool (the background writer shares
+        # it — closing it mid-write races and flakes isolation).
+        _drain_host_store_writes()
+        host_store.truncate_all_for_tests()

Uh oh!

Conversation

jmlago commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

#4: the route_* family → derived on the fly (no in-process folds)

#4a — reliability + latency (the hot path)

#4b — cache-affinity + the session meter

#4c — learned tool capability

Invariants (§3 / anti-telos)

Verification

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jmlago commented Jun 29, 2026 •

edited

Loading

#4: the `route_*` family → derived on the fly (no in-process folds)

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading