Skip to content

feat(host-store): the route_* family derived on the fly (#4a+#4b+#4c)#41

Merged
jmlago merged 3 commits into
mainfrom
host-store-route-stats
Jun 29, 2026
Merged

feat(host-store): the route_* family derived on the fly (#4a+#4b+#4c)#41
jmlago merged 3 commits into
mainfrom
host-store-route-stats

Conversation

@jmlago

@jmlago jmlago commented Jun 29, 2026

Copy link
Copy Markdown
Member

#4: the route_* family → derived on the fly (no in-process folds)

The five per-route / per-session host folds were in-process dicts (per-pod, reset
on restart — §8 debt). This block replaces ALL of them with on-the-fly
derivation from the raw ledgers
(calls from #3, plus a new per-attempt
route_observations), so the measurements are fleet-consistent and survive
restarts — while staying the host's job, never the algebra's (§3).

Three atomic commits:

#4a — reliability + latency (the hot path)

route_observations (one RAW row per provider-call ATTEMPT, incl. failed
fallbacks — a grain calls lacks). route_stats(window) derives
{route_key: {success_rate, latency_ms, count}} in one aggregate query, not
per-candidate. route_reliability→just route_key; route_latency deleted.

On the fly, NO cache — benchmarked against the prod RDS: RTT 0.8ms and real
volume is tiny (~400 calls), so a derive is a few ms. (The 73ms figure was a
200k-rows synthetic, ~500x prod.) A TTL memo would be premature (Axis 3);
scaling trigger noted (add a 1-2s memo only if it exceeds ~20ms). EMA → windowed
average (15 min).

#4b — cache-affinity + the session meter

hot_route / session_totals / session_warm / session_owner derived from
calls (per-request; carries session/route/outcome/tokens/cost/caller).
x_router.session_acc = committed totals + the in-flight call; the owner is the
caller of the session's earliest call (first-writer-wins). route_cache +
route_session_meter deleted. Trade-off documented: prior-call totals come from
the async-written ledger (ms lag for bursts) — fine for a "measurement, NOT
billing" number, and a net gain over the per-pod meter.

#4c — learned tool capability

route_observations gains tools_requested / tool_calls_emitted.
tool_incapable_routes(window) = routes with ≥20 tools-requests in 30 min and
zero tool_calls. offers_sync drops supports_tools for those.
route_tool_capability deleted. "Permanently capable" → windowed verdict (a
degrading route is re-detected).

Invariants (§3 / anti-telos)

The host MEASURES and stamps success_rate/latency_ms/supports_tools as
per-candidate fields; route_key stays host-internal; nothing returns to the
algebra. Store raw, derive by query. Fail-soft throughout (DB error → field
default / capable / empty).

Verification

Full suite 402 passed, 3 skipped, 0 failed, 3x deterministic per
increment. (Fixed an async-writer test-isolation flake — drain the write queue
before resetting/closing the pool.) Net negative lines (5 modules + their
EMA tests replaced by derivations).

Closes the in-process-state §8 debt for the route_* family. Deploys via the
normal cycle (router image → schema auto-creates route_observations + the new
columns; Python/psycopg, no Node, so no conninfo gotcha).

…he fly (#4a)

The per-route reliability/latency EMAs were in-process dicts (per-pod, reset on
restart — §8 debt). Replace them with on-the-fly derivation from a raw
per-ATTEMPT ledger, so the measurement is fleet-consistent and survives restarts
while staying the host's job (not the algebra's).

- host_store: a `route_observations` table (one row per provider call the engine
  made, including failed fallbacks — a grain `calls` lacks: `calls` is
  per-request). `route_stats(window)` derives {route_key: success_rate,
  latency_ms, count} in ONE aggregate query (not per-candidate); latency averages
  successful calls only. The background writer is generalized to thunks so it
  serves both the call ledger and route observations.
- llm_router_host: the fold writes a route observation (async, off the latency
  path) instead of updating the in-process EMAs.
- sources/antseed.offers_sync + sources/openrouter.pricing: fetch route_stats
  once and stamp success_rate/latency_ms per candidate from it; shim /x/market
  perf reads it too.
- route_reliability is reduced to the `route_key` identity (reused by route_cache
  / tool_capability / the stamps); route_latency is deleted.

Design (ratified): on-the-fly, NO cache. Benchmarked against the prod RDS — RTT
0.8ms and real volume is tiny (hundreds of calls), so route_stats is a few ms; a
TTL memo would be premature (Axis 3). Scaling trigger noted: add a 1-2s memo only
if route_stats exceeds ~20ms (≈100x current traffic). The smoothing changes from
an EMA to a windowed average (default 15 min) — more honest (no restart reset).

§3 preserved: the host MEASURES, stamps success_rate/latency_ms as per-candidate
fields, route_key stays host-internal, nothing returns to the algebra.

Verification: full suite 405 passed, 3 skipped, 0 failed (3x, deterministic — an
async-writer test-isolation flake fixed by draining the write queue before the
pool is reset/closed).
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@jmlago, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 19 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 95eebfe0-f390-48be-a122-abb3ee5b4c61

📥 Commits

Reviewing files that changed from the base of the PR and between c7d72f9 and 01dd295.

📒 Files selected for processing (14)
  • host_store.py
  • llm_router_host.py
  • route_cache.py
  • route_reliability.py
  • route_session_meter.py
  • route_tool_capability.py
  • shim.py
  • sources/antseed.py
  • tests/conftest.py
  • tests/test_antseed_offers.py
  • tests/test_metering.py
  • tests/test_route_cache.py
  • tests/test_route_tool_capability.py
  • tests/test_shim.py
📝 Walkthrough

Walkthrough

Replaces in-process EMA-based per-route reliability and latency tracking (route_reliability.py, route_latency.py) with a Postgres-backed route_observations table in host_store. Observations are written asynchronously via _fold_route_outcome; route_stats() aggregates them. All consumers (shim, sources, tests) are updated to read from host_store.route_stats().

Changes

Route observations: Postgres persistence replacing in-process EMA

Layer / File(s) Summary
host_store schema, insertion, stats, and lifecycle
host_store.py
Adds route_observations table and indexes, extends retention pruning to millisecond-timestamped rows, refactors the async write queue to use thunks, adds _insert_route_observation and observe_route_call_async, adds route_stats(window_ms), and updates reset()/truncate_all_for_tests() for test isolation.
Remove in-process EMA from route_reliability and route_latency
route_reliability.py, route_latency.py
Strips all EMA state, observers, and accessors from route_reliability.py (retaining only route_key and a no-op reset); deletes route_latency.py entirely.
Wire _fold_route_outcome to host_store.observe_route_call_async
llm_router_host.py
Updates imports and replaces direct route_reliability/route_latency observe calls in _fold_route_outcome with host_store.observe_route_call_async emitting a timestamped per-attempt payload.
Update shim and sources to read host_store.route_stats()
shim.py, sources/antseed.py, sources/openrouter.py
Replaces snapshot helper calls with host_store.route_stats() lookups; shim.py's _perf() now aggregates weighted success rate and latency across matching route keys.
Test infrastructure and updated coverage
tests/conftest.py, tests/test_host_store.py, tests/test_host.py, tests/test_antseed_offers.py, tests/test_async_concurrency.py, tests/test_shim.py, tests/test_route_reliability.py
Updates host_store_clean fixture to drain the write queue before truncation; adds seed_route_obs helper; rewrites integration tests to drain _write_q and assert via route_stats(); adds new aggregation and windowing tests for route_stats; removes deleted EMA test files.

Sequence Diagram(s)

sequenceDiagram
  participant Router as _fold_route_outcome
  participant HS as host_store
  participant Q as _write_q (background thread)
  participant PG as route_observations (Postgres)
  participant Consumer as shim / antseed / openrouter

  Router->>HS: observe_route_call_async({ts, provider_id, model_family, served_by, ok, latency_ms})
  HS->>Q: enqueue thunk(_insert_route_observation, snapshot)
  Q->>PG: INSERT row
  Consumer->>HS: route_stats(window_ms)
  HS->>PG: SELECT aggregated success_rate, avg latency_ms, count WHERE ts > cutoff
  PG-->>HS: per-route aggregated rows
  HS-->>Consumer: dict[route_key, {success_rate, latency_ms, count}]
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • genlayerlabs/unhardcoded#36: Introduced the Postgres-backed host_store operational store schema, async write machinery, and reset/truncate lifecycle hooks that this PR extends with route_observations.

Poem

🐇 Hoppity-hop, the EMA is gone,
No more in-memory state to rely on!
Each route's fate now lives in Postgres rows,
Aggregated neatly wherever stats flows.
The rabbit stamps latency into the store—
Reliable routes, forevermore! 🗄️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly captures the main change: moving route reliability and latency derivation into host-store on the fly.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch host-store-route-stats

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
host_store.py (1)

496-496: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Drop the redundant int() here.
round(lat) already returns an int when ndigits is omitted, so this can be simplified to:

Proposed cleanup
-                    "latency_ms": int(round(lat)) if lat is not None else None,
+                    "latency_ms": round(lat) if lat is not None else None,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@host_store.py` at line 496, The latency mapping in host_store.py is doing a
redundant int() around round(lat). Simplify the expression in the host_store
serialization logic that sets latency_ms by removing the unnecessary int()
wrapper and leaving the rounded value as-is when lat is not None.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@llm_router_host.py`:
- Around line 504-509: The synchronous execute() path is missing route
observation recording, so only _resolve_call_async ends up calling
_fold_route_outcome. Update the execute flow in llm_router_host.py so
_h_call_provider (or the code path that returns resp) also invokes
_fold_route_outcome(request, result, session=...) just like the async path,
ensuring route_observations is populated for both sync and async executions.

In `@shim.py`:
- Around line 471-476: The latency aggregate in the route summary is being
weighted by total attempts instead of the number of successful latency samples,
so update the aggregation logic around the route_stats/summary path to use a
dedicated latency sample count. Add a latency_count field in
host_store.route_stats() using the same success-and-latency filter, then change
the provider/family aggregation in shim.py to weight latency_ms by that new
latency_count rather than count. Keep the success_rate weighting unchanged and
use the existing route_stats() and aggregate code paths to locate the fix.

In `@tests/conftest.py`:
- Around line 35-38: Bound the write-queue drain so tests cannot hang forever
when a background insert stalls. Replace the direct host_store._write_q.join()
usage in the conftest fixture and the repeated joins in the async
concurrency/host/shim tests with a small shared helper that waits for _write_q
to empty using a deadline and then skips or fails cleanly if the writer does not
finish. Keep the helper near the test support code and have it coordinate with
_writer_loop/task_done semantics without closing the pool.

In `@tests/test_host_store.py`:
- Around line 256-278: Add a test that covers a route observation with a missing
component in the route key so the key normalization contract is locked down. Use
the existing host store helpers and `route_stats()` to assert that a route with
a missing `served_by` or family is still grouped under the same normalized key
as `_route_key()` would produce, instead of a raw `f"{prov}|{fam}|{sby}"` key.
Keep the new case close to the current
`test_route_stats_derives_reliability_and_latency` /
`test_route_stats_window_excludes_old_observations` coverage so regressions in
`host_store._route_key()` vs `host_store.route_stats()` are caught.

---

Nitpick comments:
In `@host_store.py`:
- Line 496: The latency mapping in host_store.py is doing a redundant int()
around round(lat). Simplify the expression in the host_store serialization logic
that sets latency_ms by removing the unnecessary int() wrapper and leaving the
rounded value as-is when lat is not None.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9f796a6f-f81e-41ca-a143-cdfc4c8a13be

📥 Commits

Reviewing files that changed from the base of the PR and between f270400 and c7d72f9.

📒 Files selected for processing (15)
  • host_store.py
  • llm_router_host.py
  • route_latency.py
  • route_reliability.py
  • shim.py
  • sources/antseed.py
  • sources/openrouter.py
  • tests/conftest.py
  • tests/test_antseed_offers.py
  • tests/test_async_concurrency.py
  • tests/test_host.py
  • tests/test_host_store.py
  • tests/test_route_latency.py
  • tests/test_route_reliability.py
  • tests/test_shim.py
💤 Files with no reviewable changes (3)
  • tests/test_route_latency.py
  • route_latency.py
  • tests/test_route_reliability.py

Comment thread llm_router_host.py
Comment on lines +504 to 509
# Record the outcome here (not in the hook) so the streaming/override path —
# all of opencode's traffic, and every flow node — writes a route
# observation too, the host-owned perf the algebra reads (derived) and the
# market view surfaces (#15/#4a). Mocks record as well, so a mocked call
# is measured exactly like a live one.
_fold_route_outcome(request, result, session=session)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Cover the synchronous execute() path too.

Line 509 records observations only for _resolve_call_async; public execute() still goes through _h_call_provider, which returns resp without calling _fold_route_outcome, so sync executions won’t populate route_observations after the EMA removal.

Proposed fix
 def _h_call_provider(self, request):
     py_req = _to_py(request) or {}
     provider = py_req.get("provider_id")
     model = py_req.get("model_family")
     if (provider, model) in self._mock_responses:
         resp = self._mock_responses[(provider, model)]
     else:
         resp = self._call_hook(py_req)
+    _fold_route_outcome(py_req, resp)
     return _to_lua(self.lua, resp)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Record the outcome here (not in the hook) so the streaming/override path —
# all of opencode's traffic, and every flow node — writes a route
# observation too, the host-owned perf the algebra reads (derived) and the
# market view surfaces (#15/#4a). Mocks record as well, so a mocked call
# is measured exactly like a live one.
_fold_route_outcome(request, result, session=session)
def _h_call_provider(self, request):
py_req = _to_py(request) or {}
provider = py_req.get("provider_id")
model = py_req.get("model_family")
if (provider, model) in self._mock_responses:
resp = self._mock_responses[(provider, model)]
else:
resp = self._call_hook(py_req)
_fold_route_outcome(py_req, resp)
return _to_lua(self.lua, resp)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm_router_host.py` around lines 504 - 509, The synchronous execute() path is
missing route observation recording, so only _resolve_call_async ends up calling
_fold_route_outcome. Update the execute flow in llm_router_host.py so
_h_call_provider (or the code path that returns resp) also invokes
_fold_route_outcome(request, result, session=...) just like the async path,
ensuring route_observations is populated for both sync and async executions.

Comment thread shim.py
Comment on lines +471 to +476
sr_rows = [r for r in rows if r.get("success_rate") is not None]
lt_rows = [r for r in rows if r.get("latency_ms") is not None]
sr_calls = sum(r["count"] for r in sr_rows)
lt_calls = sum(r["count"] for r in lt_rows)
sr = sum(r["success_rate"] * r["count"] for r in sr_rows)
lt = sum(r["latency_ms"] * r["count"] for r in lt_rows)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Don’t weight successful latency averages by total attempts.

Line 474 uses count, but route_stats().latency_ms excludes failed rows while count includes them. A route with many failures and one fast success will overweight its latency in the provider/family aggregate. Expose a latency sample count from route_stats() and weight with that instead.

Proposed direction
-            lt_calls = sum(r["count"] for r in lt_rows)
+            lt_calls = sum(r["latency_count"] for r in lt_rows)
             sr = sum(r["success_rate"] * r["count"] for r in sr_rows)
-            lt = sum(r["latency_ms"] * r["count"] for r in lt_rows)
+            lt = sum(r["latency_ms"] * r["latency_count"] for r in lt_rows)

Also add latency_count = count(*) FILTER (WHERE ok AND latency_ms IS NOT NULL) to host_store.route_stats().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shim.py` around lines 471 - 476, The latency aggregate in the route summary
is being weighted by total attempts instead of the number of successful latency
samples, so update the aggregation logic around the route_stats/summary path to
use a dedicated latency sample count. Add a latency_count field in
host_store.route_stats() using the same success-and-latency filter, then change
the provider/family aggregation in shim.py to weight latency_ms by that new
latency_count rather than count. Keep the success_rate weighting unchanged and
use the existing route_stats() and aggregate code paths to locate the fix.

Comment thread tests/conftest.py
Comment on lines +35 to +38
# Drain pending async writes from a prior test FIRST so they can't land
# after this truncate; don't close the pool (the background writer shares
# it — closing it mid-write races and flakes isolation).
host_store._write_q.join()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Bound the write-queue drain.

_write_q.join() can block forever if a background insert hangs before _writer_loop() reaches task_done(), so this "skip if Postgres is unreachable" fixture can wedge the whole suite instead of skipping/failing. The same raw join is now repeated in tests/test_async_concurrency.py Line 86, tests/test_host.py Lines 293/305/321, and tests/test_shim.py Line 464, so it’s worth centralizing behind a small helper with a deadline.

Suggested direction
+def _drain_host_store_writes(timeout_s=5.0):
+    import time
+    deadline = time.monotonic() + timeout_s
+    while host_store._write_q.unfinished_tasks and time.monotonic() < deadline:
+        time.sleep(0.01)
+    if host_store._write_q.unfinished_tasks:
+        pytest.fail("timed out draining host_store write queue")
+
 def host_store_clean():
     """Truncate the operational store before the test (isolation against the
     shared Postgres). Skips the test if Postgres is unreachable."""
     try:
-        host_store._write_q.join()
+        _drain_host_store_writes()
         host_store.truncate_all_for_tests()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Drain pending async writes from a prior test FIRST so they can't land
# after this truncate; don't close the pool (the background writer shares
# it — closing it mid-write races and flakes isolation).
host_store._write_q.join()
def _drain_host_store_writes(timeout_s=5.0):
import time
deadline = time.monotonic() + timeout_s
while host_store._write_q.unfinished_tasks and time.monotonic() < deadline:
time.sleep(0.01)
if host_store._write_q.unfinished_tasks:
pytest.fail("timed out draining host_store write queue")
def host_store_clean():
"""Truncate the operational store before the test (isolation against the
shared Postgres). Skips the test if Postgres is unreachable."""
try:
# Drain pending async writes from a prior test FIRST so they can't land
# after this truncate; don't close the pool (the background writer shares
# it — closing it mid-write races and flakes isolation).
_drain_host_store_writes()
host_store.truncate_all_for_tests()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/conftest.py` around lines 35 - 38, Bound the write-queue drain so tests
cannot hang forever when a background insert stalls. Replace the direct
host_store._write_q.join() usage in the conftest fixture and the repeated joins
in the async concurrency/host/shim tests with a small shared helper that waits
for _write_q to empty using a deadline and then skips or fails cleanly if the
writer does not finish. Keep the helper near the test support code and have it
coordinate with _writer_loop/task_done semantics without closing the pool.

Comment thread tests/test_host_store.py
Comment on lines +256 to +278
def test_route_stats_derives_reliability_and_latency(store):
from conftest import seed_route_obs
# peerA: 4 ok + 1 fail -> success 0.8; latency avg over OK only
seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=100, n=3)
seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=300) # ok=4, lat avg=150
seed_route_obs("antseed", "m", "peerA", ok=False, latency_ms=9999) # failure: latency ignored
seed_route_obs("antseed", "m", "peerB", ok=True, latency_ms=50)
st = store.route_stats()
assert st["antseed|m|peerA"]["success_rate"] == 0.8
assert st["antseed|m|peerA"]["latency_ms"] == 150 # avg(100,100,100,300), failure excluded
assert st["antseed|m|peerA"]["count"] == 5
assert st["antseed|m|peerB"]["success_rate"] == 1.0
assert "antseed|m|missing" not in st


def test_route_stats_window_excludes_old_observations(store):
from conftest import seed_route_obs
import time
now = int(time.time() * 1000)
seed_route_obs("p", "m", "fresh", ok=True, ts=now)
seed_route_obs("p", "m", "stale", ok=True, ts=now - 20 * 60 * 1000) # 20 min ago
assert set(store.route_stats(window_ms=15 * 60 * 1000)) == {"p|m|fresh"}
assert set(store.route_stats(window_ms=30 * 60 * 1000)) == {"p|m|fresh", "p|m|stale"}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Add a missing-component route-key case.

These tests only cover fully populated keys, so they won't catch the current contract drift between host_store._route_key() (missing parts normalize to "") and host_store.route_stats() (raw f"{prov}|{fam}|{sby}"). A route with a missing served_by or family would be looked up under a different key and stay unstamped.

Suggested test to lock the contract
+def test_route_stats_normalizes_keys_like_route_key(store):
+    from conftest import seed_route_obs
+    seed_route_obs("p", "m", None, ok=True)
+    st = store.route_stats()
+    assert "p|m|" in st
+    assert "p|m|None" not in st
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_route_stats_derives_reliability_and_latency(store):
from conftest import seed_route_obs
# peerA: 4 ok + 1 fail -> success 0.8; latency avg over OK only
seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=100, n=3)
seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=300) # ok=4, lat avg=150
seed_route_obs("antseed", "m", "peerA", ok=False, latency_ms=9999) # failure: latency ignored
seed_route_obs("antseed", "m", "peerB", ok=True, latency_ms=50)
st = store.route_stats()
assert st["antseed|m|peerA"]["success_rate"] == 0.8
assert st["antseed|m|peerA"]["latency_ms"] == 150 # avg(100,100,100,300), failure excluded
assert st["antseed|m|peerA"]["count"] == 5
assert st["antseed|m|peerB"]["success_rate"] == 1.0
assert "antseed|m|missing" not in st
def test_route_stats_window_excludes_old_observations(store):
from conftest import seed_route_obs
import time
now = int(time.time() * 1000)
seed_route_obs("p", "m", "fresh", ok=True, ts=now)
seed_route_obs("p", "m", "stale", ok=True, ts=now - 20 * 60 * 1000) # 20 min ago
assert set(store.route_stats(window_ms=15 * 60 * 1000)) == {"p|m|fresh"}
assert set(store.route_stats(window_ms=30 * 60 * 1000)) == {"p|m|fresh", "p|m|stale"}
def test_route_stats_derives_reliability_and_latency(store):
from conftest import seed_route_obs
# peerA: 4 ok + 1 fail -> success 0.8; latency avg over OK only
seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=100, n=3)
seed_route_obs("antseed", "m", "peerA", ok=True, latency_ms=300) # ok=4, lat avg=150
seed_route_obs("antseed", "m", "peerA", ok=False, latency_ms=9999) # failure: latency ignored
seed_route_obs("antseed", "m", "peerB", ok=True, latency_ms=50)
st = store.route_stats()
assert st["antseed|m|peerA"]["success_rate"] == 0.8
assert st["antseed|m|peerA"]["latency_ms"] == 150 # avg(100,100,100,300), failure excluded
assert st["antseed|m|peerA"]["count"] == 5
assert st["antseed|m|peerB"]["success_rate"] == 1.0
assert "antseed|m|missing" not in st
def test_route_stats_window_excludes_old_observations(store):
from conftest import seed_route_obs
import time
now = int(time.time() * 1000)
seed_route_obs("p", "m", "fresh", ok=True, ts=now)
seed_route_obs("p", "m", "stale", ok=True, ts=now - 20 * 60 * 1000) # 20 min ago
assert set(store.route_stats(window_ms=15 * 60 * 1000)) == {"p|m|fresh"}
assert set(store.route_stats(window_ms=30 * 60 * 1000)) == {"p|m|fresh", "p|m|stale"}
def test_route_stats_normalizes_keys_like_route_key(store):
from conftest import seed_route_obs
seed_route_obs("p", "m", None, ok=True)
st = store.route_stats()
assert "p|m|" in st
assert "p|m|None" not in st
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_host_store.py` around lines 256 - 278, Add a test that covers a
route observation with a missing component in the route key so the key
normalization contract is locked down. Use the existing host store helpers and
`route_stats()` to assert that a route with a missing `served_by` or family is
still grouped under the same normalized key as `_route_key()` would produce,
instead of a raw `f"{prov}|{fam}|{sby}"` key. Keep the new case close to the
current `test_route_stats_derives_reliability_and_latency` /
`test_route_stats_window_excludes_old_observations` coverage so regressions in
`host_store._route_key()` vs `host_store.route_stats()` are caught.

jmlago added 2 commits June 29, 2026 20:24
…s (#4b)

The per-session cache-affinity (route_cache) and usage meter (route_session_meter)
were in-process dicts (per-pod, reset on restart — §8 debt). Derive them on the
fly from `calls` (per-request; it carries session_id / status / provider /
family / served_by / tokens / cost / caller — everything the folds held), so the
views are fleet-consistent without any per-session fold.

- host_store: hot_route(session) = the route of the session's most recent
  SUCCESSFUL call; session_totals / session_warm / session_owner / all_session_totals.
- shim: /v1/session + /x/sessions + the cache_hot_route resolution read these;
  x_router.session_acc = committed session_totals + THIS in-flight call (not yet
  in the ledger). The owner is the caller of the session's earliest call
  (first-writer-wins, cross-consumer isolation), derived — no explicit binding.
- llm_router_host: the fold no longer records cache/session state (the route
  observation + the call ledger carry it); route_cache + route_session_meter are
  deleted.

Trade-off (documented): session_acc's prior-call totals come from the async-written
ledger, so a burst of near-simultaneous calls on one session can lag by the queue
drain (ms). The object marks the meter "measurement only, NOT billing", so this is
acceptable; it is also a net improvement over the per-pod in-process meter (which
was already wrong across a fleet).

Verification: full suite 402 passed, 3 skipped, 0 failed (3x deterministic).
The learned per-route tool-incapability (route_tool_capability) was an in-process
dict (per-pod, reset on restart — §8 debt). Derive it on the fly from the
per-attempt route_observations ledger (#4a), which gains two signal columns.

- route_observations: + tools_requested, tool_calls_emitted (the fold writes them
  on every attempt; only tools-requests carry signal).
- host_store.tool_incapable_routes(window): a route is incapable when it had
  >= 20 tools-requests in the last 30 min with ZERO tool_calls. The window IS the
  re-test horizon (a route ages out if it stops being tool-tested); any tool_call
  in the window clears it. Capable is the default.
- sources/antseed.offers_sync: fetches the incapable set once and drops
  supports_tools for those routes (was per-candidate is_capable()).
- route_tool_capability is deleted; the fold no longer observes it.

The behaviour shifts from "any tool_call proves capability PERMANENTLY" to a
windowed verdict — more honest (a route that degrades is re-detected), consistent
with the EMA->windowed change in #4a.

Verification: full suite 402 passed, 3 skipped, 0 failed (3x deterministic). This
completes the route_* family migration (#4a reliability/latency, #4b
cache/session, #4c tool capability) — all five in-process folds now derived
on the fly, fleet-consistent, from the raw ledgers.
@jmlago jmlago changed the title feat(host-store): derive route reliability + latency on the fly (#4a) feat(host-store): the route_* family derived on the fly (#4a+#4b+#4c) Jun 29, 2026
@jmlago jmlago merged commit 093d2c9 into main Jun 29, 2026
1 check passed
jmlago added a commit that referenced this pull request Jun 30, 2026
…D, monitoring)

PR D, reframed from "auto-correct the ranking" to OBSERVABILITY, per the decision
that an automatic measured-cost correction would (a) form a routing feedback loop,
(b) be circular for compute-from-price providers, and (c) hide an opaque routing
change. Instead D surfaces the deviation and lets the operator act — the same
human-in-the-loop stance as C's manual lever.

A read-only Overview card showing, per provider, the measured effective $/Mtok
(from the calls ledger) vs the advertised list price, with a drift flag.
- host_store.cost_by_route: per (provider, family) ledger aggregate over a window,
  derived by query (#41: store raw, derive at read — no in-process fold).
- _cost_accuracy_rows (pure, unit-tested): joins that with the live ranked price
  from /x/runtime, dividing the fictitious price_multiplier back out so the
  comparison is measured-spend vs advertised-list. Rolls up per provider, flags
  drift > 15% with >= 20 calls, sorts worst-first.
- GET /dashboard/api/cost-accuracy (admin) + a card in the Overview.

Never touches routing. The signal is strongest for providers that report their own
cost (openrouter — reveals real discounts/surprises); a compute-from-price provider
reads ~1.0 by construction, which is itself a useful "no independent signal here".
Suite 434/0; the join/deviation logic + the JS render verified directly.
jmlago added a commit that referenced this pull request Jun 30, 2026
…D, monitoring) (#61)

* feat(dashboard): cost-accuracy panel — measured spend vs list price (D, monitoring)

PR D, reframed from "auto-correct the ranking" to OBSERVABILITY, per the decision
that an automatic measured-cost correction would (a) form a routing feedback loop,
(b) be circular for compute-from-price providers, and (c) hide an opaque routing
change. Instead D surfaces the deviation and lets the operator act — the same
human-in-the-loop stance as C's manual lever.

A read-only Overview card showing, per provider, the measured effective $/Mtok
(from the calls ledger) vs the advertised list price, with a drift flag.
- host_store.cost_by_route: per (provider, family) ledger aggregate over a window,
  derived by query (#41: store raw, derive at read — no in-process fold).
- _cost_accuracy_rows (pure, unit-tested): joins that with the live ranked price
  from /x/runtime, dividing the fictitious price_multiplier back out so the
  comparison is measured-spend vs advertised-list. Rolls up per provider, flags
  drift > 15% with >= 20 calls, sorts worst-first.
- GET /dashboard/api/cost-accuracy (admin) + a card in the Overview.

Never touches routing. The signal is strongest for providers that report their own
cost (openrouter — reveals real discounts/surprises); a compute-from-price provider
reads ~1.0 by construction, which is itself a useful "no independent signal here".
Suite 434/0; the join/deviation logic + the JS render verified directly.

* fix(dashboard): cost-accuracy distinguishes real signal from tautology (review)

Review (Axis 8): the panel rendered tautological rows identically to real-signal
ones. For a compute-from-price provider (the direct openai/anthropic/google),
measured and expected both derive from the same list price → deviation ~1.0 by
construction, and any drift is reprice noise (ledger cost_usd sealed at the
price-of-then vs the current ema), not an effective discount. Flagging "drift" on
those is non-actionable and trains the operator to ignore the badge — the opposite
of what a monitoring panel wants.

Record the cost basis as a raw fact and use it:
- shim._cost_basis (single source of the cost tiering _executed_cost_usd already
  used): 'subscription' | 'reported' (provider's own usage.cost — INDEPENDENT
  signal) | 'computed' (derived from list price — tautological) | None. Stamped on
  x_router and threaded to the ledger; new calls.cost_basis column (ALTER ADD
  COLUMN IF NOT EXISTS).
- cost_by_route aggregates n_reported; _cost_accuracy_rows labels each row
  reported|derived and only warns where the signal is real (reported). The panel
  shows the signal tag; derived drift renders muted, never badged.

Review (Axis 1): import _CACHE_READ_FACTOR from shim instead of redefining 0.1 —
the panel's expected cost must track the billing factor; a copy could silently
diverge into false drift.

Suite 439/0 (incl. _cost_basis tiers + a derived provider with big drift that does
NOT warn).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant