Skip to content

[WIP] agentx#348

Draft
cquil11 wants to merge 108 commits into
masterfrom
feat/agentx
Draft

[WIP] agentx#348
cquil11 wants to merge 108 commits into
masterfrom
feat/agentx

Conversation

@cquil11

@cquil11 cquil11 commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add per-agentic-point interactivity (1 / TPOT) and TTFT time-series charts
  • default both charts to P90 with independently selectable rolling P75/P90 lines over the trailing 50 profiling requests
  • add red cumulative P75/P90 convergence lines that follow the selected percentile
  • persist TPOT in versioned request timelines and ignore warmup, cancelled, missing, and invalid samples
  • show elapsed-from-start timestamps beside dataset flamegraph turns and subturns
  • show subagent headers as elapsed start-end ranges, with child timings as a fallback
  • standardize dataset distributions on P50/P75/P90/P95 guide lines and summaries
  • add a zero-preserving log histogram for uncached input tokens per request

Data updates

  • backfilled all 746 stored request timelines to version 3 with TPOT populated
  • re-ingested both cc-traces-weka-062126 variants so flamegraph structures include timing
  • backfilled both dataset aggregates to chart-data v2 with ISL, OSL, and uncached-input percentiles
  • purged the API cache after each dataset refresh

Unofficial-run overlays cannot open the persisted agentic point-detail route because they do not have a benchmark_results id or stored request timeline. The new point-detail charts are therefore intentionally limited to DB-backed official points.

Validation

  • pnpm typecheck
  • pnpm lint
  • pnpm fmt
  • app unit suite: 118 files / 2195 tests passed
  • DB unit suite: 19 files / 294 tests passed
  • focused Cypress component: dataset distribution card, 4 tests passed
  • focused Cypress E2E: agentic time-series, flamegraph timing, and dataset distributions, 5 tests passed

cquil11 and others added 12 commits April 23, 2026 13:40
Adds agentic_traces scenario end-to-end:
- Schema migrations for agentic scenario, availability, and KV offload mode
- DB ingest/ETL + query updates to carry scenario, offload_mode, and
  server/theoretical cache-hit rates through to the API layer
- Frontend types, filters (GlobalFilterContext / InferenceContext /
  ChartControls), URL state, and tooltip rows for agentic-only fields
- ScatterGraph: subtle dashed halo on Pareto-frontier points that used
  KV offload so the tradeoff is visible at a glance
- ScatterGraph: include `offload_mode` in `buildPointConfigId` so d3's data
  join keeps both `on` and `off` variants for the same (config, conc).
  Without it, the second variant collapsed onto the first key, so FP8
  offload-on points (and their halos) silently disappeared.
- benchmark-mapper: handle older artifacts that emit `users`/`offload_mode`
  AND newer ones that emit `conc`/`offloading` (with 'none' → 'off' mapping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The halo's purpose is to surface KV-offload usage; restricting it to
Pareto-frontier-only points hid the indicator on most runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b300-p1 (and similar) artifacts were skipping ingest because the runner-pool
suffix wasn't in the strip list and didn't normalize to the canonical b300
GPU key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Label text now includes `C=<conc>` alongside the GPU/parallelism tag
  (default `<tp> C=<conc>`, advanced `<getPointLabel> C=<conc>`)
- Bumped point-label font-weight to 700 so the labels read clearly against
  the chart fill
- Greedy collision-avoidance pass on render and zoom: tries placing each
  label above/below the point through 4 candidate dy offsets, hiding the
  label only when no slot is free

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oint

Tspans now ride above the text's `dy` anchor — the LAST line sits at the
anchor (just above the point) and earlier lines stack above it. Previously
the second tspan landed below the anchor and crashed into the marker.

Also widened collision candidates by label height so the flipped-below
position fully clears the point on multi-line labels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pass

When a `<text>` contains tspans, the parent's `dy` does not shift the bbox
cleanly — its (unused) y=0 origin still factors in, so the rendered text
ended up centered on the point. Move the absolute offset into the FIRST
tspan's `dy`; later tspans cascade by 1.1em.

Collision avoidance now drives the first tspan's `dy` and tries four
candidate baselines (primary above, primary below, secondary above,
secondary below), accounting for full label height when picking a non-
overlapping slot. Labels still hidden as a last resort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two complementary fixes for runs whose `results_bmk` aggregated artifact
ends up containing both a successful row and a failed-attempt row for the
same (config, conc, offload) — the failed row's null metrics were
overwriting the good row via ON CONFLICT DO UPDATE.

1. Artifact-level: strip the trailing `_<runner-pool>_<attempt>` suffix
   from each artifact name and group by the logical name, keeping only the
   most recent per group.

2. Row-level: skip rows with `num_requests_successful === 0` AND
   `num_requests_total > 0`. The aggregated artifact merges rows from all
   runners — including failed ones — so artifact-level dedup alone can't
   reach inside it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	packages/app/src/components/GlobalFilterContext.tsx
#	packages/app/src/components/inference/utils/tooltipUtils.ts
#	packages/db/src/etl/normalizers.ts
Tag display name for the `aiperf` spec_method suffix used by the
alternate-harness runs ingested for the agentic minimax sweep.
Without this entry the legend shows 'AIPERF' from the default
toUpperCase fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bigint workflow_run_id sometimes deserializes as a number on the
frontend depending on the postgres adapter's behavior; strict ===
between a number and a string silently dropped every match, so the
changelog popover always reported "no changelog data available."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If the selected model has agentic_traces data, prefer that over the
default 8K/1K fixed-seq when the user hasn't explicitly chosen via URL.
effectiveSequence already falls back to availableSequences[0] for models
without agentic, so models with only fixed-seq data still render correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented May 14, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment Jun 26, 2026 6:28am

Request Review

# Conflicts:
#	packages/app/src/components/inference/ui/ChartControls.tsx
#	packages/app/src/components/inference/utils/tooltipUtils.ts
#	packages/db/src/etl/normalizers.ts
rowToAggDataEntry was only copying median/p99 metric variants — picking
p90/p99.9 in the percentile selector silently fell back to 0 and
collapsed every point into a vertical line at x=0. Copy the full
median/p90/p99/p99.9 set into AggDataEntry.

Hide the X-Axis Metric dropdown for agentic mode (it doubled up with the
percentile selector) and route the input-metric chart through
withPercentile so picking p99 actually plots p99_ttft instead of the
hard-coded p99_ttft config default. Percentile options pared back to
median + p99.
cquil11 added 2 commits May 15, 2026 12:30
# Conflicts:
#	packages/app/src/components/GlobalFilterContext.tsx
#	packages/app/src/components/inference/InferenceContext.tsx
#	packages/app/src/components/inference/hooks/useChartData.ts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the TTFT x-axis selectors with the percentile selector — only
p90 is offered everywhere. Default x-axis metric and chart config
input-throughput x are p90_ttft.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `!isAgentic` gate on the e2e TTFT override branch dropped the
user's `p90_ttft` pick in agentic mode, leaving the chart on the
default p90_e2el. The trailing withPercentile pass is idempotent
when xAxisField is already at the right percentile, so the gate is
unnecessary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cquil11 and others added 8 commits June 22, 2026 16:16
- /datasets: methodology prose + dataset registry cards (DatasetList)
- /datasets/[slug]: summary stats, model mix, 5 precomputed-histogram
  distribution cards (DistributionCard, log/linear), and a
  searchable/sortable/paginated conversation table
- /datasets/[slug]/conversations/[convId]: per-conversation TraceFlamegraph —
  one bar per turn (cached prefix + uncached input + output), subagent groups
  collapsible (collapsed by default) with expand/collapse-all
- header nav 'Datasets' link
- query-layer test (mock DbClient): not-found paths + numeric coercion

Verified end-to-end against the live branch DB: both datasets list with real
stats, distributions render, flamegraph shows the prefix-reuse signature
(turn 2 fully uncached, later turns mostly cached), expand-all surfaces
subagent subturns. Zero console errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wrap rows in a fixed-height (max-h-[520px]) vertically scrollable bordered box.
Subagent group headers carry aggregate token totals that dwarf any single turn,
which made their bars overflow the row (width >> 100%). Now turns/subturns use a
per-turn scale while group headers use a separate group-aggregate scale (slim
muted strips), both clamped to the track — groups stay comparable to each other
and nothing overflows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add run_datasets (workflow_run → dataset slug) mapping (migration 012) and
surface it through the benchmark-siblings sku. The agentic detail page's request
timeline now deep-links each request bar to its exact conversation in the
/datasets viewer — the request cid, stripped of any ::sa:/::fa: suffix, is the
dataset conv_id. Tooltip shows a 'click to view in dataset' hint; bars get a
pointer cursor only when a mapping exists. Backfilled workflow_run 27915787191
(the dsv4/b300/vllm run incl. point 422083) → cc-traces-weka-062126.

Verified: clicking a timeline bar on /inference/agentic/422083 navigates to the
matching /datasets/cc-traces-weka-062126/conversations/<conv_id>.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The timeline link now carries ?turn=<ti> (and &sa=<agentId> for subagent
requests). The flamegraph resolves the target node — main turns by ordinal,
subagent turns by matching the group's agentId then the ti-th child — expands
the subagent group if needed, scrolls the row into view, and flashes a ring.

subagentIdOf strips the harness stream suffix (:s<n> and :aux:<n>) so the cid's
agent id matches the dataset SubagentNode.agentId. Verified end-to-end: clicking
a subagent bar on /inference/agentic/422083 opens the conversation, expands the
right group, and highlights the exact subturn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ooltip

- Deep-link highlight is now state-driven (bg-primary/20 + ring, fades over
  700ms) instead of fragile classList mutation, so it's clearly visible and
  survives re-renders. Subagent groups still auto-expand and scroll into view.
- Portal the hover tooltip to document.body so its position:fixed is
  viewport-relative — an ancestor transform was offsetting it away from the
  cursor. Now it sits at pointer+12px.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The conversation page read ?turn/&sa from window.location.search in a useState
initializer, which captures stale/empty params during a client-side navigation —
so scroll+highlight+expand only worked after a manual reload. Switch to the
reactive useSearchParams (page wrapped in Suspense) so the params are present on
the first nav. Also make the flamegraph expand the target subagent group via an
effect (reacting to target changes), and defer the scroll one frame so the
just-expanded child row exists. Verified via a real timeline click — no reload.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In HC mode the iwanthue palette is sized and indexed by the key set it's
generated over. ScatterGraph generated it from the *active* (selected) hw set,
so deselecting a line shrank the set, re-sized the palette, and shifted every
remaining line's hue — most visible on single-vendor agentic runs (which span
the full hue wheel since 2c06009), where deselecting B300 could recolor B200
from red to blue.

Pass the stable full set of hw-types-with-data as hcKeys so the palette and
per-key index are fixed; toggling now only hides/shows lines without recoloring
the rest. Adds a useThemeColors regression test asserting a line's HC color is
identical across active subsets when hcKeys is the full set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cquil11 and others added 5 commits June 23, 2026 16:10
Replace the per-row P# badges with a colored left-gutter bracket that
groups requests in the same main-agent or subagent scope whose original
execution intervals overlapped (ran in parallel). Non-transitive overlap
chains get their own side-by-side lanes; the gutter only renders when an
overlap group exists, so non-parallel traces have no extra whitespace.

Legend swatch and conversation-view copy updated to describe the bracket;
e2e assertions check data-overlap-group on bracket segments.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…races

A pathological conversation (1621 turns, a subagent fanning out into 622
children with 17-way concurrency) produced 49 bracket lanes — a 686px
gutter that pushed the bars off-screen, plus one DOM node per lane per
row (~110k empty divs, 157k total nodes on Expand all).

Cap displayed lanes at MAX_LANES (6): overflow groups fold into the last
"dense" lane, so every parallel row still carries a marker but the gutter
width stays bounded. Render the gutter sparsely (only lanes a row touches,
absolutely positioned) instead of a dense lane-per-row matrix. A subtle
note surfaces when lanes are capped so the fold isn't silent.

Outlier now: gutter 686px -> 84px, DOM on Expand all 157k -> 35k nodes.
Normal multi-lane traces are unchanged (<=6 lanes hit the identity path).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@adibarra

Copy link
Copy Markdown
Contributor

@claude fix the build, it is failing. pnpm typecheck might help in diagnosing. once you fix it merge master branch into this one and fix conflicts.

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Claude finished @adibarra's task in 7m 5s —— View job


Task: Fix build + merge master ✅

  • Diagnose build failure with pnpm typecheck
  • Fix type/build errors
  • Merge origin/master into feat/agentx and resolve conflicts
  • Re-run typecheck/lint/fmt to confirm green
  • Commit and push

Build fix

tsc failed in trace-flamegraph.tsx:

Property 'endS' does not exist on type 'TurnNode'.

The TurnNode interface in packages/db/src/etl/weka-structure.ts had startS but no endS, even though the flamegraph reads node.endS/child.endS (timestamps + parallel-bracket overlap detection) and the ETL's own subagentTimeRange already treats a turn as spanning start → start + api_time. Added endS?: number to TurnNode and populate it (startS + api_time) for both top-level turns and subagent child turns, factoring the existing logic into a shared turnEndS helper.

Merge

The runner had a shallow clone, so origin/master initially showed "unrelated histories" — unshallowed (git fetch --unshallow) to recover the common ancestor, then merged. Three files conflicted, all on the same theme (this branch added offload_mode as a chart-line dimension; master added br.id to the benchmark query). Resolved by combining both sides:

  • packages/db/src/queries/benchmarks.ts — kept offload_mode in the line-selection key (DISTINCT ON, ORDER BY, winners join) and master's br.id SELECT.
  • packages/db/src/json-provider.tslineKey keeps offload_mode; merged comment wording.
  • packages/db/src/json-provider.line-single-run.test.ts — kept the per-offload-mode test coverage.

Validation

  • pnpm typecheck
  • pnpm lint
  • pnpm fmt
  • db tests for the touched files (weka-structure, json-provider.line-single-run, queries/*) ✅

Note: 3 tests in packages/db/src/connection.test.ts fail in this CI runner only because it exports DATABASE_DRIVER=neon and DATABASE_SSL=true, which override the defaults those tests assert. They are environment-induced (fail identically on master) and unrelated to this change.
· branch feat/agentx

github-actions Bot and others added 3 commits June 26, 2026 04:13
Co-authored-by: Alec Ibarra <adibarra@users.noreply.github.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts:
#	packages/db/src/json-provider.line-single-run.test.ts
#	packages/db/src/json-provider.ts
#	packages/db/src/queries/benchmarks.ts
…-end

Agentic artifacts ship *_intvty under two harness definitions: slow-tail
1/p(ITL) (what the charts assume) vs fast-tail p(1/ITL), which inverts
percentile order (p90 lands at ~1/p10(ITL)). Ingest stored the artifact
value verbatim and the frontend only filled intvty when missing, so newer
"timing fix" runs landed with the wrong definition — e.g. p90 reading 23.9
instead of 11.2 — silently contaminating cross-run Pareto comparisons.

Enforce the invariant in every path:
- ingest mapper: derive agentic mean/median/p75/p90/p95/p99 *_intvty from
  *_itl, discarding the artifact value (self-correcting ingest).
- frontend agenticAliases: always derive intvty = 1/itl (override, not
  fill-if-missing) so overlay / ?unofficialrun= rows match.
- backfill-agentic-intvty script: one-time fix for stored rows (already run
  against the DB: 164 rows / 656 values rewritten, 0 contaminated after).
- ingest agent doc: note the invariant + the backfill escape hatch.

std_intvty is intentionally left alone (reciprocal of a std is meaningless;
the API strips it). Unit tests added on both the mapper and the transform.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants