[WIP] agentx#348
Conversation
Adds agentic_traces scenario end-to-end: - Schema migrations for agentic scenario, availability, and KV offload mode - DB ingest/ETL + query updates to carry scenario, offload_mode, and server/theoretical cache-hit rates through to the API layer - Frontend types, filters (GlobalFilterContext / InferenceContext / ChartControls), URL state, and tooltip rows for agentic-only fields - ScatterGraph: subtle dashed halo on Pareto-frontier points that used KV offload so the tradeoff is visible at a glance
- ScatterGraph: include `offload_mode` in `buildPointConfigId` so d3's data join keeps both `on` and `off` variants for the same (config, conc). Without it, the second variant collapsed onto the first key, so FP8 offload-on points (and their halos) silently disappeared. - benchmark-mapper: handle older artifacts that emit `users`/`offload_mode` AND newer ones that emit `conc`/`offloading` (with 'none' → 'off' mapping). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The halo's purpose is to surface KV-offload usage; restricting it to Pareto-frontier-only points hid the indicator on most runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b300-p1 (and similar) artifacts were skipping ingest because the runner-pool suffix wasn't in the strip list and didn't normalize to the canonical b300 GPU key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Label text now includes `C=<conc>` alongside the GPU/parallelism tag (default `<tp> C=<conc>`, advanced `<getPointLabel> C=<conc>`) - Bumped point-label font-weight to 700 so the labels read clearly against the chart fill - Greedy collision-avoidance pass on render and zoom: tries placing each label above/below the point through 4 candidate dy offsets, hiding the label only when no slot is free Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oint Tspans now ride above the text's `dy` anchor — the LAST line sits at the anchor (just above the point) and earlier lines stack above it. Previously the second tspan landed below the anchor and crashed into the marker. Also widened collision candidates by label height so the flipped-below position fully clears the point on multi-line labels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pass When a `<text>` contains tspans, the parent's `dy` does not shift the bbox cleanly — its (unused) y=0 origin still factors in, so the rendered text ended up centered on the point. Move the absolute offset into the FIRST tspan's `dy`; later tspans cascade by 1.1em. Collision avoidance now drives the first tspan's `dy` and tries four candidate baselines (primary above, primary below, secondary above, secondary below), accounting for full label height when picking a non- overlapping slot. Labels still hidden as a last resort. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two complementary fixes for runs whose `results_bmk` aggregated artifact ends up containing both a successful row and a failed-attempt row for the same (config, conc, offload) — the failed row's null metrics were overwriting the good row via ON CONFLICT DO UPDATE. 1. Artifact-level: strip the trailing `_<runner-pool>_<attempt>` suffix from each artifact name and group by the logical name, keeping only the most recent per group. 2. Row-level: skip rows with `num_requests_successful === 0` AND `num_requests_total > 0`. The aggregated artifact merges rows from all runners — including failed ones — so artifact-level dedup alone can't reach inside it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # packages/app/src/components/GlobalFilterContext.tsx # packages/app/src/components/inference/utils/tooltipUtils.ts # packages/db/src/etl/normalizers.ts
Tag display name for the `aiperf` spec_method suffix used by the alternate-harness runs ingested for the agentic minimax sweep. Without this entry the legend shows 'AIPERF' from the default toUpperCase fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bigint workflow_run_id sometimes deserializes as a number on the frontend depending on the postgres adapter's behavior; strict === between a number and a string silently dropped every match, so the changelog popover always reported "no changelog data available." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If the selected model has agentic_traces data, prefer that over the default 8K/1K fixed-seq when the user hasn't explicitly chosen via URL. effectiveSequence already falls back to availableSequences[0] for models without agentic, so models with only fixed-seq data still render correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
# Conflicts: # packages/app/src/components/inference/ui/ChartControls.tsx # packages/app/src/components/inference/utils/tooltipUtils.ts # packages/db/src/etl/normalizers.ts
rowToAggDataEntry was only copying median/p99 metric variants — picking p90/p99.9 in the percentile selector silently fell back to 0 and collapsed every point into a vertical line at x=0. Copy the full median/p90/p99/p99.9 set into AggDataEntry. Hide the X-Axis Metric dropdown for agentic mode (it doubled up with the percentile selector) and route the input-metric chart through withPercentile so picking p99 actually plots p99_ttft instead of the hard-coded p99_ttft config default. Percentile options pared back to median + p99.
# Conflicts: # packages/app/src/components/GlobalFilterContext.tsx # packages/app/src/components/inference/InferenceContext.tsx # packages/app/src/components/inference/hooks/useChartData.ts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the TTFT x-axis selectors with the percentile selector — only p90 is offered everywhere. Default x-axis metric and chart config input-throughput x are p90_ttft. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `!isAgentic` gate on the e2e TTFT override branch dropped the user's `p90_ttft` pick in agentic mode, leaving the chart on the default p90_e2el. The trailing withPercentile pass is idempotent when xAxisField is already at the right percentile, so the gate is unnecessary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- /datasets: methodology prose + dataset registry cards (DatasetList) - /datasets/[slug]: summary stats, model mix, 5 precomputed-histogram distribution cards (DistributionCard, log/linear), and a searchable/sortable/paginated conversation table - /datasets/[slug]/conversations/[convId]: per-conversation TraceFlamegraph — one bar per turn (cached prefix + uncached input + output), subagent groups collapsible (collapsed by default) with expand/collapse-all - header nav 'Datasets' link - query-layer test (mock DbClient): not-found paths + numeric coercion Verified end-to-end against the live branch DB: both datasets list with real stats, distributions render, flamegraph shows the prefix-reuse signature (turn 2 fully uncached, later turns mostly cached), expand-all surfaces subagent subturns. Zero console errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wrap rows in a fixed-height (max-h-[520px]) vertically scrollable bordered box. Subagent group headers carry aggregate token totals that dwarf any single turn, which made their bars overflow the row (width >> 100%). Now turns/subturns use a per-turn scale while group headers use a separate group-aggregate scale (slim muted strips), both clamped to the track — groups stay comparable to each other and nothing overflows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add run_datasets (workflow_run → dataset slug) mapping (migration 012) and surface it through the benchmark-siblings sku. The agentic detail page's request timeline now deep-links each request bar to its exact conversation in the /datasets viewer — the request cid, stripped of any ::sa:/::fa: suffix, is the dataset conv_id. Tooltip shows a 'click to view in dataset' hint; bars get a pointer cursor only when a mapping exists. Backfilled workflow_run 27915787191 (the dsv4/b300/vllm run incl. point 422083) → cc-traces-weka-062126. Verified: clicking a timeline bar on /inference/agentic/422083 navigates to the matching /datasets/cc-traces-weka-062126/conversations/<conv_id>. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The timeline link now carries ?turn=<ti> (and &sa=<agentId> for subagent requests). The flamegraph resolves the target node — main turns by ordinal, subagent turns by matching the group's agentId then the ti-th child — expands the subagent group if needed, scrolls the row into view, and flashes a ring. subagentIdOf strips the harness stream suffix (:s<n> and :aux:<n>) so the cid's agent id matches the dataset SubagentNode.agentId. Verified end-to-end: clicking a subagent bar on /inference/agentic/422083 opens the conversation, expands the right group, and highlights the exact subturn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ooltip - Deep-link highlight is now state-driven (bg-primary/20 + ring, fades over 700ms) instead of fragile classList mutation, so it's clearly visible and survives re-renders. Subagent groups still auto-expand and scroll into view. - Portal the hover tooltip to document.body so its position:fixed is viewport-relative — an ancestor transform was offsetting it away from the cursor. Now it sits at pointer+12px. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The conversation page read ?turn/&sa from window.location.search in a useState initializer, which captures stale/empty params during a client-side navigation — so scroll+highlight+expand only worked after a manual reload. Switch to the reactive useSearchParams (page wrapped in Suspense) so the params are present on the first nav. Also make the flamegraph expand the target subagent group via an effect (reacting to target changes), and defer the scroll one frame so the just-expanded child row exists. Verified via a real timeline click — no reload. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In HC mode the iwanthue palette is sized and indexed by the key set it's generated over. ScatterGraph generated it from the *active* (selected) hw set, so deselecting a line shrank the set, re-sized the palette, and shifted every remaining line's hue — most visible on single-vendor agentic runs (which span the full hue wheel since 2c06009), where deselecting B300 could recolor B200 from red to blue. Pass the stable full set of hw-types-with-data as hcKeys so the palette and per-key index are fixed; toggling now only hides/shows lines without recoloring the rest. Adds a useThemeColors regression test asserting a line's HC color is identical across active subsets when hcKeys is the full set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…elector/lockfile conflicts
…te x-axis toggle test for single-chart mode buttons
…ph (incl deep-link), and dataset list states
…ring doesn't collide with master
Replace the per-row P# badges with a colored left-gutter bracket that groups requests in the same main-agent or subagent scope whose original execution intervals overlapped (ran in parallel). Non-transitive overlap chains get their own side-by-side lanes; the gutter only renders when an overlap group exists, so non-parallel traces have no extra whitespace. Legend swatch and conversation-view copy updated to describe the bracket; e2e assertions check data-overlap-group on bracket segments. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…races A pathological conversation (1621 turns, a subagent fanning out into 622 children with 17-way concurrency) produced 49 bracket lanes — a 686px gutter that pushed the bars off-screen, plus one DOM node per lane per row (~110k empty divs, 157k total nodes on Expand all). Cap displayed lanes at MAX_LANES (6): overflow groups fold into the last "dense" lane, so every parallel row still carries a marker but the gutter width stays bounded. Render the gutter sparsely (only lanes a row touches, absolutely positioned) instead of a dense lane-per-row matrix. A subtle note surfaces when lanes are capped so the fold isn't silent. Outlier now: gutter 686px -> 84px, DOM on Expand all 157k -> 35k nodes. Normal multi-lane traces are unchanged (<=6 lanes hit the identity path). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@claude fix the build, it is failing. pnpm typecheck might help in diagnosing. once you fix it merge master branch into this one and fix conflicts. |
|
Claude finished @adibarra's task in 7m 5s —— View job Task: Fix build + merge master ✅
Build fix
The MergeThe runner had a shallow clone, so
Validation
|
Co-authored-by: Alec Ibarra <adibarra@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts: # packages/db/src/json-provider.line-single-run.test.ts # packages/db/src/json-provider.ts # packages/db/src/queries/benchmarks.ts
…-end Agentic artifacts ship *_intvty under two harness definitions: slow-tail 1/p(ITL) (what the charts assume) vs fast-tail p(1/ITL), which inverts percentile order (p90 lands at ~1/p10(ITL)). Ingest stored the artifact value verbatim and the frontend only filled intvty when missing, so newer "timing fix" runs landed with the wrong definition — e.g. p90 reading 23.9 instead of 11.2 — silently contaminating cross-run Pareto comparisons. Enforce the invariant in every path: - ingest mapper: derive agentic mean/median/p75/p90/p95/p99 *_intvty from *_itl, discarding the artifact value (self-correcting ingest). - frontend agenticAliases: always derive intvty = 1/itl (override, not fill-if-missing) so overlay / ?unofficialrun= rows match. - backfill-agentic-intvty script: one-time fix for stored rows (already run against the DB: 164 rows / 656 values rewritten, 0 contaminated after). - ingest agent doc: note the invariant + the backfill escape hatch. std_intvty is intentionally left alone (reciprocal of a std is meaningless; the API strips it). Unit tests added on both the mapper and the transform. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
1 / TPOT) and TTFT time-series chartsData updates
cc-traces-weka-062126variants so flamegraph structures include timingUnofficial-run overlays cannot open the persisted agentic point-detail route because they do not have a
benchmark_resultsid or stored request timeline. The new point-detail charts are therefore intentionally limited to DB-backed official points.Validation
pnpm typecheckpnpm lintpnpm fmt