docs(canonical-api): close the anti-reinvention gaps + de-reinvent the examples by drewstone · Pull Request #370 · tangle-network/agent-runtime

drewstone · 2026-06-24T08:54:24Z

What

The anti-reinvention decision table in docs/canonical-api.md (§2 — the doc agents are told to read before writing orchestration/measurement code) had zero rows for whole families of live exports, so agents kept hand-rolling them. This adds them, and de-reinvents the one example that hand-rolled primitives.

1. Reference gaps closed (`docs/canonical-api.md` §2)

Six grouped rows added, in the existing "I want to ___ → use ___ → NOT ___" voice:

Family	Use (the real primitive)	Was being hand-rolled as
statistics	every test on the agent-eval main barrel — `pairedBootstrap`, `wilson`, `pairedTTest`, `wilcoxonSignedRank`, `mcnemar`/`mcnemarPower`/`mcnemarRequiredN`, `mannWhitneyU`, `passAtK`, `confidenceInterval`, `benjaminiHochberg`/`bonferroni`, `cohensD`/`cliffsDelta`, `requiredSampleSize`/`pairedMde`, `eProcess`, `weightedComposite`, `corpusInterRaterAgreement`, `pairedRiskDifference`, `pearsonR`/`spearmanR`, `paretoFrontier`/`dominates`	a bootstrap-with-PRNG, a Wilson/McNemar/power calc, a Pareto sweep, a Cohen's-d per gate (the #1 measurement reinvention)
judge	a campaign `JudgeConfig`, or `ensembleJudge` for a multi-model panel	a hand-built judge prompt loop, or the `@deprecated` `JudgeFn` factories
authenticity	`scoreAuthenticity` → `gateRealness` over an `AuthenticitySignals` (`agent-eval/authenticity` subpath)	a regex realness scorer, or trusting a buildability score that rewards a polished fake
verification	`new MultiLayerVerifier(layers).run(...)` + `gradeSemanticStatus(...)`	a hand-chained compile→lint→test→semantic pipeline
reference-replay	`runReferenceReplay`/`scoreReferenceReplay` + `decideReferenceReplayPromotion` over `jsonlReferenceReplayStore`	a hand-rolled "does the candidate reproduce the reference" matcher
token/usage seam	`extractLlmCallEvent` / `reportLoopUsage` (`UsageSink`) — `/loops`	re-walking sandbox events to tally tokens yourself

The judge row also kills a recurring confusion: there is no llmJudge export in this version — that name appears only in a campaign-types docstring; the live judge surface is JudgeConfig + ensembleJudge. A future agent reading this doc can no longer conclude "no judge primitive exists."

The freshness gate (scripts/check-docs-freshness.mjs) now loads the agent-eval/authenticity subpath barrel into the §2 export universe, so a row that recommends importing from that subpath resolves — exactly parallel to how contract/campaign/index are already loaded. This strengthens coverage (a real published export surface the gate previously couldn't see), it does not weaken the gate.

2. Example de-reinvented

examples/self-improving-loop was the only example (of 21 audited) that hand-rolled substrate primitives:

analyst — local interface AnalystFinding { rootCause; proposedMutation } + runAnalyst() → now the canonical AnalystFinding stamped by makeFinding (the exact shape improve(profile, findings, opts) reflects on); mutation rides recommended_action.
gate — bare v1Mean - v0Mean >= 0.5 point comparison → now pairedBootstrap(v0Scores, v1Scores, { seed }), shipping only when the paired-bootstrap CI lower bound clears 0 — the statistical core the production held-out gate (HeldOutGate / improve() over selfImprove) is built on.

Stays offline + deterministic; the demo still ships v1, now on a +5.00 paired median, 95% CI [5.00, 6.00], n=3. README + comments mirror the already-clean companion examples (improve/, intelligence-recommend/). Only the analyst body, proposer, and LLM remain scripted (for reproducibility) — the finding type and gate statistic are now the real substrate.

Verify (all green)

pnpm docs:check — docs freshness: OK (§2 table symbols against 4134 public exports; prose symbols against 5795 resolvable; docs/api regen diff clean)
pnpm run build — clean
pnpm run typecheck + typecheck:examples — clean (the fixed example compiles)
pnpm run lint — Checked 319 files, no fixes applied
Example runs offline e2e: ships v1 on the paired CI [5.00, 6.00]

DO NOT MERGE — operator review.

@deprecated

…e examples The anti-reinvention decision table (§2) had zero rows for whole families of live exports, so agents kept hand-rolling them. Add grouped rows for: - statistics — every test on the agent-eval main barrel (pairedBootstrap, wilson, pairedTTest, wilcoxonSignedRank, mcnemar/mcnemarPower, mannWhitneyU, passAtK, confidenceInterval, benjaminiHochberg/bonferroni, cohensD/cliffsDelta, requiredSampleSize/pairedMde, eProcess, weightedComposite, corpusInterRaterAgreement, pairedRiskDifference, pearsonR/spearmanR, paretoFrontier/dominates) - judge — JudgeConfig + ensembleJudge, with a warn-off for the @deprecated JudgeFn factories and an explicit note that there is no llmJudge export - authenticity — scoreAuthenticity / gateRealness / AuthenticitySignals (agent-eval/authenticity subpath) - verification — MultiLayerVerifier + gradeSemanticStatus - reference-replay — runReferenceReplay / scoreReferenceReplay / decideReferenceReplayPromotion over jsonlReferenceReplayStore - token/usage seam — extractLlmCallEvent + reportLoopUsage / UsageSink (/loops) Teach the freshness gate the agent-eval/authenticity subpath barrel so a §2 row that recommends importing from it resolves (parallel to contract/campaign/index). De-reinvent the one example that hand-rolled primitives: examples/self-improving-loop replaces its local AnalystFinding interface + runAnalyst with the canonical AnalystFinding stamped by makeFinding, and its bare v1Mean-v0Mean>=0.5 point-comparison gate with pairedBootstrap (the statistical core the production held-out gate is built on). Stays offline and deterministic; ships v1 on a +5.00 [5.00, 6.00] paired CI. README + comments mirror the clean companion examples (improve/, intelligence-recommend/).

tangletools

✅ Auto-approved PR — `70e098d8`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T08:54:31Z}

…fImprove not a bare mean gate Folds the one real example fix from #370 (otherwise superseded by the generated catalog) into this PR: self-improving-loop hand-rolled the ship gate as a bare mean comparison instead of the real HeldOutGate/selfImprove primitives.

drewstone · 2026-06-24T09:10:09Z

Superseded by #371. The primitive inventory is now GENERATED (docs/api/primitive-catalog.md, freshness-gate-enforced) so it can't go stale by hand — the root cause this PR hand-patched. The ~30 primitives catalogued here are now auto-generated; the one real example fix (self-improving-loop → HeldOutGate/selfImprove) was folded into #371.

…erence cannot go stale (#371) * docs(api): generate the primitive catalog so the anti-reinvention reference cannot go stale The hand-listed primitive inventory in docs/canonical-api.md drifted from source: it had zero mentions of live exports (scoreAuthenticity, gateRealness, MultiLayerVerifier, wilson, pairedTTest, runProfileMatrix, extractUsage, …). Anything derivable from source must be generated, not hand-written — only judgment stays curated. - scripts/gen-primitive-catalog.mjs reads the LIVE exports of (a) this package's own public subpaths (from package.json `exports`) and (b) a curated category->subpath map of the @tangle-network/agent-eval substrate surfaces agents should reuse (judge, authenticity, verification, statistics, campaign, token/usage). Extraction is via the TypeScript compiler API over a virtual re-export entry, so it follows aliased re-exports and content-hashed bundle files — the exact things that rot a hand list. Emits docs/api/primitive-catalog.md with a GENERATED header (name, import path, one-line summary per export, grouped by surface). - Wired into `docs:api` (runs after TypeDoc). The freshness gate gains a seventh class (CATALOG): it regenerates the catalog to a temp file and byte-compares to the committed copy, so a new/removed/renamed live export absent from the catalog is a RED BUILD. - Shrank canonical-api.md: removed the export-inventory enumeration from the banner and the §2 preamble, replaced with pointers to docs/api/primitive-catalog.md. Kept all the judgment — the decision gate, §1.5 AgentProfile law, the §2 "I want to -> use -> NOT" table and every "Do NOT". The version + substrate-peer pins stay (gate-enforced). - MAINTAINING.md documents the generated-inventory layer, CLASS 7, and its fix path. * chore(deps): bump agent-eval to 0.99.0 + regenerate primitive catalog agent-eval 0.99.0 adds llmJudge (+ the full current judge/auth/verify/stats surface); regenerating the generated catalog picks it up with zero hand-work, which is the point of the generator. Lockfile was pinned at 0.97.0 (pre-llmJudge) despite agent-eval already being in minimumReleaseAgeExclude. * fix(deps): keep agent-eval peer floor at >=0.97.0 Only the examples (devDependency) need 0.99.0 for llmJudge; agent-runtime's src does not, so the peer floor must not force consumers onto 0.99.0. Catalog + lockfile stay on the resolved 0.99.0 so the examples get llmJudge. * docs(examples): de-reinvent self-improving-loop — use HeldOutGate/selfImprove not a bare mean gate Folds the one real example fix from #370 (otherwise superseded by the generated catalog) into this PR: self-improving-loop hand-rolled the ship gate as a bare mean comparison instead of the real HeldOutGate/selfImprove primitives.

tangletools approved these changes Jun 24, 2026

View reviewed changes

drewstone closed this Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(canonical-api): close the anti-reinvention gaps + de-reinvent the examples#370

docs(canonical-api): close the anti-reinvention gaps + de-reinvent the examples#370
drewstone wants to merge 1 commit into
mainfrom
docs/kill-example-reinvention

drewstone commented Jun 24, 2026

Uh oh!

tangletools left a comment

Uh oh!

drewstone commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 24, 2026

What

1. Reference gaps closed (docs/canonical-api.md §2)

2. Example de-reinvented

Verify (all green)

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 70e098d8

Uh oh!

drewstone commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Reference gaps closed (`docs/canonical-api.md` §2)

✅ Auto-approved PR — `70e098d8`