Skip to content

docs(canonical-api): close the anti-reinvention gaps + de-reinvent the examples#370

Closed
drewstone wants to merge 1 commit into
mainfrom
docs/kill-example-reinvention
Closed

docs(canonical-api): close the anti-reinvention gaps + de-reinvent the examples#370
drewstone wants to merge 1 commit into
mainfrom
docs/kill-example-reinvention

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

The anti-reinvention decision table in docs/canonical-api.md (§2 — the doc agents are told to read before writing orchestration/measurement code) had zero rows for whole families of live exports, so agents kept hand-rolling them. This adds them, and de-reinvents the one example that hand-rolled primitives.

1. Reference gaps closed (docs/canonical-api.md §2)

Six grouped rows added, in the existing "I want to ___ → use ___ → NOT ___" voice:

Family Use (the real primitive) Was being hand-rolled as
statistics every test on the agent-eval main barrelpairedBootstrap, wilson, pairedTTest, wilcoxonSignedRank, mcnemar/mcnemarPower/mcnemarRequiredN, mannWhitneyU, passAtK, confidenceInterval, benjaminiHochberg/bonferroni, cohensD/cliffsDelta, requiredSampleSize/pairedMde, eProcess, weightedComposite, corpusInterRaterAgreement, pairedRiskDifference, pearsonR/spearmanR, paretoFrontier/dominates a bootstrap-with-PRNG, a Wilson/McNemar/power calc, a Pareto sweep, a Cohen's-d per gate (the #1 measurement reinvention)
judge a campaign JudgeConfig, or ensembleJudge for a multi-model panel a hand-built judge prompt loop, or the @deprecated JudgeFn factories
authenticity scoreAuthenticitygateRealness over an AuthenticitySignals (agent-eval/authenticity subpath) a regex realness scorer, or trusting a buildability score that rewards a polished fake
verification new MultiLayerVerifier(layers).run(...) + gradeSemanticStatus(...) a hand-chained compile→lint→test→semantic pipeline
reference-replay runReferenceReplay/scoreReferenceReplay + decideReferenceReplayPromotion over jsonlReferenceReplayStore a hand-rolled "does the candidate reproduce the reference" matcher
token/usage seam extractLlmCallEvent / reportLoopUsage (UsageSink) — /loops re-walking sandbox events to tally tokens yourself

The judge row also kills a recurring confusion: there is no llmJudge export in this version — that name appears only in a campaign-types docstring; the live judge surface is JudgeConfig + ensembleJudge. A future agent reading this doc can no longer conclude "no judge primitive exists."

The freshness gate (scripts/check-docs-freshness.mjs) now loads the agent-eval/authenticity subpath barrel into the §2 export universe, so a row that recommends importing from that subpath resolves — exactly parallel to how contract/campaign/index are already loaded. This strengthens coverage (a real published export surface the gate previously couldn't see), it does not weaken the gate.

2. Example de-reinvented

examples/self-improving-loop was the only example (of 21 audited) that hand-rolled substrate primitives:

  • analyst — local interface AnalystFinding { rootCause; proposedMutation } + runAnalyst() → now the canonical AnalystFinding stamped by makeFinding (the exact shape improve(profile, findings, opts) reflects on); mutation rides recommended_action.
  • gate — bare v1Mean - v0Mean >= 0.5 point comparison → now pairedBootstrap(v0Scores, v1Scores, { seed }), shipping only when the paired-bootstrap CI lower bound clears 0 — the statistical core the production held-out gate (HeldOutGate / improve() over selfImprove) is built on.

Stays offline + deterministic; the demo still ships v1, now on a +5.00 paired median, 95% CI [5.00, 6.00], n=3. README + comments mirror the already-clean companion examples (improve/, intelligence-recommend/). Only the analyst body, proposer, and LLM remain scripted (for reproducibility) — the finding type and gate statistic are now the real substrate.

Verify (all green)

  • pnpm docs:checkdocs freshness: OK (§2 table symbols against 4134 public exports; prose symbols against 5795 resolvable; docs/api regen diff clean)
  • pnpm run build — clean
  • pnpm run typecheck + typecheck:examples — clean (the fixed example compiles)
  • pnpm run lint — Checked 319 files, no fixes applied
  • Example runs offline e2e: ships v1 on the paired CI [5.00, 6.00]

DO NOT MERGE — operator review.

…e examples

The anti-reinvention decision table (§2) had zero rows for whole families of
live exports, so agents kept hand-rolling them. Add grouped rows for:

- statistics — every test on the agent-eval main barrel (pairedBootstrap,
  wilson, pairedTTest, wilcoxonSignedRank, mcnemar/mcnemarPower, mannWhitneyU,
  passAtK, confidenceInterval, benjaminiHochberg/bonferroni, cohensD/cliffsDelta,
  requiredSampleSize/pairedMde, eProcess, weightedComposite,
  corpusInterRaterAgreement, pairedRiskDifference, pearsonR/spearmanR,
  paretoFrontier/dominates)
- judge — JudgeConfig + ensembleJudge, with a warn-off for the @deprecated
  JudgeFn factories and an explicit note that there is no llmJudge export
- authenticity — scoreAuthenticity / gateRealness / AuthenticitySignals
  (agent-eval/authenticity subpath)
- verification — MultiLayerVerifier + gradeSemanticStatus
- reference-replay — runReferenceReplay / scoreReferenceReplay /
  decideReferenceReplayPromotion over jsonlReferenceReplayStore
- token/usage seam — extractLlmCallEvent + reportLoopUsage / UsageSink (/loops)

Teach the freshness gate the agent-eval/authenticity subpath barrel so a §2 row
that recommends importing from it resolves (parallel to contract/campaign/index).

De-reinvent the one example that hand-rolled primitives:
examples/self-improving-loop replaces its local AnalystFinding interface +
runAnalyst with the canonical AnalystFinding stamped by makeFinding, and its
bare v1Mean-v0Mean>=0.5 point-comparison gate with pairedBootstrap (the
statistical core the production held-out gate is built on). Stays offline and
deterministic; ships v1 on a +5.00 [5.00, 6.00] paired CI. README + comments
mirror the clean companion examples (improve/, intelligence-recommend/).

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 70e098d8

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T08:54:31Z

drewstone added a commit that referenced this pull request Jun 24, 2026
…fImprove not a bare mean gate

Folds the one real example fix from #370 (otherwise superseded by the generated
catalog) into this PR: self-improving-loop hand-rolled the ship gate as a bare
mean comparison instead of the real HeldOutGate/selfImprove primitives.
@drewstone

Copy link
Copy Markdown
Contributor Author

Superseded by #371. The primitive inventory is now GENERATED (docs/api/primitive-catalog.md, freshness-gate-enforced) so it can't go stale by hand — the root cause this PR hand-patched. The ~30 primitives catalogued here are now auto-generated; the one real example fix (self-improving-loop → HeldOutGate/selfImprove) was folded into #371.

@drewstone drewstone closed this Jun 24, 2026
drewstone added a commit that referenced this pull request Jun 24, 2026
…erence cannot go stale (#371)

* docs(api): generate the primitive catalog so the anti-reinvention reference cannot go stale

The hand-listed primitive inventory in docs/canonical-api.md drifted from source:
it had zero mentions of live exports (scoreAuthenticity, gateRealness,
MultiLayerVerifier, wilson, pairedTTest, runProfileMatrix, extractUsage, …). Anything
derivable from source must be generated, not hand-written — only judgment stays curated.

- scripts/gen-primitive-catalog.mjs reads the LIVE exports of (a) this package's own
  public subpaths (from package.json `exports`) and (b) a curated category->subpath map
  of the @tangle-network/agent-eval substrate surfaces agents should reuse (judge,
  authenticity, verification, statistics, campaign, token/usage). Extraction is via the
  TypeScript compiler API over a virtual re-export entry, so it follows aliased
  re-exports and content-hashed bundle files — the exact things that rot a hand list.
  Emits docs/api/primitive-catalog.md with a GENERATED header (name, import path,
  one-line summary per export, grouped by surface).
- Wired into `docs:api` (runs after TypeDoc). The freshness gate gains a seventh class
  (CATALOG): it regenerates the catalog to a temp file and byte-compares to the committed
  copy, so a new/removed/renamed live export absent from the catalog is a RED BUILD.
- Shrank canonical-api.md: removed the export-inventory enumeration from the banner and
  the §2 preamble, replaced with pointers to docs/api/primitive-catalog.md. Kept all the
  judgment — the decision gate, §1.5 AgentProfile law, the §2 "I want to -> use -> NOT"
  table and every "Do NOT". The version + substrate-peer pins stay (gate-enforced).
- MAINTAINING.md documents the generated-inventory layer, CLASS 7, and its fix path.

* chore(deps): bump agent-eval to 0.99.0 + regenerate primitive catalog

agent-eval 0.99.0 adds llmJudge (+ the full current judge/auth/verify/stats
surface); regenerating the generated catalog picks it up with zero hand-work,
which is the point of the generator. Lockfile was pinned at 0.97.0 (pre-llmJudge)
despite agent-eval already being in minimumReleaseAgeExclude.

* fix(deps): keep agent-eval peer floor at >=0.97.0

Only the examples (devDependency) need 0.99.0 for llmJudge; agent-runtime's src
does not, so the peer floor must not force consumers onto 0.99.0. Catalog +
lockfile stay on the resolved 0.99.0 so the examples get llmJudge.

* docs(examples): de-reinvent self-improving-loop — use HeldOutGate/selfImprove not a bare mean gate

Folds the one real example fix from #370 (otherwise superseded by the generated
catalog) into this PR: self-improving-loop hand-rolled the ship gate as a bare
mean comparison instead of the real HeldOutGate/selfImprove primitives.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants