docs(canonical-api): close the anti-reinvention gaps + de-reinvent the examples#370
Closed
drewstone wants to merge 1 commit into
Closed
docs(canonical-api): close the anti-reinvention gaps + de-reinvent the examples#370drewstone wants to merge 1 commit into
drewstone wants to merge 1 commit into
Conversation
…e examples The anti-reinvention decision table (§2) had zero rows for whole families of live exports, so agents kept hand-rolling them. Add grouped rows for: - statistics — every test on the agent-eval main barrel (pairedBootstrap, wilson, pairedTTest, wilcoxonSignedRank, mcnemar/mcnemarPower, mannWhitneyU, passAtK, confidenceInterval, benjaminiHochberg/bonferroni, cohensD/cliffsDelta, requiredSampleSize/pairedMde, eProcess, weightedComposite, corpusInterRaterAgreement, pairedRiskDifference, pearsonR/spearmanR, paretoFrontier/dominates) - judge — JudgeConfig + ensembleJudge, with a warn-off for the @deprecated JudgeFn factories and an explicit note that there is no llmJudge export - authenticity — scoreAuthenticity / gateRealness / AuthenticitySignals (agent-eval/authenticity subpath) - verification — MultiLayerVerifier + gradeSemanticStatus - reference-replay — runReferenceReplay / scoreReferenceReplay / decideReferenceReplayPromotion over jsonlReferenceReplayStore - token/usage seam — extractLlmCallEvent + reportLoopUsage / UsageSink (/loops) Teach the freshness gate the agent-eval/authenticity subpath barrel so a §2 row that recommends importing from it resolves (parallel to contract/campaign/index). De-reinvent the one example that hand-rolled primitives: examples/self-improving-loop replaces its local AnalystFinding interface + runAnalyst with the canonical AnalystFinding stamped by makeFinding, and its bare v1Mean-v0Mean>=0.5 point-comparison gate with pairedBootstrap (the statistical core the production held-out gate is built on). Stays offline and deterministic; ships v1 on a +5.00 [5.00, 6.00] paired CI. README + comments mirror the clean companion examples (improve/, intelligence-recommend/).
tangletools
approved these changes
Jun 24, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 70e098d8
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T08:54:31Z
drewstone
added a commit
that referenced
this pull request
Jun 24, 2026
…fImprove not a bare mean gate Folds the one real example fix from #370 (otherwise superseded by the generated catalog) into this PR: self-improving-loop hand-rolled the ship gate as a bare mean comparison instead of the real HeldOutGate/selfImprove primitives.
Contributor
Author
|
Superseded by #371. The primitive inventory is now GENERATED (docs/api/primitive-catalog.md, freshness-gate-enforced) so it can't go stale by hand — the root cause this PR hand-patched. The ~30 primitives catalogued here are now auto-generated; the one real example fix (self-improving-loop → HeldOutGate/selfImprove) was folded into #371. |
drewstone
added a commit
that referenced
this pull request
Jun 24, 2026
…erence cannot go stale (#371) * docs(api): generate the primitive catalog so the anti-reinvention reference cannot go stale The hand-listed primitive inventory in docs/canonical-api.md drifted from source: it had zero mentions of live exports (scoreAuthenticity, gateRealness, MultiLayerVerifier, wilson, pairedTTest, runProfileMatrix, extractUsage, …). Anything derivable from source must be generated, not hand-written — only judgment stays curated. - scripts/gen-primitive-catalog.mjs reads the LIVE exports of (a) this package's own public subpaths (from package.json `exports`) and (b) a curated category->subpath map of the @tangle-network/agent-eval substrate surfaces agents should reuse (judge, authenticity, verification, statistics, campaign, token/usage). Extraction is via the TypeScript compiler API over a virtual re-export entry, so it follows aliased re-exports and content-hashed bundle files — the exact things that rot a hand list. Emits docs/api/primitive-catalog.md with a GENERATED header (name, import path, one-line summary per export, grouped by surface). - Wired into `docs:api` (runs after TypeDoc). The freshness gate gains a seventh class (CATALOG): it regenerates the catalog to a temp file and byte-compares to the committed copy, so a new/removed/renamed live export absent from the catalog is a RED BUILD. - Shrank canonical-api.md: removed the export-inventory enumeration from the banner and the §2 preamble, replaced with pointers to docs/api/primitive-catalog.md. Kept all the judgment — the decision gate, §1.5 AgentProfile law, the §2 "I want to -> use -> NOT" table and every "Do NOT". The version + substrate-peer pins stay (gate-enforced). - MAINTAINING.md documents the generated-inventory layer, CLASS 7, and its fix path. * chore(deps): bump agent-eval to 0.99.0 + regenerate primitive catalog agent-eval 0.99.0 adds llmJudge (+ the full current judge/auth/verify/stats surface); regenerating the generated catalog picks it up with zero hand-work, which is the point of the generator. Lockfile was pinned at 0.97.0 (pre-llmJudge) despite agent-eval already being in minimumReleaseAgeExclude. * fix(deps): keep agent-eval peer floor at >=0.97.0 Only the examples (devDependency) need 0.99.0 for llmJudge; agent-runtime's src does not, so the peer floor must not force consumers onto 0.99.0. Catalog + lockfile stay on the resolved 0.99.0 so the examples get llmJudge. * docs(examples): de-reinvent self-improving-loop — use HeldOutGate/selfImprove not a bare mean gate Folds the one real example fix from #370 (otherwise superseded by the generated catalog) into this PR: self-improving-loop hand-rolled the ship gate as a bare mean comparison instead of the real HeldOutGate/selfImprove primitives.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The anti-reinvention decision table in
docs/canonical-api.md(§2 — the doc agents are told to read before writing orchestration/measurement code) had zero rows for whole families of live exports, so agents kept hand-rolling them. This adds them, and de-reinvents the one example that hand-rolled primitives.1. Reference gaps closed (
docs/canonical-api.md§2)Six grouped rows added, in the existing "I want to ___ → use ___ → NOT ___" voice:
pairedBootstrap,wilson,pairedTTest,wilcoxonSignedRank,mcnemar/mcnemarPower/mcnemarRequiredN,mannWhitneyU,passAtK,confidenceInterval,benjaminiHochberg/bonferroni,cohensD/cliffsDelta,requiredSampleSize/pairedMde,eProcess,weightedComposite,corpusInterRaterAgreement,pairedRiskDifference,pearsonR/spearmanR,paretoFrontier/dominatesJudgeConfig, orensembleJudgefor a multi-model panel@deprecatedJudgeFnfactoriesscoreAuthenticity→gateRealnessover anAuthenticitySignals(agent-eval/authenticitysubpath)new MultiLayerVerifier(layers).run(...)+gradeSemanticStatus(...)runReferenceReplay/scoreReferenceReplay+decideReferenceReplayPromotionoverjsonlReferenceReplayStoreextractLlmCallEvent/reportLoopUsage(UsageSink) —/loopsThe judge row also kills a recurring confusion: there is no
llmJudgeexport in this version — that name appears only in a campaign-types docstring; the live judge surface isJudgeConfig+ensembleJudge. A future agent reading this doc can no longer conclude "no judge primitive exists."The freshness gate (
scripts/check-docs-freshness.mjs) now loads theagent-eval/authenticitysubpath barrel into the §2 export universe, so a row that recommends importing from that subpath resolves — exactly parallel to howcontract/campaign/indexare already loaded. This strengthens coverage (a real published export surface the gate previously couldn't see), it does not weaken the gate.2. Example de-reinvented
examples/self-improving-loopwas the only example (of 21 audited) that hand-rolled substrate primitives:interface AnalystFinding { rootCause; proposedMutation }+runAnalyst()→ now the canonicalAnalystFindingstamped bymakeFinding(the exact shapeimprove(profile, findings, opts)reflects on); mutation ridesrecommended_action.v1Mean - v0Mean >= 0.5point comparison → nowpairedBootstrap(v0Scores, v1Scores, { seed }), shipping only when the paired-bootstrap CI lower bound clears 0 — the statistical core the production held-out gate (HeldOutGate/improve()overselfImprove) is built on.Stays offline + deterministic; the demo still ships v1, now on a +5.00 paired median, 95% CI [5.00, 6.00], n=3. README + comments mirror the already-clean companion examples (
improve/,intelligence-recommend/). Only the analyst body, proposer, and LLM remain scripted (for reproducibility) — the finding type and gate statistic are now the real substrate.Verify (all green)
pnpm docs:check— docs freshness: OK (§2 table symbols against 4134 public exports; prose symbols against 5795 resolvable;docs/apiregen diff clean)pnpm run build— cleanpnpm run typecheck+typecheck:examples— clean (the fixed example compiles)pnpm run lint— Checked 319 files, no fixes applied[5.00, 6.00]DO NOT MERGE — operator review.