Skip to content

fix(skill-creator): run_eval.py always reports 0% recall — install the eval artifact as a real skill; fix Windows stream reading, trigger detection, and parallel workers#1298

Open
MartinCajiao wants to merge 3 commits into
anthropics:mainfrom
MartinCajiao:fix/run-eval-skill-triggering

Conversation

@MartinCajiao

Copy link
Copy Markdown

Problem

run_eval.py — and therefore run_loop.py and improve_description.py, which consume its signal — reports recall=0% for every skill description regardless of content (#556, 10+ independent reproductions). The description-optimization loop is currently optimizing against noise.

There are four independent causes. Each one alone is sufficient for ~0% recall, which is why partial fixes haven't moved the number.

1. The eval artifact can never auto-trigger (all platforms)

The harness installs the candidate description at .claude/commands/<name>-skill-<uuid>.md and expects the model to auto-trigger it. But commands and skills are different surfaces: the init event of claude -p lists command files under slash_commands, while the model-facing auto-trigger list (skills) is built only from .claude/skills/*/SKILL.md.

Verified on Claude Code 2.1.149 by planting both artifact types side by side and inspecting the init event:

"slash_commands": ["pokemon-name-rater", "pokemon-name-rater-cmd", ...],
"skills":         ["pokemon-name-rater", ...]

The command file (pokemon-name-rater-cmd) appears in slash_commands but is absent from skills — it is structurally invisible to auto-triggering. The real skill in .claude/skills/ auto-triggered immediately in the same headless session (Skill {"skill": "pokemon-name-rater"} from a natural query). Same conclusion as @hysteric-lab's and @samuelcastro's experiments in #556.

2. On Windows, the harness crashes before measuring anything

select.select() on subprocess pipes only works on sockets on Windows. Every query raises OSError, the generic except Exception in run_eval() converts it to a stderr warning, and the query is recorded as not-triggered:

Warning: query failed: [WinError 10038] An operation was attempted on something that is not a socket
...
[FAIL] rate=0/1 expected=True: Rate the POKEMON name Snorlax for me

So on Windows recall is structurally 0% even with cause 1 fixed. Stream reading now goes through reader threads + a queue, which behaves identically on all platforms.

Relatedly: Popen(["claude", ...]) only finds claude.exe on Windows. npm installs ship a claude.cmd shim, which raises FileNotFoundError — also swallowed by the same handler, also producing silent 0%. The executable is now resolved once with shutil.which() (verified: bare .cmd name fails, which()-resolved full path works).

3. Detection counted exploration as a miss

The streaming detector returned False at the first tool_use that wasn't Skill/Read. Claude frequently starts with TodoWrite/Glob/Bash before consulting a skill, so legitimate triggers were recorded as misses (reported by @Vaikri-costume in #556; survives the existing fix PRs). Now non-Skill tools are simply ignored, and only a terminal result event, process exit, or timeout decides not-triggered. Conversation length is capped with --max-turns so non-triggering queries stay cheap.

4. Parallel workers raced each other

Each worker wrote its own UUID-named copy of the artifact with an identical description. Claude picks one arbitrarily, so sibling workers miss triggers that land on another worker's copy — trigger rate scales as ~1/N (@rhys-childs measured 100% at --num-workers 1 vs 11% at the default 10). One shared artifact per run is now created before the pool starts and removed after it drains, so full parallelism is safe.

What changed

  • Install the candidate description as a real skill at <project_root>/.claude/skills/<name>-eval-<id>/SKILL.md; one shared artifact per run, removed in a finally.
  • Replace select.select() with portable reader threads (stdout queue + bounded stderr tail).
  • Resolve the claude executable with shutil.which(); fail fast with a clear error if absent.
  • Never early-return False on non-Skill tools; decide not-triggered only on result/exit/timeout; cap with --max-turns 8.
  • Capture stderr and surface an excerpt on nonzero exit or timeout instead of DEVNULL (these silent failures are why run_eval.py: claude -p never triggers skills/commands (0% trigger rate across all queries) #556 was hard to diagnose).
  • Warn when an installed skill shares the candidate's name (shadowing deflates measured rates).
  • Sweep stale <name>-eval-* artifacts left behind by killed runs.

run_eval()'s signature, the CLI flags, and the output JSON are unchanged — run_loop.py and improve_description.py work unmodified.

Why not swap the real SKILL.md in place

#1027 takes that approach. Writing into the user's actual skill directory is destructive if the run dies mid-swap (backups and signal handlers mitigate but cannot cover SIGKILL/power loss). This PR never touches user files: the artifact is a new, uniquely named directory, removed on completion, swept if a previous run crashed, and shadowing is handled with a loud warning instead of moving the user's skill aside (the concern #996 addresses by relocation).

Tradeoff kept deliberately: the eval skill's name carries an -eval-<id> suffix while production skills don't. The description — which is what routing decides on — is identical, and the suffix avoids collisions with installed skills. This matches the previous design's tradeoff.

Verification (Windows 11, Claude Code 2.1.149, Python 3.12)

  1. Init-event A/B proving commands never enter the model's skills list (snippet above).
  2. Offline harness test against a stub claude.cmd emitting canned stream-json — exercises .cmd resolution, reader threads, 4-way parallelism on one shared artifact, TodoWrite-before-Skill detection, and artifact cleanup, with zero API cost: 4/4 pass.
  3. Error-path suite (16 checks, offline): late trigger after TodoWrite+Bash; streaming detection with the skill name split across input_json_delta chunks; garbage stdout lines tolerated; nonzero exit counted as not-triggered with the stderr excerpt surfaced; hanging CLI killed promptly at timeout; missing CLI raising a clear RuntimeError; artifact frontmatter round-trip; age-guarded stale sweep removing a 2h-old leftover while keeping a fresh one and ignoring similarly named non-artifact dirs. All pass.
  4. Integration: run_loop.py (unmodified) run end-to-end against the stub — precision/recall/accuracy computed, result schema consumed intact, all_passed exit path reached.
  5. Real before/after on the same 4-query eval set (--model haiku, 2 workers):
    • main: four WinError 10038 warnings, should-trigger 0/1 + 0/1 (recall 0%), summary 2/4 — and the 2 "passes" are the should-not-trigger queries passing vacuously.
    • this PR: should-trigger 1/1 + 1/1, should-not-trigger 0/1 + 0/1, summary 4/4. Re-run after every revision; no artifacts left behind.
  6. Lint: ruff check clean on default rules; a strict rule set (F,E,W,B,UP,SIM,RET,ARG,RUF) only flags lines inherited unchanged from the previous version of main().

Fixes #556. Related: #794 (cause 4 only), #996 (cause 1 + shadowing via relocation), #1027 (cause 1 via in-place swap). Upstream context: anthropics/claude-code#32184 (closed stale, no maintainer response).

run_eval.py reported 0% recall for every skill description (anthropics#556).
Four independent causes, each sufficient on its own:

- The candidate description was installed as a .claude/commands/ file,
  which claude -p lists under slash_commands but never under the
  model-facing skills list, so it could never auto-trigger. Install it
  as a real skill under .claude/skills/<name>-eval-<id>/ instead.
- Stream reading used select.select() on pipes, which only works on
  sockets on Windows: every query raised WinError 10038, was swallowed
  by the generic exception handler, and silently counted as
  not-triggered. Replace with portable reader threads.
- Detection returned False at the first non-Skill tool_use, counting
  runs where Claude explores (TodoWrite/Glob/Bash) before consulting
  the skill as misses. Only a terminal result event, process exit, or
  timeout decides not-triggered now.
- Each parallel worker created its own UUID-named copy of the
  artifact, so Claude picked one arbitrarily and sibling workers
  missed their triggers. One shared artifact per run, created before
  the pool starts and removed after it drains.

Also: resolve the claude executable with shutil.which() (npm installs
ship a claude.cmd shim that bare Popen cannot find on Windows), capture
stderr and surface it on failures instead of discarding it, warn when
an installed skill with the same name could shadow the candidate, sweep
stale eval artifacts left by crashed runs, and cap conversation length
with --max-turns.

Verified on Windows 11 with Claude Code 2.1.149: a 4-query eval set
went from 0% recall (per-query WinError warnings) to 2/2
should-trigger and 0/2 should-not-trigger. run_eval() signature, CLI
flags, and output JSON are unchanged; run_loop.py works unmodified.

Fixes anthropics#556
- Only sweep stale eval artifacts older than 1h so a sweep never
  deletes the live artifact of a concurrent eval of the same skill.
- Bound the post-EOF process.wait() so a claude process that lingers
  after closing stdout cannot hang a worker past its timeout.
@xg-gh-25

Copy link
Copy Markdown

Brutal diagnosis — four independent causes each sufficient for 0% recall, which explains why partial fixes didn't move the number. The "optimizing against noise" framing is exactly right: a metric that always reads zero might as well not exist.

The cross-platform testing rigor (Windows stream handling, executable resolution) and the systematic verification strategy (stub harness + offline tests before real integration) are both production patterns we recognize. A few observations:

On the artifact installation strategy: The decision to install as .claude/skills/<name>-eval-<id>/SKILL.md instead of swapping the real skill in-place is the safer choice. The tradeoff you note (suffix on the skill name) is acceptable because routing happens on description content, not the name. One risk to flag: if the model's routing logic includes edit distance on skill names as a tie-breaker (some frameworks do this), the suffix could deflate trigger rates slightly. Have you verified that Claude Code's skill picker ignores name similarity entirely?

On cause 3 (exploration counted as miss): The fix — "non-Skill tools are ignored, only terminal events decide not-triggered" — is correct for auto-triggering, but it changes what you're measuring. Before: "did the skill trigger first?" After: "did the skill trigger eventually?" The latter is the right metric for description optimization (you want the model to reach the skill at all), but if you're also evaluating routing efficiency (how many irrelevant tools were used before the right one), you'd want to track that separately. Consider logging steps_to_trigger as optional metadata.

On cause 4 (parallel worker race): The shared artifact solution is clean, but there's a hidden complexity: if multiple runs execute concurrently (e.g., CI pipeline with parallel jobs), they'll still collide on the artifact name. The -eval-<id> suffix mitigates this if <id> is globally unique (UUID), but if it's process-scoped (PID), you can get cross-process races. Suggest documenting the <id> uniqueness guarantee — or add a lockfile if not already present.

On the verification strategy: The offline stub harness is excellent (we use this pattern too — "canned stream-json" for zero-cost iteration). One gap: your offline tests cover Windows stream handling and parallel workers, but they don't cover the actual trigger detection logic against malformed/truncated tool_use events. Consider adding a fuzzer that injects partial JSON chunks mid-event to verify your detection doesn't false-positive.

Minor: The PR description says "Tradeoff kept deliberately: the eval skill's name carries an -eval- suffix while production skills don't." But you also say "the suffix avoids collisions with installed skills." These are compatible, but clarify: does the suffix prevent shadowing (good), or does it introduce name-mismatch risk (potential concern)? The warning about shadowing suggests you're aware of both sides.

Last: The "sweep stale <name>-eval-* artifacts" logic is defensive (we do this too). What's the age threshold? 2 hours? 24 hours? If it's too aggressive, you'll delete artifacts from paused debugging sessions. If too permissive, you'll accumulate cruft. Consider making it configurable or at least documenting the default.


Eval loop signal integrity recognized via SwarmAI. Discussion: T-MEM

@MartinCajiao

MartinCajiao commented Jun 10, 2026 via email

Copy link
Copy Markdown
Author

Addresses review feedback on the eval artifact strategy:

- The candidate skill now keeps its real name. Isolation moved from a
  name suffix to the project directory: each run creates a throwaway
  project under <tmp>/claude-skill-evals/<name>-<uuid4> and claude -p
  executes with it as cwd, so the name-similarity question around the
  -eval-<id> suffix is moot and the caller project is never touched.
- Cross-run collisions are impossible by construction: the project
  directory is keyed by a full UUID4 and scoped to a dedicated per-run
  directory (the previous id was already UUID-derived, not
  process-scoped; now the property is structural).
- The stale sweep threshold (previously hardcoded to 1 hour) is now
  --stale-artifact-hours, defaulting to 12 hours: long enough to
  survive paused debugging sessions, short enough not to accumulate
  cruft. The sweep only ever runs inside the dedicated temp root.
- The shadowing warning now covers user-level skills only; everything
  project-level is isolated by construction.

run_eval() stays backward compatible (project_root retained but
deprecated); run_loop.py works unmodified. Verified on Windows 11 +
Claude Code 2.1.149: offline stub suite 16/16 and 4/4 eval queries,
real end-to-end 2/2 should-trigger and 0/2 should-not with the clean
skill name, zero leftover directories.
@MartinCajiao

MartinCajiao commented Jun 11, 2026

Copy link
Copy Markdown
Author

Pushed 79b3df9 delivering the refinements discussed above:

1. Sandboxed eval environment (name-mismatch + shadowing): went one step further than promised. Each run now creates an isolated throwaway project under the OS temp dir (<tmp>/claude-skill-evals/<skill>-<uuid4>/.claude/skills/<real-name>/SKILL.md) and claude -p executes with that project as its working directory. The skill keeps its real name -- isolation comes from the project directory, not from a name suffix -- so the edit-distance question is moot by construction, and the caller's project is never touched by eval artifacts at all. On your original question: empirically the suffixed artifact was already triggering 2/2 on matching queries in the Windows e2e (routing follows the description), but removing the suffix removes the question entirely.

2. Configurable sweep threshold: now --stale-artifact-hours, defaulting to 12h as proposed. One correction to my earlier reply: the previous hardcoded window was 1 hour, not 2. The sweep also only ever operates inside the dedicated temp root now, so it cannot touch anything outside it.

3. Cross-run collisions: to be precise, the previous id was already UUID-derived (a uuid4 hex prefix), not process-scoped -- but the new layout makes the guarantee structural rather than probabilistic: a full 32-hex UUID4 keys a dedicated per-run project directory, so concurrent runs (parallel CI jobs included) cannot collide regardless of id entropy. No lockfile needed.

On steps_to_trigger: agreed -- that is the right metric for routing efficiency as opposed to routing reach. Kept out of this PR to keep the output schema stable for run_loop.py/improve_description.py, but it is a clean additive follow-up.

On fuzzing the detection against malformed/truncated tool_use events: the streaming detector accumulates input_json_delta fragments into a buffer and substring-matches on the accumulated state, so split/partial chunks are handled by construction, and malformed JSON lines are skipped. I would like to land that as in-repo tests, but the scripts currently have no test scaffolding -- verification lives in an external stub harness (canned stream-json, zero API cost) for now. Happy to contribute test scaffolding as a separate PR if maintainers want it.

Re-verified end to end on Windows 11 + Claude Code 2.1.149 after the change: 2/2 should-trigger, 0/2 should-not-trigger, zero leftover directories in temp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

run_eval.py: claude -p never triggers skills/commands (0% trigger rate across all queries)

2 participants