fix(skill-creator): run_eval.py always reports 0% recall — install the eval artifact as a real skill; fix Windows stream reading, trigger detection, and parallel workers by MartinCajiao · Pull Request #1298 · anthropics/skills

MartinCajiao · 2026-06-10T02:37:29Z

Problem

run_eval.py — and therefore run_loop.py and improve_description.py, which consume its signal — reports recall=0% for every skill description regardless of content (#556, 10+ independent reproductions). The description-optimization loop is currently optimizing against noise.

There are four independent causes. Each one alone is sufficient for ~0% recall, which is why partial fixes haven't moved the number.

1. The eval artifact can never auto-trigger (all platforms)

The harness installs the candidate description at .claude/commands/<name>-skill-<uuid>.md and expects the model to auto-trigger it. But commands and skills are different surfaces: the init event of claude -p lists command files under slash_commands, while the model-facing auto-trigger list (skills) is built only from .claude/skills/*/SKILL.md.

Verified on Claude Code 2.1.149 by planting both artifact types side by side and inspecting the init event:

"slash_commands": ["pokemon-name-rater", "pokemon-name-rater-cmd", ...],
"skills":         ["pokemon-name-rater", ...]

The command file (pokemon-name-rater-cmd) appears in slash_commands but is absent from skills — it is structurally invisible to auto-triggering. The real skill in .claude/skills/ auto-triggered immediately in the same headless session (Skill {"skill": "pokemon-name-rater"} from a natural query). Same conclusion as @hysteric-lab's and @samuelcastro's experiments in #556.

2. On Windows, the harness crashes before measuring anything

select.select() on subprocess pipes only works on sockets on Windows. Every query raises OSError, the generic except Exception in run_eval() converts it to a stderr warning, and the query is recorded as not-triggered:

Warning: query failed: [WinError 10038] An operation was attempted on something that is not a socket
...
[FAIL] rate=0/1 expected=True: Rate the POKEMON name Snorlax for me

So on Windows recall is structurally 0% even with cause 1 fixed. Stream reading now goes through reader threads + a queue, which behaves identically on all platforms.

Relatedly: Popen(["claude", ...]) only finds claude.exe on Windows. npm installs ship a claude.cmd shim, which raises FileNotFoundError — also swallowed by the same handler, also producing silent 0%. The executable is now resolved once with shutil.which() (verified: bare .cmd name fails, which()-resolved full path works).

3. Detection counted exploration as a miss

The streaming detector returned False at the first tool_use that wasn't Skill/Read. Claude frequently starts with TodoWrite/Glob/Bash before consulting a skill, so legitimate triggers were recorded as misses (reported by @Vaikri-costume in #556; survives the existing fix PRs). Now non-Skill tools are simply ignored, and only a terminal result event, process exit, or timeout decides not-triggered. Conversation length is capped with --max-turns so non-triggering queries stay cheap.

4. Parallel workers raced each other

Each worker wrote its own UUID-named copy of the artifact with an identical description. Claude picks one arbitrarily, so sibling workers miss triggers that land on another worker's copy — trigger rate scales as ~1/N (@rhys-childs measured 100% at --num-workers 1 vs 11% at the default 10). One shared artifact per run is now created before the pool starts and removed after it drains, so full parallelism is safe.

What changed

Install the candidate description as a real skill at <project_root>/.claude/skills/<name>-eval-<id>/SKILL.md; one shared artifact per run, removed in a finally.
Replace select.select() with portable reader threads (stdout queue + bounded stderr tail).
Resolve the claude executable with shutil.which(); fail fast with a clear error if absent.
Never early-return False on non-Skill tools; decide not-triggered only on result/exit/timeout; cap with --max-turns 8.
Capture stderr and surface an excerpt on nonzero exit or timeout instead of DEVNULL (these silent failures are why run_eval.py: claude -p never triggers skills/commands (0% trigger rate across all queries) #556 was hard to diagnose).
Warn when an installed skill shares the candidate's name (shadowing deflates measured rates).
Sweep stale <name>-eval-* artifacts left behind by killed runs.

run_eval()'s signature, the CLI flags, and the output JSON are unchanged — run_loop.py and improve_description.py work unmodified.

Why not swap the real SKILL.md in place

#1027 takes that approach. Writing into the user's actual skill directory is destructive if the run dies mid-swap (backups and signal handlers mitigate but cannot cover SIGKILL/power loss). This PR never touches user files: the artifact is a new, uniquely named directory, removed on completion, swept if a previous run crashed, and shadowing is handled with a loud warning instead of moving the user's skill aside (the concern #996 addresses by relocation).

Tradeoff kept deliberately: the eval skill's name carries an -eval-<id> suffix while production skills don't. The description — which is what routing decides on — is identical, and the suffix avoids collisions with installed skills. This matches the previous design's tradeoff.

Verification (Windows 11, Claude Code 2.1.149, Python 3.12)

Init-event A/B proving commands never enter the model's skills list (snippet above).
Offline harness test against a stub claude.cmd emitting canned stream-json — exercises .cmd resolution, reader threads, 4-way parallelism on one shared artifact, TodoWrite-before-Skill detection, and artifact cleanup, with zero API cost: 4/4 pass.
Error-path suite (16 checks, offline): late trigger after TodoWrite+Bash; streaming detection with the skill name split across input_json_delta chunks; garbage stdout lines tolerated; nonzero exit counted as not-triggered with the stderr excerpt surfaced; hanging CLI killed promptly at timeout; missing CLI raising a clear RuntimeError; artifact frontmatter round-trip; age-guarded stale sweep removing a 2h-old leftover while keeping a fresh one and ignoring similarly named non-artifact dirs. All pass.
Integration: run_loop.py (unmodified) run end-to-end against the stub — precision/recall/accuracy computed, result schema consumed intact, all_passed exit path reached.
Real before/after on the same 4-query eval set (--model haiku, 2 workers):
- main: four WinError 10038 warnings, should-trigger 0/1 + 0/1 (recall 0%), summary 2/4 — and the 2 "passes" are the should-not-trigger queries passing vacuously.
- this PR: should-trigger 1/1 + 1/1, should-not-trigger 0/1 + 0/1, summary 4/4. Re-run after every revision; no artifacts left behind.
Lint: ruff check clean on default rules; a strict rule set (F,E,W,B,UP,SIM,RET,ARG,RUF) only flags lines inherited unchanged from the previous version of main().

Fixes #556. Related: #794 (cause 4 only), #996 (cause 1 + shadowing via relocation), #1027 (cause 1 via in-place swap). Upstream context: anthropics/claude-code#32184 (closed stale, no maintainer response).

run_eval.py reported 0% recall for every skill description (anthropics#556). Four independent causes, each sufficient on its own: - The candidate description was installed as a .claude/commands/ file, which claude -p lists under slash_commands but never under the model-facing skills list, so it could never auto-trigger. Install it as a real skill under .claude/skills/<name>-eval-<id>/ instead. - Stream reading used select.select() on pipes, which only works on sockets on Windows: every query raised WinError 10038, was swallowed by the generic exception handler, and silently counted as not-triggered. Replace with portable reader threads. - Detection returned False at the first non-Skill tool_use, counting runs where Claude explores (TodoWrite/Glob/Bash) before consulting the skill as misses. Only a terminal result event, process exit, or timeout decides not-triggered now. - Each parallel worker created its own UUID-named copy of the artifact, so Claude picked one arbitrarily and sibling workers missed their triggers. One shared artifact per run, created before the pool starts and removed after it drains. Also: resolve the claude executable with shutil.which() (npm installs ship a claude.cmd shim that bare Popen cannot find on Windows), capture stderr and surface it on failures instead of discarding it, warn when an installed skill with the same name could shadow the candidate, sweep stale eval artifacts left by crashed runs, and cap conversation length with --max-turns. Verified on Windows 11 with Claude Code 2.1.149: a 4-query eval set went from 0% recall (per-query WinError warnings) to 2/2 should-trigger and 0/2 should-not-trigger. run_eval() signature, CLI flags, and output JSON are unchanged; run_loop.py works unmodified. Fixes anthropics#556

- Only sweep stale eval artifacts older than 1h so a sweep never deletes the live artifact of a concurrent eval of the same skill. - Bound the post-EOF process.wait() so a claude process that lingers after closing stdout cannot hang a worker past its timeout.

xg-gh-25 · 2026-06-10T03:02:16Z

Brutal diagnosis — four independent causes each sufficient for 0% recall, which explains why partial fixes didn't move the number. The "optimizing against noise" framing is exactly right: a metric that always reads zero might as well not exist.

The cross-platform testing rigor (Windows stream handling, executable resolution) and the systematic verification strategy (stub harness + offline tests before real integration) are both production patterns we recognize. A few observations:

On the artifact installation strategy: The decision to install as .claude/skills/<name>-eval-<id>/SKILL.md instead of swapping the real skill in-place is the safer choice. The tradeoff you note (suffix on the skill name) is acceptable because routing happens on description content, not the name. One risk to flag: if the model's routing logic includes edit distance on skill names as a tie-breaker (some frameworks do this), the suffix could deflate trigger rates slightly. Have you verified that Claude Code's skill picker ignores name similarity entirely?

On cause 3 (exploration counted as miss): The fix — "non-Skill tools are ignored, only terminal events decide not-triggered" — is correct for auto-triggering, but it changes what you're measuring. Before: "did the skill trigger first?" After: "did the skill trigger eventually?" The latter is the right metric for description optimization (you want the model to reach the skill at all), but if you're also evaluating routing efficiency (how many irrelevant tools were used before the right one), you'd want to track that separately. Consider logging steps_to_trigger as optional metadata.

On cause 4 (parallel worker race): The shared artifact solution is clean, but there's a hidden complexity: if multiple runs execute concurrently (e.g., CI pipeline with parallel jobs), they'll still collide on the artifact name. The -eval-<id> suffix mitigates this if <id> is globally unique (UUID), but if it's process-scoped (PID), you can get cross-process races. Suggest documenting the <id> uniqueness guarantee — or add a lockfile if not already present.

On the verification strategy: The offline stub harness is excellent (we use this pattern too — "canned stream-json" for zero-cost iteration). One gap: your offline tests cover Windows stream handling and parallel workers, but they don't cover the actual trigger detection logic against malformed/truncated tool_use events. Consider adding a fuzzer that injects partial JSON chunks mid-event to verify your detection doesn't false-positive.

Minor: The PR description says "Tradeoff kept deliberately: the eval skill's name carries an -eval- suffix while production skills don't." But you also say "the suffix avoids collisions with installed skills." These are compatible, but clarify: does the suffix prevent shadowing (good), or does it introduce name-mismatch risk (potential concern)? The warning about shadowing suggests you're aware of both sides.

Last: The "sweep stale <name>-eval-* artifacts" logic is defensive (we do this too). What's the age threshold? 2 hours? 24 hours? If it's too aggressive, you'll delete artifacts from paused debugging sessions. If too permissive, you'll accumulate cruft. Consider making it configurable or at least documenting the default.

Eval loop signal integrity recognized via SwarmAI. Discussion: T-MEM

MartinCajiao · 2026-06-10T03:10:26Z

Hi Xiaogang, thanks for the rigorous review! Great catches here. You are exactly right: these are architectural trade-offs and polishing points that I am happy to adjust. Here is the breakdown and how I plan to address them right away: *1. On the artifact installation strategy (Suffix vs. Name-Mismatch):* The suffix was designed primarily to *prevent shadowing* and ensure isolation between production skills and evaluation instances. Regarding the name-mismatch risk: Claude Code's skill picker routes tools based on the description metadata rather than strict edit distance on names, so the -eval- suffix doesn't degrade trigger rates in standard testing. However, to eliminate any edge-case risk of name-mismatch or shadowing completely, I am going to refine this in the next commit so the evaluation context strictly shadows the name only within a sandboxed temporary environment, keeping the production namespace completely clean. *2. On the cleanup logic (Configurable Threshold):* You’re spot on. Right now, the age threshold for sweeping stale artifacts is hardcoded to a defensive 2-hour window. To avoid killing active debugging sessions while preventing disk cruft, *I am going to turn this into an adjustable parameter*. - I will expose it as a configurable option (e.g., --eval-clean-threshold or via CLAUDE.md / config context). - I will default it to 12 hours to balance safety for long debugging sessions and automatic cleanup, and document it clearly in the README/docs. *3. On the parallel worker race condition (Cause 4):* To ensure absolute safety in concurrent environments (like heavily parallelized CI pipelines), I will upgrade the process-scoped ID resolution to a globally unique identifier (UUID v4) or implement a lightweight lockfile mechanism (.lock) within the local evaluation directory to completely eliminate cross-process collisions. I’ll push a commit addressing these refinements in a bit. Thanks again for the feedback!

…

On Tue, Jun 9, 2026 at 9:02 PM Xiaogang Wang ***@***.***> wrote: *xg-gh-25* left a comment (anthropics/skills#1298) <#1298 (comment)> Brutal diagnosis — four independent causes each sufficient for 0% recall, which explains why partial fixes didn't move the number. The "optimizing against noise" framing is exactly right: a metric that always reads zero might as well not exist. The cross-platform testing rigor (Windows stream handling, executable resolution) and the systematic verification strategy (stub harness + offline tests before real integration) are both production patterns we recognize. A few observations: *On the artifact installation strategy:* The decision to install as .claude/skills/<name>-eval-<id>/SKILL.md instead of swapping the real skill in-place is the safer choice. The tradeoff you note (suffix on the skill name) is acceptable because routing happens on *description content*, not the name. One risk to flag: if the model's routing logic includes *edit distance on skill names* as a tie-breaker (some frameworks do this), the suffix could deflate trigger rates slightly. Have you verified that Claude Code's skill picker ignores name similarity entirely? *On cause 3 (exploration counted as miss):* The fix — "non-Skill tools are ignored, only terminal events decide not-triggered" — is correct for auto-triggering, but it changes what you're measuring. Before: "did the skill trigger *first*?" After: "did the skill trigger *eventually*?" The latter is the right metric for description optimization (you want the model to reach the skill at all), but if you're also evaluating routing *efficiency* (how many irrelevant tools were used before the right one), you'd want to track that separately. Consider logging steps_to_trigger as optional metadata. *On cause 4 (parallel worker race):* The shared artifact solution is clean, but there's a hidden complexity: if multiple *runs* execute concurrently (e.g., CI pipeline with parallel jobs), they'll still collide on the artifact name. The -eval-<id> suffix mitigates this if <id> is globally unique (UUID), but if it's process-scoped (PID), you can get cross-process races. Suggest documenting the <id> uniqueness guarantee — or add a lockfile if not already present. *On the verification strategy:* The offline stub harness is excellent (we use this pattern too — "canned stream-json" for zero-cost iteration). One gap: your offline tests cover Windows stream handling and parallel workers, but they don't cover the actual *trigger detection logic* against malformed/truncated tool_use events. Consider adding a fuzzer that injects partial JSON chunks mid-event to verify your detection doesn't false-positive. Minor: The PR description says "Tradeoff kept deliberately: the eval skill's name carries an -eval- suffix while production skills don't." But you also say "the suffix avoids collisions with installed skills." These are compatible, but clarify: does the suffix prevent shadowing (good), or does it introduce name-mismatch risk (potential concern)? The warning about shadowing suggests you're aware of both sides. Last: The "sweep stale <name>-eval-* artifacts" logic is defensive (we do this too). What's the age threshold? 2 hours? 24 hours? If it's too aggressive, you'll delete artifacts from paused debugging sessions. If too permissive, you'll accumulate cruft. Consider making it configurable or at least documenting the default. ------------------------------ *Eval loop signal integrity recognized via SwarmAI <https://github.com/xg-gh-25/SwarmAI>. Discussion: T-MEM <xg-gh-25/SwarmAI#13>* — Reply to this email directly, view it on GitHub <#1298?email_source=notifications&email_token=BKG77WKCUNZQEOJU6OYXD3347DFU7A5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRWGYYDKMBTGU3KM4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4666050356>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BKG77WIY63VYQDPL5W4YYJL47DFU7AVCNFSM6AAAAAC2B4EF7SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DMNRWGA2TAMZVGY> . Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS <https://github.com/notifications/mobile/ios/BKG77WP5NS4D2CZF7G6X6SD47DFU7A5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRWGYYDKMBTGU3KM4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KUZTPN52GK4S7NFXXG> and Android <https://github.com/notifications/mobile/android/BKG77WLAH7U27MLNUBNVDA347DFU7A5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRWGYYDKMBTGU3KM4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2K4ZTPN52GK4S7MFXGI4TPNFSA>. Download it today! You are receiving this because you authored the thread.Message ID: ***@***.***>

Addresses review feedback on the eval artifact strategy: - The candidate skill now keeps its real name. Isolation moved from a name suffix to the project directory: each run creates a throwaway project under <tmp>/claude-skill-evals/<name>-<uuid4> and claude -p executes with it as cwd, so the name-similarity question around the -eval-<id> suffix is moot and the caller project is never touched. - Cross-run collisions are impossible by construction: the project directory is keyed by a full UUID4 and scoped to a dedicated per-run directory (the previous id was already UUID-derived, not process-scoped; now the property is structural). - The stale sweep threshold (previously hardcoded to 1 hour) is now --stale-artifact-hours, defaulting to 12 hours: long enough to survive paused debugging sessions, short enough not to accumulate cruft. The sweep only ever runs inside the dedicated temp root. - The shadowing warning now covers user-level skills only; everything project-level is isolated by construction. run_eval() stays backward compatible (project_root retained but deprecated); run_loop.py works unmodified. Verified on Windows 11 + Claude Code 2.1.149: offline stub suite 16/16 and 4/4 eval queries, real end-to-end 2/2 should-trigger and 0/2 should-not with the clean skill name, zero leftover directories.

MartinCajiao · 2026-06-11T06:14:21Z

Pushed 79b3df9 delivering the refinements discussed above:

1. Sandboxed eval environment (name-mismatch + shadowing): went one step further than promised. Each run now creates an isolated throwaway project under the OS temp dir (<tmp>/claude-skill-evals/<skill>-<uuid4>/.claude/skills/<real-name>/SKILL.md) and claude -p executes with that project as its working directory. The skill keeps its real name -- isolation comes from the project directory, not from a name suffix -- so the edit-distance question is moot by construction, and the caller's project is never touched by eval artifacts at all. On your original question: empirically the suffixed artifact was already triggering 2/2 on matching queries in the Windows e2e (routing follows the description), but removing the suffix removes the question entirely.

2. Configurable sweep threshold: now --stale-artifact-hours, defaulting to 12h as proposed. One correction to my earlier reply: the previous hardcoded window was 1 hour, not 2. The sweep also only ever operates inside the dedicated temp root now, so it cannot touch anything outside it.

3. Cross-run collisions: to be precise, the previous id was already UUID-derived (a uuid4 hex prefix), not process-scoped -- but the new layout makes the guarantee structural rather than probabilistic: a full 32-hex UUID4 keys a dedicated per-run project directory, so concurrent runs (parallel CI jobs included) cannot collide regardless of id entropy. No lockfile needed.

On steps_to_trigger: agreed -- that is the right metric for routing efficiency as opposed to routing reach. Kept out of this PR to keep the output schema stable for run_loop.py/improve_description.py, but it is a clean additive follow-up.

On fuzzing the detection against malformed/truncated tool_use events: the streaming detector accumulates input_json_delta fragments into a buffer and substring-matches on the accumulated state, so split/partial chunks are handled by construction, and malformed JSON lines are skipped. I would like to land that as in-repo tests, but the scripts currently have no test scaffolding -- verification lives in an external stub harness (canned stream-json, zero API cost) for now. Happy to contribute test scaffolding as a separate PR if maintainers want it.

Re-verified end to end on Windows 11 + Claude Code 2.1.149 after the change: 2/2 should-trigger, 0/2 should-not-trigger, zero leftover directories in temp.

MartinCajiao added 2 commits June 9, 2026 20:17

MartinCajiao mentioned this pull request Jun 10, 2026

run_eval.py: claude -p never triggers skills/commands (0% trigger rate across all queries) #556

Open

MartinCajiao force-pushed the fix/run-eval-skill-triggering branch from ee0b902 to 0f97160 Compare June 10, 2026 02:41

ikenraf mentioned this pull request Jun 11, 2026

Windows compatibility: skill-creator scripts fail (subprocess PATHEXT, cp1252 encoding, select on pipes) #1061

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(skill-creator): run_eval.py always reports 0% recall — install the eval artifact as a real skill; fix Windows stream reading, trigger detection, and parallel workers#1298

fix(skill-creator): run_eval.py always reports 0% recall — install the eval artifact as a real skill; fix Windows stream reading, trigger detection, and parallel workers#1298
MartinCajiao wants to merge 3 commits into
anthropics:mainfrom
MartinCajiao:fix/run-eval-skill-triggering

MartinCajiao commented Jun 10, 2026

Uh oh!

xg-gh-25 commented Jun 10, 2026

Uh oh!

MartinCajiao commented Jun 10, 2026 via email

Uh oh!

MartinCajiao commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MartinCajiao commented Jun 10, 2026

Problem

1. The eval artifact can never auto-trigger (all platforms)

2. On Windows, the harness crashes before measuring anything

3. Detection counted exploration as a miss

4. Parallel workers raced each other

What changed

Why not swap the real SKILL.md in place

Verification (Windows 11, Claude Code 2.1.149, Python 3.12)

Uh oh!

xg-gh-25 commented Jun 10, 2026

Uh oh!

MartinCajiao commented Jun 10, 2026 via email

Uh oh!

MartinCajiao commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MartinCajiao commented Jun 11, 2026 •

edited

Loading