fix(skill-creator): run_eval.py always reports 0% recall — install the eval artifact as a real skill; fix Windows stream reading, trigger detection, and parallel workers#1298
Conversation
run_eval.py reported 0% recall for every skill description (anthropics#556). Four independent causes, each sufficient on its own: - The candidate description was installed as a .claude/commands/ file, which claude -p lists under slash_commands but never under the model-facing skills list, so it could never auto-trigger. Install it as a real skill under .claude/skills/<name>-eval-<id>/ instead. - Stream reading used select.select() on pipes, which only works on sockets on Windows: every query raised WinError 10038, was swallowed by the generic exception handler, and silently counted as not-triggered. Replace with portable reader threads. - Detection returned False at the first non-Skill tool_use, counting runs where Claude explores (TodoWrite/Glob/Bash) before consulting the skill as misses. Only a terminal result event, process exit, or timeout decides not-triggered now. - Each parallel worker created its own UUID-named copy of the artifact, so Claude picked one arbitrarily and sibling workers missed their triggers. One shared artifact per run, created before the pool starts and removed after it drains. Also: resolve the claude executable with shutil.which() (npm installs ship a claude.cmd shim that bare Popen cannot find on Windows), capture stderr and surface it on failures instead of discarding it, warn when an installed skill with the same name could shadow the candidate, sweep stale eval artifacts left by crashed runs, and cap conversation length with --max-turns. Verified on Windows 11 with Claude Code 2.1.149: a 4-query eval set went from 0% recall (per-query WinError warnings) to 2/2 should-trigger and 0/2 should-not-trigger. run_eval() signature, CLI flags, and output JSON are unchanged; run_loop.py works unmodified. Fixes anthropics#556
- Only sweep stale eval artifacts older than 1h so a sweep never deletes the live artifact of a concurrent eval of the same skill. - Bound the post-EOF process.wait() so a claude process that lingers after closing stdout cannot hang a worker past its timeout.
ee0b902 to
0f97160
Compare
|
Brutal diagnosis — four independent causes each sufficient for 0% recall, which explains why partial fixes didn't move the number. The "optimizing against noise" framing is exactly right: a metric that always reads zero might as well not exist. The cross-platform testing rigor (Windows stream handling, executable resolution) and the systematic verification strategy (stub harness + offline tests before real integration) are both production patterns we recognize. A few observations: On the artifact installation strategy: The decision to install as On cause 3 (exploration counted as miss): The fix — "non-Skill tools are ignored, only terminal events decide not-triggered" — is correct for auto-triggering, but it changes what you're measuring. Before: "did the skill trigger first?" After: "did the skill trigger eventually?" The latter is the right metric for description optimization (you want the model to reach the skill at all), but if you're also evaluating routing efficiency (how many irrelevant tools were used before the right one), you'd want to track that separately. Consider logging On cause 4 (parallel worker race): The shared artifact solution is clean, but there's a hidden complexity: if multiple runs execute concurrently (e.g., CI pipeline with parallel jobs), they'll still collide on the artifact name. The On the verification strategy: The offline stub harness is excellent (we use this pattern too — "canned stream-json" for zero-cost iteration). One gap: your offline tests cover Windows stream handling and parallel workers, but they don't cover the actual trigger detection logic against malformed/truncated Minor: The PR description says "Tradeoff kept deliberately: the eval skill's name carries an -eval- suffix while production skills don't." But you also say "the suffix avoids collisions with installed skills." These are compatible, but clarify: does the suffix prevent shadowing (good), or does it introduce name-mismatch risk (potential concern)? The warning about shadowing suggests you're aware of both sides. Last: The "sweep stale Eval loop signal integrity recognized via SwarmAI. Discussion: T-MEM |
|
Hi Xiaogang, thanks for the rigorous review! Great catches here. You are
exactly right: these are architectural trade-offs and polishing points that
I am happy to adjust. Here is the breakdown and how I plan to address them
right away:
*1. On the artifact installation strategy (Suffix vs. Name-Mismatch):* The
suffix was designed primarily to *prevent shadowing* and ensure isolation
between production skills and evaluation instances. Regarding the
name-mismatch risk: Claude Code's skill picker routes tools based on the
description metadata rather than strict edit distance on names, so the
-eval- suffix doesn't degrade trigger rates in standard testing. However,
to eliminate any edge-case risk of name-mismatch or shadowing completely, I
am going to refine this in the next commit so the evaluation context
strictly shadows the name only within a sandboxed temporary environment,
keeping the production namespace completely clean.
*2. On the cleanup logic (Configurable Threshold):* You’re spot on. Right
now, the age threshold for sweeping stale artifacts is hardcoded to a
defensive 2-hour window. To avoid killing active debugging sessions while
preventing disk cruft, *I am going to turn this into an adjustable
parameter*.
-
I will expose it as a configurable option (e.g., --eval-clean-threshold
or via CLAUDE.md / config context).
-
I will default it to 12 hours to balance safety for long debugging
sessions and automatic cleanup, and document it clearly in the README/docs.
*3. On the parallel worker race condition (Cause 4):* To ensure absolute
safety in concurrent environments (like heavily parallelized CI pipelines),
I will upgrade the process-scoped ID resolution to a globally unique
identifier (UUID v4) or implement a lightweight lockfile mechanism (.lock)
within the local evaluation directory to completely eliminate cross-process
collisions.
I’ll push a commit addressing these refinements in a bit. Thanks again for
the feedback!
…On Tue, Jun 9, 2026 at 9:02 PM Xiaogang Wang ***@***.***> wrote:
*xg-gh-25* left a comment (anthropics/skills#1298)
<#1298 (comment)>
Brutal diagnosis — four independent causes each sufficient for 0% recall,
which explains why partial fixes didn't move the number. The "optimizing
against noise" framing is exactly right: a metric that always reads zero
might as well not exist.
The cross-platform testing rigor (Windows stream handling, executable
resolution) and the systematic verification strategy (stub harness +
offline tests before real integration) are both production patterns we
recognize. A few observations:
*On the artifact installation strategy:* The decision to install as
.claude/skills/<name>-eval-<id>/SKILL.md instead of swapping the real
skill in-place is the safer choice. The tradeoff you note (suffix on the
skill name) is acceptable because routing happens on *description content*,
not the name. One risk to flag: if the model's routing logic includes *edit
distance on skill names* as a tie-breaker (some frameworks do this), the
suffix could deflate trigger rates slightly. Have you verified that Claude
Code's skill picker ignores name similarity entirely?
*On cause 3 (exploration counted as miss):* The fix — "non-Skill tools
are ignored, only terminal events decide not-triggered" — is correct for
auto-triggering, but it changes what you're measuring. Before: "did the
skill trigger *first*?" After: "did the skill trigger *eventually*?" The
latter is the right metric for description optimization (you want the model
to reach the skill at all), but if you're also evaluating routing
*efficiency* (how many irrelevant tools were used before the right one),
you'd want to track that separately. Consider logging steps_to_trigger as
optional metadata.
*On cause 4 (parallel worker race):* The shared artifact solution is
clean, but there's a hidden complexity: if multiple *runs* execute
concurrently (e.g., CI pipeline with parallel jobs), they'll still collide
on the artifact name. The -eval-<id> suffix mitigates this if <id> is
globally unique (UUID), but if it's process-scoped (PID), you can get
cross-process races. Suggest documenting the <id> uniqueness guarantee —
or add a lockfile if not already present.
*On the verification strategy:* The offline stub harness is excellent (we
use this pattern too — "canned stream-json" for zero-cost iteration). One
gap: your offline tests cover Windows stream handling and parallel workers,
but they don't cover the actual *trigger detection logic* against
malformed/truncated tool_use events. Consider adding a fuzzer that
injects partial JSON chunks mid-event to verify your detection doesn't
false-positive.
Minor: The PR description says "Tradeoff kept deliberately: the eval
skill's name carries an -eval- suffix while production skills don't." But
you also say "the suffix avoids collisions with installed skills." These
are compatible, but clarify: does the suffix prevent shadowing (good), or
does it introduce name-mismatch risk (potential concern)? The warning about
shadowing suggests you're aware of both sides.
Last: The "sweep stale <name>-eval-* artifacts" logic is defensive (we do
this too). What's the age threshold? 2 hours? 24 hours? If it's too
aggressive, you'll delete artifacts from paused debugging sessions. If too
permissive, you'll accumulate cruft. Consider making it configurable or at
least documenting the default.
------------------------------
*Eval loop signal integrity recognized via SwarmAI
<https://github.com/xg-gh-25/SwarmAI>. Discussion: T-MEM
<xg-gh-25/SwarmAI#13>*
—
Reply to this email directly, view it on GitHub
<#1298?email_source=notifications&email_token=BKG77WKCUNZQEOJU6OYXD3347DFU7A5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRWGYYDKMBTGU3KM4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4666050356>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BKG77WIY63VYQDPL5W4YYJL47DFU7AVCNFSM6AAAAAC2B4EF7SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DMNRWGA2TAMZVGY>
.
Triage notifications, keep track of coding agent tasks and review pull
requests on the go with GitHub Mobile for iOS
<https://github.com/notifications/mobile/ios/BKG77WP5NS4D2CZF7G6X6SD47DFU7A5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRWGYYDKMBTGU3KM4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KUZTPN52GK4S7NFXXG>
and Android
<https://github.com/notifications/mobile/android/BKG77WLAH7U27MLNUBNVDA347DFU7A5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRWGYYDKMBTGU3KM4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2K4ZTPN52GK4S7MFXGI4TPNFSA>.
Download it today!
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Addresses review feedback on the eval artifact strategy: - The candidate skill now keeps its real name. Isolation moved from a name suffix to the project directory: each run creates a throwaway project under <tmp>/claude-skill-evals/<name>-<uuid4> and claude -p executes with it as cwd, so the name-similarity question around the -eval-<id> suffix is moot and the caller project is never touched. - Cross-run collisions are impossible by construction: the project directory is keyed by a full UUID4 and scoped to a dedicated per-run directory (the previous id was already UUID-derived, not process-scoped; now the property is structural). - The stale sweep threshold (previously hardcoded to 1 hour) is now --stale-artifact-hours, defaulting to 12 hours: long enough to survive paused debugging sessions, short enough not to accumulate cruft. The sweep only ever runs inside the dedicated temp root. - The shadowing warning now covers user-level skills only; everything project-level is isolated by construction. run_eval() stays backward compatible (project_root retained but deprecated); run_loop.py works unmodified. Verified on Windows 11 + Claude Code 2.1.149: offline stub suite 16/16 and 4/4 eval queries, real end-to-end 2/2 should-trigger and 0/2 should-not with the clean skill name, zero leftover directories.
|
Pushed 79b3df9 delivering the refinements discussed above: 1. Sandboxed eval environment (name-mismatch + shadowing): went one step further than promised. Each run now creates an isolated throwaway project under the OS temp dir ( 2. Configurable sweep threshold: now 3. Cross-run collisions: to be precise, the previous id was already UUID-derived (a uuid4 hex prefix), not process-scoped -- but the new layout makes the guarantee structural rather than probabilistic: a full 32-hex UUID4 keys a dedicated per-run project directory, so concurrent runs (parallel CI jobs included) cannot collide regardless of id entropy. No lockfile needed. On On fuzzing the detection against malformed/truncated Re-verified end to end on Windows 11 + Claude Code 2.1.149 after the change: 2/2 should-trigger, 0/2 should-not-trigger, zero leftover directories in temp. |
Problem
run_eval.py— and thereforerun_loop.pyandimprove_description.py, which consume its signal — reportsrecall=0%for every skill description regardless of content (#556, 10+ independent reproductions). The description-optimization loop is currently optimizing against noise.There are four independent causes. Each one alone is sufficient for ~0% recall, which is why partial fixes haven't moved the number.
1. The eval artifact can never auto-trigger (all platforms)
The harness installs the candidate description at
.claude/commands/<name>-skill-<uuid>.mdand expects the model to auto-trigger it. But commands and skills are different surfaces: theinitevent ofclaude -plists command files underslash_commands, while the model-facing auto-trigger list (skills) is built only from.claude/skills/*/SKILL.md.Verified on Claude Code 2.1.149 by planting both artifact types side by side and inspecting the init event:
The command file (
pokemon-name-rater-cmd) appears inslash_commandsbut is absent fromskills— it is structurally invisible to auto-triggering. The real skill in.claude/skills/auto-triggered immediately in the same headless session (Skill {"skill": "pokemon-name-rater"}from a natural query). Same conclusion as @hysteric-lab's and @samuelcastro's experiments in #556.2. On Windows, the harness crashes before measuring anything
select.select()on subprocess pipes only works on sockets on Windows. Every query raisesOSError, the genericexcept Exceptioninrun_eval()converts it to a stderr warning, and the query is recorded as not-triggered:So on Windows recall is structurally 0% even with cause 1 fixed. Stream reading now goes through reader threads + a queue, which behaves identically on all platforms.
Relatedly:
Popen(["claude", ...])only findsclaude.exeon Windows. npm installs ship aclaude.cmdshim, which raisesFileNotFoundError— also swallowed by the same handler, also producing silent 0%. The executable is now resolved once withshutil.which()(verified: bare.cmdname fails,which()-resolved full path works).3. Detection counted exploration as a miss
The streaming detector returned
Falseat the firsttool_usethat wasn'tSkill/Read. Claude frequently starts withTodoWrite/Glob/Bashbefore consulting a skill, so legitimate triggers were recorded as misses (reported by @Vaikri-costume in #556; survives the existing fix PRs). Now non-Skill tools are simply ignored, and only a terminalresultevent, process exit, or timeout decides not-triggered. Conversation length is capped with--max-turnsso non-triggering queries stay cheap.4. Parallel workers raced each other
Each worker wrote its own UUID-named copy of the artifact with an identical description. Claude picks one arbitrarily, so sibling workers miss triggers that land on another worker's copy — trigger rate scales as ~1/N (@rhys-childs measured 100% at
--num-workers 1vs 11% at the default 10). One shared artifact per run is now created before the pool starts and removed after it drains, so full parallelism is safe.What changed
<project_root>/.claude/skills/<name>-eval-<id>/SKILL.md; one shared artifact per run, removed in afinally.select.select()with portable reader threads (stdout queue + bounded stderr tail).claudeexecutable withshutil.which(); fail fast with a clear error if absent.result/exit/timeout; cap with--max-turns 8.DEVNULL(these silent failures are why run_eval.py: claude -p never triggers skills/commands (0% trigger rate across all queries) #556 was hard to diagnose).<name>-eval-*artifacts left behind by killed runs.run_eval()'s signature, the CLI flags, and the output JSON are unchanged —run_loop.pyandimprove_description.pywork unmodified.Why not swap the real SKILL.md in place
#1027 takes that approach. Writing into the user's actual skill directory is destructive if the run dies mid-swap (backups and signal handlers mitigate but cannot cover SIGKILL/power loss). This PR never touches user files: the artifact is a new, uniquely named directory, removed on completion, swept if a previous run crashed, and shadowing is handled with a loud warning instead of moving the user's skill aside (the concern #996 addresses by relocation).
Tradeoff kept deliberately: the eval skill's name carries an
-eval-<id>suffix while production skills don't. The description — which is what routing decides on — is identical, and the suffix avoids collisions with installed skills. This matches the previous design's tradeoff.Verification (Windows 11, Claude Code 2.1.149, Python 3.12)
claude.cmdemitting canned stream-json — exercises.cmdresolution, reader threads, 4-way parallelism on one shared artifact, TodoWrite-before-Skill detection, and artifact cleanup, with zero API cost: 4/4 pass.input_json_deltachunks; garbage stdout lines tolerated; nonzero exit counted as not-triggered with the stderr excerpt surfaced; hanging CLI killed promptly at timeout; missing CLI raising a clearRuntimeError; artifact frontmatter round-trip; age-guarded stale sweep removing a 2h-old leftover while keeping a fresh one and ignoring similarly named non-artifact dirs. All pass.run_loop.py(unmodified) run end-to-end against the stub — precision/recall/accuracy computed, result schema consumed intact,all_passedexit path reached.--model haiku, 2 workers):main: fourWinError 10038warnings, should-trigger 0/1 + 0/1 (recall 0%), summary 2/4 — and the 2 "passes" are the should-not-trigger queries passing vacuously.ruff checkclean on default rules; a strict rule set (F,E,W,B,UP,SIM,RET,ARG,RUF) only flags lines inherited unchanged from the previous version ofmain().Fixes #556. Related: #794 (cause 4 only), #996 (cause 1 + shadowing via relocation), #1027 (cause 1 via in-place swap). Upstream context: anthropics/claude-code#32184 (closed stale, no maintainer response).