feat(batch-bug-shepherd): operator visibility + fold-in invariant#1451
Merged
danielmeppiel merged 3 commits intoMay 22, 2026
Conversation
Refactors the batch-bug-shepherd skill to address two genesis-validated
blockers and ship two missing capabilities discovered during a real
sweep-all run over 14 microsoft/apm bugs:
1. OPERATOR VISIBILITY (was: silent 30-minute fan-outs)
- New asset assets/progress-diagram.md: mermaid template + 5-state
palette (pending/active/done/blocked/skipped) + per-phase render
rules + dispatch-table contract.
- SKILL.md adds 'Operator visibility is a contract' invariant; each
phase boundary re-renders the diagram with current-phase coloring
and prints a subagent_id -> target dispatch table BEFORE fan-out.
- Operator can follow long sagas at a glance instead of waiting in
the dark for the next checkpoint.
2. FOLD-IN INVARIANT (was: panel recommendations silently dropped)
- assets/verdict-schema.json: shepherd_return gains required
recommended_followups[] channel; completion_return gains
folded_followups[] + deferred_followups[]; extracted reusable
followup_item definition.
- assets/shepherd-prompt.md: fixed verdict mapping bug
(ship_with_followups + 0 blocking -> ready-to-merge, not
needs-author-changes); added recommended_followups extraction
step with required source_persona + optional fold_hint tagging.
- assets/completion-prompt.md: full rewrite. Adds
RECOMMENDED_FOLLOWUPS input; encodes FOLD vs DEFER classifier
(FOLD: touches diff / single helper / regression trap / hermetic
test / inline comment; DEFER: cross-cutting refactor / new
feature / broad doc / architectural addition); per-FOLD item
consultation with source_persona + python-architect lens;
DEFER items filed as gh issue create tracking issues (never
silently dropped); mid-flight reclassify rule to avoid stalls.
- SKILL.md adds 'Bias toward folding recommended items' invariant
and rewrites Phase 4 spawn contract (9 steps) to thread the
recommended_followups channel end-to-end.
Eval gate
- +3 rubric anchors per content fixture
(progress-diagram-header, mermaid-flowchart-rendered,
dispatch-table-before-fanout) and +3 invariant anchors
(recommended-followups-channel, fold-defer-classifier,
tracking-issue-for-defer).
- All 12 new anchors MATCH with_skill fixtures and MISS
without_skill fixtures (clean value delta).
- +3 no-fire trigger items for single-PR fold-in phrasing so the
dispatcher will not misfire the batch outer-loop on single-PR
fold work (e.g. 'fold the panel recommendations into PR #1234'
remains apm-review-panel completion territory).
Validation
- Schema validates via jsonschema Draft7; accepts new shapes,
rejects shepherd_return missing recommended_followups[].
- SKILL.md: 367 lines / 4483 tokens (caps: 500 / 5000).
- Description: 965 / 1024 chars; mentions FOLD invariant.
- 0 non-ASCII bytes across all modified files.
- All 4 changed JSON files parse.
Real-task evidence (this skill iteration was driven by a live run)
- 5 of 6 in-flight community PRs had their panel recommendations
folded in-PR by completion subagents following the new contract,
yielding 22 folded items and 8 deferred-to-tracking items across
PRs #1387, #1396, #1441, #1443, #1444. The 6th (#1442) is in
flight as this lands.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Refactors the batch-bug-shepherd skill contract to improve long-running saga operator visibility (progress diagram + dispatch tables at phase boundaries) and to stop silently dropping non-blocking review-panel recommendations by flowing them through a recommended_followups[] channel and enforcing a FOLD-vs-DEFER completion workflow.
Changes:
- Adds an operator-visibility contract (
assets/progress-diagram.md) and threads “render progress diagram + dispatch table” requirements intoSKILL.md, the root prompt, and eval fixtures. - Extends the verdict schema to carry
recommended_followups[]from shepherd -> completion, plus completion reporting viafolded_followups[]/deferred_followups[]. - Updates prompts/evals to encode the fold-in bias (classify recommended items as FOLD vs DEFER; file DEFER items as tracking issues).
Show a summary per file
| File | Description |
|---|---|
| packages/batch-bug-shepherd/.apm/skills/batch-bug-shepherd/SKILL.md | Updates architecture invariants and phase-by-phase contract to require progress renders and fold-in handling. |
| packages/batch-bug-shepherd/.apm/skills/batch-bug-shepherd/evals/triggers.json | Adds new “no-fire” trigger cases to prevent mis-dispatch on single-PR fold-in phrasing. |
| packages/batch-bug-shepherd/.apm/skills/batch-bug-shepherd/evals/fixtures/three-issues-mixed.with_skill.md | Updates idealized “with_skill” transcript to include progress/dispatch visibility and fold/defer outcomes. |
| packages/batch-bug-shepherd/.apm/skills/batch-bug-shepherd/evals/fixtures/sweep-bug-queue.with_skill.md | Updates sweep-all transcript to include progress/dispatch visibility and fold/defer outcomes at scale. |
| packages/batch-bug-shepherd/.apm/skills/batch-bug-shepherd/evals/content/three-issues-mixed.json | Adds rubric anchors for progress diagram, dispatch table, recommended followups, FOLD/DEFER behavior. |
| packages/batch-bug-shepherd/.apm/skills/batch-bug-shepherd/evals/content/sweep-bug-queue.json | Mirrors new rubric anchors for the sweep-bug-queue scenario. |
| packages/batch-bug-shepherd/.apm/skills/batch-bug-shepherd/assets/verdict-schema.json | Extends schema with recommended_followups and completion follow-up reporting structures. |
| packages/batch-bug-shepherd/.apm/skills/batch-bug-shepherd/assets/shepherd-prompt.md | Fixes verdict mapping for ship_with_followups with 0 blocking; extracts recommended_followups[]. |
| packages/batch-bug-shepherd/.apm/skills/batch-bug-shepherd/assets/progress-diagram.md | Introduces the canonical progress-diagram template, palette, and dispatch-table contract. |
| packages/batch-bug-shepherd/.apm/skills/batch-bug-shepherd/assets/completion-prompt.md | Rewrites completion workflow to classify recommended items (FOLD/DEFER) and file tracking issues for DEFER. |
| packages/batch-bug-shepherd/.apm/prompts/batch-bug-shepherd.prompt.md | Threads the progress/dispatch rendering requirements into the top-level prompt procedure. |
Copilot's findings
- Files reviewed: 11/11 changed files
- Comments generated: 8
Comment on lines
+72
to
+76
| - **Pending** -- every phase at the very first render. | ||
| - **Active** -- the phase the orchestrator is about to spawn into. | ||
| EXACTLY ONE node should be `active` at any time. Use stroke-width | ||
| 3px so the operator's attention lands there. | ||
| - **Done** -- the phase completed and the table reflects its |
Comment on lines
+23
to
+27
| classDef pending fill:#eee,stroke:#999,color:#333 | ||
| classDef active fill:#fff3b0,stroke:#b8860b,color:#333,stroke-width:2px | ||
| classDef done fill:#cfe8c9,stroke:#2e7d32,color:#1b5e20 | ||
| classDef blocked fill:#f8c7c7,stroke:#b71c1c,color:#7f0000 | ||
| classDef skipped fill:#f0f0f0,stroke:#bbb,color:#888,stroke-dasharray:3 3 |
Comment on lines
+22
to
+26
| classDef pending fill:#eee,stroke:#999,color:#333 | ||
| classDef active fill:#fff3b0,stroke:#b8860b,color:#333,stroke-width:2px | ||
| classDef done fill:#cfe8c9,stroke:#2e7d32,color:#1b5e20 | ||
| classDef blocked fill:#f8c7c7,stroke:#b71c1c,color:#7f0000 | ||
| classDef skipped fill:#f0f0f0,stroke:#bbb,color:#888,stroke-dasharray:3 3 |
Comment on lines
+15
to
+16
| 3. ONCE at the end of the run with every phase styled `done` (or | ||
| `blocked` where the human-escalation queue is non-empty). |
Comment on lines
+28
to
+34
| | state | fill | stroke | stroke-width | semantics | | ||
| |----------|-----------|-----------|--------------|------------------------------------| | ||
| | pending | `#f5f5f5` | `#9ca3af` | 1px | not started yet | | ||
| | active | `#dbeafe` | `#2563eb` | 3px | currently executing (one at a time)| | ||
| | done | `#dcfce7` | `#16a34a` | 1px | completed cleanly | | ||
| | blocked | `#fef3c7` | `#d97706` | 2px | partial completion, human follow-up| | ||
| | skipped | `#f3f4f6` | `#6b7280` | 1px,dasharray| no work in this phase (e.g. 0 fix) | |
Comment on lines
+15
to
+19
| flowchart TB | ||
| P0[Phase 0 scope]:::active | ||
| P1[Phase 1 triage]:::pending | ||
| P2[Phase 2 cross-ref]:::pending | ||
| P3[Phase 3 shepherd-or-fix]:::pending |
Comment on lines
+18
to
+21
| P3[Phase 3 shepherd-or-fix]:::pending | ||
| P4[Phase 4 completion]:::pending | ||
| P5[Phase 5 final report]:::pending | ||
| P0 --> P1 --> P2 --> P3 --> P4 --> P5 |
| a tracking issue), then act on each per classification. | ||
|
|
||
| Push to the contributor's fork if possible, otherwise open a | ||
| superseding PR that preserves author authorship, and post ONE |
Adds a post-wave gate that re-probes mergeability for every PR the saga marked ready-to-merge, dispatches one conflict-resolution subagent per CONFLICTING PR, and partitions returns into four post-gate statuses before the final report claims anything is mergeable. Mergeability is post-wave truth, not pre-wave assumption: a PR that Phase 4 marked ready can stop being mergeable the moment the maintainer lands another PR onto main. Without this gate the report ships stale ready claims. Design driven through the genesis skill end-to-end (steps 1-6 handoff packet, steps 7a-7b coder pass, step 8 validation): - NEW Phase 5 (mergeability gate) between completion (Phase 4) and renamed final report (Phase 5 -> Phase 6). - Sub-phases 5a probe (read-only, single-thread, gh pr view --json mergeStateStatus), 5b fan-out (one conflict-resolution subagent per CONFLICTING PR), 5c trust-but-verify re-probe + four-way partition (resolved / requires-author-action / requires-human-judgment / resolution-failed). - NEW assets/conflict-resolution-prompt.md spawn body for 5b. Encodes rebase, faithful merge of both intents, mutation-break re-check, lint silent, --force-with-lease push, re-probe, resolution-confirmation comment. - NEW references/mergeability-gate.md load-on-demand orchestrator step-by-step (load trigger: WHEN ENTERING PHASE 5). Keeps SKILL.md under 5000-token budget. - Schema extends verdict-schema.json oneOf with conflict_resolution_return; --force-with-lease enforced via regex pattern guard on push_command field; bare --force rejected. Five rejection cases validated. - Two-comment-per-PR cap as new architecture invariant: at most one completion-confirmation (Phase 4) + one resolution-confirmation (Phase 5b) per PR. - Progress diagram extended with WAVE4 subgraph (P5a/P5b/P5c), skipped-state semantics, P5b dispatch table requirement. - Final report extended with three new partition sections plus a RESOLUTION CONFIRMATION COMMENT block and mergeability-gate disciplines line. - Evals: +3 content rubric anchors (mergeability-probe-cli, force-with-lease-on-push, post-wave-partition-columns) + 1 optional anchor (two-comment-cap); +1 fire + 2 no-fire trigger items; fixture diff shows the gate firing on a sweep with 2 conflicting PRs and the without-skill failure mode (stale ready claim). SKILL.md: 388 lines / 4867 tokens (budget 500/5000). ASCII only. CI lint pair silent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a new wave between Phase 1 (triage) and Phase 2 (PR-in-flight
cross-reference) that checks every LEGIT bug against the project's
rejection contract before spending shepherd / fix / completion work
on it.
What changes:
- NEW PRINCIPLES.md at repo root: 7 numbered principles encoding
the project's hard nos (P1 no invented frontmatter; P2 multi-harness
with traction gating; P3 vendor neutral; P4 UX floor is not a
trade) plus 3 supporting principles (P5 portability; P6 reliability
over magic; P7 community over feature count). Bound to apm-ceo +
bbs Phase 1.5 + apm-triage-panel + apm-review-panel as the
rejection contract.
- NEW bbs Phase 1.5 strategic-alignment gate (WAVE 1.5):
- one apm-ceo subagent per LEGIT row, in parallel
- 4-state verdict: aligned | aligned-with-reservations |
out-of-scope | wrong-direction
- schema-validated returns; FAILS OPEN on infrastructure failure
(malformed-x2 or non-citable principle) so legit bugs are
never silently demoted under gate breakage
- ABORTS only when apm-ceo.agent.md or PRINCIPLES.md itself is
missing (operator-actionable error)
- demoted rows flip to status triaged-deferred and SKIP Phase
2/3/4/5; surface in Phase 6 under 'Recommend close as
out-of-scope' partition
- aligned-with-reservations rows stay in saga; downstream phases
surface the reservations in review prose
- deferred-PR strategic-rejection comment subagent (S7+S4+A9)
posts a courtesy comment on any open PR whose underlying issue
was demoted, using the would-be Phase-4 completion-comment
slot (two-comments-per-PR cap preserved)
- Verdict schema extended with 5th oneOf member
strategic_alignment_return (kind, issue, verdict, cited_principle,
rationale, reservations).
- Ground-truth table grows two columns (strategic_verdict +
strategic_rationale) and one status value (triaged-deferred).
- Progress diagram inserts P15 between P1 and P2; dispatch-table
contract extends to Phase 1.5.
- Final-report template adds 'Recommend close as out-of-scope'
partition and 'Aligned with reservations' surfacing section.
- 2 new fire trigger evals + 1 no-fire (PRINCIPLES.md authoring
guard) + 1 new rubric anchor on the three-issues-mixed scenario.
Genesis design artifact lives in the session plan store; SKILL.md
body remains within 500-line / 5000-token budget (406 lines /
4943 tokens after trimming pre-existing verbose passages on
operator-visibility, mergeability, fold-in, composition, and
operating-contract sections to make room).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Refactors the
batch-bug-shepherdskill to (1) make long sagas visible to the operator at every phase boundary, (2) stop silently dropping non-blocking panel recommendations, and (3) add a post-wave mergeability gate that re-probes every "ready" PR before the report claims anything is mergeable. All three gaps were caught bygenesisvalidation passes and confirmed by real microsoft/apm sweep-all runs.Problem (WHY)
Three failure modes surfaced during live sweep-all runs:
apm-review-panelreturnedseverity:recommendedfollow-ups, but the skill had no channel for them. Completion subagents only acted onseverity:blocking; everything else was dropped on the floor.Each gap drove a
genesisdesign pass with a persisted handoff packet; this PR ships the union of all three.Approach (WHAT)
Operator visibility is a contract
New asset
assets/progress-diagram.mddefines a mermaid template + 5-state palette (pending/active/done/blocked/skipped) + per-phase render rules + dispatch-table contract.SKILL.mdadds the invariant; each phase boundary re-renders the diagram with current-phase coloring and prints asubagent_id -> targetdispatch table BEFORE fan-out.Bias toward folding recommended items
assets/verdict-schema.json:shepherd_returngains requiredrecommended_followups[]channel;completion_returngainsfolded_followups[]+deferred_followups[]; extracted reusablefollowup_itemdefinition.assets/shepherd-prompt.md: fixed verdict mapping bug (ship_with_followups+ 0 blocking now maps toready-to-merge).assets/completion-prompt.md: encodes FOLD vs DEFER classifier; DEFER items filed asgh issue createtracking issues (never silently dropped).SKILL.mdPhase 4 rewritten as a 9-step contract that threads the channel end-to-end.Mergeability is post-wave truth (NEW)
gh pr view --json mergeStateStatus,mergeable,maintainerCanModify), 5b fan-out one conflict-resolution subagent per CONFLICTING PR, 5c trust-but-verify re-probe + four-way partition.assets/conflict-resolution-prompt.mdspawn body: rebase, faithful merge of both intents, mutation-break re-check, lint silent,--force-with-leasepush (NEVER bare--force), re-probe, single resolution-confirmation comment.references/mergeability-gate.mdload-on-demand orchestrator step-by-step. Load trigger: WHEN ENTERING PHASE 5. Keeps SKILL.md under budget.verdict-schema.jsononeOf withconflict_resolution_return;--force-with-leaseenforced via regex pattern guard onpush_command; bare--forcerejected. 5 rejection cases validated.resolved/requires-author-action(fork +maintainerCanModify=false) /requires-human-judgment(semantic conflict) /resolution-failed.Validation evidence
with_skillfixtures, MISSwithout_skillruff check src/ tests/+ruff format --check src/ tests/both silentReal-task evidence
This iteration was driven by a live sweep-all run over microsoft/apm:
41ee035band36d878f4), pushed with--force-with-lease, re-probed CLEAN. Without the gate, the report would have claimed both ready-to-merge and the maintainer would have hit the merge failure manually.How to test
Evals at
packages/batch-bug-shepherd/.apm/skills/batch-bug-shepherd/evals/:content/{three-issues-mixed,sweep-bug-queue}.json— rubric runs prompt with/without the skill, asserts value delta via anchors including the newmergeability-probe-cli,force-with-lease-on-push,post-wave-partition-columns,two-comment-cap.triggers.json— should-fire vs should-not-fire dispatch eval, train/val split. Validation split is the ship gate.fixtures/sweep-bug-queue.with_skill.md— ideal output (Phase 5 mermaid + 5b dispatch table + 4-column report).fixtures/sweep-bug-queue.without_skill.md— explicit failure mode (stale ready-to-merge claim, maintainer hits the conflict at merge time).Trade-offs
gh pr viewper resolved PR; pays for itself the first time GitHub finishes computing mergeStateStatus after the subagent already returned.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com
Update: Phase 1.5 strategic-alignment gate (commit
3786d9c6)A FOURTH gap surfaced during the live sweep that this PR ships a fix for: the skill happily shepherded reproducible-on-HEAD bugs whose fixes the project did not actually want. A LEGIT triage was necessary but not sufficient.
What changed
PRINCIPLES.mdat repo root. Numbered rejection contract (P1..P7) encoding the project's hard nos:apm-ceo,bbs Phase 1.5,apm-triage-panel,apm-review-panelas the rejection contractapm-ceosubagent per LEGIT row, in parallelaligned/aligned-with-reservations/out-of-scope/wrong-direction)apm-ceo.agent.mdorPRINCIPLES.mditself is missing (operator-actionable)triaged-deferredand SKIP Phase 2/3/4/5, surface in Phase 6 under "Recommend close as out-of-scope"strategic_alignment_return(kind / issue / verdict / cited_principle / rationale / reservations).strategic_verdict,strategic_rationale) +1 status value (triaged-deferred).Retroactive Phase 1.5 evidence (this sweep's 8 open PRs)
The gate was authored, then run against the open PRs in this sweep. Results:
Zero out-of-scope / wrong-direction verdicts. The single reservation on #1444 was tracked in #1452 (extend MCP v0.1 variables handling to all client adapters) to satisfy P3.
Updated validation
reservationswhen verdict=aligned-with-reservationsPRINCIPLES.md)