Style skill validation results comment#35713
Conversation
Format the Skill Validation Results PR comment with the same status badge and session layout used by AI Summary and Test Failure Review comments. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 35713Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 35713" |
Render the Skill Validation Results session collapsed by default to match the requested comment behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Normalize Skill Validation Results comments so no details blocks render expanded by default. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/review -b feature/enhanced-reviewer -p android |
MauiBot
left a comment
There was a problem hiding this comment.
Expert Review — 1 findings
See inline comments for details.
MauiBot
left a comment
There was a problem hiding this comment.
AI Review Summary
@kubaflo — new AI review results are available based on this last commit:
5e02847.
Collapse all skill validation details by default To request a fresh review after new comments or commits, comment/review rerun.
Review Sessions — click to expand
Gate — Test Before & After Fix
Gate Result: ⚠️ SKIPPED
No tests were detected in this PR.
Recommendation: Add tests to verify the fix using the write-tests-agent.
Pre-Flight — Context & Validation
Issue: N/A - No linked issue filed
PR: #35713 - Style skill validation results comment
Platforms Affected: CI workflow only (review requested for android, but diff has no Android/runtime code)
Files Changed: 1 implementation, 0 test
Key Findings
- PR modifies
.github/workflows/skill-validation.ymlto restyle the skill-validation results comment with author/commit context, badges, and a collapsible session. - Gate was already skipped because no tests were detected in this PR; no
gate/content.mdwas created or modified here. - Code review found a status aggregation bug: cancelled/unknown LLM evaluation can still produce an overall green Passed badge when static checks pass.
Code Review Summary
Verdict: NEEDS_CHANGES
Confidence: high
Errors: 1 | Warnings: 1 | Suggestions: 1
Key code review findings:
- ❌
.github/workflows/skill-validation.yml:1038—overallStatuscan showPassedwhen eval status iscancelledorUnknown. ⚠️ Failure investigation prompt is computed but still absent from the PR comment body.- 💡 Badge-only summary depends on shields.io availability; a plain-text fallback would improve resilience.
Fix Candidates
| # | Source | Approach | Test Result | Files Changed | Notes |
|---|---|---|---|---|---|
| PR | PR #35713 | Restyle skill validation PR comment with PR metadata, shields.io badges, and a collapsible per-commit session while preserving existing result details | .github/workflows/skill-validation.yml |
Original PR |
Code Review — Deep Analysis
Code Review — PR #35713 · .github/workflows/skill-validation.yml
GitHub CLI authentication was unavailable; analysis used public GitHub API metadata and the local squashed PR diff (
HEAD^..HEAD).
Independent Assessment
What this changes: The workflow's skill-validation result comment is restyled from a flat Markdown body into a richer status report with PR author/commit context, shields.io badges, a single expandable session block, and per-commit session markers. It adds a github.rest.pulls.get() call to retrieve PR title, author, and head SHA.
Inferred motivation: Make skill validation comments match the visual style used by other AI-generated PR comments while preserving existing static-check and LLM-evaluation details inside the expanded session.
Reconciliation with PR Narrative
Author claims: PR #35713 says it styles the Skill Validation Results comment with a stable marker, author/commit context, status badges, a single expandable session with commit metadata, and preserved existing details. It reports validation via extracted actions/github-script syntax check and git diff --check.
Agreement/disagreement: The diff matches the narrative. One status aggregation edge case appears incorrect: overallStatus can report Passed when the evaluation job is cancelled or unknown as long as static checks passed.
Findings
❌ Error — overallStatus can show Passed when eval is cancelled or unknown
.github/workflows/skill-validation.yml:1038 computes overallStatus as Passed whenever staticStatus === 'Passed' and neither status is Failed. That means staticStatus='Passed' with evalStatus='cancelled' or evalStatus='Unknown' produces a green overall badge even though LLM evaluation did not complete. Expected behavior is Passed only when static passed and eval either passed or was intentionally skipped.
⚠️ Warning — Failure investigation prompt remains outside the PR comment
The workflow computes investigatePrompt for failed evals, but the new PR comment body still does not include it inside the result session. This is pre-existing behavior and not required by the styling goal, but it leaves the author-facing comment less actionable when LLM evaluation fails.
💡 Suggestion — Badge-only summary depends on shields.io availability
The badge row uses https://img.shields.io/badge/.... If shields.io is unavailable, GitHub will show broken images rather than readable badge text. A plain-text fallback line would improve resilience, though this mirrors the referenced styling pattern.
Devil's Advocate
The overallStatus behavior might be intentional if cancelled evals are considered non-blocking, but the label is "Overall" and a cancelled/unknown phase is materially different from a passed phase. The prompt and badge fallback concerns are lower severity because they are either pre-existing or consistent with the style being copied.
Verdict: NEEDS_CHANGES
Confidence: high
Summary: The style refactor is structurally sound and safe, but the derived overall badge can be misleading for cancelled or unknown eval outcomes. A small status aggregation fix should be applied before this change is considered complete.
Blast Radius
Workflow-only change. Affects skill-validation PR comments for all platforms; no MAUI runtime or Android app code is touched.
Failure-Mode Probes
| Scenario | Expected overallStatus |
|---|---|
| static Passed, eval Passed | Passed |
| static Passed, eval Skipped | Passed |
| static Passed, eval Failed | Failed |
| static Passed, eval cancelled | Needs attention |
| static Passed, eval Unknown | Needs attention |
| static Failed, eval Passed | Failed |
| static Failed, eval Skipped | Failed |
| static Unknown, eval Passed | Needs attention |
| static Skipped, eval Skipped | Needs attention |
Fix — Analysis & Comparison
Fix Candidates
| # | Source | Approach | Test Result | Files Changed | Notes |
|---|---|---|---|---|---|
| 1 | code-review + maui-expert-reviewer | Tighten the inline overallStatus success condition to require static Passed and eval Passed/Skipped |
✅ PASS | 1 file | Smallest targeted fix; prevents false green Overall badge for cancelled/unknown eval |
| 2 | code-review + maui-expert-reviewer | Extract getOverallStatus(staticStatus, evalStatus) helper with explicit guard clauses |
✅ PASS | 1 file | More readable policy encoding, but larger than candidate 1 |
| PR | PR #35713 | Restyle skill validation comment with PR metadata, badges, and collapsible session | 1 file | Original PR; expert review found false green Overall badge edge case |
Cross-Pollination
| Model | Round | New Ideas? | Details |
|---|---|---|---|
| maui-expert-reviewer | 1 | Yes | Identified status aggregation flaw and suggested tightening pass eligibility for eval statuses |
| local test loop | 1 | Yes | Candidate 1 validated minimal inline condition against syntax, status matrix, and diff whitespace checks |
| local test loop | 2 | Yes | Candidate 2 validated helper-based policy encoding against the same checks |
| orchestrator | 2 | No | Further candidates would be trivial variations of the same boolean policy; no meaningfully different approach remains for this workflow-only bug |
Exhausted: Yes — two meaningful implementation strategies passed; additional variants would only rename variables or rearrange equivalent conditions.
Selected Fix: Candidate #1 — It passes all targeted checks, fixes the expert-reviewed cancelled/unknown eval failure mode, and is simpler than Candidate #2 while changing fewer lines.
Report — Final Recommendation
Comparative Fix Report - PR #35713
Gate: Skipped; no tests were detected in this PR. Regression ranking below uses the recorded STEP 5a targeted checks and the expert-review sandbox validation. Candidates that failed regression tests would rank below passing candidates; none of the STEP 5a candidates failed.
Candidates Compared
| Rank | Candidate | Result | Assessment |
|---|---|---|---|
| 1 | pr-plus-reviewer |
Passed targeted status matrix | Best choice. It keeps the submitted PR's styling implementation and applies the expert reviewer's minimal correctness fix so non-completed LLM states no longer show a green overall badge. |
| 2 | try-fix-1 |
Passed targeted status matrix | Equivalent code change to pr-plus-reviewer and the smallest standalone alternative. Ranked behind pr-plus-reviewer only because pr-plus-reviewer preserves the PR fix plus applied review feedback as the integrated candidate. |
| 3 | try-fix-2 |
Passed targeted status matrix | Correct and clearer if the status policy grows, but it adds a helper for a single call site and is larger than necessary for this workflow-only bug. |
| 4 | pr |
Gate skipped; expert review found a correctness error | The raw PR improves comment presentation but can falsely report Overall: Passed when static checks pass and LLM evaluation is cancelled, Unknown, or another non-success/non-skipped state. |
Winner
Winning candidate: pr-plus-reviewer
pr-plus-reviewer is the single best candidate because it retains the PR's intended UI/styling behavior and incorporates the expert review's minimal status aggregation fix. It is functionally equivalent to the strongest try-fix candidate while avoiding the raw PR's false-green overall status.
Notes
The absent failure-investigation prompt in the PR comment body and the shields.io badge fallback concern were reviewed but not treated as blocking. They are either pre-existing behavior or resilience improvements rather than correctness failures in this PR's submitted fix.
Future Action — review latest findings
No alternative fix was selected for this run. Review the session findings and CI results before merging.
|
/review rerun |
PureWeen
left a comment
There was a problem hiding this comment.
Adversarial review summary
3 independent reviewers analyzed this PR in parallel with adversarial consensus (disputed 1/3 findings were re-evaluated by the other two reviewers; only validated findings are included).
Findings
| Severity | Finding | Consensus |
|---|---|---|
| ❌ Logic | Overall badge shows Passed when LLM eval is cancelled/Unknown or any non-final value (line 1038) |
3/3 reviewers |
| ❌ Logic | Comment attributes results to the wrong commit when a push lands during a /evaluate-skills run (line 693) |
3/3 after dispute |
Unguarded github.rest.pulls.get() (line 688) can sink the entire results comment on a transient API blip |
3/3 after dispute |
See inline comments for repro scenarios and concrete fixes.
Discarded findings
Two additional findings were investigated and discarded as non-issues after dispute:
- HTML attribute injection in badge
alt—skillCount/agentCountarewc -linteger output from a trusted setup step, not raw PR-controlled file content; all other badge messages come from a constrained string vocabulary; GitHub's comment HTML sanitizer also blocks the worst cases. <details open>stripper at line 1076 collapses sub-report formatting — verified intentional per the commit history (5e028470"Collapse all skill validation details by default"); no current generated section uses<details open>.
Verified safe (do not need changes)
- Marker upsert (
body.includes(marker)) — relocation ofmarkerfromlines[0]to the first element ofbodyis detected correctly; existing comments are updated in place, no duplicates. prTitleXSS —&/</>are escaped and the value sits in element text (<strong>...</strong>), not an attribute.headSha7in HTML comment markers — derived from hex-onlypr.data.head.sha, can't contain-->.lines = []downstream indexing — no consumer indexeslines[0];markeris correctly placed at top ofbody.- Optional chaining on
pr.data.user?.login— falls through to the no-mention author line variant.
Prior review context
MauiBot's expert review (2026-06-07) independently flagged the overallStatus aggregation bug as [major]. That finding is still unresolved — this review confirms it with 3-reviewer consensus and proposes the same minimal fix.
Test coverage
The PR's node --check validation only catches JavaScript syntax errors. None of the findings above are syntax errors — they're semantic/logic issues that require executing the comment-rendering path against the matrix of evalResult/staticResult combinations. Consider adding a small JS unit test (or a workflow_dispatch dry-run) that exercises the status-derivation logic to prevent regressions.
Methodology: 3 independent reviewers from different model families; 1/3 findings re-evaluated by the other two with no anchoring on the originator's wording. Verdicts here reflect consensus, not single-reviewer opinion.
…tricter Overall status - Use the gated head_sha (needs.pr-gate/slash-gate) for the results comment instead of the live PR head, so a push during a slow /evaluate-skills run can't misattribute results. - Wrap the decorative pulls.get() in try/catch with fallbacks so a transient API failure (rate-limit/5xx) doesn't sink the whole badges/details comment. - Overall = Passed only when LLM eval is explicitly Passed/Skipped; cancelled/Unknown now reads 'Needs attention', consistent with report-status. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@PureWeen addressed all three findings in |
PureWeen
left a comment
There was a problem hiding this comment.
Round 2 Adversarial Review — Multi-Model Code Review
Re: previous review #4498374195. Re-reviewed badeb78d1c with 3 independent reviewers from different model families (adversarial consensus).
✅ Round 1 fixes verified clean (3/3 reviewers)
All three findings from the previous round were addressed at root cause; no overcorrection or new regressions introduced by the fixes themselves:
| Finding | Round-1 line | Fix in badeb78d1c |
Verdict |
|---|---|---|---|
overallStatus = Passed on cancelled/Unknown LLM eval |
L1038 | L1049 tightened to staticStatus === 'Passed' && (evalStatus === 'Passed' || evalStatus === 'Skipped') |
✅ 3/3 |
Comment attributes results to wrong commit during /evaluate-skills race |
L693 | New HEAD_SHA env at L680 sourced from pr-gate/slash-gate outputs; process.env.HEAD_SHA used for headSha/commit-link |
✅ 3/3 |
Unguarded pulls.get() can sink entire comment |
L688 | Wrapped in try/catch with prData = {} fallback (L690–700); decorative metadata degrades gracefully |
✅ 3/3 |
⚠️ One new finding (2/3 after dispute round)
One reviewer flagged a logic path that was not present in Round 1. After dispute, both other reviewers confirmed the trace is factually correct but argued for report-status shell at L1170+ and check-run github-script at L1219+) that this PR does not touch. See the inline comment at L1034 for the trace and a minimal in-scope guard suggestion.
Methodology
- 3 reviewers (different model families) → adversarial consensus
- 1/3 findings sent to a dispute round with the other 2 reviewers
- Inline comments only on paths/lines validated against the actual PR diff
event: COMMENT— approval is a human decision
When the discover-eval job fails (transient runner / gh api error) for a contributor PR with static passing, evalStatus fell through to 'Skipped' and Overall showed green ✅. Guard on discoverResult !== 'success' → 'Needs attention', distinguishing a discovery infra failure from the legitimate 'no eval entries' case (report-status/check-run unchanged). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@PureWeen addressed the round-2 finding in |
PureWeen
left a comment
There was a problem hiding this comment.
Round 3 Adversarial Review — Multi-Model Code Review
Re-reviewed a5936c1c7c with 3 independent reviewers from different model families (adversarial consensus). This round verifies the fix for the Round 2 finding (review #4501101962).
✅ Round 2 finding resolved (3/3 consensus)
The new guard in a5936c1c7c is verified correct by all three reviewers, with no regression:
} else if (discoverResult !== 'success') {
evalStatus = 'Needs attention'; // discovery job failed → don't mask as Skipped/Passed
}Branch ordering is the key detail, and it's right: the new branch sits after !evalRan && !isContributor, so:
| Scenario | evalStatus |
Overall |
|---|---|---|
| External (non-contributor) PR, discover-eval skipped | Skipped |
✅ Passed (unchanged — intended) |
| Contributor PR, discover-eval failed/cancelled | Needs attention | |
| Contributor PR, discover-eval success, no entries | Skipped |
✅ Passed (unchanged) |
| Contributor PR, eval ran with results | Passed/Failed | per results (unchanged) |
The guard fires only when discovery didn't succeed and the actor is a contributor/slash-triggered — exactly the targeted case. External PRs are still caught first and map to Skipped, so there are no false "Needs attention" alarms.
✅ Round 1 fixes still hold (3/3)
All three earlier fixes re-verified correct and complete — gated HEAD_SHA used first (no wrong-commit attribution), pulls.get() guarded with graceful metadata fallback (no remaining unguarded deref), and overallStatus only green when staticStatus === 'Passed' && (evalStatus === 'Passed' || evalStatus === 'Skipped'). Security spot-checks pass: PR title / badge alt are HTML-escaped, badge label/message go through encodeURIComponent, and the SHA only flows into hex-only contexts.
💡 Optional follow-up (non-blocking, 1/3 reviewers — out of scope for this PR)
For the same contributor + discover-failure case, the new badge now correctly reads LLM: Needs attention, but the pre-existing detailed body section (and the sibling report-status / check-run reporters) still fall through to "⏭️ LLM Evaluation: Skipped". Those are pre-existing and outside this styling PR's diff — a sensible follow-up issue would add the same discoverResult !== 'success' guard to all reporters so the badge and body agree. Not a blocker for this PR.
Verdict: No blocking issues. All four findings from the prior two rounds are resolved with 3/3 consensus and no overcorrection. The only remaining item is the optional consistency follow-up above.
Methodology: 3 independent reviewers, adversarial consensus. Changed path is workflow YAML with embedded github-script; no automated test coverage exists for this comment-rendering logic (validated by the author with node --check). event: COMMENT — approval is a human decision.
|
Thanks for the round-3 verification, @PureWeen — glad the discover-eval guard holds up across all three reviewers with no overcorrection. Agreed on the optional follow-up: the badge now reads |
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!
Description of Change
Updates the
Skill Validation ResultsPR comment to match the visual style used by AI Summary and Test Failure Review comments:## Skill Validation ResultstitleIssues Fixed
No issue filed.
Validation
actions/github-scriptpost-comment JavaScript withnode --checkgit diff --check