Improve code-review eval: add regression detection scenarios by PureWeen · Pull Request #34884 · dotnet/maui

PureWeen · 2026-04-08T18:30:47Z

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Summary

Improves the code-review skill's eval.yaml with regression detection scenarios based on real regressed-in-10.0.60 issues.

Context

The code-review skill already has 345 lines of regression prevention rules in review-rules.md (Section 21: 13 rules from 30 reverted PRs, Section 22: 8 most-regressed component rankings)
None of the existing 6 eval scenarios tested these regression rules — this PR closes that gap
Based on analysis of 5 real regression issues ([MAUI] I3_Sizing_Dynamic item sizing - After dragging the scrollbar, all images will automatically return to their original size. #34783, C6-The C6 page cannot scroll on Windows and Android platforms. #34666, [MAUI] I2_Spacing_ItemSpacing - First and last item on the list is truncated after changing Spacing value. #34636, [MacOS] Misaligned items before resizing the window on MacOS #34635, [.NET 10] Increasing gap in the bottom while scrolling. #34634) and their fix PRs
Follows eval best practices from try-fix and verify-tests evaluation cycles (PRs Improve try-fix skill: add eval.yaml and fix prompt issues #34807, Add eval.yaml for verify-tests-fail-without-fix skill #34815)
Part of eval coverage expansion tracked in issue Establish eval.yaml lifecycle and CI integration for skill validation #34814

Changes

3 new regression-focused scenarios (6 → 9 total):

#	Scenario	What it tests	Based on
7	CollectionView regression risk	Agent flags CV as high-risk component, calls for broad scenario coverage	Issue #34666 (scroll inside disabled RefreshView)
8	Side-effect risk on shared paths	Agent identifies that modifying a shared invalidation path risks breaking adjacent behaviors	Issue #34636 (ItemSpacing clipping)
9	Safe change — no false warnings	Agent does NOT flag regression risk for documentation-only changes	Prevents false positive regression noise

Also improved existing scenarios:

Added expect_activation: false to both negative trigger scenarios (2 and 6)
Added version changelog header

Eval Stats

9 scenarios (was 6)
0 output_contains (no vocabulary overfitting)
17 output_not_contains (anti-pattern guards)
4 output_matches (verdict pattern matching)
2 expect_activation: false (native negative triggers)

- Add 3 new scenarios testing Section 21/22 regression prevention rules: - Scenario 7: CollectionView high-regression-risk component detection - Scenario 8: Side-effect risk when modifying shared code paths - Scenario 9: Safe doc-only change should NOT get false regression warnings - Add expect_activation: false to both negative trigger scenarios - Add version changelog comment header - Based on real regressed-in-10.0.60 issues (#34666, #34636, #34635) - 9 scenarios total (was 6), 0 output_contains, 17 output_not_contains Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-04-08T18:31:32Z

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34884

Or

Run remotely in PowerShell:

iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34884"

Copilot

Pull request overview

This PR expands the code-review skill’s evaluation coverage by adding new scenarios specifically aimed at detecting regression-risk reasoning, based on real “regressed-in-10.0.60” issue patterns. It also strengthens two existing negative-trigger scenarios by explicitly asserting the skill should not activate.

Changes:

Added a v2 changelog header to the eval file.
Added 3 new regression-focused eval scenarios (raising total scenarios from 6 to 9).
Added expect_activation: false to two negative-trigger scenarios to ensure the workflow doesn’t activate on informational prompts.

Copilot · 2026-04-08T18:33:43Z

+        value: "no concerns"
+      - type: "output_not_contains"
+        value: "looks safe"


The output_not_contains: "no concerns" / "looks safe" assertions are both case-sensitive and very generic. This can make the scenario flaky: the model might respond with capitalized variants ("No concerns"/"Looks safe") which would bypass the guard, or use these substrings in a longer sentence while still raising concerns. Consider tightening these to more specific phrases you want to forbid and/or include the common capitalization variants so the assertion matches the intended anti-pattern reliably.

Suggested change

value: "no concerns"

- type: "output_not_contains"

value: "looks safe"

value: "no concerns about side effects"

- type: "output_not_contains"

value: "No concerns about side effects"

- type: "output_not_contains"

value: "looks safe to me"

- type: "output_not_contains"

value: "Looks safe to me"

Copilot · 2026-04-08T18:33:43Z

+        value: "regression risk"
+      - type: "output_not_contains"


The assertion output_not_contains: "regression risk" conflicts with the rubric expectation that the agent should identify this as doc-only and explicitly state there's no regression risk. A correct response like "no regression risk" would still contain the banned substring and fail the scenario. Consider removing this assertion or narrowing it to warning-style phrasing (e.g., "high regression risk" / "regression-prone") so correct low-risk language remains allowed.

Suggested change

value: "regression risk"

- type: "output_not_contains"

…gression priority eval.yaml: - Scenario 1: Add output_contains 'review' - Scenario 7: Add output_contains 'regression' - Scenario 8: Add output_contains 'side effect' - Scenario 9: Add output_contains 'documentation', narrow 'adjacent scenario' to 'test adjacent scenario' SKILL.md: - Add regression risk check instruction in Step 3 referencing Section 21/22 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Create .github/docs/skill-eval-guide.md documenting both eval tools: - Anthropic skill-creator (trigger/routing evaluation) - dotnet/skills skill-validator (behavioral evaluation) - Add mandatory pre-evaluation checklist to prevent missing tools - Document multi-agent evaluation protocol (3 workers, 4-round max) - Add eval quality standards (assertion/rubric conflict check, coverage requirements) - Update README-AI.md to reference the new guide Root cause: Anthropic evaluator worker missed skill-creator tool for 5+ rounds because it never enumerated anthropics/skills/skills/ contents. This guide ensures future evaluations always check both toolchains first. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kubaflo · 2026-05-24T14:28:56Z

/review -b feature/refactor-copilot-yml

MauiBot

Expert Review — 2 findings

See inline comments for details.

MauiBot · 2026-05-24T14:55:33Z

+    assertions:
+      - type: "output_not_contains"
+        value: "LGTM"
+      - type: "output_contains"


[moderate] Eval robustness — This assertion only checks for the substring regression, so an answer like "there is no regression concern" (or one that mentions regression without asking for adjacent CollectionView coverage) would pass the automated check. Since the scenario is intended to prevent false positives/false negatives around high-risk CollectionView fixes, assert the concrete expected behavior with less ambiguous terms, e.g. adjacent scenario coverage for scroll/tap/swipe/gesture or a NEEDS_CHANGES/NEEDS_DISCUSSION verdict.

kubaflo

Could you please check the ai's suggestions?

kubaflo · 2026-06-07T15:33:01Z

/review -b feature/enhanced-reviewer -p android

MauiBot

AI Review Summary

@PureWeen — new AI review results are available based on this last commit: c5d204e.
Remove skill-eval-guide.md -- belongs in polypilot, not MAUI repo To request a fresh review after new comments or commits, comment /review rerun.

Review Sessions — click to expand

Gate — Test Before & After Fix

Gate Result: ⚠️ SKIPPED

No tests were detected in this PR.

Recommendation: Add tests to verify the fix using the write-tests-agent.

Pre-Flight — Context & Validation

Issue: #34884 - Improve code-review eval: add regression detection scenarios
PR: #34884 - Improve code-review eval: add regression detection scenarios
Platforms Affected: infrastructure / skill validation (requested testing platform: android; no Android runtime code changed)
Files Changed: 1 implementation/configuration, 1 eval test

Key Findings

PR adds regression-oriented eval scenarios for the code-review skill and a regression-risk instruction in .github/skills/code-review/SKILL.md.
Gate was already provided as skipped because no tests were detected; no gate rerun was performed and gate/content.md was not modified.
Expert review found the new regression-risk instruction uses stale/non-actionable Section 21/22 references; current guidance lives under Dimension 6, Regression Prevention and Test Coverage, with Frequently Regressed Components.
Expert/prior inline review found eval robustness issues: the doc-only scenario rejects correct "no regression risk" wording, the CollectionView scenario accepts generic regression, and side-effect assertions are brittle.

Code Review Summary

Verdict: NEEDS_CHANGES
Confidence: high
Errors: 2 | Warnings: 1 | Suggestions: 0

Key code review findings:

❌ .github/skills/code-review/SKILL.md:103 references Section 21/22, but the intended regression guidance is under Dimension 6 and named subsections in .github/agents/maui-expert-reviewer.md.
❌ .github/skills/code-review/tests/eval.yaml:151 bans regression risk, which blocks correct doc-only language such as "no regression risk".
⚠️ .github/skills/code-review/tests/eval.yaml:111 and side-effect guards are too generic/brittle to reliably enforce the intended behavior.

Fix Candidates

#	Source	Approach	Test Result	Files Changed	Notes
PR	PR #34884	Add regression-risk check using Section 21/22 numbering plus three regression eval scenarios and negative-trigger metadata	⚠️ SKIPPED (Gate: no tests detected)	`.github/skills/code-review/SKILL.md`, `.github/skills/code-review/tests/eval.yaml`	Original PR; needs alternative to stale section references and fragile assertions.

Code Review — Deep Analysis

Code Review — PR #34884

Independent Assessment

What this changes: Adds regression-oriented evaluation scenarios for the code-review skill and adds a regression-risk reminder to the skill workflow.
Inferred motivation: Improve the skill's ability to detect MAUI regression-prone changes and avoid false positives for documentation-only changes.

Reconciliation with PR Narrative

Author claims: The PR expands eval coverage for regression prevention rules, adds negative-trigger metadata, and documents regression-risk scrutiny for frequently regressed components.
Agreement/disagreement: The eval expansion matches the stated goal, but the new skill guidance references stale/non-actionable section numbers and one doc-only assertion would fail correct language such as "no regression risk." Prior inline review also flagged a too-generic CollectionView regression assertion and brittle side-effect negative assertions.

Findings

❌ Error — Regression guidance references stale/non-actionable sections

.github/skills/code-review/SKILL.md:103 points agents to Section 22's "Most Frequently Regressed Components" table and Section 21 regression prevention rules. The current MAUI expert reviewer defines this content under Dimension 6, Regression Prevention and Test Coverage, with a named Frequently Regressed Components subsection; Sections 21/22 are unrelated CollectionView shared models and Android platform specifics. Agents following the PR text can apply the wrong checklist or fail to find the intended guidance.

❌ Error — Doc-only eval blocks correct "no regression risk" wording

.github/skills/code-review/tests/eval.yaml:151 bans the substring regression risk, while the rubric expects the agent to identify documentation-only changes as having no regression risk. A correct answer like "documentation-only; no regression risk" would fail this scenario.

⚠️ Warning — Regression eval assertions are under-specified or brittle

Prior inline review flagged .github/skills/code-review/tests/eval.yaml:111 because output_contains: "regression" can pass without requiring adjacent CollectionView scenario coverage. Review comments also flagged .github/skills/code-review/tests/eval.yaml:131-134 because generic, case-sensitive negative assertions can miss common variants.

Devil's Advocate

The PR description mentions an older review-rules.md structure, so the section references may have been correct when originally drafted. However, the current repository routes code-review through .github/agents/maui-expert-reviewer.md, where the target content is clearly named and differently numbered; using semantic anchors or removing duplication is safer than preserving numeric references.

Verdict: NEEDS_CHANGES

Confidence: high
Summary: The goal is sound, but the current PR fix has actionable eval/instruction robustness issues. Alternative candidates should remove numeric section coupling and make assertions validate behavior without blocking correct low-risk wording.

Fix — Analysis & Comparison

Fix Candidates

#	Source	Approach	Test Result	Files Changed	Notes
1	try-fix	Semantic reviewer references plus behavior-focused assertions	⚠️ BLOCKED (static checks pass; validator blocked)	2 files	Best balance: keeps the PR intent, removes numeric coupling, and tightens eval assertions without duplicating the full reviewer table.
2	try-fix	Remove duplicated regression guidance from SKILL.md and enforce behavior through evals	⚠️ BLOCKED (static checks pass; validator blocked)	2 files	Lowest duplication, but less explicit for agents reading only SKILL.md.
3	try-fix	Inline high-risk component guidance in SKILL.md and conservatively fix eval assertions	⚠️ BLOCKED (static checks pass; validator blocked)	2 files	Most self-contained, but duplicates reviewer knowledge and may drift.
PR	PR #34884	Add regression-risk check using Section 21/22 numbering plus regression eval scenarios	⚠️ SKIPPED (Gate: no tests detected)	2 files	Original PR; expert review found stale guidance and fragile eval assertions.

Cross-Pollination

Model	Round	New Ideas?	Details
maui-expert-reviewer	1	Yes	Proposed semantic-reference, remove-duplicated-guidance, and inline-guidance strategies.
Test loop	1	Yes	`skill-validator` is blocked by host GLIBC, so subsequent candidates were compared using YAML parse plus targeted assertion-shape checks and treated full validation as blocked.
Expert synthesis	2	No	No additional materially different approach beyond semantic reference, delegation/removal, and inline guidance was identified.

Exhausted: Yes — three materially different strategies were generated and no further non-trivial alternatives were identified.
Selected Fix: Candidate #1 — it is the best candidate because it preserves the PR's intended regression-review instruction, avoids stale numeric section references, and tightens the eval assertions without copying the full expert-reviewer table into SKILL.md.

Report — Final Recommendation

Comparative Report — PR #34884

Candidate ranking

Rank	Candidate	Regression/static result	Assessment
1	`try-fix-1`	Static checks passed; full `skill-validator` blocked by host GLIBC	Best overall. Removes stale numeric section coupling, strengthens the CollectionView eval with behavior-specific adjacent scenario assertions, narrows doc-only false-positive assertions, and also improves brittle side-effect negative assertions.
2	`pr-plus-reviewer`	Static checks passed in sandbox	Resolves all expert inline findings against the PR, but is narrower than `try-fix-1` because it does not address the previously identified brittle side-effect assertions.
3	`try-fix-2`	Static checks passed; full `skill-validator` blocked by host GLIBC	Removes duplicated guidance from `SKILL.md`, which avoids drift, but leaves the standalone skill less explicit for agents reading only the skill workflow.
4	`try-fix-3`	Static checks passed; full `skill-validator` blocked by host GLIBC	Self-contained and understandable, but duplicates the expert-reviewer component table in `SKILL.md`, increasing future drift risk. It also does not address the side-effect assertion brittleness.
5	`pr`	Gate skipped; expert review found actionable issues	Raw PR intent is good, but it keeps stale Section 21/22 references and fragile/broad eval assertions.

No candidate failed regression tests. The raw PR gate was skipped because no tests were detected, and each try-fix candidate was blocked only by the environmental GLIBC_2.38 requirement for skill-validator; their available static regression checks passed.

Winner

try-fix-1 is the single winning candidate. It provides the best balance of correctness and maintainability: it keeps explicit regression guidance in the code-review skill, points to stable semantic reviewer anchors instead of stale numeric sections, tightens the evals around the intended behaviors, and avoids duplicating the full regression-risk table.

Future Action — alternative fix proposed (try-fix-1)

Automated review — alternative fix proposed

The expert-reviewer evaluation compared the PR fix against automatically generated candidates and selected try-fix-1 as the strongest fix.

Why: try-fix-1 wins because it fixes the raw PR's stale Section 21/22 references and fragile regression eval assertions while preserving explicit regression-review guidance. It also improves the side-effect assertions beyond the narrower pr-plus-reviewer candidate, and no candidate failed regression tests.

Please consider applying the candidate diff below (or use it as guidance). Once you push an update, this workflow will re-trigger and re-evaluate.

Candidate diff (try-fix-1)

--- a/.github/skills/code-review/SKILL.md
+++ b/.github/skills/code-review/SKILL.md
@@ -100,7 +100,7 @@
 3. **Is the approach sound?** Would a simpler alternative work?
 4. **What problems do you see?** Run through the agent's dimension CHECKs for matched dimensions
 
-**Regression risk check:** If changed files touch components in Section 22's "Most Frequently Regressed Components" table (CollectionView, Image/Graphics, Theme/Style, CarouselView, Gesture/Tap, Button/Entry, Toolbar, Shell/TabBar), apply Section 21 regression prevention rules with extra scrutiny.
+**Regression risk check:** If changed files touch components listed in the `maui-expert-reviewer` Dimension 6 `Frequently Regressed Components` table (CollectionView, Image/Graphics, Theme/Style, CarouselView, Gesture/Tap, Button/Entry, Toolbar, Shell/TabBar), apply the `Regression Prevention and Test Coverage` rules with extra scrutiny.
 
 ### Step 4: Read PR Narrative and Reconcile
 
--- a/.github/skills/code-review/tests/eval.yaml
+++ b/.github/skills/code-review/tests/eval.yaml
@@ -109,7 +109,11 @@
       - type: "output_not_contains"
         value: "LGTM"
       - type: "output_contains"
-        value: "regression"
+        value: "adjacent scenarios"
+      - type: "output_contains"
+        value: "scroll"
+      - type: "output_contains"
+        value: "tap"
     rubric:
       - "Agent flags CollectionView as a high-regression-risk component and calls for broader scenario coverage"
       - "Agent recommends testing adjacent scenarios (scroll, tap, swipe, gesture) in addition to the reported fix scenario"
@@ -129,9 +133,13 @@
       Should I be concerned about side effects?
     assertions:
       - type: "output_not_contains"
-        value: "no concerns"
+        value: "no concerns about side effects"
       - type: "output_not_contains"
-        value: "looks safe"
+        value: "No concerns about side effects"
+      - type: "output_not_contains"
+        value: "looks safe to me"
+      - type: "output_not_contains"
+        value: "Looks safe to me"
       - type: "output_contains"
         value: "side effect"
     rubric:
@@ -149,7 +157,9 @@
       Please review.
     assertions:
       - type: "output_not_contains"
-        value: "regression risk"
+        value: "high regression risk"
+      - type: "output_not_contains"
+        value: "regression-prone"
       - type: "output_not_contains"
         value: "test adjacent scenario"
       - type: "output_not_contains"

kubaflo

Could you please check the ai's suggestions?

kubaflo

🤖 Multi-model code review — Request changes

Three models reviewed this PR independently (Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro), then cross-pollinated.

Consensus: NEEDS_CHANGES (light) — the regression-scenario intent is good and the SKILL.md addition is fine; the concerns are all about eval assertion quality + coordination, none about code safety.

Good 👍

The new "Regression risk check" step in SKILL.md and the version header comments are accurate, the YAML parses, and the new fields (expect_activation: false, output_contains) are consistent with the convention already used across the sibling skill eval files — they're not stray/no-op fields.

Assertion quality (inline)

Self-contradicting assertion (line 152) — the docs-only "should NOT warn" scenario asserts output_not_contains: "regression risk", but its rubric wants the agent to say "no regression risk" — the ideal answer contains that substring and would false-fail.
Gameable (line 112) — output_contains: "regression" passes even if the agent says "not a regression"; substring matches are direction-blind.
Prompt-echo (line 135) — output_contains: "side effect" just echoes a word already in the prompt.

Two coordination notes

No CI runs these evals. There is no skill-validation workflow or skill-validator in the repo, and copilot-evaluate-tests only reviews src/** PR test diffs — so eval.yaml scenarios are currently unverified (nothing catches issues like #1 above). Worth wiring up a runner, or at least noting these are aspirational.
Format migration in flight. PR #35942 (draft) migrates this same eval.yaml to the new Vally format — these additions to the legacy format will likely need re-doing. Consider landing against Vally, or confirm the sequencing.

_{Independent verdicts: Opus 4.8 — LGTM (med, "tighten assertions") · GPT-5.5 — NEEDS_DISCUSSION (high) · Gemini 3.1 Pro — NEEDS_DISCUSSION (high). (One model asserted a skill-validation.yml runner exists — I verified it does not, so that claim is excluded.) Consensus: safe & additive, but tighten the assertions and coordinate with the Vally migration.}

kubaflo · 2026-06-16T16:13:49Z

+      Please review.
+    assertions:
+      - type: "output_not_contains"
+        value: "regression risk"


⚠️ This negative scenario (docs-only change → should NOT get regression warnings) asserts output_not_contains: "regression risk", but its own rubric is "Agent correctly identifies this as a documentation-only change with no regression risk." The ideal answer literally contains the substring "regression risk", so the correct output would false-fail this assertion. Substring output_not_contains can't distinguish "no regression risk" (good) from "high regression risk" (bad). Assert the desired signal instead — e.g. require the LGTM verdict and output_not_contains a phrase that only appears in a false positive (like "high-regression-risk component" or "broad scenario coverage").

kubaflo · 2026-06-16T16:13:49Z

+      - type: "output_not_contains"
+        value: "LGTM"
+      - type: "output_contains"
+        value: "regression"


⚠️ output_contains: "regression" for the CollectionView scenario passes even if the agent says "this is not a regression risk" — a plain substring match is direction-blind. Pair it with the intended verdict (it already asserts output_not_contains: "LGTM", good) and/or a more specific positive like output_matches on "high[- ]regression" / "regression(-| )risk", so the scenario actually discriminates a correct flag from an incidental mention.

kubaflo · 2026-06-16T16:13:49Z

+        value: "no concerns"
+      - type: "output_not_contains"
+        value: "looks safe"
+      - type: "output_contains"


💡 output_contains: "side effect" is weak here because the prompt itself asks "Should I be concerned about side effects?" — the agent will echo "side effect" regardless of whether it correctly reasons about the shared invalidation path. Assert something the agent must derive (e.g. references to the specific adjacent operations: scroll position / item animations / header-footer), or the verdict, rather than a word lifted from the prompt.

kubaflo

✏️ Correction to my earlier review

I need to correct a factual error in my review above. I stated "there is no skill-validation workflow or skill-validator in the repo, so eval.yaml scenarios are currently unverified" and excluded a model's claim to the contrary. That was wrong.

.github/workflows/skill-validation.yml does exist on main — it runs static checks on every PR touching .github/skills/**, plus an LLM evaluation of the skill eval.yaml scenarios (via the dotnet/skills validator binary), automatically for contributor PRs or when a contributor comments /evaluate-skills. My mistake: I verified the file's existence against this PR's branch head, which predates the workflow (it 404s on c5d204e9…), instead of against main. A rebase onto current main will pick up the runner — and credit to the reviewer model whose "it runs" claim I wrongly dismissed.

This makes the assertion-quality findings more important, not less: since these scenarios actually execute in CI, the self-contradicting output_not_contains: "regression risk" (line 152) would genuinely false-fail the validator on the correct "no regression risk" output. The three inline findings stand; please treat them as real eval failures rather than hypotheticals.

_{Posted by the multi-model review workflow — self-correcting the record.}

Copilot AI review requested due to automatic review settings April 8, 2026 18:30

Copilot started reviewing on behalf of PureWeen April 8, 2026 18:31 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

PureWeen and others added 3 commits April 8, 2026 13:37

Remove skill-eval-guide.md -- belongs in polypilot, not MAUI repo

c5d204e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

This was referenced Apr 9, 2026

[repo-status] Daily Repo Status - April 9, 2026 🌟 #34893

Closed

[repo-status] Daily Repo Status - April 13, 2026 🌟 #34924

Closed

MauiBot reviewed May 24, 2026

View reviewed changes

MauiBot added s/agent-review-incomplete s/agent-fix-win AI found a better alternative fix than the PR s/agent-reviewed PR was reviewed by AI agent workflow (full 4-phase review) labels May 24, 2026

This comment has been minimized.

Sign in to view

kubaflo removed the s/agent-review-incomplete label May 24, 2026

kubaflo requested changes May 25, 2026

View reviewed changes

PureWeen added the area-ai-agents Copilot CLI agents, agent skills, AI-assisted development label Jun 1, 2026

MauiBot suggested changes Jun 7, 2026

View reviewed changes

kubaflo requested changes Jun 7, 2026

View reviewed changes

PureWeen mentioned this pull request Jun 11, 2026

[Release Readiness] .NET 10 SR9 — candidate from main #35867

Open

kubaflo requested changes Jun 16, 2026

View reviewed changes

kubaflo reviewed Jun 16, 2026

View reviewed changes

This was referenced Jun 16, 2026

Add automated regression-corpus scanner for the code-review skill #35925

Draft

ci(skills): migrate skill-eval suite from skill-validator to Vally #35942

Merged

Conversation

PureWeen commented Apr 8, 2026

Summary

Context

Changes

Eval Stats

Uh oh!

github-actions Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

kubaflo commented May 24, 2026

Uh oh!

MauiBot left a comment

Choose a reason for hiding this comment

Expert Review — 2 findings

Uh oh!

MauiBot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

kubaflo left a comment

Choose a reason for hiding this comment

Uh oh!

kubaflo commented Jun 7, 2026

Uh oh!

MauiBot left a comment

Choose a reason for hiding this comment

AI Review Summary

Gate Result: ⚠️ SKIPPED

Key Findings

Code Review Summary

Fix Candidates

Code Review — PR #34884

Independent Assessment

Reconciliation with PR Narrative

Findings

❌ Error — Regression guidance references stale/non-actionable sections

❌ Error — Doc-only eval blocks correct "no regression risk" wording

⚠️ Warning — Regression eval assertions are under-specified or brittle

Devil's Advocate

Verdict: NEEDS_CHANGES

Fix Candidates

Cross-Pollination

Comparative Report — PR #34884

Candidate ranking

Winner

Uh oh!

kubaflo left a comment

Choose a reason for hiding this comment

Uh oh!

kubaflo left a comment

Choose a reason for hiding this comment

🤖 Multi-model code review — Request changes

Good 👍

Assertion quality (inline)

Two coordination notes

Uh oh!

kubaflo Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

kubaflo Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

kubaflo Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

kubaflo left a comment

Choose a reason for hiding this comment

✏️ Correction to my earlier review

Uh oh!

github-actions Bot commented Apr 8, 2026 •

edited

Loading