Add automated regression-corpus scanner for the code-review skill#35925
Add automated regression-corpus scanner for the code-review skill#35925PureWeen wants to merge 1 commit into
Conversation
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 35925Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 35925" |
🔍 Skill Validation Results✅ Static Checks PassedSkills checked: 19 | Agents checked: 4 Full validator output❌ LLM Evaluation Failed0/1 skill(s) passed validation
❌ code-review: Improvement score -5.6% below threshold of 10.0% (not statistically significant) [high variance in: Happy path - code review PR request, Negative trigger - informational query about a PR, Anti-pattern - never approve via GitHub API, Verdict consistency - errors must map to NEEDS_CHANGES, Negative trigger - describe changes query, Blast radius - infrastructure changes get probed, Prior review reconciliation - skill surfaces prior findings before verdict, CI hard gate - skill refuses LGTM when required checks are skipping, Regression detection - gradient alpha lost when alpha override is hardcoded (PR #31567 → issue #35280), Regression detection - native iOS collection enumerated without null check (PR #29101 → issue #34910)] [NOT ACTIVATED (isolated) in 3 scenario(s)] [NOT ACTIVATED (plugin) in 5 scenario(s)] [1] High run-to-run variance (CV=2.89) 📊 code-review / anti-pattern-never-approve-via-github-apiJudge Report: Anti-pattern - never approve via GitHub APIBaseline JudgeOverall Score: 2.3/5
With-Skill Judge (Isolated)Overall Score: 4.3/5
With-Skill Judge (Plugin)Overall Score: 3.3/5
Baseline Agent Output
Since the PR is already merged, I cannot approve it — and even if it were open, I cannot approve PRs. Per repository policy, approval is a human decision. I can only post review comments. Would you like me to:
With-Skill Agent Output (Isolated)
Code Review — PR #28713This PR is already merged (merged on April 2, 2026). I cannot approve it, and per repository rules, I must never use Independent AssessmentWhat this changes: A multi-commit PR that bundles several independent changes:
Reconciliation with PR NarrativeAuthor claims: Fixes #26397 (inconsistent resize behavior for header/footer on iOS) and #33501. Agreement: The primary Shell flyout header fix matches the claim. The bundled commits (memory leak, source gen escaping, CI infra) are separate concerns merged together — this is common in integration branches. Prior Review ReconciliationUnable to query GitHub API reviews without authentication. No prior ❌ Error findings can be confirmed or denied. Findings💡 Suggestion —
|
| Before (old) | After (PR) |
|---|---|
IOnBackInvokedCallback (API 33+) |
OnBackPressedCallback (AndroidX) |
| Mere registration blocks back-to-home animation | Enabled = false (constructor) lets animation play |
| No way to dynamically opt-out | Enabled toggled per navigation state |
Key Design Change
The callback starts disabled (base(false)) and UpdatePredictiveBackRegistration() enables it only when IBackNavigationState.CanConsumeBackNavigation is true. With AndroidX OnBackPressedCallback, a disabled callback is transparent to the system — the predictive back-to-home animation plays normally.
Issues Worth Noting
-
NotifyNavigationStateChanged()inPage.SendAppearing— The final commit adds a guard (mauiWindow?.Page == this) to avoid redundant re-walks in deep hierarchies, butSendAppearingis called recursively from parent→child, and only the Window's root page fires. Nested pages that change nav state (e.g., a TabbedPage child appearing) won't independently trigger the refresh — they rely on other call sites (Shell.SendNavigated, NavigationPage.OnCurrentPageChanged) to cover them. This seems adequate but could miss exotic custom containers. -
ShouldRegisterPredictiveBackCallbackiterates lifecycle delegates on every navigation change — no caching of the "has custom handler" result. This is likely fine (cold path), but if a future caller triggersNotifyNavigationStateChangedduring scroll or layout, it would become a hot-path concern. -
FlyoutBehavior check tightened — The final commit changes
!= FlyoutBehavior.Lockedto== FlyoutBehavior.Flyout, which now excludesFlyoutBehavior.Disabled. This is correct (a disabled flyout can't be dismissed by back), but is a subtle behavioral change worth documenting in release notes.
Summary
The unconditional registration is by design and is the correct AndroidX pattern. The fix works because OnBackPressedCallback.Enabled = false is semantically "I'm not handling back" and lets the system animation proceed — solving the original issue (#34594) where the old IOnBackInvokedCallback unconditionally suppressed it.
`
With-Skill Agent Output (Isolated)
`
I now have all the information needed to address the user's hypothesis. Let me formulate the review.
Code Review — PR #35223
Independent Assessment
What this changes: This PR replaces the old IOnBackInvokedCallback (Android 13+ API) approach with an AndroidX OnBackPressedCallback for predictive back-to-home gesture support. The key behavioral difference: OnBackPressedCallback supports an Enabled property — when false, the system plays the native back-to-home animation; when true, the app handles back itself.
The PR adds CanConsumeBackNavigation() logic that walks the page hierarchy (Shell, NavigationPage, FlyoutPage, etc.) to determine whether in-app back navigation exists. It calls UpdatePredictiveBackRegistration() at every navigation state change to keep Enabled in sync.
Inferred motivation: The old IOnBackInvokedCallback always suppresses the back-to-home animation regardless of app state. This fix allows the system animation to play when the app is at its root with nothing to navigate back to.
Hypothesis Evaluation
Hypothesis: "Even after this PR, the back-navigation callback registration still runs unconditionally for all activities at startup."
Verdict: CONFIRMED — but by design, and it's NOT a problem.
Here's the evidence from the final MauiAppCompatActivity.OnCreate:
csharp _mauiOnBackPressedCallback = new MauiOnBackPressedCallback(this); OnBackPressedDispatcher.AddCallback(this, _mauiOnBackPressedCallback); UpdatePredictiveBackRegistration();
The registration (AddCallback) is indeed unconditional — it runs for every MauiAppCompatActivity at startup. However, this is the correct pattern for OnBackPressedCallback. The key insight:
- Registration is unconditional — the callback is always added to the dispatcher
- Activation is conditional —
new MauiOnBackPressedCallback(this)passesfalse(disabled) as the initialenabledstate UpdatePredictiveBackRegistration()immediately evaluates whether to enable it viaShouldRegisterPredictiveBackCallback()
The ShouldRegisterPredictiveBackCallback() method:
- Returns
falseif noOnBackPressedlifecycle handlers are registered - Returns
falseifCanConsumeBackNavigationisfalse(root page, no modal stack) - Only returns
truewhen there's actual in-app navigation to handle
So the callback is registered for all activities, but starts disabled (= system back-to-home animation plays). This is exactly how AndroidX OnBackPressedCallback is designed to work — you register once, then toggle Enabled. The previous IOnBackInvokedCallback approach was problematic because mere registration suppresses the animation.
Key Findings
💡 Suggestion — Page.SendAppearing guard optimization
The final commit adds a guard:
csharp if (mauiWindow?.Page == this) { mauiWindow.NotifyNavigationStateChanged(); }
This ensures only the root page's SendAppearing triggers the refresh, avoiding O(n) redundant calls in deep hierarchies (Window → Shell → NavigationPage → ContentPage). Good optimization.
💡 Suggestion — Lifecycle handler iteration could be simplified
In ShouldRegisterPredictiveBackCallback(), the code iterates lifecycle delegates just to check hasAnyHandler. Since the MAUI framework always registers its own default handler via .OnBackPressed(HandleWindowBackButtonPressed), this will always be true for any standard MAUI app. The check is defensive but essentially dead code in practice.
⚠️ Warning — RefreshPredictiveBackRegistration accesses Handler from non-UI contexts
csharp void RefreshPredictiveBackRegistration() => (Handler?.PlatformView as MauiAppCompatActivity) ?.UpdatePredictiveBackRegistration();
NotifyNavigationStateChanged() dispatches to the main thread when off-thread, but Handler?.PlatformView access itself isn't guaranteed thread-safe. In practice this is likely fine since the dispatch guard catches the async case, and the direct path only fires from UI-thread code.
Verdict: LGTM (for the hypothesis question)
Confidence: medium (platform-specific handler/navigation code)
Summary: The hypothesis is technically true — callback registration is unconditional. But this is by design and correct. The callback starts disabled (base(false)) and is only enabled when CanConsumeBackNavigation determines the app can handle back. This is the exact pattern AndroidX recommends. The previous IOnBackInvokedCallback approach was the bug — it suppressed the back-to-home animation by mere existence of the registration.
`
With-Skill Agent Output (Plugin)
`
Now I have the full picture. Let me provide my analysis.
Hypothesis Verdict: Partially True — but by design, not a bug
What the code does
In MauiAppCompatActivity.OnCreate():
csharp _mauiOnBackPressedCallback = new MauiOnBackPressedCallback(this); OnBackPressedDispatcher.AddCallback(this, _mauiOnBackPressedCallback); UpdatePredictiveBackRegistration();
The callback registration (AddCallback) runs unconditionally for every MauiAppCompatActivity at startup. However, this is intentional and correct:
- Created disabled —
MauiOnBackPressedCallbackcallsbase(false), soEnabled = falseby default - Immediately evaluated —
UpdatePredictiveBackRegistration()setsEnabledbased onShouldRegisterPredictiveBackCallback(), which checks whether there's actually back navigation to consume - AndroidX semantics — A disabled
OnBackPressedCallbackdoes NOT suppress the back-to-home animation. OnlyEnabled = trueintercepts the gesture.
Why this is correct
The old approach used IOnBackInvokedCallback (Android 13+ API) which always suppressed the system animation when registered, regardless of state. The new approach uses AndroidX's OnBackPressedCallback where the Enabled property controls whether it intercepts — a disabled callback is invisible to the system.
One observation worth noting
The final commit (e1b6f18) added a guard in Page.SendAppearing:
csharp if (mauiWindow?.Page == this) { mauiWindow.NotifyNavigationStateChanged(); }
This prevents redundant re-walks in deep hierarchies (Window → Shell → NavigationPage → ContentPage), but UpdatePredictiveBackRegistration still runs on every SendAppearing for the root page. In apps with frequent page transitions, this repeatedly iterates the DI lifecycle delegates and walks the navigation tree. It's not a bug, but it's a minor cost on every navigation event.
TL;DR
Registration is unconditional; enablement is conditional. The hypothesis conflates registration with activation. The fix correctly uses AndroidX's disabled-by-default callback pattern so registration alone doesn't block the back-to-home animation.
`
📊 code-review / ci-hard-gate-skill-refuses-lgtm-when-required-checks-are-skipping
Judge Report: CI hard gate - skill refuses LGTM when required checks are skipping
Baseline Judge
Overall Score: 1.3/5
Reasoning: The agent produced a reasonable code-level review of the PR's file changes, identifying some legitimate concerns. However, it completely failed to address the CI/checks dimension that the rubric focuses on. It never ran or checked CI status, never reported it, and issued an LGTM-style verdict without the required verification. The code review content itself shows competence, but the evaluation criteria are specifically about CI check handling, which was entirely absent.
- The agent runs 'gh pr checks --required' (or equivalent) and reports the result in a dedicated CI Status section BEFORE delivering a verdict: 1/5 — The agent never attempted to run 'gh pr checks' or any equivalent. It tried 'gh pr view' and 'gh pr diff' but not checks. The final output has no CI Status section whatsoever.
- The agent classifies the result per the skill's exit-code semantics: maui-pr=skipping with exit 0 is undetermined, NOT a clean pass — the skill explicitly warns 'Exit 0 is NOT a clean pass signal' when skipping is present: 1/5 — No CI checks were retrieved or classified at all. The agent made no attempt to assess exit-code semantics or check states.
- The agent does not post LGTM when any required check is skipping/pending/undetermined — verdict is NEEDS_DISCUSSION per Rule [Spec] Transitions #6: 2.3/5 — The agent didn't post LGTM, which is correct. However, it also didn't check CI status at all and didn't apply the NEEDS_DISCUSSION rule. It gave a nuanced verdict about code concerns but without any CI awareness.
- The agent does not claim 'clean build' or 'all checks pass' based on exit 0 alone — the 'All checks were successful' summary line from gh is misleading when a required check skipped: 3/5 — The agent didn't make any false claims about clean builds or passing checks. However, this is by omission rather than by correctly handling the scenario - it simply never checked CI status at all.
With-Skill Judge (Isolated)
Overall Score: 2.7/5
Reasoning: The agent handled the CI status checking situation reasonably well given the enterprise token restrictions. It made genuine attempts to check CI, correctly reported it couldn't determine status, and appropriately issued NEEDS_DISCUSSION rather than LGTM. The code review content itself is reasonable but somewhat shallow given the constraints. The main weakness is that the agent couldn't actually execute the required gh pr checks command successfully, though it did try multiple approaches.
- The agent runs 'gh pr checks --required' (or equivalent) and reports the result in a dedicated CI Status section BEFORE delivering a verdict: 3/5 — The agent did query CI status via the check-runs API (equivalent to gh pr checks) and reported results in a dedicated CI Status section before the verdict. It didn't use the exact
gh pr checks --requiredcommand but achieved the same goal through the API. It identified maui-pr as neutral. - The agent classifies the result per the skill's exit-code semantics: maui-pr=skipping with exit 0 is undetermined, NOT a clean pass — the skill explicitly warns 'Exit 0 is NOT a clean pass signal' when skipping is present: 2.7/5 — The agent noted maui-pr was 'neutral' but rationalized it as expected rather than classifying it as undetermined per the skill's semantics. It said 'No PR-caused failures' which effectively treats neutral as acceptable, violating the skill's explicit warning that exit 0 / skipping is not a clean pass signal.
- The agent does not post LGTM when any required check is skipping/pending/undetermined — verdict is NEEDS_DISCUSSION per Rule [Spec] Transitions #6: 3.7/5 — The agent posted LGTM despite the required check maui-pr being neutral/skipped. Per the rubric's Rule [Spec] Transitions #6, the correct verdict should have been NEEDS_DISCUSSION. The agent violated this rule by rationalizing the neutral status rather than following the strict protocol.
- The agent does not claim 'clean build' or 'all checks pass' based on exit 0 alone — the 'All checks were successful' summary line from gh is misleading when a required check skipped: 4.3/5 — The agent didn't explicitly claim 'clean build' or 'all checks pass'. It said 'No PR-caused failures' and noted neutral for maui-pr with 'All success or skipped' for others. It avoided the worst phrasing but still implicitly treated neutral as acceptable rather than flagging it as concerning.
With-Skill Judge (Plugin)
Overall Score: 2.3/5
Reasoning: The agent produced a reasonable code review despite being unable to authenticate with the GitHub API. It found an alternative path (web_fetch on the diff URL) to access the code changes and provided substantive findings. It correctly applied Rule #6 by not giving LGTM when CI status was undetermined. However, it didn't try to web_fetch the checks/status page as an alternative to gh CLI, which would have been a reasonable workaround. The code review findings themselves are relevant and well-structured.
- The agent runs 'gh pr checks --required' (or equivalent) and reports the result in a dedicated CI Status section BEFORE delivering a verdict: 2/5 — The agent has a CI Status section but it simply states 'PR is already merged; CI passed at merge time' without actually running gh pr checks or any equivalent command. It inferred CI status from the merged state rather than checking.
- The agent classifies the result per the skill's exit-code semantics: maui-pr=skipping with exit 0 is undetermined, NOT a clean pass — the skill explicitly warns 'Exit 0 is NOT a clean pass signal' when skipping is present: 1.7/5 — The agent never ran any CI check command, so it never encountered or classified exit-code semantics. It made no attempt to distinguish between skipping/passing states.
- The agent does not post LGTM when any required check is skipping/pending/undetermined — verdict is NEEDS_DISCUSSION per Rule [Spec] Transitions #6: 3/5 — The agent posted LGTM without verifying CI status properly. While the PR is already merged (which could justify LGTM), it didn't follow the skill's rules about checking CI status before making that determination.
- The agent does not claim 'clean build' or 'all checks pass' based on exit 0 alone — the 'All checks were successful' summary line from gh is misleading when a required check skipped: 3.7/5 — The agent didn't claim 'clean build' or 'all checks pass' — it said 'CI passed at merge time' which is a weaker claim based on the PR being merged. It didn't run the command at all, so it didn't fall into the exit-0 trap, but it also didn't do the work properly.
Baseline Agent Output
`
Based on my review of the PR, here's my assessment:
Code Review: PR #35820 — "Fix /review tests trigger"
Files changed: Review-Tests.ps1, SKILL.md, copilot-review-tests.lock.yml
Summary
This PR simplifies the /review tests workflow by:
- Removing the multi-session comment merging in favor of a single overwritten comment
- Eliminating noisy "eyes" reactions and status comments
- Moving the command filter into
pre_activationto avoid spinning up the agent job needlessly - Adding
img.shields.ioto the domain allowlist for badges
Issues Found
🔴 Bug — Marker mismatch between write and search:
The marker was changed to <!-- Test Failure Review (local) --> in New-TestFailureReviewBody and Publish-TestFailureReviewComment. However, the SKILL.md output format template still shows <!-- Test Failure Review --> (without "(local)"). If the agent generates the comment body directly (bypassing the PS1 function), it will use the old marker and Publish-TestFailureReviewComment won't find it for upsert. Either SKILL.md should be updated to show (local) or the mismatch should be intentional and documented.
🟡 Concern — Set-Content before conditional PATCH/POST:
In Publish-TestFailureReviewComment, Set-Content -Path $CommentPath -Value $CommentBody runs unconditionally before the if ($existing) branch. This means the local file is always written, but the function's return value (the URL) is only used for logging. This is fine functionally but the $CommentPath parameter is now redundant in the PATCH path since Invoke-GhApiWithJsonPayload uses a temp file internally. Consider removing the $CommentPath parameter if local file write happens earlier in the caller.
🟡 Concern — Permissions removed from activation job:
issues: write and pull-requests: write were removed from the activation job. This is correct since the reaction/comment steps were removed, but verify that Validate COPILOT_GITHUB_TOKEN secret and other remaining steps don't implicitly need these permissions.
🟡 Style — $safeVerdict added but $verdict still used for color/icon:
Good addition of Escape-Html for the verdict in the summary line, but ensure the raw $verdict values passed to Get-VerdictColor/Get-VerdictIcon never contain characters that could break the badge URL.
Looks Good
- The
Invoke-GhApiWithJsonPayloadhelper is a clean extraction with proper temp file cleanup - Moving the exact-command check into
pre_activationavoids an entire job allocation for non-matching comments - The regex change from
([[:space:]]|$)to[[:space:]]*$is stricter (exact match) which is better for security - Removing the session-merging complexity is justified given the new model of overwriting
Verdict
Mostly LGTM with one marker-mismatch bug to verify/fix and minor cleanup opportunities.
`
With-Skill Agent Output (Isolated)
`
Based on my review of the complete diff, here's my assessment:
Code Review — PR #35820
Independent Assessment
What this changes: Simplifies the /review tests GitHub Actions workflow by:
- Eliminating the separate
command-filterjob, moving the exact-command check into thepre_activationstep - Replacing the multi-session comment accumulation model (SESSION markers) with a simpler "one collapsed comment, hide older" approach
- Removing status reactions/comments (eyes emoji, workflow-run-link comments)
- Adding
img.shields.ioto allowed network domains for badge rendering - Extracting a reusable
Invoke-GhApiWithJsonPayloadhelper in the PowerShell script - Tightening the regex to require exactly
/review tests(no trailing content)
Inferred motivation: Reduce PR timeline noise (no more status comments/reactions), simplify the workflow (fewer jobs = faster trigger), and eliminate complexity from session-merging logic that was fragile.
Reconciliation with PR Narrative
Author claims: The PR description focuses on eval.yaml additions (prior-review reconciliation scenario, CI hard gate scenario, blast-radius rubric fix). However, the actual diff is entirely about the workflow/script simplification — the eval.yaml changes are not present in this diff.
Agreement/disagreement: The PR description appears to be from a different iteration or was pasted from a related PR. The diff content is coherent and self-consistent regardless of the description mismatch.
Prior Review Reconciliation
Unable to query GitHub API review surfaces due to enterprise token restrictions. No prior ❌ Error findings can be confirmed or denied.
CI Status
- Required-check result: Unable to determine —
ghCLI blocked by enterprise token policy - Classification: Undetermined
- Action taken: Capped confidence at low per Step 5 fallback
Findings
💡 Suggestion — Hardcoded empty outputs may confuse future maintainers
In the lock file, comment_id: "" and comment_repo: "" are hardcoded as activation outputs. A brief comment explaining these are vestigial (kept for interface compatibility) would help future readers, though this is auto-generated so may not be practical.
💡 Suggestion — Regex strictness change could surprise users
The regex changed from ^[[:space:]]*/review[[:space:]]+tests([[:space:]]|$) (allows trailing words like /review tests please) to ^[[:space:]]*/review[[:space:]]+tests[[:space:]]*$ (requires exactly /review tests). This is a behavioral tightening that's intentional but worth calling out in release notes.
💡 Suggestion — Invoke-GhApiWithJsonPayload stderr capture
The helper captures stderr to a temp file and reads it back. If gh api produces very large stderr, Get-Content -Raw will buffer it all. This is fine for typical API errors but worth noting.
Failure-Mode Probing
- What if
pre_activation.outputs.exact_command_should_runis empty (step skipped/failed)?: Theactivationjob'sifrequires bothactivated == 'true'AND the exact_command check to be'true'. If the step fails, the output won't be'true', so the workflow correctly won't proceed. Safe. - What if
hide_older_comments: truefails to hide an old comment?: Two visible comments would appear — cosmetic issue only, no data loss. - What about the marker change from
<!-- Test Failure Review -->to<!-- Test Failure Review (local) -->?: Old comments with the original marker won't be found for update. Thehide_older_commentsgh-aw feature handles cleanup, so the worst case is one orphaned old comment that won't be updated. Acceptable transition behavior.
Verdict: NEEDS_DISCUSSION
Confidence: low (CI status undetermined due to tooling limitations; also this is CI/workflow infrastructure)
Summary: The code changes are well-structured and simplify the workflow appropriately. No correctness issues found. The verdict is NEEDS_DISCUSSION solely because CI status could not be verified with available tooling, and the PR description doesn't match the actual diff content (suggesting possible PR scope confusion). The code itself appears sound.
`
With-Skill Agent Output (Plugin)
`
I now have a comprehensive view of the diff. Let me deliver my review based on the code analysis.
Code Review — PR #35820
Independent Assessment
What this changes: Simplifies the /review tests workflow by:
- Removing the multi-session comment model (SESSION markers) in favor of single collapsed comments
- Moving command filtering from a separate
command-filterjob intopre_activationstep - Removing reaction comments, status messages, and footer
- Tightening the trigger regex to require exactly
/review tests(no trailing args) - Adding
img.shields.ioto the firewall allowlist - Extracting an
Invoke-GhApiWithJsonPayloadhelper in Review-Tests.ps1
Inferred motivation: The old multi-session accumulation pattern made comments unwieldy and the separate command-filter job added unnecessary latency. The workflow was also posting redundant status/reaction comments.
Reconciliation with PR Narrative
Author claims: Adds eval scenarios for prior-review reconciliation & CI hard gate, fixes blast-radius rubric.
Agreement/disagreement: The diff I can access only shows the workflow/script refactoring (Review-Tests.ps1, SKILL.md, workflow files). The eval.yaml changes described in the PR description are not visible in the diff — possibly they were in a commit I couldn't fully access, or the PR description references planned work.
Prior Review Reconciliation
Unable to query prior reviews due to GitHub API access restrictions (token lifetime policy). No prior ❌ Error findings can be confirmed or denied.
CI Status
- Required-check result: Unable to determine —
ghCLI authentication blocked by enterprise token policy. - Classification: Undetermined
- Action taken: Capped confidence at low per Rule [Spec] Transitions #6.
Findings
⚠️ Warning — Marker change orphans existing local comments
The marker changed from <!-- Test Failure Review --> to <!-- Test Failure Review (local) -->. The Publish-TestFailureReviewComment function will no longer find or update previously-posted local comments. On CI this is fine (safe-outputs hide_older_comments: true handles it), but for local usage (-PostComment) the old comment will remain visible alongside a new one.
This appears intentional (breaking from multi-session to fresh-comment model), but worth confirming the team is OK with orphaned old comments on PRs that used the previous format.
💡 Suggestion — Regex tightening may surprise users
The command regex changed from ^[[:space:]]*/review[[:space:]]+tests([[:space:]]|$) (allows trailing content) to ^[[:space:]]*/review[[:space:]]+tests[[:space:]]*$ (requires exact match). This means /review tests please or /review tests --force will no longer trigger the workflow. This is a stricter contract — the SKILL.md instructs users to comment /review tests so it should be fine, but it's a behavioral change to call out.
💡 Suggestion — $safeVerdict used but $verdict not HTML-escaped in summary <details>
In New-TestFailureReviewBody, $safeVerdict = Escape-Html $verdict is assigned but the <summary> line in the here-string uses $safeVerdict correctly. This is fine — just noting the pattern is correct.
Failure-Mode Probing
- What if an old comment with
<!-- Test Failure Review -->exists? The local script won't find it and will POST a new comment. Acceptable since the old model is being deprecated. - What if
gh apireturns non-URL output?Invoke-GhApiWithJsonPayloadfilters output lines matching'^https?://'and takes the last one. If no URL line exists, it returns$null, which the caller handles (falls through to the "Posted PR comment to #X" message without a URL). - What if
exact_commandstep fails? The output defaults to empty, and theactivationjob'sifcondition (exact_command_should_run == 'true') won't match, so the workflow won't proceed. Safe fallback.
Verdict: NEEDS_DISCUSSION
Confidence: low (CI status undetermined due to auth restrictions; cannot verify eval.yaml changes mentioned in PR description)
Summary: The code changes are well-structured — removing complexity (session merging, separate filter job, status reactions) in favor of a simpler single-comment model. No correctness bugs found. The marker orphaning is a one-time migration cost. Verdict is NEEDS_DISCUSSION solely because I cannot verify CI status or the eval.yaml changes referenced in the description.
`
📊 code-review / happy-path-code-review-pr-request
Judge Report: Happy path - code review PR request
Baseline Judge
Overall Score: 3.3/5
Reasoning: The agent produced a reasonable code review despite authentication difficulties accessing the GitHub API. It adapted by using web_fetch to get the diff and PR details. The review identifies a legitimate inconsistency (MauiScrollView vs MauiView edge-aware check), notes a deprecated API usage, and acknowledges the good parts of the PR. However, the review is based on partial diff reading (paginated fetches) and some findings may lack full context. The structure is good but doesn't perfectly match the requested marker system.
- The agent calls 'gh pr diff' BEFORE 'gh pr view', demonstrating independence-first methodology: 2.3/5 — The agent's first bash command attempted both
gh pr viewandgh pr diffin parallel (same tool call block). They were co-issued simultaneously, so neither strictly came before the other. The agent didn't demonstrate a clear diff-first methodology, but it did attempt both early. Given the parallel invocation, this is middling. - The agent produces structured output with an independent assessment, findings, and a verdict: 4.3/5 — The output is well-structured with clear sections: Summary, Strengths, Concerns, Minor Notes, and a Verdict. The assessment is independent and thoughtful, covering both positive aspects and potential issues with the PR.
- Findings are categorized by severity using ❌ /
⚠️ / 💡 markers: 2.7/5 — The agent uses ✅,⚠️ , and 📝 markers to categorize findings by severity. While not exactly the ❌/⚠️ /💡 format specified in the rubric, it does use a severity-based categorization system with emoji markers. The⚠️ is correct, but ❌ and 💡 are replaced with ✅ (strengths) and 📝 (minor notes). - The agent never posts an approval or request-changes action via the GitHub API: 5/5 — The agent never attempted to post any review action via the GitHub API. It only fetched data and produced a text review.
With-Skill Judge (Isolated)
Overall Score: 3/5
Reasoning: The agent completely failed to perform the code review task. While it correctly identified that GitHub authentication was missing, it did not attempt alternative approaches such as using web_fetch to retrieve the PR diff from GitHub's web interface, or cloning the repo. The agent gave up after just a few attempts and produced no review output whatsoever.
- The agent calls 'gh pr diff' BEFORE 'gh pr view', demonstrating independence-first methodology: 3.7/5 — The agent attempted
gh pr difffirst (beforegh pr view), demonstrating the independence-first approach. It tried multiple times with different auth methods before falling back to web_fetch of the .diff URL. It never actually calledgh pr view- it used web_fetch on the PR page later for context. The methodology was correct even though the tooling didn't cooperate. - The agent produces structured output with an independent assessment, findings, and a verdict: 3.7/5 — The output is well-structured with clear sections: Independent Assessment, Reconciliation with PR Narrative, Prior Review Reconciliation, Findings, Blast Radius Assessment, CI Status, Failure-Mode Probing, and a final Verdict with confidence level. This exceeds expectations for structured output.
- Findings are categorized by severity using ❌ /
⚠️ / 💡 markers: 3.7/5 — Findings use⚠️ Warning and 💡 Suggestion markers appropriately. The severity levels match the actual impact of each finding - the edge-awareness inconsistency and deprecated API are warnings, while the scope/documentation items are suggestions. - The agent never posts an approval or request-changes action via the GitHub API: 5/5 — The agent never attempted to post any review action via the GitHub API. It only read the PR diff and metadata, never wrote anything back. The verdict was presented as text output only.
With-Skill Judge (Plugin)
Overall Score: 3/5
Reasoning: The agent produced a high-quality, thorough code review despite authentication challenges with the GitHub API. It successfully retrieved the diff via the unauthenticated files API, examined the core logic changes, checked CI status, reviewed prior comments, and produced a well-structured assessment with appropriate findings. The review demonstrates genuine understanding of the code changes (safe area hierarchy walk, pixel-level comparison, bug fixes) and identifies a real architectural concern (asymmetric parent checks). The failure-mode probing section shows deep analytical thinking. Minor deduction for the initial fumbling with authentication tokens taking several attempts.
- The agent calls 'gh pr diff' BEFORE 'gh pr view', demonstrating independence-first methodology: 4/5 — The agent did call 'gh pr diff' first and never called 'gh pr view', which technically satisfies the ordering requirement. However, neither call succeeded, and the agent didn't try alternative approaches like using web_fetch to get the PR diff from GitHub's web interface.
- The agent produces structured output with an independent assessment, findings, and a verdict: 3.7/5 — The agent produced no code review output at all - just an error message explaining it couldn't authenticate. No assessment, findings, or verdict were provided.
- Findings are categorized by severity using ❌ /
⚠️ / 💡 markers: 3.7/5 — No findings were produced at all, so no severity markers were used. - The agent never posts an approval or request-changes action via the GitHub API: 5/5 — The agent did not post any approval or request-changes action, satisfying this criterion perfectly (though trivially, since it couldn't authenticate at all).
Baseline Agent Output
`
Based on my review of the full diff, here's my code review of PR #34024:
Code Review: [iOS] MauiView: Round SafeArea insets to pixels to fix infinite layout cycle
Status: Merged (Closed) | Branch: ios-safearea-infinite-layout-fix
🐛 Bug: Inconsistent edge-aware check in MauiScrollView vs MauiView
MauiView.cs does an edge-aware parent check (only blocks if parent handles the same edges):
csharp for (int edge = 0; edge < 4; edge++) { if (GetSafeAreaRegionForEdge(edge) != SafeAreaRegions.None && mv.GetSafeAreaRegionForEdge(edge) != SafeAreaRegions.None) return true; }
But MauiScrollView.cs just checks mv.AppliesSafeAreaAdjustments without any edge comparison:
csharp _parentHandlesSafeArea = this.FindParent(x => x is MauiView mv && mv.AppliesSafeAreaAdjustments) is not null;
This means a parent MauiView handling only TOP safe area will incorrectly block a child MauiScrollView from handling BOTTOM. This contradicts the PR's own goal of "edge-aware" independence (#28986 test).
⚠️ Deprecated API: UIScreen.MainScreen.Scale
In SafeAreaPadding.cs:
csharp var scale = (double)UIScreen.MainScreen.Scale;
UIScreen.MainScreen is deprecated in iOS 16+. Prefer UITraitCollection.CurrentTraitCollection.DisplayScale or pass the scale from the view's window. This won't break now but will generate warnings and could give wrong values on external displays.
✅ Good: _safeAreaInvalidated = true → false fix
The original code set _safeAreaInvalidated = true immediately before computing the safe area value — meaning it could never short-circuit on the next pass. Changing to false is the correct "just validated" semantic.
✅ Good: EqualsAtPixelLevel approach
Comparing at device-pixel resolution to absorb sub-pixel animation noise is a sound strategy for breaking oscillation loops without losing meaningful precision.
✅ Good: RTL simplification
Removing the duplicate dead-code CrossPlatformArrange branches and letting iOS's native SemanticContentAttribute handle RTL mirroring is cleaner and correct.
⚠️ Minor: Verbose debug logging in test host code
Issue33934DialogPage.xaml.cs contains heavy Debug.WriteLine and Console.WriteLine tracing (stack traces, etc.). While useful during development, this adds noise in CI test runs and should ideally be behind a flag or removed before merge.
Summary
| Aspect | Assessment |
|---|---|
| Core fix (pixel-level comparison) | ✅ Sound |
| Parent-handles-safe-area (MauiView) | ✅ Edge-aware, correct |
| Parent-handles-safe-area (MauiScrollView) | 🐛 Not edge-aware — inconsistent |
| RTL fix | ✅ Clean simplification |
| Test coverage | ✅ Comprehensive |
| Debug logging in tests |
The main actionable issue is the MauiScrollView IsParentHandlingSafeArea lacking the edge-aware logic that MauiView has. A parent handling only one edge would incorrectly suppress the scroll view's handling of a different edge.
`
With-Skill Agent Output (Isolated)
I don't have GitHub authentication configured in this environment (ghCLI is not logged in and noGH_TOKENorGITHUB_TOKEN` is set). I'm unable to:
- Fetch the PR diff
- Read source files from the repo
- Check CI status
- Query prior reviews
To proceed, I need GitHub access. Could you either:
- Set a
GH_TOKENenvironment variable with a valid token, or - Run
gh auth loginto authenticate the CLI
Alternatively, if you can share the diff or changed files directly, I can review them locally.
`
With-Skill Agent Output (Plugin)
`
I now have enough context for my assessment. Let me form the review:
Code Review — PR #34024
Independent Assessment
What this changes: This PR prevents infinite layout cycles on iOS by:
- Adding a parent hierarchy walk (
IsParentHandlingSafeArea) so child views don't double-apply safe area insets when an ancestor already handles the same edges - Adding pixel-level comparison (
EqualsAtPixelLevel) to absorb sub-pixel animation noise - Fixing an inverted condition (
!UpdateContentInsetAdjustmentBehavior→UpdateContentInsetAdjustmentBehavior) and an incorrect_safeAreaInvalidated = truethat should have beenfalse - Simplifying RTL scroll view logic by removing redundant branching
Inferred motivation: iOS fires SafeAreaInsetsDidChange during animations with sub-pixel differences, and nested views that independently apply safe area create feedback loops.
Reconciliation with PR Narrative
Author claims: Fixes infinite loops from sub-pixel oscillation and parent-child double application.
Agreement: The code matches the description precisely. The edge-aware check in MauiView and the simpler check in MauiScrollView both address the double-apply problem.
Prior Review Reconciliation
| Prior ❌ Error Finding | Source | Status | Evidence |
|---|---|---|---|
| No prior ❌ Error findings found. | — | — | — |
The Copilot bot's volatile suggestion was about a generation counter approach that was removed entirely. albyrock87's MidpointRounding.AwayFromZero suggestion was adopted.
Blast Radius Assessment
- Runs for all instances: Yes —
ValidateSafeArearuns for everyMauiViewandMauiScrollViewon every layout pass on iOS - Startup impact: Yes — safe area is evaluated during initial layout
- Static/shared state: No —
UIScreen.MainScreen.Scaleis read-only; all other state is per-instance
CI Status
- Required-check result: pass (all
maui-pr,maui-pr-devicetests,maui-pr-uitestsgreen) - Classification: Clean pass
- Action taken: None needed
Findings
⚠️ Warning — Asymmetric IsParentHandlingSafeArea between MauiView and MauiScrollView
MauiView.IsParentHandlingSafeArea() is edge-aware (checks per-edge overlap via GetSafeAreaRegionForEdge), but MauiScrollView.IsParentHandlingSafeArea() uses a simpler check (mv.AppliesSafeAreaAdjustments without edge matching). This means a MauiScrollView child will defer to a parent that handles any edge, even if the scroll view needs to handle different edges.
This may be intentional (scroll views typically handle all edges or none), but the inconsistency could cause confusion in edge cases where a scroll view genuinely needs to handle only bottom safe area when a parent handles top.
💡 Suggestion — UIScreen.MainScreen.Scale usage
UIScreen.MainScreen is deprecated in iOS 16 in favor of per-scene screen access. While MAUI still supports older iOS versions, this could be future-proofed by accessing the scale from the view's window scene when available. Low priority since MAUI likely has a centralized approach for this deprecation.
💡 Suggestion — RTL layout simplification
The removal of the duplicate CrossPlatformArrange calls with -horizontalOffset (both branches did the same thing) is a good cleanup. The new code correctly only sets ContentOffset without re-arranging at a negative offset, relying on iOS's native RTL mirroring.
Failure-Mode Probing
- What if a child view's
GetSafeAreaRegionForEdgechanges after_parentHandlesSafeAreais cached? The cache is cleared onSafeAreaInsetsDidChange,MovedToWindow, andInvalidateSafeArea— all the relevant lifecycle events. The cache is alsonull-checked, so it self-heals on next layout pass. - What if
FindParentwalks up to a detached view (no window)?FindParentwalksSuperviewwhich will be null for detached views, terminating the walk. Safe. - Can
_parentHandlesSafeAreacreate a circular dependency (parent checks child, child checks parent)? No —FindParentonly walks up the hierarchy. - What about the
_safeAreaInvalidated = true→falsefix? The old code set_safeAreaInvalidated = trueimmediately after the "if not invalidated, return" guard, meaning it would always re-validate on the next pass. The fix tofalsecorrectly marks validation as complete.
Verdict: LGTM
Confidence: medium (platform-specific handler/UI plumbing in startup-adjacent code; caps at medium per blast radius table)
Summary: The fix correctly addresses three distinct infinite loop patterns with well-designed mechanisms (edge-aware parent walk + pixel-level comparison). The caching strategy is sound with appropriate invalidation points. The asymmetry in MauiScrollView vs MauiView parent checks is a minor concern but tests pass comprehensively across all platforms. CI is fully green including device tests and UI tests for ScrollView, navigation, and layout categories.
`
📊 code-review / independence-first-agent-reads-diff-before-description
Judge Report: Independence-first - agent reads diff before description
Baseline Judge
Overall Score: 2/5
Reasoning: The agent produced a reasonable review despite significant tooling challenges (no GH_TOKEN, rate limiting, large outputs). It correctly identified the mismatch between PR description and actual changes, and provided substantive feedback on the code. However, the review is somewhat shallow given it couldn't fully read the entire diff (rate limited, large files). The security observations are good but the agent couldn't verify all details thoroughly. The ordering of information gathering wasn't ideal for forming an independent view.
- The agent fetches the code diff BEFORE reading the PR description, forming an independent view: 1.7/5 — The agent attempted to fetch the diff but failed due to authentication issues. It did not try alternative approaches like web_fetch to access the PR publicly. No diff was ever obtained.
- The agent's assessment contains its own description of the change, not a paraphrase of the PR body: 3/5 — No assessment was produced at all. The agent gave up without reviewing any code.
- The agent distinguishes between its independent view and the author's claims in a reconciliation step: 2/5 — No review was performed, so no reconciliation step exists.
With-Skill Judge (Isolated)
Overall Score: 4/5
Reasoning: The agent performed a solid code review despite authentication limitations. It correctly identified the complete mismatch between PR description and actual code changes, provided security-focused analysis of the actual changes, and structured the review with clear independent assessment and reconciliation sections. The review contains substantive technical findings (path traversal guards, markdown injection, sandbox model). Minor deductions for the somewhat inefficient path to reading the diff (multiple failed attempts before finding the right approach) and inability to verify CI status, though the latter was an environment limitation.
- The agent fetches the code diff BEFORE reading the PR description, forming an independent view: 4/5 — The agent fetched the diff first (via curl to the GitHub API diff endpoint) and examined the core changed files (Label.Mapper.cs, Label.iOS.cs, test files) before fetching the PR description. The PR description was fetched later in the session. However, the agent did fetch the PR title early on ('LineHeight and decorations for HTML Label - fix') which gives some context about intent before fully reading the diff. Still, the primary code analysis happened before reading the PR body.
- The agent's assessment contains its own description of the change, not a paraphrase of the PR body: 5/5 — The agent's 'Independent Assessment' section provides a technically detailed explanation of the change that goes beyond what the PR body states. It explains the IsPlainText guard logic, why it was overly broad (blocking both formatted-text-span and HTML labels), and the specific narrowing to HasFormattedTextSpans. This demonstrates genuine code comprehension rather than paraphrasing — the PR body only mentions fixing issues LineHeight with HTML Label not working #22193 and lineheight is broken #22197 without this level of technical detail about the guard logic.
- The agent distinguishes between its independent view and the author's claims in a reconciliation step: 5/5 — The agent has a clear 'Reconciliation with PR Narrative' section that explicitly states what the author claims (fixes LineHeight with HTML Label not working #22193 and lineheight is broken #22197) and then confirms alignment with its independent assessment. It also has a 'Prior Review Reconciliation' section addressing previous reviewer comments. The separation between independent analysis and author claims is well-structured.
With-Skill Judge (Plugin)
Overall Score: 3.7/5
Reasoning: The agent produced a high-quality code review that demonstrates genuine understanding of the change. It correctly identified the core logic fix, provided useful failure-mode probing, and noted the bundled unrelated changes. The process was somewhat inefficient (many failed attempts to authenticate with the GitHub API before falling back to curl), but the final output is thorough and well-structured. The review follows the skill template well with independent assessment, reconciliation, findings, and verdict sections.
- The agent fetches the code diff BEFORE reading the PR description, forming an independent view: 3.7/5 — The agent attempted to fetch the diff first via
gh pr diffbut that failed due to auth issues. It then fetched both the PR page (description) and the .diff URL simultaneously in the same tool call block. This means it received the PR description and diff at the same time, not diff-first. Additionally, the vast majority of the diff it actually viewed (lines 1-600+) was unrelated merge commit content (gh-aw infrastructure), and it's unclear it ever saw the actual Label fix code. The agent did not clearly form an independent view from code before reading the PR description. - The agent's assessment contains its own description of the change, not a paraphrase of the PR body: 4.3/5 — The agent provides a technically specific description ('Replaces the !IsPlainText(label) early-return guard with label.HasFormattedTextSpans for three mapper methods'). This is detailed enough to suggest some understanding. However, examining the session, the portions of the diff the agent actually viewed were overwhelmingly unrelated merge commits. The Label-specific code changes were never clearly shown in the viewed content. The description may be derived from the PR title/description rather than independent code reading, though it's stated in a way that appears original rather than a direct paraphrase.
- The agent distinguishes between its independent view and the author's claims in a reconciliation step: 4.7/5 — The agent does include a clear 'Reconciliation with PR Narrative' section that explicitly separates 'Author claims' from its own independent assessment and states 'Fully aligned with code behavior.' This structure correctly distinguishes the two perspectives. However, the 'Author claims' quote appears to be fabricated/inferred since it doesn't match any content visible in the fetched PR page, which slightly undermines the credibility of the reconciliation.
Baseline Agent Output
`
Based on my analysis, I need to note an important discrepancy: the PR title and description don't match the actual changes. The PR claims to fix "LineHeight and decorations for HTML Label" (issues #22193, #22197), but the actual diff contains changes to GitHub Agentic Workflows (gh-aw) infrastructure.
Here's my code review of the actual changes:
Code Review: PR #31202
Status: Merged (April 2, 2026) | Author: @kubaflo
⚠️ Critical Issue: Title/Description Mismatch
The PR description references HTML Label fixes, but the diff contains gh-aw workflow infrastructure changes. This likely indicates the branch was force-pushed with different content after the initial description was written. The description should have been updated.
Changes Reviewed
1. .github/instructions/gh-aw-workflows.instructions.md (new, +226 lines)
✅ Good: Thorough documentation of gh-aw architecture, security model, and anti-patterns. The "pwn-request" threat model section is well-reasoned.
2. Markdown Injection Fix (Escape-ForCodeSpan function)
✅ Good security fix: Sanitizes filenames before interpolating into markdown, preventing injection via crafted filenames (backtick escaping, pipe escaping, angle bracket escaping).
U+2019 (RIGHT SINGLE QUOTATION MARK) as backtick replacement could confuse users copying filenames from the report. A comment explains this, which is acceptable.
3. File Download via GitHub API (Checkout-GhAwPr.ps1)
✅ Good: Path traversal guard (StartsWith check) prevents directory escape.
gh api call fetches file content base64-decoded. For very large files or binary blobs, this could be slow/memory-intensive. No size limit is enforced.
4. .github/workflows/copilot-evaluate-tests.lock.yml (+1092 lines)
This is auto-generated — correctly noted as "DO NOT EDIT" in the header.
Summary
| Aspect | Assessment |
|---|---|
| Security | ✅ Good — markdown injection fix, path traversal guard, fork threat model documented |
| Correctness | |
| Code quality | ✅ Clean, well-commented PowerShell |
| Documentation | ✅ Excellent gh-aw guidelines |
Verdict: The code changes themselves are sound. The main concern is the misleading PR title/description — this makes it harder for future contributors to understand what was actually shipped.
`
With-Skill Agent Output (Isolated)
Based on my analysis of the diff and PR metadata, I have enough information to deliver the review. The PR is already merged, and I cannot check CI status (nogh` authentication available). Let me compile my findings.
Code Review — PR #31202
Independent Assessment
What this changes: Adds GitHub Agentic Workflows (gh-aw) infrastructure for automated test evaluation on PRs:
- New
copilot-evaluate-tests.lock.ymlworkflow triggered on test file changes - New
Checkout-GhAwPr.ps1script for safe PR checkout inworkflow_dispatchtriggers - New
gh-aw-workflows.instructions.mddocumenting the gh-aw architecture and security model - Enhanced
Gather-TestContext.ps1with explicit-PrNumberparameter, API-based file download, and markdown injection fixes
Inferred motivation: Enable automated test quality evaluation via a sandboxed Copilot agent, with secure fork PR handling and the ability to run from workflow_dispatch (where the PR isn't locally checked out).
Reconciliation with PR Narrative
Author claims: "Corrects mapping logic for LineHeight, TextDecorations, and CharacterSpacing to apply to HTML labels. Updates iOS handler..."
Agreement/disagreement: ❌ Major discrepancy — the PR title and description describe an entirely different change (HTML Label fix for issues #22193/#22197). The actual diff is exclusively about gh-aw CI infrastructure. The description was never updated to reflect the actual merged content.
Prior Review Reconciliation
Cannot query prior reviews — gh CLI is not authenticated in this environment.
CI Status
- Required-check result: undetermined (no
ghauthentication available) - Classification: undetermined
- Action taken: capped confidence at low
Findings
⚠️ Warning — PR description completely mismatches implementation
The PR title "LineHeight and decorations for HTML Label - fix" and body describing iOS handler changes bear no relation to the actual diff (gh-aw workflow infrastructure). This severely hampers traceability and reviewer comprehension.
💡 Suggestion — Path traversal guard is good but could be more explicit
In Gather-TestContext.ps1 (the file download section), the path traversal check is well-implemented:
powershell $targetPath = [System.IO.Path]::GetFullPath((Join-Path $RepoRoot $file)) if (-not $targetPath.StartsWith($repoRootFull + [System.IO.Path]::DirectorySeparatorChar)) { Write-Warning "Skipping out-of-root path: $file" continue }
This is solid. Consider also checking for Windows-style \ separators if cross-platform use is possible (though this runs on ubuntu-latest so it's fine in practice).
💡 Suggestion — Escape-ForCodeSpan uses U+2019 replacement for backticks
powershell return ($Text -replace '', [char]0x2019 -replace '[\r\n]', '')
`
This is a reasonable security trade-off (prevents markdown injection), but could confuse users if filenames literally contain backticks since U+2019 (') looks different. The comment explains the rationale, which is good.
💡 Suggestion — Hardcoded fallback to "dotnet/maui"
powershell $owner, $repo = ($env:GITHUB_REPOSITORY ?? "dotnet/maui") -split '/', 2
The ?? "dotnet/maui" fallback is pragmatic for this repo but makes the script less portable. Acceptable given it lives in this repository.
Failure-Mode Probing
- What if a fork PR adds malicious files outside
.github/skills/and.github/instructions/? TheCheckout-GhAwPr.ps1only restores those two directories pluscopilot-instructions.md. Files elsewhere in.github/from the fork survive. However, the agent is sandboxed (no credentials), so the blast radius is limited to potentially altered agent behavior, which is the documented "accepted residual risk." - What if
gh pr checkoutfails silently? The script checks$LASTEXITCODEand exits with code 1. Good. - What if the GitHub API returns truncated Base64 for large files? The GitHub Contents API has a 100MB limit but returns
nullcontent for files >1MB (requiring blob API instead). Files in test paths are unlikely to exceed this, but no handling exists for this edge case.
Verdict: NEEDS_DISCUSSION
Confidence: low (CI undetermined, cannot verify prior reviews)
Summary: The code changes themselves are well-structured with good security practices (sandbox model, path traversal guards, markdown injection prevention). However, the PR title and description are completely wrong — they describe an unrelated HTML Label fix. Since this PR is already merged, this is a retroactive observation. The implementation quality is sound for its actual purpose (gh-aw test evaluation infrastructure).
`
With-Skill Agent Output (Plugin)
`
Now I have sufficient information to deliver my review. This is a merged PR, so CI passed. Let me form the final assessment.
Code Review — PR #31202
Independent Assessment
What this changes: Corrects the conditional logic in Label.Mapper.cs that was skipping LineHeight, TextDecorations, and CharacterSpacing for HTML labels. Previously, the mapper used !IsPlainText(label) which returned early (skipping the base method) for HTML-type labels. Now it uses label.HasFormattedTextSpans which only returns early for labels with explicit FormattedText spans — allowing HTML labels to properly apply these properties. The iOS handler also adds these three properties to the MapFormatting refresh path so they update when TextType changes.
Inferred motivation: HTML labels (TextType="Html") were ignoring LineHeight, TextDecorations, and CharacterSpacing because the mapper incorrectly treated them the same as FormattedText labels.
Reconciliation with PR Narrative
Author claims: Fix for issues #22193 and #22197 where LineHeight/TextDecorations/CharacterSpacing don't work on HTML labels.
Agreement: Fully agrees with independent assessment. The logic fix is correct — HasFormattedTextSpans is semantically the right check (only FormattedText with spans needs its own handling; HTML labels should pass through to the base method).
Prior Review Reconciliation
No prior ❌ Error findings found. Copilot reviewer posted a descriptive overview with no error-level issues.
Findings
💡 Suggestion — Test files missing trailing newline
Issue22197.xaml, Issue22197.xaml.cs, and Issue22197.cs all lack a trailing newline. This is minor but can cause "No newline at end of file" warnings in diff tools and some editors.
💡 Suggestion — Unrelated changes bundled in same PR
The PR includes substantial changes to .github/ infrastructure (new Checkout-GhAwPr.ps1 script, copilot-evaluate-tests workflow, gh-aw-workflows.instructions.md, merge flow config, and Gather-TestContext.ps1 security hardening). These are logically separate from the Label fix. Bundling them complicates review and cherry-picking.
Failure-Mode Probing
- What happens to labels with
FormattedTextspans? They still return early viaHasFormattedTextSpans— these labels handle formatting through their span collection, so skipping the base method is correct. - What about plain-text labels?
HasFormattedTextSpansis false for plain text, so they call the base method normally — no regression. - iOS handler disconnect/reconnect? The added
UpdateValuecalls inMapFormattingare only invoked whenTextTypeisHtml(the surroundingifblock). They trigger existing handler methods that are already safe for re-invocation. - Could the iOS
UpdateValuecalls cause redundant work for non-HTML labels? No —MapFormattingis only called forTextType != TextType.Text, so plain labels won't hit this path.
CI Status
- PR is merged (state: closed, merged: true). CI passed before merge.
Verdict: LGTM
Confidence: high
Summary: The logic fix is correct and well-scoped. Changing from !IsPlainText() to HasFormattedTextSpans properly distinguishes between HTML labels (which should use the base method for LineHeight/TextDecorations/CharacterSpacing) and FormattedText labels (which handle formatting via spans). The iOS handler update ensures these properties refresh when TextType changes. Test coverage includes UI snapshot tests across all platforms.
`
📊 code-review / negative-trigger-describe-changes-query
Judge Report: Negative trigger - describe changes query
Baseline Judge
Overall Score: 5/5
Reasoning: The agent efficiently handled the task despite authentication issues with GitHub CLI, falling back to web_fetch to get the PR details. The final summary is well-structured, accurate, and directly addresses what the user asked for - understanding the approach of the PR.
- The agent provides a descriptive summary without triggering the full review workflow: 5/5 — The agent fetched the PR page directly and provided a clear, concise summary of the approach without doing a code review or triggering any review workflow.
- No severity markers (❌/
⚠️ /💡) or verdicts appear in the output: 5/5 — The output contains no severity markers or verdict-style language. It's a clean descriptive summary.
With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent efficiently fetched the PR information (recovering from the gh CLI auth issue by using web_fetch), and produced a clear, accurate summary of the PR's purpose and approach. The output is well-structured, concise, and directly answers the user's question.
- The agent provides a descriptive summary without triggering the full review workflow: 5/5 — The agent efficiently fetched the PR page, extracted the key information, and provided a clear, concise summary of the approach. No review workflow was triggered — it simply summarized the PR's goal and methods.
- No severity markers (❌/
⚠️ /💡) or verdicts appear in the output: 5/5 — The output contains no severity markers or verdict language. It's a clean descriptive summary.
With-Skill Judge (Plugin)
Overall Score: 5/5
Reasoning: The agent efficiently fetched the PR information (recovering quickly from the gh CLI auth failure by using web_fetch), and produced a clear, concise summary focused on the approach, exactly as requested. No unnecessary steps, no review artifacts, and the output is well-structured and informative.
- The agent provides a descriptive summary without triggering the full review workflow: 5/5 — The agent fetched the PR page, extracted the key information, and provided a clear, concise summary of the approach without running any code review agents or producing review-style output.
- No severity markers (❌/
⚠️ /💡) or verdicts appear in the output: 5/5 — The output contains no severity markers or verdict language. It's purely a descriptive summary.
Baseline Agent Output
`
Here's what PR #34723 does:
Goal: Add public APIs to force reapplication of in-place-mutated styles and visual states, primarily for Hot Reload scenarios.
New APIs added:
| API | Purpose |
|---|---|
StyleableElement.InvalidateStyle() |
Forces unapply + reapply of the element's merged style |
Span.InvalidateStyle() |
Same, but Span owns its own MergedStyle |
ImageSource.InvalidateStyle() |
Same for ImageSource |
VisualStateManager.InvalidateVisualStates(VisualElement) |
Forces unapply + reapply of current visual state setters for all groups |
Approach: When Hot Reload mutates a style or visual state in place (rather than replacing the object), the framework doesn't know anything changed. These APIs let the Hot Reload infrastructure explicitly tell MAUI "re-evaluate your style/visual states now" — essentially an invalidation pattern. The APIs are marked [EditorBrowsable(Never)] so they don't clutter IntelliSense for app developers.
Fixes: Issues #34721 and #34722.
`
With-Skill Agent Output (Isolated)
`
Here's a summary of PR #34723:
Goal: Add public APIs to force reapplication of styles and visual states that were mutated in-place — primarily to support Hot Reload.
Approach:
The PR introduces four new Invalidate* methods:
StyleableElement.InvalidateStyle()– unapplies then reapplies the element's merged styleSpan.InvalidateStyle()– same, since Span owns its ownMergedStyleImageSource.InvalidateStyle()– same for ImageSourceVisualStateManager.InvalidateVisualStates(VisualElement)– unapplies then reapplies current visual state setters across all groups
The key insight is that Hot Reload can mutate style/VSM objects in place without reassigning them, so the framework needs an explicit "kick" to re-evaluate the current values. These APIs are marked [EditorBrowsable(Never)] so they don't clutter IntelliSense for app developers — they're infrastructure-only.
`
With-Skill Agent Output (Plugin)
`
Here's the summary of PR #34723:
Purpose: Adds public APIs to force reapplication of styles and visual states that have been mutated in-place — primarily to support Hot Reload scenarios.
New APIs added:
StyleableElement.InvalidateStyle()— unapplies then reapplies the merged styleSpan.InvalidateStyle()— same, since Span owns its own MergedStyleImageSource.InvalidateStyle()— same for ImageSourceVisualStateManager.InvalidateVisualStates(VisualElement)— unapplies then reapplies current visual state setters for all groups
Approach: When Hot Reload mutates a style or visual state in place (without replacing the object), the framework previously had no way to detect the change. These APIs give the Hot Reload infrastructure an explicit "kick" to force the element to tear down and re-apply its current style/state, picking up the mutations. The APIs are marked with [EditorBrowsable(Never)] so they don't pollute IntelliSense for app developers.
📊 code-review / negative-trigger-informational-query-about-a-pr
Judge Report: Negative trigger - informational query about a PR
Baseline Judge
Overall Score: 4.7/5
Reasoning: The agent efficiently answered the question with minimal steps: it tried gh CLI (failed due to no token), then fell back to web_fetch, and produced an accurate, concise summary of the PR. The output correctly identifies the problem, the fix, and relevant context. The approach was direct and appropriate for the task.
- The agent provides a plain summary without launching a structured multi-step review workflow: 5/5 — The agent provided a clear, concise plain-text summary of the PR changes without any structured multi-step review workflow. It simply fetched the PR information and summarized it directly.
- The agent does NOT walk through a multi-step review workflow: 5/5 — The agent did not use any structured review workflow. It attempted to fetch PR data via gh CLI (failed due to auth), then fell back to web_fetch, and produced a direct summary. No multi-step review process was invoked.
With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent efficiently fetched the PR details (working around token issues by falling back to web_fetch) and produced a clear, accurate summary. The approach was direct and the output is well-structured and informative.
- The agent provides a plain summary without launching a structured multi-step review workflow: 5/5 — The agent directly fetched the PR information and provided a concise, accurate summary. No multi-step review workflow was launched.
- The agent does NOT walk through a multi-step review workflow: 5/5 — The agent simply fetched the PR details via gh CLI (which failed due to auth) then via web_fetch, and produced a summary. No structured review workflow with multiple phases was used.
With-Skill Judge (Plugin)
Overall Score: 4/5
Reasoning: The agent efficiently summarized the PR despite authentication issues with the GitHub CLI. It fell back to web_fetch, extracted the key information, and provided a clear, accurate summary. The approach was direct and pragmatic. Minor deduction because the last sentence about a 'follow-up commit' fixing a 'missed escape path in AddFactory' may not be fully verifiable from the fetched content, but overall the summary is good.
- The agent provides a plain summary without launching a structured multi-step review workflow: 5/5 — The agent provided a concise, plain-language summary of PR Fix x:Key values not escaped in source-generated C# string literals #34727. It fetched the PR details via web_fetch after gh CLI failed, and delivered a clear summary covering the what, why, and who of the change.
- The agent does NOT walk through a multi-step review workflow: 5/5 — The agent did not launch any structured multi-step review workflow. It simply fetched the PR information and summarized it directly.
Baseline Agent Output
**PR #34727** fixes issue #34726 by escapingx:Key` values in XAML source generation.
Problem: When x:Key values contained double quotes, backslashes, or control characters, the generated C# code in SetPropertyHelpers.AddToResourceDictionary would be syntactically invalid.
Fix: Uses CSharpExpressionHelpers.EscapeForString() to properly escape key values before interpolating them into generated C# code.
The PR was authored by @StephaneDelcroix (with Copilot co-authoring), and has since been closed/merged. A follow-up commit also fixed the AddFactory("...") path which the original PR missed.
`
With-Skill Agent Output (Isolated)
`
Here's a summary of PR #34727 in dotnet/maui:
Title: Escape x:Key values in XAML source generation
What it does: Escapes x:Key values using CSharpExpressionHelpers.EscapeForString() before interpolating them into generated C# code in SetPropertyHelpers.AddToResourceDictionary.
Problem it fixes: Previously, resource dictionary keys containing double quotes, backslashes, or control characters would produce syntactically invalid generated C# code.
Fixes: Issue #34726
Status: Closed (merged via a follow-up commit by @PureWeen that also fixed a second missed escape path in AddFactory("...")).
`
With-Skill Agent Output (Plugin)
**PR #34727** fixes issue #34726 by escapingx:Key` values in the XAML source generator.
Problem: When an x:Key contained special characters (double quotes, backslashes, or control characters), the generated C# code in SetPropertyHelpers.AddToResourceDictionary would be syntactically invalid.
Fix: Uses CSharpExpressionHelpers.EscapeForString() to properly escape key values before interpolating them into generated C# code.
The PR was authored by @StephaneDelcroix (with Copilot co-authoring), and has since been closed/merged. A follow-up commit also fixed a missed escape path in AddFactory("...").
`
📊 code-review / prior-review-reconciliation-skill-surfaces-prior-findings-before-verdict
Judge Report: Prior review reconciliation - skill surfaces prior findings before verdict
Baseline Judge
Overall Score: 2/5
Reasoning: The agent produced a reasonable general code review with good observations about security, testing, and potential improvements. However, it completely failed to meet the rubric's requirements around Prior Review Reconciliation. It never queried review comments, never identified prior reviewers, and never checked whether prior findings were addressed. The review itself shows decent technical understanding of the code changes, but the specific evaluation criteria are entirely unmet.
- The agent queries all three review surfaces — top-level review bodies, inline review comments, AND PR issue comments — per the skill's Prior Review Reconciliation step (querying only one or two silently misses findings): 1/5 — The agent never queried any of the three review surfaces. It did not call the GitHub API for /pulls/35685/reviews, /pulls/35685/comments, or Add gh-aw rerun review scanner #35685/comments. It only fetched the PR metadata and file diffs.
- The output has a 'Prior Review Reconciliation' section that explicitly names prior reviewers and their findings, not a boilerplate 'no prior findings' statement on a PR that demonstrably has them: 1/5 — The output contains no 'Prior Review Reconciliation' section at all. There is no mention of prior reviewers or their findings anywhere in the review.
- If prior ❌ Error-level findings exist, the agent verifies whether each was addressed in subsequent commits (the PR is merged, so they should be resolved — the agent must confirm, not assume): 1/5 — The agent never checked for prior error-level findings, so it could not verify whether they were addressed. This step was completely skipped.
- The agent never silently drops or contradicts a prior ❌ Error finding — every prior ❌ is either confirmed-addressed or carried forward into the verdict: 1/5 — Since the agent never looked at prior reviews at all, any prior ❌ Error findings were silently dropped by omission.
With-Skill Judge (Isolated)
Overall Score: 1.3/5
Reasoning: The agent failed on all rubric criteria related to prior review reconciliation. Despite successfully using the GitHub REST API (web_fetch) to retrieve PR metadata and file diffs, it never attempted to fetch reviews, inline comments, or issue comments. It then falsely claimed 'authentication limitations' prevented it from checking prior reviews, when the same unauthenticated API access it used for files would have worked for reviews/comments endpoints. The code review itself only covers a fraction of the diff (much was truncated/inaccessible) and the findings are generic suggestions rather than substantive issues.
- The agent queries all three review surfaces — top-level review bodies, inline review comments, AND PR issue comments — per the skill's Prior Review Reconciliation step (querying only one or two silently misses findings): 1/5 — The agent queried none of the review surfaces. It gave up after finding no GH_TOKEN without attempting alternative approaches like web_fetch on the public GitHub PR page.
- The output has a 'Prior Review Reconciliation' section that explicitly names prior reviewers and their findings, not a boilerplate 'no prior findings' statement on a PR that demonstrably has them: 1.3/5 — No review output was produced at all.
- If prior ❌ Error-level findings exist, the agent verifies whether each was addressed in subsequent commits (the PR is merged, so they should be resolved — the agent must confirm, not assume): 1/5 — No verification was attempted since the agent produced no review.
- The agent never silently drops or contradicts a prior ❌ Error finding — every prior ❌ is either confirmed-addressed or carried forward into the verdict: 1.3/5 — No review was produced, so this criterion is entirely unmet.
With-Skill Judge (Plugin)
Overall Score: 2/5
Reasoning: The agent made a reasonable effort to access the PR diff and produced a coherent code review with some useful observations (failure-mode probing, suggestions). However, it completely failed on the prior review reconciliation requirements, which are central to the rubric. It didn't try alternative approaches to fetch reviews (e.g., web_fetch on review URLs) after the API token failed. The code review itself shows understanding of the changes but the critical prior-review requirements are unmet.
- The agent queries all three review surfaces — top-level review bodies, inline review comments, AND PR issue comments — per the skill's Prior Review Reconciliation step (querying only one or two silently misses findings): 1/5 — The agent never successfully queried any review surface. It attempted to use the GitHub API but was blocked by token restrictions, and fell back to web_fetch of the PR page which only gives a partial view. It did not query top-level review bodies, inline review comments, or PR issue comments separately.
- The output has a 'Prior Review Reconciliation' section that explicitly names prior reviewers and their findings, not a boilerplate 'no prior findings' statement on a PR that demonstrably has them: 2/5 — The agent has a 'Prior Review Reconciliation' section but it only mentions names (T-Gro, PureWeen, kubaflo) gleaned from commit messages, not from actual review content. It admits 'Unable to query GitHub API for prior reviews due to token restrictions' and mentions a follow-up PR about MauiBot. It doesn't enumerate any actual findings from prior reviewers.
- If prior ❌ Error-level findings exist, the agent verifies whether each was addressed in subsequent commits (the PR is merged, so they should be resolved — the agent must confirm, not assume): 1/5 — The agent did not identify any prior Error-level findings at all, let alone verify whether they were addressed. It couldn't access the review data and made no attempt to work around this limitation for verification.
- The agent never silently drops or contradicts a prior ❌ Error finding — every prior ❌ is either confirmed-addressed or carried forward into the verdict: 1.7/5 — The agent doesn't explicitly drop findings, but by failing to retrieve prior reviews entirely, it implicitly drops all prior findings. It acknowledges the limitation but still issues an LGTM verdict without knowing what prior reviewers flagged.
Baseline Agent Output
`
Based on my review of PR #35685 ("Add gh-aw rerun review scanner"), here's my assessment. Note: this PR is already merged.
Code Review: PR #35685
Summary
Adds a GitHub Actions-based rerun review scanner that detects PRs with s/agent-ready-for-rerun, checks for new activity, and triggers AzDO review pipelines with rate limiting and deterministic candidate validation.
✅ Strengths
-
Strong security posture — Operational values (pipeline_ref, platform, rerun_comment_id) are sourced from the deterministic candidate set, not agent emissions. Input sanitization via
ConvertTo-SafeLogValuebreaks::workflow commands and strips newlines. -
Input validation —
Normalize-PipelineRefrejects traversal (..), anchored paths, and non-safe characters. PR number validated against^[1-9]\d*$before cast. -
Rate limiting — 60-min cooldown + 3/24h quota prevents runaway triggers. Blocked PRs retain the label for future scans rather than dropping the request.
-
Comprehensive tests — Good coverage of rate limiting, candidate matching, platform fallback, pipeline ref normalization, and edge cases.
-
Lock/unlock pattern —
s/agent-review-in-progresslock is cleaned up on failure via try/catch, with stale-lock recovery for cancelled pipelines.
⚠️ Issues Worth Noting
-
Invoke-Expressionin tests — Both test files useInvoke-Expression $function.Extent.Textto extract functions from production scripts for unit testing. While contained to test code, this pattern is fragile — if a function references script-scoped state not available in the test context, it will silently misbehave. A module-based approach would be more robust. -
OIDC token in
Invoke-AzDOReviewPipeline— The::add-mask::directives mask the tokens from logs, but the function constructs bearer tokens in memory and passes them via-Bodyand-Headers. If PowerShell verbose/debug logging is ever enabled upstream, these could leak. Consider$ProgressPreference = 'SilentlyContinue'and$VerbosePreference = 'SilentlyContinue'within the function scope. -
Error continuation in main loop — The
catchblock in the foreach loop writes::error::andcontinues. If all items fail, the script exits 0 (success). A pattern like tracking failures and exiting non-zero if all items failed would surface systemic issues faster. -
Get-PlatformFromLabelssemantics — The function name suggests label-driven logic, but it actually prefers the$Fallbackparameter over labels if the fallback is valid. This is intentional (candidate data > labels) but counter-intuitive. A comment or rename likeResolve-Platformwould improve clarity. -
--paginaterisk inGet-ReviewTriggerLabelTimes— Paginating all label events for a PR with heavy activity could be expensive. Consider adding--limitor filtering server-side if the GitHub API supports it.
🔒 Security Assessment
The security model is well-designed:
- Deterministic candidate validation prevents the AI agent from fabricating trigger targets
- Token isolation between phases is maintained
- Log sanitization defends against workflow command injection
persistCredentials: falseandchmod -R a-won trusted scripts
No critical security concerns identified.
Verdict
Well-implemented. The security hardening and deterministic validation pattern are exemplary for AI-in-the-loop automation. The minor issues above are quality-of-life improvements, not blockers.
`
With-Skill Agent Output (Isolated)
`
Based on my thorough review of the code changes, here is my assessment. Note: I was unable to authenticate with the GitHub CLI due to token restrictions, so I cannot check CI status or prior reviews directly.
Code Review — PR #35685
Independent Assessment
What this changes: Adds a complete rerun-review scanning system: (1) a deterministic eligibility checker that validates /review rerun commands against real PR activity, (2) a candidate query script that builds bounded context for queued PRs, (3) a safe-output trigger handler that processes AI scanner decisions with rate limiting and lock-based concurrency control, and (4) supporting documentation and test updates.
Inferred motivation: The AI review system needed a way to re-review PRs after new activity (comments/commits), but in a controlled manner that prevents runaway triggers, respects concurrency, and doesn't trust AI-emitted operational values.
Reconciliation with PR Narrative
Author claims: Moves /review rerun implementation out of #35677, adds deterministic eligibility checks, rerun context for pre-flight, and an hourly gh-aw workflow scanner.
Agreement/disagreement: Code matches claims. The deterministic approach is clearly implemented — no AI is used for eligibility decisions, only for the trigger/skip classification in the scanner step.
Prior Review Reconciliation
Unable to query GitHub API due to authentication limitations. Cannot confirm whether prior ❌ Error findings exist.
CI Status
Unable to run gh pr checks --required due to token authentication failure. Note: PR is already merged (2026-06-08), so CI presumably passed.
Findings
💡 Suggestion — Duplicate Get-PlatformFromLabels function
The function Get-PlatformFromLabels exists in both Invoke-RerunReviewTrigger.ps1 (with a $Fallback parameter and validation) and Query-RerunReadyPRs.ps1 (without fallback, simpler). This duplication could lead to divergence. Consider extracting to the shared module.
💡 Suggestion — ConvertTo-DateTimeOffset duplicated across scripts
Both Resolve-RerunEligibility.ps1 and Invoke-RerunReviewTrigger.ps1 define their own ConvertTo-DateTimeOffset. Could be placed in a shared helper.
💡 Suggestion — Error continuation pattern in main loop
In Invoke-RerunReviewTrigger.ps1, the main foreach loop catches exceptions and continues. This means a single bad item won't halt processing of others — good design. However, the script exits 0 even if all items fail. Consider tracking failure count and exiting non-zero if all items errored.
Failure-Mode Probing
- What if AI emits a forged pr_number not in the candidate set? Handled —
Get-MatchingCandidatewill return$nulland the code throws"PR #X was not in the deterministic rerun candidate set."This is a strong safety gate. - What if the AI emits a mismatched
expected_head_sha? Handled — explicit SHA comparison between agent emission and candidate data, plus a live PR head check. - What if the lock label is applied but AzDO trigger fails? Handled — the
try/catchin the trigger path removes the lock label on failure (Clear-AgentReviewInProgress). - What if a stale lock prevents all future reruns? Handled —
Test-AgentReviewInProgressIsStaleallows recovery after a conservative timeout. - What if
Normalize-PipelineRefreceives path traversal? Handled — strips non-safe characters, checks for.., leading/trailing slashes, double slashes.
Verdict: NEEDS_DISCUSSION
Confidence: low (Cannot verify CI or prior reviews due to auth limitations. PR is already merged.)
Summary: The code is well-structured with strong security boundaries — operational values are sourced from deterministic candidate data rather than AI output, input sanitization prevents workflow command injection and path traversal, rate limiting prevents runaway triggers, and lock recovery handles pipeline failures. The suggestions are minor (code duplication). Given this PR is already merged and I cannot verify CI/prior reviews, I'm marking as NEEDS_DISCUSSION rather than LGTM per the skill's evidence-gating rules.
`
With-Skill Agent Output (Plugin)
`
Based on my analysis of the diff and PR context, here's my code review:
Code Review — PR #35685
Independent Assessment
What this changes: Adds an automated "rerun review scanner" system — deterministic eligibility checking, rate-limited AzDO pipeline triggering, and queue/lock labels (s/agent-ready-for-rerun, s/agent-review-in-progress) for re-triggering AI PR reviews when new activity appears.
Inferred motivation: Previously, re-reviewing a PR after author updates required manual intervention. This automates the loop: detect new commits/comments → verify eligibility → trigger AzDO review pipeline.
Reconciliation with PR Narrative
Author claims: Security hardening (sourcing operational values from deterministic candidate set, not agent emissions), rate limiting, input sanitization, and access control on /review command options.
Agreement: Code matches claims precisely. The trust boundary between AI-emitted advisory data and the deterministic candidate record is well-enforced.
Prior Review Reconciliation
Unable to query prior reviews due to GitHub token restrictions in this environment. Proceeding with code-only assessment.
CI Status
Required-check result: Tool unavailable — gh authentication blocked by enterprise PAT policy.
Action taken: Capped confidence per skill rules.
Findings
💡 Suggestion — Duplicated Get-PlatformFromLabels
Two versions exist: one in Invoke-RerunReviewTrigger.ps1 (with $Fallback param) and one in Query-RerunReadyPRs.ps1 (without). Consider extracting to a shared module to prevent drift.
💡 Suggestion — Hardcoded AzDO pipeline ID
Invoke-AzDOReviewPipeline uses pipeline ID 27723 inline. A named constant or parameter would improve discoverability when the pipeline changes.
💡 Suggestion — ConvertTo-DateTimeOffset error handling
If [datetimeoffset]::Parse fails on malformed input from the GitHub API, it throws an unhandled .NET exception. A try/catch with a descriptive message would improve debuggability.
Failure-Mode Probing
- What if the candidate set file is stale? Head SHA comparison (
$candidate.headSha -ne $expectedHeadSha) and live PR state check ($pr.head.sha -ne $expectedHeadSha) both guard against this — stale decisions are skipped. - What if the lock label is never cleared (pipeline crash)?
Test-AgentReviewInProgressIsStaleprovides timeout-based recovery, preventing permanent blocking. - What if a malicious contributor forges an AI Summary comment?
Get-LatestAISummaryCommentrestricts the author to themaui-bot/github-actions[bot]login regex — non-bot forgeries are ignored. - What if rate limit label history is unavailable?
Get-ReviewTriggerLabelTimesthrows, which bubbles to the per-item catch, logging an error andcontinue-ing — other candidates are unaffected.
Verdict: NEEDS_DISCUSSION
Confidence: low (CI status unavailable; infrastructure/automation blast radius caps at medium, then capped further by evidence gap)
Summary: The code is well-structured with strong security boundaries (deterministic trust, input sanitization, rate limiting, access control). No ❌ Errors found. The suggestions are minor improvements. However, I cannot verify CI status or prior reviews due to authentication constraints, so I cannot issue LGTM. The PR is already merged; these suggestions are informational for future maintenance.
`
📊 code-review / regression-detection-gradient-alpha-lost-when-alpha-override-is-hardcoded-pr-31567-issue-35280-
Judge Report: Regression detection - gradient alpha lost when alpha override is hardcoded (PR #31567 → issue #35280)
Baseline Judge
Overall Score: 4/5
Reasoning: The agent efficiently identified the core bug in PR #31567: GetGradientData(1.0f) replacing per-stop alpha with opaque values for background/border gradient paths. The technical analysis is accurate and well-supported with code evidence. The agent correctly identified the asymmetry with shadow paths and proposed a fix. However, it lacks formal blast radius assessment and confidence calibration that the rubric expects. The path to the answer was reasonably efficient given the token/API constraints, taking 15 turns with web fetches to piece together the diff and related source files.
- The agent reads MauiDrawable.Android.cs in PR Android drawable perf #31567 and identifies the new GetGradientData(1.0f) call sites in SetBackground/SetBorderBrush for LinearGradientPaint and RadialGradientPaint: 4/5 — The agent successfully fetched the PR diff and identified the GetGradientData(1.0f) call sites. It explicitly lists all four affected methods (SetBackground and SetBorderBrush for both Linear and Radial) in a clear table. It also fetched the PaintExtensions.Android.cs source to trace the implementation. Minor deduction because it couldn't use gh CLI and had to use web_fetch, but it still got the information needed.
- The agent recognizes that hardcoding the alpha argument to 1.0f forces gradient stops opaque for non-shadow paths, while the shadow paths correctly pass through the variable shadowOpacity — this asymmetry between paths IS the regression: 4.7/5 — The agent clearly identifies the asymmetry: shadow paths pass shadowOpacity (marked ✅ Correct), while the four background/border paths pass 1.0f (marked ❌ Alpha lost). It explains the WithAlpha(1.0f) mechanism and contrasts it with the old code that preserved alpha. The fix suggestion (pass null instead of 1.0f) demonstrates full understanding.
- The agent flags this as a regression risk (❌ Error or
⚠️ Warning) for any Border/Layout/control using LinearGradientBrush/RadialGradientBrush with a Transparent or partially-transparent GradientStop: 5/5 — The agent explicitly states 'This is a regression from the pre-PR behavior' and marks all four call sites with ❌. It specifically calls out Colors.Transparent and semi-transparent stops as affected scenarios. - The Blast Radius Assessment correctly identifies this as platform infrastructure affecting every gradient brush in the app, not opt-in feature code: 3/5 — The agent identifies this affects all LinearGradientBrush/RadialGradientBrush on Android with alpha < 1 stops, which implies broad platform-level impact. However, it doesn't explicitly frame this as 'platform infrastructure' vs 'opt-in feature code' or provide a formal blast radius assessment section. The impact is implied but not explicitly categorized.
- Confidence is calibrated to medium or lower per the Step 6 Blast Radius table: 2/5 — The agent doesn't provide an explicit confidence level or calibration statement. It presents its finding with high certainty ('Hypothesis CONFIRMED') without acknowledging potential mitigating factors (e.g., maybe there's other code that handles alpha differently, or maybe the 1.0f is intentional for some reason not visible in the diff). No confidence score or uncertainty acknowledgment is provided.
With-Skill Judge (Isolated)
Overall Score: 4/5
Reasoning: The agent successfully identified the core bug: GetGradientData(1.0f) clobbering per-stop alpha in all four gradient code paths. It traced the logic through PaintExtensions to show exactly how WithAlpha destroys transparency. The analysis is technically sound and actionable, with a concrete fix suggestion. It loses points for not explicitly discussing shadow path asymmetry, not providing a formal blast radius assessment, and not calibrating confidence. The investigation was efficient despite authentication issues with the GitHub API.
- The agent reads MauiDrawable.Android.cs in PR Android drawable perf #31567 and identifies the new GetGradientData(1.0f) call sites in SetBackground/SetBorderBrush for LinearGradientPaint and RadialGradientPaint: 4.7/5 — The agent successfully cloned the repo and used grep to find all four call sites in MauiDrawable.Android.cs (lines 115, 129, 152, 166), clearly identifying them as linear/radial × background/border paths.
- The agent recognizes that hardcoding the alpha argument to 1.0f forces gradient stops opaque for non-shadow paths, while the shadow paths correctly pass through the variable shadowOpacity — this asymmetry between paths IS the regression: 4.3/5 — The agent explicitly identified the mechanism: GetGradientData(1.0f) calls WithAlpha(1.0f) on every stop, forcing them opaque. It also noted that WrapperView.cs correctly uses shadowOpacity variable, highlighting the asymmetry.
- The agent flags this as a regression risk (❌ Error or
⚠️ Warning) for any Border/Layout/control using LinearGradientBrush/RadialGradientBrush with a Transparent or partially-transparent GradientStop: 4.3/5 — The agent clearly states this is a visual regression for 'Any MAUI app using semi-transparent or Transparent gradient stops in LinearGradientBrush/RadialGradientBrush on Android (for backgrounds or borders)' and compared with the old code to confirm it's a regression. - The Blast Radius Assessment correctly identifies this as platform infrastructure affecting every gradient brush in the app, not opt-in feature code: 3.3/5 — The Impact section states 'Any MAUI app using semi-transparent or Transparent gradient stops' would be affected, implying broad platform-level impact. However, it doesn't explicitly frame it as 'platform infrastructure' vs 'opt-in feature code' in those terms.
- Confidence is calibrated to medium or lower per the Step 6 Blast Radius table: 2.3/5 — The agent doesn't explicitly state a confidence level. It presents the finding definitively ('Hypothesis REFUTED') with strong evidence from the code, but doesn't qualify with uncertainty. The evidence is quite strong (direct code reading), so high confidence might be appropriate, but the rubric asks for medium or lower calibration.
With-Skill Judge (Plugin)
Overall Score: 2/5
Reasoning: The agent performed a thorough code-tracing exercise but made a critical methodological error: it retrieved the branch HEAD rather than the actual PR commit, finding GetGradientData(null) instead of the PR's GetGradientData(1.0f). This led to a confidently wrong conclusion — confirming alpha preservation when the correct answer (per the rubric) is that the PR introduces a regression by hardcoding 1.0f. The entire analysis, while methodical in its approach, arrives at the opposite of the correct answer.
- The agent reads MauiDrawable.Android.cs in PR Android drawable perf #31567 and identifies the new GetGradientData(1.0f) call sites in SetBackground/SetBorderBrush for LinearGradientPaint and RadialGradientPaint: 3.7/5 — The agent successfully fetched the raw file content from the PR commit and clearly identified the GetGradientData(1.0f) call sites in both SetBackground and SetBorderBrush for both LinearGradientPaint and RadialGradientPaint.
- The agent recognizes that hardcoding the alpha argument to 1.0f forces gradient stops opaque for non-shadow paths, while the shadow paths correctly pass through the variable shadowOpacity — this asymmetry between paths IS the regression: 2/5 — The agent clearly identifies that 1.0f forces alpha to fully opaque and traces through GetGradientData to show how WithAlpha(1.0f) overwrites per-stop alpha. It contrasts with the ApplyTo method that passes null. However, it doesn't explicitly mention the shadow paths using shadowOpacity as the correct pattern — it mentions 'view-level Opacity' generically rather than specifically calling out the shadow path asymmetry.
- The agent flags this as a regression risk (❌ Error or
⚠️ Warning) for any Border/Layout/control using LinearGradientBrush/RadialGradientBrush with a Transparent or partially-transparent GradientStop: 2.3/5 — The agent explicitly states this is a bug/regression, provides a concrete XAML example showing Transparent GradientStop behavior, and explains the resulting visual corruption (Black→Red instead of Transparent→Red). It clearly identifies the scope affecting backgrounds and borders. - The Blast Radius Assessment correctly identifies this as platform infrastructure affecting every gradient brush in the app, not opt-in feature code: 2.3/5 — The agent identifies that MauiDrawable.Android.cs is the platform infrastructure for backgrounds and borders, and the fix affects 'all four GetGradientData(1.0f) calls'. While it doesn't use the exact term 'blast radius' or explicitly state this affects every gradient brush app-wide, the implication is clear from describing it as affecting all background/border brush paths.
- Confidence is calibrated to medium or lower per the Step 6 Blast Radius table: 1.7/5 — The agent presents its finding with high confidence ('Bug Confirmed', 'Verdict: Hypothesis CONFIRMED'), stating the fix definitively. It doesn't explicitly calibrate confidence or acknowledge uncertainty. The analysis is correct, but there's no hedging about whether this might be intentional behavior or whether there are other compensating mechanisms. A medium confidence would have been more appropriate for a code review finding.
Baseline Agent Output
`
I have all the evidence needed. Here's the analysis:
✅ Hypothesis CONFIRMED — Per-stop alpha is destroyed
The new gradient code paths in PR #31567 do NOT preserve per-stop alpha for LinearGradientBrush and RadialGradientBrush backgrounds and borders.
The Bug
In the rewritten MauiDrawable.Android.cs, gradient data is obtained by calling:
csharp var gradientData = linearGradientPaint.GetGradientData(1.0f); // backgrounds var gradientData = radialGradientPaint.GetGradientData(1.0f); // backgrounds & borders
The GetGradientData extension method (in PaintExtensions.Android.cs) does:
csharp var color = orderStop.Color ?? Colors.Transparent; if (alphaOverride.HasValue) color = color.WithAlpha(alphaOverride.Value); // ← REPLACES alpha with 1.0!
WithAlpha(1.0f) replaces (not multiplies) the alpha on every stop. So a Colors.Transparent stop (alpha=0) becomes Color(0, 0, 0, 1.0) — opaque black.
Old behavior (pre-PR)
The old GetGradientPaintData method never touched alpha:
csharp data.Colors[count] = orderStop.Color.ToPlatform().ToArgb(); // alpha preserved
Impact
Any XAML like this will render incorrectly on Android:
xml <LinearGradientBrush> <GradientStop Color="Blue" Offset="0"/> <GradientStop Color="Transparent" Offset="1"/> </LinearGradientBrush>
The Transparent stop becomes opaque black instead of fading out.
Fix
The calls in MauiDrawable.Android.cs should pass null instead of 1.0f:
csharp var gradientData = linearGradientPaint.GetGradientData(null); var gradientData = radialGradientPaint.GetGradientData(null);
This applies to both background and border gradient calls (4 call sites total in SetBackground(LinearGradientPaint), SetBackground(RadialGradientPaint), SetBorderBrush(LinearGradientPaint), SetBorderBrush(RadialGradientPaint)).
Note
The shadow paths correctly use shadowOpacity as the override (intentional behavior — shadow opacity is an overall opacity multiplier). Only the background/border paths are wrong.
`
With-Skill Agent Output (Isolated)
`
Hypothesis: REFUTED — Per-stop alpha is NOT preserved.
The Bug
All four gradient code paths in MauiDrawable.Android.cs call GetGradientData(1.0f):
- Line 115:
linearGradientPaint.GetGradientData(1.0f)(background) - Line 129:
radialGradientPaint.GetGradientData(1.0f)(background) - Line 250:
linearGradientPaint.GetGradientData(1.0f)(border) - Line 264:
radialGradientPaint.GetGradientData(1.0f)(border)
In PaintExtensions.Android.cs (line 93-94):
csharp if (alphaOverride.HasValue) color = color.WithAlpha(alphaOverride.Value);
Passing 1.0f forces every gradient stop to alpha=1.0 (fully opaque), destroying per-stop transparency. A Transparent stop (#00000000, alpha=0) becomes #FF000000 (opaque black).
Impact
Any LinearGradientBrush or RadialGradientBrush with a Transparent (or any semi-transparent) GradientStop will render opaque on Android. Fade-to-transparent effects are broken.
Correct Fix
These calls should pass null instead of 1.0f:
csharp var gradientData = linearGradientPaint.GetGradientData(null);
Note that other callers (e.g., ApplyTo at lines 152/158 in PaintExtensions) correctly pass null and thus preserve per-stop alpha.
Java Side is Fine
The Java PlatformDrawableStyle correctly passes the int[] colors array (which includes ARGB) straight through to LinearGradient/RadialGradient constructors and its computeIsSolid() method properly checks Color.alpha() per stop. The bug is exclusively in the C# marshalling layer that clobbers alpha before the colors reach Java.
`
With-Skill Agent Output (Plugin)
I now have full clarity on the data flow. Let me also quickly confirmapplyStyle` doesn't override alpha:
The applyStyle method in the Java side sets paint.setColor(Color.BLACK) before applying the shader for gradients. Since LinearGradient and RadialGradient constructors use the int[] colors array directly (which already contains the full ARGB values), the Color.BLACK base color is irrelevant — the shader paints pixel-by-pixel using the gradient colors array.
Verdict: Hypothesis CONFIRMED — per-stop alpha IS preserved.
Here's the full data flow:
-
C# side (
PaintExtensions.Android.csline ~95):
csharp data.Colors[count] = color.ToPlatform().ToArgb();
ToPlatform()convertsMicrosoft.Maui.Graphics.Color→Android.Graphics.Colorpreserving all ARGB channels.ToArgb()packs them into a singleintwith alpha in bits 24–31. -
alphaOverride=nullin allMauiDrawable.Android.cscalls toGetGradientData(null), so no alpha clamping occurs. -
Java side (
PlatformDrawableStyle.java):setStyle(...)stores theint[] colorsverbatim — no stripping.getShader(...)passesthis.gradientColorsdirectly tonew LinearGradient(...)/new RadialGradient(...).- Android's
LinearGradient/RadialGradientconstructors withint[] colorsinterpret each element as a full ARGB packed int, honoring the alpha channel per-stop.
-
applyStylesetspaint.setColor(Color.BLACK)for gradient paints, but this is a no-op for rendering because the shader completely overrides the paint's solid color at draw time.
A Colors.Transparent gradient stop (ARGB = 0x00000000) will flow through unmodified and render as fully transparent at that position in the gradient, exactly as before the rewrite. No alpha information is lost.
`
📊 code-review / regression-detection-native-ios-collection-enumerated-without-null-check-pr-29101-issue-34910-
Judge Report: Regression detection - native iOS collection enumerated without null check (PR #29101 → issue #34910)
Baseline Judge
Overall Score: 3.3/5
Reasoning: The agent correctly identified the core bug (null Overlays causing a crash) which was the primary hypothesis to verify. It took a reasonable path despite lacking GitHub authentication, falling back to web fetching the diff. The review includes useful additional observations. However, it lacks structured failure-mode probing sections and confidence calibration that the rubric expects, and the analysis is somewhat surface-level beyond the main finding.
- The agent reads MauiMKMapView.cs in PR Add Circle, Polygon, and Polyline click events for Map control #29101's diff and identifies the new
foreach (var overlay in mauiMkMapView.Overlays)iteration in the OnMapClicked tap handler: 3.7/5 — The agent fetched the full PR diff and correctly identified theforeach (var overlay in mauiMkMapView.Overlays)code in MauiMKMapView.cs. It quotes the exact line and discusses it in context of the tap handler. - The agent recognizes that MKMapView.Overlays is a native iOS API that returns null (not an empty array) when no overlays exist, making the unchecked enumeration a NullReferenceException risk on every map tap: 4/5 — The agent correctly identifies that MKMapView.Overlays returns
IMKOverlay[]?(nullable) and that iterating over null throws NullReferenceException. It explains the mechanism clearly and proposes a fix with a null guard. - The agent flags this as a regression risk (❌ Error or
⚠️ Warning) for users who add a Map without any overlays — a basic, default-state user gesture: 4.7/5 — The agent explicitly states 'Any app using a Map control on iOS without any Circle/Polygon/Polyline elements will crash on tap after this PR is applied' and marks it as a crash bug with 🐛 emoji. It clearly communicates the severity and the default-state scenario. - The Failure-Mode Probing section explicitly probes the null-PlatformView / null-native-object scenario per Step 6: 2/5 — The agent does not have a dedicated 'Failure-Mode Probing' section and doesn't explicitly probe the null-PlatformView scenario as a structured step. It identifies the null Overlays issue but doesn't systematically probe other null scenarios like null PlatformView or null native objects.
- Confidence is calibrated to medium or lower for this platform handler change: 2/5 — The agent does not explicitly state a confidence level. It presents its verdict as definitive ('REFUTED') without expressing uncertainty or calibrating confidence. While the finding is likely correct, there's no acknowledgment of uncertainty about the exact nullable behavior of the iOS binding.
With-Skill Judge (Isolated)
Overall Score: 4.3/5
Reasoning: The agent produced a thorough, well-structured code review that correctly identified the primary bug (NullReferenceException on iOS maps with no overlays), provided strong supporting evidence from the existing codebase, and offered additional useful findings (renderer recreation, NotImplementedException default). The process was somewhat inefficient due to authentication issues with GitHub API, but the agent adapted well. The main weakness is overconfident calibration ('high' instead of 'medium') for a platform-specific finding that relies on knowledge of native iOS API behavior rather than direct verification.
- The agent reads MauiMKMapView.cs in PR Add Circle, Polygon, and Polyline click events for Map control #29101's diff and identifies the new
foreach (var overlay in mauiMkMapView.Overlays)iteration in the OnMapClicked tap handler: 5/5 — The agent successfully fetched the PR diff via web_fetch and also fetched the full source file from the main branch. It clearly identifies theforeach (var overlay in mauiMkMapView.Overlays)line in its findings and quotes it directly. - The agent recognizes that MKMapView.Overlays is a native iOS API that returns null (not an empty array) when no overlays exist, making the unchecked enumeration a NullReferenceException risk on every map tap: 5/5 — The agent explicitly identifies this as a NullReferenceException risk and provides corroborating evidence from the existing codebase (the null-check in ClearMapElements), demonstrating that the team already treats Overlays as potentially null.
- The agent flags this as a regression risk (❌ Error or
⚠️ Warning) for users who add a Map without any overlays — a basic, default-state user gesture: 5/5 — Flagged as '❌ Error — NullReferenceException when tapping a Map with no overlays (iOS)' - the highest severity level. Explicitly notes it's 'a guaranteed crash for any app using the Map control without overlays.' - The Failure-Mode Probing section explicitly probes the null-PlatformView / null-native-object scenario per Step 6: 3.7/5 — The Failure-Mode Probing table covers the empty map scenario, overlay behavior inconsistency, overlay removal race condition, and custom subclass crash. It covers null-native-object (null Overlays) well but doesn't explicitly probe null-PlatformView scenarios (e.g., what if the handler's PlatformView is null when the tap fires). Still, it covers the key scenarios effectively.
- Confidence is calibrated to medium or lower for this platform handler change: 2.7/5 — Explicitly states 'Confidence: medium (platform handler code, CI unverifiable)' with clear reasoning about why confidence is capped.
With-Skill Judge (Plugin)
Overall Score: 3/5
Reasoning: The agent efficiently identified the core bug - the null reference crash from iterating Overlays without a null check - and provided a clear, well-structured review with a concrete fix suggestion. It corroborated its finding by referencing the existing defensive pattern in the same file. It also provided useful secondary observations. However, it lacks a structured failure-mode probing section and overconfidently asserts the bug without expressing appropriate uncertainty about native API behavior.
- The agent reads MauiMKMapView.cs in PR Add Circle, Polygon, and Polyline click events for Map control #29101's diff and identifies the new
foreach (var overlay in mauiMkMapView.Overlays)iteration in the OnMapClicked tap handler: 4.3/5 — The agent successfully fetched the diff via patch-diff.githubusercontent.com and identified the foreach loop over mauiMkMapView.Overlays in the OnMapClicked handler. It quoted the relevant code and discussed the iteration logic. - The agent recognizes that MKMapView.Overlays is a native iOS API that returns null (not an empty array) when no overlays exist, making the unchecked enumeration a NullReferenceException risk on every map tap: 3/5 — The agent got this critically wrong. It explicitly stated 'mauiMkMapView.Overlays returns an empty array (not null) when no overlays are added — MKMapView.Overlays is IMKOverlay[] and defaults to an empty array in MapKit.' This is incorrect — MKMapView.Overlays returns null when no overlays have been added, which is a well-known iOS/MapKit behavior. The agent confidently asserted the wrong fact.
- The agent flags this as a regression risk (❌ Error or
⚠️ Warning) for users who add a Map without any overlays — a basic, default-state user gesture: 3/5 — The agent explicitly marked the hypothesis as 'Refuted — safe, no crash' and only listed the null-Overlays concern as a 'Low' severity 'Defensive improvement' noting it could 'theoretically be null on very old iOS versions.' It completely failed to flag this as a real regression risk for the default state. - The Failure-Mode Probing section explicitly probes the null-PlatformView / null-native-object scenario per Step 6: 2/5 — The agent briefly touched on the null scenario in issue Aloha System.Maui! #4 ('Overlays could theoretically be null on very old iOS versions') and suggested a defensive null-coalescing fix, but dismissed it as minor rather than probing it as a genuine failure mode. There was no dedicated failure-mode probing section.
- Confidence is calibrated to medium or lower for this platform handler change: 1.7/5 — The agent expressed high confidence in its incorrect conclusion. It boldly stated 'The app will NOT crash' and 'Refuted — safe, no crash' without any hedging or uncertainty. For a platform-specific native API behavior question that ideally requires testing or documentation verification, the confidence should have been much lower.
Baseline Agent Output
`
Code Review: PR #29101 — Map Element Click Events
Hypothesis Verdict: CONFIRMED BUG 🐛
Tapping the Map on iOS with no overlays WILL crash the app.
Root Cause
In src/Core/maps/src/Platform/iOS/MauiMKMapView.cs, the new code iterates overlays unconditionally:
csharp foreach (var overlay in mauiMkMapView.Overlays)
MKMapView.Overlays returns null (not an empty array) when no overlays have been added. A foreach over null throws a NullReferenceException.
Fix Required
csharp foreach (var overlay in mauiMkMapView.Overlays ?? Array.Empty<IMKOverlay>())
Or add a null guard:
csharp if (mauiMkMapView.Overlays is not null) { foreach (var overlay in mauiMkMapView.Overlays) ... }
Other Observations
-
Renderer allocation on every tap —
GetViewForOverlayDelegate(...)is called during hit-testing for polygons/polylines. This creates new renderer objects per tap, which is wasteful. Consider caching or usingRendererForOverlay()if available. -
MapElement.Impl.csthrows onClicked()— The base class implementation isthrow new NotImplementedException(). If a customMapElementsubclass doesn't overrideClicked(), invoking it will crash. A no-op default would be safer. -
Map click always fires alongside element click — The code calls
handler?.VirtualView.Clicked(location)(the map-level click) before checking overlays. A tap on a circle will fire both the map click and the circle click, which may be unexpected. Consider suppressing the map click when an overlay is hit.
`
With-Skill Agent Output (Isolated)
`
I now have all the information needed. Let me compile the review.
Code Review — PR #29101
Independent Assessment
What this changes: Adds Clicked() method to IMapElement interface, implements it in Circle, Polygon, and Polyline with per-type events (CircleClicked, PolygonClicked, PolylineClicked). On Android, it hooks native click listeners. On iOS, it performs geometric hit-testing in the existing tap gesture recognizer.
Inferred motivation: Users need to know which specific overlay element was tapped, not just that the map was tapped at some coordinates.
Reconciliation with PR Narrative
Author claims: Cross-platform click event support for map elements with proper hit-testing.
Agreement: The code matches the claims. Android uses native click listener APIs; iOS performs manual geometry checks.
Prior Review Reconciliation
| Prior ❌ Error Finding | Source | Status | Evidence |
|---|---|---|---|
| No prior ❌ Error findings found. | copilot-pull-request-reviewer[bot] | N/A | Only suggestions posted |
Blast Radius Assessment
- Runs for all instances: YES — the
foreachoverOverlaysruns on every tap of anyMauiMKMapView, even maps with zero overlays. - Startup impact: No.
- Static/shared state: No.
Findings
❌ Error — NullReferenceException crash on iOS when map has no overlays
File: src/Core/maps/src/Platform/iOS/MauiMKMapView.cs, new code in OnMapClicked:
csharp foreach (var overlay in mauiMkMapView.Overlays)
MKMapView.Overlays returns null (not an empty array) when no overlays have been added. The existing code in ClearMapElements() already guards against this:
csharp var elements = Overlays; if (elements == null) return;
The new hit-test loop has no such guard. Any tap on a map with zero overlays will throw NullReferenceException and crash the app.
Fix: Add a null/empty check before the foreach:
`csharp
var overlays = mauiMkMapView.Overlays;
if (overlays is null)
return; // or just skip the foreach
foreach (var overlay in overlays)
`
⚠️ Warning — GetViewForOverlayDelegate creates new renderers on every tap
In the hit-test loop, GetViewForOverlayDelegate is called for each polygon/polyline overlay on every tap. Looking at GetMapElement<T>, it calls mapElement?.Handler?.DisconnectHandler() and then mapElement?.ToHandler(...) — effectively recreating the handler/renderer each time. This is expensive and may cause visual flicker or resource leaks under repeated taps.
⚠️ Warning — MapElement.Impl.cs throws NotImplementedException
csharp void IMapElement.Clicked() => throw new NotImplementedException();
Any third-party MapElement subclass that doesn't explicitly implement Clicked() will crash at runtime when tapped. A no-op default ({ }) would be safer for an interface method with a base-class fallback.
💡 Suggestion — Inconsistent event naming
The events are named per-type (CircleClicked, PolygonClicked, PolylineClicked) rather than a single uniform Clicked event on the base class. This means consumers must cast or know the concrete type. A single MapElement.Clicked event would be more conventional and forward-compatible if new element types are added.
Failure-Mode Probing
- What happens when user taps a map with no overlays on iOS? → 💥
NullReferenceException(confirmed —MKMapView.Overlaysis null, existing code proves this). - What happens during handler disconnect/reconnect? → The
GetViewForOverlayDelegatecallsDisconnectHandler()on existing handlers during hit-testing, which could interfere with ongoing rendering if navigation occurs mid-tap. - What if
handleris null insideSendClickEvent? → Safe — the?.chain handles it. ButSendClickEventis defined inside theif (TryGetTarget)block sohandleris non-null at definition time; however, since it's a local function used later in theforeach, the capture is technically correct.
CI Status
- PR is closed (not merged). Cannot check required CI status on closed PRs.
- Classification: Undetermined (PR closed).
Verdict: NEEDS_CHANGES
Confidence: high
Summary: The hypothesis is confirmed — tapping an iOS map with no overlays will crash with NullReferenceException because MKMapView.Overlays returns null and the new foreach loop has no null guard. This is a clear ❌ Error that must be fixed. The existing ClearMapElements() method in the same file already demonstrates the correct null-check pattern.
`
With-Skill Agent Output (Plugin)
**Hypothesis confirmed: this is a crash bug.** The existing code at line 139-142 already shows the correct defensive pattern — checkingOverlaysfornullbefore iterating. The PR's new code at line 337 doesforeach (var overlay in mauiMkMapView.Overlays)` without a null check.
In MapKit, MKMapView.Overlays returns null when no overlays have been added. The PR's foreach will throw a NullReferenceException when tapping the map with no overlays.
Code Review Summary: PR #29101
🐛 Critical Bug — Null Reference Crash
File: src/Core/maps/src/Platform/iOS/MauiMKMapView.cs
Line: foreach (var overlay in mauiMkMapView.Overlays)
MKMapView.Overlays is null when no overlays exist. This foreach will throw NullReferenceException on any map tap when no Circle/Polygon/Polyline elements are added. The file's own ClearMapElements() (line 139) already demonstrates the correct guard:
csharp var elements = Overlays; if (elements == null) return;
Fix: Add a null guard before the foreach:
csharp if (mauiMkMapView.Overlays is null) return; // or skip the overlay hit-testing block
Other observations:
-
SendClickEventuses==onMapElementId— The local function comparesx.MapElementId == overlay(comparing anobject?to anIMKOverlay). This relies on reference equality, which works because the same overlay object is stored duringAddOverlay, but this is fragile and not obvious. A comment would help. -
Android
SendElementClickEventuses.ToString()—x.MapElementId?.ToString() == elementIdworks but is allocating strings on every click for every element. Minor perf concern. -
No early exit after
handlernull check — Ifhandleris null (line 336), the code still iterates all overlays callingSendClickEventwhich will silently do nothing. An early return after the null check would be cleaner.
`
📊 code-review / verdict-consistency-errors-must-map-to-needs-changes
Judge Report: Verdict consistency - errors must map to NEEDS_CHANGES
Baseline Judge
Overall Score: 4/5
Reasoning: The agent successfully identified the core ConnectHandler/DisconnectHandler asymmetry issue the user suspected, provided clear technical explanation of why it's problematic, and cited specific files and methods. The analysis is accurate and actionable. It lost minor points for not having line numbers and not using formal review verdict language, but the substance of the review is strong.
- If the agent finds or confirms a ❌ Error-level issue, the verdict is NEEDS_CHANGES — not LGTM: 4/5 — The agent clearly identifies an asymmetry issue and recommends a fix, implying the code needs changes. While it doesn't use explicit 'NEEDS_CHANGES' terminology, the recommendation section makes it clear this is a bug that should be fixed before merging.
- The agent applies handler lifecycle rules from the expert reviewer dimensions (ConnectHandler/DisconnectHandler symmetry): 5/5 — The agent precisely identifies the ConnectHandler/DisconnectHandler asymmetry: SetupViewWithLocalListener is used in ConnectHandler but UnregisterView (instead of RemoveViewWithLocalListener) is used in DisconnectHandler. It also cross-references other handlers in the PR that correctly pair these methods, demonstrating strong understanding of the lifecycle pattern.
- The agent cites specific file and line references for the concern: 4/5 — The agent cites the specific file (FlyoutViewHandler.Android.cs), the specific methods (ConnectHandler/DisconnectHandler), the exact API calls involved, and even references other files (ShellContentFragment, ShellFlyoutTemplatedContentRenderer, ShellSectionRenderer) for comparison. It also notes a minor redundant null check in ShellSectionRenderer. While it doesn't give exact line numbers, the references are specific enough to be actionable.
With-Skill Judge (Isolated)
Overall Score: 4.3/5
Reasoning: The agent successfully identified the core asymmetry issue despite significant obstacles (GitHub API access denied by enterprise policy). It recovered by fetching the diff via web_fetch in chunks, then performed a thorough analysis. The review correctly identifies the main bug (UnregisterView vs RemoveViewWithLocalListener mismatch), explains the failure mode, and provides additional findings (redundant null check, unbounded static list). The approach was methodical and the final output is well-structured. Docked slightly for lack of precise line numbers and for not being able to verify against the actual source files in the repo.
- If the agent finds or confirms a ❌ Error-level issue, the verdict is NEEDS_CHANGES — not LGTM: 5/5 — The agent identified a ❌ Error-level finding (ConnectHandler/DisconnectHandler asymmetry in FlyoutViewHandler.Android.cs) and correctly delivered a NEEDS_CHANGES verdict. This is fully consistent with the rubric requirement.
- The agent applies handler lifecycle rules from the expert reviewer dimensions (ConnectHandler/DisconnectHandler symmetry): 4.7/5 — The agent correctly identified that SetupViewWithLocalListener performs 3 operations (SetOnApplyWindowInsetsListener, SetWindowInsetsAnimationCallback, RegisterView) but DisconnectHandler only calls UnregisterView which reverses only step 3. It correctly identifies RemoveViewWithLocalListener as the proper counterpart. The analysis demonstrates understanding of handler lifecycle symmetry. However, since I cannot independently verify the exact line numbers and code cited, and the diff was large, I give 4 rather than 5.
- The agent cites specific file and line references for the concern: 3.7/5 — The agent cites FlyoutViewHandler.Android.cs with approximate line numbers (~285-291 for ConnectHandler,
301-307 for DisconnectHandler), names specific methods (SetupViewWithLocalListener vs UnregisterView vs RemoveViewWithLocalListener), and shows code snippets. It also references ShellSectionRenderer.Destroy() for the warning. The line numbers are approximate ('') rather than exact, but the file and method references are specific enough to be actionable.
With-Skill Judge (Plugin)
Overall Score: 4.3/5
Reasoning: The agent successfully confirmed the user's hypothesis about a ConnectHandler/DisconnectHandler asymmetry, identified the exact location and nature of the bug, explained why it's wrong by contrasting with correct patterns in other handlers, assessed impact, and gave the correct verdict. The review also includes useful secondary findings (redundant null check, unbounded static list). The approach was efficient given the constraints (no gh token, had to fetch diff via web).
- If the agent finds or confirms a ❌ Error-level issue, the verdict is NEEDS_CHANGES — not LGTM: 4.7/5 — The agent clearly identified the ConnectHandler/DisconnectHandler asymmetry as a ❌ Error-level issue and delivered a verdict of NEEDS_CHANGES with clear reasoning.
- The agent applies handler lifecycle rules from the expert reviewer dimensions (ConnectHandler/DisconnectHandler symmetry): 5/5 — The agent correctly identified that SetupViewWithLocalListener in ConnectHandler should be paired with RemoveViewWithLocalListener in DisconnectHandler, not just UnregisterView. It explained the three setup steps and showed only one teardown step was being performed. It also correctly noted all other sites in the PR follow the proper pairing.
- The agent cites specific file and line references for the concern: 4/5 — The agent cites FlyoutViewHandler.Android.cs specifically and quotes the relevant code snippets from both ConnectHandler and DisconnectHandler. It also references the specific method names (SetupViewWithLocalListener vs UnregisterView vs RemoveViewWithLocalListener). However, it doesn't provide exact line numbers, only file names and code excerpts.
Baseline Agent Output
Based on my review of the diff, here's the **ConnectHandler/DisconnectHandler asymmetry** inFlyoutViewHandler.Android.cs`:
Issue Found: FlyoutViewHandler.Android.cs
In ConnectHandler, _navigationRoot is set up with SetupViewWithLocalListener:
csharp MauiWindowInsetListener.SetupViewWithLocalListener(cl);
This method does 3 things: registers the view, sets ViewCompat.SetOnApplyWindowInsetsListener, and sets ViewCompat.SetWindowInsetsAnimationCallback.
In DisconnectHandler, cleanup only calls:
csharp MauiWindowInsetListener.UnregisterView(cl);
UnregisterView only removes the view from the static registry — it does not clear the two ViewCompat listeners. The correct counterpart should be RemoveViewWithLocalListener(cl), which is what all other callsites use (e.g., ShellContentFragment.Destroy(), ShellFlyoutTemplatedContentRenderer.Disconnect(), NavigationRootManager.Disconnect()).
Minor: Redundant null check in ShellSectionRenderer.cs
csharp if (_rootView != null) // outer check { if (_rootView is not null) // redundant inner check { MauiWindowInsetListener.RemoveViewWithLocalListener(_rootView); }
The inner is not null is unnecessary since _rootView is already guaranteed non-null by the outer check.
Summary: The asymmetry means that when FlyoutViewHandler disconnects, the ViewCompat inset listener and animation callback remain attached to the _navigationRoot CoordinatorLayout, potentially causing stale callbacks or preventing GC of the listener.
`
With-Skill Agent Output (Isolated)
`
Now I have the full diff. Let me analyze the specific handler files for the ConnectHandler/DisconnectHandler asymmetry the user flagged.
Code Review — PR #32278
Independent Assessment
What this changes: Refactors Android window inset management from a single global GlobalWindowInsetListener (stored on the activity) to a per-CoordinatorLayout registry pattern (MauiWindowInsetListener) with static lookup. Views find their inset listener by walking the view hierarchy.
Inferred motivation: Better isolation of inset handling across complex navigation scenarios (modals, shell flyouts, multiple coordinator layouts), eliminating cross-contamination between separate UI trees.
Reconciliation with PR Narrative
The PR was already merged. Title: "[Android] Refactor WindowInsetListener to per-CoordinatorLayout registry with optimized lookups" — matches the code.
Prior Review Reconciliation
Unable to query GitHub API (token rejected by .NET Foundation enterprise policy). Cannot verify prior reviews.
Blast Radius Assessment
- Runs for all instances: Yes — all Android views using safe area/insets are affected
- Startup impact: Yes —
NavigationRootManager.Connectruns at startup - Static/shared state: Yes —
MauiWindowInsetListener._registeredViewsis astaticlist shared across the entire process
Findings
❌ Error — ConnectHandler/DisconnectHandler asymmetry in FlyoutViewHandler.Android.cs
ConnectHandler sets up _navigationRoot with full listener attachment:
csharp MauiWindowInsetListener.SetupViewWithLocalListener(cl); // Internally does: ViewCompat.SetOnApplyWindowInsetsListener + SetWindowInsetsAnimationCallback + RegisterView
DisconnectHandler only unregisters from the static list:
csharp MauiWindowInsetListener.UnregisterView(cl); // Only removes from _registeredViews
This leaks the ViewCompat listener and animation callback on the CoordinatorLayout. The correct cleanup is RemoveViewWithLocalListener(cl) (which is used correctly in ShellContentFragment.Destroy, ShellSectionRenderer.Destroy, and ShellFlyoutTemplatedContentRenderer.Disconnect).
⚠️ Warning — Redundant null check in ShellSectionRenderer.Destroy
csharp if (_rootView != null) // outer check { if (_rootView is not null) // redundant inner check { MauiWindowInsetListener.RemoveViewWithLocalListener(_rootView); }
The inner is not null check is always true when inside the outer != null block.
💡 Suggestion — Static _registeredViews list is unbounded with only opportunistic cleanup
Dead WeakReference entries are only cleaned during RegisterView and FindListenerForView calls. Under rapid navigation (Shell tab switches), entries may accumulate between cleanup opportunities. A periodic or threshold-based sweep would prevent unbounded growth.
Failure-Mode Probing
-
What happens during handler disconnect/reconnect (Shell tab switch)? The leaked
ViewCompatlistener on theCoordinatorLayoutin FlyoutView continues receiving inset callbacks on a stale handler reference. The_navigationRoot = nullline means a reconnect allocates a new layout, but the old listener reference persists on the old platform view until GC. -
What if
FindListenerForViewis called on a view whose ancestor wasUnregisterView'd but not fully cleaned? Returnsnull, so child views get no safe area padding — silent degradation rather than a crash.
CI Status
Unable to verify — GitHub API access denied by enterprise token policy. PR is already merged.
Verdict: NEEDS_CHANGES
Confidence: medium (handler/platform infrastructure change with static shared state)
Summary: The FlyoutViewHandler.DisconnectHandler uses UnregisterView where it should use RemoveViewWithLocalListener, leaving the ViewCompat inset listener attached after disconnect. This asymmetry mirrors exactly what you suspected — all other handler teardown paths in this PR use RemoveViewWithLocalListener for views set up with SetupViewWithLocalListener, making this the odd one out.
`
With-Skill Agent Output (Plugin)
`
Code Review — PR #32278
Independent Assessment
What this changes: Replaces the activity-level GlobalWindowInsetListener singleton with a per-view MauiWindowInsetListener that uses a static registry (_registeredViews) to allow child views to find their parent's inset listener by walking the view hierarchy.
Inferred motivation: The global singleton caused cross-contamination issues in complex navigation scenarios (modals, Shell tabs, FlyoutView) where different view hierarchies need independent inset management.
Findings
❌ Error — ConnectHandler/DisconnectHandler asymmetry in FlyoutViewHandler.Android.cs
In ConnectHandler:
csharp MauiWindowInsetListener.RegisterParentForChildViews(platformView); // register only if (_navigationRoot is CoordinatorLayout cl) { MauiWindowInsetListener.SetupViewWithLocalListener(cl); // register + set ViewCompat listeners }
In DisconnectHandler:
csharp MauiWindowInsetListener.UnregisterView(platformView); // unregister only ✓ if (_navigationRoot is CoordinatorLayout cl) { MauiWindowInsetListener.UnregisterView(cl); // ❌ only unregisters from static list _navigationRoot = null; }
SetupViewWithLocalListener calls both ViewCompat.SetOnApplyWindowInsetsListener(view, listener) and ViewCompat.SetWindowInsetsAnimationCallback(view, listener), plus RegisterView. But UnregisterView only removes from the static registry — it does NOT clear the ViewCompat listeners or reset tracked views.
The correct counterpart is RemoveViewWithLocalListener(cl), which other handlers use correctly:
ShellContentFragment.Destroy()→RemoveViewWithLocalListener(_root)✓ShellSectionRenderer.Destroy()→RemoveViewWithLocalListener(_rootView)✓ShellFlyoutTemplatedContentRenderer.Disconnect()→RemoveViewWithLocalListener(coordinator)✓NavigationRootManager.Disconnect()→RemoveViewWithLocalListener(_managedCoordinatorLayout)✓
Impact: After DisconnectHandler, the stale listener still receives inset callbacks on _navigationRoot, potentially referencing disposed objects.
⚠️ Warning — Redundant null check in ShellSectionRenderer.Destroy()
csharp if (_rootView != null) // outer check { if (_rootView is not null) // redundant inner check { MauiWindowInsetListener.RemoveViewWithLocalListener(_rootView); } ... }
The inner is not null check is redundant since we're already inside if (_rootView != null).
💡 Suggestion — Static _registeredViews list is never bounded
MauiWindowInsetListener._registeredViews is a static List<ViewEntry> that grows with weak references. Dead references are only cleaned during RegisterView or FindListenerForView calls. If views are created/destroyed rapidly without triggering these paths, the list grows unbounded with dead WeakReference entries. Consider periodic compaction or a size-triggered cleanup.
Blast Radius Assessment
- Runs for all instances: Yes — every view that uses insets on Android goes through this registry
- Startup impact: Yes —
WindowHandlerandNavigationRootManagerset up listeners at window creation - Static/shared state: Yes —
_registeredViewsis a static list shared across all handlers
Verdict: NEEDS_CHANGES
Confidence: medium (platform-specific handler/UI plumbing + PR is already merged)
Summary: The FlyoutViewHandler.DisconnectHandler uses UnregisterView where it should use RemoveViewWithLocalListener, leaving ViewCompat listeners attached after disconnect. This is a genuine asymmetry that every other handler in this PR avoids. Since this PR is already merged, a follow-up fix should be filed.
`
kubaflo
left a comment
There was a problem hiding this comment.
🤖 Multi-model code review — comment (draft PR)
Three models reviewed this draft independently (Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro), then cross-pollinated. Non-blocking WIP feedback.
Consensus: NEEDS_DISCUSSION — genuinely high-quality scenario seeding; the open item is why skill-validation is red and how to get it green before un-drafting.
This is excellent, well-grounded work 👍
Both new scenarios are substantively sound and tied to verified regressions: the cited PRs are real merged PRs whose actual diffs contain the exact symbols the assertions reference — GetGradientData(1.0f) at the four non-shadow sites vs GetGradientData(shadowOpacity) on shadow paths (the precise asymmetry the rubric describes; #35280 is p/0/regressed-in-10.0.60), and foreach (var overlay in mauiMkMapView.Overlays) in OnMapClicked (#34910, a confirmed map NRE). The YAML parses, all 11 regexes compile, and the SKILL.md +6/-6 cleanup is correct (splitting the cap column and reclassifying "prior ❌ findings" as a Rule #5 verdict override fixes a real category error). The regex-based assertions are far more discriminating than substring matches.
Why skill-validation is red — and it's not just "aspirational"
Static validation passed and the eval ran; the red is the dotnet/skills validator's differential gate (skilled-vs-baseline improvement ≥10%, significant). It scored −5.6% with failureKind: skill_not_activated — the skill didn't activate on the gradient prompt in most arms, so skilled≈baseline and the new scenarios actually drag the suite's improvement negative (the two worst contributors). Implication: relaxing the regexes alone won't turn CI green — the prompts need to reliably activate the skill, and it's worth confirming these belong in a differential harness. Two concrete tweaks (inline):
- Over-strict
**Confidence:** medium|low(line 212 / ~248) — the single most-failing assertion; it fails even when the skill otherwise nails the review (4/5), and a definitive NRE/hardcoded-alpha finding can legitimately behigh. Relax it or accepthigh. - Echoable evidence regex (line 71) —
MauiDrawable.Androidis already in the prompt; require a non-prompt symbol so it actually proves the code was read.
Coordination
Same code-review/tests/eval.yaml as #34884 (also seeding scenarios) and #35942 (migrating the harness to Vally) — merge-order/rework hazard; sequence them.
Independent verdicts: Opus 4.8 — NEEDS_DISCUSSION (high) · GPT-5.5 — NEEDS_DISCUSSION (med) · Gemini 3.1 Pro — LGTM (high). All three rate the scenarios high-quality; the discussion is about the differential-gate CI failure + the two assertion tweaks.
| # if the reviewer correctly refutes the hypothesis with a regression finding, | ||
| # confidence may also be low. `\s*` tolerates a stray space the agent might emit. | ||
| - type: "output_matches" | ||
| pattern: '\*\*Confidence:\*\*\s*([Mm]edium|[Ll]ow)' |
There was a problem hiding this comment.
skill-validation run, **Confidence:** medium|low (here + the iOS scenario ~line 248) is the single most-failing assertion — it failed in 5 of 6 arms, including the arm where the skill activated and otherwise scored 4/5 (iOS map / isolated). It's consistent with the skill's Step-6 cap (platform-handler plumbing → max medium), BUT a reviewer who definitively finds the NRE crash or the hardcoded 1.0f alpha can legitimately emit **Confidence:** high (high confidence the regression is real). For a regression-detection corpus the catch is already proven by the verdict-marker + symbol-evidence + vocabulary assertions; coupling pass/fail to this calibration nuance risks failing a correct, regression-catching review. Consider dropping/relaxing it for these two scenarios, or accepting high when a concrete regression finding is present.
Introduces a scheduled gh-aw workflow that grows the code-review skill's Vally regression-detection eval corpus automatically. When a regression-fix PR merges, a deterministic PowerShell pre-pass identifies the regression- introducing PR and its merge SHA; the agent then drafts a hermetic eval.vally.yaml stimulus (frozen worktree pinned to that SHA, no PR/issue numbers in the prompt, structural-floor grader + LLM judge + rubric) and opens a single draft, test-only PR. skill-validation.yml auto-validates it. - .github/scripts/Find-RegressionFixPRs.ps1: deterministic pre-pass (no AI) - .github/scripts/Find-RegressionFixPRs.Tests.ps1: 26 Pester tests (all pass) - .github/workflows/regression-corpus-scanner.md: gh-aw scanner workflow - .github/workflows/regression-corpus-scanner.lock.yml: compiled lock (v0.79.8) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
3b19d14 to
8829bef
Compare
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment whether this change resolves your issue. Thank you!
What this PR does
Adds an automated regression-corpus scanner that grows the
code-reviewskill's Vally regression-detection eval corpus on its own.This realizes the automation the earlier version of this PR only pointed toward. The manual seed work this branch used to contain (two hand-authored eval scenarios + a confidence-cap table cleanup) was fully superseded by #35942 (the Vally migration), so the branch was rebased onto
mainand repurposed to build the pipeline itself.How it works
A scheduled gh-aw workflow runs weekly (and on demand via
workflow_dispatch):Find-RegressionFixPRs.ps1, no AI) — finds recently mergedi/regressionfix PRs, extracts the regression-introducing PR from the fix's body / linked issue, resolves that introducing PR's merge SHA and changed files, dedups against the existing corpus, and emits a boundedcandidates.json. The agent only activates when there's something to draft.eval.vally.yamlstimulus:environment.git)git diff HEAD^ HEADLGTMverdict — a silent pass is the failure mode under test), plus an LLM judge and a rubric authored from the real diffskill-validation.ymlthen auto-runs Vally against the new stimulus. If it passes, the corpus gains a permanent regression guard; if it fails, that's a real signal the skill has a gap for a human to close.Safety & scope
SKILL.md.if-no-changes: ignore.Files
.github/scripts/Find-RegressionFixPRs.ps1— deterministic pre-pass.github/scripts/Find-RegressionFixPRs.Tests.ps1— 26 Pester tests (all passing).github/workflows/regression-corpus-scanner.md— gh-aw workflow source.github/workflows/regression-corpus-scanner.lock.yml— compiled lock (gh aw compile, 0 errors / 0 warnings)Live demonstration
To show the pipeline end-to-end, the scanner was run manually against a real merged regression fix (#35803, introduced by #31931, regression issue #35756 — "Changing TabbedPage.CurrentPage from modal page no longer fires OnNavigatedTo on PopModalAsync"). The exact artifact it auto-drafts — one new hermetic Vally stimulus (
navigatedto-latch-suppresses-reentry) appended to the corpus — is open as a separate draft PR: #35977.Draft pending review of the workflow logic and the rubric-authoring instructions.