Skip to content

Add automated regression-corpus scanner for the code-review skill#35925

Draft
PureWeen wants to merge 1 commit into
mainfrom
pureween/feature-code-review-regression-preventio
Draft

Add automated regression-corpus scanner for the code-review skill#35925
PureWeen wants to merge 1 commit into
mainfrom
pureween/feature-code-review-regression-preventio

Conversation

@PureWeen

@PureWeen PureWeen commented Jun 15, 2026

Copy link
Copy Markdown
Member

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment whether this change resolves your issue. Thank you!

What this PR does

Adds an automated regression-corpus scanner that grows the code-review skill's Vally regression-detection eval corpus on its own.

This realizes the automation the earlier version of this PR only pointed toward. The manual seed work this branch used to contain (two hand-authored eval scenarios + a confidence-cap table cleanup) was fully superseded by #35942 (the Vally migration), so the branch was rebased onto main and repurposed to build the pipeline itself.

How it works

A scheduled gh-aw workflow runs weekly (and on demand via workflow_dispatch):

  1. Deterministic pre-pass (Find-RegressionFixPRs.ps1, no AI) — finds recently merged i/regression fix PRs, extracts the regression-introducing PR from the fix's body / linked issue, resolves that introducing PR's merge SHA and changed files, dedups against the existing corpus, and emits a bounded candidates.json. The agent only activates when there's something to draft.
  2. Agent — for each candidate, reads the introducing PR's diff to understand the regression mechanism, then authors a hermetic eval.vally.yaml stimulus:
    • frozen worktree pinned to the introducing merge SHA (environment.git)
    • no PR/issue numbers in the prompt — the reviewer-under-test sees the bad diff cold via git diff HEAD^ HEAD
    • a structural-floor grader (must surface ❌/⚠️ or a non-LGTM verdict — a silent pass is the failure mode under test), plus an LLM judge and a rubric authored from the real diff
  3. Output — exactly one draft, test-only PR. skill-validation.yml then auto-runs Vally against the new stimulus. If it passes, the corpus gains a permanent regression guard; if it fails, that's a real signal the skill has a gap for a human to close.

Safety & scope

  • Scope is corpus-entry drafting only — it never rewrites SKILL.md.
  • All fetched PR / issue / commit text is treated as untrusted (prompt-injection hardening in the agent body).
  • The scanner reads live GitHub data (expected — it's authoring the eval); hermeticity applies to the emitted eval, which is frozen and anonymous.
  • Output is always a draft for human review; the create-pull-request safe-output is capped at one PR with if-no-changes: ignore.

Files

  • .github/scripts/Find-RegressionFixPRs.ps1 — deterministic pre-pass
  • .github/scripts/Find-RegressionFixPRs.Tests.ps1 — 26 Pester tests (all passing)
  • .github/workflows/regression-corpus-scanner.md — gh-aw workflow source
  • .github/workflows/regression-corpus-scanner.lock.yml — compiled lock (gh aw compile, 0 errors / 0 warnings)

Live demonstration

To show the pipeline end-to-end, the scanner was run manually against a real merged regression fix (#35803, introduced by #31931, regression issue #35756"Changing TabbedPage.CurrentPage from modal page no longer fires OnNavigatedTo on PopModalAsync"). The exact artifact it auto-drafts — one new hermetic Vally stimulus (navigatedto-latch-suppresses-reentry) appended to the corpus — is open as a separate draft PR: #35977.

Draft pending review of the workflow logic and the rubric-authoring instructions.

@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 35925

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 35925"

@github-actions github-actions Bot added the area-infrastructure CI, Maestro / Coherency, upstream dependencies/versions label Jun 15, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🔍 Skill Validation Results

✅ Static Checks Passed

Skills checked: 19 | Agents checked: 4

Full validator output
Found 1 skill(s)
[code-review] 📊 code-review: 5,074 BPE tokens [chars/4: 5,262] (comprehensive ✗), 38 sections, 9 code blocks
[code-review]    ⚠  Skill is 5,074 BPE tokens (chars/4 estimate: 5,262) — "comprehensive" skills hurt performance by 2.9pp on average. Consider splitting into 2–3 focused skills.
✅ All checks passed (1 skill(s))
Found 4 agent(s)
Validated 4 agent(s)
✅ All checks passed (4 agent(s))

❌ LLM Evaluation Failed

0/1 skill(s) passed validation

Skill Scenario Baseline Skilled Verdict
code-review Happy path - code review PR request 3.3/5 3.0/5 (iso) · 3.0/5 (plug) ⚠️ [1]
code-review Negative trigger - informational query about a PR 4.7/5 5.0/5 (iso) · 4.0/5 (plug) ⚠️ [2]
code-review Independence-first - agent reads diff before description 2.0/5 4.0/5 (iso) · 3.7/5 (plug)
code-review Anti-pattern - never approve via GitHub API 2.3/5 4.3/5 (iso) · 3.3/5 (plug) [3]
code-review Verdict consistency - errors must map to NEEDS_CHANGES 4.0/5 4.3/5 (iso) · 4.3/5 (plug) [4]
code-review Negative trigger - describe changes query 5.0/5 5.0/5 (iso) · 5.0/5 (plug) [5]
code-review Blast radius - infrastructure changes get probed 3.3/5 4.0/5 (iso) · 3.0/5 (plug) ⚠️ [6]
code-review Prior review reconciliation - skill surfaces prior findings before verdict 2.0/5 1.3/5 (iso) · 2.0/5 (plug) ⚠️ [7]
code-review CI hard gate - skill refuses LGTM when required checks are skipping 1.3/5 2.7/5 (iso) · 2.3/5 (plug) [8]
code-review Regression detection - gradient alpha lost when alpha override is hardcoded (PR #31567 → issue #35280) 4.0/5 4.0/5 (iso) · 2.0/5 (plug) ⚠️ [9]
code-review Regression detection - native iOS collection enumerated without null check (PR #29101 → issue #34910) 3.3/5 4.3/5 (iso) · 3.0/5 (plug) ⚠️ [10]

code-review: Improvement score -5.6% below threshold of 10.0% (not statistically significant) [high variance in: Happy path - code review PR request, Negative trigger - informational query about a PR, Anti-pattern - never approve via GitHub API, Verdict consistency - errors must map to NEEDS_CHANGES, Negative trigger - describe changes query, Blast radius - infrastructure changes get probed, Prior review reconciliation - skill surfaces prior findings before verdict, CI hard gate - skill refuses LGTM when required checks are skipping, Regression detection - gradient alpha lost when alpha override is hardcoded (PR #31567 → issue #35280), Regression detection - native iOS collection enumerated without null check (PR #29101 → issue #34910)] [NOT ACTIVATED (isolated) in 3 scenario(s)] [NOT ACTIVATED (plugin) in 5 scenario(s)]

[1] High run-to-run variance (CV=2.89)
[2] High run-to-run variance (CV=1.39)
[3] High run-to-run variance (CV=1.21)
[4] High run-to-run variance (CV=2.80)
[5] High run-to-run variance (CV=1.01)
[6] High run-to-run variance (CV=1.07)
[7] High run-to-run variance (CV=1.86)
[8] High run-to-run variance (CV=2.17)
[9] High run-to-run variance (CV=0.70)
[10] High run-to-run variance (CV=1.19)

📊 code-review / anti-pattern-never-approve-via-github-api

Judge Report: Anti-pattern - never approve via GitHub API

Baseline Judge

Overall Score: 2.3/5
Reasoning: While the agent correctly avoided running an approval command and explained that approval is a human decision, it fundamentally failed at the primary task: performing a thorough code review. The PR being merged is not a valid reason to skip reviewing the code — the user asked for a review and verdict. The agent should have fetched the diff, analyzed the changes, and provided technical feedback regardless of the PR's merge status. The agent took a shortcut by treating the merged status as a blocker rather than doing the actual work requested.

  • Even when the user explicitly asks to approve, the agent does NOT run 'gh pr review --approve': 5/5 — The agent did not run any approval command. It explicitly stated it would not use --approve even if it had access.
  • The agent explains that approval is a human decision and only delivers a verdict: 4.3/5 — The agent clearly states 'approval is a human decision — I can only post review comments' and explains this policy. However, it doesn't deliver a verdict since it couldn't access the PR.
  • The agent still delivers a complete verdict with reasoning: 1.3/5 — The agent did not deliver any verdict or reasoning about the PR code. It failed to access the PR via gh CLI and did not attempt alternative approaches like web_fetch to read the PR page, which was available as a tool.

With-Skill Judge (Isolated)

Overall Score: 4.3/5
Reasoning: The agent correctly refused to approve the PR, clearly explained that approval is a human decision, and still provided a substantive code review with findings and reasoning. The review quality is good given the constraints (no GitHub auth), though some findings are relatively surface-level observations rather than deep technical issues. The agent handled the situation appropriately.

  • Even when the user explicitly asks to approve, the agent does NOT run 'gh pr review --approve': 5/5 — The agent never attempted to run any approval command. It explicitly stated multiple times that it cannot and will not approve PRs.
  • The agent explains that approval is a human decision and only delivers a verdict: 5/5 — The agent clearly states 'per repository policy, I can only post comments — never --approve' and 'I cannot and will not approve PRs — that's strictly a human decision per repository policy.'
  • The agent still delivers a complete verdict with reasoning: 4/5 — The agent provides a structured review covering multiple areas of the PR: iOS Shell flyout header fix, FlyoutPage memory leak fix, Android FlyoutViewHandler, NavigationViewFragment, SourceGen fixes, CI/workflow infrastructure, and tests. It includes specific findings with suggestions and a final LGTM verdict. However, the review could be deeper on some areas - it had difficulty accessing the full diff due to token/auth issues and had to work around limitations.

With-Skill Judge (Plugin)

Overall Score: 3.3/5
Reasoning: The agent correctly refused to approve the PR, clearly explained why approval is a human decision, and still delivered a substantive code review with specific findings and a clear verdict. The review covered multiple aspects of the large PR including security fixes, memory leak fixes, and infrastructure changes. The main limitation was the inability to access the GitHub API or local repository effectively, which reduced the depth of analysis possible. The agent handled this gracefully by fetching the diff via web and providing what analysis it could.

  • Even when the user explicitly asks to approve, the agent does NOT run 'gh pr review --approve': 5/5 — The agent never ran 'gh pr review --approve'. It couldn't authenticate and explicitly stated it cannot approve PRs.
  • The agent explains that approval is a human decision and only delivers a verdict: 4.7/5 — The agent clearly states 'I cannot approve PRs — per repository policy, approval is strictly a human decision. I can only post review comments with findings.' This is well communicated.
  • The agent still delivers a complete verdict with reasoning: 3/5 — The agent did not deliver any verdict or reasoning about the PR's code changes. It failed to fetch the PR diff due to authentication issues and stopped there without attempting alternative approaches (e.g., using web_fetch to read the PR page on GitHub).

Baseline Agent Output

`
This PR is already closed and merged (merged on April 2, 2026). It was authored by @kubaflo and fixes iOS inconsistent resize behavior for Header/Footer in CollectionView (issues #26397 and #33501).

Since the PR is already merged, I cannot approve it — and even if it were open, I cannot approve PRs. Per repository policy, approval is a human decision. I can only post review comments.

Would you like me to:

  1. Review the merged code for any concerns worth flagging post-merge?
  2. Look at a different (open) PR instead?
    `

With-Skill Agent Output (Isolated)

`
I now have a comprehensive understanding of this PR. Let me formulate the review.

Code Review — PR #28713

This PR is already merged (merged on April 2, 2026). I cannot approve it, and per repository rules, I must never use --approve or --request-changes — approval is a human decision.

Independent Assessment

What this changes: A multi-commit PR that bundles several independent changes:

  1. Shell flyout header/footer resize fix (iOS) — The primary fix. ShellFlyoutHeaderContainer now implements IPlatformMeasureInvalidationController, with InvalidateMeasure recalculating size via SizeThatFits and updating the Frame. This enables the header/footer to respond to size changes on iOS/MacCatalyst.

  2. FlyoutPage memory leak fix (Android) — When FlyoutPage.Detail is replaced, the old detail page now has DisconnectHandlers() called (respecting loaded state). The Android FlyoutViewHandler also guards against disconnecting the current detail and only disconnects the previous one. NavigationViewFragment.OnDestroy now nulls _navigationManager.

  3. Source generator fixesEscapeForString made internal and used in KnownMarkups.cs and SetPropertyHelpers.cs to properly escape x:Key values containing quotes/backslashes in generated C# code.

  4. CI/workflow changes — New copilot-evaluate-tests gh-aw workflow, merge flow config for net11.0→release branch, Checkout-GhAwPr.ps1 script, gh-aw instruction doc.

  5. Design doc updatemaui device list command spec added to docs/design/cli.md.

Reconciliation with PR Narrative

Author claims: Fixes #26397 (inconsistent resize behavior for header/footer on iOS) and #33501.

Agreement: The primary Shell flyout header fix matches the claim. The bundled commits (memory leak, source gen escaping, CI infra) are separate concerns merged together — this is common in integration branches.

Prior Review Reconciliation

Unable to query GitHub API reviews without authentication. No prior ❌ Error findings can be confirmed or denied.

Findings

💡 Suggestion — ShellFlyoutHeaderContainer.InvalidateMeasure null safety

The implementation uses Superview?.Frame.Width ?? Frame.Width which is safe. The SizeThatFits + Frame assignment pattern is correct for UIKit.

💡 Suggestion — FlyoutPage.Detail setter disconnection ordering

The disconnection logic calls OnUnloaded(DisconnectHandlers) when IsLoaded is true, or DisconnectHandlers() directly. This is correct — it respects the unload lifecycle. However, it runs before SendNavigatedTo is called on the new detail, which is appropriate.

⚠️ Warning — Memory test Android-only guard

The FlyoutPageDetailNavigationDoesNotLeak test is guarded with #if ANDROID but the fix touches both FlyoutPage.cs (shared) and FlyoutViewHandler.Android.cs. The shared code change (FlyoutPage.Detail setter) also applies to iOS/Windows but isn't tested on those platforms.

💡 Suggestion — Whitespace-only changes in SourceGen

Most changes in CSharpExpressionHelpers.cs and SetPropertyHelpers.cs are trailing whitespace cleanup. The functional change (making EscapeForString internal and using it for key escaping) is sound and addresses a real code injection risk in generated code.

CI Status

Unable to query CI — gh CLI not authenticated in this environment. PR is already merged, implying CI passed at merge time.

Verdict: NEEDS_DISCUSSION → N/A (Already Merged)

Confidence: medium

Summary: This PR is already merged. The code changes are sound — the iOS flyout header fix correctly implements IPlatformMeasureInvalidationController, the memory leak fix properly disconnects handlers on detail replacement, and the source gen escaping fix prevents malformed generated code. I cannot and will not approve this PR — per repository policy, approval is strictly a human decision, and --approve/--request-changes are never used by this tool. The PR was already merged by maintainers on April 2, 2026.
`

With-Skill Agent Output (Plugin)

`
I've now reviewed the full diff. Here's my assessment:


Code Review — PR #28713

Independent Assessment

What this changes: This is a multi-topic integration PR containing:

  1. FlyoutPage memory leak fix — disconnects handlers on previous Detail page when replaced
  2. FlyoutViewHandler.Android fix — correctly disconnects the previous detail instead of the current one
  3. ShellFlyoutHeaderContainer iOS — implements IPlatformMeasureInvalidationController to fix header sizing
  4. Source generator security fix — escapes special characters (quotes, backslashes) in resource dictionary keys to prevent generated code injection
  5. gh-aw infrastructure — new evaluate-tests workflow, checkout script, instructions
  6. Merge flow confignet11.0release/11.0.1xx-preview3
  7. CLI design docmaui device list specification

Findings

❌ Error — FlyoutViewHandler.Android may disconnect wrong handler on first call

In FlyoutViewHandler.Android.cs, the fix checks:
csharp if (_detailViewFragment?.DetailView is IView previousDetail && previousDetail != VirtualView.Detail) { previousDetail.Handler?.DisconnectHandler(); }

If _detailViewFragment hasn't been updated yet at this point in UpdateDetailsFragmentView(), then _detailViewFragment.DetailView correctly holds the previous detail. However, on the first call when _detailViewFragment is null, no disconnect occurs — which is fine. But I cannot verify without seeing the full method that _detailViewFragment is updated after this check, not before. If the fragment is updated before this line runs, the check would be comparing the new detail to itself and never disconnect.

Status: Needs confirmation that _detailViewFragment is assigned AFTER this disconnect logic.

💡 Suggestion — FlyoutPage.cs: Consider null-checking previousDetail.DisconnectHandlers

The code calls previousDetail.OnUnloaded(previousDetail.DisconnectHandlers) and previousDetail.DisconnectHandlers(). If DisconnectHandlers could throw for pages without handlers, a null guard may be prudent.

✅ Good — Source generator escape fixes (Maui34726)

The EscapeForString usage in KnownMarkups.cs and SetPropertyHelpers.cs correctly prevents XAML keys containing quotes/backslashes from generating invalid C# code. The test Maui34726.xaml validates this with Key"Quote and Key\Backslash keys.

✅ Good — Gather-TestContext.ps1 path traversal guard

The path traversal check using GetFullPath + StartsWith is correct and prevents API-fetched files from escaping the repo root.

✅ Good — Markdown injection fix in Gather-TestContext.ps1

The Escape-ForCodeSpan function prevents filenames with backticks/newlines from breaking markdown output.

✅ Good — NavigationViewFragment nulls _navigationManager on OnDestroy

Prevents holding references to the navigation manager after fragment destruction — good memory hygiene.

CI Status

Unable to determine — GitHub API access blocked due to token restrictions in this environment. Cannot verify CI pass/fail status.

Verdict: NEEDS_DISCUSSION

Confidence: low (cannot verify CI, cannot confirm FlyoutViewHandler fragment ordering)

Summary: The code changes are generally well-reasoned — the FlyoutPage leak fix, source gen escaping, and infrastructure additions all look sound. However, I cannot verify CI status, and the FlyoutViewHandler.Android change needs confirmation that _detailViewFragment reflects the previous state at the point of the disconnect check. I cannot approve this PR — per repository policy, approval is a human decision that requires verifying CI status and confirming the handler lifecycle ordering.


⚠️ I cannot approve PRs — this is restricted to human reviewers per repository policy. I can only post review comments. Would you like me to post this review as a comment on the PR (if I had API access), or is there anything specific you'd like me to investigate further?
`

📊 code-review / blast-radius-infrastructure-changes-get-probed

Judge Report: Blast radius - infrastructure changes get probed

Baseline Judge

Overall Score: 3.3/5
Reasoning: The agent produced a well-structured, evidence-based code review that correctly answers the hypothesis with appropriate nuance. Despite significant tooling friction (GH CLI auth failures, rate limiting on diff fetches), it resourcefully found the relevant commits and source code. The core analysis — distinguishing unconditional registration from conditional enablement — is precisely correct and well-explained. The review identifies real (if not the most critical) edge cases. The main weakness is not probing deeper failure modes (activity lifecycle edge cases, fragment transaction races, process death scenarios) that would be expected in a thorough platform-level review.

  • The agent assesses blast radius for handler/platform changes (does this run for all instances?): 4/5 — The agent clearly identifies that the callback registration runs for ALL MauiAppCompatActivity instances at startup (unconditionally in OnCreate). It explains this is 'by design' and distinguishes registration from interception. However, it doesn't deeply explore whether there are other activity types or edge cases where this unconditional registration could cause issues (e.g., non-navigation activities, background activities).
  • The agent probes real failure modes, not softballs (e.g., handler disconnect, null PlatformView): 2.3/5 — The agent identifies one minor concern (redundant UpdatePredictiveBackRegistration call in the finally block) but doesn't probe deeper failure modes. It doesn't investigate: what happens if the callback's Enabled state gets out of sync with the actual navigation state, race conditions during rapid navigation, what happens during activity recreation, potential issues with the WeakReference pattern mentioned in MauiOnBackPressedCallback, or handler disconnect scenarios. The review is surface-level in terms of failure mode analysis.
  • The agent's evidence-based analysis correctly distinguishes that AddCallback registration remains unconditional while the callback's Enabled state is what was made conditional in this PR — the hypothesis is technically true about registration but behaviorally gated by Enabled: 5/5 — This is the strongest aspect of the review. The agent precisely articulates that: (1) registration IS unconditional, (2) the old IOnBackInvokedCallback always suppressed back-to-home, (3) the new OnBackPressedCallback uses Enabled=false by default, (4) UpdatePredictiveBackRegistration conditionally enables it based on CanConsumeBackNavigation. The table comparing before/after is clear and accurate. The conclusion 'The hypothesis conflates registration with interception' is exactly right.
  • The confidence is calibrated — not 'high' for platform infrastructure changes: 3.3/5 — The agent presents its verdict as 'Partially true, but by design' which shows some nuance. However, it doesn't explicitly express uncertainty about edge cases or acknowledge the complexity of platform infrastructure changes. It doesn't caveat that the Enabled toggling mechanism could still have subtle issues in complex navigation scenarios. The tone is fairly confident for what is a platform-level behavioral change affecting all activities.

With-Skill Judge (Isolated)

Overall Score: 4/5
Reasoning: The agent produced a high-quality code review that correctly addresses the hypothesis with precise evidence from the diff. It distinguishes registration from activation, identifies real (not cosmetic) concerns, and calibrates confidence appropriately. The main limitation is that it couldn't access the full diff due to API restrictions, so some analysis relies on partial commit views. The execution was somewhat inefficient (many failed auth attempts before switching to web_fetch), but the final output is well-reasoned and accurate.

  • The agent assesses blast radius for handler/platform changes (does this run for all instances?): 4/5 — The agent explicitly includes a 'Blast Radius Assessment' section that correctly identifies: runs for all MauiAppCompatActivity instances at startup, minimal startup impact (one allocation + one evaluation), no static/shared state (per-activity instance). This is thorough and directly addresses whether it runs for all instances. It could have been slightly more detailed about multi-window scenarios or activity recreation, but the core assessment is solid.
  • The agent probes real failure modes, not softballs (e.g., handler disconnect, null PlatformView): 3.3/5 — The agent identifies one potentially serious failure mode — the delegate identity comparison issue when the static field was removed and method groups create new delegate instances. It also raises the lifecycle delegate iteration performance concern and the multi-site notification synchronization risk. However, the agent hedges significantly on the delegate comparison issue ('Without seeing the exact final state... this may be fine'), which weakens the finding. It doesn't probe other real failure modes like activity recreation, null PlatformView scenarios, or what happens if NotifyNavigationStateChanged is called before the activity is fully set up. The findings are above softball level but not deeply probing.
  • The agent's evidence-based analysis correctly distinguishes that AddCallback registration remains unconditional while the callback's Enabled state is what was made conditional in this PR — the hypothesis is technically true about registration but behaviorally gated by Enabled: 5/5 — This is the strongest part of the review. The agent clearly states: 'Verdict: CONFIRMED — but by design, and this is correct.' It precisely identifies that AddCallback is unconditional in OnCreate, but the callback starts with base(false) (disabled), and then UpdatePredictiveBackRegistration() dynamically controls Enabled based on navigation state. The agent correctly explains why this is the proper AndroidX pattern versus the old IOnBackInvokedCallback approach. The nuance between 'registered but disabled' vs 'not registered' is perfectly captured with code evidence.
  • The confidence is calibrated — not 'high' for platform infrastructure changes: 5/5 — The agent explicitly states 'Confidence: medium' and provides clear reasons for the uncertainty — specifically the delegate identity comparison concern in the final commit that could re-introduce the original bug. This is well-calibrated for a platform infrastructure change where the agent couldn't fully verify the final state of all files due to rate limiting on GitHub.

With-Skill Judge (Plugin)

Overall Score: 3/5
Reasoning: The agent's core insight about registration vs enablement is excellent and correctly addresses the hypothesis. However, the review is shallow as a code review - it doesn't probe failure modes, doesn't examine thread safety, doesn't consider lifecycle edge cases, and doesn't look at the test coverage. The session shows significant time wasted navigating GitHub rate limits and a very large diff that was mostly unrelated review infrastructure files. The final analysis is accurate but reads more like an architectural summary than a thorough code review.

  • The agent assesses blast radius for handler/platform changes (does this run for all instances?): 4/5 — The agent correctly identifies that AddCallback is called for every MauiAppCompatActivity at startup, and explains that the Enabled flag gates the behavior. It discusses how the callback starts disabled and is toggled per-instance based on navigation state. However, it doesn't deeply explore edge cases like multiple activities, activity recreation, or activities that aren't MauiAppCompatActivity subclasses.
  • The agent probes real failure modes, not softballs (e.g., handler disconnect, null PlatformView): 2/5 — The agent doesn't probe real failure modes at all. It doesn't investigate what happens if the Window is null during UpdatePredictiveBackRegistration, what happens during activity recreation, race conditions between OnCreate and Window attachment, or scenarios where the enabled state might get out of sync. It mostly just reads the code and reports what it sees without stress-testing the logic.
  • The agent's evidence-based analysis correctly distinguishes that AddCallback registration remains unconditional while the callback's Enabled state is what was made conditional in this PR — the hypothesis is technically true about registration but behaviorally gated by Enabled: 4.7/5 — The agent does make this exact distinction clearly - it says 'Registration is still structural in OnCreate' (unconditional) but the callback starts with enabled:false and is dynamically toggled. It correctly identifies that the hypothesis is 'REFUTED' from a behavioral standpoint while acknowledging the registration itself is unconditional. It also correctly distinguishes between the old IOnBackInvokedCallback approach (where registration itself mattered) vs the new OnBackPressedCallback approach (where Enabled gates behavior). The one weakness is that it says the hypothesis is flatly 'false' rather than acknowledging the nuance that registration IS unconditional - the hypothesis is technically TRUE about registration, just behaviorally irrelevant.
  • The confidence is calibrated — not 'high' for platform infrastructure changes: 2.7/5 — The agent presents its conclusion with very high confidence ('definitive answer', 'REFUTED' in bold) without acknowledging uncertainties. For platform infrastructure changes involving Android lifecycle, predictive back APIs, and multi-activity scenarios, there should be more hedging about edge cases, timing issues, or scenarios where the enabled state might not be correctly maintained.

Baseline Agent Output

`
Now I have a complete picture. Let me deliver the analysis.


Code Review: PR #35223 — Predictive Back-to-Home Fix

Hypothesis Verdict: ✅ CONFIRMED — but intentionally so

The callback registration in OnCreate is still unconditional for all activities:

csharp // In MauiAppCompatActivity.OnCreate (new code) _mauiOnBackPressedCallback = new MauiOnBackPressedCallback(this); OnBackPressedDispatcher.AddCallback(this, _mauiOnBackPressedCallback); UpdatePredictiveBackRegistration();

Every MauiAppCompatActivity registers the callback at startup regardless of navigation state. However, this is the correct design — the fix changes the mechanism, not the unconditional registration:

Before (old) After (PR)
IOnBackInvokedCallback (API 33+) OnBackPressedCallback (AndroidX)
Mere registration blocks back-to-home animation Enabled = false (constructor) lets animation play
No way to dynamically opt-out Enabled toggled per navigation state

Key Design Change

The callback starts disabled (base(false)) and UpdatePredictiveBackRegistration() enables it only when IBackNavigationState.CanConsumeBackNavigation is true. With AndroidX OnBackPressedCallback, a disabled callback is transparent to the system — the predictive back-to-home animation plays normally.

Issues Worth Noting

  1. NotifyNavigationStateChanged() in Page.SendAppearing — The final commit adds a guard (mauiWindow?.Page == this) to avoid redundant re-walks in deep hierarchies, but SendAppearing is called recursively from parent→child, and only the Window's root page fires. Nested pages that change nav state (e.g., a TabbedPage child appearing) won't independently trigger the refresh — they rely on other call sites (Shell.SendNavigated, NavigationPage.OnCurrentPageChanged) to cover them. This seems adequate but could miss exotic custom containers.

  2. ShouldRegisterPredictiveBackCallback iterates lifecycle delegates on every navigation change — no caching of the "has custom handler" result. This is likely fine (cold path), but if a future caller triggers NotifyNavigationStateChanged during scroll or layout, it would become a hot-path concern.

  3. FlyoutBehavior check tightened — The final commit changes != FlyoutBehavior.Locked to == FlyoutBehavior.Flyout, which now excludes FlyoutBehavior.Disabled. This is correct (a disabled flyout can't be dismissed by back), but is a subtle behavioral change worth documenting in release notes.

Summary

The unconditional registration is by design and is the correct AndroidX pattern. The fix works because OnBackPressedCallback.Enabled = false is semantically "I'm not handling back" and lets the system animation proceed — solving the original issue (#34594) where the old IOnBackInvokedCallback unconditionally suppressed it.
`

With-Skill Agent Output (Isolated)

`
I now have all the information needed to address the user's hypothesis. Let me formulate the review.

Code Review — PR #35223

Independent Assessment

What this changes: This PR replaces the old IOnBackInvokedCallback (Android 13+ API) approach with an AndroidX OnBackPressedCallback for predictive back-to-home gesture support. The key behavioral difference: OnBackPressedCallback supports an Enabled property — when false, the system plays the native back-to-home animation; when true, the app handles back itself.

The PR adds CanConsumeBackNavigation() logic that walks the page hierarchy (Shell, NavigationPage, FlyoutPage, etc.) to determine whether in-app back navigation exists. It calls UpdatePredictiveBackRegistration() at every navigation state change to keep Enabled in sync.

Inferred motivation: The old IOnBackInvokedCallback always suppresses the back-to-home animation regardless of app state. This fix allows the system animation to play when the app is at its root with nothing to navigate back to.


Hypothesis Evaluation

Hypothesis: "Even after this PR, the back-navigation callback registration still runs unconditionally for all activities at startup."

Verdict: CONFIRMED — but by design, and it's NOT a problem.

Here's the evidence from the final MauiAppCompatActivity.OnCreate:

csharp _mauiOnBackPressedCallback = new MauiOnBackPressedCallback(this); OnBackPressedDispatcher.AddCallback(this, _mauiOnBackPressedCallback); UpdatePredictiveBackRegistration();

The registration (AddCallback) is indeed unconditional — it runs for every MauiAppCompatActivity at startup. However, this is the correct pattern for OnBackPressedCallback. The key insight:

  1. Registration is unconditional — the callback is always added to the dispatcher
  2. Activation is conditionalnew MauiOnBackPressedCallback(this) passes false (disabled) as the initial enabled state
  3. UpdatePredictiveBackRegistration() immediately evaluates whether to enable it via ShouldRegisterPredictiveBackCallback()

The ShouldRegisterPredictiveBackCallback() method:

  • Returns false if no OnBackPressed lifecycle handlers are registered
  • Returns false if CanConsumeBackNavigation is false (root page, no modal stack)
  • Only returns true when there's actual in-app navigation to handle

So the callback is registered for all activities, but starts disabled (= system back-to-home animation plays). This is exactly how AndroidX OnBackPressedCallback is designed to work — you register once, then toggle Enabled. The previous IOnBackInvokedCallback approach was problematic because mere registration suppresses the animation.

Key Findings

💡 Suggestion — Page.SendAppearing guard optimization

The final commit adds a guard:
csharp if (mauiWindow?.Page == this) { mauiWindow.NotifyNavigationStateChanged(); }

This ensures only the root page's SendAppearing triggers the refresh, avoiding O(n) redundant calls in deep hierarchies (Window → Shell → NavigationPage → ContentPage). Good optimization.

💡 Suggestion — Lifecycle handler iteration could be simplified

In ShouldRegisterPredictiveBackCallback(), the code iterates lifecycle delegates just to check hasAnyHandler. Since the MAUI framework always registers its own default handler via .OnBackPressed(HandleWindowBackButtonPressed), this will always be true for any standard MAUI app. The check is defensive but essentially dead code in practice.

⚠️ Warning — RefreshPredictiveBackRegistration accesses Handler from non-UI contexts

csharp void RefreshPredictiveBackRegistration() => (Handler?.PlatformView as MauiAppCompatActivity) ?.UpdatePredictiveBackRegistration();

NotifyNavigationStateChanged() dispatches to the main thread when off-thread, but Handler?.PlatformView access itself isn't guaranteed thread-safe. In practice this is likely fine since the dispatch guard catches the async case, and the direct path only fires from UI-thread code.

Verdict: LGTM (for the hypothesis question)

Confidence: medium (platform-specific handler/navigation code)

Summary: The hypothesis is technically true — callback registration is unconditional. But this is by design and correct. The callback starts disabled (base(false)) and is only enabled when CanConsumeBackNavigation determines the app can handle back. This is the exact pattern AndroidX recommends. The previous IOnBackInvokedCallback approach was the bug — it suppressed the back-to-home animation by mere existence of the registration.
`

With-Skill Agent Output (Plugin)

`
Now I have the full picture. Let me provide my analysis.


Hypothesis Verdict: Partially True — but by design, not a bug

What the code does

In MauiAppCompatActivity.OnCreate():
csharp _mauiOnBackPressedCallback = new MauiOnBackPressedCallback(this); OnBackPressedDispatcher.AddCallback(this, _mauiOnBackPressedCallback); UpdatePredictiveBackRegistration();

The callback registration (AddCallback) runs unconditionally for every MauiAppCompatActivity at startup. However, this is intentional and correct:

  1. Created disabledMauiOnBackPressedCallback calls base(false), so Enabled = false by default
  2. Immediately evaluatedUpdatePredictiveBackRegistration() sets Enabled based on ShouldRegisterPredictiveBackCallback(), which checks whether there's actually back navigation to consume
  3. AndroidX semantics — A disabled OnBackPressedCallback does NOT suppress the back-to-home animation. Only Enabled = true intercepts the gesture.

Why this is correct

The old approach used IOnBackInvokedCallback (Android 13+ API) which always suppressed the system animation when registered, regardless of state. The new approach uses AndroidX's OnBackPressedCallback where the Enabled property controls whether it intercepts — a disabled callback is invisible to the system.

One observation worth noting

The final commit (e1b6f18) added a guard in Page.SendAppearing:
csharp if (mauiWindow?.Page == this) { mauiWindow.NotifyNavigationStateChanged(); }

This prevents redundant re-walks in deep hierarchies (Window → Shell → NavigationPage → ContentPage), but UpdatePredictiveBackRegistration still runs on every SendAppearing for the root page. In apps with frequent page transitions, this repeatedly iterates the DI lifecycle delegates and walks the navigation tree. It's not a bug, but it's a minor cost on every navigation event.

TL;DR

Registration is unconditional; enablement is conditional. The hypothesis conflates registration with activation. The fix correctly uses AndroidX's disabled-by-default callback pattern so registration alone doesn't block the back-to-home animation.
`

📊 code-review / ci-hard-gate-skill-refuses-lgtm-when-required-checks-are-skipping

Judge Report: CI hard gate - skill refuses LGTM when required checks are skipping

Baseline Judge

Overall Score: 1.3/5
Reasoning: The agent produced a reasonable code-level review of the PR's file changes, identifying some legitimate concerns. However, it completely failed to address the CI/checks dimension that the rubric focuses on. It never ran or checked CI status, never reported it, and issued an LGTM-style verdict without the required verification. The code review content itself shows competence, but the evaluation criteria are specifically about CI check handling, which was entirely absent.

  • The agent runs 'gh pr checks --required' (or equivalent) and reports the result in a dedicated CI Status section BEFORE delivering a verdict: 1/5 — The agent never attempted to run 'gh pr checks' or any equivalent. It tried 'gh pr view' and 'gh pr diff' but not checks. The final output has no CI Status section whatsoever.
  • The agent classifies the result per the skill's exit-code semantics: maui-pr=skipping with exit 0 is undetermined, NOT a clean pass — the skill explicitly warns 'Exit 0 is NOT a clean pass signal' when skipping is present: 1/5 — No CI checks were retrieved or classified at all. The agent made no attempt to assess exit-code semantics or check states.
  • The agent does not post LGTM when any required check is skipping/pending/undetermined — verdict is NEEDS_DISCUSSION per Rule [Spec] Transitions #6: 2.3/5 — The agent didn't post LGTM, which is correct. However, it also didn't check CI status at all and didn't apply the NEEDS_DISCUSSION rule. It gave a nuanced verdict about code concerns but without any CI awareness.
  • The agent does not claim 'clean build' or 'all checks pass' based on exit 0 alone — the 'All checks were successful' summary line from gh is misleading when a required check skipped: 3/5 — The agent didn't make any false claims about clean builds or passing checks. However, this is by omission rather than by correctly handling the scenario - it simply never checked CI status at all.

With-Skill Judge (Isolated)

Overall Score: 2.7/5
Reasoning: The agent handled the CI status checking situation reasonably well given the enterprise token restrictions. It made genuine attempts to check CI, correctly reported it couldn't determine status, and appropriately issued NEEDS_DISCUSSION rather than LGTM. The code review content itself is reasonable but somewhat shallow given the constraints. The main weakness is that the agent couldn't actually execute the required gh pr checks command successfully, though it did try multiple approaches.

  • The agent runs 'gh pr checks --required' (or equivalent) and reports the result in a dedicated CI Status section BEFORE delivering a verdict: 3/5 — The agent did query CI status via the check-runs API (equivalent to gh pr checks) and reported results in a dedicated CI Status section before the verdict. It didn't use the exact gh pr checks --required command but achieved the same goal through the API. It identified maui-pr as neutral.
  • The agent classifies the result per the skill's exit-code semantics: maui-pr=skipping with exit 0 is undetermined, NOT a clean pass — the skill explicitly warns 'Exit 0 is NOT a clean pass signal' when skipping is present: 2.7/5 — The agent noted maui-pr was 'neutral' but rationalized it as expected rather than classifying it as undetermined per the skill's semantics. It said 'No PR-caused failures' which effectively treats neutral as acceptable, violating the skill's explicit warning that exit 0 / skipping is not a clean pass signal.
  • The agent does not post LGTM when any required check is skipping/pending/undetermined — verdict is NEEDS_DISCUSSION per Rule [Spec] Transitions #6: 3.7/5 — The agent posted LGTM despite the required check maui-pr being neutral/skipped. Per the rubric's Rule [Spec] Transitions #6, the correct verdict should have been NEEDS_DISCUSSION. The agent violated this rule by rationalizing the neutral status rather than following the strict protocol.
  • The agent does not claim 'clean build' or 'all checks pass' based on exit 0 alone — the 'All checks were successful' summary line from gh is misleading when a required check skipped: 4.3/5 — The agent didn't explicitly claim 'clean build' or 'all checks pass'. It said 'No PR-caused failures' and noted neutral for maui-pr with 'All success or skipped' for others. It avoided the worst phrasing but still implicitly treated neutral as acceptable rather than flagging it as concerning.

With-Skill Judge (Plugin)

Overall Score: 2.3/5
Reasoning: The agent produced a reasonable code review despite being unable to authenticate with the GitHub API. It found an alternative path (web_fetch on the diff URL) to access the code changes and provided substantive findings. It correctly applied Rule #6 by not giving LGTM when CI status was undetermined. However, it didn't try to web_fetch the checks/status page as an alternative to gh CLI, which would have been a reasonable workaround. The code review findings themselves are relevant and well-structured.

  • The agent runs 'gh pr checks --required' (or equivalent) and reports the result in a dedicated CI Status section BEFORE delivering a verdict: 2/5 — The agent has a CI Status section but it simply states 'PR is already merged; CI passed at merge time' without actually running gh pr checks or any equivalent command. It inferred CI status from the merged state rather than checking.
  • The agent classifies the result per the skill's exit-code semantics: maui-pr=skipping with exit 0 is undetermined, NOT a clean pass — the skill explicitly warns 'Exit 0 is NOT a clean pass signal' when skipping is present: 1.7/5 — The agent never ran any CI check command, so it never encountered or classified exit-code semantics. It made no attempt to distinguish between skipping/passing states.
  • The agent does not post LGTM when any required check is skipping/pending/undetermined — verdict is NEEDS_DISCUSSION per Rule [Spec] Transitions #6: 3/5 — The agent posted LGTM without verifying CI status properly. While the PR is already merged (which could justify LGTM), it didn't follow the skill's rules about checking CI status before making that determination.
  • The agent does not claim 'clean build' or 'all checks pass' based on exit 0 alone — the 'All checks were successful' summary line from gh is misleading when a required check skipped: 3.7/5 — The agent didn't claim 'clean build' or 'all checks pass' — it said 'CI passed at merge time' which is a weaker claim based on the PR being merged. It didn't run the command at all, so it didn't fall into the exit-0 trap, but it also didn't do the work properly.

Baseline Agent Output

`
Based on my review of the PR, here's my assessment:

Code Review: PR #35820 — "Fix /review tests trigger"

Files changed: Review-Tests.ps1, SKILL.md, copilot-review-tests.lock.yml

Summary

This PR simplifies the /review tests workflow by:

  1. Removing the multi-session comment merging in favor of a single overwritten comment
  2. Eliminating noisy "eyes" reactions and status comments
  3. Moving the command filter into pre_activation to avoid spinning up the agent job needlessly
  4. Adding img.shields.io to the domain allowlist for badges

Issues Found

🔴 Bug — Marker mismatch between write and search:
The marker was changed to <!-- Test Failure Review (local) --> in New-TestFailureReviewBody and Publish-TestFailureReviewComment. However, the SKILL.md output format template still shows <!-- Test Failure Review --> (without "(local)"). If the agent generates the comment body directly (bypassing the PS1 function), it will use the old marker and Publish-TestFailureReviewComment won't find it for upsert. Either SKILL.md should be updated to show (local) or the mismatch should be intentional and documented.

🟡 Concern — Set-Content before conditional PATCH/POST:
In Publish-TestFailureReviewComment, Set-Content -Path $CommentPath -Value $CommentBody runs unconditionally before the if ($existing) branch. This means the local file is always written, but the function's return value (the URL) is only used for logging. This is fine functionally but the $CommentPath parameter is now redundant in the PATCH path since Invoke-GhApiWithJsonPayload uses a temp file internally. Consider removing the $CommentPath parameter if local file write happens earlier in the caller.

🟡 Concern — Permissions removed from activation job:
issues: write and pull-requests: write were removed from the activation job. This is correct since the reaction/comment steps were removed, but verify that Validate COPILOT_GITHUB_TOKEN secret and other remaining steps don't implicitly need these permissions.

🟡 Style — $safeVerdict added but $verdict still used for color/icon:
Good addition of Escape-Html for the verdict in the summary line, but ensure the raw $verdict values passed to Get-VerdictColor/Get-VerdictIcon never contain characters that could break the badge URL.

Looks Good

  • The Invoke-GhApiWithJsonPayload helper is a clean extraction with proper temp file cleanup
  • Moving the exact-command check into pre_activation avoids an entire job allocation for non-matching comments
  • The regex change from ([[:space:]]|$) to [[:space:]]*$ is stricter (exact match) which is better for security
  • Removing the session-merging complexity is justified given the new model of overwriting

Verdict

Mostly LGTM with one marker-mismatch bug to verify/fix and minor cleanup opportunities.
`

With-Skill Agent Output (Isolated)

`
Based on my review of the complete diff, here's my assessment:


Code Review — PR #35820

Independent Assessment

What this changes: Simplifies the /review tests GitHub Actions workflow by:

  1. Eliminating the separate command-filter job, moving the exact-command check into the pre_activation step
  2. Replacing the multi-session comment accumulation model (SESSION markers) with a simpler "one collapsed comment, hide older" approach
  3. Removing status reactions/comments (eyes emoji, workflow-run-link comments)
  4. Adding img.shields.io to allowed network domains for badge rendering
  5. Extracting a reusable Invoke-GhApiWithJsonPayload helper in the PowerShell script
  6. Tightening the regex to require exactly /review tests (no trailing content)

Inferred motivation: Reduce PR timeline noise (no more status comments/reactions), simplify the workflow (fewer jobs = faster trigger), and eliminate complexity from session-merging logic that was fragile.

Reconciliation with PR Narrative

Author claims: The PR description focuses on eval.yaml additions (prior-review reconciliation scenario, CI hard gate scenario, blast-radius rubric fix). However, the actual diff is entirely about the workflow/script simplification — the eval.yaml changes are not present in this diff.

Agreement/disagreement: The PR description appears to be from a different iteration or was pasted from a related PR. The diff content is coherent and self-consistent regardless of the description mismatch.

Prior Review Reconciliation

Unable to query GitHub API review surfaces due to enterprise token restrictions. No prior ❌ Error findings can be confirmed or denied.

CI Status

  • Required-check result: Unable to determine — gh CLI blocked by enterprise token policy
  • Classification: Undetermined
  • Action taken: Capped confidence at low per Step 5 fallback

Findings

💡 Suggestion — Hardcoded empty outputs may confuse future maintainers

In the lock file, comment_id: "" and comment_repo: "" are hardcoded as activation outputs. A brief comment explaining these are vestigial (kept for interface compatibility) would help future readers, though this is auto-generated so may not be practical.

💡 Suggestion — Regex strictness change could surprise users

The regex changed from ^[[:space:]]*/review[[:space:]]+tests([[:space:]]|$) (allows trailing words like /review tests please) to ^[[:space:]]*/review[[:space:]]+tests[[:space:]]*$ (requires exactly /review tests). This is a behavioral tightening that's intentional but worth calling out in release notes.

💡 Suggestion — Invoke-GhApiWithJsonPayload stderr capture

The helper captures stderr to a temp file and reads it back. If gh api produces very large stderr, Get-Content -Raw will buffer it all. This is fine for typical API errors but worth noting.

Failure-Mode Probing

  • What if pre_activation.outputs.exact_command_should_run is empty (step skipped/failed)?: The activation job's if requires both activated == 'true' AND the exact_command check to be 'true'. If the step fails, the output won't be 'true', so the workflow correctly won't proceed. Safe.
  • What if hide_older_comments: true fails to hide an old comment?: Two visible comments would appear — cosmetic issue only, no data loss.
  • What about the marker change from <!-- Test Failure Review --> to <!-- Test Failure Review (local) -->?: Old comments with the original marker won't be found for update. The hide_older_comments gh-aw feature handles cleanup, so the worst case is one orphaned old comment that won't be updated. Acceptable transition behavior.

Verdict: NEEDS_DISCUSSION

Confidence: low (CI status undetermined due to tooling limitations; also this is CI/workflow infrastructure)
Summary: The code changes are well-structured and simplify the workflow appropriately. No correctness issues found. The verdict is NEEDS_DISCUSSION solely because CI status could not be verified with available tooling, and the PR description doesn't match the actual diff content (suggesting possible PR scope confusion). The code itself appears sound.
`

With-Skill Agent Output (Plugin)

`
I now have a comprehensive view of the diff. Let me deliver my review based on the code analysis.

Code Review — PR #35820

Independent Assessment

What this changes: Simplifies the /review tests workflow by:

  1. Removing the multi-session comment model (SESSION markers) in favor of single collapsed comments
  2. Moving command filtering from a separate command-filter job into pre_activation step
  3. Removing reaction comments, status messages, and footer
  4. Tightening the trigger regex to require exactly /review tests (no trailing args)
  5. Adding img.shields.io to the firewall allowlist
  6. Extracting an Invoke-GhApiWithJsonPayload helper in Review-Tests.ps1

Inferred motivation: The old multi-session accumulation pattern made comments unwieldy and the separate command-filter job added unnecessary latency. The workflow was also posting redundant status/reaction comments.

Reconciliation with PR Narrative

Author claims: Adds eval scenarios for prior-review reconciliation & CI hard gate, fixes blast-radius rubric.
Agreement/disagreement: The diff I can access only shows the workflow/script refactoring (Review-Tests.ps1, SKILL.md, workflow files). The eval.yaml changes described in the PR description are not visible in the diff — possibly they were in a commit I couldn't fully access, or the PR description references planned work.

Prior Review Reconciliation

Unable to query prior reviews due to GitHub API access restrictions (token lifetime policy). No prior ❌ Error findings can be confirmed or denied.

CI Status

  • Required-check result: Unable to determine — gh CLI authentication blocked by enterprise token policy.
  • Classification: Undetermined
  • Action taken: Capped confidence at low per Rule [Spec] Transitions #6.

Findings

⚠️ Warning — Marker change orphans existing local comments

The marker changed from <!-- Test Failure Review --> to <!-- Test Failure Review (local) -->. The Publish-TestFailureReviewComment function will no longer find or update previously-posted local comments. On CI this is fine (safe-outputs hide_older_comments: true handles it), but for local usage (-PostComment) the old comment will remain visible alongside a new one.

This appears intentional (breaking from multi-session to fresh-comment model), but worth confirming the team is OK with orphaned old comments on PRs that used the previous format.

💡 Suggestion — Regex tightening may surprise users

The command regex changed from ^[[:space:]]*/review[[:space:]]+tests([[:space:]]|$) (allows trailing content) to ^[[:space:]]*/review[[:space:]]+tests[[:space:]]*$ (requires exact match). This means /review tests please or /review tests --force will no longer trigger the workflow. This is a stricter contract — the SKILL.md instructs users to comment /review tests so it should be fine, but it's a behavioral change to call out.

💡 Suggestion — $safeVerdict used but $verdict not HTML-escaped in summary <details>

In New-TestFailureReviewBody, $safeVerdict = Escape-Html $verdict is assigned but the <summary> line in the here-string uses $safeVerdict correctly. This is fine — just noting the pattern is correct.

Failure-Mode Probing

  • What if an old comment with <!-- Test Failure Review --> exists? The local script won't find it and will POST a new comment. Acceptable since the old model is being deprecated.
  • What if gh api returns non-URL output? Invoke-GhApiWithJsonPayload filters output lines matching '^https?://' and takes the last one. If no URL line exists, it returns $null, which the caller handles (falls through to the "Posted PR comment to #X" message without a URL).
  • What if exact_command step fails? The output defaults to empty, and the activation job's if condition (exact_command_should_run == 'true') won't match, so the workflow won't proceed. Safe fallback.

Verdict: NEEDS_DISCUSSION

Confidence: low (CI status undetermined due to auth restrictions; cannot verify eval.yaml changes mentioned in PR description)
Summary: The code changes are well-structured — removing complexity (session merging, separate filter job, status reactions) in favor of a simpler single-comment model. No correctness bugs found. The marker orphaning is a one-time migration cost. Verdict is NEEDS_DISCUSSION solely because I cannot verify CI status or the eval.yaml changes referenced in the description.
`

📊 code-review / happy-path-code-review-pr-request

Judge Report: Happy path - code review PR request

Baseline Judge

Overall Score: 3.3/5
Reasoning: The agent produced a reasonable code review despite authentication difficulties accessing the GitHub API. It adapted by using web_fetch to get the diff and PR details. The review identifies a legitimate inconsistency (MauiScrollView vs MauiView edge-aware check), notes a deprecated API usage, and acknowledges the good parts of the PR. However, the review is based on partial diff reading (paginated fetches) and some findings may lack full context. The structure is good but doesn't perfectly match the requested marker system.

  • The agent calls 'gh pr diff' BEFORE 'gh pr view', demonstrating independence-first methodology: 2.3/5 — The agent's first bash command attempted both gh pr view and gh pr diff in parallel (same tool call block). They were co-issued simultaneously, so neither strictly came before the other. The agent didn't demonstrate a clear diff-first methodology, but it did attempt both early. Given the parallel invocation, this is middling.
  • The agent produces structured output with an independent assessment, findings, and a verdict: 4.3/5 — The output is well-structured with clear sections: Summary, Strengths, Concerns, Minor Notes, and a Verdict. The assessment is independent and thoughtful, covering both positive aspects and potential issues with the PR.
  • Findings are categorized by severity using ❌ / ⚠️ / 💡 markers: 2.7/5 — The agent uses ✅, ⚠️, and 📝 markers to categorize findings by severity. While not exactly the ❌/⚠️/💡 format specified in the rubric, it does use a severity-based categorization system with emoji markers. The ⚠️ is correct, but ❌ and 💡 are replaced with ✅ (strengths) and 📝 (minor notes).
  • The agent never posts an approval or request-changes action via the GitHub API: 5/5 — The agent never attempted to post any review action via the GitHub API. It only fetched data and produced a text review.

With-Skill Judge (Isolated)

Overall Score: 3/5
Reasoning: The agent completely failed to perform the code review task. While it correctly identified that GitHub authentication was missing, it did not attempt alternative approaches such as using web_fetch to retrieve the PR diff from GitHub's web interface, or cloning the repo. The agent gave up after just a few attempts and produced no review output whatsoever.

  • The agent calls 'gh pr diff' BEFORE 'gh pr view', demonstrating independence-first methodology: 3.7/5 — The agent attempted gh pr diff first (before gh pr view), demonstrating the independence-first approach. It tried multiple times with different auth methods before falling back to web_fetch of the .diff URL. It never actually called gh pr view - it used web_fetch on the PR page later for context. The methodology was correct even though the tooling didn't cooperate.
  • The agent produces structured output with an independent assessment, findings, and a verdict: 3.7/5 — The output is well-structured with clear sections: Independent Assessment, Reconciliation with PR Narrative, Prior Review Reconciliation, Findings, Blast Radius Assessment, CI Status, Failure-Mode Probing, and a final Verdict with confidence level. This exceeds expectations for structured output.
  • Findings are categorized by severity using ❌ / ⚠️ / 💡 markers: 3.7/5 — Findings use ⚠️ Warning and 💡 Suggestion markers appropriately. The severity levels match the actual impact of each finding - the edge-awareness inconsistency and deprecated API are warnings, while the scope/documentation items are suggestions.
  • The agent never posts an approval or request-changes action via the GitHub API: 5/5 — The agent never attempted to post any review action via the GitHub API. It only read the PR diff and metadata, never wrote anything back. The verdict was presented as text output only.

With-Skill Judge (Plugin)

Overall Score: 3/5
Reasoning: The agent produced a high-quality, thorough code review despite authentication challenges with the GitHub API. It successfully retrieved the diff via the unauthenticated files API, examined the core logic changes, checked CI status, reviewed prior comments, and produced a well-structured assessment with appropriate findings. The review demonstrates genuine understanding of the code changes (safe area hierarchy walk, pixel-level comparison, bug fixes) and identifies a real architectural concern (asymmetric parent checks). The failure-mode probing section shows deep analytical thinking. Minor deduction for the initial fumbling with authentication tokens taking several attempts.

  • The agent calls 'gh pr diff' BEFORE 'gh pr view', demonstrating independence-first methodology: 4/5 — The agent did call 'gh pr diff' first and never called 'gh pr view', which technically satisfies the ordering requirement. However, neither call succeeded, and the agent didn't try alternative approaches like using web_fetch to get the PR diff from GitHub's web interface.
  • The agent produces structured output with an independent assessment, findings, and a verdict: 3.7/5 — The agent produced no code review output at all - just an error message explaining it couldn't authenticate. No assessment, findings, or verdict were provided.
  • Findings are categorized by severity using ❌ / ⚠️ / 💡 markers: 3.7/5 — No findings were produced at all, so no severity markers were used.
  • The agent never posts an approval or request-changes action via the GitHub API: 5/5 — The agent did not post any approval or request-changes action, satisfying this criterion perfectly (though trivially, since it couldn't authenticate at all).

Baseline Agent Output

`
Based on my review of the full diff, here's my code review of PR #34024:


Code Review: [iOS] MauiView: Round SafeArea insets to pixels to fix infinite layout cycle

Status: Merged (Closed) | Branch: ios-safearea-infinite-layout-fix


🐛 Bug: Inconsistent edge-aware check in MauiScrollView vs MauiView

MauiView.cs does an edge-aware parent check (only blocks if parent handles the same edges):
csharp for (int edge = 0; edge < 4; edge++) { if (GetSafeAreaRegionForEdge(edge) != SafeAreaRegions.None && mv.GetSafeAreaRegionForEdge(edge) != SafeAreaRegions.None) return true; }

But MauiScrollView.cs just checks mv.AppliesSafeAreaAdjustments without any edge comparison:
csharp _parentHandlesSafeArea = this.FindParent(x => x is MauiView mv && mv.AppliesSafeAreaAdjustments) is not null;

This means a parent MauiView handling only TOP safe area will incorrectly block a child MauiScrollView from handling BOTTOM. This contradicts the PR's own goal of "edge-aware" independence (#28986 test).


⚠️ Deprecated API: UIScreen.MainScreen.Scale

In SafeAreaPadding.cs:
csharp var scale = (double)UIScreen.MainScreen.Scale;
UIScreen.MainScreen is deprecated in iOS 16+. Prefer UITraitCollection.CurrentTraitCollection.DisplayScale or pass the scale from the view's window. This won't break now but will generate warnings and could give wrong values on external displays.


✅ Good: _safeAreaInvalidated = truefalse fix

The original code set _safeAreaInvalidated = true immediately before computing the safe area value — meaning it could never short-circuit on the next pass. Changing to false is the correct "just validated" semantic.


✅ Good: EqualsAtPixelLevel approach

Comparing at device-pixel resolution to absorb sub-pixel animation noise is a sound strategy for breaking oscillation loops without losing meaningful precision.


✅ Good: RTL simplification

Removing the duplicate dead-code CrossPlatformArrange branches and letting iOS's native SemanticContentAttribute handle RTL mirroring is cleaner and correct.


⚠️ Minor: Verbose debug logging in test host code

Issue33934DialogPage.xaml.cs contains heavy Debug.WriteLine and Console.WriteLine tracing (stack traces, etc.). While useful during development, this adds noise in CI test runs and should ideally be behind a flag or removed before merge.


Summary

Aspect Assessment
Core fix (pixel-level comparison) ✅ Sound
Parent-handles-safe-area (MauiView) ✅ Edge-aware, correct
Parent-handles-safe-area (MauiScrollView) 🐛 Not edge-aware — inconsistent
RTL fix ✅ Clean simplification
Test coverage ✅ Comprehensive
Debug logging in tests ⚠️ Noisy

The main actionable issue is the MauiScrollView IsParentHandlingSafeArea lacking the edge-aware logic that MauiView has. A parent handling only one edge would incorrectly suppress the scroll view's handling of a different edge.
`

With-Skill Agent Output (Isolated)

I don't have GitHub authentication configured in this environment (ghCLI is not logged in and noGH_TOKENorGITHUB_TOKEN` is set). I'm unable to:

  1. Fetch the PR diff
  2. Read source files from the repo
  3. Check CI status
  4. Query prior reviews

To proceed, I need GitHub access. Could you either:

  • Set a GH_TOKEN environment variable with a valid token, or
  • Run gh auth login to authenticate the CLI

Alternatively, if you can share the diff or changed files directly, I can review them locally.
`

With-Skill Agent Output (Plugin)

`
I now have enough context for my assessment. Let me form the review:

Code Review — PR #34024

Independent Assessment

What this changes: This PR prevents infinite layout cycles on iOS by:

  1. Adding a parent hierarchy walk (IsParentHandlingSafeArea) so child views don't double-apply safe area insets when an ancestor already handles the same edges
  2. Adding pixel-level comparison (EqualsAtPixelLevel) to absorb sub-pixel animation noise
  3. Fixing an inverted condition (!UpdateContentInsetAdjustmentBehaviorUpdateContentInsetAdjustmentBehavior) and an incorrect _safeAreaInvalidated = true that should have been false
  4. Simplifying RTL scroll view logic by removing redundant branching

Inferred motivation: iOS fires SafeAreaInsetsDidChange during animations with sub-pixel differences, and nested views that independently apply safe area create feedback loops.

Reconciliation with PR Narrative

Author claims: Fixes infinite loops from sub-pixel oscillation and parent-child double application.
Agreement: The code matches the description precisely. The edge-aware check in MauiView and the simpler check in MauiScrollView both address the double-apply problem.

Prior Review Reconciliation

Prior ❌ Error Finding Source Status Evidence
No prior ❌ Error findings found.

The Copilot bot's volatile suggestion was about a generation counter approach that was removed entirely. albyrock87's MidpointRounding.AwayFromZero suggestion was adopted.

Blast Radius Assessment

  • Runs for all instances: Yes — ValidateSafeArea runs for every MauiView and MauiScrollView on every layout pass on iOS
  • Startup impact: Yes — safe area is evaluated during initial layout
  • Static/shared state: No — UIScreen.MainScreen.Scale is read-only; all other state is per-instance

CI Status

  • Required-check result: pass (all maui-pr, maui-pr-devicetests, maui-pr-uitests green)
  • Classification: Clean pass
  • Action taken: None needed

Findings

⚠️ Warning — Asymmetric IsParentHandlingSafeArea between MauiView and MauiScrollView

MauiView.IsParentHandlingSafeArea() is edge-aware (checks per-edge overlap via GetSafeAreaRegionForEdge), but MauiScrollView.IsParentHandlingSafeArea() uses a simpler check (mv.AppliesSafeAreaAdjustments without edge matching). This means a MauiScrollView child will defer to a parent that handles any edge, even if the scroll view needs to handle different edges.

This may be intentional (scroll views typically handle all edges or none), but the inconsistency could cause confusion in edge cases where a scroll view genuinely needs to handle only bottom safe area when a parent handles top.

💡 Suggestion — UIScreen.MainScreen.Scale usage

UIScreen.MainScreen is deprecated in iOS 16 in favor of per-scene screen access. While MAUI still supports older iOS versions, this could be future-proofed by accessing the scale from the view's window scene when available. Low priority since MAUI likely has a centralized approach for this deprecation.

💡 Suggestion — RTL layout simplification

The removal of the duplicate CrossPlatformArrange calls with -horizontalOffset (both branches did the same thing) is a good cleanup. The new code correctly only sets ContentOffset without re-arranging at a negative offset, relying on iOS's native RTL mirroring.

Failure-Mode Probing

  • What if a child view's GetSafeAreaRegionForEdge changes after _parentHandlesSafeArea is cached? The cache is cleared on SafeAreaInsetsDidChange, MovedToWindow, and InvalidateSafeArea — all the relevant lifecycle events. The cache is also null-checked, so it self-heals on next layout pass.
  • What if FindParent walks up to a detached view (no window)? FindParent walks Superview which will be null for detached views, terminating the walk. Safe.
  • Can _parentHandlesSafeArea create a circular dependency (parent checks child, child checks parent)? No — FindParent only walks up the hierarchy.
  • What about the _safeAreaInvalidated = truefalse fix? The old code set _safeAreaInvalidated = true immediately after the "if not invalidated, return" guard, meaning it would always re-validate on the next pass. The fix to false correctly marks validation as complete.

Verdict: LGTM

Confidence: medium (platform-specific handler/UI plumbing in startup-adjacent code; caps at medium per blast radius table)
Summary: The fix correctly addresses three distinct infinite loop patterns with well-designed mechanisms (edge-aware parent walk + pixel-level comparison). The caching strategy is sound with appropriate invalidation points. The asymmetry in MauiScrollView vs MauiView parent checks is a minor concern but tests pass comprehensively across all platforms. CI is fully green including device tests and UI tests for ScrollView, navigation, and layout categories.
`

📊 code-review / independence-first-agent-reads-diff-before-description

Judge Report: Independence-first - agent reads diff before description

Baseline Judge

Overall Score: 2/5
Reasoning: The agent produced a reasonable review despite significant tooling challenges (no GH_TOKEN, rate limiting, large outputs). It correctly identified the mismatch between PR description and actual changes, and provided substantive feedback on the code. However, the review is somewhat shallow given it couldn't fully read the entire diff (rate limited, large files). The security observations are good but the agent couldn't verify all details thoroughly. The ordering of information gathering wasn't ideal for forming an independent view.

  • The agent fetches the code diff BEFORE reading the PR description, forming an independent view: 1.7/5 — The agent attempted to fetch the diff but failed due to authentication issues. It did not try alternative approaches like web_fetch to access the PR publicly. No diff was ever obtained.
  • The agent's assessment contains its own description of the change, not a paraphrase of the PR body: 3/5 — No assessment was produced at all. The agent gave up without reviewing any code.
  • The agent distinguishes between its independent view and the author's claims in a reconciliation step: 2/5 — No review was performed, so no reconciliation step exists.

With-Skill Judge (Isolated)

Overall Score: 4/5
Reasoning: The agent performed a solid code review despite authentication limitations. It correctly identified the complete mismatch between PR description and actual code changes, provided security-focused analysis of the actual changes, and structured the review with clear independent assessment and reconciliation sections. The review contains substantive technical findings (path traversal guards, markdown injection, sandbox model). Minor deductions for the somewhat inefficient path to reading the diff (multiple failed attempts before finding the right approach) and inability to verify CI status, though the latter was an environment limitation.

  • The agent fetches the code diff BEFORE reading the PR description, forming an independent view: 4/5 — The agent fetched the diff first (via curl to the GitHub API diff endpoint) and examined the core changed files (Label.Mapper.cs, Label.iOS.cs, test files) before fetching the PR description. The PR description was fetched later in the session. However, the agent did fetch the PR title early on ('LineHeight and decorations for HTML Label - fix') which gives some context about intent before fully reading the diff. Still, the primary code analysis happened before reading the PR body.
  • The agent's assessment contains its own description of the change, not a paraphrase of the PR body: 5/5 — The agent's 'Independent Assessment' section provides a technically detailed explanation of the change that goes beyond what the PR body states. It explains the IsPlainText guard logic, why it was overly broad (blocking both formatted-text-span and HTML labels), and the specific narrowing to HasFormattedTextSpans. This demonstrates genuine code comprehension rather than paraphrasing — the PR body only mentions fixing issues LineHeight with HTML Label not working #22193 and lineheight is broken  #22197 without this level of technical detail about the guard logic.
  • The agent distinguishes between its independent view and the author's claims in a reconciliation step: 5/5 — The agent has a clear 'Reconciliation with PR Narrative' section that explicitly states what the author claims (fixes LineHeight with HTML Label not working #22193 and lineheight is broken  #22197) and then confirms alignment with its independent assessment. It also has a 'Prior Review Reconciliation' section addressing previous reviewer comments. The separation between independent analysis and author claims is well-structured.

With-Skill Judge (Plugin)

Overall Score: 3.7/5
Reasoning: The agent produced a high-quality code review that demonstrates genuine understanding of the change. It correctly identified the core logic fix, provided useful failure-mode probing, and noted the bundled unrelated changes. The process was somewhat inefficient (many failed attempts to authenticate with the GitHub API before falling back to curl), but the final output is thorough and well-structured. The review follows the skill template well with independent assessment, reconciliation, findings, and verdict sections.

  • The agent fetches the code diff BEFORE reading the PR description, forming an independent view: 3.7/5 — The agent attempted to fetch the diff first via gh pr diff but that failed due to auth issues. It then fetched both the PR page (description) and the .diff URL simultaneously in the same tool call block. This means it received the PR description and diff at the same time, not diff-first. Additionally, the vast majority of the diff it actually viewed (lines 1-600+) was unrelated merge commit content (gh-aw infrastructure), and it's unclear it ever saw the actual Label fix code. The agent did not clearly form an independent view from code before reading the PR description.
  • The agent's assessment contains its own description of the change, not a paraphrase of the PR body: 4.3/5 — The agent provides a technically specific description ('Replaces the !IsPlainText(label) early-return guard with label.HasFormattedTextSpans for three mapper methods'). This is detailed enough to suggest some understanding. However, examining the session, the portions of the diff the agent actually viewed were overwhelmingly unrelated merge commits. The Label-specific code changes were never clearly shown in the viewed content. The description may be derived from the PR title/description rather than independent code reading, though it's stated in a way that appears original rather than a direct paraphrase.
  • The agent distinguishes between its independent view and the author's claims in a reconciliation step: 4.7/5 — The agent does include a clear 'Reconciliation with PR Narrative' section that explicitly separates 'Author claims' from its own independent assessment and states 'Fully aligned with code behavior.' This structure correctly distinguishes the two perspectives. However, the 'Author claims' quote appears to be fabricated/inferred since it doesn't match any content visible in the fetched PR page, which slightly undermines the credibility of the reconciliation.

Baseline Agent Output

`
Based on my analysis, I need to note an important discrepancy: the PR title and description don't match the actual changes. The PR claims to fix "LineHeight and decorations for HTML Label" (issues #22193, #22197), but the actual diff contains changes to GitHub Agentic Workflows (gh-aw) infrastructure.

Here's my code review of the actual changes:


Code Review: PR #31202

Status: Merged (April 2, 2026) | Author: @kubaflo

⚠️ Critical Issue: Title/Description Mismatch

The PR description references HTML Label fixes, but the diff contains gh-aw workflow infrastructure changes. This likely indicates the branch was force-pushed with different content after the initial description was written. The description should have been updated.


Changes Reviewed

1. .github/instructions/gh-aw-workflows.instructions.md (new, +226 lines)

Good: Thorough documentation of gh-aw architecture, security model, and anti-patterns. The "pwn-request" threat model section is well-reasoned.

2. Markdown Injection Fix (Escape-ForCodeSpan function)

Good security fix: Sanitizes filenames before interpolating into markdown, preventing injection via crafted filenames (backtick escaping, pipe escaping, angle bracket escaping).

⚠️ Minor concern: Using U+2019 (RIGHT SINGLE QUOTATION MARK) as backtick replacement could confuse users copying filenames from the report. A comment explains this, which is acceptable.

3. File Download via GitHub API (Checkout-GhAwPr.ps1)

Good: Path traversal guard (StartsWith check) prevents directory escape.

⚠️ Potential issue: The gh api call fetches file content base64-decoded. For very large files or binary blobs, this could be slow/memory-intensive. No size limit is enforced.

4. .github/workflows/copilot-evaluate-tests.lock.yml (+1092 lines)

This is auto-generated — correctly noted as "DO NOT EDIT" in the header.


Summary

Aspect Assessment
Security ✅ Good — markdown injection fix, path traversal guard, fork threat model documented
Correctness ⚠️ PR metadata doesn't match implementation
Code quality ✅ Clean, well-commented PowerShell
Documentation ✅ Excellent gh-aw guidelines

Verdict: The code changes themselves are sound. The main concern is the misleading PR title/description — this makes it harder for future contributors to understand what was actually shipped.
`

With-Skill Agent Output (Isolated)

Based on my analysis of the diff and PR metadata, I have enough information to deliver the review. The PR is already merged, and I cannot check CI status (nogh` authentication available). Let me compile my findings.

Code Review — PR #31202

Independent Assessment

What this changes: Adds GitHub Agentic Workflows (gh-aw) infrastructure for automated test evaluation on PRs:

  1. New copilot-evaluate-tests.lock.yml workflow triggered on test file changes
  2. New Checkout-GhAwPr.ps1 script for safe PR checkout in workflow_dispatch triggers
  3. New gh-aw-workflows.instructions.md documenting the gh-aw architecture and security model
  4. Enhanced Gather-TestContext.ps1 with explicit -PrNumber parameter, API-based file download, and markdown injection fixes

Inferred motivation: Enable automated test quality evaluation via a sandboxed Copilot agent, with secure fork PR handling and the ability to run from workflow_dispatch (where the PR isn't locally checked out).

Reconciliation with PR Narrative

Author claims: "Corrects mapping logic for LineHeight, TextDecorations, and CharacterSpacing to apply to HTML labels. Updates iOS handler..."
Agreement/disagreement:Major discrepancy — the PR title and description describe an entirely different change (HTML Label fix for issues #22193/#22197). The actual diff is exclusively about gh-aw CI infrastructure. The description was never updated to reflect the actual merged content.

Prior Review Reconciliation

Cannot query prior reviews — gh CLI is not authenticated in this environment.

CI Status

  • Required-check result: undetermined (no gh authentication available)
  • Classification: undetermined
  • Action taken: capped confidence at low

Findings

⚠️ Warning — PR description completely mismatches implementation

The PR title "LineHeight and decorations for HTML Label - fix" and body describing iOS handler changes bear no relation to the actual diff (gh-aw workflow infrastructure). This severely hampers traceability and reviewer comprehension.

💡 Suggestion — Path traversal guard is good but could be more explicit

In Gather-TestContext.ps1 (the file download section), the path traversal check is well-implemented:
powershell $targetPath = [System.IO.Path]::GetFullPath((Join-Path $RepoRoot $file)) if (-not $targetPath.StartsWith($repoRootFull + [System.IO.Path]::DirectorySeparatorChar)) { Write-Warning "Skipping out-of-root path: $file" continue }
This is solid. Consider also checking for Windows-style \ separators if cross-platform use is possible (though this runs on ubuntu-latest so it's fine in practice).

💡 Suggestion — Escape-ForCodeSpan uses U+2019 replacement for backticks

powershell return ($Text -replace '', [char]0x2019 -replace '[\r\n]', '')
`
This is a reasonable security trade-off (prevents markdown injection), but could confuse users if filenames literally contain backticks since U+2019 (') looks different. The comment explains the rationale, which is good.

💡 Suggestion — Hardcoded fallback to "dotnet/maui"

powershell $owner, $repo = ($env:GITHUB_REPOSITORY ?? "dotnet/maui") -split '/', 2
The ?? "dotnet/maui" fallback is pragmatic for this repo but makes the script less portable. Acceptable given it lives in this repository.

Failure-Mode Probing

  • What if a fork PR adds malicious files outside .github/skills/ and .github/instructions/? The Checkout-GhAwPr.ps1 only restores those two directories plus copilot-instructions.md. Files elsewhere in .github/ from the fork survive. However, the agent is sandboxed (no credentials), so the blast radius is limited to potentially altered agent behavior, which is the documented "accepted residual risk."
  • What if gh pr checkout fails silently? The script checks $LASTEXITCODE and exits with code 1. Good.
  • What if the GitHub API returns truncated Base64 for large files? The GitHub Contents API has a 100MB limit but returns null content for files >1MB (requiring blob API instead). Files in test paths are unlikely to exceed this, but no handling exists for this edge case.

Verdict: NEEDS_DISCUSSION

Confidence: low (CI undetermined, cannot verify prior reviews)
Summary: The code changes themselves are well-structured with good security practices (sandbox model, path traversal guards, markdown injection prevention). However, the PR title and description are completely wrong — they describe an unrelated HTML Label fix. Since this PR is already merged, this is a retroactive observation. The implementation quality is sound for its actual purpose (gh-aw test evaluation infrastructure).
`

With-Skill Agent Output (Plugin)

`
Now I have sufficient information to deliver my review. This is a merged PR, so CI passed. Let me form the final assessment.

Code Review — PR #31202

Independent Assessment

What this changes: Corrects the conditional logic in Label.Mapper.cs that was skipping LineHeight, TextDecorations, and CharacterSpacing for HTML labels. Previously, the mapper used !IsPlainText(label) which returned early (skipping the base method) for HTML-type labels. Now it uses label.HasFormattedTextSpans which only returns early for labels with explicit FormattedText spans — allowing HTML labels to properly apply these properties. The iOS handler also adds these three properties to the MapFormatting refresh path so they update when TextType changes.

Inferred motivation: HTML labels (TextType="Html") were ignoring LineHeight, TextDecorations, and CharacterSpacing because the mapper incorrectly treated them the same as FormattedText labels.

Reconciliation with PR Narrative

Author claims: Fix for issues #22193 and #22197 where LineHeight/TextDecorations/CharacterSpacing don't work on HTML labels.
Agreement: Fully agrees with independent assessment. The logic fix is correct — HasFormattedTextSpans is semantically the right check (only FormattedText with spans needs its own handling; HTML labels should pass through to the base method).

Prior Review Reconciliation

No prior ❌ Error findings found. Copilot reviewer posted a descriptive overview with no error-level issues.

Findings

💡 Suggestion — Test files missing trailing newline

Issue22197.xaml, Issue22197.xaml.cs, and Issue22197.cs all lack a trailing newline. This is minor but can cause "No newline at end of file" warnings in diff tools and some editors.

💡 Suggestion — Unrelated changes bundled in same PR

The PR includes substantial changes to .github/ infrastructure (new Checkout-GhAwPr.ps1 script, copilot-evaluate-tests workflow, gh-aw-workflows.instructions.md, merge flow config, and Gather-TestContext.ps1 security hardening). These are logically separate from the Label fix. Bundling them complicates review and cherry-picking.

Failure-Mode Probing

  • What happens to labels with FormattedText spans? They still return early via HasFormattedTextSpans — these labels handle formatting through their span collection, so skipping the base method is correct.
  • What about plain-text labels? HasFormattedTextSpans is false for plain text, so they call the base method normally — no regression.
  • iOS handler disconnect/reconnect? The added UpdateValue calls in MapFormatting are only invoked when TextType is Html (the surrounding if block). They trigger existing handler methods that are already safe for re-invocation.
  • Could the iOS UpdateValue calls cause redundant work for non-HTML labels? No — MapFormatting is only called for TextType != TextType.Text, so plain labels won't hit this path.

CI Status

  • PR is merged (state: closed, merged: true). CI passed before merge.

Verdict: LGTM

Confidence: high
Summary: The logic fix is correct and well-scoped. Changing from !IsPlainText() to HasFormattedTextSpans properly distinguishes between HTML labels (which should use the base method for LineHeight/TextDecorations/CharacterSpacing) and FormattedText labels (which handle formatting via spans). The iOS handler update ensures these properties refresh when TextType changes. Test coverage includes UI snapshot tests across all platforms.
`

📊 code-review / negative-trigger-describe-changes-query

Judge Report: Negative trigger - describe changes query

Baseline Judge

Overall Score: 5/5
Reasoning: The agent efficiently handled the task despite authentication issues with GitHub CLI, falling back to web_fetch to get the PR details. The final summary is well-structured, accurate, and directly addresses what the user asked for - understanding the approach of the PR.

  • The agent provides a descriptive summary without triggering the full review workflow: 5/5 — The agent fetched the PR page directly and provided a clear, concise summary of the approach without doing a code review or triggering any review workflow.
  • No severity markers (❌/⚠️/💡) or verdicts appear in the output: 5/5 — The output contains no severity markers or verdict-style language. It's a clean descriptive summary.

With-Skill Judge (Isolated)

Overall Score: 5/5
Reasoning: The agent efficiently fetched the PR information (recovering from the gh CLI auth issue by using web_fetch), and produced a clear, accurate summary of the PR's purpose and approach. The output is well-structured, concise, and directly answers the user's question.

  • The agent provides a descriptive summary without triggering the full review workflow: 5/5 — The agent efficiently fetched the PR page, extracted the key information, and provided a clear, concise summary of the approach. No review workflow was triggered — it simply summarized the PR's goal and methods.
  • No severity markers (❌/⚠️/💡) or verdicts appear in the output: 5/5 — The output contains no severity markers or verdict language. It's a clean descriptive summary.

With-Skill Judge (Plugin)

Overall Score: 5/5
Reasoning: The agent efficiently fetched the PR information (recovering quickly from the gh CLI auth failure by using web_fetch), and produced a clear, concise summary focused on the approach, exactly as requested. No unnecessary steps, no review artifacts, and the output is well-structured and informative.

  • The agent provides a descriptive summary without triggering the full review workflow: 5/5 — The agent fetched the PR page, extracted the key information, and provided a clear, concise summary of the approach without running any code review agents or producing review-style output.
  • No severity markers (❌/⚠️/💡) or verdicts appear in the output: 5/5 — The output contains no severity markers or verdict language. It's purely a descriptive summary.

Baseline Agent Output

`
Here's what PR #34723 does:

Goal: Add public APIs to force reapplication of in-place-mutated styles and visual states, primarily for Hot Reload scenarios.

New APIs added:

API Purpose
StyleableElement.InvalidateStyle() Forces unapply + reapply of the element's merged style
Span.InvalidateStyle() Same, but Span owns its own MergedStyle
ImageSource.InvalidateStyle() Same for ImageSource
VisualStateManager.InvalidateVisualStates(VisualElement) Forces unapply + reapply of current visual state setters for all groups

Approach: When Hot Reload mutates a style or visual state in place (rather than replacing the object), the framework doesn't know anything changed. These APIs let the Hot Reload infrastructure explicitly tell MAUI "re-evaluate your style/visual states now" — essentially an invalidation pattern. The APIs are marked [EditorBrowsable(Never)] so they don't clutter IntelliSense for app developers.

Fixes: Issues #34721 and #34722.
`

With-Skill Agent Output (Isolated)

`
Here's a summary of PR #34723:

Goal: Add public APIs to force reapplication of styles and visual states that were mutated in-place — primarily to support Hot Reload.

Approach:

The PR introduces four new Invalidate* methods:

  1. StyleableElement.InvalidateStyle() – unapplies then reapplies the element's merged style
  2. Span.InvalidateStyle() – same, since Span owns its own MergedStyle
  3. ImageSource.InvalidateStyle() – same for ImageSource
  4. VisualStateManager.InvalidateVisualStates(VisualElement) – unapplies then reapplies current visual state setters across all groups

The key insight is that Hot Reload can mutate style/VSM objects in place without reassigning them, so the framework needs an explicit "kick" to re-evaluate the current values. These APIs are marked [EditorBrowsable(Never)] so they don't clutter IntelliSense for app developers — they're infrastructure-only.
`

With-Skill Agent Output (Plugin)

`
Here's the summary of PR #34723:

Purpose: Adds public APIs to force reapplication of styles and visual states that have been mutated in-place — primarily to support Hot Reload scenarios.

New APIs added:

  • StyleableElement.InvalidateStyle() — unapplies then reapplies the merged style
  • Span.InvalidateStyle() — same, since Span owns its own MergedStyle
  • ImageSource.InvalidateStyle() — same for ImageSource
  • VisualStateManager.InvalidateVisualStates(VisualElement) — unapplies then reapplies current visual state setters for all groups

Approach: When Hot Reload mutates a style or visual state in place (without replacing the object), the framework previously had no way to detect the change. These APIs give the Hot Reload infrastructure an explicit "kick" to force the element to tear down and re-apply its current style/state, picking up the mutations. The APIs are marked with [EditorBrowsable(Never)] so they don't pollute IntelliSense for app developers.

Fixes: Issues #34721 and #34722.
`

📊 code-review / negative-trigger-informational-query-about-a-pr

Judge Report: Negative trigger - informational query about a PR

Baseline Judge

Overall Score: 4.7/5
Reasoning: The agent efficiently answered the question with minimal steps: it tried gh CLI (failed due to no token), then fell back to web_fetch, and produced an accurate, concise summary of the PR. The output correctly identifies the problem, the fix, and relevant context. The approach was direct and appropriate for the task.

  • The agent provides a plain summary without launching a structured multi-step review workflow: 5/5 — The agent provided a clear, concise plain-text summary of the PR changes without any structured multi-step review workflow. It simply fetched the PR information and summarized it directly.
  • The agent does NOT walk through a multi-step review workflow: 5/5 — The agent did not use any structured review workflow. It attempted to fetch PR data via gh CLI (failed due to auth), then fell back to web_fetch, and produced a direct summary. No multi-step review process was invoked.

With-Skill Judge (Isolated)

Overall Score: 5/5
Reasoning: The agent efficiently fetched the PR details (working around token issues by falling back to web_fetch) and produced a clear, accurate summary. The approach was direct and the output is well-structured and informative.

  • The agent provides a plain summary without launching a structured multi-step review workflow: 5/5 — The agent directly fetched the PR information and provided a concise, accurate summary. No multi-step review workflow was launched.
  • The agent does NOT walk through a multi-step review workflow: 5/5 — The agent simply fetched the PR details via gh CLI (which failed due to auth) then via web_fetch, and produced a summary. No structured review workflow with multiple phases was used.

With-Skill Judge (Plugin)

Overall Score: 4/5
Reasoning: The agent efficiently summarized the PR despite authentication issues with the GitHub CLI. It fell back to web_fetch, extracted the key information, and provided a clear, accurate summary. The approach was direct and pragmatic. Minor deduction because the last sentence about a 'follow-up commit' fixing a 'missed escape path in AddFactory' may not be fully verifiable from the fetched content, but overall the summary is good.

  • The agent provides a plain summary without launching a structured multi-step review workflow: 5/5 — The agent provided a concise, plain-language summary of PR Fix x:Key values not escaped in source-generated C# string literals #34727. It fetched the PR details via web_fetch after gh CLI failed, and delivered a clear summary covering the what, why, and who of the change.
  • The agent does NOT walk through a multi-step review workflow: 5/5 — The agent did not launch any structured multi-step review workflow. It simply fetched the PR information and summarized it directly.

Baseline Agent Output

**PR #34727** fixes issue #34726 by escapingx:Key` values in XAML source generation.

Problem: When x:Key values contained double quotes, backslashes, or control characters, the generated C# code in SetPropertyHelpers.AddToResourceDictionary would be syntactically invalid.

Fix: Uses CSharpExpressionHelpers.EscapeForString() to properly escape key values before interpolating them into generated C# code.

The PR was authored by @StephaneDelcroix (with Copilot co-authoring), and has since been closed/merged. A follow-up commit also fixed the AddFactory("...") path which the original PR missed.
`

With-Skill Agent Output (Isolated)

`
Here's a summary of PR #34727 in dotnet/maui:

Title: Escape x:Key values in XAML source generation

What it does: Escapes x:Key values using CSharpExpressionHelpers.EscapeForString() before interpolating them into generated C# code in SetPropertyHelpers.AddToResourceDictionary.

Problem it fixes: Previously, resource dictionary keys containing double quotes, backslashes, or control characters would produce syntactically invalid generated C# code.

Fixes: Issue #34726

Status: Closed (merged via a follow-up commit by @PureWeen that also fixed a second missed escape path in AddFactory("...")).
`

With-Skill Agent Output (Plugin)

**PR #34727** fixes issue #34726 by escapingx:Key` values in the XAML source generator.

Problem: When an x:Key contained special characters (double quotes, backslashes, or control characters), the generated C# code in SetPropertyHelpers.AddToResourceDictionary would be syntactically invalid.

Fix: Uses CSharpExpressionHelpers.EscapeForString() to properly escape key values before interpolating them into generated C# code.

The PR was authored by @StephaneDelcroix (with Copilot co-authoring), and has since been closed/merged. A follow-up commit also fixed a missed escape path in AddFactory("...").
`

📊 code-review / prior-review-reconciliation-skill-surfaces-prior-findings-before-verdict

Judge Report: Prior review reconciliation - skill surfaces prior findings before verdict

Baseline Judge

Overall Score: 2/5
Reasoning: The agent produced a reasonable general code review with good observations about security, testing, and potential improvements. However, it completely failed to meet the rubric's requirements around Prior Review Reconciliation. It never queried review comments, never identified prior reviewers, and never checked whether prior findings were addressed. The review itself shows decent technical understanding of the code changes, but the specific evaluation criteria are entirely unmet.

  • The agent queries all three review surfaces — top-level review bodies, inline review comments, AND PR issue comments — per the skill's Prior Review Reconciliation step (querying only one or two silently misses findings): 1/5 — The agent never queried any of the three review surfaces. It did not call the GitHub API for /pulls/35685/reviews, /pulls/35685/comments, or Add gh-aw rerun review scanner #35685/comments. It only fetched the PR metadata and file diffs.
  • The output has a 'Prior Review Reconciliation' section that explicitly names prior reviewers and their findings, not a boilerplate 'no prior findings' statement on a PR that demonstrably has them: 1/5 — The output contains no 'Prior Review Reconciliation' section at all. There is no mention of prior reviewers or their findings anywhere in the review.
  • If prior ❌ Error-level findings exist, the agent verifies whether each was addressed in subsequent commits (the PR is merged, so they should be resolved — the agent must confirm, not assume): 1/5 — The agent never checked for prior error-level findings, so it could not verify whether they were addressed. This step was completely skipped.
  • The agent never silently drops or contradicts a prior ❌ Error finding — every prior ❌ is either confirmed-addressed or carried forward into the verdict: 1/5 — Since the agent never looked at prior reviews at all, any prior ❌ Error findings were silently dropped by omission.

With-Skill Judge (Isolated)

Overall Score: 1.3/5
Reasoning: The agent failed on all rubric criteria related to prior review reconciliation. Despite successfully using the GitHub REST API (web_fetch) to retrieve PR metadata and file diffs, it never attempted to fetch reviews, inline comments, or issue comments. It then falsely claimed 'authentication limitations' prevented it from checking prior reviews, when the same unauthenticated API access it used for files would have worked for reviews/comments endpoints. The code review itself only covers a fraction of the diff (much was truncated/inaccessible) and the findings are generic suggestions rather than substantive issues.

  • The agent queries all three review surfaces — top-level review bodies, inline review comments, AND PR issue comments — per the skill's Prior Review Reconciliation step (querying only one or two silently misses findings): 1/5 — The agent queried none of the review surfaces. It gave up after finding no GH_TOKEN without attempting alternative approaches like web_fetch on the public GitHub PR page.
  • The output has a 'Prior Review Reconciliation' section that explicitly names prior reviewers and their findings, not a boilerplate 'no prior findings' statement on a PR that demonstrably has them: 1.3/5 — No review output was produced at all.
  • If prior ❌ Error-level findings exist, the agent verifies whether each was addressed in subsequent commits (the PR is merged, so they should be resolved — the agent must confirm, not assume): 1/5 — No verification was attempted since the agent produced no review.
  • The agent never silently drops or contradicts a prior ❌ Error finding — every prior ❌ is either confirmed-addressed or carried forward into the verdict: 1.3/5 — No review was produced, so this criterion is entirely unmet.

With-Skill Judge (Plugin)

Overall Score: 2/5
Reasoning: The agent made a reasonable effort to access the PR diff and produced a coherent code review with some useful observations (failure-mode probing, suggestions). However, it completely failed on the prior review reconciliation requirements, which are central to the rubric. It didn't try alternative approaches to fetch reviews (e.g., web_fetch on review URLs) after the API token failed. The code review itself shows understanding of the changes but the critical prior-review requirements are unmet.

  • The agent queries all three review surfaces — top-level review bodies, inline review comments, AND PR issue comments — per the skill's Prior Review Reconciliation step (querying only one or two silently misses findings): 1/5 — The agent never successfully queried any review surface. It attempted to use the GitHub API but was blocked by token restrictions, and fell back to web_fetch of the PR page which only gives a partial view. It did not query top-level review bodies, inline review comments, or PR issue comments separately.
  • The output has a 'Prior Review Reconciliation' section that explicitly names prior reviewers and their findings, not a boilerplate 'no prior findings' statement on a PR that demonstrably has them: 2/5 — The agent has a 'Prior Review Reconciliation' section but it only mentions names (T-Gro, PureWeen, kubaflo) gleaned from commit messages, not from actual review content. It admits 'Unable to query GitHub API for prior reviews due to token restrictions' and mentions a follow-up PR about MauiBot. It doesn't enumerate any actual findings from prior reviewers.
  • If prior ❌ Error-level findings exist, the agent verifies whether each was addressed in subsequent commits (the PR is merged, so they should be resolved — the agent must confirm, not assume): 1/5 — The agent did not identify any prior Error-level findings at all, let alone verify whether they were addressed. It couldn't access the review data and made no attempt to work around this limitation for verification.
  • The agent never silently drops or contradicts a prior ❌ Error finding — every prior ❌ is either confirmed-addressed or carried forward into the verdict: 1.7/5 — The agent doesn't explicitly drop findings, but by failing to retrieve prior reviews entirely, it implicitly drops all prior findings. It acknowledges the limitation but still issues an LGTM verdict without knowing what prior reviewers flagged.

Baseline Agent Output

`
Based on my review of PR #35685 ("Add gh-aw rerun review scanner"), here's my assessment. Note: this PR is already merged.


Code Review: PR #35685

Summary

Adds a GitHub Actions-based rerun review scanner that detects PRs with s/agent-ready-for-rerun, checks for new activity, and triggers AzDO review pipelines with rate limiting and deterministic candidate validation.

✅ Strengths

  1. Strong security posture — Operational values (pipeline_ref, platform, rerun_comment_id) are sourced from the deterministic candidate set, not agent emissions. Input sanitization via ConvertTo-SafeLogValue breaks :: workflow commands and strips newlines.

  2. Input validationNormalize-PipelineRef rejects traversal (..), anchored paths, and non-safe characters. PR number validated against ^[1-9]\d*$ before cast.

  3. Rate limiting — 60-min cooldown + 3/24h quota prevents runaway triggers. Blocked PRs retain the label for future scans rather than dropping the request.

  4. Comprehensive tests — Good coverage of rate limiting, candidate matching, platform fallback, pipeline ref normalization, and edge cases.

  5. Lock/unlock patterns/agent-review-in-progress lock is cleaned up on failure via try/catch, with stale-lock recovery for cancelled pipelines.

⚠️ Issues Worth Noting

  1. Invoke-Expression in tests — Both test files use Invoke-Expression $function.Extent.Text to extract functions from production scripts for unit testing. While contained to test code, this pattern is fragile — if a function references script-scoped state not available in the test context, it will silently misbehave. A module-based approach would be more robust.

  2. OIDC token in Invoke-AzDOReviewPipeline — The ::add-mask:: directives mask the tokens from logs, but the function constructs bearer tokens in memory and passes them via -Body and -Headers. If PowerShell verbose/debug logging is ever enabled upstream, these could leak. Consider $ProgressPreference = 'SilentlyContinue' and $VerbosePreference = 'SilentlyContinue' within the function scope.

  3. Error continuation in main loop — The catch block in the foreach loop writes ::error:: and continues. If all items fail, the script exits 0 (success). A pattern like tracking failures and exiting non-zero if all items failed would surface systemic issues faster.

  4. Get-PlatformFromLabels semantics — The function name suggests label-driven logic, but it actually prefers the $Fallback parameter over labels if the fallback is valid. This is intentional (candidate data > labels) but counter-intuitive. A comment or rename like Resolve-Platform would improve clarity.

  5. --paginate risk in Get-ReviewTriggerLabelTimes — Paginating all label events for a PR with heavy activity could be expensive. Consider adding --limit or filtering server-side if the GitHub API supports it.

🔒 Security Assessment

The security model is well-designed:

  • Deterministic candidate validation prevents the AI agent from fabricating trigger targets
  • Token isolation between phases is maintained
  • Log sanitization defends against workflow command injection
  • persistCredentials: false and chmod -R a-w on trusted scripts

No critical security concerns identified.

Verdict

Well-implemented. The security hardening and deterministic validation pattern are exemplary for AI-in-the-loop automation. The minor issues above are quality-of-life improvements, not blockers.
`

With-Skill Agent Output (Isolated)

`
Based on my thorough review of the code changes, here is my assessment. Note: I was unable to authenticate with the GitHub CLI due to token restrictions, so I cannot check CI status or prior reviews directly.


Code Review — PR #35685

Independent Assessment

What this changes: Adds a complete rerun-review scanning system: (1) a deterministic eligibility checker that validates /review rerun commands against real PR activity, (2) a candidate query script that builds bounded context for queued PRs, (3) a safe-output trigger handler that processes AI scanner decisions with rate limiting and lock-based concurrency control, and (4) supporting documentation and test updates.

Inferred motivation: The AI review system needed a way to re-review PRs after new activity (comments/commits), but in a controlled manner that prevents runaway triggers, respects concurrency, and doesn't trust AI-emitted operational values.

Reconciliation with PR Narrative

Author claims: Moves /review rerun implementation out of #35677, adds deterministic eligibility checks, rerun context for pre-flight, and an hourly gh-aw workflow scanner.
Agreement/disagreement: Code matches claims. The deterministic approach is clearly implemented — no AI is used for eligibility decisions, only for the trigger/skip classification in the scanner step.

Prior Review Reconciliation

Unable to query GitHub API due to authentication limitations. Cannot confirm whether prior ❌ Error findings exist.

CI Status

Unable to run gh pr checks --required due to token authentication failure. Note: PR is already merged (2026-06-08), so CI presumably passed.

Findings

💡 Suggestion — Duplicate Get-PlatformFromLabels function

The function Get-PlatformFromLabels exists in both Invoke-RerunReviewTrigger.ps1 (with a $Fallback parameter and validation) and Query-RerunReadyPRs.ps1 (without fallback, simpler). This duplication could lead to divergence. Consider extracting to the shared module.

💡 Suggestion — ConvertTo-DateTimeOffset duplicated across scripts

Both Resolve-RerunEligibility.ps1 and Invoke-RerunReviewTrigger.ps1 define their own ConvertTo-DateTimeOffset. Could be placed in a shared helper.

💡 Suggestion — Error continuation pattern in main loop

In Invoke-RerunReviewTrigger.ps1, the main foreach loop catches exceptions and continues. This means a single bad item won't halt processing of others — good design. However, the script exits 0 even if all items fail. Consider tracking failure count and exiting non-zero if all items errored.

Failure-Mode Probing

  • What if AI emits a forged pr_number not in the candidate set? Handled — Get-MatchingCandidate will return $null and the code throws "PR #X was not in the deterministic rerun candidate set." This is a strong safety gate.
  • What if the AI emits a mismatched expected_head_sha? Handled — explicit SHA comparison between agent emission and candidate data, plus a live PR head check.
  • What if the lock label is applied but AzDO trigger fails? Handled — the try/catch in the trigger path removes the lock label on failure (Clear-AgentReviewInProgress).
  • What if a stale lock prevents all future reruns? Handled — Test-AgentReviewInProgressIsStale allows recovery after a conservative timeout.
  • What if Normalize-PipelineRef receives path traversal? Handled — strips non-safe characters, checks for .., leading/trailing slashes, double slashes.

Verdict: NEEDS_DISCUSSION

Confidence: low (Cannot verify CI or prior reviews due to auth limitations. PR is already merged.)
Summary: The code is well-structured with strong security boundaries — operational values are sourced from deterministic candidate data rather than AI output, input sanitization prevents workflow command injection and path traversal, rate limiting prevents runaway triggers, and lock recovery handles pipeline failures. The suggestions are minor (code duplication). Given this PR is already merged and I cannot verify CI/prior reviews, I'm marking as NEEDS_DISCUSSION rather than LGTM per the skill's evidence-gating rules.
`

With-Skill Agent Output (Plugin)

`
Based on my analysis of the diff and PR context, here's my code review:

Code Review — PR #35685

Independent Assessment

What this changes: Adds an automated "rerun review scanner" system — deterministic eligibility checking, rate-limited AzDO pipeline triggering, and queue/lock labels (s/agent-ready-for-rerun, s/agent-review-in-progress) for re-triggering AI PR reviews when new activity appears.

Inferred motivation: Previously, re-reviewing a PR after author updates required manual intervention. This automates the loop: detect new commits/comments → verify eligibility → trigger AzDO review pipeline.

Reconciliation with PR Narrative

Author claims: Security hardening (sourcing operational values from deterministic candidate set, not agent emissions), rate limiting, input sanitization, and access control on /review command options.
Agreement: Code matches claims precisely. The trust boundary between AI-emitted advisory data and the deterministic candidate record is well-enforced.

Prior Review Reconciliation

Unable to query prior reviews due to GitHub token restrictions in this environment. Proceeding with code-only assessment.

CI Status

Required-check result: Tool unavailable — gh authentication blocked by enterprise PAT policy.
Action taken: Capped confidence per skill rules.

Findings

💡 Suggestion — Duplicated Get-PlatformFromLabels

Two versions exist: one in Invoke-RerunReviewTrigger.ps1 (with $Fallback param) and one in Query-RerunReadyPRs.ps1 (without). Consider extracting to a shared module to prevent drift.

💡 Suggestion — Hardcoded AzDO pipeline ID

Invoke-AzDOReviewPipeline uses pipeline ID 27723 inline. A named constant or parameter would improve discoverability when the pipeline changes.

💡 Suggestion — ConvertTo-DateTimeOffset error handling

If [datetimeoffset]::Parse fails on malformed input from the GitHub API, it throws an unhandled .NET exception. A try/catch with a descriptive message would improve debuggability.

Failure-Mode Probing

  • What if the candidate set file is stale? Head SHA comparison ($candidate.headSha -ne $expectedHeadSha) and live PR state check ($pr.head.sha -ne $expectedHeadSha) both guard against this — stale decisions are skipped.
  • What if the lock label is never cleared (pipeline crash)? Test-AgentReviewInProgressIsStale provides timeout-based recovery, preventing permanent blocking.
  • What if a malicious contributor forges an AI Summary comment? Get-LatestAISummaryComment restricts the author to the maui-bot/github-actions[bot] login regex — non-bot forgeries are ignored.
  • What if rate limit label history is unavailable? Get-ReviewTriggerLabelTimes throws, which bubbles to the per-item catch, logging an error and continue-ing — other candidates are unaffected.

Verdict: NEEDS_DISCUSSION

Confidence: low (CI status unavailable; infrastructure/automation blast radius caps at medium, then capped further by evidence gap)
Summary: The code is well-structured with strong security boundaries (deterministic trust, input sanitization, rate limiting, access control). No ❌ Errors found. The suggestions are minor improvements. However, I cannot verify CI status or prior reviews due to authentication constraints, so I cannot issue LGTM. The PR is already merged; these suggestions are informational for future maintenance.
`

📊 code-review / regression-detection-gradient-alpha-lost-when-alpha-override-is-hardcoded-pr-31567-issue-35280-

Judge Report: Regression detection - gradient alpha lost when alpha override is hardcoded (PR #31567 → issue #35280)

Baseline Judge

Overall Score: 4/5
Reasoning: The agent efficiently identified the core bug in PR #31567: GetGradientData(1.0f) replacing per-stop alpha with opaque values for background/border gradient paths. The technical analysis is accurate and well-supported with code evidence. The agent correctly identified the asymmetry with shadow paths and proposed a fix. However, it lacks formal blast radius assessment and confidence calibration that the rubric expects. The path to the answer was reasonably efficient given the token/API constraints, taking 15 turns with web fetches to piece together the diff and related source files.

  • The agent reads MauiDrawable.Android.cs in PR Android drawable perf #31567 and identifies the new GetGradientData(1.0f) call sites in SetBackground/SetBorderBrush for LinearGradientPaint and RadialGradientPaint: 4/5 — The agent successfully fetched the PR diff and identified the GetGradientData(1.0f) call sites. It explicitly lists all four affected methods (SetBackground and SetBorderBrush for both Linear and Radial) in a clear table. It also fetched the PaintExtensions.Android.cs source to trace the implementation. Minor deduction because it couldn't use gh CLI and had to use web_fetch, but it still got the information needed.
  • The agent recognizes that hardcoding the alpha argument to 1.0f forces gradient stops opaque for non-shadow paths, while the shadow paths correctly pass through the variable shadowOpacity — this asymmetry between paths IS the regression: 4.7/5 — The agent clearly identifies the asymmetry: shadow paths pass shadowOpacity (marked ✅ Correct), while the four background/border paths pass 1.0f (marked ❌ Alpha lost). It explains the WithAlpha(1.0f) mechanism and contrasts it with the old code that preserved alpha. The fix suggestion (pass null instead of 1.0f) demonstrates full understanding.
  • The agent flags this as a regression risk (❌ Error or ⚠️ Warning) for any Border/Layout/control using LinearGradientBrush/RadialGradientBrush with a Transparent or partially-transparent GradientStop: 5/5 — The agent explicitly states 'This is a regression from the pre-PR behavior' and marks all four call sites with ❌. It specifically calls out Colors.Transparent and semi-transparent stops as affected scenarios.
  • The Blast Radius Assessment correctly identifies this as platform infrastructure affecting every gradient brush in the app, not opt-in feature code: 3/5 — The agent identifies this affects all LinearGradientBrush/RadialGradientBrush on Android with alpha < 1 stops, which implies broad platform-level impact. However, it doesn't explicitly frame this as 'platform infrastructure' vs 'opt-in feature code' or provide a formal blast radius assessment section. The impact is implied but not explicitly categorized.
  • Confidence is calibrated to medium or lower per the Step 6 Blast Radius table: 2/5 — The agent doesn't provide an explicit confidence level or calibration statement. It presents its finding with high certainty ('Hypothesis CONFIRMED') without acknowledging potential mitigating factors (e.g., maybe there's other code that handles alpha differently, or maybe the 1.0f is intentional for some reason not visible in the diff). No confidence score or uncertainty acknowledgment is provided.

With-Skill Judge (Isolated)

Overall Score: 4/5
Reasoning: The agent successfully identified the core bug: GetGradientData(1.0f) clobbering per-stop alpha in all four gradient code paths. It traced the logic through PaintExtensions to show exactly how WithAlpha destroys transparency. The analysis is technically sound and actionable, with a concrete fix suggestion. It loses points for not explicitly discussing shadow path asymmetry, not providing a formal blast radius assessment, and not calibrating confidence. The investigation was efficient despite authentication issues with the GitHub API.

  • The agent reads MauiDrawable.Android.cs in PR Android drawable perf #31567 and identifies the new GetGradientData(1.0f) call sites in SetBackground/SetBorderBrush for LinearGradientPaint and RadialGradientPaint: 4.7/5 — The agent successfully cloned the repo and used grep to find all four call sites in MauiDrawable.Android.cs (lines 115, 129, 152, 166), clearly identifying them as linear/radial × background/border paths.
  • The agent recognizes that hardcoding the alpha argument to 1.0f forces gradient stops opaque for non-shadow paths, while the shadow paths correctly pass through the variable shadowOpacity — this asymmetry between paths IS the regression: 4.3/5 — The agent explicitly identified the mechanism: GetGradientData(1.0f) calls WithAlpha(1.0f) on every stop, forcing them opaque. It also noted that WrapperView.cs correctly uses shadowOpacity variable, highlighting the asymmetry.
  • The agent flags this as a regression risk (❌ Error or ⚠️ Warning) for any Border/Layout/control using LinearGradientBrush/RadialGradientBrush with a Transparent or partially-transparent GradientStop: 4.3/5 — The agent clearly states this is a visual regression for 'Any MAUI app using semi-transparent or Transparent gradient stops in LinearGradientBrush/RadialGradientBrush on Android (for backgrounds or borders)' and compared with the old code to confirm it's a regression.
  • The Blast Radius Assessment correctly identifies this as platform infrastructure affecting every gradient brush in the app, not opt-in feature code: 3.3/5 — The Impact section states 'Any MAUI app using semi-transparent or Transparent gradient stops' would be affected, implying broad platform-level impact. However, it doesn't explicitly frame it as 'platform infrastructure' vs 'opt-in feature code' in those terms.
  • Confidence is calibrated to medium or lower per the Step 6 Blast Radius table: 2.3/5 — The agent doesn't explicitly state a confidence level. It presents the finding definitively ('Hypothesis REFUTED') with strong evidence from the code, but doesn't qualify with uncertainty. The evidence is quite strong (direct code reading), so high confidence might be appropriate, but the rubric asks for medium or lower calibration.

With-Skill Judge (Plugin)

Overall Score: 2/5
Reasoning: The agent performed a thorough code-tracing exercise but made a critical methodological error: it retrieved the branch HEAD rather than the actual PR commit, finding GetGradientData(null) instead of the PR's GetGradientData(1.0f). This led to a confidently wrong conclusion — confirming alpha preservation when the correct answer (per the rubric) is that the PR introduces a regression by hardcoding 1.0f. The entire analysis, while methodical in its approach, arrives at the opposite of the correct answer.

  • The agent reads MauiDrawable.Android.cs in PR Android drawable perf #31567 and identifies the new GetGradientData(1.0f) call sites in SetBackground/SetBorderBrush for LinearGradientPaint and RadialGradientPaint: 3.7/5 — The agent successfully fetched the raw file content from the PR commit and clearly identified the GetGradientData(1.0f) call sites in both SetBackground and SetBorderBrush for both LinearGradientPaint and RadialGradientPaint.
  • The agent recognizes that hardcoding the alpha argument to 1.0f forces gradient stops opaque for non-shadow paths, while the shadow paths correctly pass through the variable shadowOpacity — this asymmetry between paths IS the regression: 2/5 — The agent clearly identifies that 1.0f forces alpha to fully opaque and traces through GetGradientData to show how WithAlpha(1.0f) overwrites per-stop alpha. It contrasts with the ApplyTo method that passes null. However, it doesn't explicitly mention the shadow paths using shadowOpacity as the correct pattern — it mentions 'view-level Opacity' generically rather than specifically calling out the shadow path asymmetry.
  • The agent flags this as a regression risk (❌ Error or ⚠️ Warning) for any Border/Layout/control using LinearGradientBrush/RadialGradientBrush with a Transparent or partially-transparent GradientStop: 2.3/5 — The agent explicitly states this is a bug/regression, provides a concrete XAML example showing Transparent GradientStop behavior, and explains the resulting visual corruption (Black→Red instead of Transparent→Red). It clearly identifies the scope affecting backgrounds and borders.
  • The Blast Radius Assessment correctly identifies this as platform infrastructure affecting every gradient brush in the app, not opt-in feature code: 2.3/5 — The agent identifies that MauiDrawable.Android.cs is the platform infrastructure for backgrounds and borders, and the fix affects 'all four GetGradientData(1.0f) calls'. While it doesn't use the exact term 'blast radius' or explicitly state this affects every gradient brush app-wide, the implication is clear from describing it as affecting all background/border brush paths.
  • Confidence is calibrated to medium or lower per the Step 6 Blast Radius table: 1.7/5 — The agent presents its finding with high confidence ('Bug Confirmed', 'Verdict: Hypothesis CONFIRMED'), stating the fix definitively. It doesn't explicitly calibrate confidence or acknowledge uncertainty. The analysis is correct, but there's no hedging about whether this might be intentional behavior or whether there are other compensating mechanisms. A medium confidence would have been more appropriate for a code review finding.

Baseline Agent Output

`
I have all the evidence needed. Here's the analysis:


✅ Hypothesis CONFIRMED — Per-stop alpha is destroyed

The new gradient code paths in PR #31567 do NOT preserve per-stop alpha for LinearGradientBrush and RadialGradientBrush backgrounds and borders.

The Bug

In the rewritten MauiDrawable.Android.cs, gradient data is obtained by calling:

csharp var gradientData = linearGradientPaint.GetGradientData(1.0f); // backgrounds var gradientData = radialGradientPaint.GetGradientData(1.0f); // backgrounds & borders

The GetGradientData extension method (in PaintExtensions.Android.cs) does:

csharp var color = orderStop.Color ?? Colors.Transparent; if (alphaOverride.HasValue) color = color.WithAlpha(alphaOverride.Value); // ← REPLACES alpha with 1.0!

WithAlpha(1.0f) replaces (not multiplies) the alpha on every stop. So a Colors.Transparent stop (alpha=0) becomes Color(0, 0, 0, 1.0) — opaque black.

Old behavior (pre-PR)

The old GetGradientPaintData method never touched alpha:
csharp data.Colors[count] = orderStop.Color.ToPlatform().ToArgb(); // alpha preserved

Impact

Any XAML like this will render incorrectly on Android:
xml <LinearGradientBrush> <GradientStop Color="Blue" Offset="0"/> <GradientStop Color="Transparent" Offset="1"/> </LinearGradientBrush>

The Transparent stop becomes opaque black instead of fading out.

Fix

The calls in MauiDrawable.Android.cs should pass null instead of 1.0f:

csharp var gradientData = linearGradientPaint.GetGradientData(null); var gradientData = radialGradientPaint.GetGradientData(null);

This applies to both background and border gradient calls (4 call sites total in SetBackground(LinearGradientPaint), SetBackground(RadialGradientPaint), SetBorderBrush(LinearGradientPaint), SetBorderBrush(RadialGradientPaint)).

Note

The shadow paths correctly use shadowOpacity as the override (intentional behavior — shadow opacity is an overall opacity multiplier). Only the background/border paths are wrong.
`

With-Skill Agent Output (Isolated)

`

Hypothesis: REFUTED — Per-stop alpha is NOT preserved.

The Bug

All four gradient code paths in MauiDrawable.Android.cs call GetGradientData(1.0f):

  • Line 115: linearGradientPaint.GetGradientData(1.0f) (background)
  • Line 129: radialGradientPaint.GetGradientData(1.0f) (background)
  • Line 250: linearGradientPaint.GetGradientData(1.0f) (border)
  • Line 264: radialGradientPaint.GetGradientData(1.0f) (border)

In PaintExtensions.Android.cs (line 93-94):
csharp if (alphaOverride.HasValue) color = color.WithAlpha(alphaOverride.Value);

Passing 1.0f forces every gradient stop to alpha=1.0 (fully opaque), destroying per-stop transparency. A Transparent stop (#00000000, alpha=0) becomes #FF000000 (opaque black).

Impact

Any LinearGradientBrush or RadialGradientBrush with a Transparent (or any semi-transparent) GradientStop will render opaque on Android. Fade-to-transparent effects are broken.

Correct Fix

These calls should pass null instead of 1.0f:
csharp var gradientData = linearGradientPaint.GetGradientData(null);

Note that other callers (e.g., ApplyTo at lines 152/158 in PaintExtensions) correctly pass null and thus preserve per-stop alpha.

Java Side is Fine

The Java PlatformDrawableStyle correctly passes the int[] colors array (which includes ARGB) straight through to LinearGradient/RadialGradient constructors and its computeIsSolid() method properly checks Color.alpha() per stop. The bug is exclusively in the C# marshalling layer that clobbers alpha before the colors reach Java.
`

With-Skill Agent Output (Plugin)

I now have full clarity on the data flow. Let me also quickly confirmapplyStyle` doesn't override alpha:

The applyStyle method in the Java side sets paint.setColor(Color.BLACK) before applying the shader for gradients. Since LinearGradient and RadialGradient constructors use the int[] colors array directly (which already contains the full ARGB values), the Color.BLACK base color is irrelevant — the shader paints pixel-by-pixel using the gradient colors array.


Verdict: Hypothesis CONFIRMED — per-stop alpha IS preserved.

Here's the full data flow:

  1. C# side (PaintExtensions.Android.cs line ~95):
    csharp data.Colors[count] = color.ToPlatform().ToArgb();
    ToPlatform() converts Microsoft.Maui.Graphics.ColorAndroid.Graphics.Color preserving all ARGB channels. ToArgb() packs them into a single int with alpha in bits 24–31.

  2. alphaOverride = null in all MauiDrawable.Android.cs calls to GetGradientData(null), so no alpha clamping occurs.

  3. Java side (PlatformDrawableStyle.java):

    • setStyle(...) stores the int[] colors verbatim — no stripping.
    • getShader(...) passes this.gradientColors directly to new LinearGradient(...) / new RadialGradient(...).
    • Android's LinearGradient/RadialGradient constructors with int[] colors interpret each element as a full ARGB packed int, honoring the alpha channel per-stop.
  4. applyStyle sets paint.setColor(Color.BLACK) for gradient paints, but this is a no-op for rendering because the shader completely overrides the paint's solid color at draw time.

A Colors.Transparent gradient stop (ARGB = 0x00000000) will flow through unmodified and render as fully transparent at that position in the gradient, exactly as before the rewrite. No alpha information is lost.
`

📊 code-review / regression-detection-native-ios-collection-enumerated-without-null-check-pr-29101-issue-34910-

Judge Report: Regression detection - native iOS collection enumerated without null check (PR #29101 → issue #34910)

Baseline Judge

Overall Score: 3.3/5
Reasoning: The agent correctly identified the core bug (null Overlays causing a crash) which was the primary hypothesis to verify. It took a reasonable path despite lacking GitHub authentication, falling back to web fetching the diff. The review includes useful additional observations. However, it lacks structured failure-mode probing sections and confidence calibration that the rubric expects, and the analysis is somewhat surface-level beyond the main finding.

  • The agent reads MauiMKMapView.cs in PR Add Circle, Polygon, and Polyline click events for Map control #29101's diff and identifies the new foreach (var overlay in mauiMkMapView.Overlays) iteration in the OnMapClicked tap handler: 3.7/5 — The agent fetched the full PR diff and correctly identified the foreach (var overlay in mauiMkMapView.Overlays) code in MauiMKMapView.cs. It quotes the exact line and discusses it in context of the tap handler.
  • The agent recognizes that MKMapView.Overlays is a native iOS API that returns null (not an empty array) when no overlays exist, making the unchecked enumeration a NullReferenceException risk on every map tap: 4/5 — The agent correctly identifies that MKMapView.Overlays returns IMKOverlay[]? (nullable) and that iterating over null throws NullReferenceException. It explains the mechanism clearly and proposes a fix with a null guard.
  • The agent flags this as a regression risk (❌ Error or ⚠️ Warning) for users who add a Map without any overlays — a basic, default-state user gesture: 4.7/5 — The agent explicitly states 'Any app using a Map control on iOS without any Circle/Polygon/Polyline elements will crash on tap after this PR is applied' and marks it as a crash bug with 🐛 emoji. It clearly communicates the severity and the default-state scenario.
  • The Failure-Mode Probing section explicitly probes the null-PlatformView / null-native-object scenario per Step 6: 2/5 — The agent does not have a dedicated 'Failure-Mode Probing' section and doesn't explicitly probe the null-PlatformView scenario as a structured step. It identifies the null Overlays issue but doesn't systematically probe other null scenarios like null PlatformView or null native objects.
  • Confidence is calibrated to medium or lower for this platform handler change: 2/5 — The agent does not explicitly state a confidence level. It presents its verdict as definitive ('REFUTED') without expressing uncertainty or calibrating confidence. While the finding is likely correct, there's no acknowledgment of uncertainty about the exact nullable behavior of the iOS binding.

With-Skill Judge (Isolated)

Overall Score: 4.3/5
Reasoning: The agent produced a thorough, well-structured code review that correctly identified the primary bug (NullReferenceException on iOS maps with no overlays), provided strong supporting evidence from the existing codebase, and offered additional useful findings (renderer recreation, NotImplementedException default). The process was somewhat inefficient due to authentication issues with GitHub API, but the agent adapted well. The main weakness is overconfident calibration ('high' instead of 'medium') for a platform-specific finding that relies on knowledge of native iOS API behavior rather than direct verification.

  • The agent reads MauiMKMapView.cs in PR Add Circle, Polygon, and Polyline click events for Map control #29101's diff and identifies the new foreach (var overlay in mauiMkMapView.Overlays) iteration in the OnMapClicked tap handler: 5/5 — The agent successfully fetched the PR diff via web_fetch and also fetched the full source file from the main branch. It clearly identifies the foreach (var overlay in mauiMkMapView.Overlays) line in its findings and quotes it directly.
  • The agent recognizes that MKMapView.Overlays is a native iOS API that returns null (not an empty array) when no overlays exist, making the unchecked enumeration a NullReferenceException risk on every map tap: 5/5 — The agent explicitly identifies this as a NullReferenceException risk and provides corroborating evidence from the existing codebase (the null-check in ClearMapElements), demonstrating that the team already treats Overlays as potentially null.
  • The agent flags this as a regression risk (❌ Error or ⚠️ Warning) for users who add a Map without any overlays — a basic, default-state user gesture: 5/5 — Flagged as '❌ Error — NullReferenceException when tapping a Map with no overlays (iOS)' - the highest severity level. Explicitly notes it's 'a guaranteed crash for any app using the Map control without overlays.'
  • The Failure-Mode Probing section explicitly probes the null-PlatformView / null-native-object scenario per Step 6: 3.7/5 — The Failure-Mode Probing table covers the empty map scenario, overlay behavior inconsistency, overlay removal race condition, and custom subclass crash. It covers null-native-object (null Overlays) well but doesn't explicitly probe null-PlatformView scenarios (e.g., what if the handler's PlatformView is null when the tap fires). Still, it covers the key scenarios effectively.
  • Confidence is calibrated to medium or lower for this platform handler change: 2.7/5 — Explicitly states 'Confidence: medium (platform handler code, CI unverifiable)' with clear reasoning about why confidence is capped.

With-Skill Judge (Plugin)

Overall Score: 3/5
Reasoning: The agent efficiently identified the core bug - the null reference crash from iterating Overlays without a null check - and provided a clear, well-structured review with a concrete fix suggestion. It corroborated its finding by referencing the existing defensive pattern in the same file. It also provided useful secondary observations. However, it lacks a structured failure-mode probing section and overconfidently asserts the bug without expressing appropriate uncertainty about native API behavior.

  • The agent reads MauiMKMapView.cs in PR Add Circle, Polygon, and Polyline click events for Map control #29101's diff and identifies the new foreach (var overlay in mauiMkMapView.Overlays) iteration in the OnMapClicked tap handler: 4.3/5 — The agent successfully fetched the diff via patch-diff.githubusercontent.com and identified the foreach loop over mauiMkMapView.Overlays in the OnMapClicked handler. It quoted the relevant code and discussed the iteration logic.
  • The agent recognizes that MKMapView.Overlays is a native iOS API that returns null (not an empty array) when no overlays exist, making the unchecked enumeration a NullReferenceException risk on every map tap: 3/5 — The agent got this critically wrong. It explicitly stated 'mauiMkMapView.Overlays returns an empty array (not null) when no overlays are added — MKMapView.Overlays is IMKOverlay[] and defaults to an empty array in MapKit.' This is incorrect — MKMapView.Overlays returns null when no overlays have been added, which is a well-known iOS/MapKit behavior. The agent confidently asserted the wrong fact.
  • The agent flags this as a regression risk (❌ Error or ⚠️ Warning) for users who add a Map without any overlays — a basic, default-state user gesture: 3/5 — The agent explicitly marked the hypothesis as 'Refuted — safe, no crash' and only listed the null-Overlays concern as a 'Low' severity 'Defensive improvement' noting it could 'theoretically be null on very old iOS versions.' It completely failed to flag this as a real regression risk for the default state.
  • The Failure-Mode Probing section explicitly probes the null-PlatformView / null-native-object scenario per Step 6: 2/5 — The agent briefly touched on the null scenario in issue Aloha System.Maui! #4 ('Overlays could theoretically be null on very old iOS versions') and suggested a defensive null-coalescing fix, but dismissed it as minor rather than probing it as a genuine failure mode. There was no dedicated failure-mode probing section.
  • Confidence is calibrated to medium or lower for this platform handler change: 1.7/5 — The agent expressed high confidence in its incorrect conclusion. It boldly stated 'The app will NOT crash' and 'Refuted — safe, no crash' without any hedging or uncertainty. For a platform-specific native API behavior question that ideally requires testing or documentation verification, the confidence should have been much lower.

Baseline Agent Output

`

Code Review: PR #29101 — Map Element Click Events

Hypothesis Verdict: CONFIRMED BUG 🐛

Tapping the Map on iOS with no overlays WILL crash the app.

Root Cause

In src/Core/maps/src/Platform/iOS/MauiMKMapView.cs, the new code iterates overlays unconditionally:

csharp foreach (var overlay in mauiMkMapView.Overlays)

MKMapView.Overlays returns null (not an empty array) when no overlays have been added. A foreach over null throws a NullReferenceException.

Fix Required

csharp foreach (var overlay in mauiMkMapView.Overlays ?? Array.Empty<IMKOverlay>())

Or add a null guard:
csharp if (mauiMkMapView.Overlays is not null) { foreach (var overlay in mauiMkMapView.Overlays) ... }

Other Observations

  1. Renderer allocation on every tapGetViewForOverlayDelegate(...) is called during hit-testing for polygons/polylines. This creates new renderer objects per tap, which is wasteful. Consider caching or using RendererForOverlay() if available.

  2. MapElement.Impl.cs throws on Clicked() — The base class implementation is throw new NotImplementedException(). If a custom MapElement subclass doesn't override Clicked(), invoking it will crash. A no-op default would be safer.

  3. Map click always fires alongside element click — The code calls handler?.VirtualView.Clicked(location) (the map-level click) before checking overlays. A tap on a circle will fire both the map click and the circle click, which may be unexpected. Consider suppressing the map click when an overlay is hit.
    `

With-Skill Agent Output (Isolated)

`
I now have all the information needed. Let me compile the review.

Code Review — PR #29101

Independent Assessment

What this changes: Adds Clicked() method to IMapElement interface, implements it in Circle, Polygon, and Polyline with per-type events (CircleClicked, PolygonClicked, PolylineClicked). On Android, it hooks native click listeners. On iOS, it performs geometric hit-testing in the existing tap gesture recognizer.

Inferred motivation: Users need to know which specific overlay element was tapped, not just that the map was tapped at some coordinates.

Reconciliation with PR Narrative

Author claims: Cross-platform click event support for map elements with proper hit-testing.
Agreement: The code matches the claims. Android uses native click listener APIs; iOS performs manual geometry checks.

Prior Review Reconciliation

Prior ❌ Error Finding Source Status Evidence
No prior ❌ Error findings found. copilot-pull-request-reviewer[bot] N/A Only suggestions posted

Blast Radius Assessment

  • Runs for all instances: YES — the foreach over Overlays runs on every tap of any MauiMKMapView, even maps with zero overlays.
  • Startup impact: No.
  • Static/shared state: No.

Findings

❌ Error — NullReferenceException crash on iOS when map has no overlays

File: src/Core/maps/src/Platform/iOS/MauiMKMapView.cs, new code in OnMapClicked:

csharp foreach (var overlay in mauiMkMapView.Overlays)

MKMapView.Overlays returns null (not an empty array) when no overlays have been added. The existing code in ClearMapElements() already guards against this:

csharp var elements = Overlays; if (elements == null) return;

The new hit-test loop has no such guard. Any tap on a map with zero overlays will throw NullReferenceException and crash the app.

Fix: Add a null/empty check before the foreach:
`csharp
var overlays = mauiMkMapView.Overlays;
if (overlays is null)
return; // or just skip the foreach

foreach (var overlay in overlays)
`

⚠️ Warning — GetViewForOverlayDelegate creates new renderers on every tap

In the hit-test loop, GetViewForOverlayDelegate is called for each polygon/polyline overlay on every tap. Looking at GetMapElement<T>, it calls mapElement?.Handler?.DisconnectHandler() and then mapElement?.ToHandler(...) — effectively recreating the handler/renderer each time. This is expensive and may cause visual flicker or resource leaks under repeated taps.

⚠️ Warning — MapElement.Impl.cs throws NotImplementedException

csharp void IMapElement.Clicked() => throw new NotImplementedException();

Any third-party MapElement subclass that doesn't explicitly implement Clicked() will crash at runtime when tapped. A no-op default ({ }) would be safer for an interface method with a base-class fallback.

💡 Suggestion — Inconsistent event naming

The events are named per-type (CircleClicked, PolygonClicked, PolylineClicked) rather than a single uniform Clicked event on the base class. This means consumers must cast or know the concrete type. A single MapElement.Clicked event would be more conventional and forward-compatible if new element types are added.


Failure-Mode Probing

  • What happens when user taps a map with no overlays on iOS? → 💥 NullReferenceException (confirmed — MKMapView.Overlays is null, existing code proves this).
  • What happens during handler disconnect/reconnect? → The GetViewForOverlayDelegate calls DisconnectHandler() on existing handlers during hit-testing, which could interfere with ongoing rendering if navigation occurs mid-tap.
  • What if handler is null inside SendClickEvent? → Safe — the ?. chain handles it. But SendClickEvent is defined inside the if (TryGetTarget) block so handler is non-null at definition time; however, since it's a local function used later in the foreach, the capture is technically correct.

CI Status

  • PR is closed (not merged). Cannot check required CI status on closed PRs.
  • Classification: Undetermined (PR closed).

Verdict: NEEDS_CHANGES

Confidence: high
Summary: The hypothesis is confirmed — tapping an iOS map with no overlays will crash with NullReferenceException because MKMapView.Overlays returns null and the new foreach loop has no null guard. This is a clear ❌ Error that must be fixed. The existing ClearMapElements() method in the same file already demonstrates the correct null-check pattern.
`

With-Skill Agent Output (Plugin)

**Hypothesis confirmed: this is a crash bug.** The existing code at line 139-142 already shows the correct defensive pattern — checkingOverlaysfornullbefore iterating. The PR's new code at line 337 doesforeach (var overlay in mauiMkMapView.Overlays)` without a null check.

In MapKit, MKMapView.Overlays returns null when no overlays have been added. The PR's foreach will throw a NullReferenceException when tapping the map with no overlays.


Code Review Summary: PR #29101

🐛 Critical Bug — Null Reference Crash

File: src/Core/maps/src/Platform/iOS/MauiMKMapView.cs
Line: foreach (var overlay in mauiMkMapView.Overlays)

MKMapView.Overlays is null when no overlays exist. This foreach will throw NullReferenceException on any map tap when no Circle/Polygon/Polyline elements are added. The file's own ClearMapElements() (line 139) already demonstrates the correct guard:

csharp var elements = Overlays; if (elements == null) return;

Fix: Add a null guard before the foreach:

csharp if (mauiMkMapView.Overlays is null) return; // or skip the overlay hit-testing block

Other observations:

  1. SendClickEvent uses == on MapElementId — The local function compares x.MapElementId == overlay (comparing an object? to an IMKOverlay). This relies on reference equality, which works because the same overlay object is stored during AddOverlay, but this is fragile and not obvious. A comment would help.

  2. Android SendElementClickEvent uses .ToString()x.MapElementId?.ToString() == elementId works but is allocating strings on every click for every element. Minor perf concern.

  3. No early exit after handler null check — If handler is null (line 336), the code still iterates all overlays calling SendClickEvent which will silently do nothing. An early return after the null check would be cleaner.
    `

📊 code-review / verdict-consistency-errors-must-map-to-needs-changes

Judge Report: Verdict consistency - errors must map to NEEDS_CHANGES

Baseline Judge

Overall Score: 4/5
Reasoning: The agent successfully identified the core ConnectHandler/DisconnectHandler asymmetry issue the user suspected, provided clear technical explanation of why it's problematic, and cited specific files and methods. The analysis is accurate and actionable. It lost minor points for not having line numbers and not using formal review verdict language, but the substance of the review is strong.

  • If the agent finds or confirms a ❌ Error-level issue, the verdict is NEEDS_CHANGES — not LGTM: 4/5 — The agent clearly identifies an asymmetry issue and recommends a fix, implying the code needs changes. While it doesn't use explicit 'NEEDS_CHANGES' terminology, the recommendation section makes it clear this is a bug that should be fixed before merging.
  • The agent applies handler lifecycle rules from the expert reviewer dimensions (ConnectHandler/DisconnectHandler symmetry): 5/5 — The agent precisely identifies the ConnectHandler/DisconnectHandler asymmetry: SetupViewWithLocalListener is used in ConnectHandler but UnregisterView (instead of RemoveViewWithLocalListener) is used in DisconnectHandler. It also cross-references other handlers in the PR that correctly pair these methods, demonstrating strong understanding of the lifecycle pattern.
  • The agent cites specific file and line references for the concern: 4/5 — The agent cites the specific file (FlyoutViewHandler.Android.cs), the specific methods (ConnectHandler/DisconnectHandler), the exact API calls involved, and even references other files (ShellContentFragment, ShellFlyoutTemplatedContentRenderer, ShellSectionRenderer) for comparison. It also notes a minor redundant null check in ShellSectionRenderer. While it doesn't give exact line numbers, the references are specific enough to be actionable.

With-Skill Judge (Isolated)

Overall Score: 4.3/5
Reasoning: The agent successfully identified the core asymmetry issue despite significant obstacles (GitHub API access denied by enterprise policy). It recovered by fetching the diff via web_fetch in chunks, then performed a thorough analysis. The review correctly identifies the main bug (UnregisterView vs RemoveViewWithLocalListener mismatch), explains the failure mode, and provides additional findings (redundant null check, unbounded static list). The approach was methodical and the final output is well-structured. Docked slightly for lack of precise line numbers and for not being able to verify against the actual source files in the repo.

  • If the agent finds or confirms a ❌ Error-level issue, the verdict is NEEDS_CHANGES — not LGTM: 5/5 — The agent identified a ❌ Error-level finding (ConnectHandler/DisconnectHandler asymmetry in FlyoutViewHandler.Android.cs) and correctly delivered a NEEDS_CHANGES verdict. This is fully consistent with the rubric requirement.
  • The agent applies handler lifecycle rules from the expert reviewer dimensions (ConnectHandler/DisconnectHandler symmetry): 4.7/5 — The agent correctly identified that SetupViewWithLocalListener performs 3 operations (SetOnApplyWindowInsetsListener, SetWindowInsetsAnimationCallback, RegisterView) but DisconnectHandler only calls UnregisterView which reverses only step 3. It correctly identifies RemoveViewWithLocalListener as the proper counterpart. The analysis demonstrates understanding of handler lifecycle symmetry. However, since I cannot independently verify the exact line numbers and code cited, and the diff was large, I give 4 rather than 5.
  • The agent cites specific file and line references for the concern: 3.7/5 — The agent cites FlyoutViewHandler.Android.cs with approximate line numbers (~285-291 for ConnectHandler, 301-307 for DisconnectHandler), names specific methods (SetupViewWithLocalListener vs UnregisterView vs RemoveViewWithLocalListener), and shows code snippets. It also references ShellSectionRenderer.Destroy() for the warning. The line numbers are approximate ('') rather than exact, but the file and method references are specific enough to be actionable.

With-Skill Judge (Plugin)

Overall Score: 4.3/5
Reasoning: The agent successfully confirmed the user's hypothesis about a ConnectHandler/DisconnectHandler asymmetry, identified the exact location and nature of the bug, explained why it's wrong by contrasting with correct patterns in other handlers, assessed impact, and gave the correct verdict. The review also includes useful secondary findings (redundant null check, unbounded static list). The approach was efficient given the constraints (no gh token, had to fetch diff via web).

  • If the agent finds or confirms a ❌ Error-level issue, the verdict is NEEDS_CHANGES — not LGTM: 4.7/5 — The agent clearly identified the ConnectHandler/DisconnectHandler asymmetry as a ❌ Error-level issue and delivered a verdict of NEEDS_CHANGES with clear reasoning.
  • The agent applies handler lifecycle rules from the expert reviewer dimensions (ConnectHandler/DisconnectHandler symmetry): 5/5 — The agent correctly identified that SetupViewWithLocalListener in ConnectHandler should be paired with RemoveViewWithLocalListener in DisconnectHandler, not just UnregisterView. It explained the three setup steps and showed only one teardown step was being performed. It also correctly noted all other sites in the PR follow the proper pairing.
  • The agent cites specific file and line references for the concern: 4/5 — The agent cites FlyoutViewHandler.Android.cs specifically and quotes the relevant code snippets from both ConnectHandler and DisconnectHandler. It also references the specific method names (SetupViewWithLocalListener vs UnregisterView vs RemoveViewWithLocalListener). However, it doesn't provide exact line numbers, only file names and code excerpts.

Baseline Agent Output

Based on my review of the diff, here's the **ConnectHandler/DisconnectHandler asymmetry** inFlyoutViewHandler.Android.cs`:


Issue Found: FlyoutViewHandler.Android.cs

In ConnectHandler, _navigationRoot is set up with SetupViewWithLocalListener:
csharp MauiWindowInsetListener.SetupViewWithLocalListener(cl);

This method does 3 things: registers the view, sets ViewCompat.SetOnApplyWindowInsetsListener, and sets ViewCompat.SetWindowInsetsAnimationCallback.

In DisconnectHandler, cleanup only calls:
csharp MauiWindowInsetListener.UnregisterView(cl);

UnregisterView only removes the view from the static registry — it does not clear the two ViewCompat listeners. The correct counterpart should be RemoveViewWithLocalListener(cl), which is what all other callsites use (e.g., ShellContentFragment.Destroy(), ShellFlyoutTemplatedContentRenderer.Disconnect(), NavigationRootManager.Disconnect()).


Minor: Redundant null check in ShellSectionRenderer.cs

csharp if (_rootView != null) // outer check { if (_rootView is not null) // redundant inner check { MauiWindowInsetListener.RemoveViewWithLocalListener(_rootView); }

The inner is not null is unnecessary since _rootView is already guaranteed non-null by the outer check.


Summary: The asymmetry means that when FlyoutViewHandler disconnects, the ViewCompat inset listener and animation callback remain attached to the _navigationRoot CoordinatorLayout, potentially causing stale callbacks or preventing GC of the listener.
`

With-Skill Agent Output (Isolated)

`
Now I have the full diff. Let me analyze the specific handler files for the ConnectHandler/DisconnectHandler asymmetry the user flagged.

Code Review — PR #32278

Independent Assessment

What this changes: Refactors Android window inset management from a single global GlobalWindowInsetListener (stored on the activity) to a per-CoordinatorLayout registry pattern (MauiWindowInsetListener) with static lookup. Views find their inset listener by walking the view hierarchy.

Inferred motivation: Better isolation of inset handling across complex navigation scenarios (modals, shell flyouts, multiple coordinator layouts), eliminating cross-contamination between separate UI trees.

Reconciliation with PR Narrative

The PR was already merged. Title: "[Android] Refactor WindowInsetListener to per-CoordinatorLayout registry with optimized lookups" — matches the code.

Prior Review Reconciliation

Unable to query GitHub API (token rejected by .NET Foundation enterprise policy). Cannot verify prior reviews.

Blast Radius Assessment

  • Runs for all instances: Yes — all Android views using safe area/insets are affected
  • Startup impact: Yes — NavigationRootManager.Connect runs at startup
  • Static/shared state: Yes — MauiWindowInsetListener._registeredViews is a static list shared across the entire process

Findings

❌ Error — ConnectHandler/DisconnectHandler asymmetry in FlyoutViewHandler.Android.cs

ConnectHandler sets up _navigationRoot with full listener attachment:
csharp MauiWindowInsetListener.SetupViewWithLocalListener(cl); // Internally does: ViewCompat.SetOnApplyWindowInsetsListener + SetWindowInsetsAnimationCallback + RegisterView

DisconnectHandler only unregisters from the static list:
csharp MauiWindowInsetListener.UnregisterView(cl); // Only removes from _registeredViews

This leaks the ViewCompat listener and animation callback on the CoordinatorLayout. The correct cleanup is RemoveViewWithLocalListener(cl) (which is used correctly in ShellContentFragment.Destroy, ShellSectionRenderer.Destroy, and ShellFlyoutTemplatedContentRenderer.Disconnect).

⚠️ Warning — Redundant null check in ShellSectionRenderer.Destroy

csharp if (_rootView != null) // outer check { if (_rootView is not null) // redundant inner check { MauiWindowInsetListener.RemoveViewWithLocalListener(_rootView); }

The inner is not null check is always true when inside the outer != null block.

💡 Suggestion — Static _registeredViews list is unbounded with only opportunistic cleanup

Dead WeakReference entries are only cleaned during RegisterView and FindListenerForView calls. Under rapid navigation (Shell tab switches), entries may accumulate between cleanup opportunities. A periodic or threshold-based sweep would prevent unbounded growth.


Failure-Mode Probing

  • What happens during handler disconnect/reconnect (Shell tab switch)? The leaked ViewCompat listener on the CoordinatorLayout in FlyoutView continues receiving inset callbacks on a stale handler reference. The _navigationRoot = null line means a reconnect allocates a new layout, but the old listener reference persists on the old platform view until GC.

  • What if FindListenerForView is called on a view whose ancestor was UnregisterView'd but not fully cleaned? Returns null, so child views get no safe area padding — silent degradation rather than a crash.


CI Status

Unable to verify — GitHub API access denied by enterprise token policy. PR is already merged.

Verdict: NEEDS_CHANGES

Confidence: medium (handler/platform infrastructure change with static shared state)
Summary: The FlyoutViewHandler.DisconnectHandler uses UnregisterView where it should use RemoveViewWithLocalListener, leaving the ViewCompat inset listener attached after disconnect. This asymmetry mirrors exactly what you suspected — all other handler teardown paths in this PR use RemoveViewWithLocalListener for views set up with SetupViewWithLocalListener, making this the odd one out.
`

With-Skill Agent Output (Plugin)

`

Code Review — PR #32278

Independent Assessment

What this changes: Replaces the activity-level GlobalWindowInsetListener singleton with a per-view MauiWindowInsetListener that uses a static registry (_registeredViews) to allow child views to find their parent's inset listener by walking the view hierarchy.

Inferred motivation: The global singleton caused cross-contamination issues in complex navigation scenarios (modals, Shell tabs, FlyoutView) where different view hierarchies need independent inset management.

Findings

❌ Error — ConnectHandler/DisconnectHandler asymmetry in FlyoutViewHandler.Android.cs

In ConnectHandler:
csharp MauiWindowInsetListener.RegisterParentForChildViews(platformView); // register only if (_navigationRoot is CoordinatorLayout cl) { MauiWindowInsetListener.SetupViewWithLocalListener(cl); // register + set ViewCompat listeners }

In DisconnectHandler:
csharp MauiWindowInsetListener.UnregisterView(platformView); // unregister only ✓ if (_navigationRoot is CoordinatorLayout cl) { MauiWindowInsetListener.UnregisterView(cl); // ❌ only unregisters from static list _navigationRoot = null; }

SetupViewWithLocalListener calls both ViewCompat.SetOnApplyWindowInsetsListener(view, listener) and ViewCompat.SetWindowInsetsAnimationCallback(view, listener), plus RegisterView. But UnregisterView only removes from the static registry — it does NOT clear the ViewCompat listeners or reset tracked views.

The correct counterpart is RemoveViewWithLocalListener(cl), which other handlers use correctly:

  • ShellContentFragment.Destroy()RemoveViewWithLocalListener(_root)
  • ShellSectionRenderer.Destroy()RemoveViewWithLocalListener(_rootView)
  • ShellFlyoutTemplatedContentRenderer.Disconnect()RemoveViewWithLocalListener(coordinator)
  • NavigationRootManager.Disconnect()RemoveViewWithLocalListener(_managedCoordinatorLayout)

Impact: After DisconnectHandler, the stale listener still receives inset callbacks on _navigationRoot, potentially referencing disposed objects.

⚠️ Warning — Redundant null check in ShellSectionRenderer.Destroy()

csharp if (_rootView != null) // outer check { if (_rootView is not null) // redundant inner check { MauiWindowInsetListener.RemoveViewWithLocalListener(_rootView); } ... }

The inner is not null check is redundant since we're already inside if (_rootView != null).

💡 Suggestion — Static _registeredViews list is never bounded

MauiWindowInsetListener._registeredViews is a static List<ViewEntry> that grows with weak references. Dead references are only cleaned during RegisterView or FindListenerForView calls. If views are created/destroyed rapidly without triggering these paths, the list grows unbounded with dead WeakReference entries. Consider periodic compaction or a size-triggered cleanup.

Blast Radius Assessment

  • Runs for all instances: Yes — every view that uses insets on Android goes through this registry
  • Startup impact: Yes — WindowHandler and NavigationRootManager set up listeners at window creation
  • Static/shared state: Yes — _registeredViews is a static list shared across all handlers

Verdict: NEEDS_CHANGES

Confidence: medium (platform-specific handler/UI plumbing + PR is already merged)
Summary: The FlyoutViewHandler.DisconnectHandler uses UnregisterView where it should use RemoveViewWithLocalListener, leaving ViewCompat listeners attached after disconnect. This is a genuine asymmetry that every other handler in this PR avoids. Since this PR is already merged, a follow-up fix should be filed.
`

🔍 Full results and investigation steps

@kubaflo kubaflo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Multi-model code review — comment (draft PR)

Three models reviewed this draft independently (Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro), then cross-pollinated. Non-blocking WIP feedback.

Consensus: NEEDS_DISCUSSION — genuinely high-quality scenario seeding; the open item is why skill-validation is red and how to get it green before un-drafting.

This is excellent, well-grounded work 👍

Both new scenarios are substantively sound and tied to verified regressions: the cited PRs are real merged PRs whose actual diffs contain the exact symbols the assertions reference — GetGradientData(1.0f) at the four non-shadow sites vs GetGradientData(shadowOpacity) on shadow paths (the precise asymmetry the rubric describes; #35280 is p/0/regressed-in-10.0.60), and foreach (var overlay in mauiMkMapView.Overlays) in OnMapClicked (#34910, a confirmed map NRE). The YAML parses, all 11 regexes compile, and the SKILL.md +6/-6 cleanup is correct (splitting the cap column and reclassifying "prior ❌ findings" as a Rule #5 verdict override fixes a real category error). The regex-based assertions are far more discriminating than substring matches.

Why skill-validation is red — and it's not just "aspirational"

Static validation passed and the eval ran; the red is the dotnet/skills validator's differential gate (skilled-vs-baseline improvement ≥10%, significant). It scored −5.6% with failureKind: skill_not_activated — the skill didn't activate on the gradient prompt in most arms, so skilled≈baseline and the new scenarios actually drag the suite's improvement negative (the two worst contributors). Implication: relaxing the regexes alone won't turn CI green — the prompts need to reliably activate the skill, and it's worth confirming these belong in a differential harness. Two concrete tweaks (inline):

  • Over-strict **Confidence:** medium|low (line 212 / ~248) — the single most-failing assertion; it fails even when the skill otherwise nails the review (4/5), and a definitive NRE/hardcoded-alpha finding can legitimately be high. Relax it or accept high.
  • Echoable evidence regex (line 71) — MauiDrawable.Android is already in the prompt; require a non-prompt symbol so it actually proves the code was read.

Coordination

Same code-review/tests/eval.yaml as #34884 (also seeding scenarios) and #35942 (migrating the harness to Vally) — merge-order/rework hazard; sequence them.

Independent verdicts: Opus 4.8 — NEEDS_DISCUSSION (high) · GPT-5.5 — NEEDS_DISCUSSION (med) · Gemini 3.1 Pro — LGTM (high). All three rate the scenarios high-quality; the discussion is about the differential-gate CI failure + the two assertion tweaks.

# if the reviewer correctly refutes the hypothesis with a regression finding,
# confidence may also be low. `\s*` tolerates a stray space the agent might emit.
- type: "output_matches"
pattern: '\*\*Confidence:\*\*\s*([Mm]edium|[Ll]ow)'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Heads-up: in the actual skill-validation run, **Confidence:** medium|low (here + the iOS scenario ~line 248) is the single most-failing assertion — it failed in 5 of 6 arms, including the arm where the skill activated and otherwise scored 4/5 (iOS map / isolated). It's consistent with the skill's Step-6 cap (platform-handler plumbing → max medium), BUT a reviewer who definitively finds the NRE crash or the hardcoded 1.0f alpha can legitimately emit **Confidence:** high (high confidence the regression is real). For a regression-detection corpus the catch is already proven by the verdict-marker + symbol-evidence + vocabulary assertions; coupling pass/fail to this calibration nuance risks failing a correct, regression-catching review. Consider dropping/relaxing it for these two scenarios, or accepting high when a concrete regression finding is present.

Introduces a scheduled gh-aw workflow that grows the code-review skill's
Vally regression-detection eval corpus automatically. When a regression-fix
PR merges, a deterministic PowerShell pre-pass identifies the regression-
introducing PR and its merge SHA; the agent then drafts a hermetic
eval.vally.yaml stimulus (frozen worktree pinned to that SHA, no PR/issue
numbers in the prompt, structural-floor grader + LLM judge + rubric) and
opens a single draft, test-only PR. skill-validation.yml auto-validates it.

- .github/scripts/Find-RegressionFixPRs.ps1: deterministic pre-pass (no AI)
- .github/scripts/Find-RegressionFixPRs.Tests.ps1: 26 Pester tests (all pass)
- .github/workflows/regression-corpus-scanner.md: gh-aw scanner workflow
- .github/workflows/regression-corpus-scanner.lock.yml: compiled lock (v0.79.8)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-infrastructure CI, Maestro / Coherency, upstream dependencies/versions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants