Agentic-labeler: fix stale eval fixture + add exactly-one-area scenarios#35570
Agentic-labeler: fix stale eval fixture + add exactly-one-area scenarios#35570PureWeen wants to merge 5 commits into
Conversation
The 'Automated merge PR - should noop' scenario was using PR dotnet#35464 (the original agentic-labeler workflow PR), not an actual automated merge PR. Switch to dotnet#35422 (`[automated] Merge branch 'main' => 'net11.0'`) so the fixture exercises the actual noop rule for bot-created inter-branch merges. Add two new scenarios that exercise the 'exactly one area-*' tie-breaking heuristics introduced in dotnet#35567: - dotnet#35490 (TabbedPage on iOS Glass UI) - specific control should beat generic area-navigation, so the agent should pick area-controls-tabbedpage alone. - dotnet#35501 (SafeAreaEdges.Container API) - sub-area should beat parent area, so the agent should pick area-safearea alone, not area-layout. Both motivating items came from the 26-item backfill after the max:1 truncation bug fix (dotnet#35540/dotnet#35549), where these were the two items that ended up with two area-* labels and motivated the tie-breaking rules. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 35570Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 35570" |
🔍 Skill Validation Results✅ Static Checks PassedSkills checked: 17 | Agents checked: 4 Full validator output✅ LLM Evaluation Passed1/1 skill(s) passed validation
✅ agentic-labeler: Improvement score 54.0% meets threshold of 10.0% [high variance in: iOS extension PR - dual platform labels for .ios.cs files, Infrastructure area - CI workflow file deletion, Ignore labeling instructions in PR body, Dependency bump PR with existing labels should noop, XAML source generator PR gets area-xaml] [1] High run-to-run variance (CV=3.04) 📊 agentic-labeler / -dnceng-bot-codeflow-issue-gets-area-infrastructure-not-noop-Judge Report: [dnceng-bot] codeflow issue gets area-infrastructure (not noop)Baseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4/5
With-Skill Judge (Plugin)Overall Score: 4/5
Baseline Agent Output
To proceed, I'd need either:
Could you provide the issue details or set up authentication? With-Skill Agent Output (Isolated)
Labels to apply:
No Unfortunately, I'm unable to apply the labels directly due to token authentication restrictions with the dotnet enterprise. However, the correct label for this issue is With-Skill Agent Output (Plugin)
Labels to apply:
No Unfortunately, I'm unable to actually apply the labels due to a token permissions issue (the enterprise forbids the current token). You'd need to apply 📊 agentic-labeler / android-pr-platform-label-from-android-cs-extension-filesJudge Report: Android PR - platform label from .android.cs extension filesBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4.7/5
With-Skill Judge (Plugin)Overall Score: 5/5
Baseline Agent Output
Could you either:
Without access to the PR content, I can't determine which labels to apply. With-Skill Agent Output (Isolated)
Title: "Fix Android MediaPicker result recovery" Content: This PR fixes Android-specific Existing labels already applied: The PR already has the correct Labels I would apply:
Both labels are already present on this PR — no changes needed. With-Skill Agent Output (Plugin)
Based on the labeling rules:
Labels I would apply:
📊 agentic-labeler / automated-merge-pr-should-noopJudge Report: Automated merge PR - should noopBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 5/5
With-Skill Judge (Plugin)Overall Score: 5/5
Baseline Agent Output
Based on the PR characteristics, here are the labels I would apply:
However, I should note:
Would you like me to attempt to apply these labels if you can provide authentication, or were you looking for a recommendation only? With-Skill Agent Output (Isolated)
Per the labeling rules, automated inter-branch merge PRs should be noop'd — no labels applied. These are infrastructure merge operations, not feature/bug work, and are explicitly excluded from labeling. Result: No labels to apply (noop). With-Skill Agent Output (Plugin)
Per the labeling rules, automated inter-branch merge PRs are a noop — no labels should be applied. Result: No labels to apply (noop). This is a bot-created merge PR that combines many already-merged commits from 📊 agentic-labeler / collectionview-area-with-android-platform-scope-restriction-holds-despite-complex-existing-labels-Judge Report: CollectionView area with Android platform (scope restriction holds despite complex existing labels)Baseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4.3/5
With-Skill Judge (Plugin)Overall Score: 4.7/5
Baseline Agent Output
To label this PR, I'd need either:
Could you provide access or share the PR context? With-Skill Agent Output (Isolated)
Title: "Revert [Android] Fix CollectionView handler cleanup when DataTemplateSelector switches templates" Changed files: Shared CollectionView recycling code ( Labels I would apply:
No With-Skill Agent Output (Plugin)
Title: "Revert [Android] Fix CollectionView handler cleanup when DataTemplateSelector switches templates" Changed files: Android-specific CollectionView recycling code ( Labels I would apply:
📊 agentic-labeler / dependency-bump-pr-with-existing-labels-should-noopJudge Report: Dependency bump PR with existing labels should noopBaseline JudgeOverall Score: 1.7/5
With-Skill Judge (Isolated)Overall Score: 3.7/5
With-Skill Judge (Plugin)Overall Score: 3.7/5
Baseline Agent Output
To help you label this PR, I'd need either:
Could you provide access to GitHub or share the PR context? With-Skill Agent Output (Isolated)
PR: "Bump the aspnetcore group in /eng with 3 updates" — a Dependabot dependency bump updating ASP.NET Core authentication packages (Facebook, Google, MicrosoftAccount) from 10.0.0 → 10.0.8. The only file changed is a version props file in Labels I would apply:
No
With-Skill Agent Output (Plugin)
Reasoning:
Labels I would apply: None (noop) No 📊 agentic-labeler / handlers-android-subdirectory-triggers-platform-android-headline-rule-fix-Judge Report: Handlers/*/Android/ subdirectory triggers platform/android (headline rule fix)Baseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 5/5
With-Skill Judge (Plugin)Overall Score: 4.3/5
Baseline Agent Output
Could you either:
With-Skill Agent Output (Isolated)
Labels I would apply:
The area is clearly CollectionView (grouped drag-and-drop reorder), and the platform is Android based on the handler subdirectory path convention. With-Skill Agent Output (Plugin)
Title: Changed file: Content: Fixes a CollectionView grouped drag-and-drop reorder regression on Android. Labels I would apply:
The area is I'm unable to actually apply labels via the GitHub API due to token restrictions in this environment. Would you like me to attempt another approach to apply them? 📊 agentic-labeler / ignore-labeling-instructions-in-pr-bodyJudge Report: Ignore labeling instructions in PR bodyBaseline JudgeOverall Score: 3.3/5
With-Skill Judge (Isolated)Overall Score: 4.3/5
With-Skill Judge (Plugin)Overall Score: 5/5
Baseline Agent Output
Here are the labels I would apply:
However, I'm unable to actually apply these labels because the token available in this environment is rejected by the dotnet/maui repository (the With-Skill Agent Output (Isolated)
Labels I would apply:
The dominant subsystem is Shell (RootNavigationView, FlyoutBehavior, PaneDisplayMode), and the issue is exclusively on the Windows/WinUI platform. With-Skill Agent Output (Plugin)
The issue is a Windows Shell regression where 📊 agentic-labeler / infrastructure-area-ci-workflow-file-deletionJudge Report: Infrastructure area - CI workflow file deletionBaseline JudgeOverall Score: 2/5
With-Skill Judge (Isolated)Overall Score: 5/5
With-Skill Judge (Plugin)Overall Score: 5/5
Baseline Agent Output
PR #35450: "ci: delete unused add-remove-label-check-suites workflow" This PR deletes Labels I would apply:
However, I cannot apply these labels because
Would you like me to help with anything else regarding this PR? With-Skill Agent Output (Isolated)
Title: "ci: delete unused add-remove-label-check-suites workflow" Changed files: Labels I would apply:
The PR already has With-Skill Agent Output (Plugin)
Title: "ci: delete unused add-remove-label-check-suites workflow" Labels I would apply:
No This matches the labeling rule: "PRs touching only 📊 agentic-labeler / ios-directory-only-pr-platform-ios-only-not-platform-macos-Judge Report: iOS directory-only PR - platform/ios ONLY (not platform/macos)Baseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 5/5
With-Skill Judge (Plugin)Overall Score: 5/5
Baseline Agent Output
To label this PR, I'd need either:
Could you provide authentication or share the PR context? With-Skill Agent Output (Isolated)
Title: Labels I would apply:
These match the labels already applied to the PR. The fix targets ScrollView's With-Skill Agent Output (Plugin)
Title: "Fixed-34583 : Incorrect implementation of ScrollView.Orientation" What it does: Fixes ScrollView orientation change resetting scroll position on iOS by modifying Labels I would apply:
The core fix modifies Would you like me to apply these labels to the PR? 📊 agentic-labeler / ios-extension-pr-dual-platform-labels-for-ios-cs-filesJudge Report: iOS extension PR - dual platform labels for .ios.cs filesBaseline JudgeOverall Score: 2/5
With-Skill Judge (Isolated)Overall Score: 2.7/5
With-Skill Judge (Plugin)Overall Score: 2.7/5
Baseline Agent Output
To label this PR, I'd need either:
Could you either set up authentication ( With-Skill Agent Output (Isolated)
Changed files:
Per the labeling rules:
Labels I would apply:
Not applied:
However, wait — let me reconsider. The title says "[iOS, Mac]" and these UICollectionView APIs do run on both iOS and MacCatalyst at runtime. But per the strict labeling rules, the directory pattern Final labels:
With-Skill Agent Output (Plugin)
Applying the labeling rules:
Labels I would apply:
Not applied: 📊 agentic-labeler / ios-ios-cs-extension-applies-both-platform-ios-and-platform-macosJudge Report: iOS .ios.cs extension applies both platform/ios and platform/macosBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 3.7/5
With-Skill Judge (Plugin)Overall Score: 3.3/5
Baseline Agent Output
To label PR #35318, I'd need either:
Could you either set up authentication or tell me what the PR is about so I can suggest appropriate labels? With-Skill Agent Output (Isolated)
Analysis of changed files:
Combined: Labels I would apply to PR #35318:
With-Skill Agent Output (Plugin)
Labels to Apply
Summary: The PR normalizes uppercase alphabetic keys to lowercase for
📊 agentic-labeler / issue-with-explicit-platforms-gets-platform-labels-but-no-triage-workflow-labelsJudge Report: Issue with explicit platforms gets platform labels but no triage workflow labelsBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4.3/5
With-Skill Judge (Plugin)Overall Score: 4.3/5
Baseline Agent Output
To proceed, I need either:
Could you provide a GitHub token or paste the issue details? With-Skill Agent Output (Isolated)
However, I'm unable to apply these labels because the
With-Skill Agent Output (Plugin)
Unfortunately I'm unable to apply these labels due to a token permission issue with the GitHub CLI in this environment. Would you like me to try another approach, or can you apply them manually? 📊 agentic-labeler / maccatalyst-pr-applies-platform-macos-only-not-platform-iosJudge Report: MacCatalyst PR applies platform/macos only, not platform/iosBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4/5
With-Skill Judge (Plugin)Overall Score: 4/5
Baseline Agent Output
To proceed, I'd need either:
Could you provide access or share the PR context? With-Skill Agent Output (Isolated)
Labels I would apply to PR #34970:
Rationale:
With-Skill Agent Output (Plugin)
Labels for PR #34970:
Rationale:
Note: I cannot apply these labels due to token/authentication limitations with this repository. You would need to run:
📊 agentic-labeler / maps-pr-uses-area-controls-map-not-invented-area-maps-Judge Report: Maps PR uses area-controls-map (not invented area-maps)Baseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4/5
With-Skill Judge (Plugin)Overall Score: 4/5
Baseline Agent Output
To label a PR, I'd need:
Could you either:
Common labels in dotnet/maui include: With-Skill Agent Output (Isolated)
Labels for PR #35476:
The PR hardens the Android With-Skill Agent Output (Plugin)
Title: Fix for Android Map handler (disconnect/lifecycle safety improvements) Changed file: Labels I would apply:
These are the only two labels that apply per the labeling rules. 📊 agentic-labeler / multi-platform-pr-applies-multiple-platform-labelsJudge Report: Multi-platform PR applies multiple platform labelsBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 2.3/5
With-Skill Judge (Plugin)Overall Score: 3/5
Baseline Agent Output
Could you either:
With-Skill Agent Output (Isolated)
Labels to apply:
Rationale:
With-Skill Agent Output (Plugin)
Title: Changed files include Labels I would apply:
📊 agentic-labeler / pr-does-not-get-triage-workflow-labelsJudge Report: PR does not get triage workflow labelsBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4/5
With-Skill Judge (Plugin)Overall Score: 4/5
Baseline Agent Output
To label this PR, I'd need either:
Could you either set up authentication ( With-Skill Agent Output (Isolated)
Labels I would apply:
The PR fixes a bug where CollectionView accumulates a growing bottom gap during scrolling on Android, caused by stale inset-derived padding on recycled item views. With-Skill Agent Output (Plugin)
Title: Changed files: Labels I would apply:
📊 agentic-labeler / safearea-api-issue-sub-area-beats-area-layout-and-explicit-all-platforms-enumeration-applies-4-platform-labelsJudge Report: SafeArea API issue — sub-area beats area-layout AND explicit all-platforms enumeration applies 4 platform/* labelsBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4.3/5
With-Skill Judge (Plugin)Overall Score: 3.7/5
Baseline Agent Output
To proceed, I'd need either:
Could you provide a GitHub token or paste the issue details so I can help with labeling? With-Skill Agent Output (Isolated)
Area label:
Platform labels:
The reporter explicitly lists all four platforms (iOS, Android, Windows, macOS) as affected in the "Affected platforms" field. Unfortunately, I'm unable to apply these labels via the GitHub CLI due to a token authentication restriction (the
With-Skill Agent Output (Plugin)
Labels I would apply:
Unfortunately I cannot apply these labels right now due to a token authentication issue with the GitHub API (the enterprise policy rejects the available token). Would you like me to try an alternative approach, or would you prefer to apply these labels manually? 📊 agentic-labeler / shell-area-shell-specific-source-filesJudge Report: Shell area - Shell-specific source filesBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4.7/5
With-Skill Judge (Plugin)Overall Score: 4.7/5
Baseline Agent Output
To label a PR, I'd need:
Could you either:
Common labels in dotnet/maui include: With-Skill Agent Output (Isolated)
Title: "Bind badge properties when ShellContent is converted into an implicit ShellSection so badges set directly on ShellContent update the generated section." Changed files:
Labels I would apply:
No With-Skill Agent Output (Plugin)
PR #35462 — "Bind badge properties when ShellContent is converted into an implicit ShellSection" Changed files:
Labels I would apply:
No Summary: Only 📊 agentic-labeler / skill-file-pr-gets-area-infrastructure-not-area-tooling-Judge Report: Skill-file PR gets area-infrastructure (not area-tooling)Baseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4/5
With-Skill Judge (Plugin)Overall Score: 4/5
Baseline Agent Output
To help you, I'd need either:
Could you provide authentication or share the PR context? With-Skill Agent Output (Isolated)
Title: "Add Trim/NativeAOT safety rules to code review skill" Changed files:
Labels I would apply:
No With-Skill Agent Output (Plugin)
Title: "Add Trim/NativeAOT safety rules to code review skill" Changed files: Files under Labels I would apply:
No This PR adds Section 23 to the code-review skill's review-rules.md covering IL2026/IL3050 trim/AOT annotation analysis. Since it only touches 📊 agentic-labeler / tabbed-page-on-ios-glass-ui-issue-specific-control-beats-generic-area-navigationJudge Report: Tabbed-page on iOS Glass UI issue — specific control beats generic area-navigationBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4/5
With-Skill Judge (Plugin)Overall Score: 4/5
Baseline Agent Output
To proceed, I'd need either:
Could you provide a GitHub token or paste the issue details? With-Skill Agent Output (Isolated)
These are the only two labels that apply. The issue is clearly a TabbedPage layout/rendering problem exclusive to iOS 26's new glass UI tab bar. I'm unable to apply the labels directly because no
With-Skill Agent Output (Plugin)
Labels I would apply:
The TabbedPage control is the dominant subsystem (per tie-breaking rules: specific control beats generic area like However, I'm unable to apply these labels directly because the GitHub token doesn't have access to the dotnet/maui repository (enterprise policy blocks it). Would you like me to try another approach, or is this analysis sufficient? 📊 agentic-labeler / tizen-only-issue-platform-tizen-is-silently-dropped-even-when-tizen-is-the-only-named-affected-platformJudge Report: Tizen-only issue — platform/tizen is silently dropped even when Tizen is the only named affected platformBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4/5
With-Skill Judge (Plugin)Overall Score: 4/5
Baseline Agent Output
Could you either:
With-Skill Agent Output (Isolated)
Issue #31806 is about inability to target the Tizen platform in .NET MAUI (.NET 9/10) — build errors with target framework configuration. Labels I would apply:
Labels NOT applied:
Unfortunately I'm unable to apply the labels directly due to token access restrictions on this repository. To apply manually:
With-Skill Agent Output (Plugin)
Issue #31806 is about inability to target the Tizen platform in .NET MAUI — the user cannot set the target framework and gets build errors like "The target platform identifier tizen was not recognized." Labels I would apply:
Labels NOT applied:
Summary: Apply Would you like me to attempt to apply this label to the issue? (Note: I'm currently unable to authenticate to the GitHub API due to token restrictions in this environment.) 📊 agentic-labeler / windows-pr-platform-label-from-windows-cs-or-platform-windows-Judge Report: Windows PR - platform label from .windows.cs or Platform/Windows/Baseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4.7/5
With-Skill Judge (Plugin)Overall Score: 5/5
Baseline Agent Output
To help you label PR #35458, I'd need either:
Could you provide access credentials or share the PR context so I can suggest the appropriate labels? With-Skill Agent Output (Isolated)
Changed file: Labels I would apply:
The PR fixes a Windows-specific regression where With-Skill Agent Output (Plugin)
Title: Fix Changed file: Labels I would apply:
No other 📊 agentic-labeler / workflow-only-pr-gets-area-infrastructureJudge Report: Workflow-only PR gets area-infrastructureBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 4.7/5
With-Skill Judge (Plugin)Overall Score: 4.3/5
Baseline Agent Output
To proceed, I'd need either:
Could you provide access or share the PR context? With-Skill Agent Output (Isolated)
PR #35438 fixes a GitHub Actions workflow ( Labels I would apply:
No Unfortunately, I'm unable to actually apply these labels because the GitHub token available in this environment is rejected by the dotnet org's enterprise policy (token lifetime exceeds 366 days). To apply them, you could run:
With-Skill Agent Output (Plugin)
Title: Fix review trigger leading whitespace handling What it does: Modifies a GitHub Actions workflow file ( Changed files: Only a GitHub workflow YAML file — no platform-specific code. Labels I would apply:
No 📊 agentic-labeler / xaml-source-generator-pr-gets-area-xamlJudge Report: XAML source generator PR gets area-xamlBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 2.3/5
With-Skill Judge (Plugin)Overall Score: 2/5
Baseline Agent Output
To help you, I'd need either:
Could you provide access to the GitHub CLI or share the PR information? With-Skill Agent Output (Isolated)
Title: "Fix Implicit parameter conversion from integer to byte fails with source generated XAML" Changed files:
Labels I would apply:
No With-Skill Agent Output (Plugin)
Title: Replaces Content: The PR modifies files exclusively in Labels I would apply:
No I'm unable to actually apply the labels due to token access restrictions on the dotnet/maui repository (enterprise policy blocks the available token). You'd need to apply |
|
/review -b feature/refactor-copilot-yml |
|
/review -b feature/refactor-copilot-yml |
MauiBot
left a comment
There was a problem hiding this comment.
Expert Review — 1 findings
See inline comments for details.
| - name: "SafeArea API discoverability issue — sub-area beats parent area-layout" | ||
| prompt: "Label issue #35501 in dotnet/maui. List the labels you would apply." | ||
| assertions: | ||
| - type: "output_contains" |
There was a problem hiding this comment.
[major] Regression Prevention — This fixture uses issue #35501, whose Affected platforms explicitly lists iOS, Android, Windows, and macOS, but the assertions only verify area-safearea and omitted non-area/status labels. A labeler that drops all platform labels would still pass this eval, so it does not protect the expected platform-inference behavior. Add output_contains assertions for platform/ios, platform/android, platform/windows, and platform/macos, or switch to a SafeArea fixture without explicit platform scope.
There was a problem hiding this comment.
Good catch — addressed in ae44511.
Went with the first option (add output_contains for platform/ios, platform/android, platform/windows, platform/macos) rather than switching fixtures, because that lets one scenario protect both behaviors (area-* tie-breaking and platform-non-regression) for one runner-slot cost. The other option would just avoid the gap instead of covering it. Rubric updated to make the dual intent explicit.
Verified #35490 (TabbedPage) already asserts platform/ios and its body only lists iOS, so no equivalent gap there.
kubaflo
left a comment
There was a problem hiding this comment.
Could you please check the ai's suggestions?
|
/review rerun |
|
/review -b feature/enhanced-reviewer -p android |
MauiBot
left a comment
There was a problem hiding this comment.
Expert Review — 1 findings
See inline comments for details.
| - type: "output_not_contains" | ||
| value: "s/triaged" | ||
| - type: "output_matches" | ||
| pattern: "area-(tooling|infrastructure)" |
There was a problem hiding this comment.
[moderate] Regression Prevention and Test Coverage — This fixture allows area-infrastructure for issue #31806, but the issue is about target-framework/workload setup and the skill rules route workload/build/tooling surface to area-tooling, while area-infrastructure is reserved for CI/agent/workflow infrastructure. As written, the eval would pass a regression that mislabels a user-facing tooling issue as infrastructure. Consider pinning this to area-tooling and adding a negative assertion for area-infrastructure.
kubaflo
left a comment
There was a problem hiding this comment.
The Agentic-labeler newer applies area-ai-agents; could you check it?
|
/review -b feature/enhanced-reviewer -p android |
MauiBot
left a comment
There was a problem hiding this comment.
AI Review Summary
@PureWeen — new AI review results are available based on this last commit:
645a685. To request a fresh review after new comments or commits, comment/review rerun.
Review Sessions — click to expand
Gate — Test Before & After Fix
Gate Result: ⚠️ SKIPPED
No tests were detected in this PR.
Recommendation: Add tests to verify the fix using the write-tests-agent.
Pre-Flight — Context & Validation
Issue: #unknown - unavailable (GitHub CLI unauthenticated)
PR: #35570 - unavailable via GitHub API; local branch pr-review-35570
Platforms Affected: CI/Copilot review pipeline; requested test platform: android
Files Changed: 12 implementation/instruction/workflow, 2 test/eval
Key Findings
- GitHub API context gathering was blocked by missing
ghauthentication, so issue body, PR body, comments, inline reviews, required checks, and linked issues could not be fetched. - Local diff changes 14 files, primarily
.github/scripts/*,.github/skills/*,.github/workflows/copilot-review-tests.*, andeng/pipelines/ci-copilot.yml. - Gate result was provided by the caller as skipped: no tests detected in this PR; gate verification was not re-run.
- Deep code review found a security issue:
AnalyzeCopilotTokenUsageruns.github/scripts/shared/Aggregate-CopilotTokenUsage.ps1from a checked-out worktree, then a later token-bearing dispatch consumes that generated summary.
Code Review Summary
Verdict: NEEDS_CHANGES
Confidence: low
Errors: 1 | Warnings: 0 | Suggestions: 0
Key code review findings:
- ❌
eng/pipelines/ci-copilot.yml:1782runs token-usage aggregation from PR-controlled checkout beforeVIGILANT_GUIDE_DISPATCH_TOKENdispatch consumes the output; this violates the trusted-copy script rule for ci-copilot pipeline code.
Fix Candidates
| # | Source | Approach | Test Result | Files Changed | Notes |
|---|---|---|---|---|---|
| PR | PR #35570 | Adds token usage aggregation/dispatch, but runs the aggregator from checkout self in AnalyzeCopilotTokenUsage. |
eng/pipelines/ci-copilot.yml, .github/scripts/shared/Aggregate-CopilotTokenUsage.ps1, related tests/docs |
Original PR implementation has a trusted/untrusted boundary issue. |
Code Review — Deep Analysis
Code Review — PR #35570
Independent Assessment
What this changes: Adds Copilot token-usage capture/aggregation for the PR-review pipeline, publishes a new token-usage artifact, dispatches usage statistics to dotnet/maui-vigilant-guide, improves review/gate diagnostics, and updates labeler/test-failure review instructions/evals.
Inferred motivation: Improve observability/cost tracking for Copilot review runs and make review/test automation outputs more accurate.
Reconciliation with PR Narrative
Author claims: Unavailable — gh is unauthenticated in this environment.
Agreement/disagreement: Blocked from validating PR description, linked issues, and comments. Assessment is based on origin/main...HEAD local diff and full changed files only.
Prior Review Reconciliation
| Prior ❌ Error Finding | Source | Status | Evidence |
|---|---|---|---|
| Unable to query prior reviews/comments | GitHub PR review, inline comment, and issue-comment APIs | Blocked | gh reports: "To get started with GitHub CLI, please run: gh auth login." |
Blast Radius Assessment
- Runs for all instances: Yes — this affects the shared
ci-copilotPR-review infrastructure for every reviewed PR. - Startup impact: No app startup impact, but pipeline-stage startup is affected.
- Static/shared state: No application static state; new cross-stage artifacts and a new token-bearing dispatch path are introduced.
CI Status
- Required-check result: unavailable
- Classification: undetermined
- Action taken:
gh pr checks 35570 --repo dotnet/maui --requiredblocked by missing GitHub CLI auth; confidence capped low andLGTMis not permitted.
Findings
❌ Error — Token-usage aggregation runs PR-controlled script before token-bearing dispatch
eng/pipelines/ci-copilot.yml:1782
The new AnalyzeCopilotTokenUsage stage checks out self and runs:
$script = ".github/scripts/shared/Aggregate-CopilotTokenUsage.ps1"
& $script ...That script path comes from the checked-out worktree, not from the trusted .github/ copy created before PR merge. This violates the ci-copilot security rule that post-merge/post-review scripts must be invoked from trusted copied script locations.
The next step reads the generated summary and uses VIGILANT_GUIDE_DISPATCH_TOKEN to dispatch it to dotnet/maui-vigilant-guide (eng/pipelines/ci-copilot.yml:1819, eng/pipelines/ci-copilot.yml:1854-1865). Even though the aggregation step itself does not receive that token, a PR can alter the aggregation script and fully control the payload consumed by the token-bearing dispatch step. Use a trusted script copy/artifact for aggregation, or move aggregation into an existing trusted-script path.
Failure-Mode Probing
- Malicious PR modifies
Aggregate-CopilotTokenUsage.ps1: The new stage runs the PR version and can forge the summary dispatched withVIGILANT_GUIDE_DISPATCH_TOKEN. - Missing
CopilotLogs: The aggregator handles this by producing zero-record artifacts; this path appears safe. checkout: selfcredential exposure: The new checkout usespersistCredentials: false, so repo credentials are not persisted.- gh-aw lock consistency: The
.mdworkflow and.lock.ymlwere updated together; no lock mismatch found locally.
Verdict: NEEDS_CHANGES
Confidence: low — infrastructure/security blast radius plus unavailable PR narrative, prior reviews, and CI.
Summary: The main code paths are understandable, but the new analysis stage crosses the trusted/untrusted boundary by executing a PR-controlled script before a token-bearing dispatch. That should be fixed before merge.
Fix — Analysis & Comparison
Fix Candidates
| # | Source | Approach | Test Result | Files Changed | Notes |
|---|---|---|---|---|---|
| 1 | try-fix | Publish pre-merge trusted .github copy as TrustedGithub, download it in AnalyzeCopilotTokenUsage, and run aggregation from that artifact instead of checkout self. |
✅ PASS — 20 focused Pester tests passed | 1 file | Fixes the trusted/untrusted boundary while preserving the PR feature. |
| PR | PR #35570 | Run token aggregation from .github/scripts/shared/Aggregate-CopilotTokenUsage.ps1 in the analysis-stage checkout. |
14 files | Original PR implementation; code review found a security issue. |
Cross-Pollination
| Model | Round | New Ideas? | Details |
|---|---|---|---|
| maui-expert-reviewer | 1 | Yes | Candidate 1: publish/download a trusted scripts artifact; Candidate 2: fetch target branch in the analysis stage; Candidate 3: move parsing to dotnet/maui-vigilant-guide. |
Exhausted: No — stopped because Candidate #1 passed the focused regression tests and is demonstrably better than the PR's current fix. Candidate #2 is weaker for this PR because the aggregation script may not exist on the target branch until after merge; Candidate #3 requires cross-repository changes outside this PR.
Selected Fix: Candidate #1 — it uses the repository's existing trusted-copy model, publishes that trusted copy before PR-controlled Gate code can run, removes the analysis-stage PR checkout, and keeps the token-bearing dispatch consuming output generated by trusted code.
Report — Final Recommendation
Comparative Report — PR #35570
Candidates Compared
| Rank | Candidate | Regression result | Assessment |
|---|---|---|---|
| 1 | pr-plus-reviewer |
✅ PASS — focused Pester command reported 20 passed, 0 failed, 0 skipped | Best candidate. It preserves the PR feature and applies the expert reviewer's trusted-artifact fix in a sandbox copy, eliminating the PR-controlled script execution before the token-bearing dispatch. |
| 2 | try-fix-1 |
✅ PASS — STEP 5a reported 20 focused Pester tests passed | Functionally equivalent security fix: publish/download TrustedGithub and run aggregation from the trusted artifact. It ranks just below pr-plus-reviewer only because pr-plus-reviewer is the same fix applied as PR reviewer feedback to the submitted PR. |
| 3 | pr |
Raw PR implementation leaves AnalyzeCopilotTokenUsage running .github/scripts/shared/Aggregate-CopilotTokenUsage.ps1 from the PR-controlled checkout and then feeds its output to a token-bearing dispatch. This violates the CI Copilot trusted-copy rule. |
No candidate from STEP 5a failed regression tests. If any had failed, it would rank below passing candidates per the requested rule.
Key Decision Points
pr is not acceptable as submitted because it crosses the trusted/untrusted boundary in AnalyzeCopilotTokenUsage: PR-controlled pipeline code can influence the generated token summary consumed by a later dispatch step that has VIGILANT_GUIDE_DISPATCH_TOKEN in scope.
try-fix-1 directly addresses the root cause by reusing the repository's existing trusted-copy model. Publishing the pre-merge trusted .github directory as TrustedGithub before PR-controlled gate/review work runs gives the analysis stage a trusted script source without needing a PR checkout.
pr-plus-reviewer applies that same fix to the PR in a sandbox copy and passes the same focused Pester regression command. It is therefore the safest and most mergeable candidate: it preserves the PR's intended token-usage feature while satisfying the expert reviewer's security feedback.
Winner
Winner: pr-plus-reviewer
Rationale: It is the PR fix with the expert reviewer's actionable security feedback applied, and it passes focused regression tests. It fixes the critical trusted-boundary issue while keeping the PR's intended Copilot token usage aggregation/dispatch behavior.
Future Action — review latest findings
No alternative fix was selected for this run. Review the session findings and CI results before merging.
kubaflo
left a comment
There was a problem hiding this comment.
Could you please check the ai's suggestions?
kubaflo
left a comment
There was a problem hiding this comment.
🤖 Multi-model code review — Request changes
Three models reviewed this PR independently (Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro), then cross-pollinated.
Verdict: NEEDS_CHANGES (light) — the rule changes are solid; the asks are a few eval-robustness tweaks (the eval runs warn-only, so none are merge-blocking).
The rule changes are correct ✅
All three models verified the platform/tizen exclusion is applied consistently across every location in SKILL.md (scope line, the removed platform-table row, PR-file notes, issue rules, and "what NOT to do"), with no residual mention and no conflict with the noop rules (Tizen files still get an area-*). The new "named affected-platform list wins over a generic phrase" issue logic is clear and unambiguous, and the stale #35464 → #35422 automated-merge fixture swap is sensible.
Eval-robustness tweaks (inline)
These scenarios do run in CI (skill-validation.yml → skill-validator evaluate, --verdict-warn-only), so fragile assertions surface as spurious warnings:
- Self-fragile Tizen assertion (line 522) —
output_not_contains: platform/tizenon a test whose subject is Tizen; a correct agent will mention "platform/tizen" while explaining the exclusion → false-fail. Lean on the rubric / a finalLabels:line instead. - Named-list test doesn't isolate the rule (line 509) — #35501 names all four platforms, so it can't distinguish "named list wins" from "all mentioned → all". Add a strict-subset case (e.g. "all platforms (iOS, Android)" → only those two) and a bare "all platforms" → zero case.
- Echo-fragile fixture (line 226) — #35422 is merged and already carries
area-infrastructure, the label this scenario negatively asserts. - Area regex too narrow (line 542) —
area-(tooling|infrastructure)can fail for a reason orthogonal to Tizen; widen to includesetup|templates.
Independent verdicts: GPT-5.5 — LGTM (high) · Gemini 3.1 Pro — LGTM (high) · Opus 4.8 — NEEDS_DISCUSSION (med). The two LGTMs were on the (correct) rules; Opus's deeper pass — reading the validator and each live fixture — surfaced the warn-only assertion fragilities above. All agree the eval runs in CI and the Tizen rule is consistent.
| prompt: "Label issue #31806 in dotnet/maui. List the labels you would apply." | ||
| assertions: | ||
| - type: "output_not_contains" | ||
| value: "platform/tizen" |
There was a problem hiding this comment.
platform/tizen while explaining the exclusion ("normally Tizen → platform/tizen, but the rule says never apply it"). skill-validator matches output_not_contains against the full console output, so a correct agent can deterministically FAIL this output_not_contains: platform/tizen (and partner/tizen just below). The rubric already captures the real intent ("final label set does NOT include platform/tizen"). Drop/relax the hard substring guards here and lean on the rubric, or scope the check to a final Labels: line — otherwise the headline Tizen test is flaky against correct behaviour.
| rubric: | ||
| - "The final label set includes area-safearea as the single area-* label — the issue is specifically about SafeArea API surface; the tie-breaking heuristic prefers the sub-area over the parent area-layout" | ||
| - "The final label set does NOT include area-layout — the tie-breaking heuristic prefers the sub-area over the parent area" | ||
| - "The final label set includes platform/ios, platform/android, platform/windows, AND platform/macos — the issue body's 'Affected platforms' field explicitly enumerates all four platforms (iOS, Android, Windows, macOS), and per the platform-inference rules an explicit named list wins (apply one label per named platform)" |
There was a problem hiding this comment.
platform/ios + platform/android and output_not_contains windows/macos), and a bare "all platforms" → zero-platform-labels case.
|
|
||
| - name: "Automated merge PR - should noop" | ||
| prompt: "Label PR #35464 in dotnet/maui. List the labels you would apply." | ||
| prompt: "Label PR #35422 in dotnet/maui. List the labels you would apply." |
There was a problem hiding this comment.
💡 The new automated-merge fixture #35422 is MERGED and already carries area-infrastructure, while this scenario asserts output_not_contains: area-infrastructure. The agent fetches the PR and sees that label; if it surfaces/echoes the existing area-infrastructure it false-fails. The replaced fixture (#35464) had no such label, so the swap trades the staleness problem for echo-fragility on the very label being negatively asserted. Prefer an OPEN, unlabeled automated-merge PR (or rely on the rubric).
| - type: "output_not_contains" | ||
| value: "s/triaged" | ||
| - type: "output_matches" | ||
| pattern: "area-(tooling|infrastructure)" |
There was a problem hiding this comment.
💡 The area assertion is narrower than the skill's own guidance, risking a failure orthogonal to the Tizen rule under test. #31806 is "Unable to set Tizen targeted framework" (TFM/workload/SDK setup), and SKILL.md maps that surface to area-tooling / area-templates / area-setup. The live issue happens to carry area-tooling, but a reasonable agent could pick area-setup/area-templates and fail area-(tooling|infrastructure). Suggest widening to area-(tooling|infrastructure|setup|templates).
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!
What
Follow-up improvements to the agentic-labeler skill (
.github/skills/agentic-labeler/) — both the rule definitions inSKILL.mdand the eval fixture suite intests/eval.yaml. Scope grew during multi-round adversarial review.1. Fix stale "automated merge PR" fixture (
eval.yaml)The "Automated merge PR - should noop" scenario was pointed at PR #35464 — which is the original agentic-labeler workflow PR, not an actual automated merge PR. That scenario was effectively testing nothing.
Switched to PR #35422 (
[automated] Merge branch 'main' => 'net11.0'), which actually exercises the noop rule for bot-created inter-branch merges (touches many platform-specific files but should still noop because the changes are mechanical, not feature/bug work). Rubric tightened to call outarea-infrastructureexplicitly so the test fails clearly if a regression starts mis-labeling automated merges.2. Add "exactly one area-*" tie-break scenarios (
eval.yaml)PR #35567 introduced the "exactly one area-* label" rule and tie-breaking heuristics, but no eval scenario explicitly tested those rules. Two scenarios added using the items that motivated the rule:
area-controls-tabbedpage+platform/iosarea-navigationarea-safearea+platform/{ios,android,windows,macos}area-layout; explicit named list wins over generic-phrase exclusionBoth items came from the 26-item backfill after the
max:1truncation bug fix (#35540/#35549), where they were the only two with twoarea-*labels — ideal ground truth.3. Clarify platform-inference rules for issues (
SKILL.md)The original rule was a single sentence covering many edge cases ambiguously. Rewrote into 4 explicit bullets distinguishing:
platform/*per name, even if all four are listed."all platforms","cross-platform") → no platform labels unless accompanied by an explicit named list, in which case the named list wins."tested on iOS","not reproduced on Android") → no label."please add platform/android") → no label.Surfaced after 4-model consensus (Sonnet 4.6 + GPT-5.4 + Gemini 3.1 Pro + GPT-5.5) that the prior phrasing let the labeler hallucinate "all platforms mentioned → no labels".
4. Exclude
platform/tizenfrom the labeler entirely (SKILL.md+eval.yaml)Per project direction,
platform/tizenis never applied by this labeler — even when:*.tizen.csor/Platforms/Tizen/filesThe Tizen label still exists in the repo (Samsung Tizen TV) but is owned/applied through other channels, not by this automation. The rule is documented in 4 places in
SKILL.md(Scope, file-pattern notes, issue rule, "What NOT to do") and guarded in eval fixtures bynot_contains: platform/tizenassertions. For Tizen content, the labeler still appliesarea-*normally based on the code's subject matter — only theplatform/*label is suppressed.A dedicated eval scenario using #31806 (Tizen-only "Affected platforms" + Tizen-targeting subject) verifies the exclusion in isolation.
Why
area-infrastructureor noop, so the test never had teeth.Risk
SKILL.mdrule changes (Tizen exclusion + issue platform-inference rewrite) — adjust labeler behavior on the next eval run. Validated by the eval suite (24 scenarios pass) and by the dedicated Tizen-only scenario.eval.yaml— additive; doesn't modify the workflow, runtime code, or existing scenario semantics. Scenario count grew 21 → 24.Reviewed across multiple adversarial rounds (claude-opus-4.6, gpt-5.5, gpt-5.3-codex) with consensus thresholds applied.
Cc @PureWeen