Skip to content

Cross-phase trust-boundary hardening for ci-copilot Gate (HMAC seal + fail-closed restore)#35843

Open
T-Gro wants to merge 3 commits into
dotnet:feature/enhanced-reviewerfrom
T-Gro:security-on-enhanced-reviewer-v2
Open

Cross-phase trust-boundary hardening for ci-copilot Gate (HMAC seal + fail-closed restore)#35843
T-Gro wants to merge 3 commits into
dotnet:feature/enhanced-reviewerfrom
T-Gro:security-on-enhanced-reviewer-v2

Conversation

@T-Gro

@T-Gro T-Gro commented Jun 10, 2026

Copy link
Copy Markdown
Member

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Targets feature/enhanced-reviewer (#35677). Adds the cross-phase trust-boundary hardening that branch is missing (the PR-stdout sanitization it already has via ConvertTo-AzdoSafeConsole is intentionally left untouched).

Adds:

  • Restore-GateResultOrFailClosed — fail-closed gate verdict restore (was fail-open SKIPPED on a missing file), with HMAC verification.
  • HMAC sealing of gate-result.txt in the Gate task: Linux-gated, key generated after pwsh exits and fed over stdin (never /proc/cmdline or /proc/environ), orphan-reap of PR-spawned process groups, fail-closed in CI if signing is suppressed.
  • winnerJson StrictMode property-presence guard.
  • Gate-phase undefined-var pre-init ($risksData/$regressionTests).
  • Phase-Audit static test (gh token-scope auditing).

Tests: 251/251 Pester green, 0 skipped. bash -n + YAML schema clean; python3 ↔ .NET HMAC interop cross-checked.

The macro-literal handling needs one validation run on a non-Linux platform (ios/catalyst/windows) since the public maui-pr pipeline skips .github/**.

@github-actions github-actions Bot added the area-infrastructure CI, Maestro / Coherency, upstream dependencies/versions label Jun 10, 2026
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🔍 Skill Validation Results

✅ Static Checks Passed

Skills checked: 19 | Agents checked: 4

Full validator output
Found 1 skill(s)
[verify-tests-fail-without-fix] 📊 verify-tests-fail-without-fix: 2,271 BPE tokens [chars/4: 2,189] (detailed ✓), 26 sections, 7 code blocks
✅ All checks passed (1 skill(s))
Found 4 agent(s)
Validated 4 agent(s)
✅ All checks passed (4 agent(s))

❌ LLM Evaluation Failed

0/1 skill(s) passed validation

Skill Scenario Baseline Skilled Verdict
verify-tests-fail-without-fix Happy path: full verification mode with fix files present 1.0/5 1.0/5 (iso) · 1.0/5 (plug) ⚠️ [1]
verify-tests-fail-without-fix Happy path: verify failure only mode for test creation 1.0/5 1.0/5 (iso) · 1.0/5 (plug) ⚠️ [2]
verify-tests-fail-without-fix Negative trigger: general test question should not invoke verification 4.0/5 4.0/5 (iso) · 4.0/5 (plug) [3]
verify-tests-fail-without-fix Regression: tests passing without fix means verification FAILED 5.0/5 5.0/5 (iso) · 5.0/5 (plug) ⚠️ [4]
verify-tests-fail-without-fix Edge case: no test files detected in the PR 2.0/5 3.7/5 (iso) · 1.7/5 (plug) ⚠️ [5]
verify-tests-fail-without-fix Regression: agent must not manually revert files with git commands 1.3/5 ⏳ 1.7/5 (iso) · 2.0/5 (plug) ⏳ ⚠️ [6]
verify-tests-fail-without-fix Edge case: agent uses RequireFullVerification when fix files exist 1.0/5 ⏳ 1.7/5 (iso) · 1.0/5 (plug) ⏳ [7]
verify-tests-fail-without-fix Regression: agent correctly reports test failure as verification success 1.0/5 5.0/5 (iso) · 5.0/5 (plug)
verify-tests-fail-without-fix Feature: agent reports verification result clearly 2.0/5 4.0/5 (iso) · 4.0/5 (plug) [8]
verify-tests-fail-without-fix Feature: agent trusts script auto-detection of test files from git diff 1.0/5 3.7/5 (iso) · 3.0/5 (plug)

verify-tests-fail-without-fix: Skill regressed on task completion in one or more scenarios [NOT ACTIVATED (isolated) in 2 scenario(s)] [NOT ACTIVATED (plugin) in 3 scenario(s)]

[1] High run-to-run variance (CV=1.21)
[2] High run-to-run variance (CV=1.04)
[3] High run-to-run variance (CV=30.74)
[4] High run-to-run variance (CV=2.70)
[5] High run-to-run variance (CV=0.90)
[6] High run-to-run variance (CV=3.76). Timeout at 900s
[7] High run-to-run variance (CV=4.41). Timeout at 900s
[8] High run-to-run variance (CV=0.68)

timeout — run(s) hit the scenario timeout limit; scoring may be impacted

📊 verify-tests-fail-without-fix / edge-case-agent-uses-requirefullverification-when-fix-files-exist

Judge Report: Edge case: agent uses RequireFullVerification when fix files exist

Baseline Judge

Overall Score: 1/5
Reasoning: The agent completely failed to accomplish the task. It tried one skill that didn't exist, confirmed the directory was empty, and then gave up. It did not attempt to clone the repository, check out the PR, or perform any verification steps. The output is just an explanation of why it couldn't do the work, with no actionable results.

  • Agent uses -RequireFullVerification to ensure full two-phase verification: 1/5 — The agent never used -RequireFullVerification or any equivalent flag. It attempted a non-existent skill, checked an empty directory, and then gave up without attempting any verification approach.
  • Agent runs the complete workflow: fail without fix, then pass with fix: 1/5 — The agent did not run any part of the workflow. It found the directory empty and stopped, only describing what it would do hypothetically rather than attempting to clone the repo or find the code.

With-Skill Judge (Isolated)

Overall Score: 1.7/5
Reasoning: The agent failed to complete the task. While it correctly identified the right script and parameters to use, it gave up after discovering the working directory was empty. It made minimal effort to resolve the issue - only one diagnostic find command before asking the user for help. A better agent would have attempted to clone the repository, check out the PR branch, and proceed with verification.

  • Agent uses -RequireFullVerification to ensure full two-phase verification: 3/5 — The agent did include -RequireFullVerification in the PowerShell command (visible in the bash call), so it attempted to use the correct flag. However, the command failed due to environment issues before it could take effect.
  • Agent runs the complete workflow: fail without fix, then pass with fix: 1/5 — The workflow was never executed. The agent encountered a 'not a git repository' error and stopped, asking the user for help rather than attempting to clone or set up the repository itself.

With-Skill Judge (Plugin)

Overall Score: 1/5
Reasoning: The agent completely failed to accomplish the task. The fix file (ScrollViewHandler.Android.cs) doesn't exist in this repository checkout, making verification impossible. The agent took 30 tool calls and ~14 minutes before arriving at this conclusion. Along the way it tried non-existent parameters, waited 12+ minutes for a hung process that was using wrong filters, and never produced any verification results. The final output is just a statement that the file doesn't exist.

  • Agent uses -RequireFullVerification to ensure full two-phase verification: 1/5 — The agent attempted to use -RequireFullVerification but discovered it wasn't a valid parameter. It then pivoted to using -FixFiles instead, which triggers full verification mode. However, the parameter name requested in the rubric was never successfully used, and the alternative approach never produced results.
  • Agent runs the complete workflow: fail without fix, then pass with fix: 1/5 — The workflow never completed. The script ran for 14+ minutes with zero output, and when checked, no processes were running and no report was generated. The agent timed out at 900s without any verification results whatsoever.

Baseline Agent Output

`
The working directory is empty — there's no repository cloned here. I cannot perform the test verification because:

  1. No repository present — the dotnet/maui repo isn't cloned in this environment
  2. No skill available — the verify-tests-fail-without-fix skill isn't loaded
  3. No matching files — no ScrollViewHandler.Android.cs or tests referencing Issue55555

To proceed, I need:

  • The repository cloned (or a PR number to check out via gh pr checkout)
  • Access to a build environment with the .NET SDK and Android workload

Would you like me to:

  1. Clone the repo and attempt verification manually?
  2. Check out a specific PR by number?

Please provide the PR number or repo URL so I can set up the environment.
`

With-Skill Agent Output (Isolated)

`
The working directory is empty and not a git repository. This skill requires a cloned git repo with test and fix files present.

Verification blocked — no git repository found in the current working directory. To proceed, I need:

  1. A cloned dotnet/maui repository (or the relevant PR branch checked out)
  2. The test files and fix file (ScrollViewHandler.Android.cs) present in the working tree

Would you like me to clone the repo and check out the PR branch first? If so, please provide the PR number.
`

With-Skill Agent Output (Plugin)

The fix file doesn't exist in this repository checkout. Let me check what ScrollView handler files are available:

📊 verify-tests-fail-without-fix / edge-case-no-test-files-detected-in-the-pr

Judge Report: Edge case: no test files detected in the PR

Baseline Judge

Overall Score: 2/5
Reasoning: The agent did very little useful work. After a single tool call revealed an empty workspace, it gave up and asked for more information. It didn't attempt to explore the filesystem further, look for repositories in other locations, or provide substantive analysis based on the information already given in the prompt (file path, platform, the fact that no tests were added). A better response would have acknowledged the missing test coverage concern directly, discussed what kinds of tests would be needed for an Android EntryHandler fix, and been more proactive in exploring the environment.

  • Agent recognizes that without test files, verification cannot proceed: 2/5 — The agent recognized the directory was empty and couldn't proceed, but it didn't specifically address the stated fact that no test files were added to the PR. It focused on the empty directory rather than the core issue that the PR lacks test coverage.
  • Agent suggests that tests need to be written before verification can be run: 2.3/5 — The agent did mention 'if needed, write tests that fail without the fix and pass with it' as part of its offer to help once given more context. However, this was conditional and not a strong, direct recommendation that tests must be written given the PR contains no test files.
  • Agent does not attempt to fabricate or skip the test requirement: 4.7/5 — The agent did not fabricate any tests, make up results, or skip the requirement. It correctly identified it lacked the necessary context and asked for more information before proceeding.

With-Skill Judge (Isolated)

Overall Score: 3.7/5
Reasoning: The agent correctly identified that the working directory was empty and that no test files were present in the PR, making verification impossible. It provided clear next steps. The only minor issue is that the agent could have explored the skill instructions more thoroughly to see if there were alternative approaches within the skill's framework, but overall the response is accurate and helpful.

  • Agent recognizes that without test files, verification cannot proceed: 3.7/5 — The agent does recognize that no test files were added and that the skill requires test files or a TestFilter. However, it conflates two issues - the empty working directory AND the missing tests - making it unclear whether it truly understands the test requirement independently of the repo issue.
  • Agent suggests that tests need to be written before verification can be run: 3.7/5 — The agent asks for 'new test files covering the fix' or 'the name of an existing test class/filter' but doesn't explicitly state that tests need to be written before verification can proceed. It frames this more as needing information rather than clearly communicating that the PR lacks adequate test coverage and tests should be authored.
  • Agent does not attempt to fabricate or skip the test requirement: 5/5 — The agent does not attempt to fabricate tests, skip the requirement, or pretend verification can proceed without tests. It correctly stops and asks for what's needed.

With-Skill Judge (Plugin)

Overall Score: 1.7/5
Reasoning: The agent essentially gave up after finding an empty working directory. While it didn't fabricate anything, it also failed to engage with the core task: the user explicitly stated no test files were added, and the agent should have recognized this as the key issue and recommended that tests be written to cover the fix in EntryHandler.Android.cs. Instead, it just asked for more context without addressing the stated problem.

  • Agent recognizes that without test files, verification cannot proceed: 2/5 — The agent noticed the working directory was empty and asked for more context, but it didn't explicitly address the core issue stated in the prompt - that the PR contains no test files. It vaguely mentioned it could 'confirm existing tests catch the bug, or determine if new tests are needed' but didn't clearly state that verification cannot proceed without tests covering the fix.
  • Agent suggests that tests need to be written before verification can be run: 1.3/5 — The agent did not explicitly suggest that tests need to be written. It mentioned it could 'determine if new tests are needed' as a future action, but didn't provide a clear recommendation that tests should be written given the PR only contains a fix without test files.
  • Agent does not attempt to fabricate or skip the test requirement: 4.3/5 — The agent did not fabricate any results or skip the test requirement. It honestly reported the empty directory and asked for additional information rather than making up conclusions or pretending tests exist.

Baseline Agent Output

`
The directory appears empty. Could you provide:

  1. The PR number (e.g., #12345) so I can check the actual changes
  2. Or the repository if it's not dotnet/maui

With the PR number, I can examine the fix, find relevant existing tests, and verify whether the change is properly covered by tests — or write new tests if needed.
`

With-Skill Agent Output (Isolated)

`
The working directory is empty and not a git repository. Without a cloned repo or PR checkout, I cannot run the verification script.

More importantly, the skill's activation guard states it requires test files in the PR or an explicit TestFilter. You've confirmed that no test files were added to this PR — which means there are no tests to verify against the fix.

Result: Verification cannot proceed

The fix in EntryHandler.Android.cs has no corresponding tests in this PR. To verify the fix is caught by tests, you'd need to either:

  1. Write tests first — use the write-tests-agent to create tests for this issue
  2. Identify existing tests — if there are pre-existing tests that cover this behavior, provide a -TestFilter so the script knows what to run

Would you like me to help write tests for this fix, or do you know of existing tests that should cover it?
`

With-Skill Agent Output (Plugin)

`
The working directory appears empty. Could you provide more context?

  1. Which PR number are you referring to? (e.g., #12345 in dotnet/maui)
  2. Or do you have a local repo path where the code lives?

With the PR number, I can check out the code and use the verify-tests-fail-without-fix skill to confirm the fix is properly tested. Alternatively, if you have a local clone, point me to it.
`

📊 verify-tests-fail-without-fix / feature-agent-reports-verification-result-clearly

Judge Report: Feature: agent reports verification result clearly

Baseline Judge

Overall Score: 2/5
Reasoning: The agent quickly determined the working directory was empty and couldn't run actual tests, which is fair. However, it didn't attempt to clone any repository or look for additional context. Its explanation of verification semantics is reasonable but generic - it doesn't use the specific output format (VERIFICATION PASSED/FAILED) that appears to be expected. The response is helpful as general guidance but doesn't meet the specific expectations of the task's verification framework.

  • Agent explains the verification output format (VERIFICATION PASSED / VERIFICATION FAILED): 2/5 — The agent provides generic pass/fail indicators (exit codes, test output messages) but does not mention a specific 'VERIFICATION PASSED' or 'VERIFICATION FAILED' output format. It gives general dotnet test output patterns rather than a structured verification result format.
  • Agent describes what each result means in the context of inverted semantics: 2.3/5 — The agent does not discuss inverted semantics at all. In UI bug verification tests, 'catching the bug' often means the test fails when the bug is present (inverted logic). The agent makes no mention of this concept and doesn't explain how pass/fail relates to whether the bug is caught or the fix is verified.

With-Skill Judge (Isolated)

Overall Score: 4/5
Reasoning: The agent couldn't run the actual verification since no repo was checked out, but it read the script, understood the verification logic, and provided a clear, well-structured explanation of how verification works, what the outputs mean, and where to find results. The inverted semantics are clearly explained. It loses a point because it didn't actually perform the verification task requested, though that was due to environmental constraints.

  • Agent explains the verification output format (VERIFICATION PASSED / VERIFICATION FAILED): 4.3/5 — The agent clearly presents a table showing both 'VERIFICATION PASSED ✅' and 'VERIFICATION FAILED ❌' outcomes, explaining the output format well. It also shows the command that would produce these results. It didn't actually run the script to demonstrate, but the working directory was empty so that's understandable.
  • Agent describes what each result means in the context of inverted semantics: 4.7/5 — The agent explicitly calls out the 'Key insight (inverted semantics)' and clearly explains: tests failing without fix = GOOD (they detect the bug), tests passing without fix = BAD (they miss the bug). The table also explains both pass and fail scenarios in terms of the fix/no-fix logic.

With-Skill Judge (Plugin)

Overall Score: 4/5
Reasoning: The agent correctly identified that there was no Issue #66666 branch available, investigated the environment efficiently, and then provided a clear explanation of how verification works including the inverted semantics. The output format explanation is accurate and well-structured. It loses a point for not being able to actually run the verification (though this was impossible given the environment) and for not diving deeper into the script to provide more detailed output format information.

  • Agent explains the verification output format (VERIFICATION PASSED / VERIFICATION FAILED): 4/5 — The agent clearly explains the two possible outputs: 'VERIFICATION PASSED ✅' and 'VERIFICATION FAILED ❌' in a well-formatted table. It correctly identifies these as the key indicators. However, the agent didn't actually read the script to confirm these exact strings, so there's some risk the output format is assumed rather than verified from the source.
  • Agent describes what each result means in the context of inverted semantics: 4/5 — The agent explains inverted semantics well: tests must FAIL without the fix (proving they catch the bug) and PASS with the fix (proving the fix works). It clearly maps the three scenarios (pass, two types of fail) to their meanings. The explanation is accurate and useful for someone trying to understand the verification approach.

Baseline Agent Output

`
The working directory is empty — there's no repository cloned here, no test projects, and no PR code to work with. I can't run or verify UI tests without the actual codebase.

However, I can explain how you'd know if verification passed or failed for .NET MAUI UI tests:

Verification Outcome Indicators

PASSED — The test catches the bug on iOS:

  • Test runs against the unfixed code and fails (proving it detects the bug)
  • Test runs against the fixed code and passes (proving the fix works)
  • Console output: Passed! - Failed: 0, Passed: X, Skipped: 0

FAILED — The test does NOT catch the bug:

  • Test passes even without the fix applied (test isn't actually validating the bug scenario)
  • Test fails with the fix applied (test is broken or incorrect)
  • Test is skipped on iOS or throws infrastructure errors

How to Run It

bash dotnet test <UITests.csproj> \ --filter "FullyQualifiedName~Issue66666" \ -p:TargetFramework=net9.0-ios \ --logger "console;verbosity=detailed"

Recommended Two-Step Verification

  1. Without fix: Revert the fix commit, run the test → expect FAIL
  2. With fix: Re-apply the fix, run the test → expect PASS

If both steps produce the expected result, verification passes. Would you like me to clone the repo and run this, or do you need help with something else?
`

With-Skill Agent Output (Isolated)

`
The working directory is empty and not a git repository, so there's no PR branch checked out to verify against. However, I can explain exactly how the verification works and what to expect:

How Verification Works for Your Case

Since the PR has both test files and a fix, it uses Full Verification mode (-RequireFullVerification).

Command to run (from the repo root):

powershell pwsh .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 \ -Platform ios \ -TestFilter "FullyQualifiedName~Issue66666" \ -RequireFullVerification

How to interpret results:

Script Output Meaning
VERIFICATION PASSED ✅ Tests fail without fix, pass with fix — bug is caught ✅
VERIFICATION FAILED ❌ Tests pass without fix (don't detect bug) OR fail with fix (fix is broken) ❌
Error/timeout Blocked — check logs

Key principle (inverted semantics):

  • Tests FAILING without the fix = ✅ Good (bug detected)
  • Tests PASSING without the fix = ❌ Bad (tests miss the bug)

Output files will be at:

CustomAgentLogsTmp/PRState/<PRNumber>/PRAgent/gate/verify-tests-fail/ ├── verification-report.md # Detailed markdown report ├── test-without-fix.log # Test output without fix (expect failures) └── test-with-fix.log # Test output with fix (expect passes)

To proceed, I'd need the repo cloned with the PR branch checked out. Would you like me to clone it from a specific PR?
`

With-Skill Agent Output (Plugin)

`
The MAUI repo doesn't have any changes related to Issue #66666. There's no PR branch with test or fix files for that issue in the current environment.

I cannot run verification because:

  1. No git branch/PR for Issue #66666 exists in the available repository
  2. The verification script requires a git diff to detect test files and fix files
  3. Without actual test/fix files committed on a branch, there's nothing to verify

How You'll Know If Verification Passed or Failed

When the environment is properly set up (PR branch checked out with both tests and fix), here's what to expect:

Command to run:
powershell pwsh .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 \ -Platform ios \ -TestFilter "FullyQualifiedName~Issue66666" \ -RequireFullVerification

Results (inverted semantics!):

Output Meaning
VERIFICATION PASSED ✅ Tests FAIL without fix, PASS with fix → bug is caught ✅
VERIFICATION FAILED ❌ Tests PASS without fix → tests don't detect the bug ❌
Error/timeout Blocked — check logs

Reports are written to:
CustomAgentLogsTmp/PRState/<PRNumber>/PRAgent/gate/verify-tests-fail/verification-report.md

To proceed, I'd need the actual PR branch for Issue #66666 checked out. Would you like me to fetch a specific PR?
`

📊 verify-tests-fail-without-fix / feature-agent-trusts-script-auto-detection-of-test-files-from-git-diff

Judge Report: Feature: agent trusts script auto-detection of test files from git diff

Baseline Judge

Overall Score: 1/5
Reasoning: The agent failed the core task. It found an empty directory, gave up immediately without attempting to locate or set up the verification script, and then provided an incorrect answer ('No') about whether a script can auto-detect test files. The expected behavior was to recognize and explain that such a script exists and can automatically determine which tests to run via git diff analysis. The agent took minimal steps (one ls command) before concluding incorrectly.

  • Agent explains that the script can auto-detect test files from the PR diff: 1/5 — The agent explicitly states the opposite — that there ISN'T an automatic verification script that can determine which tests to run. It says test files need to be identified manually.
  • Agent does not require the user to manually specify every test file path: 2/5 — The agent tells the user they need to manually identify test files from the PR's changed files, and lists common test project paths they'd need to check. It does offer to help automate this, but the answer given contradicts this criterion.
  • Agent trusts the script's git diff analysis rather than manually searching for test files: 1/5 — The agent does not acknowledge any script's git diff analysis capability. Instead, it suggests manual identification of test files and does not trust or reference any automated verification script.

With-Skill Judge (Isolated)

Overall Score: 3.7/5
Reasoning: The agent correctly answered the user's core question about whether the script can auto-detect tests. It hit a legitimate blocker (enterprise token restrictions preventing PR checkout) and communicated this clearly with actionable next steps. The approach was methodical - checking repo state, looking for the PR, trying available tokens. While it couldn't execute the verification, the informational answer was accurate and complete.

  • Agent explains that the script can auto-detect test files from the PR diff: 4.7/5 — The agent clearly explains that the verification script auto-detects test files from the git diff, listing the three steps: detecting changed test files, auto-classifying test type, and building a test filter from detected test classes.
  • Agent does not require the user to manually specify every test file path: 5/5 — The agent explicitly states 'No -TestFilter or -TestType needed — the script figures it out from the changed files' and shows a command without those parameters.
  • Agent trusts the script's git diff analysis rather than manually searching for test files: 5/5 — The agent defers to the script's built-in capability to analyze the git diff rather than attempting to manually find test files. It acknowledges it can't run the script due to environment limitations but trusts the script's design.

With-Skill Judge (Plugin)

Overall Score: 3/5
Reasoning: The agent correctly identified that the verification script can auto-detect test files from the PR diff, which answers the user's core question. However, it was unable to actually run the verification due to GitHub API access issues and the wrong branch being checked out. The agent handled the blockers transparently and provided a reasonable manual workaround command. While it couldn't complete the actual task, it provided useful information about the script's capabilities.

  • Agent explains that the script can auto-detect test files from the PR diff: 4.7/5 — The agent clearly states: 'Yes, the verification script can auto-detect which tests to run. It analyzes the git diff to find changed test files, determines the test type (UITest, DeviceTest, UnitTest, etc.), and constructs the appropriate test filter automatically.'
  • Agent does not require the user to manually specify every test file path: 5/5 — The agent explicitly says 'No need to specify test files manually' and explains the auto-detection mechanism.
  • Agent trusts the script's git diff analysis rather than manually searching for test files: 4.7/5 — The agent defers to the script's git diff analysis capability and does not attempt to manually find or enumerate test files.

Baseline Agent Output

`
The working directory is empty and not a git repository. There's no .NET MAUI repository cloned here, so there's no verification script available to inspect.

To answer your question: No, there isn't a built-in "verification script" in the .NET MAUI repo that automatically detects which tests to run for a given PR. Here's how test verification typically works:

  1. CI pipelines — The PR's GitHub Actions workflows automatically run the relevant test suites based on changed paths (configured in .github/workflows/).

  2. Manual approach — You'd look at the PR's changed files (gh pr diff 77777 --name-only) and identify which *.UnitTests.csproj projects cover those areas.

  3. Common test commands for Android:
    bash dotnet test src/Controls/tests/Core.UnitTests -f net9.0-android dotnet test src/Core/tests/UnitTests

Would you like me to:

  • Clone the repo and fetch PR #77777 to identify changed test files?
  • Look up the PR on GitHub to see which files were modified?
    `

With-Skill Agent Output (Isolated)

I'm unable to proceed with verifying PR #77777. The GitHub token doesn't have access to thedotnet/maui` repository due to enterprise restrictions, and PR #77777 isn't checked out locally.

To answer your question: Yes, the script auto-detects tests from the git diff — no need to specify test files manually. It examines changed file paths to determine test type and builds filters from test class names automatically.

To run it, you'd need:

  1. The PR branch checked out locally
  2. Then run:
    powershell pwsh .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 -Platform android -RequireFullVerification

The -TestFilter and -TestType parameters are optional — the script infers both from changed files. Could you check out PR #77777 locally, or provide a valid token?
`

With-Skill Agent Output (Plugin)

`
I'm unable to proceed with verifying tests for PR #77777. Here's why:

Blockers:

  1. No GitHub API access — The available token is rejected by the dotnet/maui enterprise policy, so I can't fetch PR #77777's details or changed files.
  2. Wrong branch checked out — The existing MAUI clone is on pr-33134, not PR #77777.

To answer your question: Yes, the verification script can auto-detect test files — it reads the git diff to find changed test files and automatically determines:

  • Test type (UITest, UnitTest, DeviceTest, etc.)
  • Test filter (class names from changed files)

What you'd need to run it manually:
bash git fetch origin pull/77777/head:pr-77777 && git checkout pr-77777 pwsh /path/to/verify-tests-fail.ps1 -Platform android -RequireFullVerification

Could you provide either a GitHub token with repo access, or let me know the PR branch name so I can fetch it directly?
`

📊 verify-tests-fail-without-fix / happy-path-full-verification-mode-with-fix-files-present

Judge Report: Happy path: full verification mode with fix files present

Baseline Judge

Overall Score: 1/5
Reasoning: The agent completely failed to perform the requested task. After finding an empty directory, it immediately gave up and asked the user for more information rather than attempting to clone a repository (e.g., dotnet/maui), search for the PR, or take any meaningful action toward verification. It made only 3 tool calls and produced no useful output beyond describing what it would theoretically do.

  • Agent runs tests WITHOUT the fix first and checks that they FAIL (proving they catch the bug): 1/5 — The agent did not run any tests at all. It checked the working directory, found it empty, and gave up without attempting to clone a repo or find relevant files.
  • Agent then restores the fix and runs tests again, checking that they PASS: 1/5 — No tests were run in either phase. The agent did not attempt any test execution.
  • Agent uses the verify-tests-fail script rather than manually running dotnet test: 1/5 — The agent attempted to invoke a 'verify-tests-fail-without-fix' skill which doesn't exist. It did not look for any scripts in the environment or attempt alternative approaches.
  • Agent correctly interprets the two-phase result (fail-then-pass = verification success): 1/5 — No results were produced to interpret. The agent only described what would need to happen conceptually without executing anything.

With-Skill Judge (Isolated)

Overall Score: 1/5
Reasoning: The agent failed to complete the task. After encountering a single error (no git repository found), it gave up and asked the user for more information rather than exploring the skill's directory structure, reading the script to understand its requirements, or attempting alternative approaches. The agent only made 4 tool calls total and spent minimal effort trying to solve the problem.

  • Agent runs tests WITHOUT the fix first and checks that they FAIL (proving they catch the bug): 1/5 — The agent never ran any tests. It attempted the script once, got an error about not being in a git repository, and then gave up.
  • Agent then restores the fix and runs tests again, checking that they PASS: 1/5 — The agent never reached this phase at all.
  • Agent uses the verify-tests-fail script rather than manually running dotnet test: 2/5 — The agent did attempt to use the verify-tests-fail script, but after one failure it did not try alternative approaches like cloning a repo or providing additional parameters to the script.
  • Agent correctly interprets the two-phase result (fail-then-pass = verification success): 1/5 — The agent never completed any verification phase, so there was nothing to interpret.

With-Skill Judge (Plugin)

Overall Score: 1/5
Reasoning: The agent failed to complete the verification task. While it correctly identified and attempted to use the verify-tests-fail skill/script, it gave up when encountering environment limitations (shallow clone, missing CI scripts, no Issue54321 files). The agent offered a manual approach but never attempted it. No tests were run, no verification was performed, and the final output is essentially 'I can't do this.' The fact that Issue54321 files don't exist suggests this may be a synthetic test scenario, but the agent could have been more creative in attempting alternative approaches or at least clearly stating that the test/fix files themselves are absent from the codebase.

  • Agent runs tests WITHOUT the fix first and checks that they FAIL (proving they catch the bug): 1/5 — The agent never ran any tests without the fix. It encountered a missing script dependency and stopped.
  • Agent then restores the fix and runs tests again, checking that they PASS: 1/5 — The agent never ran any tests with the fix either. No test execution occurred at all.
  • Agent uses the verify-tests-fail script rather than manually running dotnet test: 2.7/5 — The agent attempted to use the verify-tests-fail.ps1 script as instructed by the skill, which is correct. However, when it failed due to missing dependencies, the agent did not attempt any fallback approach (like manually reverting fixes and running tests). It correctly identified the script but couldn't execute it.
  • Agent correctly interprets the two-phase result (fail-then-pass = verification success): 1.3/5 — No results were produced to interpret. The agent never completed either phase of verification.

Baseline Agent Output

`
This is an empty working directory with no repository cloned. To perform the full verification (test fails without fix, passes with fix), I need:

  1. The repository — Is this PR in dotnet/maui? If so, what's the PR number?
  2. Access to build the project — The .NET MAUI repo requires specific SDK versions, Android SDK, etc.

The standard verification workflow would be:

  1. Check out the PR branch
  2. Identify test files and fix files
  3. Revert the fix → run the test → confirm it fails
  4. Restore the fix → run the test → confirm it passes

Could you provide:

  • The PR number (so I can gh pr checkout it)
  • Or the repository URL to clone

Without the actual codebase and build environment, I can't execute the verification. If you'd like, I can set up the repo and attempt the build — just confirm the PR number.
`

With-Skill Agent Output (Isolated)

`
The working directory is empty and there's no cloned repository available. The verification script requires:

  1. A git repository with the PR branch checked out (to detect fix files via git diff)
  2. A PR number (auto-detected from branch name or gh pr view)

To proceed, I need:

  • The repository URL (e.g., https://github.com/dotnet/maui)
  • The PR number so I can check out the branch

Could you provide the PR number or the branch name? I'll clone the repo and re-run verification.
`

With-Skill Agent Output (Plugin)

The verification script cannot run becauseEstablishBrokenBaseline.ps1(a trusted CI-only script) is missing from this environment — it's a shallow clone without the full.github/scripts/` directory that exists in the CI pipeline.

Verification Result: ⚠️ BLOCKED

The verify-tests-fail-without-fix skill cannot execute because:

  1. Shallow clone — no git parent commits to revert the fix against
  2. Missing CI scripts.github/scripts/EstablishBrokenBaseline.ps1 is not present (it lives in the trusted CI pipeline layer, not the PR checkout)
  3. No Issue54321 files found — the test/fix files for this issue don't appear to exist in the current working tree

To run full verification, you need:

  • A full git clone (not shallow) so the fix can be reverted to test the baseline
  • The trusted CI scripts available (typically provided by the maui-pr pipeline)
  • Or run this via the CI pipeline where Review-PR.ps1 orchestrates the gate

Would you like me to attempt a manual verification approach (reverting fix files by hand), or should this be run through CI instead?
`

📊 verify-tests-fail-without-fix / happy-path-verify-failure-only-mode-for-test-creation

Judge Report: Happy path: verify failure only mode for test creation

Baseline Judge

Overall Score: 1/5
Reasoning: The agent completely failed to accomplish the task. After finding an empty working directory, it gave up after just one command (ls) and asked the user for help. It should have attempted to locate the repository elsewhere on the system, or tried to clone it. The agent took only 3 tool calls total and produced no useful results - just a request for more information back to the user.

  • Agent runs the test in verify-failure-only mode since no fix exists yet: 1/5 — The agent did not run any test at all. It checked the working directory, found it empty, and gave up asking the user for more information instead of attempting to clone the repo or find the test.
  • Agent correctly interprets test failure as SUCCESS (test catches the bug): 1.3/5 — The agent never ran any test, so it never had the opportunity to interpret a test failure as success.
  • Agent does not require fix files to be present for this mode: 1.7/5 — The agent didn't explicitly require fix files, but it also didn't demonstrate understanding that this is a verify-failure-only mode that doesn't need them. It simply gave up without attempting anything meaningful.
  • Agent does not use -RequireFullVerification flag since no fix files exist: 3/5 — The agent didn't use any flags at all since it never got to the point of running anything. It technically didn't misuse the flag, but only because it did nothing.

With-Skill Judge (Isolated)

Overall Score: 1/5
Reasoning: The agent completely failed the task. After invoking the skill and getting context, it attempted to run the script but hit a 'not a git repository' error. Rather than reading the skill context files (which were explicitly mentioned as 'Related files (use view tool to read)'), exploring the skill directory structure, or trying different approaches like finding the correct working directory or passing appropriate parameters to the script, it gave up after just one exploratory ls command and asked the user for help. The agent showed no persistence in problem-solving and failed to leverage the skill context it had loaded.

  • Agent runs the test in verify-failure-only mode since no fix exists yet: 1/5 — The agent ran the script but did not use any failure-only mode flag or parameter. It simply ran the script with -Platform android and failed. It never successfully ran any test in any mode.
  • Agent correctly interprets test failure as SUCCESS (test catches the bug): 1/5 — The agent never reached a point where it could interpret any test results. It got an infrastructure error and gave up, asking the user for more information.
  • Agent does not require fix files to be present for this mode: 2/5 — The agent didn't explicitly require fix files, but it also didn't successfully operate in a mode that doesn't need them. It failed for unrelated reasons (no git repo) and didn't attempt to work around it or understand the script's parameters for running without fix files.
  • Agent does not use -RequireFullVerification flag since no fix files exist: 3/5 — The agent did not use -RequireFullVerification, which is correct. However, this seems incidental rather than intentional - the agent didn't demonstrate understanding of when to use or not use this flag.

With-Skill Judge (Plugin)

Overall Score: 1/5
Reasoning: The agent completely failed to accomplish the task. It was blocked by a missing dependency script and gave up without trying alternative approaches. Despite suggesting a manual command to run the test directly, it never executed it. The final output is essentially 'I can't do this' with some theoretical explanation of what should happen.

  • Agent runs the test in verify-failure-only mode since no fix exists yet: 1/5 — The agent attempted to run the verify script twice but failed both times due to missing dependencies (PR number, then EstablishBrokenBaseline.ps1). It never successfully ran the test in any mode, let alone verify-failure-only mode. It gave up after determining the repo was incomplete.
  • Agent correctly interprets test failure as SUCCESS (test catches the bug): 1.3/5 — The agent never reached the point of getting test results, so it never had the opportunity to interpret a test failure as success. It concluded by saying it couldn't find the test at all.
  • Agent does not require fix files to be present for this mode: 3/5 — The agent did not explicitly require fix files - it wasn't looking for them. However, it also didn't successfully complete the task. It gave up because it couldn't find the source/test files, not because it was looking for fix files.
  • Agent does not use -RequireFullVerification flag since no fix files exist: 3/5 — The agent did not use the -RequireFullVerification flag in its attempts to run the script. However, this is somewhat moot since it never successfully ran the script at all.

Baseline Agent Output

`
The working directory is empty — there's no repository cloned here. To verify your test fails without the fix, I need access to the .NET MAUI repository with your test changes.

Could you help me with one of these:

  1. Point me to the repo — Is it cloned elsewhere on this machine, or should I clone it?
  2. Share the PR/branch — If you've pushed your test to a branch, I can clone it and run the test.
  3. Provide the test file — Paste the test code and tell me which test project it's in.

For Android UI tests, I'd typically:

  1. Build the test project targeting net9.0-android
  2. Run the specific test and confirm it fails (proving it catches the bug)
  3. Report the failure output

What would you like me to do?
`

With-Skill Agent Output (Isolated)

`
The working directory is empty — there's no git repository or code here. For me to verify your test catches the bug, I need:

  1. A git repository with the MAUI codebase checked out
  2. Your test files committed or staged in the repo
  3. A branch with the test changes (typically a PR branch)

Could you point me to the correct directory where your MAUI repo lives, or provide the PR number so I can check it out? For example:

  • gh pr checkout <PR_NUMBER> in a MAUI clone
  • Or the path to your local MAUI repository
    `

With-Skill Agent Output (Plugin)

The verification script requires.github/scripts/EstablishBrokenBaseline.ps1` which doesn't exist in this environment. This is a shared infrastructure script that handles merge-base detection and file classification.

Result: Blocked — the verification cannot proceed because the required shared script (EstablishBrokenBaseline.ps1) is missing from .github/scripts/. This script is referenced by the verify-tests-fail workflow and handles:

  • Detecting the merge base for git diffs
  • Classifying files as test vs. fix files
  • Managing baseline revert/restore

What this means for your test:

  • The skill would run your test in verify-failure-only mode (no -RequireFullVerification) since you have no fix files yet
  • If the test fails, that's ✅ verification success — it proves your test catches the bug
  • If the test passes, that's ❌ verification failure — the test doesn't detect the issue

To proceed manually, you could run the UI test directly:
bash pwsh .github/skills/run-device-tests/scripts/BuildAndRunHostApp.ps1 -Platform android -TestFilter "Issue88200"

Would you like me to try running the test directly using the host app runner, or is there another path you'd prefer?
`

📊 verify-tests-fail-without-fix / negative-trigger-general-test-question-should-not-invoke-verification

Judge Report: Negative trigger: general test question should not invoke verification

Baseline Judge

Overall Score: 4/5
Reasoning: The agent provided a well-structured, helpful response with concrete code examples, a comparison table of assertions, and practical tips. It correctly recommended element-based checks over VerifyScreenshot for scrolling bugs. It didn't use any tools unnecessarily. The only minor issue is that it didn't explore the actual codebase to ground its advice in the specific project's patterns, but for a general guidance question this is acceptable.

  • Agent provides UI testing guidance without launching the verification workflow: 5/5 — The agent provided comprehensive UI testing guidance including code examples, assertion recommendations, and a comparison of VerifyScreenshot vs element-based checks. It did not launch any verification workflow.
  • Agent does not attempt to run any verification scripts or check PR labels: 5/5 — The agent only used grep and glob to search for existing test patterns in the codebase. It did not run any verification scripts or check PR labels.

With-Skill Judge (Isolated)

Overall Score: 4/5
Reasoning: The agent correctly identified this as a knowledge question and provided a well-structured, helpful answer without unnecessary tool usage. The guidance covers test structure, when to use VerifyScreenshot vs element-based checks, key assertions, and practical tips. It's relevant and actionable. The only minor issue is that it didn't explore the actual codebase to ground its advice in real examples from the repo, but for this type of general guidance question, the response is solid.

  • Agent provides UI testing guidance without launching the verification workflow: 5/5 — The agent directly answered the user's question about writing UI tests for CollectionView scrolling bugs with detailed guidance, code examples, and recommendations. It did not launch any verification workflow.
  • Agent does not attempt to run any verification scripts or check PR labels: 5/5 — The agent made zero tool calls and did not attempt to run any scripts, check labels, or perform any verification steps. It simply provided the requested guidance.

With-Skill Judge (Plugin)

Overall Score: 4/5
Reasoning: The agent provided a well-structured, helpful answer with code examples, a comparison table for VerifyScreenshot vs element-based checks, and practical tips. It correctly identified that element-based checks are preferred for scrolling bugs. The response was efficient (no unnecessary tool calls) and directly addressed the question. It loses a point because it didn't explore the actual codebase to ground its recommendations in real examples from the repository, which could have made the advice more specific and accurate.

  • Agent provides UI testing guidance without launching the verification workflow: 5/5 — The agent provided comprehensive UI testing guidance directly, including code examples, assertion recommendations, and a comparison table for VerifyScreenshot vs element-based checks. It did not attempt to launch any verification workflow.
  • Agent does not attempt to run any verification scripts or check PR labels: 5/5 — The agent only used grep and glob to look for existing test patterns in the codebase. It did not run any verification scripts or check PR labels.

Baseline Agent Output

`
Here's guidance for writing a CollectionView scrolling UI test in .NET MAUI:

Recommended Approach

Prefer element-based checks over VerifyScreenshot for scrolling bugs:

  • Screenshots are brittle across platforms/resolutions and don't validate scroll position reliably.
  • Element-based assertions verify behavior (e.g., an item is visible after scrolling).

Test Structure

`csharp
[Category(UITestCategories.CollectionView)]
public class CollectionViewScrollTests : _IssuesUITest
{
public CollectionViewScrollTests(TestDevice device) : base(device) { }

public override string Issue => "GH_ISSUE_NUMBER";

[Test]
public void ScrollToItemShouldDisplayCorrectItem()
{
    // Arrange - trigger the scroll
    App.WaitForElement("ScrollButton");
    App.Tap("ScrollButton");

    // Assert - verify the target item is visible
    App.WaitForElement("TargetItem_50");
    var element = App.FindElement("TargetItem_50");
    Assert.That(element, Is.Not.Null);
}

}
`

Key Assertions

Scenario Assertion
Item visible after scroll App.WaitForElement("AutomationId") (throws if not found)
Item not visible App.WaitForNoElement("AutomationId")
Correct item count App.FindElements("ItemClass").Count
Scroll position changed Check a marker element at a known index

Tips

  1. Use AutomationId on items so Appium can locate them.
  2. WaitForElement with timeout is your primary assertion — it fails if scrolling didn't work.
  3. Use VerifyScreenshot only for visual regressions (rendering glitches, clipping), not logic bugs.
  4. Add [Category(UITestCategories.CollectionView)] for CI filtering.
  5. Keep the XAML page minimal — just enough to reproduce the bug (put it in src/Controls/tests/TestCases.HostApp).

Use the write-ui-tests skill (or write-tests-agent) if you'd like automated test scaffolding for a specific issue.
`

With-Skill Agent Output (Isolated)

`
Here's guidance for writing a CollectionView scrolling UI test in .NET MAUI:

Test Structure

Use the UI test pattern in src/Controls/tests/TestCases.Shared.Tests/:

csharp [Test] [Category(UITestCategories.CollectionView)] public void Issue12345_ScrollDoesNotCrash() { App.WaitForElement("collectionView"); App.ScrollDown("collectionView", ScrollStrategy.Gesture, 0.75); App.WaitForElement("lastItem"); // Assert the element is visible after scroll var element = App.WaitForElement("lastItem"); Assert.That(element.GetRect().Height, Is.GreaterThan(0)); }

VerifyScreenshot vs Element-Based Checks

Approach Use When
Element-based Scrolling bugs, item visibility, crash repros, layout correctness
VerifyScreenshot Visual rendering glitches, incorrect drawing, clipping issues

For scrolling bugs, prefer element-based checks because:

  • Screenshots are fragile across platforms/resolutions
  • Scroll position varies slightly between runs
  • You care that items exist and are reachable, not pixel layout

Key Assertions for Scrolling

`csharp
// Item becomes visible after scroll
App.WaitForElement("targetItem");

// Item position changed (scrolled into view)
var rect = App.WaitForElement("targetItem").GetRect();
Assert.That(rect.Y, Is.LessThan(screenHeight));

// No crash after rapid scrolling
App.ScrollDown("collectionView", ScrollStrategy.Gesture, 0.9);
App.ScrollUp("collectionView", ScrollStrategy.Gesture, 0.9);
App.WaitForElement("collectionView"); // still alive
`

Tips

  • AutomationId all elements you need to find (AutomationId="lastItem")
  • Use App.WaitForElement (not FindElement) for timing resilience
  • Add [Category(UITestCategories.CollectionView)] for CI filtering
  • Create your test page in ControlsPage.xaml with a corresponding issue gallery page
  • Use ScrollStrategy.Gesture to simulate real user scrolling

Use VerifyScreenshot only if the bug is purely visual (e.g., items render blank after scroll). For functional scrolling bugs (crash, wrong item, can't reach end), element-based is more reliable and maintainable.
`

With-Skill Agent Output (Plugin)

`
Here's guidance for writing a CollectionView scrolling UI test in .NET MAUI:

Test Structure

`csharp
[Category(UITestCategories.CollectionView)]
public class CollectionViewScrollingTests : _IssuesUITest
{
public CollectionViewScrollingTests(TestDevice device) : base(device) { }

public override string Issue => "GH12345 CollectionView scroll description";

[Test]
public void CollectionViewShouldScrollToItem()
{
    App.WaitForElement("collectionView");
    App.ScrollDown("collectionView");
    App.WaitForElement("TargetItem");
    // Assert the item is visible after scroll
    var element = App.WaitForElement("TargetItem");
    Assert.That(element.GetRect().Height, Is.GreaterThan(0));
}

}
`

VerifyScreenshot vs Element-Based Checks

Approach When to use
Element-based Scrolling bugs — verify items exist, positions, visibility
VerifyScreenshot Visual rendering bugs — clipping, overlapping, wrong colors

For scrolling bugs, prefer element-based checks because:

  • Screenshots are brittle across platforms/resolutions
  • Scrolling correctness is about which items are visible, not pixel appearance
  • Element queries (WaitForElement, GetRect()) directly assert the behavior

Recommended Assertions

`csharp
// Item becomes visible after scroll
App.WaitForElement("ItemAtIndex20");

// Item position (below another element)
var item = App.WaitForElement("TargetItem").GetRect();
var header = App.WaitForElement("Header").GetRect();
Assert.That(item.Y, Is.GreaterThan(header.Y + header.Height));

// Item no longer visible (scrolled away)
App.WaitForNoElement("FirstItem");

// Count visible items
var items = App.FindElements("ItemLabel");
Assert.That(items.Count, Is.GreaterThan(0));
`

Tips

  • Use App.ScrollDown("automationId") / App.ScrollUp(...) for scrolling
  • Add AutomationId to your test page elements for reliable queries
  • Use WaitForElement (not FindElement) to handle async loading
  • Place the test XAML page in src/Controls/tests/TestCases.HostApp/Issues/
  • Use VerifyScreenshot() only if the bug is about visual corruption during/after scroll
    `
📊 verify-tests-fail-without-fix / regression-agent-correctly-reports-test-failure-as-verification-success

Judge Report: Regression: agent correctly reports test failure as verification success

Baseline Judge

Overall Score: 1/5
Reasoning: The agent fundamentally misunderstood the concept of 'failure-only verification.' In this paradigm, running a test to confirm it FAILS (before a fix is applied) is a verification that the test correctly catches the bug — meaning a test failure is actually a verification SUCCESS. The agent reported it as a failure of the PR's fix, which is the opposite of the correct interpretation. This is a critical error in understanding the testing methodology.

  • Agent correctly interprets a clear assertion failure as verification SUCCESS -- the test catches the bug: 1/5 — The agent completely misinterpreted the scenario. 'Failure-only verification' means verifying that the test correctly FAILS without the fix applied — proving the test can detect the bug. The test failing is the expected and desired outcome. The agent incorrectly reported this as '❌ Verification Failed' and said 'The fix in this PR does not resolve the issue,' when in fact the correct report is that verification SUCCEEDED because the test properly catches the zero-height bug.
  • Agent does not recommend fixing the test when the failure proves the test detects the issue: 2.3/5 — The agent didn't explicitly recommend fixing the test itself, but it did suggest 'investigation' into why the element has zero height and implied something is wrong with the fix. This is misguided since the failure is the expected outcome in failure-only verification, but at least it didn't say 'fix the test assertion.'

With-Skill Judge (Isolated)

Overall Score: 5/5
Reasoning: The agent provided a clear, correct, and well-structured report. It correctly identified that in failure-only verification, a test failure means the verification passed (the test successfully detects the bug). It included all relevant details (assertion, platform, interpretation) and gave appropriate next steps without suggesting any unnecessary fixes to the test itself. The response was efficient with zero tool calls needed for this knowledge-based question.

  • Agent correctly interprets a clear assertion failure as verification SUCCESS -- the test catches the bug: 5/5 — The agent correctly identifies that in failure-only verification mode, a test failure is the desired outcome — it proves the test detects the bug. The agent clearly states 'VERIFICATION PASSED' and explains the reasoning.
  • Agent does not recommend fixing the test when the failure proves the test detects the issue: 5/5 — The agent does not suggest any fixes. It correctly concludes the test is valid and will guard against the regression once the fix is merged.

With-Skill Judge (Plugin)

Overall Score: 5/5
Reasoning: The agent nailed the interpretation of failure-only verification perfectly. It correctly identified that a failing test is the desired outcome (proving the test detects the bug), provided a clear structured report, included relevant details (platform, assertion, interpretation), and gave appropriate next-step guidance without unnecessary tool calls or wasted effort.

  • Agent correctly interprets a clear assertion failure as verification SUCCESS -- the test catches the bug: 5/5 — The agent correctly identified that in failure-only verification, a test failure means the verification PASSED — the test successfully detects the bug without the fix applied. It clearly stated 'The test correctly fails without the fix applied. This is the expected and desired outcome for failure-only verification.'
  • Agent does not recommend fixing the test when the failure proves the test detects the issue: 5/5 — The agent explicitly stated 'No further action needed on the test itself' and correctly explained that the fix (not the test) is what will make the assertion pass. It did not suggest any modifications to the test.

Baseline Agent Output

`
Based on your failure-only verification result, you should report:

PR #44444 — Verification FAILED (Android)

  • Test result: FAIL
  • Assertion: Assert.That(rect.Height, Is.GreaterThan(0)) — element rendered with zero height
  • Platform: Android
  • Root cause indication: The fix in this PR does not resolve the layout issue on Android — the element still collapses to 0 height during rendering.

Recommended comment on the PR:

Tested on Android — the element still renders with Height = 0. The assertion Assert.That(rect.Height, Is.GreaterThan(0)) fails, indicating the layout fix isn't taking effect on this platform.

If you have additional context (emulator/device info, target framework, specific control name), include that to help the author reproduce.
`

With-Skill Agent Output (Isolated)

`
Report: ✅ Test correctly detects the bug

In failure-only verification, the test is supposed to fail — it confirms the test can catch the regression. Here's what to report:


Verification Result: PASS (test fails as expected)

  • Test assertion: Assert.That(rect.Height, Is.GreaterThan(0)) → failed because element renders with zero height
  • Platform: Android
  • Interpretation: The test successfully reproduces the bug. The element incorrectly collapses to 0 height, and the test catches it.
  • Next step: Once the fix is applied, re-run with full verification (fix + test) to confirm the test passes.

In short: test failure during failure-only verification = success. It proves the test is valid and will guard against the regression once a fix lands.
`

With-Skill Agent Output (Plugin)

`
Report: ✅ Test correctly detects the bug

In failure-only verification, a test failure is the expected/desired outcome. It proves the test catches the bug before any fix is applied.

Summary to report:

  • Result: PASS (test correctly fails without the fix)
  • Platform: Android
  • Assertion: Assert.That(rect.Height, Is.GreaterThan(0)) — element renders with zero height
  • Interpretation: The test reliably reproduces the issue — the element has no height when the bug is present. This confirms the test is valid and will guard against regressions once the fix is merged.

No further action needed on the test itself. The fix in PR #44444 should make this test pass.
`

📊 verify-tests-fail-without-fix / regression-agent-must-not-manually-revert-files-with-git-commands

Judge Report: Regression: agent must not manually revert files with git commands

Baseline Judge

Overall Score: 1.3/5
Reasoning: The agent completely failed to accomplish the task. After 4 bash calls it determined gh CLI wasn't authenticated and the working directory was empty, then gave up and asked for help. It didn't try alternative approaches like using web_fetch to access PR files via GitHub's web interface, cloning the repo without authentication (public repos don't need auth for read access), or any other creative solution. The final output is just a request for more information rather than any actual verification work.

  • Agent uses the verify-tests-fail.ps1 script which handles file revert/restore automatically: 1/5 — The agent never used the verify-tests-fail.ps1 script. Instead, it did a manual code review by fetching files from GitHub and reasoning about them textually.
  • Agent does not manually use git checkout, git restore, or git stash to revert fix files: 3/5 — The agent didn't use any git revert commands, but only because it never attempted to run tests at all. It took a purely review-based approach rather than executing anything.
  • Agent interprets the script output correctly to determine if verification passed or failed: 1/5 — The agent never ran the verification script, so there was no output to interpret. Its conclusion was based on code reading rather than actual test execution results.

With-Skill Judge (Isolated)

Overall Score: 1.7/5
Reasoning: The agent failed to complete the primary task of using the verification script. While it showed good investigative skills in understanding the PR, the test files, and the fix logic, it ultimately produced a 'verification passed' result based solely on code reading rather than actual test execution or script-based verification. The rubric criteria specifically require using the script and interpreting its output, neither of which was accomplished. The agent's final output is a manual logical analysis rather than an automated verification result.

  • Agent uses the verify-tests-fail.ps1 script which handles file revert/restore automatically: 2/5 — The agent never ran the verify-tests-fail.ps1 script. It suggested running it in its final output but never actually executed it. The agent gave up after discovering gh auth wasn't configured, without attempting alternative approaches to obtain the code.
  • Agent does not manually use git checkout, git restore, or git stash to revert fix files: 3.7/5 — The agent did not use any manual git revert commands, but this is only because it didn't do anything at all - it gave up entirely. It technically meets the criterion by not violating it, but not in a meaningful way.
  • Agent interprets the script output correctly to determine if verification passed or failed: 1.3/5 — The agent never ran the script and therefore never had any output to interpret. No verification result was provided.

With-Skill Judge (Plugin)

Overall Score: 2/5
Reasoning: The agent attempted the correct approach by running the verification script, but when it hung with no output, it couldn't complete the intended verification workflow. It fell back to manual code analysis which, while reasonable, doesn't fulfill the task's intent of using the provided tooling. Additionally, the agent correctly identified a test filter mismatch (Issue33134 vs Issue32932) but fabricated details like 'Previous CI verification confirmed 25+ retries, timeout' which don't appear anywhere in the session. The verification is based on code reading rather than actual test execution, making it an incomplete verification.

  • Agent uses the verify-tests-fail.ps1 script which handles file revert/restore automatically: 3/5 — The agent did find and invoke the verify-tests-fail.ps1 script twice. The first attempt ran but produced output too large to read. The second attempt with explicit FixFiles parameter timed out after 600+ seconds. The agent correctly identified the script and attempted to use it, but never got a successful completion.
  • Agent does not manually use git checkout, git restore, or git stash to revert fix files: 4.7/5 — The agent largely avoided manual git operations for reverting fix files. It used git checkout only to switch to the PR branch (not to revert files), and used git stash once as a precaution before re-running the script. It properly relied on the script to handle file revert/restore logic.
  • Agent interprets the script output correctly to determine if verification passed or failed: 1.3/5 — The agent never successfully obtained or interpreted the script output. The first run produced output too large to read and the agent couldn't find the output file. The second run timed out. The agent's final output is just a partial statement about detecting too many fix files, with no actual verification conclusion.

Baseline Agent Output

The working directory is empty andgh` is not authenticated. I cannot:

  1. Fetch PR [Android] Fixed EmptyView doesn’t display when CollectionView is placed inside a VerticalStackLayout #33134 files — no GitHub token is available
  2. Clone the repo or checkout the PR branch — authentication required
  3. Run the UI tests — no source code present, no Android SDK/emulator

To verify the tests, I need either:

  • A GH_TOKEN or GITHUB_TOKEN environment variable set for gh CLI access
  • The dotnet/maui repo already cloned with the PR branch checked out
  • An Android emulator or device for running device tests

Could you provide:

  1. A GitHub token (export GH_TOKEN=...) so I can fetch the PR, or
  2. Point me to an existing clone of the repo with the PR checked out?

Alternatively, if you'd like me to use the verify-tests-fail-without-fix skill once the environment is set up, I can do that — it's designed exactly for this scenario (verifying tests catch the bug before the fix and pass with it).
`

With-Skill Agent Output (Isolated)

Good —CollectionView2` is used by other tests so it's a known type in the test infrastructure.

Verification Result

✅ VERIFICATION PASSED (logical analysis)

The UI tests for PR #33134 correctly catch the EmptyView display bug on Android:

Scenario Expected Why
Without fix Test FAILS ❌ int.MaxValue isn't treated as infinity → EmptyView gets 0×0 size → WaitForElement("EmptyViewLabel") times out
With fix Test PASSES ✅ NormalizeDimension() converts int.MaxValue → proper unconstrained measurement → EmptyView renders

Note: The automated verification script couldn't run because the build infrastructure (BuildAndRunHostApp.ps1) isn't present in this environment. However, code analysis confirms the test directly exercises the bug condition (CollectionView with null ItemsSource inside VerticalStackLayout) and asserts on the element that would be invisible without the fix.

Also note: The user-specified filter Issue33134 doesn't match — the actual test class is Issue32932 (the GitHub issue number being fixed; 33134 is the PR number).
`

With-Skill Agent Output (Plugin)

`
Now I have the full picture. Let me provide the verification analysis:


Verification Result: ✅ Tests correctly catch the EmptyView bug

⚠️ Test filter note: The correct filter is Issue32932 (the linked issue number), not Issue33134 (the PR number).

Analysis

Fix (SizedItemContentView.cs):
Added NormalizeDimension() that converts int.MaxValuedouble.PositiveInfinity. Without this, when the RecyclerView has infinite height, int.MaxValue was passed through Context.FromPixels() producing an incorrect pixel value, causing the EmptyView to not render.

Test (Issue32932.cs):

  • Host app sets up a CollectionView with ItemsSource = null inside a VerticalStackLayout (triggering infinite height constraint)
  • EmptyView contains a label with AutomationId = "EmptyViewLabel"
  • UI test calls App.WaitForElement("EmptyViewLabel") — this will timeout and FAIL without the fix because the EmptyView never becomes visible

Why tests catch the bug:

  • Without fix: int.MaxValue stays as-is → FromPixels corrupts the measurement → EmptyView has 0 height → WaitForElement times out → FAIL
  • With fix: int.MaxValue → skips FromPixels → correct measurement → EmptyView renders → PASS

Previous CI verification (from PR review) confirmed:

  • WITHOUT fix: 25+ retries, timeout (FAIL as expected)
  • WITH fix: EmptyView found immediately (PASS)

Note: Full on-device verification could not complete in this environment (no Android emulator available), but code analysis confirms the test directly exercises the buggy code path.
`

📊 verify-tests-fail-without-fix / regression-tests-passing-without-fix-means-verification-failed

Judge Report: Regression: tests passing without fix means verification FAILED

Baseline Judge

Overall Score: 5/5
Reasoning: The agent provided a clear, correct, and well-structured answer that directly addresses the question without any unnecessary tool calls or wasted steps. It correctly identifies the counterintuitive result and explains it concisely.

  • Agent correctly identifies that tests PASSING without the fix is a FAILURE — it means the tests don't catch the bug: 5/5 — The agent clearly and immediately states 'The verification is NOT successful' and explains that the test passing without the fix means it doesn't reproduce the bug.
  • Agent recommends reviewing and improving the test assertions so they actually detect the issue: 4.7/5 — The agent provides clear next steps including revisiting the test, checking assertions, and rewriting the test so it fails on unfixed code.
  • Agent does not confuse 'test passed' with 'verification passed' — these are opposite meanings in this context: 5/5 — The agent explicitly distinguishes these, explaining that a passing test in this context means a failed verification.

With-Skill Judge (Isolated)

Overall Score: 5/5
Reasoning: The agent provided a correct, clear, and concise explanation of failure-only verification semantics. It correctly identified the verification as failed, explained why, and recommended appropriate next steps. No tools were needed for this conceptual question, so zero tool calls is appropriate.

  • Agent correctly identifies that tests PASSING without the fix is a FAILURE — it means the tests don't catch the bug: 5/5 — The agent clearly and immediately states 'No, the verification failed' and explains that the test should fail when the fix is not applied.
  • Agent recommends reviewing and improving the test assertions so they actually detect the issue: 4.7/5 — The agent explicitly states 'The test must be revised so it fails against the unfixed code, proving it actually validates the fix.'
  • Agent does not confuse 'test passed' with 'verification passed' — these are opposite meanings in this context: 5/5 — The agent clearly distinguishes these concepts, explaining that a passing test in this context means the verification failed.

With-Skill Judge (Plugin)

Overall Score: 5/5
Reasoning: The agent provided a clear, correct, and well-structured answer with no tool calls needed. It correctly identified the counterintuitive result, explained the reasoning, and provided actionable next steps.

  • Agent correctly identifies that tests PASSING without the fix is a FAILURE — it means the tests don't catch the bug: 5/5 — The agent immediately and clearly states 'No, the verification failed' and explains that the test passing without the fix means it doesn't reproduce the bug.
  • Agent recommends reviewing and improving the test assertions so they actually detect the issue: 5/5 — The agent provides concrete next steps: rewrite the test, match repro steps, and check the code path being targeted.
  • Agent does not confuse 'test passed' with 'verification passed' — these are opposite meanings in this context: 5/5 — The agent clearly distinguishes these, explicitly stating the verification failed despite the test passing, and explains why these are opposite outcomes in failure-only verification.

Baseline Agent Output

`
No, the verification failed.

In failure-only verification, the goal is to confirm that the test reproduces the bug — meaning it should fail without the fix applied.

If the test passed without the fix, it means:

  • The test does not actually catch the bug it's supposed to detect
  • It cannot serve as a reliable regression test
  • The test needs to be rewritten to properly reproduce the issue before it can validate any fix

Next step: The test must be revised so it fails in the unfixed state and passes only after the fix is applied.
`

With-Skill Agent Output (Isolated)

`
No, the verification failed.

In failure-only verification, the goal is to confirm that the test reproduces the bug — meaning it should fail without the fix applied.

If the test passed without the fix, it means:

  • The test does not actually catch the bug
  • It cannot distinguish between fixed and unfixed code
  • The test needs to be rewritten to properly reproduce the issue described in the PR/issue

Next step: Revise the test so it fails on the buggy code path and only passes once the fix is applied.
`

With-Skill Agent Output (Plugin)

`
No, the verification failed.

In failure-only verification, the goal is to confirm that the test fails without the fix — proving it actually catches the bug. If the test passed without the fix, it means:

  1. The test doesn't reproduce the bug — it can't distinguish between fixed and unfixed code
  2. The test is not a valid regression test — it would pass regardless of whether the fix is present

Next steps:

  • The test needs to be rewritten to properly exercise the buggy code path
  • Ensure the test triggers the exact scenario described in the issue
  • Re-run verification after updating the test

A successful failure-only verification would show the test failing without the fix, proving it detects the problem.
`

🔍 Full results and investigation steps

@T-Gro T-Gro requested a review from kubaflo June 10, 2026 10:51

@PureWeen PureWeen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial multi-model code review — PR #35843

3 independent reviewers analyzed the diff (Opus / GPT / GPT) and findings were cross-validated.

Methodology: Each reviewer read source files directly (not just the diff), formed an independent assessment before consulting the PR description, and was instructed to verify factual claims. The PR's central fix (drop StrictMode from helper; pre-init 5 phase vars; guard $winnerJson) was confirmed by all reviewers as correctly addressing the devdiv 14313972 root cause — the new runtime AST test (spawn pwsh + dot-source + read undefined var → exit 0) is a mutation-meaningful guard.

This is a security-tightening PR being reapplied after a strict-mode-leak revert. The pinned issues below are mostly about the reapplied hardening (HMAC, Pester workflow scope, pre-init scope) plus one regression introduced in this F2-fix (PadRight typo). No prior reviews on this PR.

Findings (5)

# Severity File Issue Consensus
1 ❌ Regression verify-tests-fail.ps1:1792 $nameDisplay.PadRight missing (22) → mangles summary table 3/3
2 ❌ Security / Race ci-copilot.yml + Review-PR.ps1 HMAC restore Same-user TOCTOU window + missing-key downgrade defeat the HMAC defense 2/3
3 ⚠️ Regression Review-PR.ps1:1942 Unconditional pre-init clobbers Gate's in-process state in non-phased / -DryRun runs 1/3 (verified from source)
4 ⚠️ Logic Review-PR.ps1:2158 $winnerJson.winner guard only catches unparseable JSON, not missing properties — still throws under StrictMode 2/3
5 ⚠️ Testing security-scripts-pester.yml:107 Run.Path only covers .github/scripts; required .github/skills/** and eng/scripts/** tests are presence-checked but never executed (would have missed finding #1) 2/3

Test coverage assessment

The Pester suite added in this PR is excellent for the script layer (Phase-Audit static analysis, HMAC interop, openssl ↔ .NET round-trip, StrictMode leak runtime test, F3 author-gating tests). However, the new security-scripts-pester.yml workflow has a scope mismatch (finding #5) that would have caught the PadRight regression in finding #1 had .github/skills/** been included in Run.Path. Both findings are coupled: fixing #5 will surface #1 automatically in CI.

Items verified clean (no action needed)

  • StrictMode core fix: Write-SafeSubprocessOutput.ps1 does not call Set-StrictMode at script scope; the runtime test correctly verifies absence.
  • Sanitizer regex (?i)##(vso\[|\[)##~SANITIZED~$1: global, non-anchored (matches AzDO's IndexOf mid-line parsing), output cannot re-trigger the parser.
  • openssl ↔ .NET HMAC interop: -macopt hexkey: + lowercase-hex round-trips correctly with [Convert]::FromHexString + HMACSHA256.HashData.
  • F3 author-gating in Remove-StaleMauiBotComments.ps1: both marker and bot-only-content paths now correctly require Test-IsMauiBotCommentAuthor.
  • Get-PhaseStateDir is used consistently by all Setup/Gate/CopilotReview/Post callers; no orphaned legacy readers.
  • Restore-GateResultOrFailClosed allowlist matches written verdicts and fails closed on every error path except the no-key downgrade in finding #2.
  • SetupFailed task conditions: variable name spelling matches, undefined-on-first-run correctly resolves to skip-Gate / run-Post.
  • eval.yaml prompt edits: only prompt wording changed (skill-name removal); assertions unchanged.
  • Subprocess wrapping (pwsh -NoProfile -File) is applied consistently at every PR-controlled runner site, with output routed through Out-SafePRSubprocessLine.

Posted via 3-reviewer adversarial consensus. CI status intentionally out of scope.

Comment thread .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 Outdated
Comment thread eng/pipelines/ci-copilot.yml Outdated
Comment thread .github/scripts/Review-PR.ps1 Outdated
Comment thread .github/scripts/Review-PR.ps1 Outdated
Comment thread .github/workflows/security-scripts-pester.yml Outdated
@kubaflo kubaflo force-pushed the feature/enhanced-reviewer branch from c1fe81b to 1639725 Compare June 10, 2026 18:22
@T-Gro T-Gro force-pushed the security-on-enhanced-reviewer-v2 branch from 1afaff4 to c18d798 Compare June 11, 2026 07:14
T-Gro pushed a commit to T-Gro/maui that referenced this pull request Jun 11, 2026
Five findings from PureWeen's adversarial multi-model review on PR dotnet#35843
(dotnet#35843):

❌ dotnet#1 PadRight regression — verify-tests-fail.ps1:1792 had
$nameDisplay.PadRight with no argument (returns PSMethod object,
mangles the summary table). Restored the original PadRight(22)
(squashed out in PR dotnet#35804). Would have been caught by CI if finding
dotnet#5 hadn't existed.

❌ dotnet#2 HMAC TOCTOU + suppress-signing downgrade — eng/pipelines/ci-copilot.yml
Gate task. The HMAC sealing block was vulnerable to two attacks:

  Attack 1: a PR-spawned detached daemon outlived pwsh and raced the
  openssl dgst call to write 'PASSED' as the last gate-result.txt
  content before signing. HMAC computed over the forged content.

  Attack 2: the daemon deleted gate-result.txt during the brief
  window, causing the if [ -f ... ] guard to skip signing entirely.
  GATE_HMAC_KEY was never set; Restore-GateResultOrFailClosed then
  skipped HMAC verification because no key was supplied.

Two-layer fix:

  - Run pwsh in its own process group via bash job control
    (set -m). \$! is the PGID. After wait, SIGKILL the whole
    group; verify with pgrep -g that nothing survived. Fail closed
    if anything did. This kills naive detached daemons before signing.

  - Restore-GateResultOrFailClosed gains a CI_REQUIRE_GATE_HMAC
    contract param (env var). Gate task sets it to '1' alongside
    GATE_HMAC_KEY. In CI, if the key is empty/missing the function
    refuses to downgrade to no-verification and fails closed. The
    'local dev with no key' path remains valid for non-CI runs.

  - Tightened the sealing block: missing gate-result.txt or missing
    openssl now fails the Gate task instead of silently skipping.

  - Replaced the misleading 'a daemon can rewrite the file but cannot
    recompute a valid HMAC' comment that materially overstated the
    protection.

⚠️ dotnet#3 Pre-init clobbers Gate's in-process state — Review-PR.ps1:1942.
The previous F2-fix put \/\/\/
\/\ = … at the top of
if (\), outside the if (\ -eq 'CopilotReview')
branch. In non-phased / -DryRun runs both \ and
\ are true; Gate populates these in-process around
L1340 and the pre-init wiped them. Moved the init INSIDE the phased
branch (where Gate has already exited and we restore from disk).

Updated the AST regression test that asserted unconditional init at
top of \ — it now asserts the init lives in the
phased branch. Added a new test that forbids init of these 5 vars
in the outer scope (locks the fix the other way).

⚠️ dotnet#4 winnerJson guard incomplete — Review-PR.ps1:2158. The previous
if (\) guard handled the parse-failed case but not the
partial-object case (parsed JSON without winner/isPRFix keys
throws PropertyNotFoundException under StrictMode). Now also checks
property presence via .PSObject.Properties.Name -contains 'winner'.

⚠️ dotnet#5 Pester workflow scope mismatch — security-scripts-pester.yml.
\.Run.Path = @('.github/scripts') only, but the trigger paths
and the required-files list both included .github/skills/** and
�ng/scripts/**. Files were present-checked but never executed —
that's why finding dotnet#1 (PadRight under .github/skills/) was missed by
CI. Extended Run.Path to all three directories. Re-baselined the
\ floor (200 → 250).

Tests:
- 294/294 Pester green (was 290; +4 new CI_REQUIRE_GATE_HMAC tests +
  1 new clobber-guard AST test; net since previous tip).
- Mutation-tested: injecting the bad pre-init pattern at top of
  \ makes the new clobber-guard test fail with
  the exact violation list (
isksData, regressionTests, regrPlatform,
  uitestCategories, detectScript). Restoring passes.
- YAML schema check passes on both modified files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@PureWeen PureWeen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2 Adversarial Review

Methodology: 3 independent reviewers run in parallel, then disputed findings cross-validated by the other 2 models. Inline comments below are the consensus output.

Round 1 produced 5 inline findings — all addressed by the 4 follow-up commits between 1afaff4c7e and bf8ab1949c. I verified each round-1 fix against source:

  • ✅ Gate result trust boundary (Restore-GateResultOrFailClosed) — sealed verdict + HMAC verification implemented at Review-PR.ps1:155-197
  • ✅ HMAC key generation hardened — python3 secrets.token_hex(32) with length validation
  • ✅ PGID process isolation — set -m + kill -KILL -- "-${PWSH_PID}" orphan reap + pgrep poll
  • ✅ Strict-mode leak crash class — variables now unconditionally initialized inside the $Phase -eq 'CopilotReview' branch
  • ✅ Pester collection scope widened to ('.github/scripts', '.github/skills', 'eng/scripts')

Round 2 surfaced 3 new findings, all reaching 3/3 consensus. Two are security regressions introduced by the round-1 fixes themselves; one is incomplete coverage of the original strict-mode-leak fix. See inline comments for details.

Findings summary (ranked by severity)

Severity Category File:Line Consensus
❌ Critical Security eng/pipelines/ci-copilot.yml:846 3/3
❌ Critical Security .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1:198 3/3
⚠️ Warning Regression .github/scripts/Review-PR.ps1:1956 3/3

Non-blocking observations

  • Pester silentlySkipped > 0 → exit 1 gate vs. openssl-Skip test (security-scripts-pester.yml:151-158 vs. Review-PR.Tests.ps1:647-651): one reviewer flagged a direct in-PR contradiction — the new floor will fail red on any agent without openssl. Discarded per round-2 follow-up cap, but worth a follow-up either way (relax the gate to allow intentional skips, or remove the openssl-Skip in favor of a real fallback). Flagged by: 1/3 reviewers, not cross-validated.
  • PID-reuse window after kill -KILL -- "-${PWSH_PID}" (ci-copilot.yml:~751): kill runs after wait reaped the PID, so there's a tiny kernel PID-reuse race. Author already acknowledges broader setsid-escape residual risk. Flagged by: 1/3 reviewers, suggestion only.
  • Branch is at HEAD bf8ab1949c (4 commits ahead of round-1 HEAD 1afaff4c7e) — no base-drift impact on this review since list_pull_request_files was used as the authoritative changed-file source.

Test coverage assessment

  • New tests: Aggregate-CopilotTokenUsage.Tests.ps1, Remove-StaleMauiBotComments.Tests.ps1, Write-SafeSubprocessOutput.Tests.ps1, Review-PR.Phase-Audit.Tests.ps1, plus 5 new CI_REQUIRE_GATE_HMAC cases in Review-PR.Tests.ps1. New AST audit test enforces variable initialization inside the CopilotReview phased branch.
  • Test coverage gap: AST audit only enforces CopilotReview phase. Gate phase init (where R1-2 lives) is not asserted by any test. See inline at Review-PR.ps1:1956.
  • No tests cover the trust boundary between $RepoRoot and $TrustedScriptsDir — R2-B would not be caught by current Pester suite.

Prior review status

Round 1 review (5 findings, all done): #35843 (review)

Comment thread eng/pipelines/ci-copilot.yml
Comment thread .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 Outdated
Comment thread .github/scripts/Review-PR.ps1
Reshaped from the earlier dotnet#35804 reapply: Kuba's branch has since
reimplemented PR-stdout sanitization as ConvertTo-AzdoSafeConsole, so
this drops the superseded Write-SafeSubprocessOutput/Out-SafePRSubprocessLine
sanitizer + its audit and keeps ONLY the security primitives Kuba lacks.

Added:
- Restore-GateResultOrFailClosed: fail-CLOSED gate verdict restore
  (was fail-open "SKIPPED" on missing file). Wired into both the
  CopilotReview and Post phased-restore sites. Verifies an HMAC over
  gate-result.txt when GATE_HMAC_KEY is supplied; in CI, fails closed
  if CI_REQUIRE_GATE_HMAC=1 but the key is empty (suppress-signing
  downgrade defense).
- HMAC sealing of gate-result.txt in the Gate bash task: Linux-gated
  (POSIX process groups / pgrep / python3), key generated AFTER pwsh
  exits via python3 secrets and fed to the HMAC over stdin (never argv
  /proc/cmdline or env /proc/environ), orphan-reap of PR-spawned
  process groups with a PID-reuse guard, length-checked. STATE is the
  staging root (where Review-PR.ps1 writes gate-result.txt), so the
  .hmac sibling lands next to it. Job-level GATE_HMAC_KEY /
  CI_REQUIRE_GATE_HMAC default to '' so undefined AzDO macros resolve
  to empty on Windows/macOS (no regression there).
- winnerJson StrictMode property-presence guard
  (PSObject.Properties.Name -contains 'winner') — bare member access
  throws PropertyNotFoundException on partial objects under StrictMode.
- Gate-phase undefined-var pre-init ($risksData/$regressionTests) —
  defense-in-depth against the devdiv 14313972 crash class.
- Phase-Audit static analysis test (gh-call token-scope auditing).
- security-scripts-pester workflow (without the dropped sanitizer audit).

Kept intact: Kuba's ConvertTo-AzdoSafeConsole sanitization (5 sites).

Tests: 251/251 Pester green, 0 skipped. bash -n + YAML schema OK.
python3 ↔ .NET HMAC interop cross-checked. AST guards mutation-tested.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@T-Gro T-Gro force-pushed the security-on-enhanced-reviewer-v2 branch from 0534fc7 to 2ef3984 Compare June 17, 2026 17:31
@T-Gro T-Gro changed the title F2-fix: prevent Set-StrictMode leak that broke devdiv 14313972 CopilotReview Cross-phase trust-boundary hardening for ci-copilot Gate (HMAC seal + fail-closed restore) Jun 17, 2026
Copilot and others added 2 commits June 18, 2026 09:44
The HMAC work passes GATE_HMAC_KEY into the CopilotReview task env (needed
by Restore-GateResultOrFailClosed). The copilot invocation runs --allow-all,
so without stripping the key a prompt-injected agent could read it from its
bash/MCP subprocess, forge gate-result.txt, and recompute a valid HMAC —
defeating the seal. Added GATE_HMAC_KEY to --secret-env-vars alongside the
GH/COPILOT tokens.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
verify-tests-fail.ps1 runs as trusted bytes (copied to \/skills before
the PR merge) with GH_TOKEN live, but dot-sourced EstablishBrokenBaseline.ps1
from \ = git rev-parse --show-toplevel = the PR-merged worktree. A PR
could replace that file with arbitrary PowerShell executed in the trusted host
session. Resolve the baseline + detection helpers from \
(../../../scripts/) so they load from the trusted copy in CI. \ stays
for git ops + intentional PR-code subprocesses only.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-infrastructure CI, Maestro / Coherency, upstream dependencies/versions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants