Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124)#15811
Merged
Conversation
Adds a new Vally eval suite under tools/azsdk-cli/Azure.Sdk.Tools.Vally/ for MCP tool / scenario evaluations, replacing the deleted Azure.Sdk.Tools.Cli.Benchmarks project (#15697). - README documents project intent, layout, local run instructions, and how to add a new scenario. - .vally.yaml wires the azsdk-mcp environment (stdio dotnet run against Azure.Sdk.Tools.Cli) and defines 'typespec' and 'all' suites. - evals/check-public-repo.eval.yaml is the first ported scenario (from the deleted CheckPublicRepoScenario): verifies the agent invokes azsdk_typespec_check_project_in_public_repo for a public-repo check prompt. Lints clean via 'vally lint --eval-spec'. - fixtures/.gitkeep reserves the per-scenario fixtures layout. Remaining scenarios from the deleted benchmark are tracked as a checklist in the project README and in #15124.
Adds eval YAMLs for every scenario that was deleted from Azure.Sdk.Tools.Cli.Benchmarks in #15697: - check-public-repo-then-validate - validate-typespec - typespec-generation-step02 - get-modified-typespec-projects (stub — needs git-repo fixture / setup hook) - add-arm-resource (stub — needs fixtures + npx tsp compile post-check) - create-release-plan - link-namespace-approval-issue - get-pr-link-current-branch - check-sdk-generation-status Each eval uses the built-in tool-calls grader for presence checks; the original benchmark's argument/order/forbidden/optional assertions are captured in prompt text + inline TODOs (require custom graders or upstream Vally support, documented in README). Also adds release-plan/github/pipeline suites to .vally.yaml. All 10 evals pass 'vally lint --eval-spec'.
Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README.
This was referenced Jun 2, 2026
#15183 - Move 11 multi-step scenario evals to evals/scenarios/ - Port 9 per-tool trigger evals from jeo02/migrate-evaluations-to-vally (PR #15183) to evals/triggers/, stripped azure-sdk-mcp- prefix from graders to match bare MCP tool names - Port Validate-EvalTools.ps1 to scripts/, retargeted at evals/triggers/ with bare-name regex - Update .vally.yaml suites for new layout (scenarios, triggers, all) - Update README to document the split and per-trigger-file tool coverage - Add .gitignore for vally-results/ and results/
Replace per-area folders (scenarios/, triggers/) with tier-based folders. Feature area moves to a YAML tag, enabling tag-filtered suites. Add composite suites (pr-gate, nightly) and area-filtered suites in .vally.yaml. Update Validate-EvalTools.ps1 to scan evals/unit for triggers-*.eval.yaml. Refresh README and Run-LiveEvals.ps1 paths.
…Azure/azure-sdk-tools into feat/vally-tool-scenarios-15124
Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'.
helen229
added a commit
that referenced
this pull request
Jun 4, 2026
…list (#15852) (#15854) * Add MCP tool-coverage drift check for Azure.Sdk.Tools.Mock (#15852) - New eng/scripts/Get-McpToolInventory.ps1 boots the live Azure.Sdk.Tools.Cli MCP server (via 'azsdk list -o json'), enumerates the IMockToolHandler implementations under Azure.Sdk.Tools.Mock, and reports the diff in three buckets: both / live-only / mock-only. - Cross-references mock-tier eval YAMLs under tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/ when present; gracefully no-ops when that folder hasn't landed yet (PR #15811). - '-CheckOnly' exits non-zero on (a) any stale handler that no longer maps to a live tool, or (b) any tool referenced by a mock-tier eval without a handler -- intended for the CI job tracked in #15829. - Documents the drift workflow in Azure.Sdk.Tools.Mock/README.md so a contributor flagged by the script knows how to add a handler. No stale handlers detected against the current live tool set. * Add mock handlers for remaining live MCP tools; drop eval scanning from inventory script (#15852) - 13 new handler files covering 63 live tools that previously fell back to the default response (APIView, Codeowners, EngSys, GitHub, Package, Pipeline, ReleasePlan, TypeSpec, Verify, Core, Example). - Get-McpToolInventory.ps1: pure live-vs-mock parity (removes Vally eval cross-reference); -CheckOnly fails if either bucket is non-empty. - README: updated sync workflow to reflect parity-only check. * Simplify Get-McpToolInventory.ps1: no parameters, always exits non-zero on drift (#15852) * Fix 3 release-plan handler response types to match live tools (#15852) Addresses Copilot review on PR #15854: - azsdk_get_kpi_attestation_status: ReleaseWorkflowResponse -> ReleasePlanListResponse - azsdk_get_service_details_by_typespec_path: ReleaseWorkflowResponse -> ProductInfoResponse - azsdk_update_language_exclusion_justification: ReleaseWorkflowResponse -> DefaultCommandResponse * Drop Get-McpToolInventory.ps1 (#15852) Per review discussion: the script only checked that an IMockToolHandler exists with the right ToolName; it could not detect handlers that exist but just return the placeholder DefaultCommandResponse. That blind spot makes the script of limited value. A unit test in Cli.Tests is a better fit for actual drift enforcement and is tracked as a follow-up. README updated to drop the script reference. * Update Mock README: drop reference to removed inventory script (#15852)
All 5 release-planner mock stimuli now use environment.git worktree pointing at the per-user azure-rest-api-specs cache (matching the live e2e fixture), plus a structured e2e-style prompt that supplies the Contoso fixture IDs the mock handlers expect (TypeSpec project, service/product tree IDs, work-item ID 29262). Also document the --skill-dir requirement and worker-cap caveat in README, and fix one stale path in .vally.yaml comment.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR establishes tools/azsdk-cli/Azure.Sdk.Tools.Vally as the unified home for Azure SDK MCP tool invocation evals and multi-step workflow scenarios using @microsoft/vally-cli, porting prior benchmark coverage and consolidating per-tool trigger evals into a single surface area. It also updates existing skill-eval infrastructure to launch pre-built MCP server DLLs (avoiding dotnet run/MSBuild races under parallel workers).
Changes:
- Added the new
Azure.Sdk.Tools.Vallyproject structure, including Vally config, eval suites (tool triggers + workflow scenarios), and local helper scripts. - Ported and organized trigger eval YAMLs and scenario eval YAMLs to cover tool invocation drift and multi-tool workflows.
- Updated skill-eval pipeline/config to run MCP servers via pre-built DLLs (
dotnet <dll>) rather thandotnet run.
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/azsdk-cli/docs/specs/8-operations-agent-eval-strategy.spec.md | Adds a design/spec document describing the eval strategy and intended suite structure. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/.vally.yaml | Defines Vally environments (mock/live MCP) and suites for running the new evals. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/.gitignore | Ignores local Vally output folders. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/README.md | Documents purpose, layout, and how to run Vally evals for tool scenarios and workflows. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/fixtures/.gitkeep | Establishes fixture folder conventions for eval inputs. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/scripts/Validate-EvalTools.ps1 | Adds a drift/coverage validator to cross-check trigger eval tool references vs server tool catalog. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/setup/ensure-specs-clone.ps1 | Adds helper to maintain a cached sparse clone of azure-rest-api-specs for scenarios needing a repo on disk. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/check-public-repo.eval.yaml | Adds a unit-tier tool-call eval for public-repo presence checks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/validate-typespec.eval.yaml | Adds a unit-tier tool-call eval for TypeSpec validation. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/get-modified-typespec-projects.eval.yaml | Adds a unit-tier tool-call eval for listing modified TypeSpec projects. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/add-arm-resource.eval.yaml | Adds a (currently stub-like) authoring scenario expecting plan generation + edits. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/create-release-plan.eval.yaml | Adds a unit-tier tool-call eval for creating a release plan. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/link-namespace-approval-issue.eval.yaml | Adds a unit-tier tool-call eval for linking namespace approval to a release plan. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/get-pr-link-current-branch.eval.yaml | Adds a unit-tier tool-call eval for resolving PR link for current branch. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/check-sdk-generation-status.eval.yaml | Adds a unit-tier tool-call eval for pipeline status checks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-apiview.eval.yaml | Adds trigger stimuli covering APIView-related MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-config.eval.yaml | Adds trigger stimuli covering config/label MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-engsys.eval.yaml | Adds trigger stimuli covering engineering-system MCP tools (logs/tests/etc.). |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-github.eval.yaml | Adds trigger stimuli covering GitHub MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-package.eval.yaml | Adds trigger stimuli covering package generation/build/test/release MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-pipeline.eval.yaml | Adds trigger stimuli covering pipeline MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-releaseplan.eval.yaml | Adds trigger stimuli covering release-plan MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-typespec.eval.yaml | Adds trigger stimuli covering TypeSpec MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-verify.eval.yaml | Adds trigger stimuli covering setup verification MCP tool. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/check-public-repo-then-validate.eval.yaml | Adds a mock multi-tool workflow scenario (validate then public-repo check). |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/typespec-generation-step02.eval.yaml | Adds a mock workflow scenario for TypeSpec generation step 2 behavior. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/rename-client-property.eval.yaml | Adds a stub workflow scenario intended for a future expected-diff grader. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/release-planner-workflows.eval.yaml | Adds mock workflow stimuli for key release-planner scenarios. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/live/release-planner.eval.yaml | Adds a live end-to-end scenario that creates a plan, generates SDK, and links PR back. |
| eng/pipelines/skill-eval.yml | Pre-builds MCP servers so Vally can launch pre-built DLLs (reducing parallel-run flakiness). |
| .github/skills/.vally.yaml | Updates skill eval environment config to launch MCP servers via pre-built DLLs. |
…ot found' lookup The create-release-plan-and-generate-sdk mock stimulus required the agent to call azsdk_update_sdk_details_in_release_plan, but neither the prompt nor the azsdk-common-prepare-release-plan skill's create flow asks for it. The agent correctly skipped the tool, and the grader flapped. The dedicated update-sdk-details-in-release-plan stimulus already covers that tool with an explicit prompt. Drop it from the create+generate grader so mock matches the live release-planner-e2e contract (create / get / generate / link). Also patch GetReleasePlanForSpecPrHandler to return a deterministic 'not found' response (ReleasePlanDetails = null). The mock previously returned a 'plan exists' result for any spec PR, pushing the agent down the update path instead of the create path that the stimulus exercises. Stimuli that target an existing plan pass the work-item ID directly and call azsdk_get_release_plan, so this is safe.
…cache portability
- README/eval comments: evals/unit -> evals/tools, evals/scenarios -> evals/workflow-scenarios (Copilot C1/C5)
- Validate-EvalTools.ps1: default EvalPath -> evals/tools; return 1 -> exit 1 so CI fails loudly (Copilot C2/C3)
- MCP build output: dotnet build -o artifacts/mcp/{cli,mock}; pipeline switched to Release; .vally.yaml no longer hardcodes Debug/net8.0 (Praveen #1/#2)
- ensure-specs-clone.ps1 + workflow evals: repo-relative artifacts/specs-cache path instead of C:/Users/gaoh; Vally resolves it relative to the eval file so it works for all contributors + CI (Copilot C6/C7, Praveen #4)
- add-arm-resource/rename-client-property: comment clarifying 'edit' is the Copilot SDK built-in file tool, not an MCP tool (Praveen #5)
…solidate standalone single-tool evals - Rename evals/tools/triggers-*.eval.yaml to prompt-to-tool-*.eval.yaml (Praveen review #6) - Consolidate 7 standalone single-tool scenario evals into the matching namespace files as full-context checks (check-public-repo, check-sdk-generation-status, create-release-plan, get-modified-typespec-projects, get-pr-link-current-branch, link-namespace-approval-issue, validate-typespec) - Keep add-arm-resource.eval.yaml standalone (produces a file edit, not a pure tool trigger) - Switch tool evals to gpt-5.4 and add explicit 'use the available Azure SDK MCP tools' steering plus concrete grounding to bare trigger prompts so they invoke the MCP tool reliably - Update README evals/tools section and Validate-EvalTools.ps1 to the new file names
5 tasks
praveenkuttappan
approved these changes
Jun 16, 2026
Ground 13 previously-flaky prompts with concrete IDs/paths so they route deterministically to the intended MCP tool; make the mock check-service-label handler convention-driven (status derived from the requested serviceLabel); document common vally invocation recipes in the README.
Replace references to consolidated/non-existent eval files (create-release-plan, check-public-repo, link-namespace-approval-issue) with the real prompt-to-tool-* and workflow-scenario files; correct the default output path to ./vally-results/<timestamp>/; fix the cookbook results.jsonl parser to locate the newest timestamped run; add the missing release-planner-workflows mock scenario to the index.
The prompt (LLM-judge) grader schema uses 'prompt' for the rubric text, not 'rubric'. Rename the field and add 'scoring: binary' (the rubric is pass/fail) so the spec validates.
…feat/vally-tool-scenarios-15124
praveenkuttappan
approved these changes
Jun 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #15124.
Stands up
Azure.Sdk.Tools.Vallyas the home for MCP-tool scenario and trigger evals, ports the legacyAzure.Sdk.Tools.Cli.Evaluationsbenchmarks, and folds in the per-tool trigger evals from #15183 so we have a single eval surface.What's in the PR
New project:
tools/azsdk-cli/Azure.Sdk.Tools.Vally/.vally.yaml— singleazsdk-mcpenvironment that spawnsAzure.Sdk.Tools.Cliviadotnet run; named suites for selective execution (typespec,release-plan,github,pipeline,scenarios,triggers,all)..gitignore— excludes localvally-results/andresults/.README.md— explains how Vally evals relate to the per-skill evals under.github/skills/, lists scenario + trigger coverage, documents the run loop.evals/scenarios/— 11 multi-step workflow evals (the #15124 port)Ported from
Azure.Sdk.Tools.Cli.Evaluationsand reshaped for Vally'stool-callsgrader:check-public-repoazure-rest-api-specs?check-public-repo-then-validatevalidate-typespectsplinter/validationtypespec-generation-step02get-modified-typespec-projectsadd-arm-resourceazsdk_typespec_generate_authoring_planfor an ARM resourcecreate-release-planlink-namespace-approval-issueget-pr-link-current-branchcheck-sdk-generation-statusrename-client-propertyexpected-diffgrader (follow-up)evals/triggers/— 9 per-tool trigger evals (ported from #15183)One YAML per tool category; each stimulus is a single prompt expected to invoke one MCP tool. Used to catch tool-rename / description-drift regressions.
apiview,config,engsys,github,package,pipeline,releaseplan,typespec,verify— covering the bulk of theazsdk_*tool surface.scripts/Validate-EvalTools.ps1(ported from #15183)Drift detector. Runs
azsdk list --output jsonand cross-checks:evals/triggers/exists on the running MCP server (catches renames)hello_world,upgrade, codeowner helpers) are filtered outWhat's not in this PR (deliberate)
AZSDKTOOLS_AGENT_TESTINGtoggle — currentlyfalse. There's a real-e2e vs. safe-replay tradeoff worth a separate discussion; flipping it would need either a secondazsdk-mcp-liveenvironment or a CI policy. Left for a follow-up.rename-client-propertygrader — still a stub awaiting a Vallyexpected-diffgrader.ci.ymlundereng/pipelines/templatesis a follow-up.Acknowledgements
Trigger evals +
Validate-EvalTools.ps1ported from @jeo02's #15183 (migrate-evaluations-to-vally); prefixes stripped fromazure-sdk-mcp-azsdk_*→azsdk_*to match the bare MCP tool names emitted in trajectories. Once this PR merges, #15183 is superseded.Verification
dotnet buildonAzure.Sdk.Tools.Vally— green.vally eval --eval-spec evals/scenarios/check-public-repo-then-validate.eval.yaml— runs end-to-end against the MCP server and grades againsttool-calls(trajectory captured undervally-results/).scripts/Validate-EvalTools.ps1— runs against a live MCP server and produces the expected coverage report.