Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124) by helen229 · Pull Request #15811 · Azure/azure-sdk-tools

helen229 · 2026-06-01T17:22:50Z

Stands up Azure.Sdk.Tools.Vally as the home for MCP-tool scenario and trigger evals, ports the legacy Azure.Sdk.Tools.Cli.Evaluations benchmarks, and folds in the per-tool trigger evals from #15183 so we have a single eval surface.

What's in the PR

New project: `tools/azsdk-cli/Azure.Sdk.Tools.Vally/`

.vally.yaml — single azsdk-mcp environment that spawns Azure.Sdk.Tools.Cli via dotnet run; named suites for selective execution (typespec, release-plan, github, pipeline, scenarios, triggers, all).
.gitignore — excludes local vally-results/ and results/.
README.md — explains how Vally evals relate to the per-skill evals under .github/skills/, lists scenario + trigger coverage, documents the run loop.

`evals/scenarios/` — 11 multi-step workflow evals (the #15124 port)

Ported from Azure.Sdk.Tools.Cli.Evaluations and reshaped for Vally's tool-calls grader:

Scenario	Shape
`check-public-repo`	Single-tool: is a TypeSpec project published in `azure-rest-api-specs`?
`check-public-repo-then-validate`	Multi-tool, ordered: validate then check
`validate-typespec`	Single-tool: `tsp` linter/validation
`typespec-generation-step02`	Step in the spec-PR generation flow
`get-modified-typespec-projects`	Git-aware tool against current branch
`add-arm-resource`	Calls `azsdk_typespec_generate_authoring_plan` for an ARM resource
`create-release-plan`	Single-tool: create a release-plan work item
`link-namespace-approval-issue`	Link an existing approval issue to a release plan
`get-pr-link-current-branch`	Resolve the PR for the active git branch
`check-sdk-generation-status`	Pipeline status lookup
`rename-client-property`	Stub — needs `expected-diff` grader (follow-up)

`evals/triggers/` — 9 per-tool trigger evals (ported from #15183)

One YAML per tool category; each stimulus is a single prompt expected to invoke one MCP tool. Used to catch tool-rename / description-drift regressions.

apiview, config, engsys, github, package, pipeline, releaseplan, typespec, verify — covering the bulk of the azsdk_* tool surface.

`scripts/Validate-EvalTools.ps1` (ported from #15183)

Drift detector. Runs azsdk list --output json and cross-checks:

every tool referenced in evals/triggers/ exists on the running MCP server (catches renames)
every server tool has at least one trigger eval (catches new tools landing without coverage)
known-excluded tools (examples, hello_world, upgrade, codeowner helpers) are filtered out

What's not in this PR (deliberate)

AZSDKTOOLS_AGENT_TESTING toggle — currently false. There's a real-e2e vs. safe-replay tradeoff worth a separate discussion; flipping it would need either a second azsdk-mcp-live environment or a CI policy. Left for a follow-up.
rename-client-property grader — still a stub awaiting a Vally expected-diff grader.
CI wiring — the project builds and runs locally; a ci.yml under eng/pipelines/templates is a follow-up.

Acknowledgements

Trigger evals + Validate-EvalTools.ps1 ported from @jeo02's #15183 (migrate-evaluations-to-vally); prefixes stripped from azure-sdk-mcp-azsdk_* → azsdk_* to match the bare MCP tool names emitted in trajectories. Once this PR merges, #15183 is superseded.

Verification

dotnet build on Azure.Sdk.Tools.Vally — green.
vally eval --eval-spec evals/scenarios/check-public-repo-then-validate.eval.yaml — runs end-to-end against the MCP server and grades against tool-calls (trajectory captured under vally-results/).
scripts/Validate-EvalTools.ps1 — runs against a live MCP server and produces the expected coverage report.

Adds a new Vally eval suite under tools/azsdk-cli/Azure.Sdk.Tools.Vally/ for MCP tool / scenario evaluations, replacing the deleted Azure.Sdk.Tools.Cli.Benchmarks project (#15697). - README documents project intent, layout, local run instructions, and how to add a new scenario. - .vally.yaml wires the azsdk-mcp environment (stdio dotnet run against Azure.Sdk.Tools.Cli) and defines 'typespec' and 'all' suites. - evals/check-public-repo.eval.yaml is the first ported scenario (from the deleted CheckPublicRepoScenario): verifies the agent invokes azsdk_typespec_check_project_in_public_repo for a public-repo check prompt. Lints clean via 'vally lint --eval-spec'. - fixtures/.gitkeep reserves the per-scenario fixtures layout. Remaining scenarios from the deleted benchmark are tracked as a checklist in the project README and in #15124.

Adds eval YAMLs for every scenario that was deleted from Azure.Sdk.Tools.Cli.Benchmarks in #15697: - check-public-repo-then-validate - validate-typespec - typespec-generation-step02 - get-modified-typespec-projects (stub — needs git-repo fixture / setup hook) - add-arm-resource (stub — needs fixtures + npx tsp compile post-check) - create-release-plan - link-namespace-approval-issue - get-pr-link-current-branch - check-sdk-generation-status Each eval uses the built-in tool-calls grader for presence checks; the original benchmark's argument/order/forbidden/optional assertions are captured in prompt text + inline TODOs (require custom graders or upstream Vally support, documented in README). Also adds release-plan/github/pipeline suites to .vally.yaml. All 10 evals pass 'vally lint --eval-spec'.

Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README.

#15183 - Move 11 multi-step scenario evals to evals/scenarios/ - Port 9 per-tool trigger evals from jeo02/migrate-evaluations-to-vally (PR #15183) to evals/triggers/, stripped azure-sdk-mcp- prefix from graders to match bare MCP tool names - Port Validate-EvalTools.ps1 to scripts/, retargeted at evals/triggers/ with bare-name regex - Update .vally.yaml suites for new layout (scenarios, triggers, all) - Update README to document the split and per-trigger-file tool coverage - Add .gitignore for vally-results/ and results/

Replace per-area folders (scenarios/, triggers/) with tier-based folders. Feature area moves to a YAML tag, enabling tag-filtered suites. Add composite suites (pr-gate, nightly) and area-filtered suites in .vally.yaml. Update Validate-EvalTools.ps1 to scan evals/unit for triggers-*.eval.yaml. Refresh README and Run-LiveEvals.ps1 paths.

…Azure/azure-sdk-tools into feat/vally-tool-scenarios-15124

Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'.

…list (#15852) (#15854) * Add MCP tool-coverage drift check for Azure.Sdk.Tools.Mock (#15852) - New eng/scripts/Get-McpToolInventory.ps1 boots the live Azure.Sdk.Tools.Cli MCP server (via 'azsdk list -o json'), enumerates the IMockToolHandler implementations under Azure.Sdk.Tools.Mock, and reports the diff in three buckets: both / live-only / mock-only. - Cross-references mock-tier eval YAMLs under tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/ when present; gracefully no-ops when that folder hasn't landed yet (PR #15811). - '-CheckOnly' exits non-zero on (a) any stale handler that no longer maps to a live tool, or (b) any tool referenced by a mock-tier eval without a handler -- intended for the CI job tracked in #15829. - Documents the drift workflow in Azure.Sdk.Tools.Mock/README.md so a contributor flagged by the script knows how to add a handler. No stale handlers detected against the current live tool set. * Add mock handlers for remaining live MCP tools; drop eval scanning from inventory script (#15852) - 13 new handler files covering 63 live tools that previously fell back to the default response (APIView, Codeowners, EngSys, GitHub, Package, Pipeline, ReleasePlan, TypeSpec, Verify, Core, Example). - Get-McpToolInventory.ps1: pure live-vs-mock parity (removes Vally eval cross-reference); -CheckOnly fails if either bucket is non-empty. - README: updated sync workflow to reflect parity-only check. * Simplify Get-McpToolInventory.ps1: no parameters, always exits non-zero on drift (#15852) * Fix 3 release-plan handler response types to match live tools (#15852) Addresses Copilot review on PR #15854: - azsdk_get_kpi_attestation_status: ReleaseWorkflowResponse -> ReleasePlanListResponse - azsdk_get_service_details_by_typespec_path: ReleaseWorkflowResponse -> ProductInfoResponse - azsdk_update_language_exclusion_justification: ReleaseWorkflowResponse -> DefaultCommandResponse * Drop Get-McpToolInventory.ps1 (#15852) Per review discussion: the script only checked that an IMockToolHandler exists with the right ToolName; it could not detect handlers that exist but just return the placeholder DefaultCommandResponse. That blind spot makes the script of limited value. A unit test in Cli.Tests is a better fit for actual drift enforcement and is tracked as a follow-up. README updated to drop the script reference. * Update Mock README: drop reference to removed inventory script (#15852)

…rios-15124

All 5 release-planner mock stimuli now use environment.git worktree pointing at the per-user azure-rest-api-specs cache (matching the live e2e fixture), plus a structured e2e-style prompt that supplies the Contoso fixture IDs the mock handlers expect (TypeSpec project, service/product tree IDs, work-item ID 29262). Also document the --skill-dir requirement and worker-cap caveat in README, and fix one stale path in .vally.yaml comment.

Copilot

Pull request overview

This PR establishes tools/azsdk-cli/Azure.Sdk.Tools.Vally as the unified home for Azure SDK MCP tool invocation evals and multi-step workflow scenarios using @microsoft/vally-cli, porting prior benchmark coverage and consolidating per-tool trigger evals into a single surface area. It also updates existing skill-eval infrastructure to launch pre-built MCP server DLLs (avoiding dotnet run/MSBuild races under parallel workers).

Changes:

Added the new Azure.Sdk.Tools.Vally project structure, including Vally config, eval suites (tool triggers + workflow scenarios), and local helper scripts.
Ported and organized trigger eval YAMLs and scenario eval YAMLs to cover tool invocation drift and multi-tool workflows.
Updated skill-eval pipeline/config to run MCP servers via pre-built DLLs (dotnet <dll>) rather than dotnet run.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
tools/azsdk-cli/docs/specs/8-operations-agent-eval-strategy.spec.md	Adds a design/spec document describing the eval strategy and intended suite structure.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/.vally.yaml	Defines Vally environments (mock/live MCP) and suites for running the new evals.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/.gitignore	Ignores local Vally output folders.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/README.md	Documents purpose, layout, and how to run Vally evals for tool scenarios and workflows.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/fixtures/.gitkeep	Establishes fixture folder conventions for eval inputs.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/scripts/Validate-EvalTools.ps1	Adds a drift/coverage validator to cross-check trigger eval tool references vs server tool catalog.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/setup/ensure-specs-clone.ps1	Adds helper to maintain a cached sparse clone of `azure-rest-api-specs` for scenarios needing a repo on disk.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/check-public-repo.eval.yaml	Adds a unit-tier tool-call eval for public-repo presence checks.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/validate-typespec.eval.yaml	Adds a unit-tier tool-call eval for TypeSpec validation.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/get-modified-typespec-projects.eval.yaml	Adds a unit-tier tool-call eval for listing modified TypeSpec projects.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/add-arm-resource.eval.yaml	Adds a (currently stub-like) authoring scenario expecting plan generation + edits.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/create-release-plan.eval.yaml	Adds a unit-tier tool-call eval for creating a release plan.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/link-namespace-approval-issue.eval.yaml	Adds a unit-tier tool-call eval for linking namespace approval to a release plan.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/get-pr-link-current-branch.eval.yaml	Adds a unit-tier tool-call eval for resolving PR link for current branch.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/check-sdk-generation-status.eval.yaml	Adds a unit-tier tool-call eval for pipeline status checks.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-apiview.eval.yaml	Adds trigger stimuli covering APIView-related MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-config.eval.yaml	Adds trigger stimuli covering config/label MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-engsys.eval.yaml	Adds trigger stimuli covering engineering-system MCP tools (logs/tests/etc.).
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-github.eval.yaml	Adds trigger stimuli covering GitHub MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-package.eval.yaml	Adds trigger stimuli covering package generation/build/test/release MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-pipeline.eval.yaml	Adds trigger stimuli covering pipeline MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-releaseplan.eval.yaml	Adds trigger stimuli covering release-plan MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-typespec.eval.yaml	Adds trigger stimuli covering TypeSpec MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-verify.eval.yaml	Adds trigger stimuli covering setup verification MCP tool.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/check-public-repo-then-validate.eval.yaml	Adds a mock multi-tool workflow scenario (validate then public-repo check).
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/typespec-generation-step02.eval.yaml	Adds a mock workflow scenario for TypeSpec generation step 2 behavior.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/rename-client-property.eval.yaml	Adds a stub workflow scenario intended for a future expected-diff grader.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/release-planner-workflows.eval.yaml	Adds mock workflow stimuli for key release-planner scenarios.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/live/release-planner.eval.yaml	Adds a live end-to-end scenario that creates a plan, generates SDK, and links PR back.
eng/pipelines/skill-eval.yml	Pre-builds MCP servers so Vally can launch pre-built DLLs (reducing parallel-run flakiness).
.github/skills/.vally.yaml	Updates skill eval environment config to launch MCP servers via pre-built DLLs.

…ot found' lookup The create-release-plan-and-generate-sdk mock stimulus required the agent to call azsdk_update_sdk_details_in_release_plan, but neither the prompt nor the azsdk-common-prepare-release-plan skill's create flow asks for it. The agent correctly skipped the tool, and the grader flapped. The dedicated update-sdk-details-in-release-plan stimulus already covers that tool with an explicit prompt. Drop it from the create+generate grader so mock matches the live release-planner-e2e contract (create / get / generate / link). Also patch GetReleasePlanForSpecPrHandler to return a deterministic 'not found' response (ReleasePlanDetails = null). The mock previously returned a 'plan exists' result for any spec PR, pushing the agent down the update path instead of the create path that the stimulus exercises. Stimuli that target an existing plan pass the work-item ID directly and call azsdk_get_release_plan, so this is safe.

…cache portability - README/eval comments: evals/unit -> evals/tools, evals/scenarios -> evals/workflow-scenarios (Copilot C1/C5) - Validate-EvalTools.ps1: default EvalPath -> evals/tools; return 1 -> exit 1 so CI fails loudly (Copilot C2/C3) - MCP build output: dotnet build -o artifacts/mcp/{cli,mock}; pipeline switched to Release; .vally.yaml no longer hardcodes Debug/net8.0 (Praveen #1/#2) - ensure-specs-clone.ps1 + workflow evals: repo-relative artifacts/specs-cache path instead of C:/Users/gaoh; Vally resolves it relative to the eval file so it works for all contributors + CI (Copilot C6/C7, Praveen #4) - add-arm-resource/rename-client-property: comment clarifying 'edit' is the Copilot SDK built-in file tool, not an MCP tool (Praveen #5)

…solidate standalone single-tool evals - Rename evals/tools/triggers-*.eval.yaml to prompt-to-tool-*.eval.yaml (Praveen review #6) - Consolidate 7 standalone single-tool scenario evals into the matching namespace files as full-context checks (check-public-repo, check-sdk-generation-status, create-release-plan, get-modified-typespec-projects, get-pr-link-current-branch, link-namespace-approval-issue, validate-typespec) - Keep add-arm-resource.eval.yaml standalone (produces a file edit, not a pure tool trigger) - Switch tool evals to gpt-5.4 and add explicit 'use the available Azure SDK MCP tools' steering plus concrete grounding to bare trigger prompts so they invoke the MCP tool reliably - Update README evals/tools section and Validate-EvalTools.ps1 to the new file names

…ne in #15918)

Ground 13 previously-flaky prompts with concrete IDs/paths so they route deterministically to the intended MCP tool; make the mock check-service-label handler convention-driven (status derived from the requested serviceLabel); document common vally invocation recipes in the README.

Replace references to consolidated/non-existent eval files (create-release-plan, check-public-repo, link-namespace-approval-issue) with the real prompt-to-tool-* and workflow-scenario files; correct the default output path to ./vally-results/<timestamp>/; fix the cookbook results.jsonl parser to locate the newest timestamped run; add the missing release-planner-workflows mock scenario to the index.

The prompt (LLM-judge) grader schema uses 'prompt' for the rubric text, not 'rubric'. Rename the field and add 'scoring: binary' (the rubric is pass/fail) so the spec validates.

…feat/vally-tool-scenarios-15124

helen229 added 3 commits June 1, 2026 10:22

Add rename-client-property stub eval to Vally suite (#15124)

26cc6ef

Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README.

github-actions Bot added the azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli label Jun 1, 2026

This was referenced Jun 2, 2026

Wire vally eval CI job for Azure.Sdk.Tools.Vally tool-scenario evals #15829

Open

Wire workspace setup hooks + mock MCP environment for Vally tool-scenario evals #15831

Open

helen229 added 10 commits June 2, 2026 11:48

Fix tool name prefix in graders, timeout format, expand README

8e4f524

Merge branch 'main' into feat/vally-tool-scenarios-15124

c10063b

update the config and use gpt-5.4 model

02aee34

add disallowed

d1f212f

Merge branch 'feat/vally-tool-scenarios-15124' of https://github.com/…

66216b0

…Azure/azure-sdk-tools into feat/vally-tool-scenarios-15124

Vally: remove Run-LiveEvals.ps1 (local-only test wrapper)

a88ae11

Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'.

some docs and test e2e one

bb47139

update docs

4d89bac

This was referenced Jun 3, 2026

Refresh Azure.Sdk.Tools.Mock + add MCP tool-coverage drift check #15852

Closed

Refresh Azure.Sdk.Tools.Mock handler coverage to match live MCP tool list (#15852) #15854

Merged

Vally results UX: trajectory HTML + history CSV + artifact upload #15861

Open

helen229 added 4 commits June 3, 2026 13:35

udpate design

f6f5c80

update with skill evals

3a8d609

reorg based on the design

b7005b2

remove the duplicates

6db7c5f

helen229 added 6 commits June 4, 2026 07:33

add new scenarios

b77dccb

update the doc

1264e9a

update doc

aa714ab

Merge remote-tracking branch 'origin/main' into feat/vally-tool-scena…

f26cf1f

…rios-15124

update names

fda9ef9

Copilot AI reviewed Jun 5, 2026

View reviewed changes

helen229 added 2 commits June 5, 2026 22:44

update readme for runing steps

2ce5e7b

helen229 mentioned this pull request Jun 6, 2026

Skill: orphan tool azsdk_run_generate_sdk mis-routes to generate-sdk-locally #15950

Closed