docs: add agent-eval-strategy spec for azsdk-cli operations by helen229 · Pull Request #15918 · Azure/azure-sdk-tools

helen229 · 2026-06-04T22:16:37Z

Pulls the new agent-eval-strategy spec out of #15811 (the Vally port PR) so it can be reviewed standalone before the implementation lands.

What's in here

A single new file: tools/azsdk-cli/docs/specs/8-operations-agent-eval-strategy.spec.md.

The spec covers:

Eval pyramid: per-skill evals, mock workflow scenarios, live workflow scenarios, plus a hermetic tool-shape layer.
Where each eval lives: skill evals stay next to SKILL.md; cross-skill / cross-tool evals live in tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/.
Folder layout under evals/ (tools/, workflow-scenarios/{mock,live}/, setup/).
Required graders by tier: mock workflows require tool-calls (skill-invocation optional, response grader N/A); live workflows require tool-calls + skill-invocation + a response grader (prompt / LLM-judge).
Run cadence (PR vs nightly vs weekly), open questions, and known runner gaps from the first post-merge run (skill-dir requirement, MCP stdio boot race traced to dotnet run build contention).

New design doc covering the eval pyramid (per-skill, mock workflow, live workflow, hermetic tool-shape), folder layout, required graders per workflow tier, and where each eval lives. Pulled out of #15811 (Vally port PR) so it can be reviewed standalone before the implementation lands.

Refine rationale and clarify scenario file details.

simpler explanation

haolingdong-msft

Thanks @helen229 for the spec! Added some comments from typespec authoring side. And happy to discuss furthur or provide more information from typespec authoring side.

haolingdong-msft · 2026-06-05T01:21:07Z

+  willing to spend before we move it back off the PR?
+
+We need owners' input on all four before turning the gate on.
+


Adding my two cents on azure-typespec-author skill evaluation and skill publish workflow:
Skill code update -> PR level validation (pick few benchmark cases to run, and run mock server to validate skill invocation) -> merge to main -> run nightly build for all benchmarks -> publish if nightly build succeeds.

haolingdong-msft · 2026-06-05T01:23:53Z

+- **Workflow scenario**: a user prompt that crosses multiple tools / skills
+  end-to-end (e.g. *create release plan → generate SDK → link the SDK PR*).
+- **Stimulus**: one prompt + its expected behavior — the unit of an eval.
+- **Three graders per stimulus**: `skill-invocation` (right skill picked),


For typespec authoring, the case might be more complex. Three graders are not enough. This is one of our cases: https://github.com/Azure/azure-sdk-tools/blob/feat/vally-tool-scenarios-15124/.github/skills/azure-typespec-author/evaluate/evals/001001.eval.yaml#L17-L73

This is the readme about how to run the test cases: https://github.com/Azure/azure-sdk-tools/blob/feat/vally-tool-scenarios-15124/.github/skills/azure-typespec-author/evaluate/README.md

haolingdong-msft · 2026-06-05T01:26:16Z

+
+#### PR gate for essential workflows (open)
+
+A case for *narrow* PR gating: a small curated set of mock scenarios


For typespec authoring, the PR gate might be some typical benchmark cases with real mcp server and skill invocation with mock mcp server.

praveenkuttappan · 2026-06-05T19:35:06Z

+### Folder layout
+
+```
+evals/


Is this layout specific to tools repo? Where is the pipeline configured?

We should also consider below.
Eval for our common skills and tools
Eval for repo specific skills.

praveenkuttappan · 2026-06-05T19:36:47Z

+
+| What it tests | Lives in |
+|---|---|
+| **One skill** (does this skill route, call its tools, return a sensible answer) | `.github/skills/<skill>/evals/` |


I think this will be good enough to cover the common skills and repo specific skills.

praveenkuttappan · 2026-06-05T19:42:51Z

+```
+Do you only care that the agent picks the right skill
+(you don't care which tools it then calls)?
+└── yes → .github/skills/<skill-name>/evals/   (not this project)


This should also be improved to grade that right set of tools are picked.

praveenkuttappan · 2026-06-05T19:45:54Z

+capability stimulus. A `skill-eval-authoring` skill packages the
+pattern, grader catalog, and anti-patterns so other Azure SDK teams
+adopt without re-learning the gotchas.
+


Can you add sample skill eval and workflow eval yaml? How it's configured and how to invoke the eval.

…po rollout, examples, and diagrams - Merge cross-repo eval-platform design: discovery, change-detection, sharding, matrix fan-out, result aggregation, and the azure-sdk-tools -> language-repos rollout - Address PR #15918 review: composable graders (not capped at three), language-repo reuse goal, common-vs-repo-specific layout, mandatory tool-calls grading - Add sample skill + workflow eval YAML and vally invocation; describe TypeSpec mixed PR gate and skill-publish-gated-by-nightly lifecycle - Add Mermaid diagrams: eval taxonomy, orchestration flow, decision flow, skill-publish lifecycle, cross-repo platform

praveenkuttappan · 2026-06-11T16:06:40Z

+| What it tests | Lives in |
+|---|---|
+| **One skill** (does this skill route, call its tools, return a sensible answer) | `.github/skills/<skill>/evals/` |
+| **Cross-skill / cross-tool** (multi-step chains, e2e flows, mock-server integration, anything that doesn't belong to one skill) | `tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/` |


Each language repo and specs repo might have the cross skill - workflow scenarios as well and we cannot expect them to put it in tools repo.

Tools repo can have common workflow based scenarios
Language repo can have repo specific scenarios.

Eval pipeline will collect all the eval scenarios (common + repo specific) before the pipeline sharding.

Cross skill workflow scenarios can be put it in .github --> evals --> workflow-scenarios directory in individual repos.

…add diagrams

…ion, validator, phased reporting, eng/common, concurrency cap, implementation plan)

haolingdong-msft · 2026-06-12T07:26:30Z

+| Nightly (all) | schedule | everything | mock |
+| Live | manual / on demand | live-safe scenarios | live |
+
+#### Authoring stays self-service — generated pipelines, not hand-written YAML


Just wanted to check: do we have a preference between GitHub Actions and DevOps pipelines? Do we plan to support both?

I’m asking because the KB backend has improved authentication now, which should help us avoid the extra auth step. Previously, we were using DevOps pipelines mainly because of that requirement.

Given this, it feels like we may no longer have blockers to move to GitHub Actions. I’d be happy to try that approach, especially since we could later leverage GitHub’s agentic workflows to help analyze the pipeline evaluation results.

Would love to hear your thoughts!

…latform clone helper

…ne in #15918)

… Vally (#15124) (#15811) * Scaffold Azure.Sdk.Tools.Vally tool-scenario eval suite (#15124) Adds a new Vally eval suite under tools/azsdk-cli/Azure.Sdk.Tools.Vally/ for MCP tool / scenario evaluations, replacing the deleted Azure.Sdk.Tools.Cli.Benchmarks project (#15697). - README documents project intent, layout, local run instructions, and how to add a new scenario. - .vally.yaml wires the azsdk-mcp environment (stdio dotnet run against Azure.Sdk.Tools.Cli) and defines 'typespec' and 'all' suites. - evals/check-public-repo.eval.yaml is the first ported scenario (from the deleted CheckPublicRepoScenario): verifies the agent invokes azsdk_typespec_check_project_in_public_repo for a public-repo check prompt. Lints clean via 'vally lint --eval-spec'. - fixtures/.gitkeep reserves the per-scenario fixtures layout. Remaining scenarios from the deleted benchmark are tracked as a checklist in the project README and in #15124. * Port remaining 9 benchmark scenarios to Vally (#15124) Adds eval YAMLs for every scenario that was deleted from Azure.Sdk.Tools.Cli.Benchmarks in #15697: - check-public-repo-then-validate - validate-typespec - typespec-generation-step02 - get-modified-typespec-projects (stub — needs git-repo fixture / setup hook) - add-arm-resource (stub — needs fixtures + npx tsp compile post-check) - create-release-plan - link-namespace-approval-issue - get-pr-link-current-branch - check-sdk-generation-status Each eval uses the built-in tool-calls grader for presence checks; the original benchmark's argument/order/forbidden/optional assertions are captured in prompt text + inline TODOs (require custom graders or upstream Vally support, documented in README). Also adds release-plan/github/pipeline suites to .vally.yaml. All 10 evals pass 'vally lint --eval-spec'. * Add rename-client-property stub eval to Vally suite (#15124) Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README. * Fix tool name prefix in graders, timeout format, expand README * Reorganize evals into scenarios/ and triggers/; port trigger evals from #15183 - Move 11 multi-step scenario evals to evals/scenarios/ - Port 9 per-tool trigger evals from jeo02/migrate-evaluations-to-vally (PR #15183) to evals/triggers/, stripped azure-sdk-mcp- prefix from graders to match bare MCP tool names - Port Validate-EvalTools.ps1 to scripts/, retargeted at evals/triggers/ with bare-name regex - Update .vally.yaml suites for new layout (scenarios, triggers, all) - Update README to document the split and per-trigger-file tool coverage - Add .gitignore for vally-results/ and results/ * update the config and use gpt-5.4 model * add disallowed * Vally: restructure evals into unit/integration/e2e test pyramid Replace per-area folders (scenarios/, triggers/) with tier-based folders. Feature area moves to a YAML tag, enabling tag-filtered suites. Add composite suites (pr-gate, nightly) and area-filtered suites in .vally.yaml. Update Validate-EvalTools.ps1 to scan evals/unit for triggers-*.eval.yaml. Refresh README and Run-LiveEvals.ps1 paths. * Vally: remove Run-LiveEvals.ps1 (local-only test wrapper) Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'. * some docs and test e2e one * update docs * udpate design * update with skill evals * reorg based on the design * remove the duplicates * add new scenarios * update the doc * update doc * update names * Vally: align release-planner mock stimuli with live e2e pattern All 5 release-planner mock stimuli now use environment.git worktree pointing at the per-user azure-rest-api-specs cache (matching the live e2e fixture), plus a structured e2e-style prompt that supplies the Contoso fixture IDs the mock handlers expect (TypeSpec project, service/product tree IDs, work-item ID 29262). Also document the --skill-dir requirement and worker-cap caveat in README, and fix one stale path in .vally.yaml comment. * update doc * Vally: fix MCP boot race + drop misconfigured grader (#15948) - Launch pre-built DLLs via 'dotnet <dll>' in both .vally.yaml files instead of 'dotnet run', so N parallel workers no longer race on Roslyn's exclusive write lock for the output DLL. - Add 'Build MCP servers' step to eng/pipelines/skill-eval.yml so the CI runner has the DLLs ready before vally starts. - Drop the skill-invocation grader from generate-sdk-for-existing-release-plan (no preflight reasoning step required; tools-only). - Strip 'I'm in a checkout of azure-rest-api-specs.' preamble from prompts; the worktree already provides that context. - Remove stray '// tools skills response' artifact in live release-planner.eval.yaml. - README: document 'dotnet build' as a prereq; rewrite workers warning. Validated: scenarios-mock at --workers 6 -> 5/5 stimuli pass, 0 race hits, ~4 min. * update readme for runing steps * Vally: align mock release-planner grader with live + deterministic 'not found' lookup The create-release-plan-and-generate-sdk mock stimulus required the agent to call azsdk_update_sdk_details_in_release_plan, but neither the prompt nor the azsdk-common-prepare-release-plan skill's create flow asks for it. The agent correctly skipped the tool, and the grader flapped. The dedicated update-sdk-details-in-release-plan stimulus already covers that tool with an explicit prompt. Drop it from the create+generate grader so mock matches the live release-planner-e2e contract (create / get / generate / link). Also patch GetReleasePlanForSpecPrHandler to return a deterministic 'not found' response (ReleasePlanDetails = null). The mock previously returned a 'plan exists' result for any spec PR, pushing the agent down the update path instead of the create path that the stimulus exercises. Stimuli that target an existing plan pass the work-item ID directly and call azsdk_get_release_plan, so this is safe. * update eval yaml * Address PR #15811 review: fix stale paths, exit codes, build output, cache portability - README/eval comments: evals/unit -> evals/tools, evals/scenarios -> evals/workflow-scenarios (Copilot C1/C5) - Validate-EvalTools.ps1: default EvalPath -> evals/tools; return 1 -> exit 1 so CI fails loudly (Copilot C2/C3) - MCP build output: dotnet build -o artifacts/mcp/{cli,mock}; pipeline switched to Release; .vally.yaml no longer hardcodes Debug/net8.0 (Praveen #1/#2) - ensure-specs-clone.ps1 + workflow evals: repo-relative artifacts/specs-cache path instead of C:/Users/gaoh; Vally resolves it relative to the eval file so it works for all contributors + CI (Copilot C6/C7, Praveen #4) - add-arm-resource/rename-client-property: comment clarifying 'edit' is the Copilot SDK built-in file tool, not an MCP tool (Praveen #5) * Refactor Vally tool evals: rename triggers-* to prompt-to-tool-*, consolidate standalone single-tool evals - Rename evals/tools/triggers-*.eval.yaml to prompt-to-tool-*.eval.yaml (Praveen review #6) - Consolidate 7 standalone single-tool scenario evals into the matching namespace files as full-context checks (check-public-repo, check-sdk-generation-status, create-release-plan, get-modified-typespec-projects, get-pr-link-current-branch, link-namespace-approval-issue, validate-typespec) - Keep add-arm-resource.eval.yaml standalone (produces a file edit, not a pure tool trigger) - Switch tool evals to gpt-5.4 and add explicit 'use the available Azure SDK MCP tools' steering plus concrete grounding to bare trigger prompts so they invoke the MCP tool reliably - Update README evals/tools section and Validate-EvalTools.ps1 to the new file names * Remove agent-eval-strategy design spec from PR (now reviewed standalone in #15918) * Drop flaky edit-tool assertion from add-arm-resource eval * remove script * Stabilize flaky tool-scenario prompts and add README command cookbook Ground 13 previously-flaky prompts with concrete IDs/paths so they route deterministically to the intended MCP tool; make the mock check-service-label handler convention-driven (status derived from the requested serviceLabel); document common vally invocation recipes in the README. * Fix outdated command examples in Vally README Replace references to consolidated/non-existent eval files (create-release-plan, check-public-repo, link-namespace-approval-issue) with the real prompt-to-tool-* and workflow-scenario files; correct the default output path to ./vally-results/<timestamp>/; fix the cookbook results.jsonl parser to locate the newest timestamped run; add the missing release-planner-workflows mock scenario to the index. * Fix invalid prompt-grader config in live release-planner eval The prompt (LLM-judge) grader schema uses 'prompt' for the rubric text, not 'rubric'. Rename the field and add 'scoring: binary' (the rubric is pass/fail) so the spec validates.

Copilot

Pull request overview

Adds a new design/spec document describing the intended evaluation strategy for agent-driven azsdk-cli workflows (using Vally), including eval taxonomy (skill/workflow/tool), folder layout, grader requirements, and CI cadence/reporting expectations.

Changes:

Introduces an “eval pyramid” covering per-skill evals, cross-skill workflow scenarios (mock/live), and a hermetic tool-shape layer.
Documents where eval content should live and how it should be structured under evals/.
Defines required graders per tier and outlines CI execution + reporting approach (including cost/flake considerations).

+
+| Run mode | MCP | Repos? | When | Coverage |
+|---|---|---|---|---|
+| Workflows — Mock | mock (stub, no LLM) | azure-sdk-tools only | nightly + on demand | every scenario |


+### Folder layout
+
+```
+evals/


+
+| What it tests | Lives in |
+|---|---|
+| **One skill** (does this skill route, call its tools, return a sensible answer) | `.github/skills/<skill>/evals/` |


+Both kinds are native Vally `*.eval.yaml`. Below are minimal, real-shaped
+samples plus how to configure and run them.


+            - azsdk_get_release_plan
+            - azsdk_create_release_plan
+            - azsdk_run_generate_sdk
+          forbidden: [azsdk_verify_setup]


+Both kinds are native Vally `*.eval.yaml`. Below are minimal, real-shaped
+samples plus how to configure and run them.
+
+**Skill eval** (`.github/skills/<skill>/evals/<name>.eval.yaml`) — proves


github-actions Bot added azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli design-discussion An area of design currently under discussion and open to team and community feedback. labels Jun 4, 2026

helen229 added 2 commits June 4, 2026 15:31

Update rationale and scenario file explanations

a758e50

Refine rationale and clarify scenario file details.

Refine description of tools in evaluation strategy spec

fa56972

simpler explanation

haolingdong-msft reviewed Jun 5, 2026

View reviewed changes

praveenkuttappan reviewed Jun 5, 2026

View reviewed changes

Comment thread tools/azsdk-cli/docs/specs/8-operations-agent-eval-strategy.spec.md Outdated

praveenkuttappan reviewed Jun 5, 2026

View reviewed changes

praveenkuttappan reviewed Jun 11, 2026

View reviewed changes

helen229 added 6 commits June 11, 2026 10:55

update

5dfb7d6

docs: rewrite CI section of agent-eval-strategy spec for clarity and …

095f006

…add diagrams

docs: fold meeting outcomes into eval-strategy spec (eval-gen automat…

cf7dc82

…ion, validator, phased reporting, eng/common, concurrency cap, implementation plan)

update format

70e461e

update wording

1666e5f

docs: revise CI to per-skill ci.yml pipelines per review

2f74ed8

haolingdong-msft reviewed Jun 12, 2026

View reviewed changes

helen229 added 2 commits June 12, 2026 09:53

docs: break implementation plan into dependency-ordered work items

c6a17ae

docs: declare cross-repo fixtures via metadata.repos + shared cross-p…

466292e

…latform clone helper

helen229 added a commit that referenced this pull request Jun 15, 2026

Remove agent-eval-strategy design spec from PR (now reviewed standalo…

f714a91

…ne in #15918)

helen229 marked this pull request as ready for review June 16, 2026 17:14

helen229 requested a review from a team as a code owner June 16, 2026 17:14

Copilot AI review requested due to automatic review settings June 16, 2026 17:14

Merge branch 'main' into docs-agent-eval-strategy-spec

cc795db

Copilot started reviewing on behalf of helen229 June 16, 2026 17:14 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add agent-eval-strategy spec for azsdk-cli operations#15918

docs: add agent-eval-strategy spec for azsdk-cli operations#15918
helen229 wants to merge 13 commits into
mainfrom
docs-agent-eval-strategy-spec

helen229 commented Jun 4, 2026 •

edited

Loading

Uh oh!

haolingdong-msft left a comment

Uh oh!

haolingdong-msft Jun 5, 2026

Uh oh!

haolingdong-msft Jun 5, 2026

Uh oh!

haolingdong-msft Jun 5, 2026

Uh oh!

Uh oh!

praveenkuttappan Jun 5, 2026

Uh oh!

praveenkuttappan Jun 5, 2026

Uh oh!

praveenkuttappan Jun 5, 2026

Uh oh!

praveenkuttappan Jun 5, 2026

Uh oh!

helen229 Jun 12, 2026

Uh oh!

praveenkuttappan Jun 11, 2026

Uh oh!

haolingdong-msft Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		willing to spend before we move it back off the PR?

		We need owners' input on all four before turning the gate on.


		#### PR gate for essential workflows (open)

		A case for narrow PR gating: a small curated set of mock scenarios

		Both kinds are native Vally `*.eval.yaml`. Below are minimal, real-shaped
		samples plus how to configure and run them.

Conversation

helen229 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in here

Uh oh!

haolingdong-msft left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

helen229 commented Jun 4, 2026 •

edited

Loading