docs: add agent-eval-strategy spec for azsdk-cli operations#15918
docs: add agent-eval-strategy spec for azsdk-cli operations#15918helen229 wants to merge 13 commits into
Conversation
New design doc covering the eval pyramid (per-skill, mock workflow, live workflow, hermetic tool-shape), folder layout, required graders per workflow tier, and where each eval lives. Pulled out of #15811 (Vally port PR) so it can be reviewed standalone before the implementation lands.
Refine rationale and clarify scenario file details.
simpler explanation
haolingdong-msft
left a comment
There was a problem hiding this comment.
Thanks @helen229 for the spec! Added some comments from typespec authoring side. And happy to discuss furthur or provide more information from typespec authoring side.
| willing to spend before we move it back off the PR? | ||
|
|
||
| We need owners' input on all four before turning the gate on. | ||
|
|
There was a problem hiding this comment.
Adding my two cents on azure-typespec-author skill evaluation and skill publish workflow:
Skill code update -> PR level validation (pick few benchmark cases to run, and run mock server to validate skill invocation) -> merge to main -> run nightly build for all benchmarks -> publish if nightly build succeeds.
| - **Workflow scenario**: a user prompt that crosses multiple tools / skills | ||
| end-to-end (e.g. *create release plan → generate SDK → link the SDK PR*). | ||
| - **Stimulus**: one prompt + its expected behavior — the unit of an eval. | ||
| - **Three graders per stimulus**: `skill-invocation` (right skill picked), |
There was a problem hiding this comment.
For typespec authoring, the case might be more complex. Three graders are not enough. This is one of our cases: https://github.com/Azure/azure-sdk-tools/blob/feat/vally-tool-scenarios-15124/.github/skills/azure-typespec-author/evaluate/evals/001001.eval.yaml#L17-L73
This is the readme about how to run the test cases: https://github.com/Azure/azure-sdk-tools/blob/feat/vally-tool-scenarios-15124/.github/skills/azure-typespec-author/evaluate/README.md
|
|
||
| #### PR gate for essential workflows (open) | ||
|
|
||
| A case for *narrow* PR gating: a small curated set of mock scenarios |
There was a problem hiding this comment.
For typespec authoring, the PR gate might be some typical benchmark cases with real mcp server and skill invocation with mock mcp server.
| ### Folder layout | ||
|
|
||
| ``` | ||
| evals/ |
There was a problem hiding this comment.
Is this layout specific to tools repo? Where is the pipeline configured?
We should also consider below.
Eval for our common skills and tools
Eval for repo specific skills.
|
|
||
| | What it tests | Lives in | | ||
| |---|---| | ||
| | **One skill** (does this skill route, call its tools, return a sensible answer) | `.github/skills/<skill>/evals/` | |
There was a problem hiding this comment.
I think this will be good enough to cover the common skills and repo specific skills.
| ``` | ||
| Do you only care that the agent picks the right skill | ||
| (you don't care which tools it then calls)? | ||
| └── yes → .github/skills/<skill-name>/evals/ (not this project) |
There was a problem hiding this comment.
This should also be improved to grade that right set of tools are picked.
| capability stimulus. A `skill-eval-authoring` skill packages the | ||
| pattern, grader catalog, and anti-patterns so other Azure SDK teams | ||
| adopt without re-learning the gotchas. | ||
|
|
There was a problem hiding this comment.
Can you add sample skill eval and workflow eval yaml? How it's configured and how to invoke the eval.
…po rollout, examples, and diagrams - Merge cross-repo eval-platform design: discovery, change-detection, sharding, matrix fan-out, result aggregation, and the azure-sdk-tools -> language-repos rollout - Address PR #15918 review: composable graders (not capped at three), language-repo reuse goal, common-vs-repo-specific layout, mandatory tool-calls grading - Add sample skill + workflow eval YAML and vally invocation; describe TypeSpec mixed PR gate and skill-publish-gated-by-nightly lifecycle - Add Mermaid diagrams: eval taxonomy, orchestration flow, decision flow, skill-publish lifecycle, cross-repo platform
| | What it tests | Lives in | | ||
| |---|---| | ||
| | **One skill** (does this skill route, call its tools, return a sensible answer) | `.github/skills/<skill>/evals/` | | ||
| | **Cross-skill / cross-tool** (multi-step chains, e2e flows, mock-server integration, anything that doesn't belong to one skill) | `tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/` | |
There was a problem hiding this comment.
Each language repo and specs repo might have the cross skill - workflow scenarios as well and we cannot expect them to put it in tools repo.
Tools repo can have common workflow based scenarios
Language repo can have repo specific scenarios.
Eval pipeline will collect all the eval scenarios (common + repo specific) before the pipeline sharding.
Cross skill workflow scenarios can be put it in .github --> evals --> workflow-scenarios directory in individual repos.
…ion, validator, phased reporting, eng/common, concurrency cap, implementation plan)
| | Nightly (all) | schedule | everything | mock | | ||
| | Live | manual / on demand | live-safe scenarios | live | | ||
|
|
||
| #### Authoring stays self-service — generated pipelines, not hand-written YAML |
There was a problem hiding this comment.
Just wanted to check: do we have a preference between GitHub Actions and DevOps pipelines? Do we plan to support both?
I’m asking because the KB backend has improved authentication now, which should help us avoid the extra auth step. Previously, we were using DevOps pipelines mainly because of that requirement.
Given this, it feels like we may no longer have blockers to move to GitHub Actions. I’d be happy to try that approach, especially since we could later leverage GitHub’s agentic workflows to help analyze the pipeline evaluation results.
Would love to hear your thoughts!
… Vally (#15124) (#15811) * Scaffold Azure.Sdk.Tools.Vally tool-scenario eval suite (#15124) Adds a new Vally eval suite under tools/azsdk-cli/Azure.Sdk.Tools.Vally/ for MCP tool / scenario evaluations, replacing the deleted Azure.Sdk.Tools.Cli.Benchmarks project (#15697). - README documents project intent, layout, local run instructions, and how to add a new scenario. - .vally.yaml wires the azsdk-mcp environment (stdio dotnet run against Azure.Sdk.Tools.Cli) and defines 'typespec' and 'all' suites. - evals/check-public-repo.eval.yaml is the first ported scenario (from the deleted CheckPublicRepoScenario): verifies the agent invokes azsdk_typespec_check_project_in_public_repo for a public-repo check prompt. Lints clean via 'vally lint --eval-spec'. - fixtures/.gitkeep reserves the per-scenario fixtures layout. Remaining scenarios from the deleted benchmark are tracked as a checklist in the project README and in #15124. * Port remaining 9 benchmark scenarios to Vally (#15124) Adds eval YAMLs for every scenario that was deleted from Azure.Sdk.Tools.Cli.Benchmarks in #15697: - check-public-repo-then-validate - validate-typespec - typespec-generation-step02 - get-modified-typespec-projects (stub — needs git-repo fixture / setup hook) - add-arm-resource (stub — needs fixtures + npx tsp compile post-check) - create-release-plan - link-namespace-approval-issue - get-pr-link-current-branch - check-sdk-generation-status Each eval uses the built-in tool-calls grader for presence checks; the original benchmark's argument/order/forbidden/optional assertions are captured in prompt text + inline TODOs (require custom graders or upstream Vally support, documented in README). Also adds release-plan/github/pipeline suites to .vally.yaml. All 10 evals pass 'vally lint --eval-spec'. * Add rename-client-property stub eval to Vally suite (#15124) Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README. * Fix tool name prefix in graders, timeout format, expand README * Reorganize evals into scenarios/ and triggers/; port trigger evals from #15183 - Move 11 multi-step scenario evals to evals/scenarios/ - Port 9 per-tool trigger evals from jeo02/migrate-evaluations-to-vally (PR #15183) to evals/triggers/, stripped azure-sdk-mcp- prefix from graders to match bare MCP tool names - Port Validate-EvalTools.ps1 to scripts/, retargeted at evals/triggers/ with bare-name regex - Update .vally.yaml suites for new layout (scenarios, triggers, all) - Update README to document the split and per-trigger-file tool coverage - Add .gitignore for vally-results/ and results/ * update the config and use gpt-5.4 model * add disallowed * Vally: restructure evals into unit/integration/e2e test pyramid Replace per-area folders (scenarios/, triggers/) with tier-based folders. Feature area moves to a YAML tag, enabling tag-filtered suites. Add composite suites (pr-gate, nightly) and area-filtered suites in .vally.yaml. Update Validate-EvalTools.ps1 to scan evals/unit for triggers-*.eval.yaml. Refresh README and Run-LiveEvals.ps1 paths. * Vally: remove Run-LiveEvals.ps1 (local-only test wrapper) Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'. * some docs and test e2e one * update docs * udpate design * update with skill evals * reorg based on the design * remove the duplicates * add new scenarios * update the doc * update doc * update names * Vally: align release-planner mock stimuli with live e2e pattern All 5 release-planner mock stimuli now use environment.git worktree pointing at the per-user azure-rest-api-specs cache (matching the live e2e fixture), plus a structured e2e-style prompt that supplies the Contoso fixture IDs the mock handlers expect (TypeSpec project, service/product tree IDs, work-item ID 29262). Also document the --skill-dir requirement and worker-cap caveat in README, and fix one stale path in .vally.yaml comment. * update doc * Vally: fix MCP boot race + drop misconfigured grader (#15948) - Launch pre-built DLLs via 'dotnet <dll>' in both .vally.yaml files instead of 'dotnet run', so N parallel workers no longer race on Roslyn's exclusive write lock for the output DLL. - Add 'Build MCP servers' step to eng/pipelines/skill-eval.yml so the CI runner has the DLLs ready before vally starts. - Drop the skill-invocation grader from generate-sdk-for-existing-release-plan (no preflight reasoning step required; tools-only). - Strip 'I'm in a checkout of azure-rest-api-specs.' preamble from prompts; the worktree already provides that context. - Remove stray '// tools skills response' artifact in live release-planner.eval.yaml. - README: document 'dotnet build' as a prereq; rewrite workers warning. Validated: scenarios-mock at --workers 6 -> 5/5 stimuli pass, 0 race hits, ~4 min. * update readme for runing steps * Vally: align mock release-planner grader with live + deterministic 'not found' lookup The create-release-plan-and-generate-sdk mock stimulus required the agent to call azsdk_update_sdk_details_in_release_plan, but neither the prompt nor the azsdk-common-prepare-release-plan skill's create flow asks for it. The agent correctly skipped the tool, and the grader flapped. The dedicated update-sdk-details-in-release-plan stimulus already covers that tool with an explicit prompt. Drop it from the create+generate grader so mock matches the live release-planner-e2e contract (create / get / generate / link). Also patch GetReleasePlanForSpecPrHandler to return a deterministic 'not found' response (ReleasePlanDetails = null). The mock previously returned a 'plan exists' result for any spec PR, pushing the agent down the update path instead of the create path that the stimulus exercises. Stimuli that target an existing plan pass the work-item ID directly and call azsdk_get_release_plan, so this is safe. * update eval yaml * Address PR #15811 review: fix stale paths, exit codes, build output, cache portability - README/eval comments: evals/unit -> evals/tools, evals/scenarios -> evals/workflow-scenarios (Copilot C1/C5) - Validate-EvalTools.ps1: default EvalPath -> evals/tools; return 1 -> exit 1 so CI fails loudly (Copilot C2/C3) - MCP build output: dotnet build -o artifacts/mcp/{cli,mock}; pipeline switched to Release; .vally.yaml no longer hardcodes Debug/net8.0 (Praveen #1/#2) - ensure-specs-clone.ps1 + workflow evals: repo-relative artifacts/specs-cache path instead of C:/Users/gaoh; Vally resolves it relative to the eval file so it works for all contributors + CI (Copilot C6/C7, Praveen #4) - add-arm-resource/rename-client-property: comment clarifying 'edit' is the Copilot SDK built-in file tool, not an MCP tool (Praveen #5) * Refactor Vally tool evals: rename triggers-* to prompt-to-tool-*, consolidate standalone single-tool evals - Rename evals/tools/triggers-*.eval.yaml to prompt-to-tool-*.eval.yaml (Praveen review #6) - Consolidate 7 standalone single-tool scenario evals into the matching namespace files as full-context checks (check-public-repo, check-sdk-generation-status, create-release-plan, get-modified-typespec-projects, get-pr-link-current-branch, link-namespace-approval-issue, validate-typespec) - Keep add-arm-resource.eval.yaml standalone (produces a file edit, not a pure tool trigger) - Switch tool evals to gpt-5.4 and add explicit 'use the available Azure SDK MCP tools' steering plus concrete grounding to bare trigger prompts so they invoke the MCP tool reliably - Update README evals/tools section and Validate-EvalTools.ps1 to the new file names * Remove agent-eval-strategy design spec from PR (now reviewed standalone in #15918) * Drop flaky edit-tool assertion from add-arm-resource eval * remove script * Stabilize flaky tool-scenario prompts and add README command cookbook Ground 13 previously-flaky prompts with concrete IDs/paths so they route deterministically to the intended MCP tool; make the mock check-service-label handler convention-driven (status derived from the requested serviceLabel); document common vally invocation recipes in the README. * Fix outdated command examples in Vally README Replace references to consolidated/non-existent eval files (create-release-plan, check-public-repo, link-namespace-approval-issue) with the real prompt-to-tool-* and workflow-scenario files; correct the default output path to ./vally-results/<timestamp>/; fix the cookbook results.jsonl parser to locate the newest timestamped run; add the missing release-planner-workflows mock scenario to the index. * Fix invalid prompt-grader config in live release-planner eval The prompt (LLM-judge) grader schema uses 'prompt' for the rubric text, not 'rubric'. Rename the field and add 'scoring: binary' (the rubric is pass/fail) so the spec validates.
There was a problem hiding this comment.
Pull request overview
Adds a new design/spec document describing the intended evaluation strategy for agent-driven azsdk-cli workflows (using Vally), including eval taxonomy (skill/workflow/tool), folder layout, grader requirements, and CI cadence/reporting expectations.
Changes:
- Introduces an “eval pyramid” covering per-skill evals, cross-skill workflow scenarios (mock/live), and a hermetic tool-shape layer.
- Documents where eval content should live and how it should be structured under
evals/. - Defines required graders per tier and outlines CI execution + reporting approach (including cost/flake considerations).
|
|
||
| | Run mode | MCP | Repos? | When | Coverage | | ||
| |---|---|---|---|---| | ||
| | Workflows — Mock | mock (stub, no LLM) | azure-sdk-tools only | nightly + on demand | every scenario | |
| ### Folder layout | ||
|
|
||
| ``` | ||
| evals/ |
|
|
||
| | What it tests | Lives in | | ||
| |---|---| | ||
| | **One skill** (does this skill route, call its tools, return a sensible answer) | `.github/skills/<skill>/evals/` | |
| Both kinds are native Vally `*.eval.yaml`. Below are minimal, real-shaped | ||
| samples plus how to configure and run them. |
| - azsdk_get_release_plan | ||
| - azsdk_create_release_plan | ||
| - azsdk_run_generate_sdk | ||
| forbidden: [azsdk_verify_setup] |
| Both kinds are native Vally `*.eval.yaml`. Below are minimal, real-shaped | ||
| samples plus how to configure and run them. | ||
|
|
||
| **Skill eval** (`.github/skills/<skill>/evals/<name>.eval.yaml`) — proves |
Pulls the new agent-eval-strategy spec out of #15811 (the Vally port PR) so it can be reviewed standalone before the implementation lands.
What's in here
A single new file:
tools/azsdk-cli/docs/specs/8-operations-agent-eval-strategy.spec.md.The spec covers:
SKILL.md; cross-skill / cross-tool evals live intools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/.evals/(tools/,workflow-scenarios/{mock,live}/,setup/).tool-calls(skill-invocation optional, response grader N/A); live workflows requiretool-calls+skill-invocation+ a response grader (prompt/ LLM-judge).dotnet runbuild contention).