Skip to content

docs: add agent-eval-strategy spec for azsdk-cli operations#15918

Open
helen229 wants to merge 13 commits into
mainfrom
docs-agent-eval-strategy-spec
Open

docs: add agent-eval-strategy spec for azsdk-cli operations#15918
helen229 wants to merge 13 commits into
mainfrom
docs-agent-eval-strategy-spec

Conversation

@helen229

@helen229 helen229 commented Jun 4, 2026

Copy link
Copy Markdown
Member

Pulls the new agent-eval-strategy spec out of #15811 (the Vally port PR) so it can be reviewed standalone before the implementation lands.

What's in here

A single new file: tools/azsdk-cli/docs/specs/8-operations-agent-eval-strategy.spec.md.

The spec covers:

  • Eval pyramid: per-skill evals, mock workflow scenarios, live workflow scenarios, plus a hermetic tool-shape layer.
  • Where each eval lives: skill evals stay next to SKILL.md; cross-skill / cross-tool evals live in tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/.
  • Folder layout under evals/ (tools/, workflow-scenarios/{mock,live}/, setup/).
  • Required graders by tier: mock workflows require tool-calls (skill-invocation optional, response grader N/A); live workflows require tool-calls + skill-invocation + a response grader (prompt / LLM-judge).
  • Run cadence (PR vs nightly vs weekly), open questions, and known runner gaps from the first post-merge run (skill-dir requirement, MCP stdio boot race traced to dotnet run build contention).

New design doc covering the eval pyramid (per-skill, mock workflow, live workflow, hermetic tool-shape), folder layout, required graders per workflow tier, and where each eval lives. Pulled out of #15811 (Vally port PR) so it can be reviewed standalone before the implementation lands.
@github-actions github-actions Bot added azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli design-discussion An area of design currently under discussion and open to team and community feedback. labels Jun 4, 2026
helen229 added 2 commits June 4, 2026 15:31
Refine rationale and clarify scenario file details.

@haolingdong-msft haolingdong-msft left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @helen229 for the spec! Added some comments from typespec authoring side. And happy to discuss furthur or provide more information from typespec authoring side.

willing to spend before we move it back off the PR?

We need owners' input on all four before turning the gate on.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding my two cents on azure-typespec-author skill evaluation and skill publish workflow:
Skill code update -> PR level validation (pick few benchmark cases to run, and run mock server to validate skill invocation) -> merge to main -> run nightly build for all benchmarks -> publish if nightly build succeeds.

- **Workflow scenario**: a user prompt that crosses multiple tools / skills
end-to-end (e.g. *create release plan → generate SDK → link the SDK PR*).
- **Stimulus**: one prompt + its expected behavior — the unit of an eval.
- **Three graders per stimulus**: `skill-invocation` (right skill picked),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


#### PR gate for essential workflows (open)

A case for *narrow* PR gating: a small curated set of mock scenarios

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For typespec authoring, the PR gate might be some typical benchmark cases with real mcp server and skill invocation with mock mcp server.

Comment thread tools/azsdk-cli/docs/specs/8-operations-agent-eval-strategy.spec.md Outdated
### Folder layout

```
evals/

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this layout specific to tools repo? Where is the pipeline configured?

We should also consider below.
Eval for our common skills and tools
Eval for repo specific skills.


| What it tests | Lives in |
|---|---|
| **One skill** (does this skill route, call its tools, return a sensible answer) | `.github/skills/<skill>/evals/` |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be good enough to cover the common skills and repo specific skills.

```
Do you only care that the agent picks the right skill
(you don't care which tools it then calls)?
└── yes → .github/skills/<skill-name>/evals/ (not this project)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be improved to grade that right set of tools are picked.

capability stimulus. A `skill-eval-authoring` skill packages the
pattern, grader catalog, and anti-patterns so other Azure SDK teams
adopt without re-learning the gotchas.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add sample skill eval and workflow eval yaml? How it's configured and how to invoke the eval.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

…po rollout, examples, and diagrams

- Merge cross-repo eval-platform design: discovery, change-detection, sharding, matrix fan-out, result aggregation, and the azure-sdk-tools -> language-repos rollout

- Address PR #15918 review: composable graders (not capped at three), language-repo reuse goal, common-vs-repo-specific layout, mandatory tool-calls grading

- Add sample skill + workflow eval YAML and vally invocation; describe TypeSpec mixed PR gate and skill-publish-gated-by-nightly lifecycle

- Add Mermaid diagrams: eval taxonomy, orchestration flow, decision flow, skill-publish lifecycle, cross-repo platform
| What it tests | Lives in |
|---|---|
| **One skill** (does this skill route, call its tools, return a sensible answer) | `.github/skills/<skill>/evals/` |
| **Cross-skill / cross-tool** (multi-step chains, e2e flows, mock-server integration, anything that doesn't belong to one skill) | `tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/` |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each language repo and specs repo might have the cross skill - workflow scenarios as well and we cannot expect them to put it in tools repo.

Tools repo can have common workflow based scenarios
Language repo can have repo specific scenarios.

Eval pipeline will collect all the eval scenarios (common + repo specific) before the pipeline sharding.

Cross skill workflow scenarios can be put it in .github --> evals --> workflow-scenarios directory in individual repos.

| Nightly (all) | schedule | everything | mock |
| Live | manual / on demand | live-safe scenarios | live |

#### Authoring stays self-service — generated pipelines, not hand-written YAML

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to check: do we have a preference between GitHub Actions and DevOps pipelines? Do we plan to support both?

I’m asking because the KB backend has improved authentication now, which should help us avoid the extra auth step. Previously, we were using DevOps pipelines mainly because of that requirement.

Given this, it feels like we may no longer have blockers to move to GitHub Actions. I’d be happy to try that approach, especially since we could later leverage GitHub’s agentic workflows to help analyze the pipeline evaluation results.

Would love to hear your thoughts!

helen229 added a commit that referenced this pull request Jun 16, 2026
… Vally (#15124) (#15811)

* Scaffold Azure.Sdk.Tools.Vally tool-scenario eval suite (#15124)

Adds a new Vally eval suite under tools/azsdk-cli/Azure.Sdk.Tools.Vally/ for MCP tool / scenario evaluations, replacing the deleted Azure.Sdk.Tools.Cli.Benchmarks project (#15697).

- README documents project intent, layout, local run instructions, and how to add a new scenario.

- .vally.yaml wires the azsdk-mcp environment (stdio dotnet run against Azure.Sdk.Tools.Cli) and defines 'typespec' and 'all' suites.

- evals/check-public-repo.eval.yaml is the first ported scenario (from the deleted CheckPublicRepoScenario): verifies the agent invokes azsdk_typespec_check_project_in_public_repo for a public-repo check prompt. Lints clean via 'vally lint --eval-spec'.

- fixtures/.gitkeep reserves the per-scenario fixtures layout.

Remaining scenarios from the deleted benchmark are tracked as a checklist in the project README and in #15124.

* Port remaining 9 benchmark scenarios to Vally (#15124)

Adds eval YAMLs for every scenario that was deleted from Azure.Sdk.Tools.Cli.Benchmarks in #15697:

- check-public-repo-then-validate

- validate-typespec

- typespec-generation-step02

- get-modified-typespec-projects (stub — needs git-repo fixture / setup hook)

- add-arm-resource (stub — needs fixtures + npx tsp compile post-check)

- create-release-plan

- link-namespace-approval-issue

- get-pr-link-current-branch

- check-sdk-generation-status

Each eval uses the built-in tool-calls grader for presence checks; the original benchmark's argument/order/forbidden/optional assertions are captured in prompt text + inline TODOs (require custom graders or upstream Vally support, documented in README). Also adds release-plan/github/pipeline suites to .vally.yaml. All 10 evals pass 'vally lint --eval-spec'.

* Add rename-client-property stub eval to Vally suite (#15124)

Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README.

* Fix tool name prefix in graders, timeout format, expand README

* Reorganize evals into scenarios/ and triggers/; port trigger evals from #15183

- Move 11 multi-step scenario evals to evals/scenarios/
- Port 9 per-tool trigger evals from jeo02/migrate-evaluations-to-vally (PR #15183) to evals/triggers/, stripped azure-sdk-mcp- prefix from graders to match bare MCP tool names
- Port Validate-EvalTools.ps1 to scripts/, retargeted at evals/triggers/ with bare-name regex
- Update .vally.yaml suites for new layout (scenarios, triggers, all)
- Update README to document the split and per-trigger-file tool coverage
- Add .gitignore for vally-results/ and results/

* update the config and use gpt-5.4 model

* add disallowed

* Vally: restructure evals into unit/integration/e2e test pyramid

Replace per-area folders (scenarios/, triggers/) with tier-based folders. Feature area moves to a YAML tag, enabling tag-filtered suites. Add composite suites (pr-gate, nightly) and area-filtered suites in .vally.yaml. Update Validate-EvalTools.ps1 to scan evals/unit for triggers-*.eval.yaml. Refresh README and Run-LiveEvals.ps1 paths.

* Vally: remove Run-LiveEvals.ps1 (local-only test wrapper)

Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'.

* some docs and test e2e one

* update docs

* udpate design

* update with skill evals

* reorg based on the design

* remove the duplicates

* add new scenarios

* update the doc

* update doc

* update names

* Vally: align release-planner mock stimuli with live e2e pattern

All 5 release-planner mock stimuli now use environment.git worktree pointing at the per-user azure-rest-api-specs cache (matching the live e2e fixture), plus a structured e2e-style prompt that supplies the Contoso fixture IDs the mock handlers expect (TypeSpec project, service/product tree IDs, work-item ID 29262). Also document the --skill-dir requirement and worker-cap caveat in README, and fix one stale path in .vally.yaml comment.

* update doc

* Vally: fix MCP boot race + drop misconfigured grader (#15948)

- Launch pre-built DLLs via 'dotnet <dll>' in both .vally.yaml files instead of 'dotnet run', so N parallel workers no longer race on Roslyn's exclusive write lock for the output DLL.

- Add 'Build MCP servers' step to eng/pipelines/skill-eval.yml so the CI runner has the DLLs ready before vally starts.

- Drop the skill-invocation grader from generate-sdk-for-existing-release-plan (no preflight reasoning step required; tools-only).

- Strip 'I'm in a checkout of azure-rest-api-specs.' preamble from prompts; the worktree already provides that context.

- Remove stray '// tools skills response' artifact in live release-planner.eval.yaml.

- README: document 'dotnet build' as a prereq; rewrite workers warning.

Validated: scenarios-mock at --workers 6 -> 5/5 stimuli pass, 0 race hits, ~4 min.

* update readme for runing steps

* Vally: align mock release-planner grader with live + deterministic 'not found' lookup

The create-release-plan-and-generate-sdk mock stimulus required the agent to call
azsdk_update_sdk_details_in_release_plan, but neither the prompt nor the
azsdk-common-prepare-release-plan skill's create flow asks for it. The agent
correctly skipped the tool, and the grader flapped. The dedicated
update-sdk-details-in-release-plan stimulus already covers that tool with an
explicit prompt. Drop it from the create+generate grader so mock matches the
live release-planner-e2e contract (create / get / generate / link).

Also patch GetReleasePlanForSpecPrHandler to return a deterministic
'not found' response (ReleasePlanDetails = null). The mock previously
returned a 'plan exists' result for any spec PR, pushing the agent down
the update path instead of the create path that the stimulus exercises.
Stimuli that target an existing plan pass the work-item ID directly and
call azsdk_get_release_plan, so this is safe.

* update eval yaml

* Address PR #15811 review: fix stale paths, exit codes, build output, cache portability

- README/eval comments: evals/unit -> evals/tools, evals/scenarios -> evals/workflow-scenarios (Copilot C1/C5)

- Validate-EvalTools.ps1: default EvalPath -> evals/tools; return 1 -> exit 1 so CI fails loudly (Copilot C2/C3)

- MCP build output: dotnet build -o artifacts/mcp/{cli,mock}; pipeline switched to Release; .vally.yaml no longer hardcodes Debug/net8.0 (Praveen #1/#2)

- ensure-specs-clone.ps1 + workflow evals: repo-relative artifacts/specs-cache path instead of C:/Users/gaoh; Vally resolves it relative to the eval file so it works for all contributors + CI (Copilot C6/C7, Praveen #4)

- add-arm-resource/rename-client-property: comment clarifying 'edit' is the Copilot SDK built-in file tool, not an MCP tool (Praveen #5)

* Refactor Vally tool evals: rename triggers-* to prompt-to-tool-*, consolidate standalone single-tool evals

- Rename evals/tools/triggers-*.eval.yaml to prompt-to-tool-*.eval.yaml (Praveen review #6)

- Consolidate 7 standalone single-tool scenario evals into the matching namespace files as full-context checks (check-public-repo, check-sdk-generation-status, create-release-plan, get-modified-typespec-projects, get-pr-link-current-branch, link-namespace-approval-issue, validate-typespec)

- Keep add-arm-resource.eval.yaml standalone (produces a file edit, not a pure tool trigger)

- Switch tool evals to gpt-5.4 and add explicit 'use the available Azure SDK MCP tools' steering plus concrete grounding to bare trigger prompts so they invoke the MCP tool reliably

- Update README evals/tools section and Validate-EvalTools.ps1 to the new file names

* Remove agent-eval-strategy design spec from PR (now reviewed standalone in #15918)

* Drop flaky edit-tool assertion from add-arm-resource eval

* remove script

* Stabilize flaky tool-scenario prompts and add README command cookbook

Ground 13 previously-flaky prompts with concrete IDs/paths so they route deterministically to the intended MCP tool; make the mock check-service-label handler convention-driven (status derived from the requested serviceLabel); document common vally invocation recipes in the README.

* Fix outdated command examples in Vally README

Replace references to consolidated/non-existent eval files (create-release-plan, check-public-repo, link-namespace-approval-issue) with the real prompt-to-tool-* and workflow-scenario files; correct the default output path to ./vally-results/<timestamp>/; fix the cookbook results.jsonl parser to locate the newest timestamped run; add the missing release-planner-workflows mock scenario to the index.

* Fix invalid prompt-grader config in live release-planner eval

The prompt (LLM-judge) grader schema uses 'prompt' for the rubric text, not 'rubric'. Rename the field and add 'scoring: binary' (the rubric is pass/fail) so the spec validates.
@helen229 helen229 marked this pull request as ready for review June 16, 2026 17:14
@helen229 helen229 requested a review from a team as a code owner June 16, 2026 17:14
Copilot AI review requested due to automatic review settings June 16, 2026 17:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new design/spec document describing the intended evaluation strategy for agent-driven azsdk-cli workflows (using Vally), including eval taxonomy (skill/workflow/tool), folder layout, grader requirements, and CI cadence/reporting expectations.

Changes:

  • Introduces an “eval pyramid” covering per-skill evals, cross-skill workflow scenarios (mock/live), and a hermetic tool-shape layer.
  • Documents where eval content should live and how it should be structured under evals/.
  • Defines required graders per tier and outlines CI execution + reporting approach (including cost/flake considerations).


| Run mode | MCP | Repos? | When | Coverage |
|---|---|---|---|---|
| Workflows — Mock | mock (stub, no LLM) | azure-sdk-tools only | nightly + on demand | every scenario |
### Folder layout

```
evals/

| What it tests | Lives in |
|---|---|
| **One skill** (does this skill route, call its tools, return a sensible answer) | `.github/skills/<skill>/evals/` |
Comment on lines +233 to +234
Both kinds are native Vally `*.eval.yaml`. Below are minimal, real-shaped
samples plus how to configure and run them.
- azsdk_get_release_plan
- azsdk_create_release_plan
- azsdk_run_generate_sdk
forbidden: [azsdk_verify_setup]
Both kinds are native Vally `*.eval.yaml`. Below are minimal, real-shaped
samples plus how to configure and run them.

**Skill eval** (`.github/skills/<skill>/evals/<name>.eval.yaml`) — proves
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli design-discussion An area of design currently under discussion and open to team and community feedback.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants