Skip to content

Remove Azure.Sdk.Tools.Cli.Benchmarks (superseded by Vally skill evals)#15697

Merged
helen229 merged 1 commit into
mainfrom
clean-azsdk-cli-benchmark
May 27, 2026
Merged

Remove Azure.Sdk.Tools.Cli.Benchmarks (superseded by Vally skill evals)#15697
helen229 merged 1 commit into
mainfrom
clean-azsdk-cli-benchmark

Conversation

@helen229

Copy link
Copy Markdown
Member

What

Removes the Azure.Sdk.Tools.Cli.Benchmarks project and its CI job. The
benchmark project has been superseded by the Vally-based skill evals living
under .github/skills/ (merged in #15376, the
"Migration waza skills to vally" PR).

Why

  • Skill evals run via @microsoft/vally-cli are now the canonical way we
    measure skill/tool behavior — see eng/pipelines/skill-eval.yml.
  • The benchmarks project is no longer the source of truth and the
    Run_Benchmark CI job is currently disabled (only runs when
    parameters.RunBenchmarks=true or Build.Reason=Schedule).
  • Keeping the dead project around adds maintenance cost, build time, and
    confusion about which framework owns coverage.

Changes

  • 🗑️ Delete the entire tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/
    project (182 files, scenarios, validators, test data, docs)
  • ✏️ Remove the project entry and its 12 build-configuration lines from
    tools/azsdk-cli/Azure.Sdk.Tools.Cli.sln
    (via dotnet sln remove)
  • ✏️ Remove the Run_Benchmark job and the AuthoringSpecRepo parameter
    from tools/azsdk-cli/ci.yml
  • ⚠️ Keep the RunBenchmarks parameter as a deprecated no-op so any
    external pipeline definition or scheduled run still passing
    RunBenchmarks: true does not break. Safe to delete in a follow-up once
    we've confirmed no caller sets it.

Safety check

Check Result
git grep "Cli.Benchmarks" across the rest of the repo 0 matches in .cs/.csproj/.sln/.yml
dotnet build Azure.Sdk.Tools.Cli.sln ✅ 0 errors
AzSDK_Eval_Variable_group, azuresdk-copilot-github-pat Still used by eng/pipelines/skill-eval.yml — left alone
azuresdkqabot-dev service connection, qa-bot-service Go binary Owned by tools/sdk-ai-bots/ — left alone
CODEOWNERS reference to the Benchmarks dir None
Generic prose mentions of "benchmark" (TypeSpec authoring spec, Mock README, LLM system instructions) Refer to concepts, not the deleted project — left alone

Residual risk

Only one path could break: an ADO pipeline definition that pins
RunBenchmarks: true outside the repo. The no-op parameter above absorbs
that case — the pipeline run will still succeed, just without running the
(now-deleted) benchmark job.

The benchmark project has been replaced by skill evals under .github/skills/ run via @microsoft/vally-cli. Removes:

- The Azure.Sdk.Tools.Cli.Benchmarks project and all scenarios/test data

- Its entry in Azure.Sdk.Tools.Cli.sln

- The Run_Benchmark job in tools/azsdk-cli/ci.yml

The RunBenchmarks parameter is kept as a deprecated no-op so any external pipeline definition or scheduled run still passing it does not break. Safe to delete once confirmed no caller sets it.
Copilot AI review requested due to automatic review settings May 21, 2026 18:04
@helen229 helen229 requested a review from a team as a code owner May 21, 2026 18:04
@github-actions github-actions Bot added the azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli label May 21, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the deprecated Azure.Sdk.Tools.Cli.Benchmarks project and its CI integration, since skill behavior measurement is now owned by the Vally-based skill evals under .github/skills/ (and run via eng/pipelines/skill-eval.yml).

Changes:

  • Deleted the tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/ project (code, scenarios, validators, docs, and test data).
  • Removed the Benchmarks project from tools/azsdk-cli/Azure.Sdk.Tools.Cli.sln.
  • Updated tools/azsdk-cli/ci.yml to drop the benchmark job while retaining a deprecated no-op RunBenchmarks parameter for compatibility.

Reviewed changes

Copilot reviewed 183 out of 184 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tools/azsdk-cli/ci.yml Removes the benchmark CI job and keeps RunBenchmarks as a deprecated no-op parameter for compatibility.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.sln Removes the Benchmarks project from the solution so builds no longer include it.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Azure.Sdk.Tools.Cli.Benchmarks.csproj Deleted as part of removing the Benchmarks project.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/docs/running-in-containers.md Deleted as part of removing Benchmarks documentation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/ScenarioDiscovery.cs Deleted as part of removing Benchmarks runtime/discovery infrastructure.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/SessionConfigHelper.cs Deleted as part of removing Benchmarks runtime infrastructure.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Interaction/SyntheticAICustomer.cs Deleted as part of removing Benchmarks interaction harness.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/BenchmarkDefaults.cs Deleted as part of removing Benchmarks models/config.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/BenchmarkLog.cs Deleted as part of removing Benchmarks models/logging.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/BenchmarkOptions.cs Deleted as part of removing Benchmarks models/config.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/BenchmarkResult.cs Deleted as part of removing Benchmarks models/results.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/CleanupPolicy.cs Deleted as part of removing Benchmarks models/config.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExecutionConfig.cs Deleted as part of removing Benchmarks execution configuration.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExecutionResult.cs Deleted as part of removing Benchmarks execution result model.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExpectedToolCall.cs Deleted as part of removing Benchmarks validation models.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/JudgmentResult.cs Deleted as part of removing Benchmarks judgment model.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/RepoConfig.cs Deleted as part of removing Benchmarks repo configuration model.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/TokenUsage.cs Deleted as part of removing Benchmarks token-usage tracking.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ToolCallRecord.cs Deleted as part of removing Benchmarks tool-call capture.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ValidationContext.cs Deleted as part of removing Benchmarks validation context model.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ValidationResult.cs Deleted as part of removing Benchmarks validation result model.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ValidationSummary.cs Deleted as part of removing Benchmarks validation summary model.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/BenchmarkScenario.cs Deleted as part of removing Benchmarks scenario framework.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/GitHub/GetPrLinkCurrentBranchScenario.cs Deleted as part of removing Benchmarks scenarios.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/Pipeline/CheckSdkGenerationStatusScenario.cs Deleted as part of removing Benchmarks scenarios.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/CreateReleasePlanScenario.cs Deleted as part of removing Benchmarks scenarios.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/LinkNamespaceApprovalIssueScenario.cs Deleted as part of removing Benchmarks scenarios.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoScenario.cs Deleted as part of removing Benchmarks scenarios.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoThenValidateScenario.cs Deleted as part of removing Benchmarks scenarios.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/GetModifiedTypespecProjectsScenario.cs Deleted as part of removing Benchmarks scenarios.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/RenameClientPropertyScenario.cs Deleted as part of removing Benchmarks scenarios.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/TypespecGenerationStep02Scenario.cs Deleted as part of removing Benchmarks scenarios.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/ValidateTypespecScenario.cs Deleted as part of removing Benchmarks scenarios.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001001-version-spread-property/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001001-version-spread-property/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001002-version-default-value/readme.md Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001002-version-default-value/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001003-version-required-to-optional/readme.md Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001003-version-required-to-optional/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001004-version-property-decorator/readme.md Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001004-version-property-decorator/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/readme.md Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2024-10-01-preview/Operations_List_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2024-10-01-preview/Operations_List_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2024-10-01-preview/Employees_Update_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2024-10-01-preview/Employees_ListBySubscription_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2024-10-01-preview/Employees_ListBySubscription_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2024-10-01-preview/Employees_ListByResourceGroup_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2024-10-01-preview/Employees_ListByResourceGroup_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2024-10-01-preview/Employees_Get_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2024-10-01-preview/Employees_Delete_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2021-10-01/Operations_List_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2021-10-01/Operations_List_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2021-10-01/Employees_Update_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2021-10-01/Employees_ListBySubscription_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2021-10-01/Employees_ListBySubscription_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2021-10-01/Employees_ListByResourceGroup_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2021-10-01/Employees_ListByResourceGroup_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2021-10-01/Employees_Get_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001005-version-add-preview-after-preview/examples/2021-10-01/Employees_Delete_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/readme.md Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2024-10-01/Operations_List_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2024-10-01/Operations_List_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2024-10-01/Employees_Update_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2024-10-01/Employees_ListBySubscription_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2024-10-01/Employees_ListBySubscription_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2024-10-01/Employees_ListByResourceGroup_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2024-10-01/Employees_ListByResourceGroup_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2024-10-01/Employees_Get_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2024-10-01/Employees_Delete_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2021-10-01/Operations_List_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2021-10-01/Operations_List_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2021-10-01/Employees_Update_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2021-10-01/Employees_ListBySubscription_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2021-10-01/Employees_ListBySubscription_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2021-10-01/Employees_ListByResourceGroup_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2021-10-01/Employees_ListByResourceGroup_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2021-10-01/Employees_Get_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001006-version-add-preview-after-stable/examples/2021-10-01/Employees_Delete_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/readme.md Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2024-10-01-preview/Operations_List_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2024-10-01-preview/Operations_List_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2024-10-01-preview/Employees_Update_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2024-10-01-preview/Employees_ListBySubscription_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2024-10-01-preview/Employees_ListBySubscription_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2024-10-01-preview/Employees_ListByResourceGroup_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2024-10-01-preview/Employees_ListByResourceGroup_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2024-10-01-preview/Employees_Get_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2024-10-01-preview/Employees_Delete_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2021-10-01/Operations_List_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2021-10-01/Operations_List_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2021-10-01/Employees_Update_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2021-10-01/Employees_ListBySubscription_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2021-10-01/Employees_ListBySubscription_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2021-10-01/Employees_ListByResourceGroup_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2021-10-01/Employees_ListByResourceGroup_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2021-10-01/Employees_Get_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001007-version-add-stable-after-preview/examples/2021-10-01/Employees_Delete_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/readme.md Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2024-10-01/Operations_List_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2024-10-01/Operations_List_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2024-10-01/Employees_Update_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2024-10-01/Employees_ListBySubscription_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2024-10-01/Employees_ListBySubscription_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2024-10-01/Employees_ListByResourceGroup_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2024-10-01/Employees_ListByResourceGroup_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2024-10-01/Employees_Get_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2024-10-01/Employees_Delete_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2021-10-01/Operations_List_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2021-10-01/Operations_List_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2021-10-01/Employees_Update_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2021-10-01/Employees_ListBySubscription_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2021-10-01/Employees_ListBySubscription_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2021-10-01/Employees_ListByResourceGroup_MinimumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2021-10-01/Employees_ListByResourceGroup_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2021-10-01/Employees_Get_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/001008-version-add-stable-after-stable/examples/2021-10-01/Employees_Delete_MaximumSet_Gen.json Deleted as part of removing Benchmarks generated example data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/002003-ARM-define-full-update-operation/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/002004-ARM-define-extension-resource/badgeAssignment.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/002005-ARM-define-the-resource/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/003002-arm-action-lro/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/004001-decorate-mgmt-resource-name-parameter/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/004002-decorate-length-constrains-on-array-item/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/arm-action-sync-operation/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/arm-action-sync-operation/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/arm-add-patch-operation-to-resource/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/arm-add-patch-operation-to-resource/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/version-add-preview-after-preview/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/version-add-preview-after-preview/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/version-model-property-removed/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/version-model-property-removed/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/version-model-property-renamed/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/version-model-property-renamed/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/version-model-property-required/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/version-model-property-required/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/version-model-property-type-changed/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/version-model-property-type-changed/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/version-operation-return-type-changed/main.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/TestData/TypeSpec/version-operation-return-type-changed/employee.tsp Deleted as part of removing Benchmarks TypeSpec test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/IValidator.cs Deleted as part of removing Benchmarks validation framework.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/ValidatorRunner.cs Deleted as part of removing Benchmarks validation framework.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/ContainsValidator.cs Deleted as part of removing Benchmarks validators.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/FileExistsValidator.cs Deleted as part of removing Benchmarks validators.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/InteractionValidator.cs Deleted as part of removing Benchmarks validators.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/ToolAndSkillTriggerValidator.cs Deleted as part of removing Benchmarks validators.

@jeo02

jeo02 commented May 21, 2026

Copy link
Copy Markdown
Member

Looks good, I'd say add @haolingdong-msft and @chunyu3 as reviewers just to make sure they migrated everything over to vally.

@helen229 helen229 merged commit 9561f37 into main May 27, 2026
22 checks passed
@helen229 helen229 deleted the clean-azsdk-cli-benchmark branch May 27, 2026 02:09

@haolingdong-msft haolingdong-msft left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @helen229 for cleaning up. LGTM

helen229 added a commit that referenced this pull request Jun 16, 2026
… Vally (#15124) (#15811)

* Scaffold Azure.Sdk.Tools.Vally tool-scenario eval suite (#15124)

Adds a new Vally eval suite under tools/azsdk-cli/Azure.Sdk.Tools.Vally/ for MCP tool / scenario evaluations, replacing the deleted Azure.Sdk.Tools.Cli.Benchmarks project (#15697).

- README documents project intent, layout, local run instructions, and how to add a new scenario.

- .vally.yaml wires the azsdk-mcp environment (stdio dotnet run against Azure.Sdk.Tools.Cli) and defines 'typespec' and 'all' suites.

- evals/check-public-repo.eval.yaml is the first ported scenario (from the deleted CheckPublicRepoScenario): verifies the agent invokes azsdk_typespec_check_project_in_public_repo for a public-repo check prompt. Lints clean via 'vally lint --eval-spec'.

- fixtures/.gitkeep reserves the per-scenario fixtures layout.

Remaining scenarios from the deleted benchmark are tracked as a checklist in the project README and in #15124.

* Port remaining 9 benchmark scenarios to Vally (#15124)

Adds eval YAMLs for every scenario that was deleted from Azure.Sdk.Tools.Cli.Benchmarks in #15697:

- check-public-repo-then-validate

- validate-typespec

- typespec-generation-step02

- get-modified-typespec-projects (stub — needs git-repo fixture / setup hook)

- add-arm-resource (stub — needs fixtures + npx tsp compile post-check)

- create-release-plan

- link-namespace-approval-issue

- get-pr-link-current-branch

- check-sdk-generation-status

Each eval uses the built-in tool-calls grader for presence checks; the original benchmark's argument/order/forbidden/optional assertions are captured in prompt text + inline TODOs (require custom graders or upstream Vally support, documented in README). Also adds release-plan/github/pipeline suites to .vally.yaml. All 10 evals pass 'vally lint --eval-spec'.

* Add rename-client-property stub eval to Vally suite (#15124)

Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README.

* Fix tool name prefix in graders, timeout format, expand README

* Reorganize evals into scenarios/ and triggers/; port trigger evals from #15183

- Move 11 multi-step scenario evals to evals/scenarios/
- Port 9 per-tool trigger evals from jeo02/migrate-evaluations-to-vally (PR #15183) to evals/triggers/, stripped azure-sdk-mcp- prefix from graders to match bare MCP tool names
- Port Validate-EvalTools.ps1 to scripts/, retargeted at evals/triggers/ with bare-name regex
- Update .vally.yaml suites for new layout (scenarios, triggers, all)
- Update README to document the split and per-trigger-file tool coverage
- Add .gitignore for vally-results/ and results/

* update the config and use gpt-5.4 model

* add disallowed

* Vally: restructure evals into unit/integration/e2e test pyramid

Replace per-area folders (scenarios/, triggers/) with tier-based folders. Feature area moves to a YAML tag, enabling tag-filtered suites. Add composite suites (pr-gate, nightly) and area-filtered suites in .vally.yaml. Update Validate-EvalTools.ps1 to scan evals/unit for triggers-*.eval.yaml. Refresh README and Run-LiveEvals.ps1 paths.

* Vally: remove Run-LiveEvals.ps1 (local-only test wrapper)

Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'.

* some docs and test e2e one

* update docs

* udpate design

* update with skill evals

* reorg based on the design

* remove the duplicates

* add new scenarios

* update the doc

* update doc

* update names

* Vally: align release-planner mock stimuli with live e2e pattern

All 5 release-planner mock stimuli now use environment.git worktree pointing at the per-user azure-rest-api-specs cache (matching the live e2e fixture), plus a structured e2e-style prompt that supplies the Contoso fixture IDs the mock handlers expect (TypeSpec project, service/product tree IDs, work-item ID 29262). Also document the --skill-dir requirement and worker-cap caveat in README, and fix one stale path in .vally.yaml comment.

* update doc

* Vally: fix MCP boot race + drop misconfigured grader (#15948)

- Launch pre-built DLLs via 'dotnet <dll>' in both .vally.yaml files instead of 'dotnet run', so N parallel workers no longer race on Roslyn's exclusive write lock for the output DLL.

- Add 'Build MCP servers' step to eng/pipelines/skill-eval.yml so the CI runner has the DLLs ready before vally starts.

- Drop the skill-invocation grader from generate-sdk-for-existing-release-plan (no preflight reasoning step required; tools-only).

- Strip 'I'm in a checkout of azure-rest-api-specs.' preamble from prompts; the worktree already provides that context.

- Remove stray '// tools skills response' artifact in live release-planner.eval.yaml.

- README: document 'dotnet build' as a prereq; rewrite workers warning.

Validated: scenarios-mock at --workers 6 -> 5/5 stimuli pass, 0 race hits, ~4 min.

* update readme for runing steps

* Vally: align mock release-planner grader with live + deterministic 'not found' lookup

The create-release-plan-and-generate-sdk mock stimulus required the agent to call
azsdk_update_sdk_details_in_release_plan, but neither the prompt nor the
azsdk-common-prepare-release-plan skill's create flow asks for it. The agent
correctly skipped the tool, and the grader flapped. The dedicated
update-sdk-details-in-release-plan stimulus already covers that tool with an
explicit prompt. Drop it from the create+generate grader so mock matches the
live release-planner-e2e contract (create / get / generate / link).

Also patch GetReleasePlanForSpecPrHandler to return a deterministic
'not found' response (ReleasePlanDetails = null). The mock previously
returned a 'plan exists' result for any spec PR, pushing the agent down
the update path instead of the create path that the stimulus exercises.
Stimuli that target an existing plan pass the work-item ID directly and
call azsdk_get_release_plan, so this is safe.

* update eval yaml

* Address PR #15811 review: fix stale paths, exit codes, build output, cache portability

- README/eval comments: evals/unit -> evals/tools, evals/scenarios -> evals/workflow-scenarios (Copilot C1/C5)

- Validate-EvalTools.ps1: default EvalPath -> evals/tools; return 1 -> exit 1 so CI fails loudly (Copilot C2/C3)

- MCP build output: dotnet build -o artifacts/mcp/{cli,mock}; pipeline switched to Release; .vally.yaml no longer hardcodes Debug/net8.0 (Praveen #1/#2)

- ensure-specs-clone.ps1 + workflow evals: repo-relative artifacts/specs-cache path instead of C:/Users/gaoh; Vally resolves it relative to the eval file so it works for all contributors + CI (Copilot C6/C7, Praveen #4)

- add-arm-resource/rename-client-property: comment clarifying 'edit' is the Copilot SDK built-in file tool, not an MCP tool (Praveen #5)

* Refactor Vally tool evals: rename triggers-* to prompt-to-tool-*, consolidate standalone single-tool evals

- Rename evals/tools/triggers-*.eval.yaml to prompt-to-tool-*.eval.yaml (Praveen review #6)

- Consolidate 7 standalone single-tool scenario evals into the matching namespace files as full-context checks (check-public-repo, check-sdk-generation-status, create-release-plan, get-modified-typespec-projects, get-pr-link-current-branch, link-namespace-approval-issue, validate-typespec)

- Keep add-arm-resource.eval.yaml standalone (produces a file edit, not a pure tool trigger)

- Switch tool evals to gpt-5.4 and add explicit 'use the available Azure SDK MCP tools' steering plus concrete grounding to bare trigger prompts so they invoke the MCP tool reliably

- Update README evals/tools section and Validate-EvalTools.ps1 to the new file names

* Remove agent-eval-strategy design spec from PR (now reviewed standalone in #15918)

* Drop flaky edit-tool assertion from add-arm-resource eval

* remove script

* Stabilize flaky tool-scenario prompts and add README command cookbook

Ground 13 previously-flaky prompts with concrete IDs/paths so they route deterministically to the intended MCP tool; make the mock check-service-label handler convention-driven (status derived from the requested serviceLabel); document common vally invocation recipes in the README.

* Fix outdated command examples in Vally README

Replace references to consolidated/non-existent eval files (create-release-plan, check-public-repo, link-namespace-approval-issue) with the real prompt-to-tool-* and workflow-scenario files; correct the default output path to ./vally-results/<timestamp>/; fix the cookbook results.jsonl parser to locate the newest timestamped run; add the missing release-planner-workflows mock scenario to the index.

* Fix invalid prompt-grader config in live release-planner eval

The prompt (LLM-judge) grader schema uses 'prompt' for the rubric text, not 'rubric'. Rename the field and add 'scoring: binary' (the rubric is pass/fail) so the spec validates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants