Batch small Helix work items to reduce per-item overhead (-53% compute)#66808
Conversation
Reduces ~503 work items per build to ~50 by batching compatible small test assemblies (those without special dependencies like IIS/Playwright) into groups of up to 20. Each batched work item runs dotnet test sequentially for each assembly, sharing the per-item setup overhead (tool installs, env config, vstest launcher) across the batch. Expected impact: ~60-70% reduction in total compute per build. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The batch task was using non-existent metadata fields (RuntimeVersion, QueueName, etc.) resulting in empty command arguments. Ubuntu items were submitted with empty runtime/queue args and never picked up by agents. Now parses these values from the first item's existing Command string. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When one assembly in a batch fails, the entire batch exited with code 1, causing the Helix SDK to treat it as a failed work item and not report test results from the other 19 passing assemblies to AzDO. This caused ~1300 missing tests in the test count. Now batched runs always exit 0 so all results are reported. Individual test failures are visible through the test results XML. This also prevents wasteful retries of entire 20-assembly batches for a single flaky test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Hey @dotnet/aspnet-build, looks like this PR is something you want to take a look at. |
|
@wtgodbe THis is an attempt to reduce the test overhead in aspnetcore. It is NOT ready (prototype included some hacks for testing) |
|
Love this idea! Will take a closer look next week, but worth noting that the helix tests are currently broken into 2 subsets: Lines 236 to 297 in 27c660e aspnetcore/.azure/pipelines/ci-public.yml Lines 562 to 591 in 27c660e |
@wtgodbe Do you happen to remember why it cut that much time off PRs? Was it due to test time or time to submit the tests. |
I don't remember off the top of my head, but release/9.0 and release/8.0 PRs are still doing one mondo-job, so we should be able to check & find out |
| - script: ./eng/build.cmd -ci -prepareMachine -nativeToolsOnMachine -all -noBuildNative -noBuild -test | ||
| -projects eng\helix\helix.proj /p:IsHelixPRCheck=true /p:IsHelixJob=true | ||
| /p:CrossgenOutput=false /p:ASPNETCORE_TEST_LOG_DIR=artifacts/log $(_InternalRuntimeDownloadArgs) | ||
| /p:VsTestUseMSBuildOutput=false /p:RunTemplateTests=false /p:HelixSubset=1 |
There was a problem hiding this comment.
Must get rid of ProjectsWithTestsSubset1 / ProjectsWithTestsSubset2 if you're gonna remove these properties:
Lines 296 to 297 in 2c960bf
There was a problem hiding this comment.
Pull request overview
This PR introduces Helix work item batching to reduce per-item overhead by combining multiple small test assemblies into a single Helix work item, while keeping assemblies with special dependencies unbatched. It also updates HelixTestRunner and CI pipelines to support the new batched execution model.
Changes:
- Add an MSBuild batching step in
eng/helix/helix.projto group eligible Helix work items (by TFM) and generate atargets.txtmanifest. - Extend
HelixTestRunnerto accept multiple targets / a--targets-fileinput, and update Helix run scripts to detect@targets.txt. - Simplify CI Helix execution by removing subset splitting and running a single combined Helix job.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| eng/tools/HelixTestRunner/TestRunner.cs | Adds multi-target discovery/execution support and adjusts artifact/log handling for batched payload layouts. |
| eng/tools/HelixTestRunner/Program.cs | Updates startup logging and exception formatting. |
| eng/tools/HelixTestRunner/HelixTestRunnerOptions.cs | Adds --targets-file and supports multiple --target values. |
| eng/targets/Helix.targets | Adds metadata used to determine batching eligibility (TFM + special dependency flags). |
| eng/helix/helix.proj | Implements the batching inline task and replaces eligible work items with batched ones. |
| eng/helix/content/runtests.sh | Adds @targets.txt detection and forwards either --target or --targets-file. |
| eng/helix/content/runtests.cmd | Adds @targets.txt detection and forwards either --target or --targets-file. |
| docs/helix-work-item-batching.md | New design document describing the batching approach and operational trade-offs. |
| .azure/pipelines/ci.yml | Collapses Helix subset jobs into a single Helix x64 job and updates binlog variables/artifact names. |
| .azure/pipelines/ci-unofficial.yml | Same Helix subset collapse as official CI. |
| .azure/pipelines/ci-public.yml | Same Helix subset collapse as public CI. |
| } | ||
| ProcessUtil.PrintMessage($"Copying artifacts/log/ to {HELIX_WORKITEM_UPLOAD_ROOT}/"); | ||
| if (Directory.Exists("artifacts/log")) | ||
|
|
||
| // Copy logs from each assembly's subdirectory | ||
| foreach (var target in Options.Targets) | ||
| { | ||
| foreach (var file in Directory.EnumerateFiles("artifacts/log", "*.log", SearchOption.AllDirectories)) | ||
| var workingDirectory = GetTargetWorkingDirectory(target); | ||
| var assemblyName = GetSanitizedAssemblyName(target); | ||
|
|
||
| var artifactsLogDirectory = Path.Combine(workingDirectory, "artifacts", "log"); | ||
| if (Directory.Exists(artifactsLogDirectory)) | ||
| { | ||
| // Combine the directory name + log name for the copied log file name to avoid overwriting | ||
| // duplicate test names in different test projects | ||
| var logName = $"{Path.GetFileName(Path.GetDirectoryName(file))}_{Path.GetFileName(file)}"; | ||
| ProcessUtil.PrintMessage($"Copying: {file} to {Path.Combine(HELIX_WORKITEM_UPLOAD_ROOT, logName)}"); | ||
| File.Copy(file, Path.Combine(HELIX_WORKITEM_UPLOAD_ROOT, logName)); | ||
| ProcessUtil.PrintMessage($"Copying artifacts/log/ to {HELIX_WORKITEM_UPLOAD_ROOT}/"); | ||
| foreach (var file in Directory.EnumerateFiles(artifactsLogDirectory, "*.log", SearchOption.AllDirectories)) | ||
| { | ||
| var logName = $"{assemblyName}_{Path.GetFileName(Path.GetDirectoryName(file))}_{Path.GetFileName(file)}"; | ||
| CopyFileToUploadRoot(file, HELIX_WORKITEM_UPLOAD_ROOT, logName); |
| #### Stage 3: Helix Execution (modified HelixTestRunner) | ||
|
|
||
| On the Helix agent, the `HelixTestRunner` reads `--targets-file targets.txt` and | ||
| iterates: | ||
|
|
||
| ``` | ||
| for each assembly in targets.txt: | ||
| cd {assembly_subdirectory} | ||
| dotnet test {assembly.dll} --logger xunit ... | ||
| collect results | ||
| merge all TestResults.xml → testResults.xml | ||
| upload merged results | ||
| ``` |
| ### Test Result Handling | ||
|
|
||
| **Problem:** When one assembly in a batch has a test failure, the HelixTestRunner | ||
| previously exited with code 1. The Helix SDK treated this as a work item failure, | ||
| which prevented test results from being reported to AzDO — losing visibility into | ||
| the other 19 passing assemblies' results. | ||
|
|
||
| **Solution:** Batched runs always exit 0. Individual test failures are reported | ||
| through the xUnit test results XML. The Helix SDK picks up the merged | ||
| `testResults.xml` and reports all individual test outcomes to AzDO. | ||
|
|
||
| ```csharp | ||
| // Program.cs | ||
| if (runner.Options.IsBatched && exitCode != 0) | ||
| { | ||
| ProcessUtil.PrintMessage($"Batched run had test failures but exiting 0 for result reporting."); | ||
| exitCode = 0; | ||
| } | ||
| ``` | ||
|
|
||
| **Trade-off:** Failed batches are not retried by Helix's built-in retry mechanism. | ||
| This is intentional — retrying a 20-assembly batch for one flaky test wastes compute. | ||
| Individual flaky tests should be addressed at the test level, not by re-running | ||
| entire batches. | ||
|
|
||
| ### Result Merging | ||
|
|
||
| The `HelixTestRunner.UploadResults()` method: | ||
|
|
||
| 1. Runs `dotnet test` in each assembly's subdirectory, producing per-assembly | ||
| `TestResults/TestResults.xml` | ||
| 2. Copies each to `testResults_{assemblyName}.xml` in the work item root | ||
| 3. If multiple results exist, merges them into a single `testResults.xml` by | ||
| combining all `<assembly>` elements under a unified `<assemblies>` root | ||
| 4. Copies all results to `HELIX_WORKITEM_UPLOAD_ROOT` |
| if (result.ExitCode != 0) | ||
| { | ||
| ProcessUtil.PrintMessage($"Failure in quarantined tests. Exit code: {result.ExitCode}."); | ||
| exitCode = result.ExitCode; |
The previous change swallowed non-zero exit codes for batched items, making genuine test failures invisible at the pipeline level. The exit code should always propagate — the Helix SDK's AzDO reporter picks up test results from testResults.xml regardless of exit code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move EnsureTargetRunnerConfiguration back to SetupEnvironment() so it runs once (matching original), remove duplicate calls from discovery and execution stages - Fix batch timeout: use 2min/assembly + 5min overhead instead of summing individual 45min timeouts (which always hit the 45min cap) - Remove unused ParseTimeout helper - Test results: keep the merge approach (Roslyn avoids merging by using a single vstest.console invocation via rsp file, but aspnetcore uses dotnet test with quarantine filtering which requires per-assembly invocations) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6bb2285 to
4e2fd46
Compare
The CopyDirectory approach recursively copied every file from each assembly's publish directory into the batch directory. With 20 assemblies per batch x 5 batches per subset, this was ~100 recursive directory copies causing the helix target step to take 90+ minutes and timeout at 240 min. Replace with Directory.CreateSymbolicLink (matching Roslyn's approach). The Helix SDK follows symlinks when zipping payloads. Only the batch root's shared files (runtests.cmd/sh, NuGet.config) are copied. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ef5a4e7 to
ea30c38
Compare
|
Hold on this..going to split it into smaller PRs. |
dotnet test accepts multiple DLL paths. Pass all target assemblies in a single invocation instead of iterating per-assembly. This: - Produces one unified test results file (no merging needed) - Saves ~2s per extra assembly (vstest host startup) - Preserves quarantine exit code behavior (don't fail on quarantined) - Simplifies UploadResults and AddHelixSourcesAsync Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tests that reference HELIX_WORKITEM_ROOT to find content files (baselines, snapshots, nupkgs) are incompatible with work item batching because the content files are inside symlinked subdirectories, not at the batch root. Add TestUsesHelixWorkItemRoot property for projects with custom content, and include TestDependsOnAspNetPackages/TestDependsOnAspNetAppPackages in HasSpecialDependencies to exclude these from batching. Affected projects: - Http.Extensions.Tests (RequestDelegateGenerator baselines) - OpenApi.Tests & SourceGenerators.Tests (snapshots) - Validation.GeneratorTests (snapshots) - App.Analyzers.Test (NuGet.config lookup) - App.UnitTests (via TestDependsOnAspNetAppPackages) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tests that find content files (baselines, snapshots, nupkgs) on Helix previously used HELIX_WORKITEM_ROOT which points to the work item execution root. This is incompatible with work item batching where content is in per-project subdirectories. AppContext.BaseDirectory resolves to the DLL's directory, which is correct in all modes: - Non-batched Helix: same as HELIX_WORKITEM_ROOT (publish dir) - Batched Helix: per-project symlink to publish dir - Local development: falls through to relative path (unchanged) This also removes the TestUsesHelixWorkItemRoot property and reverts the HasSpecialDependencies condition, allowing these projects to participate in batching for better compute efficiency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move the inline RoslynCodeTaskFactory task from helix.proj into the compiled RepoTasks assembly. This provides better IDE support, proper compilation errors, and uses Directory.CreateSymbolicLink directly instead of shelling out to mklink (now targeting modern .NET runtime). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fully qualify Microsoft.Build.Utilities.Task to avoid ambiguity with System.Threading.Tasks.Task. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move the import after ArtifactsBinDir is defined so the task assembly path resolves correctly. The Helix SDK doesn't include Arcade SDK's ProjectLayout.props, so ArtifactsBinDir must be explicitly set first. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Import was inserted mid-PropertyGroup, splitting it in two and causing an XML parse error. Move the import after the complete PropertyGroup block. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This assembly has 1999 tests and causes test host hangs on macOS when batched with other assemblies. Add SkipHelixWorkItemBatching property to opt specific projects out of work item batching. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. All work items now use @targets.txt command format. For non-batched items, the batcher writes a single-entry targets.txt. This removes the conditional logic from runtests.cmd and runtests.sh. 2. Rename HasSpecialDependencies metadata to SkipHelixWorkItemBatching. Set to true when any TestDepends* condition is met or when projects explicitly opt out. Defaults to not-set (eligible for batching). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove overwrite:true from File.Copy; skip if file already exists - Extract timeout constants as task parameters with defaults (TimeoutMinutesPerAssembly=2, TimeoutMinutesOverhead=5, MaxTimeoutMinutes=45) - Remove playwright handling from batched items; add assertion that batched items don't require Playwright - Replace command-line parsing with direct metadata reads (RuntimeVersion, QueueName, TestingArchitecture, RunQuarantined) - Remove CreateDirectorySymlink helper; inline Directory.CreateSymbolicLink Architecture safety: batching runs per-queue (inside HelixTargetQueue inner build), so all items in a batch share the same OS/architecture. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validate that all work items in a batch share the same RuntimeVersion, QueueName, TestingArchitecture, and RunQuarantined values. Logs an error if any item differs from the first item in the batch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| var nugetConfigPath = SkipOnHelixAttribute.OnHelix() ? | ||
| Path.Combine( | ||
| Environment.GetEnvironmentVariable("HELIX_WORKITEM_ROOT"), | ||
| AppContext.BaseDirectory, |
There was a problem hiding this comment.
The work item has individual test dirs for each work item (since they could have different deps)...the work item no longer executes at the work item root. Basically the nuget.config location is going to be along the test dll, which is no longer at the work item root.
There was a problem hiding this comment.
I wonder if you could just remove the ternary altogether and just use AppContext...
- Batched work items now inherit the same timeout each individual work item was given (HelixTimeout, 45 min) instead of an aggressive 2min/assembly estimate that risked killing legitimate slow batches. The per-test --blame-hang-timeout (15 min) still catches true hangs. - Remove the per-assembly/overhead/max timeout magic numbers; add a single DefaultTimeout fallback parameter. - Rename runtests.cmd \ to \/\. - Document why the default batch size is 20. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Template test projects declare long custom HelixTimeout values (1-2.5h) because each test scaffolds, restores, builds and runs a real app. When these heavy assemblies were batched with others, the batch inherited a single (shorter) timeout and was killed with exit 137 (Tests exceeded configured timeout). Exclude any project whose HelixTimeout differs from the default from batching. These projects are explicitly signaling they are heavy and need a dedicated work item timeout. Today this is the 3 ProjectTemplates test projects; the rule generalizes to any future heavy project. Also add Timeout to the batched work item consistency validation so a batch can never silently mix items that expect different timeouts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Subset 1 winds up w/ 17 workitems and Subset 2 winds up w/ 7, so it might be worth shuffling things around in eng/build.props |
Move 5 standalone Windows-only test projects (HttpSys/IIS/Diagnostics/Spa functional tests) from Helix Subset 1 to Subset 2 so the batched Windows work item counts are balanced (~12/12 instead of 17/7). These projects have special PreCommands and are not batchable, so they were the source of the imbalance. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FailEventSync/FailParametersSync/FailBatchSync captured a Stopwatch timestamp and immediately invoked the synchronous metric method, then asserted the recorded duration was > 0. On fast machines fewer than one Stopwatch tick elapses, so the duration rounds to 0 and the assertion flakes (seen on OSX.26.Arm64 in this PR's batched Helix legs). Await a small delay before recording to guarantee a measurable duration, mirroring the async sibling tests. Fixes #67305 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
|
||
| // Act | ||
| var startTimestamp = Stopwatch.GetTimestamp(); | ||
| await Task.Delay(10); // Small delay to ensure measurable duration |
There was a problem hiding this comment.
We kept getting failures where https://github.com/dotnet/aspnetcore/pull/66808/changes#diff-b9bcdce5eeff740407c209e80396dbfcbd9b4e4f112fc696b230c57de9d4755dR139 was zero. Most tests here do a short delay after starting the stop watch. The hypothesis is that in some cases the test is finishing so fast that the measured value is zero.
There was a problem hiding this comment.
It looks like that was the issue.
Helix efficiency: batched (this PR) vs un-batched
Builds (pipeline
Work item counts (batched 1471086 vs main 1442279)
PR breakdown: 29 batched ( Total compute time (sum of every work item's run time on a Helix machine)
→ −52.4% total Helix compute (8.68 h → 4.13 h), from eliminating ~454 work items' worth of per-item overhead (tool installs, vstest startup, machine acquisition, result upload). Wall-clock critical path — PR-to-PR (same queues)Both builds' overall critical path is the Tests: Helix x64 Subset 1 leg. Stepping into that leg:
Helix Subset 1 leg across recent un-batched PRs on the current queues: 50.1 / 53.1 / 74.6 / 86.1 / 87.2 min — the batched 89.6 min sits within that spread. Wall-clock on these queues is dominated by queue congestion, not work-item structure. (For reference, the old Efficiency summary
Takeaways
Methodology: Helix job GUIDs from each leg's "Run build.cmd helix target" SendHelixJob output; per-item durations from the Helix |
wtgodbe
left a comment
There was a problem hiding this comment.
Talked offline, the wall clock time increase that the agent reported was just noise from a queue time spike.
Batch small Helix work items to reduce per-item overhead
Problem
Each ASP.NET Core CI build sends ~503 Helix work items, but most have only 2-10 seconds of actual test execution with ~20-40 seconds of per-item overhead (tool installs, vstest startup, result upload). This wastes ~5.6 compute-hours per build.
Solution
Batch compatible small test assemblies into groups of ~20 per Helix work item. Assemblies with special dependencies (IIS, Playwright, Java, Node, MSSQL) remain as individual items.
Measured Results
Per-queue breakdown:
How it works
eng/helix/helix.proj- newBatchSmallWorkItemsMSBuild inline task runs after the existingGathertarget. Groups eligible items by TFM, creates combined payload directories, writes atargets.txtmanifest.eng/tools/HelixTestRunner- accepts--targets-file targets.txtto rundotnet testsequentially for each assembly. Tool installs happen once per batch. Test results are merged.eng/helix/content/runtests.cmd/runtests.sh- detect@targets.txtprefix for batched mode. Fully backward compatible.Batching rules
PreCommands(no IIS, Playwright, Java, Node, MSSQL deps)CI validation