Batch small Helix work items to reduce per-item overhead (-53% compute) by mmitche · Pull Request #66808 · dotnet/aspnetcore

mmitche · 2026-05-22T22:17:52Z

Batch small Helix work items to reduce per-item overhead

Problem

Each ASP.NET Core CI build sends ~503 Helix work items, but most have only 2-10 seconds of actual test execution with ~20-40 seconds of per-item overhead (tool installs, vstest startup, result upload). This wastes ~5.6 compute-hours per build.

Solution

Batch compatible small test assemblies into groups of ~20 per Helix work item. Assemblies with special dependencies (IIS, Playwright, Java, Node, MSSQL) remain as individual items.

Measured Results

Metric	Baseline	Batched	Change
Work items	503	47	-91%
Total compute	634 min (10.6 hrs)	300 min (5.0 hrs)	-53%

Per-queue breakdown:

Queue	Baseline (items/min)	Batched (items/min)	Savings
Windows (vs2026)	189 / 300 min	23 / 148 min	-51%
macOS (osx.15)	156 / 239 min	11 / 88 min	-63%
Ubuntu (2404)	158 / 95 min	13 / 64 min	-32%

How it works

eng/helix/helix.proj - new BatchSmallWorkItems MSBuild inline task runs after the existing Gather target. Groups eligible items by TFM, creates combined payload directories, writes a targets.txt manifest.
eng/tools/HelixTestRunner - accepts --targets-file targets.txt to run dotnet test sequentially for each assembly. Tool installs happen once per batch. Test results are merged.
eng/helix/content/runtests.cmd / runtests.sh - detect @targets.txt prefix for batched mode. Fully backward compatible.

Batching rules

Eligible: Items with no special PreCommands (no IIS, Playwright, Java, Node, MSSQL deps)
Group by: Target framework (net11.0, net472, etc.)
Max batch size: 20 assemblies per work item
Failed tests: Batched items exit 0 so all test results are reported. Individual failures visible via test results XML.

CI validation

Build 1432072 - Windows pass
Build 1431657 - Ubuntu + Windows pass

Reduces ~503 work items per build to ~50 by batching compatible small test assemblies (those without special dependencies like IIS/Playwright) into groups of up to 20. Each batched work item runs dotnet test sequentially for each assembly, sharing the per-item setup overhead (tool installs, env config, vstest launcher) across the batch. Expected impact: ~60-70% reduction in total compute per build. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The batch task was using non-existent metadata fields (RuntimeVersion, QueueName, etc.) resulting in empty command arguments. Ubuntu items were submitted with empty runtime/queue args and never picked up by agents. Now parses these values from the first item's existing Command string. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When one assembly in a batch fails, the entire batch exited with code 1, causing the Helix SDK to treat it as a failed work item and not report test results from the other 19 passing assemblies to AzDO. This caused ~1300 missing tests in the test count. Now batched runs always exit 0 so all results are reported. Individual test failures are visible through the test results XML. This also prevents wasteful retries of entire 20-assembly batches for a single flaky test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dotnet-policy-service · 2026-05-22T22:18:27Z

Hey @dotnet/aspnet-build, looks like this PR is something you want to take a look at.

mmitche · 2026-05-22T22:19:15Z

@wtgodbe THis is an attempt to reduce the test overhead in aspnetcore. It is NOT ready (prototype included some hacks for testing)

wtgodbe · 2026-05-22T22:32:18Z

Love this idea! Will take a closer look next week, but worth noting that the helix tests are currently broken into 2 subsets:

aspnetcore/eng/Build.props

Lines 236 to 297 in 27c660e

    
                   <ProjectsWithTestsSubset1 Include=" 
        
                                     $(RepoRoot)src\Framework\AspNetCoreAnalyzers\test\Microsoft.AspNetCore.App.Analyzers.Test.csproj; 
        
                                     $(RepoRoot)src\Framework\test\Microsoft.AspNetCore.App.UnitTests.csproj; 
        
                                     $(RepoRoot)src\Caching\**\*.*proj; 
        
                                     $(RepoRoot)src\DefaultBuilder\**\*.*proj; 
        
                                     $(RepoRoot)src\Features\**\*.*proj; 
        
                                     $(RepoRoot)src\DataProtection\**\*.*proj; 
        
                                     $(RepoRoot)src\Antiforgery\**\*.*proj; 
        
                                     $(RepoRoot)src\Hosting\**\*.*proj; 
        
                                     $(RepoRoot)src\Http\**\*.*proj; 
        
                                     $(RepoRoot)src\HttpClientFactory\**\*.*proj; 
        
                                     $(RepoRoot)src\Html.Abstractions\**\*.*proj; 
        
                                     $(RepoRoot)src\Identity\**\*.*proj; 
        
                                     $(RepoRoot)src\Servers\**\*.csproj; 
        
                                     $(RepoRoot)src\Security\**\*.*proj; 
        
                                     $(RepoRoot)src\SiteExtensions\Microsoft.Web.Xdt.Extensions\**\*.csproj; 
        
                                     $(RepoRoot)src\SiteExtensions\LoggingAggregate\test\**\*.csproj; 
        
                                     $(RepoRoot)src\Shared\**\*.*proj; 
        
                                     $(RepoRoot)src\Tools\**\*.*proj; 
        
                                     $(RepoRoot)src\Logging.AzureAppServices\**\src\*.csproj; 
        
                                     $(RepoRoot)src\Middleware\**\*.csproj; 
        
                                     " 
        
                                   Exclude=" 
        
                                     @(ProjectToBuild); 
        
                                     @(ProjectToExclude); 
        
                                     $(RepoRoot)**\node_modules\**\*; 
        
                                     $(RepoRoot)**\bin\**\*; 
        
                                     $(RepoRoot)**\obj\**\*;" 
        
                                   Condition=" '$(BuildMainlyReferenceProviders)' != 'true' " /> 
        
                   <ProjectsWithTestsSubset2 Include=" 
        
                                     $(RepoRoot)src\Razor\**\*.*proj; 
        
                                     $(RepoRoot)src\Mvc\**\*.*proj; 
        
                                     $(RepoRoot)src\Azure\**\*.*proj; 
        
                                     $(RepoRoot)src\SignalR\**\*.csproj; 
        
                                     $(RepoRoot)src\StaticAssets\**\*.csproj; 
        
                                     $(RepoRoot)src\Components\**\*.csproj; 
        
                                     $(RepoRoot)src\Analyzers\**\*.csproj; 
        
                                     $(RepoRoot)src\FileProviders\**\*.csproj; 
        
                                     $(RepoRoot)src\Configuration.KeyPerFile\**\*.csproj; 
        
                                     $(RepoRoot)src\Localization\**\*.csproj; 
        
                                     $(RepoRoot)src\ObjectPool\**\*.csproj; 
        
                                     $(RepoRoot)src\JSInterop\**\*.csproj; 
        
                                     $(RepoRoot)src\WebEncoders\**\*.csproj; 
        
                                     $(RepoRoot)src\HealthChecks\**\*.csproj; 
        
                                     $(RepoRoot)src\Testing\**\*.csproj; 
        
                                     $(RepoRoot)src\Grpc\**\*.csproj; 
        
                                     $(RepoRoot)src\ProjectTemplates\**\*.csproj; 
        
                                     $(RepoRoot)src\Extensions\**\*.csproj; 
        
                                     $(RepoRoot)src\OpenApi\**\*.csproj; 
        
                                     $(RepoRoot)src\Validation\**\*.csproj; 
        
                                     " 
        
                                   Exclude=" 
        
                                     @(ProjectToBuild); 
        
                                     @(ProjectToExclude); 
        
                                     $(RepoRoot)**\node_modules\**\*; 
        
                                     $(RepoRoot)**\bin\**\*; 
        
                                     $(RepoRoot)**\obj\**\*;" 
        
                                   Condition=" '$(BuildMainlyReferenceProviders)' != 'true' " /> 
        
                   <DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '1'" Include="@(ProjectsWithTestsSubset1)" /> 
        
                   <DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '2'" Include="@(ProjectsWithTestsSubset2)" />

,

aspnetcore/.azure/pipelines/ci-public.yml

Lines 562 to 591 in 27c660e

    
           # Helix x64 subset 1 
        
           - template: jobs/default-build.yml 
        
             parameters: 
        
               jobName: Helix_x64_Subset_1 
        
               jobDisplayName: 'Tests: Helix x64 Subset 1' 
        
               agentOs: Windows 
        
               timeoutInMinutes: 240 
        
               steps: 
        
               # Build the shared framework 
        
               - script: ./eng/build.cmd -ci -prepareMachine -nativeToolsOnMachine -all -pack -arch x64 
        
                         /p:CrossgenOutput=false /p:ASPNETCORE_TEST_LOG_DIR=artifacts/log $(_InternalRuntimeDownloadArgs) 
        
                         /p:VsTestUseMSBuildOutput=false $(HelixSubset1LogArgs) 
        
                 displayName: Build shared fx 
        
               # -noBuildNative -noBuild to avoid repeating work done in the previous step. 
        
               - script: ./eng/build.cmd -ci -prepareMachine -nativeToolsOnMachine -all -noBuildNative -noBuild -test 
        
                         -projects eng\helix\helix.proj /p:IsHelixPRCheck=true /p:IsHelixJob=true 
        
                         /p:CrossgenOutput=false /p:ASPNETCORE_TEST_LOG_DIR=artifacts/log $(_InternalRuntimeDownloadArgs) 
        
                         /p:VsTestUseMSBuildOutput=false /p:RunTemplateTests=false /p:HelixSubset=1 
        
                 displayName: Run build.cmd helix target 
        
                 env: 
        
                   HelixApiAccessToken: $(HelixApiAccessToken) # Needed for internal queues 
        
                   SYSTEM_ACCESSTOKEN: $(System.AccessToken) # We need to set this env var to publish helix results to Azure Dev Ops 
        
               artifacts: 
        
               - name: Helix_Subset_1_Logs_Attempt_$(System.JobAttempt) 
        
                 path: artifacts/log/ 
        
                 publishOnError: true 
        
                 includeForks: true 
        
           # Helix x64 subset 2

. We did this because parallelizing the helix tests cut about 40 minutes off the build - but the 2 groups were chosen completely arbitrarily. There might be some extra time to be saved by re-subsetting based on whatever "compatibility" heuristic you're currently using

mmitche · 2026-05-26T18:40:17Z

Love this idea! Will take a closer look next week, but worth noting that the helix tests are currently broken into 2 subsets:

aspnetcore/eng/Build.props

Lines 236 to 297 in 27c660e

<ProjectsWithTestsSubset1 Include="

$(RepoRoot)src\Framework\AspNetCoreAnalyzers\test\Microsoft.AspNetCore.App.Analyzers.Test.csproj;

$(RepoRoot)src\Framework\test\Microsoft.AspNetCore.App.UnitTests.csproj;

$(RepoRoot)src\Caching\**\*.*proj;

$(RepoRoot)src\DefaultBuilder\**\*.*proj;

$(RepoRoot)src\Features\**\*.*proj;

$(RepoRoot)src\DataProtection\**\*.*proj;

$(RepoRoot)src\Antiforgery\**\*.*proj;

$(RepoRoot)src\Hosting\**\*.*proj;

$(RepoRoot)src\Http\**\*.*proj;

$(RepoRoot)src\HttpClientFactory\**\*.*proj;

$(RepoRoot)src\Html.Abstractions\**\*.*proj;

$(RepoRoot)src\Identity\**\*.*proj;

$(RepoRoot)src\Servers\**\*.csproj;

$(RepoRoot)src\Security\**\*.*proj;

$(RepoRoot)src\SiteExtensions\Microsoft.Web.Xdt.Extensions\**\*.csproj;

$(RepoRoot)src\SiteExtensions\LoggingAggregate\test\**\*.csproj;

$(RepoRoot)src\Shared\**\*.*proj;

$(RepoRoot)src\Tools\**\*.*proj;

$(RepoRoot)src\Logging.AzureAppServices\**\src\*.csproj;

$(RepoRoot)src\Middleware\**\*.csproj;

"

Exclude="

@(ProjectToBuild);

@(ProjectToExclude);

$(RepoRoot)**\node_modules\**\*;

$(RepoRoot)**\bin\**\*;

$(RepoRoot)**\obj\**\*;"

Condition=" '$(BuildMainlyReferenceProviders)' != 'true' " />

<ProjectsWithTestsSubset2 Include="

$(RepoRoot)src\Razor\**\*.*proj;

$(RepoRoot)src\Mvc\**\*.*proj;

$(RepoRoot)src\Azure\**\*.*proj;

$(RepoRoot)src\SignalR\**\*.csproj;

$(RepoRoot)src\StaticAssets\**\*.csproj;

$(RepoRoot)src\Components\**\*.csproj;

$(RepoRoot)src\Analyzers\**\*.csproj;

$(RepoRoot)src\FileProviders\**\*.csproj;

$(RepoRoot)src\Configuration.KeyPerFile\**\*.csproj;

$(RepoRoot)src\Localization\**\*.csproj;

$(RepoRoot)src\ObjectPool\**\*.csproj;

$(RepoRoot)src\JSInterop\**\*.csproj;

$(RepoRoot)src\WebEncoders\**\*.csproj;

$(RepoRoot)src\HealthChecks\**\*.csproj;

$(RepoRoot)src\Testing\**\*.csproj;

$(RepoRoot)src\Grpc\**\*.csproj;

$(RepoRoot)src\ProjectTemplates\**\*.csproj;

$(RepoRoot)src\Extensions\**\*.csproj;

$(RepoRoot)src\OpenApi\**\*.csproj;

$(RepoRoot)src\Validation\**\*.csproj;

"

Exclude="

@(ProjectToBuild);

@(ProjectToExclude);

$(RepoRoot)**\node_modules\**\*;

$(RepoRoot)**\bin\**\*;

$(RepoRoot)**\obj\**\*;"

Condition=" '$(BuildMainlyReferenceProviders)' != 'true' " />

<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '1'" Include="@(ProjectsWithTestsSubset1)" />

<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '2'" Include="@(ProjectsWithTestsSubset2)" />

,

aspnetcore/.azure/pipelines/ci-public.yml

Lines 562 to 591 in 27c660e

# Helix x64 subset 1

- template: jobs/default-build.yml

parameters:

jobName: Helix_x64_Subset_1

jobDisplayName: 'Tests: Helix x64 Subset 1'

agentOs: Windows

timeoutInMinutes: 240

steps:

# Build the shared framework

- script: ./eng/build.cmd -ci -prepareMachine -nativeToolsOnMachine -all -pack -arch x64

/p:CrossgenOutput=false /p:ASPNETCORE_TEST_LOG_DIR=artifacts/log $(_InternalRuntimeDownloadArgs)

/p:VsTestUseMSBuildOutput=false $(HelixSubset1LogArgs)

displayName: Build shared fx

# -noBuildNative -noBuild to avoid repeating work done in the previous step.

- script: ./eng/build.cmd -ci -prepareMachine -nativeToolsOnMachine -all -noBuildNative -noBuild -test

-projects eng\helix\helix.proj /p:IsHelixPRCheck=true /p:IsHelixJob=true

/p:CrossgenOutput=false /p:ASPNETCORE_TEST_LOG_DIR=artifacts/log $(_InternalRuntimeDownloadArgs)

/p:VsTestUseMSBuildOutput=false /p:RunTemplateTests=false /p:HelixSubset=1

displayName: Run build.cmd helix target

env:

HelixApiAccessToken: $(HelixApiAccessToken) # Needed for internal queues

SYSTEM_ACCESSTOKEN: $(System.AccessToken) # We need to set this env var to publish helix results to Azure Dev Ops

artifacts:

- name: Helix_Subset_1_Logs_Attempt_$(System.JobAttempt)

path: artifacts/log/

publishOnError: true

includeForks: true

# Helix x64 subset 2

. We did this because parallelizing the helix tests cut about 40 minutes off the build - but the 2 groups were chosen completely arbitrarily. There might be some extra time to be saved by re-subsetting based on whatever "compatibility" heuristic you're currently using

@wtgodbe Do you happen to remember why it cut that much time off PRs? Was it due to test time or time to submit the tests.

wtgodbe · 2026-05-26T21:15:54Z

@wtgodbe Do you happen to remember why it cut that much time off PRs? Was it due to test time or time to submit the tests.

I don't remember off the top of my head, but release/9.0 and release/8.0 PRs are still doing one mondo-job, so we should be able to check & find out

wtgodbe · 2026-05-27T20:25:00Z

        - script: ./eng/build.cmd -ci -prepareMachine -nativeToolsOnMachine -all -noBuildNative -noBuild -test
                  -projects eng\helix\helix.proj /p:IsHelixPRCheck=true /p:IsHelixJob=true
                  /p:CrossgenOutput=false /p:ASPNETCORE_TEST_LOG_DIR=artifacts/log $(_InternalRuntimeDownloadArgs)
-                  /p:VsTestUseMSBuildOutput=false /p:RunTemplateTests=false /p:HelixSubset=1


Must get rid of ProjectsWithTestsSubset1 / ProjectsWithTestsSubset2 if you're gonna remove these properties:

aspnetcore/eng/Build.props

Lines 296 to 297 in 2c960bf

<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '1'" Include="@(ProjectsWithTestsSubset1)" />

<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '2'" Include="@(ProjectsWithTestsSubset2)" />

Copilot

Pull request overview

This PR introduces Helix work item batching to reduce per-item overhead by combining multiple small test assemblies into a single Helix work item, while keeping assemblies with special dependencies unbatched. It also updates HelixTestRunner and CI pipelines to support the new batched execution model.

Changes:

Add an MSBuild batching step in eng/helix/helix.proj to group eligible Helix work items (by TFM) and generate a targets.txt manifest.
Extend HelixTestRunner to accept multiple targets / a --targets-file input, and update Helix run scripts to detect @targets.txt.
Simplify CI Helix execution by removing subset splitting and running a single combined Helix job.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
eng/tools/HelixTestRunner/TestRunner.cs	Adds multi-target discovery/execution support and adjusts artifact/log handling for batched payload layouts.
eng/tools/HelixTestRunner/Program.cs	Updates startup logging and exception formatting.
eng/tools/HelixTestRunner/HelixTestRunnerOptions.cs	Adds `--targets-file` and supports multiple `--target` values.
eng/targets/Helix.targets	Adds metadata used to determine batching eligibility (TFM + special dependency flags).
eng/helix/helix.proj	Implements the batching inline task and replaces eligible work items with batched ones.
eng/helix/content/runtests.sh	Adds `@targets.txt` detection and forwards either `--target` or `--targets-file`.
eng/helix/content/runtests.cmd	Adds `@targets.txt` detection and forwards either `--target` or `--targets-file`.
docs/helix-work-item-batching.md	New design document describing the batching approach and operational trade-offs.
.azure/pipelines/ci.yml	Collapses Helix subset jobs into a single Helix x64 job and updates binlog variables/artifact names.
.azure/pipelines/ci-unofficial.yml	Same Helix subset collapse as official CI.
.azure/pipelines/ci-public.yml	Same Helix subset collapse as public CI.

        }
-        ProcessUtil.PrintMessage($"Copying artifacts/log/ to {HELIX_WORKITEM_UPLOAD_ROOT}/");
-        if (Directory.Exists("artifacts/log"))
+
+        // Copy logs from each assembly's subdirectory
+        foreach (var target in Options.Targets)
        {
-            foreach (var file in Directory.EnumerateFiles("artifacts/log", "*.log", SearchOption.AllDirectories))
+            var workingDirectory = GetTargetWorkingDirectory(target);
+            var assemblyName = GetSanitizedAssemblyName(target);
+
+            var artifactsLogDirectory = Path.Combine(workingDirectory, "artifacts", "log");
+            if (Directory.Exists(artifactsLogDirectory))
            {
-                // Combine the directory name + log name for the copied log file name to avoid overwriting
-                // duplicate test names in different test projects
-                var logName = $"{Path.GetFileName(Path.GetDirectoryName(file))}_{Path.GetFileName(file)}";
-                ProcessUtil.PrintMessage($"Copying: {file} to {Path.Combine(HELIX_WORKITEM_UPLOAD_ROOT, logName)}");
-                File.Copy(file, Path.Combine(HELIX_WORKITEM_UPLOAD_ROOT, logName));
+                ProcessUtil.PrintMessage($"Copying artifacts/log/ to {HELIX_WORKITEM_UPLOAD_ROOT}/");
+                foreach (var file in Directory.EnumerateFiles(artifactsLogDirectory, "*.log", SearchOption.AllDirectories))
+                {
+                    var logName = $"{assemblyName}_{Path.GetFileName(Path.GetDirectoryName(file))}_{Path.GetFileName(file)}";
+                    CopyFileToUploadRoot(file, HELIX_WORKITEM_UPLOAD_ROOT, logName);


+#### Stage 3: Helix Execution (modified HelixTestRunner)
+
+On the Helix agent, the `HelixTestRunner` reads `--targets-file targets.txt` and
+iterates:
+
+```
+for each assembly in targets.txt:
+    cd {assembly_subdirectory}
+    dotnet test {assembly.dll} --logger xunit ...
+    collect results
+merge all TestResults.xml → testResults.xml
+upload merged results
+```


+### Test Result Handling
+
+**Problem:** When one assembly in a batch has a test failure, the HelixTestRunner
+previously exited with code 1. The Helix SDK treated this as a work item failure,
+which prevented test results from being reported to AzDO — losing visibility into
+the other 19 passing assemblies' results.
+
+**Solution:** Batched runs always exit 0. Individual test failures are reported
+through the xUnit test results XML. The Helix SDK picks up the merged
+`testResults.xml` and reports all individual test outcomes to AzDO.
+
+```csharp
+// Program.cs
+if (runner.Options.IsBatched && exitCode != 0)
+{
+    ProcessUtil.PrintMessage($"Batched run had test failures but exiting 0 for result reporting.");
+    exitCode = 0;
+}
+```
+
+**Trade-off:** Failed batches are not retried by Helix's built-in retry mechanism.
+This is intentional — retrying a 20-assembly batch for one flaky test wastes compute.
+Individual flaky tests should be addressed at the test level, not by re-running
+entire batches.
+
+### Result Merging
+
+The `HelixTestRunner.UploadResults()` method:
+
+1. Runs `dotnet test` in each assembly's subdirectory, producing per-assembly
+   `TestResults/TestResults.xml`
+2. Copies each to `testResults_{assemblyName}.xml` in the work item root
+3. If multiple results exist, merges them into a single `testResults.xml` by
+   combining all `<assembly>` elements under a unified `<assemblies>` root
+4. Copies all results to `HELIX_WORKITEM_UPLOAD_ROOT`


                if (result.ExitCode != 0)
                {
                    ProcessUtil.PrintMessage($"Failure in quarantined tests. Exit code: {result.ExitCode}.");
+                    exitCode = result.ExitCode;


The previous change swallowed non-zero exit codes for batched items, making genuine test failures invisible at the pipeline level. The exit code should always propagate — the Helix SDK's AzDO reporter picks up test results from testResults.xml regardless of exit code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Move EnsureTargetRunnerConfiguration back to SetupEnvironment() so it runs once (matching original), remove duplicate calls from discovery and execution stages - Fix batch timeout: use 2min/assembly + 5min overhead instead of summing individual 45min timeouts (which always hit the 45min cap) - Remove unused ParseTimeout helper - Test results: keep the merge approach (Roslyn avoids merging by using a single vstest.console invocation via rsp file, but aspnetcore uses dotnet test with quarantine filtering which requires per-assembly invocations) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The CopyDirectory approach recursively copied every file from each assembly's publish directory into the batch directory. With 20 assemblies per batch x 5 batches per subset, this was ~100 recursive directory copies causing the helix target step to take 90+ minutes and timeout at 240 min. Replace with Directory.CreateSymbolicLink (matching Roslyn's approach). The Helix SDK follows symlinks when zipping payloads. Only the batch root's shared files (runtests.cmd/sh, NuGet.config) are copied. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mmitche · 2026-05-28T17:25:12Z

Hold on this..going to split it into smaller PRs.

dotnet test accepts multiple DLL paths. Pass all target assemblies in a single invocation instead of iterating per-assembly. This: - Produces one unified test results file (no merging needed) - Saves ~2s per extra assembly (vstest host startup) - Preserves quarantine exit code behavior (don't fail on quarantined) - Simplifies UploadResults and AddHelixSourcesAsync Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Tests that reference HELIX_WORKITEM_ROOT to find content files (baselines, snapshots, nupkgs) are incompatible with work item batching because the content files are inside symlinked subdirectories, not at the batch root. Add TestUsesHelixWorkItemRoot property for projects with custom content, and include TestDependsOnAspNetPackages/TestDependsOnAspNetAppPackages in HasSpecialDependencies to exclude these from batching. Affected projects: - Http.Extensions.Tests (RequestDelegateGenerator baselines) - OpenApi.Tests & SourceGenerators.Tests (snapshots) - Validation.GeneratorTests (snapshots) - App.Analyzers.Test (NuGet.config lookup) - App.UnitTests (via TestDependsOnAspNetAppPackages) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Tests that find content files (baselines, snapshots, nupkgs) on Helix previously used HELIX_WORKITEM_ROOT which points to the work item execution root. This is incompatible with work item batching where content is in per-project subdirectories. AppContext.BaseDirectory resolves to the DLL's directory, which is correct in all modes: - Non-batched Helix: same as HELIX_WORKITEM_ROOT (publish dir) - Batched Helix: per-project symlink to publish dir - Local development: falls through to relative path (unchanged) This also removes the TestUsesHelixWorkItemRoot property and reverts the HasSpecialDependencies condition, allowing these projects to participate in batching for better compute efficiency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move the inline RoslynCodeTaskFactory task from helix.proj into the compiled RepoTasks assembly. This provides better IDE support, proper compilation errors, and uses Directory.CreateSymbolicLink directly instead of shelling out to mklink (now targeting modern .NET runtime). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fully qualify Microsoft.Build.Utilities.Task to avoid ambiguity with System.Threading.Tasks.Task. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move the import after ArtifactsBinDir is defined so the task assembly path resolves correctly. The Helix SDK doesn't include Arcade SDK's ProjectLayout.props, so ArtifactsBinDir must be explicitly set first. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Import was inserted mid-PropertyGroup, splitting it in two and causing an XML parse error. Move the import after the complete PropertyGroup block. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

This assembly has 1999 tests and causes test host hangs on macOS when batched with other assemblies. Add SkipHelixWorkItemBatching property to opt specific projects out of work item batching. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

1. All work items now use @targets.txt command format. For non-batched items, the batcher writes a single-entry targets.txt. This removes the conditional logic from runtests.cmd and runtests.sh. 2. Rename HasSpecialDependencies metadata to SkipHelixWorkItemBatching. Set to true when any TestDepends* condition is met or when projects explicitly opt out. Defaults to not-set (eligible for batching). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Remove overwrite:true from File.Copy; skip if file already exists - Extract timeout constants as task parameters with defaults (TimeoutMinutesPerAssembly=2, TimeoutMinutesOverhead=5, MaxTimeoutMinutes=45) - Remove playwright handling from batched items; add assertion that batched items don't require Playwright - Replace command-line parsing with direct metadata reads (RuntimeVersion, QueueName, TestingArchitecture, RunQuarantined) - Remove CreateDirectorySymlink helper; inline Directory.CreateSymbolicLink Architecture safety: batching runs per-queue (inside HelixTargetQueue inner build), so all items in a batch share the same OS/architecture. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Validate that all work items in a batch share the same RuntimeVersion, QueueName, TestingArchitecture, and RunQuarantined values. Logs an error if any item differs from the first item in the batch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

wtgodbe

LGTM modulo a few questions

wtgodbe · 2026-06-04T22:02:12Z

        var nugetConfigPath = SkipOnHelixAttribute.OnHelix() ?
            Path.Combine(
-                Environment.GetEnvironmentVariable("HELIX_WORKITEM_ROOT"),
+                AppContext.BaseDirectory,


Why was this necessary?

The work item has individual test dirs for each work item (since they could have different deps)...the work item no longer executes at the work item root. Basically the nuget.config location is going to be along the test dll, which is no longer at the work item root.

I wonder if you could just remove the ternary altogether and just use AppContext...

- Batched work items now inherit the same timeout each individual work item was given (HelixTimeout, 45 min) instead of an aggressive 2min/assembly estimate that risked killing legitimate slow batches. The per-test --blame-hang-timeout (15 min) still catches true hangs. - Remove the per-assembly/overhead/max timeout magic numbers; add a single DefaultTimeout fallback parameter. - Rename runtests.cmd \ to \/\. - Document why the default batch size is 20. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Template test projects declare long custom HelixTimeout values (1-2.5h) because each test scaffolds, restores, builds and runs a real app. When these heavy assemblies were batched with others, the batch inherited a single (shorter) timeout and was killed with exit 137 (Tests exceeded configured timeout). Exclude any project whose HelixTimeout differs from the default from batching. These projects are explicitly signaling they are heavy and need a dedicated work item timeout. Today this is the 3 ProjectTemplates test projects; the rule generalizes to any future heavy project. Also add Timeout to the batched work item consistency validation so a batch can never silently mix items that expect different timeouts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

wtgodbe · 2026-06-11T14:22:41Z

Subset 1 winds up w/ 17 workitems and Subset 2 winds up w/ 7, so it might be worth shuffling things around in eng/build.props

Move 5 standalone Windows-only test projects (HttpSys/IIS/Diagnostics/Spa functional tests) from Helix Subset 1 to Subset 2 so the batched Windows work item counts are balanced (~12/12 instead of 17/7). These projects have special PreCommands and are not batchable, so they were the source of the imbalance. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

FailEventSync/FailParametersSync/FailBatchSync captured a Stopwatch timestamp and immediately invoked the synchronous metric method, then asserted the recorded duration was > 0. On fast machines fewer than one Stopwatch tick elapses, so the duration rounds to 0 and the assertion flakes (seen on OSX.26.Arm64 in this PR's batched Helix legs). Await a small delay before recording to guarantee a measurable duration, mirroring the async sibling tests. Fixes #67305 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

javiercn · 2026-06-18T16:22:02Z


        // Act
        var startTimestamp = Stopwatch.GetTimestamp();
+        await Task.Delay(10); // Small delay to ensure measurable duration


What is this for?

We kept getting failures where https://github.com/dotnet/aspnetcore/pull/66808/changes#diff-b9bcdce5eeff740407c209e80396dbfcbd9b4e4f112fc696b230c57de9d4755dR139 was zero. Most tests here do a short delay after starting the stop watch. The hypothesis is that in some cases the test is finishing so fast that the measured value is zero.

It looks like that was the issue.

mmitche · 2026-06-18T18:23:07Z

Helix efficiency: batched (this PR) vs un-batched

Edited to correct the wall-clock analysis. My first version compared against a main build that happened to run on faster/less-congested queues (OSX.15.Amd64 / …VS2026.Pre.Scout), which made batching look like a ~80% wall-clock regression. Controlling for that with a PR-to-PR comparison on the current queues (OSX.26.Arm64 / VS2026) shows the wall-clock impact is within noise (~3 min). Compute/work-item wins are unchanged.

Builds (pipeline aspnetcore-ci, dnceng-public/public):

Batched (this PR Batch small Helix work items to reduce per-item overhead (-53% compute) #66808): 1471086 — both Helix legs green ✅
Un-batched baseline, same-day / same queues: 1470188 (PR Add warnings for incorrect spacer element usage based on parent element #66968)
Un-batched, older main build (different/faster queues — compute reference only): 1442279

Work item counts (batched 1471086 vs main 1442279)

	Subset 1	Subset 2	Total
Un-batched	297	206	503
Batched	23	26	49
			−90.3%

PR breakdown: 29 batched (batch_*) + 20 standalone (IIS/Playwright/Java/Node/MSSQL/custom-timeout items that can't be batched).

Total compute time (sum of every work item's run time on a Helix machine)

Queue family	Un-batched items / min	Batched items / min
Ubuntu	158 / 80.6	14 / 65.0
OSX	156 / 117.5	11 / 27.9
Windows	189 / 322.8	24 / 155.0
Total	503 / 520.9 min (8.68 h)	49 / 247.9 min (4.13 h)

→ −52.4% total Helix compute (8.68 h → 4.13 h), from eliminating ~454 work items' worth of per-item overhead (tool installs, vstest startup, machine acquisition, result upload).

Wall-clock critical path — PR-to-PR (same queues)

Both builds' overall critical path is the Tests: Helix x64 Subset 1 leg. Stepping into that leg:

Step	Un-batched 1470188	Batched 1471086	Δ
Build shared fx	22.7 min	24.4 min	+1.7
Run helix target (build tests + send + wait for Helix)	61.6 min	63.3 min	+1.7
Helix Subset 1 leg (critical path)	86.1 min	89.6 min	+3.5
Total build span	94.9 min	97.9 min	+3.0

Helix Subset 1 leg across recent un-batched PRs on the current queues: 50.1 / 53.1 / 74.6 / 86.1 / 87.2 min — the batched 89.6 min sits within that spread. Wall-clock on these queues is dominated by queue congestion, not work-item structure.

(For reference, the old main build 1442279 ran Helix Subset 1 in 49.5 min — but on the older, less-congested OSX.15 / VS2026.Pre.Scout queues, so it's not comparable for wall-clock.)

Efficiency summary

Metric	Baseline	Batched (this PR)	Δ
Work items	503	49	−90.3%
Total compute	8.68 h	4.13 h	−52.4%
Avg compute / work item	1.04 min	5.06 min	denser items
Wall-clock critical path (PR-to-PR, same queues)	86.1 min	89.6 min	+3.5 min (within noise)

Takeaways

✅ 90% fewer work items and 52% less total compute / machine-hours — the intended overhead and cost win.
✅ No meaningful wall-clock regression: on equivalent queues the critical Helix leg is +3.5 min (well within the 50–87 min unbatched variance). The earlier "+81% slower" was a queue/OS artifact of the main baseline, not batching.
ℹ️ Wall-clock is currently dominated by Helix queue congestion on OSX.26.Arm64 / VS2026; if PR turnaround matters, that's the lever to pull (queue capacity), independent of this change.

_{Methodology: Helix job GUIDs from each leg's "Run build.cmd helix target" SendHelixJob output; per-item durations from the Helix workitems API (Duration field); synthetic HelixController Work Queueing item excluded (all 552 durations resolved). Wall-clock from AzDO timeline Job/Task records. Per-queue compute between 1471086 and 1442279 mixes batching with hardware/OS differences; the count and total-compute reductions are dominated by batching, and the wall-clock critical-path delta is measured PR-to-PR on identical queues (1470188 vs 1471086).}

wtgodbe

Wall-clock critical path got ~81% worse (49.5 → 89.5 min)

Can this be improved? I'm alright with a little time loss in service of more efficient machine usage, but 40 minutes is a big hit

wtgodbe

Talked offline, the wall clock time increase that the agent reported was just noise from a queue time spike.

mmitche and others added 4 commits May 21, 2026 14:06

Fix: Remove extra quotes around targets-file path in runtests.cmd

bcd4917

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions Bot added the area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework label May 22, 2026

build-analysis Bot mentioned this pull request May 23, 2026

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

3 tasks

wtgodbe reviewed May 27, 2026

View reviewed changes

mmitche marked this pull request as ready for review May 27, 2026 20:25

mmitche requested a review from a team as a code owner May 27, 2026 20:25

Copilot AI review requested due to automatic review settings May 27, 2026 20:25

Copilot started reviewing on behalf of mmitche May 27, 2026 20:25 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

mmitche and others added 2 commits May 27, 2026 20:41

mmitche force-pushed the dev/helix-work-item-batching branch from 6bb2285 to 4e2fd46 Compare May 28, 2026 03:41

Merge branch 'main' into dev/helix-work-item-batching

84da75d

mmitche mentioned this pull request May 28, 2026

Switch HelixTestRunner from dotnet test to dotnet vstest #66896

Closed

mmitche force-pushed the dev/helix-work-item-batching branch from ef5a4e7 to ea30c38 Compare May 28, 2026 17:18

build-analysis Bot mentioned this pull request May 29, 2026

[Known Build Error] TestServerTests.LongPollingWorks - ObjectDisposedException in OnDisconnectedAsync during shutdown #65702

Open

mmitche and others added 2 commits June 1, 2026 11:00

Merge branch 'main' into dev/helix-work-item-batching

3fc7406

mmitche requested a review from a team as a code owner June 1, 2026 19:39

mmitche requested a review from halter73 as a code owner June 1, 2026 19:39

mmitche and others added 2 commits June 1, 2026 13:57

build-analysis Bot mentioned this pull request Jun 2, 2026

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

3 tasks

mmitche and others added 9 commits June 2, 2026 08:49

Fix ambiguous Task reference in BatchHelixWorkItems

0ff6e2e

Fully qualify Microsoft.Build.Utilities.Task to avoid ambiguity with System.Threading.Tasks.Task. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix XML structure in helix.proj

1145499

The Import was inserted mid-PropertyGroup, splitting it in two and causing an XML parse error. Move the import after the complete PropertyGroup block. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge branch 'main' into dev/helix-work-item-batching

8b8722e

Retry CI build

e7e9e93

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

wtgodbe approved these changes Jun 4, 2026

View reviewed changes

build-analysis Bot mentioned this pull request Jun 5, 2026

Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. dotnet/dnceng#5922

Open

3 tasks

mmitche and others added 2 commits June 5, 2026 12:32

Merge branch 'main' into dev/helix-work-item-batching

bb0faad

wtgodbe approved these changes Jun 16, 2026

View reviewed changes

mmitche requested a review from a team as a code owner June 18, 2026 16:03

mmitche mentioned this pull request Jun 18, 2026

Flaky: ComponentsMetricsTest.FailParametersSync_RecordsErrorMetric (timing-sensitive duration assertion) #67305

Closed

javiercn reviewed Jun 18, 2026

View reviewed changes

wtgodbe requested changes Jun 18, 2026

View reviewed changes

wtgodbe approved these changes Jun 18, 2026

View reviewed changes

mmitche merged commit a097744 into main Jun 18, 2026
20 of 25 checks passed

mmitche deleted the dev/helix-work-item-batching branch June 18, 2026 22:03

	<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '1'" Include="@(ProjectsWithTestsSubset1)" />
	<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '2'" Include="@(ProjectsWithTestsSubset2)" />

Conversation

mmitche commented May 22, 2026