Skip to content

Batch small Helix work items to reduce per-item overhead (-53% compute)#66808

Merged
mmitche merged 27 commits into
mainfrom
dev/helix-work-item-batching
Jun 18, 2026
Merged

Batch small Helix work items to reduce per-item overhead (-53% compute)#66808
mmitche merged 27 commits into
mainfrom
dev/helix-work-item-batching

Conversation

@mmitche

@mmitche mmitche commented May 22, 2026

Copy link
Copy Markdown
Member

Batch small Helix work items to reduce per-item overhead

Problem

Each ASP.NET Core CI build sends ~503 Helix work items, but most have only 2-10 seconds of actual test execution with ~20-40 seconds of per-item overhead (tool installs, vstest startup, result upload). This wastes ~5.6 compute-hours per build.

Solution

Batch compatible small test assemblies into groups of ~20 per Helix work item. Assemblies with special dependencies (IIS, Playwright, Java, Node, MSSQL) remain as individual items.

Measured Results

Metric Baseline Batched Change
Work items 503 47 -91%
Total compute 634 min (10.6 hrs) 300 min (5.0 hrs) -53%

Per-queue breakdown:

Queue Baseline (items/min) Batched (items/min) Savings
Windows (vs2026) 189 / 300 min 23 / 148 min -51%
macOS (osx.15) 156 / 239 min 11 / 88 min -63%
Ubuntu (2404) 158 / 95 min 13 / 64 min -32%

How it works

  1. eng/helix/helix.proj - new BatchSmallWorkItems MSBuild inline task runs after the existing Gather target. Groups eligible items by TFM, creates combined payload directories, writes a targets.txt manifest.

  2. eng/tools/HelixTestRunner - accepts --targets-file targets.txt to run dotnet test sequentially for each assembly. Tool installs happen once per batch. Test results are merged.

  3. eng/helix/content/runtests.cmd / runtests.sh - detect @targets.txt prefix for batched mode. Fully backward compatible.

Batching rules

  • Eligible: Items with no special PreCommands (no IIS, Playwright, Java, Node, MSSQL deps)
  • Group by: Target framework (net11.0, net472, etc.)
  • Max batch size: 20 assemblies per work item
  • Failed tests: Batched items exit 0 so all test results are reported. Individual failures visible via test results XML.

CI validation

mmitche and others added 4 commits May 21, 2026 14:06
Reduces ~503 work items per build to ~50 by batching compatible small test assemblies (those without special dependencies like IIS/Playwright) into groups of up to 20.

Each batched work item runs dotnet test sequentially for each assembly, sharing the per-item setup overhead (tool installs, env config, vstest launcher) across the batch.

Expected impact: ~60-70% reduction in total compute per build.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The batch task was using non-existent metadata fields (RuntimeVersion,
QueueName, etc.) resulting in empty command arguments. Ubuntu items
were submitted with empty runtime/queue args and never picked up by
agents. Now parses these values from the first item's existing Command
string.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When one assembly in a batch fails, the entire batch exited with code 1,
causing the Helix SDK to treat it as a failed work item and not report
test results from the other 19 passing assemblies to AzDO. This caused
~1300 missing tests in the test count.

Now batched runs always exit 0 so all results are reported. Individual
test failures are visible through the test results XML. This also
prevents wasteful retries of entire 20-assembly batches for a single
flaky test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added the area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework label May 22, 2026
@dotnet-policy-service

Copy link
Copy Markdown
Contributor

Hey @dotnet/aspnet-build, looks like this PR is something you want to take a look at.

@mmitche

mmitche commented May 22, 2026

Copy link
Copy Markdown
Member Author

@wtgodbe THis is an attempt to reduce the test overhead in aspnetcore. It is NOT ready (prototype included some hacks for testing)

@wtgodbe

wtgodbe commented May 22, 2026

Copy link
Copy Markdown
Member

Love this idea! Will take a closer look next week, but worth noting that the helix tests are currently broken into 2 subsets:

aspnetcore/eng/Build.props

Lines 236 to 297 in 27c660e

<ProjectsWithTestsSubset1 Include="
$(RepoRoot)src\Framework\AspNetCoreAnalyzers\test\Microsoft.AspNetCore.App.Analyzers.Test.csproj;
$(RepoRoot)src\Framework\test\Microsoft.AspNetCore.App.UnitTests.csproj;
$(RepoRoot)src\Caching\**\*.*proj;
$(RepoRoot)src\DefaultBuilder\**\*.*proj;
$(RepoRoot)src\Features\**\*.*proj;
$(RepoRoot)src\DataProtection\**\*.*proj;
$(RepoRoot)src\Antiforgery\**\*.*proj;
$(RepoRoot)src\Hosting\**\*.*proj;
$(RepoRoot)src\Http\**\*.*proj;
$(RepoRoot)src\HttpClientFactory\**\*.*proj;
$(RepoRoot)src\Html.Abstractions\**\*.*proj;
$(RepoRoot)src\Identity\**\*.*proj;
$(RepoRoot)src\Servers\**\*.csproj;
$(RepoRoot)src\Security\**\*.*proj;
$(RepoRoot)src\SiteExtensions\Microsoft.Web.Xdt.Extensions\**\*.csproj;
$(RepoRoot)src\SiteExtensions\LoggingAggregate\test\**\*.csproj;
$(RepoRoot)src\Shared\**\*.*proj;
$(RepoRoot)src\Tools\**\*.*proj;
$(RepoRoot)src\Logging.AzureAppServices\**\src\*.csproj;
$(RepoRoot)src\Middleware\**\*.csproj;
"
Exclude="
@(ProjectToBuild);
@(ProjectToExclude);
$(RepoRoot)**\node_modules\**\*;
$(RepoRoot)**\bin\**\*;
$(RepoRoot)**\obj\**\*;"
Condition=" '$(BuildMainlyReferenceProviders)' != 'true' " />
<ProjectsWithTestsSubset2 Include="
$(RepoRoot)src\Razor\**\*.*proj;
$(RepoRoot)src\Mvc\**\*.*proj;
$(RepoRoot)src\Azure\**\*.*proj;
$(RepoRoot)src\SignalR\**\*.csproj;
$(RepoRoot)src\StaticAssets\**\*.csproj;
$(RepoRoot)src\Components\**\*.csproj;
$(RepoRoot)src\Analyzers\**\*.csproj;
$(RepoRoot)src\FileProviders\**\*.csproj;
$(RepoRoot)src\Configuration.KeyPerFile\**\*.csproj;
$(RepoRoot)src\Localization\**\*.csproj;
$(RepoRoot)src\ObjectPool\**\*.csproj;
$(RepoRoot)src\JSInterop\**\*.csproj;
$(RepoRoot)src\WebEncoders\**\*.csproj;
$(RepoRoot)src\HealthChecks\**\*.csproj;
$(RepoRoot)src\Testing\**\*.csproj;
$(RepoRoot)src\Grpc\**\*.csproj;
$(RepoRoot)src\ProjectTemplates\**\*.csproj;
$(RepoRoot)src\Extensions\**\*.csproj;
$(RepoRoot)src\OpenApi\**\*.csproj;
$(RepoRoot)src\Validation\**\*.csproj;
"
Exclude="
@(ProjectToBuild);
@(ProjectToExclude);
$(RepoRoot)**\node_modules\**\*;
$(RepoRoot)**\bin\**\*;
$(RepoRoot)**\obj\**\*;"
Condition=" '$(BuildMainlyReferenceProviders)' != 'true' " />
<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '1'" Include="@(ProjectsWithTestsSubset1)" />
<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '2'" Include="@(ProjectsWithTestsSubset2)" />
,
# Helix x64 subset 1
- template: jobs/default-build.yml
parameters:
jobName: Helix_x64_Subset_1
jobDisplayName: 'Tests: Helix x64 Subset 1'
agentOs: Windows
timeoutInMinutes: 240
steps:
# Build the shared framework
- script: ./eng/build.cmd -ci -prepareMachine -nativeToolsOnMachine -all -pack -arch x64
/p:CrossgenOutput=false /p:ASPNETCORE_TEST_LOG_DIR=artifacts/log $(_InternalRuntimeDownloadArgs)
/p:VsTestUseMSBuildOutput=false $(HelixSubset1LogArgs)
displayName: Build shared fx
# -noBuildNative -noBuild to avoid repeating work done in the previous step.
- script: ./eng/build.cmd -ci -prepareMachine -nativeToolsOnMachine -all -noBuildNative -noBuild -test
-projects eng\helix\helix.proj /p:IsHelixPRCheck=true /p:IsHelixJob=true
/p:CrossgenOutput=false /p:ASPNETCORE_TEST_LOG_DIR=artifacts/log $(_InternalRuntimeDownloadArgs)
/p:VsTestUseMSBuildOutput=false /p:RunTemplateTests=false /p:HelixSubset=1
displayName: Run build.cmd helix target
env:
HelixApiAccessToken: $(HelixApiAccessToken) # Needed for internal queues
SYSTEM_ACCESSTOKEN: $(System.AccessToken) # We need to set this env var to publish helix results to Azure Dev Ops
artifacts:
- name: Helix_Subset_1_Logs_Attempt_$(System.JobAttempt)
path: artifacts/log/
publishOnError: true
includeForks: true
# Helix x64 subset 2
. We did this because parallelizing the helix tests cut about 40 minutes off the build - but the 2 groups were chosen completely arbitrarily. There might be some extra time to be saved by re-subsetting based on whatever "compatibility" heuristic you're currently using

@mmitche

mmitche commented May 26, 2026

Copy link
Copy Markdown
Member Author

Love this idea! Will take a closer look next week, but worth noting that the helix tests are currently broken into 2 subsets:

aspnetcore/eng/Build.props

Lines 236 to 297 in 27c660e

<ProjectsWithTestsSubset1 Include="
$(RepoRoot)src\Framework\AspNetCoreAnalyzers\test\Microsoft.AspNetCore.App.Analyzers.Test.csproj;
$(RepoRoot)src\Framework\test\Microsoft.AspNetCore.App.UnitTests.csproj;
$(RepoRoot)src\Caching\**\*.*proj;
$(RepoRoot)src\DefaultBuilder\**\*.*proj;
$(RepoRoot)src\Features\**\*.*proj;
$(RepoRoot)src\DataProtection\**\*.*proj;
$(RepoRoot)src\Antiforgery\**\*.*proj;
$(RepoRoot)src\Hosting\**\*.*proj;
$(RepoRoot)src\Http\**\*.*proj;
$(RepoRoot)src\HttpClientFactory\**\*.*proj;
$(RepoRoot)src\Html.Abstractions\**\*.*proj;
$(RepoRoot)src\Identity\**\*.*proj;
$(RepoRoot)src\Servers\**\*.csproj;
$(RepoRoot)src\Security\**\*.*proj;
$(RepoRoot)src\SiteExtensions\Microsoft.Web.Xdt.Extensions\**\*.csproj;
$(RepoRoot)src\SiteExtensions\LoggingAggregate\test\**\*.csproj;
$(RepoRoot)src\Shared\**\*.*proj;
$(RepoRoot)src\Tools\**\*.*proj;
$(RepoRoot)src\Logging.AzureAppServices\**\src\*.csproj;
$(RepoRoot)src\Middleware\**\*.csproj;
"
Exclude="
@(ProjectToBuild);
@(ProjectToExclude);
$(RepoRoot)**\node_modules\**\*;
$(RepoRoot)**\bin\**\*;
$(RepoRoot)**\obj\**\*;"
Condition=" '$(BuildMainlyReferenceProviders)' != 'true' " />
<ProjectsWithTestsSubset2 Include="
$(RepoRoot)src\Razor\**\*.*proj;
$(RepoRoot)src\Mvc\**\*.*proj;
$(RepoRoot)src\Azure\**\*.*proj;
$(RepoRoot)src\SignalR\**\*.csproj;
$(RepoRoot)src\StaticAssets\**\*.csproj;
$(RepoRoot)src\Components\**\*.csproj;
$(RepoRoot)src\Analyzers\**\*.csproj;
$(RepoRoot)src\FileProviders\**\*.csproj;
$(RepoRoot)src\Configuration.KeyPerFile\**\*.csproj;
$(RepoRoot)src\Localization\**\*.csproj;
$(RepoRoot)src\ObjectPool\**\*.csproj;
$(RepoRoot)src\JSInterop\**\*.csproj;
$(RepoRoot)src\WebEncoders\**\*.csproj;
$(RepoRoot)src\HealthChecks\**\*.csproj;
$(RepoRoot)src\Testing\**\*.csproj;
$(RepoRoot)src\Grpc\**\*.csproj;
$(RepoRoot)src\ProjectTemplates\**\*.csproj;
$(RepoRoot)src\Extensions\**\*.csproj;
$(RepoRoot)src\OpenApi\**\*.csproj;
$(RepoRoot)src\Validation\**\*.csproj;
"
Exclude="
@(ProjectToBuild);
@(ProjectToExclude);
$(RepoRoot)**\node_modules\**\*;
$(RepoRoot)**\bin\**\*;
$(RepoRoot)**\obj\**\*;"
Condition=" '$(BuildMainlyReferenceProviders)' != 'true' " />
<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '1'" Include="@(ProjectsWithTestsSubset1)" />
<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '2'" Include="@(ProjectsWithTestsSubset2)" />

,

# Helix x64 subset 1
- template: jobs/default-build.yml
parameters:
jobName: Helix_x64_Subset_1
jobDisplayName: 'Tests: Helix x64 Subset 1'
agentOs: Windows
timeoutInMinutes: 240
steps:
# Build the shared framework
- script: ./eng/build.cmd -ci -prepareMachine -nativeToolsOnMachine -all -pack -arch x64
/p:CrossgenOutput=false /p:ASPNETCORE_TEST_LOG_DIR=artifacts/log $(_InternalRuntimeDownloadArgs)
/p:VsTestUseMSBuildOutput=false $(HelixSubset1LogArgs)
displayName: Build shared fx
# -noBuildNative -noBuild to avoid repeating work done in the previous step.
- script: ./eng/build.cmd -ci -prepareMachine -nativeToolsOnMachine -all -noBuildNative -noBuild -test
-projects eng\helix\helix.proj /p:IsHelixPRCheck=true /p:IsHelixJob=true
/p:CrossgenOutput=false /p:ASPNETCORE_TEST_LOG_DIR=artifacts/log $(_InternalRuntimeDownloadArgs)
/p:VsTestUseMSBuildOutput=false /p:RunTemplateTests=false /p:HelixSubset=1
displayName: Run build.cmd helix target
env:
HelixApiAccessToken: $(HelixApiAccessToken) # Needed for internal queues
SYSTEM_ACCESSTOKEN: $(System.AccessToken) # We need to set this env var to publish helix results to Azure Dev Ops
artifacts:
- name: Helix_Subset_1_Logs_Attempt_$(System.JobAttempt)
path: artifacts/log/
publishOnError: true
includeForks: true
# Helix x64 subset 2

. We did this because parallelizing the helix tests cut about 40 minutes off the build - but the 2 groups were chosen completely arbitrarily. There might be some extra time to be saved by re-subsetting based on whatever "compatibility" heuristic you're currently using

@wtgodbe Do you happen to remember why it cut that much time off PRs? Was it due to test time or time to submit the tests.

@wtgodbe

wtgodbe commented May 26, 2026

Copy link
Copy Markdown
Member

@wtgodbe Do you happen to remember why it cut that much time off PRs? Was it due to test time or time to submit the tests.

I don't remember off the top of my head, but release/9.0 and release/8.0 PRs are still doing one mondo-job, so we should be able to check & find out

- script: ./eng/build.cmd -ci -prepareMachine -nativeToolsOnMachine -all -noBuildNative -noBuild -test
-projects eng\helix\helix.proj /p:IsHelixPRCheck=true /p:IsHelixJob=true
/p:CrossgenOutput=false /p:ASPNETCORE_TEST_LOG_DIR=artifacts/log $(_InternalRuntimeDownloadArgs)
/p:VsTestUseMSBuildOutput=false /p:RunTemplateTests=false /p:HelixSubset=1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must get rid of ProjectsWithTestsSubset1 / ProjectsWithTestsSubset2 if you're gonna remove these properties:

aspnetcore/eng/Build.props

Lines 296 to 297 in 2c960bf

<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '1'" Include="@(ProjectsWithTestsSubset1)" />
<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '2'" Include="@(ProjectsWithTestsSubset2)" />

@mmitche mmitche marked this pull request as ready for review May 27, 2026 20:25
@mmitche mmitche requested a review from a team as a code owner May 27, 2026 20:25
Copilot AI review requested due to automatic review settings May 27, 2026 20:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces Helix work item batching to reduce per-item overhead by combining multiple small test assemblies into a single Helix work item, while keeping assemblies with special dependencies unbatched. It also updates HelixTestRunner and CI pipelines to support the new batched execution model.

Changes:

  • Add an MSBuild batching step in eng/helix/helix.proj to group eligible Helix work items (by TFM) and generate a targets.txt manifest.
  • Extend HelixTestRunner to accept multiple targets / a --targets-file input, and update Helix run scripts to detect @targets.txt.
  • Simplify CI Helix execution by removing subset splitting and running a single combined Helix job.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
eng/tools/HelixTestRunner/TestRunner.cs Adds multi-target discovery/execution support and adjusts artifact/log handling for batched payload layouts.
eng/tools/HelixTestRunner/Program.cs Updates startup logging and exception formatting.
eng/tools/HelixTestRunner/HelixTestRunnerOptions.cs Adds --targets-file and supports multiple --target values.
eng/targets/Helix.targets Adds metadata used to determine batching eligibility (TFM + special dependency flags).
eng/helix/helix.proj Implements the batching inline task and replaces eligible work items with batched ones.
eng/helix/content/runtests.sh Adds @targets.txt detection and forwards either --target or --targets-file.
eng/helix/content/runtests.cmd Adds @targets.txt detection and forwards either --target or --targets-file.
docs/helix-work-item-batching.md New design document describing the batching approach and operational trade-offs.
.azure/pipelines/ci.yml Collapses Helix subset jobs into a single Helix x64 job and updates binlog variables/artifact names.
.azure/pipelines/ci-unofficial.yml Same Helix subset collapse as official CI.
.azure/pipelines/ci-public.yml Same Helix subset collapse as public CI.

Comment thread eng/tools/HelixTestRunner/TestRunner.cs
Comment on lines 328 to +343
}
ProcessUtil.PrintMessage($"Copying artifacts/log/ to {HELIX_WORKITEM_UPLOAD_ROOT}/");
if (Directory.Exists("artifacts/log"))

// Copy logs from each assembly's subdirectory
foreach (var target in Options.Targets)
{
foreach (var file in Directory.EnumerateFiles("artifacts/log", "*.log", SearchOption.AllDirectories))
var workingDirectory = GetTargetWorkingDirectory(target);
var assemblyName = GetSanitizedAssemblyName(target);

var artifactsLogDirectory = Path.Combine(workingDirectory, "artifacts", "log");
if (Directory.Exists(artifactsLogDirectory))
{
// Combine the directory name + log name for the copied log file name to avoid overwriting
// duplicate test names in different test projects
var logName = $"{Path.GetFileName(Path.GetDirectoryName(file))}_{Path.GetFileName(file)}";
ProcessUtil.PrintMessage($"Copying: {file} to {Path.Combine(HELIX_WORKITEM_UPLOAD_ROOT, logName)}");
File.Copy(file, Path.Combine(HELIX_WORKITEM_UPLOAD_ROOT, logName));
ProcessUtil.PrintMessage($"Copying artifacts/log/ to {HELIX_WORKITEM_UPLOAD_ROOT}/");
foreach (var file in Directory.EnumerateFiles(artifactsLogDirectory, "*.log", SearchOption.AllDirectories))
{
var logName = $"{assemblyName}_{Path.GetFileName(Path.GetDirectoryName(file))}_{Path.GetFileName(file)}";
CopyFileToUploadRoot(file, HELIX_WORKITEM_UPLOAD_ROOT, logName);
Comment thread docs/helix-work-item-batching.md Outdated
Comment on lines +136 to +148
#### Stage 3: Helix Execution (modified HelixTestRunner)

On the Helix agent, the `HelixTestRunner` reads `--targets-file targets.txt` and
iterates:

```
for each assembly in targets.txt:
cd {assembly_subdirectory}
dotnet test {assembly.dll} --logger xunit ...
collect results
merge all TestResults.xml → testResults.xml
upload merged results
```
Comment thread docs/helix-work-item-batching.md Outdated
Comment on lines +219 to +253
### Test Result Handling

**Problem:** When one assembly in a batch has a test failure, the HelixTestRunner
previously exited with code 1. The Helix SDK treated this as a work item failure,
which prevented test results from being reported to AzDO — losing visibility into
the other 19 passing assemblies' results.

**Solution:** Batched runs always exit 0. Individual test failures are reported
through the xUnit test results XML. The Helix SDK picks up the merged
`testResults.xml` and reports all individual test outcomes to AzDO.

```csharp
// Program.cs
if (runner.Options.IsBatched && exitCode != 0)
{
ProcessUtil.PrintMessage($"Batched run had test failures but exiting 0 for result reporting.");
exitCode = 0;
}
```

**Trade-off:** Failed batches are not retried by Helix's built-in retry mechanism.
This is intentional — retrying a 20-assembly batch for one flaky test wastes compute.
Individual flaky tests should be addressed at the test level, not by re-running
entire batches.

### Result Merging

The `HelixTestRunner.UploadResults()` method:

1. Runs `dotnet test` in each assembly's subdirectory, producing per-assembly
`TestResults/TestResults.xml`
2. Copies each to `testResults_{assemblyName}.xml` in the work item root
3. If multiple results exist, merges them into a single `testResults.xml` by
combining all `<assembly>` elements under a unified `<assemblies>` root
4. Copies all results to `HELIX_WORKITEM_UPLOAD_ROOT`
Comment thread eng/tools/HelixTestRunner/TestRunner.cs Outdated
Comment on lines +271 to +276
if (result.ExitCode != 0)
{
ProcessUtil.PrintMessage($"Failure in quarantined tests. Exit code: {result.ExitCode}.");
exitCode = result.ExitCode;
mmitche and others added 2 commits May 27, 2026 20:41
The previous change swallowed non-zero exit codes for batched items,
making genuine test failures invisible at the pipeline level. The
exit code should always propagate — the Helix SDK's AzDO reporter
picks up test results from testResults.xml regardless of exit code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move EnsureTargetRunnerConfiguration back to SetupEnvironment() so it
  runs once (matching original), remove duplicate calls from discovery
  and execution stages
- Fix batch timeout: use 2min/assembly + 5min overhead instead of
  summing individual 45min timeouts (which always hit the 45min cap)
- Remove unused ParseTimeout helper
- Test results: keep the merge approach (Roslyn avoids merging by
  using a single vstest.console invocation via rsp file, but aspnetcore
  uses dotnet test with quarantine filtering which requires per-assembly
  invocations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mmitche mmitche force-pushed the dev/helix-work-item-batching branch from 6bb2285 to 4e2fd46 Compare May 28, 2026 03:41
The CopyDirectory approach recursively copied every file from each
assembly's publish directory into the batch directory. With 20
assemblies per batch x 5 batches per subset, this was ~100 recursive
directory copies causing the helix target step to take 90+ minutes
and timeout at 240 min.

Replace with Directory.CreateSymbolicLink (matching Roslyn's approach).
The Helix SDK follows symlinks when zipping payloads. Only the batch
root's shared files (runtests.cmd/sh, NuGet.config) are copied.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mmitche mmitche force-pushed the dev/helix-work-item-batching branch from ef5a4e7 to ea30c38 Compare May 28, 2026 17:18
@mmitche

mmitche commented May 28, 2026

Copy link
Copy Markdown
Member Author

Hold on this..going to split it into smaller PRs.

dotnet test accepts multiple DLL paths. Pass all target assemblies
in a single invocation instead of iterating per-assembly. This:
- Produces one unified test results file (no merging needed)
- Saves ~2s per extra assembly (vstest host startup)
- Preserves quarantine exit code behavior (don't fail on quarantined)
- Simplifies UploadResults and AddHelixSourcesAsync

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mmitche and others added 2 commits June 1, 2026 11:00
Tests that reference HELIX_WORKITEM_ROOT to find content files (baselines,
snapshots, nupkgs) are incompatible with work item batching because the
content files are inside symlinked subdirectories, not at the batch root.

Add TestUsesHelixWorkItemRoot property for projects with custom content,
and include TestDependsOnAspNetPackages/TestDependsOnAspNetAppPackages
in HasSpecialDependencies to exclude these from batching.

Affected projects:
- Http.Extensions.Tests (RequestDelegateGenerator baselines)
- OpenApi.Tests & SourceGenerators.Tests (snapshots)
- Validation.GeneratorTests (snapshots)
- App.Analyzers.Test (NuGet.config lookup)
- App.UnitTests (via TestDependsOnAspNetAppPackages)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mmitche mmitche requested a review from a team as a code owner June 1, 2026 19:39
@mmitche mmitche requested a review from halter73 as a code owner June 1, 2026 19:39
mmitche and others added 2 commits June 1, 2026 13:57
Tests that find content files (baselines, snapshots, nupkgs) on Helix
previously used HELIX_WORKITEM_ROOT which points to the work item
execution root. This is incompatible with work item batching where
content is in per-project subdirectories.

AppContext.BaseDirectory resolves to the DLL's directory, which is
correct in all modes:
- Non-batched Helix: same as HELIX_WORKITEM_ROOT (publish dir)
- Batched Helix: per-project symlink to publish dir
- Local development: falls through to relative path (unchanged)

This also removes the TestUsesHelixWorkItemRoot property and reverts
the HasSpecialDependencies condition, allowing these projects to
participate in batching for better compute efficiency.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move the inline RoslynCodeTaskFactory task from helix.proj into the
compiled RepoTasks assembly. This provides better IDE support, proper
compilation errors, and uses Directory.CreateSymbolicLink directly
instead of shelling out to mklink (now targeting modern .NET runtime).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mmitche and others added 9 commits June 2, 2026 08:49
Fully qualify Microsoft.Build.Utilities.Task to avoid ambiguity with
System.Threading.Tasks.Task.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move the import after ArtifactsBinDir is defined so the task assembly
path resolves correctly. The Helix SDK doesn't include Arcade SDK's
ProjectLayout.props, so ArtifactsBinDir must be explicitly set first.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Import was inserted mid-PropertyGroup, splitting it in two and
causing an XML parse error. Move the import after the complete
PropertyGroup block.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This assembly has 1999 tests and causes test host hangs on macOS when
batched with other assemblies. Add SkipHelixWorkItemBatching property
to opt specific projects out of work item batching.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. All work items now use @targets.txt command format. For non-batched
   items, the batcher writes a single-entry targets.txt. This removes
   the conditional logic from runtests.cmd and runtests.sh.

2. Rename HasSpecialDependencies metadata to SkipHelixWorkItemBatching.
   Set to true when any TestDepends* condition is met or when projects
   explicitly opt out. Defaults to not-set (eligible for batching).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove overwrite:true from File.Copy; skip if file already exists
- Extract timeout constants as task parameters with defaults
  (TimeoutMinutesPerAssembly=2, TimeoutMinutesOverhead=5, MaxTimeoutMinutes=45)
- Remove playwright handling from batched items; add assertion that
  batched items don't require Playwright
- Replace command-line parsing with direct metadata reads
  (RuntimeVersion, QueueName, TestingArchitecture, RunQuarantined)
- Remove CreateDirectorySymlink helper; inline Directory.CreateSymbolicLink

Architecture safety: batching runs per-queue (inside HelixTargetQueue
inner build), so all items in a batch share the same OS/architecture.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validate that all work items in a batch share the same RuntimeVersion,
QueueName, TestingArchitecture, and RunQuarantined values. Logs an
error if any item differs from the first item in the batch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@wtgodbe wtgodbe left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo a few questions

Comment thread eng/helix/content/runtests.cmd Outdated
Comment thread eng/helix/helix.proj
Comment thread eng/helix/helix.proj
Comment thread eng/tools/RepoTasks/BatchHelixWorkItems.cs Outdated
var nugetConfigPath = SkipOnHelixAttribute.OnHelix() ?
Path.Combine(
Environment.GetEnvironmentVariable("HELIX_WORKITEM_ROOT"),
AppContext.BaseDirectory,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this necessary?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work item has individual test dirs for each work item (since they could have different deps)...the work item no longer executes at the work item root. Basically the nuget.config location is going to be along the test dll, which is no longer at the work item root.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you could just remove the ternary altogether and just use AppContext...

- Batched work items now inherit the same timeout each individual work
  item was given (HelixTimeout, 45 min) instead of an aggressive
  2min/assembly estimate that risked killing legitimate slow batches.
  The per-test --blame-hang-timeout (15 min) still catches true hangs.
- Remove the per-assembly/overhead/max timeout magic numbers; add a
  single DefaultTimeout fallback parameter.
- Rename runtests.cmd \ to \/\.
- Document why the default batch size is 20.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mmitche and others added 2 commits June 5, 2026 12:32
Template test projects declare long custom HelixTimeout values (1-2.5h)
because each test scaffolds, restores, builds and runs a real app. When
these heavy assemblies were batched with others, the batch inherited a
single (shorter) timeout and was killed with exit 137
(Tests exceeded configured timeout).

Exclude any project whose HelixTimeout differs from the default from
batching. These projects are explicitly signaling they are heavy and
need a dedicated work item timeout. Today this is the 3 ProjectTemplates
test projects; the rule generalizes to any future heavy project.

Also add Timeout to the batched work item consistency validation so a
batch can never silently mix items that expect different timeouts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wtgodbe

wtgodbe commented Jun 11, 2026

Copy link
Copy Markdown
Member

Subset 1 winds up w/ 17 workitems and Subset 2 winds up w/ 7, so it might be worth shuffling things around in eng/build.props

Move 5 standalone Windows-only test projects (HttpSys/IIS/Diagnostics/Spa
functional tests) from Helix Subset 1 to Subset 2 so the batched Windows
work item counts are balanced (~12/12 instead of 17/7). These projects have
special PreCommands and are not batchable, so they were the source of the
imbalance.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FailEventSync/FailParametersSync/FailBatchSync captured a Stopwatch timestamp
and immediately invoked the synchronous metric method, then asserted the
recorded duration was > 0. On fast machines fewer than one Stopwatch tick
elapses, so the duration rounds to 0 and the assertion flakes (seen on
OSX.26.Arm64 in this PR's batched Helix legs). Await a small delay before
recording to guarantee a measurable duration, mirroring the async sibling
tests.

Fixes #67305

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

// Act
var startTimestamp = Stopwatch.GetTimestamp();
await Task.Delay(10); // Small delay to ensure measurable duration

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We kept getting failures where https://github.com/dotnet/aspnetcore/pull/66808/changes#diff-b9bcdce5eeff740407c209e80396dbfcbd9b4e4f112fc696b230c57de9d4755dR139 was zero. Most tests here do a short delay after starting the stop watch. The hypothesis is that in some cases the test is finishing so fast that the measured value is zero.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that was the issue.

@mmitche

mmitche commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

Helix efficiency: batched (this PR) vs un-batched

Edited to correct the wall-clock analysis. My first version compared against a main build that happened to run on faster/less-congested queues (OSX.15.Amd64 / …VS2026.Pre.Scout), which made batching look like a ~80% wall-clock regression. Controlling for that with a PR-to-PR comparison on the current queues (OSX.26.Arm64 / VS2026) shows the wall-clock impact is within noise (~3 min). Compute/work-item wins are unchanged.

Builds (pipeline aspnetcore-ci, dnceng-public/public):

Work item counts (batched 1471086 vs main 1442279)

Subset 1 Subset 2 Total
Un-batched 297 206 503
Batched 23 26 49
−90.3%

PR breakdown: 29 batched (batch_*) + 20 standalone (IIS/Playwright/Java/Node/MSSQL/custom-timeout items that can't be batched).

Total compute time (sum of every work item's run time on a Helix machine)

Queue family Un-batched items / min Batched items / min
Ubuntu 158 / 80.6 14 / 65.0
OSX 156 / 117.5 11 / 27.9
Windows 189 / 322.8 24 / 155.0
Total 503 / 520.9 min (8.68 h) 49 / 247.9 min (4.13 h)

−52.4% total Helix compute (8.68 h → 4.13 h), from eliminating ~454 work items' worth of per-item overhead (tool installs, vstest startup, machine acquisition, result upload).

Wall-clock critical path — PR-to-PR (same queues)

Both builds' overall critical path is the Tests: Helix x64 Subset 1 leg. Stepping into that leg:

Step Un-batched 1470188 Batched 1471086 Δ
Build shared fx 22.7 min 24.4 min +1.7
Run helix target (build tests + send + wait for Helix) 61.6 min 63.3 min +1.7
Helix Subset 1 leg (critical path) 86.1 min 89.6 min +3.5
Total build span 94.9 min 97.9 min +3.0

Helix Subset 1 leg across recent un-batched PRs on the current queues: 50.1 / 53.1 / 74.6 / 86.1 / 87.2 min — the batched 89.6 min sits within that spread. Wall-clock on these queues is dominated by queue congestion, not work-item structure.

(For reference, the old main build 1442279 ran Helix Subset 1 in 49.5 min — but on the older, less-congested OSX.15 / VS2026.Pre.Scout queues, so it's not comparable for wall-clock.)

Efficiency summary

Metric Baseline Batched (this PR) Δ
Work items 503 49 −90.3%
Total compute 8.68 h 4.13 h −52.4%
Avg compute / work item 1.04 min 5.06 min denser items
Wall-clock critical path (PR-to-PR, same queues) 86.1 min 89.6 min +3.5 min (within noise)

Takeaways

  • 90% fewer work items and 52% less total compute / machine-hours — the intended overhead and cost win.
  • No meaningful wall-clock regression: on equivalent queues the critical Helix leg is +3.5 min (well within the 50–87 min unbatched variance). The earlier "+81% slower" was a queue/OS artifact of the main baseline, not batching.
  • ℹ️ Wall-clock is currently dominated by Helix queue congestion on OSX.26.Arm64 / VS2026; if PR turnaround matters, that's the lever to pull (queue capacity), independent of this change.

Methodology: Helix job GUIDs from each leg's "Run build.cmd helix target" SendHelixJob output; per-item durations from the Helix workitems API (Duration field); synthetic HelixController Work Queueing item excluded (all 552 durations resolved). Wall-clock from AzDO timeline Job/Task records. Per-queue compute between 1471086 and 1442279 mixes batching with hardware/OS differences; the count and total-compute reductions are dominated by batching, and the wall-clock critical-path delta is measured PR-to-PR on identical queues (1470188 vs 1471086).

@wtgodbe wtgodbe left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wall-clock critical path got ~81% worse (49.5 → 89.5 min)

Can this be improved? I'm alright with a little time loss in service of more efficient machine usage, but 40 minutes is a big hit

@wtgodbe wtgodbe left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked offline, the wall clock time increase that the agent reported was just noise from a queue time spike.

@mmitche mmitche merged commit a097744 into main Jun 18, 2026
20 of 25 checks passed
@mmitche mmitche deleted the dev/helix-work-item-batching branch June 18, 2026 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants