Make scheduled outerloop builds succeed when only Helix tests fail#129049
Conversation
The libraries outerloop pipeline runs on a daily schedule with always:false, meaning AzDO only re-queues a commit if there were changes since the last successful scheduled run. Because flaky outerloop tests cause the 'Send to Helix' task to fail on essentially every scheduled run, the build never succeeds, so AzDO re-queues the same commit every day and submits ever more Helix work for an unchanged sha. Set shouldContinueOnError on the Send to Helix step for scheduled builds only (Build.Reason == 'Schedule'), so Helix work item failures no longer fail the build. Compile/build breaks still fail the build, and PR/CI/manual runs are unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @dotnet/area-infrastructure-libraries |
There was a problem hiding this comment.
Pull request overview
This PR updates the libraries outerloop Azure DevOps pipeline to avoid failing scheduled runs due to Helix work item/test failures, with the intent of preventing always: false schedules from repeatedly re-queuing the same commit and submitting duplicate Helix work.
Changes:
- Pass
shouldContinueOnError: ${{ eq(variables['Build.Reason'], 'Schedule') }}into the threeplatform-matrix.ymlinvocations inouterloop.yml. - Add inline YAML comments explaining the rationale (avoid same-SHA daily re-queues and wasted Helix capacity).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Bleh, it's right. partiallySucceeded won't cause AzDO to avoid scheduling. |
continueOnError only marks the build partiallySucceeded, which AzDO's always:false scheduler still treats as not-successful, so the same commit keeps getting re-queued daily. Instead, for scheduled builds, tell the Helix SDK not to fail the build on work item / test failures by passing FailOnWorkItemFailure=false and FailOnTestFailure=false. The Send to Helix step then fully succeeds, so a perpetually-flaky scheduled run no longer causes AzDO to re-queue the same sha. - helix.yml: add failOnTestFailures parameter (default true = current behavior) wired to the FailOnWorkItemFailure/FailOnTestFailure Helix SDK properties. - outerloop.yml: pass failOnTestFailures=false only for scheduled builds (Build.Reason == 'Schedule'); replaces the earlier shouldContinueOnError approach. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…will revert) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
If this looks reasonble we should backport to 9.0 and 10.0 for outerloop. |
|
/azp list |
|
@lewing any concerns here? See https://dev.azure.com/dnceng-public/public/_build/results?buildId=1451767&view=results for a test run (conditional changed to "manual" to verify the functionality) |
lewing
left a comment
There was a problem hiding this comment.
I'm fine with with it @steveisok @jeffschwMSFT for visibility
|
Will watch to see whether outerloop runs on 8.0 stop happening when there are no changes. If so, will backport to 9.0 and 10.0 |
Make scheduled outerloop Helix step succeed instead of continueOnError PR dotnet#129049 set FailOnWorkItemFailure/FailOnTestFailure to false on scheduled outerloop runs so the Send to Helix step succeeds (avoiding always:false re-queue of the same commit). That hid work item failures entirely. Add a WarnOnHelixTestFailure property that emits a build warning for each failed Helix work item, keeping them visible in the AzDO timeline without failing the build (the Helix step already disables warnaserror). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…9629) ## Problem PR #129049 made scheduled outerloop builds succeed when only Helix tests fail, by setting `FailOnWorkItemFailure`/`FailOnTestFailure` to `false` on scheduled runs (via the `failOnTestFailures: false` parameter). This stopped AzDO's `always: false` scheduler from re-queueing the same commit day after day. The side effect: failed Helix work items became **completely invisible** in the Azure DevOps timeline. The `Send to Helix` step is fully green, so there is no signal that work items failed (even though, for flaky outerloop, they almost always do). ## Fix Surface failed work items as **warnings** instead of silently dropping them. Warnings keep the failures visible in the timeline but do **not** degrade the build below `succeeded` (so the `always: false` re-queue fix from #129049 is preserved). - **`src/libraries/sendtohelixhelp.proj`**: new `WarnOnHelixWorkItemFailure` target (`AfterTargets=CheckHelixJobStatus`) that emits a `<Warning>` for each failed `@(CompletedWorkItem)` when `WarnOnHelixTestFailure=true`. This mirrors what the Arcade SDK's `CheckHelixJobStatus` would have *errored* on, but as a warning. - **`eng/pipelines/libraries/helix.yml`**: new `warnOnTestFailures` parameter (default `false`) wired to `/p:WarnOnHelixTestFailure`. - **`eng/pipelines/libraries/outerloop.yml`**: scheduled runs now set `warnOnTestFailures: true` alongside `failOnTestFailures: false` on all three legs. No warn-as-error change was needed: the `Send to Helix` step already runs with warnaserror disabled (`_warnAsErrorParamHelixOverride`), so these warnings are not promoted back into build-failing errors. ## Validation Ran the `runtime-libraries-coreclr outerloop` pipeline (dnceng-public def 125, [build 1472840](https://dev.azure.com/dnceng-public/public/_build/results?buildId=1472840)) with a temporary Manual gate. Multiple CoreCLR_Release legs completed **succeeded** with failed work items surfaced as warnings and **zero errors**, e.g.: ``` src/libraries/sendtohelixhelp.proj(364,5): warning : Work item System.Runtime.Numerics.Tests in job 2e01f1b1-... has failed. Failure log: https://helix.dot.net/api/.../console ``` Legs whose work items all passed produced no such warning, as expected. > [!NOTE] > This pull request was authored with the assistance of GitHub Copilot. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Note
This pull request was authored with the assistance of GitHub Copilot.
Problem
Several scheduled outerloop pipelines (the
outerloop.ymlfamily:runtime-libraries-coreclr outerloopand its-windows/-linux/-osxvariants) use analways: falsescheduled trigger. Withalways: false, AzDO only starts a new scheduled run if the source changed since the last successful scheduled run.Because the repo has many flaky outerloop tests, the Helix test work items virtually always have at least one failure, which fails the "Send to Helix" step and therefore the whole build. The build never reaches a
succeededstate, so AzDO re-queues the same, unchanged commit day after day, submitting more and more Helix work for no benefit. (Empirically confirmed: a single commit was re-run and failed for 19 consecutive days; once a sibling definition produced a genuinely successful run, the same-SHA re-queue stopped.)Why
continueOnErroris not enoughcontinueOnError: trueonly downgrades the build topartiallySucceeded, which AzDO'salways: falsescheduler still does not treat as successful — so the same commit keeps getting re-queued. The Helix step must end fully successful (exit 0).Fix
Make the "Send to Helix" step actually succeed on scheduled runs by disabling the two Arcade
Microsoft.DotNet.Helix.Sdkproperties that fail the build (both default totrue):FailOnWorkItemFailure—CheckHelixJobStatuserrors when a work item exits non-zero.FailOnTestFailure—CheckAzurePipelinesTestResultserrors when any published test failed.Setting both to
falselets the msbuild step exit 0, producing a fullysucceededbuild. Failed tests are still published and visible in the test results tab; AzDO does not auto-degrade a build topartiallySucceededjust because a published test run contains failures — only a failing task would.Changes
eng/pipelines/libraries/helix.yml: Added afailOnTestFailuresparameter (defaulttrue, preserving today's behavior) wired to/p:FailOnWorkItemFailureand/p:FailOnTestFailureon the Send to Helix msbuild invocation.eng/pipelines/libraries/outerloop.yml: PassesfailOnTestFailures: falseonly on scheduled runs (Build.Reason == 'Schedule') for all three matrix legs (Release, Debug, NET48).Behavior preservation
The new parameter defaults to
true, so all otherhelix.ymlcallers are unaffected (none setWaitForWorkItemCompletionor these properties on this path, so they already resolve totrue). Only scheduled outerloop runs change behavior. PR / rolling / manual outerloop runs continue to fail on Helix failures exactly as before. Build/compile breaks still fail scheduled runs (this only affects the Helix step).Tradeoff
On scheduled runs,
FailOnWorkItemFailure=falsealso masks work-item crashes/timeouts/infra failures, not just test-assertion failures. This is an accepted tradeoff for the goal of stopping the wasteful daily re-queue of unchanged commits; results remain visible in the Helix/test reporting.