Skip to content

Add runtime async support for saving and reusing continuation instances#125556

Merged
jakobbotsch merged 25 commits intodotnet:mainfrom
jakobbotsch:reuse-continuations-2
Mar 19, 2026
Merged

Add runtime async support for saving and reusing continuation instances#125556
jakobbotsch merged 25 commits intodotnet:mainfrom
jakobbotsch:reuse-continuations-2

Conversation

@jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Mar 14, 2026

This adds support for generating a single shared continuation layout for each runtime async method. The shared continuation layout is compatible with the state that needs to be stored for all suspension points in the function. For that reason it uses more memory than the previous separated continuation layouts.

The benefit is that once a single layout is compatible with all suspension points we can reuse the same continuation instance every time we suspend. That means a single runtime async function only ends up allocating one continuation instance.
On suspension heavy benchmarks this improves performance by about 30% by significantly reducing the amount of garbage generated.

A complication arises for return values. Before this change the continuation object always stored its single possible return value in a known location, and resumption stubs would propagate return values into the caller's continuation at that location. With this change the continuation stores space for all possible types of return values, and the offset to store at changes for every suspension point. To handle that we now encode the offset in Continuation.Flags.

Example with warmup
using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Threading;
using System.Threading.Tasks;

namespace AsyncMicro;

public class Program
{
    static void Main()
    {
        NullAwaiter na = new NullAwaiter();

        for (int i = 0; i < 10; i++)
        {
            for (int j = 0; j < 100; j++)
            {
                Task t = Foo(100, na);
                while (!t.IsCompleted)
                {
                    na.Continue();
                }
            }

            Thread.Sleep(100);
        }

        for (int i = 0; i < 5; i++)
        {
            Task t = Foo(10_000_000, na);
            while (!t.IsCompleted)
            {
                na.Continue();
            }
        }
    }

    static int s_value;
    static async Task Foo(int n, NullAwaiter na)
    {
        for (int i = 0; i < n; i++)
        {
            s_value += i;
        }

        Stopwatch timer = Stopwatch.StartNew();
        for (int i = 0; i < n; i++)
        {
            await na;
        }

        if (n > 100)
            Console.WriteLine("Took {0:F1} ms", timer.Elapsed.TotalMilliseconds);
    }

    private class NullAwaiter : ICriticalNotifyCompletion
    {
        public Action Continue;

        public NullAwaiter GetAwaiter() => this;

        public bool IsCompleted => false;

        public void GetResult()
        {
        }

        public void UnsafeOnCompleted(Action continuation)
        {
            Continue = continuation;
        }

        public void OnCompleted(Action continuation)
        {
            throw new NotImplementedException();
        }
    }
}
Codegen diff
diff --git "a/.\\out_base.txt" "b/.\\out.txt"
index 29d384f..0a17943 100644
--- "a/.\\out_base.txt"
+++ "b/.\\out.txt"
@@ -24,11 +24,11 @@ G_M000_IG01:                ;; offset=0x0000
        vmovdqu  ymmword ptr [rbp-0x60], ymm4
        mov      gword ptr [rbp+0x10], rcx
        mov      gword ptr [rbp+0x20], r8
-       mov      esi, edx
+       mov      ebx, edx
  
 G_M000_IG02:                ;; offset=0x0030
        test     rcx, rcx
-       jne      G_M000_IG24
+       jne      G_M000_IG26
        mov      rax, qword ptr GS:[0x0058]
        mov      rax, qword ptr [rax+0x28]
        cmp      dword ptr [rax+0x250], 2
@@ -47,7 +47,7 @@ G_M000_IG03:                ;; offset=0x0067
  
 G_M000_IG04:                ;; offset=0x007B
        xor      eax, eax
-       test     esi, esi
+       test     ebx, ebx
        jle      SHORT G_M000_IG07
  
 G_M000_IG05:                ;; offset=0x0081
@@ -57,7 +57,7 @@ G_M000_IG05:                ;; offset=0x0081
 G_M000_IG06:                ;; offset=0x008B
        add      dword ptr [rdx], eax
        inc      eax
-       cmp      eax, esi
+       cmp      eax, ebx
        jl       SHORT G_M000_IG06
  
 G_M000_IG07:                ;; offset=0x0093
@@ -81,36 +81,36 @@ G_M000_IG08:                ;; offset=0x00BD
  
 G_M000_IG09:                ;; offset=0x00D8
        mov      rdi, gword ptr [rbp-0x68]
-       test     esi, esi
+       test     ebx, ebx
        jle      SHORT G_M000_IG13
  
 G_M000_IG10:                ;; offset=0x00E0
-       mov      rbx, gword ptr [rbp+0x20]
-       cmp      byte  ptr [rbx], bl
-       mov      r14d, esi
+       mov      rsi, gword ptr [rbp+0x20]
+       cmp      byte  ptr [rsi], sil
+       mov      r14d, ebx
  
-G_M000_IG11:                ;; offset=0x00E9
-       mov      r8, rbx
+G_M000_IG11:                ;; offset=0x00EA
+       mov      r8, rsi
        mov      rcx, 0x7FF849136100
        xor      rdx, rdx
        call     [System.Runtime.CompilerServices.AsyncHelpers:UnsafeAwaitAwaiter[System.__Canon](System.__Canon)]
        test     rcx, rcx
        jne      G_M000_IG22
  
-G_M000_IG12:                ;; offset=0x0107
+G_M000_IG12:                ;; offset=0x0108
        dec      r14d
        jne      SHORT G_M000_IG11
  
-G_M000_IG13:                ;; offset=0x010C
-       cmp      esi, 100
+G_M000_IG13:                ;; offset=0x010D
+       cmp      ebx, 100
        jle      G_M000_IG16
        jmp      SHORT G_M000_IG15
  
-G_M000_IG14:                ;; offset=0x0117
+G_M000_IG14:                ;; offset=0x0118
        call     CORINFO_HELP_POLL_GC
        jmp      SHORT G_M000_IG09
  
-G_M000_IG15:                ;; offset=0x011E
+G_M000_IG15:                ;; offset=0x011F
        mov      rcx, rdi
        call     [System.Diagnostics.Stopwatch:get_ElapsedTicks():long:this]
        vxorps   xmm6, xmm6, xmm6
@@ -132,18 +132,18 @@ G_M000_IG15:                ;; offset=0x011E
        call     [System.TimeSpan:get_TotalMilliseconds():double:this]
        vmovsd   qword ptr [rbx+0x08], xmm0
        mov      rdx, rbx
-       mov      rcx, 0x254678105B8
+       mov      rcx, 0x20B60E005B8
        call     [System.Console:WriteLine(System.String,System.Object)]
        nop      
  
-G_M000_IG16:                ;; offset=0x01A5
+G_M000_IG16:                ;; offset=0x01A6
        cmp      gword ptr [rbp+0x10], 0
        je       SHORT G_M000_IG20
  
-G_M000_IG17:                ;; offset=0x01AC
+G_M000_IG17:                ;; offset=0x01AD
        xor      ecx, ecx
  
-G_M000_IG18:                ;; offset=0x01AE
+G_M000_IG18:                ;; offset=0x01AF
        vmovaps  xmm6, xmmword ptr [rsp+0x50]
        add      rsp, 104
        pop      rbx
@@ -154,12 +154,12 @@ G_M000_IG18:                ;; offset=0x01AE
        pop      rbp
        ret      
  
-G_M000_IG19:                ;; offset=0x01C1
+G_M000_IG19:                ;; offset=0x01C2
        mov      ecx, 2
        call     CORINFO_HELP_GETDYNAMIC_GCTHREADSTATIC_BASE_NOCTOR_OPTIMIZED
        jmp      G_M000_IG03
  
-G_M000_IG20:                ;; offset=0x01D0
+G_M000_IG20:                ;; offset=0x01D1
        mov      ecx, 2
        call     CORINFO_HELP_GETDYNAMIC_GCTHREADSTATIC_BASE_NOCTOR_OPTIMIZED
        mov      rbx, gword ptr [rax+0x10]
@@ -170,7 +170,7 @@ G_M000_IG20:                ;; offset=0x01D0
        mov      rdx, r8
        call     CORINFO_HELP_ASSIGN_REF
  
-G_M000_IG21:                ;; offset=0x01F4
+G_M000_IG21:                ;; offset=0x01F5
        mov      r8, gword ptr [rbx+0x08]
        mov      rdx, gword ptr [rbp-0x48]
        cmp      rdx, r8
@@ -179,23 +179,35 @@ G_M000_IG21:                ;; offset=0x01F4
        call     [System.Threading.ExecutionContext:RestoreChangedContextToThread(System.Threading.Thread,System.Threading.ExecutionContext,System.Threading.ExecutionContext)]
        jmp      SHORT G_M000_IG17
  
-G_M000_IG22:                ;; offset=0x020C
-       mov      rdx, rcx
-       mov      rcx, rdx
-       mov      rdx, 0x7FF84917DF68
-       call     CORINFO_HELP_ALLOC_CONTINUATION
+G_M000_IG22:                ;; offset=0x020D
+       mov      rax, rcx
+       mov      rcx, gword ptr [rbp+0x10]
+       mov      r15, rcx
+       test     r15, r15
+       je       SHORT G_M000_IG23
+       lea      rcx, bword ptr [rax+0x08]
+       mov      rdx, r15
+       call     CORINFO_HELP_ASSIGN_REF
+       jmp      SHORT G_M000_IG24
+ 
+G_M000_IG23:                ;; offset=0x022A
+       mov      rcx, rax
+       mov      rdx, 0x7FF84917D600
+       call     [CORINFO_HELP_ALLOC_CONTINUATION]
        mov      r15, rax
+ 
+G_M000_IG24:                ;; offset=0x0240
        lea      rcx, [reloc @RWD08]
        mov      qword ptr [r15+0x10], rcx
        xor      ecx, ecx
        mov      qword ptr [r15+0x18], rcx
        lea      rcx, bword ptr [r15+0x28]
-       mov      rdx, rbx
+       mov      rdx, rsi
        call     CORINFO_HELP_ASSIGN_REF
        lea      rcx, bword ptr [r15+0x30]
        mov      rdx, rdi
        call     CORINFO_HELP_ASSIGN_REF
-       mov      dword ptr [r15+0x38], esi
+       mov      dword ptr [r15+0x38], ebx
        mov      dword ptr [r15+0x3C], r14d
        call     [System.Runtime.CompilerServices.AsyncHelpers:CaptureExecutionContext():System.Threading.ExecutionContext]
        lea      rcx, bword ptr [r15+0x20]
@@ -209,7 +221,7 @@ G_M000_IG22:                ;; offset=0x020C
        call     [System.Runtime.CompilerServices.AsyncHelpers:RestoreContextsOnSuspension(bool,System.Threading.ExecutionContext,System.Threading.SynchronizationContext)]
        mov      rcx, r15
  
-G_M000_IG23:                ;; offset=0x0283
+G_M000_IG25:                ;; offset=0x029F
        vmovaps  xmm6, xmmword ptr [rsp+0x50]
        add      rsp, 104
        pop      rbx
@@ -220,22 +232,22 @@ G_M000_IG23:                ;; offset=0x0283
        pop      rbp
        ret      
  
-G_M000_IG24:                ;; offset=0x0296
+G_M000_IG26:                ;; offset=0x02B2
        mov      rcx, gword ptr [rbp+0x10]
        mov      rcx, gword ptr [rcx+0x20]
        call     [System.Runtime.CompilerServices.AsyncHelpers:RestoreExecutionContext(System.Threading.ExecutionContext)]
        mov      rcx, gword ptr [rbp+0x10]
-       mov      rbx, gword ptr [rcx+0x28]
+       mov      rsi, gword ptr [rcx+0x28]
        mov      rdi, gword ptr [rcx+0x30]
-       mov      esi, dword ptr [rcx+0x38]
+       mov      ebx, dword ptr [rcx+0x38]
        mov      r14d, dword ptr [rcx+0x3C]
        jmp      G_M000_IG12
  
-G_M000_IG25:                ;; offset=0x02BC
+G_M000_IG27:                ;; offset=0x02D8
        sub      rsp, 40
        vzeroupper 
  
-G_M000_IG26:                ;; offset=0x02C3
+G_M000_IG28:                ;; offset=0x02DF
        cmp      gword ptr [rbp+0x10], 0
        setne    cl
        movzx    rcx, cl
@@ -244,7 +256,7 @@ G_M000_IG26:                ;; offset=0x02C3
        call     [System.Runtime.CompilerServices.AsyncHelpers:RestoreContexts(bool,System.Threading.ExecutionContext,System.Threading.SynchronizationContext)]
        nop      
  
-G_M000_IG27:                ;; offset=0x02DD
+G_M000_IG29:                ;; offset=0x02F9
        add      rsp, 40
        ret      
  
@@ -252,10 +264,10 @@ RWD00  	dq	43E0000000000000h
 RWD08  	dq	(dynamicClass):IL_STUB_AsyncResume(System.Object,byref):System.Object
 	dq	G_M000_IG22 + 3
 
-; Total bytes of code 738
+; Total bytes of code 766
-Took 539.6 ms
-Took 538.6 ms
-Took 538.7 ms
-Took 542.8 ms
-Took 535.2 ms
+Took 404.3 ms
+Took 405.1 ms
+Took 404.6 ms
+Took 403.7 ms
+Took 404.9 ms

Improves performance by about 30%. It also increases size, but I have ideas to reduce it again (e.g. by sharing the "alloc new continuation" path).

Based on #125497

Copilot AI review requested due to automatic review settings March 14, 2026 13:44
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 14, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an optimization for CoreCLR “runtime async” methods by enabling a single shared continuation layout per async method and reusing the same continuation instance across multiple suspension points, reducing allocations and GC pressure. It updates the continuation flags contract to encode per-suspension-point field offsets (notably for return storage) in Continuation.Flags.

Changes:

  • JIT: Build shared continuation layouts across all suspension points and optionally reuse continuation instances (JitAsyncReuseContinuations).
  • ABI/contract: Redefine continuation flags to encode exception/context/result slot indices via bitfields.
  • BCL: Update AsyncHelpers continuation field accessors to decode indices from flags.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/coreclr/vm/object.h Adjusts continuation object helpers used by the runtime/interpreter to locate data/exception storage.
src/coreclr/vm/interpexec.cpp Updates interpreter suspend/resume handling to use new continuation flag semantics.
src/coreclr/jit/jitstd/vector.h Adds data() const overload needed by new JIT code.
src/coreclr/jit/jitconfigvalues.h Adds JitAsyncReuseContinuations config switch.
src/coreclr/jit/async.h Refactors async transformation to build per-state layouts and shared layout support types.
src/coreclr/jit/async.cpp Implements shared layout creation, continuation reuse logic, and index encoding into flags.
src/coreclr/interpreter/compiler.cpp Updates interpreter continuation layout/flags creation for new encoding scheme.
src/coreclr/inc/corinfo.h Redefines CorInfoContinuationFlags to include index bitfield definitions.
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs Updates managed Continuation to decode exception/context/result locations from flags.
Comments suppressed due to low confidence (1)

src/coreclr/vm/interpexec.cpp:4430

  • Exception handling on interpreter resumption is gated on a placeholder flag literal (567), and GetExceptionObjectStorage currently returns the wrong slot. This will skip exception rethrow or read garbage. Update to the new index-based encoding (compare decoded exception index against the sentinel mask) and use the decoded offset when loading the exception object.
                    if (pAsyncSuspendData->flags & /*CORINFO_CONTINUATION_HAS_EXCEPTION */ 567)
                    {
                        // Throw exception if needed
                        OBJECTREF exception = *continuation->GetExceptionObjectStorage();

                        if (exception != NULL)
                        {

Copilot AI review requested due to automatic review settings March 14, 2026 14:36
@jakobbotsch jakobbotsch changed the title Add runtime async support for saving and reusing continuation layouts Add runtime async support for saving and reusing continuation instances Mar 14, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the runtime async continuation model to support a single shared continuation layout across suspension points and (optionally) reuse a single continuation instance to reduce allocations/GC pressure.

Changes:

  • Reworks continuation Flags to encode slot indices (exception/context/result) instead of simple presence bits, and updates VM + BCL consumers accordingly.
  • Introduces JIT infrastructure to build per-suspension sub-layouts, optionally merge them into a shared layout, and enable continuation reuse via a new config knob.
  • Updates interpreter suspension metadata generation to use the new index-encoding scheme.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/coreclr/vm/object.h Computes result/exception storage addresses using index fields encoded in Flags.
src/coreclr/vm/interpexec.cpp Decodes continuation-context index from flags and uses new exception-storage accessor.
src/coreclr/jit/jitstd/vector.h Adds data() const to support const access patterns.
src/coreclr/jit/jitconfigvalues.h Adds JitAsyncReuseContinuations config (default enabled).
src/coreclr/jit/async.h Adds layout builder/state tracking types and shared-layout support plumbing.
src/coreclr/jit/async.cpp Implements shared-layout creation, per-state suspend/resume emission, and continuation reuse logic.
src/coreclr/interpreter/compiler.cpp Encodes exception/context/result indices into interpreter async suspend flags.
src/coreclr/inc/corinfo.h Redefines CorInfoContinuationFlags to include index bitfields for slots.
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs Updates managed continuation flag decoding and storage offset computation to match new encoding.

Copilot AI review requested due to automatic review settings March 17, 2026 21:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optimization for runtime async methods that enables using a single shared continuation layout across all suspension points, allowing the runtime to reuse a single continuation instance (reducing allocations/GC pressure). It also changes how continuation field locations (exception/context/result) are encoded and discovered across the JIT, VM, interpreter, and CoreLib.

Changes:

  • Introduces shared continuation layout generation and continuation instance reuse for runtime async (gated by JitAsyncReuseContinuations).
  • Updates continuation Flags to encode slot indices for exception/continuation-context/result, rather than “HasX” presence bits.
  • Adjusts VM/interpreter/CoreLib code to compute storage addresses using the new index encoding.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/coreclr/vm/object.h Computes continuation field addresses from encoded indices in Flags.
src/coreclr/vm/interpexec.cpp Uses new VM helpers to locate exception/context storage slots during suspend/resume.
src/coreclr/jit/jitconfigvalues.h Adds JitAsyncReuseContinuations config switch (enabled by default).
src/coreclr/jit/async.h Updates async transformation APIs to support shared layouts/reuse and updated return lookup.
src/coreclr/jit/async.cpp Implements shared layout building, reuse control flow, and index encoding into flags.
src/coreclr/interpreter/compiler.cpp Encodes continuation slot indices into flags for interpreter async continuations.
src/coreclr/inc/corinfo.h Redefines CorInfoContinuationFlags to include bit positions/sizes for encoded indices.
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs Updates managed Continuation helpers to interpret Flags as index-encoded offsets.

@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @dhartglassMSFT @EgorBo

With this PR continuation reuse is enabled by default. That means all suspension points use the same continuation layout (leaving the space for fields that are not live at particular suspension points untouched). This also makes the continuation live in the entire function, so it will result in some prolog code (if storing on the stack) or use of a register.

On suspension the new codegen first checks if we have a continuation, and if so, reuses the existing continuation instead of calling CORINFO_HELP_ALLOC_CONTINUATION. This results in increased size but improves perf significantly (about 30% in the benchmark above). It will also allow us to make several optimizations in the future:

  1. When we reuse we can skip storing continuation fields that we know cannot have changed since a previous suspension. This optimization is implemented in JIT: Add a runtime async optimization to skip saving unmutated locals into reused continuations #125615 and improves the benchmark above by a further 10% (it skips 2 write barriers on every suspension)
  2. To optimize size we can share the code that checks for reuse and otherwise allocates a new continuation, since now it is the same for all suspension points.
  3. We can furthermore start sharing suspension code to save (common) sets of locals.

For 2/3 we will need some way to make a function-local call. I am not totally sure how to represent it. One possibility is that we pass it the state number and then use the state number in a switch to return back.

Copilot AI review requested due to automatic review settings March 17, 2026 23:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds “continuation reuse” for runtime-async methods by generating a single shared continuation layout per method (covering all suspension points) and reusing the same continuation instance across suspensions to reduce allocations and GC pressure. It also updates the continuation flags format to encode per-suspension-point storage indices (exception/context/result) needed with shared layouts.

Changes:

  • Switch continuation flags from “Has* presence bits” to encoded slot indices, updating CoreCLR VM, JIT, interpreter, and BCL helpers accordingly.
  • Add JIT support for creating a shared continuation layout across states and reusing an existing continuation instance on subsequent suspensions.
  • Introduce JitAsyncReuseContinuations config knob (enabled by default) to gate the optimization.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/coreclr/inc/corinfo.h Redefines CorInfoContinuationFlags to encode slot indices and updates bit assignments.
src/coreclr/vm/object.h Updates ContinuationObject helpers to compute storage addresses from encoded indices.
src/coreclr/vm/interpexec.cpp Switches interpreter execution to use the new “*_StorageOrNull” helpers for exception/context storage.
src/coreclr/interpreter/compiler.cpp Updates interpreter suspend emission to encode slot indices into flags.
src/coreclr/jit/jitconfigvalues.h Adds JitAsyncReuseContinuations config setting.
src/coreclr/jit/async.h Adds shared-layout builder API and updates return-lookup signature.
src/coreclr/jit/async.cpp Implements shared layout creation and continuation-instance reuse, and encodes result/exception/context indices in flags.
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs Updates managed continuation flag decoding to match the new encoded-index format.

@jakobbotsch
Copy link
Member Author

/ba-g Unknown failures were #125638

1 similar comment
@jakobbotsch
Copy link
Member Author

/ba-g Unknown failures were #125638

@jakobbotsch jakobbotsch merged commit 1f751da into dotnet:main Mar 19, 2026
129 of 135 checks passed
@jakobbotsch jakobbotsch deleted the reuse-continuations-2 branch March 19, 2026 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI runtime-async

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants