Move thin lock acquire/release in CoreCLR to managed code#129502
Move thin lock acquire/release in CoreCLR to managed code#129502VSadov wants to merge 36 commits into
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @VSadov |
There was a problem hiding this comment.
Pull request overview
This PR moves the thin-lock (object header) acquire/release fast paths from CoreCLR native code into managed implementations in System.Private.CoreLib, removing the associated FCALL/ecall surface and native inline helpers.
Changes:
- Removed native thin-lock helpers (
syncblk.inl,ObjHeader::*HeaderThinLock, FCALL entries) and associated includes/build references. - Implemented thin-lock acquire/release in managed
System.Threading.ObjectHeader(CoreCLR) and updatedMonitorto call the new managed entrypoints. - Kept NativeAOT parity by renaming/updating its thin-lock entrypoints and adjusting call sites accordingly.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/vm/syncblk.inl | Removes native inline thin-lock acquire/release implementation. |
| src/coreclr/vm/syncblk.h | Removes native HeaderLockResult and thin-lock method declarations from ObjHeader. |
| src/coreclr/vm/ecalllist.h | Drops the ObjectHeader FCALL mapping entries. |
| src/coreclr/vm/comsynchronizable.h | Removes FCDECLs for thin-lock FCALL entrypoints. |
| src/coreclr/vm/comsynchronizable.cpp | Removes FCIMPL implementations for thin-lock FCALL entrypoints. |
| src/coreclr/vm/common.h | Removes syncblk.inl from global inline includes. |
| src/coreclr/vm/CMakeLists.txt | Removes syncblk.inl from VM header lists. |
| src/coreclr/System.Private.CoreLib/src/System/Threading/ObjectHeader.CoreCLR.cs | Adds managed thin-lock acquire/release logic and exposes AcquireThinLock(...) and managed Release(...). |
| src/coreclr/System.Private.CoreLib/src/System/Threading/Monitor.CoreCLR.cs | Routes Monitor.Enter/TryEnter/... to the new managed thin-lock entrypoints. |
| src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs | Renames/reshapes NativeAOT thin-lock entrypoint to AcquireThinLock(...) and adjusts uncommon-path handling. |
| src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Monitor.NativeAot.cs | Updates NativeAOT Monitor to call AcquireThinLock(...). |
| return HeaderLockResult.UseSlowPath; | ||
| } | ||
|
|
||
| if (Interlocked.CompareExchange(pHeader, oldBits | currentThreadID, oldBits) == oldBits) |
There was a problem hiding this comment.
Doing CAS in managed code is one of the motivations. The native CAS needs to check for the presence of LSE on ARM64, JITed code does not need that.
This affects Linux-arm64 perhaps even more than Windows-arm64.
|
@MihuBot benchmark System.Collections.Concurrent -arm -intel |
|
@MihuBot benchmark System.Threading -arm |
System.Collections.Concurrent.IsEmpty_String_
System.Collections.Concurrent.IsEmpty_Int32_
System.Collections.Concurrent.Count_String_
System.Collections.Concurrent.Count_Int32_
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
src/coreclr/System.Private.CoreLib/src/System/Threading/Monitor.CoreCLR.cs:90
- Monitor.TryEnter(object, int) always falls back to GetLockObject(obj).TryEnter(millisecondsTimeout) after a failed one-shot thin-lock attempt. For millisecondsTimeout == 0, this can unnecessarily allocate/create the Lock (inflation) even though we already know the thin lock is currently owned by another thread. Since TryEnter(…, 0) is a one-shot operation, it can return false immediately on HeaderLockResult.Failure and avoid the slow path/inflation cost.
public static bool TryEnter(object obj, int millisecondsTimeout)
{
ArgumentOutOfRangeException.ThrowIfLessThan(millisecondsTimeout, -1);
ObjectHeader.HeaderLockResult result = ObjectHeader.AcquireThinLock(obj, isOneShot: true);
if (result == ObjectHeader.HeaderLockResult.Success)
return true;
return GetLockObject(obj).TryEnter(millisecondsTimeout);
|
@MihuBot benchmark System.Threading -arm |
|
@MihuBot benchmark System.Collections.Concurrent -arm |
System.Threading.Tests.Perf_Volatile
System.Threading.Tests.Perf_Timer
System.Threading.Tests.Perf_ThreadStatic
System.Threading.Tests.Perf_ThreadPool
System.Threading.Tests.Perf_Thread
System.Threading.Tests.Perf_SpinLock
System.Threading.Tests.Perf_SemaphoreSlim
System.Threading.Tests.Perf_Monitor
System.Threading.Tests.Perf_Lock
System.Threading.Tests.Perf_Interlocked
System.Threading.Tests.Perf_EventWaitHandle
System.Threading.Tests.Perf_CancellationToken
System.Threading.Tasks.Tests.Perf_AsyncMethods
System.Threading.Tasks.ValueTaskPerfTest
System.Threading.Channels.Tests.UnboundedChannelPerfTests
System.Threading.Channels.Tests.SpscUnboundedChannelPerfTests
System.Threading.Channels.Tests.BoundedChannelPerfTests
|
Move the reconciled code to https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Threading/Monitor.cs ? (It is ok to put it under |
I think all other parts of Monitor.CoreCLR.cs and Monitor.NativeAOT.cs can also be unified, except for |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
|
||
| Enter(obj); | ||
| lockTaken = true; | ||
| } |
| @@ -72,12 +91,47 @@ | |||
| lockTaken = TryEnter(obj, millisecondsTimeout); | |||
| } | |||
|
Benchmark results on x64 (AMD Ryzen 9 7950X 16-Core, 32 logical) Higher throughput score is better. === Baseline: === The PR: |
|
Benchmark results on x64 (AMD EPYC 7763, 32core VM) Higher throughput score is better. === Baseline: === The PR: |
|
Benchmark results on ARM64 (Ampere Altra, 32core VM) Higher throughput score is better. === Baseline: The PR: |
|
The impact from the accidental inlining was fairly noticeable. Undoing that reverted a good portion of gains. The overall effect of this PR is a lot more sharing between CoreCLR and NativeAOT.
The baseline benefited from partial inlining of the entry points as well since the entry points on CoreCLR were not NoInline. I do not think getting codegen on par with c++ here is a huge concern though. Pinning is a relatively rare case for the JIT and overall impact on real apps may be uninteresting. |
The overall effect of this PR is a lot more sharing between CoreCLR and NativeAOT.
The performance impact is:
These scenarios appear to be quite sensitive to minor changes in codegen or hardware differences
(i.e. minor improvement vs. regression depending on AMD Epyc 7763 vs. AMD Ryzen 7950), so I think we can assume small differences to be a kind of "architectural noise".