Summary
A macOS x64 hang dump from CI (runtime 11.0.0-preview.5.26272.112, macOS 15.2) shows a
deadlock caused by a cycle in a Virtual Stub Dispatch DispatchCache bucket chain. One
thread spins forever walking the cycle while holding m_writeLock; every other VSD resolve
piles up behind it, hanging the whole process.
Evidence from the dump
- 32 threads. 16 are blocked identically:
__psynch_mutexwait → CrstBase::Enter → DispatchCache::Insert → VirtualCallStubManager::ResolveWorker → VSD_ResolveWorker
- Thread 25 (tid 0x18) is running, PC at
DispatchCache::Insert+192, inside the inlined
Lookup collision-chain walk (movq 0x18(%rax),%rax = follow pNext, loop if no match
and not the sentinel).
- Walking that bucket in memory,
ResolveCacheElem @ 0x10c887480 has pNext (+0x18)
pointing at itself:
pMT=0x104ce34d0 token=0x200000001 target=0x592ff3a80 pNext=0x10c887480 (self)
- So thread 25 walks
… → 0x10c887480 → 0x10c887480 → … forever while holding m_writeLock;
the other 16 threads deadlock on CrstBase::Enter(). The hung host then trips Helix's
15-min inactivity timer → createdump → "Test host process crashed."
Root cause
The cache enforces a single-writer invariant — DispatchCache::SetCacheEntry asserts
m_writeLock.OwnedByCurrentThread() (CHAIN_LOOKUP). The normal writers honor it:
DispatchCache::Insert — takes m_writeLock ✔
DispatchCache::PromoteChainEntry — takes m_writeLock ✔
But VirtualCallStubManager::~VirtualCallStubManager() purges this manager's entries from the
shared global g_resolveCache via DispatchCache::Iterator::UnlinkEntry() without taking
m_writeLock (it rewrites chain pointers directly, bypassing SetCacheEntry and its assert).
When a collectible LoaderAllocator (assembly/ALC) is unloaded while other live threads
concurrently Insert/PromoteChainEntry into the same buckets, the two writers interleave and
can splice elem->pNext = elem (or a larger cycle).
This is a collectible-unload-only path (non-collectible managers only run it at process exit),
which matches the failure profile: a different random assembly hangs each run, only under heavy
ALC churn.
Note: the racy code is old (≈2015) and we did not find a recent change to this file. The
recent spike most likely comes from increased collectible-ALC unload activity rather than a
change here; a repro would confirm.
Proposed fix
Restore the single-writer invariant by taking the cache write lock around the unlink loops.
m_writeLock is CrstStubDispatchCache (level 0 / leaf, CRST_UNSAFE_ANYMODE), so it's
compatible with the destructor's GC_NOTRIGGER/NOTHROW contract and adds no lock-ordering risk.
diff --git a/src/coreclr/vm/virtualcallstub.h b/src/coreclr/vm/virtualcallstub.h
--- a/src/coreclr/vm/virtualcallstub.h
+++ b/src/coreclr/vm/virtualcallstub.h
@@ class DispatchCache
// ... existing public members ...
+
+ // The cache enforces a single-writer invariant: callers that mutate a
+ // bucket chain (Insert / PromoteChainEntry / Iterator::UnlinkEntry) must
+ // hold this lock. Exposed so the collectible-unload purge in
+ // ~VirtualCallStubManager can serialize against concurrent inserts.
+ Crst *GetWriteLock() { LIMITED_METHOD_CONTRACT; return &m_writeLock; }
private:
Crst m_writeLock;
diff --git a/src/coreclr/vm/virtualcallstub.cpp b/src/coreclr/vm/virtualcallstub.cpp
--- a/src/coreclr/vm/virtualcallstub.cpp
+++ b/src/coreclr/vm/virtualcallstub.cpp
@@ VirtualCallStubManager::~VirtualCallStubManager()
#ifdef FEATURE_VIRTUAL_STUB_DISPATCH
// Go through each cache entry and if the cache element there is in
// the cache entry heap of the manager being deleted, then we just
// set the cache entry to empty.
+ // Serialize against concurrent Insert/PromoteChainEntry on the shared
+ // global cache; otherwise unlinking can race a concurrent insert and
+ // splice a chain into a self-referential cycle (process-wide VSD hang).
+ CrstHolder lh(g_resolveCache->GetWriteLock());
DispatchCache::Iterator it(g_resolveCache);
while (it.IsValid())
{
while (it.IsValid() && cache_entry_rangeList.IsInRange((TADDR)it.Entry()))
{
it.UnlinkEntry();
}
it.Next();
}
#endif // FEATURE_VIRTUAL_STUB_DISPATCH
@@ (stats-reset / manager-delete purge loop, ~line 420)
#ifdef FEATURE_VIRTUAL_STUB_DISPATCH
...
+ CrstHolder lh(g_resolveCache->GetWriteLock());
DispatchCache::Iterator it(g_resolveCache);
while (it.IsValid())
{
it.UnlinkEntry();
}
#endif // FEATURE_VIRTUAL_STUB_DISPATCH
Optional defense-in-depth (not in the diff): bound the Lookup/Insert chain walk so any
future cycle degrades gracefully instead of hanging the host.
Open questions for the runtime team
- Confirm the collectible-unload teardown runs concurrently with other managed threads (the
evidence says yes; if it were stop-the-world the lock would be a no-op and the cause lies elsewhere).
- Confirm no lock held across
~VirtualCallStubManager would invert ordering with the leaf lock.
- Decide whether the bounded-walk hardening is worth adding alongside the lock fix.
Known Issue Error Message
DO NOT USE JSON BELOW IF THIS IS A BUILD BREAK otherwise build analysis will allow pull requests to merge that break the build worse. For a build break, do not use this issue form. Make a regular new issue.
Fill the error message using step by step known issues guidance.
{
"ErrorMessage": "",
"ErrorPattern": "",
"BuildRetry": false,
"ExcludeConsoleLog": false
}
Report
Summary
| 24-Hour Hit Count |
7-Day Hit Count |
1-Month Count |
| 0 |
0 |
0 |
Summary
A macOS x64 hang dump from CI (runtime
11.0.0-preview.5.26272.112, macOS 15.2) shows adeadlock caused by a cycle in a Virtual Stub Dispatch
DispatchCachebucket chain. Onethread spins forever walking the cycle while holding
m_writeLock; every other VSD resolvepiles up behind it, hanging the whole process.
Evidence from the dump
__psynch_mutexwait → CrstBase::Enter → DispatchCache::Insert → VirtualCallStubManager::ResolveWorker → VSD_ResolveWorkerDispatchCache::Insert+192, inside the inlinedLookupcollision-chain walk (movq 0x18(%rax),%rax= followpNext, loop if no matchand not the sentinel).
ResolveCacheElem @ 0x10c887480haspNext(+0x18)pointing at itself:
pMT=0x104ce34d0 token=0x200000001 target=0x592ff3a80 pNext=0x10c887480(self)… → 0x10c887480 → 0x10c887480 → …forever while holdingm_writeLock;the other 16 threads deadlock on
CrstBase::Enter(). The hung host then trips Helix's15-min inactivity timer →
createdump→ "Test host process crashed."Root cause
The cache enforces a single-writer invariant —
DispatchCache::SetCacheEntryassertsm_writeLock.OwnedByCurrentThread()(CHAIN_LOOKUP). The normal writers honor it:DispatchCache::Insert— takesm_writeLock✔DispatchCache::PromoteChainEntry— takesm_writeLock✔But
VirtualCallStubManager::~VirtualCallStubManager()purges this manager's entries from theshared global
g_resolveCacheviaDispatchCache::Iterator::UnlinkEntry()without takingm_writeLock(it rewrites chain pointers directly, bypassingSetCacheEntryand its assert).When a collectible LoaderAllocator (assembly/ALC) is unloaded while other live threads
concurrently
Insert/PromoteChainEntryinto the same buckets, the two writers interleave andcan splice
elem->pNext = elem(or a larger cycle).This is a collectible-unload-only path (non-collectible managers only run it at process exit),
which matches the failure profile: a different random assembly hangs each run, only under heavy
ALC churn.
Proposed fix
Restore the single-writer invariant by taking the cache write lock around the unlink loops.
m_writeLockisCrstStubDispatchCache(level 0 / leaf,CRST_UNSAFE_ANYMODE), so it'scompatible with the destructor's
GC_NOTRIGGER/NOTHROWcontract and adds no lock-ordering risk.Optional defense-in-depth (not in the diff): bound the
Lookup/Insertchain walk so anyfuture cycle degrades gracefully instead of hanging the host.
Open questions for the runtime team
evidence says yes; if it were stop-the-world the lock would be a no-op and the cause lies elsewhere).
~VirtualCallStubManagerwould invert ordering with the leaf lock.Known Issue Error Message
DO NOT USE JSON BELOW IF THIS IS A BUILD BREAK otherwise build analysis will allow pull requests to merge that break the build worse. For a build break, do not use this issue form. Make a regular new issue.
Fill the error message using step by step known issues guidance.
{ "ErrorMessage": "", "ErrorPattern": "", "BuildRetry": false, "ExcludeConsoleLog": false }Report
Summary