Skip to content

[VSD] DispatchCache self-loop: unlocked UnlinkEntry during collectible ALC unload races Insert → process hang #128859

Description

@wtgodbe

Summary

A macOS x64 hang dump from CI (runtime 11.0.0-preview.5.26272.112, macOS 15.2) shows a
deadlock caused by a cycle in a Virtual Stub Dispatch DispatchCache bucket chain. One
thread spins forever walking the cycle while holding m_writeLock; every other VSD resolve
piles up behind it, hanging the whole process.

Evidence from the dump

  • 32 threads. 16 are blocked identically:
    __psynch_mutexwait → CrstBase::Enter → DispatchCache::Insert → VirtualCallStubManager::ResolveWorker → VSD_ResolveWorker
  • Thread 25 (tid 0x18) is running, PC at DispatchCache::Insert+192, inside the inlined
    Lookup collision-chain walk (movq 0x18(%rax),%rax = follow pNext, loop if no match
    and not the sentinel).
  • Walking that bucket in memory, ResolveCacheElem @ 0x10c887480 has pNext (+0x18)
    pointing at itself
    :
    • pMT=0x104ce34d0 token=0x200000001 target=0x592ff3a80 pNext=0x10c887480 (self)
  • So thread 25 walks … → 0x10c887480 → 0x10c887480 → … forever while holding m_writeLock;
    the other 16 threads deadlock on CrstBase::Enter(). The hung host then trips Helix's
    15-min inactivity timer → createdump → "Test host process crashed."

Root cause

The cache enforces a single-writer invariant — DispatchCache::SetCacheEntry asserts
m_writeLock.OwnedByCurrentThread() (CHAIN_LOOKUP). The normal writers honor it:

  • DispatchCache::Insert — takes m_writeLock
  • DispatchCache::PromoteChainEntry — takes m_writeLock

But VirtualCallStubManager::~VirtualCallStubManager() purges this manager's entries from the
shared global g_resolveCache via DispatchCache::Iterator::UnlinkEntry() without taking
m_writeLock
(it rewrites chain pointers directly, bypassing SetCacheEntry and its assert).
When a collectible LoaderAllocator (assembly/ALC) is unloaded while other live threads
concurrently Insert/PromoteChainEntry into the same buckets, the two writers interleave and
can splice elem->pNext = elem (or a larger cycle).

This is a collectible-unload-only path (non-collectible managers only run it at process exit),
which matches the failure profile: a different random assembly hangs each run, only under heavy
ALC churn.

Note: the racy code is old (≈2015) and we did not find a recent change to this file. The
recent spike most likely comes from increased collectible-ALC unload activity rather than a
change here; a repro would confirm.

Proposed fix

Restore the single-writer invariant by taking the cache write lock around the unlink loops.
m_writeLock is CrstStubDispatchCache (level 0 / leaf, CRST_UNSAFE_ANYMODE), so it's
compatible with the destructor's GC_NOTRIGGER/NOTHROW contract and adds no lock-ordering risk.

diff --git a/src/coreclr/vm/virtualcallstub.h b/src/coreclr/vm/virtualcallstub.h
--- a/src/coreclr/vm/virtualcallstub.h
+++ b/src/coreclr/vm/virtualcallstub.h
@@ class DispatchCache
     // ... existing public members ...
+
+    // The cache enforces a single-writer invariant: callers that mutate a
+    // bucket chain (Insert / PromoteChainEntry / Iterator::UnlinkEntry) must
+    // hold this lock. Exposed so the collectible-unload purge in
+    // ~VirtualCallStubManager can serialize against concurrent inserts.
+    Crst *GetWriteLock() { LIMITED_METHOD_CONTRACT; return &m_writeLock; }

 private:
     Crst m_writeLock;

diff --git a/src/coreclr/vm/virtualcallstub.cpp b/src/coreclr/vm/virtualcallstub.cpp
--- a/src/coreclr/vm/virtualcallstub.cpp
+++ b/src/coreclr/vm/virtualcallstub.cpp
@@ VirtualCallStubManager::~VirtualCallStubManager()
 #ifdef FEATURE_VIRTUAL_STUB_DISPATCH
     // Go through each cache entry and if the cache element there is in
     // the cache entry heap of the manager being deleted, then we just
     // set the cache entry to empty.
+    // Serialize against concurrent Insert/PromoteChainEntry on the shared
+    // global cache; otherwise unlinking can race a concurrent insert and
+    // splice a chain into a self-referential cycle (process-wide VSD hang).
+    CrstHolder lh(g_resolveCache->GetWriteLock());
     DispatchCache::Iterator it(g_resolveCache);
     while (it.IsValid())
     {
         while (it.IsValid() && cache_entry_rangeList.IsInRange((TADDR)it.Entry()))
         {
             it.UnlinkEntry();
         }
         it.Next();
     }
 #endif // FEATURE_VIRTUAL_STUB_DISPATCH

@@ (stats-reset / manager-delete purge loop, ~line 420)
 #ifdef FEATURE_VIRTUAL_STUB_DISPATCH
     ...
+    CrstHolder lh(g_resolveCache->GetWriteLock());
     DispatchCache::Iterator it(g_resolveCache);
     while (it.IsValid())
     {
         it.UnlinkEntry();
     }
 #endif // FEATURE_VIRTUAL_STUB_DISPATCH

Optional defense-in-depth (not in the diff): bound the Lookup/Insert chain walk so any
future cycle degrades gracefully instead of hanging the host.

Open questions for the runtime team

  1. Confirm the collectible-unload teardown runs concurrently with other managed threads (the
    evidence says yes; if it were stop-the-world the lock would be a no-op and the cause lies elsewhere).
  2. Confirm no lock held across ~VirtualCallStubManager would invert ordering with the leaf lock.
  3. Decide whether the bounded-walk hardening is worth adding alongside the lock fix.

Known Issue Error Message

DO NOT USE JSON BELOW IF THIS IS A BUILD BREAK otherwise build analysis will allow pull requests to merge that break the build worse. For a build break, do not use this issue form. Make a regular new issue.

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions