[REFACTOR][CODEGEN] Phase out tvm_global_barrier_state and tvm_prepare_global_barrier#19454
Conversation
…e_global_barrier This PR removes the legacy spin-on-global-memory barrier machinery from TVM. The implementation was a software synchronization primitive that used a global device counter with busy-wait polling — CUDA provides native cooperative groups / grid-sync for this use case, making this dead code. Main changes: - Remove `tvm_global_barrier_kinit` builtin op and `tirx.detect_global_barrier` pass config option - Strip kGlobal barrier path from `ThreadSyncInserter` (deletes ~130 lines of init/make logic) - Delete `CUDAModuleNode::GetGlobal`, `CUDAPrepGlobalBarrier`, and the `GetFunction` early-return for `tvm_prepare_global_barrier` - Drop the "global" branch in `CodeGenCUDA::PrintStorageSync` and the `VisitStmt_(EvaluateNode*)` override that emitted the kinit shared memory setup - Remove the `tirx.detect_global_barrier` opt-in from both default and Adreno GPU pipelines
There was a problem hiding this comment.
Code Review
This pull request removes the legacy global barrier implementation across the TVM runtime, TIRX, and CUDA codegen. This involves deleting global barrier symbols, built-in operators, configuration options, and the corresponding logic in the thread synchronization transformation and CUDA code generation. Review feedback suggests that the CUDA codegen should explicitly throw an error if a global sync is encountered to avoid silent failures, and identifies a redundant header inclusion in the thread storage synchronization pass.
Per code review: the removal of the global barrier path left PrintStorageSync silently doing nothing when sync == "global". Add an explicit TVM_FFI_THROW(InternalError) so any stale IR reaching this path fails loudly instead of generating incorrect code.
|
Thanks for the review, addressing both comments: Comment 1 (PrintStorageSync silent no-op) — ACCEPTED You are correct. After removing the Comment 2 (redundant The include is intentional and required. |
|
The `#include <tvm/tirx/op.h>` include is required and not redundant. `builtin.h` includes `<tvm/ir/op.h>` but not `<tvm/tirx/op.h>`. The `make_zero` function is defined in `<tvm/tirx/op.h>` (line 300 of this file) and used by `ThreadSyncAfterWaitQueueInserter::VisitStmt_` at line 300. Empirical verification: removing `#include <tvm/tirx/op.h>` and attempting a focused build fails with: The include is necessary. |
Phase out the legacy spin-on-global-memory CUDA barrier machinery
(
tvm_global_barrier_state/__tvm_prepare_global_barrier/ thetvm_global_barrier_kinit()builtin and thetirx.detect_global_barrierpass-config option). CUDA's native cooperative groups / grid sync
primitives cover the use case better; the bespoke implementation has
been dead in the active codegen pipelines.
This is a deletion-only refactor across 10 files (~−264 lines net):
include/tvm/runtime/device_api.htvm_global_barrier_kinit()(include/tvm/tirx/builtin.h,src/tirx/op/builtin.cc)tirx.detect_global_barrier(src/tirx/ir/transform.cc)s_tir::ThreadSyncincludingInitGlobalBarrier,MakeGlobalBarrier, and supporting state inThreadSyncInserterCUDAPrepGlobalBarrierruntime class +CUDAModuleNode::GetGlobal()CodeGenCUDA::PrintStorageSync"global" branch and theVisitStmt_(EvaluateNode*)override + 3 member fieldsNo Python or test references to these symbols. Build clean
(260 targets), CUDA codegen 50/50 passed, tirx-base + tirx-transform
621 passed, IRF cpptests 8/8 passed.