Skip to content

[REFACTOR][CODEGEN] Phase out tvm_global_barrier_state and tvm_prepare_global_barrier#19454

Merged
tlopex merged 2 commits into
apache:mainfrom
tqchen:phase-out-global-barrier-state
Apr 27, 2026
Merged

[REFACTOR][CODEGEN] Phase out tvm_global_barrier_state and tvm_prepare_global_barrier#19454
tlopex merged 2 commits into
apache:mainfrom
tqchen:phase-out-global-barrier-state

Conversation

@tqchen

@tqchen tqchen commented Apr 27, 2026

Copy link
Copy Markdown
Member

Phase out the legacy spin-on-global-memory CUDA barrier machinery
(tvm_global_barrier_state / __tvm_prepare_global_barrier / the
tvm_global_barrier_kinit() builtin and the tirx.detect_global_barrier
pass-config option). CUDA's native cooperative groups / grid sync
primitives cover the use case better; the bespoke implementation has
been dead in the active codegen pipelines.

This is a deletion-only refactor across 10 files (~−264 lines net):

  • Public symbol constants in include/tvm/runtime/device_api.h
  • TIR builtin op tvm_global_barrier_kinit() (include/tvm/tirx/builtin.h,
    src/tirx/op/builtin.cc)
  • Pass config tirx.detect_global_barrier (src/tirx/ir/transform.cc)
  • Entire kGlobal branch of s_tir::ThreadSync including
    InitGlobalBarrier, MakeGlobalBarrier, and supporting state in
    ThreadSyncInserter
  • CUDAPrepGlobalBarrier runtime class + CUDAModuleNode::GetGlobal()
  • CodeGenCUDA::PrintStorageSync "global" branch and the
    VisitStmt_(EvaluateNode*) override + 3 member fields
  • Two Python pipeline opt-in blocks (s_tir/pipeline.py + adreno mirror)

No Python or test references to these symbols. Build clean
(260 targets), CUDA codegen 50/50 passed, tirx-base + tirx-transform
621 passed, IRF cpptests 8/8 passed.

…e_global_barrier

This PR removes the legacy spin-on-global-memory barrier machinery from
TVM. The implementation was a software synchronization primitive that
used a global device counter with busy-wait polling — CUDA provides
native cooperative groups / grid-sync for this use case, making this
dead code.

Main changes:

- Remove `tvm_global_barrier_kinit` builtin op and `tirx.detect_global_barrier` pass config option
- Strip kGlobal barrier path from `ThreadSyncInserter` (deletes ~130 lines of init/make logic)
- Delete `CUDAModuleNode::GetGlobal`, `CUDAPrepGlobalBarrier`, and the `GetFunction` early-return for `tvm_prepare_global_barrier`
- Drop the "global" branch in `CodeGenCUDA::PrintStorageSync` and the `VisitStmt_(EvaluateNode*)` override that emitted the kinit shared memory setup
- Remove the `tirx.detect_global_barrier` opt-in from both default and Adreno GPU pipelines

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the legacy global barrier implementation across the TVM runtime, TIRX, and CUDA codegen. This involves deleting global barrier symbols, built-in operators, configuration options, and the corresponding logic in the thread synchronization transformation and CUDA code generation. Review feedback suggests that the CUDA codegen should explicitly throw an error if a global sync is encountered to avoid silent failures, and identifies a redundant header inclusion in the thread storage synchronization pass.

Comment thread src/target/source/codegen_cuda.cc
Comment thread src/s_tir/transform/thread_storage_sync.cc
Per code review: the removal of the global barrier path left
PrintStorageSync silently doing nothing when sync == "global".
Add an explicit TVM_FFI_THROW(InternalError) so any stale IR
reaching this path fails loudly instead of generating incorrect code.
@tqchen

tqchen commented Apr 27, 2026

Copy link
Copy Markdown
Member Author

Thanks for the review, addressing both comments:

Comment 1 (PrintStorageSync silent no-op) — ACCEPTED

You are correct. After removing the global barrier branch, PrintStorageSync silently did nothing for sync == "global", which would produce incorrect code without any diagnostic. Added an explicit TVM_FFI_THROW(InternalError) for the "global" case (matching the existing error style in the file, e.g. PrintStorageScope already has TVM_FFI_ICHECK_NE(scope, "global")). Pushed as a follow-up commit (36a69d0).

Comment 2 (redundant <tvm/tirx/op.h> include) — REJECTED

The include is intentional and required. make_zero (used at line 300 in ThreadSyncAfterWaitQueueInserter::VisitStmt_) is declared in <tvm/tirx/op.h>. It was previously provided transitively through ../../tirx/transform/ir_utils.h, which we removed as part of the cleanup. Neither <tvm/tirx/builtin.h> nor <tvm/ir/op.h> (which builtin.h includes) declare make_zero — that symbol lives only in <tvm/tirx/op.h>. Removing it causes a build error.

@tqchen

tqchen commented Apr 27, 2026

Copy link
Copy Markdown
Member Author

The `#include <tvm/tirx/op.h>` include is required and not redundant.

`builtin.h` includes `<tvm/ir/op.h>` but not `<tvm/tirx/op.h>`. The `make_zero` function is defined in `<tvm/tirx/op.h>` (line 300 of this file) and used by `ThreadSyncAfterWaitQueueInserter::VisitStmt_` at line 300.

Empirical verification: removing `#include <tvm/tirx/op.h>` and attempting a focused build fails with:
```
error: 'make_zero' was not declared in this scope
```

The include is necessary.

@tlopex tlopex merged commit 216be63 into apache:main Apr 27, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants