[Transform] Add InjectTcgen05Fence pass#2003
Conversation
On Blackwell GPUs, the tcgen05 accumulator (TMEM) resides in a separate
address space that is not synchronized by regular thread barriers like
__syncthreads() or mbarrier. Two PTX fence instructions are required to
ensure cross-thread visibility of TMEM state:
tcgen05.fence::before_thread_sync -- flush TMEM before barrier
tcgen05.fence::after_thread_sync -- pull TMEM after barrier
This commit introduces the `InjectTcgen05Fence` TIR pass that
automatically wraps every `tvm_storage_sync("shared")` call with the
fence pair when the target is SM100+ and the function uses tcgen05/TMEM
operations.
Changes:
- Define two new TIR intrinsic Ops: `tcgen05_before_thread_sync` and
`tcgen05_after_thread_sync` (builtin.h/cc)
- Add codegen support to emit `tl::tcgen05_before_thread_sync()` and
`tl::tcgen05_after_thread_sync()` (codegen_cuda.cc)
- Implement the `InjectTcgen05Fence` pass (inject_tcgen05_fence.cc)
- Register the pass in the Python transform module (__init__.py)
- Insert the pass in OptimizeForTarget after ThreadSync (phase.py)
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds TileLang tcgen05 load/store intrinsics ( Changes
Sequence DiagramsequenceDiagram
rect rgba(200,200,255,0.5)
participant Pipeline as OptimizeForTarget
end
rect rgba(200,255,200,0.5)
participant Pass as InjectTcgen05Fence
participant Mutator as StmtExprMutator
end
rect rgba(255,200,200,0.5)
participant TIR as PrimFunc Body / TIR Nodes
participant Codegen as Lowering / Codegen
end
Pipeline->>Pass: run on PrimFunc (target SM100+)
Pass->>TIR: scan for tcgen05/TMEM uses and sync sites
TIR-->>Pass: report usage locations
Pass->>Mutator: traverse SeqStmt/Evaluate nodes
Mutator->>TIR: detect storage_sync / mbarrier / arrive/wait sites
Mutator-->>Pass: insert before_thread_sync / after_thread_sync (avoid duplicates)
Pass->>Pipeline: return transformed PrimFunc
Pipeline->>Codegen: subsequent lowering emits tl::tcgen05_ld/st and codegen emits templates
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/op/builtin.h (1)
431-445: Drop the duplicate public declarations.These two APIs are already declared again at Lines 729-735. Keeping both blocks means the next signature or doc change has two places to update.
✂️ Proposed cleanup
-/*! - * \brief Emit tcgen05.fence::before_thread_sync on Blackwell (SM100+) - * - * tcgen05_before_thread_sync() - * - */ -TVM_DLL const Op &tcgen05_before_thread_sync(); - -/*! - * \brief Emit tcgen05.fence::after_thread_sync on Blackwell (SM100+) - * - * tcgen05_after_thread_sync() - * - */ -TVM_DLL const Op &tcgen05_after_thread_sync(); - /*! * \brief Indicate arrival of warp issuing TMA_STORE🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/op/builtin.h` around lines 431 - 445, This file contains duplicate public declarations for the two ops tcgen05_before_thread_sync() and tcgen05_after_thread_sync(); remove the earlier duplicate block (the declarations inside the commented brief around the first occurrence) so there is a single canonical declaration for each (keep the later declarations at the other location), ensuring only the unique symbols tcgen05_before_thread_sync and tcgen05_after_thread_sync remain declared once in the header.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 2034-2039: Remove the duplicate tcgen05 fence handler branches
that check op->op.same_as(tl::tcgen05_before_thread_sync()) and
tl::tcgen05_after_thread_sync() (which set need_tcgen05_common_h_ and call
print_extern_call_stmt), because they shadow the intended handler that performs
ICHECK_EQ(op->args.size(), 0U); delete these earlier branches so the later
handler (the one that validates op->args.size()) runs and malformed IR triggers
the ICHECK_EQ failure instead of falling through to bad C++ emission.
In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 109-112: The current gate uses TargetIsSm100 which restricts to
SM100–110; change the check to use TargetHasSMVersionGE(opt_target.value(), 100)
so the pass applies to all SM100+ (Blackwell) targets; specifically, in the
block that gets the target via Optional<Target> opt_target =
f->GetAttr<Target>(tvm::attr::kTarget) and currently calls
TargetIsSm100(opt_target.value()), replace that call with
TargetHasSMVersionGE(opt_target.value(), 100) while keeping the existing
opt_target.defined() guard and the early return (return f) behavior.
---
Nitpick comments:
In `@src/op/builtin.h`:
- Around line 431-445: This file contains duplicate public declarations for the
two ops tcgen05_before_thread_sync() and tcgen05_after_thread_sync(); remove the
earlier duplicate block (the declarations inside the commented brief around the
first occurrence) so there is a single canonical declaration for each (keep the
later declarations at the other location), ensuring only the unique symbols
tcgen05_before_thread_sync and tcgen05_after_thread_sync remain declared once in
the header.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6fce64a2-16bb-432b-9c7c-09bd38ee5ca1
📒 Files selected for processing (6)
src/op/builtin.ccsrc/op/builtin.hsrc/target/codegen_cuda.ccsrc/transform/inject_tcgen05_fence.cctilelang/engine/phase.pytilelang/transform/__init__.py
Extend InjectTcgen05Fence to cover shared storage sync and linear wait/use, use/arrive handoffs on SM100+ kernels, update the surrounding docs, and add transform coverage. Also lower tcgen05 ld/st copies through dedicated intrinsics instead of call_extern strings, with matching CUDA and CuTeDSL codegen support.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py (1)
95-173: Run the newtcgen05_ld/stchecks throughInjectTcgen05Fence()as well.These tests stop right after
LowerSharedTmem(), buttilelang/engine/phase.pyinsertsInjectTcgen05Fence()later, after TMEM copies have already been rewritten to directtl.tcgen05_ld/tl.tcgen05_stops. Adding one end-to-end transform test here would catch regressions where the fence pass still handles the legacycall_extern("tl::tcgen05_*")forms but misses the new lowered intrinsics.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py` around lines 95 - 173, The tests test_lower_tmem_copy_uses_tcgen05_ld_intrin and test_lower_tmem_copy_uses_tcgen05_st_intrin stop after LowerSharedTmem() but do not exercise InjectTcgen05Fence(), so regressions where the fence pass only handles legacy "tl::tcgen05_*" extern calls can be missed; update both tests to run the InjectTcgen05Fence() transform (the same transform applied in tilelang/engine/phase.py) after LowerSharedTmem() — i.e., call tl.transform.InjectTcgen05Fence()(mod) (or the pipeline wrapper used in phase.py) on the IRModule before inspecting the body, so the new tl.tcgen05_ld / tl.tcgen05_st intrinsics are validated end-to-end by the fence pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 101-118: IsTcgen05OrTmemCall currently treats tcgen05_mma_arrive()
as a "use", causing this pass to insert before/after-thread-sync fences around
arrive calls; remove the tcgen05_mma_arrive() check from the predicate so arrive
is not considered a TCGEN05/TMEM use (update both occurrences where
tcgen05_mma_arrive() is listed), leaving other checks (ptx_init_tensor_memory,
ptx_deallocate_tensor_memory, tcgen05_ld/st, ptx_tcgen05_mma_*, and extern-name
prefixes) intact so the sync-boundary logic can correctly detect arrive as a
sync boundary rather than a use.
---
Nitpick comments:
In `@testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py`:
- Around line 95-173: The tests test_lower_tmem_copy_uses_tcgen05_ld_intrin and
test_lower_tmem_copy_uses_tcgen05_st_intrin stop after LowerSharedTmem() but do
not exercise InjectTcgen05Fence(), so regressions where the fence pass only
handles legacy "tl::tcgen05_*" extern calls can be missed; update both tests to
run the InjectTcgen05Fence() transform (the same transform applied in
tilelang/engine/phase.py) after LowerSharedTmem() — i.e., call
tl.transform.InjectTcgen05Fence()(mod) (or the pipeline wrapper used in
phase.py) on the IRModule before inspecting the body, so the new tl.tcgen05_ld /
tl.tcgen05_st intrinsics are validated end-to-end by the fence pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 73b18376-200e-4839-862c-b0ddb947c31f
📒 Files selected for processing (10)
examples/gemm_sm100/README.mdsrc/op/builtin.ccsrc/op/builtin.hsrc/op/copy.ccsrc/target/codegen_cuda.ccsrc/target/codegen_cutedsl.ccsrc/transform/inject_tcgen05_fence.cctesting/python/transform/test_tilelang_transform_inject_tcgen05_fence.pytilelang/engine/phase.pytilelang/transform/__init__.py
✅ Files skipped from review due to trivial changes (1)
- examples/gemm_sm100/README.md
🚧 Files skipped from review as they are similar to previous changes (4)
- tilelang/transform/init.py
- src/op/builtin.h
- src/target/codegen_cuda.cc
- src/op/builtin.cc
| bool IsTcgen05OrTmemCall(const CallNode *call) { | ||
| if (!call || IsBeforeFenceCall(call) || IsAfterFenceCall(call)) { | ||
| return false; | ||
| } | ||
|
|
||
| if (call->op.same_as(ptx_tcgen05_mma_ss()) || | ||
| call->op.same_as(ptx_tcgen05_mma_ts()) || | ||
| call->op.same_as(tcgen05_ld()) || | ||
| call->op.same_as(tcgen05_st()) || | ||
| call->op.same_as(tcgen05_mma_arrive()) || | ||
| call->op.same_as(ptx_init_tensor_memory()) || | ||
| call->op.same_as(ptx_deallocate_tensor_memory())) { | ||
| return true; | ||
| } | ||
|
|
||
| return IsExternNameWithPrefix(call, "tl::tcgen05_ld_") || | ||
| IsExternNameWithPrefix(call, "tl::tcgen05_st_") || | ||
| IsExternNameWithPrefix(call, "tl::tcgen05_cp"); |
There was a problem hiding this comment.
tcgen05_mma_arrive() currently triggers the very fences this pass says it must not add.
tcgen05_mma_arrive() is counted as a TCGEN05/TMEM use here, and both linear scans check “use” before they check “sync boundary”. That means mbarrier_wait_parity(); tcgen05_mma_arrive(); will get an extra tcgen05_after_thread_sync(), and tcgen05_mma_arrive(); ptx_arrive_barrier(...); will get an extra tcgen05_before_thread_sync(), which contradicts the file header’s stated non-goal.
🛠️ Minimal fix
bool HasUpcomingTcgen05Use(const Array<Stmt> &seq, int start_index) {
for (int i = start_index + 1; i < static_cast<int>(seq.size()); ++i) {
const Stmt &stmt = seq[i];
if (IsAfterFenceStmt(stmt)) {
return false;
}
- if (StmtUsesTcgen05OrTmem(stmt)) {
- return true;
- }
if (IsBeforeFenceStmt(stmt) || IsFenceSyncBoundary(GetEvaluateCall(stmt))) {
return false;
}
+ if (StmtUsesTcgen05OrTmem(stmt)) {
+ return true;
+ }
}
return false;
}
bool HasPriorTcgen05Use(const Array<Stmt> &seq, int start_index) {
for (int i = start_index - 1; i >= 0; --i) {
const Stmt &stmt = seq[i];
if (IsBeforeFenceStmt(stmt)) {
return false;
}
- if (StmtUsesTcgen05OrTmem(stmt)) {
- return true;
- }
if (IsAfterFenceStmt(stmt) || IsFenceSyncBoundary(GetEvaluateCall(stmt))) {
return false;
}
+ if (StmtUsesTcgen05OrTmem(stmt)) {
+ return true;
+ }
}
return false;
}Also applies to: 148-177
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/transform/inject_tcgen05_fence.cc` around lines 101 - 118,
IsTcgen05OrTmemCall currently treats tcgen05_mma_arrive() as a "use", causing
this pass to insert before/after-thread-sync fences around arrive calls; remove
the tcgen05_mma_arrive() check from the predicate so arrive is not considered a
TCGEN05/TMEM use (update both occurrences where tcgen05_mma_arrive() is listed),
leaving other checks (ptx_init_tensor_memory, ptx_deallocate_tensor_memory,
tcgen05_ld/st, ptx_tcgen05_mma_*, and extern-name prefixes) intact so the
sync-boundary logic can correctly detect arrive as a sync boundary rather than a
use.
Remove the hand-written tcgen05 before/after thread-sync fences from the ws and persistent GEMM examples now that InjectTcgen05Fence covers these linear wait/use and use/arrive handoffs.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/transform/inject_tcgen05_fence.cc (1)
102-118:⚠️ Potential issue | 🟠 MajorStop classifying
tcgen05_mma_arrive()as a TCGEN05/TMEM use.It is already treated as a sync boundary below, and the linear scans at Lines 148-177 check “use” before “boundary”. With the current predicate, sequences like
mbarrier_wait_parity(); tcgen05_mma_arrive();still get an extratcgen05_after_thread_sync(), andtcgen05_mma_arrive(); ptx_arrive_barrier(...);still get an extratcgen05_before_thread_sync(), which contradicts the pass’s stated non-goal.Minimal fix
bool IsTcgen05OrTmemCall(const CallNode *call) { if (!call || IsBeforeFenceCall(call) || IsAfterFenceCall(call)) { return false; } if (call->op.same_as(ptx_tcgen05_mma_ss()) || call->op.same_as(ptx_tcgen05_mma_ts()) || - call->op.same_as(tcgen05_ld()) || call->op.same_as(tcgen05_st()) || - call->op.same_as(tcgen05_mma_arrive()) || + call->op.same_as(tcgen05_ld()) || call->op.same_as(tcgen05_st()) || call->op.same_as(ptx_init_tensor_memory()) || call->op.same_as(ptx_deallocate_tensor_memory())) { return true; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/transform/inject_tcgen05_fence.cc` around lines 102 - 118, The predicate IsTcgen05OrTmemCall(...) incorrectly treats tcgen05_mma_arrive() as a TCGEN05/TMEM use; remove tcgen05_mma_arrive() from the set of ops checked in IsTcgen05OrTmemCall (i.e., delete the call->op.same_as(tcgen05_mma_arrive()) branch) so that tcgen05_mma_arrive() remains only a sync boundary handled by the existing boundary logic and the subsequent linear scans and does not trigger extra before/after thread-sync insertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 102-118: The predicate IsTcgen05OrTmemCall(...) incorrectly treats
tcgen05_mma_arrive() as a TCGEN05/TMEM use; remove tcgen05_mma_arrive() from the
set of ops checked in IsTcgen05OrTmemCall (i.e., delete the
call->op.same_as(tcgen05_mma_arrive()) branch) so that tcgen05_mma_arrive()
remains only a sync boundary handled by the existing boundary logic and the
subsequent linear scans and does not trigger extra before/after thread-sync
insertions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3ea21838-cd9a-4bb4-bec1-a43c470203a0
📒 Files selected for processing (2)
src/op/copy.ccsrc/transform/inject_tcgen05_fence.cc
Simplify InjectTcgen05Fence to recognize only the lowered tcgen05 intrinsics seen after LowerTileOp, and update the transform tests to construct tcgen05_ld directly instead of relying on legacy call_extern strings.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/transform/inject_tcgen05_fence.cc (1)
97-100:⚠️ Potential issue | 🟠 Major
tcgen05_mma_arrive()can still trigger unintended fence insertion due to scan order.
tcgen05_mma_arrive()is classified as both a use (Line 100) and a sync boundary (Lines 126-130), but both scans check “use” before “boundary” (Line 138 before Line 141, and Line 154 before Line 157). That makesmbarrier_wait_parity(); tcgen05_mma_arrive();look like an upcoming use andtcgen05_mma_arrive(); ptx_arrive_barrier(...);look like prior use, which contradicts the pass non-goal.🔧 Minimal fix (prefer boundary check before use)
bool HasUpcomingTcgen05Use(const Array<Stmt>& seq, int start_index) { for (int i = start_index + 1; i < static_cast<int>(seq.size()); ++i) { const Stmt& stmt = seq[i]; if (IsAfterFenceStmt(stmt)) { return false; } - if (StmtUsesTcgen05OrTmem(stmt)) { - return true; - } if (IsBeforeFenceStmt(stmt) || IsFenceSyncBoundary(GetEvaluateCall(stmt))) { return false; } + if (StmtUsesTcgen05OrTmem(stmt)) { + return true; + } } return false; } bool HasPriorTcgen05Use(const Array<Stmt>& seq, int start_index) { for (int i = start_index - 1; i >= 0; --i) { const Stmt& stmt = seq[i]; if (IsBeforeFenceStmt(stmt)) { return false; } - if (StmtUsesTcgen05OrTmem(stmt)) { - return true; - } if (IsAfterFenceStmt(stmt) || IsFenceSyncBoundary(GetEvaluateCall(stmt))) { return false; } + if (StmtUsesTcgen05OrTmem(stmt)) { + return true; + } } return false; }Also applies to: 132-142, 148-158
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/transform/inject_tcgen05_fence.cc` around lines 97 - 100, The scan logic currently classifies tcgen05_mma_arrive() as a "use" before it checks for sync boundaries, causing fence insertion mistakes; change the scan order so boundary checks run before use checks (i.e., ensure the predicate that tests for sync boundaries includes call->op.same_as(tcgen05_mma_arrive()) or that the call->op.same_as(tcgen05_mma_arrive()) entry is removed from the early "use" list and placed/checked in the boundary list) for both scan passes that currently test "use" prior to "boundary" (the blocks that enumerate ptx_tcgen05_mma_ss(), ptx_tcgen05_mma_ts(), tcgen05_ld(), tcgen05_st(), tcgen05_mma_arrive(), etc. and the corresponding later boundary-check block); apply the same reordering to the other similar scan blocks in the file to prevent mbarrier_wait_parity(); tcgen05_mma_arrive(); and tcgen05_mma_arrive(); ptx_arrive_barrier(...) from being misclassified.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 97-100: The scan logic currently classifies tcgen05_mma_arrive()
as a "use" before it checks for sync boundaries, causing fence insertion
mistakes; change the scan order so boundary checks run before use checks (i.e.,
ensure the predicate that tests for sync boundaries includes
call->op.same_as(tcgen05_mma_arrive()) or that the
call->op.same_as(tcgen05_mma_arrive()) entry is removed from the early "use"
list and placed/checked in the boundary list) for both scan passes that
currently test "use" prior to "boundary" (the blocks that enumerate
ptx_tcgen05_mma_ss(), ptx_tcgen05_mma_ts(), tcgen05_ld(), tcgen05_st(),
tcgen05_mma_arrive(), etc. and the corresponding later boundary-check block);
apply the same reordering to the other similar scan blocks in the file to
prevent mbarrier_wait_parity(); tcgen05_mma_arrive(); and tcgen05_mma_arrive();
ptx_arrive_barrier(...) from being misclassified.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d50c5bce-87d6-4f80-8d23-790744d77f2c
📒 Files selected for processing (2)
src/transform/inject_tcgen05_fence.cctesting/python/transform/test_tilelang_transform_inject_tcgen05_fence.py
✅ Files skipped from review due to trivial changes (1)
- testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py
Summary
InjectTcgen05Fence, a conservative TIR pass for Blackwell (SM100+) kernels that use TCGEN05/TMEM operations.tcgen05.fence::before_thread_sync/tcgen05.fence::after_thread_syncat conservative handoff boundaries.SM100targets and on functions without TCGEN05/TMEM usage.Current insertion rules
When the function targets
SM100+and contains TCGEN05/TMEM operations, the pass currently handles three patterns:tvm_storage_sync("shared")/tvm_storage_sync("shared.dyn")as:tcgen05_after_thread_sync()aftermbarrier_wait_parity(...)when a forward linear scan reaches TCGEN05/TMEM use before another synchronization boundary.tcgen05_before_thread_sync()beforeptx_arrive_barrier(...)/ptx_arrive_cluster_barrier(...)when a backward linear scan reaches TCGEN05/TMEM use after the previous synchronization boundary.Explicit non-goals / exclusions in this PR
before_thread_syncaroundtcgen05_mma_arrive(), because the underlyingtcgen05.commit.*.mbarrieralready provides the producer-side ordering.Changes
src/transform/inject_tcgen05_fence.cctilelang/engine/phase.pyThreadSync("shared")/ThreadSync("shared.dyn")tilelang/transform/__init__.pysrc/target/codegen_cuda.ccsrc/op/builtin.htesting/python/transform/test_tilelang_transform_inject_tcgen05_fence.pyTest plan
cmake -S . -B buildcmake --build build -j 8python -m pytest -q testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.pypython -m pytest -q testing/python/transform/test_tilelang_transform_lower_shared_barrier.pytl::tcgen05_before_thread_sync()/tl::tcgen05_after_thread_sync()throughtilelang.lower(..., target="cuda")Summary by CodeRabbit
New Features
Documentation
Tests
Bug Fixes / Examples