[Transform] Add InjectTcgen05Fence pass by LeiWang1999 · Pull Request #2003 · tile-ai/tilelang

LeiWang1999 · 2026-03-31T07:45:47Z

Summary

Add InjectTcgen05Fence, a conservative TIR pass for Blackwell (SM100+) kernels that use TCGEN05/TMEM operations.
On Blackwell, regular thread synchronization primitives do not automatically synchronize TMEM state across threads. This pass inserts tcgen05.fence::before_thread_sync / tcgen05.fence::after_thread_sync at conservative handoff boundaries.
The current implementation is intentionally narrow. It is a no-op on non-SM100 targets and on functions without TCGEN05/TMEM usage.

Current insertion rules

When the function targets SM100+ and contains TCGEN05/TMEM operations, the pass currently handles three patterns:

Wrap every tvm_storage_sync("shared") / tvm_storage_sync("shared.dyn") as:

tl::tcgen05_before_thread_sync();
__syncthreads();
tl::tcgen05_after_thread_sync();

Insert tcgen05_after_thread_sync() after mbarrier_wait_parity(...) when a forward linear scan reaches TCGEN05/TMEM use before another synchronization boundary.
Insert tcgen05_before_thread_sync() before ptx_arrive_barrier(...) / ptx_arrive_cluster_barrier(...) when a backward linear scan reaches TCGEN05/TMEM use after the previous synchronization boundary.

Explicit non-goals / exclusions in this PR

This PR does not claim to infer arbitrary mbarrier protocols.
This PR does not add an extra before_thread_sync around tcgen05_mma_arrive(), because the underlying tcgen05.commit.*.mbarrier already provides the producer-side ordering.
The SM100 examples still keep manual fences where that makes the handoff points explicit.

Changes

File	Description
`src/transform/inject_tcgen05_fence.cc`	Conservative fence injection logic for shared syncs and simple linear wait/use, use/arrive patterns
`tilelang/engine/phase.py`	Run the pass after `ThreadSync("shared")` / `ThreadSync("shared.dyn")`
`tilelang/transform/__init__.py`	Python pass registration and docstring
`src/target/codegen_cuda.cc`	Keep the checked CUDA codegen path for TCGEN05 fence intrinsics
`src/op/builtin.h`	Remove duplicate TCGEN05 fence declarations
`testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py`	Regression coverage for insertion scope and non-duplication

Test plan

cmake -S . -B build
cmake --build build -j 8
python -m pytest -q testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py
python -m pytest -q testing/python/transform/test_tilelang_transform_lower_shared_barrier.py
Verified generated CUDA source still emits tl::tcgen05_before_thread_sync() / tl::tcgen05_after_thread_sync() through tilelang.lower(..., target="cuda")
End-to-end correctness/perf validation on Blackwell hardware

Summary by CodeRabbit

New Features
- Automatic conservative TCGEN05 fence insertion for SM100+ targets and new TCGEN05 load/store intrinsics with end-to-end codegen support.
Documentation
- README updated to describe automatic fence insertion behavior, targeted patterns, and limitations.
Tests
- New tests validating fence injection, lowered TCGEN05 intrinsics, barrier-handling behavior, and target-specific no-ops.
Bug Fixes / Examples
- SM100 example kernels cleaned up to remove redundant manual TCGEN05 thread-sync calls.

On Blackwell GPUs, the tcgen05 accumulator (TMEM) resides in a separate address space that is not synchronized by regular thread barriers like __syncthreads() or mbarrier. Two PTX fence instructions are required to ensure cross-thread visibility of TMEM state: tcgen05.fence::before_thread_sync -- flush TMEM before barrier tcgen05.fence::after_thread_sync -- pull TMEM after barrier This commit introduces the `InjectTcgen05Fence` TIR pass that automatically wraps every `tvm_storage_sync("shared")` call with the fence pair when the target is SM100+ and the function uses tcgen05/TMEM operations. Changes: - Define two new TIR intrinsic Ops: `tcgen05_before_thread_sync` and `tcgen05_after_thread_sync` (builtin.h/cc) - Add codegen support to emit `tl::tcgen05_before_thread_sync()` and `tl::tcgen05_after_thread_sync()` (codegen_cuda.cc) - Implement the `InjectTcgen05Fence` pass (inject_tcgen05_fence.cc) - Register the pass in the Python transform module (__init__.py) - Insert the pass in OptimizeForTarget after ThreadSync (phase.py)

github-actions · 2026-03-31T07:45:57Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

…t-tcgen05-fence

coderabbitai · 2026-03-31T09:44:44Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds TileLang tcgen05 load/store intrinsics (tl::tcgen05_ld, tl::tcgen05_st), reorders builtin fence registrations, implements a new TIR pass InjectTcgen05Fence to insert tcgen05_before_thread_sync/tcgen05_after_thread_sync around TMEM handoffs on SM100+, updates CUDA/CuTeDSL codegen and TMEM copy lowering to use the new intrinsics, integrates the pass into the pipeline, and adds tests/docs.

Changes

Cohort / File(s)	Summary
Intrinsic Declarations `src/op/builtin.h`, `src/op/builtin.cc`	Added exported ops `tl::tcgen05_ld()` and `tl::tcgen05_st()` and reorganized registration of `tcgen05_before_thread_sync` / `tcgen05_after_thread_sync`.
TMEM Copy Lowering `src/op/copy.cc`	LowerTmemCopy now emits `tl::tcgen05_ld` / `tl::tcgen05_st` calls with explicit `IntImm`/`Bool` args and reordered arguments, replacing stringified extern-name approach.
CUDA Codegen `src/target/codegen_cuda.cc`	Handle `tl::tcgen05_ld` / `tl::tcgen05_st` in Call visitor: ICHECK 6 args, extract ints/bool, set `need_tcgen05_common_h_`, and emit templated TileLang extern calls.
CuTeDSL Codegen `src/target/codegen_cutedsl.cc`	Added matching CuTeDSL emission for `tl::tcgen05_ld` / `tl::tcgen05_st` (arg checks, param extraction, emit `tl.tcgen05_*` calls).
Fence Injection Pass `src/transform/inject_tcgen05_fence.cc`	New `tl.transform.InjectTcgen05Fence()` pass: scans PrimFunc (SM100+), detects tcgen05/TMEM uses and sync sites, and injects `before_thread_sync`/`after_thread_sync` around shared syncs and mbarrier handoffs with duplicate-avoidance.
Pipeline Integration `tilelang/engine/phase.py`, `tilelang/transform/__init__.py`	Expose Python wrapper `InjectTcgen05Fence()` and invoke it in OptimizeForTarget after ThreadSync passes and before MergeIfStmt().
Tests `testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py`	New unit tests covering fence injection behavior, lowering to `tl.tcgen05_ld`/`st`, duplicate-avoidance, sync-scan semantics, and non-SM100 no-op behavior.
Examples / Docs `examples/gemm_sm100/README.md`, `examples/gemm_sm100/*.py`	Document the conservative InjectTcgen05Fence pass; removed redundant manual tcgen05 fence calls from GEMM examples.

Sequence Diagram

sequenceDiagram
    rect rgba(200,200,255,0.5)
    participant Pipeline as OptimizeForTarget
    end
    rect rgba(200,255,200,0.5)
    participant Pass as InjectTcgen05Fence
    participant Mutator as StmtExprMutator
    end
    rect rgba(255,200,200,0.5)
    participant TIR as PrimFunc Body / TIR Nodes
    participant Codegen as Lowering / Codegen
    end

    Pipeline->>Pass: run on PrimFunc (target SM100+)
    Pass->>TIR: scan for tcgen05/TMEM uses and sync sites
    TIR-->>Pass: report usage locations
    Pass->>Mutator: traverse SeqStmt/Evaluate nodes
    Mutator->>TIR: detect storage_sync / mbarrier / arrive/wait sites
    Mutator-->>Pass: insert before_thread_sync / after_thread_sync (avoid duplicates)
    Pass->>Pipeline: return transformed PrimFunc
    Pipeline->>Codegen: subsequent lowering emits tl::tcgen05_ld/st and codegen emits templates

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[Refactor][Pipeline] Run pipeline rewriting before layout inference and stabilize tiled WS #2002 — touches src/op/copy.cc; related TCGEN05 lowering and TMEM copy changes.
[BugFix] Add missing fences in GEMM SM100 examples and canonicalize the order of blockIdx #1980 — modifies tcgen05 fence intrinsics and builtin registrations.
[Example] Flash Attention SM100 #1910 — overlaps on CopyNode::LowerTmemCopy and tcgen05 lowering semantics.

Suggested labels

enhancement

Poem

🐰
I hopped through code where fences now dwell,
Threads pass the carrot safe and well.
Loads and stores in tidy rows,
Syncs appear where TMEM flows.
A tiny hop — the kernel grows.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a new InjectTcgen05Fence transformation pass to the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

src/op/builtin.h (1)

431-445: Drop the duplicate public declarations.

These two APIs are already declared again at Lines 729-735. Keeping both blocks means the next signature or doc change has two places to update.

✂️ Proposed cleanup

-/*!
- * \brief Emit tcgen05.fence::before_thread_sync on Blackwell (SM100+)
- *
- * tcgen05_before_thread_sync()
- *
- */
-TVM_DLL const Op &tcgen05_before_thread_sync();
-
-/*!
- * \brief Emit tcgen05.fence::after_thread_sync on Blackwell (SM100+)
- *
- * tcgen05_after_thread_sync()
- *
- */
-TVM_DLL const Op &tcgen05_after_thread_sync();
-
 /*!
  * \brief Indicate arrival of warp issuing TMA_STORE

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/op/builtin.h` around lines 431 - 445, This file contains duplicate public
declarations for the two ops tcgen05_before_thread_sync() and
tcgen05_after_thread_sync(); remove the earlier duplicate block (the
declarations inside the commented brief around the first occurrence) so there is
a single canonical declaration for each (keep the later declarations at the
other location), ensuring only the unique symbols tcgen05_before_thread_sync and
tcgen05_after_thread_sync remain declared once in the header.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 2034-2039: Remove the duplicate tcgen05 fence handler branches
that check op->op.same_as(tl::tcgen05_before_thread_sync()) and
tl::tcgen05_after_thread_sync() (which set need_tcgen05_common_h_ and call
print_extern_call_stmt), because they shadow the intended handler that performs
ICHECK_EQ(op->args.size(), 0U); delete these earlier branches so the later
handler (the one that validates op->args.size()) runs and malformed IR triggers
the ICHECK_EQ failure instead of falling through to bad C++ emission.

In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 109-112: The current gate uses TargetIsSm100 which restricts to
SM100–110; change the check to use TargetHasSMVersionGE(opt_target.value(), 100)
so the pass applies to all SM100+ (Blackwell) targets; specifically, in the
block that gets the target via Optional<Target> opt_target =
f->GetAttr<Target>(tvm::attr::kTarget) and currently calls
TargetIsSm100(opt_target.value()), replace that call with
TargetHasSMVersionGE(opt_target.value(), 100) while keeping the existing
opt_target.defined() guard and the early return (return f) behavior.

---

Nitpick comments:
In `@src/op/builtin.h`:
- Around line 431-445: This file contains duplicate public declarations for the
two ops tcgen05_before_thread_sync() and tcgen05_after_thread_sync(); remove the
earlier duplicate block (the declarations inside the commented brief around the
first occurrence) so there is a single canonical declaration for each (keep the
later declarations at the other location), ensuring only the unique symbols
tcgen05_before_thread_sync and tcgen05_after_thread_sync remain declared once in
the header.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6fce64a2-16bb-432b-9c7c-09bd38ee5ca1

📥 Commits

Reviewing files that changed from the base of the PR and between 0f7c214 and fec911c.

📒 Files selected for processing (6)

src/op/builtin.cc
src/op/builtin.h
src/target/codegen_cuda.cc
src/transform/inject_tcgen05_fence.cc
tilelang/engine/phase.py
tilelang/transform/__init__.py

Extend InjectTcgen05Fence to cover shared storage sync and linear wait/use, use/arrive handoffs on SM100+ kernels, update the surrounding docs, and add transform coverage. Also lower tcgen05 ld/st copies through dedicated intrinsics instead of call_extern strings, with matching CUDA and CuTeDSL codegen support.

…t-tcgen05-fence

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py (1)
95-173: Run the new tcgen05_ld/st checks through InjectTcgen05Fence() as well.

These tests stop right after LowerSharedTmem(), but tilelang/engine/phase.py inserts InjectTcgen05Fence() later, after TMEM copies have already been rewritten to direct tl.tcgen05_ld / tl.tcgen05_st ops. Adding one end-to-end transform test here would catch regressions where the fence pass still handles the legacy call_extern("tl::tcgen05_*") forms but misses the new lowered intrinsics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py`
around lines 95 - 173, The tests test_lower_tmem_copy_uses_tcgen05_ld_intrin and
test_lower_tmem_copy_uses_tcgen05_st_intrin stop after LowerSharedTmem() but do
not exercise InjectTcgen05Fence(), so regressions where the fence pass only
handles legacy "tl::tcgen05_*" extern calls can be missed; update both tests to
run the InjectTcgen05Fence() transform (the same transform applied in
tilelang/engine/phase.py) after LowerSharedTmem() — i.e., call
tl.transform.InjectTcgen05Fence()(mod) (or the pipeline wrapper used in
phase.py) on the IRModule before inspecting the body, so the new tl.tcgen05_ld /
tl.tcgen05_st intrinsics are validated end-to-end by the fence pass.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 101-118: IsTcgen05OrTmemCall currently treats tcgen05_mma_arrive()
as a "use", causing this pass to insert before/after-thread-sync fences around
arrive calls; remove the tcgen05_mma_arrive() check from the predicate so arrive
is not considered a TCGEN05/TMEM use (update both occurrences where
tcgen05_mma_arrive() is listed), leaving other checks (ptx_init_tensor_memory,
ptx_deallocate_tensor_memory, tcgen05_ld/st, ptx_tcgen05_mma_*, and extern-name
prefixes) intact so the sync-boundary logic can correctly detect arrive as a
sync boundary rather than a use.

---

Nitpick comments:
In `@testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py`:
- Around line 95-173: The tests test_lower_tmem_copy_uses_tcgen05_ld_intrin and
test_lower_tmem_copy_uses_tcgen05_st_intrin stop after LowerSharedTmem() but do
not exercise InjectTcgen05Fence(), so regressions where the fence pass only
handles legacy "tl::tcgen05_*" extern calls can be missed; update both tests to
run the InjectTcgen05Fence() transform (the same transform applied in
tilelang/engine/phase.py) after LowerSharedTmem() — i.e., call
tl.transform.InjectTcgen05Fence()(mod) (or the pipeline wrapper used in
phase.py) on the IRModule before inspecting the body, so the new tl.tcgen05_ld /
tl.tcgen05_st intrinsics are validated end-to-end by the fence pass.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 73b18376-200e-4839-862c-b0ddb947c31f

📥 Commits

Reviewing files that changed from the base of the PR and between fec911c and e9400ed.

📒 Files selected for processing (10)

examples/gemm_sm100/README.md
src/op/builtin.cc
src/op/builtin.h
src/op/copy.cc
src/target/codegen_cuda.cc
src/target/codegen_cutedsl.cc
src/transform/inject_tcgen05_fence.cc
testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py
tilelang/engine/phase.py
tilelang/transform/__init__.py

✅ Files skipped from review due to trivial changes (1)

examples/gemm_sm100/README.md

🚧 Files skipped from review as they are similar to previous changes (4)

tilelang/transform/init.py
src/op/builtin.h
src/target/codegen_cuda.cc
src/op/builtin.cc

coderabbitai · 2026-04-15T05:14:04Z

+bool IsTcgen05OrTmemCall(const CallNode *call) {
+  if (!call || IsBeforeFenceCall(call) || IsAfterFenceCall(call)) {
+    return false;
+  }
+
+  if (call->op.same_as(ptx_tcgen05_mma_ss()) ||
+      call->op.same_as(ptx_tcgen05_mma_ts()) ||
+      call->op.same_as(tcgen05_ld()) ||
+      call->op.same_as(tcgen05_st()) ||
+      call->op.same_as(tcgen05_mma_arrive()) ||
+      call->op.same_as(ptx_init_tensor_memory()) ||
+      call->op.same_as(ptx_deallocate_tensor_memory())) {
+    return true;
+  }
+
+  return IsExternNameWithPrefix(call, "tl::tcgen05_ld_") ||
+         IsExternNameWithPrefix(call, "tl::tcgen05_st_") ||
+         IsExternNameWithPrefix(call, "tl::tcgen05_cp");


⚠️ Potential issue | 🟠 Major

tcgen05_mma_arrive() currently triggers the very fences this pass says it must not add.

tcgen05_mma_arrive() is counted as a TCGEN05/TMEM use here, and both linear scans check “use” before they check “sync boundary”. That means mbarrier_wait_parity(); tcgen05_mma_arrive(); will get an extra tcgen05_after_thread_sync(), and tcgen05_mma_arrive(); ptx_arrive_barrier(...); will get an extra tcgen05_before_thread_sync(), which contradicts the file header’s stated non-goal.

🛠️ Minimal fix

bool HasUpcomingTcgen05Use(const Array<Stmt> &seq, int start_index) { for (int i = start_index + 1; i < static_cast<int>(seq.size()); ++i) { const Stmt &stmt = seq[i]; if (IsAfterFenceStmt(stmt)) { return false; } - if (StmtUsesTcgen05OrTmem(stmt)) { - return true; - } if (IsBeforeFenceStmt(stmt) || IsFenceSyncBoundary(GetEvaluateCall(stmt))) { return false; } + if (StmtUsesTcgen05OrTmem(stmt)) { + return true; + } } return false; } bool HasPriorTcgen05Use(const Array<Stmt> &seq, int start_index) { for (int i = start_index - 1; i >= 0; --i) { const Stmt &stmt = seq[i]; if (IsBeforeFenceStmt(stmt)) { return false; } - if (StmtUsesTcgen05OrTmem(stmt)) { - return true; - } if (IsAfterFenceStmt(stmt) || IsFenceSyncBoundary(GetEvaluateCall(stmt))) { return false; } + if (StmtUsesTcgen05OrTmem(stmt)) { + return true; + } } return false; }

Also applies to: 148-177

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/transform/inject_tcgen05_fence.cc` around lines 101 - 118, IsTcgen05OrTmemCall currently treats tcgen05_mma_arrive() as a "use", causing this pass to insert before/after-thread-sync fences around arrive calls; remove the tcgen05_mma_arrive() check from the predicate so arrive is not considered a TCGEN05/TMEM use (update both occurrences where tcgen05_mma_arrive() is listed), leaving other checks (ptx_init_tensor_memory, ptx_deallocate_tensor_memory, tcgen05_ld/st, ptx_tcgen05_mma_*, and extern-name prefixes) intact so the sync-boundary logic can correctly detect arrive as a sync boundary rather than a use.

Remove the hand-written tcgen05 before/after thread-sync fences from the ws and persistent GEMM examples now that InjectTcgen05Fence covers these linear wait/use and use/arrive handoffs.

coderabbitai

♻️ Duplicate comments (1)

src/transform/inject_tcgen05_fence.cc (1)
102-118: ⚠️ Potential issue | 🟠 Major

Stop classifying tcgen05_mma_arrive() as a TCGEN05/TMEM use.

It is already treated as a sync boundary below, and the linear scans at Lines 148-177 check “use” before “boundary”. With the current predicate, sequences like mbarrier_wait_parity(); tcgen05_mma_arrive(); still get an extra tcgen05_after_thread_sync(), and tcgen05_mma_arrive(); ptx_arrive_barrier(...); still get an extra tcgen05_before_thread_sync(), which contradicts the pass’s stated non-goal.
Minimal fix
 bool IsTcgen05OrTmemCall(const CallNode *call) {
   if (!call || IsBeforeFenceCall(call) || IsAfterFenceCall(call)) {
     return false;
   }
 
   if (call->op.same_as(ptx_tcgen05_mma_ss()) ||
       call->op.same_as(ptx_tcgen05_mma_ts()) ||
-      call->op.same_as(tcgen05_ld()) || call->op.same_as(tcgen05_st()) ||
-      call->op.same_as(tcgen05_mma_arrive()) ||
+      call->op.same_as(tcgen05_ld()) || call->op.same_as(tcgen05_st()) ||
       call->op.same_as(ptx_init_tensor_memory()) ||
       call->op.same_as(ptx_deallocate_tensor_memory())) {
     return true;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/inject_tcgen05_fence.cc` around lines 102 - 118, The predicate
IsTcgen05OrTmemCall(...) incorrectly treats tcgen05_mma_arrive() as a
TCGEN05/TMEM use; remove tcgen05_mma_arrive() from the set of ops checked in
IsTcgen05OrTmemCall (i.e., delete the call->op.same_as(tcgen05_mma_arrive())
branch) so that tcgen05_mma_arrive() remains only a sync boundary handled by the
existing boundary logic and the subsequent linear scans and does not trigger
extra before/after thread-sync insertions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 102-118: The predicate IsTcgen05OrTmemCall(...) incorrectly treats
tcgen05_mma_arrive() as a TCGEN05/TMEM use; remove tcgen05_mma_arrive() from the
set of ops checked in IsTcgen05OrTmemCall (i.e., delete the
call->op.same_as(tcgen05_mma_arrive()) branch) so that tcgen05_mma_arrive()
remains only a sync boundary handled by the existing boundary logic and the
subsequent linear scans and does not trigger extra before/after thread-sync
insertions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3ea21838-cd9a-4bb4-bec1-a43c470203a0

📥 Commits

Reviewing files that changed from the base of the PR and between 0c9352b and 1f09386.

📒 Files selected for processing (2)

src/op/copy.cc
src/transform/inject_tcgen05_fence.cc

Simplify InjectTcgen05Fence to recognize only the lowered tcgen05 intrinsics seen after LowerTileOp, and update the transform tests to construct tcgen05_ld directly instead of relying on legacy call_extern strings.

coderabbitai

♻️ Duplicate comments (1)

src/transform/inject_tcgen05_fence.cc (1)

97-100: ⚠️ Potential issue | 🟠 Major

tcgen05_mma_arrive() can still trigger unintended fence insertion due to scan order.

tcgen05_mma_arrive() is classified as both a use (Line 100) and a sync boundary (Lines 126-130), but both scans check “use” before “boundary” (Line 138 before Line 141, and Line 154 before Line 157). That makes mbarrier_wait_parity(); tcgen05_mma_arrive(); look like an upcoming use and tcgen05_mma_arrive(); ptx_arrive_barrier(...); look like prior use, which contradicts the pass non-goal.

🔧 Minimal fix (prefer boundary check before use)

 bool HasUpcomingTcgen05Use(const Array<Stmt>& seq, int start_index) {
   for (int i = start_index + 1; i < static_cast<int>(seq.size()); ++i) {
     const Stmt& stmt = seq[i];
     if (IsAfterFenceStmt(stmt)) {
       return false;
     }
-    if (StmtUsesTcgen05OrTmem(stmt)) {
-      return true;
-    }
     if (IsBeforeFenceStmt(stmt) || IsFenceSyncBoundary(GetEvaluateCall(stmt))) {
       return false;
     }
+    if (StmtUsesTcgen05OrTmem(stmt)) {
+      return true;
+    }
   }
   return false;
 }

 bool HasPriorTcgen05Use(const Array<Stmt>& seq, int start_index) {
   for (int i = start_index - 1; i >= 0; --i) {
     const Stmt& stmt = seq[i];
     if (IsBeforeFenceStmt(stmt)) {
       return false;
     }
-    if (StmtUsesTcgen05OrTmem(stmt)) {
-      return true;
-    }
     if (IsAfterFenceStmt(stmt) || IsFenceSyncBoundary(GetEvaluateCall(stmt))) {
       return false;
     }
+    if (StmtUsesTcgen05OrTmem(stmt)) {
+      return true;
+    }
   }
   return false;
 }

Also applies to: 132-142, 148-158

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/transform/inject_tcgen05_fence.cc` around lines 97 - 100, The scan logic
currently classifies tcgen05_mma_arrive() as a "use" before it checks for sync
boundaries, causing fence insertion mistakes; change the scan order so boundary
checks run before use checks (i.e., ensure the predicate that tests for sync
boundaries includes call->op.same_as(tcgen05_mma_arrive()) or that the
call->op.same_as(tcgen05_mma_arrive()) entry is removed from the early "use"
list and placed/checked in the boundary list) for both scan passes that
currently test "use" prior to "boundary" (the blocks that enumerate
ptx_tcgen05_mma_ss(), ptx_tcgen05_mma_ts(), tcgen05_ld(), tcgen05_st(),
tcgen05_mma_arrive(), etc. and the corresponding later boundary-check block);
apply the same reordering to the other similar scan blocks in the file to
prevent mbarrier_wait_parity(); tcgen05_mma_arrive(); and tcgen05_mma_arrive();
ptx_arrive_barrier(...) from being misclassified.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 97-100: The scan logic currently classifies tcgen05_mma_arrive()
as a "use" before it checks for sync boundaries, causing fence insertion
mistakes; change the scan order so boundary checks run before use checks (i.e.,
ensure the predicate that tests for sync boundaries includes
call->op.same_as(tcgen05_mma_arrive()) or that the
call->op.same_as(tcgen05_mma_arrive()) entry is removed from the early "use"
list and placed/checked in the boundary list) for both scan passes that
currently test "use" prior to "boundary" (the blocks that enumerate
ptx_tcgen05_mma_ss(), ptx_tcgen05_mma_ts(), tcgen05_ld(), tcgen05_st(),
tcgen05_mma_arrive(), etc. and the corresponding later boundary-check block);
apply the same reordering to the other similar scan blocks in the file to
prevent mbarrier_wait_parity(); tcgen05_mma_arrive(); and tcgen05_mma_arrive();
ptx_arrive_barrier(...) from being misclassified.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d50c5bce-87d6-4f80-8d23-790744d77f2c

📥 Commits

Reviewing files that changed from the base of the PR and between 1f09386 and dfbe9a1.

📒 Files selected for processing (2)

src/transform/inject_tcgen05_fence.cc
testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py

✅ Files skipped from review due to trivial changes (1)

testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py

LeiWang1999 changed the title ~~[Transform] Add InjectTcgen05Fence pass for Blackwell (SM100+)~~ [Transform] Add InjectTcgen05Fence pass Mar 31, 2026

LeiWang1999 added 2 commits March 31, 2026 17:27

Merge branch 'main' of https://github.com/tile-ai/tilelang into injec…

0617294

…t-tcgen05-fence

fix

fec911c

coderabbitai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread src/target/codegen_cuda.cc Outdated

Comment thread src/transform/inject_tcgen05_fence.cc

kurisu6912 and others added 3 commits April 3, 2026 10:59

Merge remote-tracking branch 'origin/main' into inject-tcgen05-fence

3d4a8ae

Merge branch 'main' of https://github.com/tile-ai/tilelang into injec…

e9400ed

…t-tcgen05-fence

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

LeiWang1999 added 2 commits April 15, 2026 13:41

Use tcgen05 fence injection in SM100 examples

0c9352b

Remove the hand-written tcgen05 before/after thread-sync fences from the ws and persistent GEMM examples now that InjectTcgen05Fence covers these linear wait/use and use/arrive handoffs.

lint fix

1f09386

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

Drop tcgen05 call_extern fence compatibility

dfbe9a1

Simplify InjectTcgen05Fence to recognize only the lowered tcgen05 intrinsics seen after LowerTileOp, and update the transform tests to construct tcgen05_ld directly instead of relying on legacy call_extern strings.

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

LeiWang1999 merged commit 891109e into tile-ai:main Apr 16, 2026
5 of 6 checks passed

This was referenced May 26, 2026

[Backend] Refactor Transform Pipeline to support different backends #2189

Merged

[Backend][Transform] Move backend-specific transforms into separate namespaces #2297

Merged

coderabbitai Bot mentioned this pull request Jun 4, 2026

[Feat] Add Blackwell features and support FA4 #2332

Closed

Conversation

LeiWang1999 commented Mar 31, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Current insertion rules

Explicit non-goals / exclusions in this PR

Changes

Test plan

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

coderabbitai Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LeiWang1999 commented Mar 31, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 31, 2026 •

edited

Loading