Skip to content

[Transform] Add InjectTcgen05Fence pass#2003

Merged
LeiWang1999 merged 9 commits into
tile-ai:mainfrom
LeiWang1999:inject-tcgen05-fence
Apr 16, 2026
Merged

[Transform] Add InjectTcgen05Fence pass#2003
LeiWang1999 merged 9 commits into
tile-ai:mainfrom
LeiWang1999:inject-tcgen05-fence

Conversation

@LeiWang1999

@LeiWang1999 LeiWang1999 commented Mar 31, 2026

Copy link
Copy Markdown
Member

Summary

  • Add InjectTcgen05Fence, a conservative TIR pass for Blackwell (SM100+) kernels that use TCGEN05/TMEM operations.
  • On Blackwell, regular thread synchronization primitives do not automatically synchronize TMEM state across threads. This pass inserts tcgen05.fence::before_thread_sync / tcgen05.fence::after_thread_sync at conservative handoff boundaries.
  • The current implementation is intentionally narrow. It is a no-op on non-SM100 targets and on functions without TCGEN05/TMEM usage.

Current insertion rules

When the function targets SM100+ and contains TCGEN05/TMEM operations, the pass currently handles three patterns:

  1. Wrap every tvm_storage_sync("shared") / tvm_storage_sync("shared.dyn") as:
    tl::tcgen05_before_thread_sync();
    __syncthreads();
    tl::tcgen05_after_thread_sync();
  2. Insert tcgen05_after_thread_sync() after mbarrier_wait_parity(...) when a forward linear scan reaches TCGEN05/TMEM use before another synchronization boundary.
  3. Insert tcgen05_before_thread_sync() before ptx_arrive_barrier(...) / ptx_arrive_cluster_barrier(...) when a backward linear scan reaches TCGEN05/TMEM use after the previous synchronization boundary.

Explicit non-goals / exclusions in this PR

  • This PR does not claim to infer arbitrary mbarrier protocols.
  • This PR does not add an extra before_thread_sync around tcgen05_mma_arrive(), because the underlying tcgen05.commit.*.mbarrier already provides the producer-side ordering.
  • The SM100 examples still keep manual fences where that makes the handoff points explicit.

Changes

File Description
src/transform/inject_tcgen05_fence.cc Conservative fence injection logic for shared syncs and simple linear wait/use, use/arrive patterns
tilelang/engine/phase.py Run the pass after ThreadSync("shared") / ThreadSync("shared.dyn")
tilelang/transform/__init__.py Python pass registration and docstring
src/target/codegen_cuda.cc Keep the checked CUDA codegen path for TCGEN05 fence intrinsics
src/op/builtin.h Remove duplicate TCGEN05 fence declarations
testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py Regression coverage for insertion scope and non-duplication

Test plan

  • cmake -S . -B build
  • cmake --build build -j 8
  • python -m pytest -q testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py
  • python -m pytest -q testing/python/transform/test_tilelang_transform_lower_shared_barrier.py
  • Verified generated CUDA source still emits tl::tcgen05_before_thread_sync() / tl::tcgen05_after_thread_sync() through tilelang.lower(..., target="cuda")
  • End-to-end correctness/perf validation on Blackwell hardware

Summary by CodeRabbit

  • New Features

    • Automatic conservative TCGEN05 fence insertion for SM100+ targets and new TCGEN05 load/store intrinsics with end-to-end codegen support.
  • Documentation

    • README updated to describe automatic fence insertion behavior, targeted patterns, and limitations.
  • Tests

    • New tests validating fence injection, lowered TCGEN05 intrinsics, barrier-handling behavior, and target-specific no-ops.
  • Bug Fixes / Examples

    • SM100 example kernels cleaned up to remove redundant manual TCGEN05 thread-sync calls.

On Blackwell GPUs, the tcgen05 accumulator (TMEM) resides in a separate
address space that is not synchronized by regular thread barriers like
__syncthreads() or mbarrier. Two PTX fence instructions are required to
ensure cross-thread visibility of TMEM state:

  tcgen05.fence::before_thread_sync  -- flush TMEM before barrier
  tcgen05.fence::after_thread_sync   -- pull TMEM after barrier

This commit introduces the `InjectTcgen05Fence` TIR pass that
automatically wraps every `tvm_storage_sync("shared")` call with the
fence pair when the target is SM100+ and the function uses tcgen05/TMEM
operations.

Changes:
- Define two new TIR intrinsic Ops: `tcgen05_before_thread_sync` and
  `tcgen05_after_thread_sync` (builtin.h/cc)
- Add codegen support to emit `tl::tcgen05_before_thread_sync()` and
  `tl::tcgen05_after_thread_sync()` (codegen_cuda.cc)
- Implement the `InjectTcgen05Fence` pass (inject_tcgen05_fence.cc)
- Register the pass in the Python transform module (__init__.py)
- Insert the pass in OptimizeForTarget after ThreadSync (phase.py)
@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@LeiWang1999 LeiWang1999 changed the title [Transform] Add InjectTcgen05Fence pass for Blackwell (SM100+) [Transform] Add InjectTcgen05Fence pass Mar 31, 2026
@coderabbitai

coderabbitai Bot commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds TileLang tcgen05 load/store intrinsics (tl::tcgen05_ld, tl::tcgen05_st), reorders builtin fence registrations, implements a new TIR pass InjectTcgen05Fence to insert tcgen05_before_thread_sync/tcgen05_after_thread_sync around TMEM handoffs on SM100+, updates CUDA/CuTeDSL codegen and TMEM copy lowering to use the new intrinsics, integrates the pass into the pipeline, and adds tests/docs.

Changes

Cohort / File(s) Summary
Intrinsic Declarations
src/op/builtin.h, src/op/builtin.cc
Added exported ops tl::tcgen05_ld() and tl::tcgen05_st() and reorganized registration of tcgen05_before_thread_sync / tcgen05_after_thread_sync.
TMEM Copy Lowering
src/op/copy.cc
LowerTmemCopy now emits tl::tcgen05_ld / tl::tcgen05_st calls with explicit IntImm/Bool args and reordered arguments, replacing stringified extern-name approach.
CUDA Codegen
src/target/codegen_cuda.cc
Handle tl::tcgen05_ld / tl::tcgen05_st in Call visitor: ICHECK 6 args, extract ints/bool, set need_tcgen05_common_h_, and emit templated TileLang extern calls.
CuTeDSL Codegen
src/target/codegen_cutedsl.cc
Added matching CuTeDSL emission for tl::tcgen05_ld / tl::tcgen05_st (arg checks, param extraction, emit tl.tcgen05_* calls).
Fence Injection Pass
src/transform/inject_tcgen05_fence.cc
New tl.transform.InjectTcgen05Fence() pass: scans PrimFunc (SM100+), detects tcgen05/TMEM uses and sync sites, and injects before_thread_sync/after_thread_sync around shared syncs and mbarrier handoffs with duplicate-avoidance.
Pipeline Integration
tilelang/engine/phase.py, tilelang/transform/__init__.py
Expose Python wrapper InjectTcgen05Fence() and invoke it in OptimizeForTarget after ThreadSync passes and before MergeIfStmt().
Tests
testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py
New unit tests covering fence injection behavior, lowering to tl.tcgen05_ld/st, duplicate-avoidance, sync-scan semantics, and non-SM100 no-op behavior.
Examples / Docs
examples/gemm_sm100/README.md, examples/gemm_sm100/*.py
Document the conservative InjectTcgen05Fence pass; removed redundant manual tcgen05 fence calls from GEMM examples.

Sequence Diagram

sequenceDiagram
    rect rgba(200,200,255,0.5)
    participant Pipeline as OptimizeForTarget
    end
    rect rgba(200,255,200,0.5)
    participant Pass as InjectTcgen05Fence
    participant Mutator as StmtExprMutator
    end
    rect rgba(255,200,200,0.5)
    participant TIR as PrimFunc Body / TIR Nodes
    participant Codegen as Lowering / Codegen
    end

    Pipeline->>Pass: run on PrimFunc (target SM100+)
    Pass->>TIR: scan for tcgen05/TMEM uses and sync sites
    TIR-->>Pass: report usage locations
    Pass->>Mutator: traverse SeqStmt/Evaluate nodes
    Mutator->>TIR: detect storage_sync / mbarrier / arrive/wait sites
    Mutator-->>Pass: insert before_thread_sync / after_thread_sync (avoid duplicates)
    Pass->>Pipeline: return transformed PrimFunc
    Pipeline->>Codegen: subsequent lowering emits tl::tcgen05_ld/st and codegen emits templates
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

enhancement

Poem

🐰
I hopped through code where fences now dwell,
Threads pass the carrot safe and well.
Loads and stores in tidy rows,
Syncs appear where TMEM flows.
A tiny hop — the kernel grows.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a new InjectTcgen05Fence transformation pass to the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/op/builtin.h (1)

431-445: Drop the duplicate public declarations.

These two APIs are already declared again at Lines 729-735. Keeping both blocks means the next signature or doc change has two places to update.

✂️ Proposed cleanup
-/*!
- * \brief Emit tcgen05.fence::before_thread_sync on Blackwell (SM100+)
- *
- * tcgen05_before_thread_sync()
- *
- */
-TVM_DLL const Op &tcgen05_before_thread_sync();
-
-/*!
- * \brief Emit tcgen05.fence::after_thread_sync on Blackwell (SM100+)
- *
- * tcgen05_after_thread_sync()
- *
- */
-TVM_DLL const Op &tcgen05_after_thread_sync();
-
 /*!
  * \brief Indicate arrival of warp issuing TMA_STORE
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/op/builtin.h` around lines 431 - 445, This file contains duplicate public
declarations for the two ops tcgen05_before_thread_sync() and
tcgen05_after_thread_sync(); remove the earlier duplicate block (the
declarations inside the commented brief around the first occurrence) so there is
a single canonical declaration for each (keep the later declarations at the
other location), ensuring only the unique symbols tcgen05_before_thread_sync and
tcgen05_after_thread_sync remain declared once in the header.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 2034-2039: Remove the duplicate tcgen05 fence handler branches
that check op->op.same_as(tl::tcgen05_before_thread_sync()) and
tl::tcgen05_after_thread_sync() (which set need_tcgen05_common_h_ and call
print_extern_call_stmt), because they shadow the intended handler that performs
ICHECK_EQ(op->args.size(), 0U); delete these earlier branches so the later
handler (the one that validates op->args.size()) runs and malformed IR triggers
the ICHECK_EQ failure instead of falling through to bad C++ emission.

In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 109-112: The current gate uses TargetIsSm100 which restricts to
SM100–110; change the check to use TargetHasSMVersionGE(opt_target.value(), 100)
so the pass applies to all SM100+ (Blackwell) targets; specifically, in the
block that gets the target via Optional<Target> opt_target =
f->GetAttr<Target>(tvm::attr::kTarget) and currently calls
TargetIsSm100(opt_target.value()), replace that call with
TargetHasSMVersionGE(opt_target.value(), 100) while keeping the existing
opt_target.defined() guard and the early return (return f) behavior.

---

Nitpick comments:
In `@src/op/builtin.h`:
- Around line 431-445: This file contains duplicate public declarations for the
two ops tcgen05_before_thread_sync() and tcgen05_after_thread_sync(); remove the
earlier duplicate block (the declarations inside the commented brief around the
first occurrence) so there is a single canonical declaration for each (keep the
later declarations at the other location), ensuring only the unique symbols
tcgen05_before_thread_sync and tcgen05_after_thread_sync remain declared once in
the header.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6fce64a2-16bb-432b-9c7c-09bd38ee5ca1

📥 Commits

Reviewing files that changed from the base of the PR and between 0f7c214 and fec911c.

📒 Files selected for processing (6)
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/target/codegen_cuda.cc
  • src/transform/inject_tcgen05_fence.cc
  • tilelang/engine/phase.py
  • tilelang/transform/__init__.py

Comment thread src/target/codegen_cuda.cc Outdated
Comment thread src/transform/inject_tcgen05_fence.cc
kurisu6912 and others added 3 commits April 3, 2026 10:59
Extend InjectTcgen05Fence to cover shared storage sync and linear wait/use, use/arrive handoffs on SM100+ kernels, update the surrounding docs, and add transform coverage. Also lower tcgen05 ld/st copies through dedicated intrinsics instead of call_extern strings, with matching CUDA and CuTeDSL codegen support.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py (1)

95-173: Run the new tcgen05_ld/st checks through InjectTcgen05Fence() as well.

These tests stop right after LowerSharedTmem(), but tilelang/engine/phase.py inserts InjectTcgen05Fence() later, after TMEM copies have already been rewritten to direct tl.tcgen05_ld / tl.tcgen05_st ops. Adding one end-to-end transform test here would catch regressions where the fence pass still handles the legacy call_extern("tl::tcgen05_*") forms but misses the new lowered intrinsics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py`
around lines 95 - 173, The tests test_lower_tmem_copy_uses_tcgen05_ld_intrin and
test_lower_tmem_copy_uses_tcgen05_st_intrin stop after LowerSharedTmem() but do
not exercise InjectTcgen05Fence(), so regressions where the fence pass only
handles legacy "tl::tcgen05_*" extern calls can be missed; update both tests to
run the InjectTcgen05Fence() transform (the same transform applied in
tilelang/engine/phase.py) after LowerSharedTmem() — i.e., call
tl.transform.InjectTcgen05Fence()(mod) (or the pipeline wrapper used in
phase.py) on the IRModule before inspecting the body, so the new tl.tcgen05_ld /
tl.tcgen05_st intrinsics are validated end-to-end by the fence pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 101-118: IsTcgen05OrTmemCall currently treats tcgen05_mma_arrive()
as a "use", causing this pass to insert before/after-thread-sync fences around
arrive calls; remove the tcgen05_mma_arrive() check from the predicate so arrive
is not considered a TCGEN05/TMEM use (update both occurrences where
tcgen05_mma_arrive() is listed), leaving other checks (ptx_init_tensor_memory,
ptx_deallocate_tensor_memory, tcgen05_ld/st, ptx_tcgen05_mma_*, and extern-name
prefixes) intact so the sync-boundary logic can correctly detect arrive as a
sync boundary rather than a use.

---

Nitpick comments:
In `@testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py`:
- Around line 95-173: The tests test_lower_tmem_copy_uses_tcgen05_ld_intrin and
test_lower_tmem_copy_uses_tcgen05_st_intrin stop after LowerSharedTmem() but do
not exercise InjectTcgen05Fence(), so regressions where the fence pass only
handles legacy "tl::tcgen05_*" extern calls can be missed; update both tests to
run the InjectTcgen05Fence() transform (the same transform applied in
tilelang/engine/phase.py) after LowerSharedTmem() — i.e., call
tl.transform.InjectTcgen05Fence()(mod) (or the pipeline wrapper used in
phase.py) on the IRModule before inspecting the body, so the new tl.tcgen05_ld /
tl.tcgen05_st intrinsics are validated end-to-end by the fence pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 73b18376-200e-4839-862c-b0ddb947c31f

📥 Commits

Reviewing files that changed from the base of the PR and between fec911c and e9400ed.

📒 Files selected for processing (10)
  • examples/gemm_sm100/README.md
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/op/copy.cc
  • src/target/codegen_cuda.cc
  • src/target/codegen_cutedsl.cc
  • src/transform/inject_tcgen05_fence.cc
  • testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py
  • tilelang/engine/phase.py
  • tilelang/transform/__init__.py
✅ Files skipped from review due to trivial changes (1)
  • examples/gemm_sm100/README.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • tilelang/transform/init.py
  • src/op/builtin.h
  • src/target/codegen_cuda.cc
  • src/op/builtin.cc

Comment thread src/transform/inject_tcgen05_fence.cc Outdated
Comment on lines +101 to +118
bool IsTcgen05OrTmemCall(const CallNode *call) {
if (!call || IsBeforeFenceCall(call) || IsAfterFenceCall(call)) {
return false;
}

if (call->op.same_as(ptx_tcgen05_mma_ss()) ||
call->op.same_as(ptx_tcgen05_mma_ts()) ||
call->op.same_as(tcgen05_ld()) ||
call->op.same_as(tcgen05_st()) ||
call->op.same_as(tcgen05_mma_arrive()) ||
call->op.same_as(ptx_init_tensor_memory()) ||
call->op.same_as(ptx_deallocate_tensor_memory())) {
return true;
}

return IsExternNameWithPrefix(call, "tl::tcgen05_ld_") ||
IsExternNameWithPrefix(call, "tl::tcgen05_st_") ||
IsExternNameWithPrefix(call, "tl::tcgen05_cp");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

tcgen05_mma_arrive() currently triggers the very fences this pass says it must not add.

tcgen05_mma_arrive() is counted as a TCGEN05/TMEM use here, and both linear scans check “use” before they check “sync boundary”. That means mbarrier_wait_parity(); tcgen05_mma_arrive(); will get an extra tcgen05_after_thread_sync(), and tcgen05_mma_arrive(); ptx_arrive_barrier(...); will get an extra tcgen05_before_thread_sync(), which contradicts the file header’s stated non-goal.

🛠️ Minimal fix
 bool HasUpcomingTcgen05Use(const Array<Stmt> &seq, int start_index) {
   for (int i = start_index + 1; i < static_cast<int>(seq.size()); ++i) {
     const Stmt &stmt = seq[i];
     if (IsAfterFenceStmt(stmt)) {
       return false;
     }
-    if (StmtUsesTcgen05OrTmem(stmt)) {
-      return true;
-    }
     if (IsBeforeFenceStmt(stmt) || IsFenceSyncBoundary(GetEvaluateCall(stmt))) {
       return false;
     }
+    if (StmtUsesTcgen05OrTmem(stmt)) {
+      return true;
+    }
   }
   return false;
 }
 
 bool HasPriorTcgen05Use(const Array<Stmt> &seq, int start_index) {
   for (int i = start_index - 1; i >= 0; --i) {
     const Stmt &stmt = seq[i];
     if (IsBeforeFenceStmt(stmt)) {
       return false;
     }
-    if (StmtUsesTcgen05OrTmem(stmt)) {
-      return true;
-    }
     if (IsAfterFenceStmt(stmt) || IsFenceSyncBoundary(GetEvaluateCall(stmt))) {
       return false;
     }
+    if (StmtUsesTcgen05OrTmem(stmt)) {
+      return true;
+    }
   }
   return false;
 }

Also applies to: 148-177

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/inject_tcgen05_fence.cc` around lines 101 - 118,
IsTcgen05OrTmemCall currently treats tcgen05_mma_arrive() as a "use", causing
this pass to insert before/after-thread-sync fences around arrive calls; remove
the tcgen05_mma_arrive() check from the predicate so arrive is not considered a
TCGEN05/TMEM use (update both occurrences where tcgen05_mma_arrive() is listed),
leaving other checks (ptx_init_tensor_memory, ptx_deallocate_tensor_memory,
tcgen05_ld/st, ptx_tcgen05_mma_*, and extern-name prefixes) intact so the
sync-boundary logic can correctly detect arrive as a sync boundary rather than a
use.

Remove the hand-written tcgen05 before/after thread-sync fences from the ws and persistent GEMM examples now that InjectTcgen05Fence covers these linear wait/use and use/arrive handoffs.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/transform/inject_tcgen05_fence.cc (1)

102-118: ⚠️ Potential issue | 🟠 Major

Stop classifying tcgen05_mma_arrive() as a TCGEN05/TMEM use.

It is already treated as a sync boundary below, and the linear scans at Lines 148-177 check “use” before “boundary”. With the current predicate, sequences like mbarrier_wait_parity(); tcgen05_mma_arrive(); still get an extra tcgen05_after_thread_sync(), and tcgen05_mma_arrive(); ptx_arrive_barrier(...); still get an extra tcgen05_before_thread_sync(), which contradicts the pass’s stated non-goal.

Minimal fix
 bool IsTcgen05OrTmemCall(const CallNode *call) {
   if (!call || IsBeforeFenceCall(call) || IsAfterFenceCall(call)) {
     return false;
   }
 
   if (call->op.same_as(ptx_tcgen05_mma_ss()) ||
       call->op.same_as(ptx_tcgen05_mma_ts()) ||
-      call->op.same_as(tcgen05_ld()) || call->op.same_as(tcgen05_st()) ||
-      call->op.same_as(tcgen05_mma_arrive()) ||
+      call->op.same_as(tcgen05_ld()) || call->op.same_as(tcgen05_st()) ||
       call->op.same_as(ptx_init_tensor_memory()) ||
       call->op.same_as(ptx_deallocate_tensor_memory())) {
     return true;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/inject_tcgen05_fence.cc` around lines 102 - 118, The predicate
IsTcgen05OrTmemCall(...) incorrectly treats tcgen05_mma_arrive() as a
TCGEN05/TMEM use; remove tcgen05_mma_arrive() from the set of ops checked in
IsTcgen05OrTmemCall (i.e., delete the call->op.same_as(tcgen05_mma_arrive())
branch) so that tcgen05_mma_arrive() remains only a sync boundary handled by the
existing boundary logic and the subsequent linear scans and does not trigger
extra before/after thread-sync insertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 102-118: The predicate IsTcgen05OrTmemCall(...) incorrectly treats
tcgen05_mma_arrive() as a TCGEN05/TMEM use; remove tcgen05_mma_arrive() from the
set of ops checked in IsTcgen05OrTmemCall (i.e., delete the
call->op.same_as(tcgen05_mma_arrive()) branch) so that tcgen05_mma_arrive()
remains only a sync boundary handled by the existing boundary logic and the
subsequent linear scans and does not trigger extra before/after thread-sync
insertions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3ea21838-cd9a-4bb4-bec1-a43c470203a0

📥 Commits

Reviewing files that changed from the base of the PR and between 0c9352b and 1f09386.

📒 Files selected for processing (2)
  • src/op/copy.cc
  • src/transform/inject_tcgen05_fence.cc

Simplify InjectTcgen05Fence to recognize only the lowered tcgen05 intrinsics seen after LowerTileOp, and update the transform tests to construct tcgen05_ld directly instead of relying on legacy call_extern strings.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/transform/inject_tcgen05_fence.cc (1)

97-100: ⚠️ Potential issue | 🟠 Major

tcgen05_mma_arrive() can still trigger unintended fence insertion due to scan order.

tcgen05_mma_arrive() is classified as both a use (Line 100) and a sync boundary (Lines 126-130), but both scans check “use” before “boundary” (Line 138 before Line 141, and Line 154 before Line 157). That makes mbarrier_wait_parity(); tcgen05_mma_arrive(); look like an upcoming use and tcgen05_mma_arrive(); ptx_arrive_barrier(...); look like prior use, which contradicts the pass non-goal.

🔧 Minimal fix (prefer boundary check before use)
 bool HasUpcomingTcgen05Use(const Array<Stmt>& seq, int start_index) {
   for (int i = start_index + 1; i < static_cast<int>(seq.size()); ++i) {
     const Stmt& stmt = seq[i];
     if (IsAfterFenceStmt(stmt)) {
       return false;
     }
-    if (StmtUsesTcgen05OrTmem(stmt)) {
-      return true;
-    }
     if (IsBeforeFenceStmt(stmt) || IsFenceSyncBoundary(GetEvaluateCall(stmt))) {
       return false;
     }
+    if (StmtUsesTcgen05OrTmem(stmt)) {
+      return true;
+    }
   }
   return false;
 }

 bool HasPriorTcgen05Use(const Array<Stmt>& seq, int start_index) {
   for (int i = start_index - 1; i >= 0; --i) {
     const Stmt& stmt = seq[i];
     if (IsBeforeFenceStmt(stmt)) {
       return false;
     }
-    if (StmtUsesTcgen05OrTmem(stmt)) {
-      return true;
-    }
     if (IsAfterFenceStmt(stmt) || IsFenceSyncBoundary(GetEvaluateCall(stmt))) {
       return false;
     }
+    if (StmtUsesTcgen05OrTmem(stmt)) {
+      return true;
+    }
   }
   return false;
 }

Also applies to: 132-142, 148-158

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/inject_tcgen05_fence.cc` around lines 97 - 100, The scan logic
currently classifies tcgen05_mma_arrive() as a "use" before it checks for sync
boundaries, causing fence insertion mistakes; change the scan order so boundary
checks run before use checks (i.e., ensure the predicate that tests for sync
boundaries includes call->op.same_as(tcgen05_mma_arrive()) or that the
call->op.same_as(tcgen05_mma_arrive()) entry is removed from the early "use"
list and placed/checked in the boundary list) for both scan passes that
currently test "use" prior to "boundary" (the blocks that enumerate
ptx_tcgen05_mma_ss(), ptx_tcgen05_mma_ts(), tcgen05_ld(), tcgen05_st(),
tcgen05_mma_arrive(), etc. and the corresponding later boundary-check block);
apply the same reordering to the other similar scan blocks in the file to
prevent mbarrier_wait_parity(); tcgen05_mma_arrive(); and tcgen05_mma_arrive();
ptx_arrive_barrier(...) from being misclassified.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 97-100: The scan logic currently classifies tcgen05_mma_arrive()
as a "use" before it checks for sync boundaries, causing fence insertion
mistakes; change the scan order so boundary checks run before use checks (i.e.,
ensure the predicate that tests for sync boundaries includes
call->op.same_as(tcgen05_mma_arrive()) or that the
call->op.same_as(tcgen05_mma_arrive()) entry is removed from the early "use"
list and placed/checked in the boundary list) for both scan passes that
currently test "use" prior to "boundary" (the blocks that enumerate
ptx_tcgen05_mma_ss(), ptx_tcgen05_mma_ts(), tcgen05_ld(), tcgen05_st(),
tcgen05_mma_arrive(), etc. and the corresponding later boundary-check block);
apply the same reordering to the other similar scan blocks in the file to
prevent mbarrier_wait_parity(); tcgen05_mma_arrive(); and tcgen05_mma_arrive();
ptx_arrive_barrier(...) from being misclassified.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d50c5bce-87d6-4f80-8d23-790744d77f2c

📥 Commits

Reviewing files that changed from the base of the PR and between 1f09386 and dfbe9a1.

📒 Files selected for processing (2)
  • src/transform/inject_tcgen05_fence.cc
  • testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py
✅ Files skipped from review due to trivial changes (1)
  • testing/python/transform/test_tilelang_transform_inject_tcgen05_fence.py

@LeiWang1999 LeiWang1999 merged commit 891109e into tile-ai:main Apr 16, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants