Skip to content

[Pipeline] Refactor software pipeline transforms#2245

Merged
LeiWang1999 merged 6 commits into
tile-ai:mainfrom
LeiWang1999:transform/pipeline-pass-cleanup
May 26, 2026
Merged

[Pipeline] Refactor software pipeline transforms#2245
LeiWang1999 merged 6 commits into
tile-ai:mainfrom
LeiWang1999:transform/pipeline-pass-cleanup

Conversation

@LeiWang1999

@LeiWang1999 LeiWang1999 commented May 22, 2026

Copy link
Copy Markdown
Member

Summary

  • Split shared software pipeline analysis, barrier, helper, and rewrite code into focused src/transform/pipeline/ modules.
  • Keep the public pipeline passes in pipeline_planning.cc and inject_pipeline.cc, while removing legacy raw PTX async planning paths.
  • Move IfStmtBinding before pipeline planning so the pipeline passes operate on canonical high-level TileOP bodies.

Changes

  • Removed explicit ptx_cp_async / commit / wait scheduling support from pipeline planning and access analysis; pipeline planning now reasons over TileOP copies and generated async annotations.
  • Consolidated InjectSoftwarePipeline implementation back into inject_pipeline.cc and deleted the temporary injector shim.
  • Added reusable pipeline helpers for access analysis, body analysis, barrier handling, stage analysis, and rewriting.
  • Tightened pipeline body invariants so inject expects the SeqStmt body produced by pipeline planning.
  • Filtered loop-local buffer declarations out of pipeline scheduling annotations while still registering them for injection.
  • Updated tests for high-level TMA / WGMMA / TCGEN05 planning cases and async producer annotations.

Validation

  • pre-commit run --all-files
  • git diff --check
  • cmake --build build -j$(nproc)
  • PYTHONPATH=$(pwd):$PYTHONPATH python -m pytest testing/python/transform/test_tilelang_transform_Inject_software_pipeline.py testing/python/transform/test_tilelang_transform_pipeline_planning.py testing/python/language/test_tilelang_language_tma_copy.py -q

Notes

  • Local uncommitted changes in examples/quickstart.py and docs/compiler_internals/async_mbarrier_dependency_analysis.md were intentionally left out of this PR.

Summary by CodeRabbit

  • New Features

    • Full software-pipeline rewriter: buffer-versioning, replayed scalar-bind replay, and improved TMA copy synchronization.
  • Improvements

    • Multi-version barrier expansion with parity-aware waits and TMA-aware wiring.
    • Richer per-stage analysis (reads/writes/scalar deps), replayability checks, and body normalization.
    • Configurable inlining of replayable binds in conditional binding pass.
  • Tests

    • Expanded pipeline/TMA tests and new IfStmtBinding coverage; pipeline-planning tests now target-aware.

Review Change Stack

@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Splits software-pipeline planning/rewriting into analyzers (access, body, stage), barrier processing, helpers, and a buffer-versioning rewriter; refactors PipelinePlanner to delegate to these components; updates pass ordering, pass-config, tests, and build sources.

Changes

Software Pipeline Transform Refactor

Layer / File(s) Summary
Access pattern analysis and buffer region collection
src/transform/pipeline/access_analysis.h, src/transform/pipeline/access_analysis.cc
MayConflict checks region dimensionality/intersection; BufferRegionCollector traverses IR to collect per-buffer read/write regions and detect global-copy/TMA-copy patterns and non-copy tile ops.
Statement scheduling and replayability analysis
src/transform/pipeline/body_analysis.h
PipelinePlanningBodyAnalyzer flattens SeqStmts, computes per-statement read/write regions, aggregates pipeline-write buffers, and computes replayable scalar-bind masks and scheduled/stage mappings.
Pipeline utilities for buffer management and IR annotation
src/transform/pipeline/helpers.h, src/transform/pipeline/helpers.cc
Defines PipelineAnnotation/PipelineInfo; helpers to collect buffer usage, detect replayability, expand layout maps for remapped allocs, annotate SIMT producers and mbar phases, and lower async commit/wait attributes.
Per-stage metadata and copy classification
src/transform/pipeline/stage_analysis.h
PipelineStageInfo and PipelineStageAnalyzer collect per-stage reads/writes and scalar defs/uses, classify copy/TMA stages, compute last-use, propagate buffer/scalar producers, validate scalar deps, and optionally emit implicit async annotations.
Pipeline barrier dependency and TMA rewriting
src/transform/pipeline/barrier.h, src/transform/pipeline/barrier.cc
BuildDependencyGraph builds block deps; ExpandPipelineBarriers multi-versions eligible barrier buffers and rewrites indices/parity; RewritePipelineTmaBarriers wires pipeline mbar, rewrites TMA copies, and inserts parity-based waits.
Buffer versioning and loop rewriting with async synchronization
src/transform/pipeline/rewriter.h, src/transform/pipeline/rewriter.cc
PipelineBodyRewriter remaps accesses with per-stage version indices; PipelineRewriter determines version counts, creates versioned allocs, replays scalar binds per access index, computes/relaxes async waits, and emits the final pipelined loop; RewritePipeline exported.
PipelinePlanner refactor to use analyzers
src/transform/pipeline_planning.cc
Delegates to PipelineStageAnalyzer and PipelinePlanningBodyAnalyzer; flattens SeqStmt bodies; reorders scalar-producer propagation; adjusts stage-shifting; conditionally emits replayable/TMA annotations; returns For over flattened SeqStmt.
IfStmtBinding: replayable-bind inlining and pass config
src/transform/if_stmt_binding.cc, src/op/builtin.h, src/op/builtin.cc, tilelang/transform/pass_config.py, testing/python/transform/test_tilelang_transform_if_stmt_binding.py
IfStmtBinding now optionally inlines replayable scalar Bind statements when configured via tl.if_stmt_binding_inline_replayable_binds; rewriter made Target-aware and uses buffer-region analysis to decide replayability.
Pipeline utils and body normalization helpers
src/transform/common/pipeline_utils.h, tilelang/engine/phase.py
Adds IsPipelineDeclarationStmt, NormalizePipelineBody, MakePipelineBody; moves IfStmtBinding() earlier in LowerAndLegalize and removes redundant call in OptimizeForTarget.
Tests and build updates
CMakeLists.txt, testing/python/...
Adds pipeline sources to build; TMA test now uses barrier_wait; InjectSoftwarePipeline tests move shared allocs outside loop and use named sblocks; pipeline-planning tests expanded and harness updated to run IfStmtBinding() before planning and be target-aware.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • chengyupku
  • Rachmanino

Poem

🐰 I hopped through stages, buffers in tow,

I split the planner so the analyses grow,
Barriers expanded, versions aligned,
Async waits counted, rewrites defined,
A nimble pipeline — now off I go!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.80% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: refactoring software pipeline transforms into focused modules, which matches the substantial codebase reorganization evident in the summary.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/transform/pipeline/access_analysis.cc`:
- Around line 174-179: The if-then-else handling in access_analysis.cc
incorrectly hard-resets within_condition_expr_ and leaves the else branch
visited with condition-context set; modify the block handling
op->op.same_as(builtin::if_then_else()) to save the previous
within_condition_expr_ value (e.g., auto prev = within_condition_expr_), set
within_condition_expr_ = true only while visiting the condition via
this->VisitExpr(op->args[0]), then restore within_condition_expr_ = prev before
visiting the then- and else- branches (visit op->args[1] and subsequent else
cases) so the else case is not treated as a condition and nested condition
contexts are preserved; apply the same save/restore pattern to the analogous
code range for 186-195.

In `@src/transform/pipeline/barrier.cc`:
- Around line 429-438: The loops over original_order access tma_copies[i]
without validating sizes, which can cause OOB if tma_copies and original_order
differ; before the loops in barrier.cc, add a check/assert that
tma_copies.size() == original_order.size() (or handle the mismatch explicitly)
so accesses to tma_copies[i] in the first loop that sets last_tma_idx and the
second loop that tests is_zero(tma_copies[i]) are safe; update any error paths
to return or abort if the sizes differ to prevent undefined behavior.

In `@testing/python/language/test_tilelang_language_tma_copy.py`:
- Around line 55-56: Update the module docstring to reflect the new TMA wait API
used in the test by replacing or augmenting the mention of
T.mbarrier_wait_parity() with the current T.barrier_wait(...) call (or mention
both forms), and briefly describe the parameter usage (e.g., parity argument k %
2) so readers understand how the test synchronizes TMA loads; look for the
docstring at the top of the test module and update the text to reference
T.barrier_wait and its parity usage alongside or instead of
T.mbarrier_wait_parity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d5978427-5287-426b-aea2-00a13e9fc634

📥 Commits

Reviewing files that changed from the base of the PR and between d3adf00 and 6bde08f.

📒 Files selected for processing (17)
  • CMakeLists.txt
  • src/transform/inject_pipeline.cc
  • src/transform/pipeline/access_analysis.cc
  • src/transform/pipeline/access_analysis.h
  • src/transform/pipeline/barrier.cc
  • src/transform/pipeline/barrier.h
  • src/transform/pipeline/body_analysis.h
  • src/transform/pipeline/helpers.cc
  • src/transform/pipeline/helpers.h
  • src/transform/pipeline/rewriter.cc
  • src/transform/pipeline/rewriter.h
  • src/transform/pipeline/stage_analysis.h
  • src/transform/pipeline_planning.cc
  • testing/python/language/test_tilelang_language_tma_copy.py
  • testing/python/transform/test_tilelang_transform_Inject_software_pipeline.py
  • testing/python/transform/test_tilelang_transform_pipeline_planning.py
  • tilelang/engine/phase.py

Comment on lines +174 to +179
} else if (op->op.same_as(builtin::if_then_else())) {
within_condition_expr_ = true;
this->VisitExpr(op->args[0]);
within_condition_expr_ = false;
for (auto i = 1; i < op->args.size(); i++) {
this->VisitExpr(op->args[i]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve conditional-context state and don’t mark else as a condition expression.

within_condition_expr_ is hard-reset instead of restored (nested contexts), and else_case is incorrectly visited with condition-context enabled (Line 192). This can hide real global reads and misclassify copy stages.

💡 Suggested fix
 void BufferRegionCollector::VisitExpr_(const CallNode *op) {
@@
   } else if (op->op.same_as(builtin::if_then_else())) {
-    within_condition_expr_ = true;
+    bool prev_within_condition_expr = within_condition_expr_;
+    within_condition_expr_ = true;
     this->VisitExpr(op->args[0]);
-    within_condition_expr_ = false;
+    within_condition_expr_ = prev_within_condition_expr;
     for (auto i = 1; i < op->args.size(); i++) {
       this->VisitExpr(op->args[i]);
     }
@@
 void BufferRegionCollector::VisitStmt_(const IfThenElseNode *op) {
-  within_condition_expr_ = true;
+  bool prev_within_condition_expr = within_condition_expr_;
+  within_condition_expr_ = true;
   this->VisitExpr(op->condition);
-  within_condition_expr_ = false;
+  within_condition_expr_ = prev_within_condition_expr;
   this->VisitStmt(op->then_case);
   if (op->else_case.defined()) {
-    within_condition_expr_ = true;
     this->VisitStmt(op->else_case.value());
-    within_condition_expr_ = false;
   }
 }

Also applies to: 186-195

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/transform/pipeline/access_analysis.cc` around lines 174 - 179, The
if-then-else handling in access_analysis.cc incorrectly hard-resets
within_condition_expr_ and leaves the else branch visited with condition-context
set; modify the block handling op->op.same_as(builtin::if_then_else()) to save
the previous within_condition_expr_ value (e.g., auto prev =
within_condition_expr_), set within_condition_expr_ = true only while visiting
the condition via this->VisitExpr(op->args[0]), then restore
within_condition_expr_ = prev before visiting the then- and else- branches
(visit op->args[1] and subsequent else cases) so the else case is not treated as
a condition and nested condition contexts are preserved; apply the same
save/restore pattern to the analogous code range for 186-195.

Comment on lines +429 to +438
for (size_t i = 0; i < original_order.size(); i++) {
if (!is_zero(tma_copies[i]))
last_tma_idx = static_cast<int>(i);
}

// Phase 1: Rewrite TMA copy blocks - all share barrier slot 0.
// ExpandPipelineBarriers (called later) will rewrite indices to be
// stage-dependent. Only the last TMA copy emits arrive.
for (size_t i = 0; i < original_order.size(); i++) {
if (is_zero(tma_copies[i]))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Potential out-of-bounds access if tma_copies size mismatches original_order size.

The loop iterates over original_order.size() but accesses tma_copies[i] without verifying that tma_copies has the same length. If a caller provides a shorter tma_copies array, this causes undefined behavior.

🛡️ Proposed fix: Add assertion for size match
 Buffer RewritePipelineTmaBarriers(
     Array<SBlock> &original_order, PipelineInfo &pipeline_info,
     const Array<Integer> &tma_copies, Map<Var, Buffer> &buffer_data_to_buffer,
     BufferSet &allocated_buffers, Array<Buffer> &block_local_allocs,
     Var loop_var, PrimExpr loop_min, int num_stages) {
+  ICHECK_EQ(tma_copies.size(), original_order.size())
+      << "tma_copies must have the same size as original_order";
   if (!std::any_of(tma_copies.begin(), tma_copies.end(),
                    [](const Integer &tc) { return !is_zero(tc); })) {
     return Buffer();
   }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/transform/pipeline/barrier.cc` around lines 429 - 438, The loops over
original_order access tma_copies[i] without validating sizes, which can cause
OOB if tma_copies and original_order differ; before the loops in barrier.cc, add
a check/assert that tma_copies.size() == original_order.size() (or handle the
mismatch explicitly) so accesses to tma_copies[i] in the first loop that sets
last_tma_idx and the second loop that tests is_zero(tma_copies[i]) are safe;
update any error paths to return or abort if the sizes differ to prevent
undefined behavior.

Comment on lines +55 to +56
T.barrier_wait(mbar_A, k % 2)
T.barrier_wait(mbar_B, k % 2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Sync test documentation with the new wait API usage.

This test now uses T.barrier_wait(...), but the module docstring still states T.mbarrier_wait_parity() for TMA load synchronization. Please update the docstring (or mention both forms) to avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@testing/python/language/test_tilelang_language_tma_copy.py` around lines 55 -
56, Update the module docstring to reflect the new TMA wait API used in the test
by replacing or augmenting the mention of T.mbarrier_wait_parity() with the
current T.barrier_wait(...) call (or mention both forms), and briefly describe
the parameter usage (e.g., parity argument k % 2) so readers understand how the
test synchronizes TMA loads; look for the docstring at the top of the test
module and update the text to reference T.barrier_wait and its parity usage
alongside or instead of T.mbarrier_wait_parity.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/transform/pipeline/stage_analysis.h (1)

114-115: 💤 Low value

Unused variable if_then_else may trigger compiler warning.

The variable is declared but never used. Consider using the cast result directly in the condition or suppressing the warning.

♻️ Suggested fix
-      if (const auto *if_then_else = node.as<IfThenElseNode>()) {
+      if (node.as<IfThenElseNode>()) {
         conditional = true;
         return;
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/transform/pipeline/stage_analysis.h` around lines 114 - 115, The local
pointer if_then_else is assigned from node.as<IfThenElseNode>() but never used,
which can trigger a compiler warning; replace the declaration with a direct
check by changing the condition to use node.as<IfThenElseNode>() directly (i.e.,
if (node.as<IfThenElseNode>()) { conditional = true; }) or, if you actually need
the casted pointer later, use the variable where needed or mark it
[[maybe_unused]]; update the code around the if (const auto *if_then_else =
node.as<IfThenElseNode>()) and the conditional assignment to remove the unused
variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/transform/pipeline/stage_analysis.h`:
- Around line 114-115: The local pointer if_then_else is assigned from
node.as<IfThenElseNode>() but never used, which can trigger a compiler warning;
replace the declaration with a direct check by changing the condition to use
node.as<IfThenElseNode>() directly (i.e., if (node.as<IfThenElseNode>()) {
conditional = true; }) or, if you actually need the casted pointer later, use
the variable where needed or mark it [[maybe_unused]]; update the code around
the if (const auto *if_then_else = node.as<IfThenElseNode>()) and the
conditional assignment to remove the unused variable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: faacdcf2-4a31-40be-9467-bac5f84f7b79

📥 Commits

Reviewing files that changed from the base of the PR and between 6bde08f and fad1a35.

📒 Files selected for processing (6)
  • CMakeLists.txt
  • src/transform/pipeline/access_analysis.cc
  • src/transform/pipeline/barrier.cc
  • src/transform/pipeline/helpers.cc
  • src/transform/pipeline/stage_analysis.h
  • tilelang/engine/phase.py

@LeiWang1999

Copy link
Copy Markdown
Member Author

@regression-perf

@github-actions

Copy link
Copy Markdown

Performance Regression Test Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/26279131389

Results

File Original Latency Current Latency Speedup
example_tilelang_gemm_fp8_2xAcc 0.0893022 0.0907862 0.983655
example_mha_fwd_varlen 0.032706 0.0332113 0.984784
example_mha_sink_bwd_bhsd 0.051944 0.0526673 0.986267
example_convolution 0.915273 0.924959 0.989528
example_mla_decode 0.313157 0.315924 0.991243
example_tilelang_gemm_fp8 0.235976 0.237917 0.991842
example_mha_bwd_bhsd 0.0295809 0.0297656 0.993796
example_mha_inference 0.0621936 0.0624552 0.99581
example_gemm_intrinsics 0.0253497 0.0254398 0.996455
example_convolution_autotune 0.725262 0.726869 0.997789
example_mha_sink_fwd_bhsd 0.0127129 0.0127405 0.997832
example_gemm 0.0171506 0.0171828 0.998128
example_mha_fwd_bshd 0.0190085 0.0190436 0.998158
example_group_per_split_token_cast_to_fp8 0.00760808 0.00762065 0.99835
sparse_mla_bwd 0.228128 0.228442 0.998626
example_warp_specialize_gemm_barrierpipe_stage2 0.0296348 0.0296737 0.998687
example_dequant_gemm_fp4_hopper 0.725097 0.725743 0.99911
example_gqa_decode 0.0411461 0.0411806 0.999162
example_warp_specialize_gemm_softpipe_stage2 0.0195324 0.019547 0.999254
sparse_mla_fwd 0.0826231 0.0826781 0.999335
example_mha_sink_fwd_bhsd_sliding_window 0.0126858 0.012694 0.999355
example_tilelang_nsa_decode 0.00550849 0.00550976 0.99977
block_sparse_attn_tilelang 0.00672228 0.00672286 0.999913
example_tilelang_sparse_gqa_decode_varlen_indice 0.0117852 0.0117847 1.00004
example_topk 30.5835 30.5743 1.0003
example_gqa_bwd_tma_reduce_varlen 0.0330355 0.0330253 1.00031
example_linear_attn_fwd 0.0284165 0.0284071 1.00033
example_mhc_post 0.106664 0.106612 1.00048
example_tilelang_sparse_gqa_decode_varlen_mask 0.0127865 0.0127802 1.00049
example_dequant_gemv_fp16xint4 0.0270041 0.0269853 1.0007
example_gqa_sink_bwd_bhsd 0.0303126 0.0302879 1.00082
fp8_lighting_indexer 0.02265 0.0226293 1.00091
example_elementwise_add 0.112963 0.112855 1.00095
example_tilelang_gemm_splitk 0.767857 0.76709 1.001
example_warp_specialize_gemm_copy_1_gemm_0 0.0195381 0.0195163 1.00112
example_dynamic 0.497373 0.49669 1.00138
sparse_mla_fwd_pipelined 0.0595879 0.0594951 1.00156
example_gqa_sink_bwd_bhsd_sliding_window 0.018133 0.0180966 1.00201
example_per_token_cast_to_fp8 0.00651406 0.00650085 1.00203
example_tilelang_block_sparse_attn 0.00725893 0.00724329 1.00216
example_mha_sink_bwd_bhsd_sliding_window 0.0385473 0.0384596 1.00228
example_fusedmoe_tilelang 0.0956731 0.0954449 1.00239
example_gemm_autotune 0.0162815 0.0162328 1.003
example_mha_bwd_bshd 0.0290022 0.0289074 1.00328
example_dequant_gemm_bf16_fp4_hopper 0.397705 0.39631 1.00352
example_mha_fwd_bhsd 0.0090818 0.00904698 1.00385
example_gemv 0.202201 0.201339 1.00428
example_warp_specialize_gemm_copy_0_gemm_1 0.0269645 0.0268489 1.00431
example_mhc_pre 0.14689 0.146145 1.0051
topk_selector 0.041552 0.0413131 1.00578
example_blocksparse_gemm 0.0138106 0.0137056 1.00766
example_linear_attn_bwd 0.117648 0.116745 1.00773
example_gqa_fwd_bshd 0.0509859 0.0505536 1.00855
example_vertical_slash_sparse_attn 0.167334 0.16591 1.00858
example_gqa_bwd 0.0331218 0.0327321 1.01191
example_dequant_gemm_bf16_mxfp4_hopper 0.359028 0.354768 1.01201
example_tilelang_gemm_splitk_vectorize_atomicadd 0.791744 0.779177 1.01613
example_dequant_gemm_w4a8 3.9308 3.80986 1.03175
example_tilelang_nsa_fwd 0.00560543 0.00527248 1.06315

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

@LeiWang1999

Copy link
Copy Markdown
Member Author

@regression-perf

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/transform/pipeline_planning.cc (1)

135-174: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the explicit-annotation body aligned with the flattened schedule.

This branch filters pipeline metadata against pipeline_stmts, but it still writes back pipeline_body_stmts. If NormalizePipelineBody leaves nested SeqStmts, the loop body and software_pipeline_* annotations can describe different statement lists, which is exactly what the num_stages path avoids a few lines below. Rebuild the explicit path from the flattened list too.

Suggested fix
-      n->body = MakePipelineBody(pipeline_body_stmts);
+      n->body = MakePipelineBody(pipeline_stmts);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/transform/pipeline_planning.cc` around lines 135 - 174, The
explicit-annotation branch currently writes back the original
pipeline_body_stmts while metadata (filtered_order_array/filtered_stage_array
and replayable masks) are computed from the flattened pipeline_stmts, causing
mismatch when NormalizePipelineBody left nested SeqStmt nodes; update the code
so the loop body is rebuilt from the flattened list (pipeline_stmts) instead of
pipeline_body_stmts (i.e. replace the final n->body =
MakePipelineBody(pipeline_body_stmts) with n->body =
MakePipelineBody(pipeline_stmts)) so annotations
(s_tir::attr::software_pipeline_order/stage and kPipelineReplayableScalarBinds)
match the actual flattened statement sequence used to compute them.
🧹 Nitpick comments (2)
src/transform/if_stmt_binding.cc (1)

51-73: ⚡ Quick win

Reuse the shared replayability analyzer here.

This pass now reimplements the same access/write/replayability flow that src/transform/pipeline/body_analysis.h already centralizes, and the two paths are already drifting: PipelinePlanningBodyAnalyzer::CollectStmtAccessRegions() wraps each stmt in an SBlock, while this copy does not. Please route both through one helper so IfStmtBinding and pipeline planning cannot disagree on which binds are replayable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/transform/if_stmt_binding.cc` around lines 51 - 73, The code duplicates
access/write/replayability logic; replace the local
CollectStmtAccessRegions/CollectWriteBuffers/IsReplayableBindStmt
implementations in IfStmtBinding with calls into the shared analyzer in
PipelinePlanningBodyAnalyzer (or its exported helper) so both paths use the same
logic; specifically, stop constructing BufferRegionCollector here and instead
call PipelinePlanningBodyAnalyzer::CollectStmtAccessRegions (or the shared
helper) which wraps statements in an SBlock correctly, then use its returned
reads/writes to build write_buffers and to call IsReplayableScalarBind —
ensuring we preserve the SBlock wrapping behavior used by the pipeline analyzer.
testing/python/transform/test_tilelang_transform_pipeline_planning.py (1)

195-203: ⚡ Quick win

Assert the injected IR, not just that injection succeeds.

This regression currently passes as long as InjectSoftwarePipeline does not throw. A no-op or wrong rewrite would still slip through. Please add one structural postcondition on the injected module as well, so the guarded-bind case stays covered end-to-end. Based on learnings, focus assertions on structural patterns in the generated kernel source rather than specific numeric literals.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@testing/python/transform/test_tilelang_transform_pipeline_planning.py` around
lines 195 - 203, After calling tl.transform.InjectSoftwarePipeline()(mod)
re-collect pipeline annotations (e.g., call
_collect_pipeline_loop_annotations(mod["main"]) again) and assert a structural
postcondition on the injected IR: ensure the annotation(s) now include the
guarded-bind information (check that "software_pipeline_replayable_scalar_binds"
is present and non-empty on the annotation for the pipeline) or, alternatively,
inspect the generated kernel body for a Bind-like pattern (e.g., a non-empty
list/string containing "bind" or equivalent), so the test verifies the rewrite
produced the expected structural change rather than merely not throwing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/transform/pipeline_planning.cc`:
- Around line 135-174: The explicit-annotation branch currently writes back the
original pipeline_body_stmts while metadata
(filtered_order_array/filtered_stage_array and replayable masks) are computed
from the flattened pipeline_stmts, causing mismatch when NormalizePipelineBody
left nested SeqStmt nodes; update the code so the loop body is rebuilt from the
flattened list (pipeline_stmts) instead of pipeline_body_stmts (i.e. replace the
final n->body = MakePipelineBody(pipeline_body_stmts) with n->body =
MakePipelineBody(pipeline_stmts)) so annotations
(s_tir::attr::software_pipeline_order/stage and kPipelineReplayableScalarBinds)
match the actual flattened statement sequence used to compute them.

---

Nitpick comments:
In `@src/transform/if_stmt_binding.cc`:
- Around line 51-73: The code duplicates access/write/replayability logic;
replace the local
CollectStmtAccessRegions/CollectWriteBuffers/IsReplayableBindStmt
implementations in IfStmtBinding with calls into the shared analyzer in
PipelinePlanningBodyAnalyzer (or its exported helper) so both paths use the same
logic; specifically, stop constructing BufferRegionCollector here and instead
call PipelinePlanningBodyAnalyzer::CollectStmtAccessRegions (or the shared
helper) which wraps statements in an SBlock correctly, then use its returned
reads/writes to build write_buffers and to call IsReplayableScalarBind —
ensuring we preserve the SBlock wrapping behavior used by the pipeline analyzer.

In `@testing/python/transform/test_tilelang_transform_pipeline_planning.py`:
- Around line 195-203: After calling tl.transform.InjectSoftwarePipeline()(mod)
re-collect pipeline annotations (e.g., call
_collect_pipeline_loop_annotations(mod["main"]) again) and assert a structural
postcondition on the injected IR: ensure the annotation(s) now include the
guarded-bind information (check that "software_pipeline_replayable_scalar_binds"
is present and non-empty on the annotation for the pipeline) or, alternatively,
inspect the generated kernel body for a Bind-like pattern (e.g., a non-empty
list/string containing "bind" or equivalent), so the test verifies the rewrite
produced the expected structural change rather than merely not throwing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b864f799-6851-4e37-b8c8-89c196b5df88

📥 Commits

Reviewing files that changed from the base of the PR and between 0a8fbcf and 3f70125.

📒 Files selected for processing (13)
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/transform/common/pipeline_utils.h
  • src/transform/if_stmt_binding.cc
  • src/transform/inject_pipeline.cc
  • src/transform/pipeline/access_analysis.cc
  • src/transform/pipeline/access_analysis.h
  • src/transform/pipeline/body_analysis.h
  • src/transform/pipeline/helpers.cc
  • src/transform/pipeline_planning.cc
  • testing/python/transform/test_tilelang_transform_if_stmt_binding.py
  • testing/python/transform/test_tilelang_transform_pipeline_planning.py
  • tilelang/transform/pass_config.py
✅ Files skipped from review due to trivial changes (2)
  • src/op/builtin.cc
  • src/op/builtin.h

@github-actions

Copy link
Copy Markdown

Performance Regression Test Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/26362643998

Results

File Original Latency Current Latency Speedup
example_gqa_bwd 0.032789 0.0332675 0.985615
example_mha_bwd_bhsd 0.0294558 0.0298455 0.98694
sparse_mla_bwd 0.228311 0.231291 0.987118
example_convolution 0.91426 0.924983 0.988407
example_linear_attn_bwd 0.116711 0.117965 0.989368
example_vertical_slash_sparse_attn 0.165509 0.167273 0.989451
example_mha_sink_bwd_bhsd 0.0518593 0.0522934 0.9917
example_mha_sink_bwd_bhsd_sliding_window 0.0380328 0.0383449 0.991862
example_mha_fwd_varlen 0.0329453 0.0331801 0.992922
example_warp_specialize_gemm_copy_0_gemm_1 0.0268628 0.0269552 0.996573
example_fusedmoe_tilelang 0.0953723 0.0956662 0.996927
topk_selector 0.0413188 0.0414298 0.997321
example_linear_attn_fwd 0.0284436 0.0285178 0.997396
example_dynamic 0.495865 0.49714 0.997434
example_gqa_sink_bwd_bhsd_sliding_window 0.0181038 0.0181474 0.997598
example_blocksparse_gemm 0.0137185 0.0137515 0.997605
example_gqa_sink_bwd_bhsd 0.0300173 0.0300575 0.998664
example_mla_decode 0.315499 0.315909 0.998702
example_warp_specialize_gemm_copy_1_gemm_0 0.0195252 0.0195452 0.99898
example_mhc_post 0.106548 0.106641 0.999132
example_tilelang_sparse_gqa_decode_varlen_mask 0.0127822 0.0127916 0.999269
example_tilelang_block_sparse_attn 0.00724261 0.00724751 0.999324
sparse_mla_fwd 0.0825877 0.0826333 0.999449
example_gemm_autotune 0.0162425 0.0162496 0.999566
example_gemm 0.0171287 0.0171326 0.999771
example_tilelang_sparse_gqa_decode_varlen_indice 0.0117843 0.0117842 1
block_sparse_attn_tilelang 0.00672144 0.00671991 1.00023
example_tilelang_gemm_splitk 0.76788 0.767558 1.00042
example_per_token_cast_to_fp8 0.00650984 0.00650701 1.00044
example_dequant_gemv_fp16xint4 0.026978 0.0269649 1.00049
fp8_lighting_indexer 0.0226768 0.0226654 1.0005
example_mha_inference 0.0623591 0.0623198 1.00063
example_gqa_decode 0.041199 0.0411705 1.00069
example_group_per_split_token_cast_to_fp8 0.00762566 0.00761978 1.00077
example_mha_fwd_bshd 0.0190399 0.0190207 1.00101
example_tilelang_nsa_decode 0.00551164 0.0055049 1.00122
example_mha_sink_fwd_bhsd_sliding_window 0.0127084 0.0126897 1.00147
example_elementwise_add 0.113108 0.112931 1.00156
example_mha_fwd_bhsd 0.00910121 0.00908463 1.00182
example_dequant_gemm_bf16_fp4_hopper 0.398469 0.39768 1.00198
example_warp_specialize_gemm_barrierpipe_stage2 0.0297141 0.0296528 1.00207
example_dequant_gemm_bf16_mxfp4_hopper 0.355586 0.354716 1.00245
example_mha_sink_fwd_bhsd 0.0127635 0.0127277 1.00281
example_tilelang_gemm_fp8_2xAcc 0.0902369 0.0899306 1.00341
example_gemv 0.202291 0.201589 1.00348
example_gemm_intrinsics 0.0254536 0.0253507 1.00406
example_dequant_gemm_w4a8 3.83163 3.81322 1.00483
example_tilelang_gemm_fp8 0.23939 0.238224 1.00489
sparse_mla_fwd_pipelined 0.0594865 0.0591761 1.00525
example_warp_specialize_gemm_softpipe_stage2 0.019559 0.0194196 1.00718
example_gqa_bwd_tma_reduce_varlen 0.0335673 0.0332913 1.00829
example_dequant_gemm_fp4_hopper 0.708931 0.702745 1.0088
example_tilelang_gemm_splitk_vectorize_atomicadd 0.785854 0.778315 1.00969
example_mha_bwd_bshd 0.0292262 0.0289129 1.01083
example_gqa_fwd_bshd 0.0514124 0.050857 1.01092
example_topk 30.8604 30.469 1.01284
example_convolution_autotune 0.730836 0.721439 1.01302
example_mhc_pre 0.147831 0.145453 1.01635
example_tilelang_nsa_fwd 0.00561653 0.0052824 1.06325

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

…s analysis

- Removed unused Target parameter from IfStmtBindingRewriter.
- Introduced IfStmtAccessCollector to handle buffer access collection.
- Updated access analysis methods to utilize the new collector.
- Cleaned up includes and removed obsolete functions in access_analysis.
- Enhanced pipeline utility headers with necessary includes for consistency.
@LeiWang1999 LeiWang1999 merged commit 3bf3d00 into tile-ai:main May 26, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant