[Feature] Emit named aliases for dynamic shared memory in CUDA codegen for better readability by Rachmanino · Pull Request #2061 · tile-ai/tilelang

Rachmanino · 2026-04-19T07:02:36Z

This PR adds an opt-in CUDA codegen path for improving the readability of merged shared.dyn buffers by preserving DSL-levelshared-memory names in generated source.

What it does:

Adds a new pass config: tl.emit_named_smem_pointers
Extends MergeSharedMemoryAllocations to emit alias metadata for merged dynamic shared-memory buffers
Updates CUDA codegen to emit named aliases such as:
- half_t* A_shared = reinterpret_cast<half_t*>(buf_dyn_shmem + ...);
For multiversioned / pipelined shared buffers, emits stage-aware aliases using a small helper:
- auto A_shared = tl::PatternVisitor([&](int i) { ... });
Rewrites many downstream shared-memory accesses to use those aliases instead of raw buf_dyn_shmem + offset expressions
Covers both non-WS and WS/TMA paths for common GEMM-style kernels, including stage-based descriptor initialization and many cp.async / tma_load sites

Current limitations / not fully solved yet:

The implementation is strongest when the logical buffer identity is still visible in IR (e.g. normal BufferLoad/BufferStore,
many staged producer accesses).
Some paths still lose logical buffer provenance after lowering to tvm_access_ptr(data, offset, extent, rw_mask), especially
when multiple logical buffers are merged into the same physical buf_dyn_shmem region.
Because of that, in some cases codegen can still choose a physically correct but semantically less desirable alias. For
example, an address logically corresponding to C_shared may still be rendered as A_shared[0] if both map to the same merged
address range and only physical address information remains.
In other words, this PR substantially improves readability, but it does not yet fully preserve logical buffer provenance end-
to-end through every pass that rewrites tvm_access_ptr.

Follow-up direction:

A more complete solution would preserve a stable logical-buffer identity across lowering (especially for tma_store-style
paths), instead of relying on address-based alias recovery after buffer merging.

Summary by CodeRabbit

New Features
- CUDA codegen can emit named, typed pointer aliases for merged dynamic shared memory; controlled by a new pass config (defaults to True).
- Multi-version dynamic shared-memory metadata is propagated so device code can emit stage-aware aliases for pipelined kernels.
Tests
- Added CUDA tests validating default/raw behavior, named alias emission, and pattern-visitor based multiversion indexing.

github-actions · 2026-04-19T07:02:44Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-04-19T07:02:49Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds support to emit named typed pointer aliases for dynamic shared-memory: new pass-config and builtin keys, metadata production in merge/multi-version passes, propagation through host/device splitting, CUDA codegen to consume alias metadata and remap shared.dyn accesses, a PatternVisitor helper, and tests.

Changes

Cohort / File(s)	Summary
Pass config & builtin keys `src/op/builtin.cc`, `src/op/builtin.h`, `tilelang/transform/pass_config.py`	Register new pass-config key `tl.emit_named_smem_pointers` / `kEmitNamedSmemPointers` and add the corresponding attribute constant used to control emitting named shared.dyn pointer aliases.
CUDA codegen `src/target/codegen_cuda.h`, `src/target/codegen_cuda.cc`	Add internal types/state and helpers to parse `tl.dyn_shared_memory_alias_metadata`, enable `emit_named_smem_pointers_`, emit per-alias typed pointer views at `shared.dyn` Allocate, and remap `BufferLoad`/`BufferStore`/`address_of` to use alias-based addresses with stage-aware handling and loop-range binding.
Shared-memory merging & metadata export `src/transform/merge_shared_memory_allocations.cc`, `src/transform/multi_version_buffer_rewriter.cc`	Merge pass now computes/returns dynamic alias metadata (offset, size, stage stride, dtype) and may attach `tl.dyn_shared_memory_alias_metadata`; multi-version rewriter exports `tl.dyn_shared_memory_multiversion_metadata`. Signature of internal merge API updated to return alias map.
Host/device attribute propagation `src/transform/split_host_device.cc`	HostDeviceSplitter reads/writes `tl.dyn_shared_memory_multiversion_metadata` on host/device PrimFunc attributes and propagates the metadata to generated device functions.
CUDA templates `src/tl_templates/cuda/common.h`	Add `tl::PatternVisitor<Fn>` device-callable helper (constructor, operator[], CTAD) used to emit indexed alias views for multiversion dyn-shared aliases.
Tests `testing/python/components/test_cuda_named_smem_alias_codegen.py`	Add CUDA tests verifying default raw dyn.smem access, emission of named typed alias pointers when enabled, and PatternVisitor-based indexed emission for multiversion dyn.smem aliases.

Sequence Diagram(s)

sequenceDiagram
    participant MergePass as MergePass\n(MergeSharedMemoryAllocations)
    participant MVBRewriter as MultiVersionRewriter
    participant HostSplitter as HostDeviceSplitter
    participant PrimFunc as PrimFunc (attrs)
    participant CodeGen as CodeGenTileLangCUDA
    participant CUDAOut as Generated CUDA

    MergePass->>MVBRewriter: consult/produce multiversion metadata
    MergePass->>MergePass: compute dyn_shared_memory_alias_metadata
    MergePass->>PrimFunc: attach alias metadata attribute
    MVBRewriter->>PrimFunc: attach multiversion metadata (if any)
    PrimFunc->>HostSplitter: split host/device
    HostSplitter->>PrimFunc: propagate multiversion metadata to device PrimFunc
    PrimFunc->>CodeGen: AddFunction() reads alias & multiversion attrs
    CodeGen->>CodeGen: build alias tables, set emit_named_smem_pointers_
    CodeGen->>CodeGen: emit alias pointer views at Allocate
    CodeGen->>CodeGen: remap loads/stores/address_of to alias addresses
    CodeGen->>CUDAOut: emit final CUDA source with alias declarations and remapped accesses

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

[Feature Request] Emit named pointer aliases for shared.dyn buffers in generated CUDA code to improve readability #2059: Implements named shared.dyn pointer alias emission and the pass-config/metadata/codegen plumbing described in that issue.

Possibly related PRs

[Refactor][Pipeline] Run pipeline rewriting before layout inference and stabilize tiled WS #2002: Related multi-version dynamic-shared metadata and rewriter changes that this codegen consumes.
[CUDA] Add read-only parameter annotation for CUDA codegen #1416: Related changes in CUDA codegen emission patterns and function-signature/code-emission adjustments.

Suggested labels

enhancement

Suggested reviewers

LeiWang1999

Poem

🐰 I nibble bytes in shared mem rows,
Names for pointers now proudly show.
Aliases hop from offset to lane,
Pattern visitors count the chain.
Hooray — kernels bound and ready to go!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.84% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding named aliases for dynamic shared memory in CUDA codegen to improve readability.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/transform/merge_shared_memory_allocations.cc`:
- Around line 578-580: ExportBufferAliasMetadata() is being called before
buffer_names_ is populated in rewriter(...), so metadata falls back to
buffer_var->name_hint instead of using high-level Buffer names; fix by moving
the ExportBufferAliasMetadata() invocation (or otherwise deferring export) to
run after rewriter(...) populates buffer_names_, or ensure buffer_names_ is
filled earlier (e.g., populate buffer_names_ when visiting Buffers or at the
start of rewriter) so that ExportBufferAliasMetadata() reads the populated
buffer_names_ map rather than the fallback name_hint. Reference symbols:
buffer_names_, rewriter(...), ExportBufferAliasMetadata(),
buffer_var->name_hint, node->buffer.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f1479d40-c532-4e65-a066-0297b65839a4

📥 Commits

Reviewing files that changed from the base of the PR and between 0924dab and 5cecdae.

📒 Files selected for processing (6)

src/op/builtin.cc
src/op/builtin.h
src/target/codegen_cuda.cc
src/target/codegen_cuda.h
src/transform/merge_shared_memory_allocations.cc
tilelang/transform/pass_config.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

src/transform/merge_shared_memory_allocations.cc (1)

1552-1554: ⚠️ Potential issue | 🟠 Major

Export alias metadata after rewriting populates buffer_names_.

buffer_names_ is filled in VisitBufferAccess(), so exporting before rewriter(...) still misses high-level Buffer names and can also miss multi_version_metadata_ keyed by those names.

Proposed fix

     SharedMemoryRewriter rewriter(collector.dyn_shmem_allocs_, true, verbose,
                                   align_bytes, multi_version_metadata);
     rewriter.PlanReuse(stmt, true, enable_aggressive_merge);
-    dyn_alias_metadata = rewriter.ExportBufferAliasMetadata();
     stmt = rewriter(std::move(stmt));
+    dyn_alias_metadata = rewriter.ExportBufferAliasMetadata();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/transform/merge_shared_memory_allocations.cc` around lines 1552 - 1554,
The ExportBufferAliasMetadata call is happening before the rewriter populates
buffer_names_ (populated in VisitBufferAccess), so move the
rewriter.ExportBufferAliasMetadata() invocation to after you invoke the rewriter
(i.e., after stmt = rewriter(std::move(stmt))); ensure you still call
rewriter.PlanReuse(...) beforehand as needed and then call
ExportBufferAliasMetadata() so multi_version_metadata_ and buffer_names_ are
captured from the rewritten IR.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 266-271: The fallback alias-matching currently treats
info.total_byte_size==0 as "no range check" allowing same-dtype buffers to
incorrectly claim dynamic shared-memory accesses; update the fallback logic that
returns std::optional<DynSharedAliasAccess>() to first detect missing size
metadata (info.total_byte_size == 0) and refuse to match (i.e., skip/continue
fallback) instead of disabling range checks — change the condition around
require_range_check / analyzer.CanProve(...) to enforce a size-present check,
and apply the same change in the other identical fallback block (the one at the
later location referenced in the review) so aliases without size metadata cannot
be matched.

In `@src/transform/split_host_device.cc`:
- Around line 371-374: The dynamic shared-memory metadata stored in
dyn_shared_multi_version_metadata_ is still using host-side Var symbols and must
be remapped to the fresh device parameter Vars created by SplitDeviceFunc;
update the code that sets device_attrs (kDynSharedMemoryMultiVersionMetadata) to
traverse the metadata expression(s) and substitute host Var -> corresponding
device param Var (the ones created/returned by SplitDeviceFunc) before calling
device_attrs.Set, ensuring any symbolic sizes/strides reference the device
parameters rather than undeclared host Vars. Locate the substitution logic used
for the function body in SplitDeviceFunc and reuse its mapping routine (or
create a similar Var->Var substitution util) to transform
dyn_shared_multi_version_metadata_.value() prior to assignment.

In `@testing/python/components/test_cuda_named_smem_alias_codegen.py`:
- Around line 61-62: The test function
test_codegen_keeps_raw_dyn_smem_access_by_default is running CUDA code via
_compile_kernel() but lacks the CUDA guard; add the CUDA gate by decorating the
test with the existing requires_cuda marker (e.g., `@pytest.mark.requires_cuda` or
the project's requires_cuda decorator) so the test is skipped on CPU-only CI;
update the test definition for test_codegen_keeps_raw_dyn_smem_access_by_default
to include that decorator and ensure imports/reference to pytest/marker are
present.

---

Duplicate comments:
In `@src/transform/merge_shared_memory_allocations.cc`:
- Around line 1552-1554: The ExportBufferAliasMetadata call is happening before
the rewriter populates buffer_names_ (populated in VisitBufferAccess), so move
the rewriter.ExportBufferAliasMetadata() invocation to after you invoke the
rewriter (i.e., after stmt = rewriter(std::move(stmt))); ensure you still call
rewriter.PlanReuse(...) beforehand as needed and then call
ExportBufferAliasMetadata() so multi_version_metadata_ and buffer_names_ are
captured from the rewritten IR.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1424a87f-7fe0-415b-8eb0-e69425b3f8fe

📥 Commits

Reviewing files that changed from the base of the PR and between 5cecdae and 2d56736.

📒 Files selected for processing (7)

src/target/codegen_cuda.cc
src/target/codegen_cuda.h
src/tl_templates/cuda/common.h
src/transform/merge_shared_memory_allocations.cc
src/transform/multi_version_buffer_rewriter.cc
src/transform/split_host_device.cc
testing/python/components/test_cuda_named_smem_alias_codegen.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/target/codegen_cuda.h

coderabbitai · 2026-04-19T07:23:02Z

+    if (dyn_shared_multi_version_metadata_.defined()) {
+      device_attrs.Set(kDynSharedMemoryMultiVersionMetadata,
+                       dyn_shared_multi_version_metadata_.value());
+    }


⚠️ Potential issue | 🟠 Major

Remap metadata expressions to the device parameter Vars.

SplitDeviceFunc creates fresh device params and substitutes the body, but the propagated dynamic shared-memory metadata still contains host-side Var objects. Symbolic strides/sizes can later codegen as undeclared variables.

Proposed fix

if (dyn_shared_multi_version_metadata_.defined()) { - device_attrs.Set(kDynSharedMemoryMultiVersionMetadata, - dyn_shared_multi_version_metadata_.value()); + Map<String, Array<PrimExpr>> remapped_metadata; + for (const auto &kv : dyn_shared_multi_version_metadata_.value()) { + Array<PrimExpr> values; + values.reserve(kv.second.size()); + for (const PrimExpr &expr : kv.second) { + values.push_back(tir::Substitute(expr, var_remap)); + } + remapped_metadata.Set(kv.first, values); + } + device_attrs.Set(kDynSharedMemoryMultiVersionMetadata, + remapped_metadata); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (dyn_shared_multi_version_metadata_.defined()) {

device_attrs.Set(kDynSharedMemoryMultiVersionMetadata,

dyn_shared_multi_version_metadata_.value());

}

if (dyn_shared_multi_version_metadata_.defined()) {

Map<String, Array<PrimExpr>> remapped_metadata;

for (const auto &kv : dyn_shared_multi_version_metadata_.value()) {

Array<PrimExpr> values;

values.reserve(kv.second.size());

for (const PrimExpr &expr : kv.second) {

values.push_back(tir::Substitute(expr, var_remap));

}

remapped_metadata.Set(kv.first, values);

}

device_attrs.Set(kDynSharedMemoryMultiVersionMetadata,

remapped_metadata);

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/transform/split_host_device.cc` around lines 371 - 374, The dynamic shared-memory metadata stored in dyn_shared_multi_version_metadata_ is still using host-side Var symbols and must be remapped to the fresh device parameter Vars created by SplitDeviceFunc; update the code that sets device_attrs (kDynSharedMemoryMultiVersionMetadata) to traverse the metadata expression(s) and substitute host Var -> corresponding device param Var (the ones created/returned by SplitDeviceFunc) before calling device_attrs.Set, ensuring any symbolic sizes/strides reference the device parameters rather than undeclared host Vars. Locate the substitution logic used for the function body in SplitDeviceFunc and reuse its mapping routine (or create a similar Var->Var substitution util) to transform dyn_shared_multi_version_metadata_.value() prior to assignment.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/target/codegen_cuda.cc (1)

269-274: ⚠️ Potential issue | 🟠 Major

Keep fallback alias matching disabled when size metadata is missing.

This stale issue is still present: 4-field metadata sets total_byte_size to zero, and the fallback path treats that as “no range check”, so any same-dtype alias can claim an unrelated dynamic shared-memory access when exact-name matching misses. Either reject fallback matches without total_byte_size, or stop accepting 4-field entries for alias fallback.

Proposed fix

-    if (require_range_check && !is_zero(info.total_byte_size) &&
-        (!analyzer.CanProve(shifted >= zero) ||
-         !analyzer.CanProve(
-             shifted <
-             analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes))))) {
-      return std::optional<DynSharedAliasAccess>();
+    if (require_range_check) {
+      if (is_zero(info.total_byte_size)) {
+        return std::optional<DynSharedAliasAccess>();
+      }
+      PrimExpr total_elements =
+          analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes));
+      if (!analyzer.CanProve(shifted >= zero) ||
+          !analyzer.CanProve(shifted < total_elements)) {
+        return std::optional<DynSharedAliasAccess>();
+      }
     }

Also applies to: 306-310, 4904-4918

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/target/codegen_cuda.cc` around lines 269 - 274, The fallback alias
matching currently treats info.total_byte_size == 0 as “no range check” and
allows same-dtype fallback matches; change the logic to reject fallback matches
when size metadata is missing by requiring a non-zero total_byte_size before
taking the fallback path. Concretely, update the condition(s) surrounding
require_range_check and analyzer.CanProve(...) that return
std::optional<DynSharedAliasAccess() to also check
!is_zero(info.total_byte_size) (or equivalently require info.total_byte_size !=
0) so that if total_byte_size is absent/zero the function does not fall back to
same-dtype aliasing; apply the same change to the other analogous sites that use
require_range_check, info.total_byte_size, indexdiv(..., elem_bytes), and return
std::optional<DynSharedAliasAccess()).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 4011-4024: The generated code adds byte offsets to the buffer
pointer (vid / buf_dyn_shmem) without accounting for element size; cast the base
pointer to unsigned char* before doing pointer arithmetic and then
reinterpret_cast the resulting address back to the element type. Concretely,
update the branches that emit aliasing code (the block that writes
alias.alias_name using PrintType/PrintExpr) to compute the address as
reinterpret_cast<decltype(nullptr)>(reinterpret_cast<unsigned char*> (vid) +
PrintExpr(alias.byte_offset) + ... ) — i.e., cast vid to unsigned char* before
adding alias.byte_offset and alias.stage_stride_bytes, then reinterpret_cast the
final computed byte address to the element pointer type printed by PrintType so
alias pointers point to correct byte addresses.

---

Duplicate comments:
In `@src/target/codegen_cuda.cc`:
- Around line 269-274: The fallback alias matching currently treats
info.total_byte_size == 0 as “no range check” and allows same-dtype fallback
matches; change the logic to reject fallback matches when size metadata is
missing by requiring a non-zero total_byte_size before taking the fallback path.
Concretely, update the condition(s) surrounding require_range_check and
analyzer.CanProve(...) that return std::optional<DynSharedAliasAccess() to also
check !is_zero(info.total_byte_size) (or equivalently require
info.total_byte_size != 0) so that if total_byte_size is absent/zero the
function does not fall back to same-dtype aliasing; apply the same change to the
other analogous sites that use require_range_check, info.total_byte_size,
indexdiv(..., elem_bytes), and return std::optional<DynSharedAliasAccess()).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5797ff70-4c78-40a3-a293-d25c948a2edf

📥 Commits

Reviewing files that changed from the base of the PR and between 2d56736 and 01219c0.

📒 Files selected for processing (2)

src/target/codegen_cuda.cc
src/target/codegen_cuda.h

🚧 Files skipped from review as they are similar to previous changes (1)

src/target/codegen_cuda.h

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

src/target/codegen_cuda.cc (2)

4015-4037: ⚠️ Potential issue | 🔴 Critical

Alias pointer arithmetic still uses element-sized stride on a typed base.

vid is declared as <op->dtype> vid[] (not byte-typed), so vid + alias.byte_offset advances by byte_offset * sizeof(op->dtype) rather than byte_offset bytes. For any non-byte element type this produces aliases that point outside the intended region. This was flagged on the previous revision and the emitted arithmetic is unchanged.

Suggested fix

         if (is_zero(alias.stage_stride_bytes)) {
           PrintType(alias.dtype, stream);
           stream << "* " << alias.alias_name << " = reinterpret_cast<";
           PrintType(alias.dtype, stream);
-          stream << "*>(" << vid << " + " << PrintExpr(alias.byte_offset)
-                 << ");\n";
+          stream << "*>(reinterpret_cast<unsigned char*>(" << vid << ") + "
+                 << PrintExpr(alias.byte_offset) << ");\n";
         } else {
           stream << "auto " << alias.alias_name
                  << " = tl::PatternVisitor([&](const int i) { return "
                     "reinterpret_cast<";
           PrintType(alias.dtype, stream);
-          stream << "*>(" << vid << " + " << PrintExpr(alias.byte_offset)
-                 << " + (i * " << PrintExpr(alias.stage_stride_bytes)
-                 << ")); });\n";
+          stream << "*>(reinterpret_cast<unsigned char*>(" << vid << ") + "
+                 << PrintExpr(alias.byte_offset) << " + (i * "
+                 << PrintExpr(alias.stage_stride_bytes) << ")); });\n";
         }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/target/codegen_cuda.cc` around lines 4015 - 4037, The alias pointer
arithmetic is adding byte offsets onto vid which is declared as a typed array
(vid), so replace arithmetic like "vid + PrintExpr(alias.byte_offset) [+ ...]"
by first casting vid to a byte pointer (e.g. reinterpret_cast<std::byte*>(vid)
or uint8_t*) before adding byte offsets and then reinterpret_cast back to the
target element pointer in both the simple and PatternVisitor branches; update
usages around dyn_shared_aliases_, alias.byte_offset and
alias.stage_stride_bytes where PrintType and reinterpret_cast are emitted so the
addition is done in bytes rather than in units of op->dtype.

279-285: ⚠️ Potential issue | 🟠 Major

Fallback still matches aliases when total_byte_size is zero.

The 4-field metadata path (AddFunction at Line 4927) sets total_byte_size = 0, and this condition !is_zero(info.total_byte_size) && (...) then short-circuits the range check. The fallback loop at Lines 316-321 therefore happily claims any same-dtype access with a size-less alias. This was raised previously and the code is unchanged — please address it or confirm intent.

Suggested fix

-    if (require_range_check && !is_zero(info.total_byte_size) &&
-        (!analyzer.CanProve(shifted >= zero) ||
-         !analyzer.CanProve(
-             shifted <
-             analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes))))) {
-      return std::optional<DynSharedAliasAccess>();
+    if (require_range_check) {
+      if (is_zero(info.total_byte_size)) {
+        return std::optional<DynSharedAliasAccess>();
+      }
+      PrimExpr total_elements =
+          analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes));
+      if (!analyzer.CanProve(shifted >= zero) ||
+          !analyzer.CanProve(shifted < total_elements)) {
+        return std::optional<DynSharedAliasAccess>();
+      }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/target/codegen_cuda.cc` around lines 279 - 285, The fallback
alias-matching loop wrongly treats entries with info.total_byte_size == 0 as
matching any same-dtype access; change the logic so zero-sized metadata never
matches: in the range-check branch and/or the fallback loop that constructs
DynSharedAliasAccess, add a guard on info.total_byte_size (e.g. skip/continue
when is_zero(info.total_byte_size) or otherwise require
!is_zero(info.total_byte_size)) so aliases with total_byte_size == 0 are not
accepted as matches; update uses of require_range_check, analyzer, and the
fallback loop that creates DynSharedAliasAccess to include this non-zero-size
check.

🧹 Nitpick comments (1)

src/target/codegen_cuda.cc (1)
4900-4951: Metadata parsing LGTM; one small robustness note.

Parsing and sorting logic are sound: dedicated handling of the size-4/size-6 schemas matches ExportBufferAliasMetadata, the sort makes alias-name generation deterministic across Map<String, Array<PrimExpr>> iteration orders, and the final emit_named_smem_pointers_ = !dyn_shared_aliases_.empty() correctly disables emission when no aliases survive parsing.

Minor: malformed entries (wrong size, non-IntImm dtype fields) are silently dropped at Lines 4914-4923. Given this feature is gated behind an opt-in pass config, a DLOG(WARNING) or similar would make schema drift easier to diagnose. Not a blocker.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/target/codegen_cuda.cc` around lines 4900 - 4951, The parsing loop
silently drops malformed dynamic-shared-memory alias entries; add diagnostics so
issues are discoverable by users. In the block guarded by
emit_named_smem_pointers_ where you iterate opt.value() (the loop that checks
metadata.size() and validates code/bits/lanes), emit a DLOG(WARNING) or VLOG
with a concise message including the alias key (kv.first) and reason (bad size
or non-IntImm fields) whenever you continue; keep using the same variables
(metadata, kv.first) and avoid changing control flow—only log before the
continue—so dyn_shared_aliases_, dyn_shared_alias_order_,
emit_named_smem_pointers_, and name_supply_->FreshName behavior remains
identical.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 246-323: TryMatchDynSharedAliasAccess can divide by zero when
stage_stride (computed from info.stage_stride_bytes via indexdiv) collapses to
zero; after computing stage_stride (the variable set from
indexdiv(info.stage_stride_bytes, elem_bytes)) add a defensive check that if
is_zero(stage_stride) (or analyzer.CanProve(stage_stride == 0)) then return
std::nullopt to avoid subsequent indexdiv/indexmod with a zero divisor; update
the branch that handles non-zero stage_stride_bytes (and the following path that
computes access.stage_index/access.element_index) to rely on this check.
Optionally, to improve performance, avoid reconstructing arith::Analyzer on
every call to TryMatchDynSharedAliasAccess by reusing/caching a single Analyzer
instance that is invalidated when Bind/Unbind occurs (ensure
dyn_shared_alias_var_ranges_ rebinding remains correct).

---

Duplicate comments:
In `@src/target/codegen_cuda.cc`:
- Around line 4015-4037: The alias pointer arithmetic is adding byte offsets
onto vid which is declared as a typed array (vid), so replace arithmetic like
"vid + PrintExpr(alias.byte_offset) [+ ...]" by first casting vid to a byte
pointer (e.g. reinterpret_cast<std::byte*>(vid) or uint8_t*) before adding byte
offsets and then reinterpret_cast back to the target element pointer in both the
simple and PatternVisitor branches; update usages around dyn_shared_aliases_,
alias.byte_offset and alias.stage_stride_bytes where PrintType and
reinterpret_cast are emitted so the addition is done in bytes rather than in
units of op->dtype.
- Around line 279-285: The fallback alias-matching loop wrongly treats entries
with info.total_byte_size == 0 as matching any same-dtype access; change the
logic so zero-sized metadata never matches: in the range-check branch and/or the
fallback loop that constructs DynSharedAliasAccess, add a guard on
info.total_byte_size (e.g. skip/continue when is_zero(info.total_byte_size) or
otherwise require !is_zero(info.total_byte_size)) so aliases with
total_byte_size == 0 are not accepted as matches; update uses of
require_range_check, analyzer, and the fallback loop that creates
DynSharedAliasAccess to include this non-zero-size check.

---

Nitpick comments:
In `@src/target/codegen_cuda.cc`:
- Around line 4900-4951: The parsing loop silently drops malformed
dynamic-shared-memory alias entries; add diagnostics so issues are discoverable
by users. In the block guarded by emit_named_smem_pointers_ where you iterate
opt.value() (the loop that checks metadata.size() and validates
code/bits/lanes), emit a DLOG(WARNING) or VLOG with a concise message including
the alias key (kv.first) and reason (bad size or non-IntImm fields) whenever you
continue; keep using the same variables (metadata, kv.first) and avoid changing
control flow—only log before the continue—so dyn_shared_aliases_,
dyn_shared_alias_order_, emit_named_smem_pointers_, and name_supply_->FreshName
behavior remains identical.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ba99e40f-1a99-4f61-8c38-e515c86993a3

📥 Commits

Reviewing files that changed from the base of the PR and between 01219c0 and 0a9ccbf.

📒 Files selected for processing (2)

src/target/codegen_cuda.cc
src/transform/merge_shared_memory_allocations.cc

✅ Files skipped from review due to trivial changes (1)

src/transform/merge_shared_memory_allocations.cc

coderabbitai · 2026-04-19T10:30:58Z

+std::optional<CodeGenTileLangCUDA::DynSharedAliasAccess>
+CodeGenTileLangCUDA::TryMatchDynSharedAliasAccess(const BufferNode *buffer,
+                                                  PrimExpr index) const {
+  if (!emit_named_smem_pointers_) {
+    return std::nullopt;
+  }
+  std::string scope;
+  if (alloc_storage_scope_.count(buffer->data.get())) {
+    scope = alloc_storage_scope_.at(buffer->data.get());
+  }
+  if (scope.empty()) {
+    scope = GetPtrStorageScope(buffer->data);
+  }
+  if (scope != "shared.dyn") {
+    return std::nullopt;
+  }
+  int elem_bytes = buffer->dtype.bytes() * buffer->dtype.lanes();
+  if (elem_bytes <= 0) {
+    return std::nullopt;
+  }
+  arith::Analyzer analyzer;
+  for (const auto &kv : dyn_shared_alias_var_ranges_) {
+    analyzer.Bind(GetRef<Var>(kv.first), kv.second);
+  }
+  auto try_match = [&](const DynSharedAliasInfo &info,
+                       bool require_range_check) {
+    if (info.dtype != buffer->dtype) {
+      return std::optional<DynSharedAliasAccess>();
+    }
+    PrimExpr base_offset =
+        analyzer.Simplify(indexdiv(info.byte_offset, elem_bytes));
+    PrimExpr shifted = analyzer.Simplify(index - base_offset);
+    PrimExpr zero = make_const(shifted.dtype(), 0);
+    if (require_range_check && !is_zero(info.total_byte_size) &&
+        (!analyzer.CanProve(shifted >= zero) ||
+         !analyzer.CanProve(
+             shifted <
+             analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes))))) {
+      return std::optional<DynSharedAliasAccess>();
+    }
+    DynSharedAliasAccess access;
+    access.info = &info;
+    if (is_zero(info.stage_stride_bytes)) {
+      access.element_index = shifted;
+      return std::optional<DynSharedAliasAccess>(std::move(access));
+    }
+    PrimExpr stage_stride =
+        analyzer.Simplify(indexdiv(info.stage_stride_bytes, elem_bytes));
+    if (!dyn_shared_stage_expr_stack_.empty()) {
+      PrimExpr stage_index = dyn_shared_stage_expr_stack_.back();
+      PrimExpr stage_shifted =
+          analyzer.Simplify(shifted - stage_index * stage_stride);
+      if (analyzer.CanProve(stage_shifted >= zero) &&
+          analyzer.CanProve(stage_shifted < stage_stride)) {
+        access.stage_index = stage_index;
+        access.element_index = stage_shifted;
+        return std::optional<DynSharedAliasAccess>(std::move(access));
+      }
+    }
+    access.stage_index = analyzer.Simplify(indexdiv(shifted, stage_stride));
+    access.element_index = analyzer.Simplify(indexmod(shifted, stage_stride));
+    return std::optional<DynSharedAliasAccess>(std::move(access));
+  };
+
+  auto it = dyn_shared_aliases_.find(buffer->name);
+  if (it != dyn_shared_aliases_.end()) {
+    if (auto access = try_match(it->second, false)) {
+      return access;
+    }
+  }
+  for (const auto &buffer_name : dyn_shared_alias_order_) {
+    const auto &info = dyn_shared_aliases_.at(buffer_name);
+    if (auto access = try_match(info, true)) {
+      return access;
+    }
+  }
+  return std::nullopt;
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -nP --type=cpp -C5 'stage_stride_bytes\s*=' src/transform/

Repository: tile-ai/tilelang

Length of output: 2465

🏁 Script executed:

#!/bin/bash # Get context around total_bytes computation and stage_stride_bytes assignment rg -nP --type=cpp -B20 -A10 'stage_stride_bytes\s*=' src/transform/merge_shared_memory_allocations.cc | head -120

Repository: tile-ai/tilelang

Length of output: 2298

🏁 Script executed:

#!/bin/bash # Check if there are any assertions or validity checks for stage_stride_bytes or elem_bytes in the codegen file rg -nP --type=cpp 'stage_stride|elem_bytes' src/target/codegen_cuda.cc | head -30

Repository: tile-ai/tilelang

Length of output: 982

🏁 Script executed:

#!/bin/bash # Look for DynSharedAliasInfo struct definition and any comments about invariants rg -nP --type=cpp -B5 -A15 'struct DynSharedAliasInfo' src/

Repository: tile-ai/tilelang

Length of output: 1313

🏁 Script executed:

#!/bin/bash # Find where buffer_num_stages_ is populated rg -nP --type=cpp 'buffer_num_stages_\s*\[' src/transform/merge_shared_memory_allocations.cc | head -20

Repository: tile-ai/tilelang

Length of output: 118

🏁 Script executed:

#!/bin/bash # Look for any validation/alignment logic in buffer allocation rg -nP --type=cpp -B3 -A3 'num_stages.*>.*1' src/transform/merge_shared_memory_allocations.cc | head -40

Repository: tile-ai/tilelang

Length of output: 821

🏁 Script executed:

#!/bin/bash # Check if there are any comments about the invariants in DynSharedAliasInfo or related metadata rg -nP --type=cpp -B10 'stage_stride_bytes' src/target/codegen_cuda.h

Repository: tile-ai/tilelang

Length of output: 556

🏁 Script executed:

#!/bin/bash # Check the AllocateNode structure and understand buffer allocation constraints rg -nP --type=cpp -B5 -A10 'class AllocateNode' src/

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

#!/bin/bash # Look for any padding/alignment logic in buffer allocation rg -nP --type=cpp 'AlignTo|align|padding' src/transform/merge_shared_memory_allocations.cc | head -30

Repository: tile-ai/tilelang

Length of output: 1723

🏁 Script executed:

#!/bin/bash # Check if there are any assertions or validation that buffers are properly sized rg -nP --type=cpp -B2 -A2 'total_bytes|stage_stride_bytes' src/transform/merge_shared_memory_allocations.cc | grep -A3 -B3 'CHECK\|ASSERT\|DCHECK'

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

#!/bin/bash # Look for where num_stages is derived from and what constraints it has rg -nP --type=cpp -B10 'kPipelineMVBContextNumStages' src/transform/merge_shared_memory_allocations.cc | head -50

Repository: tile-ai/tilelang

Length of output: 503

🏁 Script executed:

#!/bin/bash # Check if there's any documentation or comments about stage_stride_bytes guarantees rg -nP --type=cpp -B5 -A5 'stage_stride_bytes' src/target/codegen_cuda.cc | grep -A6 -B6 'is_zero'

Repository: tile-ai/tilelang

Length of output: 1208

🏁 Script executed:

#!/bin/bash # Search for any validation that ensures proper alignment of allocations rg -nP --type=cpp 'stage_stride_bytes.*make_const' src/transform/merge_shared_memory_allocations.cc -A5 -B5

Repository: tile-ai/tilelang

Length of output: 608

🏁 Script executed:

#!/bin/bash # Understand the complete picture: trace how extents are computed and if they're constrained rg -nP --type=cpp -B20 'for.*const PrimExpr &extent : alloc->extents' src/transform/merge_shared_memory_allocations.cc | head -60

Repository: tile-ai/tilelang

Length of output: 1938

🏁 Script executed:

#!/bin/bash # Check if there are any pipeline-related constraints on buffer sizing rg -nP --type=cpp 'kPipelineMVBContextNumStages' src/ -B10 -A10 | head -80

Repository: tile-ai/tilelang

Length of output: 6771

🏁 Script executed:

#!/bin/bash # Look at how the metadata is consumed - are there any safeguards in the codegen? rg -nP --type=cpp -B3 -A8 'stage_stride =.*indexdiv.*stage_stride_bytes' src/target/codegen_cuda.cc

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

#!/bin/bash # Check if there's a pattern of using is_zero guard after indexdiv elsewhere in the codebase rg -nP --type=cpp 'is_zero.*indexdiv\|indexdiv.*is_zero' src/ | head -20

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

#!/bin/bash # Look at the surrounding code context to see if stage_stride value 0 would have other consequences rg -nP --type=cpp -A15 'stage_stride = analyzer.Simplify.*indexdiv' src/target/codegen_cuda.cc

Repository: tile-ai/tilelang

Length of output: 42

Add defensive check to prevent divide-by-zero on stage_stride.

Line 293: stage_stride = indexdiv(stage_stride_bytes, elem_bytes) can collapse to zero if stage_stride_bytes < elem_bytes, yet the subsequent indexdiv(shifted, stage_stride) / indexmod(shifted, stage_stride) at lines 305–306 don't guard against it. The is_zero(info.stage_stride_bytes) check at line 288 prevents a zero denominator only in that specific branch, not after the indexdiv at line 293. In practice, merge_shared_memory_allocations.cc path 1 (lines 578–580) always produces a multiple of elem_bytes, but path 2 (lines 584–585) uses indexdiv(total_bytes, num_stages) with no guarantee of alignment. Add an early is_zero(stage_stride) bail-out after line 293 to be defensive.

Additionally, line 266–269: A fresh arith::Analyzer is constructed and re-bound on every access. For large kernels with many dyn-shared accesses this is O(N_accesses × N_ranges). Caching the analyzer (invalidated on Bind/Unbind) would improve codegen performance, though it's optional.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/target/codegen_cuda.cc` around lines 246 - 323, TryMatchDynSharedAliasAccess can divide by zero when stage_stride (computed from info.stage_stride_bytes via indexdiv) collapses to zero; after computing stage_stride (the variable set from indexdiv(info.stage_stride_bytes, elem_bytes)) add a defensive check that if is_zero(stage_stride) (or analyzer.CanProve(stage_stride == 0)) then return std::nullopt to avoid subsequent indexdiv/indexmod with a zero divisor; update the branch that handles non-zero stage_stride_bytes (and the following path that computes access.stage_index/access.element_index) to rely on this check. Optionally, to improve performance, avoid reconstructing arith::Analyzer on every call to TryMatchDynSharedAliasAccess by reusing/caching a single Analyzer instance that is invalidated when Bind/Unbind occurs (ensure dyn_shared_alias_var_ranges_ rebinding remains correct).

kurisu6912

This pr add smem alias by record aliases in merge_shared_memory_allocations pass, and then match the alias in codegen passes.

At the allocation pass, use prim func attrs to record each buffer's allocation byte range.
At the codegen pass, use Analyzer to prove the buffer's index resides in that range, and find out its alias.

Generally this is a very good solution, thanks for contribution!

kurisu6912 · 2026-04-22T07:10:43Z

@Rachmanino Tilelang has FP4 buffers, the current pr uses stride to compute the alias, does it handle fp4 buffers correctly?

Sometime fp4 stride is 0.5x, and shape is 2x, which is often counterintuitive.

Rachmanino · 2026-04-22T11:28:34Z

@Rachmanino Tilelang has FP4 buffers, the current pr uses stride to compute the alias, does it handle fp4 buffers correctly?

Sometime fp4 stride is 0.5x, and shape is 2x, which is often counterintuitive.

Thanks for your insight! I'll check this when im available. Besides, CI has passed with this feature enabled, and I guess that there's already some fp4 examples in our tests?

Rachmanino · 2026-04-22T12:40:47Z

I guess you are right, thx! I'll fix this soon and rebase to eliminate conflict

… test

LeiWang1999 · 2026-05-20T06:46:26Z

pending and wait for #2216 then we can take a look for if there exist better solutions.

LeiWang1999 · 2026-05-21T03:30:58Z

Closed. tir has been rebased into tir, where alloc is now a single alloc_buffer statement instead of a declaration and allocation with a body. We also have handle_add_byte_offset now, which makes implementing this feature much more convenient. A separate issue will be opened for that.

coderabbitai Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread src/transform/merge_shared_memory_allocations.cc

coderabbitai Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread src/target/codegen_cuda.cc

coderabbitai Bot reviewed Apr 19, 2026

View reviewed changes

Rachmanino force-pushed the worktree-issue-2059 branch 2 times, most recently from 71ac4a8 to 2540b79 Compare April 20, 2026 15:11

Rachmanino requested a review from LeiWang1999 April 21, 2026 05:45

kurisu6912 previously approved these changes Apr 22, 2026

View reviewed changes

Rachmanino added 8 commits April 22, 2026 21:01

feat(cuda): emit named aliases for merged dynamic shared memory

17a3f86

feat(cuda): add stage-aware aliases for multiversioned shared memory

3d5196d

feat(cuda): rewrite cp.async accesses with shared aliases

a0063a7

fix(cuda): limit shared alias matching to dynamic smem

1d75fcc

lint

a0f208d

test(cuda): add CUDA requirement for raw dynamic shared memory access…

22f9553

… test

test(cuda): accept simplified stage alias output

92636ab

fix(cuda): handle fp4 dynamic shared aliases

93f87f3

Rachmanino dismissed kurisu6912’s stale review via 93f87f3 April 22, 2026 13:03

Rachmanino force-pushed the worktree-issue-2059 branch from 8758136 to 93f87f3 Compare April 22, 2026 13:03

lint

5f3c6bb

LeiWang1999 closed this May 21, 2026

-    if (dyn_shared_multi_version_metadata_.defined()) {
-      device_attrs.Set(kDynSharedMemoryMultiVersionMetadata,
-                       dyn_shared_multi_version_metadata_.value());
-    }
+    if (dyn_shared_multi_version_metadata_.defined()) {
+      Map<String, Array<PrimExpr>> remapped_metadata;
+      for (const auto &kv : dyn_shared_multi_version_metadata_.value()) {
+        Array<PrimExpr> values;
+        values.reserve(kv.second.size());
+        for (const PrimExpr &expr : kv.second) {
+          values.push_back(tir::Substitute(expr, var_remap));
+        }
+        remapped_metadata.Set(kv.first, values);
+      }
+      device_attrs.Set(kDynSharedMemoryMultiVersionMetadata,
+                       remapped_metadata);
+    }

Conversation

Rachmanino commented Apr 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

coderabbitai Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

kurisu6912 left a comment

Choose a reason for hiding this comment

Uh oh!

kurisu6912 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachmanino commented Apr 22, 2026

Uh oh!

Rachmanino commented Apr 22, 2026

Uh oh!

LeiWang1999 commented May 20, 2026

Uh oh!

LeiWang1999 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Rachmanino commented Apr 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 19, 2026 •

edited

Loading

kurisu6912 commented Apr 22, 2026 •

edited

Loading