Skip to content

[Feature] Emit named aliases for dynamic shared memory in CUDA codegen for better readability#2061

Closed
Rachmanino wants to merge 9 commits into
tile-ai:mainfrom
Rachmanino:worktree-issue-2059
Closed

[Feature] Emit named aliases for dynamic shared memory in CUDA codegen for better readability#2061
Rachmanino wants to merge 9 commits into
tile-ai:mainfrom
Rachmanino:worktree-issue-2059

Conversation

@Rachmanino

@Rachmanino Rachmanino commented Apr 19, 2026

Copy link
Copy Markdown
Collaborator

This PR adds an opt-in CUDA codegen path for improving the readability of merged shared.dyn buffers by preserving DSL-levelshared-memory names in generated source.

What it does:

  • Adds a new pass config: tl.emit_named_smem_pointers
  • Extends MergeSharedMemoryAllocations to emit alias metadata for merged dynamic shared-memory buffers
  • Updates CUDA codegen to emit named aliases such as:
    • half_t* A_shared = reinterpret_cast<half_t*>(buf_dyn_shmem + ...);
  • For multiversioned / pipelined shared buffers, emits stage-aware aliases using a small helper:
    • auto A_shared = tl::PatternVisitor([&](int i) { ... });
  • Rewrites many downstream shared-memory accesses to use those aliases instead of raw buf_dyn_shmem + offset expressions
  • Covers both non-WS and WS/TMA paths for common GEMM-style kernels, including stage-based descriptor initialization and many cp.async / tma_load sites

Current limitations / not fully solved yet:

  • The implementation is strongest when the logical buffer identity is still visible in IR (e.g. normal BufferLoad/BufferStore,
    many staged producer accesses).
  • Some paths still lose logical buffer provenance after lowering to tvm_access_ptr(data, offset, extent, rw_mask), especially
    when multiple logical buffers are merged into the same physical buf_dyn_shmem region.
  • Because of that, in some cases codegen can still choose a physically correct but semantically less desirable alias. For
    example, an address logically corresponding to C_shared may still be rendered as A_shared[0] if both map to the same merged
    address range and only physical address information remains.
  • In other words, this PR substantially improves readability, but it does not yet fully preserve logical buffer provenance end-
    to-end through every pass that rewrites tvm_access_ptr.

Follow-up direction:

  • A more complete solution would preserve a stable logical-buffer identity across lowering (especially for tma_store-style
    paths), instead of relying on address-based alias recovery after buffer merging.

Summary by CodeRabbit

  • New Features

    • CUDA codegen can emit named, typed pointer aliases for merged dynamic shared memory; controlled by a new pass config (defaults to True).
    • Multi-version dynamic shared-memory metadata is propagated so device code can emit stage-aware aliases for pipelined kernels.
  • Tests

    • Added CUDA tests validating default/raw behavior, named alias emission, and pattern-visitor based multiversion indexing.

@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai

coderabbitai Bot commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds support to emit named typed pointer aliases for dynamic shared-memory: new pass-config and builtin keys, metadata production in merge/multi-version passes, propagation through host/device splitting, CUDA codegen to consume alias metadata and remap shared.dyn accesses, a PatternVisitor helper, and tests.

Changes

Cohort / File(s) Summary
Pass config & builtin keys
src/op/builtin.cc, src/op/builtin.h, tilelang/transform/pass_config.py
Register new pass-config key tl.emit_named_smem_pointers / kEmitNamedSmemPointers and add the corresponding attribute constant used to control emitting named shared.dyn pointer aliases.
CUDA codegen
src/target/codegen_cuda.h, src/target/codegen_cuda.cc
Add internal types/state and helpers to parse tl.dyn_shared_memory_alias_metadata, enable emit_named_smem_pointers_, emit per-alias typed pointer views at shared.dyn Allocate, and remap BufferLoad/BufferStore/address_of to use alias-based addresses with stage-aware handling and loop-range binding.
Shared-memory merging & metadata export
src/transform/merge_shared_memory_allocations.cc, src/transform/multi_version_buffer_rewriter.cc
Merge pass now computes/returns dynamic alias metadata (offset, size, stage stride, dtype) and may attach tl.dyn_shared_memory_alias_metadata; multi-version rewriter exports tl.dyn_shared_memory_multiversion_metadata. Signature of internal merge API updated to return alias map.
Host/device attribute propagation
src/transform/split_host_device.cc
HostDeviceSplitter reads/writes tl.dyn_shared_memory_multiversion_metadata on host/device PrimFunc attributes and propagates the metadata to generated device functions.
CUDA templates
src/tl_templates/cuda/common.h
Add tl::PatternVisitor<Fn> device-callable helper (constructor, operator[], CTAD) used to emit indexed alias views for multiversion dyn-shared aliases.
Tests
testing/python/components/test_cuda_named_smem_alias_codegen.py
Add CUDA tests verifying default raw dyn.smem access, emission of named typed alias pointers when enabled, and PatternVisitor-based indexed emission for multiversion dyn.smem aliases.

Sequence Diagram(s)

sequenceDiagram
    participant MergePass as MergePass\n(MergeSharedMemoryAllocations)
    participant MVBRewriter as MultiVersionRewriter
    participant HostSplitter as HostDeviceSplitter
    participant PrimFunc as PrimFunc (attrs)
    participant CodeGen as CodeGenTileLangCUDA
    participant CUDAOut as Generated CUDA

    MergePass->>MVBRewriter: consult/produce multiversion metadata
    MergePass->>MergePass: compute dyn_shared_memory_alias_metadata
    MergePass->>PrimFunc: attach alias metadata attribute
    MVBRewriter->>PrimFunc: attach multiversion metadata (if any)
    PrimFunc->>HostSplitter: split host/device
    HostSplitter->>PrimFunc: propagate multiversion metadata to device PrimFunc
    PrimFunc->>CodeGen: AddFunction() reads alias & multiversion attrs
    CodeGen->>CodeGen: build alias tables, set emit_named_smem_pointers_
    CodeGen->>CodeGen: emit alias pointer views at Allocate
    CodeGen->>CodeGen: remap loads/stores/address_of to alias addresses
    CodeGen->>CUDAOut: emit final CUDA source with alias declarations and remapped accesses
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • LeiWang1999

Poem

🐰 I nibble bytes in shared mem rows,
Names for pointers now proudly show.
Aliases hop from offset to lane,
Pattern visitors count the chain.
Hooray — kernels bound and ready to go!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.84% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding named aliases for dynamic shared memory in CUDA codegen to improve readability.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/transform/merge_shared_memory_allocations.cc`:
- Around line 578-580: ExportBufferAliasMetadata() is being called before
buffer_names_ is populated in rewriter(...), so metadata falls back to
buffer_var->name_hint instead of using high-level Buffer names; fix by moving
the ExportBufferAliasMetadata() invocation (or otherwise deferring export) to
run after rewriter(...) populates buffer_names_, or ensure buffer_names_ is
filled earlier (e.g., populate buffer_names_ when visiting Buffers or at the
start of rewriter) so that ExportBufferAliasMetadata() reads the populated
buffer_names_ map rather than the fallback name_hint. Reference symbols:
buffer_names_, rewriter(...), ExportBufferAliasMetadata(),
buffer_var->name_hint, node->buffer.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f1479d40-c532-4e65-a066-0297b65839a4

📥 Commits

Reviewing files that changed from the base of the PR and between 0924dab and 5cecdae.

📒 Files selected for processing (6)
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/target/codegen_cuda.cc
  • src/target/codegen_cuda.h
  • src/transform/merge_shared_memory_allocations.cc
  • tilelang/transform/pass_config.py

Comment thread src/transform/merge_shared_memory_allocations.cc

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/transform/merge_shared_memory_allocations.cc (1)

1552-1554: ⚠️ Potential issue | 🟠 Major

Export alias metadata after rewriting populates buffer_names_.

buffer_names_ is filled in VisitBufferAccess(), so exporting before rewriter(...) still misses high-level Buffer names and can also miss multi_version_metadata_ keyed by those names.

Proposed fix
     SharedMemoryRewriter rewriter(collector.dyn_shmem_allocs_, true, verbose,
                                   align_bytes, multi_version_metadata);
     rewriter.PlanReuse(stmt, true, enable_aggressive_merge);
-    dyn_alias_metadata = rewriter.ExportBufferAliasMetadata();
     stmt = rewriter(std::move(stmt));
+    dyn_alias_metadata = rewriter.ExportBufferAliasMetadata();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/merge_shared_memory_allocations.cc` around lines 1552 - 1554,
The ExportBufferAliasMetadata call is happening before the rewriter populates
buffer_names_ (populated in VisitBufferAccess), so move the
rewriter.ExportBufferAliasMetadata() invocation to after you invoke the rewriter
(i.e., after stmt = rewriter(std::move(stmt))); ensure you still call
rewriter.PlanReuse(...) beforehand as needed and then call
ExportBufferAliasMetadata() so multi_version_metadata_ and buffer_names_ are
captured from the rewritten IR.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 266-271: The fallback alias-matching currently treats
info.total_byte_size==0 as "no range check" allowing same-dtype buffers to
incorrectly claim dynamic shared-memory accesses; update the fallback logic that
returns std::optional<DynSharedAliasAccess>() to first detect missing size
metadata (info.total_byte_size == 0) and refuse to match (i.e., skip/continue
fallback) instead of disabling range checks — change the condition around
require_range_check / analyzer.CanProve(...) to enforce a size-present check,
and apply the same change in the other identical fallback block (the one at the
later location referenced in the review) so aliases without size metadata cannot
be matched.

In `@src/transform/split_host_device.cc`:
- Around line 371-374: The dynamic shared-memory metadata stored in
dyn_shared_multi_version_metadata_ is still using host-side Var symbols and must
be remapped to the fresh device parameter Vars created by SplitDeviceFunc;
update the code that sets device_attrs (kDynSharedMemoryMultiVersionMetadata) to
traverse the metadata expression(s) and substitute host Var -> corresponding
device param Var (the ones created/returned by SplitDeviceFunc) before calling
device_attrs.Set, ensuring any symbolic sizes/strides reference the device
parameters rather than undeclared host Vars. Locate the substitution logic used
for the function body in SplitDeviceFunc and reuse its mapping routine (or
create a similar Var->Var substitution util) to transform
dyn_shared_multi_version_metadata_.value() prior to assignment.

In `@testing/python/components/test_cuda_named_smem_alias_codegen.py`:
- Around line 61-62: The test function
test_codegen_keeps_raw_dyn_smem_access_by_default is running CUDA code via
_compile_kernel() but lacks the CUDA guard; add the CUDA gate by decorating the
test with the existing requires_cuda marker (e.g., `@pytest.mark.requires_cuda` or
the project's requires_cuda decorator) so the test is skipped on CPU-only CI;
update the test definition for test_codegen_keeps_raw_dyn_smem_access_by_default
to include that decorator and ensure imports/reference to pytest/marker are
present.

---

Duplicate comments:
In `@src/transform/merge_shared_memory_allocations.cc`:
- Around line 1552-1554: The ExportBufferAliasMetadata call is happening before
the rewriter populates buffer_names_ (populated in VisitBufferAccess), so move
the rewriter.ExportBufferAliasMetadata() invocation to after you invoke the
rewriter (i.e., after stmt = rewriter(std::move(stmt))); ensure you still call
rewriter.PlanReuse(...) beforehand as needed and then call
ExportBufferAliasMetadata() so multi_version_metadata_ and buffer_names_ are
captured from the rewritten IR.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1424a87f-7fe0-415b-8eb0-e69425b3f8fe

📥 Commits

Reviewing files that changed from the base of the PR and between 5cecdae and 2d56736.

📒 Files selected for processing (7)
  • src/target/codegen_cuda.cc
  • src/target/codegen_cuda.h
  • src/tl_templates/cuda/common.h
  • src/transform/merge_shared_memory_allocations.cc
  • src/transform/multi_version_buffer_rewriter.cc
  • src/transform/split_host_device.cc
  • testing/python/components/test_cuda_named_smem_alias_codegen.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/target/codegen_cuda.h

Comment thread src/target/codegen_cuda.cc
Comment on lines +371 to +374
if (dyn_shared_multi_version_metadata_.defined()) {
device_attrs.Set(kDynSharedMemoryMultiVersionMetadata,
dyn_shared_multi_version_metadata_.value());
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remap metadata expressions to the device parameter Vars.

SplitDeviceFunc creates fresh device params and substitutes the body, but the propagated dynamic shared-memory metadata still contains host-side Var objects. Symbolic strides/sizes can later codegen as undeclared variables.

Proposed fix
     if (dyn_shared_multi_version_metadata_.defined()) {
-      device_attrs.Set(kDynSharedMemoryMultiVersionMetadata,
-                       dyn_shared_multi_version_metadata_.value());
+      Map<String, Array<PrimExpr>> remapped_metadata;
+      for (const auto &kv : dyn_shared_multi_version_metadata_.value()) {
+        Array<PrimExpr> values;
+        values.reserve(kv.second.size());
+        for (const PrimExpr &expr : kv.second) {
+          values.push_back(tir::Substitute(expr, var_remap));
+        }
+        remapped_metadata.Set(kv.first, values);
+      }
+      device_attrs.Set(kDynSharedMemoryMultiVersionMetadata,
+                       remapped_metadata);
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (dyn_shared_multi_version_metadata_.defined()) {
device_attrs.Set(kDynSharedMemoryMultiVersionMetadata,
dyn_shared_multi_version_metadata_.value());
}
if (dyn_shared_multi_version_metadata_.defined()) {
Map<String, Array<PrimExpr>> remapped_metadata;
for (const auto &kv : dyn_shared_multi_version_metadata_.value()) {
Array<PrimExpr> values;
values.reserve(kv.second.size());
for (const PrimExpr &expr : kv.second) {
values.push_back(tir::Substitute(expr, var_remap));
}
remapped_metadata.Set(kv.first, values);
}
device_attrs.Set(kDynSharedMemoryMultiVersionMetadata,
remapped_metadata);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/split_host_device.cc` around lines 371 - 374, The dynamic
shared-memory metadata stored in dyn_shared_multi_version_metadata_ is still
using host-side Var symbols and must be remapped to the fresh device parameter
Vars created by SplitDeviceFunc; update the code that sets device_attrs
(kDynSharedMemoryMultiVersionMetadata) to traverse the metadata expression(s)
and substitute host Var -> corresponding device param Var (the ones
created/returned by SplitDeviceFunc) before calling device_attrs.Set, ensuring
any symbolic sizes/strides reference the device parameters rather than
undeclared host Vars. Locate the substitution logic used for the function body
in SplitDeviceFunc and reuse its mapping routine (or create a similar Var->Var
substitution util) to transform dyn_shared_multi_version_metadata_.value() prior
to assignment.

Comment thread testing/python/components/test_cuda_named_smem_alias_codegen.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/target/codegen_cuda.cc (1)

269-274: ⚠️ Potential issue | 🟠 Major

Keep fallback alias matching disabled when size metadata is missing.

This stale issue is still present: 4-field metadata sets total_byte_size to zero, and the fallback path treats that as “no range check”, so any same-dtype alias can claim an unrelated dynamic shared-memory access when exact-name matching misses. Either reject fallback matches without total_byte_size, or stop accepting 4-field entries for alias fallback.

Proposed fix
-    if (require_range_check && !is_zero(info.total_byte_size) &&
-        (!analyzer.CanProve(shifted >= zero) ||
-         !analyzer.CanProve(
-             shifted <
-             analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes))))) {
-      return std::optional<DynSharedAliasAccess>();
+    if (require_range_check) {
+      if (is_zero(info.total_byte_size)) {
+        return std::optional<DynSharedAliasAccess>();
+      }
+      PrimExpr total_elements =
+          analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes));
+      if (!analyzer.CanProve(shifted >= zero) ||
+          !analyzer.CanProve(shifted < total_elements)) {
+        return std::optional<DynSharedAliasAccess>();
+      }
     }

Also applies to: 306-310, 4904-4918

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/target/codegen_cuda.cc` around lines 269 - 274, The fallback alias
matching currently treats info.total_byte_size == 0 as “no range check” and
allows same-dtype fallback matches; change the logic to reject fallback matches
when size metadata is missing by requiring a non-zero total_byte_size before
taking the fallback path. Concretely, update the condition(s) surrounding
require_range_check and analyzer.CanProve(...) that return
std::optional<DynSharedAliasAccess() to also check
!is_zero(info.total_byte_size) (or equivalently require info.total_byte_size !=
0) so that if total_byte_size is absent/zero the function does not fall back to
same-dtype aliasing; apply the same change to the other analogous sites that use
require_range_check, info.total_byte_size, indexdiv(..., elem_bytes), and return
std::optional<DynSharedAliasAccess()).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 4011-4024: The generated code adds byte offsets to the buffer
pointer (vid / buf_dyn_shmem) without accounting for element size; cast the base
pointer to unsigned char* before doing pointer arithmetic and then
reinterpret_cast the resulting address back to the element type. Concretely,
update the branches that emit aliasing code (the block that writes
alias.alias_name using PrintType/PrintExpr) to compute the address as
reinterpret_cast<decltype(nullptr)>(reinterpret_cast<unsigned char*> (vid) +
PrintExpr(alias.byte_offset) + ... ) — i.e., cast vid to unsigned char* before
adding alias.byte_offset and alias.stage_stride_bytes, then reinterpret_cast the
final computed byte address to the element pointer type printed by PrintType so
alias pointers point to correct byte addresses.

---

Duplicate comments:
In `@src/target/codegen_cuda.cc`:
- Around line 269-274: The fallback alias matching currently treats
info.total_byte_size == 0 as “no range check” and allows same-dtype fallback
matches; change the logic to reject fallback matches when size metadata is
missing by requiring a non-zero total_byte_size before taking the fallback path.
Concretely, update the condition(s) surrounding require_range_check and
analyzer.CanProve(...) that return std::optional<DynSharedAliasAccess() to also
check !is_zero(info.total_byte_size) (or equivalently require
info.total_byte_size != 0) so that if total_byte_size is absent/zero the
function does not fall back to same-dtype aliasing; apply the same change to the
other analogous sites that use require_range_check, info.total_byte_size,
indexdiv(..., elem_bytes), and return std::optional<DynSharedAliasAccess()).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5797ff70-4c78-40a3-a293-d25c948a2edf

📥 Commits

Reviewing files that changed from the base of the PR and between 2d56736 and 01219c0.

📒 Files selected for processing (2)
  • src/target/codegen_cuda.cc
  • src/target/codegen_cuda.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/target/codegen_cuda.h

Comment thread src/target/codegen_cuda.cc

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/target/codegen_cuda.cc (2)

4015-4037: ⚠️ Potential issue | 🔴 Critical

Alias pointer arithmetic still uses element-sized stride on a typed base.

vid is declared as <op->dtype> vid[] (not byte-typed), so vid + alias.byte_offset advances by byte_offset * sizeof(op->dtype) rather than byte_offset bytes. For any non-byte element type this produces aliases that point outside the intended region. This was flagged on the previous revision and the emitted arithmetic is unchanged.

Suggested fix
         if (is_zero(alias.stage_stride_bytes)) {
           PrintType(alias.dtype, stream);
           stream << "* " << alias.alias_name << " = reinterpret_cast<";
           PrintType(alias.dtype, stream);
-          stream << "*>(" << vid << " + " << PrintExpr(alias.byte_offset)
-                 << ");\n";
+          stream << "*>(reinterpret_cast<unsigned char*>(" << vid << ") + "
+                 << PrintExpr(alias.byte_offset) << ");\n";
         } else {
           stream << "auto " << alias.alias_name
                  << " = tl::PatternVisitor([&](const int i) { return "
                     "reinterpret_cast<";
           PrintType(alias.dtype, stream);
-          stream << "*>(" << vid << " + " << PrintExpr(alias.byte_offset)
-                 << " + (i * " << PrintExpr(alias.stage_stride_bytes)
-                 << ")); });\n";
+          stream << "*>(reinterpret_cast<unsigned char*>(" << vid << ") + "
+                 << PrintExpr(alias.byte_offset) << " + (i * "
+                 << PrintExpr(alias.stage_stride_bytes) << ")); });\n";
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/target/codegen_cuda.cc` around lines 4015 - 4037, The alias pointer
arithmetic is adding byte offsets onto vid which is declared as a typed array
(vid), so replace arithmetic like "vid + PrintExpr(alias.byte_offset) [+ ...]"
by first casting vid to a byte pointer (e.g. reinterpret_cast<std::byte*>(vid)
or uint8_t*) before adding byte offsets and then reinterpret_cast back to the
target element pointer in both the simple and PatternVisitor branches; update
usages around dyn_shared_aliases_, alias.byte_offset and
alias.stage_stride_bytes where PrintType and reinterpret_cast are emitted so the
addition is done in bytes rather than in units of op->dtype.

279-285: ⚠️ Potential issue | 🟠 Major

Fallback still matches aliases when total_byte_size is zero.

The 4-field metadata path (AddFunction at Line 4927) sets total_byte_size = 0, and this condition !is_zero(info.total_byte_size) && (...) then short-circuits the range check. The fallback loop at Lines 316-321 therefore happily claims any same-dtype access with a size-less alias. This was raised previously and the code is unchanged — please address it or confirm intent.

Suggested fix
-    if (require_range_check && !is_zero(info.total_byte_size) &&
-        (!analyzer.CanProve(shifted >= zero) ||
-         !analyzer.CanProve(
-             shifted <
-             analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes))))) {
-      return std::optional<DynSharedAliasAccess>();
+    if (require_range_check) {
+      if (is_zero(info.total_byte_size)) {
+        return std::optional<DynSharedAliasAccess>();
+      }
+      PrimExpr total_elements =
+          analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes));
+      if (!analyzer.CanProve(shifted >= zero) ||
+          !analyzer.CanProve(shifted < total_elements)) {
+        return std::optional<DynSharedAliasAccess>();
+      }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/target/codegen_cuda.cc` around lines 279 - 285, The fallback
alias-matching loop wrongly treats entries with info.total_byte_size == 0 as
matching any same-dtype access; change the logic so zero-sized metadata never
matches: in the range-check branch and/or the fallback loop that constructs
DynSharedAliasAccess, add a guard on info.total_byte_size (e.g. skip/continue
when is_zero(info.total_byte_size) or otherwise require
!is_zero(info.total_byte_size)) so aliases with total_byte_size == 0 are not
accepted as matches; update uses of require_range_check, analyzer, and the
fallback loop that creates DynSharedAliasAccess to include this non-zero-size
check.
🧹 Nitpick comments (1)
src/target/codegen_cuda.cc (1)

4900-4951: Metadata parsing LGTM; one small robustness note.

Parsing and sorting logic are sound: dedicated handling of the size-4/size-6 schemas matches ExportBufferAliasMetadata, the sort makes alias-name generation deterministic across Map<String, Array<PrimExpr>> iteration orders, and the final emit_named_smem_pointers_ = !dyn_shared_aliases_.empty() correctly disables emission when no aliases survive parsing.

Minor: malformed entries (wrong size, non-IntImm dtype fields) are silently dropped at Lines 4914-4923. Given this feature is gated behind an opt-in pass config, a DLOG(WARNING) or similar would make schema drift easier to diagnose. Not a blocker.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/target/codegen_cuda.cc` around lines 4900 - 4951, The parsing loop
silently drops malformed dynamic-shared-memory alias entries; add diagnostics so
issues are discoverable by users. In the block guarded by
emit_named_smem_pointers_ where you iterate opt.value() (the loop that checks
metadata.size() and validates code/bits/lanes), emit a DLOG(WARNING) or VLOG
with a concise message including the alias key (kv.first) and reason (bad size
or non-IntImm fields) whenever you continue; keep using the same variables
(metadata, kv.first) and avoid changing control flow—only log before the
continue—so dyn_shared_aliases_, dyn_shared_alias_order_,
emit_named_smem_pointers_, and name_supply_->FreshName behavior remains
identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 246-323: TryMatchDynSharedAliasAccess can divide by zero when
stage_stride (computed from info.stage_stride_bytes via indexdiv) collapses to
zero; after computing stage_stride (the variable set from
indexdiv(info.stage_stride_bytes, elem_bytes)) add a defensive check that if
is_zero(stage_stride) (or analyzer.CanProve(stage_stride == 0)) then return
std::nullopt to avoid subsequent indexdiv/indexmod with a zero divisor; update
the branch that handles non-zero stage_stride_bytes (and the following path that
computes access.stage_index/access.element_index) to rely on this check.
Optionally, to improve performance, avoid reconstructing arith::Analyzer on
every call to TryMatchDynSharedAliasAccess by reusing/caching a single Analyzer
instance that is invalidated when Bind/Unbind occurs (ensure
dyn_shared_alias_var_ranges_ rebinding remains correct).

---

Duplicate comments:
In `@src/target/codegen_cuda.cc`:
- Around line 4015-4037: The alias pointer arithmetic is adding byte offsets
onto vid which is declared as a typed array (vid), so replace arithmetic like
"vid + PrintExpr(alias.byte_offset) [+ ...]" by first casting vid to a byte
pointer (e.g. reinterpret_cast<std::byte*>(vid) or uint8_t*) before adding byte
offsets and then reinterpret_cast back to the target element pointer in both the
simple and PatternVisitor branches; update usages around dyn_shared_aliases_,
alias.byte_offset and alias.stage_stride_bytes where PrintType and
reinterpret_cast are emitted so the addition is done in bytes rather than in
units of op->dtype.
- Around line 279-285: The fallback alias-matching loop wrongly treats entries
with info.total_byte_size == 0 as matching any same-dtype access; change the
logic so zero-sized metadata never matches: in the range-check branch and/or the
fallback loop that constructs DynSharedAliasAccess, add a guard on
info.total_byte_size (e.g. skip/continue when is_zero(info.total_byte_size) or
otherwise require !is_zero(info.total_byte_size)) so aliases with
total_byte_size == 0 are not accepted as matches; update uses of
require_range_check, analyzer, and the fallback loop that creates
DynSharedAliasAccess to include this non-zero-size check.

---

Nitpick comments:
In `@src/target/codegen_cuda.cc`:
- Around line 4900-4951: The parsing loop silently drops malformed
dynamic-shared-memory alias entries; add diagnostics so issues are discoverable
by users. In the block guarded by emit_named_smem_pointers_ where you iterate
opt.value() (the loop that checks metadata.size() and validates
code/bits/lanes), emit a DLOG(WARNING) or VLOG with a concise message including
the alias key (kv.first) and reason (bad size or non-IntImm fields) whenever you
continue; keep using the same variables (metadata, kv.first) and avoid changing
control flow—only log before the continue—so dyn_shared_aliases_,
dyn_shared_alias_order_, emit_named_smem_pointers_, and name_supply_->FreshName
behavior remains identical.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ba99e40f-1a99-4f61-8c38-e515c86993a3

📥 Commits

Reviewing files that changed from the base of the PR and between 01219c0 and 0a9ccbf.

📒 Files selected for processing (2)
  • src/target/codegen_cuda.cc
  • src/transform/merge_shared_memory_allocations.cc
✅ Files skipped from review due to trivial changes (1)
  • src/transform/merge_shared_memory_allocations.cc

Comment on lines +246 to +323
std::optional<CodeGenTileLangCUDA::DynSharedAliasAccess>
CodeGenTileLangCUDA::TryMatchDynSharedAliasAccess(const BufferNode *buffer,
PrimExpr index) const {
if (!emit_named_smem_pointers_) {
return std::nullopt;
}
std::string scope;
if (alloc_storage_scope_.count(buffer->data.get())) {
scope = alloc_storage_scope_.at(buffer->data.get());
}
if (scope.empty()) {
scope = GetPtrStorageScope(buffer->data);
}
if (scope != "shared.dyn") {
return std::nullopt;
}
int elem_bytes = buffer->dtype.bytes() * buffer->dtype.lanes();
if (elem_bytes <= 0) {
return std::nullopt;
}
arith::Analyzer analyzer;
for (const auto &kv : dyn_shared_alias_var_ranges_) {
analyzer.Bind(GetRef<Var>(kv.first), kv.second);
}
auto try_match = [&](const DynSharedAliasInfo &info,
bool require_range_check) {
if (info.dtype != buffer->dtype) {
return std::optional<DynSharedAliasAccess>();
}
PrimExpr base_offset =
analyzer.Simplify(indexdiv(info.byte_offset, elem_bytes));
PrimExpr shifted = analyzer.Simplify(index - base_offset);
PrimExpr zero = make_const(shifted.dtype(), 0);
if (require_range_check && !is_zero(info.total_byte_size) &&
(!analyzer.CanProve(shifted >= zero) ||
!analyzer.CanProve(
shifted <
analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes))))) {
return std::optional<DynSharedAliasAccess>();
}
DynSharedAliasAccess access;
access.info = &info;
if (is_zero(info.stage_stride_bytes)) {
access.element_index = shifted;
return std::optional<DynSharedAliasAccess>(std::move(access));
}
PrimExpr stage_stride =
analyzer.Simplify(indexdiv(info.stage_stride_bytes, elem_bytes));
if (!dyn_shared_stage_expr_stack_.empty()) {
PrimExpr stage_index = dyn_shared_stage_expr_stack_.back();
PrimExpr stage_shifted =
analyzer.Simplify(shifted - stage_index * stage_stride);
if (analyzer.CanProve(stage_shifted >= zero) &&
analyzer.CanProve(stage_shifted < stage_stride)) {
access.stage_index = stage_index;
access.element_index = stage_shifted;
return std::optional<DynSharedAliasAccess>(std::move(access));
}
}
access.stage_index = analyzer.Simplify(indexdiv(shifted, stage_stride));
access.element_index = analyzer.Simplify(indexmod(shifted, stage_stride));
return std::optional<DynSharedAliasAccess>(std::move(access));
};

auto it = dyn_shared_aliases_.find(buffer->name);
if (it != dyn_shared_aliases_.end()) {
if (auto access = try_match(it->second, false)) {
return access;
}
}
for (const auto &buffer_name : dyn_shared_alias_order_) {
const auto &info = dyn_shared_aliases_.at(buffer_name);
if (auto access = try_match(info, true)) {
return access;
}
}
return std::nullopt;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -nP --type=cpp -C5 'stage_stride_bytes\s*=' src/transform/

Repository: tile-ai/tilelang

Length of output: 2465


🏁 Script executed:

#!/bin/bash
# Get context around total_bytes computation and stage_stride_bytes assignment
rg -nP --type=cpp -B20 -A10 'stage_stride_bytes\s*=' src/transform/merge_shared_memory_allocations.cc | head -120

Repository: tile-ai/tilelang

Length of output: 2298


🏁 Script executed:

#!/bin/bash
# Check if there are any assertions or validity checks for stage_stride_bytes or elem_bytes in the codegen file
rg -nP --type=cpp 'stage_stride|elem_bytes' src/target/codegen_cuda.cc | head -30

Repository: tile-ai/tilelang

Length of output: 982


🏁 Script executed:

#!/bin/bash
# Look for DynSharedAliasInfo struct definition and any comments about invariants
rg -nP --type=cpp -B5 -A15 'struct DynSharedAliasInfo' src/

Repository: tile-ai/tilelang

Length of output: 1313


🏁 Script executed:

#!/bin/bash
# Find where buffer_num_stages_ is populated
rg -nP --type=cpp 'buffer_num_stages_\s*\[' src/transform/merge_shared_memory_allocations.cc | head -20

Repository: tile-ai/tilelang

Length of output: 118


🏁 Script executed:

#!/bin/bash
# Look for any validation/alignment logic in buffer allocation
rg -nP --type=cpp -B3 -A3 'num_stages.*>.*1' src/transform/merge_shared_memory_allocations.cc | head -40

Repository: tile-ai/tilelang

Length of output: 821


🏁 Script executed:

#!/bin/bash
# Check if there are any comments about the invariants in DynSharedAliasInfo or related metadata
rg -nP --type=cpp -B10 'stage_stride_bytes' src/target/codegen_cuda.h

Repository: tile-ai/tilelang

Length of output: 556


🏁 Script executed:

#!/bin/bash
# Check the AllocateNode structure and understand buffer allocation constraints
rg -nP --type=cpp -B5 -A10 'class AllocateNode' src/

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Look for any padding/alignment logic in buffer allocation
rg -nP --type=cpp 'AlignTo|align|padding' src/transform/merge_shared_memory_allocations.cc | head -30

Repository: tile-ai/tilelang

Length of output: 1723


🏁 Script executed:

#!/bin/bash
# Check if there are any assertions or validation that buffers are properly sized
rg -nP --type=cpp -B2 -A2 'total_bytes|stage_stride_bytes' src/transform/merge_shared_memory_allocations.cc | grep -A3 -B3 'CHECK\|ASSERT\|DCHECK'

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Look for where num_stages is derived from and what constraints it has
rg -nP --type=cpp -B10 'kPipelineMVBContextNumStages' src/transform/merge_shared_memory_allocations.cc | head -50

Repository: tile-ai/tilelang

Length of output: 503


🏁 Script executed:

#!/bin/bash
# Check if there's any documentation or comments about stage_stride_bytes guarantees
rg -nP --type=cpp -B5 -A5 'stage_stride_bytes' src/target/codegen_cuda.cc | grep -A6 -B6 'is_zero'

Repository: tile-ai/tilelang

Length of output: 1208


🏁 Script executed:

#!/bin/bash
# Search for any validation that ensures proper alignment of allocations
rg -nP --type=cpp 'stage_stride_bytes.*make_const' src/transform/merge_shared_memory_allocations.cc -A5 -B5

Repository: tile-ai/tilelang

Length of output: 608


🏁 Script executed:

#!/bin/bash
# Understand the complete picture: trace how extents are computed and if they're constrained
rg -nP --type=cpp -B20 'for.*const PrimExpr &extent : alloc->extents' src/transform/merge_shared_memory_allocations.cc | head -60

Repository: tile-ai/tilelang

Length of output: 1938


🏁 Script executed:

#!/bin/bash
# Check if there are any pipeline-related constraints on buffer sizing
rg -nP --type=cpp 'kPipelineMVBContextNumStages' src/ -B10 -A10 | head -80

Repository: tile-ai/tilelang

Length of output: 6771


🏁 Script executed:

#!/bin/bash
# Look at how the metadata is consumed - are there any safeguards in the codegen?
rg -nP --type=cpp -B3 -A8 'stage_stride =.*indexdiv.*stage_stride_bytes' src/target/codegen_cuda.cc

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check if there's a pattern of using is_zero guard after indexdiv elsewhere in the codebase
rg -nP --type=cpp 'is_zero.*indexdiv\|indexdiv.*is_zero' src/ | head -20

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Look at the surrounding code context to see if stage_stride value 0 would have other consequences
rg -nP --type=cpp -A15 'stage_stride = analyzer.Simplify.*indexdiv' src/target/codegen_cuda.cc

Repository: tile-ai/tilelang

Length of output: 42


Add defensive check to prevent divide-by-zero on stage_stride.

Line 293: stage_stride = indexdiv(stage_stride_bytes, elem_bytes) can collapse to zero if stage_stride_bytes < elem_bytes, yet the subsequent indexdiv(shifted, stage_stride) / indexmod(shifted, stage_stride) at lines 305–306 don't guard against it. The is_zero(info.stage_stride_bytes) check at line 288 prevents a zero denominator only in that specific branch, not after the indexdiv at line 293. In practice, merge_shared_memory_allocations.cc path 1 (lines 578–580) always produces a multiple of elem_bytes, but path 2 (lines 584–585) uses indexdiv(total_bytes, num_stages) with no guarantee of alignment. Add an early is_zero(stage_stride) bail-out after line 293 to be defensive.

Additionally, line 266–269: A fresh arith::Analyzer is constructed and re-bound on every access. For large kernels with many dyn-shared accesses this is O(N_accesses × N_ranges). Caching the analyzer (invalidated on Bind/Unbind) would improve codegen performance, though it's optional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/target/codegen_cuda.cc` around lines 246 - 323,
TryMatchDynSharedAliasAccess can divide by zero when stage_stride (computed from
info.stage_stride_bytes via indexdiv) collapses to zero; after computing
stage_stride (the variable set from indexdiv(info.stage_stride_bytes,
elem_bytes)) add a defensive check that if is_zero(stage_stride) (or
analyzer.CanProve(stage_stride == 0)) then return std::nullopt to avoid
subsequent indexdiv/indexmod with a zero divisor; update the branch that handles
non-zero stage_stride_bytes (and the following path that computes
access.stage_index/access.element_index) to rely on this check. Optionally, to
improve performance, avoid reconstructing arith::Analyzer on every call to
TryMatchDynSharedAliasAccess by reusing/caching a single Analyzer instance that
is invalidated when Bind/Unbind occurs (ensure dyn_shared_alias_var_ranges_
rebinding remains correct).

@Rachmanino Rachmanino force-pushed the worktree-issue-2059 branch 2 times, most recently from 71ac4a8 to 2540b79 Compare April 20, 2026 15:11
@Rachmanino Rachmanino requested a review from LeiWang1999 April 21, 2026 05:45
kurisu6912
kurisu6912 previously approved these changes Apr 22, 2026

@kurisu6912 kurisu6912 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pr add smem alias by record aliases in merge_shared_memory_allocations pass, and then match the alias in codegen passes.

  1. At the allocation pass, use prim func attrs to record each buffer's allocation byte range.
  2. At the codegen pass, use Analyzer to prove the buffer's index resides in that range, and find out its alias.

Generally this is a very good solution, thanks for contribution!

@kurisu6912

kurisu6912 commented Apr 22, 2026

Copy link
Copy Markdown
Collaborator

@Rachmanino Tilelang has FP4 buffers, the current pr uses stride to compute the alias, does it handle fp4 buffers correctly?

Sometime fp4 stride is 0.5x, and shape is 2x, which is often counterintuitive.

@Rachmanino

Copy link
Copy Markdown
Collaborator Author

@Rachmanino Tilelang has FP4 buffers, the current pr uses stride to compute the alias, does it handle fp4 buffers correctly?

Sometime fp4 stride is 0.5x, and shape is 2x, which is often counterintuitive.

Thanks for your insight! I'll check this when im available. Besides, CI has passed with this feature enabled, and I guess that there's already some fp4 examples in our tests?

@Rachmanino

Copy link
Copy Markdown
Collaborator Author

I guess you are right, thx! I'll fix this soon and rebase to eliminate conflict

@LeiWang1999

Copy link
Copy Markdown
Member

pending and wait for #2216 then we can take a look for if there exist better solutions.

@LeiWang1999

Copy link
Copy Markdown
Member

Closed. tir has been rebased into tir, where alloc is now a single alloc_buffer statement instead of a declaration and allocation with a body. We also have handle_add_byte_offset now, which makes implementing this feature much more convenient. A separate issue will be opened for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants