[Feature] Emit named aliases for dynamic shared memory in CUDA codegen for better readability#2061
[Feature] Emit named aliases for dynamic shared memory in CUDA codegen for better readability#2061Rachmanino wants to merge 9 commits into
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds support to emit named typed pointer aliases for dynamic shared-memory: new pass-config and builtin keys, metadata production in merge/multi-version passes, propagation through host/device splitting, CUDA codegen to consume alias metadata and remap shared.dyn accesses, a PatternVisitor helper, and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant MergePass as MergePass\n(MergeSharedMemoryAllocations)
participant MVBRewriter as MultiVersionRewriter
participant HostSplitter as HostDeviceSplitter
participant PrimFunc as PrimFunc (attrs)
participant CodeGen as CodeGenTileLangCUDA
participant CUDAOut as Generated CUDA
MergePass->>MVBRewriter: consult/produce multiversion metadata
MergePass->>MergePass: compute dyn_shared_memory_alias_metadata
MergePass->>PrimFunc: attach alias metadata attribute
MVBRewriter->>PrimFunc: attach multiversion metadata (if any)
PrimFunc->>HostSplitter: split host/device
HostSplitter->>PrimFunc: propagate multiversion metadata to device PrimFunc
PrimFunc->>CodeGen: AddFunction() reads alias & multiversion attrs
CodeGen->>CodeGen: build alias tables, set emit_named_smem_pointers_
CodeGen->>CodeGen: emit alias pointer views at Allocate
CodeGen->>CodeGen: remap loads/stores/address_of to alias addresses
CodeGen->>CUDAOut: emit final CUDA source with alias declarations and remapped accesses
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/transform/merge_shared_memory_allocations.cc`:
- Around line 578-580: ExportBufferAliasMetadata() is being called before
buffer_names_ is populated in rewriter(...), so metadata falls back to
buffer_var->name_hint instead of using high-level Buffer names; fix by moving
the ExportBufferAliasMetadata() invocation (or otherwise deferring export) to
run after rewriter(...) populates buffer_names_, or ensure buffer_names_ is
filled earlier (e.g., populate buffer_names_ when visiting Buffers or at the
start of rewriter) so that ExportBufferAliasMetadata() reads the populated
buffer_names_ map rather than the fallback name_hint. Reference symbols:
buffer_names_, rewriter(...), ExportBufferAliasMetadata(),
buffer_var->name_hint, node->buffer.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f1479d40-c532-4e65-a066-0297b65839a4
📒 Files selected for processing (6)
src/op/builtin.ccsrc/op/builtin.hsrc/target/codegen_cuda.ccsrc/target/codegen_cuda.hsrc/transform/merge_shared_memory_allocations.cctilelang/transform/pass_config.py
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/transform/merge_shared_memory_allocations.cc (1)
1552-1554:⚠️ Potential issue | 🟠 MajorExport alias metadata after rewriting populates
buffer_names_.
buffer_names_is filled inVisitBufferAccess(), so exporting beforerewriter(...)still misses high-levelBuffernames and can also missmulti_version_metadata_keyed by those names.Proposed fix
SharedMemoryRewriter rewriter(collector.dyn_shmem_allocs_, true, verbose, align_bytes, multi_version_metadata); rewriter.PlanReuse(stmt, true, enable_aggressive_merge); - dyn_alias_metadata = rewriter.ExportBufferAliasMetadata(); stmt = rewriter(std::move(stmt)); + dyn_alias_metadata = rewriter.ExportBufferAliasMetadata();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/transform/merge_shared_memory_allocations.cc` around lines 1552 - 1554, The ExportBufferAliasMetadata call is happening before the rewriter populates buffer_names_ (populated in VisitBufferAccess), so move the rewriter.ExportBufferAliasMetadata() invocation to after you invoke the rewriter (i.e., after stmt = rewriter(std::move(stmt))); ensure you still call rewriter.PlanReuse(...) beforehand as needed and then call ExportBufferAliasMetadata() so multi_version_metadata_ and buffer_names_ are captured from the rewritten IR.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 266-271: The fallback alias-matching currently treats
info.total_byte_size==0 as "no range check" allowing same-dtype buffers to
incorrectly claim dynamic shared-memory accesses; update the fallback logic that
returns std::optional<DynSharedAliasAccess>() to first detect missing size
metadata (info.total_byte_size == 0) and refuse to match (i.e., skip/continue
fallback) instead of disabling range checks — change the condition around
require_range_check / analyzer.CanProve(...) to enforce a size-present check,
and apply the same change in the other identical fallback block (the one at the
later location referenced in the review) so aliases without size metadata cannot
be matched.
In `@src/transform/split_host_device.cc`:
- Around line 371-374: The dynamic shared-memory metadata stored in
dyn_shared_multi_version_metadata_ is still using host-side Var symbols and must
be remapped to the fresh device parameter Vars created by SplitDeviceFunc;
update the code that sets device_attrs (kDynSharedMemoryMultiVersionMetadata) to
traverse the metadata expression(s) and substitute host Var -> corresponding
device param Var (the ones created/returned by SplitDeviceFunc) before calling
device_attrs.Set, ensuring any symbolic sizes/strides reference the device
parameters rather than undeclared host Vars. Locate the substitution logic used
for the function body in SplitDeviceFunc and reuse its mapping routine (or
create a similar Var->Var substitution util) to transform
dyn_shared_multi_version_metadata_.value() prior to assignment.
In `@testing/python/components/test_cuda_named_smem_alias_codegen.py`:
- Around line 61-62: The test function
test_codegen_keeps_raw_dyn_smem_access_by_default is running CUDA code via
_compile_kernel() but lacks the CUDA guard; add the CUDA gate by decorating the
test with the existing requires_cuda marker (e.g., `@pytest.mark.requires_cuda` or
the project's requires_cuda decorator) so the test is skipped on CPU-only CI;
update the test definition for test_codegen_keeps_raw_dyn_smem_access_by_default
to include that decorator and ensure imports/reference to pytest/marker are
present.
---
Duplicate comments:
In `@src/transform/merge_shared_memory_allocations.cc`:
- Around line 1552-1554: The ExportBufferAliasMetadata call is happening before
the rewriter populates buffer_names_ (populated in VisitBufferAccess), so move
the rewriter.ExportBufferAliasMetadata() invocation to after you invoke the
rewriter (i.e., after stmt = rewriter(std::move(stmt))); ensure you still call
rewriter.PlanReuse(...) beforehand as needed and then call
ExportBufferAliasMetadata() so multi_version_metadata_ and buffer_names_ are
captured from the rewritten IR.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1424a87f-7fe0-415b-8eb0-e69425b3f8fe
📒 Files selected for processing (7)
src/target/codegen_cuda.ccsrc/target/codegen_cuda.hsrc/tl_templates/cuda/common.hsrc/transform/merge_shared_memory_allocations.ccsrc/transform/multi_version_buffer_rewriter.ccsrc/transform/split_host_device.cctesting/python/components/test_cuda_named_smem_alias_codegen.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/target/codegen_cuda.h
| if (dyn_shared_multi_version_metadata_.defined()) { | ||
| device_attrs.Set(kDynSharedMemoryMultiVersionMetadata, | ||
| dyn_shared_multi_version_metadata_.value()); | ||
| } |
There was a problem hiding this comment.
Remap metadata expressions to the device parameter Vars.
SplitDeviceFunc creates fresh device params and substitutes the body, but the propagated dynamic shared-memory metadata still contains host-side Var objects. Symbolic strides/sizes can later codegen as undeclared variables.
Proposed fix
if (dyn_shared_multi_version_metadata_.defined()) {
- device_attrs.Set(kDynSharedMemoryMultiVersionMetadata,
- dyn_shared_multi_version_metadata_.value());
+ Map<String, Array<PrimExpr>> remapped_metadata;
+ for (const auto &kv : dyn_shared_multi_version_metadata_.value()) {
+ Array<PrimExpr> values;
+ values.reserve(kv.second.size());
+ for (const PrimExpr &expr : kv.second) {
+ values.push_back(tir::Substitute(expr, var_remap));
+ }
+ remapped_metadata.Set(kv.first, values);
+ }
+ device_attrs.Set(kDynSharedMemoryMultiVersionMetadata,
+ remapped_metadata);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (dyn_shared_multi_version_metadata_.defined()) { | |
| device_attrs.Set(kDynSharedMemoryMultiVersionMetadata, | |
| dyn_shared_multi_version_metadata_.value()); | |
| } | |
| if (dyn_shared_multi_version_metadata_.defined()) { | |
| Map<String, Array<PrimExpr>> remapped_metadata; | |
| for (const auto &kv : dyn_shared_multi_version_metadata_.value()) { | |
| Array<PrimExpr> values; | |
| values.reserve(kv.second.size()); | |
| for (const PrimExpr &expr : kv.second) { | |
| values.push_back(tir::Substitute(expr, var_remap)); | |
| } | |
| remapped_metadata.Set(kv.first, values); | |
| } | |
| device_attrs.Set(kDynSharedMemoryMultiVersionMetadata, | |
| remapped_metadata); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/transform/split_host_device.cc` around lines 371 - 374, The dynamic
shared-memory metadata stored in dyn_shared_multi_version_metadata_ is still
using host-side Var symbols and must be remapped to the fresh device parameter
Vars created by SplitDeviceFunc; update the code that sets device_attrs
(kDynSharedMemoryMultiVersionMetadata) to traverse the metadata expression(s)
and substitute host Var -> corresponding device param Var (the ones
created/returned by SplitDeviceFunc) before calling device_attrs.Set, ensuring
any symbolic sizes/strides reference the device parameters rather than
undeclared host Vars. Locate the substitution logic used for the function body
in SplitDeviceFunc and reuse its mapping routine (or create a similar Var->Var
substitution util) to transform dyn_shared_multi_version_metadata_.value() prior
to assignment.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/target/codegen_cuda.cc (1)
269-274:⚠️ Potential issue | 🟠 MajorKeep fallback alias matching disabled when size metadata is missing.
This stale issue is still present: 4-field metadata sets
total_byte_sizeto zero, and the fallback path treats that as “no range check”, so any same-dtype alias can claim an unrelated dynamic shared-memory access when exact-name matching misses. Either reject fallback matches withouttotal_byte_size, or stop accepting 4-field entries for alias fallback.Proposed fix
- if (require_range_check && !is_zero(info.total_byte_size) && - (!analyzer.CanProve(shifted >= zero) || - !analyzer.CanProve( - shifted < - analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes))))) { - return std::optional<DynSharedAliasAccess>(); + if (require_range_check) { + if (is_zero(info.total_byte_size)) { + return std::optional<DynSharedAliasAccess>(); + } + PrimExpr total_elements = + analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes)); + if (!analyzer.CanProve(shifted >= zero) || + !analyzer.CanProve(shifted < total_elements)) { + return std::optional<DynSharedAliasAccess>(); + } }Also applies to: 306-310, 4904-4918
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/target/codegen_cuda.cc` around lines 269 - 274, The fallback alias matching currently treats info.total_byte_size == 0 as “no range check” and allows same-dtype fallback matches; change the logic to reject fallback matches when size metadata is missing by requiring a non-zero total_byte_size before taking the fallback path. Concretely, update the condition(s) surrounding require_range_check and analyzer.CanProve(...) that return std::optional<DynSharedAliasAccess() to also check !is_zero(info.total_byte_size) (or equivalently require info.total_byte_size != 0) so that if total_byte_size is absent/zero the function does not fall back to same-dtype aliasing; apply the same change to the other analogous sites that use require_range_check, info.total_byte_size, indexdiv(..., elem_bytes), and return std::optional<DynSharedAliasAccess()).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 4011-4024: The generated code adds byte offsets to the buffer
pointer (vid / buf_dyn_shmem) without accounting for element size; cast the base
pointer to unsigned char* before doing pointer arithmetic and then
reinterpret_cast the resulting address back to the element type. Concretely,
update the branches that emit aliasing code (the block that writes
alias.alias_name using PrintType/PrintExpr) to compute the address as
reinterpret_cast<decltype(nullptr)>(reinterpret_cast<unsigned char*> (vid) +
PrintExpr(alias.byte_offset) + ... ) — i.e., cast vid to unsigned char* before
adding alias.byte_offset and alias.stage_stride_bytes, then reinterpret_cast the
final computed byte address to the element pointer type printed by PrintType so
alias pointers point to correct byte addresses.
---
Duplicate comments:
In `@src/target/codegen_cuda.cc`:
- Around line 269-274: The fallback alias matching currently treats
info.total_byte_size == 0 as “no range check” and allows same-dtype fallback
matches; change the logic to reject fallback matches when size metadata is
missing by requiring a non-zero total_byte_size before taking the fallback path.
Concretely, update the condition(s) surrounding require_range_check and
analyzer.CanProve(...) that return std::optional<DynSharedAliasAccess() to also
check !is_zero(info.total_byte_size) (or equivalently require
info.total_byte_size != 0) so that if total_byte_size is absent/zero the
function does not fall back to same-dtype aliasing; apply the same change to the
other analogous sites that use require_range_check, info.total_byte_size,
indexdiv(..., elem_bytes), and return std::optional<DynSharedAliasAccess()).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5797ff70-4c78-40a3-a293-d25c948a2edf
📒 Files selected for processing (2)
src/target/codegen_cuda.ccsrc/target/codegen_cuda.h
🚧 Files skipped from review as they are similar to previous changes (1)
- src/target/codegen_cuda.h
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/target/codegen_cuda.cc (2)
4015-4037:⚠️ Potential issue | 🔴 CriticalAlias pointer arithmetic still uses element-sized stride on a typed base.
vidis declared as<op->dtype> vid[](not byte-typed), sovid + alias.byte_offsetadvances bybyte_offset * sizeof(op->dtype)rather thanbyte_offsetbytes. For any non-byte element type this produces aliases that point outside the intended region. This was flagged on the previous revision and the emitted arithmetic is unchanged.Suggested fix
if (is_zero(alias.stage_stride_bytes)) { PrintType(alias.dtype, stream); stream << "* " << alias.alias_name << " = reinterpret_cast<"; PrintType(alias.dtype, stream); - stream << "*>(" << vid << " + " << PrintExpr(alias.byte_offset) - << ");\n"; + stream << "*>(reinterpret_cast<unsigned char*>(" << vid << ") + " + << PrintExpr(alias.byte_offset) << ");\n"; } else { stream << "auto " << alias.alias_name << " = tl::PatternVisitor([&](const int i) { return " "reinterpret_cast<"; PrintType(alias.dtype, stream); - stream << "*>(" << vid << " + " << PrintExpr(alias.byte_offset) - << " + (i * " << PrintExpr(alias.stage_stride_bytes) - << ")); });\n"; + stream << "*>(reinterpret_cast<unsigned char*>(" << vid << ") + " + << PrintExpr(alias.byte_offset) << " + (i * " + << PrintExpr(alias.stage_stride_bytes) << ")); });\n"; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/target/codegen_cuda.cc` around lines 4015 - 4037, The alias pointer arithmetic is adding byte offsets onto vid which is declared as a typed array (vid), so replace arithmetic like "vid + PrintExpr(alias.byte_offset) [+ ...]" by first casting vid to a byte pointer (e.g. reinterpret_cast<std::byte*>(vid) or uint8_t*) before adding byte offsets and then reinterpret_cast back to the target element pointer in both the simple and PatternVisitor branches; update usages around dyn_shared_aliases_, alias.byte_offset and alias.stage_stride_bytes where PrintType and reinterpret_cast are emitted so the addition is done in bytes rather than in units of op->dtype.
279-285:⚠️ Potential issue | 🟠 MajorFallback still matches aliases when
total_byte_sizeis zero.The 4-field metadata path (
AddFunctionat Line 4927) setstotal_byte_size = 0, and this condition!is_zero(info.total_byte_size) && (...)then short-circuits the range check. The fallback loop at Lines 316-321 therefore happily claims any same-dtype access with a size-less alias. This was raised previously and the code is unchanged — please address it or confirm intent.Suggested fix
- if (require_range_check && !is_zero(info.total_byte_size) && - (!analyzer.CanProve(shifted >= zero) || - !analyzer.CanProve( - shifted < - analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes))))) { - return std::optional<DynSharedAliasAccess>(); + if (require_range_check) { + if (is_zero(info.total_byte_size)) { + return std::optional<DynSharedAliasAccess>(); + } + PrimExpr total_elements = + analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes)); + if (!analyzer.CanProve(shifted >= zero) || + !analyzer.CanProve(shifted < total_elements)) { + return std::optional<DynSharedAliasAccess>(); + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/target/codegen_cuda.cc` around lines 279 - 285, The fallback alias-matching loop wrongly treats entries with info.total_byte_size == 0 as matching any same-dtype access; change the logic so zero-sized metadata never matches: in the range-check branch and/or the fallback loop that constructs DynSharedAliasAccess, add a guard on info.total_byte_size (e.g. skip/continue when is_zero(info.total_byte_size) or otherwise require !is_zero(info.total_byte_size)) so aliases with total_byte_size == 0 are not accepted as matches; update uses of require_range_check, analyzer, and the fallback loop that creates DynSharedAliasAccess to include this non-zero-size check.
🧹 Nitpick comments (1)
src/target/codegen_cuda.cc (1)
4900-4951: Metadata parsing LGTM; one small robustness note.Parsing and sorting logic are sound: dedicated handling of the size-4/size-6 schemas matches
ExportBufferAliasMetadata, the sort makes alias-name generation deterministic acrossMap<String, Array<PrimExpr>>iteration orders, and the finalemit_named_smem_pointers_ = !dyn_shared_aliases_.empty()correctly disables emission when no aliases survive parsing.Minor: malformed entries (wrong size, non-IntImm dtype fields) are silently dropped at Lines 4914-4923. Given this feature is gated behind an opt-in pass config, a
DLOG(WARNING)or similar would make schema drift easier to diagnose. Not a blocker.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/target/codegen_cuda.cc` around lines 4900 - 4951, The parsing loop silently drops malformed dynamic-shared-memory alias entries; add diagnostics so issues are discoverable by users. In the block guarded by emit_named_smem_pointers_ where you iterate opt.value() (the loop that checks metadata.size() and validates code/bits/lanes), emit a DLOG(WARNING) or VLOG with a concise message including the alias key (kv.first) and reason (bad size or non-IntImm fields) whenever you continue; keep using the same variables (metadata, kv.first) and avoid changing control flow—only log before the continue—so dyn_shared_aliases_, dyn_shared_alias_order_, emit_named_smem_pointers_, and name_supply_->FreshName behavior remains identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 246-323: TryMatchDynSharedAliasAccess can divide by zero when
stage_stride (computed from info.stage_stride_bytes via indexdiv) collapses to
zero; after computing stage_stride (the variable set from
indexdiv(info.stage_stride_bytes, elem_bytes)) add a defensive check that if
is_zero(stage_stride) (or analyzer.CanProve(stage_stride == 0)) then return
std::nullopt to avoid subsequent indexdiv/indexmod with a zero divisor; update
the branch that handles non-zero stage_stride_bytes (and the following path that
computes access.stage_index/access.element_index) to rely on this check.
Optionally, to improve performance, avoid reconstructing arith::Analyzer on
every call to TryMatchDynSharedAliasAccess by reusing/caching a single Analyzer
instance that is invalidated when Bind/Unbind occurs (ensure
dyn_shared_alias_var_ranges_ rebinding remains correct).
---
Duplicate comments:
In `@src/target/codegen_cuda.cc`:
- Around line 4015-4037: The alias pointer arithmetic is adding byte offsets
onto vid which is declared as a typed array (vid), so replace arithmetic like
"vid + PrintExpr(alias.byte_offset) [+ ...]" by first casting vid to a byte
pointer (e.g. reinterpret_cast<std::byte*>(vid) or uint8_t*) before adding byte
offsets and then reinterpret_cast back to the target element pointer in both the
simple and PatternVisitor branches; update usages around dyn_shared_aliases_,
alias.byte_offset and alias.stage_stride_bytes where PrintType and
reinterpret_cast are emitted so the addition is done in bytes rather than in
units of op->dtype.
- Around line 279-285: The fallback alias-matching loop wrongly treats entries
with info.total_byte_size == 0 as matching any same-dtype access; change the
logic so zero-sized metadata never matches: in the range-check branch and/or the
fallback loop that constructs DynSharedAliasAccess, add a guard on
info.total_byte_size (e.g. skip/continue when is_zero(info.total_byte_size) or
otherwise require !is_zero(info.total_byte_size)) so aliases with
total_byte_size == 0 are not accepted as matches; update uses of
require_range_check, analyzer, and the fallback loop that creates
DynSharedAliasAccess to include this non-zero-size check.
---
Nitpick comments:
In `@src/target/codegen_cuda.cc`:
- Around line 4900-4951: The parsing loop silently drops malformed
dynamic-shared-memory alias entries; add diagnostics so issues are discoverable
by users. In the block guarded by emit_named_smem_pointers_ where you iterate
opt.value() (the loop that checks metadata.size() and validates
code/bits/lanes), emit a DLOG(WARNING) or VLOG with a concise message including
the alias key (kv.first) and reason (bad size or non-IntImm fields) whenever you
continue; keep using the same variables (metadata, kv.first) and avoid changing
control flow—only log before the continue—so dyn_shared_aliases_,
dyn_shared_alias_order_, emit_named_smem_pointers_, and name_supply_->FreshName
behavior remains identical.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ba99e40f-1a99-4f61-8c38-e515c86993a3
📒 Files selected for processing (2)
src/target/codegen_cuda.ccsrc/transform/merge_shared_memory_allocations.cc
✅ Files skipped from review due to trivial changes (1)
- src/transform/merge_shared_memory_allocations.cc
| std::optional<CodeGenTileLangCUDA::DynSharedAliasAccess> | ||
| CodeGenTileLangCUDA::TryMatchDynSharedAliasAccess(const BufferNode *buffer, | ||
| PrimExpr index) const { | ||
| if (!emit_named_smem_pointers_) { | ||
| return std::nullopt; | ||
| } | ||
| std::string scope; | ||
| if (alloc_storage_scope_.count(buffer->data.get())) { | ||
| scope = alloc_storage_scope_.at(buffer->data.get()); | ||
| } | ||
| if (scope.empty()) { | ||
| scope = GetPtrStorageScope(buffer->data); | ||
| } | ||
| if (scope != "shared.dyn") { | ||
| return std::nullopt; | ||
| } | ||
| int elem_bytes = buffer->dtype.bytes() * buffer->dtype.lanes(); | ||
| if (elem_bytes <= 0) { | ||
| return std::nullopt; | ||
| } | ||
| arith::Analyzer analyzer; | ||
| for (const auto &kv : dyn_shared_alias_var_ranges_) { | ||
| analyzer.Bind(GetRef<Var>(kv.first), kv.second); | ||
| } | ||
| auto try_match = [&](const DynSharedAliasInfo &info, | ||
| bool require_range_check) { | ||
| if (info.dtype != buffer->dtype) { | ||
| return std::optional<DynSharedAliasAccess>(); | ||
| } | ||
| PrimExpr base_offset = | ||
| analyzer.Simplify(indexdiv(info.byte_offset, elem_bytes)); | ||
| PrimExpr shifted = analyzer.Simplify(index - base_offset); | ||
| PrimExpr zero = make_const(shifted.dtype(), 0); | ||
| if (require_range_check && !is_zero(info.total_byte_size) && | ||
| (!analyzer.CanProve(shifted >= zero) || | ||
| !analyzer.CanProve( | ||
| shifted < | ||
| analyzer.Simplify(indexdiv(info.total_byte_size, elem_bytes))))) { | ||
| return std::optional<DynSharedAliasAccess>(); | ||
| } | ||
| DynSharedAliasAccess access; | ||
| access.info = &info; | ||
| if (is_zero(info.stage_stride_bytes)) { | ||
| access.element_index = shifted; | ||
| return std::optional<DynSharedAliasAccess>(std::move(access)); | ||
| } | ||
| PrimExpr stage_stride = | ||
| analyzer.Simplify(indexdiv(info.stage_stride_bytes, elem_bytes)); | ||
| if (!dyn_shared_stage_expr_stack_.empty()) { | ||
| PrimExpr stage_index = dyn_shared_stage_expr_stack_.back(); | ||
| PrimExpr stage_shifted = | ||
| analyzer.Simplify(shifted - stage_index * stage_stride); | ||
| if (analyzer.CanProve(stage_shifted >= zero) && | ||
| analyzer.CanProve(stage_shifted < stage_stride)) { | ||
| access.stage_index = stage_index; | ||
| access.element_index = stage_shifted; | ||
| return std::optional<DynSharedAliasAccess>(std::move(access)); | ||
| } | ||
| } | ||
| access.stage_index = analyzer.Simplify(indexdiv(shifted, stage_stride)); | ||
| access.element_index = analyzer.Simplify(indexmod(shifted, stage_stride)); | ||
| return std::optional<DynSharedAliasAccess>(std::move(access)); | ||
| }; | ||
|
|
||
| auto it = dyn_shared_aliases_.find(buffer->name); | ||
| if (it != dyn_shared_aliases_.end()) { | ||
| if (auto access = try_match(it->second, false)) { | ||
| return access; | ||
| } | ||
| } | ||
| for (const auto &buffer_name : dyn_shared_alias_order_) { | ||
| const auto &info = dyn_shared_aliases_.at(buffer_name); | ||
| if (auto access = try_match(info, true)) { | ||
| return access; | ||
| } | ||
| } | ||
| return std::nullopt; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -nP --type=cpp -C5 'stage_stride_bytes\s*=' src/transform/Repository: tile-ai/tilelang
Length of output: 2465
🏁 Script executed:
#!/bin/bash
# Get context around total_bytes computation and stage_stride_bytes assignment
rg -nP --type=cpp -B20 -A10 'stage_stride_bytes\s*=' src/transform/merge_shared_memory_allocations.cc | head -120Repository: tile-ai/tilelang
Length of output: 2298
🏁 Script executed:
#!/bin/bash
# Check if there are any assertions or validity checks for stage_stride_bytes or elem_bytes in the codegen file
rg -nP --type=cpp 'stage_stride|elem_bytes' src/target/codegen_cuda.cc | head -30Repository: tile-ai/tilelang
Length of output: 982
🏁 Script executed:
#!/bin/bash
# Look for DynSharedAliasInfo struct definition and any comments about invariants
rg -nP --type=cpp -B5 -A15 'struct DynSharedAliasInfo' src/Repository: tile-ai/tilelang
Length of output: 1313
🏁 Script executed:
#!/bin/bash
# Find where buffer_num_stages_ is populated
rg -nP --type=cpp 'buffer_num_stages_\s*\[' src/transform/merge_shared_memory_allocations.cc | head -20Repository: tile-ai/tilelang
Length of output: 118
🏁 Script executed:
#!/bin/bash
# Look for any validation/alignment logic in buffer allocation
rg -nP --type=cpp -B3 -A3 'num_stages.*>.*1' src/transform/merge_shared_memory_allocations.cc | head -40Repository: tile-ai/tilelang
Length of output: 821
🏁 Script executed:
#!/bin/bash
# Check if there are any comments about the invariants in DynSharedAliasInfo or related metadata
rg -nP --type=cpp -B10 'stage_stride_bytes' src/target/codegen_cuda.hRepository: tile-ai/tilelang
Length of output: 556
🏁 Script executed:
#!/bin/bash
# Check the AllocateNode structure and understand buffer allocation constraints
rg -nP --type=cpp -B5 -A10 'class AllocateNode' src/Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Look for any padding/alignment logic in buffer allocation
rg -nP --type=cpp 'AlignTo|align|padding' src/transform/merge_shared_memory_allocations.cc | head -30Repository: tile-ai/tilelang
Length of output: 1723
🏁 Script executed:
#!/bin/bash
# Check if there are any assertions or validation that buffers are properly sized
rg -nP --type=cpp -B2 -A2 'total_bytes|stage_stride_bytes' src/transform/merge_shared_memory_allocations.cc | grep -A3 -B3 'CHECK\|ASSERT\|DCHECK'Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Look for where num_stages is derived from and what constraints it has
rg -nP --type=cpp -B10 'kPipelineMVBContextNumStages' src/transform/merge_shared_memory_allocations.cc | head -50Repository: tile-ai/tilelang
Length of output: 503
🏁 Script executed:
#!/bin/bash
# Check if there's any documentation or comments about stage_stride_bytes guarantees
rg -nP --type=cpp -B5 -A5 'stage_stride_bytes' src/target/codegen_cuda.cc | grep -A6 -B6 'is_zero'Repository: tile-ai/tilelang
Length of output: 1208
🏁 Script executed:
#!/bin/bash
# Search for any validation that ensures proper alignment of allocations
rg -nP --type=cpp 'stage_stride_bytes.*make_const' src/transform/merge_shared_memory_allocations.cc -A5 -B5Repository: tile-ai/tilelang
Length of output: 608
🏁 Script executed:
#!/bin/bash
# Understand the complete picture: trace how extents are computed and if they're constrained
rg -nP --type=cpp -B20 'for.*const PrimExpr &extent : alloc->extents' src/transform/merge_shared_memory_allocations.cc | head -60Repository: tile-ai/tilelang
Length of output: 1938
🏁 Script executed:
#!/bin/bash
# Check if there are any pipeline-related constraints on buffer sizing
rg -nP --type=cpp 'kPipelineMVBContextNumStages' src/ -B10 -A10 | head -80Repository: tile-ai/tilelang
Length of output: 6771
🏁 Script executed:
#!/bin/bash
# Look at how the metadata is consumed - are there any safeguards in the codegen?
rg -nP --type=cpp -B3 -A8 'stage_stride =.*indexdiv.*stage_stride_bytes' src/target/codegen_cuda.ccRepository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check if there's a pattern of using is_zero guard after indexdiv elsewhere in the codebase
rg -nP --type=cpp 'is_zero.*indexdiv\|indexdiv.*is_zero' src/ | head -20Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Look at the surrounding code context to see if stage_stride value 0 would have other consequences
rg -nP --type=cpp -A15 'stage_stride = analyzer.Simplify.*indexdiv' src/target/codegen_cuda.ccRepository: tile-ai/tilelang
Length of output: 42
Add defensive check to prevent divide-by-zero on stage_stride.
Line 293: stage_stride = indexdiv(stage_stride_bytes, elem_bytes) can collapse to zero if stage_stride_bytes < elem_bytes, yet the subsequent indexdiv(shifted, stage_stride) / indexmod(shifted, stage_stride) at lines 305–306 don't guard against it. The is_zero(info.stage_stride_bytes) check at line 288 prevents a zero denominator only in that specific branch, not after the indexdiv at line 293. In practice, merge_shared_memory_allocations.cc path 1 (lines 578–580) always produces a multiple of elem_bytes, but path 2 (lines 584–585) uses indexdiv(total_bytes, num_stages) with no guarantee of alignment. Add an early is_zero(stage_stride) bail-out after line 293 to be defensive.
Additionally, line 266–269: A fresh arith::Analyzer is constructed and re-bound on every access. For large kernels with many dyn-shared accesses this is O(N_accesses × N_ranges). Caching the analyzer (invalidated on Bind/Unbind) would improve codegen performance, though it's optional.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/target/codegen_cuda.cc` around lines 246 - 323,
TryMatchDynSharedAliasAccess can divide by zero when stage_stride (computed from
info.stage_stride_bytes via indexdiv) collapses to zero; after computing
stage_stride (the variable set from indexdiv(info.stage_stride_bytes,
elem_bytes)) add a defensive check that if is_zero(stage_stride) (or
analyzer.CanProve(stage_stride == 0)) then return std::nullopt to avoid
subsequent indexdiv/indexmod with a zero divisor; update the branch that handles
non-zero stage_stride_bytes (and the following path that computes
access.stage_index/access.element_index) to rely on this check. Optionally, to
improve performance, avoid reconstructing arith::Analyzer on every call to
TryMatchDynSharedAliasAccess by reusing/caching a single Analyzer instance that
is invalidated when Bind/Unbind occurs (ensure dyn_shared_alias_var_ranges_
rebinding remains correct).
71ac4a8 to
2540b79
Compare
kurisu6912
left a comment
There was a problem hiding this comment.
This pr add smem alias by record aliases in merge_shared_memory_allocations pass, and then match the alias in codegen passes.
- At the allocation pass, use prim func attrs to record each buffer's allocation byte range.
- At the codegen pass, use Analyzer to prove the buffer's index resides in that range, and find out its alias.
Generally this is a very good solution, thanks for contribution!
|
@Rachmanino Tilelang has FP4 buffers, the current pr uses stride to compute the alias, does it handle fp4 buffers correctly? Sometime fp4 stride is 0.5x, and shape is 2x, which is often counterintuitive. |
Thanks for your insight! I'll check this when im available. Besides, CI has passed with this feature enabled, and I guess that there's already some fp4 examples in our tests? |
|
I guess you are right, thx! I'll fix this soon and rebase to eliminate conflict |
8758136 to
93f87f3
Compare
|
pending and wait for #2216 then we can take a look for if there exist better solutions. |
|
Closed. tir has been rebased into tir, where alloc is now a single alloc_buffer statement instead of a declaration and allocation with a body. We also have |
This PR adds an opt-in CUDA codegen path for improving the readability of merged shared.dyn buffers by preserving DSL-levelshared-memory names in generated source.
What it does:
half_t* A_shared = reinterpret_cast<half_t*>(buf_dyn_shmem + ...);auto A_shared = tl::PatternVisitor([&](int i) { ... });Current limitations / not fully solved yet:
many staged producer accesses).
when multiple logical buffers are merged into the same physical buf_dyn_shmem region.
example, an address logically corresponding to C_shared may still be rendered as A_shared[0] if both map to the same merged
address range and only physical address information remains.
to-end through every pass that rewrites tvm_access_ptr.
Follow-up direction:
paths), instead of relying on address-based alias recovery after buffer merging.
Summary by CodeRabbit
New Features
Tests