[TIRx] Bringup TIRx Infrastructure#19581
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces TIRX, an extension to Apache TVM providing hardware-aware layout abstractions (Axe Layout), hierarchical execution scopes, and an operator dispatch framework for NVIDIA and AWS Trainium targets. The changes include a vast library of intrinsics for MMA, TMA, and asynchronous pipelining, alongside specialized schedulers, benchmarking utilities, and the renaming of legacy s_tir layout classes for better namespace separation. Feedback focuses on enhancing robustness, such as adding fallbacks for missing type annotations in codegen_ptx_cp_async, improving template instantiation for packed types in codegen_cuda_warp_reduce, and using arith.Analyzer for symbolic dimension validation. Additionally, the review suggests adopting a more structured approach for operand mapping in inline assembly and ensuring strict scheduler bound enforcement in GroupMajor3D to avoid potential indexing errors.
c7d65f2 to
a4cedb7
Compare
|
@tvm-bot rerun |
|
Failed to re-run CI in https://github.com/apache/tvm/actions/runs/26000722685 Detailswith response |
809a717 to
7ad67ed
Compare
Follow-up work on top of the TIRx infrastructure (apache#19581): - op-dispatch: warp ldmatrix/stmatrix copy dispatch; split CUDA copy into reg + gmem_smem + ldgsts; tcgen05.ld/st .16x{64,128,256}b dispatch + factory + M=128 layout; element-wise broadcast at the layout level + copy vec-alignment fix - gemm: CUDA synchronous mma.sync tensor-core dispatch; accept Layout F C operand for M=64 MMAs - op: add permute_layout primitive (removes permute_dims) - tvmscript: @Tx.jit decorator, Tx.constexpr params, Tx.wg_reg_tile - lower-tirx: Tx.device_entry() marker replacing ScopeKind::kKernel; canonical thread filters (drop Tx.filter wrapper) - codegen: typed pointer byte-offset intrinsic; remove the entry_cluster_sync codegen attribute - arith: memoize IntervalSet variable relaxation; gate canonical-simplify LT Case 2 on extra scale == +1
Follow-up work on top of the TIRx infrastructure (apache#19581): - op-dispatch: warp ldmatrix/stmatrix copy dispatch; split CUDA copy into reg + gmem_smem + ldgsts; tcgen05.ld/st .16x{64,128,256}b dispatch + factory + M=128 layout; element-wise broadcast at the layout level + copy vec-alignment fix - gemm: CUDA synchronous mma.sync tensor-core dispatch; accept Layout F C operand for M=64 MMAs - op: add permute_layout primitive (removes permute_dims) - tvmscript: Tx.jit decorator, Tx.constexpr params, Tx.wg_reg_tile - lower-tirx: Tx.device_entry() marker replacing ScopeKind::kKernel; canonical thread filters (drop Tx.filter wrapper) - codegen: typed pointer byte-offset intrinsic; remove the entry_cluster_sync codegen attribute
) ## Summary Follow-up work on top of the TIRx infrastructure bring-up (#19581). It extends the TIRx operator-dispatch, codegen, and TVMScript surfaces with the next batch of low-level programming features for Blackwell-class GPUs, while keeping `s_tir` script support intact. ## Main Changes - **op-dispatch**: warp `ldmatrix`/`stmatrix` copy dispatch; split CUDA copy into register / gmem-smem / `ldgsts` paths; `tcgen05.ld/st` `.16x{64,128,256}b` dispatch with a factory and M=128 layout; element-wise broadcast at the layout level with a copy vec-alignment fix. - **gemm**: CUDA synchronous `mma.sync` tensor-core dispatch; accept a Layout F C operand for M=64 MMAs. - **op**: add the `permute_layout` primitive (replaces `permute_dims`). - **tvmscript**: add the `Tx.jit` decorator, `Tx.constexpr` compile-time params, and `Tx.wg_reg_tile`. - **lower-tirx**: introduce the `Tx.device_entry()` marker (replacing `ScopeKind::kKernel`); canonical thread filters that drop the `Tx.filter` wrapper. - **codegen**: add a typed-pointer byte-offset intrinsic; remove the `entry_cluster_sync` codegen attribute. ## Validation - `pre-commit run` (changed files) — clean - `ninja -C build -j$(nproc)` — builds - `pytest tests/python/tirx/ -n 16` - `1997 passed, 39 skipped, 3 xpassed` - `python -m pytest tests/python/all-platform-minimal-test` - `37 passed, 105 skipped` - `TVM_TEST_TARGETS=llvm pytest tests/python/tirx-analysis tests/python/tirx-base tests/python/tirx-transform -n 16` - `630 passed, 25 skipped, 8 xfailed, 1 xpassed` ## Local CI Notes Several full CI-equivalent jobs are not locally reproducible because this machine is missing parts of the Apache TVM CI environment (e.g., specific `llvm-config` versions, Vulkan, ROCm, ARM/QEMU cross-toolchain, and web/wasm components). The Blackwell/Trainium kernel tests are maintained downstream and are intentionally not part of this PR.
…a annotations Since the TIRx infrastructure bringup (apache#19581), LowerOpaqueBlock preserves buffer metadata as allocation annotations ("buffer_data_alignment", "buffer_allocated_addr"), which codegen consumes (codegen_cuda.cc, codegen_trn.cc). Update the tests' expected IR to carry these annotations.
Since apache#19581 the TIRx well-formedness verifier rejects SBlock/SBlockRealize in tirx=True mode, so the bare T.prim_func input of test_scalar_block_no_loops failed at parse time before the pass ever ran. SBlocks are s_tir-mode constructs; parse the Before/Expected modules with s_tir=True like every other test in this file, keeping the no-itervar block shape the regression test exists to cover.
… let binds to T.let (#19729) Fix the s_tir tests broken or left stale by two upstream changes. * test_meta_schedule_space_cuda.py (cap, dil, gmm, t2d, nrm, sfm, cbr, tbg) and test_meta_schedule_space_cuda_async.py (c2d): #18927 expanded DefaultCUDA unroll_max_steps from {0, 16, 64, 512, 1024} to {0, 16, 32, 64, 128, 256, 512, 1024} without updating the recorded SampleCategorical decisions. Remap the indices (2->3, 3->6, 4->7) so each test keeps sampling the same unroll value; every sketch was re-verified by replaying the trace and structurally comparing against the expected module. * T.let migration: since #19581 the TIRx parser treats `v: T.int32 = expr` as a mutable local-scalar buffer instead of an immutable bind, which is now spelled `v: T.let[T.int32] = expr` (a Bind node, the same form te.create_prim_func emits). Tests whose intent is a bind are migrated to the new spelling: reduction combiner temporaries (add_rfactor, lower_cross_thread_reduction) and let-dependent passes (compact_buffer_region, hoist_expression, remove_undef). * Also convert reduction temporaries in still-green tests (cross_thread_reduction rule, compute_inline, schedule utilities, parallel_vectorize_unroll postproc, dlight general reduction, relax cuda_graph) so the hand-written workloads match the canonical Bind form instead of feeding rules a mutable-scalar body.
This PR fixes 11 test failures in `tests/python/s_tir/transform/` introduced as side effects of the TIRx bringup (#19581 / 859498d), in three independent commits. ### 1. LowerOpaqueBlock: update expected IR for buffer metadata annotations `LowerOpaqueBlock` now emits `buffer_allocated_addr` and `buffer_data_alignment` annotations on lowered allocations (intentional in #19581: the annotations are consumed downstream by `codegen_cuda.cc` / `codegen_trn.cc`; the alignment value 64 comes from `kAllocAlignment`). The tests' expected IR predates this, so `assert_structural_equal` failed on the missing annotations. Fix: update the expected IR in `test_s_tir_transform_lower_opaque_block.py` to carry the annotations (`T.decl_buffer(...)` → `T.alloc_buffer(..., annotations={...})`). Fixes 6 tests. ### 2. DefaultGPUSchedule: parse scalar-block test in s_tir mode #19581 added a well-formedness rule rejecting `SBlockRealize` in `tirx=True` mode, which is correct — sblocks are s_tir-mode constructs. The hand-written `Before`/`Expected` modules in `test_scalar_block_no_loops` were the only ones in the file still using plain `T.prim_func`, so they failed at parse time before the pass under test even ran. Fix: parse both modules with `T.prim_func(s_tir=True)`, consistent with every other test in the file. Fixes 1 test. ### 3. InjectPermutedLayout: match legacy PTX intrinsics by canonical name #19581 registers device intrinsics under two Op identities: a flat builtin name (returned by `builtin::xxx()` in C++) and a canonical dotted name (e.g. `tirx.ptx.ldmatrix_legacy`, produced when TVMScript / tensor intrinsics are parsed). `InjectPermutedLayout` only compared with `same_as(builtin::...)`, so it silently skipped rewriting the swizzled shared-memory offsets of parsed legacy-form calls, leaving the expected swizzle index expressions unmatched. Fix: match `ptx_ldmatrix_legacy` / `mma_store_legacy` by both the builtin Op and the canonical name via an `IsOp` helper, following the existing pattern in `lower_warp_memory.cc` and `codegen_cuda.cc`. Only the legacy intrinsic forms fold shared-memory access into `tvm_access_ptr` + offset; non-legacy forms address shared memory through `BufferLoad` and are already handled by the BufferLoad visitor, so the unreachable `InternalError` throw is replaced by a pass-through. (`mma_store_legacy` has no dotted alias, hence the asymmetric name strings.) Fixes 4 tests.
… let binds to T.let (apache#19729) Fix the s_tir tests broken or left stale by two upstream changes. * test_meta_schedule_space_cuda.py (cap, dil, gmm, t2d, nrm, sfm, cbr, tbg) and test_meta_schedule_space_cuda_async.py (c2d): apache#18927 expanded DefaultCUDA unroll_max_steps from {0, 16, 64, 512, 1024} to {0, 16, 32, 64, 128, 256, 512, 1024} without updating the recorded SampleCategorical decisions. Remap the indices (2->3, 3->6, 4->7) so each test keeps sampling the same unroll value; every sketch was re-verified by replaying the trace and structurally comparing against the expected module. * T.let migration: since apache#19581 the TIRx parser treats `v: T.int32 = expr` as a mutable local-scalar buffer instead of an immutable bind, which is now spelled `v: T.let[T.int32] = expr` (a Bind node, the same form te.create_prim_func emits). Tests whose intent is a bind are migrated to the new spelling: reduction combiner temporaries (add_rfactor, lower_cross_thread_reduction) and let-dependent passes (compact_buffer_region, hoist_expression, remove_undef). * Also convert reduction temporaries in still-green tests (cross_thread_reduction rule, compute_inline, schedule utilities, parallel_vectorize_unroll postproc, dlight general reduction, relax cuda_graph) so the hand-written workloads match the canonical Bind form instead of feeding rules a mutable-scalar body. (cherry picked from commit 67b0c6c)
…#19735) This PR fixes 11 test failures in `tests/python/s_tir/transform/` introduced as side effects of the TIRx bringup (apache#19581 / 859498d), in three independent commits. ### 1. LowerOpaqueBlock: update expected IR for buffer metadata annotations `LowerOpaqueBlock` now emits `buffer_allocated_addr` and `buffer_data_alignment` annotations on lowered allocations (intentional in apache#19581: the annotations are consumed downstream by `codegen_cuda.cc` / `codegen_trn.cc`; the alignment value 64 comes from `kAllocAlignment`). The tests' expected IR predates this, so `assert_structural_equal` failed on the missing annotations. Fix: update the expected IR in `test_s_tir_transform_lower_opaque_block.py` to carry the annotations (`T.decl_buffer(...)` → `T.alloc_buffer(..., annotations={...})`). Fixes 6 tests. ### 2. DefaultGPUSchedule: parse scalar-block test in s_tir mode apache#19581 added a well-formedness rule rejecting `SBlockRealize` in `tirx=True` mode, which is correct — sblocks are s_tir-mode constructs. The hand-written `Before`/`Expected` modules in `test_scalar_block_no_loops` were the only ones in the file still using plain `T.prim_func`, so they failed at parse time before the pass under test even ran. Fix: parse both modules with `T.prim_func(s_tir=True)`, consistent with every other test in the file. Fixes 1 test. ### 3. InjectPermutedLayout: match legacy PTX intrinsics by canonical name apache#19581 registers device intrinsics under two Op identities: a flat builtin name (returned by `builtin::xxx()` in C++) and a canonical dotted name (e.g. `tirx.ptx.ldmatrix_legacy`, produced when TVMScript / tensor intrinsics are parsed). `InjectPermutedLayout` only compared with `same_as(builtin::...)`, so it silently skipped rewriting the swizzled shared-memory offsets of parsed legacy-form calls, leaving the expected swizzle index expressions unmatched. Fix: match `ptx_ldmatrix_legacy` / `mma_store_legacy` by both the builtin Op and the canonical name via an `IsOp` helper, following the existing pattern in `lower_warp_memory.cc` and `codegen_cuda.cc`. Only the legacy intrinsic forms fold shared-memory access into `tvm_access_ptr` + offset; non-legacy forms address shared memory through `BufferLoad` and are already handled by the BufferLoad visitor, so the unreachable `InternalError` throw is replaced by a pass-through. (`mma_store_legacy` has no dotted alias, hence the asymmetric name strings.) Fixes 4 tests. (cherry picked from commit c9a77d6)
… let binds to T.let (apache#19729) Fix the s_tir tests broken or left stale by two upstream changes. * test_meta_schedule_space_cuda.py (cap, dil, gmm, t2d, nrm, sfm, cbr, tbg) and test_meta_schedule_space_cuda_async.py (c2d): apache#18927 expanded DefaultCUDA unroll_max_steps from {0, 16, 64, 512, 1024} to {0, 16, 32, 64, 128, 256, 512, 1024} without updating the recorded SampleCategorical decisions. Remap the indices (2->3, 3->6, 4->7) so each test keeps sampling the same unroll value; every sketch was re-verified by replaying the trace and structurally comparing against the expected module. * T.let migration: since apache#19581 the TIRx parser treats `v: T.int32 = expr` as a mutable local-scalar buffer instead of an immutable bind, which is now spelled `v: T.let[T.int32] = expr` (a Bind node, the same form te.create_prim_func emits). Tests whose intent is a bind are migrated to the new spelling: reduction combiner temporaries (add_rfactor, lower_cross_thread_reduction) and let-dependent passes (compact_buffer_region, hoist_expression, remove_undef). * Also convert reduction temporaries in still-green tests (cross_thread_reduction rule, compute_inline, schedule utilities, parallel_vectorize_unroll postproc, dlight general reduction, relax cuda_graph) so the hand-written workloads match the canonical Bind form instead of feeding rules a mutable-scalar body. (cherry picked from commit 67b0c6c)
…#19735) This PR fixes 11 test failures in `tests/python/s_tir/transform/` introduced as side effects of the TIRx bringup (apache#19581 / 859498d), in three independent commits. ### 1. LowerOpaqueBlock: update expected IR for buffer metadata annotations `LowerOpaqueBlock` now emits `buffer_allocated_addr` and `buffer_data_alignment` annotations on lowered allocations (intentional in apache#19581: the annotations are consumed downstream by `codegen_cuda.cc` / `codegen_trn.cc`; the alignment value 64 comes from `kAllocAlignment`). The tests' expected IR predates this, so `assert_structural_equal` failed on the missing annotations. Fix: update the expected IR in `test_s_tir_transform_lower_opaque_block.py` to carry the annotations (`T.decl_buffer(...)` → `T.alloc_buffer(..., annotations={...})`). Fixes 6 tests. ### 2. DefaultGPUSchedule: parse scalar-block test in s_tir mode apache#19581 added a well-formedness rule rejecting `SBlockRealize` in `tirx=True` mode, which is correct — sblocks are s_tir-mode constructs. The hand-written `Before`/`Expected` modules in `test_scalar_block_no_loops` were the only ones in the file still using plain `T.prim_func`, so they failed at parse time before the pass under test even ran. Fix: parse both modules with `T.prim_func(s_tir=True)`, consistent with every other test in the file. Fixes 1 test. ### 3. InjectPermutedLayout: match legacy PTX intrinsics by canonical name apache#19581 registers device intrinsics under two Op identities: a flat builtin name (returned by `builtin::xxx()` in C++) and a canonical dotted name (e.g. `tirx.ptx.ldmatrix_legacy`, produced when TVMScript / tensor intrinsics are parsed). `InjectPermutedLayout` only compared with `same_as(builtin::...)`, so it silently skipped rewriting the swizzled shared-memory offsets of parsed legacy-form calls, leaving the expected swizzle index expressions unmatched. Fix: match `ptx_ldmatrix_legacy` / `mma_store_legacy` by both the builtin Op and the canonical name via an `IsOp` helper, following the existing pattern in `lower_warp_memory.cc` and `codegen_cuda.cc`. Only the legacy intrinsic forms fold shared-memory access into `tvm_access_ptr` + offset; non-legacy forms address shared memory through `BufferLoad` and are already handled by the BufferLoad visitor, so the unreachable `InternalError` throw is replaced by a pass-through. (`mma_store_legacy` has no dotted alias, hence the asymmetric name strings.) Fixes 4 tests. (cherry picked from commit c9a77d6)
Summary
This PR adds the initial TIRx support needed for low-level programming of Blackwell-class GPU architectures. As part of the ongoing TIRx refactor, it introduces TVMScript support for directly scripting advanced hardware features without relying on scheduling as the primary programming interface.
The change keeps existing
s_tirscript support intact while making direct scripting a first-class path for TIRx programs.Main Changes
Validation
pre-commit run --all-filesninja -C build -j32CUDA_VISIBLE_DEVICES=2 pytest tests/python/tirx/ -n 161723 passed, 47 skipped, 32 warningsCUDA_VISIBLE_DEVICES=2 python -m pytest -v tests/python/all-platform-minimal-test37 passed, 105 skippedTVM_TEST_TARGETS=llvm python -m pytest -v tests/python/tirx-analysis tests/python/tirx-base tests/python/tirx-transform -n 16664 passed, 25 skipped, 9 xfailed, 1 xpassedLocal CI Notes
Some full CI-equivalent jobs were not locally reproducible because this machine is missing parts of the Apache TVM CI environment, including
llvm-config-15/17, Vulkan, ROCm, Maven, Sphinx, Doxygen, Emscripten, and ARM/QEMU cross-toolchain components. Metal-specific tests were skipped locally because no Metal runtime is available.