[S-TIR][CUDA] Fix legacy predicated cp.async zero fill#19741
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces improvements to CUDA asynchronous copy operations and software pipelining. Specifically, it adds a predicate check in cp_async.py to zero out the destination and return early when the predicate is false, and refines the physical wait count calculation in inject_software_pipeline.cc when multiple producer heads exist per commit. Additionally, the tests are updated to dynamically assert the order of CUDA source fragments and verify runtime execution on Ampere or newer GPUs. The reviewer suggested deferring the generic-to-shared address translation (dst_addr) in cp_async.py until after the early-return predicate check to avoid unnecessary overhead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| " unsigned int dst_addr = __cvta_generic_to_shared(dst_p);\n" | ||
| " if (!predicate) {\n" | ||
| f" for (int i = 0; i < {cp_size_v}; ++i) {{\n" | ||
| " dst_p[i] = 0;\n" | ||
| " }\n" | ||
| " return;\n" | ||
| " }\n" |
There was a problem hiding this comment.
We can defer the computation of dst_addr (which performs generic-to-shared address translation) until after the if (!predicate) check. This avoids unnecessary address translation overhead when the predicate is false and the function returns early.
| " unsigned int dst_addr = __cvta_generic_to_shared(dst_p);\n" | |
| " if (!predicate) {\n" | |
| f" for (int i = 0; i < {cp_size_v}; ++i) {{\n" | |
| " dst_p[i] = 0;\n" | |
| " }\n" | |
| " return;\n" | |
| " }\n" | |
| " if (!predicate) {\n" | |
| f" for (int i = 0; i < {cp_size_v}; ++i) {{\n" | |
| " dst_p[i] = 0;\n" | |
| " }\n" | |
| " return;\n" | |
| " }\n" | |
| " unsigned int dst_addr = __cvta_generic_to_shared(dst_p);\n" |
0037982 to
5e5995d
Compare
5e5995d to
f10d38f
Compare
f10d38f to
f157dc8
Compare
f157dc8 to
44bb539
Compare
44bb539 to
e0447c6
Compare
test_cp_async_in_if_then_else compares the generated CUDA source byte-for-byte against a snapshot, and the cse_v* numbering in the generated source is currently nondeterministic across processes, so the snapshot comparison cannot be stable. Mark it xfail (non-strict) with a TODO so the s_tir/transform CI enrollment (apache#19737) is not blocked; the mark goes away once CSE ordering is made deterministic and the snapshot is regenerated (apache#19741).
test_cp_async_in_if_then_else compares the generated CUDA source byte-for-byte against a snapshot, and the cse_v* numbering in the generated source is currently nondeterministic across processes, so the snapshot comparison cannot be stable. Mark it xfail (non-strict) with a TODO so the s_tir/transform CI enrollment (apache#19737) is not blocked; the mark goes away once CSE ordering is made deterministic and the snapshot is regenerated (apache#19741).
The CUDA source snapshot comparison is nondeterministic (cse_v numbering varies between runs). Mark the test xfail with a TODO until CSE determinism is fixed. See apache#19741.
53a8abc to
2856245
Compare
|
@tvm-bot rerun |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…able CommonSubexprElim numbered its cse_v variables nondeterministically: the planner's expression table was an unordered_map hashed with StructuralHash, which hashes free variables by object identity and so varies between processes (ASLR). Both the table iteration order and the StructuralHash sort tie-breaker leaked that randomness into the plan, making generated code differ run to run and breaking the byte-to-byte CUDA snapshot comparison in test_cp_async_in_if_then_else. Switch the table to support::OrderedMap so iteration follows discovery (program) order, and drop the hash tie-breaker: the stable sort by expr_depth now keeps equal-depth entries in discovery order, giving a fully deterministic plan and cse_v numbering. Regenerate the CUDA snapshot with the deterministic numbering (verified byte-identical across independent processes) and remove the xfail marker from test_cp_async_in_if_then_else added in apache#19751, since the test now passes deterministically.
2856245 to
93efb06
Compare
This fixes the legacy predicated `ptx.cp_async` codegen path used by `InjectPTXAsyncCopy` for `if_then_else(..., 0)` stores. The old inline CUDA emission zero-filled the shared-memory destination when the predicate was false. The TIRx helper-based legacy codegen only skipped the `cp.async`, leaving the destination slot unchanged. This restores the previous behavior by emitting an `@!p st.shared.*` zero store in the generated legacy predicated helper. The CUDA source snapshot in `test_s_tir_transform_inject_ptx_async_copy.py` is updated to reflect the restored false-predicate zero-fill instruction and the current generated helper-based CUDA source. (cherry picked from commit 126f1ba)
This fixes the legacy predicated `ptx.cp_async` codegen path used by `InjectPTXAsyncCopy` for `if_then_else(..., 0)` stores. The old inline CUDA emission zero-filled the shared-memory destination when the predicate was false. The TIRx helper-based legacy codegen only skipped the `cp.async`, leaving the destination slot unchanged. This restores the previous behavior by emitting an `@!p st.shared.*` zero store in the generated legacy predicated helper. The CUDA source snapshot in `test_s_tir_transform_inject_ptx_async_copy.py` is updated to reflect the restored false-predicate zero-fill instruction and the current generated helper-based CUDA source. (cherry picked from commit 126f1ba)
This fixes the legacy predicated
ptx.cp_asynccodegen path used byInjectPTXAsyncCopyforif_then_else(..., 0)stores.The old inline CUDA emission zero-filled the shared-memory destination when the predicate was false. The TIRx helper-based legacy codegen only skipped the
cp.async, leaving the destination slot unchanged. This restores the previous behavior by emitting an@!p st.shared.*zero store in the generated legacy predicated helper.The CUDA source snapshot in
test_s_tir_transform_inject_ptx_async_copy.pyis updated to reflect the restored false-predicate zero-fill instruction and the current generated helper-based CUDA source.