[S-TIR][CUDA] Fix legacy predicated cp.async zero fill by tlopex · Pull Request #19741 · apache/tvm

tlopex · 2026-06-12T00:32:29Z

This fixes the legacy predicated ptx.cp_async codegen path used by InjectPTXAsyncCopy for if_then_else(..., 0) stores.

The old inline CUDA emission zero-filled the shared-memory destination when the predicate was false. The TIRx helper-based legacy codegen only skipped the cp.async, leaving the destination slot unchanged. This restores the previous behavior by emitting an @!p st.shared.* zero store in the generated legacy predicated helper.

The CUDA source snapshot in test_s_tir_transform_inject_ptx_async_copy.py is updated to reflect the restored false-predicate zero-fill instruction and the current generated helper-based CUDA source.

gemini-code-assist

Code Review

This pull request introduces improvements to CUDA asynchronous copy operations and software pipelining. Specifically, it adds a predicate check in cp_async.py to zero out the destination and return early when the predicate is false, and refines the physical wait count calculation in inject_software_pipeline.cc when multiple producer heads exist per commit. Additionally, the tests are updated to dynamically assert the order of CUDA source fragments and verify runtime execution on Ampere or newer GPUs. The reviewer suggested deferring the generic-to-shared address translation (dst_addr) in cp_async.py until after the early-return predicate check to avoid unnecessary overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T00:33:44Z

                "  unsigned int dst_addr = __cvta_generic_to_shared(dst_p);\n"
+                "  if (!predicate) {\n"
+                f"    for (int i = 0; i < {cp_size_v}; ++i) {{\n"
+                "      dst_p[i] = 0;\n"
+                "    }\n"
+                "    return;\n"
+                "  }\n"


We can defer the computation of dst_addr (which performs generic-to-shared address translation) until after the if (!predicate) check. This avoids unnecessary address translation overhead when the predicate is false and the function returns early.

Suggested change

" unsigned int dst_addr = __cvta_generic_to_shared(dst_p);\n"

" if (!predicate) {\n"

f" for (int i = 0; i < {cp_size_v}; ++i) {{\n"

" dst_p[i] = 0;\n"

" }\n"

" return;\n"

" }\n"

" if (!predicate) {\n"

f" for (int i = 0; i < {cp_size_v}; ++i) {{\n"

" dst_p[i] = 0;\n"

" }\n"

" return;\n"

" }\n"

" unsigned int dst_addr = __cvta_generic_to_shared(dst_p);\n"

test_cp_async_in_if_then_else compares the generated CUDA source byte-for-byte against a snapshot, and the cse_v* numbering in the generated source is currently nondeterministic across processes, so the snapshot comparison cannot be stable. Mark it xfail (non-strict) with a TODO so the s_tir/transform CI enrollment (apache#19737) is not blocked; the mark goes away once CSE ordering is made deterministic and the snapshot is regenerated (apache#19741).

The CUDA source snapshot comparison is nondeterministic (cse_v numbering varies between runs). Mark the test xfail with a TODO until CSE determinism is fixed. See apache#19741.

tlopex · 2026-06-13T01:28:46Z

@tvm-bot rerun

gemini-code-assist · 2026-06-13T05:38:33Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…able CommonSubexprElim numbered its cse_v variables nondeterministically: the planner's expression table was an unordered_map hashed with StructuralHash, which hashes free variables by object identity and so varies between processes (ASLR). Both the table iteration order and the StructuralHash sort tie-breaker leaked that randomness into the plan, making generated code differ run to run and breaking the byte-to-byte CUDA snapshot comparison in test_cp_async_in_if_then_else. Switch the table to support::OrderedMap so iteration follows discovery (program) order, and drop the hash tie-breaker: the stable sort by expr_depth now keeps equal-depth entries in discovery order, giving a fully deterministic plan and cse_v numbering. Regenerate the CUDA snapshot with the deterministic numbering (verified byte-identical across independent processes) and remove the xfail marker from test_cp_async_in_if_then_else added in apache#19751, since the test now passes deterministically.

This fixes the legacy predicated `ptx.cp_async` codegen path used by `InjectPTXAsyncCopy` for `if_then_else(..., 0)` stores. The old inline CUDA emission zero-filled the shared-memory destination when the predicate was false. The TIRx helper-based legacy codegen only skipped the `cp.async`, leaving the destination slot unchanged. This restores the previous behavior by emitting an `@!p st.shared.*` zero store in the generated legacy predicated helper. The CUDA source snapshot in `test_s_tir_transform_inject_ptx_async_copy.py` is updated to reflect the restored false-predicate zero-fill instruction and the current generated helper-based CUDA source. (cherry picked from commit 126f1ba)

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

tlopex changed the title ~~[Tests] Check cp.async CUDA source invariants~~ [S-TIR][CUDA] Fix predicated cp.async pipeline correctness Jun 12, 2026

tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch 3 times, most recently from 0037982 to 5e5995d Compare June 12, 2026 00:48

tlopex marked this pull request as draft June 12, 2026 00:51

tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch from 5e5995d to f10d38f Compare June 12, 2026 01:00

tlopex changed the title ~~[S-TIR][CUDA] Fix predicated cp.async pipeline correctness~~ [S-TIR][Tests] Update cp.async expected CUDA source Jun 12, 2026

tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch from f10d38f to f157dc8 Compare June 12, 2026 01:06

tlopex marked this pull request as ready for review June 12, 2026 01:07

tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch from f157dc8 to 44bb539 Compare June 12, 2026 01:25

tlopex changed the title ~~[S-TIR][Tests] Update cp.async expected CUDA source~~ [S-TIR][CUDA] Fix legacy predicated cp.async zero fill Jun 12, 2026

tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch from 44bb539 to e0447c6 Compare June 12, 2026 01:40

tlopex marked this pull request as draft June 12, 2026 01:50

gemini-code-assist Bot mentioned this pull request Jun 12, 2026

[S-TIR][Tests] Mark test_cp_async_in_if_then_else as xfail #19751

Merged

tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch from 53a8abc to 2856245 Compare June 13, 2026 01:15

tlopex marked this pull request as ready for review June 13, 2026 05:38

tlopex added 2 commits June 13, 2026 02:42

Fix legacy predicated cp.async zero fill

8896e22

tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch from 2856245 to 93efb06 Compare June 13, 2026 06:43

tqchen approved these changes Jun 13, 2026

View reviewed changes

tqchen merged commit 126f1ba into apache:main Jun 13, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[S-TIR][CUDA] Fix legacy predicated cp.async zero fill#19741

[S-TIR][CUDA] Fix legacy predicated cp.async zero fill#19741
tqchen merged 2 commits into
apache:mainfrom
tlopex:fix-s-tir-cp-async-source-snapshot

tlopex commented Jun 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

tlopex commented Jun 13, 2026

Uh oh!

gemini-code-assist Bot commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tlopex commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

tlopex commented Jun 13, 2026

Uh oh!

gemini-code-assist Bot commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tlopex commented Jun 12, 2026 •

edited

Loading