Skip to content

[S-TIR][CUDA] Fix legacy predicated cp.async zero fill#19741

Merged
tqchen merged 2 commits into
apache:mainfrom
tlopex:fix-s-tir-cp-async-source-snapshot
Jun 13, 2026
Merged

[S-TIR][CUDA] Fix legacy predicated cp.async zero fill#19741
tqchen merged 2 commits into
apache:mainfrom
tlopex:fix-s-tir-cp-async-source-snapshot

Conversation

@tlopex

@tlopex tlopex commented Jun 12, 2026

Copy link
Copy Markdown
Member

This fixes the legacy predicated ptx.cp_async codegen path used by InjectPTXAsyncCopy for if_then_else(..., 0) stores.

The old inline CUDA emission zero-filled the shared-memory destination when the predicate was false. The TIRx helper-based legacy codegen only skipped the cp.async, leaving the destination slot unchanged. This restores the previous behavior by emitting an @!p st.shared.* zero store in the generated legacy predicated helper.

The CUDA source snapshot in test_s_tir_transform_inject_ptx_async_copy.py is updated to reflect the restored false-predicate zero-fill instruction and the current generated helper-based CUDA source.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces improvements to CUDA asynchronous copy operations and software pipelining. Specifically, it adds a predicate check in cp_async.py to zero out the destination and return early when the predicate is false, and refines the physical wait count calculation in inject_software_pipeline.cc when multiple producer heads exist per commit. Additionally, the tests are updated to dynamically assert the order of CUDA source fragments and verify runtime execution on Ampere or newer GPUs. The reviewer suggested deferring the generic-to-shared address translation (dst_addr) in cp_async.py until after the early-return predicate check to avoid unnecessary overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +275 to +281
" unsigned int dst_addr = __cvta_generic_to_shared(dst_p);\n"
" if (!predicate) {\n"
f" for (int i = 0; i < {cp_size_v}; ++i) {{\n"
" dst_p[i] = 0;\n"
" }\n"
" return;\n"
" }\n"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

We can defer the computation of dst_addr (which performs generic-to-shared address translation) until after the if (!predicate) check. This avoids unnecessary address translation overhead when the predicate is false and the function returns early.

Suggested change
" unsigned int dst_addr = __cvta_generic_to_shared(dst_p);\n"
" if (!predicate) {\n"
f" for (int i = 0; i < {cp_size_v}; ++i) {{\n"
" dst_p[i] = 0;\n"
" }\n"
" return;\n"
" }\n"
" if (!predicate) {\n"
f" for (int i = 0; i < {cp_size_v}; ++i) {{\n"
" dst_p[i] = 0;\n"
" }\n"
" return;\n"
" }\n"
" unsigned int dst_addr = __cvta_generic_to_shared(dst_p);\n"

@tlopex tlopex changed the title [Tests] Check cp.async CUDA source invariants [S-TIR][CUDA] Fix predicated cp.async pipeline correctness Jun 12, 2026
@tlopex tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch 3 times, most recently from 0037982 to 5e5995d Compare June 12, 2026 00:48
@tlopex tlopex marked this pull request as draft June 12, 2026 00:51
@tlopex tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch from 5e5995d to f10d38f Compare June 12, 2026 01:00
@tlopex tlopex changed the title [S-TIR][CUDA] Fix predicated cp.async pipeline correctness [S-TIR][Tests] Update cp.async expected CUDA source Jun 12, 2026
@tlopex tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch from f10d38f to f157dc8 Compare June 12, 2026 01:06
@tlopex tlopex marked this pull request as ready for review June 12, 2026 01:07
@tlopex tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch from f157dc8 to 44bb539 Compare June 12, 2026 01:25
@tlopex tlopex changed the title [S-TIR][Tests] Update cp.async expected CUDA source [S-TIR][CUDA] Fix legacy predicated cp.async zero fill Jun 12, 2026
@tlopex tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch from 44bb539 to e0447c6 Compare June 12, 2026 01:40
@tlopex tlopex marked this pull request as draft June 12, 2026 01:50
tlopex added a commit to tlopex/tvm that referenced this pull request Jun 12, 2026
test_cp_async_in_if_then_else compares the generated CUDA source
byte-for-byte against a snapshot, and the cse_v* numbering in the
generated source is currently nondeterministic across processes, so the
snapshot comparison cannot be stable. Mark it xfail (non-strict) with a
TODO so the s_tir/transform CI enrollment (apache#19737) is not blocked; the
mark goes away once CSE ordering is made deterministic and the snapshot
is regenerated (apache#19741).
tlopex added a commit to tlopex/tvm that referenced this pull request Jun 12, 2026
test_cp_async_in_if_then_else compares the generated CUDA source
byte-for-byte against a snapshot, and the cse_v* numbering in the
generated source is currently nondeterministic across processes, so the
snapshot comparison cannot be stable. Mark it xfail (non-strict) with a
TODO so the s_tir/transform CI enrollment (apache#19737) is not blocked; the
mark goes away once CSE ordering is made deterministic and the snapshot
is regenerated (apache#19741).
tlopex added a commit to tlopex/tvm that referenced this pull request Jun 12, 2026
The CUDA source snapshot comparison is nondeterministic (cse_v numbering
varies between runs). Mark the test xfail with a TODO until CSE
determinism is fixed. See apache#19741.
@tlopex tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch from 53a8abc to 2856245 Compare June 13, 2026 01:15
@tlopex

tlopex commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

@tvm-bot rerun

@tlopex tlopex marked this pull request as ready for review June 13, 2026 05:38
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

tlopex added 2 commits June 13, 2026 02:42
…able

CommonSubexprElim numbered its cse_v variables nondeterministically:
the planner's expression table was an unordered_map hashed with
StructuralHash, which hashes free variables by object identity and so
varies between processes (ASLR). Both the table iteration order and the
StructuralHash sort tie-breaker leaked that randomness into the plan,
making generated code differ run to run and breaking the byte-to-byte
CUDA snapshot comparison in test_cp_async_in_if_then_else.

Switch the table to support::OrderedMap so iteration follows discovery
(program) order, and drop the hash tie-breaker: the stable sort by
expr_depth now keeps equal-depth entries in discovery order, giving a
fully deterministic plan and cse_v numbering.

Regenerate the CUDA snapshot with the deterministic numbering (verified
byte-identical across independent processes) and remove the xfail marker
from test_cp_async_in_if_then_else added in apache#19751, since the test now
passes deterministically.
@tlopex tlopex force-pushed the fix-s-tir-cp-async-source-snapshot branch from 2856245 to 93efb06 Compare June 13, 2026 06:43
@tqchen tqchen merged commit 126f1ba into apache:main Jun 13, 2026
8 checks passed
MasterJH5574 pushed a commit to MasterJH5574/tvm that referenced this pull request Jun 15, 2026
This fixes the legacy predicated `ptx.cp_async` codegen path used by
`InjectPTXAsyncCopy` for `if_then_else(..., 0)` stores.

The old inline CUDA emission zero-filled the shared-memory destination
when the predicate was false. The TIRx helper-based legacy codegen only
skipped the `cp.async`, leaving the destination slot unchanged. This
restores the previous behavior by emitting an `@!p st.shared.*` zero
store in the generated legacy predicated helper.

The CUDA source snapshot in
`test_s_tir_transform_inject_ptx_async_copy.py` is updated to reflect
the restored false-predicate zero-fill instruction and the current
generated helper-based CUDA source.

(cherry picked from commit 126f1ba)
MasterJH5574 pushed a commit to MasterJH5574/tvm that referenced this pull request Jun 15, 2026
This fixes the legacy predicated `ptx.cp_async` codegen path used by
`InjectPTXAsyncCopy` for `if_then_else(..., 0)` stores.

The old inline CUDA emission zero-filled the shared-memory destination
when the predicate was false. The TIRx helper-based legacy codegen only
skipped the `cp.async`, leaving the destination slot unchanged. This
restores the previous behavior by emitting an `@!p st.shared.*` zero
store in the generated legacy predicated helper.

The CUDA source snapshot in
`test_s_tir_transform_inject_ptx_async_copy.py` is updated to reflect
the restored false-predicate zero-fill instruction and the current
generated helper-based CUDA source.

(cherry picked from commit 126f1ba)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants