[v0.25.0] Backport recent main to prepare v0.25.0.rc1 by MasterJH5574 · Pull Request #19792 · apache/tvm

MasterJH5574 · 2026-06-16T13:16:17Z

Brings the v0.25.0 release branch up to recent main so we can cut v0.25.0.rc1. rc0 was cut before a number of changes we want in the release, so this backports the run of main commits that landed after the branch point rather than a handful of isolated picks.

This PR try to fix apache#19696 , ``nn.attention`` support dynamic batch_size Co-authored-by: flashmouse <flashmosue2012@gmail.com>

…ent(numpy semantics) (apache#19755) Hi Committers, This PR addresses the `ReduceMax`/ `ReduceMin` part of issue apache#19572. Any suggestions would be appreciated if you are available. ### Root cause: The ONNX frontend ReduceMax / ReduceMin converters return relax.op.max / relax.op.min. After legalization these map to topi.max / topi.min, which fold with a commutative reducer whose combiner is Max(x, y) / Min(x, y). In codegen, Max(a, b) lowers to select(a > b, a, b) using an **ordered** float comparison (fcmp ogt), which is false for NaN. As a left-fold (acc = Max(acc, elem)), NaN propagation becomes **position-dependent** - a later non-NaN element silently overwrites an earlier NaN. ### Solution: Adopt the well-defined, **order-independent numpy/IEEE convention** (matching numpy.max/min and torch.amax/amin): the reduction yields NaN whenever **any** reduced element is NaN. Minimal, ONNX-frontend-only change: - Add a shared helper _reduce_min_max_preserve_nan(reduce_op, data, axes, keepdims). - For floating-pint inputs, detect NaN along the reduced axes via `sum(astype(isnan(data), dtype), axes, keepdims) > 0` and force those outputs to `NaN` with `where(has_nan, nan, reduce(data))`. The mask reduces over the **same axes/keepdims**, so it aligns in shape with the reduced result. - Keep non-floating(integer) inputs unchanged. - Route all reduce paths(`_impl_v11`and both reduce branches of `_impl_v18`) through the helper; the `noop_with_empty_axes` passthrough is left untouched since it performs no reduction. ### Note on scope (re: apache#19589 ): The underlying NaN behavior of Max/Min is the same family of ops discussed in apache#19589. Per review comments there, enforcing NaN semantics at the IR / LLVM-IR level is undesirable(backward-compat with older LLVM, and portability to CUDA/OpenCL/Vulkan), and a dedicated portable nanmin/nanmax TIRx intrinsic(like `nearbyint`) would be the preferred long-term mechanism. This PR deliberately: - does not touch the IR-level Max/Min lowering, and - does not rely on the bool reduction of the NaN mask - it uses `sum(isnan) > 0`, fully sidestepping Max/Min NaN behavior. --------- Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>

…es (apache#19782)

IR module cleanup benefits from using a single unique-name primitive directly at module call sites. This PR renames NameSupply to UniqueNameSupply and removes redundant wrappers around global variable naming. Main changes: - Rename the public name supply API and header to UniqueNameSupply - Replace GlobalVarSupply with direct iterator-seeded UniqueNameSupply usage - Remove obsolete access-path repr registration now covered by tvm-ffi

…e#19784) TVM's shipped code only uses cuda.bindings — cuda.bindings.nvrtc for the NVRTC JIT path and cuda.bindings.driver for the NVSHMEM link path, both in python/tvm/support/nvcc.py; it never uses cuda.core. cuda-python is now a metapackage that pulls in cuda-bindings + cuda-core (and cuda-pathfinder), so depending on it drags in cuda-core that TVM does not need. Depend directly on cuda-bindings, which provides exactly the nvrtc and driver submodules TVM imports, and update the user-facing 'pip install cuda-python' hints to match. A plain cuda-bindings install pulls no nvidia-* toolkit wheels (those live behind the [all] extra); libnvrtc is loaded from the system / TVM's CUDA install as before.

…9783) This PR migrates repository agent instructions away from Claude-specific paths and into a vendor-neutral layout. Changes: - Add root `AGENTS.md` - Move existing command guidance from `.claude/commands` to `.agents/skills/*/SKILL.md` using git renames. - Move the GPU monitor helper from `.claude/scripts` to `.agents/scripts`. - Update the TIR test skill to reference `.agents/scripts/monitor_gpu.sh`. - Replace the ASF header skip entry for `.claude/*` with `.agents/*`. Validation: - `bash -n .agents/scripts/monitor_gpu.sh` - `.agents/scripts/monitor_gpu.sh --help` - `pre-commit run --files AGENTS.md .agents/skills/tir-build/SKILL.md .agents/skills/tir-test/SKILL.md .agents/skills/tir-bench/SKILL.md .agents/scripts/monitor_gpu.sh tests/lint/check_asf_header.py`

This pr modernizes test gating. It replaces the heavy `tvm.testing.Feature` machinery with a thin `tvm.testing.env` module of `has_*()` capability probes, used via standard pytest.mark + skipif. And markers move to `pyproject.toml`

…en05 GEMM (apache#19785)

This pr fixes apache#19609. TensorRT 10 removed a large set of APIs that the Relax TensorRT BYOC integration relied on, so it failed to compile against TRT >= 10. Port the runtime and codegen to the TRT10 API and require TensorRT >= 10: - Lifetime: obj->destroy() -> delete (destroy() removed in TRT10). - Builder: drop implicit-batch mode (networks are always explicit-batch via createNetworkV2(0); setMaxBatchSize removed); setMaxWorkspaceSize -> setMemoryPoolLimit(kWORKSPACE); buildEngineWithConfig -> buildSerializedNetwork + deserializeCudaEngine, keeping the IRuntime alive alongside the engine. - Execution: the binding-index model (getNbBindings / getBindingIndex / setBindingDimensions / execute / executeV2) -> the named-tensor model (getNbIOTensors / setInputShape / setTensorAddress / enqueueV3); deserializeCudaEngine drops the trailing IPluginFactory* argument. - Layers: addConvolution / addPooling / addDeconvolution / addPadding -> the *Nd variants; set{Stride,Dilation} -> *Nd; IFullyConnectedLayer / addFullyConnected removed -> dense rebuilt with addConstant + addMatrixMultiply. - Add a build-time guard that emits a clear error on TensorRT < 10. Also fix pre-existing issues that prevented this path from running end-to-end: the runtime had drifted from the current tvm-ffi API (TVMTensorCopyToBytes / TVMGetLastError, VectorToTrtDims over ffi::Array, a stale `override` on the destructor), and the conv converters read a Relay-era "channels" attribute that Relax does not emit (output channels are now derived from the kernel shape). All tests are verified correct locally. This pr barely includes api updates and there is no new parts added

…pache#19786) This pr fixes apache#19718. The test asserted target->attrs.size()==2, which is host-specific: LLVM target canonicalization legitimately adds host attrs (feature.has_sve / has_asimd / is_aarch64 / mtriple on AArch64), so the target ends up with 9 attrs there and the assertion fails, while it happens to be 2 on x86. The test only means to verify that duplicate keys are deduplicated, so assert that the "keys" entry did not leak into the generic attrs map instead of pinning the host-specific attr count.

…ache#19787) This pr is the Follow-up to apache#19777. This pr removes the last `requires_*` decorators so test gating is plain pytest everywhere, with no custom indirection left.

This PR improves TIRx vectorization for RISC-V RVV targets. Fixed-width `T.vectorized` loops can be lowered to fixed LLVM vectors such as `<16 x float>`, which LLVM/RVV may scalarize into repeated scalar `flw/fsub.s/fsw` instructions. This PR rewrites fixed-width vectorized loops on RVV targets into scalable `T.vscale() * 4` chunks with lane masks, allowing LLVM to generate RVV load/store instructions instead. The change is limited to RISC-V RVV and does not enable the same automatic rewrite for Arm SVE. Tested on a RISC-V K3 board: Before: flw/fsub.s/fsw = 16/16/16, vle32/vse32 = 0/0 After: flw/fsub.s/fsw = 0/0/0, vle32/vse32 = 1/1 Also added a RISC-V LLVM codegen regression test.

gemini-code-assist

Code Review

This pull request migrates the codebase from NameSupply to UniqueNameSupply, updates the TensorRT integration to target the TensorRT 10 API, adds Blackwell Cluster Launch Control (CLC) tile scheduler support, and refactors test gating into a new tvm.testing.env module. Feedback on the changes highlights two potential issues in the TensorRT runtime: first, the new dense layer implementation using addMatrixMultiply will fail on valid 1D inputs due to a strict rank check, which should be handled by temporarily reshaping to 2D; second, accessing input_var_eid_[0] directly is unsafe for subgraphs with zero inputs and should fall back to outputs_[0] to prevent out-of-bounds crashes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

This pr updates the contributor guide and tvm.testing docstrings/comments to describe the current gating API --------- Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

## Why In `tryCreateBuffer`, each of the three popped error scopes independently called `device.destroy()` and `console.error`, so a buffer that triggers more than one error type destroyed the device repeatedly and logged duplicate errors. ## How - Collect all three `popErrorScope()` results via `Promise.all` and call `device.destroy()` at most once - Log every captured error instead of relying on per-scope handlers --------- Signed-off-by: Guan-Ming (Wesley) Chiu <105915352+guan404ming@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

) This PR removes the obsolete queue and rang license files and drops the leftover rang include-directory hook from the CMake setup.

…9796) ## Summary The old Hexagon app and test wrappers depend on RPC helper artifacts that are no longer part of the supported app flow. This PR removes those wrappers and related helper references while keeping the core Hexagon target, codegen, and runtime implementation in place. ## Changes - Remove the obsolete Hexagon app wrapper directories. - Remove the Hexagon contrib test directory and its dedicated pytest/RPC launcher helpers. - Drop stale CI/docs references to the removed app and test helper paths.

## Why ASF INFRA enforces that external GitHub Actions must be pinned to a commit SHA on the approved allowlist, failing the workflow with "not allowed in apache/tvm". See the [policy](https://infra.apache.org/github-actions-policy.html) and the [approved allowlist](https://github.com/apache/infrastructure-actions/blob/main/approved_patterns.yml). ## How - Pin `pre-commit/action` to `2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd` (v3.0.1) - Pin `pypa/cibuildwheel` to `294735312765b09d24a2fbec22660ce817587d55` (v4.1.0) - Pin `pypa/gh-action-pypi-publish` to `ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e` (v1.13.0) - Leave GitHub-owned `actions/*` and the allowlisted `conda-incubator/setup-miniconda@*` pattern untouched --------- Signed-off-by: Guan-Ming (Wesley) Chiu <105915352+guan404ming@users.noreply.github.com>

apache#19780) Switches from the deprecated plural `requestFileHandles([hash])` to the new singular `requestFileHandle(hash)` API, which returns a handle directly rather than a single-element array. Also updates the interface definition and removes the now-redundant `handles[0]` indexing. The rename was adopted in the COS spec after a survey of all known real-world implementations found that every call site passed a single-element array and immediately indexed `[0]` — no implementation ever used the plural form as a batch. FYI guan404ming CharlieFRuan — as reviewers of the original COS PR ([apache#18893](apache#18893)). See WICG/cross-origin-storage#61 for details.

CallingConv already participates in TVM FFI integral enum conversion, so keeping call sites on manual integer casts adds noise without changing behavior. This was not possible before the TVM FFI Any support but now we natively support enum class int value conversion with Any, so we can simplify the codepath Main changes: - Read `tvm::attr::kCallingConv` as `CallingConv` directly - Compare optional/defaulted values against `CallingConv` enum values - Store CallingConv enum values directly where the cleanup touches attr writes

The Jenkins PR title/body linter is comparatively heavy and can report false positives before the normal CI signal is available. This removes the check_pr step from the Jenkins prepare flow and drops the now-unused script-level test coverage.

…, IMAG, COMPLEX_ABS (apache#19763) Part of apache#19519 This PR adds support for the FFT and complex operator family in the Relax TFLite frontend. **Key implementations:** - Registered `REAL`, `IMAG`, `COMPLEX_ABS`to the TFLite op map. - Implemented `convert_real` and `convert_imag` which extract the real and imaginary parts of a complex tensor via `strided_slice` + `squeeze` along the last axis. - Implemented `convert_complex_abs` which computes `sqrt(re^2 + im^2)` using elementwise Relax ops. - All three ops adopt a unified representation convention: TFLite `complex64` tensors (which have no native Relax dtype equivalent) are represented as `float32[..., 2]`, where the last axis holds `(real, imaginary)` interleaved.. **Out of scope:** - `RFFT2D` is not registered in this PR. An O(N²) matmul decomposition is feasible using existing Relax ops and will be contributed separately with benchmarks showing the performance gap versus a native FFT op. A native `relax.op.signal.rfft2d` is tracked in apache#19764 **Testing:** - Added structural equality tests for `REAL`, `IMAG`, and `COMPLEX_ABS` in `test_frontend_tflite.py` following the `verify(TestClass, Expected)` pattern. ```bash python3 -m pytest tests/python/relax/test_frontend_tflite.py -k "test_real or test_imag or test_complex_abs" ```

…st (apache#19797) Common bool, int32, and int64 scalar constants show up throughout TIRX and related lowering code, and named constructors make these call sites easier to read than repeated DataType spelling. This PR establishes the scalar-constant construction policy and renames make_const to MakeConst to match the public helper naming style. - Prefer IntImm::Bool, IntImm::Int32, and IntImm::Int64 for common known scalar bool, int32, and int64 constants. - Prefer direct IntImm or FloatImm construction when dtype is known to be scalar integer or floating point. This makes the compiled code more compact and efficient. Keep MakeConst for generic overload cases where dtype can be integer, floating point, or vector-valued and the caller needs its scalar/vector dispatch. - Phase out make_zero in favor of explicit scalar constructors, or ConstHandle(0) for null handles.

CompareBeforeAfter, skip_parameterizations, and xfail_parameterizations have no remaining users anywhere in the repo. CompareBeforeAfter (a base class for TIR before/after transform tests) has been superseded by the inline assert_structural_equal(transform(Before), Expected) pattern, and the {skip,xfail}_parameterizations helpers (which marked specific parametrizations at runtime) are unused -- native pytest.param(..., marks=...) covers that need. Also drop the private _mark_parameterizations helper they relied on and the now-unused 'import textwrap'.

flashmouse and others added 12 commits June 15, 2026 20:28

[Fix] nn.attention support dynamic batch_size (apache#19779)

b3c8849

This PR try to fix apache#19696 , ``nn.attention`` support dynamic batch_size Co-authored-by: flashmouse <flashmosue2012@gmail.com>

[Docs][CI] Bump tlcpack-sphinx-addon to restore search result summari…

668d119

…es (apache#19782)

[Tests] Modernize test gating (apache#19777)

b684868

This pr modernizes test gating. It replaces the heavy `tvm.testing.Feature` machinery with a thin `tvm.testing.env` module of `has_*()` capability probes, used via standard pytest.mark + skipif. And markers move to `pyproject.toml`

[TIRX][CUDA] Framework support for FA4, CLC intrinsics, and nvfp4 tcg…

694dacb

…en05 GEMM (apache#19785)

[Tests] Replace remaining requires_* helpers with standard pytest (ap…

00813d6

…ache#19787) This pr is the Follow-up to apache#19777. This pr removes the last `requires_*` decorators so test gating is plain pytest everywhere, with no custom indirection left.

gemini-code-assist Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread src/runtime/extra/contrib/tensorrt/tensorrt_ops.cc

Comment thread src/runtime/extra/contrib/tensorrt/tensorrt_runtime.cc

MasterJH5574 marked this pull request as draft June 16, 2026 14:18

guan404ming and others added 10 commits June 16, 2026 17:52

[REFACTOR] Phase out unused queue and rang license entries (apache#19794

a7864af

) This PR removes the obsolete queue and rang license files and drops the leftover rang include-directory hook from the CMake setup.

MasterJH5574 marked this pull request as ready for review June 16, 2026 21:54

tqchen approved these changes Jun 16, 2026

View reviewed changes

MasterJH5574 merged commit 0815e00 into apache:v0.25.0 Jun 16, 2026
2 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v0.25.0] Backport recent main to prepare v0.25.0.rc1#19792

[v0.25.0] Backport recent main to prepare v0.25.0.rc1#19792
MasterJH5574 merged 23 commits into
apache:v0.25.0from
MasterJH5574:tvm-dev/2026-06-15-v0.25.0.rc1

MasterJH5574 commented Jun 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Uh oh!

Conversation

MasterJH5574 commented Jun 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants