Skip to content

[v0.25.0] Backport recent main to prepare v0.25.0.rc1#19792

Merged
MasterJH5574 merged 23 commits into
apache:v0.25.0from
MasterJH5574:tvm-dev/2026-06-15-v0.25.0.rc1
Jun 16, 2026
Merged

[v0.25.0] Backport recent main to prepare v0.25.0.rc1#19792
MasterJH5574 merged 23 commits into
apache:v0.25.0from
MasterJH5574:tvm-dev/2026-06-15-v0.25.0.rc1

Conversation

@MasterJH5574

Copy link
Copy Markdown
Contributor

Brings the v0.25.0 release branch up to recent main so we can cut v0.25.0.rc1. rc0 was cut before a number of changes we want in the release, so this backports the run of main commits that landed after the branch point rather than a handful of isolated picks.

flashmouse and others added 12 commits June 15, 2026 20:28
This PR try to fix apache#19696 , ``nn.attention`` support dynamic batch_size

Co-authored-by: flashmouse <flashmosue2012@gmail.com>
…ent(numpy semantics) (apache#19755)

Hi Committers,

This PR addresses the `ReduceMax`/ `ReduceMin` part of issue
apache#19572. Any suggestions would be
appreciated if you are available.

### Root cause:
The ONNX frontend ReduceMax / ReduceMin converters return relax.op.max /
relax.op.min. After legalization these map to topi.max / topi.min, which
fold with a commutative reducer whose combiner is Max(x, y) / Min(x, y).
In codegen, Max(a, b) lowers to select(a > b, a, b) using an **ordered**
float comparison (fcmp ogt), which is false for NaN. As a left-fold (acc
= Max(acc, elem)), NaN propagation becomes **position-dependent** - a
later non-NaN element silently overwrites an earlier NaN.

### Solution:
Adopt the well-defined, **order-independent numpy/IEEE convention**
(matching numpy.max/min and torch.amax/amin): the reduction yields NaN
whenever **any** reduced element is NaN. Minimal, ONNX-frontend-only
change:

- Add a shared helper _reduce_min_max_preserve_nan(reduce_op, data,
axes, keepdims).
- For floating-pint inputs, detect NaN along the reduced axes via
`sum(astype(isnan(data), dtype), axes, keepdims) > 0` and force those
outputs to `NaN` with `where(has_nan, nan, reduce(data))`. The mask
reduces over the **same axes/keepdims**, so it aligns in shape with the
reduced result.
- Keep non-floating(integer) inputs unchanged.
- Route all reduce paths(`_impl_v11`and both reduce branches of
`_impl_v18`) through the helper; the `noop_with_empty_axes` passthrough
is left untouched since it performs no reduction.

### Note on scope (re: apache#19589 ):
The underlying NaN behavior of Max/Min is the same family of ops
discussed in apache#19589. Per review comments there, enforcing NaN semantics
at the IR / LLVM-IR level is undesirable(backward-compat with older
LLVM, and portability to CUDA/OpenCL/Vulkan), and a dedicated portable
nanmin/nanmax TIRx intrinsic(like `nearbyint`) would be the preferred
long-term mechanism. This PR deliberately:

- does not touch the IR-level Max/Min lowering, and
- does not rely on the bool reduction of the NaN mask - it uses
`sum(isnan) > 0`, fully sidestepping Max/Min NaN behavior.

---------

Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>
IR module cleanup benefits from using a single unique-name primitive
directly at module call sites. This PR renames NameSupply to
UniqueNameSupply and removes redundant wrappers around global variable
naming.

Main changes:

- Rename the public name supply API and header to UniqueNameSupply
- Replace GlobalVarSupply with direct iterator-seeded UniqueNameSupply
usage
- Remove obsolete access-path repr registration now covered by tvm-ffi
…e#19784)

TVM's shipped code only uses cuda.bindings — cuda.bindings.nvrtc for the
NVRTC JIT path and cuda.bindings.driver for the NVSHMEM link path, both
in python/tvm/support/nvcc.py; it never uses cuda.core. cuda-python is
now a metapackage that pulls in cuda-bindings + cuda-core (and
cuda-pathfinder), so depending on it drags in cuda-core that TVM does
not need.
Depend directly on cuda-bindings, which provides exactly the nvrtc and
driver submodules TVM imports, and update the user-facing 'pip install
cuda-python' hints to match. A plain cuda-bindings install pulls no
nvidia-* toolkit wheels (those live behind the [all] extra); libnvrtc is
loaded from the system / TVM's CUDA install as before.
…9783)

This PR migrates repository agent instructions away from Claude-specific
paths and into a vendor-neutral layout.

Changes:
- Add root `AGENTS.md`
- Move existing command guidance from `.claude/commands` to
`.agents/skills/*/SKILL.md` using git renames.
- Move the GPU monitor helper from `.claude/scripts` to
`.agents/scripts`.
- Update the TIR test skill to reference
`.agents/scripts/monitor_gpu.sh`.
- Replace the ASF header skip entry for `.claude/*` with `.agents/*`.

Validation:
- `bash -n .agents/scripts/monitor_gpu.sh`
- `.agents/scripts/monitor_gpu.sh --help`
- `pre-commit run --files AGENTS.md .agents/skills/tir-build/SKILL.md
.agents/skills/tir-test/SKILL.md .agents/skills/tir-bench/SKILL.md
.agents/scripts/monitor_gpu.sh tests/lint/check_asf_header.py`
This pr modernizes test gating. It replaces the heavy
`tvm.testing.Feature` machinery with a thin `tvm.testing.env` module of
`has_*()` capability probes, used via standard pytest.mark + skipif. And
markers move to `pyproject.toml`
This pr fixes apache#19609. TensorRT 10 removed a large set of APIs that the
Relax TensorRT BYOC integration relied on, so it failed to compile
against TRT >= 10. Port the runtime and codegen to the TRT10 API and
require TensorRT >= 10:

- Lifetime: obj->destroy() -> delete (destroy() removed in TRT10).
- Builder: drop implicit-batch mode (networks are always explicit-batch
via createNetworkV2(0); setMaxBatchSize removed); setMaxWorkspaceSize ->
setMemoryPoolLimit(kWORKSPACE); buildEngineWithConfig ->
buildSerializedNetwork + deserializeCudaEngine, keeping the IRuntime
alive alongside the engine.
- Execution: the binding-index model (getNbBindings / getBindingIndex /
setBindingDimensions / execute / executeV2) -> the named-tensor model
(getNbIOTensors / setInputShape / setTensorAddress / enqueueV3);
deserializeCudaEngine drops the trailing IPluginFactory* argument.
- Layers: addConvolution / addPooling / addDeconvolution / addPadding ->
the *Nd variants; set{Stride,Dilation} -> *Nd; IFullyConnectedLayer /
addFullyConnected removed -> dense rebuilt with addConstant +
addMatrixMultiply.
- Add a build-time guard that emits a clear error on TensorRT < 10.

Also fix pre-existing issues that prevented this path from running
end-to-end: the runtime had drifted from the current tvm-ffi API
(TVMTensorCopyToBytes / TVMGetLastError, VectorToTrtDims over
ffi::Array, a stale `override` on the destructor), and the conv
converters read a Relay-era "channels" attribute that Relax does not
emit (output channels are now derived from the kernel shape).

All tests are verified correct locally. This pr barely includes api
updates and there is no new parts added
…pache#19786)

This pr fixes apache#19718. The test asserted target->attrs.size()==2, which
is host-specific: LLVM target canonicalization legitimately adds host
attrs (feature.has_sve / has_asimd / is_aarch64 / mtriple on AArch64),
so the target ends up with 9 attrs there and the assertion fails, while
it happens to be 2 on x86. The test only means to verify that duplicate
keys are deduplicated, so assert that the "keys" entry did not leak into
the generic attrs map instead of pinning the host-specific attr count.
…ache#19787)

This pr is the Follow-up to apache#19777. This pr removes the last
`requires_*` decorators so test gating is plain pytest everywhere, with
no custom indirection left.
This PR improves TIRx vectorization for RISC-V RVV targets.

Fixed-width `T.vectorized` loops can be lowered to fixed LLVM vectors
such as `<16 x float>`, which LLVM/RVV may scalarize into repeated
scalar `flw/fsub.s/fsw` instructions. This PR rewrites fixed-width
vectorized loops on RVV targets into scalable `T.vscale() * 4` chunks
with lane masks, allowing LLVM to generate RVV load/store instructions
instead.

The change is limited to RISC-V RVV and does not enable the same
automatic rewrite for Arm SVE.

Tested on a RISC-V K3 board:

Before: flw/fsub.s/fsw = 16/16/16, vle32/vse32 = 0/0
After:  flw/fsub.s/fsw = 0/0/0,    vle32/vse32 = 1/1

Also added a RISC-V LLVM codegen regression test.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the codebase from NameSupply to UniqueNameSupply, updates the TensorRT integration to target the TensorRT 10 API, adds Blackwell Cluster Launch Control (CLC) tile scheduler support, and refactors test gating into a new tvm.testing.env module. Feedback on the changes highlights two potential issues in the TensorRT runtime: first, the new dense layer implementation using addMatrixMultiply will fail on valid 1D inputs due to a strict rank check, which should be handled by temporarily reshaping to 2D; second, accessing input_var_eid_[0] directly is unsafe for subgraphs with zero inputs and should fall back to outputs_[0] to prevent out-of-bounds crashes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/runtime/extra/contrib/tensorrt/tensorrt_ops.cc
Comment thread src/runtime/extra/contrib/tensorrt/tensorrt_runtime.cc
This pr updates the contributor guide and tvm.testing
docstrings/comments to describe the current gating API

---------

Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@MasterJH5574 MasterJH5574 marked this pull request as draft June 16, 2026 14:18
guan404ming and others added 10 commits June 16, 2026 17:52
## Why

In `tryCreateBuffer`, each of the three popped error scopes
independently called `device.destroy()` and `console.error`, so a buffer
that triggers more than one error type destroyed the device repeatedly
and logged duplicate errors.

## How

- Collect all three `popErrorScope()` results via `Promise.all` and call
`device.destroy()` at most once
- Log every captured error instead of relying on per-scope handlers

---------

Signed-off-by: Guan-Ming (Wesley) Chiu <105915352+guan404ming@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
)

This PR removes the obsolete queue and rang license files and drops the
leftover rang include-directory hook from the CMake setup.
…9796)

## Summary

The old Hexagon app and test wrappers depend on RPC helper artifacts
that are no longer part of the supported app flow. This PR removes those
wrappers and related helper references while keeping the core Hexagon
target, codegen, and runtime implementation in place.

## Changes

- Remove the obsolete Hexagon app wrapper directories.
- Remove the Hexagon contrib test directory and its dedicated pytest/RPC
launcher helpers.
- Drop stale CI/docs references to the removed app and test helper
paths.
## Why

ASF INFRA enforces that external GitHub Actions must be pinned to a
commit SHA on the approved allowlist, failing the workflow with "not
allowed in apache/tvm". See the
[policy](https://infra.apache.org/github-actions-policy.html) and the
[approved
allowlist](https://github.com/apache/infrastructure-actions/blob/main/approved_patterns.yml).

## How

- Pin `pre-commit/action` to `2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd`
(v3.0.1)
- Pin `pypa/cibuildwheel` to `294735312765b09d24a2fbec22660ce817587d55`
(v4.1.0)
- Pin `pypa/gh-action-pypi-publish` to
`ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e` (v1.13.0)
- Leave GitHub-owned `actions/*` and the allowlisted
`conda-incubator/setup-miniconda@*` pattern untouched

---------

Signed-off-by: Guan-Ming (Wesley) Chiu <105915352+guan404ming@users.noreply.github.com>
apache#19780)

Switches from the deprecated plural `requestFileHandles([hash])` to the
new singular `requestFileHandle(hash)` API, which returns a handle
directly rather than a single-element array. Also updates the interface
definition and removes the now-redundant `handles[0]` indexing.

The rename was adopted in the COS spec after a survey of all known
real-world implementations found that every call site passed a
single-element array and immediately indexed `[0]` — no implementation
ever used the plural form as a batch.

FYI guan404ming CharlieFRuan — as reviewers of the original COS PR
([apache#18893](apache#18893)).

See WICG/cross-origin-storage#61 for details.
CallingConv already participates in TVM FFI integral enum conversion, so
keeping call sites on manual integer casts adds noise without changing
behavior. This was not possible before the TVM FFI Any support but now
we natively support enum class int value conversion with Any, so we can
simplify the codepath

Main changes:

- Read `tvm::attr::kCallingConv` as `CallingConv` directly
- Compare optional/defaulted values against `CallingConv` enum values
- Store CallingConv enum values directly where the cleanup touches attr
writes
The Jenkins PR title/body linter is comparatively heavy and can report
false positives before the normal CI signal is available.

This removes the check_pr step from the Jenkins prepare flow and drops
the now-unused script-level test coverage.
…, IMAG, COMPLEX_ABS (apache#19763)

Part of apache#19519

This PR adds support for the FFT and complex operator family in the
Relax TFLite frontend.

**Key implementations:**

- Registered `REAL`, `IMAG`, `COMPLEX_ABS`to the TFLite op map.
- Implemented `convert_real` and `convert_imag` which extract the real
and imaginary parts of a complex tensor via `strided_slice` + `squeeze`
along the last axis.
- Implemented `convert_complex_abs` which computes `sqrt(re^2 + im^2)`
using elementwise Relax ops.
- All three ops adopt a unified representation convention: TFLite
`complex64` tensors (which have no native Relax dtype equivalent) are
represented as `float32[..., 2]`, where the last axis holds `(real,
imaginary)` interleaved..

**Out of scope:**

- `RFFT2D` is not registered in this PR. An O(N²) matmul decomposition
is feasible using existing Relax ops and will be contributed separately
  with benchmarks showing the performance gap versus a native FFT op.
A native `relax.op.signal.rfft2d` is tracked in
apache#19764

**Testing:**

- Added structural equality tests for `REAL`, `IMAG`, and `COMPLEX_ABS`
in `test_frontend_tflite.py` following the `verify(TestClass, Expected)`
pattern.

```bash
python3 -m pytest tests/python/relax/test_frontend_tflite.py -k "test_real or test_imag or test_complex_abs"
```
…st (apache#19797)

Common bool, int32, and int64 scalar constants show up throughout TIRX and related lowering code, and named constructors make these call sites easier to read than repeated DataType spelling.

This PR establishes the scalar-constant construction policy and renames make_const to MakeConst to match the public helper naming style.

- Prefer IntImm::Bool, IntImm::Int32, and IntImm::Int64 for common known scalar bool, int32, and int64 constants.

- Prefer direct IntImm or FloatImm construction when dtype is known to be scalar integer or floating point. This makes the compiled code more compact and efficient. Keep MakeConst for generic overload cases where dtype can be integer, floating point, or vector-valued and the caller needs its scalar/vector dispatch.

- Phase out make_zero in favor of explicit scalar constructors, or ConstHandle(0) for null handles.
CompareBeforeAfter, skip_parameterizations, and xfail_parameterizations
have no remaining users anywhere in the repo. CompareBeforeAfter (a base
class for TIR before/after transform tests) has been superseded by the
inline assert_structural_equal(transform(Before), Expected) pattern, and
the {skip,xfail}_parameterizations helpers (which marked specific
parametrizations at runtime) are unused -- native pytest.param(...,
marks=...) covers that need.

Also drop the private _mark_parameterizations helper they relied on and
the now-unused 'import textwrap'.
@MasterJH5574 MasterJH5574 marked this pull request as ready for review June 16, 2026 21:54
@MasterJH5574 MasterJH5574 merged commit 0815e00 into apache:v0.25.0 Jun 16, 2026
2 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants