Skip to content

[Relax][TensorRT] Update TensorRT runtime to 10#19789

Merged
tqchen merged 1 commit into
apache:mainfrom
tlopex:fix-tensorrt10-byoc-19609
Jun 16, 2026
Merged

[Relax][TensorRT] Update TensorRT runtime to 10#19789
tqchen merged 1 commit into
apache:mainfrom
tlopex:fix-tensorrt10-byoc-19609

Conversation

@tlopex

@tlopex tlopex commented Jun 16, 2026

Copy link
Copy Markdown
Member

This pr fixes #19609. TensorRT 10 removed a large set of APIs that the Relax TensorRT BYOC integration relied on, so it failed to compile against TRT >= 10. Port the runtime and codegen to the TRT10 API and require TensorRT >= 10:

  • Lifetime: obj->destroy() -> delete (destroy() removed in TRT10).
  • Builder: drop implicit-batch mode (networks are always explicit-batch via createNetworkV2(0); setMaxBatchSize removed); setMaxWorkspaceSize -> setMemoryPoolLimit(kWORKSPACE); buildEngineWithConfig -> buildSerializedNetwork + deserializeCudaEngine, keeping the IRuntime alive alongside the engine.
  • Execution: the binding-index model (getNbBindings / getBindingIndex / setBindingDimensions / execute / executeV2) -> the named-tensor model (getNbIOTensors / setInputShape / setTensorAddress / enqueueV3); deserializeCudaEngine drops the trailing IPluginFactory* argument.
  • Layers: addConvolution / addPooling / addDeconvolution / addPadding ->
    the *Nd variants; set{Stride,Dilation} -> *Nd; IFullyConnectedLayer / addFullyConnected removed -> dense rebuilt with addConstant + addMatrixMultiply.
  • Add a build-time guard that emits a clear error on TensorRT < 10.

Also fix pre-existing issues that prevented this path from running end-to-end: the runtime had drifted from the current tvm-ffi API (TVMTensorCopyToBytes / TVMGetLastError, VectorToTrtDims over ffi::Array, a stale override on the destructor), and the conv converters read a Relay-era "channels" attribute that Relax does not emit (output channels are now derived from the kernel shape).

All tests are verified correct locally. This pr barely includes api updates and there is no new parts added

TensorRT 10 removed a large set of APIs that the Relax TensorRT BYOC
integration relied on, so it failed to compile against TRT >= 10
(apache#19609). Port the runtime and codegen to the TRT10 API and
require TensorRT >= 10:

- Lifetime: obj->destroy() -> delete (destroy() removed in TRT10).
- Builder: drop implicit-batch mode (networks are always explicit-batch
  via createNetworkV2(0); setMaxBatchSize removed); setMaxWorkspaceSize
  -> setMemoryPoolLimit(kWORKSPACE); buildEngineWithConfig ->
  buildSerializedNetwork + deserializeCudaEngine, keeping the IRuntime
  alive alongside the engine.
- Execution: the binding-index model (getNbBindings / getBindingIndex /
  setBindingDimensions / execute / executeV2) -> the named-tensor model
  (getNbIOTensors / setInputShape / setTensorAddress / enqueueV3);
  deserializeCudaEngine drops the trailing IPluginFactory* argument.
- Layers: addConvolution / addPooling / addDeconvolution / addPadding ->
  the *Nd variants; set{Stride,Dilation} -> *Nd; IFullyConnectedLayer /
  addFullyConnected removed -> dense rebuilt with addConstant +
  addMatrixMultiply.
- Add a build-time guard that emits a clear error on TensorRT < 10.

Also fix pre-existing issues that prevented this path from running
end-to-end: the runtime had drifted from the current tvm-ffi API
(TVMTensorCopyToBytes / TVMGetLastError, VectorToTrtDims over ffi::Array,
a stale `override` on the destructor), and the conv converters read a
Relay-era "channels" attribute that Relax does not emit (output channels
are now derived from the kernel shape).

Correctness fixes from an old-vs-new parity review, plus tests:

- Conv1D assumed an implicit batch dimension and dropped the spatial
  dimension under explicit batch; the reshape now derives from the full
  input rank.
- INT8 calibration: the per-input element count no longer includes the
  batch dimension (the calibrator multiplies by batch size itself), which
  previously over-read the input, and the calibrator's device buffers are
  now sized for a full batch instead of a single sample, which previously
  over-wrote memory. Both crashed INT8 calibration for batch > 1.
- Single-engine reuse now requires an exact batch match, since an
  explicit-batch engine's optimization profile pins the built batch size.
- TRT_HAS_IMPLICIT_BATCH is unconditionally false and no longer calls the
  deprecated hasImplicitBatchDimension().
- Run on TVM's current CUDA stream instead of the default stream.
- Warn instead of silently ignoring use_implicit_batch=True, and default
  it to False in the codegen config.
- Null-check the engine build/deserialize paths and free the runtime on
  failure.
- conv2d_transpose / conv3d_transpose now use the IOHW / IODHW kernel
  layout (Relax's default, which also matches TensorRT's deconvolution
  weight layout) instead of the Relay-era OIHW assumption, so the weight
  is passed through directly and the output channel count comes from the
  second kernel dimension.
- Remove dead pre-5.1.5 padding blocks and unused builder members.
- Add offload tests for conv1d, max_pool2d, avg_pool2d, softmax, sigmoid,
  tanh, conv2d_transpose, conv3d_transpose, and INT8 calibration.

Verified: builds against TensorRT 10.16 with CUDA 12.8, and the added
tests pass on both an RTX 2070 (Turing) and an RTX 5090 (Blackwell).

Fixes apache#19609

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the TVM TensorRT integration to target the TensorRT 10 API, which removes implicit-batch mode, binding indices, and several deprecated layer creation APIs (such as addFullyConnected). The changes transition the codebase to explicit-batch mode, update the operator converters to use the new Nd layer APIs, and manage the deserialization runtime lifetime alongside the engine. The review feedback highlights several critical safety improvements, specifically recommending null-pointer checks for createInferRuntime, addConstant, and addShuffle calls, guarding against integer overflow when handling dynamic dimensions during INT8 calibration, and preventing out-of-bounds access when resolving the device ID from input_var_eid_.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/runtime/extra/contrib/tensorrt/tensorrt_builder.cc
Comment thread src/runtime/extra/contrib/tensorrt/tensorrt_runtime.cc
Comment thread src/runtime/extra/contrib/tensorrt/tensorrt_ops.cc
Comment thread src/runtime/extra/contrib/tensorrt/tensorrt_ops.cc
Comment thread src/runtime/extra/contrib/tensorrt/tensorrt_runtime.cc
Comment thread src/runtime/extra/contrib/tensorrt/tensorrt_runtime.cc
@tqchen tqchen merged commit d591cd4 into apache:main Jun 16, 2026
10 checks passed
MasterJH5574 pushed a commit that referenced this pull request Jun 16, 2026
This pr fixes #19609. TensorRT 10 removed a large set of APIs that the
Relax TensorRT BYOC integration relied on, so it failed to compile
against TRT >= 10. Port the runtime and codegen to the TRT10 API and
require TensorRT >= 10:

- Lifetime: obj->destroy() -> delete (destroy() removed in TRT10).
- Builder: drop implicit-batch mode (networks are always explicit-batch
via createNetworkV2(0); setMaxBatchSize removed); setMaxWorkspaceSize ->
setMemoryPoolLimit(kWORKSPACE); buildEngineWithConfig ->
buildSerializedNetwork + deserializeCudaEngine, keeping the IRuntime
alive alongside the engine.
- Execution: the binding-index model (getNbBindings / getBindingIndex /
setBindingDimensions / execute / executeV2) -> the named-tensor model
(getNbIOTensors / setInputShape / setTensorAddress / enqueueV3);
deserializeCudaEngine drops the trailing IPluginFactory* argument.
- Layers: addConvolution / addPooling / addDeconvolution / addPadding ->
the *Nd variants; set{Stride,Dilation} -> *Nd; IFullyConnectedLayer /
addFullyConnected removed -> dense rebuilt with addConstant +
addMatrixMultiply.
- Add a build-time guard that emits a clear error on TensorRT < 10.

Also fix pre-existing issues that prevented this path from running
end-to-end: the runtime had drifted from the current tvm-ffi API
(TVMTensorCopyToBytes / TVMGetLastError, VectorToTrtDims over
ffi::Array, a stale `override` on the destructor), and the conv
converters read a Relay-era "channels" attribute that Relax does not
emit (output channels are now derived from the kernel shape).

All tests are verified correct locally. This pr barely includes api
updates and there is no new parts added
tqchen pushed a commit that referenced this pull request Jun 17, 2026
…#19810)

This pr is the follow-up pr to #19789. CurrentTensorRT BYOC converters
were ported from Relay and still read attribute names/shapes that no
longer match the Relax ops, so most ops crashed ("Key: <name> is not
found") or produced wrong results when offloaded.

This pr changed
- Converters (tensorrt_ops.cc): port reduce, matmul, expand_dims,
layer_norm, clip, reshape, strided_slice, split and layout_transform to
read Relax's attributes/arguments. Notable shape changes: clip min/max
are PrimValue arguments (not a_min/a_max attrs), reshape's shape is a
Shape argument, matmul has no transpose flags, split is multi-output
with no "mode", and layout_transform is an IndexMap rather than
src/dst_layout strings. Unsupported cases (non-static reshape,
non-permutation layout_transform) now raise a clear error instead of
crashing.
- Codegen (codegen.cc): serialize an op's non-tensor arguments
(PrimValue / ShapeExpr / tuple) as "arg_"-prefixed node attributes,
materialize a reduce op's all-axes default, and translate a
pure-permutation layout_transform IndexMap into a transpose order.
- Runtime: disable the TF32 builder flag so offloaded FP32 subgraphs
match TVM's FP32 reference, and use a process-lifetime TensorRT logger
(a per-runtime logger was left dangling once its runtime was destroyed,
corrupting the heap during TensorRT teardown).

All tests are validated locally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] TensorRT 10 compatibility issues in Relax TensorRT BYOC (TVM 0.24, CUDA 12.4+)

2 participants