Skip to content

docs(core): improve NIXL RDT tuning guide#64137

Open
nh-atuan wants to merge 1 commit into
ray-project:masterfrom
nh-atuan:issue-62600
Open

docs(core): improve NIXL RDT tuning guide#64137
nh-atuan wants to merge 1 commit into
ray-project:masterfrom
nh-atuan:issue-62600

Conversation

@nh-atuan

Copy link
Copy Markdown

Description

This PR improves the Ray Direct Transport NIXL documentation with configuration and performance tuning guidance.

It adds:

  • UCX, GDRCopy, RDMA, and memory registration setup notes for NIXL users.
  • Guidance on reducing NIXL memory registration overhead with register_nixl_memory, deregister_nixl_memory, and register_nixl_memory_pool.
  • Usage examples for pre-registering reused tensor storage and using a NIXL memory pool.
  • Common NIXL RDT anti-patterns around tensor views, small transfers, long-lived refs, mutation semantics, mixed devices, and non-contiguous tensors.

Related issues

Fixes #62600

Additional information

This is a docs-only change.

Validation:

  • python -m py_compile doc\source\ray-core\doc_code\direct_transport_nixl.py
  • git diff --check

@nh-atuan nh-atuan requested a review from a team as a code owner June 16, 2026 12:49

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Ray Core direct transport documentation for NIXL, adding setup instructions for UCX and memory registration, as well as a new performance tuning section with APIs like register_nixl_memory and register_nixl_memory_pool. The review feedback recommends fixing a reStructuredText underline length mismatch for the 'Common NIXL anti-patterns' header and suggests clarifying that the memory pool allocates space based on the entire underlying storage size of tensors to help users avoid unexpected out-of-memory errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +351 to +352
Common NIXL anti-patterns
""""""""""""""""""""""""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The underline for the section header 'Common NIXL anti-patterns' is 24 characters long, but the title itself is 25 characters long. In reStructuredText, the underline must be at least as long as the title text to avoid Sphinx compilation warnings or errors. Please increase the underline length to 25 characters.

Suggested change
Common NIXL anti-patterns
""""""""""""""""""""""""
Common NIXL anti-patterns
"""""""""""""""""""""""""

Comment on lines +346 to +349
Size the pool for the maximum live NIXL RDT data produced by the actor, not just
for one transfer. If the pool is too small, Ray raises ``NixlOutOfMemoryError``.
Increase the pool size or reduce the number of simultaneously live RDT
``ObjectRef`` instances.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is highly recommended to explicitly mention that the NIXL memory pool allocates space based on the entire underlying storage size of the tensors, rather than just the size of the view. If a user returns small views of very large tensors, the pool can be quickly exhausted and raise a NixlOutOfMemoryError unless those views are cloned first to reclaim the unused storage.

Suggested change
Size the pool for the maximum live NIXL RDT data produced by the actor, not just
for one transfer. If the pool is too small, Ray raises ``NixlOutOfMemoryError``.
Increase the pool size or reduce the number of simultaneously live RDT
``ObjectRef`` instances.
Size the pool for the maximum live NIXL RDT data produced by the actor, not just
for one transfer. Note that the pool allocates space based on the *entire underlying storage size* of the tensors, so returning small views of large tensors can quickly exhaust the pool unless those views are cloned first to reclaim the unused storage. If the pool is too small, Ray raises ``NixlOutOfMemoryError``.
Increase the pool size or reduce the number of simultaneously live RDT
``ObjectRef`` instances.

Signed-off-by: nh-atuan <anhtuan21347@gmail.com>
@ray-gardener ray-gardener Bot added docs An issue or change related to documentation core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core docs An issue or change related to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[core][rdt] Improve NIXL rdt docs on configuration and performance tuning

1 participant