docs(core): improve NIXL RDT tuning guide#64137
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the Ray Core direct transport documentation for NIXL, adding setup instructions for UCX and memory registration, as well as a new performance tuning section with APIs like register_nixl_memory and register_nixl_memory_pool. The review feedback recommends fixing a reStructuredText underline length mismatch for the 'Common NIXL anti-patterns' header and suggests clarifying that the memory pool allocates space based on the entire underlying storage size of tensors to help users avoid unexpected out-of-memory errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| Common NIXL anti-patterns | ||
| """""""""""""""""""""""" |
There was a problem hiding this comment.
The underline for the section header 'Common NIXL anti-patterns' is 24 characters long, but the title itself is 25 characters long. In reStructuredText, the underline must be at least as long as the title text to avoid Sphinx compilation warnings or errors. Please increase the underline length to 25 characters.
| Common NIXL anti-patterns | |
| """""""""""""""""""""""" | |
| Common NIXL anti-patterns | |
| """"""""""""""""""""""""" |
| Size the pool for the maximum live NIXL RDT data produced by the actor, not just | ||
| for one transfer. If the pool is too small, Ray raises ``NixlOutOfMemoryError``. | ||
| Increase the pool size or reduce the number of simultaneously live RDT | ||
| ``ObjectRef`` instances. |
There was a problem hiding this comment.
It is highly recommended to explicitly mention that the NIXL memory pool allocates space based on the entire underlying storage size of the tensors, rather than just the size of the view. If a user returns small views of very large tensors, the pool can be quickly exhausted and raise a NixlOutOfMemoryError unless those views are cloned first to reclaim the unused storage.
| Size the pool for the maximum live NIXL RDT data produced by the actor, not just | |
| for one transfer. If the pool is too small, Ray raises ``NixlOutOfMemoryError``. | |
| Increase the pool size or reduce the number of simultaneously live RDT | |
| ``ObjectRef`` instances. | |
| Size the pool for the maximum live NIXL RDT data produced by the actor, not just | |
| for one transfer. Note that the pool allocates space based on the *entire underlying storage size* of the tensors, so returning small views of large tensors can quickly exhaust the pool unless those views are cloned first to reclaim the unused storage. If the pool is too small, Ray raises ``NixlOutOfMemoryError``. | |
| Increase the pool size or reduce the number of simultaneously live RDT | |
| ``ObjectRef`` instances. |
Signed-off-by: nh-atuan <anhtuan21347@gmail.com>
Description
This PR improves the Ray Direct Transport NIXL documentation with configuration and performance tuning guidance.
It adds:
register_nixl_memory,deregister_nixl_memory, andregister_nixl_memory_pool.Related issues
Fixes #62600
Additional information
This is a docs-only change.
Validation:
python -m py_compile doc\source\ray-core\doc_code\direct_transport_nixl.pygit diff --check