- Added Pre-Launch Commands Support to LeptonExecutor #312
- Remove breaking torchrun config for single-node runs #292
- Upgrade skypilot to v0.10.0, introduce network_tier #297
- Fixes for multi-node execution with torchrun + LocalExecutor #251
- Add option to specify --container-env for srun #293
- Fix skypilot archive mount bug #288
- finetune on dgxcloud with nemo-run and deploy on bedrock example #286
- Add nsys patch in ray sub template #318
- Add logs dir to container mount for ray slurm #287
- Allow customizing folder for SlurmRayRequest #281
- Use thread pool for status, run methods inside experiment + other fixes #295
- Correctly append tar files for packaging #317
- Create CHANGELOG.md #314
- docs: Fixing doc build issue #290
- fix docs tutorial links and add intro to guides/index.md #285
- README #277
- changelog workflow #315
- Update release.yml #306
- ci(fix): Use GITHUB_TOKEN for community bot #302
- ci: Add community-bot #300
- [Bugfix] Adding a check for name length #273
- misc fixes #280
- adding fix for lowercase and name length k8s requirements #274
- Specify nodes for gpu metrics collection and split data to each rank #320
- Apply '_enable_goodbye_message' check to both goodbye messages. #319
- Update refs #278
- chore: Bump to version 0.6.0rc0.dev0 #272
- Fix docs warnings #271
- Fix docs build #269
- Support overlapped srun commands in Slurm Ray #263
- Refactor DGXC Lepton data mover: switch to BatchJob with auto cleanup and sleep after every run #265
- ci: Fix nemo fw template ref after migrating to new org #256
- Enable Nsys gpu device metrics #257
- Sync job code in local tunnel for Slurm Ray job #254
- Change the create dist job function to support creating a single node #240
- Making job names match Run:ai requirements and making errors more descriptive #255
- Support for %j in slurm log retrieval #252
- Add KubeRay tests for Ray APIs #249
- Upgrade skypilot executor with 0.9.2 #246
- Add user scoping for k8s backend and log level support for Ray APIs #247
- Update to latest Lepton SDK #248
- Add storage mount options to LeptonExecutor #237
- Import guard k8s import in Ray Cluster and Job #245
- Add RayJob and Slurm support for Ray APIs + integration with run.Experiment #236
- ci: Enforce coverage #238
- Fix bug with a CLI overwrite #235
- Add LeptonExecutor support #224
- Add cancel to docker executor #233
- Change default log wait timeout to 10s #232
- Add RayCluster API with Kuberay support #222
- Add sbatch network arg #230
- chore: Update package info #227
- Add support for job groups for local executor #220
- Roll back get_underlying_types change + introduce extract_constituent #223
- Fix some bugs for --lazy in CLI #179
- Adding support for modern type-hints #221
- Fix bug in CLI with calling a factory-fn inside a list #214
- Handle more edge cases in --help #219
- Add autogenerated API reference content to the documentation #190
- Handle Callable in --help to fix nemo llm export --help error #217
- Ensure job directory creation for various schedulers #216
- Adding support for ForwardRef in CLI #176
- Add additional debug to DGXC data mover #215
- Handle ctx in entrypoint for experiment #213
- zozhang/dgxc executor data mover #206
- Add support for YAML, TOML & JSON #182
- Add clean mode for experiment to avoid printing any NeMo-Run specific logs #208
- Fix seed for torchrun #209
- Support torchrun multi node on local executor #143
- Add nsys filename param #205
- Add DGXCloudExecutor docs and update execution guide #192
- Add --cuda-event-trace=false to nsys command #180