Skip to content

fix: Use separate step_metric for GPU Monitoring#92

Merged
terrykong merged 6 commits intomainfrom
yifu/gpu_monitor
Mar 31, 2025
Merged

fix: Use separate step_metric for GPU Monitoring#92
terrykong merged 6 commits intomainfrom
yifu/gpu_monitor

Conversation

@yfw
Copy link
Contributor

@yfw yfw commented Mar 26, 2025

What does this PR do ?

Fixes #83 by using a separate step_metric when logging with the RayGpuMonitorLogger.

Issues

#83

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@yfw yfw requested a review from terrykong March 26, 2025 20:49
yfw added 2 commits March 26, 2025 14:00
#83
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@yfw yfw force-pushed the yifu/gpu_monitor branch from 4c6b5d2 to f7ad8a1 Compare March 26, 2025 21:02
@yfw yfw added the Run CICD label Mar 26, 2025
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@yfw yfw added Run CICD and removed Run CICD labels Mar 26, 2025
terrykong
terrykong previously approved these changes Mar 28, 2025
@terrykong
Copy link
Contributor

@yfw can you confirm that the GPU metrics show up correctly in wandb?

@yfw
Copy link
Contributor Author

yfw commented Mar 31, 2025

@yfw can you confirm that the GPU metrics show up correctly in wandb?

Yes, here is a sample run: https://wandb.ai/nvidia/grpo-dev-yifu1/runs/gm82edgu

Screenshot 2025-03-26 at 1 45 41 PM

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@yfw yfw added Run CICD and removed Run CICD labels Mar 31, 2025
@terrykong terrykong enabled auto-merge (squash) March 31, 2025 16:02
@terrykong terrykong merged commit c770e64 into main Mar 31, 2025
16 of 17 checks passed
@terrykong terrykong deleted the yifu/gpu_monitor branch March 31, 2025 16:15
yfw added a commit that referenced this pull request Apr 2, 2025
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
KiddoZhu pushed a commit that referenced this pull request May 6, 2025
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU Monitoring bug when step advances faster than train loop steps

2 participants