Skip to content

[trainer,reward] fix: graceful shutdown on Ctrl+C to prevent zombie GPU processes#149

Merged
Jayce-Ping merged 8 commits into
mainfrom
feat/graceful-shutdown-ctrl-c
Apr 30, 2026
Merged

[trainer,reward] fix: graceful shutdown on Ctrl+C to prevent zombie GPU processes#149
Jayce-Ping merged 8 commits into
mainfrom
feat/graceful-shutdown-ctrl-c

Conversation

@Jayce-Ping

Copy link
Copy Markdown
Collaborator

Summary

  • Add RewardBuffer.shutdown() for non-blocking executor termination on interrupt
  • Add BaseTrainer.cleanup() that calls shutdown on reward buffers
  • Wrap trainer.start() with KeyboardInterrupt handler in train.py, using os._exit(0) to bypass NCCL heartbeat noise
  • Catch SIGINT exit codes in cli.py for clean subprocess teardown

Context

When users press Ctrl+C during multi-GPU training, worker processes could become orphaned holding GPU memory. Root cause: the RewardBuffer's ThreadPoolExecutor (holding GPU references via CUDA streams) was never shut down on interrupt, and NCCL's C++ heartbeat threads kept running during Python's normal shutdown sequence.

Test plan

  • Launch multi-GPU training with ff-train, press Ctrl+C during evaluation/sampling
  • Verify nvidia-smi shows no zombie processes after exit
  • Verify log shows "Training interrupted by user" then clean exit
  • Verify normal training completion is unaffected

🤖 Generated with Claude Code

When users press Ctrl+C during multi-GPU training, worker processes
could become orphaned while still holding GPU memory. The root cause:
the RewardBuffer's ThreadPoolExecutor (holding GPU references via CUDA
streams) was never shut down on interrupt, and NCCL's C++ heartbeat
threads kept running during Python's normal shutdown sequence.

Changes:
- Add RewardBuffer.shutdown() for non-blocking executor termination
- Add BaseTrainer.cleanup() that calls shutdown on reward buffers
- Wrap trainer.start() with KeyboardInterrupt handler in train.py
- Use os._exit(0) after cleanup to bypass NCCL heartbeat noise
- Catch SIGINT exit codes in cli.py for clean subprocess teardown

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 30, 2026 13:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Ctrl+C handling during (multi-GPU) training to avoid orphaned worker processes that keep GPU memory allocated, by explicitly shutting down async reward executors and exiting cleanly when interrupted.

Changes:

  • Add RewardBuffer.shutdown() for fast, non-reinitializing teardown of async reward executors.
  • Add BaseTrainer.cleanup() to shut down training/eval reward buffers on interrupt.
  • Wrap trainer.start() in train.py with KeyboardInterrupt handling and force-exit to avoid noisy/hanging shutdown paths; handle SIGINT-like subprocess exit codes in cli.py.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/flow_factory/trainers/abc.py Adds cleanup() to shut down reward buffers during interrupt-driven teardown.
src/flow_factory/train.py Catches KeyboardInterrupt, calls trainer cleanup, and exits immediately to avoid lingering shutdown noise.
src/flow_factory/rewards/reward_processor.py Introduces RewardBuffer.shutdown() to terminate the underlying executor without resetting state.
src/flow_factory/cli.py Treats SIGINT-style subprocess exits as a clean interrupt and exits with code 0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/flow_factory/train.py Outdated
Comment on lines +60 to +61
trainer.cleanup()
os._exit(0)
Comment thread src/flow_factory/trainers/abc.py Outdated
Comment on lines +418 to +421
Called on KeyboardInterrupt to ensure ThreadPoolExecutor threads (which
may hold GPU references via reward model CUDA streams) are terminated
before the process exits. The OS reclaims all GPU memory on exit, so
explicit empty_cache() is unnecessary and would race with executor threads.
Address review feedback:
- Wrap trainer.cleanup() in try/finally so os._exit(0) is guaranteed
  to execute even if cleanup fails on partially-initialized state
- Correct cleanup() docstring to reflect non-blocking semantics
  (shutdown signals threads but os._exit does the actual termination)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds explicit interrupt/teardown hooks to ensure multi-GPU training exits cleanly on Ctrl+C without leaving orphaned processes holding GPU memory, primarily by shutting down async reward executors and force-exiting to avoid noisy/disruptive NCCL shutdown behavior.

Changes:

  • Add RewardBuffer.shutdown() for fast, non-blocking threadpool teardown on interrupt.
  • Add BaseTrainer.cleanup() to trigger reward-buffer shutdowns during interrupt handling.
  • Handle KeyboardInterrupt in train.py with cleanup + os._exit(0), and treat SIGINT-like subprocess exits as clean interrupts in cli.py.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/flow_factory/trainers/abc.py Introduces BaseTrainer.cleanup() that shuts down training/eval reward buffers without blocking.
src/flow_factory/train.py Wraps trainer.start() in a KeyboardInterrupt handler, calls trainer.cleanup(), then os._exit(0).
src/flow_factory/rewards/reward_processor.py Adds RewardBuffer.shutdown() to terminate the async ThreadPoolExecutor without reinitialization.
src/flow_factory/cli.py Treats SIGINT-style subprocess termination as a clean interrupt to avoid noisy tracebacks and ensure exit(0).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Before this change, os._exit(0) bypassed all atexit handlers, causing
wandb/swanlab to lose unsynced metrics. Now cleanup() calls
logger.finish() before exit, allowing data sync to complete.

Double Ctrl+C still force-exits via try/finally + os._exit(0).

- Add Logger.finish() base method (cleans up temp files)
- Override in WandbLogger: calls wandb.finish(quiet=True)
- Override in SwanlabLogger: calls swanlab.finish()
- Override in TensorboardLogger: calls self.platform.close()
- BaseTrainer.cleanup(): call self.logger.finish() after executor shutdown

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Jayce-Ping Jayce-Ping force-pushed the feat/graceful-shutdown-ctrl-c branch from 4928bc2 to c10f9b1 Compare April 30, 2026 14:00
wandb.finish() did not reliably sync data during interrupt.
Revert to keep the scope focused on graceful GPU process exit only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Ctrl+C handling during training to avoid orphaned/zombie GPU processes by introducing explicit non-blocking shutdown hooks for async reward workers and handling SIGINT exit paths more deliberately in both the training entrypoint and CLI launcher.

Changes:

  • Add RewardBuffer.shutdown() to stop the reward async executor without waiting/reinitializing.
  • Add BaseTrainer.cleanup() to trigger reward-buffer shutdown on interruption.
  • Catch KeyboardInterrupt in train.py and force-exit, plus treat SIGINT returns as clean exits in cli.py.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/flow_factory/trainers/abc.py Adds BaseTrainer.cleanup() to shutdown reward buffers on interrupt.
src/flow_factory/train.py Wraps training start in KeyboardInterrupt handling and force-exits after cleanup.
src/flow_factory/rewards/reward_processor.py Adds RewardBuffer.shutdown() for non-blocking executor termination.
src/flow_factory/cli.py Treats SIGINT/KeyboardInterrupt from the training subprocess as a clean exit.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/flow_factory/train.py
try:
trainer.cleanup()
finally:
os._exit(0)
Comment thread src/flow_factory/train.py Outdated
Comment on lines +54 to +56
trainer = load_trainer(config)
trainer.start()

try:
trainer.start()
Jayce-Ping and others added 3 commits April 30, 2026 22:26
If Ctrl+C arrives during load_trainer() (e.g. while loading large
models), the ThreadPoolExecutor may already exist but cleanup() would
never run. Widen the try/except to include trainer creation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves Ctrl+C (SIGINT) handling during training to avoid orphaned processes holding GPU memory, primarily by proactively shutting down async reward worker executors and exiting quickly.

Changes:

  • Add a RewardBuffer.shutdown() method for fast, non-blocking executor teardown on interruption.
  • Introduce BaseTrainer.cleanup() to shut down reward buffers’ async executors.
  • Add SIGINT/KeyboardInterrupt handling in train.py (including forced exit) and in cli.py to treat Ctrl+C as a clean termination.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/flow_factory/trainers/abc.py Adds BaseTrainer.cleanup() to shut down reward buffers’ async executors.
src/flow_factory/train.py Wraps trainer.start() with KeyboardInterrupt handling and forces immediate process exit.
src/flow_factory/rewards/reward_processor.py Adds RewardBuffer.shutdown() for final teardown of the thread pool executor.
src/flow_factory/cli.py Treats SIGINT-like subprocess termination and KeyboardInterrupt as clean exits.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/flow_factory/train.py
Comment on lines +60 to +67
except KeyboardInterrupt:
if local_rank == 0:
logger.info("Training interrupted. Cleaning up...")
try:
if trainer is not None:
trainer.cleanup()
finally:
os._exit(0)
Comment thread src/flow_factory/cli.py Outdated
Comment on lines +213 to +217
sys.exit(0)
raise
except KeyboardInterrupt:
logger.info("Training interrupted.")
sys.exit(0)
Exit code 130 (128 + SIGINT) is the standard convention for processes
interrupted by Ctrl+C. Using 0 would mask the interrupt from shell
scripts and cluster schedulers (SLURM, K8s) that rely on exit codes
to distinguish successful completion from user cancellation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Jayce-Ping Jayce-Ping force-pushed the feat/graceful-shutdown-ctrl-c branch from e953d99 to 2a54b8f Compare April 30, 2026 14:52
@Jayce-Ping Jayce-Ping merged commit 5b66da0 into main Apr 30, 2026
Jayce-Ping added a commit to yuanzhi-zhu/Flow-Factory that referenced this pull request May 2, 2026
Resolve conflicts between CRD trainer (this PR) and DGPO trainer (X-GenGroup#133)
added to main since branching:

- training_args.py: Place DGPOTrainingArguments before CRDTrainingArguments,
  each class retains its own fields and defaults
- algorithms.md: Adopt main's reference numbering, add CRD as ref [13]
- crd.py: Add _maybe_offload_samples_to_cpu() call matching the pattern
  established in GRPO/GRPOGuard on main (PR X-GenGroup#149)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Jayce-Ping added a commit to Jayce-Ping/Flow-Factory-Private that referenced this pull request Jul 2, 2026
…PU processes (X-GenGroup#149)

* fix: add graceful shutdown on Ctrl+C to prevent zombie GPU processes

When users press Ctrl+C during multi-GPU training, worker processes
could become orphaned while still holding GPU memory. The root cause:
the RewardBuffer's ThreadPoolExecutor (holding GPU references via CUDA
streams) was never shut down on interrupt, and NCCL's C++ heartbeat
threads kept running during Python's normal shutdown sequence.

Changes:
- Add RewardBuffer.shutdown() for non-blocking executor termination
- Add BaseTrainer.cleanup() that calls shutdown on reward buffers
- Wrap trainer.start() with KeyboardInterrupt handler in train.py
- Use os._exit(0) after cleanup to bypass NCCL heartbeat noise
- Catch SIGINT exit codes in cli.py for clean subprocess teardown

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: ensure os._exit runs even if cleanup raises, fix docstring

Address review feedback:
- Wrap trainer.cleanup() in try/finally so os._exit(0) is guaranteed
  to execute even if cleanup fails on partially-initialized state
- Correct cleanup() docstring to reflect non-blocking semantics
  (shutdown signals threads but os._exit does the actual termination)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: finalize logger on Ctrl+C to preserve wandb/swanlab data

Before this change, os._exit(0) bypassed all atexit handlers, causing
wandb/swanlab to lose unsynced metrics. Now cleanup() calls
logger.finish() before exit, allowing data sync to complete.

Double Ctrl+C still force-exits via try/finally + os._exit(0).

- Add Logger.finish() base method (cleans up temp files)
- Override in WandbLogger: calls wandb.finish(quiet=True)
- Override in SwanlabLogger: calls swanlab.finish()
- Override in TensorboardLogger: calls self.platform.close()
- BaseTrainer.cleanup(): call self.logger.finish() after executor shutdown

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* revert: remove logger finalization on Ctrl+C

wandb.finish() did not reliably sync data during interrupt.
Revert to keep the scope focused on graceful GPU process exit only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: cover trainer construction in KeyboardInterrupt handler

If Ctrl+C arrives during load_trainer() (e.g. while loading large
models), the ThreadPoolExecutor may already exist but cleanup() would
never run. Widen the try/except to include trainer creation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: move success log inside try block for cleaner control flow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update

* fix: use exit code 130 for SIGINT in cli launcher

Exit code 130 (128 + SIGINT) is the standard convention for processes
interrupted by Ctrl+C. Using 0 would mask the interrupt from shell
scripts and cluster schedulers (SLURM, K8s) that rely on exit codes
to distinguish successful completion from user cancellation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants