Skip to content

fix(ModelStage): del self.model before gc.collect() in teardown()#1965

Draft
rlratzel wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
rlratzel:stageactor_cleanup
Draft

fix(ModelStage): del self.model before gc.collect() in teardown()#1965
rlratzel wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
rlratzel:stageactor_cleanup

Conversation

@rlratzel
Copy link
Copy Markdown
Contributor

Summary

Fixes GPU memory not being released after xenna-based benchmarks complete, causing subsequent benchmarks to fail with out-of-memory errors.

ModelStage.teardown() was calling gc.collect() without first dropping the reference to self.model. Since the model was still referenced, the garbage collector had nothing to reclaim, leaving the model weights in GPU memory after the benchmark subprocess exited.

The fix explicitly deletes self.model before gc.collect() so the reference count drops to zero and the memory is actually freed before torch.cuda.empty_cache() runs. A hasattr guard is included for safety in case teardown is called before setup completes.

Status

Draft — needs to be verified against benchmark logs and a test run on the affected machine before merging.

Test plan

  • Run xenna-based benchmarks sequentially and confirm GPU memory is fully released between runs
  • Confirm the benchmark that previously failed with OOM now succeeds
  • Review logs from the affected machine to confirm this is the right fix

🤖 Generated with Claude Code

Explicitly delete the model reference before calling gc.collect() so the
garbage collector can actually free the model weights. Without this, the
reference remains live and gc.collect() has nothing to reclaim, leaving
GPU memory occupied after the benchmark subprocess exits.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 11, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rlratzel added 2 commits May 13, 2026 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant