fix(ModelStage): del self.model before gc.collect() in teardown() by rlratzel · Pull Request #1965 · NVIDIA-NeMo/Curator

rlratzel · 2026-05-11T03:47:48Z

Summary

Fixes GPU memory not being released after xenna-based benchmarks complete, causing subsequent benchmarks to fail with out-of-memory errors.

ModelStage.teardown() was calling gc.collect() without first dropping the reference to self.model. Since the model was still referenced, the garbage collector had nothing to reclaim, leaving the model weights in GPU memory after the benchmark subprocess exited.

The fix explicitly deletes self.model before gc.collect() so the reference count drops to zero and the memory is actually freed before torch.cuda.empty_cache() runs. A hasattr guard is included for safety in case teardown is called before setup completes.

Status

Draft — needs to be verified against benchmark logs and a test run on the affected machine before merging.

Test plan

Run xenna-based benchmarks sequentially and confirm GPU memory is fully released between runs
Confirm the benchmark that previously failed with OOM now succeeds
Review logs from the affected machine to confirm this is the right fix

🤖 Generated with Claude Code

Explicitly delete the model reference before calling gc.collect() so the garbage collector can actually free the model weights. Without this, the reference remains live and gc.collect() has nothing to reclaim, leaving GPU memory occupied after the benchmark subprocess exits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

copy-pr-bot · 2026-05-11T03:47:52Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: rlratzel <rratzel@nvidia.com>

rlratzel added 2 commits May 13, 2026 12:53

Merge remote-tracking branch 'upstream/main' into stageactor_cleanup

52770bc

Signed-off-by: rlratzel <rratzel@nvidia.com>

Merge remote-tracking branch 'upstream/main' into stageactor_cleanup

1b07c0c

Signed-off-by: rlratzel <rratzel@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ModelStage): del self.model before gc.collect() in teardown()#1965

fix(ModelStage): del self.model before gc.collect() in teardown()#1965
rlratzel wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
rlratzel:stageactor_cleanup

rlratzel commented May 11, 2026

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rlratzel commented May 11, 2026

Summary

Status

Test plan

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant