Skip to content

Synchronize BF16 weight loads#596

Open
devshahofficial wants to merge 1 commit into
PufferAI:4.0from
devshahofficial:devshahofficial/fix-bf16-load-weights-sync
Open

Synchronize BF16 weight loads#596
devshahofficial wants to merge 1 commit into
PufferAI:4.0from
devshahofficial:devshahofficial/fix-bf16-load-weights-sync

Conversation

@devshahofficial

Copy link
Copy Markdown

Summary

  • Check CUDA errors while loading primary weights.
  • Synchronize the BF16 cast on pufferl.default_stream before load_weights returns.
  • Add a regression that guards cast launch checking and stream synchronization ordering.

Root cause

BF16 load_weights copied fp32 master weights, launched the fp32-to-BF16 cast on default_stream, and then returned without ordering that stream against subsequent inference streams. A first rollout could consume stale param_puf data.

Validation

  • python -m pytest tests/test_cuda_load_weights.py tests/test_import_performance.py -q

GPU runtime validation still needs CUDA hardware; the local environment used for this patch does not include nvcc.

cc @Infatoshi for the original BF16 checkpoint repro.

Fixes #534

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BF16 load_weights doesn't reach inference stream — eval produces degenerate action distribution

1 participant