Synchronize BF16 weight loads by devshahofficial · Pull Request #596 · PufferAI/PufferLib

devshahofficial · 2026-06-22T19:53:37Z

Summary

Check CUDA errors while loading primary weights.
Synchronize the BF16 cast on pufferl.default_stream before load_weights returns.
Add a regression that guards cast launch checking and stream synchronization ordering.

Root cause

BF16 load_weights copied fp32 master weights, launched the fp32-to-BF16 cast on default_stream, and then returned without ordering that stream against subsequent inference streams. A first rollout could consume stale param_puf data.

Validation

python -m pytest tests/test_cuda_load_weights.py tests/test_import_performance.py -q

GPU runtime validation still needs CUDA hardware; the local environment used for this patch does not include nvcc.

cc @Infatoshi for the original BF16 checkpoint repro.

Fixes #534

Synchronize BF16 weight loads

bb3fea8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Synchronize BF16 weight loads#596

Synchronize BF16 weight loads#596
devshahofficial wants to merge 1 commit into
PufferAI:4.0from
devshahofficial:devshahofficial/fix-bf16-load-weights-sync

devshahofficial commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

devshahofficial commented Jun 22, 2026

Summary

Root cause

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant