Language model sample by jlamypoirier · Pull Request #378 · ServiceNow/Fast-LLM

jlamypoirier · 2025-10-16T03:52:07Z

✨ Description

Replace GPTSample and GPTBatch with LanguageModelSample and LanguageModelBatch, encapsulating much of the related functionality and simplifying much of the code that use them (ex. SampledIndexedDataset, GPTBaseModel.preprocess_batch, gpt_data_collate_fn
Make SampledIndexedDataset agnostic of the sample type.
Change spans to use a standard range format, i.e. (first, last + 1) instead of (first, last). (Spans are still stored in (first, last) format in the binary file, this will be changed for the new binary format.)
Redo much of the code for preference spans. SampledIndexedDataset was using an entirely different code path for preference spans avoiding multi-document samples (for historical reasons I think?), I dropped it and made it use the common code path. This does mean a change in behavior (ex. multi-document samples), but that's an improvement. I have some doubts about the dpo ploss function though, I suspect the log softmax needs to be calculated separately for each document.
Datasets now always provide sequence lengths, and move cross_document_attention to AttentionConfig (from BatchConfig) so the attention layer decides itself whether to use varlen or not. Replace use_flash_attention with a more generic implementation enum. (See discussion in Base model interface review #370.) Add separate LanguageModelEmbeddingsConfig.cross_document_position_embeddings since absolute position embeddings may also use the sequence lengths.

tscholak

this looks ok to me. I think we should merge.

For others looking at this, the most important changes are in:

a few things that we should definitely make sure of, ideally via tests (if we haven't already):

span cropping, offsetting, and truncation arithmetic is correct
loss masking masks correctly
dpo works correctly

jlamypoirier added 4 commits October 14, 2025 22:52

Dataset interface

1a18929

misc

fd63846

fix

2486caf

Language model sample

92e93e8

jlamypoirier mentioned this pull request Oct 16, 2025

Dataset interface #377

Merged

jlamypoirier added 4 commits October 16, 2025 00:19

fix

d6f6944

fixes

5c802fa

test

95d1840

fixes

eafd9cb

jlamypoirier marked this pull request as ready for review October 17, 2025 03:20

jlamypoirier added 3 commits October 16, 2025 23:29

cleanup

c56df69

misc

7f437e1

misc

dfd27f5

bigximik mentioned this pull request Nov 17, 2025

[Prototype] Integration with PipelineRL: streaming dataset and trainer events with weight broadcasting #389

Closed

25 tasks

jlamypoirier and others added 3 commits November 20, 2025 17:18

Merge remote-tracking branch 'origin/main' into jlp/dataset_interface

9f73d63

Merge branch 'jlp/dataset_interface' into jlp/lm_sample

c92ccf0

Merge branch 'main' into jlp/dataset_interface

1850606

tscholak approved these changes Nov 23, 2025

View reviewed changes

Base automatically changed from jlp/dataset_interface to main November 24, 2025 15:44

jlamypoirier added 2 commits November 24, 2025 13:18

Merge commit '1850606' into jlp/lm_sample

f0d2619

Merge remote-tracking branch 'origin/main' into jlp/lm_sample

694784e

jlamypoirier merged commit 261717c into main Nov 24, 2025
2 checks passed

jlamypoirier deleted the jlp/lm_sample branch November 24, 2025 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language model sample#378

Language model sample#378
jlamypoirier merged 16 commits intomainfrom
jlp/lm_sample

jlamypoirier commented Oct 16, 2025 •

edited

Loading

Uh oh!

tscholak left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jlamypoirier commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Uh oh!

tscholak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jlamypoirier commented Oct 16, 2025 •

edited

Loading