Dataset interface by jlamypoirier · Pull Request #377 · ServiceNow/Fast-LLM

jlamypoirier · 2025-10-15T02:54:53Z

✨ Description

Part of the data rework:

Drop the generic / gpt specialization of indexed dataset types. Instead, use generics to specify a sample type.
Add Sample and Batch constructs (placeholders for now)
Remove the tokenizer from GPTData, since it's rarely used and not by the data itself. Instead, use separate tokenizers where needed (Fim, Preparator (already present), lm eval).
Start moving away from numpy in favor of torch so things are more uniform (ex. Sample and Batch both use torch tensors.)
Other minor tweaks.

Note: since this is part of a bigger set of changes, it does contain changes that don't immediately make sense but will be useful layer, as well as messy temporary solutions. (See #376 for more info on where this is going).

tscholak · 2025-10-15T14:02:48Z

Great to carve this out!

I'm wondering how to stage these changes. Switching to the new binary format is a breaking change, and we'd need to reprocess all currently used training data. This needs to be properly timed and announced. Do we have backwards compatibility for already processed data?

jlamypoirier · 2025-10-15T19:18:41Z

Great to carve this out!

I'm wondering how to stage these changes. Switching to the new binary format is a breaking change, and we'd need to reprocess all currently used training data. This needs to be properly timed and announced. Do we have backwards compatibility for already processed data?

I'm hoping to have the new format ready this week and make a big announcement. I'm not currently planning backward compatibility (time issue), but could if that's a necessity.

tscholak · 2025-10-15T21:45:25Z

Can we convert existing data to the new format? I could work on a simple converter tool. Old binary in, new binary out

jlamypoirier · 2025-10-16T04:00:22Z

Can we convert existing data to the new format? I could work on a simple converter tool. Old binary in, new binary out

I don't really think it's worth it, might as well just redo the preparation. To help with the transition I'm noticing that the intermediate memmap dataset I'm making in #378 will essentially support the old binary format with the updated code, so an I could just keep it for a while as a backward compatibility backup. (Except for vision datasets of course.)

tscholak

LGTM!

This reverts commit bf75363.

jlamypoirier added 2 commits October 14, 2025 22:52

Dataset interface

1a18929

misc

fd63846

fix

2486caf

jlamypoirier marked this pull request as ready for review October 16, 2025 03:52

bigximik mentioned this pull request Nov 17, 2025

[Prototype] Integration with PipelineRL: streaming dataset and trainer events with weight broadcasting #389

Closed

25 tasks

Merge branch 'main' into jlp/dataset_interface

1850606

tscholak approved these changes Nov 23, 2025

View reviewed changes

tscholak merged commit bf75363 into main Nov 24, 2025
4 checks passed

tscholak deleted the jlp/dataset_interface branch November 24, 2025 15:44

tscholak added a commit that referenced this pull request Nov 24, 2025

Revert "Dataset interface (#377)"

28c3ed9

This reverts commit bf75363.

tscholak restored the jlp/dataset_interface branch November 24, 2025 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset interface#377

Dataset interface#377
tscholak merged 4 commits intomainfrom
jlp/dataset_interface

jlamypoirier commented Oct 15, 2025 •

edited

Loading

Uh oh!

tscholak commented Oct 15, 2025

Uh oh!

jlamypoirier commented Oct 15, 2025

Uh oh!

tscholak commented Oct 15, 2025

Uh oh!

jlamypoirier commented Oct 16, 2025

Uh oh!

tscholak left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jlamypoirier commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Uh oh!

tscholak commented Oct 15, 2025

Uh oh!

jlamypoirier commented Oct 15, 2025

Uh oh!

tscholak commented Oct 15, 2025

Uh oh!

jlamypoirier commented Oct 16, 2025

Uh oh!

tscholak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jlamypoirier commented Oct 15, 2025 •

edited

Loading