Conversation
|
Great to carve this out! I'm wondering how to stage these changes. Switching to the new binary format is a breaking change, and we'd need to reprocess all currently used training data. This needs to be properly timed and announced. Do we have backwards compatibility for already processed data? |
I'm hoping to have the new format ready this week and make a big announcement. I'm not currently planning backward compatibility (time issue), but could if that's a necessity. |
|
Can we convert existing data to the new format? I could work on a simple converter tool. Old binary in, new binary out |
I don't really think it's worth it, might as well just redo the preparation. To help with the transition I'm noticing that the intermediate memmap dataset I'm making in #378 will essentially support the old binary format with the updated code, so an I could just keep it for a while as a backward compatibility backup. (Except for vision datasets of course.) |
✨ Description
Part of the data rework:
SampleandBatchconstructs (placeholders for now)GPTData, since it's rarely used and not by the data itself. Instead, use separate tokenizers where needed (Fim, Preparator (already present), lm eval).SampleandBatchboth use torch tensors.)Note: since this is part of a bigger set of changes, it does contain changes that don't immediately make sense but will be useful layer, as well as messy temporary solutions. (See #376 for more info on where this is going).