Fix tokenize_and_concatenate splitting tokens across chunk boundaries#1201
Draft
brainsnog wants to merge 2 commits intoTransformerLensOrg:mainfrom
Draft
Fix tokenize_and_concatenate splitting tokens across chunk boundaries#1201brainsnog wants to merge 2 commits intoTransformerLensOrg:mainfrom
brainsnog wants to merge 2 commits intoTransformerLensOrg:mainfrom
Conversation
Previously, text was split into 20 chunks by character count before tokenizing. This could cut words in half at chunk boundaries, producing token pairs that would never occur in naturally tokenized text. Fix splits chunks at whitespace boundaries instead, ensuring no word is ever divided between chunks. Chunk lengths become slightly uneven but this has no effect on correctness — the tokenizer already handles variable-length inputs via padding. Adds regression test that verifies all consecutive token pairs in the output also appear in a clean single-pass tokenization of the same text. Fixes TransformerLensOrg#1133
1 task
If a chunk contains no whitespace, the previous fix would advance the boundary search to the end of the string, consuming all remaining text in a single chunk. Fix bounds the lookahead to chunk_length // 10 characters. If no whitespace is found within that window, the cut falls back to the original character boundary — degrading gracefully rather than producing malformed chunks. Addresses feedback from issue TransformerLensOrg#1133.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1133
tokenize_and_concatenatesplit text into 20 chunks by character countbefore tokenizing. This could cut words in half at chunk boundaries,
producing token pairs that would never occur in naturally tokenized text.
For example, the word
"Military"could be split as"...t on the M"/"ilitary Ne...", causing tokens[337, 346](M+il) to appearconsecutively — a pair that is an artefact of the chunking, not real text.
This silently corrupts datasets used in interpretability experiments.
Fix
Chunks are now split at whitespace boundaries instead of arbitrary
character positions. The loop advances the cut point forward until it
lands on a space, ensuring no word is ever divided between chunks.
Chunk lengths become slightly uneven but this has no effect on
correctness — the tokenizer already handles variable-length inputs
via padding.
Changes
transformer_lens/utils.py: replaced character-based chunk splittingwith whitespace-boundary splitting in
tokenize_and_concatenatetests/unit/test_utils.py: added regression test that verifies allconsecutive token pairs in the output also appear in a clean
single-pass tokenization of the same text
Testing
Regression test confirms no artificial token pairs appear in output
after the fix. All changes are backward compatible — behaviour is
identical for text that doesn't have words at chunk boundaries.