Fix tokenize_and_concatenate splitting tokens across chunk boundaries by brainsnog · Pull Request #1201 · TransformerLensOrg/TransformerLens

brainsnog · 2026-03-12T23:15:46Z

Summary

tokenize_and_concatenate split text into 20 chunks by character count
before tokenizing. This could cut words in half at chunk boundaries,
producing token pairs that would never occur in naturally tokenized text.

For example, the word "Military" could be split as "...t on the M" /
"ilitary Ne...", causing tokens [337, 346] ( M + il) to appear
consecutively — a pair that is an artefact of the chunking, not real text.

This silently corrupts datasets used in interpretability experiments.

Fix

Chunks are now split at whitespace boundaries instead of arbitrary
character positions. The loop advances the cut point forward until it
lands on a space, ensuring no word is ever divided between chunks.
Chunk lengths become slightly uneven but this has no effect on
correctness — the tokenizer already handles variable-length inputs
via padding.

Changes

transformer_lens/utils.py: replaced character-based chunk splitting
with whitespace-boundary splitting in tokenize_and_concatenate
tests/unit/test_utils.py: added regression test that verifies all
consecutive token pairs in the output also appear in a clean
single-pass tokenization of the same text

Testing

Regression test confirms no artificial token pairs appear in output
after the fix. All changes are backward compatible — behaviour is
identical for text that doesn't have words at chunk boundaries.

Previously, text was split into 20 chunks by character count before tokenizing. This could cut words in half at chunk boundaries, producing token pairs that would never occur in naturally tokenized text. Fix splits chunks at whitespace boundaries instead, ensuring no word is ever divided between chunks. Chunk lengths become slightly uneven but this has no effect on correctness — the tokenizer already handles variable-length inputs via padding. Adds regression test that verifies all consecutive token pairs in the output also appear in a clean single-pass tokenization of the same text. Fixes TransformerLensOrg#1133

If a chunk contains no whitespace, the previous fix would advance the boundary search to the end of the string, consuming all remaining text in a single chunk. Fix bounds the lookahead to chunk_length // 10 characters. If no whitespace is found within that window, the cut falls back to the original character boundary — degrading gracefully rather than producing malformed chunks. Addresses feedback from issue TransformerLensOrg#1133.

brainsnog mentioned this pull request Mar 12, 2026

[Bug Report] tokenize_and_concatenate doesn't tokenize correctly. #1133

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tokenize_and_concatenate splitting tokens across chunk boundaries#1201

Fix tokenize_and_concatenate splitting tokens across chunk boundaries#1201
brainsnog wants to merge 2 commits intoTransformerLensOrg:mainfrom
brainsnog:fix/tokenize-and-concatenate-word-boundaries

brainsnog commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brainsnog commented Mar 12, 2026

Summary

Fix

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant