Skip to content

Fix tokenize_and_concatenate splitting tokens across chunk boundaries#1201

Draft
brainsnog wants to merge 2 commits intoTransformerLensOrg:mainfrom
brainsnog:fix/tokenize-and-concatenate-word-boundaries
Draft

Fix tokenize_and_concatenate splitting tokens across chunk boundaries#1201
brainsnog wants to merge 2 commits intoTransformerLensOrg:mainfrom
brainsnog:fix/tokenize-and-concatenate-word-boundaries

Conversation

@brainsnog
Copy link

Summary

Fixes #1133

tokenize_and_concatenate split text into 20 chunks by character count
before tokenizing. This could cut words in half at chunk boundaries,
producing token pairs that would never occur in naturally tokenized text.

For example, the word "Military" could be split as "...t on the M" /
"ilitary Ne...", causing tokens [337, 346] ( M + il) to appear
consecutively — a pair that is an artefact of the chunking, not real text.

This silently corrupts datasets used in interpretability experiments.

Fix

Chunks are now split at whitespace boundaries instead of arbitrary
character positions. The loop advances the cut point forward until it
lands on a space, ensuring no word is ever divided between chunks.
Chunk lengths become slightly uneven but this has no effect on
correctness — the tokenizer already handles variable-length inputs
via padding.

Changes

  • transformer_lens/utils.py: replaced character-based chunk splitting
    with whitespace-boundary splitting in tokenize_and_concatenate
  • tests/unit/test_utils.py: added regression test that verifies all
    consecutive token pairs in the output also appear in a clean
    single-pass tokenization of the same text

Testing

Regression test confirms no artificial token pairs appear in output
after the fix. All changes are backward compatible — behaviour is
identical for text that doesn't have words at chunk boundaries.

Previously, text was split into 20 chunks by character count before
tokenizing. This could cut words in half at chunk boundaries, producing
token pairs that would never occur in naturally tokenized text.

Fix splits chunks at whitespace boundaries instead, ensuring no word
is ever divided between chunks. Chunk lengths become slightly uneven
but this has no effect on correctness — the tokenizer already handles
variable-length inputs via padding.

Adds regression test that verifies all consecutive token pairs in the
output also appear in a clean single-pass tokenization of the same text.

Fixes TransformerLensOrg#1133
If a chunk contains no whitespace, the previous fix would advance
the boundary search to the end of the string, consuming all remaining
text in a single chunk.

Fix bounds the lookahead to chunk_length // 10 characters. If no
whitespace is found within that window, the cut falls back to the
original character boundary — degrading gracefully rather than
producing malformed chunks.

Addresses feedback from issue TransformerLensOrg#1133.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug Report] tokenize_and_concatenate doesn't tokenize correctly.

1 participant