Skip to content

Make ColBERT prefix tokens optional for prefix-free models#196

Open
robro612 wants to merge 5 commits intolightonai:mainfrom
robro612:colbert-optional-prefix
Open

Make ColBERT prefix tokens optional for prefix-free models#196
robro612 wants to merge 5 commits intolightonai:mainfrom
robro612:colbert-optional-prefix

Conversation

@robro612
Copy link
Copy Markdown
Contributor

@robro612 robro612 commented Feb 18, 2026

Summary

  • Wire the existing add_special_tokens constructor parameter to actually control prefix token insertion during tokenization
  • When add_special_tokens=False, prefix tokens ([Q] / [D] ) are not prepended and max_seq_length is not reduced
  • Default behavior (add_special_tokens=True) is unchanged — [Q] / [D] prefixes are added as before
  • add_special_tokens is persisted in config_sentence_transformers.json for save/load round-trips
  • Legacy models without the key in config default to True

This enables loading models (e.g. XTR) that don't use ColBERT-style marker tokens without incorrect tokenization.

Test plan

  • Add test suite (tests/test_add_special_tokens.py) covering:
    • Init stores the flag correctly (default True, explicit True/False)
    • Prefix token insertion/skipping in tokenize() for queries and documents
    • max_seq_length adjustment (reduced by 1 only when prefixes are used)
    • Save/load round-trip preserves the setting
    • Legacy config without the key defaults to True

@robro612 robro612 force-pushed the colbert-optional-prefix branch 2 times, most recently from e7c540e to 0795278 Compare March 5, 2026 15:04
robro612 and others added 3 commits March 5, 2026 16:43
… default prefixes

- Restore default [Q]/[D] prefix behavior when query_prefix/document_prefix not set
- Make add_special_tokens the actual toggle for prefix token insertion in tokenize()
- Persist add_special_tokens in config for save/load round-trips
- Support loading from config (legacy models without the key default to True)
- Add comprehensive test suite for add_special_tokens behavior
@NohTow NohTow force-pushed the colbert-optional-prefix branch from fd87e46 to 27dad43 Compare March 5, 2026 16:44
else self.query_prefix
if self.query_prefix is not None
else "[Q] "
query_prefix if query_prefix is not None else self.query_prefix or "[Q] "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if you just had the previous version or if it's intentional but I tried rebasing and it did not fix?
We now have stronger check because for example the empty string check returns false.
We can discuss if it was done on purpose, but since you are not using the "trick" of using empty string to disable prefixes, I'm not sure

else self.document_prefix
if self.document_prefix is not None
else "[D] "
else self.document_prefix or "[D] "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -85,7 +85,9 @@ class ColBERT(SentenceTransformer):
document_prefix
Prefix to add to the documents.
add_special_tokens
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and it's indeed me who added this even if it wasn't in ST and isn't used like that (we can forward the tokenizer param in tokenizer_kwargs

So we can definitely use this param, but I think it would be best to rename it to not confuse ppl with the HF param. Probably "add_prefixes" would be a good name?
Also, I know we already discussed that back then, but what do we think about setting an empty string to not add? No strong opinion, cc @raphaelsty

add_special_tokens
Add the prefix to the inputs.
Whether to prepend query/document prefix tokens during tokenization. Set to False for
models that don't use ColBERT-style marker tokens (e.g. XTR). If None, uses the value
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the description clearer!

add_special_tokens
if add_special_tokens is not None
else self.add_special_tokens
if hasattr(self, "add_special_tokens")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to follow the usual way?

self.add_special_tokens = (
            add_special_tokens
            if add_special_tokens is not None
            else self.add_special_tokens
            if self.add_special_tokens is not None
            else True
        )

(with possible rename ofc)

"""
# Set max sequence length based on whether the input is a query or document
max_length = self.query_length if is_query else self.document_length
use_prefix = self.add_special_tokens
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why defining a new variable for this?
i think it won't be needed after renaming the variable correctly

@NohTow
Copy link
Copy Markdown
Collaborator

NohTow commented Mar 5, 2026

Looks pretty good to me
Main comment will be renaming the variable, I am dumb for adding it this way back then (and not even plugging it...), but it is definitely confusing (proof: I was confused o/).
Else I just wanted to raise the discussion about using a boolean rather than empty string again but I honestly do not have a strong feeling, so I'll let @raphaelsty decide what he prefers

Finally there is this question on why your tests/default definition are different from main/usual definition, else I believe we good to go.
In my opinion, we should go more and more towards deprecating prefixes and use prompts. I added the prefixes logic to match closely og colbert back then, but honestly it's a pain to maintain outside of PyLate (e.g, vLLM) and prompt would be exactly the same.

@raphaelsty
Copy link
Copy Markdown
Collaborator

I'll give it a look tomorrow, the init part of colbert is always tricky, I'll need dedicated focus time

@robro612
Copy link
Copy Markdown
Contributor Author

Bump @raphaelsty

@raphaelsty
Copy link
Copy Markdown
Collaborator

raphaelsty commented Mar 12, 2026

Stress tested the MR on various models, LGTM, default behaviour stays the same :)

There are few comments from Antoine that need to be resolved before merging that's all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants