Make ColBERT prefix tokens optional for prefix-free models by robro612 · Pull Request #196 · lightonai/pylate

robro612 · 2026-02-18T19:40:03Z

Summary

Wire the existing add_special_tokens constructor parameter to actually control prefix token insertion during tokenization
When add_special_tokens=False, prefix tokens ([Q] / [D] ) are not prepended and max_seq_length is not reduced
Default behavior (add_special_tokens=True) is unchanged — [Q] / [D] prefixes are added as before
add_special_tokens is persisted in config_sentence_transformers.json for save/load round-trips
Legacy models without the key in config default to True

This enables loading models (e.g. XTR) that don't use ColBERT-style marker tokens without incorrect tokenization.

Test plan

Add test suite (tests/test_add_special_tokens.py) covering:
- Init stores the flag correctly (default True, explicit True/False)
- Prefix token insertion/skipping in tokenize() for queries and documents
- max_seq_length adjustment (reduced by 1 only when prefixes are used)
- Save/load round-trip preserves the setting
- Legacy config without the key defaults to True

… default prefixes - Restore default [Q]/[D] prefix behavior when query_prefix/document_prefix not set - Make add_special_tokens the actual toggle for prefix token insertion in tokenize() - Persist add_special_tokens in config for save/load round-trips - Support loading from config (legacy models without the key default to True) - Add comprehensive test suite for add_special_tokens behavior

NohTow · 2026-03-05T16:49:20Z

pylate/models/colbert.py

-            else self.query_prefix
-            if self.query_prefix is not None
-            else "[Q] "
+            query_prefix if query_prefix is not None else self.query_prefix or "[Q] "


I'm not sure if you just had the previous version or if it's intentional but I tried rebasing and it did not fix?
We now have stronger check because for example the empty string check returns false.
We can discuss if it was done on purpose, but since you are not using the "trick" of using empty string to disable prefixes, I'm not sure

NohTow · 2026-03-05T16:49:26Z

pylate/models/colbert.py

-            else self.document_prefix
-            if self.document_prefix is not None
-            else "[D] "
+            else self.document_prefix or "[D] "


NohTow · 2026-03-05T16:51:23Z

pylate/models/colbert.py

@@ -85,7 +85,9 @@ class ColBERT(SentenceTransformer):
    document_prefix
        Prefix to add to the documents.
    add_special_tokens


I checked and it's indeed me who added this even if it wasn't in ST and isn't used like that (we can forward the tokenizer param in tokenizer_kwargs

So we can definitely use this param, but I think it would be best to rename it to not confuse ppl with the HF param. Probably "add_prefixes" would be a good name?
Also, I know we already discussed that back then, but what do we think about setting an empty string to not add? No strong opinion, cc @raphaelsty

NohTow · 2026-03-05T16:51:38Z

pylate/models/colbert.py

    add_special_tokens
-        Add the prefix to the inputs.
+        Whether to prepend query/document prefix tokens during tokenization. Set to False for
+        models that don't use ColBERT-style marker tokens (e.g. XTR). If None, uses the value


Thanks for making the description clearer!

NohTow · 2026-03-05T16:53:07Z

pylate/models/colbert.py

+            add_special_tokens
+            if add_special_tokens is not None
+            else self.add_special_tokens
+            if hasattr(self, "add_special_tokens")


Any reason not to follow the usual way?

self.add_special_tokens = ( add_special_tokens if add_special_tokens is not None else self.add_special_tokens if self.add_special_tokens is not None else True )

(with possible rename ofc)

NohTow · 2026-03-05T16:53:33Z

pylate/models/colbert.py

        """
        # Set max sequence length based on whether the input is a query or document
        max_length = self.query_length if is_query else self.document_length
+        use_prefix = self.add_special_tokens


why defining a new variable for this?
i think it won't be needed after renaming the variable correctly

NohTow · 2026-03-05T16:56:49Z

Looks pretty good to me
Main comment will be renaming the variable, I am dumb for adding it this way back then (and not even plugging it...), but it is definitely confusing (proof: I was confused o/).
Else I just wanted to raise the discussion about using a boolean rather than empty string again but I honestly do not have a strong feeling, so I'll let @raphaelsty decide what he prefers

Finally there is this question on why your tests/default definition are different from main/usual definition, else I believe we good to go.
In my opinion, we should go more and more towards deprecating prefixes and use prompts. I added the prefixes logic to match closely og colbert back then, but honestly it's a pain to maintain outside of PyLate (e.g, vLLM) and prompt would be exactly the same.

raphaelsty · 2026-03-05T19:15:01Z

I'll give it a look tomorrow, the init part of colbert is always tricky, I'll need dedicated focus time

robro612 · 2026-03-11T15:53:03Z

Bump @raphaelsty

raphaelsty · 2026-03-12T00:26:38Z

Stress tested the MR on various models, LGTM, default behaviour stays the same :)

There are few comments from Antoine that need to be resolved before merging that's all

robro612 mentioned this pull request Feb 18, 2026

Add XTR retriever with token-level scoring and imputation #197

Closed

3 tasks

robro612 force-pushed the colbert-optional-prefix branch 2 times, most recently from e7c540e to 0795278 Compare March 5, 2026 15:04

robro612 and others added 3 commits March 5, 2026 16:43

format

637d574

lint

27dad43

NohTow force-pushed the colbert-optional-prefix branch from fd87e46 to 27dad43 Compare March 5, 2026 16:44

NohTow reviewed Mar 5, 2026

View reviewed changes

robro612 added 2 commits March 18, 2026 17:55

Switch from boolean flag -> passing "" = no marker

7016396

update tests

411f450

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make ColBERT prefix tokens optional for prefix-free models#196

Make ColBERT prefix tokens optional for prefix-free models#196
robro612 wants to merge 5 commits intolightonai:mainfrom
robro612:colbert-optional-prefix

robro612 commented Feb 18, 2026 •

edited

Loading

Uh oh!

NohTow Mar 5, 2026

Uh oh!

NohTow Mar 5, 2026

Uh oh!

NohTow Mar 5, 2026

Uh oh!

NohTow Mar 5, 2026

Uh oh!

NohTow Mar 5, 2026

Uh oh!

NohTow Mar 5, 2026

Uh oh!

NohTow commented Mar 5, 2026

Uh oh!

raphaelsty commented Mar 5, 2026

Uh oh!

robro612 commented Mar 11, 2026

Uh oh!

raphaelsty commented Mar 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

robro612 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

NohTow Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

NohTow Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

NohTow Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

NohTow Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

NohTow Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

NohTow Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

NohTow commented Mar 5, 2026

Uh oh!

raphaelsty commented Mar 5, 2026

Uh oh!

robro612 commented Mar 11, 2026

Uh oh!

raphaelsty commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

robro612 commented Feb 18, 2026 •

edited

Loading

raphaelsty commented Mar 12, 2026 •

edited

Loading