Make ColBERT prefix tokens optional for prefix-free models#196
Make ColBERT prefix tokens optional for prefix-free models#196robro612 wants to merge 5 commits intolightonai:mainfrom
Conversation
e7c540e to
0795278
Compare
… default prefixes - Restore default [Q]/[D] prefix behavior when query_prefix/document_prefix not set - Make add_special_tokens the actual toggle for prefix token insertion in tokenize() - Persist add_special_tokens in config for save/load round-trips - Support loading from config (legacy models without the key default to True) - Add comprehensive test suite for add_special_tokens behavior
fd87e46 to
27dad43
Compare
pylate/models/colbert.py
Outdated
| else self.query_prefix | ||
| if self.query_prefix is not None | ||
| else "[Q] " | ||
| query_prefix if query_prefix is not None else self.query_prefix or "[Q] " |
There was a problem hiding this comment.
I'm not sure if you just had the previous version or if it's intentional but I tried rebasing and it did not fix?
We now have stronger check because for example the empty string check returns false.
We can discuss if it was done on purpose, but since you are not using the "trick" of using empty string to disable prefixes, I'm not sure
pylate/models/colbert.py
Outdated
| else self.document_prefix | ||
| if self.document_prefix is not None | ||
| else "[D] " | ||
| else self.document_prefix or "[D] " |
pylate/models/colbert.py
Outdated
| @@ -85,7 +85,9 @@ class ColBERT(SentenceTransformer): | |||
| document_prefix | |||
| Prefix to add to the documents. | |||
| add_special_tokens | |||
There was a problem hiding this comment.
I checked and it's indeed me who added this even if it wasn't in ST and isn't used like that (we can forward the tokenizer param in tokenizer_kwargs
So we can definitely use this param, but I think it would be best to rename it to not confuse ppl with the HF param. Probably "add_prefixes" would be a good name?
Also, I know we already discussed that back then, but what do we think about setting an empty string to not add? No strong opinion, cc @raphaelsty
pylate/models/colbert.py
Outdated
| add_special_tokens | ||
| Add the prefix to the inputs. | ||
| Whether to prepend query/document prefix tokens during tokenization. Set to False for | ||
| models that don't use ColBERT-style marker tokens (e.g. XTR). If None, uses the value |
There was a problem hiding this comment.
Thanks for making the description clearer!
pylate/models/colbert.py
Outdated
| add_special_tokens | ||
| if add_special_tokens is not None | ||
| else self.add_special_tokens | ||
| if hasattr(self, "add_special_tokens") |
There was a problem hiding this comment.
Any reason not to follow the usual way?
self.add_special_tokens = (
add_special_tokens
if add_special_tokens is not None
else self.add_special_tokens
if self.add_special_tokens is not None
else True
)(with possible rename ofc)
pylate/models/colbert.py
Outdated
| """ | ||
| # Set max sequence length based on whether the input is a query or document | ||
| max_length = self.query_length if is_query else self.document_length | ||
| use_prefix = self.add_special_tokens |
There was a problem hiding this comment.
why defining a new variable for this?
i think it won't be needed after renaming the variable correctly
|
Looks pretty good to me Finally there is this question on why your tests/default definition are different from main/usual definition, else I believe we good to go. |
|
I'll give it a look tomorrow, the init part of colbert is always tricky, I'll need dedicated focus time |
|
Bump @raphaelsty |
|
Stress tested the MR on various models, LGTM, default behaviour stays the same :) There are few comments from Antoine that need to be resolved before merging that's all |
Summary
add_special_tokensconstructor parameter to actually control prefix token insertion during tokenizationadd_special_tokens=False, prefix tokens ([Q]/[D]) are not prepended andmax_seq_lengthis not reducedadd_special_tokens=True) is unchanged —[Q]/[D]prefixes are added as beforeadd_special_tokensis persisted inconfig_sentence_transformers.jsonfor save/load round-tripsTrueThis enables loading models (e.g. XTR) that don't use ColBERT-style marker tokens without incorrect tokenization.
Test plan
tests/test_add_special_tokens.py) covering:tokenize()for queries and documentsmax_seq_lengthadjustment (reduced by 1 only when prefixes are used)