Skip to content

Commit 8d7fd48

Browse files
committed
Convert & to & as a Characters token
This fixes a problem in LinkifyFilter when using it with the Cleaner where the Cleaner sets up the tokenizer to not consume entities. So character entities end up in their own Entity tokens and Linkifyfilter can't match links that cross token boundaries. If there's a &, then LinkifyFilter won't match across that. This fixes that by converting & to & in the sanitizer when it's pulling out entities and putting them in separate Entity tokens. The & Characters tokens will get merged by BleachSanitizerFilter.__iter__ and & will get converted back to & in the serialier. Fixes #422
1 parent 3097fd3 commit 8d7fd48

2 files changed

Lines changed: 16 additions & 1 deletion

File tree

bleach/sanitizer.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -395,7 +395,18 @@ def sanitize_characters(self, token):
395395
if part.startswith('&'):
396396
entity = html5lib_shim.match_entity(part)
397397
if entity is not None:
398-
new_tokens.append({'type': 'Entity', 'name': entity})
398+
if entity == 'amp':
399+
# LinkifyFilter can't match urls across token boundaries
400+
# which is problematic with & since that shows up in
401+
# querystrings all the time. This special-cases &
402+
# and converts it to a & and sticks it in as a
403+
# Characters token. It'll get merged with surrounding
404+
# tokens in the BleachSanitizerfilter.__iter__ and
405+
# escaped in the serializer.
406+
new_tokens.append({'type': 'Characters', 'data': '&'})
407+
else:
408+
new_tokens.append({'type': 'Entity', 'name': entity})
409+
399410
# Length of the entity plus 2--one for & at the beginning
400411
# and and one for ; at the end
401412
remainder = part[len(entity) + 2:]

tests/test_linkify.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -694,6 +694,10 @@ def test_only_text_is_linkified(self):
694694
'http://example.com?b=1&c=2',
695695
'<a href="http://example.com?b=1&amp;c=2">http://example.com?b=1&amp;c=2</a>'
696696
),
697+
(
698+
'http://example.com?b=1&amp;c=2',
699+
'<a href="http://example.com?b=1&amp;c=2">http://example.com?b=1&amp;c=2</a>'
700+
),
697701
(
698702
'link: https://example.com/watch#anchor',
699703
'link: <a href="https://example.com/watch#anchor">https://example.com/watch#anchor</a>'

0 commit comments

Comments
 (0)