Breaking change in beautifulsoup4 4.13

beautifulsoup4 4.13 introduces a breaking change in the text processing module at /src/commoncode/text.py ([Link](https://github.com/aboutcode-org/commoncode/blob/395b971d6723294d0120a5be74963472c1375226/src/commoncode/text.py#L135-L146)), see [#4129](https://github.com/aboutcode-org/scancode-toolkit/issues/4129#issuecomment-2636503816)

`as_unicode(s)` returns `bytes` instead of `str` starting with 4.13, which in turn breaks `is_markup(location)`/`is_markup_text(text)` in scancode [here](https://github.com/aboutcode-org/scancode-toolkit/blob/e795bc6e1e531a0657b7be2363ec746520a5ae64/src/textcode/markup.py#L56C1-L105C33).

> From the [Changelog](https://git.launchpad.net/beautifulsoup/tree/CHANGELOG):
> 
> * UnicodeDammit.markup is now always a bytestring representing the
>   *original* markup (sans BOM), and UnicodeDammit.unicode_markup is
>   always the converted Unicode equivalent of the original
>   markup. Previously, UnicodeDammit.markup was treated inconsistently
>   and would often end up containing Unicode. UnicodeDammit.markup was
>   not a documented attribute, but if you were using it, you probably
>   want to switch to using .unicode_markup instead.
> 
> If `UnicodeDammit(s).unicode_markup` is used [here](https://github.com/aboutcode-org/commoncode/blob/395b971d6723294d0120a5be74963472c1375226/src/commoncode/text.py#L146) instead of `UnicodeDammit(s).markup`, a unicode string is returned:  

 _Originally posted by @watschi in [#4129](https://github.com/aboutcode-org/scancode-toolkit/issues/4129#issuecomment-2636503816)_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Breaking change in beautifulsoup4 4.13 #79

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Breaking change in beautifulsoup4 4.13 #79

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions