Skip to content

Tokenize generate_tokens regression in CPython 3.12 #111224

@Erotemic

Description

@Erotemic

Bug report

Bug description:

I've noticed a regression when adding 3.12 support to xdoctest.

The following MWE has different behavior on 3.11 and 3.12.

import tokenize
lines = ['3, 4]', 'print(len(x))']
iterable = (line for line in lines if line)


def _readline():
    return next(iterable)

for t in tokenize.generate_tokens(_readline):
    print(t)

On 3.11 and earlier versions this will result in a tokenize.TokenError being raised:

TokenInfo(type=2 (NUMBER), string='3', start=(1, 0), end=(1, 1), line='3, 4]')
TokenInfo(type=54 (OP), string=',', start=(1, 1), end=(1, 2), line='3, 4]')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 3), end=(1, 4), line='3, 4]')
TokenInfo(type=54 (OP), string=']', start=(1, 4), end=(1, 5), line='3, 4]')
TokenInfo(type=1 (NAME), string='print', start=(2, 0), end=(2, 5), line='print(len(x))')
TokenInfo(type=54 (OP), string='(', start=(2, 5), end=(2, 6), line='print(len(x))')
TokenInfo(type=1 (NAME), string='len', start=(2, 6), end=(2, 9), line='print(len(x))')
TokenInfo(type=54 (OP), string='(', start=(2, 9), end=(2, 10), line='print(len(x))')
TokenInfo(type=1 (NAME), string='x', start=(2, 10), end=(2, 11), line='print(len(x))')
TokenInfo(type=54 (OP), string=')', start=(2, 11), end=(2, 12), line='print(len(x))')
TokenInfo(type=54 (OP), string=')', start=(2, 12), end=(2, 13), line='print(len(x))')
Traceback (most recent call last):
  File "/home/joncrall/code/xdoctest/dev/tokenize_mwe.py", line 10, in <module>
    for t in tokenize.generate_tokens(_readline):
  File "/home/joncrall/.pyenv/versions/3.11.2/lib/python3.11/tokenize.py", line 525, in _tokenize
    raise TokenError("EOF in multi-line statement", (lnum, 0))
tokenize.TokenError: ('EOF in multi-line statement', (3, 0))

However, on 3.12, this no longer raises an error:

Instead I get:

TokenInfo(type=2 (NUMBER), string='3', start=(1, 0), end=(1, 1), line='3, 4]')
TokenInfo(type=55 (OP), string=',', start=(1, 1), end=(1, 2), line='3, 4]')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 3), end=(1, 4), line='3, 4]')
TokenInfo(type=55 (OP), string=']', start=(1, 4), end=(1, 5), line='3, 4]')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 5), end=(1, 6), line='3, 4]')
TokenInfo(type=1 (NAME), string='print', start=(2, 0), end=(2, 5), line='print(len(x))')
TokenInfo(type=55 (OP), string='(', start=(2, 5), end=(2, 6), line='print(len(x))')
TokenInfo(type=1 (NAME), string='len', start=(2, 6), end=(2, 9), line='print(len(x))')
TokenInfo(type=55 (OP), string='(', start=(2, 9), end=(2, 10), line='print(len(x))')
TokenInfo(type=1 (NAME), string='x', start=(2, 10), end=(2, 11), line='print(len(x))')
TokenInfo(type=55 (OP), string=')', start=(2, 11), end=(2, 12), line='print(len(x))')
TokenInfo(type=55 (OP), string=')', start=(2, 12), end=(2, 13), line='print(len(x))')
TokenInfo(type=4 (NEWLINE), string='', start=(2, 13), end=(2, 14), line='print(len(x))')
TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')

This is a problem for xdoctest because it uses tokenize to determine if a statement is "balanced" (i.e. if it is part of a line continuation or not). This is the magic I use to autodetect PS1 vs PS2 lines and prevent users from needing to manually specify if a line is a continuation or not.

Looking through the release and migration notes, I don't see anything that would indicate that this new behavior is introduced, so I suspect it is a bug. I'm sorry I didn't catch this before the 3.12 release. I've been busy.

If this is not a bug and an intended change, then it should be documented (please link to the relevant section if I missed it). If there is a way to work around this so xdoctest works on 3.12.0 that would be helpful. (It's probably time some of the parsing code got a rewrite anyway).

CPython versions tested on:

3.11, 3.12

Operating systems tested on:

Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions