Skip to content

is_iso_date pattern is too liberal #14

@ferdnyc

Description

@ferdnyc

I don't know if it's actually used anywhere, but the pattern used for is_iso_date:

https://github.com/nexB/saneyaml/blob/40e5fa7c0b6e0012452053839184e5cd29802063/src/saneyaml.py#L330

...is too liberal, because alternation (|) has higher precedence than most RE syntax other than parentheses. As a result, the pattern will match either anything that starts with 19, or a date of the form 20[0-9][0-9]-[01][0-9]-[0123]?[1-9].

To alternate only 19 and 20, but not the rest of the pattern, they need to be enclosed in parentheses: (19|20)

>>> import re
>>> is_iso_date = re.compile(
... r'19|20[0-9]{2}-[0-1][0-9]-[0-3]?[1-9]').match
>>>
>>> is_iso_date('1994-01-01')
<re.Match object; span=(0, 2), match='19'>
>>> is_iso_date('2004-01-01')
<re.Match object; span=(0, 10), match='2004-01-01'>
>>> is_iso_date('2004-01')
>>> is_iso_date('193')
<re.Match object; span=(0, 2), match='19'>
>>> is_iso_date('1992-bibble')
<re.Match object; span=(0, 2), match='19'>
>>>
>>> fixed_iso_date = re.compile(
... r'(19|20)[0-9]{2}-[0-1][0-9]-[0-3]?[1-9]').match
>>>
>>> fixed_iso_date('1994-01-01')
<re.Match object; span=(0, 10), match='1994-01-01'>
>>> fixed_iso_date('2004-01-01')
<re.Match object; span=(0, 10), match='2004-01-01'>
>>> fixed_iso_date('2004-01')
>>> fixed_iso_date('1994-01')
>>> fixed_iso_date('193')

...The tail end of the pattern is questionable, as well. Since the last character accepts only [1-9], but the character preceding it is optional, it will truncate the match for any day ending in 0, only matching on the first 10 characters (out of 11). This means that, for example, 2004-01-30 matches, but only as far as 2004-01-3. Therefore, broken strings like 2004-01-3a will also match.

>>> # Continuing from above
>>> is_iso_date('2004-01-10')
<re.Match object; span=(0, 9), match='2004-01-1'>
>>> is_iso_date('2004-01-1')
<re.Match object; span=(0, 9), match='2004-01-1'>
>>> is_iso_date('2004-01-30')
<re.Match object; span=(0, 9), match='2004-01-3'>
>>> is_iso_date('2004-01-31')
<re.Match object; span=(0, 10), match='2004-01-31'>
>>> is_iso_date('2004-01-0a')
>>> is_iso_date('2004-01-1a')
<re.Match object; span=(0, 9), match='2004-01-1'>

The pattern should be:

r'(19|20)[0-9]{2}-[0-1][0-9]-[0-3][0-9]'

with a non-optional leading day digit, since ISO-8601 doesn't recognize strings of the form 2004-1-1 or 2004-01-1 as valid dates; all 8 digits are required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions