GH-101362: Optimise pathlib by deferring path normalisation#101560
GH-101362: Optimise pathlib by deferring path normalisation#101560barneygale wants to merge 16 commits intopython:mainfrom
Conversation
`PurePath` now normalises and splits paths only when necessary, e.g. when
`.name` or `.parent` is accessed. The result is cached. This speeds up path
object construction by around 4x.
`PurePath.__fspath__()` now returns an unnormalised path, which should be
transparent to filesystem APIs (else pathlib's normalisation is broken!).
This extends the earlier performance improvement to most impure `Path`
methods, and also speeds up pickling, `p.joinpath('bar')` and `p / 'bar'`.
This also fixes pythonGH-76846 and pythonGH-85281 by unifying path constructors and
adding an `__init__()` method.
|
Constructing path objects is up to 4x faster with one argument: $ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath' 'PurePath("foo/bar")'
1000000 loops, best of 5: 2.01 usec per loop # before
1000000 loops, best of 5: 495 nsec per loop # afterMore than 2x faster with two arguments: $ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath' 'PurePath("foo", "bar")'
1000000 loops, best of 5: 2.28 usec per loop # before
1000000 loops, best of 5: 1.02 usec per loop # after~~And ~25% faster when joining arguments:~~ [edit: no longer true! ] $ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath; p = PurePath("foo")' 'p.joinpath("bar")'
1000000 loops, best of 5: 1.66 usec per loop # before
1000000 loops, best of 5: 1.3 usec per loop # afterBut it's 12% slower when the path needs normalization, as with $ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath' 'str(PurePath("foo/bar"))'
1000000 loops, best of 5: 2.96 usec per loop # before
1000000 loops, best of 5: 3.31 usec per loop # after
[edit: resolved! see comment] $ ./python -m timeit -n 20 -s 'from pathlib import Path' 'list(Path().rglob("*"))'
20 loops, best of 5: 53.4 msec per loop # before
20 loops, best of 5: 66.5 msec per loop # after
[edit: no longer true! this can't be properly fixed until other stuff lands] $ ./python -m timeit -n 100000 -s 'from pathlib import Path' 'Path("README.rst").read_text()'
100000 loops, best of 5: 26.1 usec per loop # before
100000 loops, best of 5: 21.2 usec per loop # after
$ ./python -m timeit -n 100000 -s 'from pathlib import Path' 'Path("README.rst").exists()'
100000 loops, best of 5: 5.45 usec per loop # before
100000 loops, best of 5: 2.97 usec per loop # after |
|
I've found a couple other small optimizations which are best tackled in other PRs, so I'm marking this PR as a 'draft' for now. |
|
I've undone the change to Still a tiny bit slower than pre-PR. The rest of the speedups/slowdowns mentioned in my previous comment are still there. |
|
The change to I think I need to solve that issue first, so I'm going to mark this PR as a draft (again!) |
|
This PR has strayed too far from the original implementation. I'm going to abandon it. New PR here: |
PurePathnow normalises and splits paths only when necessary, e.g. when.nameor.parentis accessed. The result is cached. This speeds up path object construction by around 4x.edit: will fix separately.PurePath.__fspath__()now returns an unnormalised path, which should be transparent to filesystem APIs (else pathlib's normalisation is broken!). This extends the earlier performance improvement to most impurePathmethods, and also speeds upp.joinpath('bar')andp / 'bar'.This also fixes GH-76846 and GH-85281 by unifying path constructors and adding anedit: will fix separately.__init__()method.