gh-133968: Add PyUnicodeWriter_WriteASCII() function#133973
gh-133968: Add PyUnicodeWriter_WriteASCII() function#133973vstinner merged 8 commits intopython:mainfrom
Conversation
Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII().
|
JSON benchmark: #133832 (comment)
Benchmark hidden because not significant (11): encode 100 floats, encode ascii string len=100, encode Unicode string len=100, encode 1000 integers, encode 1000 floats, encode 1000 "ascii" strings, encode ascii string len=1000, encode escaped string len=896, encode 10000 integers, encode 10000 floats, encode 10000 "ascii" strings Up to 1.20x faster to encode booleans is interesting knowing that these strings are very short: "true" (4 characters) and "false" (5 characters). |
@serhiy-storchaka: What do you think of this function? |
|
Well, we had But We can add private |
Co-authored-by: Peter Bierma <[email protected]>
I don't think that it can become as fast or faster than a function which takes ASCII string as argument. If we know that the input string is ASCII, there is no need to scan the string for non-ASCII characters, and we can take the fast path. You're right that the UTF-8 decoder is already highly optimized. |
|
In short:
It's hard to beat |
|
Yes, although it was close, at least for moderately large strings. Could it be optimized even more? I don't know. But decision about |
|
I created capi-workgroup/decisions#65 issue. |
|
Benchmark: On long strings (10,000 bytes), PyUnicodeWriter_WriteASCII() is up to 2x faster (1.36 us => 690 ns) than PyUnicodeWriter_WriteUTF8(). Detailsfrom _testcapi import PyUnicodeWriter
import pyperf
range_100 = range(100)
def bench_write_utf8(text, size):
writer = PyUnicodeWriter(0)
for _ in range_100:
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
def bench_write_ascii(text, size):
writer = PyUnicodeWriter(0)
for _ in range_100:
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
runner = pyperf.Runner()
sizes = (10, 100, 1_000, 10_000)
for size in sizes:
text = b'x' * size
runner.bench_func(f'write_utf8 size={size:,}', bench_write_utf8, text, size,
inner_loops=1_000)
for size in sizes:
text = b'x' * size
runner.bench_func(f'write_ascii size={size:,}', bench_write_ascii, text, size,
inner_loops=1_000) |
|
Do we know where the bottleneck is for long strings? |
WriteUTF8() has to check for non-ASCII characters: this check has a cost. That's the bottleneck.
Maybe, I don't know if it would be faster. |
I tried but failed to modify the code to copy while reading (checking if the string is encoded to ASCII). The code is quite complicated. |
Co-authored-by: Bénédikt Tran <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
picnixz
left a comment
There was a problem hiding this comment.
I'm happy to have this function public. I always preferred using the faster versions of the writer API when I hardcoded strings, but they were private.
ZeroIntensity
left a comment
There was a problem hiding this comment.
Sorry for the late review, LGTM as well.
The C API Working Group voted in favor of adding the function. |
…3973) Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII(). Unrelated change to please the linter: remove an unused import in test_ctypes. Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]> (cherry picked from commit f49a07b)
|
GH-134974 is a backport of this pull request to the 3.14 branch. |
…3973) Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII(). Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]> (cherry picked from commit f49a07b)
…#134974) gh-133968: Add PyUnicodeWriter_WriteASCII() function (#133973) Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII(). (cherry picked from commit f49a07b) Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>
…3973) Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII(). Unrelated change to please the linter: remove an unused import in test_ctypes. Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>
…3973) Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII(). Unrelated change to please the linter: remove an unused import in test_ctypes. Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>
Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII().
📚 Documentation preview 📚: https://cpython-previews--133973.org.readthedocs.build/