GH-48961: [Docs][Python] Doctest fails on pandas 3.0 #48969

tadeja · 2026-01-23T16:31:47Z

Rationale for this change

See issue #48961
Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes
and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default

What changes are included in this PR?

Updating several doctest examples from string to large_string.

Are these changes tested?

Yes, locally.

Are there any user-facing changes?

No.

Closes #48961

GitHub Issue: [Docs][Python] Doctest fails on pandas 3.0 #48961

AlenkaF · 2026-01-26T10:54:40Z

Thank you @tadeja for looking into this!

One question regarding the bump of the Python version in Sphinx&Numpydoc job. I think it would be good if the examples worked for users with new or old pandas version. What if we use ... (ELLIPSIS) instead of changing the string type? Or even better, we could not use pandas where possible and instead create a pyarrow table directly, like so:

arrow/python/pyarrow/table.pxi

Lines 1812 to 1814 in 95a3ed4

    
                   >>> table = pa.Table.from_arrays([[2, 4, 5, 100], 
        
                   ...                               ["Flamingo", "Horse", "Brittle stars", "Centipede"]], 
        
                   ...                               names=['n_legs', 'animals'])

rok · 2026-01-26T13:35:20Z

Agreed that it doesn't make sense for us to "test Pandas logic" especially in our docs. Agreed with @AlenkaF to instantiate the table in pyarrow. Using ellipsis in this case would hide the type and potentially increase user confusion :).

AlenkaF · 2026-01-26T14:25:46Z

Note that some examples are demonstrating conversion from pandas to pyarrow so in that case we might remove the string column and only keep the integer ones?

rok

This looks good to me now. I think (hope) removing pandas from examples that don't require streamlines things for readers.

python/pyarrow/table.pxi

rok · 2026-01-26T22:13:40Z

@github-actions crossbow submit preview-docs

github-actions · 2026-01-26T22:15:53Z

Revision: 186c0a9

Submitted crossbow builds: ursacomputing/crossbow @ actions-ca47b1b8be

Task	Status
preview-docs

tadeja · 2026-01-27T12:03:58Z

@AlenkaF this is ready for final review.

Generated doc pages: pyarrow.Table page and pyarrow.RecordBatch
Both Sphinx jobs ran and completed doctests with success;
AMD64 Conda Python 3.12 Sphinx Documentation
pandas 3.0.0 pypi_0 pypi
================== 385 passed, 2 skipped, 1 warning in 6.24s ===================
and
AMD64 Conda Python 3.10 Sphinx & Numpydoc
pandas 2.3.3 pypi_0 pypi
======================== 385 passed, 2 skipped in 5.63s ========================
The two trivial cases where pandas 2.3.3 output expects None but pandas 3.0.0 expects NaN
1 4 None 2022.0
1 4 NaN 2022.0
get best resolved by populating pa.array with a string instead: first case and second case.
Note that I additionally removed pandas and replaced with pyarrow table for these three examples: def itercolumns, def remove_column and def join (although these are currently not causing failures as there isn't string vs. large_string in their output).
But there are more unnecessary pandas examples remaining that could be simplified in the future (num_columns, num_rows etc).

AlenkaF

Thank you for the changes, looks great!
I only added one minor suggestion. Then I am happy to merge!

docs/source/python/extending_types.rst

python/pyarrow/table.pxi

tadeja · 2026-01-28T09:35:11Z

@AlenkaF this is ready now!
(the unrelated change in .rst allowed getting CI test with pandas 3.0 to run. Only when there are changes in docs/** then this optional CI job with pandas 3.0 would be running Docs / AMD64 Conda Python 3.12 Sphinx Documentation).

AlenkaF · 2026-01-28T14:38:29Z

Thanks! I have opened an issue for the unrelated CI failure: #49044

pitrou · 2026-01-29T16:11:56Z

So are we ok that Pandas strings convert to Arrow large strings? It's a bit less memory-efficient, WDYT @jorisvandenbossche ?

jorisvandenbossche · 2026-01-30T14:51:42Z

It is essentially pandas that decides this, although in theory the pyarrow from_pandas() functionality could override that (the new string dtype in pandas is actually using pyarrow large_string under the hood, so the conversion to pyarrow with large_string is zero copy while converting to string would involve some conversion).

The reason we went for large_string instead of string in pandas is because with string we would easily run into the size limits of it, without having more advanced automatic chunking logic in pandas or pyarrow (for example, at the time there were still issues with take() potentially failing with string ("offset overflow while concatenating arrays"), I haven't fully followed to know if all those issues have been resolved now)

See pandas-dev/pandas#56259 for context on the pandas side

jorisvandenbossche · 2026-01-31T20:52:49Z

python/pyarrow/table.pxi

        >>> df = pd.DataFrame({'year': [None, 2022, 2019, 2021],
        ...                   'n_legs': [2, 4, 5, 100],
        ...                   'animals': ["Flamingo", "Horse", None, "Centipede"]})
-        >>> table = pa.Table.from_pandas(df)
+        >>> table = pa.Table.from_arrays(
+        ...     [[None, 2022, 2019, 2021], [2, 4, 5, 100], ["Flamingo", "Horse", None, "Centipede"]],
+        ...     names=['year', 'n_legs', 'animals'])


Not too important, but for future changes: if we want to keep the docstring examples as simple as possible, I would say that keeping the dict-type creation (as it is with pd.DataFrame) is easier than the arrays. So this could also be something like:

>>> table = pa.table({'year': [None, 2022, 2019, 2021], ... 'n_legs': [2, 4, 5, 100], ... 'animals': ["Flamingo", "Horse", None, "Centipede"]})

i.e. essentially just swapping pd.DataFrame(..) with pa.table(..)

(also, the df creation can now be removed from the example, because it is no longer used)

tadeja requested review from AlenkaF, raulcd and rok as code owners January 23, 2026 16:31

github-actions bot added Component: Python awaiting review Awaiting review labels Jan 23, 2026

tadeja requested review from assignUser, jonkeane and kou as code owners January 23, 2026 18:09

rok removed request for assignUser, jonkeane and kou January 26, 2026 18:06

rok approved these changes Jan 26, 2026

View reviewed changes

python/pyarrow/table.pxi Show resolved Hide resolved

python/pyarrow/table.pxi Show resolved Hide resolved

github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Jan 26, 2026

tadeja added 5 commits January 26, 2026 19:28

Fix DocTestFailure

05918e6

Fix DocTestFailure further

cf0c175

Update job Python 3.10 Sphinx & Numpydoc to 3.11

a6731cb

Update job 3.10 Sphinx & Numpydoc to 3.11

5193837

Alternative fix w/o pandas and revert CI

f224b15

tadeja force-pushed the 48961-Doctest-fails-on-pandas-3.0 branch from 736837d to f224b15 Compare January 26, 2026 18:34

Minor docs/ update to force docs_light job

186c0a9

github-actions bot added the Component: Documentation label Jan 26, 2026

apache deleted a comment from github-actions bot Jan 26, 2026

apache deleted a comment from tadeja Jan 26, 2026

AlenkaF approved these changes Jan 28, 2026

View reviewed changes

docs/source/python/extending_types.rst Outdated Show resolved Hide resolved

python/pyarrow/table.pxi Show resolved Hide resolved

Remove docs/ update to merge

8ef6bb1

github-actions bot removed the Component: Documentation label Jan 28, 2026

AlenkaF merged commit 811a273 into apache:main Jan 28, 2026
22 of 26 checks passed

AlenkaF removed the awaiting merge Awaiting merge label Jan 28, 2026

pitrou added the backport-candidate label Jan 29, 2026

AlenkaF mentioned this pull request Jan 30, 2026

[Docs][Python] Doctest fails on pandas 3.0 #48961

Closed

jorisvandenbossche mentioned this pull request Jan 30, 2026

GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide #48619

Merged

jorisvandenbossche reviewed Jan 31, 2026

View reviewed changes

github-actions bot added the awaiting changes Awaiting changes label Jan 31, 2026

GH-48961: [Docs][Python] Doctest fails on pandas 3.0 #48969

GH-48961: [Docs][Python] Doctest fails on pandas 3.0 #48969

Uh oh!

Conversation

tadeja commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

AlenkaF commented Jan 26, 2026

Uh oh!

rok commented Jan 26, 2026

Uh oh!

AlenkaF commented Jan 26, 2026

Uh oh!

rok left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rok commented Jan 26, 2026

Uh oh!

github-actions bot commented Jan 26, 2026

Uh oh!

tadeja commented Jan 27, 2026

Uh oh!

AlenkaF left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tadeja commented Jan 28, 2026

Uh oh!

Uh oh!

AlenkaF commented Jan 28, 2026

Uh oh!

pitrou commented Jan 29, 2026

Uh oh!

jorisvandenbossche commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tadeja commented Jan 23, 2026 •

edited

Loading

jorisvandenbossche commented Jan 30, 2026 •

edited

Loading