Skip to content

Conversation

@tadeja
Copy link
Contributor

@tadeja tadeja commented Jan 23, 2026

Rationale for this change

See issue #48961
Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes
and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default

What changes are included in this PR?

Updating several doctest examples from string to large_string.

Are these changes tested?

Yes, locally.

Are there any user-facing changes?

No.

Closes #48961

@AlenkaF
Copy link
Member

AlenkaF commented Jan 26, 2026

Thank you @tadeja for looking into this!

One question regarding the bump of the Python version in Sphinx&Numpydoc job. I think it would be good if the examples worked for users with new or old pandas version. What if we use ... (ELLIPSIS) instead of changing the string type? Or even better, we could not use pandas where possible and instead create a pyarrow table directly, like so:

arrow/python/pyarrow/table.pxi

Lines 1812 to 1814 in 95a3ed4

>>> table = pa.Table.from_arrays([[2, 4, 5, 100],
... ["Flamingo", "Horse", "Brittle stars", "Centipede"]],
... names=['n_legs', 'animals'])

@rok
Copy link
Member

rok commented Jan 26, 2026

Agreed that it doesn't make sense for us to "test Pandas logic" especially in our docs. Agreed with @AlenkaF to instantiate the table in pyarrow. Using ellipsis in this case would hide the type and potentially increase user confusion :).

@AlenkaF
Copy link
Member

AlenkaF commented Jan 26, 2026

Note that some examples are demonstrating conversion from pandas to pyarrow so in that case we might remove the string column and only keep the integer ones?

@rok rok removed request for assignUser, jonkeane and kou January 26, 2026 18:06
Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now. I think (hope) removing pandas from examples that don't require streamlines things for readers.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Jan 26, 2026
@tadeja tadeja force-pushed the 48961-Doctest-fails-on-pandas-3.0 branch from 736837d to f224b15 Compare January 26, 2026 18:34
@rok
Copy link
Member

rok commented Jan 26, 2026

@github-actions crossbow submit preview-docs

@apache apache deleted a comment from github-actions bot Jan 26, 2026
@apache apache deleted a comment from tadeja Jan 26, 2026
@github-actions
Copy link

Revision: 186c0a9

Submitted crossbow builds: ursacomputing/crossbow @ actions-ca47b1b8be

Task Status
preview-docs GitHub Actions

@tadeja
Copy link
Contributor Author

tadeja commented Jan 27, 2026

@AlenkaF this is ready for final review.

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes, looks great!
I only added one minor suggestion. Then I am happy to merge!

@tadeja
Copy link
Contributor Author

tadeja commented Jan 28, 2026

@AlenkaF this is ready now!
(the unrelated change in .rst allowed getting CI test with pandas 3.0 to run. Only when there are changes in docs/** then this optional CI job with pandas 3.0 would be running Docs / AMD64 Conda Python 3.12 Sphinx Documentation).

@AlenkaF AlenkaF merged commit 811a273 into apache:main Jan 28, 2026
22 of 26 checks passed
@AlenkaF AlenkaF removed the awaiting merge Awaiting merge label Jan 28, 2026
@AlenkaF
Copy link
Member

AlenkaF commented Jan 28, 2026

Thanks! I have opened an issue for the unrelated CI failure: #49044

@pitrou
Copy link
Member

pitrou commented Jan 29, 2026

So are we ok that Pandas strings convert to Arrow large strings? It's a bit less memory-efficient, WDYT @jorisvandenbossche ?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 30, 2026

It is essentially pandas that decides this, although in theory the pyarrow from_pandas() functionality could override that (the new string dtype in pandas is actually using pyarrow large_string under the hood, so the conversion to pyarrow with large_string is zero copy while converting to string would involve some conversion).

The reason we went for large_string instead of string in pandas is because with string we would easily run into the size limits of it, without having more advanced automatic chunking logic in pandas or pyarrow (for example, at the time there were still issues with take() potentially failing with string ("offset overflow while concatenating arrays"), I haven't fully followed to know if all those issues have been resolved now)

See pandas-dev/pandas#56259 for context on the pandas side

Comment on lines 1877 to +1882
>>> df = pd.DataFrame({'year': [None, 2022, 2019, 2021],
... 'n_legs': [2, 4, 5, 100],
... 'animals': ["Flamingo", "Horse", None, "Centipede"]})
>>> table = pa.Table.from_pandas(df)
>>> table = pa.Table.from_arrays(
... [[None, 2022, 2019, 2021], [2, 4, 5, 100], ["Flamingo", "Horse", None, "Centipede"]],
... names=['year', 'n_legs', 'animals'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too important, but for future changes: if we want to keep the docstring examples as simple as possible, I would say that keeping the dict-type creation (as it is with pd.DataFrame) is easier than the arrays. So this could also be something like:

        >>> table = pa.table({'year': [None, 2022, 2019, 2021],
        ...                   'n_legs': [2, 4, 5, 100],
        ...                   'animals': ["Flamingo", "Horse", None, "Centipede"]})

i.e. essentially just swapping pd.DataFrame(..) with pa.table(..)

(also, the df creation can now be removed from the example, because it is no longer used)

@github-actions github-actions bot added the awaiting changes Awaiting changes label Jan 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Docs][Python] Doctest fails on pandas 3.0

5 participants