-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48961: [Docs][Python] Doctest fails on pandas 3.0 #48969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-48961: [Docs][Python] Doctest fails on pandas 3.0 #48969
Conversation
|
Thank you @tadeja for looking into this! One question regarding the bump of the Python version in Sphinx&Numpydoc job. I think it would be good if the examples worked for users with new or old pandas version. What if we use arrow/python/pyarrow/table.pxi Lines 1812 to 1814 in 95a3ed4
|
|
Agreed that it doesn't make sense for us to "test Pandas logic" especially in our docs. Agreed with @AlenkaF to instantiate the table in pyarrow. Using ellipsis in this case would hide the type and potentially increase user confusion :). |
|
Note that some examples are demonstrating conversion from pandas to pyarrow so in that case we might remove the string column and only keep the integer ones? |
rok
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now. I think (hope) removing pandas from examples that don't require streamlines things for readers.
736837d to
f224b15
Compare
|
@github-actions crossbow submit preview-docs |
|
Revision: 186c0a9 Submitted crossbow builds: ursacomputing/crossbow @ actions-ca47b1b8be
|
|
@AlenkaF this is ready for final review.
|
AlenkaF
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes, looks great!
I only added one minor suggestion. Then I am happy to merge!
|
@AlenkaF this is ready now! |
|
Thanks! I have opened an issue for the unrelated CI failure: #49044 |
|
So are we ok that Pandas strings convert to Arrow large strings? It's a bit less memory-efficient, WDYT @jorisvandenbossche ? |
|
It is essentially pandas that decides this, although in theory the pyarrow The reason we went for See pandas-dev/pandas#56259 for context on the pandas side |
| >>> df = pd.DataFrame({'year': [None, 2022, 2019, 2021], | ||
| ... 'n_legs': [2, 4, 5, 100], | ||
| ... 'animals': ["Flamingo", "Horse", None, "Centipede"]}) | ||
| >>> table = pa.Table.from_pandas(df) | ||
| >>> table = pa.Table.from_arrays( | ||
| ... [[None, 2022, 2019, 2021], [2, 4, 5, 100], ["Flamingo", "Horse", None, "Centipede"]], | ||
| ... names=['year', 'n_legs', 'animals']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too important, but for future changes: if we want to keep the docstring examples as simple as possible, I would say that keeping the dict-type creation (as it is with pd.DataFrame) is easier than the arrays. So this could also be something like:
>>> table = pa.table({'year': [None, 2022, 2019, 2021],
... 'n_legs': [2, 4, 5, 100],
... 'animals': ["Flamingo", "Horse", None, "Centipede"]})i.e. essentially just swapping pd.DataFrame(..) with pa.table(..)
(also, the df creation can now be removed from the example, because it is no longer used)
Rationale for this change
See issue #48961
Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes
and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default
What changes are included in this PR?
Updating several doctest examples from
stringtolarge_string.Are these changes tested?
Yes, locally.
Are there any user-facing changes?
No.
Closes #48961