Conversation
# Conflicts: # conda-environments/activitysim-dev.yml # conda-environments/github-actions-tests.yml
|
While I've made these updates and all the regular CI tests pass (i.e. the results look correct), I have discovered the change to pandas 2.x incurs a significant runtime penalty when running without sharrow. non-sharrow test timings for pandas 1.x: non-sharrow test timings for pandas 2.x: It will require some research to figure out why this is happening, and whether it can be solved relatively easily... or at all. Initial profiling suggests the problem is in |
| # setting occup for access in spec expressions | ||
| locals_dict.update({"occup": occup}) | ||
| if model_settings.sharrow_skip: | ||
| locals_dict["disable_sharrow"] = True |
There was a problem hiding this comment.
My memory might be sloppy. Why possibly opting out sharrow for vehicle allocation?
| t = pa.Table.from_pandas(df, preserve_index=True, columns=columns) | ||
| except (pa.ArrowTypeError, pa.ArrowInvalid): | ||
| # if there are object columns, try to convert them to categories | ||
| df = df.copy() |
There was a problem hiding this comment.
I saw your latest comment about significantly longer run time with this PR. I noticed you are calling copy() here. In pandas 2.0 copy() defaults to a deep copy. I wonder if this contributed to the run time?
There was a problem hiding this comment.
I don't think this is causing the problem. This code only executes in the write_tables step at the end of the model run.
|
Here's the problem (and solution), or at least a big chunk of it: pandas-dev/pandas#59573 |
|
Closed in favor of #932 |
Addresses #794.
The update from pandas 1.x to 2.x introduces a number of small but material changes that affect ActivitySim:
Indexobjects are all one class with different datatypes, instead of being different classes (e.g. there is no moreInt64Indexclass).read_csvfunction by default now interprets "None" as a missing value (i.e. NaN) instead of being the Python objectNone.groupbyoperation, when applied to categorical data, now sorts the categories in the result unless told not to (resulting in different order of rows in outputs for some operations).df.join()also potentially sorts the resulting rows differently unless an explicitsortargument is given.Indexobjects no longer can be checked asis_monotonicbut instead needis_monotonic_increasing.