Skip to content

parquet pushdown predicate dataset.field.isin() much slower than or '|' #36283

@daddywantssugar

Description

@daddywantssugar

Describe the bug, including details regarding any error messages, version, and platform.

apologies if this isn't a bug but I assert still surprising behaviour: When pushing predicates down to parquet reads, using the "or" | syntax returns in pretty fast as expected in 5s. but the read with equivalent isin() predicate takes over an order of magnitude longer- 140s.


import pandas as pd
import pyarrow.dataset as ds
import s3fs
from contexttimer import Timer #pip install or use your own timer

fs = s3fs.S3FileSystem(anon=True) # doesn't actually require s3, network share will exhibit this as well
rawpath = f'nyc-taxi-test/weather_sorted.parquet' # sorted by longitude to accentuate the issue
filters = [
    (ds.field("longitude") == -10.8) | (ds.field("longitude") == -11.4), # 5s
    ds.field("longitude").isin([-11.4, -10.8]), # 143s
    
    (ds.field("longitude") == 10.2) | (ds.field("longitude") == 10.5), # 9s
    ds.field("longitude").isin([10.2, 10.5]), # 135s
    
    None, # no filter baseline #150s
]
for filter in filters:
    with Timer() as t:
        with fs.open(rawpath, 'rb') as f:
            df = pd.read_parquet(f, filters=filter)
    print('time: ', t, 'size: ', len(df))

"""
time:  5.372 size:  26353
time:  143.137 size:  26353
time:  9.685 size:  59565
time:  135.809 size:  59565 
time:  153.935 size:  12736802  
"""

tested on Windows 10
python 3.10
pandas 2.0.2
pyarrow 12.0.1

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions