Skip to content

Refine statistics extraction API and tests#118

Merged
NGA-TRAN merged 3 commits intoNGA-TRAN:ntran/rg_stats_apifrom
alamb:alamb/stats_api_refine
May 17, 2024
Merged

Refine statistics extraction API and tests#118
NGA-TRAN merged 3 commits intoNGA-TRAN:ntran/rg_stats_apifrom
alamb:alamb/stats_api_refine

Conversation

@alamb
Copy link
Copy Markdown

@alamb alamb commented May 17, 2024

Note this PR targets another PR apache#10537 from @NGA-TRAN rather than main

This PR proposes a different API than what is described on apache#10453, based on my working through the example in apache#10549. I am sorry I should have done this first.

The major differences is that the min/max extraction is not done in a single call, but only on demand which matches what the actual pruning predicate needs. I also think the new API also has a natural way to extract column index statistics.

I actually found there is a version of this API and tests for it already here: https://github.com/apache/datafusion/blob/d2fb05ed5ba71fd0f1d440baca12897413c2a8af/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L214-L922

It turns out there was enough code to actually hook it up using the existing (production) statistics extraction code, so I did that as well. This is far from efficient, but it is a start.

If we like this API, perhaps we can complete the test coverage and then make it more efficient

However, it is not currently exposed publically, and I don't think the tests are great (as they aren't a public API), and the performance is not great.

@github-actions github-actions Bot added the core label May 17, 2024
@NGA-TRAN NGA-TRAN merged commit 71ca4b1 into NGA-TRAN:ntran/rg_stats_api May 17, 2024
@alamb alamb deleted the alamb/stats_api_refine branch May 17, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants