Refine statistics extraction API and tests#118
Merged
NGA-TRAN merged 3 commits intoNGA-TRAN:ntran/rg_stats_apifrom May 17, 2024
Merged
Refine statistics extraction API and tests#118NGA-TRAN merged 3 commits intoNGA-TRAN:ntran/rg_stats_apifrom
NGA-TRAN merged 3 commits intoNGA-TRAN:ntran/rg_stats_apifrom
Conversation
Closed
23 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note this PR targets another PR apache#10537 from @NGA-TRAN rather than main
This PR proposes a different API than what is described on apache#10453, based on my working through the example in apache#10549. I am sorry I should have done this first.
The major differences is that the min/max extraction is not done in a single call, but only on demand which matches what the actual pruning predicate needs. I also think the new API also has a natural way to extract column index statistics.
I actually found there is a version of this API and tests for it already here: https://github.com/apache/datafusion/blob/d2fb05ed5ba71fd0f1d440baca12897413c2a8af/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L214-L922
It turns out there was enough code to actually hook it up using the existing (production) statistics extraction code, so I did that as well. This is far from efficient, but it is a start.
If we like this API, perhaps we can complete the test coverage and then make it more efficient
However, it is not currently exposed publically, and I don't think the tests are great (as they aren't a public API), and the performance is not great.