GH-36036: [C++][Python][Parquet] Implement Float16 logical type#36073
GH-36036: [C++][Python][Parquet] Implement Float16 logical type#36073pitrou merged 37 commits intoapache:mainfrom
Conversation
47790ee to
2809fbf
Compare
pitrou
left a comment
There was a problem hiding this comment.
Thanks @benibus ! Please note that the Thrift definitions will have to be synchronized from https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift once the Parquet format spec addition is accepted.
|
Thanks! Before moving forward on some of the suggestions, I just want to ensure that I interpreted the endianness requirement from the proposal correctly. As per the spec:
Is this in line with what I've done here (i.e. the |
Yes! |
cpp/src/parquet/types.cc
Outdated
There was a problem hiding this comment.
We may need to change thrift_internal.h for serialization of FLOAT16.
5c109d3 to
2f9ca2a
Compare
|
By the way:
Do you plan to tackle Arrow integration in a subsequent PR? This PR cannot be integrated before the proposal is approved, AFAICT. |
Yes, my plan was to handle Arrow integration in a subsequent PR (in tandem with the Java implementation), and then move forward with the proposal. However, I think you're right that we'll need to roll those features into this PR. Just did a closer re-reading of @julienledem's comments on apache/parquet-format#184, and I believe we're going to want the full C++/Java PRs ready (but not merged) before approving the proposal - so we'll probably need to sit on this one in the meantime. That being said, I can maintain this PR until then. Plus we may need to make adjustments if new requirements come in from the Java side. |
|
Well, if Arrow integration does not come in this PR, then at least some testing of reading/writing using non-Arrow Parquet APIs should still be added. |
|
Just curious: Is it enough to get approval if there is a full C++ implementation only? Or java parity should be ready at the meanwhile? |
|
We probably want a Java implementation as well according to apache/parquet-format#184 (comment) |
OK, if that is a must. I can spare some time to do this. Would be good after this PR gets completed. |
Reverted several prior changes that were accidentally pushed Enabled construction from native floats Removed `uint16_t` conversion operator since it doesn't behave consistently with standard floats. As a result, rolled back some of the prior changes to `random_real` used in the Parquet test utils
a672c73 to
157e0d7
Compare
### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: #37582 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
|
Now that the proposal is merged, would anyone be willing to take a final look and potentially merge this? I've gone ahead and rebased for good measure. I should also mention that the Go implementation was merged today, but it'll eventually need this PR's |
…37599) ### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: apache#37582 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
mapleFU
left a comment
There was a problem hiding this comment.
LGTM
( It's a bit weird for me that fp16 can only use PLAIN, RLE_DICT and ... DELTA_BYTE_ARRAY, aha...)
Probably... I'll try to add something, assuming it wouldn't be redundant.
Yeah, I think we're going to specifically address the encodings as a follow-up since there's been some recent discussion in that area. |
|
Update: I don't think it'd be very useful to add tests for different type lengths at the reader/writer level since it isn't even possible to construct a However, it turns out that I forgot to add the relevant tests to |
|
LGTM, I'm waiting for a last CI check and will merge afterwards. |
|
Also we're going to release this in parquet-2.10 |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit b55d13c. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…37599) ### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: apache#37582 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…apache#36073) ### Rationale for this change There is currently an active proposal to support half-float types in Parquet. For more details/discussion, see the links in this PR's accompanying issue. ### What changes are included in this PR? This PR implements basic support for a `Float16LogicalType` in accordance with the proposed spec. More specifically, this includes: - Changes to `parquet.thrift` and regenerated `parqet_types` files - Basic `LogicalType` class definition, method impls, and enums - Support for specialized comparisons and column statistics In the interest of scope, this PR does not currently deal with arrow integration and byte split encoding - although we will want both of these features resolved before the proposal is approved. ### Are these changes tested? Yes (tests are included) ### Are there any user-facing changes? Yes * Closes: apache#36036 Lead-authored-by: benibus <bpharks@gmx.com> Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache/arrow#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: #37582 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Rationale for this change
There is currently an active proposal to support half-float types in Parquet. For more details/discussion, see the links in this PR's accompanying issue.
What changes are included in this PR?
This PR implements basic support for a
Float16LogicalTypein accordance with the proposed spec. More specifically, this includes:parquet.thriftand regeneratedparqet_typesfilesLogicalTypeclass definition, method impls, and enumsIn the interest of scope, this PR does not currently deal with arrow integration and byte split encoding - although we will want both of these features resolved before the proposal is approved.
Are these changes tested?
Yes (tests are included)
Are there any user-facing changes?
Yes