The output files can contain uint8 and uint16 values that are illegal per the spec. For example in this file -
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 18, "_3": 10002, "_4": 10002, "_5": 10002, "_6": 10002.0, "_7": 10002.0, "_8": "100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002", "_9": -18, "_10": -10002, "_11": -10002, "_12": -10002, "_13": "10002", "_14": [50, 50, 50], "_15": 10002, "_16": 10002, "_17": [50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50], "_18": 10002, "_19": 10002, "_20": 10002}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 20, "_3": 10004, "_4": 10004, "_5": 10004, "_6": 10004.0, "_7": 10004.0, "_8": "100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004", "_9": -20, "_10": -10004, "_11": -10004, "_12": -10004, "_13": "10004", "_14": [52, 52, 52], "_15": 10004, "_16": 10004, "_17": [52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52], "_18": 10004, "_19": 10004, "_20": 10004}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 24, "_3": 10008, "_4": 10008, "_5": 10008, "_6": 10008.0, "_7": 10008.0, "_8": "100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008", "_9": -24, "_10": -10008, "_11": -10008, "_12": -10008, "_13": "10008", "_14": [56, 56, 56], "_15": 10008, "_16": 10008, "_17": [56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56], "_18": 10008, "_19": 10008, "_20": 10008}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
The values written can be read by the Parquet-java reader but other implementations are free to return an error or null for such values which is not desirable.
Describe the bug, including details regarding any error messages, version, and platform.
The DataFusion Comet project's unit test use the ExampleParquetWriter to create Parquet files - https://github.com/apache/datafusion-comet/blob/996362e78d497c02542f1e29dbb7cba3ec16f64c/spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala#L432
This was inspired by similar unit test code in Spark - https://github.com/apache/spark/blob/ece14704cc083f17689d2e0b9ab8e31cf71a7a2d/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala#L871
The output files can contain uint8 and uint16 values that are illegal per the spec. For example in this file -
alltypes_extended_plain.parquet.zip
The columns
_8and_9areuint_8anduint_16values and contain illegal negative values.Taking as an example the first value for column
_8the bit pattern written to the file is0xffffffeewhich gets read as a negative value which is illegal for a unsigned int.The value originates in this line - https://github.com/apache/datafusion-comet/blob/996362e78d497c02542f1e29dbb7cba3ec16f64c/spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala#L520
where a negative value is cast to a byte and then written to Parquet. The Parquet writer needs to cast correctly to a larger type before writing to the file.
The values written can be read by the Parquet-java reader but other implementations are free to return an error or null for such values which is not desirable.
Component(s)
Core