Conversation
| self._writers[filename] = pq.ParquetWriter( | ||
| file_handler, schema=pa.table({name: [val] for name, val in document.items()}).schema | ||
| ) |
There was a problem hiding this comment.
The Document's attributes have fixed types, so I wonder if it would make more sense to pass pa.schema({"text": pa.string(), "id": pa.string(), media: pa.struct({"type": pa.int32(), "url": pa.string(), "alt": pa.string(), "local_path": pa.string()}), "metadata": pa.string()}) for the schema.
Parquet still doesn't support unions (see apache/parquet-format#44), so we would have to work around this limitation by turning the metadata value into a string using json.dumps(metadata). Then, to make the ParquetReader compatible with this format, we would also have to add metadata to the schema (pa.schema(fields, metadata=...)), which the reader would check and perform deserialization (using json.loads) on the other side if needed.
But the current solution is good enough, so this can also be addressed later.
PS: To be extra strict, the default nullability of non-nullable fields ("text", "id", etc.) in the above schema can be disabled with pa.field(pa_type, nullable=False)
There was a problem hiding this comment.
They used to have fixed types but now we support an adapter so that people can choose their output format (still a dictionary, but they can do whatever they want with the fields)
Regarding unions, does this mean if we have different value types in metadata (let's say strings and floats) then this doesn't work?
Regarding nullability, the problem would also be the custom user formats
There was a problem hiding this comment.
maybe we could also have pa.RecordBatch.from_pylist([document]).schema here instead?
There was a problem hiding this comment.
They used to have fixed types but now we support an
adapterso that people can choose their output format (still a dictionary, but they can do whatever they want with the fields)
We could only use the fixed schema if adapter is not specified.
Regarding unions, does this mean if we have different value types in
metadata(let's say strings and floats) then this doesn't work?
JSON supports these types, so it will work.
maybe we could also have
pa.RecordBatch.from_pylist([document]).schemahere instead?
Yes, this would be cleaner indeed
There was a problem hiding this comment.
I see. I think maybe for now we will keep the current format so that even when people upload to the hub directly and so on there isn't a big json field
Co-authored-by: Mario Šaško <mariosasko777@gmail.com>
* added parquet writer * nit * Update src/datatrove/pipeline/writers/parquet.py Co-authored-by: Mario Šaško <mariosasko777@gmail.com> * updated test * nit --------- Co-authored-by: Mario Šaško <mariosasko777@gmail.com>
No description provided.