ARROW-10386: [R] List column class attributes not preserved in roundtrip#9182
ARROW-10386: [R] List column class attributes not preserved in roundtrip#9182jonkeane wants to merge 13 commits intoapache:masterfrom
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
|
@github-actions crossbow submit test-r-version-compatibility |
039b59e to
d18222d
Compare
|
@github-actions crossbow submit test-r-version-compatibility |
|
Revision: d18222dca3587269aa2abc27ee9033cd352351f4 Submitted crossbow builds: ursa-labs/crossbow @ actions-878
|
…st itself. ARROW-10386.
68d6c47 to
5649500
Compare
r/R/schema.R
Outdated
| #' | ||
| #' @section Metadata: | ||
| #' | ||
| #' Attributes from the `data.frame` are saved alongside tables so that the |
There was a problem hiding this comment.
"When converting a data.frame to an Arrow Table or RecordBatch, "
r/R/schema.R
Outdated
| #' Modify or replace by assigning in (`sch$metadata <- new_metadata`). | ||
| #' All list elements are coerced to string. | ||
| #' | ||
| #' @section Metadata: |
There was a problem hiding this comment.
This section should either be called something like "R Metadata", or it should start by discussing the key-value metadata more generally.
r/R/schema.R
Outdated
| #' them when pulled back into R. This metadata is separate from the schema | ||
| #' (e.g. types of the columns) which is compatible with other Arrow clients. | ||
| #' The R metadata is only read by R and is ignored by other clients (e.g. | ||
| #' pyarrow which has its own custom metadata for things like Pandas metadata). |
There was a problem hiding this comment.
I believe it is Pandas only that stores extra metadata, not pyarrow itself.
r/R/schema.R
Outdated
| #' object can be reconstructed faithfully in R (e.g. with `as.data.frame()`). | ||
| #' This metadata can be both at the top-level of the `data.frame` (e.g. | ||
| #' `attributes(df)`) or at the column (e.g. `attributes(df$col_a)`) or element | ||
| #' level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for |
There was a problem hiding this comment.
According to the code, this is only true for list columns (which makes sense because regular vectors can't have attributes on elements)
r/R/schema.R
Outdated
| #' level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for | ||
| #' storing `haven` columns in a table and being able to faithfully re-create | ||
| #' them when pulled back into R. This metadata is separate from the schema | ||
| #' (e.g. types of the columns) which is compatible with other Arrow clients. |
There was a problem hiding this comment.
| #' (e.g. types of the columns) which is compatible with other Arrow clients. | |
| #' (column names and types), which is compatible with other Arrow clients. |
r/R/schema.R
Outdated
| #' (e.g. types of the columns) which is compatible with other Arrow clients. | ||
| #' The R metadata is only read by R and is ignored by other clients (e.g. | ||
| #' pyarrow which has its own custom metadata for things like Pandas metadata). | ||
| #' This metadata is stored (and can be accessed with) `table$metadata$r`. |
There was a problem hiding this comment.
Shouldn't say table here, we're in the Schema docs.
| #' This metadata is stored (and can be accessed with) `table$metadata$r`. | |
| #' This metadata is stored in `$metadata$r`. |
r/R/schema.R
Outdated
| #' include large amounts of metadata) you can set the option | ||
| #' `arrow.compress_metadata` to `FALSE`. | ||
| #' | ||
| #' One exception to storing all metadata: `readr`'s `problems` attribute if it |
There was a problem hiding this comment.
I don't think this paragraph is necessary.
r/R/schema.R
Outdated
| #' pyarrow which has its own custom metadata for things like Pandas metadata). | ||
| #' This metadata is stored (and can be accessed with) `table$metadata$r`. | ||
| #' | ||
| #' This metadata is saved by serializing R's attribute list structure to a |
There was a problem hiding this comment.
"Since Schema metadata keys and values must be strings, ..."
r/R/schema.R
Outdated
| #' serialized string. Because of this, large amounts of metadata can quickly | ||
| #' increase the size of tables (and therefore the size of tables written to | ||
| #' parquet or feather files). If the (serialized) metadata exceeds 100Kbs in | ||
| #' size, it is first compressed before saving. To disable this compression | ||
| #' (e.g. for tables that are compatible with Arrow versions before 3.0.0 and | ||
| #' include large amounts of metadata) you can set the option | ||
| #' `arrow.compress_metadata` to `FALSE`. |
There was a problem hiding this comment.
| #' serialized string. Because of this, large amounts of metadata can quickly | |
| #' increase the size of tables (and therefore the size of tables written to | |
| #' parquet or feather files). If the (serialized) metadata exceeds 100Kbs in | |
| #' size, it is first compressed before saving. To disable this compression | |
| #' (e.g. for tables that are compatible with Arrow versions before 3.0.0 and | |
| #' include large amounts of metadata) you can set the option | |
| #' `arrow.compress_metadata` to `FALSE`. | |
| #' string. If the serialized metadata exceeds 100Kbs in size, by default | |
| #' it is compressed starting in version 3.0.0. To disable this compression | |
| #' (e.g. for tables that are compatible with Arrow versions before 3.0.0 and | |
| #' include large amounts of metadata), set the option | |
| #' `arrow.compress_metadata` to `FALSE`. Files with compressed metadata | |
| #' are readable by older versions of arrow, but the metadata is dropped. |
|
Do we have any backwards compat testing with this feature? |
|
No, but I'll make a Jira + work on adding one/some |
|
Also this deserves a NEWS bullet, including a special mention of |
TBH I think we need something in this PR since we're up against the release deadline. Don't need the full spectrum of feather/parquet/compression, just pick one, and make sure that we can read a data.frame, likely with a warning about invalid metadata. |
|
Oops, yeah I just realized I mis-read the notifications on this — I thought this had been merged already. I'll put them here and we can close the (now extraneous) ARROW-11241 |
|
@github-actions crossbow submit test-r-version-compatibility |
|
Revision: a66818d Submitted crossbow builds: ursa-labs/crossbow @ actions-881
|
|
@github-actions crossbow submit test-r-version-compatibility |
|
Revision: 306751f Submitted crossbow builds: ursa-labs/crossbow @ actions-883
|
|
@github-actions crossbow submit test-r-version-compatibility |
|
Revision: fa0041b Submitted crossbow builds: ursa-labs/crossbow @ actions-884
|
No description provided.