Skip to content

Compressed CSV/JSON support#3642

Merged
alamb merged 5 commits into
apache:masterfrom
Licht-T:compression-text-support
Oct 11, 2022
Merged

Compressed CSV/JSON support#3642
alamb merged 5 commits into
apache:masterfrom
Licht-T:compression-text-support

Conversation

@Licht-T
Copy link
Copy Markdown
Contributor

@Licht-T Licht-T commented Sep 28, 2022

Which issue does this PR close?

Closes #3641.

Rationale for this change

Explained in #3641.

What changes are included in this PR?

  • Add FileCompressionType as the text file compression type definition.
  • Add GZip/BZip2 converters for Read/Stream.
  • Add the COMPRESSION TYPE SQL token.
  • Modify the file type inference method so that it can determine a file type and its compression.

Are there any user-facing changes?

Yes.

I am not sure which are public APIs, but maybe yes.

If there are any breaking changes to public APIs, please add the api change label.

@github-actions github-actions Bot added core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner labels Sep 28, 2022
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Sep 28, 2022

This looks very cool @Licht-T -- thank you for the contribution. Hopefully we'll get a chance to review it in the next few days

@Licht-T
Copy link
Copy Markdown
Contributor Author

Licht-T commented Sep 29, 2022

Thanks, @alamb.

The test works on my Windows environment but still fails on the Windows CI. I will push a debug code.

@Licht-T
Copy link
Copy Markdown
Contributor Author

Licht-T commented Sep 29, 2022

I found Url::join cannot handle a drive letter.

url: Url { scheme: "file", cannot_be_a_base: false, username: "", password: None, host: None, port: None, path: "/", query: None, fragment: None }, path: "C:/Users/runneradmin/AppData/Local/Temp/.tmpGAMumf/partition-0.json.bz2"
joined url: Url { scheme: "c", cannot_be_a_base: false, username: "", password: None, host: None, port: None, path: "/Users/runneradmin/AppData/Local/Temp/.tmpGAMumf/partition-0.json.bz2", query: None, fragment: None }

@Licht-T
Copy link
Copy Markdown
Contributor Author

Licht-T commented Sep 29, 2022

Now all green. Ready to get reviewed.

async-compression = { version = "0.3.14", features = ["bzip2", "gzip", "futures-io", "tokio"] }
async-trait = "0.1.41"
bytes = "1.1"
bzip2 = "0.4.3"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas how many dependencies we are adding? I guess some of them are already transitive dependencies through parquet.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Oct 3, 2022

I took a quick look through this PR and it looks great -- thank you @Licht-T

Can you please merge / rebase from master to resolve conflicts so that I can merge it?

Screen Shot 2022-10-03 at 9 50 46 AM

Thanks again!

@Licht-T
Copy link
Copy Markdown
Contributor Author

Licht-T commented Oct 10, 2022

@alamb Merged from master.

I believe we need some documentation about this, right? I am willing to do that.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Oct 11, 2022

I believe we need some documentation about this, right? I am willing to do that.

Yes please @Licht-T -- that would be super helpful.

Perhaps https://arrow.apache.org/datafusion/user-guide/sql/ddl.html?highlight=external+table and https://arrow.apache.org/datafusion/user-guide/sql/index.html?highlight=external+table ?

If you are not able to do so, we should at least file a ticket to document the feature

doc-comment = "0.3"
env_logger = "0.9"
fuzz-utils = { path = "fuzz-utils" }
rstest = "0.15.0"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan what do we think about adding this new testing library?

@alamb alamb merged commit b8a3a78 into apache:master Oct 11, 2022
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Oct 11, 2022

Thanks again @Licht-T

@ursabot
Copy link
Copy Markdown

ursabot commented Oct 11, 2022

Benchmark runs are scheduled for baseline = 58afdf7 and contender = b8a3a78. b8a3a78 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@janrito janrito mentioned this pull request Oct 11, 2022
@Bidek56
Copy link
Copy Markdown

Bidek56 commented Jan 2, 2023

@Licht-T Were you able to test this change? I am trying these csv options and it simple returns nothing on a valid .gz file.
Thanks

let csv_options = CsvReadOptions::default().has_header(true).file_compression_type(FileCompressionType::GZIP); ctx.register_csv("attrib", file, csv_options).await?;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compressed CSV/JSON Read

5 participants