Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 15 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,26 +28,26 @@ DataFusion's Python bindings can be used as an end-user tool as well as providin

## Features

- Execute queries using SQL or DataFrames against CSV, Parquet, and JSON data sources
- Queries are optimized using DataFusion's query optimizer
- Execute user-defined Python code from SQL
- Exchange data with Pandas and other DataFrame libraries that support PyArrow
- Serialize and deserialize query plans in Substrait format
- Experimental support for executing SQL queries against Polars, Pandas and cuDF
- Execute queries using SQL or DataFrames against CSV, Parquet, and JSON data sources.
- Queries are optimized using DataFusion's query optimizer.
- Execute user-defined Python code from SQL.
- Exchange data with Pandas and other DataFrame libraries that support PyArrow.
- Serialize and deserialize query plans in Substrait format.
- Experimental support for transpiling SQL queries to DataFrame calls with Polars, Pandas, and cuDF.

## Comparison with other projects

Here is a comparison with similar projects that may help understand when DataFusion might be suitable and unsuitable
Here is a comparison with similar projects that may help understand when DataFusion might be suitable and unsuitable
for your needs:

- [DuckDB](http://www.duckdb.org/) is an open source, in-process analytic database. Like DataFusion, it supports
very fast execution, both from its custom file format and directly from Parquet files. Unlike DataFusion, it is
written in C/C++ and it is primarily used directly by users as a serverless database and query system rather than
as a library for building such database systems.
- [DuckDB](http://www.duckdb.org/) is an open source, in-process analytic database. Like DataFusion, it supports
very fast execution, both from its custom file format and directly from Parquet files. Unlike DataFusion, it is
written in C/C++ and it is primarily used directly by users as a serverless database and query system rather than
as a library for building such database systems.

- [Polars](http://pola.rs/) is one of the fastest DataFrame libraries at the time of writing. Like DataFusion, it
is also written in Rust and uses the Apache Arrow memory model, but unlike DataFusion it does not provide full SQL
support, nor as many extension points.
- [Polars](http://pola.rs/) is one of the fastest DataFrame libraries at the time of writing. Like DataFusion, it
is also written in Rust and uses the Apache Arrow memory model, but unlike DataFusion it does not provide full SQL
support, nor as many extension points.

## Example Usage

Expand Down Expand Up @@ -110,6 +110,7 @@ See [examples](examples/README.md) for more information.

- [Executing SQL on Polars](./examples/sql-on-polars.py)
- [Executing SQL on Pandas](./examples/sql-on-pandas.py)
- [Executing SQL on cuDF](./examples/sql-on-cudf.py)

## How to install (from pip)

Expand Down
19 changes: 10 additions & 9 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,21 +29,22 @@ Here is a direct link to the file used in the examples:

### Executing Queries with DataFusion

- [Query a Parquet file using SQL](./examples/sql-parquet.py)
- [Query a Parquet file using the DataFrame API](./examples/dataframe-parquet.py)
- [Run a SQL query and store the results in a Pandas DataFrame](./examples/sql-to-pandas.py)
- [Query PyArrow Data](./examples/query-pyarrow-data.py)
- [Query a Parquet file using SQL](./sql-parquet.py)
- [Query a Parquet file using the DataFrame API](./dataframe-parquet.py)
- [Run a SQL query and store the results in a Pandas DataFrame](./sql-to-pandas.py)
- [Query PyArrow Data](./query-pyarrow-data.py)

### Running User-Defined Python Code

- [Register a Python UDF with DataFusion](./examples/python-udf.py)
- [Register a Python UDAF with DataFusion](./examples/python-udaf.py)
- [Register a Python UDF with DataFusion](./python-udf.py)
- [Register a Python UDAF with DataFusion](./python-udaf.py)

### Substrait Support

- [Serialize query plans using Substrait](./examples/substrait.py)
- [Serialize query plans using Substrait](./substrait.py)

### Executing SQL against DataFrame Libraries (Experimental)

- [Executing SQL on Polars](./examples/sql-on-polars.py)
- [Executing SQL on Pandas](./examples/sql-on-pandas.py)
- [Executing SQL on Polars](./sql-on-polars.py)
- [Executing SQL on Pandas](./sql-on-pandas.py)
- [Executing SQL on cuDF](./sql-on-cudf.py)
4 changes: 1 addition & 3 deletions examples/sql-on-cudf.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,6 @@


ctx = SessionContext()
ctx.register_parquet(
"taxi", "/home/jeremy/Downloads/yellow_tripdata_2021-01.parquet"
)
ctx.register_parquet("taxi", "yellow_tripdata_2021-01.parquet")
df = ctx.sql("select passenger_count from taxi")
print(df)