HDDS-8971. Example integration with Iceberg, Spark and Trino#5016
HDDS-8971. Example integration with Iceberg, Spark and Trino#5016adoroszlai wants to merge 2 commits intoapache:masterfrom
Conversation
This commit does not contain secrets.
ayushtkn
left a comment
There was a problem hiding this comment.
Thanx @adoroszlai , just started exploring this.
A quick question: we are using tabulario image which isn't official but from a vendor, which we won't have any control, are we ok using that?
Spark/Iceberg might be having official image may be, can we use that?
Hive does have official docker image, in case you want to explore: https://hub.docker.com/r/apache/hive
I tried iceberg with that
export HIVE_VERSION=4.0.0-alpha-2
docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:${HIVE_VERSION}
docker exec -it hive4 beeline -u 'jdbc:hive2://localhost:10000/'
create table ice01 (id int) stored by iceberg;
show create table ice01;
insert into ice01 values (1),(2),(3),(4);
select * from ice01;
The show create table ice01; shows iceberg, which confirms the table is iceberg, I think I didn't see that it is mentioned anywhere, may be those guys configured some default or so.
I think you are good with v1 table which doesn't support deletes/updates as in the current example in this PR. (https://iceberg.apache.org/spec/#format-versioning)
It is pretty easy as well, just a tbl property and we are sorted for v2
create table ice02 (id int) stored by iceberg tblproperties ('format-version'='2');
so, we can do it in future as well. :-)
|
Thanks @ayushtkn for starting to review.
I found this image from Tabular at https://iceberg.apache.org/spark-quickstart/ - if there was an official Apache Iceberg image, I guess they would have used that in the example. I'm open to using any other image. BTW, this is just a small experiment to help answer #4973. Spark does have official images, will explore those. |
|
@SaketaChalamchala please take a look, too |
| DESCRIBE iceberg.nyc.taxis; | ||
| INSERT INTO iceberg.nyc.taxis VALUES (2, 1000375, 7.2, 555, 'N'); | ||
| SELECT * FROM iceberg.nyc.taxis; | ||
| EOF |
There was a problem hiding this comment.
Thanks for the patch @adoroszlai. If this is going to be an example of Trino + Iceberg, would it make sense to remove the dependency on spark and create the table in Trino like below?
CREATE TABLE IF NOT EXISTS iceberg.nyc.taxis
(
vendor_id bigint,
trip_id bigint,
trip_distance double,
fare_amount double,
store_and_fwd_flag varchar
)
WITH (
format = 'PARQUET'
location = 's3://warehouse/nyc/taxis');
INSERT INTO iceberg.nyc.taxis VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N'), (2, 1000373, 0.9, 9.01, 'N'), (1, 1000374, 8.4, 42.13, 'Y');
There was a problem hiding this comment.
Thanks @SaketaChalamchala for the review.
Let's call it an Iceberg, Spark, Trino example instead. :) (I followed the "Spark and Iceberg Quickstart" guide for the Iceberg part.)
|
There's a code against again the latest. |
@jojochuang thanks for taking a look. Conflict has been resolved. |
There was a problem hiding this comment.
The PR looks good.
But I'm afraid of patent or license issues like hell. Looking at the source code for the tabulario/spark-iceberg docker image (https://github.com/tabular-io/docker-spark-iceberg/blob/main/docker-compose.yml)
It includes MinIO and MinIO is AGPL. I want to make sure this is okay.
|
@jojochuang we don't distribute MinIO in any way. Users running this example download the MinIO docker image from Docker Hub. But I'm fine abandoning this PR. |
|
Hey @adoroszlai maybe it's time to revive it given all the interest around Trino? I'll just create an epic jira and put all Trino related work together. |
|
For Ozone S3, another integration scenario is Trino-Hive-Ozone S3. |


What changes were proposed in this pull request?
Create add-on for
ozonedocker-compose environment to demonstrate integration with Iceberg and Trino.https://issues.apache.org/jira/browse/HDDS-8971
How was this patch tested?
Added test script to verify the setup:
spark-shell(example taken from Spark and Iceberg Quickstart)trinoCI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5437385105
(Interesting part begins here.)