Skip to content

HDDS-8971. Example integration with Iceberg, Spark and Trino#5016

Closed
adoroszlai wants to merge 2 commits intoapache:masterfrom
adoroszlai:HDDS-8971
Closed

HDDS-8971. Example integration with Iceberg, Spark and Trino#5016
adoroszlai wants to merge 2 commits intoapache:masterfrom
adoroszlai:HDDS-8971

Conversation

@adoroszlai
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Create add-on for ozone docker-compose environment to demonstrate integration with Iceberg and Trino.

https://issues.apache.org/jira/browse/HDDS-8971

How was this patch tested?

Added test script to verify the setup:

  • create S3 bucket in Ozone
  • create table and insert data in spark-shell (example taken from Spark and Iceberg Quickstart)
  • describe table and insert more data in trino
  • check that data/metadata is stored in Ozone

CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5437385105
(Interesting part begins here.)

This commit does not contain secrets.
@adoroszlai adoroszlai self-assigned this Jul 2, 2023
@adoroszlai adoroszlai added the test label Jul 2, 2023
@adoroszlai adoroszlai requested a review from ayushtkn July 2, 2023 18:35
Copy link
Copy Markdown
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanx @adoroszlai , just started exploring this.
A quick question: we are using tabulario image which isn't official but from a vendor, which we won't have any control, are we ok using that?

Spark/Iceberg might be having official image may be, can we use that?

Hive does have official docker image, in case you want to explore: https://hub.docker.com/r/apache/hive

I tried iceberg with that

export HIVE_VERSION=4.0.0-alpha-2

docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:${HIVE_VERSION}

 docker exec -it hive4 beeline -u 'jdbc:hive2://localhost:10000/'

 create table ice01 (id int) stored by iceberg;

show create table ice01;

insert into ice01 values (1),(2),(3),(4);

select * from ice01;

The show create table ice01; shows iceberg, which confirms the table is iceberg, I think I didn't see that it is mentioned anywhere, may be those guys configured some default or so.

Show create output:
image

select query:
image

I think you are good with v1 table which doesn't support deletes/updates as in the current example in this PR. (https://iceberg.apache.org/spec/#format-versioning)

It is pretty easy as well, just a tbl property and we are sorted for v2

create table ice02 (id int) stored by iceberg tblproperties ('format-version'='2');

so, we can do it in future as well. :-)

@adoroszlai
Copy link
Copy Markdown
Contributor Author

Thanks @ayushtkn for starting to review.

A quick question: we are using tabulario image which isn't official but from a vendor, which we won't have any control, are we ok using that?
Spark/Iceberg might be having official image may be, can we use that?

I found this image from Tabular at https://iceberg.apache.org/spark-quickstart/ - if there was an official Apache Iceberg image, I guess they would have used that in the example. I'm open to using any other image. BTW, this is just a small experiment to help answer #4973.

Spark does have official images, will explore those.

@adoroszlai
Copy link
Copy Markdown
Contributor Author

@SaketaChalamchala please take a look, too

DESCRIBE iceberg.nyc.taxis;
INSERT INTO iceberg.nyc.taxis VALUES (2, 1000375, 7.2, 555, 'N');
SELECT * FROM iceberg.nyc.taxis;
EOF
Copy link
Copy Markdown
Contributor

@SaketaChalamchala SaketaChalamchala Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @adoroszlai. If this is going to be an example of Trino + Iceberg, would it make sense to remove the dependency on spark and create the table in Trino like below?

CREATE TABLE IF NOT EXISTS iceberg.nyc.taxis
(
    vendor_id bigint,
    trip_id bigint,
    trip_distance double,
    fare_amount double,
    store_and_fwd_flag varchar
)
WITH (
format = 'PARQUET'
location = 's3://warehouse/nyc/taxis');

INSERT INTO iceberg.nyc.taxis VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N'), (2, 1000373, 0.9, 9.01, 'N'), (1, 1000374, 8.4, 42.13, 'Y');

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SaketaChalamchala for the review.

Let's call it an Iceberg, Spark, Trino example instead. :) (I followed the "Spark and Iceberg Quickstart" guide for the Iceberg part.)

@jojochuang
Copy link
Copy Markdown
Contributor

There's a code against again the latest.

@adoroszlai adoroszlai changed the title HDDS-8971. Example integration with Iceberg and Trino HDDS-8971. Example integration with Iceberg, Spark and Trino Oct 11, 2023
@adoroszlai
Copy link
Copy Markdown
Contributor Author

There's a code against again the latest.

@jojochuang thanks for taking a look. Conflict has been resolved.

@adoroszlai adoroszlai requested a review from jojochuang October 15, 2023 16:54
Copy link
Copy Markdown
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good.

But I'm afraid of patent or license issues like hell. Looking at the source code for the tabulario/spark-iceberg docker image (https://github.com/tabular-io/docker-spark-iceberg/blob/main/docker-compose.yml)

It includes MinIO and MinIO is AGPL. I want to make sure this is okay.

@adoroszlai
Copy link
Copy Markdown
Contributor Author

@jojochuang we don't distribute MinIO in any way. Users running this example download the MinIO docker image from Docker Hub.

But I'm fine abandoning this PR.

@adoroszlai adoroszlai closed this Oct 31, 2023
@jojochuang
Copy link
Copy Markdown
Contributor

Hey @adoroszlai maybe it's time to revive it given all the interest around Trino? I'll just create an epic jira and put all Trino related work together.

@hli00
Copy link
Copy Markdown

hli00 commented Feb 14, 2025

For Ozone S3, another integration scenario is Trino-Hive-Ozone S3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants