HDDS-8971. Example integration with Iceberg, Spark and Trino by adoroszlai · Pull Request #5016 · apache/ozone

adoroszlai · 2023-07-02T17:59:14Z

What changes were proposed in this pull request?

Create add-on for ozone docker-compose environment to demonstrate integration with Iceberg and Trino.

https://issues.apache.org/jira/browse/HDDS-8971

How was this patch tested?

Added test script to verify the setup:

create S3 bucket in Ozone
create table and insert data in spark-shell (example taken from Spark and Iceberg Quickstart)
describe table and insert more data in trino
check that data/metadata is stored in Ozone

CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5437385105
(Interesting part begins here.)

This commit does not contain secrets.

ayushtkn

Thanx @adoroszlai , just started exploring this.
A quick question: we are using tabulario image which isn't official but from a vendor, which we won't have any control, are we ok using that?

Spark/Iceberg might be having official image may be, can we use that?

Hive does have official docker image, in case you want to explore: https://hub.docker.com/r/apache/hive

I tried iceberg with that

export HIVE_VERSION=4.0.0-alpha-2

docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:${HIVE_VERSION}

 docker exec -it hive4 beeline -u 'jdbc:hive2://localhost:10000/'

 create table ice01 (id int) stored by iceberg;

show create table ice01;

insert into ice01 values (1),(2),(3),(4);

select * from ice01;

The show create table ice01; shows iceberg, which confirms the table is iceberg, I think I didn't see that it is mentioned anywhere, may be those guys configured some default or so.

Show create output:

select query:

I think you are good with v1 table which doesn't support deletes/updates as in the current example in this PR. (https://iceberg.apache.org/spec/#format-versioning)

It is pretty easy as well, just a tbl property and we are sorted for v2

create table ice02 (id int) stored by iceberg tblproperties ('format-version'='2');

so, we can do it in future as well. :-)

adoroszlai · 2023-07-03T18:24:52Z

Thanks @ayushtkn for starting to review.

A quick question: we are using tabulario image which isn't official but from a vendor, which we won't have any control, are we ok using that?
Spark/Iceberg might be having official image may be, can we use that?

I found this image from Tabular at https://iceberg.apache.org/spark-quickstart/ - if there was an official Apache Iceberg image, I guess they would have used that in the example. I'm open to using any other image. BTW, this is just a small experiment to help answer #4973.

Spark does have official images, will explore those.

adoroszlai · 2023-07-13T18:39:46Z

@SaketaChalamchala please take a look, too

SaketaChalamchala · 2023-07-18T19:01:48Z

hadoop-ozone/dist/src/main/compose/ozone/test-iceberg.sh

+DESCRIBE iceberg.nyc.taxis;
+INSERT INTO iceberg.nyc.taxis VALUES (2, 1000375, 7.2, 555, 'N');
+SELECT * FROM iceberg.nyc.taxis;
+EOF


Thanks for the patch @adoroszlai. If this is going to be an example of Trino + Iceberg, would it make sense to remove the dependency on spark and create the table in Trino like below?

CREATE TABLE IF NOT EXISTS iceberg.nyc.taxis ( vendor_id bigint, trip_id bigint, trip_distance double, fare_amount double, store_and_fwd_flag varchar ) WITH ( format = 'PARQUET' location = 's3://warehouse/nyc/taxis'); INSERT INTO iceberg.nyc.taxis VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N'), (2, 1000373, 0.9, 9.01, 'N'), (1, 1000374, 8.4, 42.13, 'Y');

Thanks @SaketaChalamchala for the review.

Let's call it an Iceberg, Spark, Trino example instead. :) (I followed the "Spark and Iceberg Quickstart" guide for the Iceberg part.)

jojochuang · 2023-09-25T17:20:14Z

There's a code against again the latest.

adoroszlai · 2023-10-15T16:54:23Z

There's a code against again the latest.

@jojochuang thanks for taking a look. Conflict has been resolved.

jojochuang

The PR looks good.

But I'm afraid of patent or license issues like hell. Looking at the source code for the tabulario/spark-iceberg docker image (https://github.com/tabular-io/docker-spark-iceberg/blob/main/docker-compose.yml)

It includes MinIO and MinIO is AGPL. I want to make sure this is okay.

adoroszlai · 2023-10-31T07:54:49Z

@jojochuang we don't distribute MinIO in any way. Users running this example download the MinIO docker image from Docker Hub.

But I'm fine abandoning this PR.

jojochuang · 2025-02-08T00:20:17Z

Hey @adoroszlai maybe it's time to revive it given all the interest around Trino? I'll just create an epic jira and put all Trino related work together.

hli00 · 2025-02-14T01:31:06Z

For Ozone S3, another integration scenario is Trino-Hive-Ozone S3.

HDDS-8971. Example integration with Iceberg

41be4fb

This commit does not contain secrets.

adoroszlai self-assigned this Jul 2, 2023

adoroszlai added the test label Jul 2, 2023

adoroszlai requested a review from ayushtkn July 2, 2023 18:35

ayushtkn reviewed Jul 3, 2023

View reviewed changes

SaketaChalamchala reviewed Jul 18, 2023

View reviewed changes

Merge remote-tracking branch 'origin/master' into HDDS-8971

01aefda

adoroszlai changed the title ~~HDDS-8971. Example integration with Iceberg and Trino~~ HDDS-8971. Example integration with Iceberg, Spark and Trino Oct 11, 2023

adoroszlai requested a review from jojochuang October 15, 2023 16:54

jojochuang approved these changes Oct 30, 2023

View reviewed changes

jojochuang requested changes Oct 30, 2023

View reviewed changes

adoroszlai closed this Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-8971. Example integration with Iceberg, Spark and Trino#5016

HDDS-8971. Example integration with Iceberg, Spark and Trino#5016
adoroszlai wants to merge 2 commits intoapache:masterfrom
adoroszlai:HDDS-8971

adoroszlai commented Jul 2, 2023

Uh oh!

ayushtkn left a comment

Uh oh!

adoroszlai commented Jul 3, 2023

Uh oh!

adoroszlai commented Jul 13, 2023

Uh oh!

SaketaChalamchala Jul 18, 2023 •

edited

Loading

Uh oh!

adoroszlai Oct 11, 2023

Uh oh!

jojochuang commented Sep 25, 2023

Uh oh!

adoroszlai commented Oct 15, 2023

Uh oh!

jojochuang left a comment •

edited

Loading

Uh oh!

adoroszlai commented Oct 31, 2023

Uh oh!

jojochuang commented Feb 8, 2025

Uh oh!

hli00 commented Feb 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

adoroszlai commented Jul 2, 2023

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ayushtkn left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Jul 3, 2023

Uh oh!

adoroszlai commented Jul 13, 2023

Uh oh!

SaketaChalamchala Jul 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai Oct 11, 2023

Choose a reason for hiding this comment

Uh oh!

jojochuang commented Sep 25, 2023

Uh oh!

adoroszlai commented Oct 15, 2023

Uh oh!

jojochuang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Oct 31, 2023

Uh oh!

jojochuang commented Feb 8, 2025

Uh oh!

hli00 commented Feb 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SaketaChalamchala Jul 18, 2023 •

edited

Loading

jojochuang left a comment •

edited

Loading