Skip to content

refactor(amber): stop hardcoding S3 in REST catalog init#4988

Merged
mengw15 merged 15 commits into
apache:mainfrom
mengw15:refactor/remove-hardcoded-s3-from-rest-catalog
May 16, 2026
Merged

refactor(amber): stop hardcoding S3 in REST catalog init#4988
mengw15 merged 15 commits into
apache:mainfrom
mengw15:refactor/remove-hardcoded-s3-from-rest-catalog

Conversation

@mengw15
Copy link
Copy Markdown
Contributor

@mengw15 mengw15 commented May 8, 2026

What changes were proposed in this PR?

Stop hardcoding s3.endpoint, s3.region, s3.path-style-access, s3.access-key-id and s3.secret-access-key at REST-catalog init in both IcebergUtil.createRestCatalog (Scala) and iceberg_utils.create_rest_catalog (Python). Both helpers now pass only warehouse + catalog uri (and on the Scala side the FileIO impl hint).

Why: When a Lakekeeper warehouse is created, its S3 settings (endpoint, region, credentials, path-style) are registered against that warehouse on the server. At catalog init the client only needs warehouse + uri — Lakekeeper resolves the S3 config from the warehouse record and serves it back. The hardcoded StorageConfig.s3* values on the client were redundant, and forcing them everywhere also pinned every warehouse to the single system bucket. Removing them lets each warehouse own its own storage settings.

StorageConfig.s3* itself is kept — pytexera/storage/large_binary_manager.py still uses it for the non-Iceberg texera-large-binaries bucket (R UDF large-binary support), which is out of scope.

Any related issues, documentation, discussions?

Closes #4987

How was this PR tested?

  • sbt "WorkflowCore/compile" — passes; verifies no other Scala caller depends on the removed properties.
  • Python edits parse cleanly via ast.parse; the only caller (iceberg_catalog_instance.py) is updated to match the new create_rest_catalog signature.

End-to-end verification (warehouse with its own S3 settings → REST catalog opened with only warehouse + uri → table round-trip) requires a running Lakekeeper, which CI doesn't have today. #4276 (draft) wires Lakekeeper into CI; once that lands I'll add the integration test on top of it.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

@github-actions github-actions Bot added python refactor Refactor the code common labels May 8, 2026
@mengw15 mengw15 requested a review from bobbai00 May 8, 2026 20:50
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.19%. Comparing base (e27b98a) to head (53fed1a).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4988      +/-   ##
============================================
+ Coverage     43.17%   43.19%   +0.01%     
+ Complexity     2214     2211       -3     
============================================
  Files          1045     1045              
  Lines         40260    40161      -99     
  Branches       4250     4234      -16     
============================================
- Hits          17381    17346      -35     
+ Misses        21812    21743      -69     
- Partials       1067     1072       +5     
Flag Coverage Δ *Carryforward flag
access-control-service 39.53% <ø> (ø)
agent-service 33.72% <ø> (ø) Carriedforward from c9ecf14
amber 43.87% <100.00%> (+0.02%) ⬆️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 0.00% <ø> (ø)
file-service 32.18% <ø> (ø)
frontend 34.05% <ø> (-0.01%) ⬇️ Carriedforward from c9ecf14
python 90.36% <ø> (-0.08%) ⬇️
workflow-compiling-service 56.81% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Yicong-Huang
Copy link
Copy Markdown
Contributor

Thanks @mengw15 can you make the description a bit concise about why this change is needed? Also please add tests to confirm this change works.

@mengw15
Copy link
Copy Markdown
Contributor Author

mengw15 commented May 9, 2026

Thanks @mengw15 can you make the description a bit concise about why this change is needed? Also please add tests to confirm this change works.

Thanks for the questions.

Why the change is needed. When a Lakekeeper warehouse is created, the S3 settings (endpoint, region, credentials, path-style, etc.) are already registered against that warehouse on the server side. At REST-catalog init the client only needs the warehouse identifier and uri — Lakekeeper resolves and serves the S3 config from the warehouse record. The previously hardcoded s3.* properties from StorageConfig were therefore redundant on the client; deleting them lets each warehouse own its own storage settings instead of all warehouses being forced onto the system bucket. I'll tighten the PR description to say just this.

About tests. End-to-end verification needs a running Lakekeeper, which CI doesn't have yet. #4276 (draft) adds Lakekeeper to CI; once that lands I'll layer an integration test on top of it that creates a warehouse with its own S3 settings, opens a REST catalog with only warehouse + uri, and round-trips a table.

mengw15 and others added 3 commits May 10, 2026 09:29
Align test_iceberg_rest_catalog_integration.py with create_rest_catalog's
new signature after S3 settings stopped being passed at catalog init.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Yicong-Huang
Copy link
Copy Markdown
Contributor

now as #4276 is merged, can we hook up with new tests?

@mengw15
Copy link
Copy Markdown
Contributor Author

mengw15 commented May 11, 2026

now as #4276 is merged, can we hook up with new tests?

With #4276 merged, which brought over the Lakekeeper CI job and the two integration tests.

These two tests are testing the createRestCatalog in Scala and python. so this PR is covered end-to-end. With CI passed, I think we can confirm that this change works.

@Yicong-Huang
Copy link
Copy Markdown
Contributor

Yicong-Huang commented May 11, 2026

These two tests are testing the createRestCatalog in Scala and python. so this PR is covered end-to-end. With CI passed, I think we can confirm that this change works.

sg. thanks! let's also make sure the coverage is filled, this can make sure your changes in this PR are actually being tested in the CI.

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.75%. Comparing base (14f8be4) to head (b7b3f84).

Files with missing lines Patch % Lines
...ala/org/apache/texera/amber/util/IcebergUtil.scala 0.00% 3 Missing ⚠️

See more #4988 (comment)

mengw15 and others added 2 commits May 11, 2026 10:45
Amber integration job runs without jacoco, so IcebergRestCatalogIntegrationSpec
does not register on codecov. Add a unit test that drives createRestCatalog
far enough to construct the property Map; .initialize then throws because no
Lakekeeper is up in unit-test scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mengw15
Copy link
Copy Markdown
Contributor Author

mengw15 commented May 11, 2026

These two tests are testing the createRestCatalog in Scala and python. so this PR is covered end-to-end. With CI passed, I think we can confirm that this change works.

sg. thanks! let's also make sure the coverage is filled, this can make sure your changes in this PR are actually being tested in the CI.

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.✅ Project coverage is 42.75%. Comparing base (14f8be4) to head (b7b3f84).

Files with missing lines Patch % Lines
...ala/org/apache/texera/amber/util/IcebergUtil.scala 0.00% 3 Missing ⚠️
See more #4988 (comment)

It seems like the amber-integration job doesn't upload to Codecov. Added a mock unit test in IcebergUtilSpec to satisfy the patch number; it just drives createRestCatalog until .initialize throws (no Lakekeeper in unit-test scope). Real coverage still comes from the integration test.

@Yicong-Huang
Copy link
Copy Markdown
Contributor

Yes integration test is for now designed not to alter coverage report: we rely on unit tests.

mengw15 and others added 2 commits May 11, 2026 18:01
Tightens the previous coverage-only test: instead of intercepting any
Exception, assert RESTException specifically. The property Map is built
before .initialize, so a RESTException from either an unreachable
Lakekeeper or a missing warehouse confirms the Map composition is sound
and the failure is server-side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mengw15
Copy link
Copy Markdown
Contributor Author

mengw15 commented May 15, 2026

Yes integration test is for now designed not to alter coverage report: we rely on unit tests.

The unit test I added in IcebergUtilSpec is coverage-only — without a running Lakekeeper, RESTCatalog wraps the connection failure into RESTException

The PR's real semantics (client sends only warehouse + uri, Lakekeeper resolves S3 config from the warehouse record) can only be verified end-to-end against a running Lakekeeper. That's what IcebergRestCatalogIntegrationSpec covers in amber-integration.

For a real test in a coverage-uploading job, we'd need to add Lakekeeper (+ MinIO + Postgres) to the amber job — basically duplicating the service setup that amber-integration already does. Want me to do that, or is the current coverage-only test fine?

Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comment on test.

Addresses review feedback on PR apache#4988: the previous name
"build REST catalog properties without S3 settings" described the
new call shape but didn't reflect what the assertion actually
verifies (intercept RESTException on server-side connection
failure). Rename to make the assertion clear to readers who
don't have the PR context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mengw15 mengw15 enabled auto-merge (squash) May 16, 2026 17:38
@mengw15 mengw15 merged commit bfa79a7 into apache:main May 16, 2026
21 checks passed
@mengw15 mengw15 deleted the refactor/remove-hardcoded-s3-from-rest-catalog branch May 16, 2026 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iceberg REST catalog should not hardcode S3 properties at init

3 participants