Skip to content

feat: update to zarr3#2

Merged
Faolain merged 17 commits into
mainfrom
py-hamt-update
Apr 14, 2025
Merged

feat: update to zarr3#2
Faolain merged 17 commits into
mainfrom
py-hamt-update

Conversation

@Faolain

@Faolain Faolain commented Mar 27, 2025

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • Documentation
    • Updated project description to clarify dataset retrieval functionalities and enhanced usage examples.
    • Revised instructions for creating and activating a virtual environment with standardized naming.
  • New Features
    • Introduced functions for improved IPFS dataset retrieval and enhanced error handling.
    • Added integration tests for comprehensive validation of dataset functionalities.
  • Refactor
    • Streamlined the method for rolling data aggregations to simplify processing.
    • Overhauled the EncryptionCodec class to support asynchronous operations and improved configuration handling.
  • Chores
    • Updated dependency management for greater flexibility and clarified external sources.
  • Data Updates
    • Modified dataset identifiers in the configuration file for improved referencing.
  • Tests
    • Enhanced testing framework for Zarr dataset encryption and updated test cases for dataset retrieval.
    • Introduced new functional tests and refined existing tests for clarity and accuracy.
    • Added tests for IPFS integration and dataset caching mechanisms.

@coderabbitai

coderabbitai Bot commented Mar 27, 2025

Copy link
Copy Markdown

Walkthrough

The update enhances the project documentation, specifically the README, to clarify the functionality related to retrieving dClimate GIS Zarr datasets via IPFS and STAC. It also modifies the rolling_aggregation method for streamlined dataset processing, introduces improved IPFS functionalities, and updates dependency specifications in the project configuration. Additionally, the test suite has been revised to ensure compatibility with the latest changes, including new fixtures and tests for IPFS interactions.

Changes

Files Change Summary
README.md Updated project description, clarified dataset retrieval methods, and revised virtual environment name from myenv to .venv.
dclimate_zarr_client/geotemporal_data.py Simplified rolling_aggregation method by directly invoking the aggregation method on the rolled dataset.
dclimate_zarr_client/ipfs_retrieval.py Enhanced IPFS functionalities, added logging, and introduced new functions for IPNS name resolution and dataset retrieval.
pyproject.toml Relaxed version constraints for xarray and zarr, and added custom pytest markers for integration tests.
tests/conftest.py Updated references to zarr.ZipStore, added IPFS connection checks, and defined new fixtures for IPFS testing.
tests/test_client.py Removed old IPNS tests and added new functional tests for loading datasets via STAC, updated assertions for error handling.
dclimate_zarr_client/encryption_codec.py Overhauled EncryptionCodec class to extend BytesBytesCodec, added asynchronous methods for encoding/decoding, and improved configuration handling.
tests/test_zarr_encryption_ipfs.py Updated encryption handling in tests to utilize IPFSZarr3 and validate codec registration in the Zarr store.
dclimate_zarr_client/cids.json Updated values for keys: "cpc-precip-conus", "cpc-precip-global", and "chirps-final-p05" with new identifiers.
tests/test_ipfs_retrieval.py Modified tests for get_ipns_name_hash and list_datasets, focusing on functional testing and improved assertions.
dclimate_zarr_client/init.py Removed load_ipns from public API and replaced it with load_ipfs_via_stac.
dclimate_zarr_client/client.py Renamed load_ipns to load_ipfs_via_stac, simplified parameters, and updated dataset loading logic.
dclimate_zarr_client/dclimate_zarr_errors.py Updated error classes to reflect new contexts for dataset retrieval errors and added new error classes for IPFS and STAC issues.
tests/test_integration_ipfs.py Introduced new integration tests for various IPFS functionalities, including dataset retrieval and caching mechanisms.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant F as load_ipfs_via_stac
    participant H as get_dataset_hamt_cid_from_stac
    participant I as _get_dataset_by_ipfs_cid
    participant X as xr.open_zarr

    C->>F: Request dataset via STAC
    F->>H: Get HAMT root IPFS CID
    H-->>F: Return CID
    F->>I: Load dataset by IPFS CID
    I-->>F: Return dataset
    F->>X: Open Zarr dataset
    X-->>F: Return Zarr dataset
    F-->>C: Provide dataset
Loading

Poem

I'm a coding rabbit, hopping through code,
Virtual environments now shine on a neat new road.
Aggregations roll directly with a simple tap,
And IPFSZarr3 wraps our HAMT like a cap.
Dependencies are flexible, tests hop along,
In this magical repo, everything sings a joyful song!
🐰💻

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e11604b and a243f18.

📒 Files selected for processing (1)
  • tests/test_client.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_client.py

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pyproject.toml (1)

18-18: Consider using minimum version constraints

While removing specific version constraints gives flexibility, it's generally a good practice to specify minimum versions to ensure compatibility. Consider using xarray>=X.Y.Z and zarr>=3.0.0 to ensure compatibility with the Zarr3 update.

Also applies to: 20-20

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b11d300 and d08c44e.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • README.md (1 hunks)
  • dclimate_zarr_client/geotemporal_data.py (1 hunks)
  • dclimate_zarr_client/ipfs_retrieval.py (2 hunks)
  • pyproject.toml (2 hunks)
  • tests/conftest.py (2 hunks)
  • tests/test_client.py (2 hunks)
🔇 Additional comments (9)
dclimate_zarr_client/ipfs_retrieval.py (2)

8-8: Added import for Zarr3 support

The import of IPFSZarr3 from the py_hamt module is correctly added to support the Zarr3 update.


203-205: Properly implemented Zarr3 storage adapter

The implementation correctly wraps the HAMT store with IPFSZarr3 before passing it to xr.open_zarr, which is necessary for compatibility with Zarr3. The read_only=True parameter is properly preserved.

dclimate_zarr_client/geotemporal_data.py (1)

500-500: Simplified rolling aggregation implementation

This change simplifies the code by directly calling the aggregation method on the rolled object instead of using getattr on the DatasetRolling class. This makes the code more readable and maintainable while maintaining the same functionality.

README.md (1)

80-82: Standardized virtual environment naming

The update to use .venv instead of myenv follows industry best practices. The .venv name is recognized by many Python tools and IDEs automatically, and keeping it hidden (with the dot prefix) helps reduce clutter in directory listings.

tests/test_client.py (2)

77-77: Updated ZipStore import path for Zarr3 compatibility

Correctly updated the import path from zarr.ZipStore to zarr.storage.ZipStore to match the structural changes in the Zarr3 library.


89-89: Updated ZipStore import path for Zarr3 compatibility

Consistently updated the second instance of ZipStore to use the correct import path from the storage submodule.

tests/conftest.py (3)

10-10: Appropriate import addition for Zarr3 compatibility

Good addition of the explicit import for zarr.storage which supports the updated path to access ZipStore.


19-19: Correctly updated ZipStore import path

The change from zarr.ZipStore to zarr.storage.ZipStore aligns with Zarr3's organization where storage classes are properly namespaced.


25-27: Improved formatting and correct ZipStore import path

The ZipStore path has been correctly updated to use the storage submodule, and the code formatting has been improved by splitting parameters across multiple lines for better readability.

Comment thread pyproject.toml Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
tests/test_zarr_encryption_ipfs.py (5)

11-11: Remove unused register_codec import.

According to the static analysis hints, numcodecs.register_codec is never used in this file. Removing it will help maintain a clean import list.

- from numcodecs import register_codec
🧰 Tools
🪛 Ruff (0.8.2)

11-11: numcodecs.register_codec imported but unused

Remove unused import: numcodecs.register_codec

(F401)


12-12: Remove commented-out import if not needed.

This commented-out line might be a leftover. Consider removing it to keep the codebase clean and consistent.

- # from zarr.codecs import BytesCodec

13-13: Remove unused BytesBytesCodec from the import.

BytesBytesCodec is imported but not referenced anywhere in this file. Keeping imports minimal ensures clarity and avoids confusion.

-from numcodecs.zarr3 import BytesBytesCodec, Blosc
+from numcodecs.zarr3 import Blosc
🧰 Tools
🪛 Ruff (0.8.2)

13-13: numcodecs.zarr3.BytesBytesCodec imported but unused

Remove unused import: numcodecs.zarr3.BytesBytesCodec

(F401)


71-71: Reassess repeated zarr.registry.register_codec calls with changing keys.

Registering the same codec multiple times for the same identifier but with different encryption keys may lead to confusion or unintended side effects, especially if different parts of the code rely on a single registered codec.

Also applies to: 107-107, 120-120, 162-162, 173-173


123-124: Remove or restore commented-out assertions if still relevant.

These commented-out checks might be important for ensuring encryption correctness. Uncomment them or remove them altogether to maintain clarity and avoid confusion.

- # assert test_ds["temp"].encoding["compressors"][1]._encryption_key is not None
- # Check length of compressor is just 1 for default ZstdCodec
dclimate_zarr_client/encryption_codec.py (2)

9-24: Consider storing the encryption key per instance if concurrency is expected.

_encryption_key is maintained at the class level. If multiple coroutines attempt encoding/decoding concurrently with different keys, it may lead to race conditions or incorrect decryptions. Moving the key to an instance attribute could mitigate concurrency issues.


36-37: Ensure consistent key validation for better user feedback.

The check for a 32-byte key is correct, but consider adding more context in the error message (e.g., expected length vs. actual length) to help users diagnose configuration issues faster.

- raise ValueError("Encryption key must be 32 bytes")
+ raise ValueError(f"Encryption key must be 32 bytes, but got {len(encryption_key)} bytes")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc4a5c0 and f64c626.

📒 Files selected for processing (2)
  • dclimate_zarr_client/encryption_codec.py (1 hunks)
  • tests/test_zarr_encryption_ipfs.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_zarr_encryption_ipfs.py

11-11: numcodecs.register_codec imported but unused

Remove unused import: numcodecs.register_codec

(F401)


13-13: numcodecs.zarr3.BytesBytesCodec imported but unused

Remove unused import: numcodecs.zarr3.BytesBytesCodec

(F401)

🪛 GitHub Actions: pre-commit
tests/test_zarr_encryption_ipfs.py

[error] 1-1: pre-commit hook 'trailing-whitespace' failed. Files were modified by this hook.

dclimate_zarr_client/encryption_codec.py

[error] 1-1: pre-commit hook 'end-of-file-fixer' failed. Files were modified by this hook.

🪛 GitHub Actions: Python application
tests/test_zarr_encryption_ipfs.py

[error] 134-134: ValueError: MAC check failed

🔇 Additional comments (3)
dclimate_zarr_client/encryption_codec.py (3)

66-84: Confirm that asynchronous decryption meets performance and concurrency needs.

As _decode_single relies on asyncio.to_thread, ensure your environment is prepared for thread-based concurrency and that read operations won't block other critical tasks.


85-106: Review parallel encoding usage.

Like the decryption method, _encode_single offloads work to a separate thread. Verify that creating multiple threads for chunk-based encryptions best suits your performance and resource constraints.


107-117: Validate additional overhead for nonce and tag.

The calculated addition of 40 bytes per chunk is standard for XChaCha20-Poly1305. Confirm that this overhead is acceptable, especially if your dataset has many small chunks.

Comment thread tests/test_zarr_encryption_ipfs.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/test_zarr_encryption_ipfs.py (1)

10-14: Update imports for zarr3 compatibility

The import changes correctly add zarr and json modules needed for zarr3 compatibility. There are a couple of improvements to make:

  1. The import numcodecs.register_codec is no longer used since you're now using zarr.registry.register_codec.
  2. The static analysis correctly points out that BytesBytesCodec is imported but not directly used in this file.
import os
import shutil
import tempfile

from multiformats import CID
import numpy as np
import pandas as pd
import xarray as xr
import pytest
import zarr
import json
-from numcodecs import register_codec
-# from zarr.codecs import BytesCodec
-from numcodecs.zarr3 import BytesBytesCodec, Blosc
+from numcodecs.zarr3 import Blosc

from Crypto.Random import get_random_bytes

from py_hamt import HAMT, IPFSStore, IPFSZarr3
🧰 Tools
🪛 Ruff (0.8.2)

12-12: numcodecs.register_codec imported but unused

Remove unused import: numcodecs.register_codec

(F401)


14-14: numcodecs.zarr3.BytesBytesCodec imported but unused

Remove unused import: numcodecs.zarr3.BytesBytesCodec

(F401)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd0e022 and 1c30047.

📒 Files selected for processing (1)
  • tests/test_zarr_encryption_ipfs.py (5 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
tests/test_zarr_encryption_ipfs.py (1)
dclimate_zarr_client/encryption_codec.py (1)
  • EncryptionCodec (9-117)
🪛 Ruff (0.8.2)
tests/test_zarr_encryption_ipfs.py

12-12: numcodecs.register_codec imported but unused

Remove unused import: numcodecs.register_codec

(F401)


14-14: numcodecs.zarr3.BytesBytesCodec imported but unused

Remove unused import: numcodecs.zarr3.BytesBytesCodec

(F401)

🪛 GitHub Actions: pre-commit
tests/test_zarr_encryption_ipfs.py

[error] 1-1: pre-commit hook 'trailing-whitespace' failed. Files were modified by this hook.

🔇 Additional comments (5)
tests/test_zarr_encryption_ipfs.py (5)

71-71: Correct codec registration for zarr3

The update to use zarr.registry.register_codec with an explicit codec identifier is the correct approach for zarr3 compatibility.

Also applies to: 159-159


76-76: Update encoding to use compressors instead of filters

The change from using filters to compressors correctly aligns with zarr3 terminology, and includes both compression and encryption in a sequential pipeline.


82-95: Great addition of zarr.json validation

These validation tests are excellent for ensuring the codec is correctly registered and configured in the zarr3 metadata. Testing the codec configuration ensures that no sensitive data like encryption keys are serialized.


118-121: Updated assertions for zarr3 compressors

The assertions are correctly updated to check for compressors instead of filters to align with zarr3 terminology.


127-127: Properly using IPFSZarr3 wrapper

The introduction of IPFSZarr3 to wrap the HAMT store and the corresponding updates to read/write operations are consistent with the zarr3 migration. This provides a cleaner abstraction for working with IPFS storage in zarr3.

Also applies to: 131-131, 143-143, 146-146, 161-161

Comment thread tests/test_zarr_encryption_ipfs.py Outdated
Comment on lines +71 to +72
zarr.registry.register_codec("xchacha20poly1305", EncryptionCodec)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix trailing whitespace

The code has trailing whitespace on line 72 which is causing pipeline failures.

    # Register the codec
    zarr.registry.register_codec("xchacha20poly1305", EncryptionCodec)
-  
+
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
zarr.registry.register_codec("xchacha20poly1305", EncryptionCodec)
zarr.registry.register_codec("xchacha20poly1305", EncryptionCodec)

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
tests/test_zarr_encryption_ipfs.py (3)

10-14: Clean up unused imports

There are two unused imports according to the static analysis:

  • numcodecs.register_codec (line 12)
  • numcodecs.zarr3.BytesBytesCodec (line 14)
import zarr
import json
-from numcodecs import register_codec
-# from zarr.codecs import BytesCodec
-from numcodecs.zarr3 import BytesBytesCodec, Blosc
+# from zarr.codecs import BytesCodec
+from numcodecs.zarr3 import Blosc
🧰 Tools
🪛 Ruff (0.8.2)

12-12: numcodecs.register_codec imported but unused

Remove unused import: numcodecs.register_codec

(F401)


14-14: numcodecs.zarr3.BytesBytesCodec imported but unused

Remove unused import: numcodecs.zarr3.BytesBytesCodec

(F401)


23-32: Use contextlib.suppress for cleaner error handling

The try-except-pass pattern can be replaced with contextlib.suppress for better readability.

# Register the codec type *once* globally or in a fixture
# This only tells Zarr about the *class*, not the key.
@pytest.fixture(scope="module", autouse=True)
def register_encryption_codec():
-    try:
-        # Attempt registration, ignore if already done
-        zarr.registry.register_codec("xchacha20poly1305", EncryptionCodec)
-    except ValueError: # Or whatever error your registry throws on duplicate
-        pass
+    from contextlib import suppress
+    # Attempt registration, ignore if already done
+    with suppress(ValueError):  # Or whatever error your registry throws on duplicate
+        zarr.registry.register_codec("xchacha20poly1305", EncryptionCodec)
🧰 Tools
🪛 Ruff (0.8.2)

27-31: Use contextlib.suppress(ValueError) instead of try-except-pass

(SIM105)


187-194: Use a more specific exception class in pytest.raises

Using pytest.raises(Exception) is too broad and might catch unexpected exceptions. Specify the exact exception class you're expecting (like ValueError or a crypto-specific exception).

    # Use the *wrong* key in the context
    with EncryptionCodec.key_context(wrong_encryption_key):
        loaded_failure = xr.open_zarr(store=hamt1_read_z3, consolidated=False)

        # Accessing data should raise an exception since we don't have the correct encryption key
-        with pytest.raises(Exception):
+        with pytest.raises((ValueError, RuntimeError)):  # or the specific exception class from the crypto library
            _ = loaded_failure["temp"].values
🧰 Tools
🪛 Ruff (0.8.2)

192-192: pytest.raises(Exception) should be considered evil

(B017)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c30047 and 9c0f4ef.

📒 Files selected for processing (2)
  • dclimate_zarr_client/encryption_codec.py (1 hunks)
  • tests/test_zarr_encryption_ipfs.py (3 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
tests/test_zarr_encryption_ipfs.py (1)
dclimate_zarr_client/encryption_codec.py (3)
  • EncryptionCodec (11-120)
  • key_context (33-41)
  • compute_encoded_size (110-120)
🪛 Ruff (0.8.2)
tests/test_zarr_encryption_ipfs.py

12-12: numcodecs.register_codec imported but unused

Remove unused import: numcodecs.register_codec

(F401)


14-14: numcodecs.zarr3.BytesBytesCodec imported but unused

Remove unused import: numcodecs.zarr3.BytesBytesCodec

(F401)


27-31: Use contextlib.suppress(ValueError) instead of try-except-pass

(SIM105)


118-119: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)


181-181: Local variable test_read is assigned to but never used

Remove assignment to unused variable test_read

(F841)


192-192: pytest.raises(Exception) should be considered evil

(B017)

🪛 GitHub Actions: pre-commit
tests/test_zarr_encryption_ipfs.py

[error] 179-179: Ruff: Local variable 'test_read' is assigned to but never used (F841).


[error] 1-1: pre-commit hook 'trailing-whitespace' failed. Files were modified by this hook.


[error] 1-1: pre-commit hook 'ruff' failed. Files were modified by this hook.


[error] 1-1: pre-commit hook 'ruff-format' failed. Files were modified by this hook.

dclimate_zarr_client/encryption_codec.py

[error] 1-1: pre-commit hook 'end-of-file-fixer' failed. Files were modified by this hook.

🔇 Additional comments (8)
tests/test_zarr_encryption_ipfs.py (3)

33-41: Good use of session-scoped fixture for consistent encryption key

The fixture effectively ensures the same encryption key is used throughout the test session, which is essential for testing encrypt/decrypt operations reliably.


44-95: Comprehensive update for Zarr3 compatibility

The function now correctly uses Zarr3 patterns with compressors instead of filters, and properly manages the encryption key through the context manager. The validation of the zarr.json file is a good addition to ensure codec registration works as expected.


116-126: Good test coverage for key validation and new functionality

The use of the context manager in test_bad_encryption_keys is appropriate, and the new test_compute_encoded_size test verifies the additional functionality added to the codec.

🧰 Tools
🪛 Ruff (0.8.2)

118-119: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

dclimate_zarr_client/encryption_codec.py (5)

1-15: Excellent modernization for Zarr3 compatibility

The transition to BytesBytesCodec and the use of contextvars for thread-safe key management are excellent improvements. The imports are well organized and the class structure is now much more appropriate for Zarr3.

🧰 Tools
🪛 GitHub Actions: pre-commit

[error] 1-1: pre-commit hook 'end-of-file-fixer' failed. Files were modified by this hook.


16-24: Improved constructor with better default and documentation

Adding a default value for the header parameter and improving the docstring makes the class more user-friendly and self-documenting.


25-42: Greatly improved key management with context manager pattern

The context manager pattern for key management is a significant security improvement over the previous approach. It ensures keys are properly cleaned up and provides thread safety through contextvars.


43-67: Well-designed serialization methods for Zarr3 compatibility

The from_dict and to_dict methods are well-implemented with proper type hints and comprehensive docstrings. They correctly handle the Zarr3 serialization format.


68-109: Effective use of asynchronous processing for crypto operations

The async methods with asyncio.to_thread are a good approach for handling potentially blocking crypto operations, enabling better performance in IO-bound scenarios.

Comment thread tests/test_zarr_encryption_ipfs.py Outdated
Comment on lines +110 to +120
def compute_encoded_size(self, input_byte_length: int, chunk_spec) -> int:
"""Calculate the size of the encoded data.

Args:
input_byte_length (int): Size of the input bytes.
chunk_spec: Array specification (unused in this calculation).

Returns:
int: Encoded size (input size + 40 bytes for nonce and tag).
"""
return input_byte_length + 40 # 24 bytes nonce + 16 bytes tag No newline at end of file

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing newline at end of file

The static analysis indicates an end-of-file issue. Add a newline at the end of the file to fix this pipeline failure.

        return input_byte_length + 40  # 24 bytes nonce + 16 bytes tag
+
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def compute_encoded_size(self, input_byte_length: int, chunk_spec) -> int:
"""Calculate the size of the encoded data.
Args:
input_byte_length (int): Size of the input bytes.
chunk_spec: Array specification (unused in this calculation).
Returns:
int: Encoded size (input size + 40 bytes for nonce and tag).
"""
return input_byte_length + 40 # 24 bytes nonce + 16 bytes tag
def compute_encoded_size(self, input_byte_length: int, chunk_spec) -> int:
"""Calculate the size of the encoded data.
Args:
input_byte_length (int): Size of the input bytes.
chunk_spec: Array specification (unused in this calculation).
Returns:
int: Encoded size (input size + 40 bytes for nonce and tag).
"""
return input_byte_length + 40 # 24 bytes nonce + 16 bytes tag

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
dclimate_zarr_client/encryption_codec.py (1)

99-109: ⚠️ Potential issue

Add newline at end of file

The file is missing a newline at the end, which is causing pipeline failures.

        return input_byte_length + 40  # 24 bytes nonce + 16 bytes tag
+
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2af0a27 and 107a46d.

📒 Files selected for processing (2)
  • dclimate_zarr_client/encryption_codec.py (2 hunks)
  • tests/test_zarr_encryption_ipfs.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_zarr_encryption_ipfs.py

12-12: numcodecs.register_codec imported but unused

Remove unused import: numcodecs.register_codec

(F401)


14-14: numcodecs.zarr3.BytesBytesCodec imported but unused

Remove unused import: numcodecs.zarr3.BytesBytesCodec

(F401)

🪛 GitHub Actions: pre-commit
tests/test_zarr_encryption_ipfs.py

[error] 1-1: pre-commit hook 'trailing-whitespace' failed. Files were modified by this hook.

dclimate_zarr_client/encryption_codec.py

[error] 1-1: pre-commit hook 'end-of-file-fixer' failed. Files were modified by this hook.

🔇 Additional comments (12)
tests/test_zarr_encryption_ipfs.py (6)

22-30: Good addition of a session-scoped fixture!

Creating a session-scoped fixture for the encryption key ensures consistency across tests and follows best practices.


84-85: Good implementation of the compressors pipeline

The implementation correctly uses the Zarr3 compressor chain with Blosc first for compression and EncryptionCodec second for encryption, following Zarr3 best practices.


90-102: Excellent validation of codec configuration

The assertions validate that the codec is correctly registered and configured in the zarr.json file, which helps ensure the serialization is working as expected.


118-122: Good test for compute_encoded_size

This test validates the new method that calculates the encoded size with nonce and tag overhead, which is important for estimating storage requirements.


138-144: Good implementation of IPFSZarr3 wrapper

Using the IPFSZarr3 wrapper around the HAMT store correctly implements the Zarr3 compatibility layer.


168-177: Good negative test case

Testing with an incorrect encryption key ensures that encrypted data is properly protected and inaccessible without the correct key.

dclimate_zarr_client/encryption_codec.py (6)

9-10: Good inheritance from BytesBytesCodec

The update to inherit from BytesBytesCodec correctly aligns with Zarr3's codec architecture.


14-24: Nice improvement with default header value

Adding a default value for the header parameter makes the codec more user-friendly while maintaining customization options.


33-44: Good implementation of from_dict

The from_dict method correctly implements the Zarr3 configuration loading pattern with proper type hinting and error handling.


46-55: Well-implemented to_dict method

The to_dict method correctly implements the Zarr3 configuration serialization pattern with the appropriate structure.


57-76: Excellent async implementation of decode

Using asyncio.to_thread for the blocking cryptographic operations is a great approach to prevent I/O blocking in asynchronous contexts.


77-97: Good async implementation of encode

The encryption logic is well-structured with clear separation between buffer handling and cryptographic operations.

Comment thread tests/test_zarr_encryption_ipfs.py Outdated
Comment thread tests/test_zarr_encryption_ipfs.py Outdated
Comment thread tests/test_zarr_encryption_ipfs.py Outdated
},
attrs={"description": "Test dataset with random weather data"},
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix trailing whitespace

The trailing whitespace on this line is causing pipeline failures.

-  
+
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@codecov-commenter

codecov-commenter commented Apr 4, 2025

Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 49.84894% with 166 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@b11d300). Learn more about missing BASE report.

Files with missing lines Patch % Lines
dclimate_zarr_client/ipfs_retrieval.py 42.80% 132 Missing and 31 partials ⚠️
dclimate_zarr_client/client.py 66.66% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main       #2   +/-   ##
=======================================
  Coverage        ?   65.42%           
=======================================
  Files           ?        8           
  Lines           ?      778           
  Branches        ?      111           
=======================================
  Hits            ?      509           
  Misses          ?      221           
  Partials        ?       48           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
dclimate_zarr_client/encryption_codec.py (1)

27-33:

Details

❓ Verification inconclusive

Consider replacing global state with a context manager.

The static _encryption_key class variable creates global state which can cause issues in concurrent environments. Consider implementing a context manager as suggested in the AI summary to manage key lifecycle more safely.

@classmethod
@contextmanager
def key_context(cls, key: bytes | None):
    """Temporarily set encryption key within a context."""
    if key is not None:
        if len(key) != 32:  # 32 bytes = 64 hex chars
            raise ValueError("Encryption key must be 32 bytes")
        
        old_key = cls._encryption_key
        cls._encryption_key = key
        try:
            yield
        finally:
            cls._encryption_key = old_key
    else:
        yield

🌐 Web query:

Is global state in Python classes thread-safe?

💡 Result:

Global state in Python classes is not inherently thread-safe. While Python's Global Interpreter Lock (GIL) prevents concurrent execution of bytecode across threads, thread safety depends on the specific operations performed:

  1. Shared State Risks
    Class attributes and global variables can become corrupted in multi-threaded environments when:

    • Multiple threads modify the same attribute/variable
    • Operations require multiple bytecode steps (e.g., x += 1 is not atomic[2][6])
    • Methods lack synchronization for state-modifying operations[3][6]
  2. Thread-Safe Implementation Strategies
    To ensure safety:

    import threading
    
    class SharedState:
        _lock = threading.Lock()
        data = {}
        
        @classmethod
        def update(cls, key, value):
            with cls._lock:  # Critical section protection
                cls.data[key] = value
    • Use threading.Lock() for mutual exclusion in critical sections[1][3]
    • Leverage atomic operations where possible (e.g., queue.Queue for thread-safe data exchange)[3]
    • Design classes to minimize shared mutable state[3][5]
  3. GIL Limitations
    While the GIL ensures bytecode-level atomicity for single operations like:

    L.append(x)  # Atomic under GIL[2]

    It doesn't protect multi-step operations like:

    if x in cache:  # Non-atomic check-then-act
        cache[x] += 1

    Race conditions can occur between the existence check and modification[2][6].

For singleton patterns requiring thread-safety, a locked metaclass approach is effective:

class SingletonMeta(type):
    _instances = {}
    _lock = threading.Lock()

    def __call__(cls, *args, **kwargs):
        with cls._lock:  # Thread-safe instantiation
            if cls not in cls._instances:
                cls._instances[cls] = super().__call__(*args, **kwargs)
        return cls._instances[cls]

Key recommendations[1][3][5]:

  • Always use locks when modifying shared class/global state
  • Prefer thread-safe data structures (queue.Queue, collections.deque)
  • Avoid global state where possible through dependency injection
  • Test thread interactions explicitly using stress tests

Without explicit synchronization mechanisms, concurrent access to global class state risks data races, inconsistent states, and hard-to-debug issues[2][6].

Citations:


Action Required: Revisit Global Key Management to Address Thread Safety

The current class-level _encryption_key is set via a static method, which introduces global mutable state and is not inherently thread-safe. Transitioning to a context manager—as outlined below—can help by limiting the key's scope and lifetime. However, if the encryption key may be modified concurrently in a multi-threaded environment, consider integrating explicit thread synchronization (e.g., using a threading.Lock) within the context manager.

Recommendations:

  • Replace the current set_encryption_key method with a context manager to temporarily set and safely restore the encryption key.
  • Evaluate if thread-safety is a concern in your use case and, if so, add locking to protect state transitions.
@classmethod
@contextmanager
def key_context(cls, key: bytes | None):
    """Temporarily set encryption key within a context."""
    if key is not None:
        if len(key) != 32:  # 32 bytes = 64 hex chars
            raise ValueError("Encryption key must be 32 bytes")
        
        # Optionally, add thread-safety via a lock if concurrent access is expected.
        old_key = cls._encryption_key
        cls._encryption_key = key
        try:
            yield
        finally:
            cls._encryption_key = old_key
    else:
        yield

Please review these changes carefully and assess whether additional thread synchronization is needed for your environment.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 107a46d and 6738801.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • dclimate_zarr_client/encryption_codec.py (2 hunks)
  • tests/test_ipfs_retrieval.py (2 hunks)
  • tests/test_zarr_encryption_ipfs.py (5 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
tests/test_ipfs_retrieval.py (1)
dclimate_zarr_client/geotemporal_data.py (1)
  • time_range (345-366)
🪛 GitHub Actions: pre-commit
tests/test_ipfs_retrieval.py

[error] 303-303: ruff-format: 1 file reformatted. Please check the changes made by the pre-commit hook.

🔇 Additional comments (11)
tests/test_ipfs_retrieval.py (1)

33-33: Appropriate update to test mock.

The updated IPFS hash value is correctly matched to work with the updated dependencies for Zarr3.

tests/test_zarr_encryption_ipfs.py (5)

24-30: Good use of session-scoped fixture.

Using a session-scoped fixture for the encryption key improves test consistency and allows for reliable assertions between test functions.


34-94: Well-structured Zarr3 compatibility updates.

The changes to utilize compressors instead of filters align properly with Zarr3's approach. Setting consolidated=False is also appropriate for IPFS HAMT stores.


97-108: Good validation of codec configuration.

The added assertions properly verify that the codec is correctly registered and configured in the zarr.json file, which is essential for ensuring proper serialization in Zarr3.


125-128: Appropriate test for new functionality.

The test for compute_encoded_size verifies the new functionality added to the codec and confirms that it correctly accounts for the additional bytes needed for the nonce and tag.


131-186: Proper adaptation to IPFSZarr3.

The test has been correctly updated to use the new IPFSZarr3 wrapper class and verifies that encryption still works correctly with the new Zarr3 structure.

dclimate_zarr_client/encryption_codec.py (5)

1-12: Proper updates for Zarr3 compatibility.

The imports have been updated to use Zarr3 types, and the class now correctly inherits from BytesBytesCodec which is the appropriate base class for Zarr3 codecs that transform bytes to bytes.


16-26: Good default value and improved docstrings.

The constructor now includes a sensible default value for the header and has well-structured docstrings with type annotations.


35-55: Excellent Zarr3 serialization methods.

The from_dict and to_dict methods properly implement the Zarr3 serialization interface, allowing the codec to be correctly configured from metadata.


56-98: Good implementation of async encoding/decoding.

The async methods for encoding and decoding chunks allow for better performance with large datasets by moving CPU-bound operations to separate threads using asyncio.to_thread.


99-109: Accurate encoded size calculation.

The compute_encoded_size method correctly accounts for the additional 40 bytes needed for the nonce (24 bytes) and tag (16 bytes) when calculating the encoded size.

Comment thread tests/test_ipfs_retrieval.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (12)
dclimate_zarr_client/ipfs_retrieval.py (3)

23-26: Avoid calling logging.basicConfig in library code.
It’s typically best practice to let the application decide how to configure logging so as not to override user-defined settings. Consider removing or conditionally applying it only if no existing logging configuration is found.


98-249: Two-step IPNS fetch approach.
Retrying with and without nocache is a clever way to handle gateway behaviors. Consider refactoring the repeated request logic into a helper function to reduce duplication.


696-894: Refactor lengthy list_datasets for readability.
While functional, this routine is quite large. Splitting out child-collection or item-link handling into smaller helpers could improve maintainability and testability.

pyproject.toml (1)

18-20: Removed version pins for xarray and zarr.
While this gives more flexibility, it also introduces some uncertainty if backward-incompatible updates are released. Consider specifying version ranges to balance stability with upgrades.

README.md (2)

144-144: Fix missing comma in sentence

There's a missing comma after "IPFS gateway" in this sentence.

-To run your own IPFS gateway follow the instructions for [installing ipfs]
+To run your own IPFS gateway, follow the instructions for [installing ipfs]
🧰 Tools
🪛 LanguageTool

[uncategorized] ~144-~144: A comma might be missing here.
Context: ...e/main/notebooks). To run your own IPFS gateway follow the instructions for [installing...

(AI_EN_LECTOR_MISSING_PUNCTUATION_COMMA)


146-146: Remove trailing punctuation from heading

According to Markdown linting standards (rule MD026), headings should not have trailing punctuation.

-## Create and activate a virtual environment:
+## Create and activate a virtual environment
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

146-146: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

tests/test_ipfs_retrieval.py (3)

34-53: Well-structured test for hash retrieval success path

The test provides thorough coverage of the success path for getting an IPNS name hash, including verifying that the cache update function is called correctly.

However, the nested 'with' statements could be simplified for better readability:

    endpoint_data = {"cpc-precip-conus": "bafy-real-or-mock-hash", "other": "hash2"}
-    with patch("requests.get") as mock_get:
-        mock_get.return_value.raise_for_status.return_value = None
-        mock_get.return_value.json.return_value = endpoint_data
-        # Mock file operations to simulate cache update
-        with (
-            patch("builtins.open", mock_open()) as _,
-            patch("os.path.dirname", return_value="/fake/dir"),
-            patch(
-                "dclimate_zarr_client.ipfs_retrieval.update_cache_if_changed"
-            ) as mock_update,
-        ):
+    with patch("requests.get") as mock_get, \
+         patch("builtins.open", mock_open()) as _, \
+         patch("os.path.dirname", return_value="/fake/dir"), \
+         patch("dclimate_zarr_client.ipfs_retrieval.update_cache_if_changed") as mock_update:
+        mock_get.return_value.raise_for_status.return_value = None
+        mock_get.return_value.json.return_value = endpoint_data
+        # Mock file operations to simulate cache update

61-170: Comprehensive fallback testing with nested context managers

The tests provide excellent coverage of various fallback scenarios for the IPNS hash retrieval functionality. The test assertions are clear and verify the expected behavior in each case.

However, there are multiple instances of nested 'with' statements that could be simplified throughout this section for better readability and adherence to Ruff's SIM117 rule.

For example, in the test_get_ipns_name_hash_fallback_success function:

-    with patch(
-        "requests.get", side_effect=requests.RequestException("Simulated Error")
-    ):
-        with patch("os.path.exists", return_value=True):
-            fake_json = '{"cpc-precip-conus":"bafkreihashfromlocalfile"}'
-            # Mock open within the context of the function call
-            with patch(
-                "dclimate_zarr_client.ipfs_retrieval.open",
-                mock_open(read_data=fake_json),
-                create=True,
-            ):
+    with patch("requests.get", side_effect=requests.RequestException("Simulated Error")), \
+         patch("os.path.exists", return_value=True), \
+         patch("dclimate_zarr_client.ipfs_retrieval.open", 
+               mock_open(read_data='{"cpc-precip-conus":"bafkreihashfromlocalfile"}'),
+               create=True):

Similar simplifications could be applied to tests at lines 78-84, 88-99, 104-122, 126-140, 144-155, and 159-170.

🧰 Tools
🪛 Ruff (0.8.2)

61-64: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)


78-81: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)


88-91: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)


93-98: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)


110-118: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)


144-147: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)


159-162: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)


198-261: Consider removing commented-out tests

There are several commented-out test functions that appear to be for functionality that has been changed or removed. While they might provide historical context, they also clutter the codebase and could lead to confusion.

Consider either:

  1. Removing these commented-out tests entirely if they're no longer needed
  2. Adding a clear comment explaining why they're kept for future reference, with an explanation of the functionality they were testing and why they're no longer applicable
tests/conftest.py (2)

116-149: Robust IPFS connection check with logging.
The function appropriately handles connection, timeout, and other requests exceptions.
Consider using a logging utility instead of print statements for production-like environments, though this is acceptable for tests.


165-178: Global constants for known dataset IDs and parameters.
Storing these in a test-specific constants section is a straightforward approach.
If reused beyond test files, consider centralizing them in a shared constants module.

tests/test_integration_ipfs.py (1)

43-56: Local is_ipfs_running helper mirrors conftest approach.
Duplicating or partially reusing logic is fine for test clarity here, but consider unifying if feasible.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ebf459 and e11604b.

📒 Files selected for processing (11)
  • README.md (3 hunks)
  • dclimate_zarr_client/__init__.py (1 hunks)
  • dclimate_zarr_client/cids.json (1 hunks)
  • dclimate_zarr_client/client.py (8 hunks)
  • dclimate_zarr_client/dclimate_zarr_errors.py (2 hunks)
  • dclimate_zarr_client/ipfs_retrieval.py (3 hunks)
  • pyproject.toml (2 hunks)
  • tests/conftest.py (3 hunks)
  • tests/test_client.py (3 hunks)
  • tests/test_integration_ipfs.py (1 hunks)
  • tests/test_ipfs_retrieval.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • dclimate_zarr_client/cids.json
🧰 Additional context used
🧬 Code Graph Analysis (6)
dclimate_zarr_client/__init__.py (1)
dclimate_zarr_client/client.py (3)
  • load_ipfs_via_stac (26-84)
  • load_s3 (87-103)
  • geo_temporal_query (106-263)
dclimate_zarr_client/ipfs_retrieval.py (1)
dclimate_zarr_client/dclimate_zarr_errors.py (4)
  • DatasetNotFoundError (25-26)
  • NoMetadataFoundError (17-18)
  • IpfsConnectionError (61-62)
  • StacCatalogError (65-66)
dclimate_zarr_client/client.py (2)
dclimate_zarr_client/ipfs_retrieval.py (2)
  • get_dataset_hamt_cid_from_stac (252-433)
  • _get_dataset_by_ipfs_cid (601-679)
dclimate_zarr_client/geotemporal_data.py (1)
  • GeotemporalData (22-648)
tests/test_ipfs_retrieval.py (2)
dclimate_zarr_client/ipfs_retrieval.py (3)
  • get_ipns_name_hash (514-557)
  • update_cache_if_changed (500-511)
  • list_datasets (733-893)
dclimate_zarr_client/dclimate_zarr_errors.py (3)
  • DatasetNotFoundError (25-26)
  • IpfsConnectionError (61-62)
  • StacCatalogError (65-66)
tests/test_integration_ipfs.py (3)
dclimate_zarr_client/client.py (2)
  • load_ipfs_via_stac (26-84)
  • geo_temporal_query (106-263)
dclimate_zarr_client/ipfs_retrieval.py (2)
  • fetch_json_from_ipns (99-248)
  • get_dataset_hamt_cid_from_stac (252-433)
dclimate_zarr_client/dclimate_zarr_errors.py (3)
  • IpfsConnectionError (61-62)
  • StacCatalogError (65-66)
  • DatasetNotFoundError (25-26)
tests/test_client.py (4)
dclimate_zarr_client/dclimate_zarr_errors.py (9)
  • DatasetNotFoundError (25-26)
  • IpfsConnectionError (61-62)
  • StacCatalogError (65-66)
  • NoDataFoundError (21-22)
  • ConflictingGeoRequestError (9-10)
  • InvalidForecastRequestError (29-30)
  • ConflictingAggregationRequestError (13-14)
  • InvalidExportFormatError (43-44)
  • SelectionTooLargeError (5-6)
dclimate_zarr_client/client.py (1)
  • geo_temporal_query (106-263)
tests/conftest.py (1)
  • forecast_ds (27-32)
dclimate_zarr_client/geotemporal_data.py (1)
  • time_range (345-366)
🪛 Ruff (0.8.2)
tests/test_ipfs_retrieval.py

61-64: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)


78-81: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)


88-91: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)


93-98: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)


110-118: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)


144-147: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)


159-162: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)

tests/test_client.py

105-107: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🪛 LanguageTool
README.md

[uncategorized] ~144-~144: A comma might be missing here.
Context: ...e/main/notebooks). To run your own IPFS gateway follow the instructions for [installing...

(AI_EN_LECTOR_MISSING_PUNCTUATION_COMMA)

🪛 markdownlint-cli2 (0.17.2)
README.md

146-146: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

🔇 Additional comments (53)
dclimate_zarr_client/ipfs_retrieval.py (7)

4-4: No issues with adding logging import.


9-17: Imports from py_hamt and custom errors look good.


27-32: Consider concurrency for in-memory cache.
_stac_hamt_cid_cache is a globally shared dictionary. If used in multi-threaded or multi-process scenarios, ensure proper synchronization or concurrency handling to avoid data races.

Would you like guidance on making this cache thread-safe?


33-53: Configurable IPFSStore creation.
This helper neatly centralizes IPFS gateway/RPC initialization logic, making it straightforward to extend or override in the future.


55-96: Robust JSON fetch with comprehensive error handling.
Catching a variety of network and parsing exceptions is excellent for ensuring your system doesn’t fail silently on IPFS retrieval issues.


251-433: Comprehensive STAC catalog traversal.
This function covers multiple edge cases for invalid links or STAC structures; the structured approach and use of warnings is solid.


599-680: Clean IPFS-based dataset loading.
Leveraging HAMT and IPFSZarr3 with strong exception handling ensures reliability when dealing with Zarr data on IPFS.

dclimate_zarr_client/__init__.py (2)

2-6: Renamed reference to load_ipfs_via_stac.
Using the new function name improves clarity and matches the updated implementation.


13-13: Export load_ipfs_via_stac in __all__.
Exposing it clearly in the public API is consistent with the rest of the module’s architecture.

pyproject.toml (1)

39-45: Added custom Pytest markers.
Defining ipfs and integration markers clarifies the test scope and requirements, which is good practice for organizing tests.

dclimate_zarr_client/dclimate_zarr_errors.py (3)

26-26: Documentation improvement for DatasetNotFoundError

The updated docstring now correctly reflects that this error is raised when datasets can't be found via any of the supported retrieval methods (IPNS, STAC, or S3).


52-52: Documentation improvement for PathNotFoundError

The updated docstring now correctly includes IPFS as a path source that can trigger this error, alongside the existing AWS S3 reference.


61-67: Appropriate new error classes for IPFS and STAC support

The newly added error classes provide specific error handling for IPFS connectivity issues and STAC catalog traversal problems, which aligns well with the PR's goal of enhancing IPFS/STAC support.

README.md (4)

3-9: Improved project description with clear technical context

The updated README introduction clearly explains the project's purpose and technical approach, highlighting the use of IPFS, STAC catalog, and py-hamt for accessing Zarr datasets. This provides users with a much better understanding of what the library does and how it works.


37-37: Updated module description reflects new capabilities

The description of ipfs_retrieval.py now properly explains its expanded functionality, including IPNS name resolution, STAC catalog traversal, and interaction with IPFS gateways and RPC endpoints.


47-141: Excellent comprehensive usage examples

The new usage section provides highly detailed examples covering different access methods (IPFS and S3), query approaches, and proper error handling. These examples will significantly help users understand how to use the library effectively, particularly the new IPFS/STAC functionality.


149-151: Updated virtual environment naming convention

The change from myenv to .venv follows modern Python conventions for virtual environment naming. The corresponding activation commands have been correctly updated for both Unix and Windows systems.

tests/test_client.py (6)

19-22: Appropriate imports for new error classes

The imports of the new error classes align with the error handling changes in the PR and are necessary for the new test cases.


25-33: Good use of centralized test constants

Importing test constants from conftest.py is a good practice that improves test maintainability and ensures consistent test data across test files.


111-125: Clarified fixture naming and comprehensive patching

The fixture has been appropriately renamed to patch_s3 to better reflect its purpose. Patching the function in both modules where it might be called ensures consistent behavior throughout the tests.


214-343: Well-structured functional tests for IPFS retrieval

These new functional tests provide comprehensive coverage of the IPFS dataset retrieval functionality, with robust error handling and thorough assertions. They test both basic functionality (loading datasets) and more complex operations (geo-temporal queries with various parameters).

The tests properly catch and report specific expected exceptions as well as unexpected errors, making test failures more informative.


344-474: Comprehensive error handling tests

The new error handling tests ensure that appropriate exceptions are raised for various invalid input scenarios. This improves the robustness of the client by verifying it properly validates inputs and provides meaningful error messages.

The tests use constants from conftest.py for consistency and include message checking to ensure detailed error information.


475-578: Improved organization and validation for S3 tests

Moving the S3 tests to a class structure improves organization. The extensive validation of the forecast dataset fixture before running the actual test helps prevent confusing test failures by ensuring the test data has the expected structure.

tests/test_ipfs_retrieval.py (3)

7-26: Updated imports and module-level fixture for IPFS connectivity

The imports have been appropriately updated to include the new error classes and constants needed for the tests. Adding the module-level pytest marker ensures IPFS connectivity is checked before running tests, which prevents confusing test failures when IPFS is unavailable.


177-195: Well-structured functional test for dataset listing

This test effectively verifies the list_datasets functionality by checking both the structure of the return value and that it contains expected datasets. The error handling is appropriate for a functional test, with specific catches for expected error types and a generic catch for unexpected errors.


308-406: Improved cache update tests and helper function

The cache update tests have been enhanced with better clarity and coverage of edge cases. The get_cache_path helper function now has improved documentation, and the tests properly verify cache update behavior in various scenarios, including file not found and JSON decode errors.

The removal of commented-out assertion lines is a good cleanup that improves code readability.

tests/conftest.py (5)

6-6: Imports look correct.
The newly added imports (os, requests, and zarr.storage) are necessary for environment configuration, IPFS checks, and direct ZipStore usage.

Also applies to: 10-10, 12-12


21-22: Fixture documentation is helpful.
Retaining local fixtures that do not require IPFS improves test independence and helps differentiate local and IPFS test coverage.


28-31: Similar approach for forecast fixture is consistent.
Having a separate fixture for forecast data reaffirms modularity and keeps IPFS-unrelated tests consistent with the real data process.


109-114: Session-scoped IPFS gateway URL fixture.
Providing a single session fixture to retrieve the IPFS gateway URL is a good practice and ensures consistency across tests.


151-163: Autouse fixture to skip tests if IPFS is unavailable.
This session-scoped approach effectively prevents failures from cascading.

dclimate_zarr_client/client.py (8)

15-18: Relevant IPFS retrieval imports.
Bringing in _get_dataset_by_ipfs_cid and get_dataset_hamt_cid_from_stac clarifies the direct usage.


20-23: Catalog IPNS constant offers quick reference.
Defining DCLIMATE_STAC_CATALOG_IPNS at the top keeps the root catalog path discoverable.


26-66: Docstring improvements for load_ipfs_via_stac.
Removing the as_of parameter simplifies usage and clarifies the IPFS-based approach.
Ensure the readme and references elsewhere are updated to avoid confusion if users relied on as_of.


67-82: HAMT CID resolution and dataset loading.
The flow is straightforward: fetch the CID via STAC, then load the dataset.
Error handling is sufficiently delegated to the imported functions.


106-126: geo_temporal_query parameter updates.
Replacing as_of with IPFS gateway/rpc stems aligns with the updated approach.
Documentation is consistent, although consider making it more explicit that historical snapshots are no longer an option.


141-148: Revised doc references for IPFS and S3 usage.
Good for clarifying which parameter is required given the data source.


222-235: Conditional branching for IPFS or S3 data retrieval.
The checks for missing bucket_name or invalid source are well-handled.


262-262: Returning data in dictionary format.
Falling back to data.as_dict() is consistent with the documented output_format=="array" path.

tests/test_integration_ipfs.py (14)

3-15: File-level configuration and library skipping.
Auto-skipping tests if required libraries are missing promotes stable CI.


16-31: Comprehensive client imports with caution on internal usage.
Importing internal functions (_get_dataset_by_ipfs_cid, _stac_hamt_cid_cache) is acceptable for integration tests but watch for potential maintenance overhead if signatures change.


34-41: Global known dataset references.
Clear naming ensures readability for test usage.


58-67: Module-scoped fixture to skip on IPFS unavailability.
Pairs well with the local is_ipfs_running, achieving the skip logic at integration test scope.


76-90: test_fetch_root_catalog_from_ipns: Basic IPNS fetch test.
Verifies the root catalog retrieval robustly; good assertion coverage on returned JSON.


92-111: test_get_hamt_cid_success: Normal dataset retrieval path.
Checks both function success and format of the resulting CID.


114-125: test_get_hamt_cid_not_found: Negative path coverage.
Raises DatasetNotFoundError as expected for unknown dataset IDs.


128-149: test_get_hamt_cid_caching: Ensures caching mechanism.
Timing checks highlight the performance benefits of caching.


152-172: test_load_dataset_internal: Validates _get_dataset_by_ipfs_cid.
Ensures that the direct, lower-level IPFS loading path remains functional.


175-192: test_load_ipfs_via_stac_success: Main load function test.
Demonstrates the top-level client usage and verifies presence of a known variable in the dataset.


195-203: test_load_ipfs_via_stac_not_found: Expects error for nonexistent dataset.
Ensures consistent error handling for unknown STAC dataset IDs.


206-234: test_geo_temporal_query_ipfs_point_success: Single-point retrieval.
Verifies the correct filtering and data extraction, including range checks on time coords.


237-268: test_geo_temporal_query_ipfs_netcdf_output: NetCDF serialization check.
Ensures user-chosen output format is properly returned and remains valid.


271-291: test_list_datasets_ipfs: Catalog listing validation.
Checks that the catalog fetch is successful, returning a non-empty list of known dataset IDs.

Comment thread tests/test_client.py Outdated
@Faolain Faolain merged commit 2da6713 into main Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants