Skip to content

Add cloud text I/O benchmark#2008

Open
nightcityblade wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1521
Open

Add cloud text I/O benchmark#2008
nightcityblade wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1521

Conversation

@nightcityblade
Copy link
Copy Markdown
Contributor

Description

Adds a cloud text I/O benchmark script for JSONL and Parquet datasets. The benchmark reads from a configurable local or cloud URI, writes to a configurable output URI, records CPU/GPU labels through the benchmark parameters, and reports document throughput plus local output throughput when available.

Fixes #1521

Usage

python benchmarking/scripts/cloud_text_io_benchmark.py \
  --benchmark-results-path=/tmp/cloud-io-results \
  --input-path=s3://bucket/text-jsonl \
  --output-path=s3://bucket/curator-cloud-io-output \
  --format=jsonl \
  --compression=gzip \
  --executor=ray_data \
  --device-label=gpu \
  --read-kwargs-json='{"storage_options": {}}'

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Testing performed:

  • ruff format benchmarking/scripts/cloud_text_io_benchmark.py
  • ruff check benchmarking/scripts/cloud_text_io_benchmark.py
  • python3 -m py_compile benchmarking/scripts/cloud_text_io_benchmark.py

Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@nightcityblade nightcityblade requested a review from a team as a code owner May 21, 2026 15:13
@nightcityblade nightcityblade requested review from huvunvidia and removed request for a team May 21, 2026 15:13
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR adds a new cloud_text_io_benchmark.py script that measures read/write throughput for JSONL and Parquet datasets against local or cloud object-store URIs using configurable executors.

  • Implements a two-stage pipeline (Reader → Writer) with try/finally result persistence, matching the existing benchmark pattern in modifier_benchmark.py.
  • total_documents and throughput_docs_per_sec are computed from task.num_items on the output FileGroupTask objects; because FileGroupTask.num_items returns the number of output files rather than rows, these metrics will be orders of magnitude off for any non-trivial dataset.
  • Local-path output byte measurement via pathlib is intentionally skipped for cloud URIs, which is consistent with the PR description.

Confidence Score: 4/5

Safe to merge after fixing the document-count metric; all other structural patterns are consistent with the existing benchmark suite.

The benchmark's two headline metrics — total_documents and throughput_docs_per_sec — will silently report file counts instead of row counts because the pipeline's final stage returns FileGroupTask whose num_items counts files, not documents. The rest of the script (pipeline construction, compression handling, result persistence, exit-code propagation) is correct.

benchmarking/scripts/cloud_text_io_benchmark.py — specifically the total_documents calculation at line 134.

Important Files Changed

Filename Overview
benchmarking/scripts/cloud_text_io_benchmark.py New cloud I/O benchmark script for JSONL/Parquet with try/finally result persistence; total_documents incorrectly sums FileGroupTask.num_items (file count) instead of StagePerfStats.num_items_processed (document count), making throughput metrics wrong by orders of magnitude.

Sequence Diagram

sequenceDiagram
    participant CLI as main()
    participant RB as run_benchmark()
    participant BP as _build_pipeline()
    participant P as Pipeline
    participant R as JsonlReader / ParquetReader
    participant W as JsonlWriter / ParquetWriter
    participant WBR as write_benchmark_results()

    CLI->>RB: run_benchmark(args)
    RB->>BP: _build_pipeline(input_path, output_path, format, ...)
    BP->>R: instantiate reader
    BP->>W: instantiate writer (compression merged into write_kwargs)
    BP-->>RB: Pipeline
    RB->>P: pipeline.run(executor)
    P->>R: process() → DocumentBatch tasks
    R-->>P: DocumentBatch[]
    P->>W: process(DocumentBatch) → FileGroupTask
    W-->>P: "FileGroupTask[] (data = [file_path])"
    P-->>RB: tasks: FileGroupTask[]
    Note over RB: task.num_items = len(file_paths) ≠ document count
    RB-->>CLI: results dict
    CLI->>WBR: write params.json / metrics.json / tasks.pkl
Loading

Reviews (2): Last reviewed commit: "Ensure cloud text I/O benchmark writes f..." | Re-trigger Greptile

Comment on lines +193 to +196
args = parser.parse_args()
results = run_benchmark(args)
write_benchmark_results(results, args.benchmark_results_path)
return 0 if results["metrics"]["is_success"] else 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Benchmark results not written on setup failures

write_benchmark_results is only called when run_benchmark returns successfully. Errors that occur before pipeline.run() — such as json.JSONDecodeError from malformed --read-kwargs-json, TypeError from _json_arg, or ValueError from setup_executor — will propagate uncaught, leaving no result artefact behind. Other benchmarks in the same directory (e.g. modifier_benchmark.py) guard against this by initialising results with a failure default and wrapping the call in try/finally so write_benchmark_results is always invoked. Consider adopting the same pattern here.

read_kwargs=read_kwargs,
write_kwargs=write_kwargs,
)
executor = setup_executor(args.executor, config=executor_config or None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 executor_config or None silently coerces an explicit empty JSON object (--executor-config-json='{}') into None, while read_kwargs and write_kwargs keep their empty-dict values and are passed through as-is. A user passing '{}' for the executor config would not see it treated differently from omitting the flag entirely, which is inconsistent with how the other two kwargs are handled. Consider passing executor_config directly (and letting setup_executor handle an empty dict), or documenting the coercion explicitly.

Suggested change
executor = setup_executor(args.executor, config=executor_config or None)
executor = setup_executor(args.executor, config=executor_config if executor_config else None)

@nightcityblade
Copy link
Copy Markdown
Contributor Author

nightcityblade commented May 22, 2026

Addressed the Greptile feedback by updating cloud_text_io_benchmark.py to match the sibling benchmark pattern: main() now initializes a failed result artifact up front and persists benchmark results from a finally block, so setup-time failures still write params/metrics/tasks outputs.

Validation:

  • python3 -m py_compile benchmarking/scripts/cloud_text_io_benchmark.py
  • Targeted stubbed failure-path check confirming main() writes benchmark artifacts when run_benchmark() raises before setup completes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a benchmark for Cloud I/O for text

2 participants