Add cloud text I/O benchmark by nightcityblade · Pull Request #2008 · NVIDIA-NeMo/Curator

nightcityblade · 2026-05-21T15:13:38Z

Description

Adds a cloud text I/O benchmark script for JSONL and Parquet datasets. The benchmark reads from a configurable local or cloud URI, writes to a configurable output URI, records CPU/GPU labels through the benchmark parameters, and reports document throughput plus local output throughput when available.

Fixes #1521

Usage

python benchmarking/scripts/cloud_text_io_benchmark.py \
  --benchmark-results-path=/tmp/cloud-io-results \
  --input-path=s3://bucket/text-jsonl \
  --output-path=s3://bucket/curator-cloud-io-output \
  --format=jsonl \
  --compression=gzip \
  --executor=ray_data \
  --device-label=gpu \
  --read-kwargs-json='{"storage_options": {}}'

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Testing performed:

ruff format benchmarking/scripts/cloud_text_io_benchmark.py
ruff check benchmarking/scripts/cloud_text_io_benchmark.py
python3 -m py_compile benchmarking/scripts/cloud_text_io_benchmark.py

Signed-off-by: nightcityblade <nightcityblade@gmail.com>

copy-pr-bot · 2026-05-21T15:13:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-05-21T15:16:29Z

Greptile Summary

This PR adds a new cloud_text_io_benchmark.py script that measures read/write throughput for JSONL and Parquet datasets against local or cloud object-store URIs using configurable executors.

Implements a two-stage pipeline (Reader → Writer) with try/finally result persistence, matching the existing benchmark pattern in modifier_benchmark.py.
total_documents and throughput_docs_per_sec are computed from task.num_items on the output FileGroupTask objects; because FileGroupTask.num_items returns the number of output files rather than rows, these metrics will be orders of magnitude off for any non-trivial dataset.
Local-path output byte measurement via pathlib is intentionally skipped for cloud URIs, which is consistent with the PR description.

Confidence Score: 4/5

Safe to merge after fixing the document-count metric; all other structural patterns are consistent with the existing benchmark suite.

The benchmark's two headline metrics — total_documents and throughput_docs_per_sec — will silently report file counts instead of row counts because the pipeline's final stage returns FileGroupTask whose num_items counts files, not documents. The rest of the script (pipeline construction, compression handling, result persistence, exit-code propagation) is correct.

benchmarking/scripts/cloud_text_io_benchmark.py — specifically the total_documents calculation at line 134.

Important Files Changed

Filename	Overview
benchmarking/scripts/cloud_text_io_benchmark.py	New cloud I/O benchmark script for JSONL/Parquet with try/finally result persistence; `total_documents` incorrectly sums `FileGroupTask.num_items` (file count) instead of `StagePerfStats.num_items_processed` (document count), making throughput metrics wrong by orders of magnitude.

Sequence Diagram

sequenceDiagram
    participant CLI as main()
    participant RB as run_benchmark()
    participant BP as _build_pipeline()
    participant P as Pipeline
    participant R as JsonlReader / ParquetReader
    participant W as JsonlWriter / ParquetWriter
    participant WBR as write_benchmark_results()

    CLI->>RB: run_benchmark(args)
    RB->>BP: _build_pipeline(input_path, output_path, format, ...)
    BP->>R: instantiate reader
    BP->>W: instantiate writer (compression merged into write_kwargs)
    BP-->>RB: Pipeline
    RB->>P: pipeline.run(executor)
    P->>R: process() → DocumentBatch tasks
    R-->>P: DocumentBatch[]
    P->>W: process(DocumentBatch) → FileGroupTask
    W-->>P: "FileGroupTask[] (data = [file_path])"
    P-->>RB: tasks: FileGroupTask[]
    Note over RB: task.num_items = len(file_paths) ≠ document count
    RB-->>CLI: results dict
    CLI->>WBR: write params.json / metrics.json / tasks.pkl

_{Reviews (2): Last reviewed commit: "Ensure cloud text I/O benchmark writes f..." | Re-trigger Greptile}

greptile-apps · 2026-05-21T15:16:33Z

+    args = parser.parse_args()
+    results = run_benchmark(args)
+    write_benchmark_results(results, args.benchmark_results_path)
+    return 0 if results["metrics"]["is_success"] else 1


Benchmark results not written on setup failures

write_benchmark_results is only called when run_benchmark returns successfully. Errors that occur before pipeline.run() — such as json.JSONDecodeError from malformed --read-kwargs-json, TypeError from _json_arg, or ValueError from setup_executor — will propagate uncaught, leaving no result artefact behind. Other benchmarks in the same directory (e.g. modifier_benchmark.py) guard against this by initialising results with a failure default and wrapping the call in try/finally so write_benchmark_results is always invoked. Consider adopting the same pattern here.

greptile-apps · 2026-05-21T15:16:34Z

+        read_kwargs=read_kwargs,
+        write_kwargs=write_kwargs,
+    )
+    executor = setup_executor(args.executor, config=executor_config or None)


executor_config or None silently coerces an explicit empty JSON object (--executor-config-json='{}') into None, while read_kwargs and write_kwargs keep their empty-dict values and are passed through as-is. A user passing '{}' for the executor config would not see it treated differently from omitting the flag entirely, which is inconsistent with how the other two kwargs are handled. Consider passing executor_config directly (and letting setup_executor handle an empty dict), or documenting the coercion explicitly.

Suggested change

executor = setup_executor(args.executor, config=executor_config or None)

executor = setup_executor(args.executor, config=executor_config if executor_config else None)

nightcityblade · 2026-05-22T03:04:06Z

Addressed the Greptile feedback by updating cloud_text_io_benchmark.py to match the sibling benchmark pattern: main() now initializes a failed result artifact up front and persists benchmark results from a finally block, so setup-time failures still write params/metrics/tasks outputs.

Validation:

python3 -m py_compile benchmarking/scripts/cloud_text_io_benchmark.py
Targeted stubbed failure-path check confirming main() writes benchmark artifacts when run_benchmark() raises before setup completes.

Add cloud text I/O benchmark

e3f8934

Signed-off-by: nightcityblade <nightcityblade@gmail.com>

nightcityblade requested a review from a team as a code owner May 21, 2026 15:13

nightcityblade requested review from huvunvidia and removed request for a team May 21, 2026 15:13

github-actions Bot added the community-request label May 21, 2026

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

Ensure cloud text I/O benchmark writes failure results

dc231b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cloud text I/O benchmark#2008

Add cloud text I/O benchmark#2008
nightcityblade wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1521

nightcityblade commented May 21, 2026

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

greptile-apps Bot commented May 21, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

nightcityblade commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	executor = setup_executor(args.executor, config=executor_config or None)
	executor = setup_executor(args.executor, config=executor_config if executor_config else None)

Conversation

nightcityblade commented May 21, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

greptile-apps Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

nightcityblade commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 21, 2026 •

edited

Loading

nightcityblade commented May 22, 2026 •

edited

Loading