[deps] Upgrade HuggingFace stack: datasets 4.x by ArturNiederfahrenhorst · Pull Request #64054 · ray-project/ray

ArturNiederfahrenhorst · 2026-06-12T11:41:21Z

Coordinated bump of Ray's HuggingFace ML stack so it can run on datasets 4.x and huggingface-hub 1.x:

datasets: ==3.6.0 -> >=4.0.0,<5.0.0 (reverts the [deps] Pin datasets==3.6.0 in py313 train-requirements #62926 pin; verified the dtype regression it worked around in from_huggingface -> iter_torch_batches does NOT recur with datasets 4.x — the CPU train transformers tests pass on this branch).
huggingface-hub: >=0.24.0 -> >=1.0,<2.0.
transformers: ==4.36.2 -> >=5.0,<6.0, accelerate: ==0.28.0 -> >=1.0. Every transformers 4.x (through 4.57) hard-caps huggingface-hub <1.0, so hf-hub 1.x requires transformers 5.x.

For the Ray Serve docs this means that we have to switch from model.translate(...) to model.generate() ( https://huggingface.co/docs/transformers/en/tasks/translation ) which impacts the text content around the relevant code changes and makes the PR bigger.
Also, there is a small number of other places across the libraries that need attention.

Follow-ups

diffusers vs. transformers 5 in the ML release (BYOD) image: this PR brings transformers==5.11.0 alongside the pre-existing diffusers==0.12.1 baseline pin in release/ray_release/byod/requirements_ml_byod_3.{10,13}.in. diffusers 0.12.1 targets the transformers 4.x era and should be upgraded

…ers 5.x Coordinated bump of Ray's HuggingFace ML stack so it can run on datasets 4.x and huggingface-hub 1.x: - datasets: ==3.6.0 -> >=4.0.0,<5.0.0 (reverts the ray-project#62926 pin; the dtype regression it worked around in from_huggingface -> iter_torch_batches must be fixed as part of this change). - huggingface-hub: >=0.24.0 -> >=1.0,<2.0. - transformers: ==4.36.2 -> >=5.0,<6.0, accelerate: ==0.28.0 -> >=1.0. Every transformers 4.x (through 4.57) hard-caps huggingface-hub <1.0, so hf-hub 1.x requires transformers 5.x. Input-requirements only; the compiled lockfiles (requirements_compiled*.txt) and deplocks must be regenerated on x86 CI/forge (compile_pip_dependencies skips aarch64). Resolution verified coherent: the stack resolves to transformers 5.11.0 / datasets 4.8.5 / huggingface-hub 1.19.0 / tokenizers 0.22.2 / accelerate 1.14.0, and s3fs==2023.12.1's pinned fsspec stays within datasets 4.x's accepted range. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

gemini-code-assist

Code Review

This pull request updates machine learning dependency requirements across several requirements files, upgrading transformers, accelerate, datasets, and huggingface-hub to newer major versions. The reviewer identified that several of these specified major versions (such as transformers 5.x, datasets 4.x, and huggingface-hub 1.x) have not yet been released on PyPI, which will cause dependency resolution to fail and break CI/CD pipelines. It is recommended to revert these pins to the latest available stable versions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

pip-compile reuses the existing requirements_compiled*.txt as constraints and won't auto-upgrade transitive deps, so the stale tokenizers==0.15.2 / regex==2024.5.15 / accelerate==0.28.0 / datasets==3.6.0 pins blocked the transformers 5.x resolution. Remove the HF cluster (transformers, tokenizers, regex, accelerate, huggingface-hub, hf-xet, datasets, safetensors, dill, multiprocess) so CI's pip-compile re-resolves just those to the modern stack (pip-tools partial upgrade). The recompiled files will be committed back. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Authoritative recompile from CI's pip-compile jobs (premerge #68150), pulled from their artifacts. Resolves the HF cluster to the modern stack: transformers 5.11.0, datasets 4.8.5 (4.0.0 on py3.11), huggingface-hub 1.16.1, tokenizers 0.22.2, accelerate 1.14.0, safetensors 0.8.0, regex 2026.5.9. Replaces the temporary line-removal from the previous commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Regenerated all affected deplocks against the updated requirements_compiled*.txt via `raydepsets build --all-configs` (bazel-pinned uv, --python-platform=linux, deterministic). 39 lock files updated to the modern HF stack: datasets 4.8.5, transformers 5.11.0, huggingface-hub 1.16.1, accelerate 1.14.0, tokenizers 0.22.2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Keep the requirement files to bare version specs; the rationale lives in the PR/commit history, not inline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Brings the HuggingFace stack upgrade (datasets 4.x, huggingface-hub 1.x, transformers 5.x; PR ray-project#64054) and current master into the lerobot datasource branch. datasets 4.x / hf-hub 1.x in the shared data lock is what lets lerobot[dataset] (which requires huggingface-hub>=1.0) resolve there, so the read_lerobot test can finally run in CI. Conflict resolved in python/requirements/ml/data-test-requirements.txt: kept both master's `obstore` and the py>=3.12-gated lerobot stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

transformers 5.x removed two long-deprecated TrainingArguments kwargs: - no_cuda -> use_cpu - evaluation_strategy -> eval_strategy Update all call sites (train v1/v2 transformers tests, test_train_usage, test_local_mode, the transformers/pbt_transformers examples). Internal config dict keys are left as-is; only the kwargs passed to TrainingArguments change. Both replacements are valid on transformers 4.x as well. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

transformers 5 removed the translation_* pipeline tasks and changed BPE tokenization spacing, breaking the Serve doc examples: - Translation examples (getting_started models/translator/model_deployment*, translator_example, production_guide/text_ml, develop_and_deploy): replace pipeline("translation_en_to_*", model="t5-small") with direct AutoModelForSeq2SeqLM + AutoTokenizer + generate() using a task prefix ("translate English to French/German/Romanian: "). - streaming_tutorial: the hardcoded GPT-2 chunk asserts pinned exact tokenizer output; replace with "streaming happened" (chunk-count) checks. - Replace exact-match output asserts with robust checks; model output isn't stable across transformers versions (which is what broke here). - Update getting_started / develop-and-deploy prose to describe the generate() flow instead of the removed pipeline's [{"translation_text"}] dict. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

…5 pipeline task transformers 5 removed the summarization pipeline task (same as translation; the supported-task list has no translation/summarization/text2text-generation). Migrate the Summarizer examples (getting_started/models, getting_started/ translator, production_guide/text_ml) from pipeline("summarization", ...) to AutoModelForSeq2SeqLM + AutoTokenizer + generate() with a "summarize: " prefix, preserving the Summarizer->Translator composition. Fix the dead SummarizationPipeline doc reference. Asserts use robust non-empty checks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

evaluate 0.4.3 calls the removed huggingface_hub.hf_api.HfFolder.get_token(), which breaks GPU train tests (accelerate/glue metrics) under hf-hub 1.x with AttributeError. evaluate 0.4.6 uses build_hf_headers instead and has an identical dependency tree. Bump the requirement + compiled files and relock the affected deplocks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

…rmers-5 pipeline task doc/BUILD.bazel runs every serve doc_code/*.py (gradio-integration.py included), so the pipeline("summarization", model="t5-small") call here breaks the same way the getting_started Summarizers did under transformers 5. Switch it to AutoModelForSeq2SeqLM + generate() with a "summarize: " prefix. Also fix the remaining dead SummarizationPipeline prose reference in getting_started.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

The evaluate 0.4.3 -> 0.4.6 bump propagates to the ml_torchft release-test byod lock as well; regenerate it so raydepsets --check stays consistent and the release image doesn't pull the old evaluate that breaks under hf-hub 1.x. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

…l examples The transformers-5 migration swapped pipeline() for greedy generate(), which (a) weakened the doc-example asserts to isinstance(x, str), losing the regression coverage the old exact-string asserts provided, and (b) degraded output quality: greedy t5-small loops/repeats and truncates mid-phrase without the beam search the old pipeline applied by default. Restore num_beams=4, early_stopping=True (the pipeline's behavior) across the Translator/Summarizer examples, bump the summary to max_new_tokens=20 so it finishes a phrase, and re-pin the exact transformers-5 outputs (verified by running t5-small under transformers 5.12). Each assert carries an f"got {x!r}" message so any divergence self-reports. Update the matching expected-output text in getting_started.md and kubernetes.md. Outputs were captured on arm64; CI (x86) confirms them, and the got-message surfaces any beam-scoring divergence for a one-line re-pin. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

dstrodtman

One suggestion to simplify text, and confirming that many docs tests are intentionally changed from deterministic output to slightly broader checks (asserts on types, length, etc.)

Would be useful if you can canonize as much of your flow as possible as an agentic skill and commit this somewhere (either docs subsumed in general external dependency updates, or specific instructions for auditing and updating dependencies for docs and tests). Happy to collaborate.

Per review, remove the f"got {x!r}" messages added alongside the exact-output pins; keep the plain equality assertions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

ArturNiederfahrenhorst · 2026-06-15T15:30:51Z

 class Translator:
    def __init__(self):
        # Load model
-        self.model = pipeline("translation_en_to_fr", model="t5-small")


See https://github.com/huggingface/transformers/blob/main/MIGRATION_GUIDE_V5.md

Such pipelines have been removed. They prepended a small chunk of text. We can get the original behavior by prepending "translate English to French:".

…gs unchanged Reproduce the exact pre-migration pipeline() outputs with generate() by passing t5-small's task_specific_params (num_beams=4, early_stopping, length_penalty, no_repeat_ngram_size) and decoding with clean_up_tokenization_spaces=False (transformers 5 flipped that decode default to True). This restores the original output strings, so the assertions and doc console outputs revert to master verbatim -- no expected-output changes. Also drop the injected migration-explanation code comments and the LANGUAGE_TO_PREFIX dict (reconfigure now mirrors the original if/elif), keeping the diff to the necessary pipeline() -> tokenizer/generate/decode change plus prose that described the removed pipeline dict output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

…anslator.py The summarize generate() in the composition example carried six kwargs. Trim it to the same shape as the translate() call (num_beams=4, early_stopping=True, max_length=15) and update the composed example's expected French output to the string this produces. models.py / text_ml.py keep their faithful reproduction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

ArturNiederfahrenhorst · 2026-06-15T16:05:10Z

 # __end_client__

-assert french_text == "c'était le meilleur des temps, c'était le pire des temps ."
+assert french_text == "C'était le meilleur des temps, c'était le pire des temps,"


I tried to make the model produce the same exact outputs here but that requires a bunch of kwargs to model.generate so I think adjusting the expected output slightly is nicer here.

Mirror the translator.py change in the production-guide example: drop the length_penalty / no_repeat_ngram_size tuning kwargs from the summarize generate() (keeping the reconfigure-driven min_length / max_length), and update the French/German expected outputs (asserts, inline comments, and the matching kubernetes.md console outputs) to the strings this produces. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

gradio-integration.py: trim the summarize generate() to the same 3-kwarg shape as the other examples (num_beams=4, early_stopping=True, max_length=200); output is not asserted (the test checks the HTTP 200), and it stays a non-empty summary. streaming_tutorial.py: the streaming checks were just len(chunks) > 1. Strengthen them to also require the streamed chunks reconstruct non-empty content (len(chunks) > 1 and "".join(chunks).strip()), for both the single-stream and the batched cases, and drop the now-redundant explanatory comment. Avoids pinning DialoGPT's exact words, which shifted under transformers 5 and are platform-fragile. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Restore exact-content assertions for the streaming examples (instead of just len(chunks) > 1), now pinning the actual chunk lists the clients receive under transformers 5. Empty keep-alive chunks are filtered ([c for c in chunks if c]) since the HTTP client drops them; the remaining list is the deterministic streamed token sequence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

…ade-hf-stack-datasets4 Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

@dstrodtman

Collapse the "two models" intro + bullets into a single sentence, as suggested by @dstrodtman. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

The batchbot self-test pinned the first response to "interstitial error", which was a bad capture artifact (it came from passing attention_mask, which the example does not). The example runs generate(input_ids) with right padding and no mask, which yields the coherent "I'm not a fan of the new look ." Pin that instead, addressing the reviewer note about asserting nonsensical output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

CI revealed two issues with the exact streamed-chunk pins: - chatbot: the client appends a "\n" after each turn (print("\n")), so the pinned list was missing those entries. Add them, matching the exact tokens the deployment yields on x86 (read from the CI logs). - batchbot: its per-token, sub-word, batched chunks are platform-fragile and we have no x86 ground truth (the assert never ran), so assert the robust property (multiple chunks reconstructing non-empty content) instead of an exact list. textbot's exact pin already matches x86 and is kept. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 8176f94. Configure here.}

CI surfaced these once the cu13/pytorch-index flakes cleared. Two patterns: - datasets 4.x requires a namespaced repo id: bare load_dataset("glue", ...) / "hf://datasets/glue/cola/" now raise HfUriError/RepositoryNotFoundError. Point them at nyu-mll/glue (matching the existing deepspeed_torch_trainer_no_raydata fix), and yelp_review_full -> Yelp/yelp_review_full. - transformers 5 removed the TrainingArguments kwargs evaluation_strategy and no_cuda; rename to eval_strategy and use_cpu (same migration already applied to the train test files). Files: accelerate/deepspeed torch trainer examples, transformers_torch_trainer_basic, checkpoints doc_code, and the huggingface_text_classification / pbt_transformers / intel_gaudi llama / lightning_cola_advanced notebooks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

ArturNiederfahrenhorst · 2026-06-16T11:57:03Z

@@ -140,7 +140,7 @@ def collate_fn(batch):
    }

    # Prepare Ray Datasets
-    hf_datasets = load_dataset("glue", "mrpc")
+    hf_datasets = load_dataset("nyu-mll/glue", "mrpc")


This is the original by https://wp.nyu.edu/ml2/

ArturNiederfahrenhorst · 2026-06-16T12:01:40Z

@@ -19,7 +19,7 @@
 # ====================================================================
 def train_func(config):
    # Datasets
-    dataset = load_dataset("yelp_review_full")
+    dataset = load_dataset("Yelp/yelp_review_full")


Same as for the NYU one, this one is canonical.

The Gaudi example isn't run in CI (no bazel target) and isn't locally testable (needs Habana HPU + optimum-habana, which may pin its own transformers and use GaudiTrainingArguments), so the evaluation_strategy->eval_strategy rename there was unverifiable. Restore the original to avoid shipping a blind change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Empty commit to re-run premerge so the newly-added continuous-build label is evaluated, running the py3.11/3.12/3.13 suites that validate the transformers-5 / datasets-4 / hf-hub-1 upgrade across Python versions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

…nto add-lerobot-datasource Bring the dependency upgrade and example migrations from the upgrade-hf-stack- datasets4 branch (PR ray-project#64054 work) into the lerobot datasource branch so read_lerobot can build on datasets 4.x / huggingface-hub 1.x. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

transformers 5 removed the deprecated `tokenizer=` kwarg on `Trainer` (replaced by `processing_class=` since 4.46), which raised TypeError: Trainer.__init__() got an unexpected keyword argument 'tokenizer' in the train v1 gpu notebook test (build #68368). Update the two notebooks that still passed it: - transformers/huggingface_text_classification.ipynb (the failing CI test) - deepspeed/gptj_deepspeed_fine_tuning.ipynb (gated test; same latent bug) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

…datasource Brings the Trainer(tokenizer=) -> processing_class= fix (PR ray-project#64054) so the train notebook tests pass on transformers 5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

ArturNiederfahrenhorst added the go add ONLY when ready to merge, run all tests label Jun 12, 2026

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread python/requirements/ml/core-requirements.txt

Comment thread python/requirements/ml/train-requirements.txt

Comment thread python/requirements/ml/py313/core-requirements.txt

Comment thread python/requirements/ml/py313/train-requirements.txt

cursor Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread python/requirements_compiled.txt

ray-gardener Bot added the data Ray Data-related issues label Jun 12, 2026

ArturNiederfahrenhorst mentioned this pull request Jun 12, 2026

[Data] Add read_lerobot datasource for LeRobot v3 datasets #63821

Draft

ArturNiederfahrenhorst and others added 2 commits June 12, 2026 16:19

cursor Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread python/deplocks/base_extra_testdeps/ray-ml-base_extra_testdeps_py3.10.lock

ArturNiederfahrenhorst and others added 2 commits June 13, 2026 09:03

ArturNiederfahrenhorst requested review from a team as code owners June 15, 2026 11:16