[deps] Upgrade HuggingFace stack: datasets 4.x#64054
[deps] Upgrade HuggingFace stack: datasets 4.x#64054ArturNiederfahrenhorst wants to merge 26 commits into
Conversation
…ers 5.x Coordinated bump of Ray's HuggingFace ML stack so it can run on datasets 4.x and huggingface-hub 1.x: - datasets: ==3.6.0 -> >=4.0.0,<5.0.0 (reverts the ray-project#62926 pin; the dtype regression it worked around in from_huggingface -> iter_torch_batches must be fixed as part of this change). - huggingface-hub: >=0.24.0 -> >=1.0,<2.0. - transformers: ==4.36.2 -> >=5.0,<6.0, accelerate: ==0.28.0 -> >=1.0. Every transformers 4.x (through 4.57) hard-caps huggingface-hub <1.0, so hf-hub 1.x requires transformers 5.x. Input-requirements only; the compiled lockfiles (requirements_compiled*.txt) and deplocks must be regenerated on x86 CI/forge (compile_pip_dependencies skips aarch64). Resolution verified coherent: the stack resolves to transformers 5.11.0 / datasets 4.8.5 / huggingface-hub 1.19.0 / tokenizers 0.22.2 / accelerate 1.14.0, and s3fs==2023.12.1's pinned fsspec stays within datasets 4.x's accepted range. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request updates machine learning dependency requirements across several requirements files, upgrading transformers, accelerate, datasets, and huggingface-hub to newer major versions. The reviewer identified that several of these specified major versions (such as transformers 5.x, datasets 4.x, and huggingface-hub 1.x) have not yet been released on PyPI, which will cause dependency resolution to fail and break CI/CD pipelines. It is recommended to revert these pins to the latest available stable versions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
pip-compile reuses the existing requirements_compiled*.txt as constraints and won't auto-upgrade transitive deps, so the stale tokenizers==0.15.2 / regex==2024.5.15 / accelerate==0.28.0 / datasets==3.6.0 pins blocked the transformers 5.x resolution. Remove the HF cluster (transformers, tokenizers, regex, accelerate, huggingface-hub, hf-xet, datasets, safetensors, dill, multiprocess) so CI's pip-compile re-resolves just those to the modern stack (pip-tools partial upgrade). The recompiled files will be committed back. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Authoritative recompile from CI's pip-compile jobs (premerge #68150), pulled from their artifacts. Resolves the HF cluster to the modern stack: transformers 5.11.0, datasets 4.8.5 (4.0.0 on py3.11), huggingface-hub 1.16.1, tokenizers 0.22.2, accelerate 1.14.0, safetensors 0.8.0, regex 2026.5.9. Replaces the temporary line-removal from the previous commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Regenerated all affected deplocks against the updated requirements_compiled*.txt via `raydepsets build --all-configs` (bazel-pinned uv, --python-platform=linux, deterministic). 39 lock files updated to the modern HF stack: datasets 4.8.5, transformers 5.11.0, huggingface-hub 1.16.1, accelerate 1.14.0, tokenizers 0.22.2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Keep the requirement files to bare version specs; the rationale lives in the PR/commit history, not inline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Brings the HuggingFace stack upgrade (datasets 4.x, huggingface-hub 1.x, transformers 5.x; PR ray-project#64054) and current master into the lerobot datasource branch. datasets 4.x / hf-hub 1.x in the shared data lock is what lets lerobot[dataset] (which requires huggingface-hub>=1.0) resolve there, so the read_lerobot test can finally run in CI. Conflict resolved in python/requirements/ml/data-test-requirements.txt: kept both master's `obstore` and the py>=3.12-gated lerobot stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
transformers 5.x removed two long-deprecated TrainingArguments kwargs: - no_cuda -> use_cpu - evaluation_strategy -> eval_strategy Update all call sites (train v1/v2 transformers tests, test_train_usage, test_local_mode, the transformers/pbt_transformers examples). Internal config dict keys are left as-is; only the kwargs passed to TrainingArguments change. Both replacements are valid on transformers 4.x as well. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
transformers 5 removed the translation_* pipeline tasks and changed BPE
tokenization spacing, breaking the Serve doc examples:
- Translation examples (getting_started models/translator/model_deployment*,
translator_example, production_guide/text_ml, develop_and_deploy): replace
pipeline("translation_en_to_*", model="t5-small") with direct
AutoModelForSeq2SeqLM + AutoTokenizer + generate() using a task prefix
("translate English to French/German/Romanian: ").
- streaming_tutorial: the hardcoded GPT-2 chunk asserts pinned exact tokenizer
output; replace with "streaming happened" (chunk-count) checks.
- Replace exact-match output asserts with robust checks; model output isn't
stable across transformers versions (which is what broke here).
- Update getting_started / develop-and-deploy prose to describe the generate()
flow instead of the removed pipeline's [{"translation_text"}] dict.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
…5 pipeline task
transformers 5 removed the summarization pipeline task (same as translation;
the supported-task list has no translation/summarization/text2text-generation).
Migrate the Summarizer examples (getting_started/models, getting_started/
translator, production_guide/text_ml) from pipeline("summarization", ...) to
AutoModelForSeq2SeqLM + AutoTokenizer + generate() with a "summarize: " prefix,
preserving the Summarizer->Translator composition. Fix the dead
SummarizationPipeline doc reference. Asserts use robust non-empty checks.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
evaluate 0.4.3 calls the removed huggingface_hub.hf_api.HfFolder.get_token(), which breaks GPU train tests (accelerate/glue metrics) under hf-hub 1.x with AttributeError. evaluate 0.4.6 uses build_hf_headers instead and has an identical dependency tree. Bump the requirement + compiled files and relock the affected deplocks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
…rmers-5 pipeline task
doc/BUILD.bazel runs every serve doc_code/*.py (gradio-integration.py included),
so the pipeline("summarization", model="t5-small") call here breaks the same way
the getting_started Summarizers did under transformers 5. Switch it to
AutoModelForSeq2SeqLM + generate() with a "summarize: " prefix. Also fix the
remaining dead SummarizationPipeline prose reference in getting_started.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
The evaluate 0.4.3 -> 0.4.6 bump propagates to the ml_torchft release-test byod lock as well; regenerate it so raydepsets --check stays consistent and the release image doesn't pull the old evaluate that breaks under hf-hub 1.x. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
…l examples
The transformers-5 migration swapped pipeline() for greedy generate(), which
(a) weakened the doc-example asserts to isinstance(x, str), losing the regression
coverage the old exact-string asserts provided, and (b) degraded output quality:
greedy t5-small loops/repeats and truncates mid-phrase without the beam search
the old pipeline applied by default.
Restore num_beams=4, early_stopping=True (the pipeline's behavior) across the
Translator/Summarizer examples, bump the summary to max_new_tokens=20 so it
finishes a phrase, and re-pin the exact transformers-5 outputs (verified by
running t5-small under transformers 5.12). Each assert carries an f"got {x!r}"
message so any divergence self-reports. Update the matching expected-output text
in getting_started.md and kubernetes.md.
Outputs were captured on arm64; CI (x86) confirms them, and the got-message
surfaces any beam-scoring divergence for a one-line re-pin.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
dstrodtman
left a comment
There was a problem hiding this comment.
One suggestion to simplify text, and confirming that many docs tests are intentionally changed from deterministic output to slightly broader checks (asserts on types, length, etc.)
Would be useful if you can canonize as much of your flow as possible as an agentic skill and commit this somewhere (either docs subsumed in general external dependency updates, or specific instructions for auditing and updating dependencies for docs and tests). Happy to collaborate.
Per review, remove the f"got {x!r}" messages added alongside the exact-output
pins; keep the plain equality assertions.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
| class Translator: | ||
| def __init__(self): | ||
| # Load model | ||
| self.model = pipeline("translation_en_to_fr", model="t5-small") |
There was a problem hiding this comment.
See https://github.com/huggingface/transformers/blob/main/MIGRATION_GUIDE_V5.md
Such pipelines have been removed. They prepended a small chunk of text. We can get the original behavior by prepending "translate English to French:".
…gs unchanged Reproduce the exact pre-migration pipeline() outputs with generate() by passing t5-small's task_specific_params (num_beams=4, early_stopping, length_penalty, no_repeat_ngram_size) and decoding with clean_up_tokenization_spaces=False (transformers 5 flipped that decode default to True). This restores the original output strings, so the assertions and doc console outputs revert to master verbatim -- no expected-output changes. Also drop the injected migration-explanation code comments and the LANGUAGE_TO_PREFIX dict (reconfigure now mirrors the original if/elif), keeping the diff to the necessary pipeline() -> tokenizer/generate/decode change plus prose that described the removed pipeline dict output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
…anslator.py The summarize generate() in the composition example carried six kwargs. Trim it to the same shape as the translate() call (num_beams=4, early_stopping=True, max_length=15) and update the composed example's expected French output to the string this produces. models.py / text_ml.py keep their faithful reproduction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
| # __end_client__ | ||
|
|
||
| assert french_text == "c'était le meilleur des temps, c'était le pire des temps ." | ||
| assert french_text == "C'était le meilleur des temps, c'était le pire des temps," |
There was a problem hiding this comment.
I tried to make the model produce the same exact outputs here but that requires a bunch of kwargs to model.generate so I think adjusting the expected output slightly is nicer here.
Mirror the translator.py change in the production-guide example: drop the length_penalty / no_repeat_ngram_size tuning kwargs from the summarize generate() (keeping the reconfigure-driven min_length / max_length), and update the French/German expected outputs (asserts, inline comments, and the matching kubernetes.md console outputs) to the strings this produces. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
gradio-integration.py: trim the summarize generate() to the same 3-kwarg shape as the other examples (num_beams=4, early_stopping=True, max_length=200); output is not asserted (the test checks the HTTP 200), and it stays a non-empty summary. streaming_tutorial.py: the streaming checks were just len(chunks) > 1. Strengthen them to also require the streamed chunks reconstruct non-empty content (len(chunks) > 1 and "".join(chunks).strip()), for both the single-stream and the batched cases, and drop the now-redundant explanatory comment. Avoids pinning DialoGPT's exact words, which shifted under transformers 5 and are platform-fragile. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Restore exact-content assertions for the streaming examples (instead of just len(chunks) > 1), now pinning the actual chunk lists the clients receive under transformers 5. Empty keep-alive chunks are filtered ([c for c in chunks if c]) since the HTTP client drops them; the remaining list is the deterministic streamed token sequence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
…ade-hf-stack-datasets4 Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Collapse the "two models" intro + bullets into a single sentence, as suggested by @dstrodtman. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
The batchbot self-test pinned the first response to "interstitial error", which was a bad capture artifact (it came from passing attention_mask, which the example does not). The example runs generate(input_ids) with right padding and no mask, which yields the coherent "I'm not a fan of the new look ." Pin that instead, addressing the reviewer note about asserting nonsensical output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
CI revealed two issues with the exact streamed-chunk pins:
- chatbot: the client appends a "\n" after each turn (print("\n")), so the
pinned list was missing those entries. Add them, matching the exact tokens
the deployment yields on x86 (read from the CI logs).
- batchbot: its per-token, sub-word, batched chunks are platform-fragile and we
have no x86 ground truth (the assert never ran), so assert the robust property
(multiple chunks reconstructing non-empty content) instead of an exact list.
textbot's exact pin already matches x86 and is kept.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 8176f94. Configure here.
CI surfaced these once the cu13/pytorch-index flakes cleared. Two patterns:
- datasets 4.x requires a namespaced repo id: bare load_dataset("glue", ...) /
"hf://datasets/glue/cola/" now raise HfUriError/RepositoryNotFoundError. Point
them at nyu-mll/glue (matching the existing deepspeed_torch_trainer_no_raydata
fix), and yelp_review_full -> Yelp/yelp_review_full.
- transformers 5 removed the TrainingArguments kwargs evaluation_strategy and
no_cuda; rename to eval_strategy and use_cpu (same migration already applied to
the train test files).
Files: accelerate/deepspeed torch trainer examples, transformers_torch_trainer_basic,
checkpoints doc_code, and the huggingface_text_classification / pbt_transformers /
intel_gaudi llama / lightning_cola_advanced notebooks.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
| @@ -140,7 +140,7 @@ def collate_fn(batch): | |||
| } | |||
|
|
|||
| # Prepare Ray Datasets | |||
| hf_datasets = load_dataset("glue", "mrpc") | |||
| hf_datasets = load_dataset("nyu-mll/glue", "mrpc") | |||
There was a problem hiding this comment.
This is the original by https://wp.nyu.edu/ml2/
| @@ -19,7 +19,7 @@ | |||
| # ==================================================================== | |||
| def train_func(config): | |||
| # Datasets | |||
| dataset = load_dataset("yelp_review_full") | |||
| dataset = load_dataset("Yelp/yelp_review_full") | |||
There was a problem hiding this comment.
Same as for the NYU one, this one is canonical.
The Gaudi example isn't run in CI (no bazel target) and isn't locally testable (needs Habana HPU + optimum-habana, which may pin its own transformers and use GaudiTrainingArguments), so the evaluation_strategy->eval_strategy rename there was unverifiable. Restore the original to avoid shipping a blind change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Empty commit to re-run premerge so the newly-added continuous-build label is evaluated, running the py3.11/3.12/3.13 suites that validate the transformers-5 / datasets-4 / hf-hub-1 upgrade across Python versions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
…nto add-lerobot-datasource Bring the dependency upgrade and example migrations from the upgrade-hf-stack- datasets4 branch (PR ray-project#64054 work) into the lerobot datasource branch so read_lerobot can build on datasets 4.x / huggingface-hub 1.x. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
transformers 5 removed the deprecated `tokenizer=` kwarg on `Trainer` (replaced by `processing_class=` since 4.46), which raised TypeError: Trainer.__init__() got an unexpected keyword argument 'tokenizer' in the train v1 gpu notebook test (build #68368). Update the two notebooks that still passed it: - transformers/huggingface_text_classification.ipynb (the failing CI test) - deepspeed/gptj_deepspeed_fine_tuning.ipynb (gated test; same latent bug) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
…datasource Brings the Trainer(tokenizer=) -> processing_class= fix (PR ray-project#64054) so the train notebook tests pass on transformers 5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Coordinated bump of Ray's HuggingFace ML stack so it can run on datasets 4.x and huggingface-hub 1.x:
For the Ray Serve docs this means that we have to switch from model.translate(...) to model.generate() ( https://huggingface.co/docs/transformers/en/tasks/translation ) which impacts the text content around the relevant code changes and makes the PR bigger.
Also, there is a small number of other places across the libraries that need attention.
Follow-ups
transformers==5.11.0alongside the pre-existingdiffusers==0.12.1baseline pin inrelease/ray_release/byod/requirements_ml_byod_3.{10,13}.in. diffusers 0.12.1 targets the transformers 4.x era and should be upgraded