Skip to content

[deps] Upgrade HuggingFace stack: datasets 4.x#64054

Open
ArturNiederfahrenhorst wants to merge 26 commits into
ray-project:masterfrom
ArturNiederfahrenhorst:upgrade-hf-stack-datasets4
Open

[deps] Upgrade HuggingFace stack: datasets 4.x#64054
ArturNiederfahrenhorst wants to merge 26 commits into
ray-project:masterfrom
ArturNiederfahrenhorst:upgrade-hf-stack-datasets4

Conversation

@ArturNiederfahrenhorst

@ArturNiederfahrenhorst ArturNiederfahrenhorst commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Coordinated bump of Ray's HuggingFace ML stack so it can run on datasets 4.x and huggingface-hub 1.x:

  • datasets: ==3.6.0 -> >=4.0.0,<5.0.0 (reverts the [deps] Pin datasets==3.6.0 in py313 train-requirements #62926 pin; verified the dtype regression it worked around in from_huggingface -> iter_torch_batches does NOT recur with datasets 4.x — the CPU train transformers tests pass on this branch).
  • huggingface-hub: >=0.24.0 -> >=1.0,<2.0.
  • transformers: ==4.36.2 -> >=5.0,<6.0, accelerate: ==0.28.0 -> >=1.0. Every transformers 4.x (through 4.57) hard-caps huggingface-hub <1.0, so hf-hub 1.x requires transformers 5.x.

For the Ray Serve docs this means that we have to switch from model.translate(...) to model.generate() ( https://huggingface.co/docs/transformers/en/tasks/translation ) which impacts the text content around the relevant code changes and makes the PR bigger.
Also, there is a small number of other places across the libraries that need attention.

Follow-ups

  • diffusers vs. transformers 5 in the ML release (BYOD) image: this PR brings transformers==5.11.0 alongside the pre-existing diffusers==0.12.1 baseline pin in release/ray_release/byod/requirements_ml_byod_3.{10,13}.in. diffusers 0.12.1 targets the transformers 4.x era and should be upgraded

…ers 5.x

Coordinated bump of Ray's HuggingFace ML stack so it can run on datasets 4.x
and huggingface-hub 1.x:

- datasets: ==3.6.0 -> >=4.0.0,<5.0.0 (reverts the ray-project#62926 pin; the dtype
  regression it worked around in from_huggingface -> iter_torch_batches must be
  fixed as part of this change).
- huggingface-hub: >=0.24.0 -> >=1.0,<2.0.
- transformers: ==4.36.2 -> >=5.0,<6.0, accelerate: ==0.28.0 -> >=1.0. Every
  transformers 4.x (through 4.57) hard-caps huggingface-hub <1.0, so hf-hub 1.x
  requires transformers 5.x.

Input-requirements only; the compiled lockfiles (requirements_compiled*.txt) and
deplocks must be regenerated on x86 CI/forge (compile_pip_dependencies skips
aarch64). Resolution verified coherent: the stack resolves to transformers
5.11.0 / datasets 4.8.5 / huggingface-hub 1.19.0 / tokenizers 0.22.2 /
accelerate 1.14.0, and s3fs==2023.12.1's pinned fsspec stays within datasets
4.x's accepted range.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
@ArturNiederfahrenhorst ArturNiederfahrenhorst added the go add ONLY when ready to merge, run all tests label Jun 12, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates machine learning dependency requirements across several requirements files, upgrading transformers, accelerate, datasets, and huggingface-hub to newer major versions. The reviewer identified that several of these specified major versions (such as transformers 5.x, datasets 4.x, and huggingface-hub 1.x) have not yet been released on PyPI, which will cause dependency resolution to fail and break CI/CD pipelines. It is recommended to revert these pins to the latest available stable versions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/requirements/ml/core-requirements.txt
Comment thread python/requirements/ml/train-requirements.txt
Comment thread python/requirements/ml/py313/core-requirements.txt
Comment thread python/requirements/ml/py313/train-requirements.txt
pip-compile reuses the existing requirements_compiled*.txt as constraints and
won't auto-upgrade transitive deps, so the stale tokenizers==0.15.2 /
regex==2024.5.15 / accelerate==0.28.0 / datasets==3.6.0 pins blocked the
transformers 5.x resolution. Remove the HF cluster (transformers, tokenizers,
regex, accelerate, huggingface-hub, hf-xet, datasets, safetensors, dill,
multiprocess) so CI's pip-compile re-resolves just those to the modern stack
(pip-tools partial upgrade). The recompiled files will be committed back.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Comment thread python/requirements_compiled.txt
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Jun 12, 2026
Authoritative recompile from CI's pip-compile jobs (premerge #68150), pulled
from their artifacts. Resolves the HF cluster to the modern stack:
transformers 5.11.0, datasets 4.8.5 (4.0.0 on py3.11), huggingface-hub 1.16.1,
tokenizers 0.22.2, accelerate 1.14.0, safetensors 0.8.0, regex 2026.5.9.

Replaces the temporary line-removal from the previous commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
ArturNiederfahrenhorst and others added 2 commits June 12, 2026 16:19
Regenerated all affected deplocks against the updated requirements_compiled*.txt
via `raydepsets build --all-configs` (bazel-pinned uv, --python-platform=linux,
deterministic). 39 lock files updated to the modern HF stack: datasets 4.8.5,
transformers 5.11.0, huggingface-hub 1.16.1, accelerate 1.14.0, tokenizers 0.22.2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Keep the requirement files to bare version specs; the rationale lives in the
PR/commit history, not inline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
ArturNiederfahrenhorst added a commit to shorbaji/ray that referenced this pull request Jun 12, 2026
Brings the HuggingFace stack upgrade (datasets 4.x, huggingface-hub 1.x,
transformers 5.x; PR ray-project#64054) and current master into the lerobot datasource
branch. datasets 4.x / hf-hub 1.x in the shared data lock is what lets
lerobot[dataset] (which requires huggingface-hub>=1.0) resolve there, so the
read_lerobot test can finally run in CI.

Conflict resolved in python/requirements/ml/data-test-requirements.txt: kept
both master's `obstore` and the py>=3.12-gated lerobot stack.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
ArturNiederfahrenhorst and others added 2 commits June 13, 2026 09:03
transformers 5.x removed two long-deprecated TrainingArguments kwargs:
- no_cuda  -> use_cpu
- evaluation_strategy -> eval_strategy

Update all call sites (train v1/v2 transformers tests, test_train_usage,
test_local_mode, the transformers/pbt_transformers examples). Internal config
dict keys are left as-is; only the kwargs passed to TrainingArguments change.
Both replacements are valid on transformers 4.x as well.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
transformers 5 removed the translation_* pipeline tasks and changed BPE
tokenization spacing, breaking the Serve doc examples:

- Translation examples (getting_started models/translator/model_deployment*,
  translator_example, production_guide/text_ml, develop_and_deploy): replace
  pipeline("translation_en_to_*", model="t5-small") with direct
  AutoModelForSeq2SeqLM + AutoTokenizer + generate() using a task prefix
  ("translate English to French/German/Romanian: ").
- streaming_tutorial: the hardcoded GPT-2 chunk asserts pinned exact tokenizer
  output; replace with "streaming happened" (chunk-count) checks.
- Replace exact-match output asserts with robust checks; model output isn't
  stable across transformers versions (which is what broke here).
- Update getting_started / develop-and-deploy prose to describe the generate()
  flow instead of the removed pipeline's [{"translation_text"}] dict.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
@ArturNiederfahrenhorst ArturNiederfahrenhorst requested review from a team as code owners June 15, 2026 11:16
Comment thread doc/source/serve/doc_code/getting_started/translator.py Outdated
Comment thread doc/source/serve/doc_code/streaming_tutorial.py Outdated
ArturNiederfahrenhorst and others added 3 commits June 15, 2026 15:41
…5 pipeline task

transformers 5 removed the summarization pipeline task (same as translation;
the supported-task list has no translation/summarization/text2text-generation).
Migrate the Summarizer examples (getting_started/models, getting_started/
translator, production_guide/text_ml) from pipeline("summarization", ...) to
AutoModelForSeq2SeqLM + AutoTokenizer + generate() with a "summarize: " prefix,
preserving the Summarizer->Translator composition. Fix the dead
SummarizationPipeline doc reference. Asserts use robust non-empty checks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
evaluate 0.4.3 calls the removed huggingface_hub.hf_api.HfFolder.get_token(),
which breaks GPU train tests (accelerate/glue metrics) under hf-hub 1.x with
AttributeError. evaluate 0.4.6 uses build_hf_headers instead and has an
identical dependency tree. Bump the requirement + compiled files and relock the
affected deplocks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
…rmers-5 pipeline task

doc/BUILD.bazel runs every serve doc_code/*.py (gradio-integration.py included),
so the pipeline("summarization", model="t5-small") call here breaks the same way
the getting_started Summarizers did under transformers 5. Switch it to
AutoModelForSeq2SeqLM + generate() with a "summarize: " prefix. Also fix the
remaining dead SummarizationPipeline prose reference in getting_started.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Comment thread doc/source/serve/doc_code/production_guide/text_ml.py Outdated
ArturNiederfahrenhorst and others added 2 commits June 15, 2026 15:52
The evaluate 0.4.3 -> 0.4.6 bump propagates to the ml_torchft release-test byod
lock as well; regenerate it so raydepsets --check stays consistent and the
release image doesn't pull the old evaluate that breaks under hf-hub 1.x.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
…l examples

The transformers-5 migration swapped pipeline() for greedy generate(), which
(a) weakened the doc-example asserts to isinstance(x, str), losing the regression
coverage the old exact-string asserts provided, and (b) degraded output quality:
greedy t5-small loops/repeats and truncates mid-phrase without the beam search
the old pipeline applied by default.

Restore num_beams=4, early_stopping=True (the pipeline's behavior) across the
Translator/Summarizer examples, bump the summary to max_new_tokens=20 so it
finishes a phrase, and re-pin the exact transformers-5 outputs (verified by
running t5-small under transformers 5.12). Each assert carries an f"got {x!r}"
message so any divergence self-reports. Update the matching expected-output text
in getting_started.md and kubernetes.md.

Outputs were captured on arm64; CI (x86) confirms them, and the got-message
surfaces any beam-scoring divergence for a one-line re-pin.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

@dstrodtman dstrodtman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion to simplify text, and confirming that many docs tests are intentionally changed from deterministic output to slightly broader checks (asserts on types, length, etc.)

Would be useful if you can canonize as much of your flow as possible as an agentic skill and commit this somewhere (either docs subsumed in general external dependency updates, or specific instructions for auditing and updating dependencies for docs and tests). Happy to collaborate.

Comment thread doc/source/serve/getting_started.md Outdated
Comment thread doc/source/serve/doc_code/getting_started/model_deployment.py Outdated
Comment thread doc/source/serve/doc_code/getting_started/models.py
Per review, remove the f"got {x!r}" messages added alongside the exact-output
pins; keep the plain equality assertions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
class Translator:
def __init__(self):
# Load model
self.model = pipeline("translation_en_to_fr", model="t5-small")

@ArturNiederfahrenhorst ArturNiederfahrenhorst Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/huggingface/transformers/blob/main/MIGRATION_GUIDE_V5.md

Such pipelines have been removed. They prepended a small chunk of text. We can get the original behavior by prepending "translate English to French:".

ArturNiederfahrenhorst and others added 2 commits June 15, 2026 17:54
…gs unchanged

Reproduce the exact pre-migration pipeline() outputs with generate() by passing
t5-small's task_specific_params (num_beams=4, early_stopping, length_penalty,
no_repeat_ngram_size) and decoding with clean_up_tokenization_spaces=False
(transformers 5 flipped that decode default to True). This restores the original
output strings, so the assertions and doc console outputs revert to master
verbatim -- no expected-output changes.

Also drop the injected migration-explanation code comments and the
LANGUAGE_TO_PREFIX dict (reconfigure now mirrors the original if/elif), keeping
the diff to the necessary pipeline() -> tokenizer/generate/decode change plus
prose that described the removed pipeline dict output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
…anslator.py

The summarize generate() in the composition example carried six kwargs. Trim it
to the same shape as the translate() call (num_beams=4, early_stopping=True,
max_length=15) and update the composed example's expected French output to the
string this produces. models.py / text_ml.py keep their faithful reproduction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
# __end_client__

assert french_text == "c'était le meilleur des temps, c'était le pire des temps ."
assert french_text == "C'était le meilleur des temps, c'était le pire des temps,"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make the model produce the same exact outputs here but that requires a bunch of kwargs to model.generate so I think adjusting the expected output slightly is nicer here.

Mirror the translator.py change in the production-guide example: drop the
length_penalty / no_repeat_ngram_size tuning kwargs from the summarize
generate() (keeping the reconfigure-driven min_length / max_length), and update
the French/German expected outputs (asserts, inline comments, and the matching
kubernetes.md console outputs) to the strings this produces.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Comment thread doc/source/serve/doc_code/getting_started/translator.py
ArturNiederfahrenhorst and others added 3 commits June 15, 2026 18:33
gradio-integration.py: trim the summarize generate() to the same 3-kwarg shape
as the other examples (num_beams=4, early_stopping=True, max_length=200); output
is not asserted (the test checks the HTTP 200), and it stays a non-empty summary.

streaming_tutorial.py: the streaming checks were just len(chunks) > 1. Strengthen
them to also require the streamed chunks reconstruct non-empty content
(len(chunks) > 1 and "".join(chunks).strip()), for both the single-stream and the
batched cases, and drop the now-redundant explanatory comment. Avoids pinning
DialoGPT's exact words, which shifted under transformers 5 and are platform-fragile.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Restore exact-content assertions for the streaming examples (instead of just
len(chunks) > 1), now pinning the actual chunk lists the clients receive under
transformers 5. Empty keep-alive chunks are filtered ([c for c in chunks if c])
since the HTTP client drops them; the remaining list is the deterministic
streamed token sequence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
…ade-hf-stack-datasets4

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Comment thread doc/source/serve/doc_code/streaming_tutorial.py Outdated
ArturNiederfahrenhorst and others added 3 commits June 15, 2026 23:38
Collapse the "two models" intro + bullets into a single sentence, as suggested
by @dstrodtman.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
The batchbot self-test pinned the first response to "interstitial error", which
was a bad capture artifact (it came from passing attention_mask, which the
example does not). The example runs generate(input_ids) with right padding and
no mask, which yields the coherent "I'm not a fan of the new look ." Pin that
instead, addressing the reviewer note about asserting nonsensical output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
CI revealed two issues with the exact streamed-chunk pins:
- chatbot: the client appends a "\n" after each turn (print("\n")), so the
  pinned list was missing those entries. Add them, matching the exact tokens
  the deployment yields on x86 (read from the CI logs).
- batchbot: its per-token, sub-word, batched chunks are platform-fragile and we
  have no x86 ground truth (the assert never ran), so assert the robust property
  (multiple chunks reconstructing non-empty content) instead of an exact list.

textbot's exact pin already matches x86 and is kept.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 8176f94. Configure here.

Comment thread doc/source/serve/doc_code/getting_started/models.py
CI surfaced these once the cu13/pytorch-index flakes cleared. Two patterns:

- datasets 4.x requires a namespaced repo id: bare load_dataset("glue", ...) /
  "hf://datasets/glue/cola/" now raise HfUriError/RepositoryNotFoundError. Point
  them at nyu-mll/glue (matching the existing deepspeed_torch_trainer_no_raydata
  fix), and yelp_review_full -> Yelp/yelp_review_full.
- transformers 5 removed the TrainingArguments kwargs evaluation_strategy and
  no_cuda; rename to eval_strategy and use_cpu (same migration already applied to
  the train test files).

Files: accelerate/deepspeed torch trainer examples, transformers_torch_trainer_basic,
checkpoints doc_code, and the huggingface_text_classification / pbt_transformers /
intel_gaudi llama / lightning_cola_advanced notebooks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
@@ -140,7 +140,7 @@ def collate_fn(batch):
}

# Prepare Ray Datasets
hf_datasets = load_dataset("glue", "mrpc")
hf_datasets = load_dataset("nyu-mll/glue", "mrpc")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original by https://wp.nyu.edu/ml2/

@@ -19,7 +19,7 @@
# ====================================================================
def train_func(config):
# Datasets
dataset = load_dataset("yelp_review_full")
dataset = load_dataset("Yelp/yelp_review_full")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for the NYU one, this one is canonical.

The Gaudi example isn't run in CI (no bazel target) and isn't locally testable
(needs Habana HPU + optimum-habana, which may pin its own transformers and use
GaudiTrainingArguments), so the evaluation_strategy->eval_strategy rename there
was unverifiable. Restore the original to avoid shipping a blind change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Empty commit to re-run premerge so the newly-added continuous-build label is
evaluated, running the py3.11/3.12/3.13 suites that validate the transformers-5
/ datasets-4 / hf-hub-1 upgrade across Python versions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
ArturNiederfahrenhorst added a commit to shorbaji/ray that referenced this pull request Jun 16, 2026
…nto add-lerobot-datasource

Bring the dependency upgrade and example migrations from the upgrade-hf-stack-
datasets4 branch (PR ray-project#64054 work) into the lerobot datasource branch so
read_lerobot can build on datasets 4.x / huggingface-hub 1.x.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
transformers 5 removed the deprecated `tokenizer=` kwarg on `Trainer`
(replaced by `processing_class=` since 4.46), which raised
TypeError: Trainer.__init__() got an unexpected keyword argument 'tokenizer'
in the train v1 gpu notebook test (build #68368).

Update the two notebooks that still passed it:
- transformers/huggingface_text_classification.ipynb (the failing CI test)
- deepspeed/gptj_deepspeed_fine_tuning.ipynb (gated test; same latent bug)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
ArturNiederfahrenhorst added a commit to shorbaji/ray that referenced this pull request Jun 16, 2026
…datasource

Brings the Trainer(tokenizer=) -> processing_class= fix (PR ray-project#64054) so the
train notebook tests pass on transformers 5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

continuous-build data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants