Add trainerv2 llm fine tuning, WIP by typhoonzero · Pull Request #156 · alauda/aml-docs

typhoonzero · 2026-03-17T10:37:18Z

Summary by CodeRabbit

Documentation
- Added an end-to-end guide and notebook for supervised fine-tuning on Kubernetes using Kubeflow Trainer v2 and LlamaFactory, including reusable runtime templates and per-experiment job examples.
- Added build configuration and Python dependency specs for a training image.
- Updated fine-tuning guidance: cluster prerequisites, resource/secret/PVC setup, multi-node training, MLflow monitoring, model merge/inference, multi-run workflows, and cleanup steps.

coderabbitai · 2026-03-17T10:37:30Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 39b9ce96-fcbd-4ec8-812e-ffe1a2ded4ef

📥 Commits

Reviewing files that changed from the base of the PR and between f3eab95 and ec4fa2b.

📒 Files selected for processing (1)

docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx

✅ Files skipped from review due to trivial changes (1)

docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx

Walkthrough

Adds Kubeflow Trainer v2 fine-tuning documentation and build artifacts: a new Jupyter notebook and MDX guide showing TrainingRuntime + TrainJob usage, a Containerfile and pyproject for the training image, and updates to an existing fine-tuning notebook.

Changes

Cohort / File(s)	Summary
Kubeflow Trainer v2 docs `docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb`, `docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx`	New notebook and guide demonstrating a reusable `TrainingRuntime` (dataset-init → model-init → trainer), example `TrainJob` overrides, env-driven defaults, PVC & Git credentials usage, multi-node handling, monitoring and cleanup steps.
Training image build `docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfile`, `docs/en/kubeflow/how_to/assets/build-train-image/pyproject.toml`	New Containerfile and `pyproject.toml` defining CUDA/Python base, system deps, Python 3.12 venv setup, dependency pins (llamafactory via git), and build-time `uv sync` configuration.
Workbench notebook update `docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`	Updated Containerfile base image and apt sources, added `HTTPS_PROXY` note, adjusted LLaMA-Factory/transformers/deepspeed install guidance and version constraints.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant Kubectl as kubectl/API
    participant TrainerAPI as Trainer v2 API
    participant Runtime as TrainingRuntime (controller)
    participant PVC as Shared PVC
    participant Git as Git/Git LFS
    participant TrainerPod as Trainer container (LlamaFactory)

    rect rgba(200,220,255,0.5)
    User->>kubectl/API: apply TrainJob (spec.runtimeRef + overrides)
    kubectl/API->>TrainerAPI: create TrainJob
    TrainerAPI->>Runtime: instantiate pipeline steps
    end

    rect rgba(200,255,200,0.5)
    Runtime->>Git: dataset-initializer clones DATASET_URL -> PVC
    Git->>PVC: write dataset
    Runtime->>Git: model-initializer clones BASE_MODEL_URL (Git LFS) -> PVC
    Git->>PVC: write model files
    end

    rect rgba(255,230,200,0.5)
    Runtime->>TrainerPod: start trainer step with env (VC_WORKER_HOSTS, DO_MERGE, ...)
    TrainerPod->>TrainerPod: run `llamafactory-cli train` (torchrun/deepspeed if multi-node)
    TrainerPod->>Git: optionally push merged model to model repo
    TrainerPod->>TrainerAPI: emit status/metrics (MLflow)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Add notebook fine tuning and trainer quickstart #98: Overlapping Kubeflow Trainer v2 fine-tuning docs and runtime/container artifacts with similar TrainingRuntime/TrainJob patterns.

Suggested reviewers

zhaomingkun1030

Poem

🐰 I hopped through PVCs and Git with glee,

Cloned datasets, fetched models for me.
Runtime sparked, the trainer took flight,
LoRA merged, the weights shone bright.
Pipelines done — carrots for the night! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is partially related to the changeset but uses vague/incomplete phrasing ('WIP') and lacks clarity about the primary change.	Replace 'WIP' with a more descriptive subtitle or finalize the title to clearly indicate the main contribution (e.g., 'Add Kubeflow Trainer v2 fine-tuning documentation with LlamaFactory').

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch add_trainerv2_llm_fine_tuning

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (9)

docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb (8)

327-328: Consider documenting bf16 vs fp16 choice for different GPU architectures.

The configuration uses fp16: true and bf16: false. While fp16 works on most GPUs, bf16 (bfloat16) provides better numerical stability and is preferred on newer GPU architectures (Ampere and newer: A100, H100, A10, etc.).

Add a comment in the documentation explaining the choice:

                          # Use fp16 for older GPUs (V100, T4, P100)
                          # For Ampere+ GPUs (A100, H100, A10), consider using bf16 instead:
                          #   bf16: true
                          #   fp16: false
                          bf16: false
                          fp16: true

Or make it configurable via an environment variable that can be overridden in the TrainJob.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 327 -
328, Add a short explanatory comment near the bf16/fp16 config fields explaining
when to prefer bf16 vs fp16 (e.g., "Use fp16 for older GPUs like V100/T4/P100;
prefer bf16 on Ampere+ GPUs such as A100/H100/A10") and make the setting
configurable from the TrainJob (expose bf16 and fp16 as TrainJob parameters or
read them from an environment variable so they can be overridden at runtime);
reference the bf16 and fp16 keys in the config and the TrainJob parameter names
when implementing the change.

131-137: Consider documenting resource limit tuning for large models.

The dataset and model initializers are configured with 4Gi memory limits. For very large models or datasets (tens of GB), Git LFS operations might approach or exceed this limit, causing OOM errors.

The model-initializer already uses a two-step approach (GIT_LFS_SKIP_SMUDGE=1 then git lfs pull at line 178-179), which helps control memory usage. Consider adding a note in the documentation that users with exceptionally large models should monitor initializer memory usage and adjust limits if needed.

Add a comment or documentation note:

                      resources:
                        requests:
                          cpu: 100m
                          memory: 128Mi
                        limits:
                          cpu: 2
                          # Adjust memory limit if cloning very large models (>10GB)
                          memory: 4Gi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 131 -
137, Add a brief documentation note next to the Kubernetes pod resources block
that mentions tuning memory limits for very large models; update the resources
section (the requests/limits block shown) to include a comment advising users to
increase the memory limit (currently 4Gi) when cloning or pulling very large
models (>10GB), and reference the model-initializer’s two-step workflow (the
GIT_LFS_SKIP_SMUDGE=1 and subsequent git lfs pull steps) so readers know why
memory may spike and when to increase the limit.

99-114: Consider adding validation for required environment variables.

The script uses DATASET_URL, GIT_USER, and GIT_TOKEN without first validating they are set. If any are missing, the script will fail with unclear error messages. Additionally, the directory existence check at line 106 comes after rm -rf ${DATASET_NAME} at line 104, making the conditional logic ineffective.

🛡️ Suggested improvements

                         set -ex
+                        # Validate required environment variables
+                        : "${DATASET_URL:?DATASET_URL environment variable is required}"
+                        : "${GIT_USER:?GIT_USER environment variable is required}"
+                        : "${GIT_TOKEN:?GIT_TOKEN environment variable is required}"
+                        
                         cd /mnt/models
                         DATASET_NAME=$(basename ${DATASET_URL})
                         DATASET_URL_NO_HTTPS="${DATASET_URL//https:\/\/}"
                         gitauth="${GIT_USER}:${GIT_TOKEN}"
-                        rm -rf ${DATASET_NAME}
-                        rm -rf data
                         if [ -d ${DATASET_NAME} ]; then
                             echo "dataset ${DATASET_NAME} already exists skipping download"
                         else
+                            rm -rf ${DATASET_NAME}
+                            rm -rf data
                             git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}"
                         fi

Apply the same pattern to the model-initializer (lines 168-184) and trainer (lines 273-384) scripts.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 99 -
114, Validate that DATASET_URL, GIT_USER, and GIT_TOKEN are set at the top of
the script (check with -z and exit 1 with a clear message) before using them;
fix the ineffective directory check by moving or removing the premature rm -rf
${DATASET_NAME} so the sequence becomes: compute DATASET_NAME, validate env
vars, if [ -d "${DATASET_NAME}" ] then echo skipping download else perform git
clone using DATASET_URL_NO_HTTPS and gitauth; apply the same env-var validation
and directory-check ordering to the model-initializer and trainer scripts (the
blocks that reference DATASET_URL, GIT_USER, GIT_TOKEN and perform rm -rf / git
clone).

350-369: Clarify behavior when DO_MERGE is false.

When DO_MERGE is set to "false", the script skips the merge step and simply renames output_models to output_models_merged. The output (LoRA adapters) is still pushed to the model repository. However, the documentation at line 749 states that inference with adapters is not currently supported.

This could confuse users who set DO_MERGE=false expecting usable output. Consider adding a warning in the documentation.

Add documentation near line 260 or in the Step 6 section:

> **Important:** If you set `DO_MERGE=false`, the output will contain only LoRA adapters, not a merged model. These adapters **cannot** be used directly with the inference service in the current version. Only set `DO_MERGE=false` if you plan to merge adapters manually or use them outside the Alauda AI platform.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 350 -
369, Add a clear warning in the Step 6 / around line 260 explaining DO_MERGE
behavior: state that when DO_MERGE=="false" the pipeline merely renames
output_models to output_models_merged and only produces LoRA adapters (not a
merged model), these adapters cannot be used by the current inference service,
and users should only set DO_MERGE=false if they intend to merge adapters
manually or use them off-platform; reference DO_MERGE, output_models,
output_models_merged, and the lf-merge-config.yaml / llamafactory-cli export
flow so readers can locate the relevant steps.

383-383: Consider documenting git commit identity customization.

The automated commit uses hardcoded git user identity (AMLSystemUser / aml_admin@cpaas.io). While this is acceptable for system-generated commits, some organizations may want to customize the commit author for audit trails.

Add a note in the documentation or make the git identity configurable via environment variables:

+                          # Set GIT_COMMIT_NAME and GIT_COMMIT_EMAIL env vars to customize commit identity
+                          GIT_COMMIT_NAME="${GIT_COMMIT_NAME:-AMLSystemUser}"
+                          GIT_COMMIT_EMAIL="${GIT_COMMIT_EMAIL:-aml_admin@cpaas.io}"
                           git init
                           git checkout -b sft-${push_branch}
                           git lfs track *.safetensors
                           git add .
-                          git -c user.name='AMLSystemUser' -c user.email='aml_admin@cpaas.io' commit -am "fine tune push auto commit"
+                          git -c user.name="${GIT_COMMIT_NAME}" -c user.email="${GIT_COMMIT_EMAIL}" commit -am "fine tune push auto commit"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` at line 383, The
notebook currently hardcodes the git commit identity in the command string ("git
-c user.name='AMLSystemUser' -c user.email='aml_admin@cpaas.io' commit -am
\"fine tune push auto commit\""); update the docs and code to make these values
configurable by reading environment variables (e.g., GIT_USER_NAME and
GIT_USER_EMAIL) and falling back to the existing defaults, or at minimum add a
documentation note explaining how to override the user.name/user.email for audit
purposes; ensure you reference and replace the literal git -c
user.name/user.email invocation where the commit is constructed so the command
uses the env-derived values or documents how to set them.

275-291: Nitpick: redundant environment variable assignment.

Line 280 contains a redundant assignment export RANK=$RANK — RANK is already set from VC_TASK_INDEX at line 278. This line can be removed.

🧹 Minor cleanup

                           if [ "${VC_WORKER_HOSTS}" != "" ]; then
                               export N_RANKS=$(echo "${VC_WORKER_HOSTS}" |awk -F',' '{print NF}')
                               export RANK=$VC_TASK_INDEX
                               export MASTER_HOST=$(echo "${VC_WORKER_HOSTS}" |awk -F',' '{print $1}')
-                              export RANK=$RANK
                               export WORLD_SIZE=$N_RANKS
                               export NNODES=$N_RANKS
                               export NODE_RANK=$RANK
                               export MASTER_ADDR=${MASTER_HOST}
                               export MASTER_PORT="8888"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 275 -
291, Remove the redundant environment assignment "export RANK=$RANK" in the
startup script block where RANK is already set from VC_TASK_INDEX; keep the
existing "export RANK=$VC_TASK_INDEX" (the one using VC_TASK_INDEX) and delete
the duplicate "export RANK=$RANK" to avoid unnecessary/no-op exports and
potential confusion when examining variables like RANK, VC_TASK_INDEX,
VC_WORKER_HOSTS, and related WORLD_SIZE/NNODES exports.

557-569: Document HAMi GPU virtualization requirement.

The resource specification uses HAMi GPU virtualization resources (nvidia.com/gpualloc, nvidia.com/gpucores, nvidia.com/gpumem) to allocate fractional GPU resources (50% compute, 8GB memory). This requires HAMi to be installed in the cluster.

Users without HAMi should use standard NVIDIA GPU resources instead:

📝 Add documentation note and alternative configuration

Add a note to the prerequisites or this section:

> **Note:** This example uses HAMi GPU virtualization to allocate fractional GPU resources. If HAMi is not installed in your cluster, use standard NVIDIA GPU resources instead:
> ```yaml
> nvidia.com/gpu: 1  # Request 1 full GPU
> ```

Alternative configuration without HAMi:

    resourcesPerNode:
      limits:
        cpu: "4"
        memory: "16Gi"
        nvidia.com/gpu: 1  # Standard NVIDIA GPU resource
      requests:
        cpu: "100m"
        memory: "2Gi"
        nvidia.com/gpu: 1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 557 -
569, The resources example under resourcesPerNode uses HAMi-specific GPU
virtualization fields (nvidia.com/gpualloc, nvidia.com/gpucores,
nvidia.com/gpumem) which require HAMi; add a short note in the prerequisites or
immediately above this resourcesPerNode example explaining that HAMi must be
installed to use fractional GPU allocation and show an alternative using the
standard NVIDIA resource (nvidia.com/gpu: 1) for both limits and requests;
reference the resourcesPerNode block and the HAMi fields (nvidia.com/gpualloc,
nvidia.com/gpucores, nvidia.com/gpumem) so reviewers can locate and replace or
provide the alternative configuration.

65-66: Document the minimum Kubeflow Trainer v2 version requirement.

The YAML correctly uses apiVersion: trainer.kubeflow.org/v1alpha1, which is the current official API for Kubeflow Trainer v2. However, this is an alpha project where the Kubeflow maintainers explicitly warn that APIs may change. Add a note documenting the minimum required Trainer v2 version for this notebook to ensure readers understand potential compatibility concerns and can verify their cluster supports the expected API.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 65 -
66, Add a short note near the YAML block that declares "apiVersion:
trainer.kubeflow.org/v1alpha1" / "kind: TrainingRuntime" documenting the minimum
Kubeflow Trainer v2 release required for this notebook (e.g., minimum Trainer v2
version X.Y.Z), and instruct readers how to verify their cluster supports that
API (check Trainer v2 release/version or available API resources via kubectl
api-versions / kubectl get trainingruntimes). Ensure the note is placed
immediately above or below the YAML so readers see the compatibility requirement
before applying the manifest.

docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx (1)

16-26: Add critical constraint about repository naming to prerequisites.

The prerequisites table is missing an important constraint documented in the notebook: model and dataset Git repository names must be different from each other. Both are cloned into /mnt/models/<repo-name>, so identical names would cause one to overwrite the other, leading to initialization failures.
📝 Suggested addition to prerequisites table

Add a new row to the prerequisites table:
 | `kubectl` access | `kubectl` configured with permission to create `TrainingRuntime` and `TrainJob` resources in your target namespace |
+| Repository naming | Model and dataset Git repositories must have **different names** — both are cloned into `/mnt/models/<repo-name>`, so identical names will cause conflicts |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx` around lines 16 - 26,
Add a new prerequisite row to the table warning that model and dataset Git
repository names must be different to avoid clobbering: state that both repos
are cloned into /mnt/models/<repo-name> so using identical repository names will
cause one clone to overwrite the other and break initialization; update the
prerequisites table in the Kubeflow Trainer v2 doc (the table under
"Prerequisites") to include this constraint and give a short example or
recommended naming pattern (e.g., append -model and -data) to make the
requirement clear.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb`:
- Around line 266-267: The MODEL_OUTPUT_DIR env var is defined but unused;
either remove it from the runtime and TrainJob manifests or change the script to
use it consistently: replace hardcoded "output_models" and
"output_models_merged" with a single configurable base path read from
MODEL_OUTPUT_DIR (e.g., use os.environ['MODEL_OUTPUT_DIR'] or equivalent) and
update the merge logic (where output_models_merged is computed/used) and the
push/save logic (where output_models is written/pushed) to reference that env
var-derived path; ensure all occurrences (creation, merge, and push steps) use
the same MODEL_OUTPUT_DIR variable name.
- Around line 99-114: The notebook disables SSL verification for Git operations
(see the git clone/push commands that include "-c http.sslVerify=false" and
references to GIT_USER/GIT_TOKEN and DATASET_URL/ DATASET_URL_NO_HTTPS) which is
a security risk; fix by removing the "-c http.sslVerify=false" flags and instead
configure Git to trust your CA (add the org CA to the container at
/etc/ssl/certs/ or set git config --global http.sslCAInfo /path/to/ca-cert.pem)
or, if temporary, leave the flag but add an explicit WARNING comment next to the
git commands documenting it as a temporary testing-only tradeoff and include
TODO to migrate to proper certs.
- Around line 496-525: The nodeSelector entries are incorrectly applied to the
initializer jobs; remove the nodeSelector: nvidia.com/gpu.product: Tesla-T4
blocks from the targetJobs named dataset-initializer and model-initializer so
those steps run on CPU nodes, and ensure the trainer job retains its
nodeSelector. Locate the YAML blocks under targetJobs -> - name:
dataset-initializer and - name: model-initializer and delete their nodeSelector
sections (leave volumes/containers untouched); confirm only the trainer job has
the GPU nodeSelector.
- Around line 819-821: The docs table incorrectly references
spec.initializer.model.env[DATASET_URL] for the base model URL; update the
documentation to use spec.initializer.model.env[BASE_MODEL_URL] (and ensure the
other table entry for spec.trainer.env[BASE_MODEL_URL] remains correct) so the
Base model URL row consistently points to
spec.initializer.model.env[BASE_MODEL_URL] and spec.trainer.env[BASE_MODEL_URL].

---

Nitpick comments:
In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb`:
- Around line 327-328: Add a short explanatory comment near the bf16/fp16 config
fields explaining when to prefer bf16 vs fp16 (e.g., "Use fp16 for older GPUs
like V100/T4/P100; prefer bf16 on Ampere+ GPUs such as A100/H100/A10") and make
the setting configurable from the TrainJob (expose bf16 and fp16 as TrainJob
parameters or read them from an environment variable so they can be overridden
at runtime); reference the bf16 and fp16 keys in the config and the TrainJob
parameter names when implementing the change.
- Around line 131-137: Add a brief documentation note next to the Kubernetes pod
resources block that mentions tuning memory limits for very large models; update
the resources section (the requests/limits block shown) to include a comment
advising users to increase the memory limit (currently 4Gi) when cloning or
pulling very large models (>10GB), and reference the model-initializer’s
two-step workflow (the GIT_LFS_SKIP_SMUDGE=1 and subsequent git lfs pull steps)
so readers know why memory may spike and when to increase the limit.
- Around line 99-114: Validate that DATASET_URL, GIT_USER, and GIT_TOKEN are set
at the top of the script (check with -z and exit 1 with a clear message) before
using them; fix the ineffective directory check by moving or removing the
premature rm -rf ${DATASET_NAME} so the sequence becomes: compute DATASET_NAME,
validate env vars, if [ -d "${DATASET_NAME}" ] then echo skipping download else
perform git clone using DATASET_URL_NO_HTTPS and gitauth; apply the same env-var
validation and directory-check ordering to the model-initializer and trainer
scripts (the blocks that reference DATASET_URL, GIT_USER, GIT_TOKEN and perform
rm -rf / git clone).
- Around line 350-369: Add a clear warning in the Step 6 / around line 260
explaining DO_MERGE behavior: state that when DO_MERGE=="false" the pipeline
merely renames output_models to output_models_merged and only produces LoRA
adapters (not a merged model), these adapters cannot be used by the current
inference service, and users should only set DO_MERGE=false if they intend to
merge adapters manually or use them off-platform; reference DO_MERGE,
output_models, output_models_merged, and the lf-merge-config.yaml /
llamafactory-cli export flow so readers can locate the relevant steps.
- Line 383: The notebook currently hardcodes the git commit identity in the
command string ("git -c user.name='AMLSystemUser' -c
user.email='aml_admin@cpaas.io' commit -am \"fine tune push auto commit\"");
update the docs and code to make these values configurable by reading
environment variables (e.g., GIT_USER_NAME and GIT_USER_EMAIL) and falling back
to the existing defaults, or at minimum add a documentation note explaining how
to override the user.name/user.email for audit purposes; ensure you reference
and replace the literal git -c user.name/user.email invocation where the commit
is constructed so the command uses the env-derived values or documents how to
set them.
- Around line 275-291: Remove the redundant environment assignment "export
RANK=$RANK" in the startup script block where RANK is already set from
VC_TASK_INDEX; keep the existing "export RANK=$VC_TASK_INDEX" (the one using
VC_TASK_INDEX) and delete the duplicate "export RANK=$RANK" to avoid
unnecessary/no-op exports and potential confusion when examining variables like
RANK, VC_TASK_INDEX, VC_WORKER_HOSTS, and related WORLD_SIZE/NNODES exports.
- Around line 557-569: The resources example under resourcesPerNode uses
HAMi-specific GPU virtualization fields (nvidia.com/gpualloc,
nvidia.com/gpucores, nvidia.com/gpumem) which require HAMi; add a short note in
the prerequisites or immediately above this resourcesPerNode example explaining
that HAMi must be installed to use fractional GPU allocation and show an
alternative using the standard NVIDIA resource (nvidia.com/gpu: 1) for both
limits and requests; reference the resourcesPerNode block and the HAMi fields
(nvidia.com/gpualloc, nvidia.com/gpucores, nvidia.com/gpumem) so reviewers can
locate and replace or provide the alternative configuration.
- Around line 65-66: Add a short note near the YAML block that declares
"apiVersion: trainer.kubeflow.org/v1alpha1" / "kind: TrainingRuntime"
documenting the minimum Kubeflow Trainer v2 release required for this notebook
(e.g., minimum Trainer v2 version X.Y.Z), and instruct readers how to verify
their cluster supports that API (check Trainer v2 release/version or available
API resources via kubectl api-versions / kubectl get trainingruntimes). Ensure
the note is placed immediately above or below the YAML so readers see the
compatibility requirement before applying the manifest.

In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx`:
- Around line 16-26: Add a new prerequisite row to the table warning that model
and dataset Git repository names must be different to avoid clobbering: state
that both repos are cloned into /mnt/models/<repo-name> so using identical
repository names will cause one clone to overwrite the other and break
initialization; update the prerequisites table in the Kubeflow Trainer v2 doc
(the table under "Prerequisites") to include this constraint and give a short
example or recommended naming pattern (e.g., append -model and -data) to make
the requirement clear.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1b65f7d8-22ab-4bc8-b29e-bf1652222233

📥 Commits

Reviewing files that changed from the base of the PR and between fc04a84 and c1e81ea.

📒 Files selected for processing (2)

docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb
docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx

coderabbitai · 2026-03-17T10:44:15Z

docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb

+    "                        set -ex\n",
+    "                        cd /mnt/models\n",
+    "                        DATASET_NAME=$(basename ${DATASET_URL})\n",
+    "                        DATASET_URL_NO_HTTPS=\"${DATASET_URL//https:\\/\\/}\"\n",
+    "                        gitauth=\"${GIT_USER}:${GIT_TOKEN}\"\n",
+    "                        rm -rf ${DATASET_NAME}\n",
+    "                        rm -rf data\n",
+    "                        if [ -d ${DATASET_NAME} ]; then\n",
+    "                            echo \"dataset ${DATASET_NAME} already exists skipping download\"\n",
+    "                        else\n",
+    "                            git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone \"https://${gitauth}@${DATASET_URL_NO_HTTPS}\"\n",
+    "                        fi\n",
+    "                        echo \"listing files under /mnt/models ...\"\n",
+    "                        ls /mnt/models\n",
+    "                        echo \"listing dataset files ...\"\n",
+    "                        ls ${DATASET_NAME}\n",


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify all instances of sslVerify=false in the notebook rg -n "sslVerify=false" docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb

Repository: alauda/aml-docs

Length of output: 654

Security: SSL verification is disabled for Git operations across multiple pipeline steps.

The notebook disables SSL certificate verification at 4 locations (lines 109, 178, 179, and 384) using -c http.sslVerify=false, which exposes Git cloning and pushing operations to man-in-the-middle attacks.

Recommended solutions:

Preferred: Configure Git to trust your organization's CA certificate:

Add the CA cert to the container image at /etc/ssl/certs/

Or use git config --global http.sslCAInfo /path/to/ca-cert.pem

Alternative: If using self-signed certificates temporarily, document this as a known security trade-off and plan to migrate to properly signed certificates.

Quick fix for testing only: Keep the flag but add a comment warning that it should be removed in production.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 99 - 114, The notebook disables SSL verification for Git operations (see the git clone/push commands that include "-c http.sslVerify=false" and references to GIT_USER/GIT_TOKEN and DATASET_URL/ DATASET_URL_NO_HTTPS) which is a security risk; fix by removing the "-c http.sslVerify=false" flags and instead configure Git to trust your CA (add the org CA to the container at /etc/ssl/certs/ or set git config --global http.sslCAInfo /path/to/ca-cert.pem) or, if temporary, leave the flag but add an explicit WARNING comment next to the git commands documenting it as a temporary testing-only tradeoff and include TODO to migrate to proper certs.

coderabbitai · 2026-03-17T10:44:15Z

docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb

+    "                      - name: MODEL_OUTPUT_DIR\n",
+    "                        value: /mnt/workspace/output_model\n",


⚠️ Potential issue | 🟡 Minor

Unused environment variable: MODEL_OUTPUT_DIR is not referenced in the script.

The MODEL_OUTPUT_DIR environment variable is defined here (and again in the TrainJob at line 553-554) but is never used in the training script. The script uses hardcoded paths output_models (line 316) and output_models_merged (lines 359, 368, 373) instead.

This creates confusion and inconsistency in the configuration. Either remove the unused variable or update the script to use it.

🧹 Suggested fix: remove unused variable or use it in the script

Option 1: Remove the unused variable

Remove MODEL_OUTPUT_DIR from both TrainingRuntime (lines 266-267) and TrainJob (lines 553-554).

Option 2: Use the variable in the script

If the intent is to make output paths configurable, update the script to use the environment variable:

- output_dir: output_models + output_dir: ${MODEL_OUTPUT_DIR}

However, this would also require updating the merge logic at lines 354 and 367-368, and the push logic at line 373.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 266 - 267, The MODEL_OUTPUT_DIR env var is defined but unused; either remove it from the runtime and TrainJob manifests or change the script to use it consistently: replace hardcoded "output_models" and "output_models_merged" with a single configurable base path read from MODEL_OUTPUT_DIR (e.g., use os.environ['MODEL_OUTPUT_DIR'] or equivalent) and update the merge logic (where output_models_merged is computed/used) and the push/save logic (where output_models is written/pushed) to reference that env var-derived path; ensure all occurrences (creation, merge, and push steps) use the same MODEL_OUTPUT_DIR variable name.

coderabbitai · 2026-03-17T10:44:15Z

docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb

+    "    - targetJobs:\n",
+    "        - name: dataset-initializer\n",
+    "      spec:\n",
+    "        # Step 3: Do the same as step 1.\n",
+    "        volumes:\n",
+    "          - name: models-cache\n",
+    "            persistentVolumeClaim:\n",
+    "              claimName: team-model-cache-pvc\n",
+    "        containers:\n",
+    "          - name: dataset-initializer\n",
+    "            volumeMounts:\n",
+    "              - name: models-cache\n",
+    "                mountPath: /mnt/models\n",
+    "        nodeSelector:\n",
+    "          nvidia.com/gpu.product: Tesla-T4\n",
+    "    - targetJobs:\n",
+    "        - name: model-initializer\n",
+    "      spec:\n",
+    "        # Step 4: Do the same as step 1.\n",
+    "        volumes:\n",
+    "          - name: models-cache\n",
+    "            persistentVolumeClaim:\n",
+    "              claimName: team-model-cache-pvc\n",
+    "        containers:\n",
+    "          - name: model-initializer\n",
+    "            volumeMounts:\n",
+    "              - name: models-cache\n",
+    "                mountPath: /mnt/models\n",
+    "        nodeSelector:\n",
+    "          nvidia.com/gpu.product: Tesla-T4\n",


⚠️ Potential issue | 🟡 Minor

Inefficient resource allocation: initializers don't need GPU nodes.

The nodeSelector for dataset-initializer (lines 509-510) and model-initializer (lines 524-525) targets Tesla-T4 GPU nodes, but these initialization steps only run git clone commands and don't require GPUs. Only the trainer job needs GPU resources.

Scheduling initializers on GPU nodes wastes valuable GPU capacity and may delay initialization if GPU nodes are busy.

⚡ Recommended fix: remove nodeSelector from initializers

- targetJobs: - name: dataset-initializer spec: volumes: - name: models-cache persistentVolumeClaim: claimName: team-model-cache-pvc containers: - name: dataset-initializer volumeMounts: - name: models-cache mountPath: /mnt/models - nodeSelector: - nvidia.com/gpu.product: Tesla-T4 - targetJobs: - name: model-initializer spec: volumes: - name: models-cache persistentVolumeClaim: claimName: team-model-cache-pvc containers: - name: model-initializer volumeMounts: - name: models-cache mountPath: /mnt/models - nodeSelector: - nvidia.com/gpu.product: Tesla-T4

Keep the nodeSelector only for the trainer job which actually needs GPUs.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

" - targetJobs:\n",

" - name: dataset-initializer\n",

" spec:\n",

" # Step 3: Do the same as step 1.\n",

" volumes:\n",

" - name: models-cache\n",

" persistentVolumeClaim:\n",

" claimName: team-model-cache-pvc\n",

" containers:\n",

" - name: dataset-initializer\n",

" volumeMounts:\n",

" - name: models-cache\n",

" mountPath: /mnt/models\n",

" nodeSelector:\n",

" nvidia.com/gpu.product: Tesla-T4\n",

" - targetJobs:\n",

" - name: model-initializer\n",

" spec:\n",

" # Step 4: Do the same as step 1.\n",

" volumes:\n",

" - name: models-cache\n",

" persistentVolumeClaim:\n",

" claimName: team-model-cache-pvc\n",

" containers:\n",

" - name: model-initializer\n",

" volumeMounts:\n",

" - name: models-cache\n",

" mountPath: /mnt/models\n",

" nodeSelector:\n",

" nvidia.com/gpu.product: Tesla-T4\n",

" - targetJobs:\n",

" - name: dataset-initializer\n",

" spec:\n",

" # Step 3: Do the same as step 1.\n",

" volumes:\n",

" - name: models-cache\n",

" persistentVolumeClaim:\n",

" claimName: team-model-cache-pvc\n",

" containers:\n",

" - name: dataset-initializer\n",

" volumeMounts:\n",

" - name: models-cache\n",

" mountPath: /mnt/models\n",

" - targetJobs:\n",

" - name: model-initializer\n",

" spec:\n",

" # Step 4: Do the same as step 1.\n",

" volumes:\n",

" - name: models-cache\n",

" persistentVolumeClaim:\n",

" claimName: team-model-cache-pvc\n",

" containers:\n",

" - name: model-initializer\n",

" volumeMounts:\n",

" - name: models-cache\n",

" mountPath: /mnt/models\n",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 496 - 525, The nodeSelector entries are incorrectly applied to the initializer jobs; remove the nodeSelector: nvidia.com/gpu.product: Tesla-T4 blocks from the targetJobs named dataset-initializer and model-initializer so those steps run on CPU nodes, and ensure the trainer job retains its nodeSelector. Locate the YAML blocks under targetJobs -> - name: dataset-initializer and - name: model-initializer and delete their nodeSelector sections (leave volumes/containers untouched); confirm only the trainer job has the GPU nodeSelector.

coderabbitai · 2026-03-17T10:44:15Z

docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb

+    "|---|---|\n",
+    "| Base model URL | `spec.initializer.model.env[DATASET_URL]` and `spec.trainer.env[BASE_MODEL_URL]` in `TrainJob` |\n",
+    "| Dataset URL | `spec.initializer.dataset.env[DATASET_URL]` and `spec.trainer.env[DATASET_URL]` in `TrainJob` |\n",


⚠️ Potential issue | 🟡 Minor

Documentation error: incorrect environment variable name for base model URL.

Line 820 incorrectly states that to change the base model URL, users should modify spec.initializer.model.env[DATASET_URL]. This should be spec.initializer.model.env[BASE_MODEL_URL].

📝 Correction needed

| What to change | Where to change it | |---|---| -| Base model URL | `spec.initializer.model.env[DATASET_URL]` and `spec.trainer.env[BASE_MODEL_URL]` in `TrainJob` | +| Base model URL | `spec.initializer.model.env[BASE_MODEL_URL]` and `spec.trainer.env[BASE_MODEL_URL]` in `TrainJob` | | Dataset URL | `spec.initializer.dataset.env[DATASET_URL]` and `spec.trainer.env[DATASET_URL]` in `TrainJob` |

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 819 - 821, The docs table incorrectly references spec.initializer.model.env[DATASET_URL] for the base model URL; update the documentation to use spec.initializer.model.env[BASE_MODEL_URL] (and ensure the other table entry for spec.trainer.env[BASE_MODEL_URL] remains correct) so the Base model URL row consistently points to spec.initializer.model.env[BASE_MODEL_URL] and spec.trainer.env[BASE_MODEL_URL].

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfile (1)
1-24: Consider documenting the build context requirements.

This Containerfile expects pyproject.toml and uv.lock to be present in the build context (line 21). Users need to know these files must exist alongside the Containerfile. A comment at the top of the file indicating the required files would improve usability.
📝 Suggested documentation comment
+# Build context requirements:
+# - pyproject.toml (defines Python dependencies)
+# - uv.lock (locked dependency versions)
+#
+# Build: docker build -f fine_tune_with_llamafactory.Containerfile -t <image-name> .
+
 FROM nvidia/cuda:12.6.3-cudnn-devel-ubuntu24.04
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfile`
around lines 1 - 24, Add a short comment at the top of the Containerfile
describing required build-context files: state that pyproject.toml and uv.lock
must be present in the same directory (or included in the docker/kaniko build
context) because the Dockerfile uses COPY pyproject.toml uv.lock ./; reference
this requirement near the COPY instruction (and optionally note any proxy or
network assumptions) so users know what to include before running docker build.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfile`:
- Line 22: The Dockerfile contains a hardcoded proxy in the RUN line ("RUN
HTTPS_PROXY=http://192.168.144.12:7890 uv sync --frozen") which will break
external builds; change this to use a build-time argument or omit the proxy: add
an ARG HTTPS_PROXY (and optionally ENV HTTPS_PROXY) and replace the inline value
so the RUN uses the ARG (or remove the HTTPS_PROXY prefix entirely if not
required), and document that users can pass --build-arg HTTPS_PROXY=<value> when
building.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Around line 124-130: The docs Containerfile lists package constraints
("transformers>=4.51.1,<=4.53.3" and "deepspeed~=0.18.8") that conflict with the
Kubeflow pyproject.toml pins ("transformers<=4.51.1" and "deepspeed~=0.16.9");
update the Containerfile entries to match the Kubeflow pins or update the
Kubeflow pyproject.toml to the Containerfile versions, and if you intentionally
keep them different, add a concise note in this document explaining the
divergence and the rationale (referencing the exact package strings
"transformers>=4.51.1,<=4.53.3", "transformers<=4.51.1", "deepspeed~=0.18.8",
and "deepspeed~=0.16.9").
- Around line 104-106: The Docker base image and proxy are hardcoded to internal
resources; replace the literal "FROM
152-231-registry.alauda.cn:60070/mlops/nvidia/pytorch:24.12-py3" with a
parameterized build ARG or a publicly available image (e.g., use an upstream
nvidia/pytorch tag) and replace the literal proxy "192.168.144.12:7890" with a
configurable placeholder (e.g., PROXY or HTTP_PROXY build arg / environment
variable) and clearly label it as a placeholder in the docs so external users
can supply their own values.

---

Nitpick comments:
In
`@docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfile`:
- Around line 1-24: Add a short comment at the top of the Containerfile
describing required build-context files: state that pyproject.toml and uv.lock
must be present in the same directory (or included in the docker/kaniko build
context) because the Dockerfile uses COPY pyproject.toml uv.lock ./; reference
this requirement near the COPY instruction (and optionally note any proxy or
network assumptions) so users know what to include before running docker build.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 224c76af-5509-4e0a-9761-c21b3c5feea7

📥 Commits

Reviewing files that changed from the base of the PR and between c1e81ea and f3eab95.

⛔ Files ignored due to path filters (1)

docs/en/kubeflow/how_to/assets/build-train-image/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfile
docs/en/kubeflow/how_to/assets/build-train-image/pyproject.toml
docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb
docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx
docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx

✅ Files skipped from review due to trivial changes (3)

docs/en/kubeflow/how_to/assets/build-train-image/pyproject.toml
docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx
docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb

coderabbitai · 2026-03-20T13:46:49Z

docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfile

+
+WORKDIR /opt
+COPY pyproject.toml uv.lock ./
+RUN HTTPS_PROXY=http://192.168.144.12:7890 uv sync --frozen


⚠️ Potential issue | 🟠 Major

Hardcoded internal proxy IP will cause build failures for external users.

The proxy 192.168.144.12:7890 is an internal resource that won't be accessible to users building this image. Consider parameterizing it or removing it entirely.

🔧 Suggested fix using build argument

+ARG HTTPS_PROXY="" WORKDIR /opt COPY pyproject.toml uv.lock ./ -RUN HTTPS_PROXY=http://192.168.144.12:7890 uv sync --frozen +RUN uv sync --frozen

If a proxy is required in certain environments, users can pass --build-arg HTTPS_PROXY=... when building.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfile` at line 22, The Dockerfile contains a hardcoded proxy in the RUN line ("RUN HTTPS_PROXY=http://192.168.144.12:7890 uv sync --frozen") which will break external builds; change this to use a build-time argument or omit the proxy: add an ARG HTTPS_PROXY (and optionally ENV HTTPS_PROXY) and replace the inline value so the RUN uses the ARG (or remove the HTTPS_PROXY prefix entirely if not required), and document that users can pass --build-arg HTTPS_PROXY=<value> when building.

coderabbitai · 2026-03-20T13:46:49Z

docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx

+# FROM docker-mirrors.alauda.cn/library/python:3.13-trixie
+FROM 152-231-registry.alauda.cn:60070/mlops/nvidia/pytorch:24.12-py3



⚠️ Potential issue | 🟠 Major

Hardcoded internal registry and proxy will break builds for external users.

Line 105 references an internal registry (152-231-registry.alauda.cn:60070) that external users cannot access. Line 116 hardcodes an internal proxy IP (192.168.144.12:7890). Both should be parameterized or replaced with publicly accessible alternatives.

🔧 Suggested fix using build arguments

ARG LLAMA_FACTORY_VERSION="v0.9.4" -# FROM docker-mirrors.alauda.cn/library/python:3.13-trixie -FROM 152-231-registry.alauda.cn:60070/mlops/nvidia/pytorch:24.12-py3 +# Replace <your-registry> with your container registry or use a public image +ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:24.12-py3 +FROM ${BASE_IMAGE}

For the proxy on line 116:

+ARG HTTPS_PROXY="" RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip setuptools && \ cd /opt && \ -HTTPS_PROXY=http://192.168.144.12:7890 git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git && \ +git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git && \

Based on learnings: placeholder credentials in documentation should be "clearly labeled as placeholders."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 104 - 106, The Docker base image and proxy are hardcoded to internal resources; replace the literal "FROM 152-231-registry.alauda.cn:60070/mlops/nvidia/pytorch:24.12-py3" with a parameterized build ARG or a publicly available image (e.g., use an upstream nvidia/pytorch tag) and replace the literal proxy "192.168.144.12:7890" with a configurable placeholder (e.g., PROXY or HTTP_PROXY build arg / environment variable) and clearly label it as a placeholder in the docs so external users can supply their own values.

coderabbitai · 2026-03-20T13:46:49Z

docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx

+"transformers>=4.51.1,<=4.53.3" \
 "tokenizers>=0.21.1" \
 "sqlalchemy~=2.0.30" \
 "pymysql~=1.1.1" \
 "loguru~=0.7.2" \
 "mysqlclient~=2.2.7" \
+"deepspeed~=0.18.8" \


⚠️ Potential issue | 🟡 Minor

Version inconsistencies with Kubeflow pyproject.toml will produce divergent environments.

This Containerfile specifies transformers>=4.51.1,<=4.53.3 and deepspeed~=0.18.8, while the Kubeflow pyproject.toml (referenced in context snippet 1) pins transformers<=4.51.1 and deepspeed~=0.16.9. Users following either path will get different package versions, which may cause confusion or incompatibility issues.

Consider aligning the version constraints across both documentation paths, or explicitly document the differences and their rationale.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 124 - 130, The docs Containerfile lists package constraints ("transformers>=4.51.1,<=4.53.3" and "deepspeed~=0.18.8") that conflict with the Kubeflow pyproject.toml pins ("transformers<=4.51.1" and "deepspeed~=0.16.9"); update the Containerfile entries to match the Kubeflow pins or update the Kubeflow pyproject.toml to the Containerfile versions, and if you intentionally keep them different, add a concise note in this document explaining the divergence and the rationale (referencing the exact package strings "transformers>=4.51.1,<=4.53.3", "transformers<=4.51.1", "deepspeed~=0.18.8", and "deepspeed~=0.16.9").

cloudflare-workers-and-pages · 2026-03-21T05:33:38Z

Deploying alauda-ai with Cloudflare Pages

Latest commit:	`ec4fa2b`
Status:	✅ Deploy successful!
Preview URL:	https://0f2c258a.alauda-ai.pages.dev
Branch Preview URL:	https://add-trainerv2-llm-fine-tunin.alauda-ai.pages.dev

View logs

add trainerv2 llm fine tuning, WIP

c1e81ea

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

update for distributed training

f3eab95

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

update link

ec4fa2b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trainerv2 llm fine tuning, WIP#156

Add trainerv2 llm fine tuning, WIP#156
typhoonzero wants to merge 3 commits intomasterfrom
add_trainerv2_llm_fine_tuning

typhoonzero commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		" - name: MODEL_OUTPUT_DIR\n",
		" value: /mnt/workspace/output_model\n",

		# FROM docker-mirrors.alauda.cn/library/python:3.13-trixie
		FROM 152-231-registry.alauda.cn:60070/mlops/nvidia/pytorch:24.12-py3

Conversation

typhoonzero commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

cloudflare-workers-and-pages bot commented Mar 21, 2026

Deploying alauda-ai with Cloudflare Pages

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

typhoonzero commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading