Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
WalkthroughAdds Kubeflow Trainer v2 fine-tuning documentation and build artifacts: a new Jupyter notebook and MDX guide showing TrainingRuntime + TrainJob usage, a Containerfile and pyproject for the training image, and updates to an existing fine-tuning notebook. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant Kubectl as kubectl/API
participant TrainerAPI as Trainer v2 API
participant Runtime as TrainingRuntime (controller)
participant PVC as Shared PVC
participant Git as Git/Git LFS
participant TrainerPod as Trainer container (LlamaFactory)
rect rgba(200,220,255,0.5)
User->>kubectl/API: apply TrainJob (spec.runtimeRef + overrides)
kubectl/API->>TrainerAPI: create TrainJob
TrainerAPI->>Runtime: instantiate pipeline steps
end
rect rgba(200,255,200,0.5)
Runtime->>Git: dataset-initializer clones DATASET_URL -> PVC
Git->>PVC: write dataset
Runtime->>Git: model-initializer clones BASE_MODEL_URL (Git LFS) -> PVC
Git->>PVC: write model files
end
rect rgba(255,230,200,0.5)
Runtime->>TrainerPod: start trainer step with env (VC_WORKER_HOSTS, DO_MERGE, ...)
TrainerPod->>TrainerPod: run `llamafactory-cli train` (torchrun/deepspeed if multi-node)
TrainerPod->>Git: optionally push merged model to model repo
TrainerPod->>TrainerAPI: emit status/metrics (MLflow)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (9)
docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb (8)
327-328: Consider documenting bf16 vs fp16 choice for different GPU architectures.The configuration uses
fp16: trueandbf16: false. While fp16 works on most GPUs, bf16 (bfloat16) provides better numerical stability and is preferred on newer GPU architectures (Ampere and newer: A100, H100, A10, etc.).Add a comment in the documentation explaining the choice:
# Use fp16 for older GPUs (V100, T4, P100) # For Ampere+ GPUs (A100, H100, A10), consider using bf16 instead: # bf16: true # fp16: false bf16: false fp16: trueOr make it configurable via an environment variable that can be overridden in the TrainJob.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 327 - 328, Add a short explanatory comment near the bf16/fp16 config fields explaining when to prefer bf16 vs fp16 (e.g., "Use fp16 for older GPUs like V100/T4/P100; prefer bf16 on Ampere+ GPUs such as A100/H100/A10") and make the setting configurable from the TrainJob (expose bf16 and fp16 as TrainJob parameters or read them from an environment variable so they can be overridden at runtime); reference the bf16 and fp16 keys in the config and the TrainJob parameter names when implementing the change.
131-137: Consider documenting resource limit tuning for large models.The dataset and model initializers are configured with 4Gi memory limits. For very large models or datasets (tens of GB), Git LFS operations might approach or exceed this limit, causing OOM errors.
The model-initializer already uses a two-step approach (
GIT_LFS_SKIP_SMUDGE=1thengit lfs pullat line 178-179), which helps control memory usage. Consider adding a note in the documentation that users with exceptionally large models should monitor initializer memory usage and adjust limits if needed.Add a comment or documentation note:
resources: requests: cpu: 100m memory: 128Mi limits: cpu: 2 # Adjust memory limit if cloning very large models (>10GB) memory: 4Gi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 131 - 137, Add a brief documentation note next to the Kubernetes pod resources block that mentions tuning memory limits for very large models; update the resources section (the requests/limits block shown) to include a comment advising users to increase the memory limit (currently 4Gi) when cloning or pulling very large models (>10GB), and reference the model-initializer’s two-step workflow (the GIT_LFS_SKIP_SMUDGE=1 and subsequent git lfs pull steps) so readers know why memory may spike and when to increase the limit.
99-114: Consider adding validation for required environment variables.The script uses
DATASET_URL,GIT_USER, andGIT_TOKENwithout first validating they are set. If any are missing, the script will fail with unclear error messages. Additionally, the directory existence check at line 106 comes afterrm -rf ${DATASET_NAME}at line 104, making the conditional logic ineffective.🛡️ Suggested improvements
set -ex + # Validate required environment variables + : "${DATASET_URL:?DATASET_URL environment variable is required}" + : "${GIT_USER:?GIT_USER environment variable is required}" + : "${GIT_TOKEN:?GIT_TOKEN environment variable is required}" + cd /mnt/models DATASET_NAME=$(basename ${DATASET_URL}) DATASET_URL_NO_HTTPS="${DATASET_URL//https:\/\/}" gitauth="${GIT_USER}:${GIT_TOKEN}" - rm -rf ${DATASET_NAME} - rm -rf data if [ -d ${DATASET_NAME} ]; then echo "dataset ${DATASET_NAME} already exists skipping download" else + rm -rf ${DATASET_NAME} + rm -rf data git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}" fiApply the same pattern to the model-initializer (lines 168-184) and trainer (lines 273-384) scripts.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 99 - 114, Validate that DATASET_URL, GIT_USER, and GIT_TOKEN are set at the top of the script (check with -z and exit 1 with a clear message) before using them; fix the ineffective directory check by moving or removing the premature rm -rf ${DATASET_NAME} so the sequence becomes: compute DATASET_NAME, validate env vars, if [ -d "${DATASET_NAME}" ] then echo skipping download else perform git clone using DATASET_URL_NO_HTTPS and gitauth; apply the same env-var validation and directory-check ordering to the model-initializer and trainer scripts (the blocks that reference DATASET_URL, GIT_USER, GIT_TOKEN and perform rm -rf / git clone).
350-369: Clarify behavior when DO_MERGE is false.When
DO_MERGEis set to"false", the script skips the merge step and simply renamesoutput_modelstooutput_models_merged. The output (LoRA adapters) is still pushed to the model repository. However, the documentation at line 749 states that inference with adapters is not currently supported.This could confuse users who set
DO_MERGE=falseexpecting usable output. Consider adding a warning in the documentation.Add documentation near line 260 or in the Step 6 section:
> **Important:** If you set `DO_MERGE=false`, the output will contain only LoRA adapters, not a merged model. These adapters **cannot** be used directly with the inference service in the current version. Only set `DO_MERGE=false` if you plan to merge adapters manually or use them outside the Alauda AI platform.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 350 - 369, Add a clear warning in the Step 6 / around line 260 explaining DO_MERGE behavior: state that when DO_MERGE=="false" the pipeline merely renames output_models to output_models_merged and only produces LoRA adapters (not a merged model), these adapters cannot be used by the current inference service, and users should only set DO_MERGE=false if they intend to merge adapters manually or use them off-platform; reference DO_MERGE, output_models, output_models_merged, and the lf-merge-config.yaml / llamafactory-cli export flow so readers can locate the relevant steps.
383-383: Consider documenting git commit identity customization.The automated commit uses hardcoded git user identity (
AMLSystemUser/aml_admin@cpaas.io). While this is acceptable for system-generated commits, some organizations may want to customize the commit author for audit trails.Add a note in the documentation or make the git identity configurable via environment variables:
+ # Set GIT_COMMIT_NAME and GIT_COMMIT_EMAIL env vars to customize commit identity + GIT_COMMIT_NAME="${GIT_COMMIT_NAME:-AMLSystemUser}" + GIT_COMMIT_EMAIL="${GIT_COMMIT_EMAIL:-aml_admin@cpaas.io}" git init git checkout -b sft-${push_branch} git lfs track *.safetensors git add . - git -c user.name='AMLSystemUser' -c user.email='aml_admin@cpaas.io' commit -am "fine tune push auto commit" + git -c user.name="${GIT_COMMIT_NAME}" -c user.email="${GIT_COMMIT_EMAIL}" commit -am "fine tune push auto commit"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` at line 383, The notebook currently hardcodes the git commit identity in the command string ("git -c user.name='AMLSystemUser' -c user.email='aml_admin@cpaas.io' commit -am \"fine tune push auto commit\""); update the docs and code to make these values configurable by reading environment variables (e.g., GIT_USER_NAME and GIT_USER_EMAIL) and falling back to the existing defaults, or at minimum add a documentation note explaining how to override the user.name/user.email for audit purposes; ensure you reference and replace the literal git -c user.name/user.email invocation where the commit is constructed so the command uses the env-derived values or documents how to set them.
275-291: Nitpick: redundant environment variable assignment.Line 280 contains a redundant assignment
export RANK=$RANK—RANKis already set fromVC_TASK_INDEXat line 278. This line can be removed.🧹 Minor cleanup
if [ "${VC_WORKER_HOSTS}" != "" ]; then export N_RANKS=$(echo "${VC_WORKER_HOSTS}" |awk -F',' '{print NF}') export RANK=$VC_TASK_INDEX export MASTER_HOST=$(echo "${VC_WORKER_HOSTS}" |awk -F',' '{print $1}') - export RANK=$RANK export WORLD_SIZE=$N_RANKS export NNODES=$N_RANKS export NODE_RANK=$RANK export MASTER_ADDR=${MASTER_HOST} export MASTER_PORT="8888"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 275 - 291, Remove the redundant environment assignment "export RANK=$RANK" in the startup script block where RANK is already set from VC_TASK_INDEX; keep the existing "export RANK=$VC_TASK_INDEX" (the one using VC_TASK_INDEX) and delete the duplicate "export RANK=$RANK" to avoid unnecessary/no-op exports and potential confusion when examining variables like RANK, VC_TASK_INDEX, VC_WORKER_HOSTS, and related WORLD_SIZE/NNODES exports.
557-569: Document HAMi GPU virtualization requirement.The resource specification uses HAMi GPU virtualization resources (
nvidia.com/gpualloc,nvidia.com/gpucores,nvidia.com/gpumem) to allocate fractional GPU resources (50% compute, 8GB memory). This requires HAMi to be installed in the cluster.Users without HAMi should use standard NVIDIA GPU resources instead:
📝 Add documentation note and alternative configuration
Add a note to the prerequisites or this section:
> **Note:** This example uses HAMi GPU virtualization to allocate fractional GPU resources. If HAMi is not installed in your cluster, use standard NVIDIA GPU resources instead: > ```yaml > nvidia.com/gpu: 1 # Request 1 full GPU > ```Alternative configuration without HAMi:
resourcesPerNode: limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: 1 # Standard NVIDIA GPU resource requests: cpu: "100m" memory: "2Gi" nvidia.com/gpu: 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 557 - 569, The resources example under resourcesPerNode uses HAMi-specific GPU virtualization fields (nvidia.com/gpualloc, nvidia.com/gpucores, nvidia.com/gpumem) which require HAMi; add a short note in the prerequisites or immediately above this resourcesPerNode example explaining that HAMi must be installed to use fractional GPU allocation and show an alternative using the standard NVIDIA resource (nvidia.com/gpu: 1) for both limits and requests; reference the resourcesPerNode block and the HAMi fields (nvidia.com/gpualloc, nvidia.com/gpucores, nvidia.com/gpumem) so reviewers can locate and replace or provide the alternative configuration.
65-66: Document the minimum Kubeflow Trainer v2 version requirement.The YAML correctly uses
apiVersion: trainer.kubeflow.org/v1alpha1, which is the current official API for Kubeflow Trainer v2. However, this is an alpha project where the Kubeflow maintainers explicitly warn that APIs may change. Add a note documenting the minimum required Trainer v2 version for this notebook to ensure readers understand potential compatibility concerns and can verify their cluster supports the expected API.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 65 - 66, Add a short note near the YAML block that declares "apiVersion: trainer.kubeflow.org/v1alpha1" / "kind: TrainingRuntime" documenting the minimum Kubeflow Trainer v2 release required for this notebook (e.g., minimum Trainer v2 version X.Y.Z), and instruct readers how to verify their cluster supports that API (check Trainer v2 release/version or available API resources via kubectl api-versions / kubectl get trainingruntimes). Ensure the note is placed immediately above or below the YAML so readers see the compatibility requirement before applying the manifest.docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx (1)
16-26: Add critical constraint about repository naming to prerequisites.The prerequisites table is missing an important constraint documented in the notebook: model and dataset Git repository names must be different from each other. Both are cloned into
/mnt/models/<repo-name>, so identical names would cause one to overwrite the other, leading to initialization failures.📝 Suggested addition to prerequisites table
Add a new row to the prerequisites table:
| `kubectl` access | `kubectl` configured with permission to create `TrainingRuntime` and `TrainJob` resources in your target namespace | +| Repository naming | Model and dataset Git repositories must have **different names** — both are cloned into `/mnt/models/<repo-name>`, so identical names will cause conflicts |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx` around lines 16 - 26, Add a new prerequisite row to the table warning that model and dataset Git repository names must be different to avoid clobbering: state that both repos are cloned into /mnt/models/<repo-name> so using identical repository names will cause one clone to overwrite the other and break initialization; update the prerequisites table in the Kubeflow Trainer v2 doc (the table under "Prerequisites") to include this constraint and give a short example or recommended naming pattern (e.g., append -model and -data) to make the requirement clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb`:
- Around line 266-267: The MODEL_OUTPUT_DIR env var is defined but unused;
either remove it from the runtime and TrainJob manifests or change the script to
use it consistently: replace hardcoded "output_models" and
"output_models_merged" with a single configurable base path read from
MODEL_OUTPUT_DIR (e.g., use os.environ['MODEL_OUTPUT_DIR'] or equivalent) and
update the merge logic (where output_models_merged is computed/used) and the
push/save logic (where output_models is written/pushed) to reference that env
var-derived path; ensure all occurrences (creation, merge, and push steps) use
the same MODEL_OUTPUT_DIR variable name.
- Around line 99-114: The notebook disables SSL verification for Git operations
(see the git clone/push commands that include "-c http.sslVerify=false" and
references to GIT_USER/GIT_TOKEN and DATASET_URL/ DATASET_URL_NO_HTTPS) which is
a security risk; fix by removing the "-c http.sslVerify=false" flags and instead
configure Git to trust your CA (add the org CA to the container at
/etc/ssl/certs/ or set git config --global http.sslCAInfo /path/to/ca-cert.pem)
or, if temporary, leave the flag but add an explicit WARNING comment next to the
git commands documenting it as a temporary testing-only tradeoff and include
TODO to migrate to proper certs.
- Around line 496-525: The nodeSelector entries are incorrectly applied to the
initializer jobs; remove the nodeSelector: nvidia.com/gpu.product: Tesla-T4
blocks from the targetJobs named dataset-initializer and model-initializer so
those steps run on CPU nodes, and ensure the trainer job retains its
nodeSelector. Locate the YAML blocks under targetJobs -> - name:
dataset-initializer and - name: model-initializer and delete their nodeSelector
sections (leave volumes/containers untouched); confirm only the trainer job has
the GPU nodeSelector.
- Around line 819-821: The docs table incorrectly references
spec.initializer.model.env[DATASET_URL] for the base model URL; update the
documentation to use spec.initializer.model.env[BASE_MODEL_URL] (and ensure the
other table entry for spec.trainer.env[BASE_MODEL_URL] remains correct) so the
Base model URL row consistently points to
spec.initializer.model.env[BASE_MODEL_URL] and spec.trainer.env[BASE_MODEL_URL].
---
Nitpick comments:
In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb`:
- Around line 327-328: Add a short explanatory comment near the bf16/fp16 config
fields explaining when to prefer bf16 vs fp16 (e.g., "Use fp16 for older GPUs
like V100/T4/P100; prefer bf16 on Ampere+ GPUs such as A100/H100/A10") and make
the setting configurable from the TrainJob (expose bf16 and fp16 as TrainJob
parameters or read them from an environment variable so they can be overridden
at runtime); reference the bf16 and fp16 keys in the config and the TrainJob
parameter names when implementing the change.
- Around line 131-137: Add a brief documentation note next to the Kubernetes pod
resources block that mentions tuning memory limits for very large models; update
the resources section (the requests/limits block shown) to include a comment
advising users to increase the memory limit (currently 4Gi) when cloning or
pulling very large models (>10GB), and reference the model-initializer’s
two-step workflow (the GIT_LFS_SKIP_SMUDGE=1 and subsequent git lfs pull steps)
so readers know why memory may spike and when to increase the limit.
- Around line 99-114: Validate that DATASET_URL, GIT_USER, and GIT_TOKEN are set
at the top of the script (check with -z and exit 1 with a clear message) before
using them; fix the ineffective directory check by moving or removing the
premature rm -rf ${DATASET_NAME} so the sequence becomes: compute DATASET_NAME,
validate env vars, if [ -d "${DATASET_NAME}" ] then echo skipping download else
perform git clone using DATASET_URL_NO_HTTPS and gitauth; apply the same env-var
validation and directory-check ordering to the model-initializer and trainer
scripts (the blocks that reference DATASET_URL, GIT_USER, GIT_TOKEN and perform
rm -rf / git clone).
- Around line 350-369: Add a clear warning in the Step 6 / around line 260
explaining DO_MERGE behavior: state that when DO_MERGE=="false" the pipeline
merely renames output_models to output_models_merged and only produces LoRA
adapters (not a merged model), these adapters cannot be used by the current
inference service, and users should only set DO_MERGE=false if they intend to
merge adapters manually or use them off-platform; reference DO_MERGE,
output_models, output_models_merged, and the lf-merge-config.yaml /
llamafactory-cli export flow so readers can locate the relevant steps.
- Line 383: The notebook currently hardcodes the git commit identity in the
command string ("git -c user.name='AMLSystemUser' -c
user.email='aml_admin@cpaas.io' commit -am \"fine tune push auto commit\"");
update the docs and code to make these values configurable by reading
environment variables (e.g., GIT_USER_NAME and GIT_USER_EMAIL) and falling back
to the existing defaults, or at minimum add a documentation note explaining how
to override the user.name/user.email for audit purposes; ensure you reference
and replace the literal git -c user.name/user.email invocation where the commit
is constructed so the command uses the env-derived values or documents how to
set them.
- Around line 275-291: Remove the redundant environment assignment "export
RANK=$RANK" in the startup script block where RANK is already set from
VC_TASK_INDEX; keep the existing "export RANK=$VC_TASK_INDEX" (the one using
VC_TASK_INDEX) and delete the duplicate "export RANK=$RANK" to avoid
unnecessary/no-op exports and potential confusion when examining variables like
RANK, VC_TASK_INDEX, VC_WORKER_HOSTS, and related WORLD_SIZE/NNODES exports.
- Around line 557-569: The resources example under resourcesPerNode uses
HAMi-specific GPU virtualization fields (nvidia.com/gpualloc,
nvidia.com/gpucores, nvidia.com/gpumem) which require HAMi; add a short note in
the prerequisites or immediately above this resourcesPerNode example explaining
that HAMi must be installed to use fractional GPU allocation and show an
alternative using the standard NVIDIA resource (nvidia.com/gpu: 1) for both
limits and requests; reference the resourcesPerNode block and the HAMi fields
(nvidia.com/gpualloc, nvidia.com/gpucores, nvidia.com/gpumem) so reviewers can
locate and replace or provide the alternative configuration.
- Around line 65-66: Add a short note near the YAML block that declares
"apiVersion: trainer.kubeflow.org/v1alpha1" / "kind: TrainingRuntime"
documenting the minimum Kubeflow Trainer v2 release required for this notebook
(e.g., minimum Trainer v2 version X.Y.Z), and instruct readers how to verify
their cluster supports that API (check Trainer v2 release/version or available
API resources via kubectl api-versions / kubectl get trainingruntimes). Ensure
the note is placed immediately above or below the YAML so readers see the
compatibility requirement before applying the manifest.
In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx`:
- Around line 16-26: Add a new prerequisite row to the table warning that model
and dataset Git repository names must be different to avoid clobbering: state
that both repos are cloned into /mnt/models/<repo-name> so using identical
repository names will cause one clone to overwrite the other and break
initialization; update the prerequisites table in the Kubeflow Trainer v2 doc
(the table under "Prerequisites") to include this constraint and give a short
example or recommended naming pattern (e.g., append -model and -data) to make
the requirement clear.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1b65f7d8-22ab-4bc8-b29e-bf1652222233
📒 Files selected for processing (2)
docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynbdocs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx
| " set -ex\n", | ||
| " cd /mnt/models\n", | ||
| " DATASET_NAME=$(basename ${DATASET_URL})\n", | ||
| " DATASET_URL_NO_HTTPS=\"${DATASET_URL//https:\\/\\/}\"\n", | ||
| " gitauth=\"${GIT_USER}:${GIT_TOKEN}\"\n", | ||
| " rm -rf ${DATASET_NAME}\n", | ||
| " rm -rf data\n", | ||
| " if [ -d ${DATASET_NAME} ]; then\n", | ||
| " echo \"dataset ${DATASET_NAME} already exists skipping download\"\n", | ||
| " else\n", | ||
| " git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone \"https://${gitauth}@${DATASET_URL_NO_HTTPS}\"\n", | ||
| " fi\n", | ||
| " echo \"listing files under /mnt/models ...\"\n", | ||
| " ls /mnt/models\n", | ||
| " echo \"listing dataset files ...\"\n", | ||
| " ls ${DATASET_NAME}\n", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify all instances of sslVerify=false in the notebook
rg -n "sslVerify=false" docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynbRepository: alauda/aml-docs
Length of output: 654
Security: SSL verification is disabled for Git operations across multiple pipeline steps.
The notebook disables SSL certificate verification at 4 locations (lines 109, 178, 179, and 384) using -c http.sslVerify=false, which exposes Git cloning and pushing operations to man-in-the-middle attacks.
Recommended solutions:
-
Preferred: Configure Git to trust your organization's CA certificate:
- Add the CA cert to the container image at
/etc/ssl/certs/ - Or use
git config --global http.sslCAInfo /path/to/ca-cert.pem
- Add the CA cert to the container image at
-
Alternative: If using self-signed certificates temporarily, document this as a known security trade-off and plan to migrate to properly signed certificates.
-
Quick fix for testing only: Keep the flag but add a comment warning that it should be removed in production.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 99 -
114, The notebook disables SSL verification for Git operations (see the git
clone/push commands that include "-c http.sslVerify=false" and references to
GIT_USER/GIT_TOKEN and DATASET_URL/ DATASET_URL_NO_HTTPS) which is a security
risk; fix by removing the "-c http.sslVerify=false" flags and instead configure
Git to trust your CA (add the org CA to the container at /etc/ssl/certs/ or set
git config --global http.sslCAInfo /path/to/ca-cert.pem) or, if temporary, leave
the flag but add an explicit WARNING comment next to the git commands
documenting it as a temporary testing-only tradeoff and include TODO to migrate
to proper certs.
| " - name: MODEL_OUTPUT_DIR\n", | ||
| " value: /mnt/workspace/output_model\n", |
There was a problem hiding this comment.
Unused environment variable: MODEL_OUTPUT_DIR is not referenced in the script.
The MODEL_OUTPUT_DIR environment variable is defined here (and again in the TrainJob at line 553-554) but is never used in the training script. The script uses hardcoded paths output_models (line 316) and output_models_merged (lines 359, 368, 373) instead.
This creates confusion and inconsistency in the configuration. Either remove the unused variable or update the script to use it.
🧹 Suggested fix: remove unused variable or use it in the script
Option 1: Remove the unused variable
Remove MODEL_OUTPUT_DIR from both TrainingRuntime (lines 266-267) and TrainJob (lines 553-554).
Option 2: Use the variable in the script
If the intent is to make output paths configurable, update the script to use the environment variable:
- output_dir: output_models
+ output_dir: ${MODEL_OUTPUT_DIR}However, this would also require updating the merge logic at lines 354 and 367-368, and the push logic at line 373.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 266 -
267, The MODEL_OUTPUT_DIR env var is defined but unused; either remove it from
the runtime and TrainJob manifests or change the script to use it consistently:
replace hardcoded "output_models" and "output_models_merged" with a single
configurable base path read from MODEL_OUTPUT_DIR (e.g., use
os.environ['MODEL_OUTPUT_DIR'] or equivalent) and update the merge logic (where
output_models_merged is computed/used) and the push/save logic (where
output_models is written/pushed) to reference that env var-derived path; ensure
all occurrences (creation, merge, and push steps) use the same MODEL_OUTPUT_DIR
variable name.
| " - targetJobs:\n", | ||
| " - name: dataset-initializer\n", | ||
| " spec:\n", | ||
| " # Step 3: Do the same as step 1.\n", | ||
| " volumes:\n", | ||
| " - name: models-cache\n", | ||
| " persistentVolumeClaim:\n", | ||
| " claimName: team-model-cache-pvc\n", | ||
| " containers:\n", | ||
| " - name: dataset-initializer\n", | ||
| " volumeMounts:\n", | ||
| " - name: models-cache\n", | ||
| " mountPath: /mnt/models\n", | ||
| " nodeSelector:\n", | ||
| " nvidia.com/gpu.product: Tesla-T4\n", | ||
| " - targetJobs:\n", | ||
| " - name: model-initializer\n", | ||
| " spec:\n", | ||
| " # Step 4: Do the same as step 1.\n", | ||
| " volumes:\n", | ||
| " - name: models-cache\n", | ||
| " persistentVolumeClaim:\n", | ||
| " claimName: team-model-cache-pvc\n", | ||
| " containers:\n", | ||
| " - name: model-initializer\n", | ||
| " volumeMounts:\n", | ||
| " - name: models-cache\n", | ||
| " mountPath: /mnt/models\n", | ||
| " nodeSelector:\n", | ||
| " nvidia.com/gpu.product: Tesla-T4\n", |
There was a problem hiding this comment.
Inefficient resource allocation: initializers don't need GPU nodes.
The nodeSelector for dataset-initializer (lines 509-510) and model-initializer (lines 524-525) targets Tesla-T4 GPU nodes, but these initialization steps only run git clone commands and don't require GPUs. Only the trainer job needs GPU resources.
Scheduling initializers on GPU nodes wastes valuable GPU capacity and may delay initialization if GPU nodes are busy.
⚡ Recommended fix: remove nodeSelector from initializers
- targetJobs:
- name: dataset-initializer
spec:
volumes:
- name: models-cache
persistentVolumeClaim:
claimName: team-model-cache-pvc
containers:
- name: dataset-initializer
volumeMounts:
- name: models-cache
mountPath: /mnt/models
- nodeSelector:
- nvidia.com/gpu.product: Tesla-T4
- targetJobs:
- name: model-initializer
spec:
volumes:
- name: models-cache
persistentVolumeClaim:
claimName: team-model-cache-pvc
containers:
- name: model-initializer
volumeMounts:
- name: models-cache
mountPath: /mnt/models
- nodeSelector:
- nvidia.com/gpu.product: Tesla-T4Keep the nodeSelector only for the trainer job which actually needs GPUs.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| " - targetJobs:\n", | |
| " - name: dataset-initializer\n", | |
| " spec:\n", | |
| " # Step 3: Do the same as step 1.\n", | |
| " volumes:\n", | |
| " - name: models-cache\n", | |
| " persistentVolumeClaim:\n", | |
| " claimName: team-model-cache-pvc\n", | |
| " containers:\n", | |
| " - name: dataset-initializer\n", | |
| " volumeMounts:\n", | |
| " - name: models-cache\n", | |
| " mountPath: /mnt/models\n", | |
| " nodeSelector:\n", | |
| " nvidia.com/gpu.product: Tesla-T4\n", | |
| " - targetJobs:\n", | |
| " - name: model-initializer\n", | |
| " spec:\n", | |
| " # Step 4: Do the same as step 1.\n", | |
| " volumes:\n", | |
| " - name: models-cache\n", | |
| " persistentVolumeClaim:\n", | |
| " claimName: team-model-cache-pvc\n", | |
| " containers:\n", | |
| " - name: model-initializer\n", | |
| " volumeMounts:\n", | |
| " - name: models-cache\n", | |
| " mountPath: /mnt/models\n", | |
| " nodeSelector:\n", | |
| " nvidia.com/gpu.product: Tesla-T4\n", | |
| " - targetJobs:\n", | |
| " - name: dataset-initializer\n", | |
| " spec:\n", | |
| " # Step 3: Do the same as step 1.\n", | |
| " volumes:\n", | |
| " - name: models-cache\n", | |
| " persistentVolumeClaim:\n", | |
| " claimName: team-model-cache-pvc\n", | |
| " containers:\n", | |
| " - name: dataset-initializer\n", | |
| " volumeMounts:\n", | |
| " - name: models-cache\n", | |
| " mountPath: /mnt/models\n", | |
| " - targetJobs:\n", | |
| " - name: model-initializer\n", | |
| " spec:\n", | |
| " # Step 4: Do the same as step 1.\n", | |
| " volumes:\n", | |
| " - name: models-cache\n", | |
| " persistentVolumeClaim:\n", | |
| " claimName: team-model-cache-pvc\n", | |
| " containers:\n", | |
| " - name: model-initializer\n", | |
| " volumeMounts:\n", | |
| " - name: models-cache\n", | |
| " mountPath: /mnt/models\n", |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 496 -
525, The nodeSelector entries are incorrectly applied to the initializer jobs;
remove the nodeSelector: nvidia.com/gpu.product: Tesla-T4 blocks from the
targetJobs named dataset-initializer and model-initializer so those steps run on
CPU nodes, and ensure the trainer job retains its nodeSelector. Locate the YAML
blocks under targetJobs -> - name: dataset-initializer and - name:
model-initializer and delete their nodeSelector sections (leave
volumes/containers untouched); confirm only the trainer job has the GPU
nodeSelector.
| "|---|---|\n", | ||
| "| Base model URL | `spec.initializer.model.env[DATASET_URL]` and `spec.trainer.env[BASE_MODEL_URL]` in `TrainJob` |\n", | ||
| "| Dataset URL | `spec.initializer.dataset.env[DATASET_URL]` and `spec.trainer.env[DATASET_URL]` in `TrainJob` |\n", |
There was a problem hiding this comment.
Documentation error: incorrect environment variable name for base model URL.
Line 820 incorrectly states that to change the base model URL, users should modify spec.initializer.model.env[DATASET_URL]. This should be spec.initializer.model.env[BASE_MODEL_URL].
📝 Correction needed
| What to change | Where to change it |
|---|---|
-| Base model URL | `spec.initializer.model.env[DATASET_URL]` and `spec.trainer.env[BASE_MODEL_URL]` in `TrainJob` |
+| Base model URL | `spec.initializer.model.env[BASE_MODEL_URL]` and `spec.trainer.env[BASE_MODEL_URL]` in `TrainJob` |
| Dataset URL | `spec.initializer.dataset.env[DATASET_URL]` and `spec.trainer.env[DATASET_URL]` in `TrainJob` |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb` around lines 819 -
821, The docs table incorrectly references
spec.initializer.model.env[DATASET_URL] for the base model URL; update the
documentation to use spec.initializer.model.env[BASE_MODEL_URL] (and ensure the
other table entry for spec.trainer.env[BASE_MODEL_URL] remains correct) so the
Base model URL row consistently points to
spec.initializer.model.env[BASE_MODEL_URL] and spec.trainer.env[BASE_MODEL_URL].
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfile (1)
1-24: Consider documenting the build context requirements.This Containerfile expects
pyproject.tomlanduv.lockto be present in the build context (line 21). Users need to know these files must exist alongside the Containerfile. A comment at the top of the file indicating the required files would improve usability.📝 Suggested documentation comment
+# Build context requirements: +# - pyproject.toml (defines Python dependencies) +# - uv.lock (locked dependency versions) +# +# Build: docker build -f fine_tune_with_llamafactory.Containerfile -t <image-name> . + FROM nvidia/cuda:12.6.3-cudnn-devel-ubuntu24.04🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfile` around lines 1 - 24, Add a short comment at the top of the Containerfile describing required build-context files: state that pyproject.toml and uv.lock must be present in the same directory (or included in the docker/kaniko build context) because the Dockerfile uses COPY pyproject.toml uv.lock ./; reference this requirement near the COPY instruction (and optionally note any proxy or network assumptions) so users know what to include before running docker build.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfile`:
- Line 22: The Dockerfile contains a hardcoded proxy in the RUN line ("RUN
HTTPS_PROXY=http://192.168.144.12:7890 uv sync --frozen") which will break
external builds; change this to use a build-time argument or omit the proxy: add
an ARG HTTPS_PROXY (and optionally ENV HTTPS_PROXY) and replace the inline value
so the RUN uses the ARG (or remove the HTTPS_PROXY prefix entirely if not
required), and document that users can pass --build-arg HTTPS_PROXY=<value> when
building.
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Around line 124-130: The docs Containerfile lists package constraints
("transformers>=4.51.1,<=4.53.3" and "deepspeed~=0.18.8") that conflict with the
Kubeflow pyproject.toml pins ("transformers<=4.51.1" and "deepspeed~=0.16.9");
update the Containerfile entries to match the Kubeflow pins or update the
Kubeflow pyproject.toml to the Containerfile versions, and if you intentionally
keep them different, add a concise note in this document explaining the
divergence and the rationale (referencing the exact package strings
"transformers>=4.51.1,<=4.53.3", "transformers<=4.51.1", "deepspeed~=0.18.8",
and "deepspeed~=0.16.9").
- Around line 104-106: The Docker base image and proxy are hardcoded to internal
resources; replace the literal "FROM
152-231-registry.alauda.cn:60070/mlops/nvidia/pytorch:24.12-py3" with a
parameterized build ARG or a publicly available image (e.g., use an upstream
nvidia/pytorch tag) and replace the literal proxy "192.168.144.12:7890" with a
configurable placeholder (e.g., PROXY or HTTP_PROXY build arg / environment
variable) and clearly label it as a placeholder in the docs so external users
can supply their own values.
---
Nitpick comments:
In
`@docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfile`:
- Around line 1-24: Add a short comment at the top of the Containerfile
describing required build-context files: state that pyproject.toml and uv.lock
must be present in the same directory (or included in the docker/kaniko build
context) because the Dockerfile uses COPY pyproject.toml uv.lock ./; reference
this requirement near the COPY instruction (and optionally note any proxy or
network assumptions) so users know what to include before running docker build.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 224c76af-5509-4e0a-9761-c21b3c5feea7
⛔ Files ignored due to path filters (1)
docs/en/kubeflow/how_to/assets/build-train-image/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfiledocs/en/kubeflow/how_to/assets/build-train-image/pyproject.tomldocs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynbdocs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdxdocs/en/workbench/how_to/fine_tunning_using_notebooks.mdx
✅ Files skipped from review due to trivial changes (3)
- docs/en/kubeflow/how_to/assets/build-train-image/pyproject.toml
- docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx
- docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb
|
|
||
| WORKDIR /opt | ||
| COPY pyproject.toml uv.lock ./ | ||
| RUN HTTPS_PROXY=http://192.168.144.12:7890 uv sync --frozen |
There was a problem hiding this comment.
Hardcoded internal proxy IP will cause build failures for external users.
The proxy 192.168.144.12:7890 is an internal resource that won't be accessible to users building this image. Consider parameterizing it or removing it entirely.
🔧 Suggested fix using build argument
+ARG HTTPS_PROXY=""
WORKDIR /opt
COPY pyproject.toml uv.lock ./
-RUN HTTPS_PROXY=http://192.168.144.12:7890 uv sync --frozen
+RUN uv sync --frozenIf a proxy is required in certain environments, users can pass --build-arg HTTPS_PROXY=... when building.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@docs/en/kubeflow/how_to/assets/build-train-image/fine_tune_with_llamafactory.Containerfile`
at line 22, The Dockerfile contains a hardcoded proxy in the RUN line ("RUN
HTTPS_PROXY=http://192.168.144.12:7890 uv sync --frozen") which will break
external builds; change this to use a build-time argument or omit the proxy: add
an ARG HTTPS_PROXY (and optionally ENV HTTPS_PROXY) and replace the inline value
so the RUN uses the ARG (or remove the HTTPS_PROXY prefix entirely if not
required), and document that users can pass --build-arg HTTPS_PROXY=<value> when
building.
| # FROM docker-mirrors.alauda.cn/library/python:3.13-trixie | ||
| FROM 152-231-registry.alauda.cn:60070/mlops/nvidia/pytorch:24.12-py3 | ||
|
|
There was a problem hiding this comment.
Hardcoded internal registry and proxy will break builds for external users.
Line 105 references an internal registry (152-231-registry.alauda.cn:60070) that external users cannot access. Line 116 hardcodes an internal proxy IP (192.168.144.12:7890). Both should be parameterized or replaced with publicly accessible alternatives.
🔧 Suggested fix using build arguments
ARG LLAMA_FACTORY_VERSION="v0.9.4"
-# FROM docker-mirrors.alauda.cn/library/python:3.13-trixie
-FROM 152-231-registry.alauda.cn:60070/mlops/nvidia/pytorch:24.12-py3
+# Replace <your-registry> with your container registry or use a public image
+ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:24.12-py3
+FROM ${BASE_IMAGE}For the proxy on line 116:
+ARG HTTPS_PROXY=""
RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip setuptools && \
cd /opt && \
-HTTPS_PROXY=http://192.168.144.12:7890 git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git && \
+git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git && \Based on learnings: placeholder credentials in documentation should be "clearly labeled as placeholders."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 104 -
106, The Docker base image and proxy are hardcoded to internal resources;
replace the literal "FROM
152-231-registry.alauda.cn:60070/mlops/nvidia/pytorch:24.12-py3" with a
parameterized build ARG or a publicly available image (e.g., use an upstream
nvidia/pytorch tag) and replace the literal proxy "192.168.144.12:7890" with a
configurable placeholder (e.g., PROXY or HTTP_PROXY build arg / environment
variable) and clearly label it as a placeholder in the docs so external users
can supply their own values.
| "transformers>=4.51.1,<=4.53.3" \ | ||
| "tokenizers>=0.21.1" \ | ||
| "sqlalchemy~=2.0.30" \ | ||
| "pymysql~=1.1.1" \ | ||
| "loguru~=0.7.2" \ | ||
| "mysqlclient~=2.2.7" \ | ||
| "deepspeed~=0.18.8" \ |
There was a problem hiding this comment.
Version inconsistencies with Kubeflow pyproject.toml will produce divergent environments.
This Containerfile specifies transformers>=4.51.1,<=4.53.3 and deepspeed~=0.18.8, while the Kubeflow pyproject.toml (referenced in context snippet 1) pins transformers<=4.51.1 and deepspeed~=0.16.9. Users following either path will get different package versions, which may cause confusion or incompatibility issues.
Consider aligning the version constraints across both documentation paths, or explicitly document the differences and their rationale.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 124 -
130, The docs Containerfile lists package constraints
("transformers>=4.51.1,<=4.53.3" and "deepspeed~=0.18.8") that conflict with the
Kubeflow pyproject.toml pins ("transformers<=4.51.1" and "deepspeed~=0.16.9");
update the Containerfile entries to match the Kubeflow pins or update the
Kubeflow pyproject.toml to the Containerfile versions, and if you intentionally
keep them different, add a concise note in this document explaining the
divergence and the rationale (referencing the exact package strings
"transformers>=4.51.1,<=4.53.3", "transformers<=4.51.1", "deepspeed~=0.18.8",
and "deepspeed~=0.16.9").
Deploying alauda-ai with
|
| Latest commit: |
ec4fa2b
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://0f2c258a.alauda-ai.pages.dev |
| Branch Preview URL: | https://add-trainerv2-llm-fine-tunin.alauda-ai.pages.dev |
Summary by CodeRabbit