[WIP] agentic: add Kimi Mooncake LMCacheMP disagg recipe#1924
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢
PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢
PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| global global_args |
| setup_vllm_env | ||
| start_lmcache_mp_if_needed |
There was a problem hiding this comment.
🔴 start_lmcache_mp_if_needed returns 1 when the LMCache MP server fails its 240s health check, but server_vllm.sh has no set -e and both call sites (rank-0 prefill at line 423 and additional prefill nodes at line 609) invoke it bare. After a failed LMCache start, execution falls through and vllm serve is launched with --kv-transfer-config pointing at the unreachable LMCacheMPConnector, surfacing as a much later/less informative connector-init or first-request error that hides the clear "LMCache MP server failed to become healthy" message the function already logged. Fix: start_lmcache_mp_if_needed || exit 1 at both call sites.
Extended reasoning...
What the bug is
server_vllm.sh adds a new helper start_lmcache_mp_if_needed() (lines 261-400). When ROLE_KV_CONNECTOR == "mooncake-lmcachemp", the function backgrounds lmcache server …, polls http://$LMCACHE_HOST:$LMCACHE_HTTP_PORT/healthcheck for up to 240s (120 attempts × 2s), and on timeout logs ERROR: LMCache MP server failed to become healthy, tails the LMCache log, and return 1s.
The function is called bare at two sites:
- line 423 (rank-0 prefill / proxy node)
- line 609 (additional prefill nodes
xP > 1)
Both call sites look like:
setup_vllm_env
start_lmcache_mp_if_needed
…
PREFILL_CMD="vllm serve … --kv-transfer-config '${KVT_PREFILL}' …"
eval "$PREFILL_CMD" > "$PREFILL_LOG_FILE" 2>&1 &No || exit 1, no if !, no $? check.
Why existing code doesn't prevent it
I grepped for set -[eE] in server_vllm.sh, setup_deps.sh, and env.sh — no matches. The script runs with default bash semantics: a failed command does not abort. So when start_lmcache_mp_if_needed returns 1, control flows straight to the vllm serve launch.
Why this matters in the failure mode
KVT_PREFILL for this recipe is the MultiConnector with two children:
{"kv_connector":"MultiConnector","kv_role":"kv_producer",
"kv_connector_extra_config":{"connectors":[
{"kv_connector":"MooncakeConnector",…},
{"kv_connector":"LMCacheMPConnector",
"kv_connector_extra_config":{"lmcache.mp.host":"tcp://${LMCACHE_HOST}","lmcache.mp.port":${LMCACHE_PORT}}}]}}When vLLM initializes this MultiConnector with the LMCache server down, the LMCacheMPConnector init (or its first request) will fail trying to dial tcp://127.0.0.1:5555. That error will be deep inside the vLLM scheduler/worker stack and is markedly less informative than the clear stderr message + 80-line LMCache log tail that start_lmcache_mp_if_needed had already produced.
Step-by-step proof
PREFILL_KV_CONNECTOR=mooncake-lmcachempis set in the agentic recipe.- Rank 0 starts:
setup_vllm_env; start_lmcache_mp_if_needed. - The LMCache binary either OOMs at init (note the PR's own bring-up note: "Decode-side LMCacheMP on MI300X full Kimi repeatedly OOMed during KV registration") or simply takes >240s to bind on a cold node.
- The function logs
ERROR: LMCache MP server failed to become healthyto stderr, tails the log, andreturn 1s. - With no
set -eand no|| exit 1, the next line runs:PREFILL_CMD="vllm serve …"followed byeval "$PREFILL_CMD" > "$PREFILL_LOG_FILE" 2>&1 &. - vLLM boot proceeds, the MultiConnector tries to connect to the dead LMCache server, and the engine crashes with a connector-init traceback in
prefill_<host>.log— not the LMCache log that has the actual root cause. - The proxy /health barrier on node 0 then waits up to 30 minutes for the prefill server to come up, the operator scrolls through prefill logs first, and only eventually finds the earlier LMCache stderr buried above the vLLM traceback.
How to fix
One-line change at each call site (line 423 and line 609):
start_lmcache_mp_if_needed || exit 1Alternatively, add set -e (or at minimum set -e scoped via subshell) — but per-call || exit 1 is the smaller surgical fix and avoids unintended aborts elsewhere in the script.
|
|
||
| - config-keys: | ||
| - kimik2.5-fp4-mi355x-vllm-disagg-agentic | ||
| description: | ||
| - "Add Kimi-K2.5 MI355X multinode agentic recipe using Mooncake(tcp) for PD KV transfer and LMCacheMP only on prefill for L2 prefix reuse." | ||
| - "Decode uses MooncakeConnector only to avoid decode-side LMCache lookup/retrieve racing the remote-prefill KV load." | ||
| pr-link: XXX |
There was a problem hiding this comment.
🔴 The new kimik2.5-fp4-mi355x-vllm-disagg-agentic entry in perf-changelog.yaml has pr-link: XXX instead of a real URL — every other entry in the file uses a real https://github.com/SemiAnalysisAI/InferenceX/pull/<n> link. As written, this will land in main as a literal "XXX" string and break any downstream tooling that parses pr-link as a URL. Replace with https://github.com/SemiAnalysisAI/InferenceX/pull/1924 before merging.
Extended reasoning...
Bug
The newly added perf-changelog.yaml entry for kimik2.5-fp4-mi355x-vllm-disagg-agentic (lines 4156-4162 of the PR diff) sets:
pr-link: XXXThis is an unfinished placeholder. Every other entry in this file — including the four immediately preceding entries at lines 4131, 4140, 4148, and 4155 — uses a real GitHub PR URL of the form https://github.com/SemiAnalysisAI/InferenceX/pull/<n>.
Why it's a problem
pr-link is structured metadata in the changelog. The schema convention (established by all existing entries) is that the value is a fully-qualified GitHub PR URL. A literal XXX string violates that schema and will:
- Render as the literal text "XXX" anywhere the changelog is presented to humans (changelog pages, release notes, dashboards).
- Break any downstream tooling that consumes the file and treats
pr-linkas a URL — e.g., link validators, anything that doesrequests.get(entry["pr-link"]), anything that parses/pull/(\d+)$to extract the PR number for cross-references. - Make it impossible to navigate from the changelog entry back to the originating PR — exactly the cross-reference
pr-linkexists to provide.
Step-by-step proof
- Open
perf-changelog.yamland look at any existing entry (e.g. the one ending at line 4155):pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1891
- Look at the new entry the PR adds at the end of the file:
- config-keys: - kimik2.5-fp4-mi355x-vllm-disagg-agentic description: - "Add Kimi-K2.5 MI355X multinode agentic recipe..." - "Decode uses MooncakeConnector only..." pr-link: XXX
- The
XXXis a placeholder the author left for the PR number, which is now known to be [WIP] agentic: add Kimi Mooncake LMCacheMP disagg recipe #1924.
Fix
Replace line 4162 with:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1924Trivial one-line change; no other parts of the PR are affected.
| ( python3 "$WS_PATH/mooncake_lmcache_proxy.py" --host 0.0.0.0 --port "${ROUTER_PORT}" \ | ||
| --prefiller-hosts "${NODE0_ADDR}" --prefiller-ports "${SERVER_PORT}" \ | ||
| --decoder-hosts "${MC_PROXY_DECODER_IP}" --decoder-ports "${SERVER_PORT}" \ | ||
| > "$MC_PROXY_LOG" 2>&1 & ) || echo "[Mooncake] WARN: proxy failed to start (see $MC_PROXY_LOG)" | ||
| sleep 6 | ||
| fi |
There was a problem hiding this comment.
🟡 The error handler on ( python3 $WS_PATH/mooncake_lmcache_proxy.py ... & ) || echo "[Mooncake] WARN: proxy failed to start" is unreachable: backgrounding inside the subshell makes the subshell return rc=0 immediately, so any real proxy launch failure (import error, port already bound, missing dep, script crash) silently produces no WARN. The only failure signal is the subsequent /health barrier on line 481 timing out after 1800s. Capture $! after backgrounding and kill -0 $pid after the sleep 6 to actually detect launch failure, or just drop the dead || echo since the health barrier ultimately catches it.
Extended reasoning...
What the bug is
At server_vllm.sh:468-473, the in-container Mooncake/LMCache PD proxy is launched as:
( python3 "$WS_PATH/mooncake_lmcache_proxy.py" --host 0.0.0.0 --port "${ROUTER_PORT}" \
--prefiller-hosts "${NODE0_ADDR}" --prefiller-ports "${SERVER_PORT}" \
--decoder-hosts "${MC_PROXY_DECODER_IP}" --decoder-ports "${SERVER_PORT}" \
> "$MC_PROXY_LOG" 2>&1 & ) || echo "[Mooncake] WARN: proxy failed to start (see $MC_PROXY_LOG)"The author's intent is clearly "emit a warning if the proxy fails to start." In practice the || echo WARN branch is unreachable dead code in any realistic failure mode.
Why || never fires
Bash semantics for ( cmd & ):
- The parent shell forks a subshell.
- Inside the subshell,
&backgroundspython3 …— the shell callsfork()again, the child will eventuallyexec(python3), and the subshell's$?is set to 0 (background commands always succeed at the point of being backgrounded; fork() doesn't validate the executable). - The subshell has no further commands, so it exits with rc=0.
||sees rc=0 and skipsecho.
The exec(python3) either succeeds or fails asynchronously in the detached grandchild, after the parent shell has already moved past the || clause. No exec failure, ImportError, port-already-bound, syntax error, or runtime crash in mooncake_lmcache_proxy.py can ever propagate to the || branch. The only way echo WARN would trigger is if the OS could not even fork the subshell (e.g. resource exhaustion) — an extremely rare condition unrelated to the proxy itself.
Step-by-step proof
Suppose mooncake_lmcache_proxy.py has a syntax error and the launch is at t=0:
- t=0: parent shell at line 468 sees
( ... & ) || .... It forks a subshell (pid=A). - t=0+ε: subshell A reads
python3 … &. It forks pid=B for the python process and immediately marks B as backgrounded. - t=0+ε: subshell A has no further commands.
$?from the&is 0. Subshell A exits with rc=0. - t=0+ε: parent shell sees subshell A returned 0.
||short-circuits.echo WARNis skipped. - t=0+δ (concurrently): pid=B calls
exec(python3). Python loadsmooncake_lmcache_proxy.py, hits the syntax error, prints a traceback to$MC_PROXY_LOG, exits non-zero. Nothing observes this. - t=6:
sleep 6completes; script proceeds to the/healthbarrier (HEALTH_BARRIER_CMD). - t=1806:
sync.py barrier --timeout 1800times out. The script exits with an opaque "port ${ROUTER_PORT} never opened" error, 30 minutes after the actual proxy crash.
The bug-finder's described symptom is exactly what bash semantics predict.
Impact
Practical impact is bounded by the existing /health barrier on line 481 (--timeout 1800), which will detect a dead proxy — eventually. So the script still fails closed, just with a 30-minute blind spot and an opaque error pointing at the network layer rather than the python crash. For multi-node SLURM jobs this wastes ~30 minutes of compute per failed bring-up and obscures the root cause. For the agentic-coding scenario which already runs 1800s per concurrency point, this can effectively double the wasted SLURM time on first-occurrence failures.
How to fix
The intent-preserving fix is to capture the background PID and probe it after the existing sleep:
python3 "$WS_PATH/mooncake_lmcache_proxy.py" --host 0.0.0.0 --port "${ROUTER_PORT}" \
--prefiller-hosts "${NODE0_ADDR}" --prefiller-ports "${SERVER_PORT}" \
--decoder-hosts "${MC_PROXY_DECODER_IP}" --decoder-ports "${SERVER_PORT}" \
> "$MC_PROXY_LOG" 2>&1 &
MC_PROXY_PID=$!
sleep 6
if ! kill -0 "$MC_PROXY_PID" 2>/dev/null; then
echo "[Mooncake] ERROR: proxy died on startup; tail of $MC_PROXY_LOG:" >&2
tail -n 80 "$MC_PROXY_LOG" >&2 || true
exit 1
fiThis drops the unnecessary subshell wrapper, captures the PID via $!, and uses kill -0 after the existing 6-second wait to verify the process is still alive — turning a 30-minute timeout into a 6-second fast-fail with a useful log tail. (The sibling start_lmcache_mp_if_needed already uses this exact tail-on-failure pattern at server_vllm.sh:393-396.) Alternatively, just delete the || echo … to remove the misleading dead code, since the health barrier eventually catches the failure anyway.
| min_tokens = req_data.pop("min_tokens", None) | ||
| min_completion_tokens = req_data.pop("min_completion_tokens", None) | ||
| headers = { | ||
| "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}", | ||
| "X-Request-Id": request_id, | ||
| } | ||
|
|
||
| response = await client_info["client"].post( | ||
| endpoint, json=req_data, headers=headers | ||
| ) | ||
| response.raise_for_status() | ||
|
|
||
| # read/consume the response body to release the connection | ||
| # otherwise, it would http.ReadError | ||
| await response.aread() | ||
|
|
||
| # Add back the min_tokens and min_completion_tokens so D can use them | ||
| req_data["min_tokens"] = min_tokens | ||
| req_data["min_completion_tokens"] = min_completion_tokens | ||
|
|
||
| return response |
There was a problem hiding this comment.
🟡 Dead code: send_request_to_service() does req_data = req_data.copy() (line 161), then pops min_tokens/min_completion_tokens from the local copy and 'adds them back' at lines 194-196 — but the function returns response, not req_data, so the add-back has no effect. Behavior is correct by accident (decode reads the caller's original dict, which was never mutated). Drop the three add-back lines plus the misleading "so D can use them" comment, or rewrite to take/return a sanitized dict explicitly.
Extended reasoning...
What the code does today. At line 161, send_request_to_service does req_data = req_data.copy(), rebinding the local name to a shallow copy of the caller's dict. At lines 178-179 it pops min_tokens and min_completion_tokens off that local copy before POSTing the prefill request (these args are not supported on the P engine). After the prefill response is read, lines 194-196 then write the popped values back into the local copy with a comment that says "so D can use them."\n\nWhy the add-back is dead. The function returns response (line 198), not req_data. The local copy goes out of scope and is garbage-collected. Nothing downstream observes those re-added keys. _handle_completions does receive the response and merges kv_transfer_params back onto its own req_data, but that req_data is the caller's outer dict, which the function-local .copy() shadowed — the pop never touched it.\n\nStep-by-step proof.\n1. Caller _handle_completions reads the request JSON into outer req_data (line ~228 of _handle_completions). Say it contains {"min_tokens": 5, ...}.\n2. It calls await send_request_to_service(prefill_client_info, api, req_data, request_id). Python passes the dict reference; inside the function, the parameter req_data initially points at the same outer dict.\n3. Line 161: req_data = req_data.copy() rebinds the local name to a new dict. The outer dict is untouched.\n4. Lines 178-179: req_data.pop("min_tokens", None) removes the key from the local copy only. Outer dict still has min_tokens: 5.\n5. POST goes out without min_tokens (correct — P doesn't support it).\n6. Lines 194-196: req_data["min_tokens"] = min_tokens writes back into the local copy.\n7. Line 198: return response. The local copy is discarded.\n8. Back in _handle_completions, the outer req_data is forwarded to the decode engine. It still contains min_tokens: 5 — not because of the add-back, but because step 3's shallow copy ensured the pop never reached it.\n\nWhy existing code doesn't catch it. There is no test or assertion that observes the post-call state of the local req_data inside send_request_to_service, and the decode call works correctly via the outer dict for the reason above. The bug is latent until someone refactors: e.g. if a future change moves the .copy() to the caller, or removes it, or changes the function to mutate the parameter in-place, the add-back lines suddenly start mattering — and the misleading comment will lead the refactorer to believe the add-back already handles that case.\n\nImpact. Nit. Zero runtime effect today. The risk is purely future maintenance: the comment asserts an intent ("so D can use them") that the function does not fulfill, which is a small landmine for the next reader.\n\nSuggested fix. Either delete lines 192-196 and the comment entirely (simplest), or, if the intent is to make the sanitization explicit, change the function to take a pre-sanitized dict from the caller and drop the internal .copy() + pop + restore dance. Note that this file's SPDX header credits the vLLM project — the same dead code may exist upstream and the fix could be sent there too.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28122926398 |
| @@ -0,0 +1,297 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | |||
There was a problem hiding this comment.
what's the reason why vLLM-router or an non-toy proxy can't be used here?
8c6bdb4 to
d6908c1
Compare
billishyahao
left a comment
There was a problem hiding this comment.
Thanks for the contribution. Please address the comments.
| # The 0.55 vLLM GPU memory target leaves HBM headroom for LMCacheMP KV-cache | ||
| # registration; 0.75+ can fill HBM before the server can attach its tensors. | ||
| kimik2.5-fp4-mi355x-vllm-disagg-agentic: | ||
| image: yukiozzz/kimi-lmc-mc-rocm:dmabuf |
There was a problem hiding this comment.
Can we eliminate the private image ?
| - "ENABLE_PREFIX_CACHING=1" | ||
| - "MAX_MODEL_LEN=262144" | ||
| - "WEKA_LOADER_OVERRIDE=semianalysis_cc_traces_weka_with_subagents_256k" | ||
| - "LMCACHE_L1_SIZE_GB=1200" |
There was a problem hiding this comment.
how about moving to env.sh
| - "Add Kimi-K2.5 MI355X multinode agentic recipe using Mooncake(tcp) for PD KV transfer and LMCacheMP only on prefill for L2 prefix reuse." | ||
| - "Decode uses MooncakeConnector only to avoid decode-side LMCache lookup/retrieve racing the remote-prefill KV load." | ||
| - "Use fp8 KV cache, 0.55 vLLM GPU memory utilization, and 1.2TB LMCache host-DRAM cache to leave HBM headroom for LMCacheMP KV registration." | ||
| pr-link: XXX |
d6908c1 to
aa0bf28
Compare
Summary