Skip to content

Qwen3Next MTP for vLLM plugin mode#772

Merged
zejunchen-zejun merged 15 commits into
mainfrom
ganyi/qwen3next_mtp
May 15, 2026
Merged

Qwen3Next MTP for vLLM plugin mode#772
zejunchen-zejun merged 15 commits into
mainfrom
ganyi/qwen3next_mtp

Conversation

@ganyi1996ppo

@ganyi1996ppo ganyi1996ppo commented May 13, 2026

Copy link
Copy Markdown
Contributor

Motivation

server script:

export VLLM_TORCH_PROFILER_DIR=./vllm_profile
export ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export HIP_VISIBLE_DEVICES=0,1,2,3
export ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=0
export ATOM_USE_CUSTOM_ALL_GATHER=0
export ATOM_DISABLE_VLLM_PLUGIN=0
MODEL=/mnt/data/pretrained_model/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8


vllm serve $MODEL\
  --port 8200 \
  --no-enable-prefix-caching \
  --tensor-parallel-size 1 \
  --gpu_memory_utilization 0.8 \
  --max-num-batched-tokens 32768 \
  --kv-cache-dtype fp8 \
  --compilation-config '{ "cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile", "torch_profiler_with_stack": "True"}' \
  --speculative-config '{"num_speculative_tokens":1, "method": "mtp"}'\

verify script

MODEL_ID=/mnt/data/pretrained_model/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
lm_eval \
  --model local-completions \
  --model_args model=$MODEL_ID,base_url=http://localhost:8200/v1/completions,num_concurrent=256,max_retries=10,timeout=3000,seed=1234,max_gen_toks=2048,temperature=0,tokenized_requests=False,trust_remote_code=True \
  --batch_size auto \
  --tasks gsm8k \
  --num_fewshot 5 \

result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9212|±  |0.0074|
|     |       |strict-match    |     5|exact_match|↑  |0.9143|±  |0.0077|

SpecDecoding metrics: Mean acceptance length: 1.91, Accepted throughput: 371.69 tokens/s, Drafted throughput: 410.38 tokens/s, Accepted: 3717 tokens, Drafted: 4104 tokens, Per-position acceptance rate: 0.906, Avg Draft acceptance rate: 90.6%

Technical Details

Test Plan

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings May 13, 2026 08:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for running Qwen3Next MTP (multi-token prediction / EAGLE-style speculative decoding) under vLLM plugin mode, including draft-model construction, KV-cache indexing fixes, and attention/metadata handling for multi-token verification.

Changes:

  • Register Qwen3NextMTP for vLLM plugin mode and add model-class routing to the ATOM vLLM wrapper.
  • Teach the vLLM wrapper to detect draft-model construction, load draft weights correctly (spec_decode=True), and swap the global atom_config during forward() to keep layer lookups consistent across target/draft alternation.
  • Update plugin attention metadata + paged attention implementations to correctly handle multi-token decode layouts used by MTP/EAGLE.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
atom/plugin/vllm/register.py Registers Qwen3NextMTP architecture override for vLLM plugin mode.
atom/plugin/vllm/model_wrapper.py Detects draft vs target, routes draft architecture, swaps/restores global atom_config for forwards, and passes spec_decode into weight loading.
atom/plugin/vllm/attention_backend/attention_gdn.py Fixes GDN attention output writeback for speculative decode and adjusts imports/code paths.
atom/plugin/attention.py Adjusts attention metadata builder thresholds/logic for MTP/EAGLE multi-token verification and async spec-decode metadata.
atom/plugin/attention_mha.py Updates paged-attention decode kernels and buffer sizing to support MTP multi-token decode layout; fixes extend block-table slicing.
atom/models/qwen3_next.py Adds explicit layer_num for attention KV slot isolation in MTP, fixes speculative_config fallback for vLLM, and exposes embed_tokens for sharing.
atom/models/qwen3_next_mtp.py Implements Qwen3Next MTP draft model with correct layer indexing, quant prefixing, and expert mapping for shared-expert fusion.
atom/model_loader/loader.py Plumbs spec_decode through plugin-mode loading so draft models can load mtp.* weights and apply MTP remapping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/plugin/vllm/attention_backend/attention_gdn.py Outdated
Comment thread atom/plugin/vllm/attention_backend/attention_gdn.py Outdated
Comment thread atom/plugin/vllm/attention_backend/attention_gdn.py Outdated
Comment thread atom/plugin/vllm/model_wrapper.py Outdated
Comment thread atom/models/qwen3_next.py Outdated
Comment thread atom/models/qwen3_next.py Outdated
Copilot AI review requested due to automatic review settings May 13, 2026 09:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment thread atom/models/qwen3_next.py Outdated
Signed-off-by: ganyi <ygan@amd.com>
@ganyi1996ppo ganyi1996ppo force-pushed the ganyi/qwen3next_mtp branch from f38481f to 3af7ccb Compare May 14, 2026 08:01
Signed-off-by: ganyi <ygan@amd.com>
Copilot AI review requested due to automatic review settings May 14, 2026 08:05
@zejunchen-zejun zejunchen-zejun requested a review from whx-sjtu May 14, 2026 08:05
Signed-off-by: ganyi <ygan@amd.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

atom/plugin/vllm/model_wrapper.py:421

  • The draft-model detection in load_weights checks for "Qwen3NextMTP", but other parts of the repo still use the architecture key "Qwen3NextMTPModel" for Qwen3-Next MTP. If the draft model’s HF config reports "Qwen3NextMTPModel", spec_decode-specific loading (hf_config_override / weight filtering) won’t activate. Please align this set with the actual HF architecture string used for the draft model.
        is_mtp_draft_model = self.model_arch in {
            "DeepSeekMTPModel",
            "Qwen3NextMTP",
        }

Comment thread atom/plugin/vllm/register.py
Comment thread atom/plugin/vllm/model_wrapper.py
Comment thread atom/plugin/vllm/model_wrapper.py
Signed-off-by: ganyi <ygan@amd.com>
@zejunchen-zejun

Copy link
Copy Markdown
Collaborator

Hi @ganyi1996ppo
Could you help add qwen3next MTP into atom-vllm nightly and benchmark workflow, so that the acc and perf can be tracked when merged?

Signed-off-by: ganyi <ygan@amd.com>
Copilot AI review requested due to automatic review settings May 14, 2026 13:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

atom/plugin/vllm/model_wrapper.py:196

  • _expose_spec_decode_attrs() is now only executed when model_arch in _MTP_MASK_INPUT_ARCH (currently only DeepSeekMTPModel). The new Qwen3NextMTP model has the same extra .model nesting and does not expose embed_tokens/layers on the outer module, so vLLM speculative decoding weight/embedding sharing is likely to fail. Suggest calling _expose_spec_decode_attrs() for all MTP draft models that wrap an inner .model (and keep _adapt_mtp_layers_for_vllm() gated separately if it’s DeepSeek-specific), or add Qwen3NextMTP to the relevant allowlist.
        logger.info(f"Construct ATOM model {model_arch} for vLLM plugin mode")
        self.model = model_cls(self.atom_config)

        if model_arch in _MTP_MASK_INPUT_ARCH:
            self._adapt_mtp_layers_for_vllm()
            # Mirror nested attributes required by vLLM speculative decoding.
            self._expose_spec_decode_attrs()

atom/plugin/vllm/model_wrapper.py:422

  • Draft-model detection only checks self.model_arch against { "DeepSeekMTPModel", "Qwen3NextMTP" }. If the HF draft config still reports Qwen3NextMTPModel (as referenced elsewhere in the repo), this branch won’t treat it as spec-decode, and hf_config_override won’t be applied. Consider accepting both Qwen3NextMTP and Qwen3NextMTPModel here (and in _ATOM_MODEL_CLASSES) so both draft-arch spellings work.
        is_mtp_draft_model = self.model_arch in {
            "DeepSeekMTPModel",
            "Qwen3NextMTP",
        }

Comment thread atom/plugin/vllm/model_wrapper.py
Comment thread atom/plugin/vllm/register.py
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Copilot AI review requested due to automatic review settings May 14, 2026 14:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

recipes/atom_vllm/Qwen3.5.md:137

  • The "Key Environment Variables" list no longer includes ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=1, but the earlier text still refers to three required variables. Please ensure this section stays consistent with the intended required/optional env var set for Qwen3.5.

## Key Environment Variables

- `ATOM_USE_CUSTOM_ALL_GATHER=0`: **Required** - disables custom all-gather for compatibility with Qwen3.5 model architecture
- `AITER_QUICK_REDUCE_QUANTIZATION=INT4`: **Performance optimization** - enables INT4 quantization for quick reduce operations
  - **Benefit**: Significantly improves TTFT (Time To First Token) performance by reducing communication overhead during tensor parallelism all-reduce operations

Comment thread atom/plugin/vllm/model_wrapper.py
Comment thread recipes/atom_vllm/Qwen3.5.md Outdated
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Copilot AI review requested due to automatic review settings May 14, 2026 15:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

atom/plugin/vllm/model_wrapper.py:432

  • atom.config.SpeculativeConfig does not expose draft_model_config, so draft_model_config = getattr(self.atom_config.speculative_config, "draft_model_config", None) will always be None and hf_config_override will not be applied for MTP draft-model weight loading. This can cause the draft model to load with the target model's HF config. Use self.atom_config.speculative_config.draft_model_hf_config (or fall back to self.vllm_config.speculative_config.draft_model_config.hf_config) when building draft_hf_config.
        is_mtp_draft_model = self.model_arch in {
            "DeepSeekMTPModel",
            "Qwen3NextMTP",
        }
        draft_hf_config = None
        if is_mtp_draft_model:
            draft_model_config = getattr(
                getattr(self.atom_config, "speculative_config", None),
                "draft_model_config",
                None,
            )
            if draft_model_config is not None:
                draft_hf_config = getattr(
                    draft_model_config, "hf_config", draft_model_config
                )

Comment thread atom/plugin/vllm/model_wrapper.py
Comment thread recipes/atom_vllm/Qwen3.5.md
Comment thread atom/plugin/config.py
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Copilot AI review requested due to automatic review settings May 15, 2026 02:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

atom/plugin/vllm/model_wrapper.py:433

  • atom.config.SpeculativeConfig does not have a draft_model_config attribute (it exposes draft_model_hf_config). As written, draft_hf_config will always stay None, so load_model_in_plugin_mode(..., hf_config_override=...) won’t apply the draft HF-config overrides needed for MTP (e.g., architecture rewrite / expert backfill). Please fetch the draft config from self.atom_config.speculative_config.draft_model_hf_config (or keep reading it from self.vllm_config.speculative_config.draft_model_config.hf_config).
            draft_model_config = getattr(
                getattr(self.atom_config, "speculative_config", None),
                "draft_model_config",
                None,
            )
            if draft_model_config is not None:
                draft_hf_config = getattr(
                    draft_model_config, "hf_config", draft_model_config
                )

Comment thread atom/plugin/vllm/model_wrapper.py
Comment thread atom/plugin/config.py
Signed-off-by: ganyi <ygan@amd.com>
valarLip
valarLip previously approved these changes May 15, 2026
Add ATOM_FP8_BLOCKSCALE_WEIGHT_PRESHUFFLE=0 to all Qwen3.5 and
Qwen3-Next model configs across benchmark, nightly accuracy, and
recipe files.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 15, 2026 06:01
@zejunchen-zejun zejunchen-zejun dismissed stale reviews from valarLip and themself via e262d7e May 15, 2026 06:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

atom/plugin/vllm/model_wrapper.py:433

  • In load_weights, draft MTP hf_config override is fetched via self.atom_config.speculative_config.draft_model_config, but ATOM's SpeculativeConfig stores the draft config as draft_model_hf_config (see atom/config.py). As written, draft_hf_config will stay None, so the draft model will be loaded using the target hf_config, which can break MTP weight name/architecture overrides.

Update this to read the draft config from self.atom_config.speculative_config.draft_model_hf_config (or fall back to vLLM's vllm_config.speculative_config.draft_model_config.hf_config), and pass that object as hf_config_override.

        is_mtp_draft_model = self.model_arch in {
            "DeepSeekMTPModel",
            "Qwen3NextMTP",
        }
        draft_hf_config = None
        if is_mtp_draft_model:
            draft_model_config = getattr(
                getattr(self.atom_config, "speculative_config", None),
                "draft_model_config",
                None,
            )
            if draft_model_config is not None:
                draft_hf_config = getattr(
                    draft_model_config, "hf_config", draft_model_config
                )

Comment thread recipes/atom_vllm/Qwen3.5.md Outdated
Comment on lines 75 to 79
**Important**: The following three environment variables are required for Qwen3.5:

- `ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=1`: Disables ATOM attention plugin to use vLLM's implementation for full attention layers (required because Qwen3.5 uses a hybrid architecture with both linear attention (GatedDeltaNet) and full attention layers)
- `ATOM_USE_CUSTOM_ALL_GATHER=0`: Disables custom all-gather for compatibility with Qwen3.5 model architecture
- `AITER_QUICK_REDUCE_QUANTIZATION=INT4`: **Performance optimization** - enables INT4 quantization for quick reduce operations, which can significantly improve TTFT (Time To First Token) performance. **Note**: This optimization may introduce a risk of accuracy degradation. For accuracy-critical workloads, consider validating with your specific use case.

zejunchen-zejun and others added 2 commits May 15, 2026 14:42
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Remove stale "three" count (now variable list), add
ATOM_FP8_BLOCKSCALE_WEIGHT_PRESHUFFLE=0 to both the Important section
and Key Environment Variables section.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 15, 2026 06:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Comment on lines 418 to 428
is_mtp_draft_model = self.model_arch in {
"DeepSeekMTPModel",
"Qwen3NextMTPModel",
"Qwen3NextMTP",
}
draft_hf_config = None
if is_mtp_draft_model:
draft_model_config = getattr(
getattr(self.vllm_config, "speculative_config", None),
getattr(self.atom_config, "speculative_config", None),
"draft_model_config",
None,
)
Comment thread atom/plugin/config.py
Comment on lines +75 to +112
def _build_atom_speculative_config_from_vllm(vllm_spec_config: Any):
"""Translate vLLM's SpeculativeConfig into ATOM's SpeculativeConfig.

Reuses vLLM's already-loaded draft hf_config (skips a second disk fetch
in ATOM SpeculativeConfig.__post_init__) but still runs ATOM's
hf_config_override on it — so MTP model_type remap, n_routed_experts
backfill (Qwen families), and architecture rewrite all land on the
draft config in one place. Mirrors how standalone ATOM MTP exposes
the draft hf_config via atom_config.speculative_config.

The draft hf_config is deepcopied first because hf_config_override
mutates `architectures` to ATOM's standalone naming (e.g.
"Qwen3NextMTPModel"), which differs from vLLM's registry name
("Qwen3NextMTP"). Mutating in place would make vLLM's later draft
architecture lookup fail.
"""
if vllm_spec_config is None:
return None

from atom.config import SpeculativeConfig

draft_model_config = getattr(vllm_spec_config, "draft_model_config", None)
draft_hf_config = getattr(draft_model_config, "hf_config", None)
if draft_hf_config is not None:
draft_hf_config = copy.deepcopy(draft_hf_config)
model_path = getattr(draft_model_config, "model", None) or getattr(
vllm_spec_config, "model", None
)

return SpeculativeConfig(
method=getattr(vllm_spec_config, "method", "") or "",
model=model_path,
num_speculative_tokens=getattr(
vllm_spec_config, "num_speculative_tokens", None
),
draft_model_hf_config=draft_hf_config,
)

@zejunchen-zejun zejunchen-zejun merged commit cc3539e into main May 15, 2026
27 of 33 checks passed
@zejunchen-zejun zejunchen-zejun deleted the ganyi/qwen3next_mtp branch May 15, 2026 07:32
sijyang pushed a commit that referenced this pull request May 24, 2026
* mtp 1 acc right

Signed-off-by: ganyi <ygan@amd.com>

* add recipe for qwen3-next-mtp

Signed-off-by: ganyi <ygan@amd.com>

* modify some qwen3.5 recipe

Signed-off-by: ganyi <ygan@amd.com>

* black

Signed-off-by: ganyi <ygan@amd.com>

* remove redundant code

Signed-off-by: ganyi <ygan@amd.com>

* remove redundant code

Signed-off-by: ganyi <ygan@amd.com>

* add spec decode convert for vllm plugin

Signed-off-by: ganyi <ygan@amd.com>

* remove vllm related branch

Signed-off-by: ganyi <ygan@amd.com>

* use atom spec decode config for plugin loading

Signed-off-by: ganyi <ygan@amd.com>

* remove unnecessary changes in modeling

Signed-off-by: ganyi <ygan@amd.com>

* format

Signed-off-by: ganyi <ygan@amd.com>

* add qwen3next mtp into benchmark

Signed-off-by: ganyi <ygan@amd.com>

* [ci] disable FP8 blockscale weight preshuffle for Qwen3.5/Qwen3-Next

Add ATOM_FP8_BLOCKSCALE_WEIGHT_PRESHUFFLE=0 to all Qwen3.5 and
Qwen3-Next model configs across benchmark, nightly accuracy, and
recipe files.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* [ci] fix Qwen3-Next MTP benchmark label from MET to AW

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* [docs] fix Qwen3.5 recipe: update env var count and add preshuffle doc

Remove stale "three" count (now variable list), add
ATOM_FP8_BLOCKSCALE_WEIGHT_PRESHUFFLE=0 to both the Important section
and Key Environment Variables section.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

---------

Signed-off-by: ganyi <ygan@amd.com>
Co-authored-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
zejunchen-zejun added a commit that referenced this pull request Jun 4, 2026
…nt SGLang V3.2

Follow-up to the citation audit, two more verified corrections in the
plugin sections:

- vLLM-ATOM / Qwen3.5: the prior pass dropped Qwen3.5 along with the
  misattributed #532, but Qwen3.5 does have real vLLM-plugin support.
  Restore it with the correct PRs: #448 (fp8 functionality/accuracy,
  touches atom/plugin/vllm/model_wrapper.py) and #593 (Qwen3.5 FP4
  nightly + benchmark, recipes/atom_vllm/Qwen3.5.md), keeping #772
  (Qwen3-Next MTP).
- SGLang-ATOM: dropped "V3.2" from the DeepSeek model list. No SGLang
  DeepSeek V3.2 PR landed in v0.1.2..v0.1.3 (V3 MTP=#643, R1 FP4=#614,
  FP4 MTP=#834/#846); the cited PRs only cover V3 and R1 FP4.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
zejunchen-zejun added a commit that referenced this pull request Jun 4, 2026
…itations in v0.1.3 (#1061)

* docs(release-notes): fix misattributed plugin PR citations in v0.1.3

Four citations in the vLLM-ATOM sections referenced PRs that actually
belong to SGLang-ATOM or the native ATOM path. Verified each PR title
against GitHub before correcting.

- Model Support / DeepSeek V4 / R1 FP4: dropped the bullet. #650 is the
  native DeepSeek V4 triton-MoE path (already cited under ATOM Server)
  and #614 is a SGLang-ATOM R1 FP4 PR (already cited under SGLang-ATOM);
  neither supports a vLLM-ATOM V4 / R1 FP4 claim.
- Model Support / Qwen3.5 / Qwen3-Next: dropped #532 (it adds Qwen3.5 /
  Qwen3-Next to SGLang, not vLLM); keep #772 (Qwen3-Next MTP for vLLM).
- H&P / vLLM-ATOM: dropped #528 + the "Q/K norm-quant fusion" claim;
  #528 is the SGLang+ATOM qk-norm fusion PR (already cited under SGLang).
- H&P / vLLM-ATOM: dropped #614 from the DeepSeek FP4 validation bullet
  (SGLang-ATOM PR), leaving the genuine #639 TP8/EP8 case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(release-notes): correct Qwen3.5 vLLM citation and drop nonexistent SGLang V3.2

Follow-up to the citation audit, two more verified corrections in the
plugin sections:

- vLLM-ATOM / Qwen3.5: the prior pass dropped Qwen3.5 along with the
  misattributed #532, but Qwen3.5 does have real vLLM-plugin support.
  Restore it with the correct PRs: #448 (fp8 functionality/accuracy,
  touches atom/plugin/vllm/model_wrapper.py) and #593 (Qwen3.5 FP4
  nightly + benchmark, recipes/atom_vllm/Qwen3.5.md), keeping #772
  (Qwen3-Next MTP).
- SGLang-ATOM: dropped "V3.2" from the DeepSeek model list. No SGLang
  DeepSeek V3.2 PR landed in v0.1.2..v0.1.3 (V3 MTP=#643, R1 FP4=#614,
  FP4 MTP=#834/#846); the cited PRs only cover V3 and R1 FP4.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(release-notes): fix 3 more cross-section PR misattributions

Verified each PR's changed files to confirm which engine path it belongs
to:

- vLLM-ATOM Engine Core: drop #793 + "handles scalar KV scales". #793
  only touches atom/model_ops/{attention_mha,base_attention}.py (native,
  no plugin files) and is already cited correctly in the native section.
- vLLM-ATOM H&P: drop the DeepSeek FP4 TP8/EP8 bullet and move #639 to
  SGLang-ATOM H&P. #639 only touches sglang_benchmark_models.json and
  atom-sglang-benchmark.yaml -> it is a SGLang benchmark PR, not vLLM.
- vLLM-ATOM H&P: drop "V4 DP benchmark coverage (#949)". #949 touches the
  native benchmark (.github/benchmark/models.json, atom-benchmark.yaml),
  already cited under ATOM Server; it is not a plugin PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants