[AMD][MI300X] Expand GPT-OSS FP4 TP=1 concurrency from 64 to 256#1053
Conversation
Expand the search space for GPT-OSS 120B FP4 on MI300X TP=1 from conc=64 to conc=256 for the 1k1k configuration. With 128 experts (top-4 routing), larger batch sizes significantly improve MoE weight amortization across HBM. Measured results on a single MI300X: conc=64: 4,016 total TPS (baseline) conc=96: 5,105 total TPS (+27%) conc=128: 5,981 total TPS (+49%) conc=256: 8,552 total TPS (+113%) The existing benchmark script requires no changes — only the search space upper bound is adjusted.
c3013cd to
c1fdc98
Compare
|
hi @ramineroane thanks for the PR! @seungrokj and @chunfangamd , etc is the |
Sure. working on it. |
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi300x-vllm |
|
@seungrokj Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24554407956 |
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi300x-vllm |
|
@seungrokj Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24557535782 |
|
hi @functionstackx @cquil11 |
functionstackx
left a comment
There was a problem hiding this comment.
@seungrokj approved. feel free to merge

Summary
Expand the search space for GPT-OSS 120B FP4 on MI300X TP=1 from
conc=64toconc=256for the 1k1k configuration.Motivation
With 128 experts and top-4 routing, larger batch sizes significantly improve MoE weight amortization across HBM. At batch=64, nearly all 111/128 unique experts are loaded per decode step — increasing concurrency amortizes this weight loading cost across more tokens.
Results (single MI300X, vllm v0.17.0, ISL/OSL=1024)
Changes
.github/configs/amd-master.yaml: TP=1conc-endfrom 64 → 256perf-changelog.yaml: Added changelog entryNo benchmark script changes required.