Skip to content

Add new FW (TRT) and precision support#5

Merged
kimbochen merged 13 commits into
mainfrom
kepotdar-trt-init
Sep 5, 2025
Merged

Add new FW (TRT) and precision support#5
kimbochen merged 13 commits into
mainfrom
kepotdar-trt-init

Conversation

@kedarpotdar-nv

Copy link
Copy Markdown
Collaborator

Overview

This PR adds TensorRT-LLM (TRT-LLM) as a new inference framework for LLaMA 70B benchmarking on NVIDIA H200 and B200 GPUs, alongside the existing vLLM framework. This enables direct performance comparison between vLLM and TRT-LLM on the same hardware.

Key Features

  • Multi-framework support: vLLM and TRT-LLM for LLaMA 70B
  • Precision support: FP8 (default) and FP4 (future-ready)
  • Unified plotting: All frameworks and hardware on single performance plots
  • Framework-specific Docker images: Prevents image conflicts between frameworks
  • Clean job naming: Clear identification in GitHub Actions UI

🔧 Core Workflow Updates

.github/workflows/benchmark-tmpl.yml

  • Added framework and precision as required inputs
  • Updated RESULT_FILENAME to include framework and precision
  • Modified result processing to pass framework and precision to process_result.py
  • Fixed job naming to prevent duplication
  • Added hardware extraction logic for proper result processing

.github/workflows/70b-tmpl.yml

  • Added bmk-h200-trt and bmk-b200-trt jobs for TRT-LLM
  • Configured TRT-LLM jobs with nvidia/tensorrt-llm Docker image
  • Set precision to empty string for FP8 (default)
  • Updated all jobs to include framework and precision inputs

🚀 New Benchmark Scripts

benchmarks/70b_h200_trt_slurm.sh and benchmarks/70b_b200_trt_slurm.sh

  • TRT-LLM server setup for H200 using mpirun trtllm-serve
  • Inline llama-config.yml configuration
  • Client benchmarking with benchmark_serving.py

🔄 Launcher Script Updates

Updated SLURM Launchers
runners/launch_h200-nv.sh
runners/launch_h200-cw.sh
runners/launch_h200-nb.sh
runners/launch_b200-nv.sh
Key improvements:

  • Framework-specific SQSH file naming to prevent Docker image conflicts
  • Dynamic script selection based on framework (VLLM/SGLang use base scripts, TRT uses _trt scripts)
  • Proper MODEL_CODE environment variable passing to containers
  • Framework reset logic for VLLM and SGLang to use default script names

📊 Result Processing & Visualization

utils/process_result.py

  • Added framework and precision command-line arguments
  • Updated output data structure to include framework and precision
  • Default precision handling (empty string → 'fp8')

utils/plot_perf.py

  • Added distinct colors for TRT-LLM results:
    h200-trt: dark green
    b200-trt: gray
  • Unified plotting: all frameworks and hardware on single plots
  • Updated plot titles and legend handling
  • Model-specific plot generation

🧪 Testing Configuration

.github/workflows/workflow-scheduler.yml

  • Commented out concurrency and schedule blocks for manual testing
  • Disabled DSR1 jobs as requested

@kimbochen kimbochen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. lgtm

@kimbochen kimbochen merged commit 0ef8128 into main Sep 5, 2025
@kimbochen kimbochen deleted the kepotdar-trt-init branch September 5, 2025 04:51
Oseltamivir added a commit that referenced this pull request Jun 23, 2026
Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.
Oseltamivir added a commit that referenced this pull request Jun 25, 2026
… rate, run links

Addresses review #3 frontend critiques (backward-compatible with v2 docs):
- Percentile selector p50/p90/p99 (p99 default); reads pooled-trial percentiles.
- Suite selector backend-default vs resource-constrained — kept distinct, never read as
  one fair contest (#5). dtype/mode/resource/contract are all in the per-line label +
  hover; lines are uniquely colored (SKU family) + dashed-fp8 (#10).
- Bandwidth axis renamed "Logical routed payload rate" using SEPARATE dispatch/combine
  bytes; serial bandwidth removed; serial relabeled "Σ isolated medians" (#6,#7).
- Hover shows p50/p90/p99, contract, suite, and the WORKFLOW RUN (run id + sha) that
  produced the point (#1). Provenance text no longer claims a single dtype (the
  "bf16 while fp8 shown" bug); states routing-identity-proven, pooled-sample count,
  logical-rate caveat, suite-separation, and correctness-is-smoke (#9 fix).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants