Skip to content

Latest commit

 

History

History
38 lines (37 loc) · 5.32 KB

File metadata and controls

38 lines (37 loc) · 5.32 KB

MoE Inference Optimization

Title Link
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [paper]
Fast Inference of Mixture-of-Experts Language Models with Offloading [paper]
MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving [paper]
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models [paper]
Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference [paper]
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models [paper]
SwapMoE: Efficient Memory-Constrained Serving of Large Sparse MoE Models via Dynamic Expert Pruning and Swapping [paper]
Accelerating Distributed MoE Training and Inference with Lina [paper]
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference [paper]
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models [paper]
AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference [paper]
ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference [paper]
ProMoE: Fast MoE-based LLM Serving using Proactive Caching [paper]
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [paper]
Toward Efficient Inference for Mixture of Experts [paper]
A Survey on Inference Optimization Techniques for Mixture of Experts Models [paper]
MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services [paper]
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [paper]
fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving [paper]
MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing [paper]
Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline [paper]
Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing [paper]
DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference [paper]
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [paper]
Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion [paper]
CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory [paper]
eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference [paper]
Accelerating MoE Model Inference with Expert Sharding [paper]
Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores [paper]
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching [paper]
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism [paper]
D$^2$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving [paper]
Faster MoE LLM Inference for Extremely Large Models [paper]
Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony [paper]