LLM_Paper_Learning/moe_related.md at main · PasaLab/LLM_Paper_Learning

MoE Inference Optimization

Title	Link
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference	[paper]
Fast Inference of Mixture-of-Experts Language Models with Offloading	[paper]
MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving	[paper]
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models	[paper]
Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference	[paper]
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models	[paper]
SwapMoE: Efficient Memory-Constrained Serving of Large Sparse MoE Models via Dynamic Expert Pruning and Swapping	[paper]
Accelerating Distributed MoE Training and Inference with Lina	[paper]
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference	[paper]
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models	[paper]
AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference	[paper]
ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference	[paper]
ProMoE: Fast MoE-based LLM Serving using Proactive Caching	[paper]
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference	[paper]
Toward Efficient Inference for Mixture of Experts	[paper]
A Survey on Inference Optimization Techniques for Mixture of Experts Models	[paper]
MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services	[paper]
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference	[paper]
fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving	[paper]
MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing	[paper]
Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline	[paper]
Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing	[paper]
DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference	[paper]
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts	[paper]
Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion	[paper]
CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory	[paper]
eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference	[paper]
Accelerating MoE Model Inference with Expert Sharding	[paper]
Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores	[paper]
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching	[paper]
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism	[paper]
D$^2$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving	[paper]
Faster MoE LLM Inference for Extremely Large Models	[paper]
Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony	[paper]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE Inference Optimization

FilesExpand file tree

moe_related.md

Latest commit

History

moe_related.md

File metadata and controls

MoE Inference Optimization