| Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference |
[paper] |
| Fast Inference of Mixture-of-Experts Language Models with Offloading |
[paper] |
| MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving |
[paper] |
| Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models |
[paper] |
| Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference |
[paper] |
| SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models |
[paper] |
| SwapMoE: Efficient Memory-Constrained Serving of Large Sparse MoE Models via Dynamic Expert Pruning and Swapping |
[paper] |
| Accelerating Distributed MoE Training and Inference with Lina |
[paper] |
| Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference |
[paper] |
| EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models |
[paper] |
| AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference |
[paper] |
| ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference |
[paper] |
| ProMoE: Fast MoE-based LLM Serving using Proactive Caching |
[paper] |
| HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference |
[paper] |
| Toward Efficient Inference for Mixture of Experts |
[paper] |
| A Survey on Inference Optimization Techniques for Mixture of Experts Models |
[paper] |
| MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services |
[paper] |
| EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference |
[paper] |
| fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving |
[paper] |
| MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing |
[paper] |
| Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline |
[paper] |
| Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing |
[paper] |
| DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference |
[paper] |
| Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts |
[paper] |
| Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion |
[paper] |
| CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory |
[paper] |
| eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference |
[paper] |
| Accelerating MoE Model Inference with Expert Sharding |
[paper] |
| Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores |
[paper] |
| MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching |
[paper] |
| MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism |
[paper] |
| D$^2$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving |
[paper] |
| Faster MoE LLM Inference for Extremely Large Models |
[paper] |
| Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony |
[paper] |