ForkNous ResearchNous Researchpublished Feb 12, 2026seen 5d

NousResearch/LeastLoadedEP

forked from SalesforceAIResearch/LeastLoadedEP

Open original ↗

Captured source

source ↗
published Feb 12, 2026seen 5dcaptured 14hhttp 200method plain

NousResearch/LeastLoadedEP

License: Apache-2.0

Stars: 3

Forks: 3

Open issues: 0

Created: 2026-02-12T22:54:52Z

Pushed: 2026-02-12T23:02:53Z

Default branch: main

Fork: yes

Parent repository: SalesforceAIResearch/LeastLoadedEP

Archived: no

README:

Least-Loaded Expert Parallelism (LLEP)

> Disclaimer: This repository is provided for research purposes only.

LLEP is an EP algorithm that dynamically reroutes excess tokens—along with their associated expert parameters—from overloaded devices to underutilized ones. It ensures that all devices complete their workloads within the minimum collective latency while respecting memory constraints. LLEP shines when pre-trained MoE models exhibit unpredictable imbalanced routing -- which is often the case even for the most powerful MoE LLMs.

LLEP computes the exact mathematical computation of the standard mixture-of-experts, through flexible load routing to different GPUs. It does NOT alter the models' logical routing behaviors for the sake of load balancing. LLEP is suitable for post-training and inference, as well as pre-training. LLEP supports gradients and backward pass.

LLEP achieves up to 6× speedup and 4× reduction in peak memory usage compared to standard EP, enabling faster and higher-throughput post-training and inference.

Standard EP - Imbalanced expert load may cause inefficiency and GPU memory OOM

![assets](assets/standard_ep.png)

LLEP - Excess loads and weights are spilled from overloaded GPUs to underloaded GPUs

![assets](assets/llep.png)

Speedup compared to EP across different model configurations

![assets](assets/speedup.png)

Memory Usage: LLEP keep peak memory usage constant, while EP runs the risks of OOM with exploding memory consumption under imbalanced load.

![assets](assets/memory_usage.png)

Usage

Simple Test run

export MOE_ADAPTIVE_LPT_ROUTING_RATIO=1.3
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.run --nproc_per_node=8 test_llep.py \
--num_tokens 32768 \
--hidden_size 2880 \
--intermediate_size 2880 \
--num_experts 128 \
--top_k 4 \
--max_tokens_factor 1.0 \
--min_tokens_per_gemm 1024 \
--imbalance_configs "30:4,50:4,80:4,95:4"

Citation

@misc{llep,
title={Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts},
author={Xuan-Phi Nguyen and Shrey Pandit and Austin Xu and Caiming Xiong and Shafiq Joty},
year={2026},
eprint={2601.17111},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.17111},
}

Notability

notability 1.0/10

Low-star fork, trivial