inclusionAI/MoBE
Python
Captured source
source ↗inclusionAI/MoBE
Description: Mixture-of-Basis-Experts for Compressing MoE-based LLMs
Language: Python
Stars: 34
Forks: 5
Open issues: 2
Created: 2025-08-13T08:19:35Z
Pushed: 2025-12-24T15:41:10Z
Default branch: main
Fork: no
Archived: no
README:
MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs
---
✅ Feature List
- [x] Supported multiple MoE models (BF16)
- [x] Ling Family
- [x] Qwen3MoE Family
- [x] DeepSeek-V3
- [x] Kimi-K2-Instruct
- [ ] Supported SGLang inference (with fused-MoE kernel)
- [ ] Supported MoBE mega-kernel (high-performance fused kernel for MoBE)
> 💡 *Coming soon: Optimized inference kernels for MoBE models to maximize throughput and memory efficiency.* ---
📘 Introduction
MoBE (Mixture-of-Basis-Experts) is a novel model compression technique designed for MoE LLMs developed by the AGI Center, Ant Group Research. It achieves efficient parameter reduction by factorizing each expert's weight matrix as:
$$ \mathbf{W} = \mathbf{A}\mathbf{B}, \quad \text{where} \quad \mathbf{B} = \sum_{i=1}^m \alpha_i B_i $$
- $\mathbf{A}$: Expert-specific matrix
- $\mathbf{B}$: Linear combination of basis matrices across all experts, weighted by coefficients $\alpha_i$
The factorization is learned by minimizing the reconstruction error between the original and compressed weight matrices.
🔍 Key Results
MoBE significantly outperforms prior compression methods with minimal accuracy degradation:
- Reduces parameter count by 24%–30% in leading open-source models
- Incurs only 1%–2% absolute accuracy drop (≈2% relative)
- Demonstrated on Qwen3-235B, DeepSeek-V3 (671B), and Kimi-K2-Instruct (1T)
📊 Evaluation Results
 ---
🚀 Quickstart
🔧 Installation
pip install -r requirements.txt
---
🛠️ Step-by-Step Instructions
Converting an MoE model to MoBE involves two stages: 1. Train the MoBE decomposition. 2. Generate either a native MoBE model or reconstruct a standard MoE for compatibility. ---
1. Train MoBE Matrices
python train.py --index_path /root/DeepSeek-V3-0324/model.safetensors.index.json \ --base_dir /root/DeepSeek-V3-0324 \ --save_path /root/MoBE/DeepSeek-V3-0324 \ --num_hidden_layers 61 \ --num_matrices 256 \ --rows_per_matrix 2048 \ --cols 7168 \ --num_epochs 10000 \ --batch_size 32 \ --num_batches 8 \ --learning_rate 0.07 \ --num_B 64 \ --truncation 2048 \ --start_layer 3 \ --end_layer 61 \ --matrix_type "gate_proj" \ --activation 'tanh'
| Argument | Description | |--------|-------------| | index_path | Path to .safetensors.index.json mapping tensor names to shards | | base_dir | Root directory containing model shards | | save_path | Output directory for trained MoBE matrices | | num_hidden_layers | Total number of transformer layers | | num_matrices | Number of experts in the original MoE model | | rows_per_matrix | Row dimension of the weight matrices (e.g., up_proj, gate_proj) | | cols | Column dimension of the weight matrices | | num_epochs | Number of optimization steps for reconstruction | | batch_size | Batch size (number of experts sampled per step) | | num_batches | Number of batches processed per epoch. Total experts in one layer = batch_size × num_batches | | learning_rate | Learning rate for the optimizer (e.g., Adam) | | num_B | Number of basis matrices used in the MoBE | | truncation | Maximum number of rows retained in each basis matrix | | start_layer | First transformer layer (inclusive) to apply MoBE compression | | end_layer | Last transformer layer (exclusive) to apply compression | | matrix_type | Type of weight matrix to compress (e.g., "gate_proj", "up_proj") | | activation | Activation function used in MoBE (e.g., "silu", "tanh") |
> 💡 Tip: Run this step separately for each matrix_type (e.g., gate_proj, up_proj) within the same layer range.
For Kimi-K2-Instruct, we recommend dividing the experts within each transformer layer into two groups and applying MoBE compression separately to each group.
python train_group.py --index_path /root/Kimi-K2-Instruct/model.safetensors.index.json \ --base_dir /root/Kimi-K2-Instruct \ --save_path /root/MoBE/Kimi-K2-Instruct \ --num_hidden_layers 61 \ --num_matrices 384 \ --rows_per_matrix 2048 \ --cols 7168 \ --num_epochs 15000 \ --batch_size 32 \ --num_batches 12 \ --learning_rate 0.07 \ --num_B 128 \ --truncation 2048 \ --start_layer 1 \ --end_layer 61 \ --matrix_type "gate_proj" \ --num_groups 2 \ --activation 'silu'
| Argument | Description | |--------|-------------| | index_path | Path to .safetensors.index.json mapping tensor names to shards | | base_dir | Root directory containing model shards | | save_path | Output directory for trained MoBE matrices | | num_hidden_layers | Total number of transformer layers | | num_matrices | Number of experts in the original MoE model | | rows_per_matrix | Row dimension of the weight matrices (e.g., up_proj, gate_proj) | | cols | Column dimension of the weight matrices | | num_epochs | Number of optimization steps for reconstruction | | batch_size | Batch size (number of experts sampled per step) | | num_batches | Number of batches processed per epoch. Total experts in one layer = batch_size × num_batches | | learning_rate | Learning rate for the optimizer (e.g., Adam) | | num_B | Number of basis matrices used in the MoBE | | truncation | Maximum number of rows retained in each basis matrix | | start_layer | First transformer layer (inclusive) to apply MoBE compression | | end_layer | Last transformer layer (exclusive) to apply compression | | matrix_type | Type of weight matrix to compress (e.g., "gate_proj", "up_proj") | | activation | Activation function used in MoBE (e.g., "silu", "tanh") | | num_groups | Number of expert groups to split the original MoE experts into before applying MoBE compression separately to each group |
---
2. Generate MoBE or Reconstructed MoE Model
After training, you can:
- Deploy the native MoBE model (high compression)
- Reconstruct a standard MoE model for compatibility with
vLLMorSGLang
##### 🔹 Option A: Save Native MoBE Model
python get_mobe.py --base_model /root/DeepSeek-V3-0324 \ --mobe_dir /root/MoBE/DeepSeek-V3-0324 \ --save_dir /root/DeepSeek-V3-0324-MoBE \ --num_B 64 \ --num_experts 256 \ --start_layer 3 \ --end_layer 61 \…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low-stars new model repo.