RepoInclusionAI (Ant Group)InclusionAI (Ant Group)published Aug 13, 2025seen 5d

inclusionAI/MoBE

Python

Open original ↗

Captured source

source ↗
published Aug 13, 2025seen 5dcaptured 16hhttp 200method plain

inclusionAI/MoBE

Description: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

Language: Python

Stars: 34

Forks: 5

Open issues: 2

Created: 2025-08-13T08:19:35Z

Pushed: 2025-12-24T15:41:10Z

Default branch: main

Fork: no

Archived: no

README:

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

---

✅ Feature List

  • [x] Supported multiple MoE models (BF16)
  • [x] Ling Family
  • [x] Qwen3MoE Family
  • [x] DeepSeek-V3
  • [x] Kimi-K2-Instruct
  • [ ] Supported SGLang inference (with fused-MoE kernel)
  • [ ] Supported MoBE mega-kernel (high-performance fused kernel for MoBE)

> 💡 *Coming soon: Optimized inference kernels for MoBE models to maximize throughput and memory efficiency.* ---

📘 Introduction

MoBE (Mixture-of-Basis-Experts) is a novel model compression technique designed for MoE LLMs developed by the AGI Center, Ant Group Research. It achieves efficient parameter reduction by factorizing each expert's weight matrix as:

$$ \mathbf{W} = \mathbf{A}\mathbf{B}, \quad \text{where} \quad \mathbf{B} = \sum_{i=1}^m \alpha_i B_i $$

  • $\mathbf{A}$: Expert-specific matrix
  • $\mathbf{B}$: Linear combination of basis matrices across all experts, weighted by coefficients $\alpha_i$

The factorization is learned by minimizing the reconstruction error between the original and compressed weight matrices.

🔍 Key Results

MoBE significantly outperforms prior compression methods with minimal accuracy degradation:

  • Reduces parameter count by 24%–30% in leading open-source models
  • Incurs only 1%–2% absolute accuracy drop (≈2% relative)
  • Demonstrated on Qwen3-235B, DeepSeek-V3 (671B), and Kimi-K2-Instruct (1T)

📊 Evaluation Results

![results](results.jpg) ---

🚀 Quickstart

🔧 Installation

pip install -r requirements.txt

---

🛠️ Step-by-Step Instructions

Converting an MoE model to MoBE involves two stages: 1. Train the MoBE decomposition. 2. Generate either a native MoBE model or reconstruct a standard MoE for compatibility. ---

1. Train MoBE Matrices

python train.py --index_path /root/DeepSeek-V3-0324/model.safetensors.index.json \
--base_dir /root/DeepSeek-V3-0324 \
--save_path /root/MoBE/DeepSeek-V3-0324 \
--num_hidden_layers 61 \
--num_matrices 256 \
--rows_per_matrix 2048 \
--cols 7168 \
--num_epochs 10000 \
--batch_size 32 \
--num_batches 8 \
--learning_rate 0.07 \
--num_B 64 \
--truncation 2048 \
--start_layer 3 \
--end_layer 61 \
--matrix_type "gate_proj" \
--activation 'tanh'

| Argument | Description | |--------|-------------| | index_path | Path to .safetensors.index.json mapping tensor names to shards | | base_dir | Root directory containing model shards | | save_path | Output directory for trained MoBE matrices | | num_hidden_layers | Total number of transformer layers | | num_matrices | Number of experts in the original MoE model | | rows_per_matrix | Row dimension of the weight matrices (e.g., up_proj, gate_proj) | | cols | Column dimension of the weight matrices | | num_epochs | Number of optimization steps for reconstruction | | batch_size | Batch size (number of experts sampled per step) | | num_batches | Number of batches processed per epoch. Total experts in one layer = batch_size × num_batches | | learning_rate | Learning rate for the optimizer (e.g., Adam) | | num_B | Number of basis matrices used in the MoBE | | truncation | Maximum number of rows retained in each basis matrix | | start_layer | First transformer layer (inclusive) to apply MoBE compression | | end_layer | Last transformer layer (exclusive) to apply compression | | matrix_type | Type of weight matrix to compress (e.g., "gate_proj", "up_proj") | | activation | Activation function used in MoBE (e.g., "silu", "tanh") |

> 💡 Tip: Run this step separately for each matrix_type (e.g., gate_proj, up_proj) within the same layer range.

For Kimi-K2-Instruct, we recommend dividing the experts within each transformer layer into two groups and applying MoBE compression separately to each group.

python train_group.py --index_path /root/Kimi-K2-Instruct/model.safetensors.index.json \
--base_dir /root/Kimi-K2-Instruct \
--save_path /root/MoBE/Kimi-K2-Instruct \
--num_hidden_layers 61 \
--num_matrices 384 \
--rows_per_matrix 2048 \
--cols 7168 \
--num_epochs 15000 \
--batch_size 32 \
--num_batches 12 \
--learning_rate 0.07 \
--num_B 128 \
--truncation 2048 \
--start_layer 1 \
--end_layer 61 \
--matrix_type "gate_proj" \
--num_groups 2 \
--activation 'silu'

| Argument | Description | |--------|-------------| | index_path | Path to .safetensors.index.json mapping tensor names to shards | | base_dir | Root directory containing model shards | | save_path | Output directory for trained MoBE matrices | | num_hidden_layers | Total number of transformer layers | | num_matrices | Number of experts in the original MoE model | | rows_per_matrix | Row dimension of the weight matrices (e.g., up_proj, gate_proj) | | cols | Column dimension of the weight matrices | | num_epochs | Number of optimization steps for reconstruction | | batch_size | Batch size (number of experts sampled per step) | | num_batches | Number of batches processed per epoch. Total experts in one layer = batch_size × num_batches | | learning_rate | Learning rate for the optimizer (e.g., Adam) | | num_B | Number of basis matrices used in the MoBE | | truncation | Maximum number of rows retained in each basis matrix | | start_layer | First transformer layer (inclusive) to apply MoBE compression | | end_layer | Last transformer layer (exclusive) to apply compression | | matrix_type | Type of weight matrix to compress (e.g., "gate_proj", "up_proj") | | activation | Activation function used in MoBE (e.g., "silu", "tanh") | | num_groups | Number of expert groups to split the original MoE experts into before applying MoBE compression separately to each group |

---

2. Generate MoBE or Reconstructed MoE Model

After training, you can:

  • Deploy the native MoBE model (high compression)
  • Reconstruct a standard MoE model for compatibility with vLLM or SGLang

##### 🔹 Option A: Save Native MoBE Model

python get_mobe.py --base_model /root/DeepSeek-V3-0324 \
--mobe_dir /root/MoBE/DeepSeek-V3-0324 \
--save_dir /root/DeepSeek-V3-0324-MoBE \
--num_B 64 \
--num_experts 256 \
--start_layer 3 \
--end_layer 61 \…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low-stars new model repo.