RepoXiaomi (MiMo)Xiaomi (MiMo)published Oct 16, 2025seen 5d

XiaomiMiMo/MiMo-Audio-Training

Python

Open original ↗

Captured source

source ↗
published Oct 16, 2025seen 5dcaptured 13hhttp 200method plain

XiaomiMiMo/MiMo-Audio-Training

Language: Python

Stars: 109

Forks: 13

Open issues: 5

Created: 2025-10-16T13:52:54Z

Pushed: 2025-10-16T13:55:07Z

Default branch: main

Fork: no

Archived: no

README:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

MiMo-Audio-Training Toolkit

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Introduction

Welcome to the MiMo-Audio-Training toolkit! This toolkit is designed to fine-tune the XiaomiMiMo/MiMo-Audio-7B-Instruct. This toolkit serves as a reference implementation for researchers and developers interested in MiMo-Audio and looking to adapt it to their own custom tasks.

Supported Tasks

The MiMo-Audio-Eval toolkit supports a comprehensive set of tasks. Some of the key features include:

  • Tasks:
  • SFT:
  • ASR
  • TTS / InstructTTS
  • Audio Understanding and Reasoning
  • Spoken Dialogue

Getting Started

To get started with the MiMo-Audio-Training toolkit, follow the instructions below to set up the environment and install the required dependencies.

Prerequisites (Linux)

  • Python 3.12
  • CUDA >= 12.0

Installation:

git clone --recurse-submodules https://github.com/XiaomiMiMo/MiMo-Audio-Training
cd MiMo-Audio-Training
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1
pip install -e .

> \[!Note] > If the compilation of flash-attn takes too long, you can download the precompiled wheel and install it manually: > > * Download Precompiled Wheel > > ``sh > pip install /path/to/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl >

Training Process:

Download the fine-tuning Dataset and pre-process the data as the instruct_template.md

Training

We provide multiple training scripts under the scripts directory, supporting both single-GPU and multi-GPU training setups.

cd MiMo-Audio-Training
bash scripts/train_multiGPU_torchrun.sh

Generate and Evaluation

Run inference using: generate.py

Evaluate the SFT model with 🌐MiMo-Audio-Eval.

Citation

@misc{coreteam2025mimoaudio,
title={MiMo-Audio: Audio Language Models are Few-Shot Learners},
author={LLM-Core-Team Xiaomi},
year={2025},
url={https://github.com/XiaomiMiMo/MiMo-Audio},
}

Contact

Please contact us at [mimo@xiaomi.com](mailto:mimo@xiaomi.com) or open an issue if you have any questions.

Notability

notability 6.0/10

Xiaomi audio training repo with moderate stars.