Ming-Lite-Omni-Preview: A MoE Model Designed to Perceive a Wide Range of Modalities
Captured source
source ↗Ming-Lite-Omni-Preview: A MoE Model Designed to Perceive a Wide Range of Modalities | INCLUSION AI
Skip to main content GITHUB 🤗 Hugging Face | 🤖 ModelScope
Introduction
Ming-Lite-Omni-Preview is built upon Ling-Lite , which is a MoE model designed to perceive a wide range of modalities, including text, images, audio, and video, while generating text and natural speech in a streaming manner. To naturely handle the diverse modalities, we have enhanced Ling-Lite by incorporating modality-specific routers for each modality. As a result, Ming-Omni excels at handling information from diverse modalities and is highly scalable.
Key Features
Omni and Novel MoE Architecture : An innovative Omni architecture based on Mixture of Experts (MoE) that achieves competive performance across multiple modality benchmarks.
Video understanding : Supports KV-Cache dynamic compression of visual tokens. While supporting the ability to understand long videos of hours, it can also provide more detailed understanding of short videos of a few seconds.
Natural Speech Generation and Fine-grained Voice Dialogue : Supports dialect understanding and generation in end-to-end conversations, enables one-shot voice cloning, and enhances prosody through audio tokenizer compression
Evaluation
Image benchmark
Benchmarks Ming-Lite-Omni-Preview Qwen2.5-VL-7B-Instruct InternVL2.5-8B-MPO AI2D 83.84 83.9 84.5 HallusionBench 54.68 51.9 51.7 MMBench_TEST_V11 79.63 84.3 82.0 MMMU 57.0 58.6 54.8 MMStar 62.0 63.9 65.2 MMVet 73.6 67.1 68.1 MathVista 69.0 68.2 67.9 OCRBench 87.9 86.4 88.2 Average 70.96 70.5 70.3
Object Recognition
Object Recognition Ming-Lite-Omni-Preview Qwen2.5-VL-7B InternVL-2.5-8B Plants 52.1 55.3 32.8 Animals 52.6 54.8 36.5 Home appliances & furniture 93.5 97.4 90.9 Personal Electronics 96.1 95.1 93.2 Food & Ingredients 57.5 60.0 48.7 Tableware 96.6 94.9 88.1 Vehicles 31.9 40.9 31.9 Average 68.6 71.2 60.3
Video benchmark
Benchmarks Ming-Lite-Omni-Preview Qwen2.5VL-7B VideoMME wo/w sub. 63.9/67.6 65.1/71.6 MVBench 67.0 72.0 Video-MMMU 45.4 47.44 LongVideoBench 53.7 60.0
Audio benchmark
SpeechQA
Model AlpacaEval CommonEval SD-QA MMSU OpenBookQA IFEval AdvBench Qwen2-Audio-chat 3.69 3.40 35.35 35.43 49.01 22.57 98.85 Baichuan-Audio 4.00 3.39 49.64 48.80 63.30 41.32 86.73 GLM-4-Voice 4.06 3.48 43.31 40.11 52.97 24.91 88.08 Kimi-Audio 4.46 3.97 63.12 62.17 83.52 61.10 100.00 Qwen2.5-Omni 4.49 3.93 55.71 61.32 81.10 52.87 99.42 Ming-Lite-Omni-Preview 4.25 3.88 58.95 46.06 60.00 46.71 96.53
ASR
Model Aishell-1 Aishell-2 ios Wenetspeech test-net Wenet test-meeting Librispeech test-clean Librispeech test-other Whisper Large-v3 5.14 4.76 9.68 18.54 1.9 3.65 Qwen2-Audio 1.53 3.06 7.72 8.4 1.6 3.6 GLM-4-voice Base 2.46 - - - 2.82 7.66 Baichuan-Omni-1.5 - - 6.9 8.4 - - Qwen2.5-Omni 1.18 2.36 5.9 7.7 1.8 3.4 Ming-Lite-Omni-Preview 1.62 2.82 6.23 6.9 2.34 5.74
Knowledge
Model InfoSeek_H-mean InfoSeek_unseen_question InfoSeek_unseen_entity GPT-4o 36.05 - - PaLI-X 22.06 23.5 20.8 Qwen2.5-vl-32B 19.35 20.55 18.28 Ming-Lite-Omni-Preview 27.3 28.9 25.9
OCR&GUI
Model Ming-Lite-Omni-Preview Qwen2.5-VL-7B-Instruct ChartQA_TEST 85.2 87.3 DocVQA_TEST 93.2 95.7 OCRBenchV2_en/zh 52.2/51.6 56.3/57.2 OmniDocBench↓ 34.7/34.5 30.8/39.8 TextVQA_VAL 82.36 84.9 ScreenSpot 79.3 84.7
Model Downloads
You can download the model from both Huggingface and ModelScope.
Model Input modality Oput modality Download Ming-Lite-Omni-Preview Image,text,viedio,audio Image,text,audio 🤗 HuggingFace 🤖 ModelScope
If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope .
Use Cases
Video-Audio-QA
MultiModal Input QA Q: (audio content: 请描述视频内容。) A: The video features a woman performing a series of yoga poses on a rooftop with a scenic view of mountains and a clear blue sky. Q: Is there any food in front of me? A: Yes, there's candy on the table.
Speech2Speech (supports dialect)
Quickstart
Please download our model following Model Downloads , then you can refer to the following codes to run Ming-Lite-Omni-Preview model.
import os from transformers import AutoProcessor from modeling_bailingmm import BailingMMNativeForConditionalGeneration
build model
model = BailingMMNativeForConditionalGeneration . from_pretrained ( "inclusionAI/Ming-Lite-Omni" , torch_dtype = torch . bfloat16 , low_cpu_mem_usage = True ) . to ( "cuda" )
assets_path = YOUR_ASSETS_PATH
build processor
processor = AutoProcessor . from_pretrained ( "inclusionAI/Ming-Lite-Omni" , trust_remote_code = True )
qa
messages = [ { "role" : "HUMAN" , "content" : [ { "type" : "text" , "text" : "请详细介绍鹦鹉的生活习性。" } ] , } , ]
Output:
鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:
### 1. 栖息地
鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。
### 2. 饮食
鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。
......
image qa
messages = [ { "role" : "HUMAN" , "content" : [ { "type" : "image" , "image" : os . path . join ( assets_path , "flowers.jpg" ) } , { "type" : "text" , "text" : "What kind of flower is this?" } , ] , } , ]
Output:
The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white.
To enable thinking before response, adding the following system prompt before your question:
cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in ... tags, then the final answer enclosed in ... tags. The critical answer or key result should be placed within \\boxed{}.\n"
And your input message should be like this:
messages = [ { "role" : "HUMAN" , "content" : [ { "type" : "image" , "image" : os . path . join ( assets_path , "reasoning.png" ) } , { "type" : "text" , "text" : cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.\nChoices:\n(A) $\frac{7}{16}$\n(B) $\frac{3}{16}$\n(C) $\frac{7}{32}$\n(D) $\frac{9}{32}$\n(E) $\frac{1}{5}$" } , ] , } , ]
Output:
\\nOkay, so I have this problem about a rectangle ABCD…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Preview MoE model, notable but lacks traction data.