Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation
Captured source
source ↗Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation | INCLUSION AI
Skip to main content GITHUB 🤗 Hugging Face | 🤖 ModelScope
The Introduction Video of Ming-UniAudio
Audio Edit Demo
Editing Tasks Video demos
🚀 Technical Highlights
First unified continuous speech tokenizer for both understanding and generation tasks: MingTok-Audio is a unified continuous speech tokenizer MingTok-Audio based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks.
First Speech LLM with unifed continuous tokenizer for both understanding and generation: Ming-UniAudio is an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-fidelity speech synthesis.
First universal free-form speech editing model for semantic and acoustic tasks without temporal regime: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with Ming-Freeform-Audio-Edit, the first open-source evaluation set for such tasks.
First benchmark for free-form speech editing: We propose Audio-Edit-Benchmark, the first open-source free-form evaluation set comprising editing tasks of four semantic and five acoustic types, to evaluate the model's editing performance.
Instruction-Guided Free-Form Speech Editing
Semantic Editing - Insert
Instruction Transcription Target Transcription Before Edit Speechedit Result insert '简直' after the character or word at index 8. 真是个浪漫的邂逅可以说是英雄救美了 真是个浪漫的邂逅简直可以说是英雄救美了 insert '真正' before the character or word '好'. 就有道而正焉可谓好学也已 就有道而正焉可谓真正好学也已 insert 'clearly' before the character or word at index 8. Its legal status in Trinidad was insufficient to preserve its ecological status. Its legal status in Trinidad was insufficient clearly to preserve its ecological status. insert 'successfully' after the character or word 'profession'. Previously an attorney Korona left the profession to pursue a career in music. Previously an attorney Korona left the profession successfully to pursue a career in music.
Semantic Editing - Substitute
Instruction Transcription Target Transcription Before Edit Speechedit Result substitute '妈妈' with '爸爸'. 我想对于妈妈来说会比任何礼物都要温暖 我想对于爸爸来说会比任何礼物都要温暖 substitute the characters or words from index 8 to index 10 with '五万元'. 当时我想等筹齐两万元聘礼就送她妈回家 当时我想等筹齐五万元聘礼就送她妈回家 substitute 'get pictures off' with 'transfer photos from'. I'm trying to explain to my mother how to get pictures off her phone. I'm trying to explain to my mother how to transfer photos from her phone. substitute the words from index 8 to index 9 with 'could become'. Considering the growth of human population insects might be the food of the future. Considering the growth of human population insects could become the food of the future.
Semantic Editing - Delete
Instruction Transcription Target Transcription Before Edit Speechedit Result delete '比普通的茶叶要'. 花草茶的口味一般比普通的茶叶要苦一些 花草茶的口味一般苦一些 delete the characters or words from index 11 to index 15. 我吃了点燕麦片煎鸡蛋还喝了点橙汁 我吃了点燕麦片煎鸡蛋汁 delete 'times'. The classification of this gibbon has changed several times in the past few years. The classification of this gibbon has changed several in the past few years. delete the characters or words from index 2 to index 6. On the second day the boy climbed to the top of a cliff near the camp On climbed to the top of a cliff near the camp
Acoustic Editing - Dialect Conversion
Instruction Transcription Before Edit Speechedit Result Change the accent of the speech to Dongbei. 之后,他考取导游证,成为拱北口岸中旅的导游。 Change the accent of the speech to Chengdu. 只有当科技为本地社群创造价值的时候,才能真正有意义。 Change the accent of the speech to Chengdu. 我得用回想与幻想补充我所缺少的饮食,安慰我所得到的痛苦。 Change the accent of the speech to Guangxi. 全国恶性肿瘤发病,及死亡第一位的是肺癌。
Acoustic Editing - Speed
Instruction Transcription Before Edit Speechedit Result adjusts the speed to 0.5. 我用胸抵住车把,掌握方向,速度一点也不比别人慢。 adjusts the speed to 0.7. There is a growing body of case law on Bayh-Dole. adjusts the speed to 1.3. Cribb was born near Bristol but moved to London before starting professional fighting. adjusts the speed to 2. 切实帮助困难群众解决生产生活中,遇到的困难和问题。
Acoustic Editing - Pitch
Instruction Transcription Before Edit Speechedit Result shifts the pitch by 3 steps. 因为外面有战争,家里又有战争带来的悲伤和匮乏。 shifts the pitch by 5 steps. 自动驾驶将大幅提升出行安全,效率。 shifts the pitch by -1 steps. The heart of the campus has a number of historic buildings. shifts the pitch by -1 steps. Stevenson is also the director of music ministries at Angeles Mesa Presbyterian Church.
Acoustic Editing - Volume
Instruction Transcription Before Edit Speechedit Result adjusts the volume to 1.4. A woman sits as she shows the designs she has made in the floor. adjusts the volume to 1.6. For example, they both consist of predominately older, hence redder, stars. adjusts the volume to 0.9. 伏羲的儿孙们看见伏羲捉来了鱼,也都欢欢喜喜跑来问长问短。 adjusts the volume to 0.3. 他们还告诉巨人,那座城市里群英荟萃。
Acoustic Editing - Denoise
Instruction Transcription Before Edit Speechedit Result denoise the audio. Be shape of example,before deriving this formula we explained what we mean by problems of this kind we now generalize these ideas for general binomial experiments. denoise the audio. Summoned to himself with firmness no surrender his superiors had also preached this saying it was the way of eternal honor his comrades were old. denoise the audio. There are people who travel long distances to assure my continued existence we have also seen the power of faith at work among us it was muscular but it wasn't symmetrical. denoise the audio. Theory eventually proved inexact the heavens refused to give up their weeping but what has been happening recently might be described as creeping mannerism clever.
Acoustic Editing - Background Music
Instruction Before Edit Speechedit Result add rain to audio. add car sound to audio. add carefree music to audio. add groovy music to audio.
Acoustic Editing - Emotion Conversion
Instruction…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable speech LLM model release