RepoInclusionAI (Ant Group)InclusionAI (Ant Group)published Sep 29, 2025seen 5d

inclusionAI/Ming-Freeform-Audio-Edit

Python

Open original ↗

Captured source

source ↗

inclusionAI/Ming-Freeform-Audio-Edit

Language: Python

Stars: 16

Forks: 2

Open issues: 0

Created: 2025-09-29T03:19:43Z

Pushed: 2025-10-27T07:46:37Z

Default branch: main

Fork: no

Archived: no

README:

README

Introduction

This repository hosts Ming-Freeform-Audio-Edit, the benchmark test set for evaluating the downstream editing tasks of the Ming-UniAudio model.

This test set covers 7 distinct editing tasks, categorized as follows:

+ Semantic Editing (3 tasks):

+ Free-form Deletion + Free-form Insertion + Free-form Substitution + Acoustic Editing (5 tasks): + Time-stretching + Pitch Shifting + Dialect Conversion + Emotion Conversion + Volume Conversion

The audio samples are sourced from well-known open-source datasets, including seed-tts eval, LibriTTS, and Gigaspeech.

Dataset statistics

Semantic Editing

full version

| Task Types\ # samples \ Language | Zh deletion | Zh insertion | Zh substitution | En deletion | En insertion | En substitution | | -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: | | Index-based | 186 | 180 | 36 | 138 | 100 | 67 | | Content-based | 95 | 110 | 289 | 62 | 99 | 189 | | Total | 281 | 290 | 325 | 200 | 199 | 256 |

basic version

| Task Types\ # samples \ Language | Zh deletion | Zh insertion | Zh substitution | En deletion | En insertion | En substitution | | -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: | | Index-based | 92 | 65 | 29 | 47 | 79 | 29 | | Content-based | 78 | 105 | 130 | 133 | 81 | 150 | | Total | 170 | 170 | 159 | 180 | 160 | 179 |

*Index-based* instruction: specifies an operation on content at positions *i* to *j*. (e.g. delete the characters or words from index 3 to 12)

*Content-based*: targets specific characters or words for editing. (e.g. insert 'hello' before 'world')

Acoustic Editing

| Task Types\ # samples \ Language | Zh | En | | -------------------------------- | ---: | ---: | | Time-stretching | 50 | 50 | | Pitch Shifting | 50 | 50 | | Dialect Conversion | 250 | --- | | Emotion Conversion | 84 | 72 | | Volume Conversion | 50 | 50 |

Evaluation Metrics

Environment Preparation

git clone https://github.com/inclusionAI/Ming-Freeform-Audio-Edit.git
cd Ming-Freeform-Audio-Edit
pip install -r requirements.txt

Note: Please download the audio and meta files from HuggingFace or ModelScope and put the wavs and meta directories under Ming-Freeform-Audio-Edit

Semantic Editing

For the deletion, insertion, and substitution tasks, we evaluate the performance using four key metrics: + Word Error Rate (WER) of the Edited Region (wer) + Word Error Rate (WER) of the Non-edited Region (wer.noedit) + Edit Operation Accuracy (acc) + Speaker Similarity (sim)

1. If you have organized the directories contain edited waveforms like below:

eval_path
|
├── del
│ └── edit_del_basic
│ └── tts/ # This is the actual directory contains the edited wavs
├── ins
│ └── edit_ins_basic
│ └── tts/ # This is the actual directory contains the edited wavs
├── sub
└── edit_sub_basic
└── tts/ # This is the actual directory contains the edited wavs

Then you can run the following command to get those metrics:

cd Ming-Freeform-Audio-Edit/eval_scripts
bash run_eval_semantic.sh eval_path \
whisper_path \
paraformer_path \
wavlm_path \
eval_mode \
lang

Here is a brief description of the parameters for the script above: + eval_path: The top-level directory containing subdirectories for each editing task + whisper_path:Path to the Whisper model, which is used to calculate WER for English audio. You can download it from here. + paraformer_path:Path to the Paraformer model, which is used to calculate WER for Chinese audio. You can download it from here. + wavlm_path: Path to the WavLM model, which is used to calculate speaker similarity. You can download it from here. + eval_mode: Used to specify which version of the evaluation set to use. Choose between basic and open + lang: supported language, choose between zh and en

2. If your directory for the edited audio is not organized in the format described above, you can run the following commands.

cd eval_scripts
# get wer, wer.noedit
bash cal_wer_edit.sh meta_file \
wav_dir \
lang \
num_jobs \
res_dir \
task_type \
eval_mode \
whisper_path \
paraformer_path \
edit_cat # use `semantic` here
# get sim
bash cal_sim_edit.sh meta_file \
wav_dir \
wavlm_path \
num_jobs \
res_dir \
lang

Here is a brief description of the parameters for the script above: + meta_file: The absolute path to the meta file for the corresponding task (e.g., meta_en_deletion_basic.csv or meta_en_deletion.csv). + wav_dir: The directory containing the edited audio files (the WAV files should be located directly in this directory). + lang: zh or en + num_jobs: number of process. + res_dir: The directory to save the metric results. + task_type: del, ins or sub + eval_mode: The same as the above. + whisper_path: The same as the above + paraformer_path: The same as the above + edit_cat: semantic or acoustic

Acoustic Editing

For the acoustic editing tasks, we use WER and SPK-SIM as the primary evaluation metrics.

1. If the directory for the edited audio is structured, you can run the following command.

cd Ming-Freeform-Audio-Edit/eval_scripts
bash run_eval_acoustic.sh eval_path \
whisper_path \
paraformer_path \
wavlm_path \
eval_mode \
lang

2. Otherwise, you can run commands similar to the one for the semantic tasks, with the edit_cat parameter set to acoustic.

Additionally, for the dialect and emotion conversion tasks, we assess the conversion accuracy by leveraging a large language model (LLM) through API calls, refer to eval_scripts/run_eval_acoustic.sh for more details.

Notability

notability 3.0/10

Low stars new repo, minor interest