Summary
MuChoMusic is introduced as a benchmark for evaluating music understanding in audio-language models, with a dedicated benchmarking section describing its evaluation protocol and metrics. This page includes benchmark evidence for Music QA multiple-choice understanding on MuChoMusic benchmark. Reproduction guidance focuses on implementation viability and concrete risk controls.
Key Contributions
- MuChoMusic is introduced as a benchmark for evaluating music understanding in audio-language models, with a dedicated benchmarking section describing its evaluation protocol and metrics.
- The MuChoMusic benchmark uses a multiple-choice, output-based evaluation protocol where a model is given a music clip and options, and its generated text answer is compared to the correct option to score accuracy.
- The study benchmarks multiple audio-language models, including MusiLingo, MuLLaMa, and M2UGen, each combining the MERT audio encoder with different large language models such as Vicuna 7B and LLaMA-2 7B.
- The official MuChoMusic repository provides an evaluation script that reads model outputs from JSON files containing question prompts and predicted answers, and computes benchmark scores into a results directory.
- The MuChoMusic paper reports detailed benchmarking results including overall accuracy, separate knowledge and reasoning accuracies, and an instruction following rate for each evaluated model.
Implementation Guidance
Use mulab-mir/muchomusic first because deterministic ranking and extracted evidence align on implementation viability. Start with the repo setup path, then validate benchmark reproduction before adaptation.
Reproducibility Notes
- Absence of CI workflows in the MuChoMusic repository can allow dependency or code changes to break evaluation without automated detection.
- Incorrect JSON formatting of model outputs for the benchmark may cause evaluation script failures or mis-scored results.
- Environment differences from the tested setup, such as Python or library versions, may affect reproducibility of reported metrics.
Results & Benchmarks
| Task | Dataset | Metric | Value |
|---|---|---|---|
| Music QA multiple-choice understanding | MuChoMusic benchmark | Accuracy | 21.1 |
| Music QA multiple-choice understanding | MuChoMusic benchmark | Accuracy | 32.4 |
| Music QA instruction following | MuChoMusic benchmark | IFR | 71.6 |
| Music QA instruction following | MuChoMusic benchmark | IFR | 79.4 |
Best Implementation
MuChoMusic is a benchmark for evaluating music understanding in multimodal audio-language models.
- Selected mulab-mir/muchomusic as the strongest maintained implementation for new work.
- Includes dependency/environment manifest signals.
- Repository activity is within the last 24 months.
Reproduction Path
- 1
Start with mulab-mir/muchomusic and validate setup instructions in README.
- 2
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
- 3
Log exact dependency versions and runtime environment for reproducibility.
Additional Implementations
No additional verified repositories beyond the primary recommendation.
Hugging Face Artifacts
No direct paper-linked artifacts were found. Showing strongest curated related artifacts.
- mulab-mir/muchomusic18 7