Maintained implementation availablenone

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

August 1, 2024arXiv: 2408.01337

1 repo44 stars~a few hours to reproduce

Abstract

Summary

MuChoMusic is introduced as a benchmark for evaluating music understanding in audio-language models, with a dedicated benchmarking section describing its evaluation protocol and metrics. This page includes benchmark evidence for Music QA multiple-choice understanding on MuChoMusic benchmark. Reproduction guidance focuses on implementation viability and concrete risk controls.

Key Contributions

MuChoMusic is introduced as a benchmark for evaluating music understanding in audio-language models, with a dedicated benchmarking section describing its evaluation protocol and metrics.
The MuChoMusic benchmark uses a multiple-choice, output-based evaluation protocol where a model is given a music clip and options, and its generated text answer is compared to the correct option to score accuracy.
The study benchmarks multiple audio-language models, including MusiLingo, MuLLaMa, and M2UGen, each combining the MERT audio encoder with different large language models such as Vicuna 7B and LLaMA-2 7B.
The official MuChoMusic repository provides an evaluation script that reads model outputs from JSON files containing question prompts and predicted answers, and computes benchmark scores into a results directory.
The MuChoMusic paper reports detailed benchmarking results including overall accuracy, separate knowledge and reasoning accuracies, and an instruction following rate for each evaluated model.

Implementation Guidance

Use mulab-mir/muchomusic first because deterministic ranking and extracted evidence align on implementation viability. Start with the repo setup path, then validate benchmark reproduction before adaptation.

Reproducibility Notes

Absence of CI workflows in the MuChoMusic repository can allow dependency or code changes to break evaluation without automated detection.
Incorrect JSON formatting of model outputs for the benchmark may cause evaluation script failures or mis-scored results.
Environment differences from the tested setup, such as Python or library versions, may affect reproducibility of reported metrics.

Results & Benchmarks

Task	Dataset	Metric	Value
Music QA multiple-choice understanding	MuChoMusic benchmark	Accuracy	21.1
Music QA multiple-choice understanding	MuChoMusic benchmark	Accuracy	32.4
Music QA instruction following	MuChoMusic benchmark	IFR	71.6
Music QA instruction following	MuChoMusic benchmark	IFR	79.4

Best Implementation

mulab-mir/muchomusic

MuChoMusic is a benchmark for evaluating music understanding in multimodal audio-language models.

44 2 Dec 2024 MIT

License ✓

CI –

Deps ✓

Docker –

Selected mulab-mir/muchomusic as the strongest maintained implementation for new work.
Includes dependency/environment manifest signals.
Repository activity is within the last 24 months.

Reproduction Path

1
Start with mulab-mir/muchomusic and validate setup instructions in README.
2
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
3
Log exact dependency versions and runtime environment for reproducibility.

Time to first repro: a few hoursNo CI workflows detected

Additional Implementations

No additional verified repositories beyond the primary recommendation.

Hugging Face Artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts.

Curated Related

mulab-mir/muchomusic
18 7

Research Context

Methods

Transformer

Domains

Natural Language Processing