OpenTrain AI
Maintained implementation availablenone

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

August 1, 2024arXiv: 2408.01337
1 repo44 stars~a few hours to reproduce
arXiv PDF

Abstract

Summary

MuChoMusic is introduced as a benchmark for evaluating music understanding in audio-language models, with a dedicated benchmarking section describing its evaluation protocol and metrics. This page includes benchmark evidence for Music QA multiple-choice understanding on MuChoMusic benchmark. Reproduction guidance focuses on implementation viability and concrete risk controls.

Key Contributions

  • MuChoMusic is introduced as a benchmark for evaluating music understanding in audio-language models, with a dedicated benchmarking section describing its evaluation protocol and metrics.
  • The MuChoMusic benchmark uses a multiple-choice, output-based evaluation protocol where a model is given a music clip and options, and its generated text answer is compared to the correct option to score accuracy.
  • The study benchmarks multiple audio-language models, including MusiLingo, MuLLaMa, and M2UGen, each combining the MERT audio encoder with different large language models such as Vicuna 7B and LLaMA-2 7B.
  • The official MuChoMusic repository provides an evaluation script that reads model outputs from JSON files containing question prompts and predicted answers, and computes benchmark scores into a results directory.
  • The MuChoMusic paper reports detailed benchmarking results including overall accuracy, separate knowledge and reasoning accuracies, and an instruction following rate for each evaluated model.

Implementation Guidance

Use mulab-mir/muchomusic first because deterministic ranking and extracted evidence align on implementation viability. Start with the repo setup path, then validate benchmark reproduction before adaptation.

Reproducibility Notes

  • Absence of CI workflows in the MuChoMusic repository can allow dependency or code changes to break evaluation without automated detection.
  • Incorrect JSON formatting of model outputs for the benchmark may cause evaluation script failures or mis-scored results.
  • Environment differences from the tested setup, such as Python or library versions, may affect reproducibility of reported metrics.

Results & Benchmarks

TaskDatasetMetricValue
Music QA multiple-choice understandingMuChoMusic benchmarkAccuracy21.1
Music QA multiple-choice understandingMuChoMusic benchmarkAccuracy32.4
Music QA instruction followingMuChoMusic benchmarkIFR71.6
Music QA instruction followingMuChoMusic benchmarkIFR79.4

Best Implementation

MuChoMusic is a benchmark for evaluating music understanding in multimodal audio-language models.

44 2 Dec 2024 MIT
License
CI
Deps
Docker
  • Selected mulab-mir/muchomusic as the strongest maintained implementation for new work.
  • Includes dependency/environment manifest signals.
  • Repository activity is within the last 24 months.

Reproduction Path

  1. 1

    Start with mulab-mir/muchomusic and validate setup instructions in README.

  2. 2

    Reproduce the baseline result with the provided defaults before modifying hyperparameters.

  3. 3

    Log exact dependency versions and runtime environment for reproducibility.

Time to first repro: a few hoursNo CI workflows detected

Additional Implementations

No additional verified repositories beyond the primary recommendation.

Hugging Face Artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts.

Curated Related

Research Context