PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

Q: How reproducible is "PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark"?

Estimated time to first reproduction: a few days. Risk flags: No repository-level reproducibility signals are currently available, Estimate assumes artifact-level reproduction; full training reproduction may require additional paper details.. Use the paper-linked Hugging Face release as the starting artifact, then reconstruct training and evaluation settings from the paper.

Mohammad Javad Ranjbar Kalahroodi, Mohammad Amini, Parmis Bathayan, Heshaam Faili, Azadeh Shakery

Published: Mar 15, 2026

No direct implementation yet

Evidence: Inferred

Domain fit: AI-core

Verified repos: 0

Core AI workload signals detected from paper context and implementation/artifact evidence.

Time to first repro: a few days

2 risk flags

arXiv PDF

Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding ...

Read full abstract

, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench

Technical details

Canonical key: arxiv-2603.14456

Cache status: Fresh

Generated at: May 2, 2026, 5:20 AM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: No

LLM status: ready

LLM model: openai/gpt-5.1-20251113

LLM generated: Apr 27, 2026, 5:25 AM

LLM content type: researcher_benchmark_brief

HF policy: hf-relevance-v27

LLM evidence refs: paper.abstract, evidencePack.paperSections[id=paper_table_1], evidencePack.paperSections[id=paper_caption_2], evidencePack.paperSections[id=paper_table_4], evidencePack.paperSections[id=paper_caption_7], guidance.riskFlags[0], researcherSummary.reproductionRisks[2], guidance.riskFlags[1], researcherSummary.reproductionRisks[1], evidencePack.paperSections[id=paper_table_6], evidencePack.paperSections[id=paper_caption_8], researcherSummary.benchmarkSnapshot[0], researcherSummary.benchmarkSnapshot[1], paper.title, summary.hasReliableImplementation

implementation starting point

Benchmarks: thin evidence

Time to repro: a few days

2 risk flags

Results & Benchmarks

Freshness tier: hot

Direct + Inferred Evidence

Natural language processing

Qwen3-Omni-30B

WER Mean

0.358

Source: paper fulltext

Natural language processing

Qwen2.5-Omni-7B

WER Mean

2.317

Source: paper fulltext

Natural language processing

Qwen2.5-Omni-3B

WER Mean

4.189

Source: paper fulltext

Benchmark evidence drill-down

3 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task	Dataset	Metric	Value	Source	Evidence refs
Natural language processing	Qwen3-Omni-30B	WER Mean	0.358	paper-derived	No explicit refs
Natural language processing	Qwen2.5-Omni-7B	WER Mean	2.317	paper-derived	No explicit refs
Natural language processing	Qwen2.5-Omni-3B	WER Mean	4.189	paper-derived	No explicit refs

Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks.

Implementation Evidence Summary

Confidence: low

No direct maintained repository implementation was found, but paper-linked Hugging Face artifacts are available.

Reproduction Risks

Estimate assumes artifact-level reproduction; full training reproduction may require additional paper details.

Hardware Notes

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Evidence disclosure

Evidence graph: 2 refs, 1 links.

Utility signals: depth 95/100, grounding 68/100, status medium.

Implementation Status

No verified maintained repo

There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.

Use the paper-linked Hugging Face release as the starting artifact, then reconstruct training and evaluation settings from the paper.
No direct maintained implementation was found. Use the paper PDF and citation graph to design a baseline reproduction.
Track assumptions and missing details in an experiment log before coding.

Time to first repro: a few days