SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents
Changhao Jiang, Jiajun Sun, Yifei Cao, Jiabao Zhuang, Xinmeng Che, +13 more
Abstract
Speech is essential for realistic role-playing, yet existing work on role-playing agents largely centers on text, leaving Speech Role-Playing Agents (SRPAs) underexplored and without systematic evaluation. We introduce SpeechRole, a unified framework for developing and assessing SRPAs. SpeechRole-Data contains 98 roles and 111k speech-to-speech conversations with rich timbre and prosodic variation, providing large-scale resources for training SRPAs. SpeechRole-Eval offers a multidimensional benchmark that directly evaluates generated speech, preserving paralinguistic cues and measuring interaction ability, speech expressiveness, and role-playing fidelity. Experiments show that end-to-end SRPAs such as GPT-4o Audio achieve strong fluency and naturalness, but remain limited in prosody consistency and emotion appropriateness. In contrast, current open-source end-to-end models exhibit substantial performance gaps across multiple evaluation dimensions. Cascaded and end-to-end systems achieve comparable results in interaction ability and role-playing fidelity, suggesting that these aspects are still largely influenced by the underlying text-based language models. We release all data, code, and evaluation tools at https://github.com/yuhui1038/SpeechRole.
Full analysis loading… Code implementations, benchmark data, and reproduction guides are being assembled. Please check back shortly.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.