Who can we trust? LLM-as-a-jury for Comparative Assessment
Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026
Citations: 0
Pairwise Preference Automatic Metrics General
- Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
- Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability.