- Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka · Apr 2, 2026 · Citations: 0
Pairwise Preference Llm As JudgeAutomatic Metrics
A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
- Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright · Mar 3, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon
Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
- Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
José Pombal, Ricardo Rei, André F. T. Martins · Apr 8, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Llm As Judge
We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings.
- IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
Bosi Wen, Yilin Niu, Cunxiang Wang, Xiaoying Ling, Ying Zhang · Mar 5, 2026 · Citations: 0
Pairwise Preference Llm As Judge
Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models.
- HyperMem: Hypergraph Memory for Long-Term Conversations
Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang · Apr 9, 2026 · Citations: 0
Pairwise Preference Llm As JudgeAutomatic Metrics
Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
- HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong · Jan 9, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Human EvalLlm As Judge
Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
- Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks
Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik · Mar 6, 2026 · Citations: 0
Pairwise PreferenceExpert Verification Llm As Judge
This has prompted evaluation frameworks that use LLM-as-judge protocols and claim verification, along with meta-evaluation frameworks that seek to validate these methods.
- WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics
Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, Tao Xie · Jan 5, 2026 · Citations: 0
Pairwise Preference Llm As Judge
However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable…
- Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric
Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao · Feb 15, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Llm As Judge
To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which…
- Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
Xin Sun, Di Wu, Sijing Qin, Isao Echizen, Abdallah El Ali · Apr 7, 2026 · Citations: 0
Pairwise Preference Llm As Judge
Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge).
- Text-to-Stage: Spatial Layouts from Long-form Narratives
Jefferson Hernandez, Swarnadeep Saha, Chenxi Whitehouse, Sanjeel Parekh, Calvin Murdock · Mar 18, 2026 · Citations: 0
Pairwise Preference Llm As Judge
In this work, we probe the ability of a language model to demonstrate spatial reasoning from unstructured text, mimicking human capabilities and automating a process that benefits many downstream media applications.
- Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge
Junjie Wu, Xuan Kan, Zihao He, Shunwen Tan, Bo Pan · Mar 12, 2026 · Citations: 0
Pairwise Preference Llm As Judge
Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks.
- VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization
Weixin Liu, Congning Ni, Qingyuan Song, Susannah L. Rose, Christopher Symons · Mar 11, 2026 · Citations: 0
Pairwise Preference Llm As Judge
We introduce VERI-DPO, which uses claim verification to mine preferences and distill them into the summarizer with Direct Preference Optimization (DPO).
- AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
Karen Zhou, Chenhao Tan · Mar 7, 2026 · Citations: 0
Pairwise Preference Llm As Judge
Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge.