- Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics
Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
- Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing
Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul · Mar 6, 2026 · Citations: 0
Human EvalAutomatic Metrics Long Horizon
We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis…
- Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala, Jouni Isoaho · Apr 6, 2026 · Citations: 0
Human EvalAutomatic Metrics
However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation.
- CARE: An Explainable Computational Framework for Assessing Client-Perceived Therapeutic Alliance Using Large Language Models
Anqi Li, Chenxiao Wang, Yu Lu, Renjun Xu, Lizhi Ma · Feb 24, 2026 · Citations: 0
Human EvalAutomatic Metrics
Experiments show that CARE outperforms leading LLMs and substantially reduces the gap between counselor evaluations and client-perceived alliance, achieving over 70% higher Pearson correlation with client ratings.
- When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech
Nicolás Benjamín Ocampo, Tommaso Caselli, Davide Ceolin · Mar 26, 2026 · Citations: 0
Human EvalAutomatic Metrics
We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data.
- Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media
Thi Huyen Nguyen, Koustav Rudra, Wolfgang Nejdl · Mar 19, 2026 · Citations: 0
Human EvalAutomatic Metrics
Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales.
- Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026 · Citations: 0
Human EvalAutomatic Metrics
Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
- Claim Automation using Large Language Model
Zhengda Mo, Zhiyu Quan, Eli O'Donohue, Kaiwen Zhong · Feb 18, 2026 · Citations: 0
Human EvalAutomatic Metrics
We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy.
- DETECT: Determining Ease and Textual Clarity of German Text Simplifications
Maria Korobeynikova, Alessia Battisti, Lukas Fischer, Yingqiang Gao · Oct 25, 2025 · Citations: 0
Human EvalAutomatic Metrics
Current evaluation of German automatic text simplification (ATS) relies on general-purpose metrics such as SARI, BLEU, and BERTScore, which insufficiently capture simplification quality in terms of simplicity, meaning preservation, and…
- Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction
Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu · Oct 7, 2025 · Citations: 0
Human EvalAutomatic Metrics
Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
- Family Matters: Language Transfer and Merging for Adapting Small LLMs to Faroese
Jenny Kunz, Iben Nyholm Debess, Annika Simonsen · Oct 1, 2025 · Citations: 0
Human EvalAutomatic Metrics
To address the lack of existing Faroese evaluation resources, we construct two new minimal-pair probing benchmarks, one for linguistic acceptability and one for text comprehension, and complement them with human evaluations conducted by…