- AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0
Demonstrations Human EvalLlm As Judge Long Horizon
LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
- Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas
Tim Schopf, Michael Färber · Mar 11, 2026 · Citations: 0
Rubric Rating Human Eval
To address this, we introduce RINoBench, the first comprehensive benchmark for large-scale evaluation of research idea novelty judgments.
- LMUnit: Fine-grained Evaluation with Natural Language Unit Tests
Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin · Dec 17, 2024 · Citations: 0
Pairwise Preference Human Eval
We introduce natural language unit tests, a paradigm that decomposes response quality into explicit, testable criteria, along with a unified scoring model, LMUnit, which combines multi-objective training across preferences, direct ratings,…
- PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis · Oct 21, 2025 · Citations: 0
Rubric Rating Human EvalLlm As Judge
In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g.
- Validating Political Position Predictions of Arguments
Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026 · Citations: 0
Pairwise Preference Human Eval
Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
- LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026 · Citations: 0
Rubric Rating Human Eval
We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
- Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring
Jonas Kubesch, Lena Huber, Clemens Havas · Mar 6, 2026 · Citations: 0
Rubric Rating Human Eval
This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation.
- A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations
Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu · Mar 26, 2026 · Citations: 0
Expert Verification Human Eval
To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations.
- IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR
Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun · Jan 23, 2026 · Citations: 0
Pairwise PreferenceExpert Verification Human Eval
Peer review relies on substantive, evidence-based questions, yet current LLMs generate surface-level queries that perform worse than human reviewer questions in expert evaluation.
- PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations
Vittoria Vineis, Matteo Silvestri, Lorenzo Antonelli, Filippo Betello, Gabriele Tolomei · Mar 6, 2026 · Citations: 0
Pairwise Preference Human Eval
To address these challenges, we present PONTE (Personalized Orchestration for Natural language Trustworthy Explanations), a human-in-the-loop framework for adaptive and reliable XAI narratives.
- HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong · Jan 9, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Human EvalLlm As Judge
Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
- Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups
Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi · Oct 23, 2025 · Citations: 0
Rubric Rating Human EvalAutomatic Metrics
Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters.
- Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026 · Citations: 0
Pairwise Preference Human Eval
We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight…
- RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Zhitao He, Zongwei Lyu, Yi R Fung · Jan 22, 2026 · Citations: 0
Pairwise PreferenceCritique Edit Human Eval
In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion…
- TaoSR1: The Thinking Model for E-commerce Relevance Search
Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin · Aug 17, 2025 · Citations: 0
Pairwise Preference Human Eval
Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3)…
- Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan · Oct 27, 2025 · Citations: 0
Pairwise Preference Human Eval
Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation.
- Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback
Chenyang Zhao, Vinny Cahill, Ivana Dusparic · Feb 24, 2026 · Citations: 0
Pairwise PreferenceRlaif Or Synthetic Feedback Human Eval
Preference-based RL offers an appealing alternative by learning from human preferences over pairs of behavioural outcomes.
- Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala, Jouni Isoaho · Apr 6, 2026 · Citations: 0
Human EvalAutomatic Metrics
However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation.
- FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline
Haotian Wu, Shufan Jiang, Chios Chen, Yiyang Feng, Hehai Lin · Oct 8, 2025 · Citations: 0
Human Eval Multi Agent
As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios.
- Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek
James L. Zainaldin, Cameron Pattison, Manuela Marai, Jacob Wu, Mark J. Schiefsky · Feb 27, 2026 · Citations: 0
Human EvalAutomatic Metrics
This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose.
- Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026 · Citations: 0
Human EvalAutomatic Metrics
Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
- Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms
Joshua Ashkinaze, Ruijia Guan, Laura Kurek, Eytan Adar, Ceren Budak · Jul 4, 2024 · Citations: 0
Human EvalAutomatic Metrics
We evaluate LLMs' capacity to detect (Task 1) and correct (Task 2) biased Wikipedia edits according to Wikipedia's Neutral Point of View (NPOV) policy.
- LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models
Huimin Ren, Yan Liang, Baiqiao Su, Chaobo Sun, Hengtong Lu · Nov 13, 2025 · Citations: 0
Human EvalLlm As Judge
Current methods either rely on subjective and costly human evaluation or on automated LLM-as-a-judge systems, which suffer from inherent biases and unreliability.
- Voxtral TTS
Mistral-AI, :, Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg · Mar 26, 2026 · Citations: 0
Human EvalAutomatic Metrics
In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5.
- Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026 · Citations: 0
Human EvalAutomatic Metrics
Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
- Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation
Lakshan Cooray, Deshan Sumanathilaka, Pattigadapa Venkatesh Raju · Jan 31, 2026 · Citations: 0
Human EvalLlm As Judge
Nine instruction-tuned low-parameterized SLMs are evaluated against three commercial LLMs using lexical and semantic similarity metrics alongside qualitative assessments, including human evaluation and LLM-as-a-judge methods.
- Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation
Scott Merrill, Shashank Srivastava · Nov 21, 2025 · Citations: 0
Human EvalSimulation Env
Transcripts produced via automatic speech recognition (ASR) assign anonymous speaker labels (e.g., Speaker_1), preventing models from capturing consistent human behavior.
- AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization
Fahmida Liza Piya, Rahmatollah Beheshti · Feb 23, 2026 · Citations: 0
Human EvalLlm As Judge
We present AgenticSum, an inference-time, agentic framework that separates context selection, generation, verification, and targeted correction to reduce hallucinated content.
- An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
Gabriel Stefan, Adrian-Marius Dumitran · Apr 9, 2026 · Citations: 0
Human Eval
We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation.
- STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems
Hongru Ji, Yuyin Fan, Meng Zhao, Xianghua Li, Lianwei Wu · Apr 8, 2026 · Citations: 0
Human Eval
To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with…
- Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation
HyunJoon Jung, William Na · Apr 1, 2026 · Citations: 0
Human Eval
LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed?
- Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors
Cole Walsh, Rodica Ivan · Mar 26, 2026 · Citations: 0
Human Eval
These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that…
- LLMs Do Not Grade Essays Like Humans
Jerin George Mathew, Sumayya Taher, Anindita Kundu, Denilson Barbosa · Mar 24, 2026 · Citations: 0
Human Eval
Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear.
- Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation
Hanwen Shen, Ting Ying, Jiajie Lu, Shanshan Wang · Mar 14, 2026 · Citations: 0
Human Eval
Across toxic-prompt settings and benchmarks, CAP-TTA reduces bias (confirmed by human evaluation) while achieving much lower update latency than AdamW/SGD; it also mitigates catastrophic forgetting by significantly improving narrative…
- Enhancing Debunking Effectiveness through LLM-based Personality Adaptation
Pietro Dell'Oglio, Alessandro Bondielli, Francesco Marcelloni, Lucia C. Passaro · Mar 10, 2026 · Citations: 0
Human Eval
To assess the effectiveness of these transformations, we employ a separate LLM as an automated evaluator simulating corresponding personality traits, thereby eliminating the need for costly human evaluation panels.
- Evaluating LLM-Based Grant Proposal Review via Structured Perturbations
William Thorne, Joseph James, Yang Wang, Chenghua Lin, Diana Maynard · Mar 9, 2026 · Citations: 0
Human Eval
As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap'' for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation.
- TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation
Toms Bergmanis, Martins Kronis, Ingus Jānis Pretkalniņš, Dāvis Nicmanis, Jeļizaveta Jeļinska · Mar 9, 2026 · Citations: 0
Human Eval
Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages.
- TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
Christian Greisinger, Steffen Eger · Mar 3, 2026 · Citations: 0
Human Eval
Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at…
- When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation
Thibault Prouteau, Francis Lareau, Nicolas Dugué, Jean-Charles Lamirel, Christophe Malaterre · Mar 2, 2026 · Citations: 0
Human Eval
Existing methods often rely on automated metrics like topic coherence and diversity, which may not fully align with human judgment.
- Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
Nora Petrova, John Burden · Feb 24, 2026 · Citations: 0
Human Eval
While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking.
- BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR
Md. Najib Hasan, Mst. Jannatun Ferdous Rain, Fyad Mohammed, Nazmul Siddique · Feb 16, 2026 · Citations: 0
Human Eval
Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity.
- From Passive to Persuasive: Localized Activation Injection for Empathy and Negotiation
Niranjan Chebrolu, Kokil Jaidka, Gerard Christopher Yeo · Nov 16, 2025 · Citations: 0
Human Eval
Evaluated on emotional dialogue and negotiation in both single- and multi-turn settings, localized injection consistently outperforms global steering and instruction priming; human evaluation confirms that gains reflect genuine improvements…
- Assessing LLM Reasoning Through Implicit Causal Chain Discovery in Climate Discourse
Liesbeth Allein, Nataly Pineda-Castañeda, Andrea Rocci, Marie-Francine Moens · Oct 15, 2025 · Citations: 0
Human Eval
In a diagnostic evaluation framework, we instruct nine LLMs to generate all possible intermediate causal steps linking given cause-effect pairs in causal chain structures.
- ReCellTy: Domain-Specific Knowledge Graph Retrieval-Augmented LLMs Reasoning Workflow for Single-Cell Annotation
Dezheng Han, Yibin Jia, Ruxiao Chen, Wenjie Han, Shuaishuai Guo · Apr 24, 2025 · Citations: 0
Human Eval
Compared to general-purpose LLMs, our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across multiple tissue types, while more closely aligning with the cognitive logic of manual annotation.
- Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models
Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue · Dec 19, 2024 · Citations: 0
Human Eval
However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning.
- Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models
Bei Yan, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen · Jun 24, 2024 · Citations: 0
Human Eval
To address this, we propose a Hallucination benchmark Quality Measurement framework (HQM), which leverages specific indicators to assess both reliability and validity.