- AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0
Demonstrations Human EvalLlm As Judge Long Horizon
LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
- CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng · Mar 31, 2026 · Citations: 0
Rubric RatingExpert Verification Human Eval Web Browsing
The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined…
- Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas
Tim Schopf, Michael Färber · Mar 11, 2026 · Citations: 0
Rubric Rating Human Eval
To address this, we introduce RINoBench, the first comprehensive benchmark for large-scale evaluation of research idea novelty judgments.
- LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026 · Citations: 0
Rubric Rating Human Eval
We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
- Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics
Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
- Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner · Mar 29, 2026 · Citations: 0
Expert Verification Human EvalAutomatic Metrics Multi Agent
In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
- Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring
Jonas Kubesch, Lena Huber, Clemens Havas · Mar 6, 2026 · Citations: 0
Rubric Rating Human Eval
This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation.
- A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations
Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu · Mar 26, 2026 · Citations: 0
Expert Verification Human Eval
To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations.
- PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations
Vittoria Vineis, Matteo Silvestri, Lorenzo Antonelli, Filippo Betello, Gabriele Tolomei · Mar 6, 2026 · Citations: 0
Pairwise Preference Human Eval
To address these challenges, we present PONTE (Personalized Orchestration for Natural language Trustworthy Explanations), a human-in-the-loop framework for adaptive and reliable XAI narratives.
- VRM: Teaching Reward Models to Understand Authentic Human Preferences
Biao Liu, Ning Xu, Junming Yang, Hao Xu, Xin Geng · Mar 5, 2026 · Citations: 0
Pairwise Preference Human Eval
Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on…
- Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing
Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul · Mar 6, 2026 · Citations: 0
Human EvalAutomatic Metrics Long Horizon
We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis…
- DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena · Apr 7, 2026 · Citations: 0
Human Eval Long Horizon
Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis.
- EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo · Mar 9, 2026 · Citations: 0
Human Eval Multi Agent
To address this, we introduce EvoScientist, an evolving multi-agent AI scientist framework that continuously improves research strategies through persistent memory and self-evolution.
- Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala, Jouni Isoaho · Apr 6, 2026 · Citations: 0
Human EvalAutomatic Metrics
However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation.
- Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
Somaya Eltanbouly, Samer Rashwani · Mar 25, 2026 · Citations: 0
Human EvalLlm As Judge
Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments.
- Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization
Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon · Mar 31, 2026 · Citations: 0
Human EvalAutomatic Metrics
Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance.
- Learning to Predict Future-Aligned Research Proposals with Language Models
Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu · Mar 28, 2026 · Citations: 0
Human EvalAutomatic Metrics
Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality.
- How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
Minzhu Tu, Shiyu Ni, Keping Bi · Apr 8, 2026 · Citations: 0
Human EvalAutomatic Metrics
Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases.
- Voxtral TTS
Mistral-AI, :, Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg · Mar 26, 2026 · Citations: 0
Human EvalAutomatic Metrics
In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5.
- Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties
Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi · Mar 26, 2026 · Citations: 0
Human EvalAutomatic Metrics
A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.
- When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech
Nicolás Benjamín Ocampo, Tommaso Caselli, Davide Ceolin · Mar 26, 2026 · Citations: 0
Human EvalAutomatic Metrics
We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data.
- Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media
Thi Huyen Nguyen, Koustav Rudra, Wolfgang Nejdl · Mar 19, 2026 · Citations: 0
Human EvalAutomatic Metrics
Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales.
- Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
Xuanbo Su, Wenhao Hu, Haibo Su, Yunzhang Chen, Le Zhan · Apr 8, 2026 · Citations: 0
Human EvalSimulation Env
We introduce SalesLLM benchmark, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with…
- An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
Gabriel Stefan, Adrian-Marius Dumitran · Apr 9, 2026 · Citations: 0
Human Eval
We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation.
- STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems
Hongru Ji, Yuyin Fan, Meng Zhao, Xianghua Li, Lianwei Wu · Apr 8, 2026 · Citations: 0
Human Eval
To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with…
- PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation
Yanxin Luo, Xiaoyu Zhang, Jing Li, Yan Gao, Donghong Han · Apr 2, 2026 · Citations: 0
Human Eval
Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations.
- Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation
HyunJoon Jung, William Na · Apr 1, 2026 · Citations: 0
Human Eval
LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed?
- ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection
Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga · Mar 31, 2026 · Citations: 0
Human Eval
Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.
- Open Machine Translation for Esperanto
Ona de Gibert, Lluís de Gibert · Mar 31, 2026 · Citations: 0
Human Eval
In this work, we present the first comprehensive evaluation of open-source MT systems for Esperanto, comparing rule-based systems, encoder-decoder models, and LLMs across model sizes.
- Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors
Cole Walsh, Rodica Ivan · Mar 26, 2026 · Citations: 0
Human Eval
These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that…
- LLMs Do Not Grade Essays Like Humans
Jerin George Mathew, Sumayya Taher, Anindita Kundu, Denilson Barbosa · Mar 24, 2026 · Citations: 0
Human Eval
Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear.
- Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation
Hanwen Shen, Ting Ying, Jiajie Lu, Shanshan Wang · Mar 14, 2026 · Citations: 0
Human Eval
Across multiple benchmarks and human evaluations, CAP-TTA effectively reduces toxicity/bias score with significantly lower latency than standard optimization methods (e.g., AdamW or SGD).
- Enhancing Debunking Effectiveness through LLM-based Personality Adaptation
Pietro Dell'Oglio, Alessandro Bondielli, Francesco Marcelloni, Lucia C. Passaro · Mar 10, 2026 · Citations: 0
Human Eval
To assess the effectiveness of these transformations, we employ a separate LLM as an automated evaluator simulating corresponding personality traits, thereby eliminating the need for costly human evaluation panels.
- Evaluating LLM-Based Grant Proposal Review via Structured Perturbations
William Thorne, Joseph James, Yang Wang, Chenghua Lin, Diana Maynard · Mar 9, 2026 · Citations: 0
Human Eval
As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap'' for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation.
- TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation
Toms Bergmanis, Martins Kronis, Ingus Jānis Pretkalniņš, Dāvis Nicmanis, Jeļizaveta Jeļinska · Mar 9, 2026 · Citations: 0
Human Eval
Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages.
- Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data
Thanathai Lertpetchpun, Thanapat Trachu, Jihwan Lee, Tiantian Feng, Dani Byrd · Mar 8, 2026 · Citations: 0
Human Eval
Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.
- The Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks
Anca Dinu, Andreiana Mihail, Andra-Maria Florescu, Claudiu Creanga · Mar 6, 2026 · Citations: 0
Human Eval
The analysis combines human evaluation with computational methods aimed at detecting visual and stylistic similarities or divergences between the original works and their AI-produced renditions.
- TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
Christian Greisinger, Steffen Eger · Mar 3, 2026 · Citations: 0
Human Eval
Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at…