Skip to content
← Back to explorer

Tag: Human Eval

Includes explicit human evaluation in the reported methodology.

Papers in tag: 36

Research Utility Snapshot

Evaluation Modes

  • Human Eval (20)
  • Automatic Metrics (8)
  • Simulation Env (3)

Human Feedback Types

  • Pairwise Preference (9)
  • Rubric Rating (4)
  • Critique Edit (1)

Required Expertise

  • General (12)
  • Coding (5)
  • Law (2)
Distill and Align Decomposition for Enhanced Claim Verification

Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero, Arturo Oncevay · Feb 25, 2026 · Citations: 0

Human EvalAutomatic Metrics General
  • Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
  • Human evaluation confirms the high quality of the generated subclaims.
A Benchmark for Deep Information Synthesis

Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger · Feb 24, 2026 · Citations: 0

Human EvalAutomatic Metrics Coding
  • Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis.
  • However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval.
Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback

Chenyang Zhao, Vinny Cahill, Ivana Dusparic · Feb 24, 2026 · Citations: 0

Pairwise PreferenceRlaif Or Synthetic Feedback Human Eval General
  • Preference-based RL offers an appealing alternative by learning from human preferences over pairs of behavioural outcomes.
  • More recently, RL from AI feedback (RLAIF) has demonstrated that large language models (LLMs) can generate preference labels at scale, mitigating the reliance on human annotators.
CARE: An Explainable Computational Framework for Assessing Client-Perceived Therapeutic Alliance Using Large Language Models

Anqi Li, Chenxiao Wang, Yu Lu, Renjun Xu, Lizhi Ma, Zhenzhong Lan · Feb 24, 2026 · Citations: 0

Human EvalAutomatic Metrics General
  • Experiments show that CARE outperforms leading LLMs and substantially reduces the gap between counselor evaluations and client-perceived alliance, achieving over 70% higher Pearson correlation with client ratings.
  • CARE also produces high-quality, contextually grounded rationales, validated by both automatic and human evaluations.
PreScience: A Benchmark for Forecasting Scientific Contributions

Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C. Kozlowski, Oyvind Tafjord · Feb 24, 2026 · Citations: 0

Human EvalSimulation Env General
  • We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction.
  • We develop baselines and evaluations for each task, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement.
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026 · Citations: 0

Pairwise Preference Human Eval MathMedicine
  • We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight
  • Across diverse reasoning and diagnostic benchmarks (GSM8K, CRUXEval, MBPP, AIME, CorrectBench, and TruthfulQA) using Llama-3 and Qwen-3 (8B), explicit regulatory structuring substantially improves error diagnosis and yields a threefold incr
Validating Political Position Predictions of Arguments

Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026 · Citations: 0

Pairwise Preference Human Eval General
  • Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
  • We address this challenge through a dual-scale validation framework applied to political stance prediction in argumentative discourse, combining pointwise and pairwise human annotation.
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System

Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026 · Citations: 0

Human EvalAutomatic Metrics Law
  • Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
  • Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini's superior interpretability.
Claim Automation using Large Language Model

Zhengda Mo, Zhiyu Quan, Eli O'Donohue, Kaiwen Zhong · Feb 18, 2026 · Citations: 0

Human EvalAutomatic Metrics General
  • We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy.
Discovering Implicit Large Language Model Alignment Objectives

Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026 · Citations: 0

Rubric Rating Human Eval General
  • To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
  • Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness.
FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health

Victor De Lima, Jiqun Liu, Grace Hui Yang · Feb 17, 2026 · Citations: 0

Human EvalSimulation Env Coding
  • Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.
  • Human evaluation further confirms that FrameRef's generated framings measurably affect human judgment.
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human Eval General
  • Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
  • We test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retriev
The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage

Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh, Georgios Chochlakis · Feb 10, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human Eval Law
  • To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-c
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind

Zhitao He, Zongwei Lyu, Yi R Fung · Jan 22, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Human Eval General
  • In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion str
  • To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach.
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee · Jan 9, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human EvalLlm As Judge General
  • Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
  • We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations.
Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab · Oct 27, 2025 · Citations: 0

Human EvalAutomatic Metrics Coding
  • We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural n
  • Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English.
Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang · Oct 27, 2025 · Citations: 0

Pairwise Preference Human Eval Coding
  • Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation.
  • However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation.