Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 22 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (527) General (186) Long Horizon (106) Pairwise Preference (91) Coding (69) Simulation Env (67) Multi Agent (46) Medicine (35) Expert Verification (33) Llm As Judge (28) Human Eval (25) Web Browsing (25) Rubric Rating (24) Red Team (23) Critique Edit (22) Multilingual (21)

The logic of KM belief update is contained in the logic of AGM belief revision

Giacomo Bonanno · Feb 26, 2026

Citations: 0

Critique Edit Math

Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim · Feb 26, 2026

Citations: 0

Critique Edit Coding

NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code.
All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol.

Towards Better RL Training Data Utilization via Second-Order Rollout

Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui · Feb 26, 2026

Citations: 0

Critique Edit General

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu · Feb 25, 2026

Citations: 0

Critique Edit Automatic Metrics General

To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer.
Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%.

CAMEL: Confidence-Gated Reflection for Reward Modeling

Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Kun Xu · Feb 24, 2026

Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics General

Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances.
Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters,…

Can Large Language Models Replace Human Coders? Introducing ContentBench

Michael Haman · Feb 23, 2026

Citations: 0

Critique Edit Automatic Metrics Coding

This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
The suite uses versioned tracks that invite researchers to contribute new benchmark datasets.

Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition

Varun Nathan, Shreyas Guha, Ayush Kumar · Feb 16, 2026

Citations: 0

Critique Edit General

We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools…
Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data…

Unlocking Reasoning Capability on Machine Translation in Large Language Models

Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio, Tom Kocmi · Feb 16, 2026

Citations: 0

Critique Edit Long Horizon MathCoding

We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xander Xu · Feb 15, 2026

Citations: 0

Expert VerificationCritique Edit Automatic Metrics Law

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons.

From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design

Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026

Citations: 0

Critique Edit Simulation Env Coding

Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI

Ziyan Wang, Longlong Ma · Feb 9, 2026

Citations: 0

Critique Edit Automatic Metrics General

In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, there

RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind

Zhitao He, Zongwei Lyu, Yi R Fung · Jan 22, 2026

Citations: 0

Pairwise PreferenceCritique Edit Human Eval General

In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion…
Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations.

VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding

Haorui Yu, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Citations: 0

Critique Edit General

We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models' (VLMs) cultural understanding beyond surface-level visual perception.
Existing VLM benchmarks predominantly measure L1-L2 capabilities (object recognition, scene description, and factual question answering) while under-evaluate higher-order cultural interpretation.

Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Citations: 0

Rubric RatingCritique Edit Coding

Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2).

Reward Modeling from Natural Language Human Feedback

Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang, Shaoning Sun · Jan 12, 2026

Citations: 0

Pairwise PreferenceCritique Edit General

To address this issue, we propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals, thereby mitigating the problem of limited solution space inherent…
Additionally, considering that human critiques are difficult to scale up, we introduce Meta Reward Model (MetaRM) which learns to predict process reward from datasets with human critiques and then generalizes to data without human…

Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

Mengze Hong, Di Jiang, Weiwei Zhao, Yawen Li, Yihang Wang, Xinyuan Luo · Nov 14, 2025

Citations: 0

Critique Edit Simulation Env General

Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly…

Error-Aware Knowledge Distillation via Targeted Revision for Customer-Service Summarization

Hee-Jin Lee, Zhen Guo, Luchao Jin, Morteza Moazami Goudarzi · Nov 4, 2025

Citations: 0

Critique Edit Automatic Metrics General

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin · Oct 6, 2025

Citations: 0

Critique Edit General

Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control.
Our evaluations reveal several shortcomings: open-weight models exhibit high vulnerability to harmful compliance, with Mistral-7B reaching attack success rates as high as 97% to 98% in domains such as historical revisionism, propaganda, and…

ClearFairy: Capturing Creative Workflows through Decision Structuring, In-Situ Questioning, and Rationale Inference

Kihoon Son, DaEun Choi, Tae Soo Kim, Young-Ho Kim, Sangdoo Yun, Juho Kim · Sep 18, 2025

Citations: 0

Critique Edit Automatic Metrics General

Furthermore, exploratory applications demonstrate that captured steps can enhance generative AI agents in Figma, yielding predictions better aligned with professionals and producing coherent outcomes.

TASER: Table Agents for Schema-guided Extraction and Recommendation

Nicole Cho, Kirsty Fielding, William Watson, Sumitra Ganesh, Manuela Veloso · Aug 18, 2025

Citations: 0

Critique Edit General

To address this, we present TASER (Table Agents for Schema-guided Extraction and Recommendation), a continuously learning, agentic table extraction system that converts highly unstructured, multi-page, heterogeneous tables into normalized,…
Our Recommender Agent reviews unmatched outputs and proposes schema revisions, enabling TASER to outperform vision-based table detection models such as Table Transformer by 10.1%.

Protocol Hubs

Expert Verification Papers (32) CS.CL + Expert Verification Papers (24) Pairwise Preference Papers (89) CS.CL + Pairwise Preference Papers (74) Coding Papers (69) CS.CL Human Feedback And Eval Papers (1,020) CS.AI + Expert Verification Papers (20) CS.AI Human Feedback And Eval Papers (794) Expert Verification Or Pairwise Preference Papers (118) Pairwise Preference Papers (Last 120 Days) (59) Pairwise Preference Papers (Last 90 Days) (58) Pairwise Preference Papers (Last 60 Days) (57) Long Horizon Papers (101) CS.AI + Pairwise Preference Papers (52) Expert Verification Or Rubric Rating Papers (50) CS.CL + Coding Papers (51)

Benchmark Hubs

WebArena Ecosystem Benchmark Papers (13)

Metric Hubs

Daily Archives

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote