Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 42 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (527) General (186) Long Horizon (106) Pairwise Preference (91) Coding (69) Simulation Env (67) Multi Agent (46) Medicine (35) Expert Verification (33) Llm As Judge (28) Human Eval (25) Web Browsing (25) Rubric Rating (24) Red Team (23) Critique Edit (22) Multilingual (21)

The logic of KM belief update is contained in the logic of AGM belief revision

Giacomo Bonanno · Feb 26, 2026

Citations: 0

Critique Edit Math

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan, Zhitao Zeng · Feb 26, 2026

Citations: 0

Red Team Automatic Metrics Multilingual

Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs.
To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module.

Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim · Feb 26, 2026

Citations: 0

Critique Edit Coding

NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code.
All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol.

Towards Better RL Training Data Utilization via Second-Order Rollout

Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui · Feb 26, 2026

Citations: 0

Critique Edit General

MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models

Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng, Kyle Lam · Feb 25, 2026

Citations: 0

Expert Verification Automatic Metrics MedicineCoding

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case.

ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li, Shujian Huang · Feb 25, 2026

Citations: 0

Pairwise Preference Multilingual

Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged…

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu · Feb 25, 2026

Citations: 0

Critique Edit Automatic Metrics General

To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer.
Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%.

CAMEL: Confidence-Gated Reflection for Reward Modeling

Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Kun Xu · Feb 24, 2026

Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics General

Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances.
Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters,…

SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation

Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026

Citations: 0

Automatic Metrics Multi Agent Multilingual

To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task.
Extensive experiments on translation benchmarks show that SAMAS achieves competitive semantic accuracy against strong baselines, primarily by leveraging its statistically significant advantage in style fidelity.

Can Large Language Models Replace Human Coders? Introducing ContentBench

Michael Haman · Feb 23, 2026

Citations: 0

Critique Edit Automatic Metrics Coding

This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
The suite uses versioned tracks that invite researchers to contribute new benchmark datasets.

IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

Priyaranjan Pattnayak, Sanchari Chowdhuri · Feb 18, 2026

Citations: 0

Red Team CodingMultilingual

Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied.
We introduce Indic Jailbreak Robustness (IJR), a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks.

Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai · Feb 18, 2026

Citations: 0

Pairwise Preference Multilingual

The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment.
In this work, we propose a resource-efficient method for improving multilingual safety alignment.

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026

Citations: 0

Red Team LawMultilingual

LLM-based agents execute real-world workflows via tools and memory.
We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive…

Rethinking Metrics for Lexical Semantic Change Detection

Roksana Goworek, Haim Dubossarsky · Feb 17, 2026

Citations: 0

Pairwise Preference Automatic Metrics CodingMultilingual

Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition

Varun Nathan, Shreyas Guha, Ayush Kumar · Feb 16, 2026

Citations: 0

Critique Edit General

We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools…
Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data…

Unlocking Reasoning Capability on Machine Translation in Large Language Models

Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio, Tom Kocmi · Feb 16, 2026

Citations: 0

Critique Edit Long Horizon MathCoding

We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.

The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective

Ali Zahedzadeh, Behnam Bahrak · Feb 15, 2026

Citations: 0

Automatic Metrics Long Horizon Multilingual

Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct answers.To operationalize this view, we introduce an evaluation…

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xander Xu · Feb 15, 2026

Citations: 0

Expert VerificationCritique Edit Automatic Metrics Law

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons.

From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design

Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026

Citations: 0

Critique Edit Simulation Env Coding

Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages

Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026

Citations: 0

Pairwise Preference Automatic Metrics CodingMultilingual

Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
We synthesize recent findings indicating that (i) safety guardrails weaken sharply on low-resource and code-mixed inputs, (ii) culturally harmful behavior can persist even when standard toxicity scores look acceptable, and (iii)…

Protocol Hubs

Expert Verification Papers (32) CS.CL + Expert Verification Papers (24) Pairwise Preference Papers (89) CS.CL + Pairwise Preference Papers (74) Coding Papers (69) CS.CL Human Feedback And Eval Papers (1,020) CS.AI + Expert Verification Papers (20) CS.AI Human Feedback And Eval Papers (794) Expert Verification Or Pairwise Preference Papers (118) Pairwise Preference Papers (Last 120 Days) (59) Pairwise Preference Papers (Last 90 Days) (58) Pairwise Preference Papers (Last 60 Days) (57) Long Horizon Papers (101) CS.AI + Pairwise Preference Papers (52) Expert Verification Or Rubric Rating Papers (50) CS.CL + Coding Papers (51)

Benchmark Hubs

WebArena Ecosystem Benchmark Papers (13)

Metric Hubs

Daily Archives

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote