A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research.
Every paper includes structured metadata for quick triage.
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 21, 2026
Citations: 0
Pairwise PreferenceHuman EvalGeneral
One annotator pair achieved almost perfect agreement ($κ= 0.8743$; $93.8\%$ raw agreement), exceeding a number of reported benchmarks for English sarcasm research works.
This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
The suite uses versioned tracks that invite researchers to contribute new benchmark datasets.
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026
Citations: 0
Automatic MetricsLong HorizonMath
Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab · Oct 27, 2025
Citations: 0
Human EvalAutomatic MetricsCoding
We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural n
Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English.
While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
We survey 335 individuals with lived mental health experience to collect preference rankings across therapeutic dimensions, then develop a multi-objective alignment framework using direct preference optimization.
Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026
Citations: 0
Pairwise PreferenceHuman EvalGeneral
Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
We address this challenge through a dual-scale validation framework applied to political stance prediction in argumentative discourse, combining pointwise and pairwise human annotation.
Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C. Kozlowski, Oyvind Tafjord · Feb 24, 2026
Citations: 0
Human EvalSimulation EnvGeneral
We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction.
We develop baselines and evaluations for each task, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement.
Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee · Jan 9, 2026
Citations: 0
Pairwise PreferenceRubric RatingHuman EvalLlm As JudgeGeneral
Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations.
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026
Citations: 0
Automatic MetricsSimulation EnvGeneral
When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
By prioritizing deterministic rules, clear decision tracking, and retaining unresolved cases for human review, the framework provides a practical foundation for downstream manufacturing automation in real-world industrial environments.
As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
Current evaluation practices for open-ended text responses heavily rely on human experts.