A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research.
Every paper includes structured metadata for quick triage.
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei · Feb 24, 2026
Citations: 0
Automatic MetricsLong HorizonGeneral
Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design.
Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
However, this deepening trust introduces a novel attack surface: Agent-Mediated Deception (AMD), where compromised agents are weaponized against their human users.
Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years.
However, the efficient training of collective robots using multi-agent RL and the transfer of learned policies to real-world applications remain open research questions.
Floriano Tori, Lorenzo Bini, Marco Sorbi, Stéphane Marchand-Maillet, Vincent Ginis · Feb 24, 2026
Citations: 0
Pairwise PreferenceAutomatic MetricsGeneral
However, it remains unclear how the topology of a graph interacts with the learned preferences of GNNs.
Our findings on synthetic graphs and molecular benchmarks reveal that MAs do not preferentially concentrate on curvature extremes, despite their theoretical link to information flow.
Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur · Feb 24, 2026
Citations: 0
Expert VerificationAutomatic MetricsGeneral
Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practic
We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues.
The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engi
Shruti Srivastava, Kiranmayee Janardhan, Shaurya Jauhari · Feb 24, 2026
Citations: 0
Red TeamAutomatic MetricsGeneral
These limitations have driven the evolution toward auto-mated red teaming, which leverages artificial intelligence and automation to deliver efficient and adaptive security evaluations.
Yu Fu, Seongho Son, Ilija Bogunovic · Feb 24, 2026
Citations: 0
Llm As JudgeAutomatic MetricsGeneral
Existing alignment paradigms remain limited in capturing the pluralistic nature of human values.
First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses.
Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang · Feb 24, 2026
Citations: 0
Automatic MetricsTool UseGeneral
Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior.
Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.
Chenyang Zhao, Vinny Cahill, Ivana Dusparic · Feb 24, 2026
Citations: 0
Pairwise PreferenceRlaif Or Synthetic FeedbackHuman EvalGeneral
Preference-based RL offers an appealing alternative by learning from human preferences over pairs of behavioural outcomes.
More recently, RL from AI feedback (RLAIF) has demonstrated that large language models (LLMs) can generate preference labels at scale, mitigating the reliance on human annotators.
Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang, Longtao Huang · Feb 24, 2026
Citations: 0
Automatic MetricsLong HorizonGeneral
Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution.
Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows.
Reward models play a fundamental role in aligning large language models with human preferences.
Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational ove
Anqi Li, Chenxiao Wang, Yu Lu, Renjun Xu, Lizhi Ma, Zhenzhong Lan · Feb 24, 2026
Citations: 0
Human EvalAutomatic MetricsGeneral
Experiments show that CARE outperforms leading LLMs and substantially reduces the gap between counselor evaluations and client-perceived alliance, achieving over 70% higher Pearson correlation with client ratings.
CARE also produces high-quality, contextually grounded rationales, validated by both automatic and human evaluations.
Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy, Kilian Q. Weinberger · Feb 24, 2026
Citations: 0
Llm As JudgeAutomatic MetricsGeneral
Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves $>70\%$ win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning.
Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C. Kozlowski, Oyvind Tafjord · Feb 24, 2026
Citations: 0
Human EvalSimulation EnvGeneral
We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction.
We develop baselines and evaluations for each task, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement.
Zachary Ravichandran, David Snyder, Alexander Robey, Hamed Hassani, Vijay Kumar, George J. Pappas · Feb 23, 2026
Citations: 0
Simulation EnvWeb BrowsingGeneral
Traditional safety approaches enforce fixed constraints in user-specified contexts, limiting their ability to handle the open-ended contextual variability of real-world deployment.
We address this gap via CORE, a safety framework that enables online contextual reasoning, grounding, and enforcement without prior knowledge of the environment (e.g., maps or safety specifications).