A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research.
Every paper includes structured metadata for quick triage.
Mirae Kim, Seonghun Jeong, Youngjun Kwak · Feb 20, 2026
Citations: 0
Red TeamAutomatic MetricsGeneral
A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models.
Erik Derner, Dalibor Kučera, Aditya Gulati, Ayoub Bagheri, Nuria Oliver · Feb 19, 2026
Citations: 0
Automatic MetricsWeb BrowsingGeneral
Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear.
These findings highlight the importance of user- and task-sensitive conversational agents and support that communication style personalization can meaningfully enhance interaction quality and performance.
Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld · Feb 19, 2026
Citations: 0
Pairwise PreferenceAutomatic MetricsGeneral
Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order.
Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments.
Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold.
However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation.
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026
Citations: 0
Automatic MetricsMulti AgentGeneral
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
Current CoT evaluation narrowly focuses on target task accuracy.
Kensuke Okada, Yui Furukawa, Kyosuke Bunji · Feb 19, 2026
Citations: 0
Rubric RatingAutomatic MetricsGeneral
Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments.
We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs.
As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatur
Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies -- the ``prevailing mindsets'' embedded during training and alignment that outlive individual model versions.
Jared Moore, Rasmus Overmark, Ned Cooper, Beba Cibralic, Nick Haber, Cameron R. Jones · Feb 19, 2026
Citations: 0
Automatic MetricsLong HorizonGeneral
A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks.
We address this gap with a novel ToM task that requires an agent to persuade a target to choose one of three policy proposals by strategically revealing information.
Zhengda Mo, Zhiyu Quan, Eli O'Donohue, Kaiwen Zhong · Feb 18, 2026
Citations: 0
Human EvalAutomatic MetricsGeneral
We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy.
Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026
Citations: 0
Pairwise PreferenceAutomatic MetricsGeneral
Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability.
Vered Tohar, Tsahi Hayat, Amir Leshem · Feb 18, 2026
Citations: 0
Automatic MetricsLong HorizonGeneral
In a blinded authorship test with 50 humanities students and graduates (three AI poems and three poems by well-known poets each), judgments were at chance: human poems were labeled human 54% of the time and AI poems 52%, with 95% confidence
Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif, Segev Shlomov · Feb 18, 2026
Citations: 0
Automatic MetricsLong HorizonGeneral
Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating,
We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces.
Existing evaluations of agents with memory typically assess memorization and action in isolation.
One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions.
Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi, Yuanshun Yao · Feb 18, 2026
Citations: 0
Pairwise PreferenceAutomatic MetricsGeneral
Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users.
Prior approaches typically rely on static datasets, either training implicit preference models on interaction history or encoding user profiles in external memory.
We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice.
We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks.
This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.
Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities.
Web agents based on large language models have demonstrated promising capability in automating web tasks.
However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might not possess comprehensive awareness of execution risks, prematurely performing risky actions that cause
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang, Li Qing · Feb 17, 2026
Citations: 0
DemonstrationsAutomatic MetricsGeneral
Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
We first define the components and evaluation metrics for TOFs, then formalize a cost-efficient flowchart construction algorithm to abstract procedural knowledge from service dialogues.