A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research.
Every paper includes structured metadata for quick triage.
Mirae Kim, Seonghun Jeong, Youngjun Kwak · Feb 20, 2026
Citations: 0
Red TeamAutomatic MetricsGeneral
A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models.
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026
Citations: 0
Automatic MetricsMath
Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026
Citations: 0
Expert VerificationAutomatic MetricsMedicine
The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
Results Across all concepts, CUICurate produced substantially larger and more complete concept sets than the manual benchmarks whilst matching human precision.
Erik Derner, Dalibor Kučera, Aditya Gulati, Ayoub Bagheri, Nuria Oliver · Feb 19, 2026
Citations: 0
Automatic MetricsWeb BrowsingGeneral
Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear.
These findings highlight the importance of user- and task-sensitive conversational agents and support that communication style personalization can meaningfully enhance interaction quality and performance.
Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld · Feb 19, 2026
Citations: 0
Pairwise PreferenceAutomatic MetricsGeneral
Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order.
Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments.
Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold.
However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation.
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026
Citations: 0
Automatic MetricsMulti AgentGeneral
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
Current CoT evaluation narrowly focuses on target task accuracy.
Kensuke Okada, Yui Furukawa, Kyosuke Bunji · Feb 19, 2026
Citations: 0
Rubric RatingAutomatic MetricsGeneral
Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments.
We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs.
Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences.
Positioned as a "methodological experiment," this study proposes an AI Agent-based collaborative research workflow (Agentic Workflow) for humanities and social science research.
Adrian Cosma, Cosmin Dumitrache, Emilian Radoi · Feb 19, 2026
Citations: 0
Expert VerificationAutomatic MetricsMedicine
As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy.
As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatur
Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies -- the ``prevailing mindsets'' embedded during training and alignment that outlive individual model versions.
Jared Moore, Rasmus Overmark, Ned Cooper, Beba Cibralic, Nick Haber, Cameron R. Jones · Feb 19, 2026
Citations: 0
Automatic MetricsLong HorizonGeneral
A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks.
We address this gap with a novel ToM task that requires an agent to persuade a target to choose one of three policy proposals by strategically revealing information.
Large language models have advanced web agents, yet current agents lack personalization capabilities.
Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts.
Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani, Babak Khalaj · Feb 18, 2026
Citations: 0
Simulation EnvMulti AgentCoding
MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation.
Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning.
Zhengda Mo, Zhiyu Quan, Eli O'Donohue, Kaiwen Zhong · Feb 18, 2026
Citations: 0
Human EvalAutomatic MetricsGeneral
We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy.
Priyaranjan Pattnayak, Sanchari Chowdhuri · Feb 18, 2026
Citations: 0
Red TeamAutomatic MetricsCodingMultilingual
Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied.
We introduce \textbf{Indic Jailbreak Robustness (IJR)}, a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic)