OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 304 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (978) General (590) Coding (314) Simulation Env (115) Math (103) Multilingual (99) Long Horizon (82) Medicine (78) Pairwise Preference (70) Law (45) Multi Agent (41) Human Eval (38) Expert Verification (25) Web Browsing (22) Critique Edit (21) Red Team (21)

Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

Subrit Dikshit · Feb 18, 2026

Citations: 0

Automatic MetricsSimulation Env LawCoding

Who can we trust? LLM-as-a-jury for Comparative Assessment

Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026

Citations: 0

Pairwise Preference Automatic Metrics General

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability.

Creating a digital poet

Vered Tohar, Tsahi Hayat, Amir Leshem · Feb 18, 2026

Citations: 0

Automatic Metrics Long Horizon General

In a blinded authorship test with 50 humanities students and graduates (three AI poems and three poems by well-known poets each), judgments were at chance: human poems were labeled human 54% of the time and AI poems 52%, with 95% confidence

Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026

Citations: 0

Expert Verification Automatic Metrics Multi Agent Coding

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm.

TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers

Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif, Segev Shlomov · Feb 18, 2026

Citations: 0

Automatic Metrics Long Horizon General

Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating,
We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces.

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026

Citations: 0

Red Team Automatic Metrics LawMultilingual

LLM-based agents execute real-world workflows via tools and memory.
These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios.

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin · Feb 18, 2026

Citations: 0

Pairwise Preference Simulation Env Web Browsing General

Existing evaluations of agents with memory typically assess memorization and action in isolation.
One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions.

Learning Personalized Agents from Human Feedback

Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi, Yuanshun Yao · Feb 18, 2026

Citations: 0

Pairwise Preference Automatic Metrics General

Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users.
Prior approaches typically rely on static datasets, either training implicit preference models on interaction history or encoding user profiles in external memory.

Multi-Objective Alignment of Language Models for Personalized Psychotherapy

Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli, Majid Sarrafzadeh · Feb 17, 2026

Citations: 0

Pairwise PreferenceExpert Verification Automatic Metrics Medicine

While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
We survey 335 individuals with lived mental health experience to collect preference rankings across therapeutic dimensions, then develop a multi-objective alignment framework using direct preference optimization.

Intent Laundering: AI Safety Datasets Are Not What They Seem

Shahriar Golchin, Marc Wetter · Feb 17, 2026

Citations: 0

Red Team Automatic Metrics General

We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice.
We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks.

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du · Feb 17, 2026

Citations: 0

Automatic Metrics Long Horizon Coding

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering.
Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity.

ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé · Feb 17, 2026

Citations: 0

Pairwise Preference Automatic Metrics Coding

In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences.
We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset.

Rethinking Metrics for Lexical Semantic Change Detection

Roksana Goworek, Haim Dubossarsky · Feb 17, 2026

Citations: 0

Pairwise Preference Automatic Metrics CodingMultilingual

Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite

Tim Fischer, Chris Biemann · Feb 17, 2026

Citations: 0

Demonstrations Automatic Metrics General

This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.
Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities.

In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations

Mohammad Aflah Khan, Mahsa Amani, Soumi Das, Bishwamittra Ghosh, Qinyuan Wu, Krishna P. Gummadi · Feb 17, 2026

Citations: 0

Pairwise Preference Automatic Metrics General

Agents based on Large Language Models (LLMs) are increasingly being deployed as interfaces to information on online platforms.
These agents filter, prioritize, and synthesize information retrieved from the platforms' back-end databases or via web search.

World-Model-Augmented Web Agents with Action Correction

Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li, Shengyu Zhang · Feb 17, 2026

Citations: 0

Llm As JudgeSimulation Env Multi Agent General

Web agents based on large language models have demonstrated promising capability in automating web tasks.
However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might not possess comprehensive awareness of execution risks, prematurely performing risky actions that cause

The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu · Feb 17, 2026

Citations: 0

Pairwise Preference Automatic Metrics Multi Agent Coding

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and informati
By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a universal port fo

Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework

Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang, Li Qing · Feb 17, 2026

Citations: 0

Demonstrations Automatic Metrics General

Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
We first define the components and evaluation metrics for TOFs, then formalize a cost-efficient flowchart construction algorithm to abstract procedural knowledge from service dialogues.

Discovering Implicit Large Language Model Alignment Objectives

Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026

Citations: 0

Rubric Rating Human Eval General

To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness.

FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health

Victor De Lima, Jiqun Liu, Grace Hui Yang · Feb 17, 2026

Citations: 0

Human EvalSimulation Env Long Horizon Coding

Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.
Human evaluation further confirms that FrameRef's generated framings measurably affect human judgment.

Protocol Hubs

Expert Verification Papers (25) CS.CL + Expert Verification Papers (20) Pairwise Preference Papers (70) CS.CL + Pairwise Preference Papers (62) CS.AI + Expert Verification Papers (15) CS.AI + Pairwise Preference Papers (42) Rubric Rating Papers (17) CS.CL + Rubric Rating Papers (16) General + Pairwise Preference Papers (43) Expert Verification Or Rubric Rating Papers (39) CS.CL + Math Papers (84) Long Horizon Papers (82) CS.CL + Human Eval Papers (35) CS.CL + Long Horizon Papers (58) Expert Verification + Medicine Papers (11) Human Eval Papers (38)

Human Feedback and Eval Paper Explorer

Filter by tag

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives