OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 156 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (876) General (528) Coding (281) Simulation Env (109) Multilingual (92) Math (90) Long Horizon (74) Medicine (69) Pairwise Preference (64) Law (43) Multi Agent (38) Human Eval (36) Expert Verification (23) Red Team (21) Web Browsing (21) Critique Edit (19)

Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins

Jasmin Han, Janardan Devkota, Joseph Waring, Amanda Luken, Felix Naughton, Roger Vilardaga · Feb 23, 2026

Citations: 0

Rubric Rating Automatic Metrics General

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore, Alysa Zhao · Feb 22, 2026

Citations: 0

Automatic Metrics Long Horizon General

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across

Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026

Citations: 0

Pairwise Preference Automatic Metrics Long Horizon General

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
By optimizing multi-turn reasoning trajectories under a personalized reward function, the framework reinforces reasoning paths that better align with user-specific preferences and contextual signals reflected by the reward model.

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026

Citations: 0

Automatic Metrics Long Horizon General

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks

Wilson Y. Lee · Feb 22, 2026

Citations: 0

Automatic Metrics Long Horizon General

Why do language agents fail on tasks they are capable of solving?
Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating en

Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 21, 2026

Citations: 0

Pairwise Preference Human Eval General

One annotator pair achieved almost perfect agreement ($κ= 0.8743$; $93.8\%$ raw agreement), exceeding a number of reported benchmarks for English sarcasm research works.

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu, Ruizhe Li · Feb 21, 2026

Citations: 0

Red Team Automatic Metrics General

Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem

Lichang Song, Ting Long, Yi Chang · Feb 21, 2026

Citations: 0

Automatic Metrics Multi Agent General

To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer decision-ma

Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift

Stephen Russell · Feb 21, 2026

Citations: 0

Automatic Metrics Long Horizon General

Validating Political Position Predictions of Arguments

Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026

Citations: 0

Pairwise Preference Human Eval General

Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
We address this challenge through a dual-scale validation framework applied to political stance prediction in argumentative discourse, combining pointwise and pairwise human annotation.

Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation

Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026

Citations: 0

Automatic MetricsSimulation Env General

When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
By prioritizing deterministic rules, clear decision tracking, and retaining unresolved cases for human review, the framework provides a practical foundation for downstream manufacturing automation in real-world industrial environments.

Simplifying Outcomes of Language Model Component Analyses with ELIA

Aaron Louis Eidt, Nils Feldhus · Feb 20, 2026

Citations: 0

Pairwise Preference Automatic Metrics General

The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations.

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

Mirae Kim, Seonghun Jeong, Youngjun Kwak · Feb 20, 2026

Citations: 0

Red Team Automatic Metrics General

A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models.

Mind the Style: Impact of Communication Style on Human-Chatbot Interaction

Erik Derner, Dalibor Kučera, Aditya Gulati, Ayoub Bagheri, Nuria Oliver · Feb 19, 2026

Citations: 0

Automatic Metrics Web Browsing General

Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear.
These findings highlight the importance of user- and task-sensitive conversational agents and support that communication style personalization can meaningfully enhance interaction quality and performance.

Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld · Feb 19, 2026

Citations: 0

Pairwise Preference Automatic Metrics General

Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order.
Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments.

Modeling Distinct Human Interaction in Web Agents

Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou, Frank Xu · Feb 19, 2026

Citations: 0

Pairwise Preference Automatic Metrics Web Browsing General

Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold.
However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation.

KLong: Training LLM Agent for Extremely Long-horizon Tasks

Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi · Feb 19, 2026

Citations: 0

Rubric Rating Automatic Metrics Long Horizon General

This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks.
Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe.

Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability

Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026

Citations: 0

Automatic Metrics Multi Agent General

In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
Current CoT evaluation narrowly focuses on target task accuracy.

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Kensuke Okada, Yui Furukawa, Kyosuke Bunji · Feb 19, 2026

Citations: 0

Rubric Rating Automatic Metrics General

Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments.
We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs.

The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI

Dusan Bosnjakovic · Feb 19, 2026

Citations: 0

Automatic Metrics Multi Agent General

As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatur
Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies -- the ``prevailing mindsets'' embedded during training and alignment that outlive individual model versions.

Protocol Hubs

Expert Verification Papers (23) CS.CL + Pairwise Preference Papers (56) Pairwise Preference Papers (64) CS.AI + Pairwise Preference Papers (39) General + Pairwise Preference Papers (38) CS.CL + Expert Verification Papers (18) Automatic Metrics + Pairwise Preference Papers (51) Expert Verification Or Rubric Rating Papers (36) CS.CL + Medicine Papers (52) Automatic Metrics + Expert Verification Papers (19) Human Eval Papers (36) CS.CL + Math Papers (71) CS.CL + Human Eval Papers (33) Long Horizon Papers (74) Critique Edit Or Expert Verification Papers (41) Automatic Metrics + General + Pairwise Preference Papers (29)

Benchmark Hubs

MATH Benchmark Papers (30)

Human Feedback and Eval Paper Explorer

Filter by tag

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives