OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 304 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (978) General (590) Coding (314) Simulation Env (115) Math (103) Multilingual (99) Long Horizon (82) Medicine (78) Pairwise Preference (70) Law (45) Multi Agent (41) Human Eval (38) Expert Verification (25) Web Browsing (22) Critique Edit (21) Red Team (21)

Towards Efficient Agents: A Co-Design of Inference Architecture and System

Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu, Hanting Chen · Dec 20, 2025

Citations: 0

Automatic Metrics Long Horizon General

The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
This paper presents AgentInfer, a unified framework for end-to-end agent acceleration that bridges inference optimization and architectural design.

Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

Iker García-Ferrero, David Montero, Roman Orus · Dec 18, 2025

Citations: 0

Red Team Automatic Metrics General

We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction.
On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks.

Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

Jonathan Kamp, Roos Bakker, Dominique Blok · Dec 11, 2025

Citations: 0

Pairwise Preference Automatic Metrics General

In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics.

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu · Dec 9, 2025

Citations: 0

Simulation Env Long Horizon General

Extensive experiments on the AerialVLN and OpenFly benchmark validate the effectiveness of our method.

Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors

Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025

Citations: 0

Automatic Metrics Long Horizon Law

We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.

AITutor-EvalKit: Exploring the Capabilities of AI Tutors

Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar · Dec 3, 2025

Citations: 0

Demonstrations Automatic Metrics General

We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization.

Stabilizing Off-Policy Training for Long-Horizon LLM Agent via Turn-Level Importance Sampling and Clipping-Triggered Normalization

Chenliang Li, Adel Elmahdy, Alex Boyd, Zhongruo Wang, Siliang Zeng, Alfredo Garcia · Nov 25, 2025

Citations: 0

Automatic Metrics Long Horizon Medicine

Reinforcement learning (RL) algorithms such as PPO and GRPO are widely used to train large language models (LLMs) for multi-turn agentic tasks.
To address these challenges, we propose SORL, \underline{S}tabilizing \underline{O}ff-Policy \underline{R}einforcement \underline{L}earning for Long-Horizon Agent Training.

Bridging Symbolic Control and Neural Reasoning in LLM Agents: Structured Cognitive Loop with a Governance Layer

Myung Ho Kim · Nov 21, 2025

Citations: 0

Automatic Metrics Long Horizon General

Large language model agents suffer from fundamental architectural problems: entangled reasoning and execution, memory volatility, and uncontrolled action sequences.
We introduce Structured Cognitive Loop (SCL), a modular architecture that explicitly separates agent cognition into five phases: Retrieval, Cognition, Control, Action, and Memory (R-CCAM).

From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems

Brendan Gho, Suman Muppavarapu, Afnan Shaik, Tyson Tsay, Atharva Mohan, James Begin · Nov 18, 2025

Citations: 0

Automatic Metrics Multi Agent General

As foundation models are increasingly deployed as interacting agents in multi-agent systems, their collective behavior raises new challenges for trustworthiness, transparency, and accountability.
We introduce a market-making framework for multi-agent large language model (LLM) coordination that organizes agent interactions as structured economic exchanges.

Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

Mengze Hong, Di Jiang, Weiwei Zhao, Yawen Li, Yihang Wang, Xinyuan Luo · Nov 14, 2025

Citations: 0

Critique Edit Simulation Env General

Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assi

Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces

Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury · Nov 10, 2025

Citations: 0

Automatic Metrics Long Horizon Coding

On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20\%}.
More broadly, GSW offers a concrete blueprint for endowing LLMs with human-like episodic memory, paving the way for more capable agents that can reason over long horizons.

Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale

David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu, Prithviraj Ammanabrolu · Nov 7, 2025

Citations: 0

Pairwise Preference Automatic Metrics Math

We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts su
Remarkably, finetuning Qwen2.5-VL-7B on our data outperforms existing open-data baselines across evaluated vision-centric benchmarks, and our best configurations match or surpass strong closed-data models such as MiMo-VL-7B-RL on Vstar Benc

Error-Aware Knowledge Distillation via Targeted Revision for Customer-Service Summarization

Hee-Jin Lee, Zhen Guo, Luchao Jin, Morteza Moazami Goudarzi · Nov 4, 2025

Citations: 0

Critique Edit Automatic Metrics General

BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning

Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen, Liang-Yan Gui · Oct 31, 2025

Citations: 0

Pairwise Preference Automatic MetricsSimulation Env Long Horizon General

Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.
However, such vision-driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy

Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · Oct 30, 2025

Citations: 0

Red Team Automatic Metrics General

Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup.
This reasoning ability also generalizes to safety-critical settings beyond the training distribution.

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang · Oct 29, 2025

Citations: 0

Demonstrations Automatic Metrics Long Horizon Coding

Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang · Oct 29, 2025

Citations: 0

Simulation Env Long Horizon General

Real-world language agents must handle complex, multi-step workflows across diverse Apps.
For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual.

From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity

Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen, Haiyang Geng · Oct 29, 2025

Citations: 0

Automatic Metrics Multi Agent Medicine

To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation.
Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards.

World Simulation with Video Foundation Models for Physical AI

NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji · Oct 28, 2025

Citations: 0

Simulation Env Long Horizon CodingMultilingual

These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nv

Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab · Oct 27, 2025

Citations: 0

Human EvalAutomatic Metrics Coding

We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural n
Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English.

Protocol Hubs

Expert Verification Papers (25) CS.CL + Expert Verification Papers (20) Pairwise Preference Papers (70) CS.CL + Pairwise Preference Papers (62) CS.AI + Expert Verification Papers (15) CS.AI + Pairwise Preference Papers (42) Rubric Rating Papers (17) CS.CL + Rubric Rating Papers (16) General + Pairwise Preference Papers (43) Expert Verification Or Rubric Rating Papers (39) CS.CL + Math Papers (84) Long Horizon Papers (82) CS.CL + Human Eval Papers (35) CS.CL + Long Horizon Papers (58) Expert Verification + Medicine Papers (11) Human Eval Papers (38)

Human Feedback and Eval Paper Explorer

Filter by tag

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives