OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 304 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (978) General (590) Coding (314) Simulation Env (115) Math (103) Multilingual (99) Long Horizon (82) Medicine (78) Pairwise Preference (70) Law (45) Multi Agent (41) Human Eval (38) Expert Verification (25) Web Browsing (22) Critique Edit (21) Red Team (21)

KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang · Jan 30, 2026

Citations: 0

Automatic MetricsSimulation Env Coding

Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026

Citations: 0

Simulation Env Long Horizon Coding

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
Unlike open-ended text generation, embodied agents must decompose high-level intent into actionable sub-goals while strictly adhering to the logic of a dynamic, observed environment.

INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection

Shubham Kulkarni, Alexander Lyzhov, Preetam Joshi, Shiva Chaitanya · Jan 28, 2026

Citations: 0

Automatic Metrics Web Browsing Medicine

We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification.
All calls are annotated with a phase-structured JSON schema covering IVR navigation, patient identification, coverage status, medication checks (up to two drugs), and agent identification (CRN), and each phase is labeled for Information and

Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You · Jan 24, 2026

Citations: 0

Pairwise Preference Automatic Metrics Long Horizon Coding

Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success.
To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent.

Between Search and Platform: ChatGPT Under the DSA

Toni Lorente, Kathrin Gardhouse · Jan 22, 2026

Citations: 0

Automatic Metrics Web Browsing Law

RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind

Zhitao He, Zongwei Lyu, Yi R Fung · Jan 22, 2026

Citations: 0

Pairwise PreferenceCritique Edit Human Eval General

In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion str
To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach.

APEX-Agents

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein · Jan 20, 2026

Citations: 0

Rubric RatingExpert Verification Simulation Env Long Horizon Law

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
APEX-Agents requires agents to navigate realistic work environments with files and tools.

Multimodal Multi-Agent Empowered Legal Judgment Prediction

Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu, Rong Fu · Jan 19, 2026

Citations: 0

Simulation Env Multi Agent Law

Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
Experiments on JurisMM and the benchmark LawBench validate our framework's effectiveness.

Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space

Wang Zixian · Jan 18, 2026

Citations: 0

Automatic Metrics Long Horizon MathLaw

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye · Jan 15, 2026

Citations: 0

Simulation Env Long Horizon General

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery.

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang · Jan 14, 2026

Citations: 0

Pairwise Preference Simulation Env Long Horizon General

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintainin

A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding

Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit, Thomas Pickard · Jan 13, 2026

Citations: 0

Pairwise Preference Automatic Metrics Multilingual

The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.

VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding

Haorui Yu, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Citations: 0

Critique Edit Automatic Metrics General

We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models' (VLMs) cultural understanding beyond surface-level visual perception.
Existing VLM benchmarks predominantly measure L1-L2 capabilities (object recognition, scene description, and factual question answering) while under-evaluate higher-order cultural interpretation.

Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Citations: 0

Rubric RatingCritique Edit Automatic Metrics Coding

Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2).

Reward Modeling from Natural Language Human Feedback

Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang, Shaoning Sun · Jan 12, 2026

Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics General

Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs).
Typically in pairwise rewarding tasks, GRMs generate reasoning chains ending with critiques and preference labels, and RLVR then relies on the correctness of the preference labels as the training reward.

HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee · Jan 9, 2026

Citations: 0

Pairwise PreferenceRubric Rating Human EvalLlm As Judge General

Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations.

What Matters For Safety Alignment?

Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan · Jan 7, 2026

Citations: 0

Red Team Automatic Metrics Tool Use General

This paper presents a comprehensive empirical study on the safety alignment capabilities.
We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems.

ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System

Anantha Sharma · Jan 3, 2026

Citations: 0

Pairwise Preference Automatic Metrics General

CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri, Trizal Garg · Dec 26, 2025

Citations: 0

Expert Verification Automatic Metrics CodingMultilingual

To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
Recognizing linguistic diversity, we construct the benchmark in both English and Hindi, establishing a framework that is open for further extension to other regional languages.

DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation

Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull, Daniel R. Cahn · Dec 23, 2025

Citations: 0

Automatic MetricsSimulation Env General

Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge.
An effective simulator must expose the failure modes of the systems under evaluation.

Protocol Hubs

Expert Verification Papers (25) CS.CL + Expert Verification Papers (20) Pairwise Preference Papers (70) CS.CL + Pairwise Preference Papers (62) CS.AI + Expert Verification Papers (15) CS.AI + Pairwise Preference Papers (42) Rubric Rating Papers (17) CS.CL + Rubric Rating Papers (16) General + Pairwise Preference Papers (43) Expert Verification Or Rubric Rating Papers (39) CS.CL + Math Papers (84) Long Horizon Papers (82) CS.CL + Human Eval Papers (35) CS.CL + Long Horizon Papers (58) Expert Verification + Medicine Papers (11) Human Eval Papers (38)

Human Feedback and Eval Paper Explorer

Filter by tag

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives