OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 304 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (978) General (590) Coding (314) Simulation Env (115) Math (103) Multilingual (99) Long Horizon (82) Medicine (78) Pairwise Preference (70) Law (45) Multi Agent (41) Human Eval (38) Expert Verification (25) Web Browsing (22) Critique Edit (21) Red Team (21)

PMG: Parameterized Motion Generator for Human-like Locomotion Control

Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu, Yi Cheng · Feb 13, 2026

Citations: 0

Automatic Metrics Long Horizon General

Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain.
To address these limitations, we propose the Parameterized Motion Generator (PMG), a real-time motion generator grounded in an analysis of human motion structure that synthesizes reference trajectories using only a compact set of parameteri

Think like a Scientist: Physics-guided LLM Agent for Equation Discovery

Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu · Feb 12, 2026

Citations: 0

Automatic Metrics Long Horizon General

We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process.
The agent coordinates physics-based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints.

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin · Feb 12, 2026

Citations: 0

Expert Verification Automatic Metrics MathCoding

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li · Feb 12, 2026

Citations: 0

Automatic Metrics Tool Use Coding

To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zoo

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, Holger Boche · Feb 12, 2026

Citations: 0

Simulation Env Long Horizon General

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks.
By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis · Feb 12, 2026

Citations: 0

Red Team Automatic Metrics General

The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task

Rui Cao, Zhenyun Deng, Yulong Chen, Michael Schlichtkrull, Andreas Vlachos · Feb 11, 2026

Citations: 0

Automatic Metrics Web Browsing General

The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455.
This paper provides a detailed description of the shared task, presents the complete evaluation results, and discusses key insights and lessons learned.

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong · Feb 11, 2026

Citations: 0

Pairwise Preference Simulation Env Tool Use MathCoding

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
We focus on what matters most when building agents: sharp reasoning and fast, reliable execution.

The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage

Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh, Georgios Chochlakis · Feb 10, 2026

Citations: 0

Pairwise PreferenceRubric Rating Human Eval Law

To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-c

UI-Venus-1.5 Technical Report

Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen · Feb 9, 2026

Citations: 0

Simulation Env Long Horizon Coding

GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.
In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications.

Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI

Ziyan Wang, Longlong Ma · Feb 9, 2026

Citations: 0

Critique Edit Automatic Metrics General

In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, there

Document Reconstruction Unlocks Scalable Long-Context RLVR

Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim · Feb 9, 2026

Citations: 0

Rubric Rating Automatic Metrics Coding

However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision.

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen · Feb 8, 2026

Citations: 0

Automatic Metrics Long Horizon Coding

Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons.
While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behavioral stagnation due to frozen parameters.

RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution

Isaac Picov, Ritesh Goru · Feb 6, 2026

Citations: 0

Automatic Metrics Tool Use General

VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration

Jaeyoon Jung, Yejun Yoon, Kunwoo Park · Feb 4, 2026

Citations: 0

Automatic Metrics Multi Agent Coding

This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration.
For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking.

OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong, Haoran Luo · Feb 3, 2026

Citations: 0

Automatic Metrics Tool Use General

To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning.
Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries.

SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training

Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le, Daixuan Cheng · Feb 3, 2026

Citations: 0

Simulation Env Long Horizon Coding

In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents.
SWE-Master systematically explores the complete agent development pipeline, including teacher-trajectory synthesis and data curation, long-horizon SFT, RL with real execution feedback, and inference framework design.

Jiliang Ni, Jiachen Pu, Zhongyi Yang, Jingfeng Luo, Conggang Hu · Feb 3, 2026

Citations: 0

Automatic Metrics Tool Use General

The proliferation of Large Language Models (LLMs) in function calling is pivotal for creating advanced AI agents, yet their large scale hinders widespread adoption, necessitating transferring their capabilities into smaller ones.
Extensive experiments on challenging and renowned benchmarks demonstrate the effectiveness of our method.

What If We Allocate Test-Time Compute Adaptively?

Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan, Ayesha Mohsin · Feb 1, 2026

Citations: 0

Automatic Metrics Long Horizon Math

For each problem, the agent runs multiple inference iterations.
Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks such as AIME24 and AMO-Bench.

Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

Xiang Zheng, Weiqi Zhai, Wei Wang, Boyu Yang, Wenbo Li, Ruixiang Luo · Jan 31, 2026

Citations: 0

Automatic Metrics Multi Agent Math

Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence.
To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning.

Protocol Hubs

Expert Verification Papers (25) CS.CL + Expert Verification Papers (20) Pairwise Preference Papers (70) CS.CL + Pairwise Preference Papers (62) CS.AI + Expert Verification Papers (15) CS.AI + Pairwise Preference Papers (42) Rubric Rating Papers (17) CS.CL + Rubric Rating Papers (16) General + Pairwise Preference Papers (43) Expert Verification Or Rubric Rating Papers (39) CS.CL + Math Papers (84) Long Horizon Papers (82) CS.CL + Human Eval Papers (35) CS.CL + Long Horizon Papers (58) Expert Verification + Medicine Papers (11) Human Eval Papers (38)

Human Feedback and Eval Paper Explorer

Filter by tag

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives