OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 172 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (978) General (590) Coding (314) Simulation Env (115) Math (103) Multilingual (99) Long Horizon (82) Medicine (78) Pairwise Preference (70) Law (45) Multi Agent (41) Human Eval (38) Expert Verification (25) Web Browsing (22) Critique Edit (21) Red Team (21)

BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning

Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen, Liang-Yan Gui · Oct 31, 2025

Citations: 0

Pairwise Preference Automatic MetricsSimulation Env Long Horizon General

Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.
However, such vision-driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy

Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · Oct 30, 2025

Citations: 0

Red Team Automatic Metrics General

Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup.
This reasoning ability also generalizes to safety-critical settings beyond the training distribution.

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang · Oct 29, 2025

Citations: 0

Simulation Env Long Horizon General

Real-world language agents must handle complex, multi-step workflows across diverse Apps.
For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual.

Designing and Evaluating Chain-of-Hints for Scientific Question Answering

Anubhav Jangra, Smaranda Muresan · Oct 24, 2025

Citations: 0

Pairwise Preference Automatic Metrics General

Using the best performing LLM as the backbone of a quantitative study with 41 participants, we uncover distinct user preferences across hinting strategies, and identify the limitations of automatic evaluation metrics to capture them.

RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA

Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025

Citations: 0

Automatic Metrics Long Horizon General

A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes can
Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong single-pass, multi-hop, and agentic RAG baselines with high efficiency.

Robust Preference Alignment via Directional Neighborhood Consensus

Ruochen Mao, Yuling Shi, Xiaodong Gu, Jiaheng Wei · Oct 23, 2025

Citations: 0

Pairwise Preference Automatic Metrics General

Aligning large language models with human preferences is critical for creating reliable and controllable AI systems.
A human preference can be visualized as a high-dimensional vector where different directions represent trade-offs between desired attributes (e.g., helpfulness vs.

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut · Oct 21, 2025

Citations: 0

Rubric Rating Human EvalLlm As Judge General

While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge.
In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g.

MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation

Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang, Huang Huang · Oct 21, 2025

Citations: 0

Demonstrations Simulation Env Long Horizon General

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
This challenge intensifies for multi-step bimanual mobile manipulation, where humans must teleoperate both the mobile base and two high-DoF arms.

SPACeR: Self-Play Anchoring with Centralized Reference Models

Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka, Yihan Hu · Oct 20, 2025

Citations: 0

Demonstrations Simulation Env Multi Agent General

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings.

Toward LLM-Supported Automated Assessment of Critical Thinking Subskills

Marisa C. Peczuh, Nischal Ashok Kumar, Ryan Baker, Blair Lehman, Danielle Eisenberg, Caitlin Mills · Oct 14, 2025

Citations: 0

Rubric Rating Automatic Metrics General

As the world becomes increasingly saturated with AI-generated content, disinformation, and algorithmic persuasion, critical thinking - the capacity to evaluate evidence, detect unreliable claims, and exercise independent judgment - is becom
We developed a coding rubric based on an established skills progression and completed human coding for a corpus of student essays.

PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation

Xiangjun Zai, Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Wenjie Zhang · Oct 14, 2025

Citations: 0

Automatic Metrics Long Horizon General

Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining s

EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science

Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park, Jihee Kim · Oct 8, 2025

Citations: 0

Automatic MetricsSimulation Env General

To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals.
Our evaluation reveals critical limitations in current LLMs' context-dependent reasoning.

Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction

Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu · Oct 7, 2025

Citations: 0

Human EvalAutomatic Metrics General

Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).

Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty

Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing · Oct 7, 2025

Citations: 0

Pairwise Preference Automatic Metrics General

Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs).
It typically involves a language model to generate on-policy responses for prompts and a reward model (RM) to guide the selection of chosen and rejected responses, which can be further trained with direct preference optimization (DPO).

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin · Oct 6, 2025

Citations: 0

Critique Edit Automatic Metrics General

Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control.
Our evaluations reveal several shortcomings: open-weight models exhibit high vulnerability to harmful compliance, with Mistral-7B reaching attack success rates as high as 97% to 98% in domains such as historical revisionism, propaganda, and

General Exploratory Bonus for Optimistic Exploration in RLHF

Wendi Li, Changdae Oh, Sharon Li · Sep 27, 2025

Citations: 0

Automatic Metrics General

Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism.

ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation

Jiho Kim, Junseong Choi, Woosog Chay, Daeun Kyung, Yeonsu Kwon, Yohan Jo · Sep 26, 2025

Citations: 0

Pairwise Preference Simulation Env General

In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context.
Built on ProPerSim, we propose ProPerAssistant, a retrieval-augmented, preference-aligned assistant that continually learns and adapts through user feedback.

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch · Sep 25, 2025

Citations: 0

Rubric Rating Automatic Metrics General

EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis

Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025

Citations: 0

Expert Verification Llm As JudgeSimulation Env Multi Agent General

We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
We introduce two types of agents: a scientist agent for planning, coordination, reflection, and generation of final results, and a task-expert agent to focus exclusively on one specific duty serving as a tool to the scientist agent.

ClearFairy: Capturing Creative Workflows through Decision Structuring, In-Situ Questioning, and Rationale Inference

Kihoon Son, DaEun Choi, Tae Soo Kim, Young-Ho Kim, Sangdoo Yun, Juho Kim · Sep 18, 2025

Citations: 0

Critique Edit Automatic Metrics General

Furthermore, exploratory applications demonstrate that captured steps can enhance generative AI agents in Figma, yielding predictions better aligned with professionals and producing coherent outcomes.

Protocol Hubs

Expert Verification Papers (25) CS.CL + Expert Verification Papers (20) Pairwise Preference Papers (70) CS.CL + Pairwise Preference Papers (62) CS.AI + Expert Verification Papers (15) CS.AI + Pairwise Preference Papers (42) Rubric Rating Papers (17) CS.CL + Rubric Rating Papers (16) General + Pairwise Preference Papers (43) Expert Verification Or Rubric Rating Papers (39) CS.CL + Math Papers (84) Long Horizon Papers (82) CS.CL + Human Eval Papers (35) CS.CL + Long Horizon Papers (58) Expert Verification + Medicine Papers (11) Human Eval Papers (38)

Human Feedback and Eval Paper Explorer

Filter by tag

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives