OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 172 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (978) General (590) Coding (314) Simulation Env (115) Math (103) Multilingual (99) Long Horizon (82) Medicine (78) Pairwise Preference (70) Law (45) Multi Agent (41) Human Eval (38) Expert Verification (25) Web Browsing (22) Critique Edit (21) Red Team (21)

Collaborative Document Editing with Multiple Users and AI Agents

Florian Lehmann, Krystsina Shauchenka, Daniel Buschek · Sep 15, 2025

Citations: 0

Simulation Env Multi Agent General

We propose integrating AI agents directly into collaborative writing environments.
Our prototype makes AI use visible to all users through two new shared objects: user-defined agent profiles and tasks.

Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models

Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025

Citations: 0

Pairwise Preference Automatic Metrics Long Horizon General

We additionally contribute a CAD dataset with human preference annotations.
Experiments with proprietary models (GPT-4o, Gemini, etc) show large gains, with GPT-4o (Omni) achieving up to +23.4 absolute accuracy points on the human-preference benchmark.

EO-1: An Open Unified Embodied Foundation Model for General Robot Control

Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Dong Wang · Aug 28, 2025

Citations: 0

Automatic Metrics Long Horizon General

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems.
However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction.

Your AI Bosses Are Still Prejudiced: The Emergence of Stereotypes in LLM-Based Multi-Agent Systems

Jingyu Guo, Yingying Xu · Aug 27, 2025

Citations: 0

Automatic Metrics Multi Agent General

While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases.
Previous studies have focused on biases inherited from training data, but whether stereotypes can emerge spontaneously in AI agent interactions merits further exploration.

Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo · Aug 26, 2025

Citations: 0

Automatic Metrics Long Horizon General

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval.
We introduce HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning.

TASER: Table Agents for Schema-guided Extraction and Recommendation

Nicole Cho, Kirsty Fielding, William Watson, Sumitra Ganesh, Manuela Veloso · Aug 18, 2025

Citations: 0

Critique Edit Automatic Metrics General

To address this, we present TASER (Table Agents for Schema-guided Extraction and Recommendation), a continuously learning, agentic table extraction system that converts highly unstructured, multi-page, heterogeneous tables into normalized,
Our Recommender Agent reviews unmatched outputs and proposes schema revisions, enabling TASER to outperform vision-based table detection models such as Table Transformer by 10.1%.

1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning

Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, Maarten Sap · Aug 11, 2025

Citations: 0

Automatic Metrics Multi Agent General

We introduce a multi-agent framework that decomposes privacy reasoning into specialized subtasks (extraction, classification), reducing the information load on any single agent while enabling iterative validation and more reliable adherence
Experiments on the ConfAIde and PrivacyLens benchmark with several open-source and closed-sourced LLMs demonstrate that our best multi-agent configuration substantially reduces private information leakage (\textbf{18\%} on ConfAIde and \tex

CoAct-1: Computer-using Multi-Agent System with Coding Actions

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li · Aug 5, 2025

Citations: 0

Automatic Metrics Long Horizon General

Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks.
While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency.

Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Loza Vera, Muhammad Dehan Al Kautsar · Jul 31, 2025

Citations: 0

Red Team Automatic Metrics General

Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints.

A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

David Schlangen, Sherzod Hakimov, Chalamalasetti Kranti, Jonathan Jordan, Philipp Sadler · Jul 11, 2025

Citations: 0

Pairwise Preference Automatic Metrics General

There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation.
The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available.

On the Inference (In-)Security of Vertical Federated Learning: Efficient Auditing against Inference Tampering Attack

Chung-ju Huang, Ziqi Zhang, Yinggui Wang, Binghui Wang, Tao Wei, Leye Wang · Jul 3, 2025

Citations: 0

Automatic MetricsSimulation Env General

TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation

Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun, Yuqi Ren · Jun 30, 2025

Citations: 0

Pairwise Preference Automatic Metrics General

Conducting supervised and preference fine-tuning of large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values.
To address these challenges, we propose the \underline{\textbf{Ta}}xonomy-Guided \underline{\textbf{P}}reference Data Generation (TaP) framework for automated, scalable preference dataset construction across languages.

When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025

Citations: 0

Red Team Automatic Metrics General

In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.
We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries.

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang · Jun 3, 2025

Citations: 0

Critique Edit Automatic Metrics General

Automated Web Application Testing: End-to-End Test Case Generation with Large Language Models and Screen Transition Graphs

Nguyen-Khang Le, Quan Minh Bui, Minh Ngoc Nguyen, Hiep Nguyen, Trung Vo, Son T. Luu · Jun 3, 2025

Citations: 0

Automatic Metrics Web Browsing General

Synthesis of discrete-continuous quantum circuits with multimodal diffusion models

Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil · Jun 2, 2025

Citations: 0

Automatic MetricsSimulation Env General

We benchmark the model over different experiments, analyzing the method's accuracy across varying qubit counts and circuit depths, showcasing the ability of the method to outperform existing approaches in gate counts and under noisy conditi

Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages

Kaja Dobrovoljc · May 28, 2025

Citations: 0

Pairwise Preference Automatic Metrics General

Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities.
Strikingly, the overlap between spoken and written syntactic inventories is very limited: most structures attested in speech do not occur in writing, pointing to modality-specific preferences in syntactic organization that reflect the disti

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier, Yu Su · May 28, 2025

Citations: 0

Red Team Simulation Env Web Browsing General

Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection.
Current evaluations of this threat either lack support realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces.

Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task

Mengyang Qiu, Zoe Brisebois, Siena Sun · May 22, 2025

Citations: 0

Pairwise Preference Simulation Env General

Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear.
We evaluated 34 distinct models across 45 configurations from major closed-source and open-source providers, and compared outputs to responses from 106 human participants.

VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu · May 21, 2025

Citations: 0

Pairwise Preference Automatic Metrics General

However, existing reward benchmarks focus on preference comparisons between responses rather than evaluating verification against ground truth references, leaving a critical gap in our ability to evaluate verification systems used in reason
In this paper, we introduce VerifyBench and its challenging variant VerifyBench-Hard, two benchmarks specifically designed to assess reference-based reward systems.

Protocol Hubs

Expert Verification Papers (25) CS.CL + Expert Verification Papers (20) Pairwise Preference Papers (70) CS.CL + Pairwise Preference Papers (62) CS.AI + Expert Verification Papers (15) CS.AI + Pairwise Preference Papers (42) Rubric Rating Papers (17) CS.CL + Rubric Rating Papers (16) General + Pairwise Preference Papers (43) Expert Verification Or Rubric Rating Papers (39) CS.CL + Math Papers (84) Long Horizon Papers (82) CS.CL + Human Eval Papers (35) CS.CL + Long Horizon Papers (58) Expert Verification + Medicine Papers (11) Human Eval Papers (38)

Human Feedback and Eval Paper Explorer

Filter by tag

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives