Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch · Sep 25, 2025
OpenTrain Research Tools
Human Feedback and Eval Paper Explorer
A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.
Filter by tag
Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025
- We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
- We introduce two types of agents: a scientist agent for planning, coordination, reflection, and generation of final results, and a task-expert agent to focus exclusively on one specific duty serving as a tool to the scientist agent.
Kihoon Son, DaEun Choi, Tae Soo Kim, Young-Ho Kim, Sangdoo Yun, Juho Kim · Sep 18, 2025
- Furthermore, exploratory applications demonstrate that captured steps can enhance generative AI agents in Figma, yielding predictions better aligned with professionals and producing coherent outcomes.
Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li, Ruifeng Xu · Sep 17, 2025
- This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
- In addition, the assessment of defenses on the constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in the defense methods.
Florian Lehmann, Krystsina Shauchenka, Daniel Buschek · Sep 15, 2025
- We propose integrating AI agents directly into collaborative writing environments.
- Our prototype makes AI use visible to all users through two new shared objects: user-defined agent profiles and tasks.
Hasin Jawad Ali, Ilhamul Azam, Ajwad Abrar, Md. Kamrul Hasan, Hasan Mahmud · Sep 14, 2025
- The challenge of aligning artificial intelligence (AI) with human values persists due to the abstract and often conflicting nature of moral principles and the opacity of existing approaches.
- This paper introduces CogniAlign, a multi-agent deliberation framework based on naturalistic moral realism, that grounds moral reasoning in survivability, defined across individual and collective dimensions, and operationalizes it through s
Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025
- We additionally contribute a CAD dataset with human preference annotations.
- Experiments with proprietary models (GPT-4o, Gemini, etc) show large gains, with GPT-4o (Omni) achieving up to +23.4 absolute accuracy points on the human-preference benchmark.
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Dong Wang · Aug 28, 2025
- The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems.
- However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction.
Jingyu Guo, Yingying Xu · Aug 27, 2025
- While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases.
- Previous studies have focused on biases inherited from training data, but whether stereotypes can emerge spontaneously in AI agent interactions merits further exploration.
Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo · Aug 26, 2025
- Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval.
- We introduce HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning.
Nicole Cho, Kirsty Fielding, William Watson, Sumitra Ganesh, Manuela Veloso · Aug 18, 2025
- To address this, we present TASER (Table Agents for Schema-guided Extraction and Recommendation), a continuously learning, agentic table extraction system that converts highly unstructured, multi-page, heterogeneous tables into normalized,
- Our Recommender Agent reviews unmatched outputs and proposes schema revisions, enabling TASER to outperform vision-based table detection models such as Table Transformer by 10.1%.
Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025
- Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
- In this paper, we present the Conversational Robustness Evaluation Score: CORE, a metric to quantify the effectiveness of language use within multi-agent systems across different game-theoretic interactions.
Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, Maarten Sap · Aug 11, 2025
- We introduce a multi-agent framework that decomposes privacy reasoning into specialized subtasks (extraction, classification), reducing the information load on any single agent while enabling iterative validation and more reliable adherence
- Experiments on the ConfAIde and PrivacyLens benchmark with several open-source and closed-sourced LLMs demonstrate that our best multi-agent configuration substantially reduces private information leakage (\textbf{18\%} on ConfAIde and \tex
Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li · Aug 5, 2025
- Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks.
- While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency.
Guozhao Mo, Wenliang Zhong, Jiawei Chen, Qianhao Yuan, Xuanang Chen, Yaojie Lu · Aug 3, 2025
- Unfortunately, there is still a large gap between real-world MCP usage and current evaluation: they typically assume single-server settings and directly inject tools into the model's context, bypassing the challenges of large-scale retrieva
- The benchmark includes a ready-to-deploy tool suite of 70 servers with 527 tools, ensuring reproducibility without scattered API configuration.
Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Loza Vera, Muhammad Dehan Al Kautsar · Jul 31, 2025
- Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints.
David Schlangen, Sherzod Hakimov, Chalamalasetti Kranti, Jonathan Jordan, Philipp Sadler · Jul 11, 2025
- There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation.
- The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available.
Jie Peng, Jiarui Ji, Runlin Lei, Zhewei Wei, Yongchao Liu, Chuntao Hong · Jul 4, 2025
- Additionally, prior work mainly focuses on discriminative tasks on DyTAGs, resulting in a lack of standardized task formulations and evaluation protocols tailored for DyTAG generation.
- To address these critical issues, we propose Generative DyTAG Benchmark (GDGB), which comprises eight meticulously curated DyTAG datasets with high-quality textual features for both nodes and edges, overcoming limitations of prior datasets.
Chung-ju Huang, Ziqi Zhang, Yinggui Wang, Binghui Wang, Tao Wei, Leye Wang · Jul 3, 2025
Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun, Yuqi Ren · Jun 30, 2025
- Conducting supervised and preference fine-tuning of large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values.
- To address these challenges, we propose the \underline{\textbf{Ta}}xonomy-Guided \underline{\textbf{P}}reference Data Generation (TaP) framework for automated, scalable preference dataset construction across languages.
Protocol Hubs
Benchmark Hubs
- Retrieval Benchmark Papers (115)
- Retrieval Benchmark Papers (Last 365 Days) (111)
- Retrieval Or GSM8K Benchmark Papers (128)
- Retrieval Or MMLU Benchmark Papers (128)
- Retrieval Or MATH Benchmark Papers (135)
- Retrieval Or DROP Benchmark Papers (129)
- Retrieval Or MMLU Or GSM8K Benchmark Papers (139)
- Retrieval Or MATH Or GSM8K Benchmark Papers (148)
- Retrieval Or MATH Or MMLU Benchmark Papers (146)
- Retrieval Or DROP Or GSM8K Benchmark Papers (141)