OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 26 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (876) General (528) Coding (281) Simulation Env (109) Multilingual (92) Math (90) Long Horizon (74) Medicine (69) Pairwise Preference (64) Law (43) Multi Agent (38) Human Eval (36) Expert Verification (23) Red Team (21) Web Browsing (21) Critique Edit (19)

Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026

Citations: 0

Pairwise Preference Automatic Metrics Long Horizon General

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
By optimizing multi-turn reasoning trajectories under a personalized reward function, the framework reinforces reasoning paths that better align with user-specific preferences and contextual signals reflected by the reward model.

Structurally Aligned Subtask-Level Memory for Software Engineering Agents

Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026

Citations: 0

Automatic Metrics Long Horizon Coding

Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning.

Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem

Lichang Song, Ting Long, Yi Chang · Feb 21, 2026

Citations: 0

Automatic Metrics Multi Agent General

To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer decision-ma

PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation

Xiangjun Zai, Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Wenjie Zhang · Oct 14, 2025

Citations: 0

Automatic Metrics Long Horizon General

Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining s

CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026

Citations: 0

Expert Verification Automatic Metrics Medicine

The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
Results Across all concepts, CUICurate produced substantially larger and more complete concept sets than the manual benchmarks whilst matching human precision.

Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces

Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury · Nov 10, 2025

Citations: 0

Automatic Metrics Long Horizon Coding

On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20\%}.
More broadly, GSW offers a concrete blueprint for endowing LLMs with human-like episodic memory, paving the way for more capable agents that can reason over long horizons.

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026

Citations: 0

Automatic Metrics Long Horizon General

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models

Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025

Citations: 0

Pairwise Preference Automatic Metrics Long Horizon General

We additionally contribute a CAD dataset with human preference annotations.
Experiments with proprietary models (GPT-4o, Gemini, etc) show large gains, with GPT-4o (Omni) achieving up to +23.4 absolute accuracy points on the human-preference benchmark.

OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong, Haoran Luo · Feb 3, 2026

Citations: 0

Automatic Metrics Tool Use General

To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning.
Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries.

FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026

Citations: 0

Demonstrations Automatic Metrics General

In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.

Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

Subrit Dikshit · Feb 18, 2026

Citations: 0

Automatic MetricsSimulation Env LawCoding

Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo · Aug 26, 2025

Citations: 0

Automatic Metrics Long Horizon General

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval.
We introduce HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning.

Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen, Jan-Jakob Sonke · Feb 24, 2025

Citations: 0

Pairwise Preference Automatic Metrics General

Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

Mengze Hong, Di Jiang, Weiwei Zhao, Yawen Li, Yihang Wang, Xinyuan Luo · Nov 14, 2025

Citations: 0

Critique Edit Simulation Env General

Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assi

Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition

Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang, Frank Ong · Apr 26, 2025

Citations: 0

Pairwise Preference Automatic Metrics Multi Agent General

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
In contrast, games present distinct challenges: fast-evolving catalogs, interaction-driven preferences (e.g., skill level, mechanics, hardware), and increased risk of unsafe responses in open-ended conversation.

A Benchmark for Deep Information Synthesis

Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger · Feb 24, 2026

Citations: 0

Human EvalAutomatic Metrics Tool Use Coding

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis.
However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval.

ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation

Jiho Kim, Junseong Choi, Woosog Chay, Daeun Kyung, Yeonsu Kwon, Yohan Jo · Sep 26, 2025

Citations: 0

Pairwise Preference Simulation Env General

In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context.
Built on ProPerSim, we propose ProPerAssistant, a retrieval-augmented, preference-aligned assistant that continually learns and adapts through user feedback.

Validating Political Position Predictions of Arguments

Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026

Citations: 0

Pairwise Preference Human Eval General

Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
We address this challenge through a dual-scale validation framework applied to political stance prediction in argumentative discourse, combining pointwise and pairwise human annotation.

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan · Apr 7, 2025

Citations: 0

Red Team Automatic Metrics Math

We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies (in-cont

MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation

Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan, Jun-En Ding · Mar 23, 2025

Citations: 0

Expert Verification Automatic Metrics Medicine

Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.

Protocol Hubs

Simulation Env Papers (109) Multilingual Papers (92) Math Papers (90) Automatic Metrics Papers (876) General Papers (528) Coding Papers (281) Long Horizon Papers (74) Medicine Papers (69) Automatic Metrics + Long Horizon Papers (55) Pairwise Preference Papers (64) Automatic Metrics + Pairwise Preference Papers (51) Law Papers (43) Multi Agent Papers (38) Human Eval Papers (36) Automatic Metrics + Multi Agent Papers (25) Simulation Env + Long Horizon Papers (20)

Human Feedback and Eval Paper Explorer

Filter by tag

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives