OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 277 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (876) General (528) Coding (281) Simulation Env (109) Multilingual (92) Math (90) Long Horizon (74) Medicine (69) Pairwise Preference (64) Law (43) Multi Agent (38) Human Eval (36) Expert Verification (23) Red Team (21) Web Browsing (21) Critique Edit (19)

SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation

Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026

Citations: 0

Automatic Metrics Multi Agent Multilingual

This limitation stems from the inability of current single-model and static multi-agent systems to perceive and adapt to stylistic variations.
To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task.

Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics

Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li, Gang Yang · Feb 23, 2026

Citations: 0

Automatic Metrics Long Horizon MedicineCoding

The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.

Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation

Rizhuo Huang, Yifan Feng, Rundong Xue, Shihui Ying, Jun-Hai Yong, Chuan Shi · Feb 23, 2026

Citations: 0

Expert Verification Automatic Metrics General

Additionally, we present \textbf{HyperDocRED}, a rigorously annotated benchmark for document-level knowledge hypergraph extraction.

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li · Feb 23, 2026

Citations: 0

Automatic Metrics Long Horizon Coding

We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.

Can Large Language Models Replace Human Coders? Introducing ContentBench

Michael Haman · Feb 23, 2026

Citations: 0

Critique Edit Automatic Metrics Coding

This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
The suite uses versioned tracks that invite researchers to contribute new benchmark datasets.

Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins

Jasmin Han, Janardan Devkota, Joseph Waring, Amanda Luken, Felix Naughton, Roger Vilardaga · Feb 23, 2026

Citations: 0

Rubric Rating Automatic Metrics General

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore, Alysa Zhao · Feb 22, 2026

Citations: 0

Automatic Metrics Long Horizon General

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across

Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026

Citations: 0

Pairwise Preference Automatic Metrics Long Horizon General

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
By optimizing multi-turn reasoning trajectories under a personalized reward function, the framework reinforces reasoning paths that better align with user-specific preferences and contextual signals reflected by the reward model.

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026

Citations: 0

Automatic Metrics Long Horizon General

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

Qijie You, Wenkai Yu, Wentao Zhang · Feb 22, 2026

Citations: 0

Automatic Metrics Long Horizon MedicineCoding

With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction.
However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop query.

Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer

Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng, Zhenkai Liang · Feb 22, 2026

Citations: 0

Automatic Metrics Long Horizon MathCoding

Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.

Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks

Wilson Y. Lee · Feb 22, 2026

Citations: 0

Automatic Metrics Long Horizon General

Why do language agents fail on tasks they are capable of solving?
Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating en

Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026

Citations: 0

Automatic Metrics Multi Agent LawCoding

We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
The pipeline intercepts Whisper's initial transcript, applies specialized LLM agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's decoder.

Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 21, 2026

Citations: 0

Pairwise Preference Human Eval General

One annotator pair achieved almost perfect agreement ($κ= 0.8743$; $93.8\%$ raw agreement), exceeding a number of reported benchmarks for English sarcasm research works.

Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026

Citations: 0

Pairwise Preference Human Eval MathMedicine

We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight
Across diverse reasoning and diagnostic benchmarks (GSM8K, CRUXEval, MBPP, AIME, CorrectBench, and TruthfulQA) using Llama-3 and Qwen-3 (8B), explicit regulatory structuring substantially improves error diagnosis and yields a threefold incr

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu, Ruizhe Li · Feb 21, 2026

Citations: 0

Red Team Automatic Metrics General

Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem

Lichang Song, Ting Long, Yi Chang · Feb 21, 2026

Citations: 0

Automatic Metrics Multi Agent General

To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer decision-ma

Watermarking LLM Agent Trajectories

Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li, Zheng Liu · Feb 21, 2026

Citations: 0

Automatic Metrics Long Horizon MathCoding

LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.
Despite the high cost of creating these datasets, existing literature has overlooked copyright protection for LLM agent trajectories.

Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift

Stephen Russell · Feb 21, 2026

Citations: 0

Automatic Metrics Long Horizon General

Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications

Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026

Citations: 0

Pairwise Preference Automatic Metrics Long Horizon Coding

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
As AI agents tackle increasingly complex tasks, aligning their behavior with human-provided specifications becomes critical for responsible AI deployment.

Protocol Hubs

Simulation Env Papers (109) Multilingual Papers (92) Math Papers (90) Automatic Metrics Papers (876) General Papers (528) Coding Papers (281) Long Horizon Papers (74) Medicine Papers (69) Automatic Metrics + Long Horizon Papers (55) Pairwise Preference Papers (64) Automatic Metrics + Pairwise Preference Papers (51) Law Papers (43) Multi Agent Papers (38) Human Eval Papers (36) Automatic Metrics + Multi Agent Papers (25) Simulation Env + Long Horizon Papers (20)

Human Feedback and Eval Paper Explorer

Filter by tag

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives