Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 257 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

HyperMem: Hypergraph Memory for Long-Term Conversations

Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang, Tingwen Liu · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Llm As JudgeAutomatic Metrics General
  • Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
  • Experiments on the LoCoMo benchmark show that HyperMem achieves state-of-the-art performance with 92.73% LLM-as-a-judge accuracy, demonstrating the effectiveness of HyperMem for long-term conversations.
Open paper
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou, Junshan Zhang · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics General
  • Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
  • To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences.
Open paper
How Much LLM Does a Self-Revising Agent Actually Need?

Sungwoo Jung, Seonil Son · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Critique Edit Automatic Metrics General
  • Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop.
  • We introduce a declared reflective runtime protocol that externalizes agent state, confidence signals, guarded actions, and hypothetical transitions into inspectable runtime structure.
Open paper
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Red Team Automatic Metrics Long Horizon General
  • As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
  • To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety.
Open paper
State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation

Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi, Amarendra Chaudhary · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Demonstrations Automatic Metrics General
  • Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by…
  • Our results indicate that the combination of sparse MoE architecture, culturally-informed CoT distillation with explicit Arabic linguistic checks, and strategic bilingual data curation enables an open-source adapted model to systematically…
Open paper
Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao, Ritankar Das · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics General
  • These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly…
Open paper
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang, Chao Feng · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.
Open paper
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang, Jong C. Park · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Automatic MetricsSimulation Env Multi Agent General
  • Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision.
  • Our experiments demonstrate that the representative agent's accuracy consistently declines as social pressure increases: larger adversarial groups, more capable peers, and longer arguments all lead to significant performance degradation.
Open paper
ActionParty: Multi-Subject Action Binding in Generative Video Games

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov, Fabio Pizzati · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Automatic MetricsSimulation Env Multi Agent General
  • However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene.
  • We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments.
Open paper
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang, Kunyu Shi · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use General
  • The advent of agentic multimodal models has empowered systems to actively interact with external environments.
  • Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.
Open paper
Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing

Wenhao Yuan, Chenchen Lin, Jian Chen, Jinfeng Xu, Xuehe Wang, Edith Cheuk Han Ngai · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • In large language model (LLM) agents, reasoning trajectories are treated as reliable internal beliefs for guiding actions and updating memory.
  • In this paper, inspired by the vulnerability of unfaithful intermediate reasoning trajectories, we propose Self-Audited Verified Reasoning (SAVeR), a novel framework that enforces verification over internal belief states within the agent…
Open paper
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang, Haobo Chai, Zihang Liu · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints.
  • We first propose DD-MM-PAS (Demand Detection, Memory Modeling, Proactive Agent System) as a general paradigm for streaming proactive AI agent.
Open paper
Learning to Interrupt in Language-based Multi-agent Communication

Danqing Wang, Da Yin, Ruta Desai, Lei Li, Asli Celikyilmaz, Ansong Ni · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Multi Agent General
  • Motivated by this, we propose an interruptible communication framework that allows the agent who is listening to interrupt the current speaker.
  • We evaluate our framework across various multi-agent scenarios, including 2-agent text pictionary games, 3-agent meeting scheduling, and 3-agent debate.
Open paper
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Llm As JudgeAutomatic Metrics General
  • We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
  • Transformer-based probes achieve the strongest discrimination, with M2 performing best on 5-fold average AUC/F1, and M3 performing best on both single-fold validation and held-out test evaluation.
Open paper
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong, Steve Scargall · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over…
  • Across benchmarks, MemMachine achieves strong accuracy-efficiency tradeoffs: on LoCoMo it reaches 0.9169 using gpt4.1-mini; on LongMemEvalS (ICLR 2025), a six-dimension ablation yields 93.0 percent accuracy, with retrieval-stage…
Open paper
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework

Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala, Jouni Isoaho · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Human EvalAutomatic Metrics General
  • However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation.
  • Human evaluation with strong inter-rater agreement (Cohen's k > 0.80) confirms robustness.
Open paper
Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use General
  • We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use.
  • Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains.
Open paper

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation.
  • Benchmarks such as LOCOMO and LOCCO report performance degradation from 0.455 to 0.05 across stages, while MultiWOZ shows 78.2% accuracy with 6.8% false memory rate under persistent retention.
Open paper
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou, Konstantinos Arvanitis · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, General).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Llm As JudgeAutomatic Metrics General
  • We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.