Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 23 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

StoryAlign: Evaluating and Training Reward Models for Story Generation

Haotian Xia, Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou · May 6, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics Coding
  • Although large language models (LLMs) have significantly advanced text generation, stories generated by LLMs still diverge from human-authored works regarding complex narrative structure and human-aligned preferences.
  • We find existing reward models struggle to select human-preferred stories, with the best model achieving only 66.3\% accuracy.
Open paper
Do Phone-Use Agents Respect Your Privacy?

Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Coding
  • We study whether phone-use agents respect privacy while completing benign mobile tasks.
  • To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents.
Open paper
IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

Ali Abdelaal, Mohammed Nader Al Haffar, Mahmoud Fawzi, Walid Magdy · Mar 24, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Coding
  • We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions).
  • The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8\% to 93.8\% (by Gemini 3 Flash).
Open paper
CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

Hao Wang, Licheng Pan, Zhichao Chen, Chunyuan Zheng, Zhixuan Chu, Xiaoxi Li · Mar 19, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Coding
  • Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly…
  • Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on…
Open paper
Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

Jingyu Lu, Yuhan Wang, Fan Zhuo, Xize Cheng, Changhao Pan, Xueyi Pu · Mar 16, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Coding
  • To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps.
  • Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs.
Open paper
Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics MathCoding
  • In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus.
  • Replacing random errors with a coherent but mathematically incorrect rule system largely eliminates the preference (near-chance accuracy).
Open paper
Sabiá-4 Technical Report

Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Roseval Malaquias Junior, Giovana Kerche Bonás, Marcos Piau · Mar 10, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Tool Use LawCoding
  • The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal…
  • We evaluate the models on six benchmark categories: conversational capabilities in Brazilian Portuguese, knowledge of Brazilian legislation, long-context understanding, instruction following, standardized exams, and agentic capabilities…
Open paper
From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

Minh Hoang Nguyen, Vu Hoang Pham, Xuan Thanh Huynh, Phuc Hong Mai, Vinh The Nguyen, Quang Nhut Huynh · Mar 6, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Coding
  • On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning…
Open paper
Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics MathCoding
  • While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct…
Open paper
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé · Feb 17, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Coding
  • In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences.
  • We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset.
Open paper
Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics CodingMultilingual
  • Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
  • We synthesize recent findings indicating that (i) safety guardrails weaken sharply on low-resource and code-mixed inputs, (ii) culturally harmful behavior can persist even when standard toxicity scores look acceptable, and (iii)…
Open paper
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training

Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen · Feb 14, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Multi Agent Coding
  • We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions.
  • Experiments across multiple LLM backbones and benchmarks demonstrate consistent improvements in contextual privacy preservation, reducing leakage rates by up to 12.32% while maintaining comparable helpfulness, as well as zero-shot…
Open paper
Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You · Jan 24, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Long Horizon Coding
  • To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent.
  • We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data.
Open paper
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu · Mar 4, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Automatic Metrics Web Browsing Coding
  • We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous…
  • We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level…
Open paper
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu · Mar 4, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Automatic Metrics MathCoding
  • On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
Open paper
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models

Daniel Yang, Samuel Stante, Florian Redhardt, Lena Libon, Parnian Kassraie, Ido Hakimi · Feb 27, 2026

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Automatic Metrics Coding
  • Reward models are central to aligning large language models (LLMs) with human preferences.
  • Yet most approaches rely on pointwise reward estimates that overlook the epistemic uncertainty in reward models arising from limited human feedback.
Open paper
ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild

Bufang Yang, Lilin Xu, Liekang Zeng, Yunqi Guo, Siyang Jiang, Wenrui Lu · Dec 7, 2025

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 53% High protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Tool Use Coding
  • In this work, we propose ProAgent, an end-to-end proactive agent system that harnesses on-demand sensory contexts to provide in-the-wild assistance.
  • Results demonstrate that ProAgent achieves up to 27.7% higher proactive prediction accuracy and 20.5% lower false detection than state-of-the-art baselines.
Open paper
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu, Lingkai Kong · Oct 14, 2025

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Coding
  • Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference.
Open paper
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification

Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav, Zsolt Kira · Jul 15, 2025

Citations: 0

Match reason: Matches selected tags (Automatic Metrics, Coding).

Score: 53% High protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic MetricsSimulation Env Long Horizon MathCoding
  • We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents.
  • Our methods yield more human-aligned verifiers, improving failure detection by 25pp and accuracy by 14pp.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.