Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 52 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Do Phone-Use Agents Respect Your Privacy?

Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics Coding
  • We study whether phone-use agents respect privacy while completing benign mobile tasks.
  • To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents.
Open paper
VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents

Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu, Luxi Lin · Mar 25, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Pairwise Preference Simulation Env Tool Use Coding
  • With the growing demand for intelligent in-vehicle experiences, vehicle-based agents are evolving from simple assistants to long-term companions.
  • To address this gap, we introduce VehicleMemBench, a multi-user long-context memory benchmark built on an executable in-vehicle simulation environment.
Open paper
IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

Ali Abdelaal, Mohammed Nader Al Haffar, Mahmoud Fawzi, Walid Magdy · Mar 24, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics Coding
  • We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions).
  • The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8\% to 93.8\% (by Gemini 3 Flash).
Open paper
CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

Hao Wang, Licheng Pan, Zhichao Chen, Chunyuan Zheng, Zhixuan Chu, Xiaoxi Li · Mar 19, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics Coding
  • Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly…
  • Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on…
Open paper
From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

Xiaoyong Guo, Nanjie Li, Zijie Zeng, Kai Wang, Hao Huang, Haihua Xu · Mar 25, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Pairwise Preference Coding
  • We propose a unified training framework to improve robustness under realistic histories: (i) Teacher Error Knowledge by using Whisper large-v3 hypotheses as training-time history, (ii) Context Dropout to regularize over-reliance on history,…
Open paper
Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

Jingyu Lu, Yuhan Wang, Fan Zhuo, Xize Cheng, Changhao Pan, Xueyi Pu · Mar 16, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Coding
  • To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps.
  • Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs.
Open paper
Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics MathCoding
  • In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus.
  • Replacing random errors with a coherent but mathematically incorrect rule system largely eliminates the preference (near-chance accuracy).
Open paper
Sabiá-4 Technical Report

Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Roseval Malaquias Junior, Giovana Kerche Bonás, Marcos Piau · Mar 10, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Tool Use LawCoding
  • The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal…
  • We evaluate the models on six benchmark categories: conversational capabilities in Brazilian Portuguese, knowledge of Brazilian legislation, long-context understanding, instruction following, standardized exams, and agentic capabilities…
Open paper
From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

Minh Hoang Nguyen, Vu Hoang Pham, Xuan Thanh Huynh, Phuc Hong Mai, Vinh The Nguyen, Quang Nhut Huynh · Mar 6, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Coding
  • On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning…
Open paper
Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics MathCoding
  • While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct…
Open paper
Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donahue · Mar 25, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Pairwise PreferenceRubric Rating Coding
  • We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh…
  • Among 13 different models, the best judges underperform human annotators by 12-23%.
Open paper
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu · Mar 4, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Automatic Metrics Web Browsing Coding
  • We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous…
  • We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level…
Open paper
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu · Mar 4, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Automatic Metrics MathCoding
  • On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
Open paper
Tucano 2 Cool: Better Open Source LLMs for Portuguese

Nicholas Kluge Corrêa, Aniket Sen, Shiza Fatimah, Sophia Falk, Lennard Landgraf, Julia Kastner · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Tool Use Coding
  • Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two…
  • Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling…
Open paper
From Isolated Scoring to Collaborative Ranking: A Comparison-Native Framework for LLM-Based Paper Evaluation

Pujun Zheng, Jiacheng Yao, Jinquan Zheng, Chenyang Gu, Guoxiu He, Jiawei Liu · Mar 18, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Coding
  • Large language models (LLMs) are currently applied to scientific paper evaluation by assigning an absolute score to each paper independently.
  • To overcome this limitation, we propose shifting paper evaluation from isolated scoring to collaborative ranking.
Open paper
Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference MathCoding
  • We investigate whether transmission occurs through natural language paraphrases with fixed semantic content, and whether content explicitly contradicting the teacher's preference can block it.
  • We find that training on paraphrases from a teacher system-prompted to love a particular animal increases a student's preference for that animal by up to 19 percentage points.
Open paper

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Coding
  • A sample of 5 frontier and 5 open-weight models were measured using 50 curated Bioalignment prompts with a Kelly criterion-inspired evaluation framework.
  • We next examined if fine-tuning could increase the preferences of two open-weight models, Llama 3.2-3B-Instruct and Qwen2.5-3B-Instruct, for biological-based approaches.
Open paper

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise PreferenceExpert Verification MedicineCoding
  • To avoid costly clinician labeling, we introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations.
  • We evaluate PrivMedChat across medical dialogue tasks and assess utility, safety, and privacy under consistent privacy accounting, thereby providing a practical pathway to align medical chatbots while offering formal privacy guarantees.
Open paper
EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training

Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu, Mark Fišel · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference MathCoding
  • We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior.
  • Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned…
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.