Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 590 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,634) General (532) Long Horizon (320) Pairwise Preference (289) Coding (221) Simulation Env (190) Multi Agent (184) Medicine (117) Llm As Judge (109) Expert Verification (98) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (78) Demonstrations (67) Critique Edit (63)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Apr 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Apr 11, 2026 · Citations: 0

To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Apr 10, 2026 · Citations: 0

On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models

Punyajoy Saha, Sudipta Halder, Debjyoti Mondal, Subhadarshi Panda · Mar 7, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise PreferenceRed Team Automatic Metrics General

Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale,…
Moreover, overly conservative safety mechanisms can reduce model usefulness by rejecting sensitive but legitimate queries.

Open paper

Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing

Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul · Mar 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready

Human EvalAutomatic Metrics Long Horizon General

We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis…
Supported by over 200 hours of expert human evaluation, BRTR achieves state-of-the-art performance across three frontier spreadsheet understanding benchmarks, surpassing prior methods by 25 percentage points on FRTR-Bench, 7 points on…

Open paper

From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

Minh Hoang Nguyen, Vu Hoang Pham, Xuan Thanh Huynh, Phuc Hong Mai, Vinh The Nguyen, Quang Nhut Huynh · Mar 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics Coding

On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning…

Open paper

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Amirabbas Afzali, Myeongho Jeon, Maria Brbic · Mar 5, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics General

Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization…
Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO.

Open paper

Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models

Sean Lamont, Christian Walder, Paul Montague, Amir Dezfouli, Michael Norrish · Mar 5, 2026

Citations: 0

Match reason: Title directly matches "cost".

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics MathCoding

We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model.

Open paper

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Natchanon Pollertlam, Witchayut Kornsuwannawit · Mar 5, 2026

Citations: 0

Match reason: Title directly matches "cost".

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

We compare a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks - LongMemEval, LoCoMo, and PersonaMemv2 - and evaluate both architectures on accuracy and cumulative API…

Open paper

Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference

Noah Golowich, Fan Chen, Dhruv Rohatgi, Raghav Singhal, Carles Domingo-Enrich, Dylan J. Foster · Mar 9, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Given a base language model and a *process reward model* estimating expected terminal rewards, we ask: *how accurately can we sample from a target distribution given some number of process reward evaluations?* Theoretically, we identify (1)…

Open paper

CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li, Shijian Wang · Mar 7, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Medicine

Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy.
We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution.

Open paper

SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

Yunlong Chu, Minglai Shao, Yuhang Liu, Bing Hao, Yumeng Lin, Jialu Wang · Mar 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Experiments on reasoning benchmarks demonstrate that SPOT improves accuracy by 2.3 points on average while reducing generated tokens by 37.5% and provides faithful semantic interpretations of the latent reasoning process.

Open paper

Wisdom of the AI Crowd (AI-CROWD) for Ground Truth Approximation in Content Analysis: A Research Protocol & Validation Using Eleven Large Language Models

Luis de-Marcos, Manuel Goyanes, Adrián Domínguez-Díaz · Mar 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

Large-scale content analysis is increasingly limited by the absence of observable ground truth or gold-standard labels, as creating such benchmarks through extensive human coding becomes impractical for massive datasets due to high time,…

Open paper

Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions

Hussein Ghaly · Mar 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

We introduce two evaluation metrics: Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF), in order to avoid hallucinations and unnecessary additions or omissions to the input text beyond the task requirement.

Open paper

Deterministic Differentiable Structured Pruning for Large Language Models

Weiyu Huang, Pengle Zhang, Xiaolu Zhang, Jun Zhou, Jun Zhu, Jianfei Chen · Mar 9, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, Weiyang Liu · Mar 5, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions

Saroj Mishra, Suman Niroula, Umesh Yadav, Dilip Thakur, Srijan Gyawali, Shiva Gaire · Mar 7, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Long Horizon General

Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retrieval strategies.
Despite rapid industrial adoption, current research lacks a systematic understanding of Agentic RAG as a sequential decision-making system, leading to highly fragmented architectures, inconsistent evaluation methodologies, and unresolved…

Open paper

ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat · Mar 5, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback

Llm As JudgeAutomatic Metrics General

Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators.
Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation.

Open paper

MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue

Naifan Zhang, Ruihan Sun, Jinwei Su, Hengjie Yang, Zhengyuan Pan, Zhaohan Chen · Mar 6, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Moderate protocol signal Freshness: Warm Status: Ready

Llm As JudgeSimulation Env Long Horizon General

We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns.
Across multiple subjective dialogue benchmarks, including EMPA, EmoBench, and EQ-Bench, and model scales ranging from 7B to 32B, our method consistently improves both training stability and final performance over outcome-only GRPO and…

Open paper

An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data

Trinh Pham, Thanh Tam Nguyen, Viet Huynh, Hongzhi Yin, Quoc Viet Hung Nguyen · Mar 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

Without timely evaluation, organizations cannot approve releases or detect failures early.

Open paper

RoboPocket: Improve Robot Policies Instantly with Your Phone

Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le, Yi Wang · Mar 5, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Demonstrations Long Horizon Law

Open paper

The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling

J. Clayton Kerce, Alexis Fox · Mar 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary · Mar 6, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 32% Sparse protocol signal Freshness: Warm Status: Fallback

Critique Edit MathCoding

We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii)…

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent