Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 20 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,634) General (532) Long Horizon (320) Pairwise Preference (289) Coding (221) Simulation Env (190) Multi Agent (184) Medicine (117) Llm As Judge (109) Expert Verification (98) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (78) Demonstrations (67) Critique Edit (63)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Apr 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Apr 11, 2026 · Citations: 0

To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Apr 10, 2026 · Citations: 0

On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching, Jia Li · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Rubric Rating Automatic Metrics MathCoding

To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.

Open paper

Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers

Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki, Kiyoharu Aizawa · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 65% High protocol signal Freshness: Hot Status: Ready

Rubric Rating Automatic Metrics Coding

We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal…
For evaluation, we introduce PaperWrite-Bench, a benchmark of 51 papers from top-tier venues across diverse domains published after 2025.

Open paper

CounselReflect: A Toolkit for Auditing Mental-Health Dialogues

Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng, Angel Hsing-Chi Hwang · Mar 31, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 65% High protocol signal Freshness: Hot Status: Ready

Rubric RatingExpert Verification Human Eval Web Browsing Coding

The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined…
Human evaluation includes a user study with 20 participants and an expert review with 6 mental-health professionals, suggesting that CounselReflect supports understandable, usable, and trustworthy auditing.

Open paper

PRBench: End-to-end Paper Reproduction in Physics Research

Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li · Mar 29, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Rubric RatingExpert Verification Automatic MetricsSimulation Env Coding

We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics.
Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution.

Open paper

Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas

Tim Schopf, Michael Färber · Mar 11, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 58% High protocol signal Freshness: Warm Status: Ready

Rubric Rating Human Eval Coding

To address this, we introduce RINoBench, the first comprehensive benchmark for large-scale evaluation of research idea novelty judgments.
Using this benchmark, we evaluate several state-of-the-art large language models (LLMs) on their ability to judge the novelty of research ideas.

Open paper

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, Caiwen Ding · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 58% High protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics Multi Agent Coding

To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it…
Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.

Open paper

Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation

Harry Stuart, Masahiro Kaneko, Timothy Baldwin · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 58% High protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics Coding

Effective hiring is integral to the success of an organisation, but it is very challenging to find the most suitable candidates because expert evaluation (e.g.\ interviews conducted by a technical manager) are expensive to deploy at scale.

Open paper

KLong: Training LLM Agent for Extremely Long-horizon Tasks

Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie, Xinlong Yang · Feb 19, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Rubric Rating Long Horizon Coding

Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics.
Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.

Open paper

Document Reconstruction Unlocks Scalable Long-Context RLVR

Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim · Feb 9, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 58% High protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics Coding

However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision.

Open paper

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donahue · Mar 25, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Pairwise PreferenceRubric Rating Coding

We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh…
Among 13 different models, the best judges underperform human annotators by 12-23%.

Open paper

When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

Abhinaba Basu, Pavan Chakraborty · Mar 19, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Rubric Rating Coding

Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias.

Open paper

Small Reward Models via Backward Inference

Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi, Yulia Tsvetkov · Feb 14, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready

Rubric Rating Llm As Judge Coding

However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility.
Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%.

Open paper

Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback

Rubric RatingCritique Edit Coding

Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2).

Open paper

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi · Oct 23, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready

Rubric Rating Human EvalAutomatic Metrics Coding

Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters.
Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups, demonstrating the possibility of its use in large-scale assessments of collaboration and communication.

Open paper

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary · Oct 5, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 53% High protocol signal Freshness: Cold Status: Ready

Rubric Rating Automatic MetricsSimulation Env Coding

We present a principled Bayesian evaluation framework that replaces Pass@k and average accuracy over N trials (avg@N) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and…
Together, these results recommend replacing Pass@k for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit.

Open paper

Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains

Andrew Katz · Mar 15, 2026

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Rubric Rating Coding

Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty.
We address both limitations by extending surprisal-based evaluation from binary grammaticality contrasts to ordinal-scaled classification and scoring tasks across multiple domains.

Open paper

ScholarEval: Research Idea Evaluation Grounded in Literature

Hanane Nour Moussa, Patrick Queiroz Da Silva, Daniel Adu-Ampratwum, Alyson East, Zitong Lu, Nikki Puccetti · Oct 17, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback

Rubric Rating Coding

As AI tools become increasingly common for research ideation, robust evaluation is critical to ensure the validity and usefulness of generated ideas.
We introduce ScholarEval, a retrieval augmented evaluation framework that assesses research ideas based on two fundamental criteria: soundness - the empirical validity of proposed methods based on existing literature, and contribution - the…

Open paper

Toward LLM-Supported Automated Assessment of Critical Thinking Subskills

Marisa C. Peczuh, Nischal Ashok Kumar, Ryan Baker, Blair Lehman, Danielle Eisenberg, Caitlin Mills · Oct 14, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback

Rubric Rating Coding

As the world becomes increasingly saturated with AI-generated content, disinformation, and algorithmic persuasion, critical thinking - the capacity to evaluate evidence, detect unreliable claims, and exercise independent judgment - is…
We developed a coding rubric based on an established skills progression and completed human coding for a corpus of student essays.

Open paper

Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference

Jin Du, Li Chen, Xun Xian, An Luo, Fangqiao Tian, Ganghua Wang · May 19, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback

Rubric Rating Coding

Current benchmarks usually involve simplified tasks.
To address these limitations, we propose CausalPitfalls, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls.

Open paper

RM-R1: Reward Modeling as Reasoning

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang · May 5, 2025

Citations: 0

Match reason: Matches selected tags (Coding, Rubric Rating).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback

Pairwise PreferenceRubric Rating MathCoding

Reward modeling is essential for aligning large language models with human preferences through reinforcement learning.
Empirically, our models achieve superior performance across three reward model benchmarks on average, outperforming much larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones (e.g., GPT-4o) by up to 4.9%.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent