Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 661 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,620) General (530) Long Horizon (320) Pairwise Preference (288) Coding (218) Simulation Env (187) Multi Agent (182) Medicine (116) Llm As Judge (107) Expert Verification (97) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (77) Demonstrations (67) Critique Edit (63)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures

Gregory M. Ruddell · Mar 22, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

As large language models are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime.
We show benchmark accuracy does not predict governability, correction capacity varies independently of detection, and identical governance scaffolds produce opposite effects across models.

Open paper

Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

Yao Chen, Yilong Chen, Yinqi Yang, Junyuan Shang, Zhenyu Zhang, Zefeng Zhang · Mar 25, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Sparse protocol signal Freshness: Hot Status: Ready

Long Horizon General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

HyReach: Vision-Guided Hybrid Manipulator Reaching in Unseen Cluttered Environments

Shivani Kamtikar, Kendall Koe, Justin Wasserman, Samhita Marri, Benjamin Walt, Naveen Kumar Uppalapati · Mar 22, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Sparse protocol signal Freshness: Hot Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol

Jie Zhu, Yimin Tian, Boyang Li, Kehao Wu, Zhongzhi Liang, Junhui Li · Mar 26, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 90% High protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Tool Use General

This paper introduces FinMCP-Bench, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols.
Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities.

Open paper

VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents

Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu, Luxi Lin · Mar 25, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 68% High protocol signal Freshness: Hot Status: Ready

Pairwise Preference Simulation Env Tool Use Coding

With the growing demand for intelligent in-vehicle experiences, vehicle-based agents are evolving from simple assistants to long-term companions.
To address this gap, we introduce VehicleMemBench, a multi-user long-context memory benchmark built on an executable in-vehicle simulation environment.

Open paper

ReasonScaffold: A Scaffolded Reasoning-based Annotation Protocol for Human-AI Co-Annotation

Smitha Muthya Sudheendra, Jaideep Srivastava · Mar 22, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 68% High protocol signal Freshness: Hot Status: Ready

Critique Edit Automatic Metrics General

We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior.
To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning.

Open paper

GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation

Ruizhong Miao, Yuying Wang, Rongguang Wang, Chenyang Li, Tao Sheng, Sujith Ravi · Mar 26, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 64% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Prior approaches to this problem include agentic retrieval strategies, which expand the semantic search space by generating additional queries.
Experiments on multiple retrieval benchmarks demonstrate the effectiveness of the proposed approach.

Open paper

How Vulnerable Are Edge LLMs?

Ao Ding, Hongzong Li, Zi Liang, Zhanpeng Shi, Shuxin Zhuang, Shiqin Tang · Mar 25, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 64% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

PLACID: Privacy-preserving Large language models for Acronym Clinical Inference and Disambiguation

Manjushree B. Aithal, Ph. D., Alexander Kotz, James Mitchell, Ph. D · Mar 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 64% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Medicine

To bridge this gap, this study pioneers the evaluation of small-parameter models deployed entirely on-device to ensure privacy preservation.

Open paper

Cycle Inverse-Consistent TransMorph: A Balanced Deep Learning Framework for Brain MRI Registration

Jiaqi Shang, Haojin Wu, Yinyi Lai, Zongyu Li, Chenghao Zhang, Jia Guo · Mar 23, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 64% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Medicine

We conduct a comprehensive evaluation of the proposed framework on a large multi-center dataset consisting of 2851 T1-weighted brain MRI scans aggregated from 13 public datasets.
Experimental results demonstrate that the proposed framework achieves strong and balanced performance across multiple quantitative evaluation metrics while maintaining stable and physically plausible deformation fields.

Open paper

On the Challenges and Opportunities of Learned Sparse Retrieval for Code

Simon Lupart, Maxime Louis, Thibault Formal, Hervé Déjean, Stéphane Clinchant · Mar 23, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Sparse protocol signal Freshness: Hot Status: Ready

Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Rashid: A Cipher-Based Framework for Exploring In-Context Language Learning

Niyati Bafna, Ryan Soh-Eun Shim, Barbara Plank, David Yarowsky, Hale Sirin · Mar 23, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready

Multilingual

We use our framework to assess current methods in the field with SOTA evaluation tools and manual analysis, explore the utility of potentially expensive resources in improving ICLL, and test ICLL strategies on rich downstream tasks beyond…

Open paper

Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

Dingjie Song, Tianlong Xu, Yi-Fan Zhang, Hang Li, Zhiling Yan, Xing Fan · Mar 26, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Math

To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork.
The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification.

Open paper

BHDD: A Burmese Handwritten Digit Dataset

Swan Htet Aung, Hein Htet, Htoo Say Wah Khaing, Thuya Myo Nyunt · Mar 23, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Reddit After Roe: A Computational Analysis of Abortion Narratives and Barriers in the Wake of Dobbs

Aria Pessianzadeh, Alex H. Poole, Rezvaneh Rezapour · Mar 23, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Long Horizon Law

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu · Mar 23, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Long Horizon General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression

Li Wang, Yandong Wang, Xin Yu, Kui Zhang, Tianhao Peng, Wenjun Wu · Mar 23, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready

Llm As Judge Coding

Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise.
Experiments with multiple models of varying scales across seven long-context benchmarks show that TAMTRL consistently outperforms strong baselines, demonstrating its effectiveness.

Open paper

A Comparative Analysis of LLM Memorization at Statistical and Internal Levels: Cross-Model Commonalities and Model-Specific Signatures

Bowen Chen, Namgi Han, Yusuke Miyao · Mar 23, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready

General

Memorization is a fundamental component of intelligence for both humans and LLMs.

Open paper

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donahue · Mar 25, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Fallback

Pairwise PreferenceRubric Rating Coding

We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh…
Among 13 different models, the best judges underperform human annotators by 12-23%.

Open paper

SecureBreak -- A dataset towards safe and secure models

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera · Mar 23, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Fallback

Red Team General

To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security…
The dataset is highly reliable due to careful manual annotation, where labels are assigned conservatively to ensure safety.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent