Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 100 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,620) General (530) Long Horizon (320) Pairwise Preference (288) Coding (218) Simulation Env (187) Multi Agent (182) Medicine (116) Llm As Judge (107) Expert Verification (97) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (77) Demonstrations (67) Critique Edit (63)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML

Ismail Lamaakal, Chaymae Yahyati, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine Maleh · Aug 18, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Not All Errors Are Created Equal: ASCoT Addresses Late-Stage Fragility in Efficient LLM Reasoning

Dongxu Zhang, Yujun Wu, Yiding Sun, Jinnan Yang, Ning Yang, Jihua Zhu · Aug 7, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Math

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis · Aug 6, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than GQA, low-rank baselines and recent Repeat-all-over/Sequential sharing at comparable parameter budgets.

Open paper

SimLens for Early Exit in Large Language Models: Eliciting Accurate Latent Predictions with One More Token

Ming Ma, Bowen Zheng, Zhongqiao Lin, Tianming Yang · Jul 23, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles

Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Puyu Zeng, Yuxuan Wang, Biqing Qi · Jun 12, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63\times speedup on LLaDA with minimal accuracy drop, and up to 34.22\times when combined with caching.

Open paper

Structured Agent Distillation for Large Language Model

Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang · May 20, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Demonstrations Simulation Env General

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency.

Open paper

EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming

Sen Fang, Weiyuan Ding, Mengshi Zhang, Zihao Chen, Bowen Xu · May 18, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Coding

However, adversarial attacks exhibit fundamental limitations that compromise fair robustness assessment: they demonstrate contradictory evaluation outcomes where different attack strategies tend to favor different models, and more…
We evaluate 96 popular LLMs, ranging from 0.5B to 685B parameters, on EVALOOOP equipped with the MBPP Plus benchmark, and found that EVALOOOP typically induces a 2.65%-47.62% absolute drop in pass@1 accuracy within ten loops.

Open paper

Enhanced Structured Lasso Pruning with Class-wise Information

Xiang Liu, Mingchen Li, Xia Li, Leigang Qu, Guansu Wang, Zifan Peng · Feb 13, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke · May 19, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

When Can Transformers Count to n?

Gilad Yehudai, Haim Kaplan, Guy Dar, Royi Rassin, Asma Ghandeharioun, Mor Geva · Jul 21, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Multi-lingual Functional Evaluation for Large Language Models

Victor Ojewale, Inioluwa Deborah Raji, Suresh Venkatasubramanian · Jun 25, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Ready

MathMultilingual

Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM.
However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings.

Open paper

LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

Jude Khouja, Lingyi Yang, Karolina Korgul, Simeon Hellsten, Vlad A. Neacsu, Harry Mayne · Mar 4, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Ready

General

We introduce LINGOLY-TOO, a challenging reasoning benchmark of 1,203 questions and a total of 6,995 sub-questions that counters these shortcuts by applying expert-designed obfuscations to Linguistics Olympiad problems.

Open paper

AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · Jun 17, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% High protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Long Horizon Coding

We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents.
Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark's difficulty and discriminative power.

Open paper

AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios

Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, Marek Rei · Aug 27, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Math

In this work, we introduce an Agentic Commonsense and Math benchmark (AgentCoMa), where each compositional task requires a commonsense reasoning step and a math reasoning step.
In contrast, non-expert human annotators can solve the compositional questions and the individual steps in AgentCoMa with similarly high accuracy.

Open paper

REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?

Chenxi Jiang, Chuhao Zhou, Jianfei Yang · May 16, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Robot task planning decomposes human instructions into executable action sequences that enable robots to complete a series of complex tasks.
To this end, we propose the first robot task planning benchmark that systematically models vague REs grounded in pragmatic theory (REI-Bench), where we discover that the vagueness of REs can severely degrade robot planning performance,…

Open paper

Interpretable Deep Learning Framework for Improved Disease Classification in Medical Imaging

Jutika Borah, Hidam Kumarjit Singh · Mar 14, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Medicine

The framework is evaluated on four medical imaging benchmark datasets: chest X-rays of COVID-19, Tuberculosis, Pneumonia, and retinal Optical Coherence Tomography (OCT) images.

Open paper

Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation

Jaechul Roh, Zachary Novack, Yuefeng Peng, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, Amir Houmansadr · Jul 23, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Markovian Transformers for Informative Language Modeling

Scott Viteri, Max Lamparth, Peter Chatain, Clark Barrett · Apr 29, 2024

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready

Math

Cross-model evaluation confirms that learned CoTs generalize across architectures, suggesting they encode transferable reasoning steps rather than model-specific artifacts.

Open paper

AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking

Silin Gao, Antoine Bosselut, Samy Bengio, Emmanuel Abbe · Jun 9, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready

Math

Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks.

Open paper

SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs

Yuling Gu, Oyvind Tafjord, Hyunwoo Kim, Jared Moore, Ronan Le Bras, Peter Clark · Oct 17, 2024

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready

Medicine

Yet most evaluations stop at explicit belief attribution in classical toy stories or stylized tasks, leaving open the questions of whether LLMs can implicitly apply such knowledge to predict human behavior, or to judge an observed behavior,…
We introduce SimpleToM, a benchmark that advances ToM evaluation along two novel axes.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent