Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 129 Search mode: keyword Ranking: eval-signal prioritized Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,649) General (539) Long Horizon (322) Pairwise Preference (293) Coding (221) Simulation Env (191) Multi Agent (185) Medicine (117) Llm As Judge (110) Expert Verification (98) Human Eval (89) Rubric Rating (83) Math (79) Web Browsing (79) Demonstrations (67) Critique Edit (64)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Apr 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Apr 11, 2026 · Citations: 0

To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Apr 10, 2026 · Citations: 0

On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

MemRerank: Preference Memory for Personalized Product Reranking

Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 95% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Automatic Metrics General

LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch.
We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking.

Open paper

Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models

Peiju Liu, Jinming Liu, Xipeng Qiu, Xuanjing Huang · Mar 26, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

On the CogACT + SIMPLER benchmark, TIES improves average success rates by 6\% while reducing token usage by 78\%, and demonstrate strong generalization across diverse decoders and benchmarks.

Open paper

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Natapong Nitarach · Mar 29, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Math

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

Juan Gabriel Kostelec, Xiang Wang, Axel Laborieux, Christos Sourmpis, Qinghai Guo · Mar 27, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions.
Our best Hybrid-KDA model retains 86--90\% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75\% and improving time-to-first-token by 2--4\times at 128K-token contexts.

Open paper

Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

Paulo Roberto de Moura Júnior, Jean Lelong, Annabelle Blangero · Mar 26, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics LawCoding

Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance.

Open paper

LLM-Driven Reasoning for Constraint-Aware Feature Selection in Industrial Systems

Yuhang Zhou, Zhuokai Zhao, Ke Li, Spilios Evmorfos, Gökalp Demirci, Mingyi Wang · Mar 26, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

To address this, we propose Model Feature Agent (MoFA), a model-driven framework that performs sequential, reasoning-based feature selection using both semantic and quantitative feature information.

Open paper

Select, Label, Evaluate: Active Testing in NLP

Antonio Purificato, Maria Sofia Bucarelli, Andrea Bacciu, Amin Mantrach, Fabrizio Silvestri · Mar 23, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Human annotation cost and time remain significant bottlenecks in Natural Language Processing (NLP), with test data annotation being particularly expensive due to the stringent requirement for low-error and high-quality labels necessary for…
Given a labeling budget, it aims to choose the subset that best estimates model performance while minimizing cost and human effort.

Open paper

LLM Router: Rethinking Routing with Prefill Activations

Tanay Varshney, Annie Surla, Michelle Xu, Gomathy Venkata Krishnan, Maximilian Jeblick, David Austin · Mar 21, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

LLMs often achieve similar average benchmark accuracies while exhibiting complementary strengths on different subsets of queries, suggesting that a router with query-specific model selection can outperform any single model.

Open paper

Impact of automatic speech recognition quality on Alzheimer's disease detection from spontaneous speech: a reproducible benchmark study with lexical modeling and statistical validation

Himadri Samanta · Mar 18, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Medicine

The study provides a reproducible benchmark pipeline and highlights ASR selection as a critical modeling decision in clinical speech-based artificial intelligence systems.

Open paper

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma, Rongyi Yu · Mar 27, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 73% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory

Jon-Paul Cacioli · Mar 26, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 73% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Coding

Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive…
We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio.

Open paper

MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning

Andrea Manzoni · Mar 25, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 73% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More

Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia, James Zou · Mar 25, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 73% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics MathCoding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

Hyoseok Park, Yeonsang Park · Mar 23, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 77% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1).
Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context.

Open paper

How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó · Mar 26, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 68% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

PINGALA: Prosody-Aware Decoding for Sanskrit Poetry Generation

Manoj Balaji Jagadeeshan, Atul Singh, Nallani Chakravartula Sahith, Amrith Krishna, Pawan Goyal · Mar 25, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 68% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

We also introduce a new approach for reference-free evaluation using cross-encoders which achieved better alignment with true poetry instances.

Open paper

Detecting Basic Values in A Noisy Russian Social Media Text Data: A Multi-Stage Classification Framework

Maria Milkova, Maksim Rudnev · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 72% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

This study presents a multi-stage classification framework for detecting human values in noisy Russian language social media, validated on a random sample of 7.5 million public text posts.
By treating value detection as a multi perspective interpretive task, where expert labels, GPT annotations, and model predictions represent coherent but not identical readings of the same texts, we show that the model generally aligns with…

Open paper

Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering

Minchan Kwon, Hyounguk Shon, Junmo Kim · Mar 16, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 72% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

Hangeol Chang, Changsun Lee, Seungjoon Rho, Junho Yeo, Jong Chul Ye · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 66% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain

Corentin Royer, Debarun Bhattacharjya, Gaetano Rossiello, Andrea Giovannini, Mennatallah El-Assady · Mar 18, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: sparse protocol signal.

Score: 67% Sparse protocol signal Freshness: Warm Status: Ready

Long Horizon MathCoding

Existing methods for training PRMs rely on costly human annotations or computationally intensive automatic labeling.
We demonstrate that these labels enable effective chain-of-thought selection in best-of-K evaluation settings across diverse reasoning benchmarks, including mathematics, Python programming, SQL, and scientific question answering.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent