Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 664 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,634) General (532) Long Horizon (320) Pairwise Preference (289) Coding (221) Simulation Env (190) Multi Agent (184) Medicine (117) Llm As Judge (109) Expert Verification (98) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (78) Demonstrations (67) Critique Edit (63)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Apr 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Apr 11, 2026 · Citations: 0

To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Apr 10, 2026 · Citations: 0

On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready

Long Horizon General

Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: reflection-in-action, where the agent uses test-time scaling to generate and score multiple candidate actions…
We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment.

Open paper

EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

Luka Šiktar, Branimir Ćaran, Bojan Šekoranja, Marko Švaco · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready

General

Vision-based UAV frameworks aid human search tasks by detecting and recognizing specific individuals, then tracking and following them while maintaining a safe distance.
A key safety requirement for UAV following is the accurate estimation of the distance between camera and target object under real-world conditions, achieved by fusing multiple image modalities.

Open paper

SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery

David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% High protocol signal Freshness: Warm Status: Ready

Expert Verification Automatic Metrics Multi Agent Coding

Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
The code, datasets, and evaluation protocols for SparkMe are available as open-source at https://github.com/SALT-NLP/SparkMe.

Open paper

From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators

Zhihao Li, Yu Feng, Zhilu Lai, Wei Wang · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

On standard PDE benchmarks and real datasets, our method attains state-of-the-art competitive accuracy while providing intrinsic interpretability.

Open paper

GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning

Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Attention-Based SINR Estimation in User-Centric Non-Terrestrial Networks

Bruno De Filippo, Alessandro Guidotti, Alessandro Vanelli-Coralli · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

These results enable the integration of DMHSA-based estimators into scheduling procedures, allowing the evaluation of multiple candidate user groups and the selection of those offering the highest average SINR and capacity.

Open paper

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Sanket Badhe, Deep Shah · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Law

These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as…

Open paper

Structurally Aligned Subtask-Level Memory for Software Engineering Agents

Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Long Horizon Coding

Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning.

Open paper

Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis

Shaoxuan Wu, Jingkun Chen, Chong Ma, Cong Shen, Xiao Zhang, Jun Feng · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready

MedicineCoding

Human-AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists.

Open paper

PPCR-IM: A System for Multi-layer DAG-based Public Policy Consequence Reasoning and Social Indicator Mapping

Zichen Song, Weijia Li · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready

General

For each policy episode, the system outputs a structured record containing the DAG, indicator mappings, and three evaluation measures: an expected-indicator coverage score, a discovery rate for overlooked but relevant indicators, and a…

Open paper

NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning

Guoan Wang, Shihao Yang, Jun-en Ding, Hao Zhu, Feng Liu · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready

Medicine

Extensive evaluations across diverse benchmarks and zero-shot transfer tasks highlight NeuroNarrator's capacity to integrate temporal, spectral, and spatial dynamics, positioning it as a foundational framework for time-frequency-aware,…

Open paper

The Mean is the Mirage: Entropy-Adaptive Model Merging under Heterogeneous Domain Shifts in Medical Imaging

Sameer Ambekar, Reza Nasirigerdeh, Peter J. Schuffler, Lina Felsner, Daniel M. Lang, Julia A. Schnabel · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready

Medicine

We extensively evaluate our method with state-of-the-art baselines using two backbones across nine medical and natural-domain generalization image classification datasets, showing consistent gains across standard evaluation and challenging…

Open paper

The Initial Exploration Problem in Knowledge Graph Exploration

Claire McNamara, Lucy Hederman, Declan O'Sullivan · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready

General

Drawing on theories from information behaviour and human-computer interaction, including ASK, exploratory search, information foraging, and cognitive load theory, we develop a conceptual framing of the IEP characterised by three…

Open paper

MIP Candy: A Modular PyTorch Framework for Medical Image Processing

Tianhao Fu, Yucheng Chen · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready

MedicineCoding

MIPCandy provides a complete, modular pipeline spanning data loading, training, inference, and evaluation, allowing researchers to obtain a fully functional process workflow by implementing a single method, build_network, while retaining…

Open paper

Airavat: An Agentic Framework for Internet Measurement

Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Abdu Jyothi · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready

LawCoding

We present Airavat, the first agentic framework for Internet measurement workflow generation with systematic verification and validation.
Airavat coordinates a set of agents mirroring expert reasoning: three agents handle problem decomposition, solution design, and code implementation, with assistance from a registry of existing tools.

Open paper

HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders

Kun Yuan, Junyu Bi, Daixuan Cheng, Changfa Wu, Shuwen Xiao, Binbin Cao · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Fallback

Pairwise Preference General

Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints.
While summarizing history via interest centers offers a practical alternative, existing methods struggle to (1) identify user-specific centers at appropriate granularity and (2) accurately assign behaviors, leading to quantization errors…

Open paper

Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence

ChengYou Li, XiaoDong Liu, XiangBao Meng, XinYu Zhao · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Fallback

Simulation Env Multi Agent General

The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engi

Open paper

Revisiting Text Ranking in Deep Research

Chuan Meng, Litu Ou, Sean MacAvaney, Jeff Dalton · Feb 25, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it.
passages), (ii) pipeline configurations (different retrievers, re-rankers, and re-ranking depths), and (iii) query characteristics (the mismatch between agent-issued queries and the training queries of text rankers).

Open paper

Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound

Nicholas Dietrich, David McShannon · Feb 25, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Medicine

Conclusion: Spatial-domain adversarial perturbations in ultrasound segmentation showed partial mitigation with input preprocessing, whereas frequency-domain perturbations were not mitigated by the defenses, highlighting modality-specific…

Open paper

Equitable Evaluation via Elicitation

Elbert Du, Cynthia Dwork, Lunjia Hu, Reid McIlroy-Young, Han Shao, Linjun Zhang · Feb 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

Math

To obtain sufficient training data, we train an LLM to act as synthetic humans.
To address systematic model bias we enforce a mathematically rigorous notion of equitability ensuring that the covariance between self-presentation manner and skill evaluation error is small.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent