Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 880 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,620) General (530) Long Horizon (320) Pairwise Preference (288) Coding (218) Simulation Env (187) Multi Agent (182) Medicine (116) Llm As Judge (107) Expert Verification (97) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (77) Demonstrations (67) Critique Edit (63)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention

Zehao Jin, Yanan Sui · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents

Harshee Jignesh Shah · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Our architecture introduces three components: (1) a Behavioral Access Control (BAC) system that restricts context layer access based on real-time sycophancy risk scores, (2) a Trait Classifier that identifies persuasion tactics across…
Cross-model evaluation on Gemini 2.5 Flash reveals a 46.0% baseline reduced to 14.2% (p < 10^-10, OR = 5.15).

Open paper

Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

Zaifu Zhan, Mengyuan Cui, Rui Zhang · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready

Critique Edit Automatic Metrics Medicine

Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile,…
In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track…

Open paper

Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries

Tanay Gondil · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Using signal detection theory (SDT), we find that all models exhibit high introspective sensitivity (d' = 2.4-3.5), but sensitivity drops substantially at safety boundaries.
Critically, confidence scores provide actionable signal: restricting to high-confidence predictions yields 98.3 percent accuracy for well-calibrated models, enabling practical confidence-based routing for safety-critical deployments.

Open paper

When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

Robinson Ferrer, Damla Turgut, Zhongzhou Chen, Shashank Sonkar · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Coding

This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review.

Open paper

LLM Probe: Evaluating LLMs for Low-Resource Languages

Hailay Kidu Teklehaymanot, Gebrearegawi Gebremariam, Wolfgang Nejdl · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Multilingual

Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized…
To illustrate the framework, we create a manually annotated benchmark dataset using a low-resource Semitic language as a case study.

Open paper

M-MiniGPT4: Multilingual VLLM Alignment via Translated Data

Seung Hun Han, Youssef Mohamed, Mohamed Elhoseiny · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics CodingMultilingual

M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed.

Open paper

MemRerank: Preference Memory for Personalized Product Reranking

Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Automatic Metrics General

LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch.
We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking.

Open paper

PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding

Nan Wang, Zhiwei Jin, Chen Chen, Haonan Lu · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Coding

We observe that this cost is largely wasteful -- across document and GUI benchmarks, only 22--71\% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image.
Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2\times inference speedup and 1.9\times training acceleration.

Open paper

AfrIFact: Cultural Information Retrieval, Evidence Extraction and Fact Checking for African Languages

Israel Abebe Azime, Jesujoba Oluwadara Alabi, Crystina Zhang, Iffat Maab, Atnafu Lambebo Tonja, Tadesse Destaw Belay · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Multilingual

Our evaluation results show that even the best embedding models lack cross-lingual retrieval capabilities, and that cultural and news documents are easier to retrieve than healthcare-domain documents, both in large corpora and in single…

Open paper

Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

Thanh Luong Tuan · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

We present a neurosymbolic architecture implemented within the Foundation AgenticOS (FAOS) platform that addresses these limitations through ontology-constrained neural reasoning.
We evaluate the architecture through a controlled experiment (600 runs across five industries: FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance), finding that ontology-coupled agents significantly outperform…

Open paper

MF-QAT: Multi-Format Quantization-Aware Training for Elastic Inference

Zifei Xu, Sayeh Sharify, Hesham Mostafa · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

Lixin Xiu, Xufang Luo, Hideki Nakayama · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics Coding

Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs.

Open paper

Impact of enriched meaning representations for language generation in dialogue tasks: A comprehensive exploration of the relevance of tasks, corpora and metrics

Alain Vázquez, Maria Inés Torres · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

In addition, among these semantic metrics, those trained with human ratings can detect omissions and other subtle semantic issues that embedding-based metrics often miss.

Open paper

Authorship Impersonation via LLM Prompting does not Evade Authorship Verification Methods

Baoyi Zeng, Andrea Nini · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

TRIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models

Lingjie Chen, Ruizhong Qiu, Yuyu Fan, Yanjun Zhao, Hanghang Tong · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Long Horizon MathCoding

Experiments on LLaDA and Dream across math and coding benchmarks show that TRIMS significantly improves the accuracy-parallelism trade-off over both standard MDLM training and train-free acceleration baselines, while achieving competitive…

Open paper

Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li, Yuchen Wu · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% High protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Multi Agent MathLaw

In this paper, we propose Agent Q-Mix, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem.
Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure.

Open paper

Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency

Xingshuai Huang, Derek Li, Bahareh Nikpour, Parsa Omidi · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Long Horizon MathCoding

Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9%…

Open paper

Cognitive Friction: A Decision-Theoretic Framework for Bounded Deliberation in Tool-Using Agents

Davide Di Gioia · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Tool Use General

Autonomous tool-using agents in networked environments must decide which information source to query and when to stop querying and act.
Without principled bounds on information-acquisition costs, unconstrained agents exhibit systematic failure modes: excessive tool use under congestion, prolonged deliberation under time decay, and brittle behavior under ambiguous evidence.

Open paper

Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations

Yahan Li, Xinyi Jie, Wanjia Ruan, Xubei Zhang, Huaijie Zhu, Yicheng Gao · Mar 31, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Automatic Metrics MedicineCoding

In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading.
Building on four existing medical dialogue datasets, we introduce CPB-Bench (Challenging Patient Behaviors Benchmark), a bilingual (English and Chinese) benchmark of 692 multi-turn dialogues annotated with these behaviors.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent