Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 27 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Tracing Uncertainty in Language Model "Reasoning"

Nils Grünefeld, Bertram Højer, Philipp Mondorf, Barbara Plank, Anna Rogers, Christian Hardmeier · May 8, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Math
  • Language model (LM) "reasoning", commonly described as Chain-of-Thought or test-time scaling, often improves benchmark performance, but the dynamics underlying this process remain poorly understood.
Open paper
Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text

Tom Kempton, Viktor Drobnyi, Maeve Madigan, Stuart Burrell · May 7, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Medicine
  • The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance.
  • However, we demonstrate that the token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally…
Open paper
Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

Dominik Dahlem, Diego Maniloff, Mac Misiura · May 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • The resulting two-axis diagnostic (φ for capacity, G for direction) yields a falsifiable polarity prediction: bottleneck- and diffuse-dominated benchmarks should exhibit opposite polarity.
  • Under length-controlled evaluation, transport features retain interpretable signal (LC-AUROC from 0.62 to 0.84) on tested models up to 8B parameters, with polarity reversing as predicted between HaluEval and MedHallu.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency.
Open paper
Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

Chaoran Chen, Dayu Yuan, Peter Kairouz · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Law
  • In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training.
  • The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training.
Open paper
RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale

Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile, Zach Reavis · Apr 2, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Expert Verification Llm As JudgeAutomatic Metrics Math
  • This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic…
  • We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection.
Open paper
An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms

Nils Grünefeld, Jes Frellsen, Christian Hardmeier · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • We then use the estimates to investigate when each uncertainty type carries useful signal for predicting answer correctness in question answering with large language models, revealing a benchmark-dependent divergence: the combined estimate…
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive…
  • We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics MathMedicine
  • We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B-14B), with Jaccard, embedding, and NLI-based baselines at three DeBERTa scales (all ~0.51 AUROC).
Open paper
A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment

Sheng Liu, Long Chen, Zeyun Zhao, Qinglin Gou, Qingyue Wei, Arjun Masurkar · Mar 23, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics Multi Agent Medicine
  • We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics Medicine
  • Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19…
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • In this pre-registered study, we treat three LLMs as observers performing factual discrimination across 168,000 trials and test whether temperature functions as a criterion shift analogous to payoff manipulations in human psychophysics.
  • All models exhibited unequal-variance evidence distributions (z-ROC slopes 0.52-0.84), with instruct models showing more extreme asymmetry (0.52-0.63) than the base model (0.77-0.87) or human recognition memory (~0.80).
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready
Automatic MetricsSimulation Env Long Horizon General
  • We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace),…
  • We introduce DECEPTSYNTH, a scalable synthetic pipeline for generating deception-positive and deception-negative agent trajectories across a novel 12-category taxonomy spanning verbal, behavioral, and structural deception.
Open paper
Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

Yichi Zhang, Nabeel Seedat, Yinpeng Dong, Peng Cui, Jun Zhu, Mihaela van de Schaar · Mar 3, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics Long Horizon Medicine
  • As LLM-powered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment.
  • We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset, surpassing the best baseline by 12% in AUROC and 50% in Brier score reduction, which confirms the effectiveness in both…
Open paper
A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection

Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad, Nick Rahimi · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
Open paper
FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

Hojjat Karami, David Atienza, Jean-Philippe Thiran, Anisoara Ionescu · Apr 24, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics MedicineCoding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Multilingual
  • The rapid advancement of large language models has increasingly blurred the boundary between human-written and AI-generated text, raising societal risks such as misinformation dissemination, authorship ambiguity, and threats to intellectual…
  • Empirical evaluations demonstrate that Exons-Detect achieves state-of-the-art detection performance and exhibits strong robustness to adversarial attacks and varying input lengths.
Open paper
How Uncertainty Estimation Scales with Sampling in Reasoning Models

Maksym Del, Markus Kängsepp, Marharyta Domnich, Ardi Tampuu, Lisa Yankovskaya, Meelis Kull · Mar 19, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Math
  • Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale.
  • These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.