Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 1 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,818) General (570) Long Horizon (352) Pairwise Preference (308) Coding (237) Multi Agent (204) Simulation Env (203) Medicine (124) Llm As Judge (116) Expert Verification (104) Human Eval (95) Rubric Rating (89) Math (87) Web Browsing (84) Demonstrations (73) Red Team (69)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Implicit Representations of Grammaticality in Language Models
May 6, 2026 · Citations: 0

Grammaticality and likelihood are distinct notions in human language.
MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge
May 6, 2026 · Citations: 0

Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination.
The First Token Knows: Single-Decode Confidence for Hallucination Detection
May 6, 2026 · Citations: 0

Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency.
PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation
May 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction
May 6, 2026 · Citations: 0

Benchmarking on three real-world datasets, Dreaddit, SDCNL, and DepSeverity, reports accuracies of 0.835, 0.731, and 0.751, respectively, demonstrating its potential for reliable mental health prediction.
Text Corpora as Concept Fields: Black-Box Hallucination and Novelty Measurement
May 6, 2026 · Citations: 0

Concept Fields provide a fast, lightweight, and interpretable signal for groundedness and novelty, complementary to LLM-as-judge and white-box detectors.
Continual Knowledge Updating in LLM Systems: Learning Through Multi-Timescale Memory Dynamics
May 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models
May 6, 2026 · Citations: 0

We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models.
The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences
May 6, 2026 · Citations: 0

To test this hypothesis at the item level, we introduce the Pinocchio score (π_i), the ratio of inter-model response variance under neutral prompting to that under a human-simulation prompt, as an annotation-free measure of each item's…
The Impossibility Triangle of Long-Context Modeling
May 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
May 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals
May 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL

Yaxun Dai, Baolin Sun, Junying Wang, Pengfei Wang, Yingqi Gao, Xuemei Dong · May 6, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready

Long Horizon General

Extensive experiments on BIRD benchmarks show that FineStep achieves state-of-the-art performance and reduces redundant tool interactions, with a 3.25% average EX gain over GRPO at the 4B scale.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now