Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 11 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (2,019) General (628) Long Horizon (401) Pairwise Preference (339) Coding (265) Simulation Env (234) Multi Agent (216) Medicine (134) Llm As Judge (124) Expert Verification (112) Human Eval (103) Math (101) Rubric Rating (96) Web Browsing (93) Demonstrations (82) Tool Use (81)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

LLMSurgeon: Diagnosing Data Mixture of Large Language Models
May 28, 2026 · Citations: 0

To evaluate, we introduce LLMScan, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures.
SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations
May 28, 2026 · Citations: 0

We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation.
Unlocking the Working Memory of Large Language Models for Latent Reasoning
May 28, 2026 · Citations: 0

In contrast, human cognition can use working memory to hold and manipulate information internally without the need to externalize intermediate thoughts.
Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents
May 28, 2026 · Citations: 0

Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent.
Demystifying Data Organization for Enhanced LLM Training
May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
COMPOSE: Composing Future Theorems from Citations and Formal Structure
May 28, 2026 · Citations: 0

To support this setting, we construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, together with a benchmark of 47K future papers from 2024--2025.
Reasoning with Sampling: Cutting at Decision Points
May 28, 2026 · Citations: 0

Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.
On Language Generation in the Limit with Bounded Memory
May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Resolution Diagnostics for Paired LLM Evaluation
May 28, 2026 · Citations: 0

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9…
MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings
May 28, 2026 · Citations: 0

We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems.
Self-Trained Verification for Training- and Test-Time Self-Improvement
May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
May 28, 2026 · Citations: 0

Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data,…

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

GRaSp: Automatic Example Optimization for In-Context Learning in Low-Data Tasks

Simen Bihaug-Frøyland, Henrik Brådland · May 8, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready

Demonstrations Automatic Metrics General

We evaluate GRaSp on financial named entity recognition (FiNER-139), comparing synthetic and human-annotated candidate pools across pool sizes of 500 and 5000.

Open paper

Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering

Elyas Irankhah, Samah Fodeh · Apr 8, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Expert Verification Automatic Metrics Medicine

Open paper

Learning Diagnostic Reasoning for Decision Support in Toxicology

Nico Oberländer, David Bani-Harouni, Tobias Zellner, Nassir Navab, Florian Eyer, Matthias Keicher · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Expert Verification Automatic Metrics Medicine

Open paper

M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency

Abolfazl Ansari, Delvin Ce Zhang, Zhuoyang Zou, Wenpeng Yin, Dongwon Lee · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Medicine

However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically.
Furthermore, expert evaluations expose hallucinations when models generate scientific explanations for their alignment decisions.

Open paper

Structural Stress and Learned Helplessness in Afghanistan: A Multi-Layer Analysis of the AFSTRESS Dari Corpus

Jawid Ahmad Baktash, Mursal Dawodi, Nadira Ahmadi · Mar 28, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

We introduce AFSTRESS, the first multi-label corpus of self-reported stress narratives in Dari (Eastern Persian), comprising 737 responses collected from Afghan individuals during an ongoing humanitarian crisis.

Open paper

Clinical named entity recognition in the Portuguese language: a benchmark of modern BERT models and LLMs

Vinicius Anjos de Almeida, Sandro Saorin da Silva, Josimar Chire, Leonardo Vicenzi, Nícolas Henrique Borges, Helena Kociolek · Mar 27, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics MedicineMultilingual

Named entity recognition (NER) enables the automatic extraction of medical concepts; however, benchmarks for Portuguese remain scarce.

Open paper

MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

Kazi Samin Yasar Alam, Md Tanbir Chowdhury, Tamim Ahmed, Ajwad Abrar, Md Rafid Haque · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

We construct the corpus through targeted social media collection, systematic filtering, and multi-annotator validation.
We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting.

Open paper

Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes

Abdullah Al Monsur, Nitesh Vamshi Bommisetty, Gene Louis Kim · Jan 17, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

FinTruthQA: A Benchmark for AI-Driven Financial Disclosure Quality Assessment in Investor -- Firm Interactions

Peilin Zhou, Ziyue Xu, Xinyu Shi, Jiageng Wu, Yikang Jiang, Dading Chong · Jun 17, 2024

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

To address this challenge, we introduce FinTruthQA, to our knowledge the first benchmark for AI-driven assessment of financial disclosure quality in investor-firm interactions.
FinTruthQA comprises 6,000 real-world financial Q&A entries, each manually annotated based on four key evaluation criteria: question identification, question relevance, answer readability, and answer relevance.

Open paper

Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction

Yao-Shun Chuang, Tushti Mody, Uday Pratap Singh, Shirindokht Shiraz, Chun-Teh Lee, Ryan Brandon · May 5, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Automatic Metrics Medicine

Using 1,200 annotated notes, we evaluated candidate open-weight models with multi-prompt ensemble inference and further adapted selected models using QLoRA-based supervised fine-tuning and direct preference optimization.
Model performance varied substantially, highlighting the need for task-specific evaluation rather than reliance on generic benchmarks.

Open paper

Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Darvan Shvan Khairaldeen, Hossein Hassani · Feb 24, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

On the full 50-song evaluation at a 0.750 threshold, recall was 39.4% and precision 25.8% .

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Join the #1 Platform for AI Training Talent