Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 1 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (2,019) General (628) Long Horizon (401) Pairwise Preference (339) Coding (265) Simulation Env (234) Multi Agent (216) Medicine (134) Llm As Judge (124) Expert Verification (112) Human Eval (103) Math (101) Rubric Rating (96) Web Browsing (93) Demonstrations (82) Tool Use (81)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

LLMSurgeon: Diagnosing Data Mixture of Large Language Models
May 28, 2026 · Citations: 0

To evaluate, we introduce LLMScan, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures.
SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations
May 28, 2026 · Citations: 0

We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation.
Unlocking the Working Memory of Large Language Models for Latent Reasoning
May 28, 2026 · Citations: 0

In contrast, human cognition can use working memory to hold and manipulate information internally without the need to externalize intermediate thoughts.
Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents
May 28, 2026 · Citations: 0

Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent.
Demystifying Data Organization for Enhanced LLM Training
May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
COMPOSE: Composing Future Theorems from Citations and Formal Structure
May 28, 2026 · Citations: 0

To support this setting, we construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, together with a benchmark of 47K future papers from 2024--2025.
Reasoning with Sampling: Cutting at Decision Points
May 28, 2026 · Citations: 0

Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.
On Language Generation in the Limit with Bounded Memory
May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Resolution Diagnostics for Paired LLM Evaluation
May 28, 2026 · Citations: 0

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9…
MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings
May 28, 2026 · Citations: 0

We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems.
Self-Trained Verification for Training- and Test-Time Self-Improvement
May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
May 28, 2026 · Citations: 0

Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data,…

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning

Yuxuan Gu, Weimin Bai, Yifei Wang, Weijian Luo, He Sun · Nov 19, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Fallback

Pairwise Preference General

Such a distillation with MARVAL not only yields substantial inference acceleration but, crucially, makes RL post-training with verifiable rewards practical, resulting in scalable yet human-preferred fast generative models.
In conclusion, MARVAL demonstrates the first practical path to distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignments.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now