Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 2 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,911) General (597) Long Horizon (375) Pairwise Preference (321) Coding (249) Simulation Env (218) Multi Agent (209) Medicine (126) Llm As Judge (120) Expert Verification (105) Human Eval (97) Math (93) Rubric Rating (93) Web Browsing (86) Demonstrations (79) Red Team (72)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
May 14, 2026 · Citations: 0

On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.
The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale
May 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Quantifying and Mitigating Premature Closure in Frontier LLMs
May 14, 2026 · Citations: 0

In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries.
Explainable Detection of Depression Status Shifts from User Digital Traces
May 14, 2026 · Citations: 0

To enhance interpretability, the framework integrates a large language model to generate concise and human-readable reports that describe the evolution of mental-health signals and highlight key transitions.
Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing
May 14, 2026 · Citations: 0

PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36\times across multiple model families and benchmarks under a unified decoding protocol.
Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA
May 14, 2026 · Citations: 0

To systematically evaluate VLMs on this practical task, we propose ProcedureVQA, a novel multimodal benchmark specifically designed for visual procedural reasoning.
Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study
May 14, 2026 · Citations: 0

We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks.
Holistic Evaluation and Failure Diagnosis of AI Agents
May 14, 2026 · Citations: 0

We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments.
Speculative Interaction Agents: Building Real-Time Agents with Asynchronous I/O and Speculative Tool Calling
May 13, 2026 · Citations: 0

In our work, we propose Speculative Interaction Agents to enable real-time interaction even for agents with complex multi-turn tool calling.
Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models
May 13, 2026 · Citations: 0

Across four state-of-the-art reasoning models, the proposed method substantially amplifies output length, achieving up to a 26.1x increase on the MATH benchmark and consistently outperforming benign and manually crafted missing-premise…
Leveraging Speech to Identify Signatures of Insight and Transfer in Problem Solving
May 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
May 12, 2026 · Citations: 0

Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2).

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

No exact ID match for "2212.09748". Showing results for "Scalable Diffusion Models with Transformers" instead.

RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models

Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou · Mar 16, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

General

Transformer based diffusion and vision-language models have achieved remarkable success; yet, efficiently removing undesirable or sensitive information without retraining remains a central challenge for model safety and compliance.
We evaluate RAZOR on CLIP, Stable Diffusion, and vision-language models (VLMs) using widely adopted unlearning benchmarks covering identity, style, and object erasure tasks.

Open paper

Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation

Xiaomeng Yang, Mengping Yang, Jia Gong, Luozheng Qin, Zhiyu Tan, Hao Li · Feb 4, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Pairwise Preference Automatic Metrics General

However, they usually fail to produce satisfactory outputs that are aligned to users' authentic demands and preferences.
In this work, we introduce Dual-Iterative Optimization (Dual-IPO), an iterative paradigm that sequentially optimizes both the reward model and the video generation model for improved synthesis quality and human preference alignment.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now