Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 1 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,931) General (603) Long Horizon (380) Pairwise Preference (326) Coding (253) Simulation Env (221) Multi Agent (211) Medicine (128) Llm As Judge (120) Expert Verification (107) Human Eval (98) Math (93) Rubric Rating (93) Web Browsing (89) Demonstrations (79) Tool Use (74)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech
May 17, 2026 · Citations: 0

There are not enough established benchmarks for the task fo speech summarization.
Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents
May 17, 2026 · Citations: 0

Long-horizon LLM agents rely on persistent memory to support interactions across sessions, yet existing memory systems often retrieve context using semantic similarity or broad history inclusion, treating retrieved memories as uniformly…
Temporal Decay of Co-Citation Predictability: A 20-Year Statute Retrieval Benchmark from 396M Ukrainian Court Citations
May 17, 2026 · Citations: 0

We test this assumption longitudinally by constructing UA-StatuteRetrieval, a benchmark that measures co-citation predictability across 20 annual snapshots (2007-2026) of 396 million codex citations from 101 million Ukrainian court…
AI Agents May Always Fall for Prompt Injections
May 17, 2026 · Citations: 0

Prompt injection is the most critical vulnerability in deployed AI agents.
SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening
May 17, 2026 · Citations: 0

The rapid growth of online video platforms and AI-generated content has made reliable video guardrails a key challenge for safety and real-world deployment.
Mixture of Experts for Low-Resource LLMs
May 17, 2026 · Citations: 0

Routing improvements correlate with consistent downstream benchmark gains, positioning routing entropy and expert specialization as principled diagnostics for multilingual capacity in MoE systems.
How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning
May 17, 2026 · Citations: 0

Across five language models and multiple math reasoning benchmarks, Mu-GRPO matches or exceeds the performance of standard GRPO while achieving around 2x speedup in wall-clock training time, establishing a substantially improved…
Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models
May 17, 2026 · Citations: 0

Recent work has fine-tuned language models on chess data and reported high benchmark scores as evidence that the resulting models can understand the rules of chess, play full chess games at a professional level, or generate human-readable…
Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
May 17, 2026 · Citations: 0

Training tool-calling agents requires large-scale trajectory data with verifiable labels, yet existing approaches either synthesize environments that diverge from real API behavior or generate tasks without ground-truth outcomes for…
No Free Swap: Protocol-Dependent Layer Redundancy in Transformers
May 15, 2026 · Citations: 0

When researchers ask whether two transformer layers are "equivalent" for compression, they often conflate distinct tests.
DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory
May 15, 2026 · Citations: 0

Large language model (LLM) agents require long-term memory to leverage information from past interactions.
STS: Efficient Sparse Attention with Speculative Token Sparsity
May 15, 2026 · Citations: 0

This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Distilling Feedback into Memory-as-a-Tool

Víctor Gallego · Jan 9, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% Moderate protocol signal Freshness: Cold Status: Ready

Rubric RatingCritique Edit Automatic Metrics General

We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now