Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 590 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Document Reconstruction Unlocks Scalable Long-Context RLVR

Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim · Feb 9, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready
Rubric Rating Automatic Metrics Coding
  • However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
  • In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision.
Open paper
How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?

Yuxuan Li, Leyang Li, Hao-Ping Lee, Sauvik Das · Feb 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Simulation Env General
  • A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats.
  • We interrogate this assumption using SP-ABCBench, a new benchmark of 30 tests derived from validated S&P human-subject studies, which measures alignment between simulations and human-subjects studies on a 0-100 ascending scale, where higher…
Open paper
On Randomness in Agentic Evals

Bjarni Haukur Bjarnason, André Silva, Martin Monperrus · Feb 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks.
  • To enable reliable evaluation of agentic systems, we recommend three concrete practices: (1) estimate pass@1 from multiple independent runs per task, especially when measuring small improvements, (2) use statistical power analysis to…
Open paper
Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory

Sanyam Singh, Naga Ganesh, Vineet Singh, Lakshmi Pedapudi, Ritesh Kumar, SSP Jyothi · Feb 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • We present a hybrid LLM architecture that decouples factual retrieval from conversational delivery: supervised fine-tuning with LoRA on expert-curated GOLDEN FACTS (atomic, verified units of agricultural knowledge) optimizes fact recall,…
  • Our evaluation framework, DG-EVAL, performs atomic fact verification (measuring recall, precision, and contradiction detection) against expert-curated ground truth rather than Wikipedia or retrieved documents.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Human value detection from single sentences is a sparse, imbalanced multi-label task.
  • Under this benchmark, the HO structure is more useful as an inductive bias than as a rigid routing rule.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Rubric Rating Simulation Env Tool Use Math
  • Small LLMs often struggle to match the agentic capabilities of large, costly models.
  • While reinforcement learning can help, progress has been limited by two structural bottlenecks: existing open-source agentic training data are narrow in task variety and easily solved; real-world APIs lack diversity and are unstable for…
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Although this level of detail is sound, it requires considerable human effort and substantial cost to annotate opinions in datasets for training models, especially across diverse domains and real-world applications.
  • After trialling the pipeline with models of different sizes for the Aspect Sentiment Triplet Extraction (ASTE) and Aspect-Category-Opinion-Sentiment (ACOS) analysis tasks, we show that LLMs can serve as automatic annotators and…
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • We argue reliable human-AI partnership requires a shift from answer generation to collaborative premise governance over a knowledge substrate, negotiating only what is decision-critical.
  • We illustrate with tutoring and propose falsifiable evaluation criteria.
Open paper
AXE: Low-Cost Cross-Domain Web Structured Information Extraction

Abdelrahman Mansour, Khaled W. Alshaer, Moataz Elsaban · Feb 2, 2026

Citations: 0

Match reason: Title directly matches "cost".

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Human Eval Multilingual
  • Our results show that proprietary LLMs achieve near human-level APE quality even with simple one-shot prompting, regardless of whether document context is provided.
  • Furthermore, standard automatic metrics do not reliably reflect these qualitative improvements, highlighting the continued necessity of human evaluation.
Open paper
$V_0$: A Generalist Value Model for Any Policy at State Zero

Yi-Kai Zhang, Zhiyuan Yao, Hongyan Hao, Yueqing Sun, Qi Gu, Hui Su · Feb 3, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Fallback
Simulation Env Long Horizon General
  • We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture.
  • Furthermore, we introduce a novel bounded lookahead module that leverages symbolic transition logic to enhance the agents' planning capabilities through the grounded action projection.
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Fallback
Human EvalLlm As Judge General
  • Nine instruction-tuned low-parameterized SLMs are evaluated against three commercial LLMs using lexical and semantic similarity metrics alongside qualitative assessments, including human evaluation and LLM-as-a-judge methods.
Open paper
OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong, Haoran Luo · Feb 3, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Tool Use General
  • To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning.
  • Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries.
Open paper
POP: Prefill-Only Pruning for Efficient Large Model Inference

Junhui He, Zhihui Fu, Jun Wang, Qingan Li · Feb 3, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Self-Improving World Modelling with Latent Actions

Yifu Qiu, Zheng Zhao, Waylon Li, Yftah Ziser, Anna Korhonen, Shay B. Cohen · Feb 5, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 32% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.