Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 742 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

What Matters For Safety Alignment?

Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan · Jan 7, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Ready
Red Team Automatic Metrics Tool Use General
  • This paper presents a comprehensive empirical study on the safety alignment capabilities.
  • We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems.
Open paper
CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Galactic Archaeology and Scientific Discovery

Lorenzo Monti, Tatiana Muraveva, Brian Sheridan, Davide Massari, Alessia Garofalo, Gisella Clementini · Jan 14, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Multilingual, Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset

Z. Melce Hüsünbeyi, Virginie Mouilleron, Leonie Uhling, Daniel Foppe, Tatjana Scheffler, Djamé Seddah · Jan 12, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 73% Sparse protocol signal Freshness: Warm Status: Ready
Multilingual
  • Evaluation with G-Eval and human assessment demonstrates that our pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates the development of more interpretable and…
Open paper
PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark

Ziyang Zeng, Dun Zhang, Yu Yan, Xu Sun, Cuiqiaoshu Pan, Yudong Zhou · Jan 13, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Medicine
  • To address these limitations, we introduce PosIR (Position-Aware Information Retrieval), the first standardized benchmark designed to systematically diagnose position bias in diverse retrieval scenarios.
  • Extensive experiments on 10 state-of-the-art embedding-based retrieval models reveal that: (1) retrieval performance on PosIR with documents exceeding 1536 tokens correlates poorly with the MMTEB benchmark, exposing limitations of current…
Open paper

Match reason: Matched by broad semantic/index fallback.

Score: 38% High protocol signal Freshness: Warm Status: Ready
Rubric Rating Automatic Metrics General
  • Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for trained readers, who can over-trust surface well-formedness.
  • We present LREAD, a Korean-specific instantiation of a rubric-based expert-calibration framework for human attribution of LLM-generated text.
Open paper
Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Llm As Judge Coding
  • However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable…
  • To address these challenges, we introduce WebCoderBench, the first real-world-collected, generalizable, and interpretable benchmark for web app generation.
Open paper

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Multilingual
  • To train translation systems, data acquisition relies primarily on human translations and digital parallel sources or, to a limited degree, synthetic generation.
  • But, for low-resource languages, human translation to generate sufficient data is prohibitively expensive.
Open paper
Task Arithmetic with Support Languages for Low-Resource ASR

Emma Rafkin, Dan DeGenaro, Xiulin Yang · Jan 11, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
The Invisible Hand of AI Libraries Shaping Open Source Projects and Communities

Matteo Esposito, Andrea Janes, Valentina Lenarduzzi, Davide Taibi · Jan 5, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Vision-language models lag human performance on physical dynamics and intent reasoning

Tianjun Gu, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, Xin Tan · Jan 4, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • To evaluate TSI, we present EscherVerse, a large-scale open-world resource built from 11,328 real-world videos, including an 8,000-example benchmark and a 35,963-example instruction-tuning set.
  • Across 27 state-of-the-art vision-language models and an independent analysis of first-pass human responses from 11 annotators, we identify a persistent teleo-spatial reasoning gap: the strongest proprietary model achieves 57.26\% overall…
Open paper
EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

Pei Yang, Wanyi Chen, Ke Wang, Lynn Ai, Eric Yang, Tianyu Shi · Jan 10, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Coding
  • Existing evaluations often overlook execution accuracy and safety.
  • We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains.
Open paper

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Freire's theories provide a radically different lens for exploring IA's sociotechnical concerns relative to the current dominating frames of fairness, accountability, confidentiality, transparency, and safety.
Open paper
Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Is Sentiment Banana-Shaped? Exploring the Geometry and Portability of Sentiment Concept Vectors

Laurits Lyngbaek, Pascale Feldkamp, Yuri Bizzoni, Kristoffer L. Nielbo, Kenneth Enevoldsen · Jan 12, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
CodingMultilingual
  • Use cases of sentiment analysis in the humanities often require contextualized, continuous scores.
  • Concept Vector Projections (CVP) offer a recent solution: by modeling sentiment as a direction in embedding space, they produce continuous, multilingual scores that align closely with human judgments.
Open paper
A Mind Cannot Be Smeared Across Time

Michael Timothy Bennett · Jan 11, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
Law
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Beyond the Black Box: A Survey on the Theory and Mechanism of Large Language Models

Zeyu Gan, Ruifeng Ren, Wei Yao, Xiaolin Hu, Gengze Xu, Chen Qian · Jan 6, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
Math
  • To address this theoretical fragmentation, this survey proposes a unified lifecycle-based taxonomy that organizes the research landscape into six distinct stages: Data Preparation, Model Preparation, Training, Alignment, Inference, and…
  • Moving beyond current best practices, we identify critical frontier challenges, including the theoretical limits of synthetic data self-improvement, the mathematical bounds of safety guarantees, and the mechanistic origins of emergent…
Open paper
Collusive Pricing Under LLM

Shengyu Cao, Ming Hu · Jan 3, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 32% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference General
  • Above it, the system is bistable, with competitive and collusive pricing both locally stable and the realized outcome determined by the model's initial preference.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.