Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 742 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Cost-of-Pass: An Economic Framework for Evaluating Language Models

Mehmet Hamza Erol, Batu El, Mirac Suzgun, Mert Yuksekgonul, James Zou · Apr 17, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • We then define the frontier cost-of-pass: the minimum cost-of-pass achievable across available models or the human-expert(s), using the approx.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready
Simulation Env General
  • A Proximal Policy Optimization (PPO) agent is trained using a risk-sensitive reward that penalizes reserve shortfall, capital inefficiency, and breaches of a volatility-adjusted solvency floor, with tail risk explicitly controlled through…
  • To reflect regulatory stress-testing practice, the agent is trained under a regime-aware curriculum and evaluated using both regime-stratified simulations and fixed-shock stress scenarios.
Open paper
EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents

Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski · Mar 24, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs.
  • First, we develop benchmarks derived from key problems in economics -- procurement, scheduling, and pricing -- that test an LLM's ability to learn from the environment in context.
Open paper
ReCellTy: Domain-Specific Knowledge Graph Retrieval-Augmented LLMs Reasoning Workflow for Single-Cell Annotation

Dezheng Han, Yibin Jia, Ruxiao Chen, Wenjie Han, Shuaishuai Guo, Jianbo Wang · Apr 24, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Ready
Human Eval General
  • Compared to general-purpose LLMs, our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across multiple tissue types, while more closely aligning with the cognitive logic of manual annotation.
Open paper
Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading

Cfir Avraham Hadar, Omer Shubi, Yoav Meiri, Amit Heshes, Yevgeni Berzak · May 4, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready
General
  • To address this question, we introduce goal decoding tasks and evaluation frameworks using large-scale eye tracking for reading data in English with hundreds of text-specific information seeking tasks.
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready
General
  • As large language models (LLMs) and LLM-based agents increasingly interact with humans in decision-making contexts, understanding the trust dynamics between humans and AI agents becomes a central concern.
  • Using established behavioral theories, we develop an approach that studies whether LLMs trust depends on the three major trustworthiness dimensions: competence, benevolence and integrity of the human subject.
Open paper
Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages

Tadesse Destaw Belay, Dawit Ketema Gete, Abinew Ali Ayele, Olga Kolesnikova, Iqra Ameer, Grigori Sidorov · Mar 24, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready
Multilingual
  • Developing and integrating emotion-understanding models are essential for a wide range of human-computer interaction tasks, including customer feedback analysis, marketing research, and social media monitoring.
  • Furthermore, we benchmark state-of-the-art encoder-only Pretrained Language Models (PLMs) and Large Language Models (LLMs) on this enriched dataset.
Open paper
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda · Apr 3, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 71% Sparse protocol signal Freshness: Cold Status: Fallback
Pairwise Preference General
  • Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as false information and personal question, along with multiple…
Open paper
arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation

Weiqi Wang, Jiefu Ou, Yangqiu Song, Benjamin Van Durme, Daniel Khashabi · Apr 14, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Coding
  • Building on recent work (Newman et al., 2024), we move beyond oracle settings by (i) simulating well-specified yet schema-agnostic user demands that avoid leaking gold column names or values, (ii) explicitly modeling retrieval noise via…
  • To support reproducible evaluation, we introduce arXiv2Table, a benchmark of 1,957 tables referencing 7,158 papers, with human-verified distractors and rewritten, schema-agnostic user demands.
Open paper
Scalable LLM Reasoning Acceleration with Low-rank Distillation

Harry Dong, Bilge Acun, Beidi Chen, Yuejie Chi · May 8, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready
Math
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Adaptive Social Learning via Mode Policy Optimization for Language Agents

Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu · May 4, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready
Simulation Env General
  • To address this, we propose an Adaptive Social Learning (ASL) framework in this paper, aiming to improve the adaptive reasoning ability of language agents in dynamic social interactions.
  • Extensive experiments on the benchmark social intelligence environment verify that ASL achieves 15.6% higher task performance than GPT-4o.
Open paper
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

Sizhe Wang, Zhengren Wang, Dongsheng Ma, Yongan Yu, Rui Ling, Zhiyu Li · Apr 30, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready
Coding
  • We formalize this iterative, multi-turn paradigm as codeflow and introduce CodeFlowBench, the first benchmark designed to comprehensively evaluate LLMs' ability to perform codeflow - implementing new functionality by reusing existing…
  • Furthermore, a novel evaluation framework featured dual assessment protocol and structural metrics derived from dependency trees is introduced.
Open paper
TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving

Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng · Apr 22, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready
Coding
  • Furthermore, we propose "connection thinking" to bridge the semantic gap between rigid formal logic and fluent human-like reasoning, ensuring coherent logical transitions.
  • Extensive experiments demonstrate that training models on our synthesized dataset, GeoTrust, substantially enhances deep geometric reasoning capabilities and yields significant performance gains across out-of-distribution (OOD) benchmarks,…
Open paper
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text

Jiageng Wu, Bowen Gu, Ren Zhou, Kevin Xie, Doug Snyder, Yixing Jiang · Apr 28, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready
MedicineCoding
  • However, benchmarking on large-scale real-world data such as electronic health records (EHRs) is critical, as clinical decisions are directly informed by these sources, yet current evaluations remain limited.
  • To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages.
Open paper
GeneMamba: An Efficient and Effective Foundation Model on Single Cell Data

Cong Qi, Hanzhang Fang, Siqi Jiang, Xun Song, Tianxing Hu, Wei Zhi · Apr 22, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Benchmarking NLP-supported Language Sample Analysis for Swiss Children's Speech

Anja Ryser, Yingqiang Gao, Sarah Ebling · Apr 1, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready
General
  • This preliminary study aims to identify optimal practices that support speech-language pathologists in diagnosing DLD more efficiently with active involvement of human specialists.
Open paper
Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning

Zhaowei Liu, Xin Guo, Zhi Yang, Fangqi Lou, Lingfeng Zeng, Jinyi Niu · Mar 20, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready
Coding
  • First, we construct Fin-R1-Data, a high-quality financial dataset consisting of 60,091 chain-of-thought (CoT) samples, distilled and filtered from multiple authoritative benchmarks to ensure consistency and reliability.
  • Despite its relatively small parameter scale, Fin-R1 achieves competitive empirical performance across established financial benchmarks and demonstrates practical utility in compliance checking and robo-advisory.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.