Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 169 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,634) General (532) Long Horizon (320) Pairwise Preference (289) Coding (221) Simulation Env (190) Multi Agent (184) Medicine (117) Llm As Judge (109) Expert Verification (98) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (78) Demonstrations (67) Critique Edit (63)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Apr 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Apr 11, 2026 · Citations: 0

To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Apr 10, 2026 · Citations: 0

On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

WISE: Web Information Satire and Fakeness Evaluation

Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury · Dec 30, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as…
Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%).

Open paper

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi · Jan 15, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Prompting Underestimates LLM Capability for Time Series Classification

Dan Schumacher, Erfan Nourbakhsh, Rocky Slavin, Anthony Rios · Jan 6, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure.
Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.

Open paper

CAST: Character-and-Scene Episodic Memory for Agents

Kexin Ma, Bojun Li, Yuhua Tang, Liting Sun, Ruochun Jin · Jan 14, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Fallback

Llm As JudgeAutomatic Metrics General

Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where.
Experiments demonstrate that CAST has averagely improved 8.11% F1 and 10.21% J(LLM-as-a-Judge) than baselines on various datasets, especially on open and time-sensitive conversational questions.

Open paper

Reason2Decide: Rationale-Driven Multi-Task Learning

H M Quamran Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, Randy Goebel · Dec 23, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Fallback

Llm As JudgeAutomatic Metrics Medicine

Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge).
This indicates that LLM-generated rationales are suitable for pretraining models, reducing reliance on human annotations.

Open paper

Towards Hyper-Efficient RAG Systems in VecDBs: Distributed Parallel Multi-Resolution Vector Search

Dong Liu, Yanxuan Yu · Nov 12, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Coding

We implement SPI as a plugin for both FAISS and Qdrant backends and evaluate it across multiple RAG tasks including MS MARCO, Natural Questions, and multimodal retrieval benchmarks.

Open paper

Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection

Xuwei Tan, Yao Ma, Xueru Zhang · Dec 15, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Detecting fraud in financial transactions typically relies on tabular models that demand heavy feature engineering to handle high-dimensional data and offer limited interpretability, making it difficult for humans to understand predictions.
Large Language Models (LLMs), in contrast, can produce human-readable explanations and facilitate feature analysis, potentially reducing the manual workload of fraud analysts and informing system refinements.

Open paper

LabelFusion: Fusing Large Language Models with Transformer Encoders for Robust Financial News Classification

Michael Schlee, Christoph Weisser, Timo Kivimäki, Melchizedek Mashiku, Benjamin Saefken · Dec 11, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

A Machine Learning Approach for Detection of Mental Health Conditions and Cyberbullying from Social Media

Edward Ajayi, Martha Kachweka, Mawuli Deku, Emily Aiken · Nov 25, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Medicine

We conducted a comprehensive evaluation comparing traditional lexical models, hybrid approaches, and several end-to-end fine-tuned transformers.
Grounded in a comprehensive ethical analysis, we frame the system as a human-in-the-loop screening aid, not a diagnostic tool.

Open paper

SVBRD-LLM: Self-Verifying Behavioral Rule Discovery for Autonomous Vehicle Identification

Xiangyu Li, Tianyi Wang, Junfeng Jiao, Christian Claudel, Zhaomiao Guo · Nov 18, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

As autonomous vehicles (AVs) are increasingly deployed on public roads, understanding their real-world behaviors is critical for traffic safety analysis and regulatory oversight.
It then employs GPT-5 zero-shot prompting to perform comparative behavioral analysis between AVs and human-driven vehicles (HDVs) across lane-changing and normal driving behaviors, generating 26 structured rule hypotheses that comprises…

Open paper

Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement

Zijin Su, Huanzhu Lyu, Yuren Niu, Yiming Liu · Nov 18, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers

Fernanda Bufon Färber, Iago Alves Brito, Julia Soares Dollis, Pedro Schindler Freire Brasil Ribeiro, Rafael Teixeira Sousa, Arlindo Rodrigues Galvão Filho · Nov 14, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics MedicineMultilingual

To validate MedPT's utility, we benchmark it in a medical specialty classification task: fine-tuning a 1.7B parameter model achieves an outstanding 94\% F1-score on a 20-class setup.

Open paper

IDALC: A Semi-Supervised Framework for Intent Detection and Active Learning based Correction

Ankan Mullick, Sukannya Purkayastha, Saransh Sharma, Pawan Goyal, Niloy Ganguly · Nov 8, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

In this paper, we introduce IDALC (Intent Detection and Active Learning based Correction), a semi-supervised framework designed to detect user intents and rectify system-rejected utterances while minimizing the need for human annotation.
Empirical findings on various benchmark datasets demonstrate that our system surpasses baseline methods, achieving a 5-10% higher accuracy and a 4-8% improvement in macro-F1.

Open paper

PETra: A Multilingual Corpus of Pragmatic Explicitation in Translation

Doreen Osmelak, Koel Dutta Chowdhury, Uliana Sentsova, Cristina España-Bonet, Josef van Genabith · Nov 4, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Multilingual

We identify candidate explicitation cases through null alignments and refined using active learning with human annotation.

Open paper

Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

Gabriel Grand, Valerio Pepe, Jacob Andreas, Joshua B. Tenenbaum · Oct 23, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics Medicine

Drawing on insights from human cognition, we develop methods to evaluate and enhance agentic information-seeking.
For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling).

Open paper

CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning

Masato Kikuchi, Masatsugu Ono, Toshioki Soga, Tetsu Tanabe, Tadachika Ozono · Oct 21, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization

Inha Kang, Eunki Kim, Wonjeong Ryu, Jaeyo Shin, Seungjun Yu, Yoon-Hee Kang · Nov 27, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% Moderate protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Long Horizon Coding

Open paper

LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

Julian Valline, Cedric Lothritz, Siwen Guo, Jordi Cabot · Oct 28, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% Moderate protocol signal Freshness: Cold Status: Fallback

Llm As JudgeAutomatic Metrics General

Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach, retaining 227,507 high-quality instruction-answer pairs.
On NLP downstream tasks, 9 of 14 models improve in macro-averaged F1, though gains on the two benchmarks do not systematically correlate.

Open paper

GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning

Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia, Shuangshuang Tian · Oct 23, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% Moderate protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Long Horizon General

Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving…

Open paper

RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA

Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% High protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Long Horizon General

A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes…
Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong single-pass, multi-hop, and agentic RAG baselines with high efficiency.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent