Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 39 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement

Jessica Huynh, Alfredo Gomez, Athiya Deviyani, Renee Shelby, Jeffrey P. Bigham, Fernando Diaz · May 7, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Rubric Rating Llm As JudgeAutomatic Metrics General
  • Autoraters, also referred to as LLM-as-judges, are increasingly used for evaluation and automated content moderation.
  • While these rubrics can be edited to improve the individual accuracy of both human and automated scoring, this approach may result in disagreement between the two scores, or with the associated holistic judgment.
Open paper
EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation

Aimin Zhang, Jiajing Guo, Fuwei Jia, Chen Lv, Boyu Wang, Fangzheng Li · Apr 22, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Llm As JudgeAutomatic Metrics Multi Agent General
  • Experimental results based on real-world foreign trade scenarios demonstrate that, after integrating EvoAgent, GPT5.2 achieves significant improvements in professionalism, accuracy, and practical utility.
  • Under a five-dimensional LLM-as-Judge evaluation protocol, the overall average score increases by approximately 28%.
Open paper

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 65% Moderate protocol signal Freshness: Hot Status: Fallback
Llm As JudgeAutomatic Metrics General
  • We evaluate on HotpotQA-RAG v3, a controlled multi-hop benchmark, under an artifact-aware protocol (shortcut baselines, counterfactual swaps, no-oracle checks, GPT-4o audits).
  • Calibrated SURE-RAG reaches 0.9075 Macro-F1 (0.8951 +/- 0.0069), substantially above DeBERTa mean-pooling (0.6516) and a GPT-4o judge (0.7284), while matching a strong but opaque concat cross-encoder (0.8888 +/- 0.0109) with full…
Open paper
HyperMem: Hypergraph Memory for Long-Term Conversations

Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang, Tingwen Liu · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Llm As JudgeAutomatic Metrics General
  • Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
  • Experiments on the LoCoMo benchmark show that HyperMem achieves state-of-the-art performance with 92.73% LLM-as-a-judge accuracy, demonstrating the effectiveness of HyperMem for long-term conversations.
Open paper
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Llm As JudgeAutomatic Metrics MedicineMultilingual
  • A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
  • Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%).
Open paper
RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale

Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile, Zach Reavis · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Expert Verification Llm As JudgeAutomatic Metrics Math
  • This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic…
  • We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection.
Open paper
Multi-Agent Dialectical Refinement for Enhanced Argument Classification

Jakub Bąba, Jarosław A. Chudziak · Mar 29, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Llm As JudgeAutomatic Metrics Multi Agent General
  • We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty.
  • Evaluation on the UKP Student Essays corpus demonstrates that MAD-ACC achieves a Macro F1 score of 85.7%, significantly outperforming single-agent reasoning baselines, without requiring domain-specific training.
Open paper
Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Expert Verification Llm As JudgeAutomatic Metrics Medicine
  • In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
  • PubMed Reasoner with a GPT-4o backbone achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, and showing consistent gains on MMLU Clinical Knowledge.
Open paper

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Llm As JudgeAutomatic Metrics Long Horizon General
  • Across four LLM backbones, DCS consistently outperforms supervised probes and LLM-as-judge baselines, achieving up to 71.1% accuracy on sentence-level hawkish--dovish classification.
Open paper
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki · Mar 6, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Llm As JudgeAutomatic Metrics Long Horizon CodingMultilingual
  • To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic,…
  • We use LLM-as-a-Judge for scoring and report category-wise and overall accuracy.
Open paper
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Llm As JudgeAutomatic Metrics General
  • We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
  • Transformer-based probes achieve the strongest discrimination, with M2 performing best on 5-fold average AUC/F1, and M3 performing best on both single-fold validation and held-out test evaluation.
Open paper
LLM-as-a-Judge for Time Series Explanations

Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Llm As JudgeAutomatic Metrics General
  • Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
  • To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations.
Open paper

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Human EvalLlm As Judge Coding
  • Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments.
  • The automated judgments were verified through human evaluation, demonstrating high agreement (kappa = 0.87).
Open paper
Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Llm As JudgeAutomatic Metrics Coding
  • We evaluate LLM-as-a-judge marking across three physics assessment formats - structured questions, written essays, and scientific plots - comparing GPT-5.2, Grok 4.1, Claude Opus 4.5, DeepSeek-V3.2, Gemini Pro 3, and committee aggregations…
  • Across n=55 scripts (n=275 essays), blind AI marking is harsher and more variable than human marking, with discriminative validity already poor (ρ\approx 0.1).
Open paper
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat · Mar 5, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Llm As JudgeAutomatic Metrics General
  • Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators.
  • Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation.
Open paper
PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Rubric RatingExpert Verification Llm As JudgeAutomatic Metrics Medicine
  • Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.
  • We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
Open paper
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou, Konstantinos Arvanitis · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Llm As JudgeAutomatic Metrics General
  • We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations.
Open paper

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Llm As JudgeAutomatic Metrics Medicine
  • Fine-tuning Qwen2.5-7B-Instruct on CrossTrace via QLoRA yields substantial improvements over the untuned baseline: IAScore rises from 0.828 to 0.968 (GPT-4o judge) and from 0.716 to 0.888 (Claude Opus 4.5), structural compliance improves…
  • Human validation of 150 stratified records confirms 99.7% step-level grounding accuracy and a 0.0% fabrication rate.
Open paper
DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

Md Hasebul Hasan, Krity Haque Charu, Eshwara Prasad Sridhar, Shuchisnigdha Deb, Mohammad A. Islam · Mar 20, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Human EvalLlm As Judge LawCoding
  • To bridge this gap, we present DeEscalWild, a novel benchmark dataset curated from a multi-stage pipeline of in-the-wild police-civilian interactions extracted from publicly available video repositories.
  • Extensive experiments demonstrate that SLMs fine-tuned on this data significantly outperform their base counterparts across ROUGE-L, BLEU-4, METEOR, BERTScore, Realism Score, and human evaluation metrics.
Open paper
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation

Bian Sun, Zhenjian Wang, Orvill de la Torre, Zirui Wang · Feb 27, 2026

Citations: 0

Match reason: Matches selected tags (Llm As Judge, Automatic Metrics).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Llm As JudgeAutomatic Metrics Medicine
  • Due to the resource-intensive nature of large-scale human validation, the model's performance was evaluated through a dual-track framework: Track A utilized traditional lexical similarity metrics (e.g., BLEU, ROUGE), while Track B employed…
  • Consequently, we propose that while automated metrics and LLM judges serve as valuable developmental proxies, rigorous validation by human medical experts remains an indispensable requirement for the safe deployment of LLMs in healthcare…
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.