HFEPX Hub

Llm As Judge Papers (Last 30 Days)

Updated from current HFEPX corpus (Apr 27, 2026). 19 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 27, 2026). 19 papers are grouped in this hub page. Common evaluation modes: Llm As Judge, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequently cited benchmark: DROP. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 28, 2026.

Papers: 19 Last published: Mar 28, 2026 Global RSS Tag RSS

Llm As JudgeLast 30d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (19) Replication-Ready Only (3)

High-Signal Coverage

100.0%

19 / 19 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

3 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

46.2% of papers report explicit human-feedback signals, led by pairwise preferences.
LLM-as-judge appears in 68.4% of papers in this hub.
DROP is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Benchmark Interpretation

DROP appears in 7.7% of hub papers (1/19); use this cohort for benchmark-matched comparisons.
Healthbench appears in 7.7% of hub papers (1/19); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 53.8% of hub papers (7/19); compare with a secondary metric before ranking methods.
f1 is reported in 15.4% of hub papers (2/19); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (46.2% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (30.8% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (84.6% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (23.1% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (30.8% vs 35% target).

Strengths

Strong human-feedback signal (46.2% of papers).

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (23.1% coverage).
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (DROP vs Healthbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and f1.

Recommended Queries (Expanded)

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: DROP Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Groun…

Highest protocol score with explicit human/eval signal plus MMLU.

Strongest benchmark reference

Self-Preference Bias in Rubric-Based Evaluation of Large Language Mod…

IFEval gives a fast comparison anchor.

Strongest recent paper

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanes…

Useful for current practice scanning; published Apr 2, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Mar 28, 2026 · Citations: 0 · Score: 7.5

HF: Expert Verification · Eval: Llm As Judge, Automatic Metrics · Benchmark: MMLU · Metric: Accuracy
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
Apr 8, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference, Rubric Rating · Eval: Llm As Judge · Benchmark: IFEval · Metric: Not Reported
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Apr 2, 2026 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Llm As Judge, Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
HyperMem: Hypergraph Memory for Long-Term Conversations
Apr 9, 2026 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Llm As Judge, Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Apr 2, 2026 · Citations: 0 · Score: 5.5

HF: Expert Verification · Eval: Llm As Judge, Automatic Metrics · Benchmark: Not Reported · Metric: Auroc
LLM-as-a-Judge for Time Series Explanations
Apr 2, 2026 · Citations: 0 · Score: 5.5

HF: Not reported · Eval: Llm As Judge, Automatic Metrics · Benchmark: DROP · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering Mar 28, 2026	Yes Expert Verification	Llm As Judge , Automatic Metrics	MMLU	Accuracy , Relevance	Not Reported
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models Apr 8, 2026	Yes Pairwise Preference , Rubric Rating	Llm As Judge	IFEval , Healthbench	Not Reported	Not Reported
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study Apr 2, 2026	Yes Pairwise Preference	Llm As Judge , Automatic Metrics	Not Reported	Accuracy	Not Reported
HyperMem: Hypergraph Memory for Long-Term Conversations Apr 9, 2026	Yes Pairwise Preference	Llm As Judge , Automatic Metrics	Not Reported	Accuracy , Coherence	Not Reported
RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale Apr 2, 2026	Yes Expert Verification	Llm As Judge , Automatic Metrics	Not Reported	Auroc	Not Reported
LLM-as-a-Judge for Time Series Explanations Apr 2, 2026	No Not Reported	Llm As Judge , Automatic Metrics	DROP	Accuracy , Faithfulness	Not Reported
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations Apr 7, 2026	No Not Reported	Llm As Judge , Automatic Metrics	SQuAD	F1	Not Reported
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge Apr 7, 2026	Yes Pairwise Preference	Llm As Judge	Not Reported	Not Reported	Not Reported
EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation Apr 22, 2026	No Not Reported	Llm As Judge , Automatic Metrics	Not Reported	Accuracy	Not Reported
Multi-Agent Dialectical Refinement for Enhanced Argument Classification Mar 29, 2026	No Not Reported	Llm As Judge , Automatic Metrics	Not Reported	F1 , F1 macro	Not Reported
Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems Apr 24, 2026	No Not Reported	Llm As Judge	Not Reported	Precision , Recall	Not Reported
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images Apr 8, 2026	No Not Reported	Llm As Judge , Automatic Metrics	Not Reported	Accuracy , Exact match	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	PubMed Reasoner: Dynamic Reasoning-based Retrieval…	Self-Preference Bias in Rubric-Based Evaluation of…	Blinded Radiologist and LLM-Based Evaluation of LLM…
Human Feedback	Expert Verification	Pairwise Preference, Rubric Rating	Pairwise Preference
Evaluation Modes	Llm As Judge, Automatic Metrics	Llm As Judge	Llm As Judge, Automatic Metrics
Benchmarks	MMLU	IFEval, Healthbench	Not reported
Metrics	Accuracy, Relevance	Not reported	Accuracy
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Domain Experts	Unknown	Domain Experts
Annotation Unit	Unknown	Multi Dim Rubric	Pairwise

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (4)
Expert Verification (2)
Rubric Rating (1)

Evaluation Modes

Llm As Judge (13)
Automatic Metrics (10)

Top Benchmarks

DROP (1)
Healthbench (1)
IFEval (1)
MMLU (1)

Top Metrics

Accuracy (7)
F1 (2)
Agreement (1)
Auroc (1)

Rater Population Mix

Domain Experts (3)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 31.6% · benchmarks 26.3% · metrics 63.2% · quality controls 0.0%.

Top Papers

PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026 · Citations: 0

Expert Verification Llm As JudgeAutomatic Metrics

In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka · Apr 2, 2026 · Citations: 0

Pairwise Preference Llm As JudgeAutomatic Metrics

A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
José Pombal, Ricardo Rei, André F. T. Martins · Apr 8, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Llm As Judge

We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings.
HyperMem: Hypergraph Memory for Long-Term Conversations
Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang · Apr 9, 2026 · Citations: 0

Pairwise Preference Llm As JudgeAutomatic Metrics

Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile · Apr 2, 2026 · Citations: 0

Expert Verification Llm As JudgeAutomatic Metrics

This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic…
LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
Xin Sun, Di Wu, Sijing Qin, Isao Echizen, Abdallah El Ali · Apr 7, 2026 · Citations: 0

Pairwise Preference Llm As Judge

Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge).
EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation
Aimin Zhang, Jiajing Guo, Fuwei Jia, Chen Lv, Boyu Wang · Apr 22, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics Multi Agent

Experimental results based on real-world foreign trade scenarios demonstrate that, after integrating EvoAgent, GPT5.2 achieves significant improvements in professionalism, accuracy, and practical utility.
Multi-Agent Dialectical Refinement for Enhanced Argument Classification
Jakub Bąba, Jarosław A. Chudziak · Mar 29, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics Multi Agent

We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty.
Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems
Meghana Karnam, Ananya Joshi · Apr 24, 2026 · Citations: 0

Llm As Judge Long Horizon

Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression.
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou · Apr 8, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations.
CrossTrace: A Cross-Domain Dataset of Grounded Scientific Reasoning Traces for Hypothesis Generation
Andrew Bouras, OMS-II Research Fellow · Mar 30, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

Fine-tuning Qwen2.5-7B-Instruct on CrossTrace via QLoRA yields substantial improvements over the untuned baseline: IAScore rises from 0.828 to 0.968 (GPT-4o judge) and from 0.716 to 0.888 (Claude Opus 4.5), structural compliance improves…
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
Erez Yosef, Oron Anschel, Shunit Haviv Hakimi, Asaf Gendler, Adam Botach · Apr 24, 2026 · Citations: 0

Llm As Judge

We propose an LLM-based evaluation framework for evaluating model-generated answers, enabling accurate evaluation across diverse mathematical representations and answer formats.
Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM
Samay U. Shetty, Tharindu Cyril Weerasooriya, Deepak Pandita, Christopher M. Homan · Apr 9, 2026 · Citations: 0

Llm As Judge

When humans label subjective content, they disagree, and that disagreement is not noise.
To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs
Zohaib Khan, Mustafa Dogan, Ifeoma Okoh, Pouya Sadeghi, Siddhartha Shrestha · Apr 8, 2026 · Citations: 0

Llm As Judge

Using both human annotations and large-scale LLM-as-a-judge evaluations across hundreds of thousands of generations from state-of-the-art models, we show that misinformation generation varies systematically based on the country being…
MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts
Weiyue Li, Ruizhi Qian, Yi Li, Yongce Li, Yunfan Long · Apr 7, 2026 · Citations: 0

Llm As Judge

As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge.
De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules
Keerat Guliani, Deepkamal Gill, David Landsman, Nima Eshraghi, Krishna Kumar · Apr 2, 2026 · Citations: 0

Llm As Judge

We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data.
The Necessity of Setting Temperature in LLM-as-a-Judge
Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia · Mar 30, 2026 · Citations: 0

Llm As Judge

LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now