HFEPX Hub

Expert Verification Papers (Last 45 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 21 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 21 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Adjudication. Frequently cited benchmark: Ad-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 15, 2026.

Papers: 21 Last published: Feb 15, 2026 Global RSS Tag RSS

Expert VerificationLast 45d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (21) Replication-Ready Only (2)

High-Signal Coverage

100.0%

21 / 21 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

2 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
4 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 61.9% of papers in this hub.
Ad-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Most common quality-control signal is adjudication (9.5% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Stratify by benchmark (Ad-Bench vs HLE) before comparing methods.

Benchmark Interpretation

Ad-Bench appears in 4.8% of hub papers (1/21); use this cohort for benchmark-matched comparisons.
HLE appears in 4.8% of hub papers (1/21); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 19% of hub papers (4/21); compare with a secondary metric before ranking methods.
precision is reported in 14.3% of hub papers (3/21); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Moderate: Papers reporting quality controls

Coverage is usable but incomplete (19% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (14.3% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (61.9% vs 35% target).
Strong: Papers with known rater population

Coverage is strong (100% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (28.6% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).
Agentic evaluation appears in 33.3% of papers.

Known Gaps

Only 19% of papers report quality controls; prioritize calibration/adjudication evidence.
Benchmark coverage is thin (14.3% of papers mention benchmarks/datasets).

Suggested Next Analyses

Stratify by benchmark (Ad-Bench vs HLE) before comparing methods.
Track metric sensitivity by reporting both accuracy and precision.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: Ad-Bench Metric Slice: accuracy Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Feb 15, 2026 · Citations: 0 · Score: 9.5

HF: Expert Verification, Critique Edit · Eval: Automatic Metrics · Benchmark: HLE · Metric: Accuracy
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Feb 15, 2026 · Citations: 0 · Score: 8.0

HF: Expert Verification · Eval: Simulation Env · Benchmark: Ad Bench · Metric: Pass@1
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Feb 18, 2026 · Citations: 0 · Score: 8.0

HF: Expert Verification · Eval: Not reported · Benchmark: LiveCodeBench · Metric: Not Reported
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Feb 23, 2026 · Citations: 0 · Score: 7.5

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: F1
Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots
Feb 26, 2026 · Citations: 0 · Score: 7.5

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Agreement
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
Feb 24, 2026 · Citations: 0 · Score: 6.0

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Cost

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam Feb 15, 2026	Yes Expert Verification , Critique Edit	Automatic Metrics	HLE	Accuracy	Adjudication
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents Feb 15, 2026	Yes Expert Verification	Simulation Env	Ad Bench	Pass@1 , Pass@3	Not Reported
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling Feb 18, 2026	Yes Expert Verification	Not Reported	LiveCodeBench	Not Reported	Calibration
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models Feb 23, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	F1 , Precision	Gold Questions
Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots Feb 26, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Agreement	Adjudication
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery Feb 24, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Cost	Not Reported
Multi-Objective Alignment of Language Models for Personalized Psychotherapy Feb 17, 2026	Yes Pairwise Preference , Expert Verification	Automatic Metrics	Not Reported	Agreement , Cost	Not Reported
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications Feb 20, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Precision , Recall	Not Reported
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models Feb 25, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Accuracy	Not Reported
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video Feb 25, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Accuracy	Not Reported
An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems Feb 24, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Precision	Not Reported
What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform Feb 19, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Accuracy	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	HLE-Verified: A Systematic Verification and Structu…	AD-Bench: A Real-World, Trajectory-Aware Advertisin…	Team of Thoughts: Efficient Test-time Scaling of Ag…
Human Feedback	Expert Verification, Critique Edit	Expert Verification	Expert Verification
Evaluation Modes	Automatic Metrics	Simulation Env	Not reported
Benchmarks	HLE	Ad Bench	LiveCodeBench
Metrics	Accuracy	Pass@1, Pass@3	Not reported
Quality Controls	Adjudication	Not reported	Calibration
Rater Population	Domain Experts	Domain Experts	Domain Experts
Annotation Unit	Unknown	Trajectory	Unknown

Research Utility Snapshot

Human Feedback Mix

Expert Verification (21)
Pairwise Preference (2)
Rubric Rating (2)
Critique Edit (1)

Evaluation Modes

Automatic Metrics (13)
Simulation Env (2)

Top Benchmarks

Ad Bench (1)
HLE (1)
LiveCodeBench (1)

Top Metrics

Accuracy (4)
Precision (3)
Agreement (2)
Cost (2)

Rater Population Mix

Domain Experts (21)

Quality Controls

Adjudication (2)
Calibration (1)
Gold Questions (1)

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 14.3% · metrics 61.9% · quality controls 19.0%.

Top Papers

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu · Feb 15, 2026 · Citations: 0

Expert Verification Simulation Env Long Horizon

While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0

Expert VerificationCritique Edit Automatic Metrics

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026 · Citations: 0

Expert Verification Multi Agent

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype…
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics Long Horizon

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang · Feb 26, 2026 · Citations: 0

Expert Verification Simulation Env Multi Agent

As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness…
Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots
Dimitrios P. Panagoulias, Evangelia-Aikaterini Tsichrintzi, Georgios Savvidis, Evridiki Tsoureli-Nikita · Feb 26, 2026 · Citations: 0

Expert Verification Automatic Metrics

Human-in-the-loop validation is essential in safety-critical clinical AI, yet the transition between initial model inference and expert correction is rarely analyzed as a structured signal.
Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026 · Citations: 0

Pairwise PreferenceExpert Verification Automatic Metrics

While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Multi Agent

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026 · Citations: 0

Expert Verification Automatic Metrics

The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics

Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in…
What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform
Adrian Cosma, Cosmin Dumitrache, Emilian Radoi · Feb 19, 2026 · Citations: 0

Expert Verification Automatic Metrics

As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy.
LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts
Yang Liu, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li · Feb 15, 2026 · Citations: 0

Expert Verification Automatic Metrics

By decomposing the definition modeling task into specialized semantic domains, where small language models are trained as domain experts, LM-Lexicon achieves substantial improvements (+7% BLEU score compared with the prior state-of-the-art…
OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum
Yangyang Zhang, Zilong Wang, Jianbo Xu, Yongqi Chen, Chu Han · Feb 14, 2026 · Citations: 0

Expert Verification Multi Agent

Here we present OMGs (Ovarian tumour Multidisciplinary intelligent aGent System), a multi-agent AI framework where domain-specific agents deliberate collaboratively to integrate multidisciplinary evidence and generate MDT-style…
"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics

Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation
Shentong Mo, Xufang Luo, Dongsheng Li · Feb 26, 2026 · Citations: 0

Expert Verification

In this work, we propose a novel Mixture-of-Experts prompt tuning method called pMoE, which leverages the strengths of multiple expert domains through expert-specialized prompt tokens and the learnable dispatcher, effectively combining…
Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation
Rizhuo Huang, Yifan Feng, Rundong Xue, Shihui Ying, Jun-Hai Yong · Feb 23, 2026 · Citations: 0

Expert Verification

Additionally, we present HyperDocRED, a rigorously annotated benchmark for document-level knowledge hypergraph extraction.
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang · Feb 12, 2026 · Citations: 0

Expert Verification

Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term…

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote