HFEPX Hub

CS.AI + Expert Verification Papers

Updated from current HFEPX corpus (Feb 27, 2026). 15 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Gold Questions. Frequently cited benchmark: BIRD. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 15 Last published: Feb 26, 2026 Global RSS Tag RSS

Cs.AIExpert Verification

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 15 papers for CS.AI + Expert Verification Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on BIRD, Cricbench and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by expert verification.

Evidence: TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
automatic metrics appears in 73.3% of papers in this hub.

Evidence: SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
BIRD is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems

Protocol Takeaways

Most common quality-control signal is gold-question checks (13.3% of papers).

Evidence: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics , Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.

Evidence: TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Evidence: Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery

Benchmark Interpretation

BIRD appears in 6.7% of hub papers (1/15); use this cohort for benchmark-matched comparisons.
Cricbench appears in 6.7% of hub papers (1/15); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 20% of hub papers (3/15); compare with a secondary metric before ranking methods.
cost is reported in 20% of hub papers (3/15); compare with a secondary metric before ranking methods.

Researcher Checklist

Maintain strength on Papers with explicit human feedback. Coverage is strong (100% vs 45% target).
Tighten coverage on Papers reporting quality controls. Coverage is usable but incomplete (20% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (33.3% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (66.7% vs 35% target).
Maintain strength on Papers with known rater population. Coverage is strong (100% vs 35% target).
Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (33.3% vs 35% target).

Papers with explicit human feedback

Coverage is strong (100% vs 45% target).

Papers reporting quality controls

Coverage is usable but incomplete (20% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (33.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (66.7% vs 35% target).

Papers with known rater population

Coverage is strong (100% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (33.3% vs 35% target).

Known Limitations

LLM-as-judge appears without enough inter-annotator agreement reporting.
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Cross-page comparisons should be benchmark- and metric-matched to avoid protocol confounding.

Research Utility Links

Judge vs Human Agreement - Compares papers that evaluate with both human raters and LLM judges.
Benchmark Slice: BIRD - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs llm_as_judge

both=0, left_only=1, right_only=1

0 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=0, left_only=1, right_only=11

0 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=11

0 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

BIRD

Coverage: 1 papers (6.7%)

1 papers (6.7%) mention BIRD.

Examples: CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Benchmark Brief

Cricbench

Coverage: 1 papers (6.7%)

1 papers (6.7%) mention Cricbench.

Examples: CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Benchmark Brief

DROP

Coverage: 1 papers (6.7%)

1 papers (6.7%) mention DROP.

Examples: CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Metric Brief

accuracy

Coverage: 3 papers (20%)

3 papers (20%) mention accuracy.

Examples: SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics , A Scalable Framework for Evaluating Health Language Models

Metric Brief

cost

Coverage: 3 papers (20%)

3 papers (20%) mention cost.

Examples: SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis , A Scalable Framework for Evaluating Health Language Models

Metric Brief

precision

Coverage: 2 papers (13.3%)

2 papers (13.3%) mention precision.

Examples: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Top Papers

TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang · Feb 26, 2026 · Citations: 0

Expert Verification Simulation Env Multi Agent

As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics

Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontolo
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026 · Citations: 0

Expert Verification Automatic Metrics

The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human Eval Multi Agent

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang · Feb 12, 2026 · Citations: 0

Expert Verification Automatic Metrics

On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distil
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0

Rubric RatingExpert Verification Simulation Env Long Horizon

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0

Expert Verification Automatic Metrics

To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis
Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025 · Citations: 0

Expert Verification Llm As JudgeSimulation Env Multi Agent

We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang · Jun 4, 2025 · Citations: 0

Expert Verification Automatic Metrics

However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences
A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics

As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
Measuring AI Ability to Complete Long Software Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia · Mar 18, 2025 · Citations: 0

Expert Verification Automatic Metrics Tool Use

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.

CS.AI + Expert Verification Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs