Skip to content
← Back to explorer

HFEPX Hub

Expert Verification + Medicine Papers

Updated from current HFEPX corpus (Feb 27, 2026). 11 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Gold Questions. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 11 Last published: Feb 26, 2026 Global RSS Tag RSS
Expert VerificationMedicine

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 11 papers for Expert Verification + Medicine Papers. Dominant protocol signals include automatic metrics, LLM-as-judge, simulation environments, with frequent benchmark focus on Retrieval and metric focus on accuracy, agreement. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Retrieval appears in 18.2% of hub papers (2/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • accuracy is reported in 54.5% of hub papers (6/11); compare with a secondary metric before ranking methods.
  • agreement is reported in 18.2% of hub papers (2/11); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Maintain strength on Papers with explicit human feedback. Coverage is strong (100% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (9.1% vs 30% target).
  • Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (18.2% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (90.9% vs 35% target).
  • Maintain strength on Papers with known rater population. Coverage is strong (100% vs 35% target).
  • Maintain strength on Papers with known annotation unit. Coverage is strong (36.4% vs 35% target).

Papers with explicit human feedback

Coverage is strong (100% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (9.1% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (18.2% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (90.9% vs 35% target).

Papers with known rater population

Coverage is strong (100% vs 35% target).

Papers with known annotation unit

Coverage is strong (36.4% vs 35% target).

Suggested Reading Order

  1. 1. An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

    High citation traction makes this a useful baseline for method and protocol context.

  3. 3. MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models

    High citation traction makes this a useful baseline for method and protocol context.

  4. 4. SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video

    High citation traction makes this a useful baseline for method and protocol context.

  5. 5. DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries

    Include an LLM-as-judge paper to assess judge design and agreement assumptions.

  6. 6. CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

    Adds automatic metrics with expert verification for broader coverage within this hub.

  7. 7. What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform

    Adds automatic metrics with expert verification for broader coverage within this hub.

  8. 8. Multi-Objective Alignment of Language Models for Personalized Psychotherapy

    Adds automatic metrics with pairwise preferences for broader coverage within this hub.

Known Limitations

  • Only 9.1% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Benchmark coverage is thin (18.2% of papers mention benchmarks/datasets).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=9

0 papers use both Llm As Judge and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=9, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

llm_as_judge vs simulation_env

both=0, left_only=1, right_only=1

0 papers use both Llm As Judge and Simulation Env.

Top Papers

Related Hubs