Skip to content
← Back to explorer

HFEPX Hub

CS.LG + Medicine Papers

Updated from current HFEPX corpus (Feb 27, 2026). 23 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Vbvr-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 23 Last published: Feb 26, 2026 Global RSS Tag RSS
Cs.LGMedicine

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 23 papers for CS.LG + Medicine Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Vbvr-Bench and metric focus on accuracy, f1. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Vbvr-Bench appears in 4.3% of hub papers (1/23); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • accuracy is reported in 17.4% of hub papers (4/23); compare with a secondary metric before ranking methods.
  • f1 is reported in 13% of hub papers (3/23); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (17.4% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (8.7% vs 30% target).
  • Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (4.3% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (39.1% vs 35% target).
  • Tighten coverage on Papers with known rater population. Coverage is usable but incomplete (30.4% vs 35% target).
  • Close gap on Papers with known annotation unit. Coverage is a replication risk (17.4% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (17.4% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (8.7% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (4.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (39.1% vs 35% target).

Papers with known rater population

Coverage is usable but incomplete (30.4% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (17.4% vs 35% target).

Suggested Reading Order

  1. 1. An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion

    High citation traction makes this a useful baseline for method and protocol context.

  3. 3. Dynamic Personality Adaptation in Large Language Models via State Machines

    High citation traction makes this a useful baseline for method and protocol context.

  4. 4. FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning

    High citation traction makes this a useful baseline for method and protocol context.

  5. 5. MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

    Adds automatic metrics for broader coverage within this hub.

  6. 6. Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages

    Adds automatic metrics for broader coverage within this hub.

  7. 7. The Mean is the Mirage: Entropy-Adaptive Model Merging under Heterogeneous Domain Shifts in Medical Imaging

    Adds automatic metrics for broader coverage within this hub.

  8. 8. MIP Candy: A Modular PyTorch Framework for Medical Image Processing

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Only 8.7% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Annotation unit is under-specified (17.4% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

automatic_metrics vs simulation_env

both=0, left_only=21, right_only=2

0 papers use both Automatic Metrics and Simulation Env.

Top Papers

Related Hubs