HFEPX Hub

CS.AI + Medicine Papers

Updated from current HFEPX corpus (Feb 27, 2026). 47 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 47 Last published: Feb 26, 2026 Global RSS Tag RSS

Cs.AIMedicine

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 47 papers for CS.AI + Medicine Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, AIME and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

17% of papers report explicit human-feedback signals, led by expert verification.

Evidence: TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought
automatic metrics appears in 87.2% of papers in this hub.

Evidence: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought , TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

Protocol Takeaways

Most common quality-control signal is rater calibration (4.3% of papers).

Evidence: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion
Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.

Evidence: TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , Virtual Biopsy for Intracranial Tumors Diagnosis on MRI , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion

Benchmark Interpretation

Retrieval appears in 14.9% of hub papers (7/47); use this cohort for benchmark-matched comparisons.
AIME appears in 2.1% of hub papers (1/47); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 34% of hub papers (16/47); compare with a secondary metric before ranking methods.
cost is reported in 12.8% of hub papers (6/47); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (17% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (6.4% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (21.3% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (55.3% vs 35% target).
Tighten coverage on Papers with known rater population. Coverage is usable but incomplete (25.5% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (8.5% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (17% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (6.4% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (21.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (55.3% vs 35% target).

Papers with known rater population

Coverage is usable but incomplete (25.5% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (8.5% vs 35% target).

Known Limitations

Only 6.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (8.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=0, left_only=2, right_only=41

0 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=1, left_only=40, right_only=4

1 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=5, right_only=2

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

Retrieval

Coverage: 7 papers (14.9%)

7 papers (14.9%) mention Retrieval.

Examples: TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance , CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Benchmark Brief

AIME

Coverage: 1 papers (2.1%)

1 papers (2.1%) mention AIME.

Examples: Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Benchmark Brief

Correctbench

Coverage: 1 papers (2.1%)

1 papers (2.1%) mention Correctbench.

Examples: Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Metric Brief

accuracy

Coverage: 16 papers (34%)

16 papers (34%) mention accuracy.

Examples: Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance , Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video

Metric Brief

cost

Coverage: 6 papers (12.8%)

6 papers (12.8%) mention cost.

Examples: Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages , To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering , Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics

Metric Brief

Coverage: 4 papers (8.5%)

4 papers (8.5%) mention f1.

Examples: Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages , PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Top Papers

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu · Feb 26, 2026 · Citations: 0

Automatic Metrics

Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases.
TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought
Jianmin Li, Ying Chang, Su-Kit Tang, Yujia Liu, Yanwen Wang · Feb 26, 2026 · Citations: 0

Automatic Metrics

Additionally, TCM-DiffRAG outperformed directly supervised fine-tuned (SFT) LLMs and other benchmark RAG methods.
TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang · Feb 26, 2026 · Citations: 0

Expert Verification Simulation Env Multi Agent

As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness
TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion
Donghong Cai, Jiarui Feng, Yanbo Wang, Da Zheng, Yixin Chen · Feb 26, 2026 · Citations: 0

Automatic Metrics

Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines.
Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026 · Citations: 0

Automatic Metrics

Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
Craig Myles, Patrick Schrempf, David Harris-Birtill · Feb 25, 2026 · Citations: 0

Automatic Metrics

We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical d
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis
Shaoxuan Wu, Jingkun Chen, Chong Ma, Cong Shen, Xiao Zhang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Human-AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists.
Virtual Biopsy for Intracranial Tumors Diagnosis on MRI
Xinzhe Luo, Shuai Shao, Yan Wang, Jiangtao Wang, Yutong Bai · Feb 25, 2026 · Citations: 0

Automatic Metrics

To address these challenges, we construct the ICT-MRI dataset - the first public biopsy-verified benchmark with 249 cases across four categories.
Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound
Nicholas Dietrich, David McShannon · Feb 25, 2026 · Citations: 0

Automatic Metrics

Conclusion: Spatial-domain adversarial perturbations in ultrasound segmentation showed partial mitigation with input preprocessing, whereas frequency-domain perturbations were not mitigated by the defenses, highlighting modality-specific ch
FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning
Alina Devkota, Jacob Thrasher, Donald Adjeroh, Binod Bhattarai, Prashnna K. Gyawali · Feb 24, 2026 · Citations: 0

Automatic Metrics

Extensive experiments on both natural and medical image benchmarking datasets, across diverse model architectures, demonstrate that FedVG consistently improves performance, particularly in highly heterogeneous settings.
MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation
Daniel Tamayo, Iñaki Lacunza, Paula Rivera-Hidalgo, Severino Da Dalt, Javier Aula-Blasco · Feb 24, 2026 · Citations: 0

Automatic Metrics

We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code.
Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
Mohammadreza Ghaffarzadeh-Esfahani, Nahid Yousefian, Ebrahim Heidari-Farsani, Ali Akbar Omidvarian, Sepehr Ghahraei · Feb 24, 2026 · Citations: 0

Automatic Metrics

Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP).
The Mean is the Mirage: Entropy-Adaptive Model Merging under Heterogeneous Domain Shifts in Medical Imaging
Sameer Ambekar, Reza Nasirigerdeh, Peter J. Schuffler, Lina Felsner, Daniel M. Lang · Feb 24, 2026 · Citations: 0

Automatic Metrics

We extensively evaluate our method with state-of-the-art baselines using two backbones across nine medical and natural-domain generalization image classification datasets, showing consistent gains across standard evaluation and challenging
XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence
Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser · Feb 24, 2026 · Citations: 0

Automatic Metrics

Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints.
PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data
Samah Fodeh, Linhai Ma, Yan Wang, Srivani Talakokkul, Ganesh Puthiaraju · Feb 24, 2026 · Citations: 0

Automatic Metrics

Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH).
CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning
Ziwei Niu, Hao Sun, Shujun Bian, Xihong Yang, Lanfen Lin · Feb 24, 2026 · Citations: 0

Automatic Metrics

Accurate interpretation of electrocardiogram (ECG) signals is crucial for diagnosing cardiovascular diseases.
MIP Candy: A Modular PyTorch Framework for Medical Image Processing
Tianhao Fu, Yucheng Chen · Feb 24, 2026 · Citations: 0

Automatic Metrics

MIPCandy provides a complete, modular pipeline spanning data loading, training, inference, and evaluation, allowing researchers to obtain a fully functional process workflow by implementing a single method, $\texttt{build_network}$, while r
OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation
Tian Lan, Lei Xu, Zimu Yuan, Shanggui Liu, Jiajun Liu · Feb 24, 2026 · Citations: 0

Automatic Metrics

Our evaluation demonstrates that OrthoDiffusion achieves excellent performance in the segmentation of 11 knee structures and the detection of 8 knee abnormalities.
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontolo
A Very Big Video Reasoning Suite
Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer · Feb 23, 2026 · Citations: 0

Simulation Env

We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities.
To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering
Zaifu Zhan, Min Zeng, Shuang Zhou, Yiran Song, Xiaoyi Chen · Feb 23, 2026 · Citations: 0

Automatic Metrics

Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA.
AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization
Fahmida Liza Piya, Rahmatollah Beheshti · Feb 23, 2026 · Citations: 0

Human Eval

We present AgenticSum, an inference-time, agentic framework that separates context selection, generation, verification, and targeted correction to reduce hallucinated content.
Exploring Anti-Aging Literature via ConvexTopics and Large Language Models
Lana E. Yeganova, Won G. Kim, Shubo Tian, Natalie Xie, Donald C. Comeau · Feb 23, 2026 · Citations: 0

Automatic Metrics

Common clustering and topic modeling approaches such as K-means or LDA remain sensitive to initialization and prone to local optima, limiting reproducibility and evaluation.
Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming
Ian Steenstra, Paola Pedrelli, Weiyan Shi, Stacy Marsella, Timothy W. Bickmore · Feb 23, 2026 · Citations: 0

Red Team Simulation Env

Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue.
MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
Wall Kim, Chaeyoung Song, Hanul Kim · Feb 23, 2026 · Citations: 0

Automatic Metrics

Recently, TabPFN has gained attention as a foundation model for tabular data.
Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li · Feb 23, 2026 · Citations: 0

Automatic Metrics Long Horizon

The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
OptiRepair: Closed-Loop Diagnosis and Repair of Supply Chain Optimization Models with LLM Agents
Ruicheng Ao, David Simchi-Levi, Xinshang Wang · Feb 23, 2026 · Citations: 0

Automatic Metrics

Whether AI agents can perform this task remains untested.
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026 · Citations: 0

Pairwise Preference Human Eval

We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026 · Citations: 0

Expert Verification Automatic Metrics

The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering
Jash Rajesh Parekh, Wonbin Kweon, Joey Chan, Rezarta Islamaj, Robert Leaman · Feb 20, 2026 · Citations: 0

Automatic Metrics

Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context.
ReIn: Conversational Error Recovery with Reasoning Inception
Takyoung Kim, Jinseok Nam, Chandrayee Basu, Xing Fan, Chengyuan Ma · Feb 19, 2026 · Citations: 0

Automatic Metrics

Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors.
Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination
Amir Hosseinian, MohammadReza Zare Shahneh, Umer Mansoor, Gilbert Szeto, Kirill Karlin · Feb 17, 2026 · Citations: 0

Automatic Metrics

Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%).
Cold-Start Personalization via Training-Free Priors from Structured World Models
Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du · Feb 16, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available.
Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer's Disease Detection via Speech
Xiao Wei, Bin Wen, Yuqin Lin, Kai Li, Mingyang gu · Feb 16, 2026 · Citations: 0

Automatic Metrics

Early diagnosis of Alzheimer's Disease (AD) is crucial for delaying its progression.
Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook
Ming Li, Xirui Li, Tianyi Zhou · Feb 15, 2026 · Citations: 0

Simulation Env Multi Agent

As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems?
Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints
Ha Na Cho, Sairam Sutari, Alexander Lopez, Hansen Bow, Kai Zheng · Jan 24, 2026 · Citations: 0

Automatic Metrics

Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety.
KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification
Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh · Dec 9, 2025 · Citations: 0

Automatic Metrics

Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and managemen
Stabilizing Off-Policy Training for Long-Horizon LLM Agent via Turn-Level Importance Sampling and Clipping-Triggered Normalization
Chenliang Li, Adel Elmahdy, Alex Boyd, Zhongruo Wang, Siliang Zeng · Nov 25, 2025 · Citations: 0

Automatic Metrics Long Horizon

Reinforcement learning (RL) algorithms such as PPO and GRPO are widely used to train large language models (LLMs) for multi-turn agentic tasks.
From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen · Oct 29, 2025 · Citations: 0

Automatic Metrics Multi Agent

To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation.
A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
Sohyeon Jeon, Hyung-Chul Lee · Oct 22, 2025 · Citations: 0

Automatic Metrics

Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge.
LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?
Guozhao Mo, Wenliang Zhong, Jiawei Chen, Qianhao Yuan, Xuanang Chen · Aug 3, 2025 · Citations: 0

Automatic Metrics Tool Use

Unfortunately, there is still a large gap between real-world MCP usage and current evaluation: they typically assume single-server settings and directly inject tools into the model's context, bypassing the challenges of large-scale retrieva
m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models
Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou · Apr 1, 2025 · Citations: 0

Automatic Metrics

Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32
A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics

As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes
Zhanliang Wang, Da Wu, Quan Nguyen, Kai Wang · Mar 15, 2025 · Citations: 0

Automatic Metrics

These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes.
Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes
Saman Khamesian, Asiful Arefeen, Maria Adela Grando, Bithika M. Thompson, Hassan Ghasemzadeh · Feb 20, 2025 · Citations: 0

Automatic Metrics

Managing Type 1 Diabetes (T1D) demands constant vigilance as individuals strive to regulate their blood glucose levels and avoid dysglycemia, including hyperglycemia and hypoglycemia.
Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations
Zijie Liu, Xinyu Zhao, Jie Peng, Zhuangdi Zhu, Qingyu Chen · Jan 29, 2025 · Citations: 0

Automatic MetricsSimulation Env

These tuning methods and benchmarks overlook critical aspects like evidence-based reasoning and handling distracting information.

CS.AI + Medicine Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs