Bingxuan Li, Simo Du, Yue Guo · Apr 8, 2026 · Citations: 0
Automatic MetricsMedicine
We propose SEA, a self-learning diagnostic agent with cognitively inspired dual-memory module.
On standard evaluation with MedCaseReasoning dataset, SEA achieves 92.46% accuracy, outperforming the strongest baseline by +19.6%, demonstrating the benefit of jointly optimizing reasoning and memory.
José Pombal, Ricardo Rei, André F. T. Martins · Apr 8, 2026 · Citations: 0
Pairwise PreferenceRubric RatingLlm As JudgeMedicine
We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings.
Using IFEval, a benchmark with programmatically verifiable rubrics, we show that SPB persists even when evaluation criteria are entirely objective: among rubrics where generators fail, judges can be up to 50\% more likely to incorrectly…
Chang Liu, Changsheng Ma, Yongfeng Tao, Bin Hu, Minqiang Yang · Apr 8, 2026 · Citations: 0
Simulation EnvMedicine
However, existing methods often rely on static cognitive profiles and omniscient single-agent simulation, failing to capture the dynamic, information-asymmetric nature of real therapy.
We introduce CCD-CBT, a multi-agent framework that shifts CBT simulation along two axes: 1) from a static to a dynamically reconstructed Cognitive Conceptualization Diagram (CCD), updated by a dedicated Control Agent, and 2) from omniscient…
Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria.
The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions.
To address this gap, we introduce Personality-based Client Simulation Attack (PCSA), the first red-teaming framework that simulates clients in psychological counseling through coherent, persona-driven client dialogues to expose…
Pairwise PreferenceLlm As JudgeAutomatic MetricsMedicineMultilingual
A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%).
Ruijie Yang, Yan Zhu, Peiyao Fu, Te Luo, Zhihua Wang, Xian Yang · Apr 2, 2026 · Citations: 0
Expert VerificationAutomatic MetricsMedicine
Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic…
In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC)…
Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain…
Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile,…
In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track…
Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon, Md Rakibul Hasan · Mar 31, 2026 · Citations: 0
Human EvalAutomatic MetricsMedicine
Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance.
ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores.
Andrew Bouras, OMS-II Research Fellow · Mar 30, 2026 · Citations: 0
Llm As JudgeAutomatic MetricsMedicine
Fine-tuning Qwen2.5-7B-Instruct on CrossTrace via QLoRA yields substantial improvements over the untuned baseline: IAScore rises from 0.828 to 0.968 (GPT-4o judge) and from 0.716 to 0.888 (Claude Opus 4.5), structural compliance improves…
Human validation of 150 stratified records confirms 99.7% step-level grounding accuracy and a 0.0% fabrication rate.
Hua Li, Yingying Li, Xiaobin Feng, Xinyi Fu, Lifeng Dong, Qingfeng Yang · Mar 30, 2026 · Citations: 0
Pairwise PreferenceMedicine
While large language models (LLMs) offer new potential for medical applications, they face three major challenges in the context of integrative Chinese and Western medicine (ICWM): a lack of high-quality data, the absence of models capable…
tuning (SFT) and direct preference optimization (DPO), and complemented it with SSDF-Navigator, a pluggable consultation navigation model designed to optimize clinical inquiry strategies.
In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases.
Expert VerificationLlm As JudgeAutomatic MetricsMedicine
In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
PubMed Reasoner with a GPT-4o backbone achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, and showing consistent gains on MMLU Clinical Knowledge.
Get Started
Join the #1 Platform for AI Training Talent
Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.