- MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0
Expert Verification Automatic Metrics
Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
- SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0
Expert Verification Automatic Metrics
Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
- RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li · Feb 25, 2026 · Citations: 0
Rubric Rating Automatic Metrics
Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.
- SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
- "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng · Feb 24, 2026 · Citations: 0
Expert Verification Automatic Metrics
Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
- An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur · Feb 24, 2026 · Citations: 0
Expert Verification Automatic Metrics
Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practic
- SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
Yifei Xu, Guilherme Potje, Shivam Shandilya, Tiancheng Yuan, Leonardo de Oliveira Nunes · Feb 24, 2026 · Citations: 0
Rubric RatingRed Team Automatic Metrics
Designing aligned and robust rewards for open-ended generation remains a key barrier to RL post-training.
- An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026 · Citations: 0
Expert Verification Automatic Metrics
Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontolo
- Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation
Rizhuo Huang, Yifan Feng, Rundong Xue, Shihui Ying, Jun-Hai Yong · Feb 23, 2026 · Citations: 0
Expert Verification Automatic Metrics
Additionally, we present \textbf{HyperDocRED}, a rigorously annotated benchmark for document-level knowledge hypergraph extraction.
- Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins
Jasmin Han, Janardan Devkota, Joseph Waring, Amanda Luken, Felix Naughton · Feb 23, 2026 · Citations: 0
Rubric Rating Automatic Metrics
Perceived message effectiveness (PME) by potential intervention end-users is important for selecting and optimizing personalized smoking cessation intervention messages for mobile health (mHealth) platform delivery.
- CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026 · Citations: 0
Expert Verification Automatic Metrics
The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
- KLong: Training LLM Agent for Extremely Long-horizon Tasks
Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi · Feb 19, 2026 · Citations: 0
Rubric Rating Automatic Metrics Long Horizon
This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks.
- Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study
Kensuke Okada, Yui Furukawa, Kyosuke Bunji · Feb 19, 2026 · Citations: 0
Rubric Rating Automatic Metrics
Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments.
- What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform
Adrian Cosma, Cosmin Dumitrache, Emilian Radoi · Feb 19, 2026 · Citations: 0
Expert Verification Automatic Metrics
As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy.
- Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
- Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026 · Citations: 0
Pairwise PreferenceExpert Verification Automatic Metrics
While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
- Discovering Implicit Large Language Model Alignment Objectives
Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026 · Citations: 0
Rubric Rating Human Eval
To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
- Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Human Eval Multi Agent
Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
- HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0
Expert VerificationCritique Edit Automatic Metrics
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
- Small Reward Models via Backward Inference
Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi · Feb 14, 2026 · Citations: 0
Rubric Rating Automatic Metrics
However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility.
- The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage
Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh · Feb 10, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Human Eval
To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-c
- Document Reconstruction Unlocks Scalable Long-Context RLVR
Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin · Feb 9, 2026 · Citations: 0
Rubric Rating Automatic Metrics
However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
- APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0
Rubric RatingExpert Verification Simulation Env Long Horizon
We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
- Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models
Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026 · Citations: 0
Rubric RatingCritique Edit Automatic Metrics
Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
- HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong · Jan 9, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Human EvalLlm As Judge
Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
- CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0
Expert Verification Automatic Metrics
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- Toward LLM-Supported Automated Assessment of Critical Thinking Subskills
Marisa C. Peczuh, Nischal Ashok Kumar, Ryan Baker, Blair Lehman, Danielle Eisenberg · Oct 14, 2025 · Citations: 0
Rubric Rating Automatic Metrics
As the world becomes increasingly saturated with AI-generated content, disinformation, and algorithmic persuasion, critical thinking - the capacity to evaluate evidence, detect unreliable claims, and exercise independent judgment - is becom
- Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong · Sep 25, 2025 · Citations: 0
Rubric Rating Automatic Metrics
Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs.
- EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis
Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025 · Citations: 0
Expert Verification Llm As JudgeSimulation Env Multi Agent
We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
- DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries
Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto · Jun 20, 2025 · Citations: 0
Expert Verification Llm As Judge
This study introduces DistillNote, an evaluation framework for LLM summaries that targets their functional utility by applying the generated summary downstream in a complex clinical prediction task, explicitly quantifying how much predictio
- From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise
Nitin Sharma, Thomas Wolfers, Çağatay Yıldız · Jun 9, 2025 · Citations: 0
Expert Verification Automatic Metrics
Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education.
- HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang · Jun 4, 2025 · Citations: 0
Expert Verification Automatic Metrics
However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences
- A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025 · Citations: 0
Rubric RatingExpert Verification Automatic Metrics
As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
- MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation
Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan · Mar 23, 2025 · Citations: 0
Expert Verification Automatic Metrics
Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.
- Measuring AI Ability to Complete Long Software Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia · Mar 18, 2025 · Citations: 0
Expert Verification Automatic Metrics Tool Use
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
- Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman · Feb 22, 2025 · Citations: 0
Pairwise PreferenceExpert Verification Automatic Metrics
Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions.