- TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang · Feb 26, 2026 · Citations: 0
Expert Verification Simulation Env Multi Agent
As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness
- SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0
Expert Verification Automatic Metrics
Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
- SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
- "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng · Feb 24, 2026 · Citations: 0
Expert Verification Automatic Metrics
Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
- An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026 · Citations: 0
Expert Verification Automatic Metrics
Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontolo
- CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026 · Citations: 0
Expert Verification Automatic Metrics
The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
- Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
- Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Human Eval Multi Agent
Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
- Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang · Feb 12, 2026 · Citations: 0
Expert Verification Automatic Metrics
On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distil
- APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0
Rubric RatingExpert Verification Simulation Env Long Horizon
We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
- CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0
Expert Verification Automatic Metrics
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis
Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025 · Citations: 0
Expert Verification Llm As JudgeSimulation Env Multi Agent
We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
- HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang · Jun 4, 2025 · Citations: 0
Expert Verification Automatic Metrics
However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences
- A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025 · Citations: 0
Rubric RatingExpert Verification Automatic Metrics
As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
- Measuring AI Ability to Complete Long Software Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia · Mar 18, 2025 · Citations: 0
Expert Verification Automatic Metrics Tool Use
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.