- A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025
Rubric RatingExpert Verification
As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
- Lean Formalization of Generalization Error Bound by Rademacher Complexity and Dudley's Entropy Integral
Sho Sonoda, Kazumi Kasaura, Yuma Mizuno, Kei Tsukamoto, Naoto Onda · Mar 25, 2025
Understanding and certifying the generalization performance of machine learning algorithms -- i.e.
- EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents
Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski · Mar 24, 2025
We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs.
- MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation
Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan · Mar 23, 2025
Expert Verification
Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.
- Imitating AI agents increase diversity in homogeneous information environments but can reduce it in heterogeneous ones
Emil Bakkensen Johansen, Oliver Baumann · Mar 20, 2025
Recent developments in large language models (LLMs) have facilitated autonomous AI agents capable of imitating human-generated content, raising fundamental questions about how AI may reshape democratic information environments such as news.
- EmoGRACE: Aspect-based emotion analysis for social media data
Christina Zorenböhmer, Sebastian Schmidt, Bernd Resch · Mar 19, 2025
While sentiment analysis has advanced from sentence to aspect-level, i.e., the identification of concrete terms related to a sentiment, the equivalent field of Aspect-based Emotion Analysis (ABEA) is faced with dataset bottlenecks and the i
- Measuring AI Ability to Complete Long Software Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia · Mar 18, 2025
Expert Verification Tool Use
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
- A Survey on the Optimization of Large Language Model-based Agents
Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang · Mar 16, 2025
Long Horizon
With the rapid development of Large Language Models (LLMs), LLM-based agents have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks.
- Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes
Zhanliang Wang, Da Wu, Quan Nguyen, Kai Wang · Mar 15, 2025
These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes.
- InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang · Mar 9, 2025
Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-11% improvements across MATH500, AIME24, and GPQA_diamond benchmarks.
- VQEL: Enabling Self-Play in Emergent Language Games via Agent-Internal Vector Quantization
Mohammad Mahdi Samiei Paqaleh, Mehdi Jamalkhah, Mahdieh Soleymani Baghshah · Mar 6, 2025
Emergent Language (EL) focuses on the emergence of communication among artificial agents.
- Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, Pengfei Zheng · Mar 6, 2025
Prevailing LLM serving engines employ expert parallelism (EP) to implement multi-device inference of massive MoE models.
- HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen · Mar 3, 2025
A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on.