- A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025
Rubric RatingExpert Verification
As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
- Lean Formalization of Generalization Error Bound by Rademacher Complexity and Dudley's Entropy Integral
Sho Sonoda, Kazumi Kasaura, Yuma Mizuno, Kei Tsukamoto, Naoto Onda · Mar 25, 2025
Understanding and certifying the generalization performance of machine learning algorithms -- i.e.
- EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents
Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski · Mar 24, 2025
We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs.
- MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation
Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan · Mar 23, 2025
Expert Verification
Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.
- Imitating AI agents increase diversity in homogeneous information environments but can reduce it in heterogeneous ones
Emil Bakkensen Johansen, Oliver Baumann · Mar 20, 2025
Recent developments in large language models (LLMs) have facilitated autonomous AI agents capable of imitating human-generated content, raising fundamental questions about how AI may reshape democratic information environments such as news.
- EmoGRACE: Aspect-based emotion analysis for social media data
Christina Zorenböhmer, Sebastian Schmidt, Bernd Resch · Mar 19, 2025
While sentiment analysis has advanced from sentence to aspect-level, i.e., the identification of concrete terms related to a sentiment, the equivalent field of Aspect-based Emotion Analysis (ABEA) is faced with dataset bottlenecks and the i
- Measuring AI Ability to Complete Long Software Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia · Mar 18, 2025
Expert Verification Tool Use
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
- A Survey on the Optimization of Large Language Model-based Agents
Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang · Mar 16, 2025
Long Horizon
With the rapid development of Large Language Models (LLMs), LLM-based agents have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks.
- Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes
Zhanliang Wang, Da Wu, Quan Nguyen, Kai Wang · Mar 15, 2025
These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes.
- InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang · Mar 9, 2025
Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-11% improvements across MATH500, AIME24, and GPQA_diamond benchmarks.
- VQEL: Enabling Self-Play in Emergent Language Games via Agent-Internal Vector Quantization
Mohammad Mahdi Samiei Paqaleh, Mehdi Jamalkhah, Mahdieh Soleymani Baghshah · Mar 6, 2025
Emergent Language (EL) focuses on the emergence of communication among artificial agents.
- Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, Pengfei Zheng · Mar 6, 2025
Prevailing LLM serving engines employ expert parallelism (EP) to implement multi-device inference of massive MoE models.
- HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen · Mar 3, 2025
A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on.
- Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Hanjiang Hu, Alexander Robey, Changliu Liu · Feb 28, 2025
Red Team
To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues.
- The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz · Feb 26, 2025
To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks.
- Compressing Language Models for Specialized Domains
Miles Williams, George Chrysostomou, Vitor Jeronymo, Nikolaos Aletras · Feb 25, 2025
Compression techniques such as pruning and quantization offer a practical path towards efficient LM deployment, exemplified by their ability to preserve performance on general-purpose benchmarks.
- PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han · Feb 25, 2025
To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.
- Can Multimodal LLMs Perform Time Series Anomaly Detection?
Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao · Feb 25, 2025
Multi Agent
One natural way for humans to detect time series anomalies is through visualization and textual description.
- Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 24, 2025
Natural Language Processing (NLP) is becoming a dominant subset of artificial intelligence as the need to help machines understand human language looks indispensable.
- Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen · Feb 24, 2025
Pairwise Preference
Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval.
- Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman · Feb 22, 2025
Pairwise PreferenceExpert Verification
Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions.
- Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes
Saman Khamesian, Asiful Arefeen, Maria Adela Grando, Bithika M. Thompson, Hassan Ghasemzadeh · Feb 20, 2025
Managing Type 1 Diabetes (T1D) demands constant vigilance as individuals strive to regulate their blood glucose levels and avoid dysglycemia, including hyperglycemia and hypoglycemia.
- SEFL: A Framework for Generating Synthetic Educational Assignment Feedback with LLM Agents
Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay · Feb 18, 2025
Critique Edit
Through comprehensive evaluations with three LLM judges and three human experts, across a subset of 900 outputs, we demonstrate that SEFL-tuned models outperform both their untuned counterparts and an existing baseline in terms of feedback
- Using the Path of Least Resistance to Explain Deep Networks
Sina Salek, Joseph Enguehard · Feb 17, 2025
Through experiments on both synthetic and real-world image classification data, we provide empirical evidence supporting our theoretical analysis and showing that GIG produces more faithful attributions than existing methods, including IG,
- MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu · Feb 17, 2025
Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct, MetaMathQA and etc., we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on
- Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Bettina Messmer, Vinko Sabolčec, Martin Jaggi · Feb 14, 2025
Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks and mitigating the curse of multilinguali
- MoEMba: A Mamba-based Mixture of Experts for High-Density EMG-based Hand Gesture Recognition
Mehran Shabanpour, Kasra Rad, Sadaf Khademi, Arash Mohammadi · Feb 9, 2025
High-Density surface Electromyography (HDsEMG) has emerged as a pivotal resource for Human-Computer Interaction (HCI), offering direct insights into muscle activities and motion intentions.
- Oracular Programming: A Modular Foundation for Building LLM-Enabled Software
Jonathan Laurent, André Platzer · Feb 7, 2025
Demonstrations Web Browsing
Large Language Models can solve a wide range of tasks from just a few examples, but they remain difficult to steer and lack a capability essential for building reliable software at scale: the modular composition of computations under enforc
- vCache: Verified Semantic Prompt Caching
Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu · Feb 6, 2025
We release the vCache implementation and four benchmarks to support future research.
- Should You Use Your Large Language Model to Explore or Exploit?
Keegan Harris, Aleksandrs Slivkins · Jan 31, 2025
Tool Use
We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff.
- Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations
Zijie Liu, Xinyu Zhao, Jie Peng, Zhuangdi Zhu, Qingyu Chen · Jan 29, 2025
These tuning methods and benchmarks overlook critical aspects like evidence-based reasoning and handling distracting information.
- Safe Reinforcement Learning for Real-World Engine Control
Julian Bedei, Lucas Koch, Kevin Badalian, Alexander Winkler, Patrick Schaber · Jan 28, 2025
This work introduces a toolchain for applying Reinforcement Learning (RL), specifically the Deep Deterministic Policy Gradient (DDPG) algorithm, in safety-critical real-world environments.
- Object-Centric World Models from Few-Shot Annotations for Sample-Efficient Reinforcement Learning
Weipu Zhang, Adam Jelley, Trevor McInroe, Amos Storkey, Gang Wang · Jan 27, 2025
Empirical results demonstrate that OC-STORM significantly outperforms the STORM baseline on the Atari 100k benchmark and achieves state-of-the-art sample efficiency on challenging boss fights in the visually complex game Hollow Knight.
- Humanity's Last Exam
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu · Jan 24, 2025
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities.
- Causal Claims in Economics
Prashant Garg, Thiemo Fetzer · Jan 12, 2025
As economics scales, a key bottleneck is representing what papers claim in a comparable, aggregable form.