- A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025 · Citations: 0
Rubric RatingExpert Verification
As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
- EventWeave: A Dynamic Framework for Capturing Core and Supporting Events in Dialogue Systems
Zhengyi Zhao, Shubo Zhang, Yiming Du, Bin Liang, Baojun Wang · Mar 29, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
Lang Cao, Renhong Chen, Yingtian Zou, Chao Peng, Huacong Xu · Mar 28, 2025 · Citations: 0
Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and…
- Boosting Large Language Models with Mask Fine-Tuning
Mingyuan Zhang, Yue Bai, Huan Wang, Yizhou Wang, Qihua Dong · Mar 27, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics
Arsham Gholamzadeh Khoee, Shuai Wang, Robert Feldt, Dhasarathy Parthasarathy, Yinan Yu · Mar 27, 2025 · Citations: 0
Multi Agent
Ensuring reliable data-driven decisions is crucial in domains where analytical accuracy directly impacts safety, compliance, or operational outcomes.
- Lean Formalization of Generalization Error Bound by Rademacher Complexity and Dudley's Entropy Integral
Sho Sonoda, Kazumi Kasaura, Yuma Mizuno, Kei Tsukamoto, Naoto Onda · Mar 25, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ELM: A Hybrid Ensemble of Language Models for Automated Tumor Group Classification in Population-Based Cancer Registries
Lovedeep Gondara, Jonathan Simkin, Shebnum Devji, Gregory Arbour, Raymond Ng · Mar 24, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Minimum Volume Conformal Sets for Multivariate Regression
Sacha Braun, Liviu Aolaritei, Michael I. Jordan, Francis Bach · Mar 24, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents
Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski · Mar 24, 2025 · Citations: 0
We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs.
- Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages
Tadesse Destaw Belay, Dawit Ketema Gete, Abinew Ali Ayele, Olga Kolesnikova, Iqra Ameer · Mar 24, 2025 · Citations: 0
Developing and integrating emotion-understanding models are essential for a wide range of human-computer interaction tasks, including customer feedback analysis, marketing research, and social media monitoring.
- Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment
Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Ranjie Duan, Jiexi Liu · Mar 23, 2025 · Citations: 0
Pairwise PreferenceDemonstrations
Existing techniques are either reward-based (training a reward model on preference pairs and optimizing with reinforcement learning) or reward-free (directly fine-tuning on ranked outputs).
- FedSKD: Aggregation-free Model-heterogeneous Federated Learning via Multi-dimensional Similarity Knowledge Distillation for Medical Image Classification
Ziqiao Weng, Weidong Cai, Bo Zhou · Mar 23, 2025 · Citations: 0
Extensive evaluations on fMRI-based autism spectrum disorder diagnosis and skin lesion classification demonstrate that FedSKD outperforms state-of-the-art heterogeneous and homogeneous FL baselines, achieving superior personalization…
- MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation
Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan · Mar 23, 2025 · Citations: 0
Expert Verification
Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.
- Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning
Zhaowei Liu, Xin Guo, Zhi Yang, Fangqi Lou, Lingfeng Zeng · Mar 20, 2025 · Citations: 0
First, we construct Fin-R1-Data, a high-quality financial dataset consisting of 60,091 chain-of-thought (CoT) samples, distilled and filtered from multiple authoritative benchmarks to ensure consistency and reliability.
- Imitating AI agents increase diversity in homogeneous information environments but can reduce it in heterogeneous ones
Emil Bakkensen Johansen, Oliver Baumann · Mar 20, 2025 · Citations: 0
Recent developments in large language models (LLMs) have facilitated autonomous AI agents capable of imitating human-generated content, raising fundamental questions about how AI may reshape democratic information environments such as news.
- More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models
Evan Chen, Run-Jun Zhan, Yan-Bai Lin, Hung-Hsuan Chen · Mar 20, 2025 · Citations: 0
This study introduces a novel evaluation framework to uncover gender biases in LLMs: using free-form storytelling to surface biases embedded within the models.
- What Makes a Reward Model a Good Teacher? An Optimization Perspective
Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee · Mar 19, 2025 · Citations: 0
The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model.
- EmoGRACE: Aspect-based emotion analysis for social media data
Christina Zorenböhmer, Sebastian Schmidt, Bernd Resch · Mar 19, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- KINESIS: Motion Imitation for Human Musculoskeletal Locomotion
Merkourios Simos, Alberto Silvio Chiappa, Alexander Mathis · Mar 18, 2025 · Citations: 0
- Measuring AI Ability to Complete Long Software Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia · Mar 18, 2025 · Citations: 0
Expert Verification Tool Use
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
- Learning Over Dirty Data with Minimal Repairs
Cheng Zhen, Prayoga, Nischal Aryal, Arash Termehchy, Garrett Biwer · Mar 18, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- OSCAR: Online Soft Compression And Reranking
Maxime Louis, Thibault Formal, Hervé Dejean, Stéphane Clinchant · Mar 17, 2025 · Citations: 0
- ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs
Pengcheng Wen, Jiaming Ji, Chi-Min Chan, Juntao Dai, Donghai Hong · Mar 17, 2025 · Citations: 0
Through extensive evaluation across different model sizes (3B-32B parameters), we have two key findings: (1) smaller models (<30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured…
- KARL: Knowledge-Aware Reasoning and Reinforcement Learning for Knowledge-Intensive Visual Grounding
Xinyu Ma, Ziyang Ding, Zhicong Luo, Chi Chen, Zonghao Guo · Mar 17, 2025 · Citations: 0
To facilitate systematic evaluation, we introduce KVG-Bench, a benchmark spanning 10 domains with 1.3K curated test cases covering 531 images and 882 entities.
- Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang · Mar 16, 2025 · Citations: 0
Long Horizon
Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM.
- HyConEx: Hypernetwork classifier with counterfactual explanations for tabular data
Patryk Marszałek, Kamil Książek, Oleksii Furman, Ulvi Movsum-zada, Przemysław Spurek · Mar 16, 2025 · Citations: 0
- A Survey on the Optimization of Large Language Model-based Agents
Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang · Mar 16, 2025 · Citations: 0
Long Horizon
With the rapid development of Large Language Models (LLMs), LLM-based agents have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks.
- Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios
Zhi Chen, Wei Ma, Lingxiao Jiang · Mar 16, 2025 · Citations: 0
- Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes
Zhanliang Wang, Da Wu, Quan Nguyen, Kai Wang · Mar 15, 2025 · Citations: 0
These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes.
- Interpretable Deep Learning Framework for Improved Disease Classification in Medical Imaging
Jutika Borah, Hidam Kumarjit Singh · Mar 14, 2025 · Citations: 0
The framework is evaluated on four medical imaging benchmark datasets: chest X-rays of COVID-19, Tuberculosis, Pneumonia, and retinal Optical Coherence Tomography (OCT) images.
- Implicit Bias-Like Patterns in Reasoning Models
Messi H. J. Lee, Calvin K. Lai · Mar 14, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Unicorn: A Universal and Collaborative Reinforcement Learning Approach Towards Generalizable Network-Wide Traffic Signal Control
Yifeng Zhang, Yilin Liu, Ping Gong, Peizhuo Li, Mingfeng Fan · Mar 14, 2025 · Citations: 0
- Reasoning-Grounded Natural Language Explanations for Language Models
Vojtech Cahlik, Rodrigo Alves, Pavel Kordik · Mar 14, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
Yuhang Liu, Dong Gong, Yichao Cai, Erdun Gao, Zhen Zhang · Mar 12, 2025 · Citations: 0
- PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization
Zhiwen You, Yue Guo · Mar 11, 2025 · Citations: 0
Existing automatic factual consistency evaluation methods, such as entailment- and question-answering (QA) -based, struggle with plain language summarization (PLS) due to elaborative explanation phenomenon, which introduces external content…
- Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges
Xiaoxiao Liu, Qingying Xiao, Bingquan Zhang, Junying Chen, Xiangyi Feng · Mar 11, 2025 · Citations: 0
However, there is a lack of standardized evaluation criteria to assess their effectiveness, particularly in dynamic, interactive scenarios.
- Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye · Mar 9, 2025 · Citations: 0
- Green Prompting: Characterizing Prompt-driven Energy Costs of LLM Inference
Marta Adamska, Daria Smirnova, Hamid Nasiri, Zhengxin Yu, Peter Garraghan · Mar 9, 2025 · Citations: 0
Web Browsing
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang · Mar 9, 2025 · Citations: 0
Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-11% improvements across MATH500, AIME24, and GPQA_diamond benchmarks.
- Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs
Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke · Mar 7, 2025 · Citations: 0
When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, and show improvements over prompting and Self-Debias in all cases, and…
- Frequency Autoregressive Image Generation with Continuous Tokens
Hu Yu, Hao Luo, Hangjie Yuan, Yu Rong, Jie Huang · Mar 7, 2025 · Citations: 0
However, due to the huge modality gap, image autoregressive models may require a systematic reevaluation from two perspectives: tokenizer format and regression direction.
- No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, Chris Tanner · Mar 7, 2025 · Citations: 0
Pairwise Preference
To address this gap, we introduce the Business and Finance Fundamentals Benchmark (BFF-Bench), a dataset of 160 challenging questions and long-form responses authored by financial professionals.
- Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems
Jooyoung Lee, Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos · Mar 6, 2025 · Citations: 0
The proliferation of generative models has presented significant challenges in distinguishing authentic human-authored content from deepfake content.
- VQEL: Enabling Self-Play in Emergent Language Games via Agent-Internal Vector Quantization
Mohammad Mahdi Samiei Paqaleh, Mehdi Jamalkhah, Mahdieh Soleymani Baghshah · Mar 6, 2025 · Citations: 0
Emergent Language (EL) focuses on the emergence of communication among artificial agents.
- Training-free Adjustable Polynomial Graph Filtering for Ultra-fast Multimodal Recommendation
Yu-Seung Roh, Joo-Young Kim, Jin-Duk Park, Won-Yong Shin · Mar 6, 2025 · Citations: 0
- Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, Pengfei Zheng · Mar 6, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes · Mar 5, 2025 · Citations: 0
- LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
Jude Khouja, Lingyi Yang, Karolina Korgul, Simeon Hellsten, Vlad A. Neacsu · Mar 4, 2025 · Citations: 0
We introduce LINGOLY-TOO, a challenging reasoning benchmark of 1,203 questions and a total of 6,995 sub-questions that counters these shortcuts by applying expert-designed obfuscations to Linguistics Olympiad problems.
- Wikipedia in the Era of LLMs: Evolution and Risks
Siming Huang, Yuliang Xu, Mingmeng Geng, Yao Wan, Dongping Chen · Mar 4, 2025 · Citations: 0
If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models could shift.
- Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models
David Bani-Harouni, Chantal Pellegrini, Paul Stangel, Ege Özsoy, Kamilia Zaripova · Mar 4, 2025 · Citations: 0
- HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen · Mar 3, 2025 · Citations: 0
A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on.
- $\texttt{SEM-CTRL}$: Semantically Controlled Decoding
Mohammad Albinhassan, Pranava Madhyastha, Alessandra Russo · Mar 3, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- LLM-Advisor: An LLM Benchmark for Cost-efficient Path Planning across Multiple Terrains
Ling Xiao, Toshihiko Yamasaki · Mar 3, 2025 · Citations: 0
Web Browsing
We further introduce two datasets, MultiTerraPath and RUGD_v2, for systematic evaluation of cost-efficient path planning.
- Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Hanjiang Hu, Alexander Robey, Changliu Liu · Feb 28, 2025 · Citations: 0
Red Team
To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues.
- Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models
Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf · Feb 28, 2025 · Citations: 0
- Prediction of Item Difficulty for Reading Comprehension Items by Creation of Annotated Item Repository
Radhika Kapoor, Sang T. Truong, Nick Haber, Maria Araceli Ruiz-Primo, Benjamin W. Domingue · Feb 28, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture
Taiqiang Wu, Chenchen Ding, Wenyong Zhou, Yuxin Cheng, Xincheng Feng · Feb 27, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Stay Focused: Problem Drift in Multi-Agent Debate
Jonas Becker, Lars Benedikt Kaesberg, Andreas Stephan, Jan Philip Wahle, Terry Ruas · Feb 26, 2025 · Citations: 0
Multi Agent
Multi-agent debate - multiple instances of large language models discussing problems in turn-based interaction - has shown promise for solving knowledge and reasoning tasks.
- The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz · Feb 26, 2025 · Citations: 0
To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks.
- Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning
Hongyi Cai, Jie Li, Mohammad Mahdinur Rahman, Wenzhen Dong · Feb 26, 2025 · Citations: 0
Experimental evaluation demonstrates that models fine-tuned on LCG-filtered subsets of 6K samples achieve superior performance compared to existing methods, with substantial improvements on MT-bench and consistent gains across comprehensive…