- Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman · Feb 22, 2025 · Citations: 0
Pairwise PreferenceExpert Verification
Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions.
- Less is More: Improving LLM Alignment via Preference Data Selection
Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang · Feb 20, 2025 · Citations: 0
Pairwise Preference
Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences.
- Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes
Saman Khamesian, Asiful Arefeen, Maria Adela Grando, Bithika M. Thompson, Hassan Ghasemzadeh · Feb 20, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- SEFL: A Framework for Generating Synthetic Educational Assignment Feedback with LLM Agents
Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay · Feb 18, 2025 · Citations: 0
Critique Edit
Through comprehensive evaluations with three LLM judges and three human experts, across a subset of 900 outputs, we demonstrate that SEFL-tuned models outperform both their untuned counterparts and an existing baseline in terms of feedback…
- Using the Path of Least Resistance to Explain Deep Networks
Sina Salek, Joseph Enguehard · Feb 17, 2025 · Citations: 0
Through experiments on both synthetic and real-world image classification data, we provide empirical evidence supporting our theoretical analysis and showing that GIG produces more faithful attributions than existing methods, including IG,…
- MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu · Feb 17, 2025 · Citations: 0
Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct, MetaMathQA and etc., we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on…
- Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Bettina Messmer, Vinko Sabolčec, Martin Jaggi · Feb 14, 2025 · Citations: 0
Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks and mitigating the curse of…
- Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations
Shruti Joshi, Andrea Dittadi, Sébastien Lachapelle, Dhanya Sridhar · Feb 14, 2025 · Citations: 0
- Hallucination, Monofacts, and Miscalibration: An Empirical Investigation
Miranda Muqing Miao, Michael Kearns · Feb 11, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.