- Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection
Md Tanvir Rouf Shawon, Mohammad Sabik Irbaz, Hadeel R. A. Elyazori, Keerti Reddy Resapu, Yili Lin · Feb 11, 2026 · Citations: 0
Objective: This paper introduces a patient simulator for scalable, automated evaluation of healthcare conversational agents, generating realistic, controllable interactions that systematically vary across medical, linguistic, and behavioral…
- When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing
Zachary Pedram Dadfar · Feb 11, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Voxtral Realtime
Mistral-AI, :, Alexander H. Liu, Andy Ehrenberg, Andy Lo · Feb 11, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning
Yicheng Chen, Zerun Ma, Xinchen Xie, Yining Li, Kai Chen · Feb 11, 2026 · Citations: 0
Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and…
- GraphSeek: Next-Generation Graph Analytics with LLMs
Maciej Besta, Łukasz Jarmocik, Orest Hrycyna, Shachar Klaiman, Konrad Mączka · Feb 11, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Embedding Inversion via Conditional Masked Diffusion Language Models
Han Xiao · Feb 11, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Learning Page Order in Shuffled WOO Releases
Efe Kahraman, Giulio Tosato · Feb 11, 2026 · Citations: 0
Pairwise Preference
We observe two unexpected failures: seq2seq transformers fail to generalize on long documents (Kendall's tau drops from 0.918 on 2-5 pages to 0.014 on 21-25 pages), and curriculum learning underperforms direct training by 39% on long…
- When Fusion Helps and When It Breaks: View-Aligned Robustness in Same-Source Financial Imaging
Rui Ma · Feb 11, 2026 · Citations: 0
To control label ambiguity from near-zero moves, we use an ex-post minimum-movement threshold min_move (tau) based on realized absolute next-day return, defining an offline benchmark on the subset where the absolute next-day return is at…
- LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules
Ivan Vulić, Adam Grycner, Quentin de Laroussilhe, Jonas Pfeiffer · Feb 11, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models
Mingyu Cao, Alvaro H. C. Correia, Christos Louizos, Shiwei Liu, Lu Yin · Feb 11, 2026 · Citations: 0
Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and…
- Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation
Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson · Feb 11, 2026 · Citations: 0
Tool Use
We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world productivity software API tasks via code execution.
- Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion
Pengcheng Zhou, Haochen Li, Zhiqiang Nie, JiaLe Chen, Qing Gong · Feb 11, 2026 · Citations: 0
- The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task
Rui Cao, Zhenyun Deng, Yulong Chen, Michael Schlichtkrull, Andreas Vlachos · Feb 11, 2026 · Citations: 0
Web Browsing
The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455.
- To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks
Nanxu Gong, Haotian Li, Sixun Dong, Jianxun Lian, Yanjie Fu · Feb 11, 2026 · Citations: 0
- Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026 · Citations: 0
Pairwise Preference Tool Use
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
- Learning Adaptive Distribution Alignment with Neural Characteristic Function for Graph Domain Adaptation
Wei Chen, Xingyu Guo, Shuang Li, Zhao Zhang, Yan Zhong · Feb 11, 2026 · Citations: 0
- Neuro-Symbolic Synergy for Interactive World Modeling
Hongyu Zhao, Siyu Zhou, Haolin Yang, Zengyi Qin, Tianyi Zhou · Feb 11, 2026 · Citations: 0
- TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation
Steven Liu, Jane Luo, Xin Zhang, Aofan Liu, Hao Liu · Feb 11, 2026 · Citations: 0
To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments.
- When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents
Virginie Mouilleron, Théo Lasnier, Anna Mosolova, Djamé Seddah · Feb 11, 2026 · Citations: 0