- AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu · Feb 15, 2026 · Citations: 0
Simulation Env Coding
While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
- CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0
Automatic Metrics CodingMultilingual
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin · Aug 28, 2025 · Citations: 0
Automatic Metrics MedicineCoding
These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
- MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0
Automatic Metrics MedicineCoding
Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
- SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0
Automatic Metrics MedicineCoding
Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
- SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0
Automatic Metrics Coding
Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
- Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026 · Citations: 0
Automatic Metrics Coding
Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
- Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu · Oct 14, 2025 · Citations: 0
Automatic Metrics Coding
Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference.
- KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang · Jan 30, 2026 · Citations: 0
Automatic MetricsSimulation Env Coding
Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation.
- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0
Automatic Metrics MathCoding
This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
- Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li · Feb 23, 2026 · Citations: 0
Automatic Metrics MedicineCoding
The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
- RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility
Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · Sep 27, 2025 · Citations: 0
Automatic Metrics Coding
Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors.
- AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026 · Citations: 0
Automatic Metrics MathCoding
We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining.
- AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Qijie You, Wenkai Yu, Wentao Zhang · Feb 22, 2026 · Citations: 0
Automatic Metrics MedicineCoding
With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction.
- Counterfactual Simulation Training for Chain-of-Thought Faithfulness
Peter Hase, Christopher Potts · Feb 24, 2026 · Citations: 0
Automatic MetricsSimulation Env Coding
In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual…
- GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0
Automatic Metrics Coding
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
- Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song · Feb 23, 2026 · Citations: 0
Automatic Metrics Coding
We introduce (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.