- AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026
Automatic Metrics MathCoding
While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants.
- Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody
Yuqi Shi, Hao Yang, Xiyao Lu, Jinsong Zhang · Feb 26, 2026
Automatic Metrics General
While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persistent challenge.
- Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
Sanjid Hasan, Risalat Labib, A H M Fuad, Bayazid Hasan · Feb 26, 2026
Automatic Metrics General
Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.
- MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026
Automatic Metrics Coding
Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026
Automatic Metrics MathCoding
This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
- ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL
Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu · Feb 26, 2026
Automatic Metrics General
Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency.
- Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026
Automatic Metrics MathMedicine
Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
- Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian · Feb 26, 2026
Automatic Metrics General
Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries.
- Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
Craig Myles, Patrick Schrempf, David Harris-Birtill · Feb 25, 2026
Automatic Metrics MedicineCoding
We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical d
- A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad · Feb 25, 2026
Automatic Metrics General
Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
- How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026
Automatic Metrics Coding
Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.
- GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026
Automatic Metrics Coding
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
- NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang · Feb 25, 2026
Automatic Metrics Coding
Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image.
- Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
Christian Nickel, Laura Schrewe, Florian Mai, Lucie Flek · Feb 25, 2026
Automatic Metrics General
Theory of Mind (ToM) refers to an agent's ability to model the internal states of others.
- DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026
Automatic Metrics General
This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.
- Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026
Human EvalAutomatic Metrics General
Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
- Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
Heejin Jo · Feb 25, 2026
Automatic Metrics General
Large language models consistently fail the "car wash problem," a viral reasoning benchmark requiring implicit physical constraint inference.
- SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026
Automatic Metrics MedicineCoding
Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
- Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026
Automatic Metrics General
We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
- Virtual Biopsy for Intracranial Tumors Diagnosis on MRI
Xinzhe Luo, Shuai Shao, Yan Wang, Jiangtao Wang, Yutong Bai · Feb 25, 2026
Automatic Metrics Medicine
To address these challenges, we construct the ICT-MRI dataset - the first public biopsy-verified benchmark with 249 cases across four categories.
- Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026
Automatic Metrics MathCoding
Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
- From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators
Zhihao Li, Yu Feng, Zhilu Lai, Wei Wang · Feb 25, 2026
Automatic Metrics General
On standard PDE benchmarks and real datasets, our method attains state-of-the-art competitive accuracy while providing intrinsic interpretability.
- One Brain, Omni Modalities: Towards Unified Non-Invasive Brain Decoding with Large Language Models
Changli Tang, Shurui Li, Junliang Wang, Qinfan Xiao, Zhonghao Zhai · Feb 25, 2026
Automatic Metrics Coding
Extensive evaluations demonstrate that NOBEL serves as a robust generalist across standard single-modal tasks.
- GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning
Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang · Feb 25, 2026
Automatic Metrics General
Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems.
- Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang · Feb 24, 2026
Automatic Metrics Math
Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.
- Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Charafeddine Mouzouni · Feb 24, 2026
Automatic Metrics Math
We validate across five benchmarks, five models from three families, and both synthetic and real data.
- XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence
Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser · Feb 24, 2026
Automatic Metrics MedicineCoding
Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints.
- Attention-Based SINR Estimation in User-Centric Non-Terrestrial Networks
Bruno De Filippo, Alessandro Guidotti, Alessandro Vanelli-Coralli · Feb 24, 2026
Automatic Metrics General
These results enable the integration of DMHSA-based estimators into scheduling procedures, allowing the evaluation of multiple candidate user groups and the selection of those offering the highest average SINR and capacity.
- HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG
Yuqi Huang, Ning Liao, Kai Yang, Anning Hu, Shengchao Hu · Feb 24, 2026
Automatic Metrics General
Extensive experiments demonstrate that HELP achieves competitive performance across multiple simple and multi-hop QA benchmarks and up to a 28.8$\times$ speedup over leading Graph-based RAG baselines.
- Predicting Sentence Acceptability Judgments in Multimodal Contexts
Hyewon Jang, Nikolai Ilinykh, Sharid Loáiciga, Jey Han Lau, Shalom Lappin · Feb 24, 2026
Automatic Metrics General
Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts.
- Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
Dhita Putri Pratama, Soyeon Caren Han, Yihao Ding · Feb 24, 2026
Automatic Metrics Coding
Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning.
- OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation
Tian Lan, Lei Xu, Zimu Yuan, Shanggui Liu, Jiajun Liu · Feb 24, 2026
Automatic Metrics Medicine
Our evaluation demonstrates that OrthoDiffusion achieves excellent performance in the segmentation of 11 knee structures and the detection of 8 knee abnormalities.
- Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams
Darvan Shvan Khairaldeen, Hossein Hassani · Feb 24, 2026
Automatic Metrics General
On the full 50-song evaluation at a 0.750 threshold, recall was 39.4% and precision 25.8% .
- Counterfactual Simulation Training for Chain-of-Thought Faithfulness
Peter Hase, Christopher Potts · Feb 24, 2026
Automatic MetricsSimulation Env Coding
Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output.
- CAMEL: Confidence-Gated Reflection for Reward Modeling
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar · Feb 24, 2026
Automatic Metrics General
Reward models play a fundamental role in aligning large language models with human preferences.
- Enhancing Hate Speech Detection on Social Media: A Comparative Analysis of Machine Learning Models and Text Transformation Approaches
Saurabh Mishra, Shivani Thakur, Radhika Mamidi · Feb 24, 2026
Automatic Metrics General
The proliferation of hate speech on social media platforms has necessitated the development of effective detection and moderation tools.
- Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference
Anna Hart, Chi Han, Jeonghwan Kim, Huimin Zhao, Heng Ji · Feb 24, 2026
Automatic Metrics General
Modern Protein Language Models (PLMs) apply transformer-based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties.
- To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering
Zaifu Zhan, Min Zeng, Shuang Zhou, Yiran Song, Xiaoyi Chen · Feb 23, 2026
Automatic Metrics Medicine
Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA.
- NanoKnow: How to Know What Your Language Model Knows
Lingwei Gu, Nour Jedidi, Jimmy Lin · Feb 23, 2026
Automatic Metrics Coding
Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre
- Multilingual Large Language Models do not comprehend all natural languages to equal degrees
Natalia Moskvina, Raquel Montero, Masaya Yoshida, Ferdy Hubers, Paolo Morosi · Feb 23, 2026
Automatic Metrics Multilingual
Large Language Models (LLMs) play a critical role in how humans access information.
- When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
Krzysztof Adamkiewicz, Brian Moser, Stanislav Frolov, Tobias Christian Nauen, Federico Raue · Feb 23, 2026
Automatic Metrics General
Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following.
- Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li · Feb 23, 2026
Automatic Metrics MedicineCoding
The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
- Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song · Feb 23, 2026
Automatic Metrics Coding
We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.
- Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026
Automatic Metrics Math
In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
- Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore · Feb 22, 2026
Automatic Metrics General
Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
- Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu · Feb 21, 2026
Automatic Metrics Multilingual
Personal AI agents incur substantial cost via repeated LLM calls.
- ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models
Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan · Feb 21, 2026
Automatic Metrics General
We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9).
- Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026
Automatic Metrics Coding
Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.
- Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos · Feb 20, 2026
Automatic MetricsSimulation Env Coding
Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation.
- Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026
Automatic MetricsSimulation Env General
When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
- Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning
Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin · Feb 20, 2026
Automatic Metrics MathCoding
Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead.
- FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Mirae Kim, Seonghun Jeong, Youngjun Kwak · Feb 20, 2026
Automatic Metrics General
A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models.
- Agentic Adversarial QA for Improving Domain-Specific LLMs
Vincent Grari, Ciprian Tomoiaga, Sylvain Lamprier, Tatsunori Hashimoto, Marcin Detyniecki · Feb 20, 2026
Automatic Metrics Law
Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.
- Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026
Automatic Metrics Math
Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
- CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello · Feb 19, 2026
Automatic Metrics Multilingual
HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts.
- The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?
Jayadev Billa · Feb 19, 2026
Automatic Metrics General
Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades.
- Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026
Automatic Metrics General
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
- What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data
Dimitri Staufer, Kirsten Morehouse · Feb 19, 2026
Automatic Metrics General
Large language models (LLMs), and conversational agents based on them, are exposed to personal data (PD) during pre-training and during user interactions.
- Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy
Bianca Raimondi, Maurizio Gabbrielli · Feb 19, 2026
Automatic Metrics Coding
The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics.
- ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning
Hussein S. Al-Olimat, Ahmad Alshareef · Feb 19, 2026
Automatic Metrics Multilingual
While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification.