- LiCQA : A Lightweight Complex Question Answering System
Sourav Saha, Dwaipayan Roy, Mandar Mitra · Feb 25, 2026 · Citations: 0
Automatic Metrics
The results of our experiments show that LiCQA significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.
- When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang · Feb 25, 2026 · Citations: 0
Automatic Metrics
Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity.
- Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0
Automatic Metrics Tool Use
Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
- Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
Christian Nickel, Laura Schrewe, Florian Mai, Lucie Flek · Feb 25, 2026 · Citations: 0
Automatic Metrics
Theory of Mind (ToM) refers to an agent's ability to model the internal states of others.
- A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT
Louis Estève, Christophe Servan, Thomas Lavergne, Agata Savary · Feb 25, 2026 · Citations: 0
Automatic Metrics
Diversity has been gaining interest in the NLP community in recent years.
- CxMP: A Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models
Miyu Oba, Saku Sugawara · Feb 25, 2026 · Citations: 0
Automatic Metrics
Most existing benchmarks focus on judging grammatical acceptability, whereas the ability to interpret meanings conveyed by grammatical forms has received much less attention.
- RADAR: Reasoning as Discrimination with Aligned Representations for LLM-based Knowledge Graph Reasoning
Bo Xue, Yuan Jin, Luoyi Fu, Jiaxin Ding, Xinbing Wang · Feb 25, 2026 · Citations: 0
Automatic Metrics
Across four benchmarks, RADAR achieves 5-6% relative gains on link prediction and triple classification over strong LLM baselines, while increasing task-relevant mutual information in intermediate representations by 62.9%, indicating more r
- Large Language Models are Algorithmically Blind
Sohan Venkatesh, Ashish Mahendran Kurapath, Tejas Melkote · Feb 25, 2026 · Citations: 0
Automatic Metrics
Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best m
- DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.
- Personalized Graph-Empowered Large Language Model for Proactive Information Access
Chia Cheng Chang, An-Zi Yen, Hen-Hsen Huang, Hsin-Hsi Chen · Feb 25, 2026 · Citations: 0
Automatic Metrics
Since individuals may struggle to recall all life details and often confuse events, establishing a system to assist users in recalling forgotten experiences is essential.
- Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026 · Citations: 0
Human EvalAutomatic Metrics
Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
- FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0
Demonstrations Automatic Metrics
In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
- JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning
Ruichen Xu, Ying-Jun Angela Zhang, Jianwei Huang · Feb 25, 2026 · Citations: 0
Automatic Metrics
Extensive evaluations on MNIST and CIFAR-10 demonstrate that JSAM achieves up to 15% improvement in test accuracy compared to existing unbiased selection mechanisms while maintaining cost efficiency across varying data heterogeneity levels.
- Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
Heejin Jo · Feb 25, 2026 · Citations: 0
Automatic Metrics
Large language models consistently fail the "car wash problem," a viral reasoning benchmark requiring implicit physical constraint inference.
- D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption.
- Improving Implicit Discourse Relation Recognition with Natural Language Explanations from LLMs
Heng Wang, Changxing Wu · Feb 25, 2026 · Citations: 0
Human Eval
Experimental results on PDTB demonstrate that our approach significantly improves IDRR performance, while human evaluation further confirms that the generated explanations enhance model interpretability.
- fEDM+: A Risk-Based Fuzzy Ethical Decision Making Framework with Principle-Level Explainability and Pluralistic Validation
Abeer Dyoub, Francesca A. Lisi · Feb 25, 2026 · Citations: 0
Automatic Metrics
In a previous work, we introduced the fuzzy Ethical Decision-Making framework (fEDM), a risk-based ethical reasoning architecture grounded in fuzzy logic.
- The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems
Hyo Jin Kim · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Although initially formulated for human truth-telling under asymmetric stakes, the same phase-dynamic architecture extends to AI systems operating under policy constraints and alignment filters.
- Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen · Feb 25, 2026 · Citations: 0
Demonstrations Automatic Metrics
Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
- Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
Andrea Silvi, Ponrawee Prasertsom, Jennifer Culbertson, Devdatt Dubhashi, Moa Johansson · Feb 25, 2026 · Citations: 0
Automatic Metrics
Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular.
- Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu · Feb 25, 2026 · Citations: 0
Automatic Metrics
Large Vision-Language Models (LVLMs) exhibit outstanding performance on vision-language tasks but struggle with hallucination problems.
- Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
- DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation
Duc Trung Vu, Pham Khanh Chi, Dat Phi Van, Linh Ngo Van, Sang Dinh · Feb 25, 2026 · Citations: 0
Automatic Metrics
Extensive experiments across diverse NLP benchmarks demonstrate that DWA-KD outperforms state-of-the-art KD baselines, while ablation studies confirm the complementary contributions of entropy-based token weighting and embedding and final h
- CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.
- PPCR-IM: A System for Multi-layer DAG-based Public Policy Consequence Reasoning and Social Indicator Mapping
Zichen Song, Weijia Li · Feb 25, 2026 · Citations: 0
Automatic Metrics
For each policy episode, the system outputs a structured record containing the DAG, indicator mappings, and three evaluation measures: an expected-indicator coverage score, a discovery rate for overlooked but relevant indicators, and a rela
- RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li · Feb 25, 2026 · Citations: 0
Rubric Rating Automatic Metrics
Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.
- When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning
Muku Akasaka, Soyeon Caren Han · Feb 25, 2026 · Citations: 0
Automatic Metrics
In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks.
- Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access
Touseef Hasan, Laila Cure, Souvika Sarkar · Feb 25, 2026 · Citations: 0
Simulation Env
We conduct a pilot evaluation study using community-sourced queries to examine system behavior in realistic scenarios.
- Exploring Human-Machine Coexistence in Symmetrical Reality
Zhenliang Zhang · Feb 25, 2026 · Citations: 0
Automatic Metrics
In the context of the evolution of artificial intelligence (AI), the interaction between humans and AI entities has become increasingly salient, challenging the conventional human-centric paradigms of human-machine interaction.
- Power and Limitations of Aggregation in Compound AI Systems
Nivasini Ananthakrishnan, Meena Jagadeesan · Feb 25, 2026 · Citations: 0
Automatic Metrics
In this work, we investigate the power and limitations of aggregation within a stylized principal-agent framework.
- Revisiting RAG Retrievers: An Information Theoretic Benchmark
Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert · Feb 25, 2026 · Citations: 0
Automatic Metrics
Existing benchmarks primarily compare entire RAG pipelines or introduce new datasets, providing little guidance on selecting or combining retrievers themselves.
- From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators
Zhihao Li, Yu Feng, Zhilu Lai, Wei Wang · Feb 25, 2026 · Citations: 0
Automatic Metrics
On standard PDE benchmarks and real datasets, our method attains state-of-the-art competitive accuracy while providing intrinsic interpretability.
- ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.
- LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
- Training Generalizable Collaborative Agents via Strategic Risk Aversion
Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar · Feb 25, 2026 · Citations: 0
Automatic Metrics Multi Agent
Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals.
- Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information
Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu · Feb 25, 2026 · Citations: 0
Critique Edit Automatic Metrics
To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer.
- GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning
Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang · Feb 25, 2026 · Citations: 0
Automatic Metrics
Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems.
- Evaluating the Usage of African-American Vernacular English in Large Language Models
Deja Dunlap, R. Thomas McCoy · Feb 25, 2026 · Citations: 0
Automatic Metrics
In AI, most evaluations of natural language understanding tasks are conducted in standardized dialects such as Standard American English (SAE).
- A Knowledge-Driven Approach to Music Segmentation, Music Source Separation and Cinematic Audio Source Separation
Chun-wei Ho, Sabato Marco Siniscalchi, Kai Li, Chin-Hui Lee · Feb 25, 2026 · Citations: 0
Simulation Env
Evaluation on simulation data shows that score-guided learning achieves very good music segmentation and separation results.
- iMiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis
Sofoklis Kakouros, Fang Kang, Haoyu Chen · Feb 25, 2026 · Citations: 0
Automatic Metrics
To demonstrate the utility of the dataset and establish initial benchmarks, we introduce two evaluation tasks for comparative assessment: speech emotion recognition and transcript-based sentiment analysis.
- VecGlypher: Unified Vector Glyph Generation with Language Models
Xiaoke Huang, Bhavul Gauri, Kam Woh Ng, Tony Ng, Mengmeng Xu · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with ma
- Revisiting Text Ranking in Deep Research
Chuan Meng, Litu Ou, Sean MacAvaney, Jeff Dalton · Feb 25, 2026 · Citations: 0
Automatic Metrics
To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it.
- Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Inderjeet Singh, Vikas Pahuja, Aishvariya Priya Rathina Sabapathy, Chiara Picardi, Amit Giloni · Feb 24, 2026 · Citations: 0
Automatic Metrics
Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components.
- Causal Decoding for Hallucination-Resistant Multimodal Large Language Models
Shiwei Tan, Hengyi Wang, Weiyi Qin, Qi Xu, Zhigang Hua · Feb 24, 2026 · Citations: 0
Automatic Metrics
Across captioning and QA benchmarks, our framework substantially lowers object-hallucination rates and achieves state-of-the-art faithfulness without degrading overall output quality.
- Provably Safe Generative Sampling with Constricting Barrier Functions
Darshan Gadginmath, Ahmed Allibhoy, Fabio Pasqualetti · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
However, a critical gap remains for their deployment in safety-critical domains: the lack of formal guarantees that generated samples will satisfy hard constraints.
- On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation
Alexander Galozy · Feb 24, 2026 · Citations: 0
Automatic Metrics
Reinforcement learning (RL) agents under partial observability often condition actions on internally accumulated information such as memory or inferred latent context.
- ECHOSAT: Estimating Canopy Height Over Space And Time
Jan Pauls, Karsten Schrödter, Sven Ligensa, Martin Schwartz, Berkant Turan · Feb 24, 2026 · Citations: 0
Automatic Metrics
Our experimental evaluation shows that our model improves state-of-the-art accuracies in the context of single-year predictions.
- The Headless Firm: How AI Reshapes Enterprise Boundaries
Tassilo Klein, Sebastian Wieczorek · Feb 24, 2026 · Citations: 0
Automatic Metrics Multi Agent
We argue that agentic AI induces a structural change in how coordination costs scale: in prior modular systems, integration cost grew with interaction topology (O(n^2) in the number of components); in protocol-mediated agentic systems, inte
- Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages
Felix Schneider, Maria Gogolev, Sven Sickert, Joachim Denzler · Feb 24, 2026 · Citations: 0
Automatic Metrics
Tokenization and sub-tokenization based models like word2vec, BERT and the GPTs are the state-of-the-art in natural language processing.
- Towards Controllable Video Synthesis of Routine and Rare OR Events
Dominik Schneider, Lalithkumar Seenivasan, Sampath Rapuri, Vishalroshan Anil, Aiza Maksutova · Feb 24, 2026 · Citations: 0
Automatic Metrics
Purpose: Curating large-scale datasets of operating room (OR) workflow, encompassing rare, safety-critical, or atypical events, remains operationally and ethically challenging.
- Towards single-shot coherent imaging via overlap-free ptychography
Oliver Hoidn, Aashwin Mishra, Steven Henke, Albert Vong, Matthew Seaberg · Feb 24, 2026 · Citations: 0
Automatic Metrics
On synthetic benchmarks, reconstructions remain accurate at low counts ($\sim\!10^4$ photons/frame), and overlap-free single-shot reconstruction with an experimental probe reaches amplitude structural similarity (SSIM) 0.904, compared with
- Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment
Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li · Feb 24, 2026 · Citations: 0
Pairwise PreferenceRed Team Automatic Metrics
Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs).
- Test-Time Training with KV Binding Is Secretly Linear Attention
Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li · Feb 24, 2026 · Citations: 0
Automatic Metrics
Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time.
- Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidat
- On Data Engineering for Scaling LLM Terminal Capabilities
Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro · Feb 24, 2026 · Citations: 0
Automatic Metrics
Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed.
- Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids
Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott · Feb 24, 2026 · Citations: 0
Simulation Env Long Horizon
Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information.
- NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan · Feb 24, 2026 · Citations: 0
Automatic Metrics
Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures.
- SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
- "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng · Feb 24, 2026 · Citations: 0
Expert Verification Automatic Metrics
Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
- Cooperative-Competitive Team Play of Real-World Craft Robots
Rui Zhao, Xihui Li, Yizheng Zhang, Yuzhen Liu, Zhong Zhang · Feb 24, 2026 · Citations: 0
Simulation Env Multi Agent
Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years.