- Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Minxue Tang, Yangyang Yu, Aolin Ding, Maziyar Baran Pouyan, Taha Belkhouja Yujia Bao · Feb 22, 2026
Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI.
- PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification
Isun Chehreh, Ebrahim Ansari · Feb 22, 2026
Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification.
- Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore · Feb 22, 2026
Long Horizon
Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
- Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026
Pairwise Preference Long Horizon
Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
- Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection
Raihan Tanvir, Md. Golam Rabiul Alam · Feb 22, 2026
Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives.
- Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content
Simon Münker, Nils Schwager, Kai Kugler, Michael Heseltine, Achim Rettinger · Feb 22, 2026
The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift.
- TurkicNLP: An NLP Toolkit for Turkic Languages
Sherzod Hakimov · Feb 22, 2026
Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources.
- Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing
Maciej Świechowski, Adam Żychowski, Jacek Mańdziuk · Feb 22, 2026
The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps).
- Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM
Francesca Bianco, Derek Shiller · Feb 22, 2026
This work supports a more evidence-driven (a) debate on AI sentience and welfare, and (b) governance when setting policy, auditing standards, and safety safeguards.
- Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs
Wenqiu Tang, Zhen Wan, Takahiro Komamizu, Ichiro Ide · Feb 22, 2026
Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on persona-s
- VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026
Long Horizon
Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
- A Dataset for Named Entity Recognition and Relation Extraction from Art-historical Image Descriptions
Stefanie Schneider, Miriam Göldl, Julian Stalter, Ricarda Vollmer · Feb 22, 2026
The dataset is released as UIMA XMI Common Analysis Structure (CAS) files with accompanying images and bibliographic metadata, and can be used to benchmark and fine-tune NER and RE systems, including zero- and few-shot setups with Large Lan
- AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Qijie You, Wenkai Yu, Wentao Zhang · Feb 22, 2026
Long Horizon
With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction.
- How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders
Michael McCoubrey, Angelo Salatino, Francesco Osborne, Enrico Motta · Feb 22, 2026
In recent years, there has been a growing use of generative AI, and large language models (LLMs) in particular, to support both the assessment and generation of scientific work.
- Astra: Activation-Space Tail-Eigenvector Low-Rank Adaptation of Large Language Models
Kainan Liu, Yong Zhang, Ning Cheng, Yun Zhu, Yanmeng Wang · Feb 22, 2026
Extensive experiments across natural language understanding (NLU) and natural language generation (NLG) tasks demonstrate that Astra consistently outperforms existing PEFT baselines across 16 benchmarks and even surpasses full fine-tuning (
- Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models
Seong Hah Cho, Junyi Li, Anna Leshinskaya · Feb 22, 2026
Among the characteristics of value representation in humans is that they distinguish among value of different kinds.
- TriTopic: Tri-Modal Graph-Based Topic Modeling with Iterative Refinement and Archetypes
Roman Egger · Feb 22, 2026
In benchmarks across 20 Newsgroups, BBC News, AG News, and Arxiv, TriTopic achieves the highest NMI on every dataset (mean NMI 0.575 vs.
- Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026
Long Horizon
Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
- IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning
Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su · Feb 22, 2026
Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training.
- Uncovering Context Reliance in Unstructured Knowledge Editing
Zisheng Zhou, Mengqi Zhang, Shiguang Wu, Xiaotian Ye, Chi Zhang · Feb 22, 2026
Evaluations show that COIN reduces Context Reliance by 45.2% and outperforms strong baselines by 23.6% in editing success rate, highlighting the vital role of mitigating Context Reliance for robust editing.
- Learning to Detect Language Model Training Data via Active Reconstruction
Junjie Oscar Yin, John X. Morris, Vitaly Shmatikov, Sewon Min, Hannaneh Hajishirzi · Feb 22, 2026
Detecting LLM training data is generally framed as a membership inference attack (MIA) problem.
- Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks
Wilson Y. Lee · Feb 22, 2026
Long Horizon
Why do language agents fail on tasks they are capable of solving?
- Benchmark Test-Time Scaling of General LLM Agents
Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang · Feb 22, 2026
LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests.
- Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026
Multi Agent
We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
- Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 21, 2026
Pairwise Preference
One annotator pair achieved almost perfect agreement ($κ= 0.8743$; $93.8\%$ raw agreement), exceeding a number of reported benchmarks for English sarcasm research works.
- MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs
Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata · Feb 21, 2026
Changing runtime complexity on cloud and edge devices necessitates elastic large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources.
- Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu · Feb 21, 2026
Personal AI agents incur substantial cost via repeated LLM calls.
- DeepInnovator: Triggering the Innovative Capabilities of LLMs
Tianyu Fan, Fengji Zhang, Yuxiang Zheng, Bei Chen, Xinyao Niu · Feb 21, 2026
The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously g
- AAVGen: Precision Engineering of Adeno-associated Viral Capsids for Renal Selective Targeting
Mohammadreza Ghaffarzadeh-Esfahani, Yousof Gheisari · Feb 21, 2026
Adeno-associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency.
- TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning
Yujiao Yang · Feb 21, 2026
Extensive experiments across multiple reasoning benchmarks demonstrate that the proposed framework provides multi-level, verifiable explanations, including executable reasoning structures for individual instances, feasible-region representa
- [b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic
Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David Harwath, David R. Mortensen · Feb 21, 2026
Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored.
- Hyperbolic Busemann Neural Networks
Ziheng Chen, Bernhard Schölkopf, Nicu Sebe · Feb 21, 2026
Hyperbolic spaces provide a natural geometry for representing hierarchical and tree-structured data due to their exponential volume growth.
- EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation
Adam Dejl, Jonathan Pearson · Feb 21, 2026
Robust and comprehensive evaluation of large language models (LLMs) is essential for identifying effective LLM system configurations and mitigating risks associated with deploying LLMs in sensitive domains.
- Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026
Pairwise Preference
We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight
- BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models
Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat · Feb 21, 2026
We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG).
- MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu · Feb 21, 2026
Red Team
Defending LLMs against adversarial jailbreak attacks remains an open challenge.
- ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models
Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan · Feb 21, 2026
We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9).
- The Convergence of Schema-Guided Dialogue Systems and the Model Context Protocol
Andreas Schlapbach · Feb 21, 2026
This paper establishes a fundamental convergence: Schema-Guided Dialogue (SGD) and the Model Context Protocol (MCP) represent two manifestations of a unified paradigm for deterministic, auditable LLM-agent interaction.
- Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem
Lichang Song, Ting Long, Yi Chang · Feb 21, 2026
Multi Agent
To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer decision-ma
- ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models
Zefang Liu, Chenyang Zhu, Sangwoo Cho, Shi-Xiong Zhang · Feb 21, 2026
Experimental results across diverse benchmarks demonstrate that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.
- Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li · Feb 21, 2026
Long Horizon
LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.
- Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift
Stephen Russell · Feb 21, 2026
Long Horizon
Most semantic drift studies report multiple signals e.g., embedding displacement, neighbor changes, distributional divergence, and recursive trajectory instability, without a shared explanatory theory that relates them.
- Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM
Md Badsha Biswas, Ozlem Uzuner · Feb 21, 2026
Through extensive evaluation on four benchmark datasets with five LLMs, we show that knowledge aggregation not only improves claim verification but also reveals differences in source-specific reasoning.
- From Trial by Fire To Sleep Like a Baby: A Lexicon of Anxiety Associations for 20k English Multiword Expressions
Saif M. Mohammad · Feb 21, 2026
Anxiety is the unease about a possible future negative outcome.
- Spilled Energy in Large Language Models
Adrian Robert Minut, Hazem Dewidar, Iacopo Masi · Feb 21, 2026
Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalizati
- PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation
Nina Hosseini-Kivanani · Feb 20, 2026
Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings.
- DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning
Fangyuan Xu, Sihao Chen, Zinan Lin, Taiwei Shi, Sydney Graham · Feb 20, 2026
Differentially private (DP) synthetic data generation plays a pivotal role in developing large language models (LLMs) on private data, where data owners cannot provide eyes-on access to individual examples.
- Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools
Baris Arat, Emre Sefer · Feb 20, 2026
Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever.
- Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026
Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.
- Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026
Pairwise Preference Long Horizon
When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
- VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning
Harshul Raj Surana, Arijit Maji, Aryan Vats, Akash Ghosh, Sriparna Saha · Feb 20, 2026
Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured.
- RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
Deniz Qian, Hung-Ting Chen, Eunsol Choi · Feb 20, 2026
Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI).
- SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe · Feb 20, 2026
Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.
- Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures
Joshua Nunley · Feb 20, 2026
This paper presents a direct framework for sequence models with hidden states on closed subgroups of U(d).
- Validating Political Position Predictions of Arguments
Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026
Pairwise Preference
Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
- Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026
Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
- On the "Induction Bias" in Sequence Models
M. Reza Ebrahimi, Michaël Defferrard, Sunny Panchal, Roland Memisevic · Feb 20, 2026
Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.
- Predicting Contextual Informativeness for Vocabulary Learning using Deep Learning
Tao Wu, Adam Kapelner · Feb 20, 2026
In summary, we demonstrate that a modern embedding model on neural network architecture, when guided by human supervision, results in a low-cost large supply of near-perfect contexts for teaching vocabulary for a variety of target words.
- PsihoRo: Depression and Anxiety Romanian Text Corpus
Alexandra Ciobotaru, Ana-Maria Bucur, Liviu P. Dinu · Feb 20, 2026
Psychological corpora in NLP are collections of texts used to analyze human psychology, emotions, and mental health.
- VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean
Yutong Xin, Qiaochu Chen, Greg Durrett, Işil Dillig · Feb 20, 2026
However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries.