- Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities
Shankar Padmanabhan, Mustafa Omer Gul, Tanya Goyal · Feb 17, 2026
Post-training endows pretrained LLMs with a variety of desirable skills, including instruction-following, reasoning, and others.
- Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff
Patrick Pynadath, Ruqi Zhang · Feb 17, 2026
Any-order autoregressive models (AO-ARMs) offer a promising path toward efficient masked diffusion by enabling native key-value caching, but competitive performance has so far required two-stream attention, typically motivated as a means of
- Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs
Sean Trott, Samuel Taylor, Cameron Jones, James A. Michaelov, Pamela D. Rivière · Feb 17, 2026
Research on mental state reasoning in language models (LMs) has the potential to inform theories of human social cognition--such as the theory that mental state reasoning emerges in part from language exposure--and our understanding of LMs
- Surgical Activation Steering via Generative Causal Mediation
Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell · Feb 17, 2026
Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response?
- CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill
Bradley McDanel, Steven Li, Harshit Khaitan · Feb 17, 2026
This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks.
- Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026
Pairwise PreferenceExpert Verification
While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
- Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination
Amir Hosseinian, MohammadReza Zare Shahneh, Umer Mansoor, Gilbert Szeto, Kirill Karlin · Feb 17, 2026
Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%).
- A Curious Class of Adpositional Multiword Expressions in Korean
Junghyun Min, Na-Rae Han, Jena D. Hwang, Nathan Schneider · Feb 17, 2026
Multiword expressions (MWEs) have been widely studied in cross-lingual annotation frameworks such as PARSEME.
- MAEB: Massive Audio Embedding Benchmark
Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha · Feb 17, 2026
We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages.
- Anatomy of Capability Emergence: Scale-Invariant Representation Collapse and Top-Down Reorganization in Neural Networks
Jayadev Billa · Feb 17, 2026
Capability emergence during neural network training remains mechanistically opaque.
- DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting
Md Mofijul Islam, Md Sirajus Salekin, Nivedha Balakrishnan, Vincil C. Bishop, Niharika Jain · Feb 17, 2026
We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models.
- Avey-B
Devang Acharya, Mohammad Hammoud · Feb 17, 2026
Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more eff
- Intent Laundering: AI Safety Datasets Are Not What They Seem
Shahriar Golchin, Marc Wetter · Feb 17, 2026
Red Team
We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice.
- Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings
Suhyung Jang, Ghang Lee, Jaekun Lee, Hyunjun Lee · Feb 17, 2026
Accurate representation of building semantics, encompassing both generic object types and specific subtypes, is essential for effective AI model training in the architecture, engineering, construction, and operation (AECO) industry.
- *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu · Feb 17, 2026
Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods.
- ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution
Yahia Alqurnawi, Preetom Biswas, Anmol Rao, Tejas Anvekar, Chitta Baral · Feb 17, 2026
Multimodal Large Language Models (mLLMs) are often used to answer questions in structured data such as tables in Markdown, JSON, and images.
- GLM-5: from Vibe Coding to Agentic Engineering
GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou · Feb 17, 2026
Long Horizon
We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering.
- ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé · Feb 17, 2026
Pairwise Preference
In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences.
- Beyond Binary Classification: Detecting Fine-Grained Sexism in Social Media Videos
Laura De Grazia, Danae Sánchez Villegas, Desmond Elliott, Mireia Farrús, Mariona Taulé · Feb 17, 2026
Our findings indicate that multimodal LLMs perform competitively with human annotators in identifying nuanced forms of sexism; however, they struggle to capture co-occurring sexist types when these are conveyed through visual cues.
- Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac
Chahan Vidal-Gorène, Bastien Kindt, Florian Cafiero · Feb 17, 2026
Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline.
- Causal Effect Estimation with Latent Textual Treatments
Omri Feldman, Amar Venugopal, Jann Spiess, Amir Feder · Feb 17, 2026
Understanding the causal effects of text on downstream outcomes is a central task in many applications.
- Recursive Concept Evolution for Compositional Reasoning in Large Language Models
Sarim Chaudhry · Feb 17, 2026
Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE.
- Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026
Pairwise Preference
Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
- Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU
Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili · Feb 17, 2026
Real-time conversational assistants for procedural tasks often depend on video input, which can be computationally expensive and compromise user privacy.
- A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models
Noa Linder, Meirav Segal, Omer Antverg, Gil Gekker, Tomer Fichman · Feb 17, 2026
Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use.
- Revisiting Northrop Frye's Four Myths Theory with Large Language Models
Edirlei Soares de Lima, Marco A. Casanova, Antonio L. Furtado · Feb 17, 2026
Northrop Frye's theory of four fundamental narrative genres (comedy, romance, tragedy, satire) has profoundly influenced literary criticism, yet computational approaches to his framework have focused primarily on narrative patterns rather t
- LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models
Ahmed Khaled Khamis, Hesham Ali · Feb 17, 2026
Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources concentrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic -- the most widely u
- STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng · Feb 17, 2026
Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and
- Clinically Inspired Symptom-Guided Depression Detection from Emotion-Aware Speech Representations
Chaithra Nerella, Chiranjeevi Yarra · Feb 17, 2026
Depression manifests through a diverse set of symptoms such as sleep disturbance, loss of interest, and concentration difficulties.
- Beyond Static Pipelines: Learning Dynamic Workflows for Text-to-SQL
Yihan Wang, Peiyu Liu, Runyu Chen, Wei Xu · Feb 17, 2026
Experiments on widely-used Text-to-SQL benchmarks demonstrate that dynamic workflow construction consistently outperforms the best static workflow methods, with especially pronounced gains on complex and out-of-distribution queries.
- RUVA: Personalized Transparent On-Device Graph Reasoning
Gabriele Conte, Alessio Mattiace, Gianni Carmosino, Potito Aghilar, Giovanni Servedio · Feb 17, 2026
We propose Ruva, the first "Glass Box" architecture designed for Human-in-the-Loop Memory Curation.
- jina-embeddings-v5-text: Task-Targeted Embedding Distillation
Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther · Feb 17, 2026
Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size.
- Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
Tim Fischer, Chris Biemann · Feb 17, 2026
Demonstrations
This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.
- ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling
Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper · Feb 17, 2026
ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks.
- ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns
Ziyu Zhao, Tong Zhu, Zhi Zhang, Tiantian Fan, Jinluan Yang · Feb 17, 2026
Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation.
- DependencyAI: Detecting AI Generated Text through Dependency Parsing
Sara Ahmed, Tracy Hammond · Feb 17, 2026
To increase interpretability, we analyze feature importance to reveal syntactic structures that distinguish AI-generated from human-written text.
- Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination
Xiangyan Chen, Yujian Gan, Matthew Purver · Feb 17, 2026
The tendency for hallucination in current large language models (LLMs) negatively impacts dialogue systems.
- LuxMT Technical Report
Nils Rehlinger · Feb 17, 2026
To assess translation performance, we construct a novel benchmark covering LB-FR, LB-EN, and LB-FR using human-translated data from Luci, a tourist magazine about Luxembourg.
- Towards Expectation Detection in Language: A Case Study on Treatment Expectations in Reddit
Aswathy Velutharambath, Amelie Wührl · Feb 17, 2026
Patients' expectations towards their treatment have a substantial effect on the treatments' success.
- In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations
Mohammad Aflah Khan, Mahsa Amani, Soumi Das, Bishwamittra Ghosh, Qinyuan Wu · Feb 17, 2026
Pairwise Preference
Agents based on Large Language Models (LLMs) are increasingly being deployed as interfaces to information on online platforms.
- TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models
Chansung Park, Juyong Jiang, Fan Wang, Sayak Paul, Jiasi Shen · Feb 17, 2026
TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation.
- Measuring Social Integration Through Participation: Categorizing Organizations and Leisure Activities in the Displaced Karelians Interview Archive using LLMs
Joonatan Laato, Veera Schroderus, Jenna Kanerva, Jenni Kauppi, Virpi Lummaa · Feb 17, 2026
We annotate a gold-standard set to allow for a reliable evaluation, and then test whether large language models can apply the same schema at scale.
- World-Model-Augmented Web Agents with Action Correction
Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li · Feb 17, 2026
Multi Agent
Web agents based on large language models have demonstrated promising capability in automating web tasks.
- The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026
Pairwise Preference Multi Agent
Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and informati
- Making Large Language Models Speak Tulu: Structured Prompting for an Extremely Low-Resource Language
Prathamesh Devadiga, Paras Chopra · Feb 17, 2026
Can large language models converse in languages virtually absent from their training data?
- Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang · Feb 17, 2026
Demonstrations
Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
- Far Out: Evaluating Language Models on Slang in Australian and Indian English
Deniz Kaya Dilsiz, Dipankar Srirag, Aditya Joshi · Feb 17, 2026
We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models.
- NeuroSymActive: Differentiable Neural-Symbolic Reasoning with Active Exploration for Knowledge Graph Question Answering
Rong Fu, Yang Li, Zeyu Zhang, Jiekai Wu, Yaohua Liu · Feb 17, 2026
Empirical results on standard KGQA benchmarks show that NeuroSymActive attains strong answer accuracy while reducing the number of expensive graph lookups and model calls compared to common retrieval-augmented baselines.
- Discovering Implicit Large Language Model Alignment Objectives
Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026
Rubric Rating
To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
- Prescriptive Scaling Reveals the Evolution of Language Model Capabilities
Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade · Feb 17, 2026
Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via
- Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory
Zihao Tang, Xin Yu, Ziyu Xiao, Zengxuan Wen, Zelin Li · Feb 17, 2026
Mnemis achieves state-of-the-art performance across all compared methods on long-term memory benchmarks, scoring 93.9 on LoCoMo and 91.6 on LongMemEval-S using GPT-4.1-mini.
- Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement
Stephan Ludwig, Peter J. Danaher, Xiaohao Yang, Yu-Ting Lin, Ehsan Abedin · Feb 17, 2026
Accurately measuring consumer emotions and evaluations from unstructured text remains a core challenge for marketing research and practice.
- The Information Geometry of Softmax: Probing and Steering
Kiho Park, Todd Nief, Yo Joong Choe, Victor Veitch · Feb 17, 2026
This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces.
- FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health
Victor De Lima, Jiqun Liu, Grace Hui Yang · Feb 17, 2026
Long Horizon
Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.