- Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities
Shankar Padmanabhan, Mustafa Omer Gul, Tanya Goyal · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff
Patrick Pynadath, Ruqi Zhang · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs
Sean Trott, Samuel Taylor, Cameron Jones, James A. Michaelov, Pamela D. Rivière · Feb 17, 2026 · Citations: 0
Research on mental state reasoning in language models (LMs) has the potential to inform theories of human social cognition--such as the theory that mental state reasoning emerges in part from language exposure--and our understanding of LMs…
- Activation Steering via Generative Causal Mediation
Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill
Bradley McDanel, Steven Li, Harshit Khaitan · Feb 17, 2026 · Citations: 0
This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks.
- Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026 · Citations: 0
Pairwise PreferenceExpert Verification
While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
- Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination
Amir Hosseinian, MohammadReza Zare Shahneh, Umer Mansoor, Gilbert Szeto, Kirill Karlin · Feb 17, 2026 · Citations: 0
Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%).
- A Curious Class of Adpositional Multiword Expressions in Korean
Junghyun Min, Na-Rae Han, Jena D. Hwang, Nathan Schneider · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MAEB: Massive Audio Embedding Benchmark
Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha · Feb 17, 2026 · Citations: 0
We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages.
- The Geometric Anatomy of Capability Acquisition in Transformers
Jayadev Billa · Feb 17, 2026 · Citations: 0
On Pythia-2.8B, a logical deduction task that is genuinely hard for the model shows a precursor gap of {\sim}49K training steps, while easy benchmarks show none.
- DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting
Md Mofijul Islam, Md Sirajus Salekin, Nivedha Balakrishnan, Vincil C. Bishop, Niharika Jain · Feb 17, 2026 · Citations: 0
We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models.
- Avey-B
Devang Acharya, Mohammad Hammoud · Feb 17, 2026 · Citations: 0
Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more…
- Intent Laundering: AI Safety Datasets Are Not What They Seem
Shahriar Golchin, Marc Wetter · Feb 17, 2026 · Citations: 0
Red Team
We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice.
- Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings
Suhyung Jang, Ghang Lee, Jaekun Lee, Hyunjun Lee · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu · Feb 17, 2026 · Citations: 0
Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods.
- ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution
Yahia Alqurnawi, Preetom Biswas, Anmol Rao, Tejas Anvekar, Chitta Baral · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- GLM-5: from Vibe Coding to Agentic Engineering
GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou · Feb 17, 2026 · Citations: 0
Long Horizon
We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering.
- ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé · Feb 17, 2026 · Citations: 0
Pairwise Preference
In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences.
- Beyond Binary Classification: Detecting Fine-Grained Sexism in Social Media Videos
Laura De Grazia, Danae Sánchez Villegas, Desmond Elliott, Mireia Farrús, Mariona Taulé · Feb 17, 2026 · Citations: 0
Our findings indicate that multimodal LLMs perform competitively with human annotators in identifying nuanced forms of sexism; however, they struggle to capture co-occurring sexist types when these are conveyed through visual cues.
- Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac
Chahan Vidal-Gorène, Bastien Kindt, Florian Cafiero · Feb 17, 2026 · Citations: 0
Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline.
- Causal Effect Estimation with Latent Textual Treatments
Omri Feldman, Amar Venugopal, Jann Spiess, Amir Feder · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Recursive Concept Evolution for Compositional Reasoning in Large Language Models
Sarim Chaudhry · Feb 17, 2026 · Citations: 0
Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE.
- Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0
Pairwise Preference
Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
- Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU
Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models
Noa Linder, Meirav Segal, Omer Antverg, Gil Gekker, Tomer Fichman · Feb 17, 2026 · Citations: 0
Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use.
- Revisiting Northrop Frye's Four Myths Theory with Large Language Models
Edirlei Soares de Lima, Marco A. Casanova, Antonio L. Furtado · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models
Ahmed Khaled Khamis, Hesham Ali · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng · Feb 17, 2026 · Citations: 0
Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13% (ρ_{T}=1.0, top-p=1.0) and 3.69%…
- Clinically Inspired Symptom-Guided Depression Detection from Emotion-Aware Speech Representations
Chaithra Nerella, Chiranjeevi Yarra · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Beyond Static Pipelines: Learning Dynamic Workflows for Text-to-SQL
Yihan Wang, Peiyu Liu, Runyu Chen, Wei Xu · Feb 17, 2026 · Citations: 0
Experiments on widely-used Text-to-SQL benchmarks demonstrate that dynamic workflow construction consistently outperforms the best static workflow methods, with especially pronounced gains on complex and out-of-distribution queries.
- RUVA: Personalized Transparent On-Device Graph Reasoning
Gabriele Conte, Alessio Mattiace, Gianni Carmosino, Potito Aghilar, Giovanni Servedio · Feb 17, 2026 · Citations: 0
We propose Ruva, the first "Glass Box" architecture designed for Human-in-the-Loop Memory Curation.
- jina-embeddings-v5-text: Task-Targeted Embedding Distillation
Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther · Feb 17, 2026 · Citations: 0
Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size.
- Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
Tim Fischer, Chris Biemann · Feb 17, 2026 · Citations: 0
Demonstrations
This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.
- ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling
Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper · Feb 17, 2026 · Citations: 0
ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks.
- ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns
Ziyu Zhao, Tong Zhu, Zhi Zhang, Tiantian Fan, Jinluan Yang · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DependencyAI: Detecting AI Generated Text through Dependency Parsing
Sara Ahmed, Tracy Hammond · Feb 17, 2026 · Citations: 0
To increase interpretability, we analyze feature importance to reveal syntactic structures that distinguish AI-generated from human-written text.
- Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination
Xiangyan Chen, Yujian Gan, Matthew Purver · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- LuxMT Technical Report
Nils Rehlinger · Feb 17, 2026 · Citations: 0
To assess translation performance, we construct a novel benchmark covering LB-FR, LB-EN, and LB-FR using human-translated data from Luci, a tourist magazine about Luxembourg.
- Towards Expectation Detection in Language: A Case Study on Treatment Expectations in Reddit
Aswathy Velutharambath, Amelie Wührl · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations
Mohammad Aflah Khan, Mahsa Amani, Soumi Das, Bishwamittra Ghosh, Qinyuan Wu · Feb 17, 2026 · Citations: 0
Pairwise Preference
Agents based on Large Language Models (LLMs) are increasingly being deployed as interfaces to information on online platforms.
- TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models
Chansung Park, Juyong Jiang, Fan Wang, Sayak Paul, Jiasi Shen · Feb 17, 2026 · Citations: 0
TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation.
- Measuring Social Integration Through Participation: Categorizing Organizations and Leisure Activities in the Displaced Karelians Interview Archive using LLMs
Joonatan Laato, Veera Schroderus, Jenna Kanerva, Jenni Kauppi, Virpi Lummaa · Feb 17, 2026 · Citations: 0
We annotate a gold-standard set to allow for a reliable evaluation, and then test whether large language models can apply the same schema at scale.
- SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition
Youness Dkhissi, Valentin Vielzeuf, Elys Allesiardo, Anthony Larcher · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- World-Model-Augmented Web Agents with Action Correction
Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li · Feb 17, 2026 · Citations: 0
Multi Agent
To address these challenges, we propose WAC, a web agent that integrates model collaboration, consequence simulation, and feedback-driven action refinement.
- The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0
Pairwise Preference Multi Agent
Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and…
- Making Large Language Models Speak Tulu: Structured Prompting for an Extremely Low-Resource Language
Prathamesh Devadiga, Paras Chopra · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang · Feb 17, 2026 · Citations: 0
Demonstrations
Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
- Far Out: Evaluating Language Models on Slang in Australian and Indian English
Deniz Kaya Dilsiz, Dipankar Srirag, Aditya Joshi · Feb 17, 2026 · Citations: 0
We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models.
- Alignment as Iatrogenesis: Pastoral Power, Collective Pathology, and the Structural Limits of Monolingual Safety Evaluation
Hiroki Fukui · Feb 17, 2026 · Citations: 0
- NeuroSymActive: Differentiable Neural-Symbolic Reasoning with Active Exploration for Knowledge Graph Question Answering
Rong Fu, Yang Li, Zeyu Zhang, Jiekai Wu, Yaohua Liu · Feb 17, 2026 · Citations: 0
Empirical results on standard KGQA benchmarks show that NeuroSymActive attains strong answer accuracy while reducing the number of expensive graph lookups and model calls compared to common retrieval-augmented baselines.
- Discovering Implicit Large Language Model Alignment Objectives
Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026 · Citations: 0
Rubric Rating
To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
- Prescriptive Scaling Reveals the Evolution of Language Model Capabilities
Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade · Feb 17, 2026 · Citations: 0
Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via…
- Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory
Zihao Tang, Xin Yu, Ziyu Xiao, Zengxuan Wen, Zelin Li · Feb 17, 2026 · Citations: 0
Mnemis achieves state-of-the-art performance across all compared methods on long-term memory benchmarks, scoring 93.9 on LoCoMo and 91.6 on LongMemEval-S using GPT-4.1-mini.
- Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement
Stephan Ludwig, Peter J. Danaher, Xiaohao Yang, Yu-Ting Lin, Ehsan Abedin · Feb 17, 2026 · Citations: 0
Accurately measuring consumer emotions and evaluations from unstructured text remains a core challenge for marketing research and practice.
- The Information Geometry of Softmax: Probing and Steering
Kiho Park, Todd Nief, Yo Joong Choe, Victor Veitch · Feb 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health
Victor De Lima, Jiqun Liu, Grace Hui Yang · Feb 17, 2026 · Citations: 0
Long Horizon
Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.