- Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations
Joschka Braun · Feb 19, 2026
Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time.
- ADAPT: Hybrid Prompt Optimization for LLM Feature Visualization
João N. Cardoso, Arlindo L. Oliveira, Bruno Martins · Feb 19, 2026
Understanding what features are encoded by learned directions in LLM activation space requires identifying inputs that strongly activate them.
- Mind the Style: Impact of Communication Style on Human-Chatbot Interaction
Erik Derner, Dalibor Kučera, Aditya Gulati, Ayoub Bagheri, Nuria Oliver · Feb 19, 2026
Web Browsing
Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear.
- On the scaling relationship between cloze probabilities and language model next-token prediction
Cassandra L. Jacobs, Morgan Grobol · Feb 19, 2026
While even the best models under-allocate probability mass to human responses, larger models assign higher-quality estimates of next tokens and their likelihood of production in cloze data because they are less sensitive to lexical co-occur
- TFL: Targeted Bit-Flip Attack on Large Language Model
Jingkai Guo, Chaitali Chakrabarti, Deliang Fan · Feb 19, 2026
Large language models (LLMs) are increasingly deployed in safety and security critical applications, raising concerns about their robustness to model parameter fault injection attacks.
- Neural Synchrony Between Socially Interacting Language Models
Zhining Zhang, Wentao Zhu, Chi Han, Yizhou Wang, Heng Ji · Feb 19, 2026
Neuroscience has uncovered a fundamental mechanism of our social nature: human brain activity becomes synchronized with others in many social contexts involving interaction.
- QueryPlot: Generating Geological Evidence Layers using Natural Language Queries for Mineral Exploration
Meng Ye, Xiao Lin, Georgina Lukoczki, Graham W. Lederer, Yi Yao · Feb 19, 2026
Mineral prospectivity mapping requires synthesizing heterogeneous geological knowledge, including textual deposit models and geospatial datasets, to identify regions likely to host specific mineral deposit types.
- Sink-Aware Pruning for Diffusion Language Models
Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen · Feb 19, 2026
Long Horizon
Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning.
- CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello · Feb 19, 2026
HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts.
- What Language is This? Ask Your Tokenizer
Clara Meister, Ahmetcan Yavuz, Pietro Lesci, Tiago Pimentel · Feb 19, 2026
Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models.
- Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking
Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld · Feb 19, 2026
Pairwise Preference
Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order.
- Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen · Feb 19, 2026
Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries.
- Unmasking the Factual-Conceptual Gap in Persian Language Models
Alireza Sakhaeirad, Ali Ma'manpoosh, Arshia Hemmat · Feb 19, 2026
While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms.
- The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?
Jayadev Billa · Feb 19, 2026
Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades.
- Modeling Distinct Human Interaction in Web Agents
Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou · Feb 19, 2026
Pairwise Preference Web Browsing
Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold.
- KLong: Training LLM Agent for Extremely Long-horizon Tasks
Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi · Feb 19, 2026
Rubric Rating Long Horizon
This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks.
- Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Jyotin Goel, Souvik Maji, Pratik Mazumder · Feb 19, 2026
Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates.
- Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026
Multi Agent
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
- Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems
Zhangqi Duan, Arnav Kankaria, Dhruv Kartik, Andrew Lan · Feb 19, 2026
Human evaluation further demonstrates substantial agreement between LLM and expert annotations.
- The Anxiety of Influence: Bloom Filters in Transformer Attention Heads
Peter Balogh · Feb 19, 2026
Some transformer attention heads appear to function as membership testers, dedicating themselves to answering the question "has this token appeared before in the context?" We identify these heads across four language models (GPT-2 small, me
- Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics
Baris Karacan, Barbara Di Eugenio, Patrick Thornton · Feb 19, 2026
Clinical free-text notes contain vital patient information.
- Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection
Yichen Lu, Siwei Nie, Minlong Lu, Xudong Yang, Xiaobo Zhang · Feb 19, 2026
Image Copy Detection (ICD) aims to identify manipulated content between image pairs through robust feature representation learning.
- What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data
Dimitri Staufer, Kirsten Morehouse · Feb 19, 2026
Large language models (LLMs), and conversational agents based on them, are exposed to personal data (PD) during pre-training and during user interactions.
- Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian
Pietro Ferrazzi, Mattia Franzin, Alberto Lavelli, Bernardo Magnini · Feb 19, 2026
Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings.
- Auditing Reciprocal Sentiment Alignment: Inversion Risk, Dialect Representation and Intent Misalignment in Transformers
Nusrat Jahan Lia, Shubhashis Roy Dipta · Feb 19, 2026
The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior.
- PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions
Greta Damo, Stéphane Petiot, Elena Cabrio, Serena Villata · Feb 19, 2026
The increasing volume of hate speech on online platforms poses significant societal challenges.
- Entropy-Based Data Selection for Language Models
Hongming Li, Yang Liu, Chao Huang · Feb 19, 2026
Modern language models (LMs) increasingly require two critical resources: computational resources and data resources.
- ABCD: All Biases Come Disguised
Mateusz Nowak, Xavier Cadet, Peter Chin · Feb 19, 2026
Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions.
- AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue
Adib Sakhawat, Fardeen Sadab, Rakin Shahriar · Feb 19, 2026
Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions.
- Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study
Dylan Bouchard, Mohit Singh Chauhan, Viren Bajaj, David Skarbrevik · Feb 19, 2026
Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation.
- Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics
Sanjeev Kumar, Preethi Jyothi, Pushpak Bhattacharyya · Feb 19, 2026
This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings.
- Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference
Anastasia Zhukova, Felix Hamborg, Karsten Donnay, Norman Meuschke, Bela Gipp · Feb 19, 2026
Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants.
- DAVE: A Policy-Enforcing LLM Spokesperson for Secure Multi-Document Data Sharing
René Brinkhege, Prahlad Menon · Feb 19, 2026
We therefore outline an evaluation methodology to assess security, utility, and performance trade-offs under benign and adversarial querying as a basis for future empirical work on systematically governed LLM access to multi-party data spac
- The Role of the Availability Heuristic in Multiple-Choice Answering Behaviour
Leonidas Zotos, Hedderik van Rijn, Malvina Nissim · Feb 19, 2026
When students are unsure of the correct answer to a multiple-choice question (MCQ), guessing is common practice.
- RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering
Yiming Zhang, Siyue Zhang, Junbo Zhao, Chen Zhao · Feb 19, 2026
We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories.
- WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval
Michael Dinzinger, Laura Caspari, Ali Salman, Irvin Topi, Jelena Mitrović · Feb 19, 2026
We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages.
- Bayesian Optimality of In-Context Learning with Selective State Spaces
Di Zhang, Jiaqi Xing · Feb 19, 2026
Experiments on synthetic LG-SSM tasks and a character-level Markov benchmark confirm selective SSMs converge faster to Bayes-optimal risk, show superior sample efficiency with longer contexts in structured-noise settings, and track latent s
- Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
Bogdan Kostić, Conor Fallon, Julian Risch, Alexander Löser · Feb 19, 2026
The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison.
- ArXiv-to-Model: A Practical Study of Scientific LM Training
Anuj Gupta · Feb 19, 2026
While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented.
- Representation Collapse in Machine Translation Through the Lens of Angular Dispersion
Evgeniia Tokarchuk, Maya K. Nachesa, Sergey Troshin, Vlad Niculae · Feb 19, 2026
Modern neural translation models based on the Transformer architecture are known for their high performance, particularly when trained on high-resource datasets.
- Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective
Yukun Chen, Xinyu Zhang, Jialong Tang, Yu Wan, Baosong Yang · Feb 19, 2026
While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital
- Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study
Kensuke Okada, Yui Furukawa, Kyosuke Bunji · Feb 19, 2026
Rubric Rating
Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments.
- Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy
Bianca Raimondi, Maurizio Gabbrielli · Feb 19, 2026
The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics.
- From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences
Yi-Chih Huang · Feb 19, 2026
Demonstrations
Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences.
- What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform
Adrian Cosma, Cosmin Dumitrache, Emilian Radoi · Feb 19, 2026
Expert Verification
As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy.
- The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI
Dusan Bosnjakovic · Feb 19, 2026
Multi Agent
As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatur
- Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests
Anton Dzega, Aviad Elyashar, Ortal Slobodin, Odeya Cohen, Rami Puzis · Feb 19, 2026
Their interpretations are highly consistent with those of human experts.
- BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo · Feb 19, 2026
Long Horizon
However, such errors have rarely been captured by existing benchmarks.
- Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression
Akira Sakai, Yuma Ichikawa · Feb 19, 2026
Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck.
- ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning
Hussein S. Al-Olimat, Ahmad Alshareef · Feb 19, 2026
While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification.
- RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
Yunseok Han, Yejoon Lee, Jaeyoung Do · Feb 19, 2026
To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions.
- Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data
Deepak Uniyal, Md Abul Bashar, Richi Nayak · Feb 19, 2026
Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages.
- Large Language Models Persuade Without Planning Theory of Mind
Jared Moore, Rasmus Overmark, Ned Cooper, Beba Cibralic, Nick Haber · Feb 19, 2026
Long Horizon
A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks.
- ReIn: Conversational Error Recovery with Reasoning Inception
Takyoung Kim, Jinseok Nam, Chandrayee Basu, Xing Fan, Chengyuan Ma · Feb 19, 2026
Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors.
- Arcee Trinity Large Technical Report
Varun Singh, Lucas Krauss, Sami Jaghouar, Matej Sirovatka, Charles Goddard · Feb 19, 2026
We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token.
- Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
Serin Kim, Sangam Lee, Dongha Lee · Feb 19, 2026
Pairwise Preference
Large language models have advanced web agents, yet current agents lack personalization capabilities.
- Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases
Zhao Tan, Yiji Zhao, Shiyu Wang, Chang Xu, Yuxuan Liang · Feb 19, 2026
To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories.
- Exploring LLMs for User Story Extraction from Mockups
Diego Firmenich, Leandro Antonelli, Bruno Pazos, Fabricio Lozada, Leonardo Morales · Feb 19, 2026
User stories are one of the most widely used artifacts in the software industry to define functional requirements.
- Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling
Divyam Madaan, Sumit Chopra, Kyunghyun Cho · Feb 19, 2026
Despite the recent success of Multimodal Large Language Models (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference.
- HQFS: Hybrid Quantum Classical Financial Security with VQC Forecasting, QUBO Annealing, and Audit-Ready Post-Quantum Signing
Srikumar Nayak · Feb 19, 2026
Here's the corrected paragraph with all punctuation and formatting issues fixed: Financial risk systems usually follow a two-step routine: a model predicts return or risk, and then an optimizer makes a decision such as a portfolio rebalance