- How to Train Your Long-Context Visual Document Model
Austin Veselka · Feb 16, 2026
Pairwise Preference
We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performanc
- Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems
Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud · Feb 16, 2026
Multi Agent
Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks.
- OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape · Feb 16, 2026
Tool Use
Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks.
- Weight space Detection of Backdoors in LoRA Adapters
David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit, Kevin Zhu, Ruizhe Li · Feb 16, 2026
We evaluate the method on 500 LoRA adapters -- 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset.
- AIC CTU@AVerImaTeC: dual-retriever RAG for image-text fact checking
Herbert Ullrich, Jan Drchal · Feb 16, 2026
In this paper, we present our 3rd place system in the AVerImaTeC shared task, which combines our last year's retrieval-augmented generation (RAG) pipeline with a reverse image search (RIS) module.
- ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction
William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan · Feb 16, 2026
ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing, and is publicly available on HuggingFace.
- Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel · Feb 16, 2026
Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particular
- Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
Xinhang Ma, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik · Feb 16, 2026
Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models.
- CGRA-DeBERTa Concept Guided Residual Augmentation Transformer for Theologically Islamic Understanding
Tahir Hussain, Saddam Hussain Khan · Feb 16, 2026
The qualitative evaluation noted better extraction and discrimination and theological precision.
- Symmetry in language statistics shapes the geometry of model representations
Dhruva Karkada, Daniel J. Korchinski, Andres Nava, Matthieu Wyart, Yasaman Bahri · Feb 16, 2026
Although learned representations underlie neural networks' success, their fundamental properties remain poorly understood.
- Scaling Beyond Masked Diffusion Language Models
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu · Feb 16, 2026
Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks.
- Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation
Ruoxi Liu, Philipp Koehn · Feb 16, 2026
This paper proposes a novel method for Text Style Transfer (TST) based on parameter-efficient fine-tuning of Large Language Models (LLMs).
- Cold-Start Personalization via Training-Free Priors from Structured World Models
Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du · Feb 16, 2026
Pairwise Preference
Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available.
- Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation
Mengdan Zhu, Yufan Zhao, Tao Di, Yulan Yan, Liang Zhao · Feb 16, 2026
News recommendation plays a critical role in online news platforms by helping users discover relevant content.
- Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System
Kawin Mayilvaghanan, Siddhant Gupta, Ayush Kumar · Feb 16, 2026
Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback.
- Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition
Varun Nathan, Shreyas Guha, Ayush Kumar · Feb 16, 2026
Critique Edit
We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools
- BFS-PO: Best-First Search for Large Reasoning Models
Fiorenzo Parascandolo, Wenhui Tan, Enver Sangineto, Ruihua Song, Rita Cucchiara · Feb 16, 2026
Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.
- Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
Matteo Rinaldi, Rossella Varvara, Viviana Patti · Feb 16, 2026
We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language.
- Learning State-Tracking from Code Using Linear RNNs
Julien Siems, Riccardo Grazzi, Kirill Kalinin, Hitesh Ballani, Babak Rahmani · Feb 16, 2026
Over the last years, state-tracking tasks, particularly permutation composition, have become a testbed to understand the limits of sequence models architectures like Transformers and RNNs (linear and non-linear).
- Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque
Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri · Feb 16, 2026
Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces.
- Overthinking Loops in Agents: A Structural Risk via MCP Tools
Yohan Lee, Jisoo Jang, Seoyeon Choi, Sangyeop Kim, Seungtaek Choi · Feb 16, 2026
Tool-using LLM agents increasingly coordinate real workloads by selecting and chaining third-party tools based on text-visible metadata such as tool names, descriptions, and return messages.
- A Geometric Analysis of Small-sized Language Model Hallucinations
Emanuele Ricco, Elia Onofri, Lorenzo Cima, Stefano Cresci, Roberto Di Pietro · Feb 16, 2026
Long Horizon
Hallucinations -- fluent but factually incorrect responses -- pose a major challenge to the reliability of language models, especially in multi-step or agentic settings.
- Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment
Laurène Vaugrante, Anietta Weckauff, Thilo Hagendorff · Feb 16, 2026
Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.
- Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026
Pairwise PreferenceRubric Rating Multi Agent
Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
- Unlocking Reasoning Capability on Machine Translation in Large Language Models
Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio · Feb 16, 2026
Critique Edit Long Horizon
We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
- Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers
Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Lukas Mauch, Fabien Cardinaux · Feb 16, 2026
Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment and yield improvements, providing an efficient and general architectural enhancement for autoregressive Transformers.
- Cognitive networks reconstruct mindsets about STEM subjects and educational contexts in almost 1000 high-schoolers, University students and LLM-based digital twins
Francesco Gariboldi, Emma Franchino, Edith Haim, Gianluca Lattanzi, Alessandro Grecucci · Feb 16, 2026
Human networks show greater overlapping between mathematics and anxiety than GPT-oss.
- Rethinking the Role of LLMs in Time Series Forecasting
Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Wei Zhang · Feb 16, 2026
We show that such conclusions stem from limited evaluation settings and do not hold at scale.
- LLMStructBench: Benchmarking Large Language Model Structured Data Extraction
Sönke Tenckhoff, Mario Koddenbrock, Erik Rodner · Feb 16, 2026
We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text.
- Evolutionary System Prompt Learning for Reinforcement Learning in LLMs
Lunjun Zhang, Ryan Chen, Bradly C. Stadie · Feb 16, 2026
Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI.
- Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks
Lukas Struppek, Adam Gleave, Kellin Pelrine · Feb 16, 2026
Red Team
As the capabilities of large language models continue to advance, so does their potential for misuse.
- Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
Gianluca Vico, Jindřich Libovický · Feb 16, 2026
We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation.
- Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer's Disease Detection via Speech
Xiao Wei, Bin Wen, Yuqin Lin, Kai Li, Mingyang gu · Feb 16, 2026
Early diagnosis of Alzheimer's Disease (AD) is crucial for delaying its progression.
- Is Information Density Uniform when Utterances are Grounded on Perception and Discourse?
Matteo Gay, Coleman Haley, Mario Giulianelli, Edoardo Ponti · Feb 16, 2026
The Uniform Information Density (UID) hypothesis posits that speakers are subject to a communicative pressure to distribute information evenly within utterances, minimising surprisal variance.
- Query as Anchor: Scenario-Adaptive User Representation via Large Language Model
Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Ziyi Gao · Feb 16, 2026
Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment.
- BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR
Md. Najib Hasan, Mst. Jannatun Ferdous Rain, Fyad Mohammed, Nazmul Siddique · Feb 16, 2026
Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity.