- Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection
Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- How to Train Your Long-Context Visual Document Model
Austin Veselka · Feb 16, 2026 · Citations: 0
Pairwise Preference
We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art…
- Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems
Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud · Feb 16, 2026 · Citations: 0
Multi Agent
Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks.
- OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape · Feb 16, 2026 · Citations: 0
Tool Use
Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks.
- Weight space Detection of Backdoors in LoRA Adapters
David Puertolas Merenciano, Ekaterina Vasyagina, Kevin Zhu, Javier Ferrando, Maheep Chaudhary · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- AIC CTU@AVerImaTeC: dual-retriever RAG for image-text fact checking
Herbert Ullrich, Jan Drchal · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction
William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan · Feb 16, 2026 · Citations: 0
ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing, and is publicly available on HuggingFace.
- Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
Xinhang Ma, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CGRA-DeBERTa Concept Guided Residual Augmentation Transformer for Theologically Islamic Understanding
Tahir Hussain, Saddam Hussain Khan · Feb 16, 2026 · Citations: 0
The qualitative evaluation noted better extraction and discrimination and theological precision.
- Symmetry in language statistics shapes the geometry of model representations
Dhruva Karkada, Daniel J. Korchinski, Andres Nava, Matthieu Wyart, Yasaman Bahri · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Scaling Beyond Masked Diffusion Language Models
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu · Feb 16, 2026 · Citations: 0
Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks.
- Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation
Ruoxi Liu, Philipp Koehn · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Cold-Start Personalization via Training-Free Priors from Structured World Models
Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du · Feb 16, 2026 · Citations: 0
Pairwise Preference
Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available.
- Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation
Mengdan Zhu, Yufan Zhao, Tao Di, Yulan Yan, Liang Zhao · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System
Kawin Mayilvaghanan, Siddhant Gupta, Ayush Kumar · Feb 16, 2026 · Citations: 0
Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback.
- Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition
Varun Nathan, Shreyas Guha, Ayush Kumar · Feb 16, 2026 · Citations: 0
Critique Edit
We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools…
- BFS-PO: Best-First Search for Large Reasoning Models
Fiorenzo Parascandolo, Wenhui Tan, Enver Sangineto, Ruihua Song, Rita Cucchiara · Feb 16, 2026 · Citations: 0
Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.
- Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
Matteo Rinaldi, Rossella Varvara, Viviana Patti · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Learning State-Tracking from Code Using Linear RNNs
Julien Siems, Riccardo Grazzi, Kirill Kalinin, Hitesh Ballani, Babak Rahmani · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque
Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri · Feb 16, 2026 · Citations: 0
Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces.
- Overthinking Loops in Agents: A Structural Risk via MCP Tools
Yohan Lee, Jisoo Jang, Seoyeon Choi, Sangyeop Kim, Seungtaek Choi · Feb 16, 2026 · Citations: 0
Tool-using LLM agents increasingly coordinate real workloads by selecting and chaining third-party tools based on text-visible metadata such as tool names, descriptions, and return messages.
- A Geometric Analysis of Small-sized Language Model Hallucinations
Emanuele Ricco, Elia Onofri, Lorenzo Cima, Stefano Cresci, Roberto Di Pietro · Feb 16, 2026 · Citations: 0
Long Horizon
Hallucinations -- fluent but factually incorrect responses -- pose a major challenge to the reliability of language models, especially in multi-step or agentic settings.
- Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment
Laurène Vaugrante, Anietta Weckauff, Thilo Hagendorff · Feb 16, 2026 · Citations: 0
Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.
- GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture
Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin · Feb 16, 2026 · Citations: 0
- Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Multi Agent
Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
- Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America
Yannis Karmim, Renato Pino, Hernan Contreras, Hernan Lira, Sebastian Cifuentes · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Unlocking Reasoning Capability on Machine Translation in Large Language Models
Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio · Feb 16, 2026 · Citations: 0
Critique Edit Long Horizon
We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
- Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers
Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Axel Marmoret, Lukas Mauch · Feb 16, 2026 · Citations: 0
Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment and yield improvements, providing an efficient and general architectural enhancement for autoregressive Transformers.
- Cognitive networks reconstruct mindsets about STEM subjects and educational contexts in almost 1000 high-schoolers, University students and LLM-based digital twins
Francesco Gariboldi, Emma Franchino, Edith Haim, Gianluca Lattanzi, Alessandro Grecucci · Feb 16, 2026 · Citations: 0
Human networks show greater overlapping between mathematics and anxiety than GPT-oss.
- Rethinking the Role of LLMs in Time Series Forecasting
Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Wei Zhang · Feb 16, 2026 · Citations: 0
We show that such conclusions stem from limited evaluation settings and do not hold at scale.
- LLMStructBench: Benchmarking Large Language Model Structured Data Extraction
Sönke Tenckhoff, Mario Koddenbrock, Erik Rodner · Feb 16, 2026 · Citations: 0
We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text.
- Evolutionary System Prompt Learning for Reinforcement Learning in LLMs
Lunjun Zhang, Ryan Chen, Bradly C. Stadie · Feb 16, 2026 · Citations: 0
Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI.
- Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks
Lukas Struppek, Adam Gleave, Kellin Pelrine · Feb 16, 2026 · Citations: 0
Red Team
We present the largest empirical study to date of prefill attacks, evaluating over 20 existing and novel strategies across multiple model families and state-of-the-art open-weight models.
- Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
Gianluca Vico, Jindřich Libovický · Feb 16, 2026 · Citations: 0
We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation.
- Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer's Disease Detection via Speech
Xiao Wei, Bin Wen, Yuqin Lin, Kai Li, Mingyang gu · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Is Information Density Uniform when Utterances are Grounded on Perception and Discourse?
Matteo Gay, Coleman Haley, Mario Giulianelli, Edoardo Ponti · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- GradMAP: Faster Layer Pruning with Gradient Metric and Projection Compensation
Hao Liu, Guangyan Li, Wensheng Zhang, Yongqiang Tang · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought
Yuling Jiao, Yanming Lai, Huazhen Lin, Wensen Ma, Houduo Qi · Feb 16, 2026 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Alignment Adapter to Improve the Performance of Compressed Deep Learning Models
Rohit Raj Rai, Abhishek Dhaka, Amit Awekar · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- The Wikidata Query Logs Dataset
Sebastian Walter, Hannah Bast · Feb 16, 2026 · Citations: 0
To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions.
- MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs
Gabriel Roccabruna, Olha Khomyn, Giuseppe Riccardi · Feb 16, 2026 · Citations: 0
AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution.
- Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation
Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Tareque Mohmud Chowdhury · Feb 16, 2026 · Citations: 0
We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning.
- Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets
Yuchen Yang, Wenze Lin, Enhao Huang, Zhixuan Chu, Hongbin Zhou · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Beyond Translation: Evaluating Mathematical Reasoning Capabilities of LLMs in Sinhala and Tamil
Sukumar Kishanthan, Kumar Thushalika, Buddhi Jayasekara, Asela Hevapathige · Feb 16, 2026 · Citations: 0
These findings challenge the common assumption that models exhibiting strong multilingual performance can reason equally effectively across languages, and highlight the need for fine-grained, type-aware evaluation in multilingual settings.
- Query as Anchor: Scenario-Adaptive User Representation via Large Language Model
Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Ziyi Gao · Feb 16, 2026 · Citations: 0
Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment.
- Parameter-Efficient Fine-Tuning of LLMs with Mixture of Space Experts
Buze Zhang, Jinkai Tao, Zilang Zeng, Neil He, Ali Maatouk · Feb 16, 2026 · Citations: 0
Our experiments across diverse benchmarks demonstrate that MoSLoRA consistently outperforms strong baselines, achieving up to 5.6% improvement on MATH500 and 15.9% on MAWPS.
- BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR
Md. Najib Hasan, Mst. Jannatun Ferdous Rain, Fyad Mohammed, Nazmul Siddique · Feb 16, 2026 · Citations: 0
Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity.
- TikArt: Stabilizing Aperture-Guided Fine-Grained Visual Reasoning with Reinforcement Learning
Hao Ding, Zhichuan Yang, Weijie Ge, Ziqin Gao, Chaoyi Lu · Feb 16, 2026 · Citations: 0
- HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation
Wen-Sheng Lien, Yu-Kai Chan, Hao-Lung Hsiao, Bo-Kai Ruan, Meng-Fen Chiang · Feb 16, 2026 · Citations: 0
Extensive evaluations on WikiTopics (11 closed-domain datasets) and three open-domain QA benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) validate HyperRAG's effectiveness.
- Measuring and Mitigating Post-hoc Rationalization in Reverse Chain-of-Thought Generation
Guangyue Peng, Zongchao Chen, Wen Luo, Yuntao Wen, Wei Li · Feb 16, 2026 · Citations: 0
Experiments across open-ended reasoning benchmarks demonstrate that SSR-D achieves up to 10% improvement over suppression baselines while preserving out-of-distribution (OOD) generalization.
- Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models
Lance Calvin Lim Gamboa, Yue Feng, Mark Lee · Feb 16, 2026 · Citations: 0
With natural language generation becoming a popular use case for language models, the Bias Benchmark for Question-Answering (BBQ) has grown to be an important benchmark format for evaluating stereotypical associations exhibited by…
- Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5
Dongrui Liu, Yi Yu, Jie Zhang, Guanxu Chen, Qihao Lin · Feb 16, 2026 · Citations: 0
As Large Language Models (LLMs) general capabilities rapidly evolve and the proliferation of agentic AI, this version of the risk analysis technical report presents an updated and granular assessment of five critical dimensions: cyber…
- Precedent-Informed Reasoning: Mitigating Overthinking in Large Reasoning Models via Test-Time Precedent Learning
Qianyue Wang, Jinwu Hu, Huanxiang Lin, Bolin Chen, Zhiquan Wen · Feb 16, 2026 · Citations: 0
Inspired by human reasoning patterns where people solve new problems by leveraging past related cases to constrain search spaces and reduce trial-and-error, we propose Precedent Informed Reasoning (PIR) transforming LRMs'reasoning paradigm…
- Selective Synchronization Attention
Hasi Hays · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Synthetic Reader Panels: Tournament-Based Ideation with LLM Personas for Autonomous Publishing
Fred Zimmerman · Feb 16, 2026 · Citations: 0
Pairwise Preference
We present a system for autonomous book ideation that replaces human focus groups with synthetic reader panels -- diverse collections of LLM-instantiated reader personas that evaluate book concepts through structured tournament…
- LLM-Guided Knowledge Distillation for Temporal Knowledge Graph Reasoning
Wang Xing, Wei Song, Siyu Lin, Chen Wu, Man Wang · Feb 16, 2026 · Citations: 0
Extensive experiments on multiple public TKG benchmarks with diverse backbone architectures demonstrate that the proposed approach consistently improves link prediction performance over strong distillation baselines, while maintaining a…
- WavePhaseNet: A DFT-Based Method for Constructing Semantic Conceptual Hierarchy Structures (SCHS)
Kiyotaka Kasubuchi, Kazuo Fukiya · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Feature Recalibration Based Olfactory-Visual Multimodal Model for Enhanced Rice Deterioration Detection
Rongqiang Zhao, Hengrui Hu, Yijing Wang, Mingchun Sun, Jie Liu · Feb 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- TruthStance: An Annotated Dataset of Conversations on Truth Social
Fathima Ameen, Danielle Brown, Manusha Malgareddy, Amanul Haque · Feb 16, 2026 · Citations: 0
We provide a human-annotated benchmark of 1,500 instances across argument mining and claim-based stance detection, including inter-annotator agreement, and use it to evaluate large language model (LLM) prompting strategies.