- Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook
Ming Li, Xirui Li, Tianyi Zhou · Feb 15, 2026
Multi Agent
As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems?
- MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents
Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai, Moayad Aloqaily, Ouns Bouachir · Feb 15, 2026
Tool Use
The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers.
- Investigation for Relative Voice Impression Estimation
Kenichi Fujita, Yusuke Ijima · Feb 15, 2026
Pairwise Preference
The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright'').
- Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026
16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
- Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework
Grzegorz Statkiewicz, Alicja Dobrzeniecka, Karolina Seweryn, Aleksandra Krasnodębska, Karolina Piosek · Feb 15, 2026
Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along w
- Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness
Pietro Bernardelle, Stefano Civelli, Kevin Roitero, Gianluca Demartini · Feb 15, 2026
Large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent.
- HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026
Expert VerificationCritique Edit
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
- A Comparative Analysis of Social Network Topology in Reddit and Moltbook
Yiming Zhu, Gareth Tyson, Pan Hui · Feb 14, 2026
Recent advances in agent-mediated systems have enabled a new paradigm of social network simulation, where AI agents interact with human-like autonomy.
- From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026
Critique Edit
We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design.
- ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics
Hend Al-Khalifa, Nadia Ghezaiel, Maria Bounnit, Hend Hamed Alhazmi, Noof Abdullah Alfear · Feb 14, 2026
We benchmark 40 model configurations, including traditional machine learning, transformer-based models, and large language models.
- OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery
Qi Liu, Ruochen Hao, Can Li, Wanjing Ma · Feb 14, 2026
Multi Agent
We present OR-Agent, a configurable multi-agent research framework designed for automated exploration in rich experimental environments.
- Small Reward Models via Backward Inference
Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi · Feb 14, 2026
Rubric Rating
However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility.
- Semantic Chunking and the Entropy of Natural Language
Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, Misha Tsodyks · Feb 13, 2026
The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached.
- OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report
Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen · Feb 13, 2026
We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks.
- SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026
Pairwise Preference
Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
- Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts
Kais Allkivi · Feb 13, 2026
Additional evaluation on an earlier exam sample revealed that the writings have become more complex over a 7-10-year period, while accuracy still reached 0.8 with some feature sets.
- Buy versus Build an LLM: A Decision Framework for Governments
Jiahao Lu, Ziwei Xu, William Tjhi, Junnan Li, Antoine Bosselut · Feb 13, 2026
This paper provides a strategic framework for making this decision by evaluating these options across dimensions including sovereignty, safety, cost, resource capability, cultural fit, and sustainability.
- BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan · Feb 13, 2026
Web Browsing
Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.
- PMG: Parameterized Motion Generator for Human-like Locomotion Control
Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu · Feb 13, 2026
Long Horizon
Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain.
- Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats
Pengxiang Zhao, Hui-Ling Zhen, Xing Li, Han Bao, Weizhe Lin · Feb 13, 2026
As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency.
- propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
Maximilian Idahl, Benedikt Droste, Björn Plüster, Jan Philipp Harries · Feb 12, 2026
We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose,
- Think like a Scientist: Physics-guided LLM Agent for Equation Discovery
Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu · Feb 12, 2026
Long Horizon
We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process.
- "Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou · Feb 12, 2026
Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments.
- GPT-4o Lacks Core Features of Theory of Mind
John Muchovej, Amanda Royka, Shane Lee, Julian Jara-Ettinger · Feb 12, 2026
Research into this question has focused on evaluating LLMs against benchmarks and found success across a range of social tasks.
- Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang · Feb 12, 2026
Expert Verification
On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distil
- Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models
Yuzhe Shang, Pengzhi Gao, Wei Liu, Jian Luan, Jinsong Su · Feb 12, 2026
Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years.
- Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance
Yunchong Huang, Gianni Barlacchi, Sandro Pezzelle · Feb 12, 2026
Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved.
- Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai · Feb 12, 2026
Tool Use
To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
- TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, Holger Boche · Feb 12, 2026
Long Horizon
Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks.
- Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models
Ali Mekky, Mohamed El Zeftawy, Lara Hassan, Amr Keleg, Preslav Nakov · Feb 12, 2026
Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task.
- OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li · Feb 12, 2026
Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset.
- Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis · Feb 12, 2026
Red Team
Jailbreaking large language models (LLMs) has emerged as a critical security challenge with the widespread deployment of conversational AI systems.
- When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration
Jayadev Billa · Feb 12, 2026
Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6% text dominance under audio-text conflict versus 1.6% under text-text conflict with identical reliabili
- When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing
Zachary Pedram Dadfar · Feb 11, 2026
Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear.
- Embedding Inversion via Conditional Masked Diffusion Language Models
Han Xiao · Feb 11, 2026
We frame embedding inversion as conditional masked diffusion, recovering all tokens in parallel through iterative denoising rather than sequential autoregressive generation.
- When Fusion Helps and When It Breaks: View-Aligned Robustness in Same-Source Financial Imaging
Rui Ma · Feb 11, 2026
To control label ambiguity from near-zero moves, we use an ex-post minimum-movement threshold min_move (tau) based on realized absolute next-day return, defining an offline benchmark on the subset where the absolute next-day return is at le
- LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules
Ivan Vulić, Adam Grycner, Quentin de Laroussilhe, Jonas Pfeiffer · Feb 11, 2026
Despite its huge number of variants, standard Low-Rank Adaptation (LoRA) is still a dominant technique for parameter-efficient fine-tuning (PEFT).
- Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models
Mingyu Cao, Alvaro H. C. Correia, Christos Louizos, Shiwei Liu, Lu Yin · Feb 11, 2026
Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and
- The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task
Rui Cao, Zhenyun Deng, Yulong Chen, Michael Schlichtkrull, Andreas Vlachos · Feb 11, 2026
Web Browsing
The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455.
- Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026
Pairwise Preference Tool Use
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
- TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation
Steven Liu, Jane Luo, Xin Zhang, Aofan Liu, Hao Liu · Feb 11, 2026
Current evaluations systematically overlook the third goal.
- The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage
Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh · Feb 10, 2026
Pairwise PreferenceRubric Rating
To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-c
- UI-Venus-1.5 Technical Report
Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu · Feb 9, 2026
Long Horizon
GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.
- Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis
Haoshen Wang, Xueli Zhong, Bingbing Lin, Jia Huang, Xingduo Pan · Feb 9, 2026
Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies.
- Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI
Ziyan Wang, Longlong Ma · Feb 9, 2026
Critique Edit
In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, there
- Language Modeling and Understanding Through Paraphrase Generation and Detection
Jan Philip Wahle · Feb 9, 2026
Language enables humans to share knowledge, reason about the world, and pass on strategies for survival and innovation across generations.
- Document Reconstruction Unlocks Scalable Long-Context RLVR
Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin · Feb 9, 2026
Rubric Rating
However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.