- Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook
Ming Li, Xirui Li, Tianyi Zhou · Feb 15, 2026 · Citations: 0
We present the first large-scale systemic diagnosis of this AI agent society.
- FMMD: A multimodal open peer review dataset based on F1000Research
Zhenzhen Zhuang, Yuqing Fu, Jing Zhu, Zhangping Zhou, Jialiang Lin · Feb 15, 2026 · Citations: 0
Automated scholarly paper review (ASPR) has entered the coexistence phase with traditional peer review, where artificial intelligence (AI) systems are increasingly incorporated into real-world manuscript evaluation.
- MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents
Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai, Moayad Aloqaily, Ouns Bouachir · Feb 15, 2026 · Citations: 0
Tool Use
The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers.
- Whom to Query for What: Adaptive Group Elicitation via Multi-Turn LLM Interactions
Ruomeng Ding, Tianwei Gao, Thomas P. Zollo, Eitan Bachmat, Richard Zemel · Feb 15, 2026 · Citations: 0
To address this gap, we study adaptive group elicitation, a multi-round setting where an agent adaptively selects both questions and respondents under explicit query and participation budgets.
- STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts
Zachary Bamberger, Till R. Saenger, Gilad Morad, Ofra Amir, Brandon M. Stewart · Feb 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Detecting LLM Hallucinations via Embedding Cluster Geometry: A Three-Type Taxonomy with Measurable Signatures
Matic Korun · Feb 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu · Feb 15, 2026 · Citations: 0
Expert Verification Long Horizon
While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
- We can still parse using syntactic rules
Ghaly Hussein · Feb 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents
Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang · Feb 15, 2026 · Citations: 0
Tool Use
To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization.
- The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents
Ziyang Ma, Ruiyang Xu, Yinghao Ma, Chao-Han Huck Yang, Bohan Li · Feb 15, 2026 · Citations: 0
Rubric Rating
Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions.
- Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports
Dragan Stoll, Brian E. Perron, Zia Qi, Selina Steinmann, Nicole F. Eicher · Feb 15, 2026 · Citations: 0
The performance of RLMs with different parameter sizes (255B, 32B, 4B) was compared against human validated data.
- MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM
Omin Kwon, Yeonjae Kim, Doyeon Kim, Minseo Kim, Yeonhong Park · Feb 15, 2026 · Citations: 0
Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse…
- Knowing When Not to Answer: Abstention-Aware Scientific Reasoning
Samir Abdaljalil, Erchin Serpedin, Hasan Kurban · Feb 15, 2026 · Citations: 0
We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings.
- GPT-5 vs Other LLMs in Long Short-Context Performance
Nima Esmi, Maryam Nezhad-Moghaddam, Fatemeh Borhani, Asadollah Shahbahrami, Amin Daemdoost · Feb 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model
Shaobin Zhuang, Yuang Ai, Jiaming Han, Weijia Mao, Xiaohui Li · Feb 15, 2026 · Citations: 0
- Investigation for Relative Voice Impression Estimation
Kenichi Fujita, Yusuke Ijima · Feb 15, 2026 · Citations: 0
Pairwise Preference
The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright'').
- Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling
Yiran Guo, Zhongjian Qiao, Yingqi Xie, Jie Liu, Dan Ye · Feb 15, 2026 · Citations: 0
Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines.
- Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026 · Citations: 0
16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
- A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing
Naeimeh Nourmohammadi, Md Meem Hossain, The Anh Han, Safina Showkat Ara, Zia Ush Shamszaman · Feb 15, 2026 · Citations: 0
Multi Agent
We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability.
- ROAST: Rollout-based On-distribution Activation Steering Technique
Xuanbo Su, Hao Luo, Yingfang Zhang, Lijun Zhang · Feb 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Algebraic Quantum Intelligence: A New Framework for Reproducible Machine Creativity
Kazuo Yano, Jonghyeok Lee, Tae Ishitomi, Hironobu Kawaguchi, Akira Koyama · Feb 15, 2026 · Citations: 0
We evaluate the resulting system on creative reasoning benchmarks spanning ten domains under an LLM-as-a-judge protocol.
- Character-aware Transformers Learn an Irregular Morphological Pattern Yet None Generalize Like Humans
Akhilesh Kakolu Ramarao, Kevin Tang, Dinah Baer-Henney · Feb 15, 2026 · Citations: 0
Recent work has shown that encoder-decoder models can acquire irregular patterns, but evidence that they generalize these patterns like humans is mixed.
- CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry
Shangqing Zhao, Yupei Ren, Yuhao Zhou, Xiaopeng Bai, Man Lan · Feb 15, 2026 · Citations: 0
To systematically evaluate and advance this capability, we introduce Chinese Cipai Variants (CCiV), a benchmark designed to assess LLM-generated Ci poetry across these three dimensions: structure, rhythm, and quality.
- Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality
Nitay Calderon, Eyal Ben-David, Zorik Gekhman, Eran Ofek, Gal Yona · Feb 15, 2026 · Citations: 0
To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search.
- GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler
Minghan Wang, Ye Bai, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari · Feb 15, 2026 · Citations: 0
Experiments across multiple benchmarks and two latent reasoning architectures show that GTS yields more reliable inference-time scaling than heuristic baselines, suggesting that effective latent ITS requires better-controlled and…
- Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework
Grzegorz Statkiewicz, Alicja Dobrzeniecka, Karolina Seweryn, Aleksandra Krasnodębska, Karolina Piosek · Feb 15, 2026 · Citations: 0
Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along…
- Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric
Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao · Feb 15, 2026 · Citations: 0
Pairwise PreferenceRubric Rating
To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which…
- From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset
Jandad Jahani, Mursal Dawodi, Jawid Ahmad Baktash · Feb 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts
Yang Liu, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li · Feb 15, 2026 · Citations: 0
Expert Verification
By decomposing the definition modeling task into specialized semantic domains, where small language models are trained as domain experts, LM-Lexicon achieves substantial improvements (+7% BLEU score compared with the prior state-of-the-art…
- LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation
Jizheng Chen, Weiming Zhang, Xinyi Dai, Weiwen Liu, Kounianhua Du · Feb 15, 2026 · Citations: 0
Pairwise Preference
LogitsCoder iteratively generates and refines reasoning steps by first steering token selection toward statistically preferred patterns via Logits Preference Decoding, then selecting and aggregating diverse reasoning paths using Logits Rank…
- Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness
Pietro Bernardelle, Stefano Civelli, Kevin Roitero, Gianluca Demartini · Feb 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- BitDance: Scaling Autoregressive Generative Models with Binary Tokens
Yuang Ai, Jiaming Han, Shaobin Zhuang, Weijia Mao, Xuefeng Hu · Feb 15, 2026 · Citations: 0
- Geometry-Preserving Aggregation for Mixture-of-Experts Embedding Models
Sajjad Kachuee, Mohammad Sharifkhani · Feb 15, 2026 · Citations: 0
Experiments on selected tasks from the Massive Text Embedding Benchmark (MTEB), including semantic similarity, clustering, and duplicate question detection, demonstrate consistent performance improvements with identical training cost and…
- GRRM: Group Relative Reward Modeling for Machine Translation
Sen Yang, Shanbo Cheng, Lu Xu, Jianbing Zhang, Shujian Huang · Feb 15, 2026 · Citations: 0
Empirical evaluations confirm that GRRM achieves competitive ranking accuracy among all baselines.
- Named Entity Recognition for Payment Data Using NLP
Srikumar Nayak · Feb 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective
Ali Zahedzadeh, Behnam Bahrak · Feb 15, 2026 · Citations: 0
Long Horizon
Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct answers.To operationalize this view, we introduce an evaluation…
- Chain-of-Thought Reasoning with Large Language Models for Clinical Alzheimer's Disease Assessment and Diagnosis
Tongze Zhang, Jun-En Ding, Melik Ozolcer, Fang-Ming Hung, Albert Chih-Chieh Yang · Feb 15, 2026 · Citations: 0
Traditional diagnosis still relies heavily on medical imaging and clinical assessment by physicians, which is often time-consuming and resource-intensive in terms of both human expertise and healthcare resources.
- Neuromem: A Granular Decomposition of the Streaming Lifecycle in External Memory for LLMs
Ruicheng Zhang, Xinyi Li, Tianyi Xu, Shuhao Zhang, Xiaofei Liao · Feb 15, 2026 · Citations: 0
We present Neuromem, a scalable testbed that benchmarks External Memory Modules under an interleaved insertion-and-retrieval protocol and decomposes its lifecycle into five dimensions including memory data structure, normalization strategy,…
- HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0
Expert VerificationCritique Edit
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
- MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars
Shuoyuan Wang, Yiran Wang, Hongxin Wei · Feb 15, 2026 · Citations: 0
We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery.
- Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu · Feb 15, 2026 · Citations: 0
Long Horizon
The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
- Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning
Zhimin Zhao · Feb 15, 2026 · Citations: 0
Pairwise Preference
We propose a five-level hierarchy of learnability based on information structure and argue that the ceiling on ML progress depends less on model size than on whether a task is learnable at all.
- A Comparative Analysis of Social Network Topology in Reddit and Moltbook
Yiming Zhu, Gareth Tyson, Pan Hui · Feb 14, 2026 · Citations: 0
Recent advances in agent-mediated systems have enabled a new paradigm of social network simulation, where AI agents interact with human-like autonomy.
- DeepXiv-SDK: An Agentic Data Interface for Scientific Literature
Hongjin Qian, Ziyi Xia, Ze Liu, Jianlyu Chen, Kun Luo · Feb 14, 2026 · Citations: 0
LLM-agents are increasingly used to accelerate the progress of scientific research.
- Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives
Ruchira Dhar, Qiwei Peng, Anders Søgaard · Feb 14, 2026 · Citations: 0
Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.
- From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026 · Citations: 0
Critique Edit
We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design.
- Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin
Thibault Clérice, Rachel Bawden, Anthony Glaise, Ariane Pinche, David Smith · Feb 14, 2026 · Citations: 0
We also produce a manually corrected gold-standard evaluation set.
- Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach
Amir Hossein Mohammadi, Ali Moeinian, Zahra Razavizade, Afsaneh Fatemi, Reza Ramezani · Feb 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics
Hend Al-Khalifa, Nadia Ghezaiel, Maria Bounnit, Hend Hamed Alhazmi, Noof Abdullah Alfear · Feb 14, 2026 · Citations: 0
It contains 10,000 samples with linguistic feature annotations across 16 politeness categories and achieves substantial inter-annotator agreement (kappa = 0.703).
- Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026 · Citations: 0
Pairwise Preference
Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
- Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe
Somnath Banerjee · Feb 14, 2026 · Citations: 0
Pairwise Preference Long Horizon
The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.
- Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?
Grace Chang Yuan, Xiaoman Zhang, Sung Eun Kim, Pranav Rajpurkar · Feb 14, 2026 · Citations: 0
Multi Agent
Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning.
- PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen · Feb 14, 2026 · Citations: 0
Pairwise Preference Multi Agent
We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions.
- Speculative Decoding with a Speculative Vocabulary
Miles Williams, Young D. Kwon, Rui Li, Alexandros Kouris, Stylianos I. Venieris · Feb 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind
Minyuan Ruan, Ziyue Wang, Kaiming Liu, Yunghwei Lai, Peng Li · Feb 14, 2026 · Citations: 0
Long Horizon
Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks to assist human users.
- The acquisition of English irregular inflections by Yemeni L1 Arabic learners: A Universal Grammar approach
Muneef Y. Alsawsh, Mohammed Q. Shormani · Feb 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum
Yangyang Zhang, Zilong Wang, Jianbo Xu, Yongqi Chen, Chu Han · Feb 14, 2026 · Citations: 0
Expert Verification Multi Agent
Here we present OMGs (Ovarian tumour Multidisciplinary intelligent aGent System), a multi-agent AI framework where domain-specific agents deliberate collaboratively to integrate multidisciplinary evidence and generate MDT-style…
- StackingNet: Collective Inference Across Independent AI Foundation Models
Siyang Li, Chenhao Liu, Dongrui Wu, Zhigang Zeng, Lieyun Ding · Feb 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- How Do Lexical Senses Correspond Between Spoken German and German Sign Language?
Melis Çelikkol, Wei Zhao · Feb 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery
Qi Liu, Ruochen Hao, Can Li, Wanjing Ma · Feb 14, 2026 · Citations: 0
Multi Agent
We present OR-Agent, a configurable multi-agent research framework designed for automated exploration in rich experimental environments.