- Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility
Angana Borah, Zohaib Khan, Rada Mihalcea, Verónica Pérez-Rosas · Mar 3, 2026 · Citations: 0
As Large Language Models (LLMs) are increasingly used to simulate human behaviors, we investigate whether they can simulate demographic misinformation susceptibility, treating beliefs as a primary driving factor.
- ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer
Chunyuan Deng, Sanket Lokegaonkar, Colin Lockard, Besnik Fetahu, Nasser Zalmout · Mar 3, 2026 · Citations: 0
- Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright · Mar 3, 2026 · Citations: 0
- Tucano 2 Cool: Better Open Source LLMs for Portuguese
Nicholas Kluge Corrêa, Aniket Sen, Shiza Fatimah, Sophia Falk, Lennard Landgraf · Mar 3, 2026 · Citations: 0
- Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?
Dadi Guo, Yuejin Xie, Qingyu Liu, Jiayu Liu, Zhiyuan Fan · Mar 3, 2026 · Citations: 0
- Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems
Raad Khraishi, Iman Zafar, Katie Myles, Greig A Cowan · Mar 3, 2026 · Citations: 0
We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence…
- Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features
Kyle Janse van Rensburg, Benjamin van Niekerk, Herman Kamper · Mar 3, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection
Sofiane Elguendouze, Erwan Hain, Elena Cabrio, Serena Villata · Mar 3, 2026 · Citations: 0
Experiments on standard benchmarks show that our approach achieves higher performance compared to state-of-the-art systems.
- TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models
Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu · Mar 3, 2026 · Citations: 0
Red Team
Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses.
- TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
Christian Greisinger, Steffen Eger · Mar 3, 2026 · Citations: 0
Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at…
- Incremental Graph Construction Enables Robust Spectral Clustering of Texts
Marko Pranjić, Boshko Koloski, Nada Lavrač, Senja Pollak, Marko Robnik-Šikonja · Mar 3, 2026 · Citations: 0
We validate the approach on spectral clustering of SentenceTransformer embeddings using Laplacian eigenmaps across six clustering datasets from the Massive Text Embedding Benchmark.
- PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems
Sudip Bhujel · Mar 3, 2026 · Citations: 0
Pairwise PreferenceExpert Verification
Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content.
- TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health
Zixin Xiong, Ziteng Wang, Haotian Fan, Xinjie Zhang, Wenxuan Wang · Mar 3, 2026 · Citations: 0
While Large Language Models (LLMs) demonstrate significant potential in providing accessible mental health support, their practical deployment raises critical trustworthiness concerns due to the domains high-stakes and safety-sensitive…
- MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling
Jinwoong Kim, Sangjin Park · Mar 3, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Contextualized Privacy Defense for LLM Agents
Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie · Mar 3, 2026 · Citations: 0
Long Horizon
LLM agents increasingly act on users' personal information, yet existing privacy defenses remain limited in both design and adaptability.
- ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation
Bo Xu, Haotian Wu, Hehai Lin, Weiquan Huang, Beier Zhu · Mar 3, 2026 · Citations: 0
Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods.
- Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction
Guangjun Zhang, Hu Zhang, Yazhou Han, Yue Fan, Yuhang Shao · Mar 3, 2026 · Citations: 0
Multi Agent
Moreover, ensuring the reliability and usability of synthetic data remains a significant challenge due to the absence of quality evaluation mechanisms.
- Eval4Sim: An Evaluation Framework for Persona Simulation
Eliseo Bao, Anxo Perez, Xi Wang, Javier Parapar · Mar 3, 2026 · Citations: 0
Large Language Model (LLM) personas with explicit specifications of attributes, background, and behavioural tendencies are increasingly used to simulate human conversations for tasks such as user modeling, social reasoning, and behavioural…
- LaTeX Compilation: Challenges in the Era of LLMs
Tianyou Liu, Ziqiang Li, Xurui Liu, Yansong Li · Mar 3, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models
Haruto Yoshida, Keito Kudo, Yoichi Aoki, Ryota Tanaka, Itsumi Saito · Mar 3, 2026 · Citations: 0
Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges…
- The Distribution of Phoneme Frequencies across the World's Languages: Macroscopic and Microscopic Information-Theoretic Models
Fermín Moscoso del Prado Martín, Suchir Salhan · Mar 3, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Browser-based Open Source Assistant for Multimodal Content Verification
Rosanna Milner, Michael Foster, Olesya Razuvayevskaya, Ian Roberts, Valentin Porcellini · Mar 3, 2026 · Citations: 0
Web Browsing
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs
Prarthana Bhattacharyya, Joshua Mitton, Ralph Abboud, Simon Woodhead · Mar 3, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
Yichi Zhang, Nabeel Seedat, Yinpeng Dong, Peng Cui, Jun Zhu · Mar 3, 2026 · Citations: 0
Expert Verification Long Horizon
As LLM-powered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment.
- OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets
Jiyuan Shen, Peiyue Yuan, Atin Ghosh, Yifan Mai, Daniel Dahlmeier · Mar 3, 2026 · Citations: 0
In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction.
- From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench
Weikang Shi, Houxing Ren, Junting Pan, Aojun Zhou, Ke Wang · Mar 3, 2026 · Citations: 0
Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching…
- Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration
Linhao Zhong, Linyu Wu, Wen Wang, Yuling Xi, Chenchen Jing · Mar 3, 2026 · Citations: 0
However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation.
- Sensory-Aware Sequential Recommendation via Review-Distilled Representations
Yeo Chan Yoon · Mar 3, 2026 · Citations: 0
Qualitative analysis further shows that the extracted attributes align closely with human perceptions of products, enabling interpretable connections between natural language descriptions and recommendation behavior.
- Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization
Yueyang Cang, Xiaoteng Zhang, Erlu Zhao, Zehua Ji, Yuhang Liu · Mar 3, 2026 · Citations: 0
Multi Agent
Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS).
- HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse
Sai Kartheek Reddy Kasu, Shankar Biradar, Sunil Saumya, Md. Shad Akhtar · Mar 3, 2026 · Citations: 0
Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives.
- ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs
Wicaksono Leksono Muhamad, Joanito Agili Lopo, Tack Hwa Wong, Muhammad Ravi Shulthan Habibi, Samuel Cahyawijaya · Mar 3, 2026 · Citations: 0
Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or…
- Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory
Shunki Uebayashi, Kento Masui, Kyohei Atarashi, Han Bao, Hisashi Kashima · Mar 3, 2026 · Citations: 0
Benchmarks for MLLMs should measure their ability for cross-modal integration.
- Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches
Anum Afzal, Yuki Saito, Hiroya Takamura, Katsuhito Sudoh, Shinnosuke Takamichi · Mar 3, 2026 · Citations: 0
Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone.
- Credibility Governance: A Social Mechanism for Collective Self-Correction under Weak Truth Signals
Wanying He, Yanxi Lin, Ziheng Zhou, Xue Feng, Min Peng · Mar 3, 2026 · Citations: 0
We propose Credibility Governance (CG), a mechanism that reallocates influence by learning which agents and viewpoints consistently track evolving public evidence.
- StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong · Mar 3, 2026 · Citations: 0
Rubric Rating Multi Agent
To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it…
- Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models
Shubhangi Upasani, Ravi Shanker Raju, Bo Li, Mengmeing Ji, John Long · Mar 3, 2026 · Citations: 0
Prompt length is a major bottleneck in agentic large language model (LLM) workloads, where repeated inference steps and multi-call loops incur substantial prefill cost.
- Think, But Don't Overthink: Reproducing Recursive Language Models
Daren Wang · Mar 3, 2026 · Citations: 0
Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks.
- GPUTOK: GPU Accelerated Byte Level BPE Tokenization
Venu Gopal Kadamba, Kanishkha Jaisankar · Mar 3, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ExpGuard: LLM Content Moderation in Specialized Domains
Minseok Choi, Dongjin Kim, Seungbin Yang, Subin Kim, Youngjun Kwak · Mar 3, 2026 · Citations: 0
Expert Verification
With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies.
- How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities
Ziwen Xu, Kewei Xu, Haoming Xu, Haiwen Hong, Longtao Huang · Mar 3, 2026 · Citations: 0
We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality.
- FlashEvaluator: Expanding Search Space with Parallel Evaluation
Chao Feng, Yuanhao Pu, Chenghao Zhang, Shanqi Liu, Shuchang Liu · Mar 3, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs
Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan · Mar 3, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think
Junzhe Shen, Jieru Zhao, Ziwei He, Zhouhan Lin · Mar 3, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen · Mar 3, 2026 · Citations: 0
Red Team Web Browsing
Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.