- RoboPocket: Improve Robot Policies Instantly with Your Phone
Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le · Mar 5, 2026 · Citations: 0
Demonstrations Long Horizon
To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones.
- POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation
Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, Weiyang Liu · Mar 5, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks
Shangwen Sun, Alfredo Canziani, Yann LeCun, Jiachen Zhu · Mar 5, 2026 · Citations: 0
- Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks · Mar 5, 2026 · Citations: 0
- Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow · Mar 5, 2026 · Citations: 0
- Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval
Artem Vazhentsev, Maria Marina, Daniil Moskovskiy, Sergey Pletenev, Mikhail Seleznyov · Mar 5, 2026 · Citations: 0
- NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance
Abrar Eyasir, Tahsin Ahmed, Muhammad Ibrahim · Mar 5, 2026 · Citations: 0
- DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates
Klaywert Danillo Ferreira de Souza, David Eduardo Pereira, Cláudio E. C. Campelo, Larissa Lucena Vasconcelos · Mar 5, 2026 · Citations: 0
- FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling
Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar · Mar 5, 2026 · Citations: 0
- Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry
Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha · Mar 5, 2026 · Citations: 0
- Ensembling Language Models with Sequential Monte Carlo
Robin Shing Moon Chan, Tianyu Liu, Samuel Kiegeland, Clemente Pasti, Jacob Hoover Vigly · Mar 5, 2026 · Citations: 0
- Dissociating Direct Access from Inference in AI Introspection
Harvey Lederman, Kyle Mahowald · Mar 5, 2026 · Citations: 0
- An Exploration-Analysis-Disambiguation Reasoning Framework for Word Sense Disambiguation with Low-Parameter LLMs
Deshan Sumanathilaka, Nicholas Micallef, Julian Hough · Mar 5, 2026 · Citations: 0
- Progressive Residual Warmup for Language Model Pretraining
Tianhao Chen, Xin Xu, Lu Yin, Hao Chen, Yang Wang · Mar 5, 2026 · Citations: 0
- DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning
Mohammad Mahdi Moradi, Sudhir Mudur · Mar 5, 2026 · Citations: 0
- Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR
Carlos Carvalho, Francisco Teixeira, Thomas Rolland, Alberto Abad · Mar 5, 2026 · Citations: 0
- A Multilingual Human Annotated Corpus of Original and Easy-to-Read Texts to Support Access to Democratic Participatory Processes
Stefan Bott, Verena Riegler, Horacio Saggion, Almudena Rascón Alcaina, Nouran Khallaf · Mar 5, 2026 · Citations: 0
- PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery · Mar 5, 2026 · Citations: 0
- Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution
Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong · Mar 5, 2026 · Citations: 0
- WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation
Luca Della Libera, Cem Subakan, Mirco Ravanelli · Mar 5, 2026 · Citations: 0
- Knowledge Divergence and the Value of Debate for Scalable Oversight
Robin Young · Mar 5, 2026 · Citations: 0
Rlaif Or Synthetic Feedback
AI safety via debate and reinforcement learning from AI feedback (RLAIF) are both proposed methods for scalable oversight of advanced AI systems, yet no formal framework relates them or characterizes when debate offers an advantage.
- SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning
Zhu Li, Yongjian Chen, Huiyuan Lai, Xiyuan Gao, Shekhar Nayak · Mar 5, 2026 · Citations: 0
- Oral to Web: Digitizing 'Zero Resource'Languages of Bangladesh
Mohammad Mamun Or Rashid · Mar 5, 2026 · Citations: 0
- VietJobs: A Vietnamese Job Advertisement Dataset
Hieu Pham Dinh, Hung Nguyen Huy, Mo El-Haj · Mar 5, 2026 · Citations: 0
- Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
Ofir Ben Shoham · Mar 5, 2026 · Citations: 0
- Core-based Hierarchies for Efficient GraphRAG
Jakir Hossain, Ahmet Erdem Sarıyüce · Mar 5, 2026 · Citations: 0
- Distilling Formal Logic into Neural Spaces: A Kernel Alignment Approach for Signal Temporal Logic
Sara Candussio, Gabriele Sarti, Gaia Saveri, Luca Bortolussi · Mar 5, 2026 · Citations: 0
- Diffusion LLMs can think EoS-by-EoS
Sarah Breckner, Sebastian Schuster · Mar 5, 2026 · Citations: 0
- Transducing Language Models
Vésteinn Snæbjarnarson, Samuel Kiegeland, Tianyu Liu, Reda Boumasmoud, Ryan Cotterell · Mar 5, 2026 · Citations: 0
- Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions
Kun Chen, Xianglei Liao, Kaixue Fei, Yi Xing, Xinrui Li · Mar 5, 2026 · Citations: 0
- Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity
Di Zhang, Xun Wu, Shaohan Huang, Yudong Wang, Hanyong Shao · Mar 5, 2026 · Citations: 0
- C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
Avni Mittal, Rauno Arike · Mar 5, 2026 · Citations: 0
- Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers
Ruichen Xu, Wenjing Yan, Ying-Jun Angela Zhang · Mar 5, 2026 · Citations: 0
- Representation Fidelity:Auditing Algorithmic Decisions About Humans Using Self-Descriptions
Theresa Elstner, Martin Potthast · Mar 5, 2026 · Citations: 0
- LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting
Yewen Li, Zhiyi Lyu, Peng Jiang, Qingpeng Cai, Fei Pan · Mar 5, 2026 · Citations: 0
- Measuring the Redundancy of Decoder Layers in SpeechLLMs
Adel Moumen, Guangzhi Sun, Philip C Woodland · Mar 5, 2026 · Citations: 0
- ARC-TGI: Human-Validated Task Generators with Reasoning Chain Templates for ARC-AGI
Jens Lehmann, Syeda Khushbakht, Nikoo Salehfard, Nur A Zarin Nishat, Dhananjay Bhandiwad · Mar 5, 2026 · Citations: 0
- Aura: Universal Multi-dimensional Exogenous Integration for Aviation Time Series
Jiafeng Lin, Mengren Zheng, Simeng Ye, Yuxuan Wang, Huan Zhang · Mar 5, 2026 · Citations: 0
- MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection
Inayat Arshad, Fajar Saleem, Ijaz Hussain · Mar 5, 2026 · Citations: 0
- NeuronMoE: Neuron-Guided Mixture-of-Experts for Efficient Multilingual LLM Extension
Rongzhi Li, Hitomi Yanaka · Mar 5, 2026 · Citations: 0
- Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure
Yida Lu, Jianwei Fang, Xuyang Shao, Zixuan Chen, Shiyao Cui · Mar 5, 2026 · Citations: 0
- HiFlow: Hierarchical Feedback-Driven Optimization for Constrained Long-Form Text Generation
Yifan Zhu, Guanting Chen, Bing Wei, Haoran Luo · Mar 5, 2026 · Citations: 0
- ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat · Mar 5, 2026 · Citations: 0
Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators.
- VRM: Teaching Reward Models to Understand Authentic Human Preferences
Biao Liu, Ning Xu, Junming Yang, Hao Xu, Xin Geng · Mar 5, 2026 · Citations: 0
Pairwise Preference
Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on…
- Functionality-Oriented LLM Merging on the Fisher--Rao Manifold
Jiayu Wang, Zuojun Ye, Wenpeng Yin · Mar 5, 2026 · Citations: 0
Across various benchmarks and collapse diagnostics, our method remains stable as the number and heterogeneity of merged models increase, consistently outperforming prior baselines.
- Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation
Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng · Mar 5, 2026 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MPCEval: A Benchmark for Multi-Party Conversation Generation
Minxing Zhang, Yi Yang, Zhuofan Jia, Xuan Yang, Jian Pei · Mar 5, 2026 · Citations: 0
Multi-party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck.
- When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger
Amirabbas Afzali, Myeongho Jeon, Maria Brbic · Mar 5, 2026 · Citations: 0
Pairwise Preference
Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization…
- Replaying pre-training data improves fine-tuning
Suhas Kotha, Percy Liang · Mar 5, 2026 · Citations: 0
Web Browsing
We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by 4.5\% and Basque question-answering accuracy by 2\%.
- VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters
Jiaxin Fan, Wenpo Song · Mar 5, 2026 · Citations: 0
By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling.
- TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0
Demonstrations Web Browsing
The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
- LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services
Jinwen Chen, Shuai Gong, Shiwen Zhang, Zheng Zhang, Yachao Zhao · Mar 5, 2026 · Citations: 0
Pairwise Preference
While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency.
- Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition
Mengze Hong, Yi Gu, Di Jiang, Hanlin Gu, Chen Jason Zhang · Mar 5, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis
Stavros Gazetas, Giorgos Filandrianos, Maria Lymperaiou, Paraskevi Tzouveli, Athanasios Voulodimos · Mar 5, 2026 · Citations: 0
Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.
- AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection
Panagiotis Alexios Spanakis, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou · Mar 5, 2026 · Citations: 0
This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement.
- Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
Hiroki Fukui · Mar 5, 2026 · Citations: 0
Multi Agent
We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface…
- Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research
Arina Kostina, Marios Dikaiakos, Alejandro Porcel, Tassos Stassopoulos · Mar 5, 2026 · Citations: 0
In our work we evaluate LLMs on the task of identifying the top three human values expressed in long-form interviews based on the Schwartz Theory of Basic Values framework.
- Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models
Sean Lamont, Christian Walder, Paul Montague, Amir Dezfouli, Michael Norrish · Mar 5, 2026 · Citations: 0
We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model.
- FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications
Yunfan Zhang, Yijie Bei, Jetashree Ravi, Pawel Garbacki · Mar 5, 2026 · Citations: 0
However, existing instruction following benchmarks predominantly evaluate natural language generation constraints that reflect the needs of chat assistants rather than enterprise users.
- HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents
Yilin Jiang, Fei Tan, Xuanyu Yin, Jing Leng, Aimin Zhou · Mar 5, 2026 · Citations: 0
Multi Agent
We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas.