- Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching
Andrea Fraschini, Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli · Mar 27, 2026 · Citations: 0
Long Horizon
Finally, we empirically validate the advantages of our contributions across multiple continuous control benchmarks.
- Introducing MELI: the Mandarin-English Language Interview Corpus
Suyuan Liu, Molly Babel · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- TAPS: Task Aware Proposal Distributions for Speculative Sampling
Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem · Mar 27, 2026 · Citations: 0
Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench.
- Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language
Hanif Rahman, Shafeeq ur Rehman · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models
Rahul Soni · Mar 27, 2026 · Citations: 0
Critique Edit Long Horizon
Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks.
- The Last Fingerprint: How Markdown Training Shapes LLM Prose
E. M. Freeburg · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- PHONOS: PHOnetic Neutralization for Online Streaming Applications
Waris Quamer, Mu-Ruei Tseng, Ghady Nasrallah, Ricardo Gutierrez-Osuna · Mar 27, 2026 · Citations: 0
Our evaluations show an 81% reduction in non-native accent confidence, with listening-test ratings consistent with this shift, and reduced speaker linkability as accent-neutralized utterances move away from the original speaker in embedding…
- FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?
Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu · Mar 27, 2026 · Citations: 0
We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level.
- A large corpus of lucid and non-lucid dream reports
Remington Mallett · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Multilingual Stutter Event Detection for English, German, and Mandarin Speech
Felix Haas, Sebastian P. Bayerl · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- In your own words: computationally identifying interpretable themes in free-text survey data
Jenny S Wang, Aliya Saperstein, Emma Pierson · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation
Lorcan McLaren, James Cross, Zuzanna Krakowska, Robin Rauner, Martijn Schoonvelde · Mar 27, 2026 · Citations: 0
Most evaluations test a single model or configuration; how model choice, model size, learning approach, and prompt style interact, and whether popular "best practices" survive controlled comparison, are largely unexplored.
- Learning to Commit: Generating Organic Pull Requests via Online Repository Memory
Mo Li, L. H. Xu, Qitai Tan, Ting Cao, Yunxin Liu · Mar 27, 2026 · Citations: 0
Large language model (LLM)-based coding agents achieve impressive results on controlled benchmarks yet routinely produce pull requests that real maintainers reject.
- Weight Tying Biases Token Embeddings Towards the Output Space
Antonio Lopardo, Avyukth Harish, Catherine Arnett, Akshat Gupta · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian · Mar 27, 2026 · Citations: 0
Long Horizon
We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning.
- EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching
Paul Bontempo · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference
Joris Köster, Zixuan Liu, Siavash Khajavi, Zizhan Zheng · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models
Juan Gabriel Kostelec, Xiang Wang, Axel Laborieux, Christos Sourmpis, Qinghai Guo · Mar 27, 2026 · Citations: 0
We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions.
- Development of a European Union Time-Indexed Reference Dataset for Assessing the Performance of Signal Detection Methods in Pharmacovigilance using a Large Language Model
Maria Kefala, Jeffery L. Painter, Syed Tauhid Bukhari, Maurizio Sessa · Mar 27, 2026 · Citations: 0
Existing datasets do not capture when adverse events (AEs) are officially recognized by regulatory authorities, preventing restriction of analyses to pre-confirmation periods and limiting evaluation of early detection performance.
- How Open Must Language Models be to Enable Reliable Scientific Inference?
James A. Michaelov, Catherine Arnett, Tyler A. Chang, Pamela D. Rivière, Samuel M. Taylor · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
Zelin Tan, Zhouliang Yu, Bohan Lin, Zijie Geng, Hejia Geng · Mar 27, 2026 · Citations: 0
Rubric Rating
We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward…
- ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs
Inês Vieira, Inês Calvo, Iago Paulo, James Furtado, Rafael Ferreira · Mar 27, 2026 · Citations: 0
European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR).
- JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems
Guangzhao Yang, Yu Pan, Shi Qiu, Ningjie Bai · Mar 27, 2026 · Citations: 0
Despite recent advances, efficient and robust turn-taking detection remains a significant challenge in industrial-grade Voice AI agent deployments.
- AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese
Afonso Simplício, Gonçalo Vinagre, Miguel Moura Ramos, Diogo Tavares, Rafael Ferreira · Mar 27, 2026 · Citations: 0
Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and…
- Clinical named entity recognition in the Portuguese language: a benchmark of modern BERT models and LLMs
Vinicius Anjos de Almeida, Sandro Saorin da Silva, Josimar Chire, Leonardo Vicenzi, Nícolas Henrique Borges · Mar 27, 2026 · Citations: 0
Named entity recognition (NER) enables the automatic extraction of medical concepts; however, benchmarks for Portuguese remain scarce.
- Entanglement as Memory: Mechanistic Interpretability of Quantum Language Models
Nathan Roll · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims
Raia Abu Ahmad, Max Upravitelev, Aida Usmanova, Veronika Solopova, Georg Rehm · Mar 27, 2026 · Citations: 0
Pairwise Preference
In addition to standard evaluation metrics (Recall@K and Binary Preference), we adapt an automated framework to assess retrieval quality under incomplete annotations, exposing systematic biases in how conventional metrics rank systems.
- Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models
Mikko Saukkoriipi, Nicole Hernandez, Jaakko Sahlsten, Kimmo Kaski, Otso Arponen · Mar 27, 2026 · Citations: 0
Expert Verification
Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients.
- Analysing Calls to Order in German Parliamentary Debates
Nina Smirnova, Daniel Dan, Philipp Mayr · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Why Models Know But Don't Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models
Richard J. Young · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Word Alignment-Based Evaluation of Uniform Meaning Representations
Daniel Zeman, Federica Gamba · Mar 27, 2026 · Citations: 0
Comparison and evaluation of graph-based representations of sentence meaning is a challenge because competing representations of the same sentence may have different number of nodes, and it is not obvious which nodes should be compared to…
- Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers
Yusheng Zhao, Hourun Li, Bohan Wu, Jingyang Yuan, Meng Zhang · Mar 27, 2026 · Citations: 0
Extensive experiments are conducted on twenty-three benchmark datasets across both regular (4K) and long (32K) context lengths, demonstrating the effectiveness of the proposed method.
- A Formal Framework for Uncertainty Analysis of Text Generation with Large Language Models
Steffen Herbold, Florian Lemmerich · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law
JiHyeok Jung, TaeYoung Yoon, HyunSouk Cho · Mar 27, 2026 · Citations: 0
However, existing legal benchmarks primarily evaluate rule application under the assumption of fixed norms, and thus fail to capture situations where legal judgments shift or where multiple norms interact.
- From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs
Jiyuan An, Liner Yang, Mengyan Wang, Luming Lu, Weihua An · Mar 27, 2026 · Citations: 0
As spatial intelligence becomes an increasingly important capability for foundation models, it remains unclear whether large language models' (LLMs) performance on spatial reasoning benchmarks reflects structured internal spatial…
- Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang · Mar 27, 2026 · Citations: 0
Rubric RatingExpert Verification
To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains.
- findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding
Héctor Javier Vázquez Martínez · Mar 27, 2026 · Citations: 0
Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets,…
- Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models
Antoine Edy, Max Conti, Quentin Macé · Mar 27, 2026 · Citations: 0
We analyze these behaviors for state-of-the-art models on the NanoBEIR benchmark.
- SocialX: A Modular Platform for Multi-Source Big Data Research in Indonesia
Muhammad Apriandito Arya Saputra, Andry Alamsyah, Dian Puteri Ramadhani, Thomhert Suprapto Siadari, Hanif Fakhrurroja · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Automatic Speech Recognition for Documenting Endangered Languages: Case Study of Ikema Miyakoan
Chihiro Taguchi, Yukinori Takubo, David Chiang · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR
Shashi Kumar, Esaú Villatoro-Tello, Sergio Burdisso, Kadri Hacioglu, Thibault Bañeras-Roux · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs
Uri Z. Kialy, Avi Shtarkberg, Ayal Klein · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- GS-BrainText: A Multi-Site Brain Imaging Report Dataset from Generation Scotland for Clinical Natural Language Processing Development and Validation
Beatrice Alex, Claire Grover, Arlene Casey, Richard Tobin, Heather Whalley · Mar 27, 2026 · Citations: 0
Benchmark evaluation using EdIE-R, an existing rule-based NLP system developed in conjunction with the annotation schema, revealed some performance variation across health boards (F1: 86.13-98.13), phenotypes (F1: 22.22-100) and age groups…
- Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents
Nicholas Edwards, Sebastian Schuster · Mar 27, 2026 · Citations: 0
Multi Agent
We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution.
- Sparse Auto-Encoders and Holism about Large Language Models
Jumbly Grindrod · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- An Object Web Seminar: A Retrospective on a Technical Dialogue Still Reverberating
James J. Cusick · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory
Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang · Mar 27, 2026 · Citations: 0
Expert Verification Multi Agent
To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians.
- DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models
Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Clash of the models: Comparing performance of BERT-based variants for generic news frame detection
Vihang Jumle · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.