- Large Language Models in the Abuse Detection Pipeline
Suraj Kath, Sanket Badhe, Preet Shah, Ashwin Sampathkumar, Shivani Gupta · Mar 31, 2026 · Citations: 0
Large Language Models introduce new capabilities for contextual reasoning, policy interpretation, explanation generation, and cross-modal understanding, enabling them to support multiple stages of modern safety systems.
- Asymmetric Actor-Critic for Multi-turn LLM Agents
Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia · Mar 31, 2026 · Citations: 0
Long Horizon
In many real-world applications, agents must succeed in one-shot settings where retries are impossible.
- Frege in the Flesh: Biolinguistics and the Neural Enforcement of Syntactic Structures
Elliot Murphy · Mar 31, 2026 · Citations: 0
Biolinguistics is the interdisciplinary scientific study of the biological foundations, evolution, and genetic basis of human language.
- Hybrid Energy-Based Models for Physical AI: Provably Stable Identification of Port-Hamiltonian Dynamics
Simone Betteti, Luca Laurenti · Mar 31, 2026 · Citations: 0
- Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study
Zaifu Zhan, Mengyuan Cui, Rui Zhang · Mar 31, 2026 · Citations: 0
Critique Edit
Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile,…
- LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026 · Citations: 0
Rubric Rating
We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
- REM-CTX: Automated Peer Review via Reinforcement Learning with Auxiliary Context
Pawin Taechoyotin, Daniel E. Acuna · Mar 31, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval
Antonín Jarolím, Martin Fajčík · Mar 31, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Taxonomy of Programming Languages for Code Generation
Nishat Raihan, Christian Newman, Marcos Zampieri · Mar 31, 2026 · Citations: 0
Our results provide a principled framework for dataset curation and tier-aware evaluation of multilingual LLMs.
- Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries
Tanay Gondil · Mar 31, 2026 · Citations: 0
Using signal detection theory (SDT), we find that all models exhibit high introspective sensitivity (d' = 2.4-3.5), but sensitivity drops substantially at safety boundaries.
- Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations
Haoran Wang, Li Xiong, Kai Shu · Mar 31, 2026 · Citations: 0
Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion.
- Polish phonology and morphology through the lens of distributional semantics
Paula Orzechowska, R. Harald Baayen · Mar 31, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving
Annette Taberner-Miller · Mar 31, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Oblivion: Self-Adaptive Agentic Memory Control through Decay-Driven Activation
Ashish Rana, Chia-Chien Hung, Qumeng Sun, Julian Martin Kunkel, Carolin Lawrence · Mar 31, 2026 · Citations: 0
Long Horizon
Human memory adapts through selective forgetting: experiences become less accessible over time but can be reactivated by reinforcement or contextual cues.
- Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency
Xingshuai Huang, Derek Li, Bahareh Nikpour, Parsa Omidi · Mar 31, 2026 · Citations: 0
Long Horizon
Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9%…
- Hierarchical Pre-Training of Vision Encoders with Large Language Models
Eugene Lee, Ting-Yu Chang, Jui-Huang Tsai, Jiajie Diao, Chen-Yi Lee · Mar 31, 2026 · Citations: 0
Empirical evaluations demonstrate that HIVE achieves superior performance not only in image classification but also on various vision-language tasks, outperforming self-attention-based methods in benchmarks such as MME, GQA, OK-VQA, and…
- One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction
Yuxing Lu, Yushuhong Lin, Jason Zhang · Mar 31, 2026 · Citations: 0
Multi Agent
Existing single-agent strategies sample from one role-conditioned distribution, and multi-agent frameworks use fixed roles with flat majority voting, discarding the diagnostic signal in disagreement.
- Reward-Based Online LLM Routing via NeuralUCB
Ming-Hua Tsai, Phat Tran · Mar 31, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Covertly improving intelligibility with data-driven adaptations of speech timing
Paige Tuttösí, Angelica Lim, H. Henny Yeung, Yue Wang, Jean-Julien Aucouturier · Mar 31, 2026 · Citations: 0
Human talkers often address listeners with language-comprehension challenges, such as hard-of-hearing or non-native adults, by globally slowing down their speech.
- Cognitive Friction: A Decision-Theoretic Framework for Bounded Deliberation in Tool-Using Agents
Davide Di Gioia · Mar 31, 2026 · Citations: 0
Tool Use
Autonomous tool-using agents in networked environments must decide which information source to query and when to stop querying and act.
- ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection
Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga · Mar 31, 2026 · Citations: 0
Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.
- Tracking Equivalent Mechanistic Interpretations Across Neural Networks
Alan Sun, Mariya Toneva · Mar 31, 2026 · Citations: 0
Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.
- Enhancing Structural Mapping with LLM-derived Abstractions for Analogical Reasoning in Narratives
Mohammadhossein Khojasteh, Yifan Jiang, Stefano De Giorgis, Frank van Harmelen, Filip Ilievski · Mar 31, 2026 · Citations: 0
Analogical reasoning is a key driver of human generalization in problem-solving and argumentation.
- Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior
Junwei Yu, Mufeng Yang, Yepeng Ding, Hiroyuki Sato · Mar 31, 2026 · Citations: 0
Web Browsing
Experimental evaluation across six mainstream generative engines demonstrates consistent improvements in citation rate (17.3 percent) and subjective quality (18.5 percent), validating the effectiveness and generalizability of the proposed…
- Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System
Xiaoshan Huang, Conrad Borchers, Jiayi Zhang, Susanne P. Lajoie · Mar 31, 2026 · Citations: 0
This research advances human-centered AI by demonstrating how biological signals can be fused with dialogues to understand critical moments in problem solving.
- Four Generations of Quantum Biomedical Sensors
Xin Jin, Priyam Srivastava, Ronghe Wang, Yuqing Li, Jonathan Beaumariage · Mar 31, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Rewrite the News: Tracing Editorial Reuse Across News Agencies
Soveatin Kuntur, Nina Smirnova, Anna Wroblewska, Philipp Mayr, Sebastijan Razboršek Maček · Mar 31, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Learning to Play Blackjack: A Curriculum Learning Perspective
Amirreza Alasti, Efe Erdal, Yücel Celik, Theresa Eimer · Mar 31, 2026 · Citations: 0
We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually.
- Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization
Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon · Mar 31, 2026 · Citations: 0
Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance.
- FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish
Daban Q. Jaff, Mohammad Mohammadamini · Mar 31, 2026 · Citations: 0
FLEURS offers n-way parallel speech for 100+ languages, but Northern Kurdish is not one of them, which limits benchmarking for automatic speech recognition and speech translation tasks in this language.
- Towards Empowering Consumers through Sentence-level Readability Scoring in German ESG Reports
Benjamin Josef Schüßler, Jakob Prange · Mar 31, 2026 · Citations: 0
We apply various readability scoring methods and evaluate them regarding their prediction error and correlation with human rankings.
- SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models
Adar Avsian, Larry Heck · Mar 31, 2026 · Citations: 0
Multi Agent
We introduce SNEAK (Secret-aware Natural language Evaluation for Adversarial Knowledge), a benchmark for evaluating selective information sharing in language models.
- Owl-AuraID 1.0: An Intelligent System for Autonomous Scientific Instrumentation and Scientific Data Analysis
Han Deng, Anqi Zou, Hanling Zhang, Ben Fei, Chengyu Zhang · Mar 31, 2026 · Citations: 0
We present Owl-AuraID, a software-hardware collaborative embodied agent system that adopts a GUI-native paradigm to operate instruments through the same interfaces as human experts.
- ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian
Cristian Santini, Sebastian Barzaghi, Paolo Sernani, Emanuele Frontoni, Laura Melosi · Mar 31, 2026 · Citations: 0
The dataset's diachronic coverage spanning two centuries makes it particularly suitable for temporal entity disambiguation and cross-domain evaluation.
- Reasoning-Driven Synthetic Data Generation and Evaluation
Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous · Mar 31, 2026 · Citations: 0
Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative.
- Terminal Agents Suffice for Enterprise Automation
Patrice Bechard, Orlando Marquez Ayala, Emily Chen, Jordan Skelton, Sagar Davasam · Mar 31, 2026 · Citations: 0
There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously.
- Training-Free Dynamic Upcycling of Expert Language Models
Eros Fanì, Oğuzhan Ersoy · Mar 31, 2026 · Citations: 0
Expert Verification
To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model.
- A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models
Lixin Xiu, Xufang Luo, Hideki Nakayama · Mar 31, 2026 · Citations: 0
Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs.
- Near-Miss: Latent Policy Failure Detection in Agentic Workflows
Ella Rabinovich, David Boaz, Naama Zwerdling, Ateret Anaby-Tavor · Mar 31, 2026 · Citations: 0
In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces.
- Agenda-based Narrative Extraction: Steering Pathfinding Algorithms with Large Language Models
Brian Felipe Keith-Norambuena, Carolina Inés Rojas-Córdova, Claudio Juvenal Meneses-Villegas, Elizabeth Johanna Lam-Esquenazi, Angélica María Flores-Bustos · Mar 31, 2026 · Citations: 0
We evaluated our approach on a news article corpus using LLM judges with Claude Opus 4.5 and GPT 5.1, measuring both coherence and agenda alignment across 64 endpoint pairs and 6 agendas.
- Semantic Interaction for Narrative Map Sensemaking: An Insight-based Evaluation
Brian Felipe Keith-Norambuena, Fausto German, Eric Krokos, Sarah Joseph, Chris North · Mar 31, 2026 · Citations: 0
While SI frameworks for narrative extraction have been proposed, empirical evaluations of their effectiveness remain limited.
- Convergent Representations of Linguistic Constructions in Human and Artificial Neural Systems
Pegah Ramezani, Thomas Kinfe, Andreas Maier, Achim Schilling, Patrick Krauss · Mar 31, 2026 · Citations: 0
Pairwise Preference
The present study tests these predictions in human neural activity using electroencephalography (EEG).
- Learning Diagnostic Reasoning for Decision Support in Toxicology
Nico Oberländer, David Bani-Harouni, Tobias Zellner, Nassir Navab, Florian Eyer · Mar 31, 2026 · Citations: 0
Expert Verification
To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology.
- When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment
Robinson Ferrer, Damla Turgut, Zhongzhou Chen, Shashank Sonkar · Mar 31, 2026 · Citations: 0
This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review.
- FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration
Qiyao Wang, Hongbo Wang, Longze Chen, Zhihao Yang, Guhong Chen · Mar 31, 2026 · Citations: 0
Extensive evaluations demonstrate that FlowPIE consistently produces ideas with higher novelty, feasibility and diversity compared to strong LLM-based and agent-based frameworks, while enabling reward scaling during test time.
- Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models
Linda Zeng, Steven Y. Feng, Michael C. Frank · Mar 31, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Can LLM Agents Identify Spoken Dialects like a Linguist?
Tobias Bystrich, Lukas Hamm, Maria Hassan, Lea Fischbach, Lucie Flek · Mar 31, 2026 · Citations: 0
In this work, we explore the ability of large language models (LLMs) as agents in understanding the dialects and whether they can show comparable performance to models such as HuBERT in dialect classification.
- Baby Scale: Investigating Models Trained on Individual Children's Language Input
Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank · Mar 31, 2026 · Citations: 0
Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior.
- Impact of enriched meaning representations for language generation in dialogue tasks: A comprehensive exploration of the relevance of tasks, corpora and metrics
Alain Vázquez, Maria Inés Torres · Mar 31, 2026 · Citations: 0
In addition, among these semantic metrics, those trained with human ratings can detect omissions and other subtle semantic issues that embedding-based metrics often miss.
- LLM Probe: Evaluating LLMs for Low-Resource Languages
Hailay Kidu Teklehaymanot, Gebrearegawi Gebremariam, Wolfgang Nejdl · Mar 31, 2026 · Citations: 0
Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized…
- Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models
Gabriel Loiseau, Damien Sileo, Damien Riquet, Maxime Meyer, Marc Tommasi · Mar 31, 2026 · Citations: 0
Accurate privacy evaluation of textual data remains a critical challenge in privacy-preserving natural language processing.
- Metriplector: From Field Theory to Neural Architecture
Dan Oprisa, Peter Toth · Mar 31, 2026 · Citations: 0
- MemFactory: Unified Inference & Training Framework for Agent Memory
Ziliang Guo, Ziheng Li, Bo Tang, Feiyu Xiong, Zhiyu Li · Mar 31, 2026 · Citations: 0
To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents.
- Calibrated Confidence Expression for Radiology Report Generation
David Bani-Harouni, Chantal Pellegrini, Julian Lüers, Su Hwan Kim, Markus Baalmann · Mar 31, 2026 · Citations: 0
Expert Verification
In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment.
- M-MiniGPT4: Multilingual VLLM Alignment via Translated Data
Seung Hun Han, Youssef Mohamed, Mohamed Elhoseiny · Mar 31, 2026 · Citations: 0
M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed.
- An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms
Nils Grünefeld, Jes Frellsen, Christian Hardmeier · Mar 31, 2026 · Citations: 0
We then use the estimates to investigate when each uncertainty type carries useful signal for predicting answer correctness in question answering with large language models, revealing a benchmark-dependent divergence: the combined estimate…
- Authorship Impersonation via LLM Prompting does not Evade Authorship Verification Methods
Baoyi Zeng, Andrea Nini · Mar 31, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng · Mar 31, 2026 · Citations: 0
Rubric RatingExpert Verification Web Browsing
The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined…
- PRISM: PRIor from corpus Statistics for topic Modeling
Tal Ishon, Yoav Goldberg, Uri Shaham · Mar 31, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Security in LLM-as-a-Judge: A Comprehensive SoK
Aiman Al Masoud, Antony Anju, Marco Arazzi, Mert Cihangiroglu, Vignesh Kumar Kembu · Mar 31, 2026 · Citations: 0