- Computational Analysis of Semantic Connections Between Herman Melville Reading and Writing
Nudrat Habib, Elisa Barney Smith, Steven Olsen Smith · Mar 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Seamless Deception: Larger Language Models Are Better Knowledge Concealers
Dhananjay Ashok, Ruth-Ann Armstrong, Jonathan May · Mar 15, 2026 · Citations: 0
Initial findings on smaller models show that classifiers can detect concealment more reliably than human evaluators, with gradient-based concealment proving easier to identify than prompt-based methods.
- Punctuated Equilibria in Artificial Intelligence: The Institutional Scaling Law and the Speciation of Sovereign AI
Mark Baciak, Thomas A. Cellucci, Deanna M. Falkowski · Mar 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Argumentation for Explainable and Globally Contestable Decision Support with LLMs
Adam Dejl, Matthew Williams, Francesca Toni · Mar 15, 2026 · Citations: 0
In this paper, we introduce ArgEval, a framework that shifts from instance-specific reasoning to structured evaluation of general decision options.
- Nudging Hidden States: Training-Free Model Steering for Chain-of-Thought Reasoning in Large Audio-Language Models
Lok-Lam Ieong, Chia-Chien Chen, Chih-Kai Yang, Yu-Han Huang, An-Yu Cheng · Mar 15, 2026 · Citations: 0
We introduce three strategies using diverse information sources and evaluate them across four LALMs and four benchmarks.
- Anterior's Approach to Fairness Evaluation of Automated Prior Authorization System
Sai P. Selvaraj, Khadija Mahmoud, Anuj Iravane · Mar 15, 2026 · Citations: 0
We propose a fairness evaluation framework for prior authorization models based on model error rates rather than approval outcomes.
- $PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought
Shubhashis Roy Dipta, Daniel Bis, Kun Zhou, Lichao Wang, Benjamin Z. Yao · Mar 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Parameter-Efficient Quality Estimation via Frozen Recursive Models
Umar Abubacar, Roman Bauer, Diptesh Kanojia · Mar 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad
Yongqiang Chen, Chenxi Liu, Zhenhao Chen, Tongliang Liu, Bo Han · Mar 15, 2026 · Citations: 0
Evolve-based agent such as AlphaEvolve is one of the notable successes in using Large Language Models (LLMs) to build AI Scientists.
- Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes
Deepon Halder, Raj Dabre · Mar 15, 2026 · Citations: 0
Long Horizon
Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating…
- Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models
Deepon Halder, Angira Mukherjee · Mar 15, 2026 · Citations: 0
Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts.
- MALicious INTent Dataset and Inoculating LLMs for Enhanced Disinformation Detection
Arkadiusz Modzelewski, Witold Sosnowski, Eleni Papadopulos, Elisa Sartori, Tiziano Labruna · Mar 15, 2026 · Citations: 0
This work presents MALINT, the first human-annotated English corpus developed in collaboration with expert fact-checkers to capture disinformation and its malicious intent.
- CangjieBench: Benchmarking LLMs on a Low-Resource General-Purpose Programming Language
Junhang Cheng, Fang Liu, Jia Li, Chengru Wu, Nanxiang Jiang · Mar 15, 2026 · Citations: 0
To address this gap, we introduce CangjieBench, a contamination-free benchmark for Cangjie, a representative low-resource general-purpose language.
- Fine-tuning MLLMs Without Forgetting Is Easier Than You Think
He Li, Yuhui Zhang, Xiaohan Wang, Kaifeng Lyu, Serena Yeung-Levy · Mar 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agentic Workflows
Aditya Sharan, Sriram Hebbale, Dhruv Kumar · Mar 15, 2026 · Citations: 0
In domains like physics, standard text augmentation often introduces hallucinations, while static benchmarks lack the reasoning traces required for fine-tuning.
- AI Can Learn Scientific Taste
Jingqi Tong, Mingzhe Li, Hangcheng Li, Yongzhuo Yang, Yurong Mou · Mar 15, 2026 · Citations: 0
Pairwise Preference
Great scientists have strong judgement and foresight, closely tied to what we call scientific taste.
- An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs
Qian Zhu, Xinnan Guo, Jingjing Huo, Jun Li, Pan Liu · Mar 15, 2026 · Citations: 0
Expert VerificationRlaif Or Synthetic Feedback
Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples).
- Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs
Auksarapak Kietkajornrit, Jad Tarifi, Nima Asgharbeygi · Mar 15, 2026 · Citations: 0
We evaluate the proposed framework on SEAL-0, an extremely challenging benchmark for search-augmented LLMs.
- PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark
Mohammad Javad Ranjbar Kalahroodi, Mohammad Amini, Parmis Bathayan, Heshaam Faili, Azadeh Shakery · Mar 15, 2026 · Citations: 0
Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks.
- Echoes Across Centuries: Phonetic Signatures of Persian Poets
Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar · Mar 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Creative Convergence or Imitation? Genre-Specific Homogeneity in LLM-Generated Chinese Literature
Yuanchi Ma, Kaize Shi, Hui He, Zhihua Zhang, Zhongxiang Lei · Mar 15, 2026 · Citations: 0
We further construct a human-annotated corpus to support the analysis of narrative structures within LLM-generated text.
- Questionnaire Responses Do not Capture the Safety of AI Agents
Max Hellrigel-Holderbaum, Edward James Young · Mar 15, 2026 · Citations: 0
As AI systems advance in capabilities, measuring their safety and alignment to human values is becoming paramount.
- BiT-MCTS: A Theme-based Bidirectional MCTS Approach to Chinese Fiction Generation
Zhaoyi Li, Xu Zhang, Xiaojun Wan · Mar 15, 2026 · Citations: 0
We construct a Chinese theme corpus for evaluation and conduct extensive experiments across three contemporary LLM backbones.
- Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains
Andrew Katz · Mar 15, 2026 · Citations: 0
Rubric Rating
Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty.
- Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling
Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty · Mar 15, 2026 · Citations: 0
Red Team
While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed…
- Motivation in Large Language Models
Omer Nahum, Asael Sklar, Ariel Goldstein, Roi Reichart · Mar 15, 2026 · Citations: 0
Pairwise Preference
Motivation is a central driver of human behavior, shaping decisions, goals, and task performance.
- ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation
Jungwoo Oh, Hyunseung Chung, Junhee Lee, Min-Gyu Kim, Hangyul Yoon · Mar 15, 2026 · Citations: 0
Long Horizon
To investigate this, we introduce ECG-Reasoning-Benchmark, a novel multi-turn evaluation framework comprising over 6,400 samples to systematically assess step-by-step reasoning across 17 core ECG diagnoses.
- Mind the Shift: Decoding Monetary Policy Stance from FOMC Statements with Large Language Models
Yixuan Tang, Yi Yang · Mar 15, 2026 · Citations: 0
Long Horizon
Across four LLM backbones, DCS consistently outperforms supervised probes and LLM-as-judge baselines, achieving up to 71.1% accuracy on sentence-level hawkish--dovish classification.
- SemantiCache: Efficient KV Cache Compression via Semantic Chunking and Clustered Merging
Shunlong Wu, Hai Lin, Shaoshen Chen, Tingwei Lu, Yongqin Zeng · Mar 15, 2026 · Citations: 0
Extensive experiments across diverse benchmarks and models demonstrate that SemantiCache accelerates the decoding stage of inference by up to 2.61 times and substantially reduces memory footprint, while maintaining performance comparable to…
- Seeking Physics in Diffusion Noise
Chujun Tang, Lei Zhong, Fangqiang Ding · Mar 15, 2026 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering
Shaowei Guan, Yu Zhai, Hin Chi Kwok, Jiawei Du, Xinyu Feng · Mar 15, 2026 · Citations: 0
Multi Agent
To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering.
- Automatic Inter-document Multi-hop Scientific QA Generation
Seungmin Lee, Dongha Kim, Yuni Jeon, Junyoung Koh, Min Song · Mar 15, 2026 · Citations: 0
Human and automatic validation confirmed high factual consistency, and experimental results demonstrate that IM-SciQA effectively differentiates reasoning capabilities across retrieval and QA stages, providing a realistic and interpretable…
- Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring
Weixin Guan, Liang Li, Jiapeng Liu, Bing Li, Peng Fu · Mar 15, 2026 · Citations: 0
We conduct experiments across multiple benchmarks using LRLMs of different types and scales, and the results indicate that our method delivers the largest performance improvement over vanilla CoT compared to existing early-exit methods.
- Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective
Mohamed Aghzal, Gregory J. Stein, Ziyu Yao · Mar 15, 2026 · Citations: 0
Long Horizon
Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks.
- QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis
Yutong Wu, Chenrui Cao, Pengwei Jin, Di Huang, Rui Zhang · Mar 15, 2026 · Citations: 0
Notably, CodeV-SVA-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.
- "I'm Not Reading All of That": Understanding Software Engineers' Level of Cognitive Engagement with Agentic Coding Assistants
Carlos Rafael Catalan, Lheane Marie Dizon, Patricia Nicole Monderin, Emily Kuang · Mar 15, 2026 · Citations: 0
- Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective
Tianyi Zhang, David Traum · Mar 15, 2026 · Citations: 0
We re-examine a notable retrieval-augmented framework for personalized dialogue, LAPDOG, as a case study for evaluation methodology.
- Vavanagi: a Community-run Platform for Documentation of the Hula Language in Papua New Guinea
Bri Olewale, Raphael Merx, Ekaterina Vylomova · Mar 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Relationship-Aware Safety Unlearning for Multimodal LLMs
Vishnu Narayanan Anilkumar, Abhijith Sreesylesh Babu, Trieu Hai Vo, Mohankrishna Kolla, Alexander Cuneo · Mar 15, 2026 · Citations: 0
- Selective Fine-Tuning of GPT Architectures for Parameter-Efficient Clinical Text Classification
Fariba Afrin Irany, Sampson Akwafuo · Mar 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos
Arushi Goel, Sreyan Ghosh, Vatsal Agarwal, Nishit Anand, Kaousheik Jayakumar · Mar 14, 2026 · Citations: 0
We introduce MMOU, a new benchmark designed to systematically evaluate multimodal understanding and reasoning under these challenging, real-world conditions.
- The GELATO Dataset for Legislative NER
Matthew Flynn, Timothy Obiso, Sam Newman · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- OasisSimp: An Open-source Asian-English Sentence Simplification Dataset
Hannah Liu, Muxin Tian, Iqra Ali, Haonan Gao, Qiaoyiwen Wu · Mar 14, 2026 · Citations: 0
Each language simplification dataset was created by trained annotators who followed detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness.
- Not All Latent Spaces Are Flat: Hyperbolic Concept Control
Maria Rosaria Briglia, Simone Facchiano, Paolo Cursi, Alessio Sampieri, Emanuele Rodolà · Mar 14, 2026 · Citations: 0
- Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors
Mark Rofin, Jalal Naghiyev, Michael Hahn · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CMHL: Contrastive Multi-Head Learning for Emotionally Consistent Text Classification
Menna Elgabry, Ali Hamdi, Khaled Shaban · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments
Rupak Raj Ghimire, Bipesh Subedi, Balaram Prasain, Prakash Poudyal, Praveen Acharya · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- The Reasoning Bottleneck in Graph-RAG: Structured Prompting and Context Compression for Multi-Hop QA
Yasaman Zarrinkia, Venkatesh Srinivasan, Alex Thomo · Mar 14, 2026 · Citations: 0
Evaluating KET-RAG, a leading Graph-RAG system, on three multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA), we find that 77% to 91% of questions have the gold answer in the retrieved context, yet accuracy is only 35% to 78%, and…
- Probing neural audio codecs for distinctions among English nuclear tunes
Juan Pablo Vigneaux, Jennifer Cole · Mar 14, 2026 · Citations: 0
Results: Linear probes trained on the unquantized latents or some of the associated codewords yield above-chance accuracy in distinguishing eight phonologically specified nuclear tunes with monotonal pitch accents (top average test accuracy…
- SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions
Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, Giorgos Stamou · Mar 14, 2026 · Citations: 0
Red Team
The benchmark is constructed from U.S.
- Beyond Explicit Edges: Robust Reasoning over Noisy and Sparse Knowledge Graphs
Hang Gao, Dimitris N. Metaxas · Mar 14, 2026 · Citations: 0
Web Browsing
INSES consistently outperforms SOTA RAG and GraphRAG baselines across multiple benchmarks.
- Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models
Haitao Jiang, Wenbo Zhang, Jiarui Yao, Hengrui Cai, Sheng Wang · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- FLUX: Data Worth Training On
Gowtham, Sai Rupesh, Sanjay Kumar, Saravanan, Venkata Chaithanya · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- sebis at ArchEHR-QA 2026: How Much Can You Do Locally? Evaluating Grounded EHR QA on a Single Notebook
Ibrahim Ebrar Yurt, Fabian Karl, Tejaswi Choppa, Florian Matthes · Mar 14, 2026 · Citations: 0
Expert Verification
Clinical question answering over electronic health records (EHRs) can help clinicians and patients access relevant medical information more efficiently.
- ToolFlood: Beyond Selection -- Hiding Valid Tools from LLM Agents via Semantic Covering
Hussein Jawad, Nicolas J-B Brunel · Mar 14, 2026 · Citations: 0
Tool Use
Large Language Model (LLM) agents increasingly use external tools for complex tasks and rely on embedding-based retrieval to select a small top-k subset for reasoning.
- OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset
Wenbin Hu, Huihao Jing, Haochen Shi, Changxuan Fan, Haoran Li · Mar 14, 2026 · Citations: 0
Ensuring the safety and compliance of large language models (LLMs) is of paramount importance.
- The Phenomenology of Hallucinations
Valeria Ruscio, Keiran Thompson · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Large Language Models Reproduce Racial Stereotypes When Used for Text Annotation
Petter Törnberg · Mar 14, 2026 · Citations: 0
Arab names elicit cognitive elevation alongside interpersonal devaluation, and all four minority groups are consistently rated as less self-disciplined.
- Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering
Lin Fan, Yafei Ou, Zhipeng Deng, Pengyu Dai, Hou Chongxian · Mar 14, 2026 · Citations: 0
Expert Verification Long Horizon
Benchmark: github.com/hahaha111111/Step-CoT.
- GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent
Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, Mikhail Burtsev · Mar 14, 2026 · Citations: 0
We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.