- MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos
Arushi Goel, Sreyan Ghosh, Vatsal Agarwal, Nishit Anand, Kaousheik Jayakumar · Mar 14, 2026 · Citations: 0
We introduce MMOU, a new benchmark designed to systematically evaluate multimodal understanding and reasoning under these challenging, real-world conditions.
- The GELATO Dataset for Legislative NER
Matthew Flynn, Timothy Obiso, Sam Newman · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- OasisSimp: An Open-source Asian-English Sentence Simplification Dataset
Hannah Liu, Muxin Tian, Iqra Ali, Haonan Gao, Qiaoyiwen Wu · Mar 14, 2026 · Citations: 0
Each language simplification dataset was created by trained annotators who followed detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness.
- Not All Latent Spaces Are Flat: Hyperbolic Concept Control
Maria Rosaria Briglia, Simone Facchiano, Paolo Cursi, Alessio Sampieri, Emanuele Rodolà · Mar 14, 2026 · Citations: 0
- Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors
Mark Rofin, Jalal Naghiyev, Michael Hahn · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CMHL: Contrastive Multi-Head Learning for Emotionally Consistent Text Classification
Menna Elgabry, Ali Hamdi, Khaled Shaban · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments
Rupak Raj Ghimire, Bipesh Subedi, Balaram Prasain, Prakash Poudyal, Praveen Acharya · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- The Reasoning Bottleneck in Graph-RAG: Structured Prompting and Context Compression for Multi-Hop QA
Yasaman Zarrinkia, Venkatesh Srinivasan, Alex Thomo · Mar 14, 2026 · Citations: 0
Evaluating KET-RAG, a leading Graph-RAG system, on three multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA), we find that 77% to 91% of questions have the gold answer in the retrieved context, yet accuracy is only 35% to 78%, and…
- Probing neural audio codecs for distinctions among English nuclear tunes
Juan Pablo Vigneaux, Jennifer Cole · Mar 14, 2026 · Citations: 0
Results: Linear probes trained on the unquantized latents or some of the associated codewords yield above-chance accuracy in distinguishing eight phonologically specified nuclear tunes with monotonal pitch accents (top average test accuracy…
- SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions
Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, Giorgos Stamou · Mar 14, 2026 · Citations: 0
Red Team
The benchmark is constructed from U.S.
- Beyond Explicit Edges: Robust Reasoning over Noisy and Sparse Knowledge Graphs
Hang Gao, Dimitris N. Metaxas · Mar 14, 2026 · Citations: 0
Web Browsing
INSES consistently outperforms SOTA RAG and GraphRAG baselines across multiple benchmarks.
- Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models
Haitao Jiang, Wenbo Zhang, Jiarui Yao, Hengrui Cai, Sheng Wang · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- FLUX: Data Worth Training On
Gowtham, Sai Rupesh, Sanjay Kumar, Saravanan, Venkata Chaithanya · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- sebis at ArchEHR-QA 2026: How Much Can You Do Locally? Evaluating Grounded EHR QA on a Single Notebook
Ibrahim Ebrar Yurt, Fabian Karl, Tejaswi Choppa, Florian Matthes · Mar 14, 2026 · Citations: 0
Expert Verification
Clinical question answering over electronic health records (EHRs) can help clinicians and patients access relevant medical information more efficiently.
- ToolFlood: Beyond Selection -- Hiding Valid Tools from LLM Agents via Semantic Covering
Hussein Jawad, Nicolas J-B Brunel · Mar 14, 2026 · Citations: 0
Tool Use
Large Language Model (LLM) agents increasingly use external tools for complex tasks and rely on embedding-based retrieval to select a small top-k subset for reasoning.
- OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset
Wenbin Hu, Huihao Jing, Haochen Shi, Changxuan Fan, Haoran Li · Mar 14, 2026 · Citations: 0
Ensuring the safety and compliance of large language models (LLMs) is of paramount importance.
- The Phenomenology of Hallucinations
Valeria Ruscio, Keiran Thompson · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Large Language Models Reproduce Racial Stereotypes When Used for Text Annotation
Petter Törnberg · Mar 14, 2026 · Citations: 0
Arab names elicit cognitive elevation alongside interpersonal devaluation, and all four minority groups are consistently rated as less self-disciplined.
- Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering
Lin Fan, Yafei Ou, Zhipeng Deng, Pengyu Dai, Hou Chongxian · Mar 14, 2026 · Citations: 0
Expert Verification Long Horizon
Benchmark: github.com/hahaha111111/Step-CoT.
- GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent
Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, Mikhail Burtsev · Mar 14, 2026 · Citations: 0
We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.
- APEX-Searcher: Augmenting LLMs' Search Capabilities through Agentic Planning and Execution
Kun Chen, Qingchao Kong, Zhao Feifei, Wenji Mao · Mar 14, 2026 · Citations: 0
To address these issues, in this paper, we proposes APEX-Searcher, a novel Agentic Planning and Execution framework to augment LLM search capabilities.
- PMIScore: An Unsupervised Approach to Quantify Dialogue Engagement
Yongkang Guo, Zhihuan Huang, Yuqing Kong · Mar 14, 2026 · Citations: 0
A reliable measure of engagement could help benchmark large language models, enhance the effectiveness of human-computer interactions, or improve personal communication skills.
- GhanaNLP Parallel Corpora: Comprehensive Multilingual Resources for Low-Resource Ghanaian Languages
Lawrence Adu Gyamfi, Paul Azunre, Stephen Edward Moore, Joel Budu, Akwasi Asare · Mar 14, 2026 · Citations: 0
The data were collected, translated, and annotated by human professionals and enriched with standard structural metadata to ensure consistency and usability.
- DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents
Snehasis Mukhopadhyay · Mar 14, 2026 · Citations: 0
Long Horizon
We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace),…
- Greedy Information Projection for LLM Data Selection
Victor Ye Dong, Kuan-Yun Lee, Jiamei Shuai, Shengfei Liu, Yi Liu · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Projection-Free Evolution Strategies for Continuous Prompt Search
Yu Cai, Canxi Huang, Xiaoyu He · Mar 14, 2026 · Citations: 0
Experimental results on seven natural language understanding tasks from the GLUE benchmark demonstrate that our proposed approach significantly outperforms existing baselines.
- Generate Then Correct: Single Shot Global Correction for Aspect Sentiment Quad Prediction
Shidong He, Haoyu Wang, Wenjie Luo · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- LiveWeb-IE: A Benchmark For Online Web Information Extraction
Seungbin Yang, Jihwan Kim, Jaemin Choi, Dongjin Kim, Soyoung Yang · Mar 14, 2026 · Citations: 0
To bridge this gap, we introduce \dataset, a new benchmark designed for evaluating WIE systems directly against live websites.
- Causal Tracing of Audio-Text Fusion in Large Audio Language Models
Wei-Chih Chen, Chien-yu Huang, Hung-yi Lee · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Knowledge Distillation for Large Language Models
Alejandro Paredes La Torre, Barbara Flores, Diego Rodriguez · Mar 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Can We Trust LLMs on Memristors? Diving into Reasoning Ability under Non-Ideality
Taiqiang Wu, Yuxin Cheng, Chenchen Ding, Runming Yang, Xincheng Feng · Mar 14, 2026 · Citations: 0
Empirical results indicate that reasoning capability decreases significantly but varies for distinct benchmarks.
- SAATT Nav: a Socially Aware Autonomous Transparent Transportation Navigation Framework for Wheelchairs
Yutong Zhang, Shaiv Y. Mehra, Bradley S. Duerstock, Juan P. Wachs · Mar 14, 2026 · Citations: 0
Web Browsing
Current autonomous systems lack social awareness in navigation and transparency in decision-making, leading to decreased perceived safety and trust from the user and others in context.
- Repetition Without Exclusivity: Scale Sensitivity of Referential Mechanisms in Child-Scale Language Models
Jon-Paul Cacioli · Mar 14, 2026 · Citations: 0
We present the first systematic evaluation of mutual exclusivity (ME) -- the bias to map novel words to novel referents -- in text-only language models trained on child-directed speech.
- QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models
Yao Wu, Kangping Yin, Liang Dong, Zhenxin Ma, Shuting Xu · Mar 14, 2026 · Citations: 0
Rubric Rating
To bridge this gap, we introduce QuarkMedBench, an ecologically valid benchmark tailored for real-world medical LLM assessment.
- Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation
Hanwen Shen, Ting Ying, Jiajie Lu, Shanshan Wang · Mar 14, 2026 · Citations: 0
Across toxic-prompt settings and benchmarks, CAP-TTA reduces bias (confirmed by human evaluation) while achieving much lower update latency than AdamW/SGD; it also mitigates catastrophic forgetting by significantly improving narrative…
- SHAMISA: SHAped Modeling of Implicit Structural Associations for Self-supervised No-Reference Image Quality Assessment
Mahdi Naseri, Zhou Wang · Mar 14, 2026 · Citations: 0