- An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data
Trinh Pham, Thanh Tam Nguyen, Viet Huynh, Hongzhi Yin, Quoc Viet Hung Nguyen · Mar 8, 2026 · Citations: 0
Without timely evaluation, organizations cannot approve releases or detect failures early.
- AI Steerability 360: A Toolkit for Steering Large Language Models
Erik Miehling, Karthikeyan Natesan Ramamurthy, Praveen Venkateswaran, Irene Ko, Pierre Dognin · Mar 8, 2026 · Citations: 0
Comprehensive evaluation and comparison of steering methods/pipelines is facilitated by use case classes (for defining tasks) and a benchmark class (for performance comparison on a given task).
- DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation
Bo Jiang · Mar 8, 2026 · Citations: 0
We introduce a taxonomy of three defense categories -- output perturbation, data poisoning, and information throttling -- and evaluate nine defense configurations using a standardized pipeline with Qwen3-14B as teacher and…
- Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation
David Beauchemin, Richard Khoury · Mar 8, 2026 · Citations: 0
In this paper, we address this challenge by introducing AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks.
- Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context
Ashish Pandey, Tek Raj Chhetri · Mar 8, 2026 · Citations: 0
Using Croissant-compliant dataset of 2400+ stereotypical and anti-stereotypical sentence pairs on gender roles across social domains, we implement an evaluation framework, Dual-Metric Bias Assessment (DMBA), combining two metrics: (1)…
- Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems
Zongqian Li, Tengchao Lv, Shaohan Huang, Yixuan Su, Qinzheng Sun · Mar 8, 2026 · Citations: 0
Evaluations on strictly unseen LiveCodeBench demonstrate that MicroCoder achieves 3x larger performance gains within 300 training steps compared to widely-used baseline datasets of comparable size, with consistent advantages under both GRPO…
- Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models
Zongqian Li, Shaohan Huang, Zewen Chi, Yixuan Su, Lexin Zhou · Mar 8, 2026 · Citations: 0
MicroCoder-GRPO achieves up to 17.6% relative improvement over strong baselines on LiveCodeBench v6, with more pronounced gains under extended context evaluation.
- ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs
Yuzhuang Xu, Xu Han, Yuxuan Li, Wanxiang Che · Mar 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- QuadAI at SemEval-2026 Task 3: Ensemble Learning of Hybrid RoBERTa and LLMs for Dimensional Aspect-Based Sentiment Analysis
A. J. W. de Vink, Filippos Karolos Ventirozos, Natalia Amat-Lefort, Lifeng Han · Mar 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Whitening Reveals Cluster Commitment as the Geometric Separator of Hallucination Types
Matic Korun · Mar 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- 3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models
Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Hai Lin, Shen Li · Mar 8, 2026 · Citations: 0
Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning.
- Large Language Model for Discrete Optimization Problems: Evaluation and Step-by-step Reasoning
Tianhao Qian, Guilin Qi, Z. Y. Wu, Ran Gu, Xuanyi Liu · Mar 8, 2026 · Citations: 0
It aimed to (1) provide an overview of LLMs' ability in large-scale problems, (2) offer suggestions to those who want to solve discrete optimization problems automatically, and (3) regard the performance as a benchmark for future research.
- Scalable Training of Mixture-of-Experts Models with Megatron Core
Zijie Yan, Hongxiao Bai, Xin Yao, Dennis Liu, Tong Liu · Mar 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Ref-DGS: Reflective Dual Gaussian Splatting
Ningjing Fan, Yiqun Wang, Dongming Yan, Peter Wonka · Mar 8, 2026 · Citations: 0
- KohakuRAG: A simple RAG framework with hierarchical document indexing
Shih-Ying Yeh, Yueh-Feng Ku, Ko-Wei Huang, Buu-Khang Tu · Mar 8, 2026 · Citations: 0
We evaluate on the WattBot 2025 Challenge, a benchmark requiring systems to answer technical questions from 32 documents with \pm0.1% numeric tolerance and exact source attribution.
- StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control
Haishu Zhao, Aokai Hao, Yuan Ge, Zhenqiang Hong, Tong Xiao · Mar 8, 2026 · Citations: 0
However, there remains a lack of systematic benchmarks that quantifies and evaluates the style intensity control ability in conversations.
- KCoEvo: A Knowledge Graph Augmented Framework for Evolutionary Code Generation
Jiazhen Kang, Yuchen Lu, Chen Jiang, Jinrui Liu, Tianhao Zhang · Mar 8, 2026 · Citations: 0
Both modules are trained with synthetic supervision automatically derived from real-world API diffs, ensuring scalability and minimal human effort.
- A Systematic Comparison of Training Objectives for Out-of-Distribution Detection in Image Classification
Furkan Genç, Onat Özdemir, Emre Akbaş · Mar 8, 2026 · Citations: 0
- Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR
Rishikesh Kumar Sharma, Safal Narshing Shrestha, Jenny Poudel, Rupak Tiwari, Arju Shrestha · Mar 8, 2026 · Citations: 0
In this work, we introduce Nwāchā Munā, a newly curated 5.39-hour manually transcribed Devanagari speech corpus for Nepal Bhasha, and establish the first benchmark using script-preserving acoustic modeling.
- Learning-free L2-Accented Speech Generation using Phonological Rules
Thanathai Lertpetchpun, Yoonjeong Lee, Jihwan Lee, Tiantian Feng, Dani Byrd · Mar 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs
Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib · Mar 8, 2026 · Citations: 0
Long Horizon
To evaluate models beyond final-answer accuracy, we propose MIR-E (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline.
- Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data
Thanathai Lertpetchpun, Thanapat Trachu, Jihwan Lee, Tiantian Feng, Dani Byrd · Mar 8, 2026 · Citations: 0
Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.
- TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning
Mingyue Cheng, Shuo Yu, Chuang Jiang, Xiaoyu Tao, Qingyang Mao · Mar 8, 2026 · Citations: 0
Long Horizon
To address these limitations, we previously proposed TableMind as a tuning-based autonomous programmatic agent that simulates human-like interaction within a lightweight large language model (LLM).
- Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech
Tajamul Ashraf, Burhaan Rasheed Zargar, Saeed Abdul Muizz, Ifrah Mushtaq, Nazima Mehdi · Mar 8, 2026 · Citations: 0
The lack of robust Text-to-Speech (TTS) systems limits digital accessibility and inclusive human-computer interaction for native speakers.
- SeDa: A Unified System for Dataset Discovery and Multi-Entity Augmented Semantic Exploration
Kan Ling, Zhen Qin, Yichi Zhu, Hengrun Zhang, Huiqun Yu · Mar 8, 2026 · Citations: 0
- A Joint Neural Baseline for Concept, Assertion, and Relation Extraction from Clinical Text
Fei Cheng, Ribeka Tanaka, Sadao Kurohashi · Mar 8, 2026 · Citations: 0
We empirically investigate the joint evaluation of our proposal and the pipeline baseline with various embedding techniques: word, contextual, and in-domain contextual embeddings.
- Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs
Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Chris Lott, Mingu Lee · Mar 8, 2026 · Citations: 0
Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping.
- Cross-Modal Taxonomic Generalization in (Vision-) Language Models
Tianyang Xu, Marcelo Sandoval-Castaneda, Karen Livescu, Greg Shakhnarovich, Kanishka Misra · Mar 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling
J. Clayton Kerce, Alexis Fox · Mar 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Image Generation Models: A Technical History
Rouzbeh Shirvani · Mar 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Dial: A Knowledge-Grounded Dialect-Specific NL2SQL System
Xiang Zhang, Hongming Xu, Le Zhou, Wei Zhou, Xuanhe Zhou · Mar 8, 2026 · Citations: 0
We construct DS-NL2SQL, a benchmark covering six major database systems with 2,218 dialect-specific test cases.
- Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning
Guoli Wang, Haonan Shi, Tu Ouyang, An Wang · Mar 8, 2026 · Citations: 0
Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data.
- Generalization in Online Reinforcement Learning for Mobile Agents
Li Gu, Zihuan Jiang, Zhixiang Chi, Huan Liu, Ziqiang Wang · Mar 8, 2026 · Citations: 0
Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions and interacting with the screen.
- Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests
Amutheezan Sivagnanam, Ayan Mukhopadhyay, Samitha Samaranayake, Abhishek Dubey, Aron Laszka · Mar 8, 2026 · Citations: 0
- AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions
Jihyoung Jang, Hyounghun Kim · Mar 8, 2026 · Citations: 0
Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies.
- Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams
Jiyeon Kim, Hyunji Lee, Dylan Zhou, Sue Hyun Park, Seunghyun Yoon · Mar 8, 2026 · Citations: 0
We introduce Online Adaptation to Continual Knowledge Streams(OAKS) to evaluate this capability, establishing a benchmark for online adaptation over streaming, continually updating knowledge.
- SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions
Saroj Mishra, Suman Niroula, Umesh Yadav, Dilip Thakur, Srijan Gyawali · Mar 7, 2026 · Citations: 0
Long Horizon
Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retrieval strategies.
- Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios
Namrata Patil Gurav, Akashdeep Ranu, Archchana Sindhujan, Diptesh Kanojia · Mar 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Position: LLMs Must Use Functor-Based and RAG-Driven Bias Mitigation for Fairness
Ravi Ranjan, Utkarsh Grover, Agorista Polyzou · Mar 7, 2026 · Citations: 0
Critique Edit
Biases in large language models (LLMs) often manifest as systematic distortions in associations between demographic attributes and professional or social roles, reinforcing harmful stereotypes across gender, ethnicity, and geography.
- RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts
Darya Kharlamova, Irina Proskurina · Mar 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Latent Generative Models with Tunable Complexity for Compressed Sensing and other Inverse Problems
Sean Gunn, Jorio Cocola, Oliver De Candido, Vaggos Chatziafratis, Paul Hand · Mar 7, 2026 · Citations: 0
- How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
Nouran Khallaf, Serge Sharoff · Mar 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise
Nouran Khallaf, Serge Sharoff · Mar 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- The Third Ambition: Artificial Intelligence and the Science of Human Behavior
W. Russell Neuman, Chad Coleman · Mar 7, 2026 · Citations: 0
Contemporary artificial intelligence research has been organized around two dominant ambitions: productivity, which treats AI systems as tools for accelerating work and economic output, and alignment, which focuses on ensuring that…
- Adversarial Latent-State Training for Robust Policies in Partially Observable Domains
Angad Singh Ahuja · Mar 7, 2026 · Citations: 0
- Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin
Po-Chun Hsu, Meng-Hsi Chen, Tsu Ling Chao, Chia Tien Han, Da-shan Shiu · Mar 7, 2026 · Citations: 0
To address these gaps, we introduce TS-Bench (Taiwan Safety Benchmark), a standardized evaluation suite for assessing safety performance in Taiwanese Mandarin.
- Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster
Minu Kim, Hoirin Kim, David R. Mortensen · Mar 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing
Arash Marioriyad, Ali Nouri, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah · Mar 7, 2026 · Citations: 0
As Large Language Models (LLMs) transition into autonomous agentic roles, the risk of deception-defined behaviorally as the systematic provision of false information to satisfy external incentives-poses a significant challenge to AI safety.
- Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice
Yuxu Ge · Mar 7, 2026 · Citations: 0
- The DIME Architecture: A Unified Operational Algorithm for Neural Representation, Dynamics, Control and Integration
Ionel Cristian Vladu, Nicu Bizdoaca, Ionica Pirici, Tudor-Adrian Balseanu, Eduard Nicusor Bondoc · Mar 7, 2026 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Fine-Grained Table Retrieval Through the Lens of Complex Queries
Wojciech Kosiuk, Xingyu Ji, Yeounoh Chung, Fatma Özcan, Madelon Hulsebos · Mar 7, 2026 · Citations: 0
Our analyses over industry-aligned benchmarks illustrate the robustness of DCTR for highly composite queries and densely connected databases.
- Emotion Transcription in Conversation: A Benchmark for Capturing Subtle and Complex Emotional States through Natural Language
Yoshiki Tanaka, Ryuichi Uehara, Koji Inoue, Michimasa Inaba · Mar 7, 2026 · Citations: 0
Emotion Recognition in Conversation (ERC) is critical for enabling natural human-machine interactions.
- Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge
Shuai Lu, Meng Wang, Jia Guo, Jiawei Du, Bo Liu · Mar 7, 2026 · Citations: 0
- Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information
Yoshiki Tanaka, Takumasa Kaneko, Hiroki Onozeki, Natsumi Ezure, Ryuichi Uehara · Mar 7, 2026 · Citations: 0
In this study, we present a Werewolf AI agent developed for the AIWolfDial 2024 shared task, co-hosted with the 17th INLG.
- Grounding Machine Creativity in Game Design Knowledge Representations: Empirical Probing of LLM-Based Executable Synthesis of Goal Playable Patterns under Structural Constraints
Hugh Xuechen Liu, Kıvanç Tatar · Mar 7, 2026 · Citations: 0
Using 26 goal pattern instantiations, we compare a direct generation baseline (natural language -> C# -> Unity) with pipelines conditioned on a human-authored Unity-specific intermediate representation (IR), across three IR configurations…
- Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang · Mar 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Entropy-Aware On-Policy Distillation of Language Models
Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou · Mar 7, 2026 · Citations: 0
Across six math reasoning benchmarks, this yields Pass@8 accuracy gains of +1.37 for Qwen3-0.6B-Base, +2.39 for Qwen3-1.7B-Base, and +5.05 for Qwen3-4B-Base compared to baseline on-policy distillation methods.
- CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs
Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li · Mar 7, 2026 · Citations: 0
Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy.
- Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision
Shreyas Gopal, Donghang Wu, Ashutosh Anshul, Yeo Yue Heng, Yizhou Peng · Mar 7, 2026 · Citations: 0
We further synthesize Audio-MLQA, a multilingual spoken QA benchmark built on MLQA with high-quality TTS questions.
- Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment
Junming Liu, Yuqi Li, Shiping Wen, Zhigang Zeng, Tingwen Huang · Mar 7, 2026 · Citations: 0
Pairwise Preference
In this paper, we propose Hit-RAG, a multi-stage preference alignment framework designed to resolve these cognitive bottlenecks through a progressive optimization pipeline.