- SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions
Saroj Mishra, Suman Niroula, Umesh Yadav, Dilip Thakur, Srijan Gyawali · Mar 7, 2026 · Citations: 0
Long Horizon
Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retrieval strategies.
- Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios
Namrata Patil Gurav, Akashdeep Ranu, Archchana Sindhujan, Diptesh Kanojia · Mar 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Position: LLMs Must Use Functor-Based and RAG-Driven Bias Mitigation for Fairness
Ravi Ranjan, Utkarsh Grover, Agorista Polyzou · Mar 7, 2026 · Citations: 0
Critique Edit
Biases in large language models (LLMs) often manifest as systematic distortions in associations between demographic attributes and professional or social roles, reinforcing harmful stereotypes across gender, ethnicity, and geography.
- RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts
Darya Kharlamova, Irina Proskurina · Mar 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
Nouran Khallaf, Serge Sharoff · Mar 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise
Nouran Khallaf, Serge Sharoff · Mar 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- The Third Ambition: Artificial Intelligence and the Science of Human Behavior
W. Russell Neuman, Chad Coleman · Mar 7, 2026 · Citations: 0
Contemporary artificial intelligence research has been organized around two dominant ambitions: productivity, which treats AI systems as tools for accelerating work and economic output, and alignment, which focuses on ensuring that…
- Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin
Po-Chun Hsu, Meng-Hsi Chen, Tsu Ling Chao, Chia Tien Han, Da-shan Shiu · Mar 7, 2026 · Citations: 0
To address these gaps, we introduce TS-Bench (Taiwan Safety Benchmark), a standardized evaluation suite for assessing safety performance in Taiwanese Mandarin.
- Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster
Minu Kim, Hoirin Kim, David R. Mortensen · Mar 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing
Arash Marioriyad, Ali Nouri, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah · Mar 7, 2026 · Citations: 0
As Large Language Models (LLMs) transition into autonomous agentic roles, the risk of deception-defined behaviorally as the systematic provision of false information to satisfy external incentives-poses a significant challenge to AI safety.
- Fine-Grained Table Retrieval Through the Lens of Complex Queries
Wojciech Kosiuk, Xingyu Ji, Yeounoh Chung, Fatma Özcan, Madelon Hulsebos · Mar 7, 2026 · Citations: 0
Our analyses over industry-aligned benchmarks illustrate the robustness of DCTR for highly composite queries and densely connected databases.
- Emotion Transcription in Conversation: A Benchmark for Capturing Subtle and Complex Emotional States through Natural Language
Yoshiki Tanaka, Ryuichi Uehara, Koji Inoue, Michimasa Inaba · Mar 7, 2026 · Citations: 0
Emotion Recognition in Conversation (ERC) is critical for enabling natural human-machine interactions.
- Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information
Yoshiki Tanaka, Takumasa Kaneko, Hiroki Onozeki, Natsumi Ezure, Ryuichi Uehara · Mar 7, 2026 · Citations: 0
In this study, we present a Werewolf AI agent developed for the AIWolfDial 2024 shared task, co-hosted with the 17th INLG.
- Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang · Mar 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Entropy-Aware On-Policy Distillation of Language Models
Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou · Mar 7, 2026 · Citations: 0
Across six math reasoning benchmarks, this yields Pass@8 accuracy gains of +1.37 for Qwen3-0.6B-Base, +2.39 for Qwen3-1.7B-Base, and +5.05 for Qwen3-4B-Base compared to baseline on-policy distillation methods.
- CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs
Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li · Mar 7, 2026 · Citations: 0
Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy.
- Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision
Shreyas Gopal, Donghang Wu, Ashutosh Anshul, Yeo Yue Heng, Yizhou Peng · Mar 7, 2026 · Citations: 0
We further synthesize Audio-MLQA, a multilingual spoken QA benchmark built on MLQA with high-quality TTS questions.
- Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment
Junming Liu, Yuqi Li, Shiping Wen, Zhigang Zeng, Tingwen Huang · Mar 7, 2026 · Citations: 0
Pairwise Preference
In this paper, we propose Hit-RAG, a multi-stage preference alignment framework designed to resolve these cognitive bottlenecks through a progressive optimization pipeline.
- AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
Karen Zhou, Chenhao Tan · Mar 7, 2026 · Citations: 0
Pairwise Preference
Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge.
- Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models
Punyajoy Saha, Sudipta Halder, Debjyoti Mondal, Subhadarshi Panda · Mar 7, 2026 · Citations: 0
Pairwise PreferenceRed Team
Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale,…
- A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity
Muhammad Arslan Shaukat, Muntasir Adnan, Carlos C. N. Kuhn · Mar 7, 2026 · Citations: 0
We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems.
- Elenchus: Generating Knowledge Bases from Prover-Skeptic Dialogues
Bradley P. Allen · Mar 7, 2026 · Citations: 0
Expert Verification
A human expert develops a bilateral position (commitments and denials) about a topic through prover-skeptic dialogue with a large language model (LLM) opponent.
- Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards
Xin Zhang, Xingyu Li, Rongguang Wang, Ruizhong Miao, Zheng Wang · Mar 7, 2026 · Citations: 0
Our experiments demonstrate that Chart-RL consistently outperforms supervised fine-tuning (SFT) across different chart understanding benchmarks, achieving relative improvements of 16.7% on MutlChartQA, and 11.5% on ChartInsights.