- AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026
Automatic Metrics Multi Agent
While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants.
- MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026
Automatic Metrics
Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
- Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA
Wenwei Li, Ming Xu, Tianle Xia, Lingxiang Hu, Yiding Sun · Feb 26, 2026
Automatic Metrics
We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for
- Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026
Automatic Metrics
Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
- Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026
Automatic Metrics Long Horizon
Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
- DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs
Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen · Feb 25, 2026
Automatic Metrics
Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modes
- Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
Heejin Jo · Feb 25, 2026
Automatic Metrics
Large language models consistently fail the "car wash problem," a viral reasoning benchmark requiring implicit physical constraint inference.
- HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG
Yuqi Huang, Ning Liao, Kai Yang, Anning Hu, Shengchao Hu · Feb 24, 2026
Automatic Metrics
Extensive experiments demonstrate that HELP achieves competitive performance across multiple simple and multi-hop QA benchmarks and up to a 28.8$\times$ speedup over leading Graph-based RAG baselines.
- Cross-lingual Matryoshka Representation Learning across Speech and Text
Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina · Feb 23, 2026
Automatic Metrics
We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best.
- VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026
Automatic Metrics Long Horizon
Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
- AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Qijie You, Wenkai Yu, Wentao Zhang · Feb 22, 2026
Automatic Metrics Long Horizon
With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction.
- Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval
Subrit Dikshit · Feb 18, 2026
Automatic MetricsSimulation Env
The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter
- Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination
Amir Hosseinian, MohammadReza Zare Shahneh, Umer Mansoor, Gilbert Szeto, Kirill Karlin · Feb 17, 2026
Automatic Metrics
Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%).
- NeuroSymActive: Differentiable Neural-Symbolic Reasoning with Active Exploration for Knowledge Graph Question Answering
Rong Fu, Yang Li, Zeyu Zhang, Jiekai Wu, Yaohua Liu · Feb 17, 2026
Automatic Metrics
Empirical results on standard KGQA benchmarks show that NeuroSymActive attains strong answer accuracy while reducing the number of expensive graph lookups and model calls compared to common retrieval-augmented baselines.
- Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel · Feb 16, 2026
Automatic Metrics
Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particular
- Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation
Ruoxi Liu, Philipp Koehn · Feb 16, 2026
Automatic Metrics
This paper proposes a novel method for Text Style Transfer (TST) based on parameter-efficient fine-tuning of Large Language Models (LLMs).
- Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026
Automatic Metrics
16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
- Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness
Pietro Bernardelle, Stefano Civelli, Kevin Roitero, Gianluca Demartini · Feb 15, 2026
Automatic Metrics
Large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent.
- CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation
Crystal Min Hui Poon, Pai Chet Ng, Xiaoxiao Miao, Immanuel Jun Kai Loh, Bowen Zhang · Nov 14, 2025
Automatic Metrics
Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist in reducing perceived quality: accent bias, where models default t
- RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025
Automatic Metrics Long Horizon
A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes can
- Embedding-Based Context-Aware Reranker
Ye Yuan, Mohammad Amin Shabani, Siqi Liu · Oct 15, 2025
Automatic Metrics
We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency.
- Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025
Automatic Metrics Long Horizon
We additionally contribute a CAD dataset with human preference annotations.
- PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin · Jun 20, 2025
Automatic Metrics
We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task.
- Structure-Augmented Reasoning Generation
Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han · Jun 10, 2025
Automatic Metrics
Extensive experiments on open-domain QA benchmarks and specialized reasoning datasets in finance and medicine demonstrate that SARG significantly outperforms state-of-the-art flat-context RAG baselines in both factual accuracy and reasoning
- Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement
Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun · Jun 5, 2025
Automatic Metrics
Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors.
- Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025
Automatic Metrics
However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.
- Entailed Opinion Matters: Improving the Fact-Checking Performance of Language Models by Relying on their Entailment Ability
Gaurav Kumar, Ayush Garg, Debajyoti Mazumder, Aditya Kishore, Babu kumar · May 21, 2025
Automatic Metrics
Automated fact-checking has been a challenging task for the research community.
- Don't Let It Hallucinate: Premise Verification via Retrieval-Augmented Logical Reasoning
Yuehan Qin, Shawn Li, Yi Nian, Xinyan Velocity Yu, Yue Zhao · Apr 8, 2025
Automatic Metrics
Large language models (LLMs) have shown substantial capacity for generating fluent, contextually appropriate responses.
- MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation
Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan · Mar 23, 2025
Automatic Metrics
Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.
- Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes
Zhanliang Wang, Da Wu, Quan Nguyen, Kai Wang · Mar 15, 2025
Automatic Metrics
These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes.
- Humanity's Last Exam
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu · Jan 24, 2025
Automatic Metrics
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities.
- Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli · Jun 7, 2024
Automatic Metrics
MRAG integrates seamlessly with existing RAG frameworks and benchmarks.