- SODIUM: From Open Web Data to Queryable Databases
Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang · Mar 19, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy.
- Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching
Yicheng Pan, Zhiyuan Ning, Ludi Wang, Yi Du · Apr 7, 2026 · Citations: 0
Rubric Rating Automatic Metrics
To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching.
- A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan · Apr 7, 2026 · Citations: 0
Expert Verification Automatic Metrics
Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
- Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0
Pairwise Preference Automatic Metrics Multi Agent
These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
- Role-Augmented Intent-Driven Generative Search Engine Optimization
Xiaolu Chen, Haojie Wu, Jie Bao, Zhen Chen, Yong Liao · Aug 15, 2025 · Citations: 0
Rubric Rating Automatic Metrics Web Browsing
To better evaluate the method under realistic settings, we address the benchmarking limitations of prior work by: (1) extending the GEO dataset with diversified query variations reflecting real-world search scenarios and (2) introducing…
- Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE
Hejin Huang, Jusheng Zhang, Kaitong Cai, Jian Wang, Rong Pan · Mar 31, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Preference-based alignment objectives have been widely adopted, from RLHF-style pairwise learning in large language models to emerging applications in recommender systems.
- Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu · Apr 2, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process.
- PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval
Chun Chet Ng, Jia Yu Lim, Wei Zeng Low · Nov 18, 2025 · Citations: 0
Automatic Metrics Multi Agent
We present PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and lightweight multi-agent coordination for document and chunk ranking tasks.
- Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
Somaya Eltanbouly, Samer Rashwani · Mar 25, 2026 · Citations: 0
Human EvalLlm As Judge
Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments.
- OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang · Mar 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement.
- PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark
Ziyang Zeng, Dun Zhang, Yu Yan, Xu Sun, Cuiqiaoshu Pan · Jan 13, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
To address these limitations, we introduce PosIR (Position-Aware Information Retrieval), the first standardized benchmark designed to systematically diagnose position bias in diverse retrieval scenarios.
- SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions
Saroj Mishra, Suman Niroula, Umesh Yadav, Dilip Thakur, Srijan Gyawali · Mar 7, 2026 · Citations: 0
Automatic Metrics Long Horizon
Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retrieval strategies.
- SkillX: Automatically Constructing Skill Knowledge Bases for Agents
Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang · Apr 6, 2026 · Citations: 0
Automatic Metrics Long Horizon
Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited…
- Diffusion Generative Recommendation with Continuous Tokens
Haohao Qu, Shanru Lin, Yujuan Ding, Yiqi Wang, Wenqi Fan · Apr 16, 2025 · Citations: 0
Pairwise Preference Automatic Metrics
Specifically, ContRec consists of two key modules: a sigma-VAE Tokenizer, which encodes users/items with continuous tokens; and a Dispersive Diffusion module, which captures implicit user preference.
- LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval
Jiajie Jin, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie · Mar 2, 2026 · Citations: 0
Automatic Metrics Long Horizon
Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines.
- Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026 · Citations: 0
Automatic Metrics Multi Agent
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
- Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through…
- A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026 · Citations: 0
Automatic Metrics Tool Use
To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights.
- Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction
Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu · Oct 7, 2025 · Citations: 0
Human EvalAutomatic Metrics
Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
- Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval
Andrea Volpini, Elie Raad, Beatrice Gamba, David Riccitelli · Mar 11, 2026 · Citations: 0
Automatic Metrics Web Browsing
In this paper, we investigate whether structured linked data, specifically Schema.org markup and dereferenceable entity pages served by a Linked Data Platform, can improve retrieval accuracy and answer quality in both standard and agentic…
- From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen · Mar 2, 2026 · Citations: 0
Automatic Metrics Long Horizon
While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive…
- A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.
- LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources
Joshua Castillo, Ravi Mukkamala · Apr 8, 2026 · Citations: 0
Automatic Metrics
Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles.
- JUÁ -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections
Jayr Pereira, Leandro Fernandes, Erick de Brito, Roberto Lotufo, Luiz Bonifacio · Apr 7, 2026 · Citations: 0
Automatic Metrics
We present JUÁ, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections.
- WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
Yingjian Zhu, Xinming Wang, Kun Ding, Ying Wang, Bin Fan · Apr 7, 2026 · Citations: 0
Automatic Metrics
Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector.
- SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT
Guan-Yan Yang, Wei-Ling Wen, Shu-Yuan Ku, Farn Wang, Kuo-Hui Yeh · Apr 7, 2026 · Citations: 0
Automatic Metrics
Our evaluation demonstrates that SemLink achieves a Recall of 96.00%, comparable to state-of-the-art LLMs (GPT-5.2), while operating approximately 47.5 times faster and requiring significantly fewer computational resources.
- Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite
Klaudia Thellmann, Bernhard Stadler, Michael Färber · Apr 2, 2026 · Citations: 0
Automatic Metrics
Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence.
- From BM25 to Corrective RAG: Benchmarking Retrieval Strategies for Text-and-Table Documents
Meftun Akarsu, Recep Kaan Karaman, Christopher Mierbach · Apr 2, 2026 · Citations: 0
Automatic Metrics
We benchmark ten retrieval strategies spanning sparse, dense, hybrid fusion, cross-encoder reranking, query expansion, index augmentation, and adaptive retrieval on a challenging financial QA benchmark of 23,088 queries over 7,318 documents…
- FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval
Antonín Jarolím, Martin Fajčík · Mar 31, 2026 · Citations: 0
Automatic Metrics
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Agenda-based Narrative Extraction: Steering Pathfinding Algorithms with Large Language Models
Brian Felipe Keith-Norambuena, Carolina Inés Rojas-Córdova, Claudio Juvenal Meneses-Villegas, Elizabeth Johanna Lam-Esquenazi, Angélica María Flores-Bustos · Mar 31, 2026 · Citations: 0
Automatic Metrics
We evaluated our approach on a news article corpus using LLM judges with Claude Opus 4.5 and GPT 5.1, measuring both coherence and agenda alignment across 64 endpoint pairs and 6 agendas.
- APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay
Pratyay Banerjee, Masud Moshtaghi, Ankit Chadha · Mar 31, 2026 · Citations: 0
Automatic Metrics
LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before.
- Analysing Calls to Order in German Parliamentary Debates
Nina Smirnova, Daniel Dan, Philipp Mayr · Mar 27, 2026 · Citations: 0
Automatic Metrics
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Adaptive Chunking: Optimizing Chunking-Method Selection for RAG
Paulo Roberto de Moura Júnior, Jean Lelong, Annabelle Blangero · Mar 26, 2026 · Citations: 0
Automatic Metrics
Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance.
- GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation
Ruizhong Miao, Yuying Wang, Rongguang Wang, Chenyang Li, Tao Sheng · Mar 26, 2026 · Citations: 0
Automatic Metrics
Prior approaches to this problem include agentic retrieval strategies, which expand the semantic search space by generating additional queries.