- SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Sungho Park, Jueun Kim, Wook-Shin Han · Feb 26, 2026 · Citations: 0
Automatic Metrics
Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in n
- CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery
Mengze Hong, Di Jiang, Chen Jason Zhang, Zichang Guo, Yawen Li · Feb 26, 2026 · Citations: 0
Simulation Env
In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements.
- MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026 · Citations: 0
Automatic Metrics
Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
- Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt · Feb 26, 2026 · Citations: 0
Automatic Metrics
In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval.
- Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
- LiCQA : A Lightweight Complex Question Answering System
Sourav Saha, Dwaipayan Roy, Mandar Mitra · Feb 25, 2026 · Citations: 0
Automatic Metrics
The results of our experiments show that LiCQA significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.
- Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access
Touseef Hasan, Laila Cure, Souvika Sarkar · Feb 25, 2026 · Citations: 0
Simulation Env
We conduct a pilot evaluation study using community-sourced queries to examine system behavior in realistic scenarios.
- Revisiting RAG Retrievers: An Information Theoretic Benchmark
Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert · Feb 25, 2026 · Citations: 0
Automatic Metrics
Existing benchmarks primarily compare entire RAG pipelines or introduce new datasets, providing little guidance on selecting or combining retrievers themselves.
- Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
Barah Fazili, Koustava Goswami · Feb 25, 2026 · Citations: 0
Automatic Metrics
This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models.
- Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
Germán T. Eizaguirre, Lars Tissen, Marc Sánchez-Artigas · Feb 25, 2026 · Citations: 0
Automatic Metrics
Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly.
- Revisiting Text Ranking in Deep Research
Chuan Meng, Litu Ou, Sean MacAvaney, Jeff Dalton · Feb 25, 2026 · Citations: 0
Automatic Metrics
To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it.
- A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.
- Multi-Vector Index Compression in Any Modality
Hanxiang Qin, Alexander Martin, Rohan Jha, Chunsheng Zuo, Reno Kriz · Feb 24, 2026 · Citations: 0
Automatic Metrics
We study efficient multi-vector retrieval for late interaction in any modality.
- A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026 · Citations: 0
Human EvalAutomatic Metrics Tool Use
Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis.
- Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
Sanket Badhe, Deep Shah · Feb 24, 2026 · Citations: 0
Automatic Metrics
These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-v
- Position-Aware Sequential Attention for Accurate Next Item Recommendations
Timur Nabiev, Evgeny Frolov · Feb 24, 2026 · Citations: 0
Automatic Metrics
Experiments on standard next-item prediction benchmarks show that our positional kernel attention consistently improves over strong competing baselines.
- HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders
Kun Yuan, Junyu Bi, Daixuan Cheng, Changfa Wu, Shuwen Xiao · Feb 24, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints.
- Generative Pseudo-Labeling for Pre-Ranking with LLMs
Junyu Bi, Xinting Niu, Daixuan Cheng, Kun Yuan, Tao Wang · Feb 24, 2026 · Citations: 0
Automatic Metrics
Pre-ranking is a critical stage in industrial recommendation systems, tasked with efficiently scoring thousands of recalled items for downstream ranking.
- E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications
Jiwoo Kang, Yeon-Chang Lee · Feb 24, 2026 · Citations: 0
Automatic Metrics
Multimodal recommender systems (MMRSs) enhance collaborative filtering by leveraging item-side modalities, but their reliance on a fixed set of modalities and task-specific objectives limits both modality extensibility and task generalizati
- RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition
Kun Ran, Marwah Alaofi, Danula Hettiachchi, Chenglong Ma, Khoi Nguyen Dinh Anh · Feb 24, 2026 · Citations: 0
Automatic Metrics
R2RAG won the Best Dynamic Evaluation award in the Open Source category, demonstrating high effectiveness with careful design and efficient use of resources.
- KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026 · Citations: 0
Automatic Metrics
Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
- NanoKnow: How to Know What Your Language Model Knows
Lingwei Gu, Nour Jedidi, Jimmy Lin · Feb 23, 2026 · Citations: 0
Automatic Metrics
Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre
- Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval
Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao · Feb 23, 2026 · Citations: 0
Automatic Metrics
We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retr
- Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Jiahao Huo · Feb 23, 2026 · Citations: 0
Automatic Metrics
Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications.
- Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation
Rizhuo Huang, Yifan Feng, Rundong Xue, Shihui Ying, Jun-Hai Yong · Feb 23, 2026 · Citations: 0
Expert Verification Automatic Metrics
Additionally, we present \textbf{HyperDocRED}, a rigorously annotated benchmark for document-level knowledge hypergraph extraction.
- PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification
Isun Chehreh, Ebrahim Ansari · Feb 22, 2026 · Citations: 0
Automatic Metrics
Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification.
- Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Long Horizon
Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
- Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools
Baris Arat, Emre Sefer · Feb 20, 2026 · Citations: 0
Automatic Metrics
Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever.
- VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning
Harshul Raj Surana, Arijit Maji, Aryan Vats, Akash Ghosh, Sriparna Saha · Feb 20, 2026 · Citations: 0
Automatic Metrics
Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured.
- RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
Deniz Qian, Hung-Ting Chen, Eunsol Choi · Feb 20, 2026 · Citations: 0
Automatic Metrics
Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI).
- Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
Amine Kobeissi, Philippe Langlais · Feb 20, 2026 · Citations: 0
Automatic Metrics
Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings.
- CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello · Feb 19, 2026 · Citations: 0
Automatic Metrics
HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts.
- Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026 · Citations: 0
Automatic Metrics Multi Agent
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
- WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval
Michael Dinzinger, Laura Caspari, Ali Salman, Irvin Topi, Jelena Mitrović · Feb 19, 2026 · Citations: 0
Automatic Metrics
We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages.
- ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models
Antoine Chaffin, Luca Arnaboldi, Amélie Chatelain, Florent Krzakala · Feb 18, 2026 · Citations: 0
Automatic Metrics
Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models.
- Variable-Length Semantic IDs for Recommender Systems
Kirill Khrylchenko · Feb 18, 2026 · Citations: 0
Automatic Metrics
In parallel, the emergent communication literature studies how agents develop discrete communication protocols, often producing variable-length messages in which frequent concepts receive shorter descriptions.
- ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction
William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan · Feb 16, 2026 · Citations: 0
Automatic Metrics
ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing, and is publicly available on HuggingFace.
- Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation
Mengdan Zhu, Yufan Zhao, Tao Di, Yulan Yan, Liang Zhao · Feb 16, 2026 · Citations: 0
Automatic Metrics
News recommendation plays a critical role in online news platforms by helping users discover relevant content.
- Query as Anchor: Scenario-Adaptive User Representation via Large Language Model
Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Ziyi Gao · Feb 16, 2026 · Citations: 0
Automatic Metrics
Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment.
- Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026 · Citations: 0
Automatic Metrics
16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.