Benchmark Hub

Retrieval In CS.IR Papers

Updated from current HFEPX corpus (Feb 27, 2026). 31 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 31 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 31 papers for Retrieval In CS.IR Papers. Dominant protocol signals include automatic metrics, human evaluation, simulation environments, with frequent benchmark focus on Retrieval, BrowseComp and metric focus on accuracy, recall. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

9.7% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering , MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
automatic metrics appears in 96.8% of papers in this hub.

Evidence: MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Revisiting RAG Retrievers: An Information Theoretic Benchmark
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access

Protocol Takeaways

Most common quality-control signal is rater calibration (3.2% of papers).

Evidence: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: A Benchmark for Deep Information Synthesis , MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Benchmark Interpretation

Retrieval appears in 100% of hub papers (31/31); use this cohort for benchmark-matched comparisons.
BrowseComp appears in 3.2% of hub papers (1/31); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 19.4% of hub papers (6/31); compare with a secondary metric before ranking methods.
recall is reported in 16.1% of hub papers (5/31); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (9.7% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (3.2% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (100% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (45.2% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (6.5% vs 35% target).
Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (22.6% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (9.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (3.2% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (45.2% vs 35% target).

Papers with known rater population

Coverage is a replication risk (6.5% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (22.6% vs 35% target).

Known Limitations

Only 3.2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=0, right_only=29

1 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=30, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

human_eval vs simulation_env

both=0, left_only=1, right_only=1

0 papers use both Human Eval and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 31 papers (100%)

31 papers (100%) mention Retrieval.

Examples: MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Benchmark Brief

BrowseComp

Coverage: 1 papers (3.2%)

1 papers (3.2%) mention BrowseComp.

Examples: Revisiting Text Ranking in Deep Research

Benchmark Brief

Financebench

Coverage: 1 papers (3.2%)

1 papers (3.2%) mention Financebench.

Examples: Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

Metric Brief

accuracy

Coverage: 6 papers (19.4%)

6 papers (19.4%) mention accuracy.

Examples: MoDora: Tree-Based Semi-Structured Document Analysis System , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Metric Brief

recall

Coverage: 5 papers (16.1%)

5 papers (16.1%) mention recall.

Examples: E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications , RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering , Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

Metric Brief

relevance

Coverage: 3 papers (9.7%)

3 papers (9.7%) mention relevance.

Examples: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering , OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers On This Benchmark

MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026

Automatic Metrics

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt · Feb 26, 2026

Automatic Metrics

In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval.
Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026

Automatic Metrics Long Horizon

Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access
Touseef Hasan, Laila Cure, Souvika Sarkar · Feb 25, 2026

Simulation Env

We conduct a pilot evaluation study using community-sourced queries to examine system behavior in realistic scenarios.
Revisiting RAG Retrievers: An Information Theoretic Benchmark
Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert · Feb 25, 2026

Automatic Metrics

Existing benchmarks primarily compare entire RAG pipelines or introduce new datasets, providing little guidance on selecting or combining retrievers themselves.
Revisiting Text Ranking in Deep Research
Chuan Meng, Litu Ou, Sean MacAvaney, Jeff Dalton · Feb 25, 2026

Automatic Metrics

To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it.
Multi-Vector Index Compression in Any Modality
Hanxiang Qin, Alexander Martin, Rohan Jha, Chunsheng Zuo, Reno Kriz · Feb 24, 2026

Automatic Metrics

We study efficient multi-vector retrieval for late interaction in any modality.
A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026

Human EvalAutomatic Metrics Tool Use

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis.
E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications
Jiwoo Kang, Yeon-Chang Lee · Feb 24, 2026

Automatic Metrics

Multimodal recommender systems (MMRSs) enhance collaborative filtering by leveraging item-side modalities, but their reliance on a fixed set of modalities and task-specific objectives limits both modality extensibility and task generalizati
RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition
Kun Ran, Marwah Alaofi, Danula Hettiachchi, Chenglong Ma, Khoi Nguyen Dinh Anh · Feb 24, 2026

Automatic Metrics

R2RAG won the Best Dynamic Evaluation award in the Open Source category, demonstrating high effectiveness with careful design and efficient use of resources.
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026

Automatic Metrics

Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval
Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao · Feb 23, 2026

Automatic Metrics

We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retr
Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Jiahao Huo · Feb 23, 2026

Automatic Metrics

Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications.
Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026

Automatic Metrics Long Horizon

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools
Baris Arat, Emre Sefer · Feb 20, 2026

Automatic Metrics

Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever.
RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
Deniz Qian, Hung-Ting Chen, Eunsol Choi · Feb 20, 2026

Automatic Metrics

Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI).
Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
Amine Kobeissi, Philippe Langlais · Feb 20, 2026

Automatic Metrics

Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings.
WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval
Michael Dinzinger, Laura Caspari, Ali Salman, Irvin Topi, Jelena Mitrović · Feb 19, 2026

Automatic Metrics

We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages.
ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction
William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan · Feb 16, 2026

Automatic Metrics

ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing, and is publicly available on HuggingFace.
Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026

Automatic Metrics

16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers
Prachuryya Kaushik, Ashish Anand · Jan 15, 2026

Automatic Metrics

We introduce \textbf{AWED-FiNER}, an open-source collection of agentic tool, web application, and 53 state-of-the-art expert models that provide Fine-grained Named Entity Recognition (FgNER) solutions across 36 languages spoken by more than
Neurosymbolic Retrievers for Retrieval-augmented Generation
Yash Saxena, Manas Gaur · Jan 8, 2026

Automatic Metrics

Retrieval Augmented Generation (RAG) has made significant strides in overcoming key limitations of large language models, such as hallucination, lack of contextual grounding, and issues with transparency.
OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models
Michael Siebenmann, Javier Argota Sánchez-Vaquerizo, Stefan Arisona, Krystian Samp, Luis Gisler · Nov 30, 2025

Automatic Metrics

The system combines semantic data retrieval, agentic reasoning for iterative code generation, and secure sandboxed execution that produces verifiable multimodal outputs.
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani · Oct 31, 2025

Automatic Metrics

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coh
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao · Oct 10, 2025

Automatic Metrics

We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings.
PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin · Jun 20, 2025

Automatic Metrics

We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task.
Revela: Dense Retriever Learning via Language Modeling
Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang · Jun 19, 2025

Automatic Metrics

We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones.
Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement
Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun · Jun 5, 2025

Automatic Metrics

Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors.
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025

Automatic Metrics Multi Agent

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
Diffusion Generative Recommendation with Continuous Tokens
Haohao Qu, Shanru Lin, Yujuan Ding, Yiqi Wang, Wenqi Fan · Apr 16, 2025

Automatic Metrics

Specifically, ContRec consists of two key modules: a sigma-VAE Tokenizer, which encodes users/items with continuous tokens; and a Dispersive Diffusion module, which captures implicit user preference.
Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli · Jun 7, 2024

Automatic Metrics

MRAG integrates seamlessly with existing RAG frameworks and benchmarks.

Other Benchmark Hubs

Retrieval In CS.IR Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers On This Benchmark

Other Benchmark Hubs