Benchmark Hub

Retrieval Benchmark Papers With Accuracy

Updated from current HFEPX corpus (Feb 27, 2026). 32 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 32 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 32 papers for Retrieval Benchmark Papers With Accuracy. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, HotpotQA and metric focus on accuracy, context length. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

6.3% of papers report explicit human-feedback signals, led by expert verification.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MoDora: Tree-Based Semi-Structured Document Analysis System , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
automatic metrics appears in 100% of papers in this hub.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MoDora: Tree-Based Semi-Structured Document Analysis System , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MoDora: Tree-Based Semi-Structured Document Analysis System , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance

Protocol Takeaways

Most common quality-control signal is rater calibration (3.1% of papers).

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MoDora: Tree-Based Semi-Structured Document Analysis System , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MoDora: Tree-Based Semi-Structured Document Analysis System , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Stratify by benchmark (Retrieval vs HotpotQA) before comparing methods.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MoDora: Tree-Based Semi-Structured Document Analysis System , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance

Benchmark Interpretation

Retrieval appears in 100% of hub papers (32/32); use this cohort for benchmark-matched comparisons.
HotpotQA appears in 6.3% of hub papers (2/32); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 100% of hub papers (32/32); compare with a secondary metric before ranking methods.
context length is reported in 9.4% of hub papers (3/32); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (6.3% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (3.1% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (100% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (12.5% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (6.3% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (6.3% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (3.1% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (12.5% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (6.3% vs 35% target).

Known Limitations

Only 3.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (12.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=1, left_only=31, right_only=0

1 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 32 papers (100%)

32 papers (100%) mention Retrieval.

Examples: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MoDora: Tree-Based Semi-Structured Document Analysis System , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA

Benchmark Brief

HotpotQA

Coverage: 2 papers (6.3%)

2 papers (6.3%) mention HotpotQA.

Examples: RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA , PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents

Benchmark Brief

MMLU

Coverage: 1 papers (3.1%)

1 papers (3.1%) mention MMLU.

Examples: Humanity's Last Exam

Metric Brief

accuracy

Coverage: 32 papers (100%)

32 papers (100%) mention accuracy.

Metric Brief

context length

Coverage: 3 papers (9.4%)

3 papers (9.4%) mention context length.

Examples: DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs , Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness , Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Metric Brief

cost

Coverage: 2 papers (6.3%)

2 papers (6.3%) mention cost.

Examples: Cross-lingual Matryoshka Representation Learning across Speech and Text , Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MoDora: Tree-Based Semi-Structured Document Analysis System , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers On This Benchmark

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026

Automatic Metrics Multi Agent

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants.
MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026

Automatic Metrics

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA
Wenwei Li, Ming Xu, Tianle Xia, Lingxiang Hu, Yiding Sun · Feb 26, 2026

Automatic Metrics

We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for
Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026

Automatic Metrics

Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026

Automatic Metrics Long Horizon

Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs
Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen · Feb 25, 2026

Automatic Metrics

Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modes
Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
Heejin Jo · Feb 25, 2026

Automatic Metrics

Large language models consistently fail the "car wash problem," a viral reasoning benchmark requiring implicit physical constraint inference.
HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG
Yuqi Huang, Ning Liao, Kai Yang, Anning Hu, Shengchao Hu · Feb 24, 2026

Automatic Metrics

Extensive experiments demonstrate that HELP achieves competitive performance across multiple simple and multi-hop QA benchmarks and up to a 28.8$\times$ speedup over leading Graph-based RAG baselines.
Cross-lingual Matryoshka Representation Learning across Speech and Text
Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina · Feb 23, 2026

Automatic Metrics

We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best.
VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026

Automatic Metrics Long Horizon

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Qijie You, Wenkai Yu, Wentao Zhang · Feb 22, 2026

Automatic Metrics Long Horizon

With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction.
Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval
Subrit Dikshit · Feb 18, 2026

Automatic MetricsSimulation Env

The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter
Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination
Amir Hosseinian, MohammadReza Zare Shahneh, Umer Mansoor, Gilbert Szeto, Kirill Karlin · Feb 17, 2026

Automatic Metrics

Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%).
NeuroSymActive: Differentiable Neural-Symbolic Reasoning with Active Exploration for Knowledge Graph Question Answering
Rong Fu, Yang Li, Zeyu Zhang, Jiekai Wu, Yaohua Liu · Feb 17, 2026

Automatic Metrics

Empirical results on standard KGQA benchmarks show that NeuroSymActive attains strong answer accuracy while reducing the number of expensive graph lookups and model calls compared to common retrieval-augmented baselines.
Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel · Feb 16, 2026

Automatic Metrics

Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particular
Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation
Ruoxi Liu, Philipp Koehn · Feb 16, 2026

Automatic Metrics

This paper proposes a novel method for Text Style Transfer (TST) based on parameter-efficient fine-tuning of Large Language Models (LLMs).
Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026

Automatic Metrics

16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness
Pietro Bernardelle, Stefano Civelli, Kevin Roitero, Gianluca Demartini · Feb 15, 2026

Automatic Metrics

Large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent.
CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation
Crystal Min Hui Poon, Pai Chet Ng, Xiaoxiao Miao, Immanuel Jun Kai Loh, Bowen Zhang · Nov 14, 2025

Automatic Metrics

Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist in reducing perceived quality: accent bias, where models default t
RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025

Automatic Metrics Long Horizon

A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes can
Embedding-Based Context-Aware Reranker
Ye Yuan, Mohammad Amin Shabani, Siqi Liu · Oct 15, 2025

Automatic Metrics

We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency.
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025

Automatic Metrics Long Horizon

We additionally contribute a CAD dataset with human preference annotations.
PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin · Jun 20, 2025

Automatic Metrics

We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task.
Structure-Augmented Reasoning Generation
Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han · Jun 10, 2025

Automatic Metrics

Extensive experiments on open-domain QA benchmarks and specialized reasoning datasets in finance and medicine demonstrate that SARG significantly outperforms state-of-the-art flat-context RAG baselines in both factual accuracy and reasoning
Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement
Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun · Jun 5, 2025

Automatic Metrics

Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors.
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025

Automatic Metrics

However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.
Entailed Opinion Matters: Improving the Fact-Checking Performance of Language Models by Relying on their Entailment Ability
Gaurav Kumar, Ayush Garg, Debajyoti Mazumder, Aditya Kishore, Babu kumar · May 21, 2025

Automatic Metrics

Automated fact-checking has been a challenging task for the research community.
Don't Let It Hallucinate: Premise Verification via Retrieval-Augmented Logical Reasoning
Yuehan Qin, Shawn Li, Yi Nian, Xinyan Velocity Yu, Yue Zhao · Apr 8, 2025

Automatic Metrics

Large language models (LLMs) have shown substantial capacity for generating fluent, contextually appropriate responses.
MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation
Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan · Mar 23, 2025

Automatic Metrics

Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.
Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes
Zhanliang Wang, Da Wu, Quan Nguyen, Kai Wang · Mar 15, 2025

Automatic Metrics

These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes.
Humanity's Last Exam
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu · Jan 24, 2025

Automatic Metrics

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities.
Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli · Jun 7, 2024

Automatic Metrics

MRAG integrates seamlessly with existing RAG frameworks and benchmarks.

Other Benchmark Hubs

Retrieval Benchmark Papers With Accuracy

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers On This Benchmark

Other Benchmark Hubs