Benchmark Hub

Retrieval + Long Horizon Benchmark Papers

Updated from current HFEPX corpus (Feb 27, 2026). 14 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 14 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 14 papers for Retrieval + Long Horizon Benchmark Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, ALFWorld and metric focus on accuracy, f1. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

14.3% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering , Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Structurally Aligned Subtask-Level Memory for Software Engineering Agents
automatic metrics appears in 85.7% of papers in this hub.

Evidence: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Structurally Aligned Subtask-Level Memory for Software Engineering Agents , Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering , VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Structurally Aligned Subtask-Level Memory for Software Engineering Agents , Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering , OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Structurally Aligned Subtask-Level Memory for Software Engineering Agents , Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering , VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: Bridging Symbolic Control and Neural Reasoning in LLM Agents: Structured Cognitive Loop with a Governance Layer , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Structurally Aligned Subtask-Level Memory for Software Engineering Agents , Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Stratify by benchmark (Retrieval vs ALFWorld) before comparing methods.

Evidence: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Structurally Aligned Subtask-Level Memory for Software Engineering Agents , Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering , VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Benchmark Interpretation

Retrieval appears in 100% of hub papers (14/14); use this cohort for benchmark-matched comparisons.
ALFWorld appears in 7.1% of hub papers (1/14); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 35.7% of hub papers (5/14); compare with a secondary metric before ranking methods.
f1 is reported in 21.4% of hub papers (3/14); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (14.3% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (100% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (71.4% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (7.1% vs 35% target).
Maintain strength on Papers with known annotation unit. Coverage is strong (35.7% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (14.3% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (71.4% vs 35% target).

Papers with known rater population

Coverage is a replication risk (7.1% vs 35% target).

Papers with known annotation unit

Coverage is strong (35.7% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=12, right_only=2

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 14 papers (100%)

14 papers (100%) mention Retrieval.

Examples: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Structurally Aligned Subtask-Level Memory for Software Engineering Agents , Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

Benchmark Brief

ALFWorld

Coverage: 1 papers (7.1%)

1 papers (7.1%) mention ALFWorld.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Benchmark Brief

BrowseComp

Coverage: 1 papers (7.1%)

1 papers (7.1%) mention BrowseComp.

Examples: Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

Metric Brief

accuracy

Coverage: 5 papers (35.7%)

5 papers (35.7%) mention accuracy.

Examples: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval , AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

Metric Brief

Coverage: 3 papers (21.4%)

3 papers (21.4%) mention f1.

Examples: RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA , PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation , Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

Metric Brief

coherence

Coverage: 2 papers (14.3%)

2 papers (14.3%) mention coherence.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model , Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Structurally Aligned Subtask-Level Memory for Software Engineering Agents , Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers On This Benchmark

Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026

Automatic Metrics Long Horizon

Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026

Automatic Metrics Long Horizon

Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026

Automatic Metrics Long Horizon

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026

Automatic Metrics Long Horizon

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Qijie You, Wenkai Yu, Wentao Zhang · Feb 22, 2026

Automatic Metrics Long Horizon

With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction.
OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering
Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong · Feb 3, 2026

Automatic Metrics Tool Use

To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026

Simulation Env Long Horizon

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
Bridging Symbolic Control and Neural Reasoning in LLM Agents: Structured Cognitive Loop with a Governance Layer
Myung Ho Kim · Nov 21, 2025

Automatic Metrics Long Horizon

Large language model agents suffer from fundamental architectural problems: entangled reasoning and execution, memory volatility, and uncontrolled action sequences.
Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces
Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury · Nov 10, 2025

Automatic Metrics Long Horizon

On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20\%}.
RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025

Automatic Metrics Long Horizon

A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes can
PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation
Xiangjun Zai, Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu · Oct 14, 2025

Automatic Metrics Long Horizon

Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining s
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025

Automatic Metrics Long Horizon

We additionally contribute a CAD dataset with human preference annotations.
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee · Aug 26, 2025

Automatic Metrics Long Horizon

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval.
A Survey on the Optimization of Large Language Model-based Agents
Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang · Mar 16, 2025

Simulation Env Long Horizon

With the rapid development of Large Language Models (LLMs), LLM-based agents have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks.

Other Benchmark Hubs

Retrieval + Long Horizon Benchmark Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers On This Benchmark

Other Benchmark Hubs