Benchmark Hub

Retrieval In CS.LG Papers

Updated from current HFEPX corpus (Feb 27, 2026). 13 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 13 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 13 papers for Retrieval In CS.LG Papers. Dominant protocol signals include automatic metrics, human evaluation, with frequent benchmark focus on Retrieval, MMLU and metric focus on accuracy, calibration. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

7.7% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence , MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
automatic metrics appears in 100% of papers in this hub.

Evidence: MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Revisiting RAG Retrievers: An Information Theoretic Benchmark
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Revisiting RAG Retrievers: An Information Theoretic Benchmark

Protocol Takeaways

Most common quality-control signal is rater calibration (7.7% of papers).

Evidence: Humanity's Last Exam , MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models , Humanity's Last Exam , MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: A Benchmark for Deep Information Synthesis , MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Benchmark Interpretation

Retrieval appears in 100% of hub papers (13/13); use this cohort for benchmark-matched comparisons.
MMLU appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 38.5% of hub papers (5/13); compare with a secondary metric before ranking methods.
calibration is reported in 7.7% of hub papers (1/13); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (7.7% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (7.7% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (100% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (69.2% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (15.4% vs 35% target).
Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (23.1% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (7.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (7.7% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (69.2% vs 35% target).

Papers with known rater population

Coverage is a replication risk (15.4% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (23.1% vs 35% target).

Known Limitations

Only 7.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15.4% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=0, right_only=12

1 papers use both Human Eval and Automatic Metrics.

Benchmark Brief

Retrieval

Coverage: 13 papers (100%)

13 papers (100%) mention Retrieval.

Examples: MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Benchmark Brief

MMLU

Coverage: 1 papers (7.7%)

1 papers (7.7%) mention MMLU.

Examples: Humanity's Last Exam

Metric Brief

accuracy

Coverage: 5 papers (38.5%)

5 papers (38.5%) mention accuracy.

Examples: MoDora: Tree-Based Semi-Structured Document Analysis System , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

Metric Brief

calibration

Coverage: 1 papers (7.7%)

1 papers (7.7%) mention calibration.

Examples: Humanity's Last Exam

Metric Brief

context length

Coverage: 1 papers (7.7%)

1 papers (7.7%) mention context length.

Examples: Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers On This Benchmark

MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026

Automatic Metrics

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt · Feb 26, 2026

Automatic Metrics

In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval.
Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026

Automatic Metrics Long Horizon

Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
Revisiting RAG Retrievers: An Information Theoretic Benchmark
Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert · Feb 25, 2026

Automatic Metrics

Existing benchmarks primarily compare entire RAG pipelines or introduce new datasets, providing little guidance on selecting or combining retrievers themselves.
Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Inderjeet Singh, Vikas Pahuja, Aishvariya Priya Rathina Sabapathy, Chiara Picardi, Amit Giloni · Feb 24, 2026

Automatic Metrics

Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components.
A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026

Human EvalAutomatic Metrics Tool Use

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis.
Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools
Baris Arat, Emre Sefer · Feb 20, 2026

Automatic Metrics

Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever.
Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel · Feb 16, 2026

Automatic Metrics

Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particular
Neurosymbolic Retrievers for Retrieval-augmented Generation
Yash Saxena, Manas Gaur · Jan 8, 2026

Automatic Metrics

Retrieval Augmented Generation (RAG) has made significant strides in overcoming key limitations of large language models, such as hallucination, lack of contextual grounding, and issues with transparency.
OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models
Michael Siebenmann, Javier Argota Sánchez-Vaquerizo, Stefan Arisona, Krystian Samp, Luis Gisler · Nov 30, 2025

Automatic Metrics

The system combines semantic data retrieval, agentic reasoning for iterative code generation, and secure sandboxed execution that produces verifiable multimodal outputs.
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025

Automatic Metrics

However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen · Feb 24, 2025

Automatic Metrics

Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval.
Humanity's Last Exam
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu · Jan 24, 2025

Automatic Metrics

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities.

Other Benchmark Hubs

Retrieval In CS.LG Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers On This Benchmark

Other Benchmark Hubs