Benchmark Hub

Retrieval Benchmark Papers With Cost

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics, Simulation Env. Common annotation unit: Freeform. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 24, 2026.

Papers: 10 Last published: Feb 24, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for Retrieval Benchmark Papers With Cost. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, ALFWorld and metric focus on cost, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

automatic metrics appears in 80% of papers in this hub.

Evidence: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Structured Prompt Language: Declarative Context Management for LLMs , Cross-lingual Matryoshka Representation Learning across Speech and Text
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Structured Prompt Language: Declarative Context Management for LLMs , Cross-lingual Matryoshka Representation Learning across Speech and Text
long-horizon tasks appears in 10% of papers, indicating agentic evaluation demand.

Evidence: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model , Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Structured Prompt Language: Declarative Context Management for LLMs

Protocol Takeaways

Most common quality-control signal is rater calibration (10% of papers).

Evidence: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , Structured Prompt Language: Declarative Context Management for LLMs , Cross-lingual Matryoshka Representation Learning across Speech and Text
Rater context is mostly unspecified rater pools, and annotation is commonly Freeform; use this to scope replication staffing.

Evidence: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Structured Prompt Language: Declarative Context Management for LLMs , Cross-lingual Matryoshka Representation Learning across Speech and Text
Stratify by benchmark (Retrieval vs ALFWorld) before comparing methods.

Evidence: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Structured Prompt Language: Declarative Context Management for LLMs , Cross-lingual Matryoshka Representation Learning across Speech and Text

Benchmark Interpretation

Retrieval appears in 100% of hub papers (10/10); use this cohort for benchmark-matched comparisons.
ALFWorld appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.
accuracy is reported in 20% of hub papers (2/10); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (10% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (100% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (0% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (20% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (20% vs 35% target).

Known Limitations

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: cost - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=8, right_only=2

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 10 papers (100%)

10 papers (100%) mention Retrieval.

Examples: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Structured Prompt Language: Declarative Context Management for LLMs

Benchmark Brief

ALFWorld

Coverage: 1 papers (10%)

1 papers (10%) mention ALFWorld.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Benchmark Brief

MMLU

Coverage: 1 papers (10%)

1 papers (10%) mention MMLU.

Examples: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Metric Brief

cost

Coverage: 10 papers (100%)

10 papers (100%) mention cost.

Metric Brief

accuracy

Coverage: 2 papers (20%)

2 papers (20%) mention accuracy.

Examples: Cross-lingual Matryoshka Representation Learning across Speech and Text , Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Metric Brief

jailbreak success rate

Coverage: 2 papers (20%)

2 papers (20%) mention jailbreak success rate.

Examples: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , Cross-lingual Matryoshka Representation Learning across Speech and Text

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Structured Prompt Language: Declarative Context Management for LLMs

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers On This Benchmark

Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Inderjeet Singh, Vikas Pahuja, Aishvariya Priya Rathina Sabapathy, Chiara Picardi, Amit Giloni · Feb 24, 2026

Automatic Metrics

Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components.
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026

Automatic Metrics

Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
Structured Prompt Language: Declarative Context Management for LLMs
Wen G. Gong · Feb 23, 2026

Automatic Metrics

SPL-flow extends SPL into resilient agentic pipelines with a three-tier provider fallback strategy (Ollama -> OpenRouter -> self-healing retry) fully transparent to the .spl script.
Cross-lingual Matryoshka Representation Learning across Speech and Text
Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina · Feb 23, 2026

Automatic Metrics

We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best.
Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Wenxuan Ding, Nicholas Tomlin, Greg Durrett · Feb 18, 2026

Simulation Env

Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent.
Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026

Automatic Metrics

16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026

Simulation Env Long Horizon

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
Fast-weight Product Key Memory
Tianyu Zhao, Llion Jones · Jan 2, 2026

Automatic Metrics

Notably, in Needle-in-a-Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents
Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai · Sep 27, 2025

Automatic Metrics

To tackle these challenges, we present ReMemR1, which integrates the mechanism of memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non-linear reasoning.
LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation
Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang · Nov 7, 2024

Automatic Metrics

The LLM-enhanced CLIP delivers consistent improvements across a wide range of downstream tasks, including linear-probe classification, zero-shot image-text retrieval with both short and long captions (in English and other languages), zero-s

Other Benchmark Hubs

Retrieval Benchmark Papers With Cost

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers On This Benchmark

Other Benchmark Hubs