Metric Hub

Accuracy In CS.IR Papers

Updated from current HFEPX corpus (Feb 27, 2026). 11 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Human Eval. Common annotation unit: Ranking. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 11 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 11 papers for Accuracy In CS.IR Papers. Dominant protocol signals include automatic metrics, human evaluation, with frequent benchmark focus on Retrieval, HotpotQA and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

automatic metrics appears in 100% of papers in this hub.

Evidence: MoDora: Tree-Based Semi-Structured Document Analysis System , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , NanoKnow: How to Know What Your Language Model Knows , CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: MoDora: Tree-Based Semi-Structured Document Analysis System , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering , PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
long-horizon tasks appears in 9.1% of papers, indicating agentic evaluation demand.

Evidence: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , MoDora: Tree-Based Semi-Structured Document Analysis System , NanoKnow: How to Know What Your Language Model Knows , CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: MoDora: Tree-Based Semi-Structured Document Analysis System , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , NanoKnow: How to Know What Your Language Model Knows , CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Rater context is mostly unspecified rater pools, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: MoDora: Tree-Based Semi-Structured Document Analysis System , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , NanoKnow: How to Know What Your Language Model Knows , CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction , MoDora: Tree-Based Semi-Structured Document Analysis System , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , NanoKnow: How to Know What Your Language Model Knows

Benchmark Interpretation

Retrieval appears in 54.5% of hub papers (6/11); use this cohort for benchmark-matched comparisons.
HotpotQA appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 100% of hub papers (11/11); compare with a secondary metric before ranking methods.
cost is reported in 9.1% of hub papers (1/11); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (72.7% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (0% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (18.2% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (72.7% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (18.2% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=0, right_only=10

1 papers use both Human Eval and Automatic Metrics.

Benchmark Brief

Retrieval

Coverage: 6 papers (54.5%)

6 papers (54.5%) mention Retrieval.

Examples: MoDora: Tree-Based Semi-Structured Document Analysis System , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Benchmark Brief

HotpotQA

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention HotpotQA.

Examples: PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents

Benchmark Brief

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention NQ.

Examples: NanoKnow: How to Know What Your Language Model Knows

Metric Brief

accuracy

Coverage: 11 papers (100%)

11 papers (100%) mention accuracy.

Examples: MoDora: Tree-Based Semi-Structured Document Analysis System , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , NanoKnow: How to Know What Your Language Model Knows

Metric Brief

cost

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention cost.

Examples: Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Metric Brief

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention f1.

Examples: Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: MoDora: Tree-Based Semi-Structured Document Analysis System , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , NanoKnow: How to Know What Your Language Model Knows

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026

Automatic Metrics Coding

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026

Automatic Metrics General

Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
NanoKnow: How to Know What Your Language Model Knows
Lingwei Gu, Nour Jedidi, Jimmy Lin · Feb 23, 2026

Automatic Metrics Coding

Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre
CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello · Feb 19, 2026

Automatic Metrics Multilingual

HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts.
Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026

Automatic Metrics General

In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026

Automatic Metrics Coding

16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction
Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu · Oct 7, 2025

Human EvalAutomatic Metrics General

Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin · Jun 20, 2025

Automatic Metrics General

We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task.
Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement
Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun · Jun 5, 2025

Automatic Metrics Coding

Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors.
Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli · Jun 7, 2024

Automatic Metrics Coding

MRAG integrates seamlessly with existing RAG frameworks and benchmarks.
Augmenting Lateral Thinking in Language Models with Humor and Riddle Data for the BRAINTEASER Task
Mina Ghashami, Soumya Smruti Mishra · May 16, 2024

Automatic Metrics General

The SemEval 2024 BRAINTEASER task challenges language models to perform lateral thinking -- a form of creative, non-linear reasoning that remains underexplored in NLP.

Accuracy In CS.IR Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs