Skip to content
← Back to explorer

Metric Hub

Relevance + General Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Common annotation unit: Ranking. Frequently cited benchmark: Pii-Bench. Common metric signal: relevance. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 10 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for Relevance + General Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Pii-Bench, Retrieval and metric focus on relevance, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Pii-Bench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
  • Retrieval appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • relevance is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.
  • accuracy is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.

Abstract Evidence Highlights

Direct snippets from paper abstracts to ground protocol and benchmark interpretation.

Human-eval abstract signal: Large language models (LLMs) have created new opportunities to enhance the efficiency of scholarly activities; however, challenges persist in the ethical deployment of AI assistance, including (1) the trustworthiness of AI-generated content, (2) preservation...

Human-eval abstract signal: Visual spatial reasoning (VSR) remains challenging for modern vision-language models (VLMs), despite advances in multimodal architectures.

relevance metric signal: Targeted single spatial cues outperform multi-context aggregation, excessive or weakly relevant commonsense knowledge degrades performance, and CoT prompting improves accuracy only when spatial grounding is sufficiently precise.

Protocol abstract signal: The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution.

Protocol abstract signal: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial.

Protocol abstract signal: The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task.

Protocol abstract signal: We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language.

Protocol abstract signal: Search relevance plays a central role in web e-commerce.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
  • Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (20% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (0% vs 35% target).
  • Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (30% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (20% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (30% vs 35% target).

Suggested Reading Order

  1. 1. CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. How Retrieved Context Shapes Internal Representations in RAG

    Adds automatic metrics for broader coverage within this hub.

  5. 5. ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting

    Adds automatic metrics for broader coverage within this hub.

  6. 6. Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

    Adds automatic metrics for broader coverage within this hub.

  7. 7. RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment

    Adds automatic metrics for broader coverage within this hub.

  8. 8. On the Existence and Behavior of Secondary Attention Sinks

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (0% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

automatic_metrics vs simulation_env

both=0, left_only=8, right_only=2

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Pii-Bench

Coverage: 1 papers (10%)

1 papers (10%) mention Pii-Bench.

Examples: PII-Bench: Evaluating Query-Aware Privacy Protection Systems

Benchmark Brief

Retrieval

Coverage: 1 papers (10%)

1 papers (10%) mention Retrieval.

Examples: How Retrieved Context Shapes Internal Representations in RAG

Metric Brief

accuracy

Coverage: 1 papers (10%)

1 papers (10%) mention accuracy.

Examples: When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

Metric Brief

jailbreak success rate

Coverage: 1 papers (10%)

1 papers (10%) mention jailbreak success rate.

Examples: AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Top Papers Reporting This Metric

Other Metric Hubs