Metric Hub

Relevance + General Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Common annotation unit: Ranking. Frequently cited benchmark: Pii-Bench. Common metric signal: relevance. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 10 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for Relevance + General Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Pii-Bench, Retrieval and metric focus on relevance, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

automatic metrics appears in 80% of papers in this hub.

Evidence: When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning , How Retrieved Context Shapes Internal Representations in RAG , ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting , Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
Pii-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: PII-Bench: Evaluating Query-Aware Privacy Protection Systems , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning , AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs
relevance is a repeated reporting metric here, enabling more consistent cross-paper score interpretation.

Evidence: PII-Bench: Evaluating Query-Aware Privacy Protection Systems , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning , AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning , AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs , How Retrieved Context Shapes Internal Representations in RAG
Rater context is mostly unspecified rater pools, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning , AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs , How Retrieved Context Shapes Internal Representations in RAG
Stratify by benchmark (Pii-Bench vs Retrieval) before comparing methods.

Evidence: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning , AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs , How Retrieved Context Shapes Internal Representations in RAG

Benchmark Interpretation

Pii-Bench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
Retrieval appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

relevance is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.
accuracy is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.

Abstract Evidence Highlights

Direct snippets from paper abstracts to ground protocol and benchmark interpretation.

Protocol CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery

Human-eval abstract signal: Large language models (LLMs) have created new opportunities to enhance the efficiency of scholarly activities; however, challenges persist in the ethical deployment of AI assistance, including (1) the trustworthiness of AI-generated content, (2) preservation...

Protocol When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

Human-eval abstract signal: Visual spatial reasoning (VSR) remains challenging for modern vision-language models (VLMs), despite advances in multimodal architectures.

Metric When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

relevance metric signal: Targeted single spatial cues outperform multi-context aggregation, excessive or weakly relevant commonsense knowledge degrades performance, and CoT prompting improves accuracy only when spatial grounding is sufficiently precise.

Protocol AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Protocol abstract signal: The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution.

Protocol How Retrieved Context Shapes Internal Representations in RAG

Protocol abstract signal: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial.

Protocol ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting

Protocol abstract signal: The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task.

Protocol Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Protocol abstract signal: We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language.

Protocol RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment

Protocol abstract signal: Search relevance plays a central role in web e-commerce.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (20% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (0% vs 35% target).
Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (30% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (20% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (30% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Pii-Bench - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: relevance - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=8, right_only=2

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Pii-Bench

Coverage: 1 papers (10%)

1 papers (10%) mention Pii-Bench.

Examples: PII-Bench: Evaluating Query-Aware Privacy Protection Systems

Benchmark Brief

Retrieval

Coverage: 1 papers (10%)

1 papers (10%) mention Retrieval.

Examples: How Retrieved Context Shapes Internal Representations in RAG

Metric Brief

relevance

Coverage: 10 papers (100%)

10 papers (100%) mention relevance.

Examples: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning , AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Metric Brief

accuracy

Coverage: 1 papers (10%)

1 papers (10%) mention accuracy.

Examples: When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

Metric Brief

jailbreak success rate

Coverage: 1 papers (10%)

1 papers (10%) mention jailbreak success rate.

Examples: AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning , AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery
Mengze Hong, Di Jiang, Chen Jason Zhang, Zichang Guo, Yawen Li · Feb 26, 2026

Simulation Env General

In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements.
When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning
Muku Akasaka, Soyeon Caren Han · Feb 25, 2026

Automatic Metrics General

In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks.
AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs
Che Wang, Jiaming Zhang, Ziqi Zhang, Zijie Wang, Yinghui Wang · Feb 24, 2026

Simulation Env General

The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution.
How Retrieved Context Shapes Internal Representations in RAG
Samuel Yeh, Sharon Li · Feb 23, 2026

Automatic Metrics General

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial.
ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting
Yuxing Tian, Fengran Mo, Weixu Zhang, Yiyan Qi, Jian-Yun Nie · Feb 23, 2026

Automatic Metrics General

The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task.
Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
Matteo Rinaldi, Rossella Varvara, Viviana Patti · Feb 16, 2026

Automatic Metrics General

We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language.
RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang · Dec 31, 2025

Automatic Metrics General

While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics acr
On the Existence and Behavior of Secondary Attention Sinks
Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu · Dec 22, 2025

Automatic Metrics General

Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance.
AgentDR: Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents
Mingdai Yang, Nurendra Choudhary, Jiangshu Du, Edward W. Huang, Philip S. Yu · Oct 7, 2025

Automatic Metrics General

Recent agent-based recommendation frameworks aim to simulate user behaviors by incorporating memory mechanisms and prompting strategies, but they struggle with hallucinating non-existent items and full-catalog ranking.
PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han · Feb 25, 2025

Automatic Metrics General

To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.

Relevance + General Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Abstract Evidence Highlights

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs