Metric Hub

Relevance In CS.AI Papers

Updated from current HFEPX corpus (Feb 27, 2026). 14 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: relevance. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 14 Last published: Feb 25, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 14 papers for Relevance In CS.AI Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, MMLU and metric focus on relevance, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

automatic metrics appears in 92.9% of papers in this hub.

Evidence: VeRO: An Evaluation Harness for Agents to Optimize Agents , Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs , CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models , VeRO: An Evaluation Harness for Agents to Optimize Agents , Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
multi-agent setups appears in 7.1% of papers, indicating agentic evaluation demand.

Evidence: From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity , VeRO: An Evaluation Harness for Agents to Optimize Agents , Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs , CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

Protocol Takeaways

Most common quality-control signal is rater calibration (7.1% of papers).

Evidence: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , VeRO: An Evaluation Harness for Agents to Optimize Agents , Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs , CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models , VeRO: An Evaluation Harness for Agents to Optimize Agents , Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs , CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference
Stratify by benchmark (Retrieval vs MMLU) before comparing methods.

Evidence: VeRO: An Evaluation Harness for Agents to Optimize Agents , Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs , CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference , AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Benchmark Interpretation

Retrieval appears in 14.3% of hub papers (2/14); use this cohort for benchmark-matched comparisons.
MMLU appears in 7.1% of hub papers (1/14); use this cohort for benchmark-matched comparisons.

Metric Interpretation

relevance is reported in 100% of hub papers (14/14); compare with a secondary metric before ranking methods.
accuracy is reported in 14.3% of hub papers (2/14); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (7.1% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (21.4% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (7.1% vs 35% target).
Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (21.4% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (7.1% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (21.4% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (7.1% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (21.4% vs 35% target).

Known Limitations

Only 7.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: relevance - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=13, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 2 papers (14.3%)

2 papers (14.3%) mention Retrieval.

Examples: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models

Benchmark Brief

MMLU

Coverage: 1 papers (7.1%)

1 papers (7.1%) mention MMLU.

Examples: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Benchmark Brief

Pii-Bench

Coverage: 1 papers (7.1%)

1 papers (7.1%) mention Pii-Bench.

Examples: PII-Bench: Evaluating Query-Aware Privacy Protection Systems

Metric Brief

relevance

Coverage: 14 papers (100%)

14 papers (100%) mention relevance.

Examples: VeRO: An Evaluation Harness for Agents to Optimize Agents , Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs , CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

Metric Brief

accuracy

Coverage: 2 papers (14.3%)

2 papers (14.3%) mention accuracy.

Examples: Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs , From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity

Metric Brief

calibration

Coverage: 1 papers (7.1%)

1 papers (7.1%) mention calibration.

Examples: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: VeRO: An Evaluation Harness for Agents to Optimize Agents , Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs , CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

VeRO: An Evaluation Harness for Agents to Optimize Agents
Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan, Xue · Feb 25, 2026

Automatic Metrics Coding

An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles.
Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
Dhita Putri Pratama, Soyeon Caren Han, Yihao Ding · Feb 24, 2026

Automatic Metrics Coding

Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning.
CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference
Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis · Feb 24, 2026

Automatic Metrics Coding

Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56$\times$} higher throughput, and consistently outperforms other str
AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs
Che Wang, Jiaming Zhang, Ziqi Zhang, Zijie Wang, Yinghui Wang · Feb 24, 2026

Simulation Env General

The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution.
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026

Automatic Metrics Math

Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting
Yuxing Tian, Fengran Mo, Weixu Zhang, Yiyan Qi, Jian-Yun Nie · Feb 23, 2026

Automatic Metrics General

The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task.
Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models
Melkamu Abay Mersha, Jugal Kalita · Feb 18, 2026

Automatic Metrics Coding

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret.
The Invisible Hand of AI Libraries Shaping Open Source Projects and Communities
Matteo Esposito, Andrea Janes, Valentina Lenarduzzi, Davide Taibi · Jan 5, 2026

Automatic Metrics Coding

In the early 1980s, Open Source Software emerged as a revolutionary concept amidst the dominance of proprietary software.
RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang · Dec 31, 2025

Automatic Metrics General

While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics acr
On the Existence and Behavior of Secondary Attention Sinks
Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu · Dec 22, 2025

Automatic Metrics General

Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance.
OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models
Michael Siebenmann, Javier Argota Sánchez-Vaquerizo, Stefan Arisona, Krystian Samp, Luis Gisler · Nov 30, 2025

Automatic Metrics Coding

The system combines semantic data retrieval, agentic reasoning for iterative code generation, and secure sandboxed execution that produces verifiable multimodal outputs.
From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen · Oct 29, 2025

Automatic Metrics Medicine

To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation.
AgentDR: Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents
Mingdai Yang, Nurendra Choudhary, Jiangshu Du, Edward W. Huang, Philip S. Yu · Oct 7, 2025

Automatic Metrics General

Recent agent-based recommendation frameworks aim to simulate user behaviors by incorporating memory mechanisms and prompting strategies, but they struggle with hallucinating non-existent items and full-catalog ranking.
PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han · Feb 25, 2025

Automatic Metrics General

To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.

Relevance In CS.AI Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs