Metric Hub

Relevance In CS.CL Papers

Updated from current HFEPX corpus (Feb 27, 2026). 17 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Human Eval. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: relevance. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 17 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 17 papers for Relevance In CS.CL Papers. Dominant protocol signals include automatic metrics, human evaluation, simulation environments, with frequent benchmark focus on Retrieval, Financebench and metric focus on relevance, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

automatic metrics appears in 88.2% of papers in this hub.

Evidence: VeRO: An Evaluation Harness for Agents to Optimize Agents , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , How Retrieved Context Shapes Internal Representations in RAG
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , How Retrieved Context Shapes Internal Representations in RAG , Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering , SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation
multi-agent setups appears in 5.9% of papers, indicating agentic evaluation demand.

Evidence: From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , VeRO: An Evaluation Harness for Agents to Optimize Agents , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

Protocol Takeaways

Most common quality-control signal is rater calibration (5.9% of papers).

Evidence: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , VeRO: An Evaluation Harness for Agents to Optimize Agents , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning
Rater context is mostly unspecified rater pools, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , VeRO: An Evaluation Harness for Agents to Optimize Agents , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , VeRO: An Evaluation Harness for Agents to Optimize Agents , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

Benchmark Interpretation

Retrieval appears in 23.5% of hub papers (4/17); use this cohort for benchmark-matched comparisons.
Financebench appears in 5.9% of hub papers (1/17); use this cohort for benchmark-matched comparisons.

Metric Interpretation

relevance is reported in 100% of hub papers (17/17); compare with a secondary metric before ranking methods.
accuracy is reported in 11.8% of hub papers (2/17); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (5.9% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (29.4% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (0% vs 35% target).
Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (29.4% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (5.9% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (29.4% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (29.4% vs 35% target).

Known Limitations

Only 5.9% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: relevance - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=0, left_only=1, right_only=15

0 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=15, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

human_eval vs simulation_env

both=0, left_only=1, right_only=1

0 papers use both Human Eval and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 4 papers (23.5%)

4 papers (23.5%) mention Retrieval.

Examples: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , How Retrieved Context Shapes Internal Representations in RAG , Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

Benchmark Brief

Financebench

Coverage: 1 papers (5.9%)

1 papers (5.9%) mention Financebench.

Examples: Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

Benchmark Brief

MMLU

Coverage: 1 papers (5.9%)

1 papers (5.9%) mention MMLU.

Examples: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Metric Brief

relevance

Coverage: 17 papers (100%)

17 papers (100%) mention relevance.

Examples: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , VeRO: An Evaluation Harness for Agents to Optimize Agents , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

Metric Brief

accuracy

Coverage: 2 papers (11.8%)

2 papers (11.8%) mention accuracy.

Examples: When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning , From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity

Metric Brief

agreement

Coverage: 1 papers (5.9%)

1 papers (5.9%) mention agreement.

Examples: propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , VeRO: An Evaluation Harness for Agents to Optimize Agents , When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery
Mengze Hong, Di Jiang, Chen Jason Zhang, Zichang Guo, Yawen Li · Feb 26, 2026

Simulation Env General

In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements.
VeRO: An Evaluation Harness for Agents to Optimize Agents
Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan, Xue · Feb 25, 2026

Automatic Metrics Coding

An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles.
When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning
Muku Akasaka, Soyeon Caren Han · Feb 25, 2026

Automatic Metrics General

In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks.
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026

Automatic Metrics Math

Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
How Retrieved Context Shapes Internal Representations in RAG
Samuel Yeh, Sharon Li · Feb 23, 2026

Automatic Metrics General

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial.
ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting
Yuxing Tian, Fengran Mo, Weixu Zhang, Yiyan Qi, Jian-Yun Nie · Feb 23, 2026

Automatic Metrics General

The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task.
Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
Amine Kobeissi, Philippe Langlais · Feb 20, 2026

Automatic Metrics Coding

Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings.
Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models
Melkamu Abay Mersha, Jugal Kalita · Feb 18, 2026

Automatic Metrics Coding

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret.
Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
Matteo Rinaldi, Rossella Varvara, Viviana Patti · Feb 16, 2026

Automatic Metrics General

We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language.
propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
Maximilian Idahl, Benedikt Droste, Björn Plüster, Jan Philipp Harries · Feb 12, 2026

Human Eval Multilingual

We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose,
SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation
Hanqi Jiang, Junhao Chen, Yi Pan, Ling Chen, Weihang You · Jan 6, 2026

Automatic Metrics Coding

While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory.
The Invisible Hand of AI Libraries Shaping Open Source Projects and Communities
Matteo Esposito, Andrea Janes, Valentina Lenarduzzi, Davide Taibi · Jan 5, 2026

Automatic Metrics Coding

In the early 1980s, Open Source Software emerged as a revolutionary concept amidst the dominance of proprietary software.
RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang · Dec 31, 2025

Automatic Metrics General

While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics acr
On the Existence and Behavior of Secondary Attention Sinks
Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu · Dec 22, 2025

Automatic Metrics General

Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance.
From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen · Oct 29, 2025

Automatic Metrics Medicine

To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation.
A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives
Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He · Jun 19, 2025

Automatic Metrics Medicine

Evaluations were heterogeneous: intrinsic metrics (27.1\%), human-in-the-loop assessments (44.1\%), and LLM-based evaluations (13.6\%).
PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han · Feb 25, 2025

Automatic Metrics General

To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.

Relevance In CS.CL Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs