Metric Hub

Recall + Automatic Metrics Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 34 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: recall. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 34 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 34 papers for Recall + Automatic Metrics Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, ARC and metric focus on recall, f1. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

5.9% of papers report explicit human-feedback signals, led by expert verification.

Evidence: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications , Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
automatic metrics appears in 100% of papers in this hub.

Evidence: Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Improving Parametric Knowledge Access in Reasoning Language Models , Personalized Graph-Empowered Large Language Model for Proactive Information Access
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Personalized Graph-Empowered Large Language Model for Proactive Information Access , E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications , RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering , Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

Protocol Takeaways

Most common quality-control signal is rater calibration (2.9% of papers).

Evidence: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , Probing for Knowledge Attribution in Large Language Models , Improving Parametric Knowledge Access in Reasoning Language Models
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning , Agentic Adversarial QA for Improving Domain-Specific LLMs , CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Stratify by benchmark (Retrieval vs ARC) before comparing methods.

Evidence: Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Improving Parametric Knowledge Access in Reasoning Language Models , Personalized Graph-Empowered Large Language Model for Proactive Information Access

Benchmark Interpretation

Retrieval appears in 29.4% of hub papers (10/34); use this cohort for benchmark-matched comparisons.
ARC appears in 2.9% of hub papers (1/34); use this cohort for benchmark-matched comparisons.

Metric Interpretation

recall is reported in 100% of hub papers (34/34); compare with a secondary metric before ranking methods.
f1 is reported in 32.4% of hub papers (11/34); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (5.9% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (8.8% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (50% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (20.6% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (5.9% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (5.9% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (8.8% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (50% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (20.6% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (5.9% vs 35% target).

Known Limitations

Only 8.8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (20.6% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: recall - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=2, left_only=32, right_only=0

2 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 10 papers (29.4%)

10 papers (29.4%) mention Retrieval.

Examples: Personalized Graph-Empowered Large Language Model for Proactive Information Access , E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications , RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering

Benchmark Brief

ARC

Coverage: 1 papers (2.9%)

1 papers (2.9%) mention ARC.

Examples: Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise

Benchmark Brief

Banglasummeval

Coverage: 1 papers (2.9%)

1 papers (2.9%) mention Banglasummeval.

Examples: BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization

Metric Brief

recall

Coverage: 34 papers (100%)

34 papers (100%) mention recall.

Examples: Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Improving Parametric Knowledge Access in Reasoning Language Models

Metric Brief

Coverage: 11 papers (32.4%)

11 papers (32.4%) mention f1.

Examples: Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Metric Brief

accuracy

Coverage: 10 papers (29.4%)

10 papers (29.4%) mention accuracy.

Examples: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams , To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Improving Parametric Knowledge Access in Reasoning Language Models

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Top Papers Reporting This Metric

Probing for Knowledge Attribution in Large Language Models
Ivo Brink, Alexander Boer, Dennis Ulmer · Feb 26, 2026

Automatic Metrics General

Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retr
A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad · Feb 25, 2026

Automatic Metrics General

Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
Improving Parametric Knowledge Access in Reasoning Language Models
Melody Ma, John Hewitt · Feb 25, 2026

Automatic Metrics Math

We study reasoning for accessing world knowledge stored in a language model's parameters.
Personalized Graph-Empowered Large Language Model for Proactive Information Access
Chia Cheng Chang, An-Zi Yen, Hen-Hsen Huang, Hsin-Hsi Chen · Feb 25, 2026

Automatic Metrics General

Since individuals may struggle to recall all life details and often confuse events, establishing a system to assist users in recalling forgotten experiences is essential.
Towards Controllable Video Synthesis of Routine and Rare OR Events
Dominik Schneider, Lalithkumar Seenivasan, Sampath Rapuri, Vishalroshan Anil, Aiza Maksutova · Feb 24, 2026

Automatic Metrics General

Purpose: Curating large-scale datasets of operating room (OR) workflow, encompassing rare, safety-critical, or atypical events, remains operationally and ethically challenging.
E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications
Jiwoo Kang, Yeon-Chang Lee · Feb 24, 2026

Automatic Metrics General

Multimodal recommender systems (MMRSs) enhance collaborative filtering by leveraging item-side modalities, but their reliance on a fixed set of modalities and task-specific objectives limits both modality extensibility and task generalizati
Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams
Darvan Shvan Khairaldeen, Hossein Hassani · Feb 24, 2026

Automatic Metrics General

On the full 50-song evaluation at a 0.750 threshold, recall was 39.4% and precision 25.8% .
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026

Automatic Metrics Medicine

Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontolo
To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering
Zaifu Zhan, Min Zeng, Shuang Zhou, Yiran Song, Xiaoyi Chen · Feb 23, 2026

Automatic Metrics Medicine

Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA.
PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification
Isun Chehreh, Ebrahim Ansari · Feb 22, 2026

Automatic Metrics General

Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification.
Uncovering Context Reliance in Unstructured Knowledge Editing
Zisheng Zhou, Mengqi Zhang, Shiguang Wu, Xiaotian Ye, Chi Zhang · Feb 22, 2026

Automatic Metrics General

Evaluations show that COIN reduces Context Reliance by 45.2% and outperforms strong baselines by 23.6% in editing success rate, highlighting the vital role of mitigating Context Reliance for robust editing.
VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning
Harshul Raj Surana, Arijit Maji, Aryan Vats, Akash Ghosh, Sriparna Saha · Feb 20, 2026

Automatic Metrics Math

Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured.
RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
Deniz Qian, Hung-Ting Chen, Eunsol Choi · Feb 20, 2026

Automatic Metrics General

Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI).
Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026

Automatic MetricsSimulation Env General

When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
Agentic Adversarial QA for Improving Domain-Specific LLMs
Vincent Grari, Ciprian Tomoiaga, Sylvain Lamprier, Tatsunori Hashimoto, Marcin Detyniecki · Feb 20, 2026

Automatic Metrics Law

Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.
Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
Amine Kobeissi, Philippe Langlais · Feb 20, 2026

Automatic Metrics Coding

Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings.
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026

Automatic Metrics Medicine

The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
QueryPlot: Generating Geological Evidence Layers using Natural Language Queries for Mineral Exploration
Meng Ye, Xiao Lin, Georgina Lukoczki, Graham W. Lederer, Yi Yao · Feb 19, 2026

Automatic Metrics Coding

Mineral prospectivity mapping requires synthesizing heterogeneous geological knowledge, including textual deposit models and geospatial datasets, to identify regions likely to host specific mineral deposit types.
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Jyotin Goel, Souvik Maji, Pratik Mazumder · Feb 19, 2026

Automatic Metrics General

Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates.
RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering
Yiming Zhang, Siyue Zhang, Junbo Zhao, Chen Zhao · Feb 19, 2026

Automatic Metrics General

We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories.
Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy
Bianca Raimondi, Maurizio Gabbrielli · Feb 19, 2026

Automatic Metrics Coding

The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics.
BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization
Ahmed Rafid, Rumman Adib, Fariya Ahmed, Ajwad Abrar, Mohammed Saidul Islam · Feb 18, 2026

Automatic Metrics MedicineMultilingual

However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries.
Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum
Víctor Yeste, Paolo Rosso · Jan 20, 2026

Automatic Metrics Coding

We study sentence-level detection of the 19 human values in the refined Schwartz continuum in about 74k English sentences from news and political manifestos (ValueEval'24 corpus).
CAST: Character-and-Scene Episodic Memory for Agents
Kexin Ma, Bojun Li, Yuhua Tang, Liting Sun, Ruochun Jin · Jan 14, 2026

Automatic Metrics General

Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where.
Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching
Stephen Gadd · Jan 11, 2026

Automatic Metrics General

Linking names across historical sources, languages, and writing systems remains a fundamental challenge in digital humanities and geographic information retrieval.
WISE: Web Information Satire and Fakeness Evaluation
Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury · Dec 30, 2025

Automatic Metrics General

This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as eith
OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models
Michael Siebenmann, Javier Argota Sánchez-Vaquerizo, Stefan Arisona, Krystian Samp, Luis Gisler · Nov 30, 2025

Automatic Metrics Coding

The system combines semantic data retrieval, agentic reasoning for iterative code generation, and secure sandboxed execution that produces verifiable multimodal outputs.
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani · Oct 31, 2025

Automatic Metrics General

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coh
PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing
Anthony Hughes, Vasisht Duddu, N. Asokan, Nikolaos Aletras, Ning Ma · Oct 8, 2025

Automatic Metrics General

Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference.
Language Models use Lookbacks to Track Beliefs
Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov · May 20, 2025

Automatic Metrics General

How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality?
Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes
Saman Khamesian, Asiful Arefeen, Maria Adela Grando, Bithika M. Thompson, Hassan Ghasemzadeh · Feb 20, 2025

Automatic Metrics Medicine

Managing Type 1 Diabetes (T1D) demands constant vigilance as individuals strive to regulate their blood glucose levels and avoid dysglycemia, including hyperglycemia and hypoglycemia.
Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes
Rahul Garg, Trilok Padhi, Hemang Jain, Ugur Kursuncu, Ponnurangam Kumaraguru · Nov 19, 2024

Automatic MetricsSimulation Env General

Experimental results from our study on two hate speech benchmark datasets demonstrate superior performance over the state-of-the-art baselines across AU-ROC, F1, and Recall with improvements of 1.1%, 7%, and 35%, respectively.
Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE
Christian Møller Dahl, Torben Johansen, Christian Vedel · Feb 21, 2024

Automatic Metrics Coding

This paper introduces OccCANINE, an open-source tool that maps occupational descriptions to HISCO codes.
Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise
Zhenkai Zhang, Krista A. Ehinger, Tom Drummond · Oct 26, 2023

Automatic Metrics Math

This paper introduces two key contributions aimed at improving the speed and quality of images generated through inverse diffusion processes.

Recall + Automatic Metrics Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs