Metric Hub

Recall + General Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 20 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: recall. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 20 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 20 papers for Recall + General Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, Medieval and metric focus on recall, f1. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

5% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks , Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Personalized Graph-Empowered Large Language Model for Proactive Information Access
automatic metrics appears in 95% of papers in this hub.

Evidence: Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Personalized Graph-Empowered Large Language Model for Proactive Information Access , Towards Controllable Video Synthesis of Routine and Rare OR Events
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Personalized Graph-Empowered Large Language Model for Proactive Information Access , E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications , RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering , RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering

Protocol Takeaways

Most common quality-control signal is rater calibration (5% of papers).

Evidence: WISE: Web Information Satire and Fakeness Evaluation , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Probing for Knowledge Attribution in Large Language Models , Personalized Graph-Empowered Large Language Model for Proactive Information Access
Stratify by benchmark (Retrieval vs Medieval) before comparing methods.

Evidence: Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Personalized Graph-Empowered Large Language Model for Proactive Information Access , Towards Controllable Video Synthesis of Routine and Rare OR Events
Track metric sensitivity by reporting both recall and f1.

Evidence: Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Personalized Graph-Empowered Large Language Model for Proactive Information Access , Towards Controllable Video Synthesis of Routine and Rare OR Events

Benchmark Interpretation

Retrieval appears in 30% of hub papers (6/20); use this cohort for benchmark-matched comparisons.
Medieval appears in 5% of hub papers (1/20); use this cohort for benchmark-matched comparisons.

Metric Interpretation

recall is reported in 100% of hub papers (20/20); compare with a secondary metric before ranking methods.
f1 is reported in 40% of hub papers (8/20); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (5% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (10% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (40% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (0% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (5% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (40% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: recall - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=2, left_only=17, right_only=1

2 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 6 papers (30%)

6 papers (30%) mention Retrieval.

Examples: Personalized Graph-Empowered Large Language Model for Proactive Information Access , E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications , RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering

Benchmark Brief

Medieval

Coverage: 1 papers (5%)

1 papers (5%) mention Medieval.

Examples: Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching

Benchmark Brief

Memoryarena

Coverage: 1 papers (5%)

1 papers (5%) mention Memoryarena.

Examples: MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Metric Brief

recall

Coverage: 20 papers (100%)

20 papers (100%) mention recall.

Examples: Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Personalized Graph-Empowered Large Language Model for Proactive Information Access

Metric Brief

Coverage: 8 papers (40%)

8 papers (40%) mention f1.

Examples: Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Metric Brief

precision

Coverage: 5 papers (25%)

5 papers (25%) mention precision.

Examples: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams , PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Personalized Graph-Empowered Large Language Model for Proactive Information Access

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

Probing for Knowledge Attribution in Large Language Models
Ivo Brink, Alexander Boer, Dennis Ulmer · Feb 26, 2026

Automatic Metrics General

Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retr
A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad · Feb 25, 2026

Automatic Metrics General

Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
Personalized Graph-Empowered Large Language Model for Proactive Information Access
Chia Cheng Chang, An-Zi Yen, Hen-Hsen Huang, Hsin-Hsi Chen · Feb 25, 2026

Automatic Metrics General

Since individuals may struggle to recall all life details and often confuse events, establishing a system to assist users in recalling forgotten experiences is essential.
Towards Controllable Video Synthesis of Routine and Rare OR Events
Dominik Schneider, Lalithkumar Seenivasan, Sampath Rapuri, Vishalroshan Anil, Aiza Maksutova · Feb 24, 2026

Automatic Metrics General

Purpose: Curating large-scale datasets of operating room (OR) workflow, encompassing rare, safety-critical, or atypical events, remains operationally and ethically challenging.
E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications
Jiwoo Kang, Yeon-Chang Lee · Feb 24, 2026

Automatic Metrics General

Multimodal recommender systems (MMRSs) enhance collaborative filtering by leveraging item-side modalities, but their reliance on a fixed set of modalities and task-specific objectives limits both modality extensibility and task generalizati
Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams
Darvan Shvan Khairaldeen, Hossein Hassani · Feb 24, 2026

Automatic Metrics General

On the full 50-song evaluation at a 0.750 threshold, recall was 39.4% and precision 25.8% .
PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification
Isun Chehreh, Ebrahim Ansari · Feb 22, 2026

Automatic Metrics General

Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification.
Uncovering Context Reliance in Unstructured Knowledge Editing
Zisheng Zhou, Mengqi Zhang, Shiguang Wu, Xiaotian Ye, Chi Zhang · Feb 22, 2026

Automatic Metrics General

Evaluations show that COIN reduces Context Reliance by 45.2% and outperforms strong baselines by 23.6% in editing success rate, highlighting the vital role of mitigating Context Reliance for robust editing.
RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
Deniz Qian, Hung-Ting Chen, Eunsol Choi · Feb 20, 2026

Automatic Metrics General

Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI).
Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026

Automatic MetricsSimulation Env General

When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Jyotin Goel, Souvik Maji, Pratik Mazumder · Feb 19, 2026

Automatic Metrics General

Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates.
RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering
Yiming Zhang, Siyue Zhang, Junbo Zhao, Chen Zhao · Feb 19, 2026

Automatic Metrics General

We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories.
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026

Simulation Env General

Existing evaluations of agents with memory typically assess memorization and action in isolation.
CAST: Character-and-Scene Episodic Memory for Agents
Kexin Ma, Bojun Li, Yuhua Tang, Liting Sun, Ruochun Jin · Jan 14, 2026

Automatic Metrics General

Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where.
Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching
Stephen Gadd · Jan 11, 2026

Automatic Metrics General

Linking names across historical sources, languages, and writing systems remains a fundamental challenge in digital humanities and geographic information retrieval.
WISE: Web Information Satire and Fakeness Evaluation
Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury · Dec 30, 2025

Automatic Metrics General

This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as eith
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani · Oct 31, 2025

Automatic Metrics General

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coh
PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing
Anthony Hughes, Vasisht Duddu, N. Asokan, Nikolaos Aletras, Ning Ma · Oct 8, 2025

Automatic Metrics General

Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference.
Language Models use Lookbacks to Track Beliefs
Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov · May 20, 2025

Automatic Metrics General

How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality?
Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes
Rahul Garg, Trilok Padhi, Hemang Jain, Ugur Kursuncu, Ponnurangam Kumaraguru · Nov 19, 2024

Automatic MetricsSimulation Env General

Experimental results from our study on two hate speech benchmark datasets demonstrate superior performance over the state-of-the-art baselines across AU-ROC, F1, and Recall with improvements of 1.1%, 7%, and 35%, respectively.

Recall + General Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs