Metric Hub

Recall In CS.LG Papers

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Gold Questions. Frequently cited benchmark: ARC. Common metric signal: recall. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 10 Last published: Feb 25, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for Recall In CS.LG Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on ARC, Legalbench and metric focus on recall, f1. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

10% of papers report explicit human-feedback signals, led by expert verification.

Evidence: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Towards Controllable Video Synthesis of Routine and Rare OR Events , Agentic Adversarial QA for Improving Domain-Specific LLMs
automatic metrics appears in 90% of papers in this hub.

Evidence: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Towards Controllable Video Synthesis of Routine and Rare OR Events , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , Agentic Adversarial QA for Improving Domain-Specific LLMs
ARC is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Towards Controllable Video Synthesis of Routine and Rare OR Events , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Protocol Takeaways

Most common quality-control signal is gold-question checks (10% of papers).

Evidence: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Towards Controllable Video Synthesis of Routine and Rare OR Events , Agentic Adversarial QA for Improving Domain-Specific LLMs
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , Agentic Adversarial QA for Improving Domain-Specific LLMs , OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Stratify by benchmark (ARC vs Legalbench) before comparing methods.

Evidence: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Towards Controllable Video Synthesis of Routine and Rare OR Events , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , Agentic Adversarial QA for Improving Domain-Specific LLMs

Benchmark Interpretation

ARC appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
Legalbench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

recall is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.
f1 is reported in 40% of hub papers (4/10); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (10% vs 45% target).
Tighten coverage on Papers reporting quality controls. Coverage is usable but incomplete (20% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (30% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Tighten coverage on Papers with known rater population. Coverage is usable but incomplete (30% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (10% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (10% vs 45% target).

Papers reporting quality controls

Coverage is usable but incomplete (20% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (30% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is usable but incomplete (30% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (10% vs 35% target).

Known Limitations

Annotation unit is under-specified (10% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Cross-page comparisons should be benchmark- and metric-matched to avoid protocol confounding.

Research Utility Links

Benchmark Slice: ARC - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: recall - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=1, left_only=8, right_only=1

1 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

ARC

Coverage: 1 papers (10%)

1 papers (10%) mention ARC.

Examples: Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise

Benchmark Brief

Legalbench

Coverage: 1 papers (10%)

1 papers (10%) mention Legalbench.

Examples: Agentic Adversarial QA for Improving Domain-Specific LLMs

Benchmark Brief

Retrieval

Coverage: 1 papers (10%)

1 papers (10%) mention Retrieval.

Examples: OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models

Metric Brief

recall

Coverage: 10 papers (100%)

10 papers (100%) mention recall.

Examples: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Towards Controllable Video Synthesis of Routine and Rare OR Events , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Metric Brief

Coverage: 4 papers (40%)

4 papers (40%) mention f1.

Examples: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes

Metric Brief

accuracy

Coverage: 3 papers (30%)

3 papers (30%) mention accuracy.

Examples: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Agentic Adversarial QA for Improving Domain-Specific LLMs , Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Towards Controllable Video Synthesis of Routine and Rare OR Events , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Top Papers Reporting This Metric

A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad · Feb 25, 2026

Automatic Metrics General

Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
Towards Controllable Video Synthesis of Routine and Rare OR Events
Dominik Schneider, Lalithkumar Seenivasan, Sampath Rapuri, Vishalroshan Anil, Aiza Maksutova · Feb 24, 2026

Automatic Metrics General

Purpose: Curating large-scale datasets of operating room (OR) workflow, encompassing rare, safety-critical, or atypical events, remains operationally and ethically challenging.
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026

Automatic Metrics Medicine

Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontolo
Agentic Adversarial QA for Improving Domain-Specific LLMs
Vincent Grari, Ciprian Tomoiaga, Sylvain Lamprier, Tatsunori Hashimoto, Marcin Detyniecki · Feb 20, 2026

Automatic Metrics Law

Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Jyotin Goel, Souvik Maji, Pratik Mazumder · Feb 19, 2026

Automatic Metrics General

Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates.
OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models
Michael Siebenmann, Javier Argota Sánchez-Vaquerizo, Stefan Arisona, Krystian Samp, Luis Gisler · Nov 30, 2025

Automatic Metrics Coding

The system combines semantic data retrieval, agentic reasoning for iterative code generation, and secure sandboxed execution that produces verifiable multimodal outputs.
Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes
Saman Khamesian, Asiful Arefeen, Maria Adela Grando, Bithika M. Thompson, Hassan Ghasemzadeh · Feb 20, 2025

Automatic Metrics Medicine

Managing Type 1 Diabetes (T1D) demands constant vigilance as individuals strive to regulate their blood glucose levels and avoid dysglycemia, including hyperglycemia and hypoglycemia.
Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes
Rahul Garg, Trilok Padhi, Hemang Jain, Ugur Kursuncu, Ponnurangam Kumaraguru · Nov 19, 2024

Automatic MetricsSimulation Env General

Experimental results from our study on two hate speech benchmark datasets demonstrate superior performance over the state-of-the-art baselines across AU-ROC, F1, and Recall with improvements of 1.1%, 7%, and 35%, respectively.
Topological quantification of ambiguity in semantic search
Thomas Roland Barillot, Alex De Castro · Jun 12, 2024

Simulation Env Coding

We studied how the local topological structure of sentence-embedding neighborhoods encodes semantic ambiguity.
Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise
Zhenkai Zhang, Krista A. Ehinger, Tom Drummond · Oct 26, 2023

Automatic Metrics Math

This paper introduces two key contributions aimed at improving the speed and quality of images generated through inverse diffusion processes.

Recall In CS.LG Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs