Metric Hub

F1 In CS.AI Papers

Updated from current HFEPX corpus (Feb 27, 2026). 20 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Gold Questions. Frequently cited benchmark: Retrieval. Common metric signal: f1. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 20 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 20 papers for F1 In CS.AI Papers. Dominant protocol signals include automatic metrics, human evaluation, simulation environments, with frequent benchmark focus on Retrieval, BrowseComp and metric focus on f1, recall. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

5% of papers report explicit human-feedback signals, led by expert verification.

Evidence: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
automatic metrics appears in 100% of papers in this hub.

Evidence: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Distill and Align Decomposition for Enhanced Claim Verification
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: A Benchmark for Deep Information Synthesis , RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA , Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Protocol Takeaways

Most common quality-control signal is gold-question checks (5% of papers).

Evidence: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , Probing for Knowledge Attribution in Large Language Models
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , Probing for Knowledge Attribution in Large Language Models
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Distill and Align Decomposition for Enhanced Claim Verification , A Benchmark for Deep Information Synthesis , Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Benchmark Interpretation

Retrieval appears in 15% of hub papers (3/20); use this cohort for benchmark-matched comparisons.
BrowseComp appears in 5% of hub papers (1/20); use this cohort for benchmark-matched comparisons.

Metric Interpretation

f1 is reported in 100% of hub papers (20/20); compare with a secondary metric before ranking methods.
recall is reported in 40% of hub papers (8/20); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (5% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (10% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (30% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (10% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (5% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (5% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (30% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (5% vs 35% target).

Known Limitations

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: f1 - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=3, left_only=0, right_only=17

3 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=3, left_only=17, right_only=0

3 papers use both Automatic Metrics and Simulation Env.

human_eval vs simulation_env

both=0, left_only=3, right_only=3

0 papers use both Human Eval and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 3 papers (15%)

3 papers (15%) mention Retrieval.

Examples: A Benchmark for Deep Information Synthesis , RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA , Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

Benchmark Brief

BrowseComp

Coverage: 1 papers (5%)

1 papers (5%) mention BrowseComp.

Examples: Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

Benchmark Brief

DROP

Coverage: 1 papers (5%)

1 papers (5%) mention DROP.

Examples: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Metric Brief

Coverage: 20 papers (100%)

20 papers (100%) mention f1.

Examples: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection

Metric Brief

recall

Coverage: 8 papers (40%)

8 papers (40%) mention recall.

Examples: Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Metric Brief

accuracy

Coverage: 6 papers (30%)

6 papers (30%) mention accuracy.

Examples: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Distill and Align Decomposition for Enhanced Claim Verification , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Top Papers Reporting This Metric

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Sungho Park, Jueun Kim, Wook-Shin Han · Feb 26, 2026

Automatic Metrics Coding

Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in n
Probing for Knowledge Attribution in Large Language Models
Ivo Brink, Alexander Boer, Dennis Ulmer · Feb 26, 2026

Automatic Metrics General

Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retr
A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad · Feb 25, 2026

Automatic Metrics General

Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026

Human EvalAutomatic Metrics General

Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
Mohammadreza Ghaffarzadeh-Esfahani, Nahid Yousefian, Ebrahim Heidari-Farsani, Ali Akbar Omidvarian, Sepehr Ghahraei · Feb 24, 2026

Automatic Metrics MedicineMultilingual

Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP).
PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data
Samah Fodeh, Linhai Ma, Yan Wang, Srivani Talakokkul, Ganesh Puthiaraju · Feb 24, 2026

Automatic Metrics MedicineCoding

Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH).
A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026

Human EvalAutomatic Metrics Coding

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis.
Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams
Darvan Shvan Khairaldeen, Hossein Hassani · Feb 24, 2026

Automatic Metrics General

On the full 50-song evaluation at a 0.750 threshold, recall was 39.4% and precision 25.8% .
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026

Automatic Metrics Medicine

Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontolo
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026

Human EvalAutomatic Metrics Law

Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026

Automatic MetricsSimulation Env General

When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models
Wojciech Michaluk, Tymoteusz Urban, Mateusz Kubita, Soveatin Kuntur, Anna Wroblewska · Feb 20, 2026

Automatic Metrics Coding

Clickbait headlines degrade the quality of online information and undermine user trust.
Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings
Suhyung Jang, Ghang Lee, Jaekun Lee, Hyunjun Lee · Feb 17, 2026

Automatic Metrics General

Accurate representation of building semantics, encompassing both generic object types and specific subtypes, is essential for effective AI model training in the architecture, engineering, construction, and operation (AECO) industry.
A Geometric Analysis of Small-sized Language Model Hallucinations
Emanuele Ricco, Elia Onofri, Lorenzo Cima, Stefano Cresci, Roberto Di Pietro · Feb 16, 2026

Automatic Metrics General

Hallucinations -- fluent but factually incorrect responses -- pose a major challenge to the reliability of language models, especially in multi-step or agentic settings.
Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum
Víctor Yeste, Paolo Rosso · Jan 20, 2026

Automatic Metrics Coding

We study sentence-level detection of the 19 human values in the refined Schwartz continuum in about 74k English sentences from news and political manifestos (ValueEval'24 corpus).
RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025

Automatic Metrics General

A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes can
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee · Aug 26, 2025

Automatic Metrics General

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval.
Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Maximilian Kreutner, Marlene Lutz, Markus Strohmaier · Jun 13, 2025

Automatic MetricsSimulation Env Coding

Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse but have been found to consistently exhibit a progressive left-leaning bias.
Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes
Saman Khamesian, Asiful Arefeen, Maria Adela Grando, Bithika M. Thompson, Hassan Ghasemzadeh · Feb 20, 2025

Automatic Metrics Medicine

Managing Type 1 Diabetes (T1D) demands constant vigilance as individuals strive to regulate their blood glucose levels and avoid dysglycemia, including hyperglycemia and hypoglycemia.
Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes
Rahul Garg, Trilok Padhi, Hemang Jain, Ugur Kursuncu, Ponnurangam Kumaraguru · Nov 19, 2024

Automatic MetricsSimulation Env General

Experimental results from our study on two hate speech benchmark datasets demonstrate superior performance over the state-of-the-art baselines across AU-ROC, F1, and Recall with improvements of 1.1%, 7%, and 35%, respectively.

F1 In CS.AI Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs