Metric Hub

F1 + Coding Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 11 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Human Eval. Frequently cited benchmark: Retrieval. Common metric signal: f1. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 11 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 11 papers for F1 + Coding Metric Papers. Dominant protocol signals include automatic metrics, human evaluation, simulation environments, with frequent benchmark focus on Retrieval, DROP and metric focus on f1, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

automatic metrics appears in 100% of papers in this hub.

Evidence: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data , A Benchmark for Deep Information Synthesis
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: A Benchmark for Deep Information Synthesis , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data
tool-use evaluation appears in 9.1% of papers, indicating agentic evaluation demand.

Evidence: A Benchmark for Deep Information Synthesis , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data , A Benchmark for Deep Information Synthesis
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , A Benchmark for Deep Information Synthesis , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data
Stratify by benchmark (Retrieval vs DROP) before comparing methods.

Evidence: MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , A Benchmark for Deep Information Synthesis , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

Benchmark Interpretation

Retrieval appears in 18.2% of hub papers (2/11); use this cohort for benchmark-matched comparisons.
DROP appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

f1 is reported in 100% of hub papers (11/11); compare with a secondary metric before ranking methods.
accuracy is reported in 18.2% of hub papers (2/11); compare with a secondary metric before ranking methods.

Abstract Evidence Highlights

Direct snippets from paper abstracts to ground protocol and benchmark interpretation.

Protocol MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

Human-eval abstract signal: Bangla-English code-mixing is widespread across South Asian social media, yet resources for implicit meaning identification in this setting remain scarce.

Protocol A Benchmark for Deep Information Synthesis

Human-eval abstract signal: When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark.

Protocol SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

LLM-judge abstract signal: We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA.

Benchmark A Benchmark for Deep Information Synthesis

Retrieval benchmark signal: However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval.

Benchmark SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Retrieval benchmark signal: Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in...

Metric SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

f1 metric signal: On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning.

Metric MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

f1 metric signal: Zero-shot models achieve competitive micro-F1 scores but low exact match accuracy.

Protocol PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

Protocol abstract signal: Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH).

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (36.4% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (0% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (36.4% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: f1 - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=2, left_only=0, right_only=9

2 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=2, left_only=9, right_only=0

2 papers use both Automatic Metrics and Simulation Env.

human_eval vs simulation_env

both=1, left_only=1, right_only=1

1 papers use both Human Eval and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 2 papers (18.2%)

2 papers (18.2%) mention Retrieval.

Examples: A Benchmark for Deep Information Synthesis , Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection

Benchmark Brief

DROP

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention DROP.

Examples: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Benchmark Brief

Valueeval

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention Valueeval.

Examples: Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum

Metric Brief

Coverage: 11 papers (100%)

11 papers (100%) mention f1.

Examples: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

Metric Brief

accuracy

Coverage: 2 papers (18.2%)

2 papers (18.2%) mention accuracy.

Examples: MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement

Metric Brief

cost

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention cost.

Examples: Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Sungho Park, Jueun Kim, Wook-Shin Han · Feb 26, 2026

Automatic Metrics Coding

Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in n
MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification
Kazi Samin Yasar Alam, Md Tanbir Chowdhury, Tamim Ahmed, Ajwad Abrar, Md Rafid Haque · Feb 25, 2026

Human EvalAutomatic Metrics Coding

We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting.
PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data
Samah Fodeh, Linhai Ma, Yan Wang, Srivani Talakokkul, Ganesh Puthiaraju · Feb 24, 2026

Automatic Metrics MedicineCoding

Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH).
A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026

Human EvalAutomatic Metrics Coding

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis.
Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection
Raihan Tanvir, Md. Golam Rabiul Alam · Feb 22, 2026

Automatic Metrics CodingMultilingual

Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives.
Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models
Wojciech Michaluk, Tymoteusz Urban, Mateusz Kubita, Soveatin Kuntur, Anna Wroblewska · Feb 20, 2026

Automatic Metrics Coding

Clickbait headlines degrade the quality of online information and undermine user trust.
Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement
Stephan Ludwig, Peter J. Danaher, Xiaohao Yang, Yu-Ting Lin, Ehsan Abedin · Feb 17, 2026

Automatic Metrics Coding

Accurately measuring consumer emotions and evaluations from unstructured text remains a core challenge for marketing research and practice.
Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models
Ali Mekky, Mohamed El Zeftawy, Lara Hassan, Amr Keleg, Preslav Nakov · Feb 12, 2026

Automatic Metrics Coding

Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task.
Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum
Víctor Yeste, Paolo Rosso · Jan 20, 2026

Automatic Metrics Coding

We study sentence-level detection of the 19 human values in the refined Schwartz continuum in about 74k English sentences from news and political manifestos (ValueEval'24 corpus).
Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes
Abdullah Al Monsur, Nitesh Vamshi Bommisetty, Gene Louis Kim · Jan 17, 2026

Automatic Metrics Coding

The current state of event detection research has two notable re-occurring limitations that we investigate in this study.
Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Maximilian Kreutner, Marlene Lutz, Markus Strohmaier · Jun 13, 2025

Automatic MetricsSimulation Env Coding

Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse but have been found to consistently exhibit a progressive left-leaning bias.

F1 + Coding Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Abstract Evidence Highlights

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs