Metric Hub

Precision + General Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 15 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequent quality control: Calibration. Common metric signal: precision. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 15 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 15 papers for Precision + General Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on multiple benchmark families and metric focus on precision, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

6.7% of papers report explicit human-feedback signals, led by expert verification.

Evidence: An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems , pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams
automatic metrics appears in 100% of papers in this hub.

Evidence: pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams
precision is a repeated reporting metric here, enabling more consistent cross-paper score interpretation.

Evidence: pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Protocol Takeaways

Most common quality-control signal is rater calibration (13.3% of papers).

Evidence: MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs , WISE: Web Information Satire and Fakeness Evaluation , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.

Evidence: pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training , An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams
Track metric sensitivity by reporting both precision and accuracy.

Evidence: pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Metric Interpretation

precision is reported in 100% of hub papers (15/15); compare with a secondary metric before ranking methods.
accuracy is reported in 40% of hub papers (6/15); compare with a secondary metric before ranking methods.

Abstract Evidence Highlights

Direct snippets from paper abstracts to ground protocol and benchmark interpretation.

Protocol pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training

Human-eval abstract signal: Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment.

Protocol A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection

Human-eval abstract signal: Cyberbullying has become a serious and growing concern in todays virtual world.

Metric pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training

precision metric signal: However, existing methods still fail to achieve satisfactory accuracy and scalability.

Metric A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection

precision metric signal: Developing a generalized model with moderate accuracy remains challenging.

Quality Control MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs

rater calibration quality-control signal: However, it has been observed that the calibration parameters for quantization are typically linked to specific precisions, which presents challenges during elastic-precision calibration and precision switching at runtime.

Quality Control WISE: Web Information Satire and Fakeness Evaluation

rater calibration quality-control signal: Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error.

Protocol An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

Protocol abstract signal: Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain.

Protocol Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Protocol abstract signal: Maqam, a singing type, is a significant component of Kurdish music.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (6.7% vs 45% target).
Tighten coverage on Papers reporting quality controls. Coverage is usable but incomplete (20% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (0% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (13.3% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (6.7% vs 45% target).

Papers reporting quality controls

Coverage is usable but incomplete (20% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (13.3% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Rater population is under-specified (13.3% coverage).
Annotation unit is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Metric Slice: precision - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=1, left_only=14, right_only=0

1 papers use both Automatic Metrics and Simulation Env.

Metric Brief

precision

Coverage: 15 papers (100%)

15 papers (100%) mention precision.

Examples: pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

Metric Brief

accuracy

Coverage: 6 papers (40%)

6 papers (40%) mention accuracy.

Examples: pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Metric Brief

Coverage: 5 papers (33.3%)

5 papers (33.3%) mention f1.

Examples: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams , PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training
Wenzheng Zhang, Bingzheng Liu, Yang Hu, Xiaoying Bai, Wentao Zhang · Feb 26, 2026

Automatic Metrics General

Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment.
A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad · Feb 25, 2026

Automatic Metrics General

Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur · Feb 24, 2026

Automatic Metrics General

Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practic
Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams
Darvan Shvan Khairaldeen, Hossein Hassani · Feb 24, 2026

Automatic Metrics General

On the full 50-song evaluation at a 0.750 threshold, recall was 39.4% and precision 25.8% .
PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification
Isun Chehreh, Ebrahim Ansari · Feb 22, 2026

Automatic Metrics General

Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification.
TriTopic: Tri-Modal Graph-Based Topic Modeling with Iterative Refinement and Archetypes
Roman Egger · Feb 22, 2026

Automatic Metrics General

In benchmarks across 20 Newsgroups, BBC News, AG News, and Arxiv, TriTopic achieves the highest NMI on every dataset (mean NMI 0.575 vs.
MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs
Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata · Feb 21, 2026

Automatic Metrics General

Changing runtime complexity on cloud and edge devices necessitates elastic large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources.
AAVGen: Precision Engineering of Adeno-associated Viral Capsids for Renal Selective Targeting
Mohammadreza Ghaffarzadeh-Esfahani, Yousof Gheisari · Feb 21, 2026

Automatic Metrics General

Adeno-associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency.
Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026

Automatic MetricsSimulation Env General

When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
The Anxiety of Influence: Bloom Filters in Transformer Attention Heads
Peter Balogh · Feb 19, 2026

Automatic Metrics General

Some transformer attention heads appear to function as membership testers, dedicating themselves to answering the question "has this token appeared before in the context?" We identify these heads across four language models (GPT-2 small, me
PREFER: An Ontology for the PREcision FERmentation Community
Txell Amigó, Shawn Zheng Kai Tan, Angel Luu Phanthanourak, Sebastian Schulz, Pasquale D. Colaianni · Feb 18, 2026

Automatic Metrics General

Precision fermentation relies on microbial cell factories to produce sustainable food, pharmaceuticals, chemicals, and biofuels.
CGRA-DeBERTa Concept Guided Residual Augmentation Transformer for Theologically Islamic Understanding
Tahir Hussain, Saddam Hussain Khan · Feb 16, 2026

Automatic Metrics General

The qualitative evaluation noted better extraction and discrimination and theological precision.
Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats
Pengxiang Zhao, Hui-Ling Zhen, Xing Li, Han Bao, Weizhe Lin · Feb 13, 2026

Automatic Metrics General

As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency.
WISE: Web Information Satire and Fakeness Evaluation
Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury · Dec 30, 2025

Automatic Metrics General

This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as eith
Pretraining Language Models for Diachronic Linguistic Change Discovery
Elisabeth Fittschen, Sabrina Li, Tom Lippincott, Leshem Choshen, Craig Messner · Apr 7, 2025

Automatic Metrics General

This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies.

Precision + General Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Metric Interpretation

Abstract Evidence Highlights

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs