Metric Hub

Precision In CS.AI Papers

Updated from current HFEPX corpus (Feb 27, 2026). 20 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: precision. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 20 Last published: Feb 25, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 20 papers for Precision In CS.AI Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, ARC and metric focus on precision, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

10% of papers report explicit human-feedback signals, led by expert verification.

Evidence: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
automatic metrics appears in 100% of papers in this hub.

Evidence: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages , OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems , CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages

Protocol Takeaways

Most common quality-control signal is rater calibration (10% of papers).

Evidence: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Stratify by benchmark (Retrieval vs ARC) before comparing methods.

Evidence: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages , OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Benchmark Interpretation

Retrieval appears in 10% of hub papers (2/20); use this cohort for benchmark-matched comparisons.
ARC appears in 5% of hub papers (1/20); use this cohort for benchmark-matched comparisons.

Metric Interpretation

precision is reported in 100% of hub papers (20/20); compare with a secondary metric before ranking methods.
accuracy is reported in 40% of hub papers (8/20); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (10% vs 45% target).
Tighten coverage on Papers reporting quality controls. Coverage is usable but incomplete (20% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (25% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (15% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (5% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (10% vs 45% target).

Papers reporting quality controls

Coverage is usable but incomplete (20% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (15% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (5% vs 35% target).

Known Limitations

Rater population is under-specified (15% coverage).
Annotation unit is under-specified (5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: precision - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=1, left_only=19, right_only=0

1 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 2 papers (10%)

2 papers (10%) mention Retrieval.

Examples: Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems , CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Benchmark Brief

ARC

Coverage: 1 papers (5%)

1 papers (5%) mention ARC.

Examples: Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise

Benchmark Brief

GSM8K

Coverage: 1 papers (5%)

1 papers (5%) mention GSM8K.

Examples: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Metric Brief

precision

Coverage: 20 papers (100%)

20 papers (100%) mention precision.

Examples: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages , OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation

Metric Brief

accuracy

Coverage: 8 papers (40%)

8 papers (40%) mention accuracy.

Examples: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Metric Brief

recall

Coverage: 6 papers (30%)

6 papers (30%) mention recall.

Examples: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams , An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages , OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Top Papers Reporting This Metric

A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad · Feb 25, 2026

Automatic Metrics General

Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
Mohammadreza Ghaffarzadeh-Esfahani, Nahid Yousefian, Ebrahim Heidari-Farsani, Ali Akbar Omidvarian, Sepehr Ghahraei · Feb 24, 2026

Automatic Metrics MedicineMultilingual

Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP).
OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation
Tian Lan, Lei Xu, Zimu Yuan, Shanggui Liu, Jiajun Liu · Feb 24, 2026

Automatic Metrics Medicine

Our evaluation demonstrates that OrthoDiffusion achieves excellent performance in the segmentation of 11 knee structures and the detection of 8 knee abnormalities.
Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams
Darvan Shvan Khairaldeen, Hossein Hassani · Feb 24, 2026

Automatic Metrics General

On the full 50-song evaluation at a 0.750 threshold, recall was 39.4% and precision 25.8% .
Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems
Mukul Chhabra, Luigi Medrano, Arush Verma · Feb 23, 2026

Automatic Metrics Coding

Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error c
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026

Automatic Metrics Medicine

Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontolo
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026

Automatic Metrics Math

In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs
Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata · Feb 21, 2026

Automatic Metrics General

Changing runtime complexity on cloud and edge devices necessitates elastic large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources.
Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu · Feb 21, 2026

Automatic Metrics Multilingual

Personal AI agents incur substantial cost via repeated LLM calls.
AAVGen: Precision Engineering of Adeno-associated Viral Capsids for Renal Selective Targeting
Mohammadreza Ghaffarzadeh-Esfahani, Yousof Gheisari · Feb 21, 2026

Automatic Metrics General

Adeno-associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency.
The Convergence of Schema-Guided Dialogue Systems and the Model Context Protocol
Andreas Schlapbach · Feb 21, 2026

Automatic Metrics Coding

This paper establishes a fundamental convergence: Schema-Guided Dialogue (SGD) and the Model Context Protocol (MCP) represent two manifestations of a unified paradigm for deterministic, auditable LLM-agent interaction.
Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026

Automatic MetricsSimulation Env General

When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026

Automatic Metrics Medicine

The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
The Anxiety of Influence: Bloom Filters in Transformer Attention Heads
Peter Balogh · Feb 19, 2026

Automatic Metrics General

Some transformer attention heads appear to function as membership testers, dedicating themselves to answering the question "has this token appeared before in the context?" We identify these heads across four language models (GPT-2 small, me
PREFER: An Ontology for the PREcision FERmentation Community
Txell Amigó, Shawn Zheng Kai Tan, Angel Luu Phanthanourak, Sebastian Schulz, Pasquale D. Colaianni · Feb 18, 2026

Automatic Metrics General

Precision fermentation relies on microbial cell factories to produce sustainable food, pharmaceuticals, chemicals, and biofuels.
CGRA-DeBERTa Concept Guided Residual Augmentation Transformer for Theologically Islamic Understanding
Tahir Hussain, Saddam Hussain Khan · Feb 16, 2026

Automatic Metrics General

The qualitative evaluation noted better extraction and discrimination and theological precision.
Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats
Pengxiang Zhao, Hui-Ling Zhen, Xing Li, Han Bao, Weizhe Lin · Feb 13, 2026

Automatic Metrics General

As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency.
SciTS: Scientific Time Series Understanding and Generation with LLMs
Wen Wu, Ziyang Zhang, Liwei Liu, Xuenan Xu, Jimin Zhuang · Sep 26, 2025

Automatic Metrics Coding

To address these gaps, we introduce SciTS, a benchmark spanning 12 scientific domains and 43 tasks, with over 50k+ instances, both univariate and multivariate signals ranging from $10^0$ to $10^7$ in length and up to 10~MHz in frequency.
Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
Zuhong Lin, Daoyuan Ren, Kai Ran, Jing Sun, Songlin Yu · Apr 26, 2025

Automatic Metrics Coding

Accurately identifying the synthesis conditions of metal-organic frameworks (MOFs) is essential for guiding experimental design, yet remains challenging because relevant information in the literature is often scattered, inconsistent, and di
Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise
Zhenkai Zhang, Krista A. Ehinger, Tom Drummond · Oct 26, 2023

Automatic Metrics Math

This paper introduces two key contributions aimed at improving the speed and quality of images generated through inverse diffusion processes.

Precision In CS.AI Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs