Metric Hub

Accuracy In CS.CV Papers

Updated from current HFEPX corpus (Feb 27, 2026). 18 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 18 Last published: Feb 25, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 18 papers for Accuracy In CS.CV Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, DocVQA and metric focus on accuracy, auroc. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

16.7% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning , NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
automatic metrics appears in 100% of papers in this hub.

Evidence: NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors , DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , Virtual Biopsy for Intracranial Tumors Diagnosis on MRI
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering , NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors , DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video

Protocol Takeaways

Most common quality-control signal is rater calibration (5.6% of papers).

Evidence: Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition , NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors , DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.

Evidence: SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , Virtual Biopsy for Intracranial Tumors Diagnosis on MRI , OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation , MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification
Stratify by benchmark (Retrieval vs DocVQA) before comparing methods.

Evidence: NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors , DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , Virtual Biopsy for Intracranial Tumors Diagnosis on MRI

Benchmark Interpretation

Retrieval appears in 11.1% of hub papers (2/18); use this cohort for benchmark-matched comparisons.
DocVQA appears in 5.6% of hub papers (1/18); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 100% of hub papers (18/18); compare with a secondary metric before ranking methods.
auroc is reported in 5.6% of hub papers (1/18); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (16.7% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (5.6% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (16.7% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Tighten coverage on Papers with known rater population. Coverage is usable but incomplete (22.2% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (16.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (5.6% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (16.7% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is usable but incomplete (22.2% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Only 5.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (22.2% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=2, left_only=16, right_only=0

2 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 2 papers (11.1%)

2 papers (11.1%) mention Retrieval.

Examples: VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval , Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Benchmark Brief

DocVQA

Coverage: 1 papers (5.6%)

1 papers (5.6%) mention DocVQA.

Examples: Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring

Metric Brief

accuracy

Coverage: 18 papers (100%)

18 papers (100%) mention accuracy.

Examples: NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors , DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video

Metric Brief

auroc

Coverage: 1 papers (5.6%)

1 papers (5.6%) mention auroc.

Examples: MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification

Metric Brief

calibration

Coverage: 1 papers (5.6%)

1 papers (5.6%) mention calibration.

Examples: Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors , DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang · Feb 25, 2026

Automatic Metrics Coding

Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image.
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026

Automatic Metrics General

This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026

Automatic Metrics MedicineCoding

Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
Virtual Biopsy for Intracranial Tumors Diagnosis on MRI
Xinzhe Luo, Shuai Shao, Yan Wang, Jiangtao Wang, Yutong Bai · Feb 25, 2026

Automatic Metrics Medicine

To address these challenges, we construct the ICT-MRI dataset - the first public biopsy-verified benchmark with 249 cases across four categories.
XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence
Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser · Feb 24, 2026

Automatic Metrics MedicineCoding

Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints.
OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation
Tian Lan, Lei Xu, Zimu Yuan, Shanggui Liu, Jiajun Liu · Feb 24, 2026

Automatic Metrics Medicine

Our evaluation demonstrates that OrthoDiffusion achieves excellent performance in the segmentation of 11 knee structures and the detection of 8 knee abnormalities.
MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
Taha Koleilat, Hojat Asgariandehkordi, Omid Nejati Manzari, Berardino Barile, Yiming Xiao · Feb 23, 2026

Automatic Metrics Medicine

Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts.
When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
Krzysztof Adamkiewicz, Brian Moser, Stanislav Frolov, Tobias Christian Nauen, Federico Raue · Feb 23, 2026

Automatic Metrics General

Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following.
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song · Feb 23, 2026

Automatic Metrics Coding

We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.
Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Minxue Tang, Yangyang Yu, Aolin Ding, Maziyar Baran Pouyan, Taha Belkhouja Yujia Bao · Feb 22, 2026

Automatic Metrics General

Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI.
VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026

Automatic Metrics General

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026

Automatic Metrics Coding

16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring
Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang · Jan 20, 2026

Automatic Metrics General

While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints.
KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification
Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh · Dec 9, 2025

Automatic Metrics MedicineCoding

Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and managemen
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025

Automatic MetricsSimulation Env General

Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.
Uncovering Grounding IDs: How External Cues Shape Multimodal Binding
Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian · Sep 28, 2025

Automatic Metrics General

Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding.
MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification
Patrick Wienholt, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn · Sep 9, 2025

Automatic Metrics MedicineCoding

Deep neural networks excel in radiological image classification but frequently suffer from poor interpretability, limiting clinical acceptance.
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
Chuhao Zhou, Jianfei Yang · May 23, 2025

Automatic MetricsSimulation Env Coding

Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language.

Accuracy In CS.CV Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs