HFEPX Hub

CS.CV + Automatic Metrics Papers

Updated from current HFEPX corpus (Feb 27, 2026). 76 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 76 Last published: Feb 26, 2026 Global RSS Tag RSS

Cs.CVAutomatic Metrics

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 76 papers for CS.CV + Automatic Metrics Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, MATH and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

14.5% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Moral Preferences of LLMs Under Directed Contextual Influence , DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning , Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
automatic metrics appears in 100% of papers in this hub.

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , OmniGAIA: Towards Native Omni-Modal AI Agents , Moral Preferences of LLMs Under Directed Contextual Influence , NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , OmniGAIA: Towards Native Omni-Modal AI Agents , Moral Preferences of LLMs Under Directed Contextual Influence , NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

Protocol Takeaways

Most common quality-control signal is rater calibration (1.3% of papers).

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , OmniGAIA: Towards Native Omni-Modal AI Agents , Moral Preferences of LLMs Under Directed Contextual Influence , NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.

Evidence: SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning , Virtual Biopsy for Intracranial Tumors Diagnosis on MRI , Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
Stratify by benchmark (Retrieval vs MATH) before comparing methods.

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , OmniGAIA: Towards Native Omni-Modal AI Agents , Moral Preferences of LLMs Under Directed Contextual Influence , NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

Benchmark Interpretation

Retrieval appears in 6.6% of hub papers (5/76); use this cohort for benchmark-matched comparisons.
MATH appears in 2.6% of hub papers (2/76); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 23.7% of hub papers (18/76); compare with a secondary metric before ranking methods.
cost is reported in 6.6% of hub papers (5/76); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (14.5% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (1.3% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (15.8% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (42.1% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (11.8% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (3.9% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (14.5% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (1.3% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (15.8% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (42.1% vs 35% target).

Papers with known rater population

Coverage is a replication risk (11.8% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (3.9% vs 35% target).

Known Limitations

Only 1.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.8% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=3, left_only=73, right_only=0

3 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 5 papers (6.6%)

5 papers (6.6%) mention Retrieval.

Examples: Multi-Vector Index Compression in Any Modality , Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework , VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Benchmark Brief

MATH

Coverage: 2 papers (2.6%)

2 papers (2.6%) mention MATH.

Examples: Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale , MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts

Benchmark Brief

ARC

Coverage: 1 papers (1.3%)

1 papers (1.3%) mention ARC.

Examples: Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise

Metric Brief

accuracy

Coverage: 18 papers (23.7%)

18 papers (23.7%) mention accuracy.

Examples: NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors , DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video

Metric Brief

cost

Coverage: 5 papers (6.6%)

5 papers (6.6%) mention cost.

Examples: Motivation is Something You Need , Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression , Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Metric Brief

precision

Coverage: 3 papers (3.9%)

3 papers (3.9%) mention precision.

Examples: OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation , CGRA-DeBERTa Concept Guided Residual Augmentation Transformer for Theologically Islamic Understanding , Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , OmniGAIA: Towards Native Omni-Modal AI Agents , Moral Preferences of LLMs Under Directed Contextual Influence

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang · Feb 26, 2026 · Citations: 0

Automatic Metrics

With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and t
OmniGAIA: Towards Native Omni-Modal AI Agents
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong · Feb 26, 2026 · Citations: 0

Automatic Metrics Tool Use

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world.
Moral Preferences of LLMs Under Directed Contextual Influence
Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie · Feb 26, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences.
NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image.
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large Vision-Language Models (LVLMs) exhibit outstanding performance on vision-language tasks but struggle with hallucination problems.
Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis
Shaoxuan Wu, Jingkun Chen, Chong Ma, Cong Shen, Xiao Zhang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Human-AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists.
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.
Virtual Biopsy for Intracranial Tumors Diagnosis on MRI
Xinzhe Luo, Shuai Shao, Yan Wang, Jiangtao Wang, Yutong Bai · Feb 25, 2026 · Citations: 0

Automatic Metrics

To address these challenges, we construct the ICT-MRI dataset - the first public biopsy-verified benchmark with 249 cases across four categories.
Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound
Nicholas Dietrich, David McShannon · Feb 25, 2026 · Citations: 0

Automatic Metrics

Conclusion: Spatial-domain adversarial perturbations in ultrasound segmentation showed partial mitigation with input preprocessing, whereas frequency-domain perturbations were not mitigated by the defenses, highlighting modality-specific ch
Causal Decoding for Hallucination-Resistant Multimodal Large Language Models
Shiwei Tan, Hengyi Wang, Weiyi Qin, Qi Xu, Zhigang Hua · Feb 24, 2026 · Citations: 0

Automatic Metrics

Across captioning and QA benchmarks, our framework substantially lowers object-hallucination rates and achieves state-of-the-art faithfulness without degrading overall output quality.
ECHOSAT: Estimating Canopy Height Over Space And Time
Jan Pauls, Karsten Schrödter, Sven Ligensa, Martin Schwartz, Berkant Turan · Feb 24, 2026 · Citations: 0

Automatic Metrics

Our experimental evaluation shows that our model improves state-of-the-art accuracies in the context of single-year predictions.
FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning
Alina Devkota, Jacob Thrasher, Donald Adjeroh, Binod Bhattarai, Prashnna K. Gyawali · Feb 24, 2026 · Citations: 0

Automatic Metrics

Extensive experiments on both natural and medical image benchmarking datasets, across diverse model architectures, demonstrate that FedVG consistently improves performance, particularly in highly heterogeneous settings.
Towards Controllable Video Synthesis of Routine and Rare OR Events
Dominik Schneider, Lalithkumar Seenivasan, Sampath Rapuri, Vishalroshan Anil, Aiza Maksutova · Feb 24, 2026 · Citations: 0

Automatic Metrics

Purpose: Curating large-scale datasets of operating room (OR) workflow, encompassing rare, safety-critical, or atypical events, remains operationally and ethically challenging.
Towards single-shot coherent imaging via overlap-free ptychography
Oliver Hoidn, Aashwin Mishra, Steven Henke, Albert Vong, Matthew Seaberg · Feb 24, 2026 · Citations: 0

Automatic Metrics

On synthetic benchmarks, reconstructions remain accurate at low counts ($\sim\!10^4$ photons/frame), and overlap-free single-shot reconstruction with an experimental probe reaches amplitude structural similarity (SSIM) 0.904, compared with
Scaling View Synthesis Transformers
Evan Kim, Hyunwoo Ryu, Thomas W. Mitchel, Vincent Sitzmann · Feb 24, 2026 · Citations: 0

Automatic Metrics

Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto fronti
Test-Time Training with KV Binding Is Secretly Linear Attention
Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li · Feb 24, 2026 · Citations: 0

Automatic Metrics

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time.
Multi-Vector Index Compression in Any Modality
Hanxiang Qin, Alexander Martin, Rohan Jha, Chunsheng Zuo, Reno Kriz · Feb 24, 2026 · Citations: 0

Automatic Metrics

We study efficient multi-vector retrieval for late interaction in any modality.
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidat
XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence
Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser · Feb 24, 2026 · Citations: 0

Automatic Metrics

Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints.
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan · Feb 24, 2026 · Citations: 0

Automatic Metrics

Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures.
Motivation is Something You Need
Mehdi Acheli, Walid Gaaloul · Feb 24, 2026 · Citations: 0

Automatic Metrics

Inspired by the interplay of emotions and cognition in the human brain and more specifically the SEEKING motivational state, we design a dual-model framework where a smaller base model is trained continuously, while a larger motivated model
VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
Seongheon Park, Changdae Oh, Hyeong Kyu Choi, Xuefeng Du, Sharon Li · Feb 24, 2026 · Citations: 0

Automatic Metrics

Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evalua
MIP Candy: A Modular PyTorch Framework for Medical Image Processing
Tianhao Fu, Yucheng Chen · Feb 24, 2026 · Citations: 0

Automatic Metrics

MIPCandy provides a complete, modular pipeline spanning data loading, training, inference, and evaluation, allowing researchers to obtain a fully functional process workflow by implementing a single method, $\texttt{build_network}$, while r
Multimodal MRI Report Findings Supervised Brain Lesion Segmentation with Substructures
Yubin Ge, Yongsong Huang, Xiaofeng Liu · Feb 24, 2026 · Citations: 0

Automatic Metrics

Report-supervised (RSuper) learning seeks to alleviate the need for dense tumor voxel labels with constraints derived from radiology reports (e.g., volumes, counts, sizes, locations).
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
Christian Simon, Masato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa · Feb 24, 2026 · Citations: 0

Automatic Metrics

We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks.
CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
Yang Zhang, Danyang Li, Yuxuan Li, Xin Zhang, Tianyu Xie · Feb 24, 2026 · Citations: 0

Automatic Metrics

Extensive experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines, achieving substantial gains in fine-grained visual understanding while maintaining robust reasoning capab
See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
Jaehyun Park, Minyoung Ahn, Minkyu Kim, Jonghyun Lee, Jae-Gil Lee · Feb 24, 2026 · Citations: 0

Automatic Metrics

Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets.
OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation
Tian Lan, Lei Xu, Zimu Yuan, Shanggui Liu, Jiajun Liu · Feb 24, 2026 · Citations: 0

Automatic Metrics

Our evaluation demonstrates that OrthoDiffusion achieves excellent performance in the segmentation of 11 knee structures and the detection of 8 knee abnormalities.
PyVision-RL: Forging Open Agentic Vision Models via RL
Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng · Feb 24, 2026 · Citations: 0

Automatic Metrics Tool Use

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior.
Communication-Inspired Tokenization for Structured Image Representations
Aram Davtyan, Yusuf Sahin, Yasaman Haghighi, Sebastian Stapf, Pablo Acuaviva · Feb 24, 2026 · Citations: 0

Automatic Metrics

Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences.
Onboard-Targeted Segmentation of Straylight in Space Camera Sensors
Riccardo Gallon, Fabian Schiemenz, Alessandra Menicucci, Eberhard Gill · Feb 24, 2026 · Citations: 0

Automatic Metrics Web Browsing

This study details an artificial intelligence (AI)-based methodology for the semantic segmentation of space camera faults.
MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
Taha Koleilat, Hojat Asgariandehkordi, Omid Nejati Manzari, Berardino Barile, Yiming Xiao · Feb 23, 2026 · Citations: 0

Automatic Metrics

Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts.
When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
Krzysztof Adamkiewicz, Brian Moser, Stanislav Frolov, Tobias Christian Nauen, Federico Raue · Feb 23, 2026 · Citations: 0

Automatic Metrics

Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following.
Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Jiahao Huo · Feb 23, 2026 · Citations: 0

Automatic Metrics

Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications.
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song · Feb 23, 2026 · Citations: 0

Automatic Metrics Long Horizon

We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.
Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Minxue Tang, Yangyang Yu, Aolin Ding, Maziyar Baran Pouyan, Taha Belkhouja Yujia Bao · Feb 22, 2026 · Citations: 0

Automatic Metrics

Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI.
VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026 · Citations: 0

Automatic Metrics Long Horizon

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
Hyperbolic Busemann Neural Networks
Ziheng Chen, Bernhard Schölkopf, Nicu Sebe · Feb 21, 2026 · Citations: 0

Automatic Metrics

Hyperbolic spaces provide a natural geometry for representing hierarchical and tree-structured data due to their exponential volume growth.
Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen · Feb 19, 2026 · Citations: 0

Automatic Metrics

Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries.
Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection
Yichen Lu, Siwei Nie, Minlong Lu, Xudong Yang, Xiaobo Zhang · Feb 19, 2026 · Citations: 0

Automatic Metrics

Image Copy Detection (ICD) aims to identify manipulated content between image pairs through robust feature representation learning.
Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression
Akira Sakai, Yuma Ichikawa · Feb 19, 2026 · Citations: 0

Automatic Metrics

Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck.
Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling
Divyam Madaan, Sumit Chopra, Kyunghyun Cho · Feb 19, 2026 · Citations: 0

Automatic Metrics

Despite the recent success of Multimodal Large Language Models (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference.
Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models
Melkamu Abay Mersha, Jugal Kalita · Feb 18, 2026 · Citations: 0

Automatic Metrics

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret.
DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting
Md Mofijul Islam, Md Sirajus Salekin, Nivedha Balakrishnan, Vincil C. Bishop, Niharika Jain · Feb 17, 2026 · Citations: 0

Automatic Metrics

We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models.
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and informati
How to Train Your Long-Context Visual Document Model
Austin Veselka · Feb 16, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performanc
CGRA-DeBERTa Concept Guided Residual Augmentation Transformer for Theologically Islamic Understanding
Tahir Hussain, Saddam Hussain Khan · Feb 16, 2026 · Citations: 0

Automatic Metrics

The qualitative evaluation noted better extraction and discrimination and theological precision.
Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026 · Citations: 0

Automatic Metrics

16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai · Feb 12, 2026 · Citations: 0

Automatic Metrics Tool Use

To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring
Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang · Jan 20, 2026 · Citations: 0

Automatic Metrics

While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints.
Generating metamers of human scene understanding
Ritik Raina, Abe Leite, Alexandros Graikos, Seoyoung Ahn, Dimitris Samaras · Jan 16, 2026 · Citations: 0

Automatic Metrics

Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene.
FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao · Jan 12, 2026 · Citations: 0

Automatic Metrics

To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry.
VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding
Haorui Yu, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026 · Citations: 0

Critique Edit Automatic Metrics

We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models' (VLMs) cultural understanding beyond surface-level visual perception.
CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving
Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng · Jan 5, 2026 · Citations: 0

Automatic Metrics

Motivated by this, we present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage, explicitly simulating the hierarchical flow of human reasoning: perception$\Rightarrow$internalizat
KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification
Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh · Dec 9, 2025 · Citations: 0

Automatic Metrics

Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and managemen
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong · Nov 19, 2025 · Citations: 0

Automatic Metrics

Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches.
Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu · Nov 7, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts su
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025 · Citations: 0

Pairwise Preference Automatic MetricsSimulation Env Long Horizon

Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.

CS.CV + Automatic Metrics Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs