RealCQA-V2: A Diagnostic Benchmark for Structured Visual Entailment over Scientific Charts
Saleem Ahmed, Srirangaraj Setlur, Venu Govindaraju · Oct 29, 2024 · Citations: 0
How to use this paper page
Coverage: StaleUse this page to decide whether the paper is strong enough to influence an eval design. It summarizes the abstract plus available structured metadata. If the signal is thin, use it as background context and compare it against stronger hub pages before making protocol choices.
Best use
Background context only
Metadata: StaleTrust level
Provisional
Signals: StaleWhat still needs checking
Structured extraction is still processing; current fields are metadata-first.
Signal confidence unavailable
Abstract
Multimodal reasoning models often produce fluent answers supported by seemingly coherent rationales. Existing benchmarks evaluate only final-answer correctness. They do not support atomic visual entailment verification of intermediate steps, especially visual compositional logic. This limitation is especially acute in scientific chart understanding, where answers depend on deterministically grounded visual semantics such as axes, legends, and quantitative relations. We introduce RealCQA-V2, a large-scale benchmark that reformulates chart question answering as Visual Premise Proving (VPP): a structured logical entailment task over chart-grounded visual predicates. Each question is deconstructed into manually curated, atomic premises grounded in chart elements (axes, legends, marks, and quantitative relations), yielding executable reasoning chains rather than free-form textual rationales. These premises form compositional reasoning chains, enabling verification at the level of individual visual statements and complete reasoning sequences. We introduce chain-level metrics that measure both full logical validity (AccVPP) and partial reasoning progress within failed chains (DCP), extending beyond traditional VQA accuracy. Baseline evaluations across representative LVLMs reveal a consistent local-global reasoning gap: models often verify many individual premises correctly while failing to preserve coherence across the full chain. RealCQA-V2 establishes a reproducible benchmark for structured visual entailment over real scientific charts and enables rigorous diagnosis of multimodal reasoning beyond answer-only evaluation.