Skip to content
← Back to explorer

VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Yupeng Xie, Zhiyang Zhang, Yifan Wu, Sirong Lu, Jiayi Zhang, Zhaoyang Yu, Jinlin Wang, Sirui Hong, Bang Liu, Chenglin Wu, Yuyu Luo · Oct 25, 2025 · Citations: 0

Abstract

Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.553 and a correlation with human ratings of only 0.428. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.421 (a 23.9% reduction) and increasing the consistency with human experts to 0.687 (a 60.5% improvement) compared to GPT-5. The benchmark is available at https://github.com/HKUSTDial/VisJudgeBench.

HFEPX Relevance Assessment

This paper appears adjacent to HFEPX scope (human-feedback/eval), but does not show strong direct protocol evidence in metadata/abstract.

Eval-Fit Score

5/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Detected

HFEPX Fit

Adjacent candidate

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Domain Experts
  • Unit of annotation: Unknown
  • Expertise required: General
  • Extraction source: Persisted extraction

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.45
  • Flags: low_signal, possible_false_positive

Protocol And Measurement Signals

Benchmarks / Datasets

Visjudge-BenchVisjudgebench

Reported Metrics

accuracymae

Research Brief

Deterministic synthesis

To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality. HFEPX signals include Automatic Metrics with confidence 0.45. Updated from current HFEPX corpus.

Generated Mar 5, 2026, 4:01 AM · Grounded in abstract + metadata only

Key Takeaways

  • To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality.
  • To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment.

Researcher Actions

  • Treat this as method context, then pivot to protocol-specific HFEPX hubs.
  • Cross-check benchmark overlap: Visjudge-Bench, Visjudgebench.
  • Validate metric comparability (accuracy, mae).

Caveats

  • Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
  • Low-signal flag detected: protocol relevance may be indirect.

Research Summary

Contribution Summary

  • To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality.
  • To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment.
  • Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.421 (a 23.9% reduction) and increasing the consistency with human experts to 0.687 (a 60.5% improvement) compared to…

Why It Matters For Eval

  • To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality.
  • To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment.

Researcher Checklist

  • Gap: Human feedback protocol is explicit

    No explicit human feedback protocol detected.

  • Pass: Evaluation mode is explicit

    Detected: Automatic Metrics

  • Gap: Quality control reporting appears

    No calibration/adjudication/IAA control explicitly detected.

  • Pass: Benchmark or dataset anchors are present

    Detected: Visjudge-Bench, Visjudgebench

  • Pass: Metric reporting is present

    Detected: accuracy, mae

Category-Adjacent Papers (Broader Context)

These papers are nearby in arXiv category and useful for broader context, but not necessarily protocol-matched to this paper.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.