Skip to content
← Back to explorer

Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026 · Citations: 0

Abstract

Vision-Language Models (VLMs) excel at visual description yet remain under-validated for cultural interpretation. Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks. We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2). The framework operationalises cultural understanding through five levels (L1--L5) and 165 culture-specific dimensions across six traditions: Tier I computes automated quality indicators, Tier II applies rubric-based single-judge scoring, and Tier III calibrates the aggregate score to human expert ratings via sigmoid calibration. Applied to 15 VLMs across 294 evaluation pairs, the validated instrument reveals that (i) automated metrics and judge scoring measure different constructs, establishing single-judge calibration as the more reliable alternative; (ii) cultural understanding degrades from visual description (L1--L2) to cultural interpretation (L3--L5); and (iii) Western art samples consistently receive higher scores than non-Western ones. To our knowledge, this is the first cross-cultural evaluation instrument for generative art critique, providing a reproducible methodology for auditing VLM cultural competence. Framework code is available at https://github.com/yha9806/VULCA-Framework.

Human Data Lens

  • Uses human feedback: Yes
  • Feedback types: Rubric Rating, Critique Edit
  • Rater population: Domain Experts
  • Unit of annotation: Multi Dim Rubric
  • Expertise required: Coding

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Calibration
  • Confidence: 0.80
  • Flags: None

Research Summary

Contribution Summary

  • Vision-Language Models (VLMs) excel at visual description yet remain under-validated for cultural interpretation.
  • Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
  • We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2).

Why It Matters For Eval

  • Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
  • We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2).

Related Papers