PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown · Oct 21, 2025 · Citations: 0

General Human Eval Llm As Judge Rubric Rating

Open arXiv Find Implementation RSS feed Shortlist (0)

How to use this paper page

Coverage: Stale

Use this page to decide whether the paper is strong enough to influence an eval design. It summarizes the abstract plus available structured metadata. If the signal is thin, use it as background context and compare it against stronger hub pages before making protocol choices.

Best use

Primary protocol reference for eval design

Metadata: Stale

Trust level

High

Signals: Stale

What still needs checking

No major weakness surfaced.

Signal confidence: 0.80

Abstract

While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $ρ$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.

HFEPX Relevance Assessment

This paper has strong direct human-feedback and evaluation protocol signal and is suitable as a primary eval pipeline reference.

Best use

Primary protocol reference for eval design

Use if you need

A benchmark-and-metrics comparison anchor.

Main weakness

No major weakness surfaced.

Trust level

High

Eval-Fit Score

79/100 • High

Use this as a primary source when designing or comparing eval protocols.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

Extraction confidence: High

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

What This Page Found In The Paper

Each field below shows whether the signal looked explicit, partial, or missing in the available metadata. Use this to judge what is safe to trust directly and what still needs full-paper validation.

Human Feedback Types

strong

Rubric Rating

Confidence: High Direct evidence

Directly usable for protocol triage.

Evidence snippet: While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge.

Evaluation Modes

strong

Human Eval, Llm As Judge

Confidence: High Direct evidence

Includes extracted eval setup.

Evidence snippet: While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge.

Quality Controls

missing

Not reported

Confidence: Low Not found

No explicit QC controls found.

Evidence snippet: While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge.

Benchmarks / Datasets

strong

CAPArena

Confidence: High Direct evidence

Useful for quick benchmark comparison.

Evidence snippet: We show that PoSh achieves stronger correlations (+0.05 Spearman $ρ$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning.

Reported Metrics

strong

Spearman

Confidence: High Direct evidence

Useful for evaluation criteria comparison.

Rater Population

strong

Domain Experts

Confidence: High Direct evidence

Helpful for staffing comparability.

Evidence snippet: PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge).

Human Data Lens

Uses human feedback: Yes
Feedback types: Rubric Rating
Rater population: Domain Experts
Unit of annotation: Multi Dim Rubric
Expertise required: General
Signal basis: Structured extraction plus abstract evidence.

Evaluation Lens

Evaluation modes: Human Eval, Llm As Judge
Agentic eval: None
Quality controls: Not reported
Signal confidence: 0.80
Known cautions: None surfaced in extraction.

Protocol And Measurement Signals

Benchmarks / Datasets

CAPArena

Reported Metrics

spearman

Research Brief

Metadata summary

While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge.

Based on abstract + metadata only. Check the source paper before making high-confidence protocol decisions.

Key Takeaways

While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge.
CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification.
In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans.

Researcher Actions

Compare this paper against nearby papers in the same arXiv category before using it for protocol decisions.
Validate inferred eval signals (Automatic metrics) against the full paper.
Use related-paper links to find stronger protocol-specific references.

Caveats

Generated from abstract + metadata only; no PDF parsing.
Signals below are heuristic and may miss details reported outside the abstract.

Recommended Queries

Expert verification evaluation

Research Summary

Contribution Summary

In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g.
To validate PoSh, we introduce a challenging new dataset, DOCENT.
We show that PoSh achieves stronger correlations (+0.05 Spearman ρ) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable…

Why It Matters For Eval

In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g.
We show that PoSh achieves stronger correlations (+0.05 Spearman ρ) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable…

Researcher Checklist

Pass: Human feedback protocol is explicit

Detected: Rubric Rating
Pass: Evaluation mode is explicit

Detected: Human Eval, Llm As Judge
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Pass: Benchmark or dataset anchors are present

Detected: CAPArena
Pass: Metric reporting is present

Detected: spearman

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

How to use this paper page

Abstract

HFEPX Relevance Assessment

What This Page Found In The Paper

Human Feedback Types

Evaluation Modes

Quality Controls

Benchmarks / Datasets

Reported Metrics

Rater Population

Human Data Lens

Evaluation Lens

Protocol And Measurement Signals

Benchmarks / Datasets

Reported Metrics

Research Brief

Key Takeaways

Researcher Actions

Caveats

Recommended Queries

Research Summary

Contribution Summary

Why It Matters For Eval

Researcher Checklist

Related Papers

Join the #1 Platform for AI Training Talent