Skip to content
← Back to explorer

The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task

Rui Cao, Zhenyun Deng, Yulong Chen, Michael Schlichtkrull, Andreas Vlachos · Feb 11, 2026 · Citations: 0

Abstract

The Automatic Verification of Image-Text Claims (AVerImaTeC) shared task aims to advance system development for retrieving evidence and verifying real-world image-text claims. Participants were allowed to either employ external knowledge sources, such as web search engines, or leverage the curated knowledge store provided by the organizers. System performance was evaluated using the AVerImaTeC score, defined as a conditional verdict accuracy in which a verdict is considered correct only when the associated evidence score exceeds a predefined threshold. The shared task attracted 14 submissions during the development phase and 6 submissions during the testing phase. All participating systems in the testing phase outperformed the baseline provided. The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455. This paper provides a detailed description of the shared task, presents the complete evaluation results, and discusses key insights and lessons learned.

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: General

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: Web Browsing
  • Quality controls: Not reported
  • Confidence: 0.45
  • Flags: ambiguous

Research Summary

Contribution Summary

  • The Automatic Verification of Image-Text Claims (AVerImaTeC) shared task aims to advance system development for retrieving evidence and verifying real-world image-text claims.
  • Participants were allowed to either employ external knowledge sources, such as web search engines, or leverage the curated knowledge store provided by the organizers.
  • System performance was evaluated using the AVerImaTeC score, defined as a conditional verdict accuracy in which a verdict is considered correct only when the associated evidence score exceeds a predefined threshold.

Why It Matters For Eval

  • The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455.
  • This paper provides a detailed description of the shared task, presents the complete evaluation results, and discusses key insights and lessons learned.

Related Papers