OpenTrain AI
Maintained implementation available

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

February 1, 2024arXiv: 2402.04615
4 repos381 stars~a few hours to reproduce
arXiv PDF

Abstract

Results & Benchmarks

TaskDatasetMetricValue
Computer visionBaselineApp Unseen.67.6
Computer visionScreenAIApp Unseen.87.8

Best Implementation

Implementation of the ScreenAI model from the paper: "A Vision-Language Model for UI and Infographics Understanding"

381 37 Mar 2026 MIT
License
CI
Deps
Docker
  • Selected kyegomez/ScreenAI as the strongest maintained implementation for new work.
  • Includes CI workflow signals.
  • Includes dependency/environment manifest signals.
  • Repository activity is within the last 24 months.

Reproduction Path

  1. 1

    Start with kyegomez/ScreenAI and validate setup instructions in README.

  2. 2

    Reproduce the baseline result with the provided defaults before modifying hyperparameters.

  3. 3

    Log exact dependency versions and runtime environment for reproducibility.

Time to first repro: a few hoursNo repository-level red flags were detected, but paper-specific preprocessing and hyperparameter details may still be under-specified.

Additional Implementations

Official

  • The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations are in text format, and describe the UI elements present on the screen: their type, location, OCR text and a short description. It has been introduced in the paper `ScreenAI: A Vision-Language Model for UI and Infographics Understanding`.

    Stars: 88Forks: 10Last push: Mar 2024

Community

  • ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (IJCAI-24)

    Stars: 585Forks: 64Last push: Nov 2024License: NOASSERTION

Hugging Face Artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.