OpenTrain AI
Maintained implementation availablepytorchPretrained Models Available

LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation

Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang +9 more

November 7, 2024arXiv: 2411.04997
1 repo644 stars~a few days to reproduce
arXiv PDF

Abstract

CLIP is a seminal multimodal model that maps images and text into a shared representation space through contrastive learning on billions of image-caption pairs. Inspired by the rapid progress of large language models (LLMs), we investigate how the superior linguistic understanding and broad world knowledge of LLMs can further strengthen CLIP, particularly in handling long and complex captions. We introduce an efficie...

Results & Benchmarks

TaskDatasetMetricValue
Retrieval / indexingCOCOAP12.9
Retrieval / indexingPASCAL VOCAP11.5

Hardware Requirements

  • Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Best Implementation

LLM2CLIP significantly improves already state-of-the-art CLIP models.

644 29 Feb 2026 MIT
License
CI
Deps
Docker
  • Selected microsoft/LLM2CLIP as the strongest maintained implementation for new work.
  • Includes CI workflow signals.
  • Repository activity is within the last 24 months.

Reproduction Path

  1. 1

    Start with microsoft/LLM2CLIP and validate setup instructions in README.

  2. 2

    Reproduce the baseline result with the provided defaults before modifying hyperparameters.

  3. 3

    Log exact dependency versions and runtime environment for reproducibility.

Time to first repro: a few daysDependency manifest is missing

Additional Implementations

No additional verified repositories beyond the primary recommendation.

Hugging Face Artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts.