Skip to content
← Back to explorer

Renaissance: Investigating the Pretraining of Vision-Language Encoders

Clayton Fields, Casey Kennington · Nov 11, 2024 · Citations: 0

Abstract

In the past several years there has been an explosion of available models for vision-language (VL) tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. Additionally, the limited programming tools available for modeling make conducting VL research more difficult than necessary. In this paper, we seek to answer several questions related to the pretraining of VL encoders through meta-analysis. To conduct these experiments, we introduce a VL evaluation framework called Renaissance. In our first set of experiments, we show that we can save significant compute at little to no cost to downstream performance, by freezing large parts of VL models during pretraining. In our second set of experiments, we examine the effect of basing a VL transformer on a vision model versus a text model. Renaissance offers a great deal of flexibility in creating, training and evaluating transformer encoders for VL modeling. Its source code will be made publicly available upon publication. The source code for Renaissance can be found at https://github.com/bsu-slim/renaissance.

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: Coding

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.35
  • Flags: low_signal, possible_false_positive

Research Summary

Contribution Summary

  • In the past several years there has been an explosion of available models for vision-language (VL) tasks.
  • Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models.
  • Additionally, the limited programming tools available for modeling make conducting VL research more difficult than necessary.

Why It Matters For Eval

  • To conduct these experiments, we introduce a VL evaluation framework called Renaissance.

Related Papers