Paper status: completed

Vision-Language Models for Vision Tasks: A Survey

Published:04/03/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This survey reviews Vision-Language Models (VLMs) for visual tasks, addressing challenges of crowd-label dependency in DNNs and training inefficiency. It analyzes network architectures, pre-training objectives, and existing methods, offering insights for future research.

Abstract

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at https://github.com/jingyi0000/VLM_survey.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Vision-Language Models for Vision Tasks: A Survey

1.2. Authors

  • Jingyi Zhang† (Research Associate and Ph.D. student, School of Computer Science and Engineering, Nanyang Technological University (NTU))

  • Jiaxing Huang† (Research Associate and Ph.D. student, School of Computer Science and Engineering, NTU)

  • Sheng Jin (Research Fellow, Nanyang Technology University (NTU))

  • Shijian Lu* (Associate Professor, School of Computer Science and Engineering, Nanyang Technological University, Singapore)

    († denotes equal contribution, * denotes corresponding author)

1.3. Journal/Conference

This paper was published on arXiv, a preprint server, with a publication date of 2023-04-03. While arXiv hosts preprints before formal peer review, it is a highly influential platform in computer science and related fields, where much cutting-edge research first appears. Its widespread availability allows for broad dissemination and feedback within the research community.

1.4. Publication Year

2023

1.5. Abstract

This survey addresses two major challenges in visual recognition: the heavy reliance on crowd-labelled data for training deep neural networks (DNNs) and the common practice of training a separate DNN for each visual recognition task, which is laborious and time-consuming. To overcome these, Vision-Language Models (VLMs) have emerged, learning rich vision-language correlations from web-scale image-text pairs. A single VLM can then perform zero-shot predictions across various visual recognition tasks without task-specific fine-tuning. The paper systematically reviews VLMs for visual recognition by covering: (1) the historical development of visual recognition paradigms; (2) the foundational elements of VLMs, including network architectures, pre-training objectives, and downstream tasks; (3) the datasets used for VLM pre-training and evaluation; (4) a categorization of existing VLM pre-training, transfer learning, and knowledge distillation methods; (5) a benchmarking, analysis, and discussion of these methods; and (6) an outline of future research challenges and directions.

https://arxiv.org/abs/2304.00685 (Preprint) PDF Link: https://arxiv.org/pdf/2304.00685v2.pdf

2. Executive Summary

2.1. Background & Motivation

The paper addresses fundamental inefficiencies and limitations in traditional visual recognition research.

  1. Data Scarcity and Laborious Annotation: Most deep neural network (DNN) training for visual recognition tasks (like image classification, object detection, and semantic segmentation) heavily relies on large volumes of meticulously crowd-labelled data. Collecting such datasets is a laborious and time-consuming process.

  2. Task-Specific Model Training: Typically, a separate DNN is trained for each distinct visual recognition task. This single-task, single-model paradigm leads to redundancy and inefficiency, especially as the number of visual tasks grows.

    The core problem, therefore, is the laborious and time-consuming visual recognition paradigm that hinders scalability and efficiency. This problem is crucial because visual recognition is a cornerstone of numerous applications, from autonomous driving to robotics. Existing solutions like Deep Learning from Scratch and Supervised/Unsupervised Pre-training, Fine-tuning and Prediction paradigms have improved convergence and data efficiency but still require a fine-tuning stage with task-specific labelled data.

The paper's innovative idea and entry point is the Vision-Language Model Pre-training and Zero-shot Prediction paradigm. Inspired by advances in natural language processing (NLP), this paradigm proposes pre-training a single VLM on web-scale image-text pairs (which are abundantly available) to learn rich vision-language correlation. This pre-trained VLM can then perform zero-shot predictions on various downstream visual tasks, eliminating the need for task-specific fine-tuning. This promises to be simple to implement yet performs incredibly well, as demonstrated by models like CLIP.

2.2. Main Contributions / Findings

The paper makes three primary contributions:

  1. Systematic Review of VLMs for Visual Recognition: It provides the first comprehensive survey focusing specifically on VLMs for visual recognition tasks (image classification, object detection, semantic segmentation). It offers a holistic view of this emerging field, summarizing and categorizing existing studies.

  2. Up-to-date Progress Benchmarking and Discussion: The survey includes a comprehensive benchmarking of current VLM-based visual recognition methods across multiple public datasets. This provides a clear picture of their performance and facilitates comparison.

  3. Identification of Research Challenges and Future Directions: It highlights several open research challenges and suggests promising future research directions for VLMs in visual recognition, guiding future studies in the field.

    Key conclusions and findings reached by the paper include:

  • VLMs, especially with contrastive objectives like those in CLIP, can learn highly transferable visual models from natural language supervision, achieving superior zero-shot performance on diverse visual recognition tasks.
  • Scaling up both pre-training data and model size generally leads to consistent improvements in VLM performance, though this trend eventually saturates.
  • VLM transfer learning (e.g., prompt tuning, feature adaptation) effectively mitigates domain gaps between pre-trained VLMs and downstream tasks, consistently improving performance. Unsupervised VLM transfer shows particular promise due to its annotation efficiency.
  • VLM knowledge distillation is effective for transferring general VLM knowledge to task-specific models, especially for dense prediction tasks like object detection and semantic segmentation, allowing the benefits of VLM generalization while leveraging specialized architectures.
  • The field still faces challenges such as the need for fine-grained vision-language correlation modelling, unification of vision and language learning, multilingual VLM pre-training, data-efficient VLMs, and more advanced integration with Large Language Models (LLMs).

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following foundational concepts:

  • Deep Neural Networks (DNNs): These are a class of artificial neural networks with multiple layers (hence "deep") between the input and output layers. They learn representations of data with multiple levels of abstraction. When fed with data, they learn to identify patterns and make predictions or classifications.
  • Visual Recognition: This is a broad field in computer vision concerned with enabling computers to identify and understand objects, scenes, and activities in images and videos. Key tasks include:
    • Image Classification: Assigning a single category label to an entire image (e.g., "cat," "dog," "car").
    • Object Detection: Identifying and localizing multiple objects within an image by drawing bounding boxes around them and assigning a category label to each (e.g., "dog" at coordinates X, Y, W, H).
    • Semantic Segmentation: Assigning a category label to each pixel in an image, effectively outlining objects and regions with pixel-level precision (e.g., marking all pixels belonging to the "sky," "road," or "car").
  • Pre-training, Fine-tuning, and Zero-shot Prediction: These are paradigms for training and deploying machine learning models:
    • Pre-training: An initial training phase where a model is trained on a very large dataset (often general-purpose and unlabelled or weakly labelled) to learn broad, transferable features.
    • Fine-tuning: After pre-training, the model's weights are further adjusted (fine-tuned) on a smaller, task-specific dataset with labelled data. This adapts the pre-trained general knowledge to a specific downstream task.
    • Zero-shot Prediction: The ability of a model to make predictions for categories or tasks it has never explicitly seen during training. In the context of VLMs, this means classifying an image into a category just by reading its textual description, without any labelled examples of that category in the training data.
  • Vision-Language Models (VLMs): These are neural networks designed to process and understand information from both visual (images/videos) and textual (natural language) modalities. They learn to find correlations and alignments between images and text, enabling them to perform tasks that involve both vision and language.
  • Transformers: An attention-based neural network architecture introduced in 2017, initially for natural language processing. Unlike recurrent neural networks, Transformers process entire sequences at once, making them highly efficient for parallel computation. Their core mechanism, self-attention, allows them to weigh the importance of different parts of the input sequence relative to each other, capturing long-range dependencies effectively. Transformers have since been adapted for computer vision (Vision Transformers, ViT) and multi-modal tasks.
  • Contrastive Learning: A self-supervised learning paradigm where a model learns by contrasting different samples. It aims to pull "positive pairs" (e.g., different augmented views of the same image, or an image and its matching text) closer together in the embedding space while pushing "negative pairs" (e.g., an image and a non-matching text, or different images) farther apart. This encourages the model to learn discriminative features.
  • Generative Models: A class of models that learn the underlying distribution of data and can then generate new data samples that resemble the training data. In VLMs, this might involve generating text descriptions from images or even images from text.
  • Transfer Learning: A machine learning technique where a model trained on one task (the source task) is reused as the starting point for a model on a second task (the target task). It is particularly useful when the target task has limited labelled data, as the pre-trained model provides a strong set of initial features.
  • Knowledge Distillation: A technique where a smaller, simpler model (the "student") is trained to mimic the behavior of a larger, more complex model (the "teacher"). The teacher's "knowledge" (e.g., its predictions, intermediate features) is transferred to the student, allowing the student to achieve performance comparable to the teacher with lower computational cost.
  • Large Language Models (LLMs): Very large neural networks trained on massive amounts of text data, capable of understanding, generating, and processing human language with remarkable fluency and coherence (e.g., GPT-3, GPT-4). They are increasingly used to enhance VLMs by providing rich linguistic knowledge.

3.2. Previous Works

The paper contextualizes VLMs within the broader evolution of visual recognition paradigms:

  1. Traditional Machine Learning and Prediction: Before deep learning, visual recognition relied on feature engineering by domain experts (e.g., SIFT [38], HoG features) and lightweight learning models like Support Vector Machines (SVM) [8] or k-Nearest Neighbors (k-NN) [7]. This approach lacked scalability and required specialized feature design for each task.
  2. Deep Learning from Scratch and Prediction: The advent of deep learning, exemplified by AlexNet [4] (a Convolutional Neural Network (CNN)), VGG [5], and ResNet [6], enabled end-to-end learning of features directly from data. This paradigm achieved unprecedented performance on benchmarks like ImageNet [40]. However, it introduced challenges: slow convergence and the laborious collection of large-scale, task-specific, and crowd-labelled data.
  3. Supervised Pre-training, Fine-tuning and Prediction: To address slow convergence and data needs, researchers discovered that features learned from large-scale labelled datasets (like ImageNet) could be transferred to downstream tasks [11]. A DNN is pre-trained on a large labelled dataset (e.g., ImageNet) with a supervised loss, and then fine-tuned on smaller, task-specific labelled data. This accelerates training and improves performance with limited downstream data.
  4. Unsupervised Pre-training, Fine-tuning and Prediction: To further mitigate the reliance on labelled data for pre-training, self-supervised learning emerged [12], [13]. This paradigm learns useful and transferable representations from unlabelled data. Examples include Masked Image Modelling (MIM) [41] (reconstructing masked parts of an image) and contrastive learning [12] (learning discriminative features by contrasting data samples). Models pre-trained this way are then fine-tuned on labelled task-specific data.
  5. Vision-language Model Pre-training and Zero-shot Prediction: This is the latest paradigm, inspired by Large Language Models (LLMs) in NLP [14], [15], [16]. It involves:
    • VLM Pre-training: A model is trained on large-scale image-text pairs (abundant on the internet) using vision-language objectives [10], [18], [19] to learn image-text correspondences.
    • Zero-shot Prediction: The pre-trained VLM is directly applied to downstream visual recognition tasks without fine-tuning, by matching image embeddings with text embeddings (e.g., category names or descriptions). The pioneer in this paradigm is CLIP [10], which uses an image-text contrastive objective to pull paired images and texts closer in an embedding space.

3.3. Technological Evolution

The evolution of visual recognition paradigms has moved from:

  • Hand-crafted features and simple classifiers (Traditional ML): Focused on human expertise in feature design.

  • End-to-end learning with deep CNNs (Deep Learning from Scratch): Shifted focus to network architecture, but required massive labelled data for each task.

  • Transfer of supervised pre-trained features (Supervised Pre-training): Recognized the value of general features, but still needed labelled pre-training data.

  • Learning from unlabelled data (Unsupervised Pre-training): Reduced annotation burden for pre-training, but still required fine-tuning.

  • Learning from web-scale image-text data for zero-shot generalization (VLM Pre-training): Leverages freely available multimodal data, enabling direct application to new tasks without fine-tuning, dramatically increasing efficiency and scalability.

    This paper's work fits at the cutting edge of this evolution, surveying the rapidly growing field of VLMs, particularly their application to visual recognition tasks, and extending to how these powerful pre-trained models are adapted (transfer learning) or have their knowledge extracted (knowledge distillation) for specific downstream challenges.

3.4. Differentiation Analysis

The paper distinguishes itself from existing VLM surveys by focusing specifically on VLMs for visual recognition tasks rather than broader vision-language tasks (e.g., visual question answering, natural language for visual reasoning, phrase grounding). Other related surveys (e.g., [50], [51], [52], [53], [54]) have covered:

  • Advances on vision-language tasks generally, including VLM pre-training for various task-specific methods [50].

  • VLM pre-training for vision-language tasks [51], [52].

  • Recent progress of multi-modal learning on multi-modal tasks [53], [54].

    In contrast, this survey provides a unique contribution by focusing on three major aspects for visual recognition tasks:

  1. Recent progress of VLM pre-training for visual recognition tasks: Specifically how VLMs are trained to understand images for classification, detection, and segmentation.

  2. Two typical transfer approaches from VLMs to visual recognition tasks: How pre-trained VLMs are adapted to new visual recognition scenarios with minimal effort.

  3. Benchmarking of VLM pre-training methods on visual recognition tasks: A comparative analysis of their performance.

    This specialized focus provides a comprehensive and up-to-date resource for researchers interested in leveraging VLMs for computer vision's core recognition challenges.

4. Methodology

4.1. Principles

The core principle behind Vision-Language Models (VLMs) for visual recognition is to learn a rich correlation between visual information (images) and linguistic information (text) from vast quantities of readily available web-scale image-text pairs. By encoding images and texts into a shared embedding space, a VLM can understand the semantic relationship between them. This allows the model to perform zero-shot predictions on various visual recognition tasks: for a given image, it can classify, detect, or segment by comparing the image's embedding with the embeddings of candidate text descriptions (e.g., class names, object labels, segment descriptions). The theoretical intuition is that natural language provides a flexible and scalable form of supervision, enabling VLMs to learn generalizable visual concepts that transcend predefined categories.

4.2. Core Methodology In-depth (Layer by Layer)

VLM pre-training involves a deep neural network that extracts features from images and texts, and then learns vision-language correlation using specific pre-training objectives.

4.2.1. Network Architectures (Section 3.1)

VLMs employ separate encoders for images and text, or sometimes a unified encoder.

4.2.1.1. Architectures for Learning Image Features

Two main types are used:

  • CNN-based Architectures:
    • ConvNets like VGG [5], ResNet [6], and EfficientNet [59] are commonly used.
    • ResNet [6] is popular due to its skip connections which mitigate gradient vanishing and explosion, allowing for very deep networks.
    • Modifications for VLM pre-training (e.g., CLIP [10]): include ResNet-D [60], anti-aliased rect-2 blur pooling [61], and replacing global average pooling with attention pooling from Transformer multi-head attention [58].
  • Transformer-based Architectures:
    • Vision Transformer (ViT) [57] is a standard. It splits input images into fixed-size patches, linearly projects them, adds position embeddings, and feeds them into a Transformer encoder.
    • The Transformer encoder consists of a stack of Transformer blocks, each with a multi-head self-attention layer and a feed-forward network.
    • Modifications (e.g., CLIP [10], FILIP [18], SLIP [64]): often add a normalization layer before the Transformer encoder.

4.2.1.2. Architectures for Learning Language Features

  • Transformer and its variants (e.g., BERT [14], GPT-2 [16], original Transformer [58]) are widely adopted.
  • The standard Transformer [58] has an encoder-decoder structure. Both encoder and decoder have multiple blocks, each containing multi-head (self-)attention layers and Multi-Layer Perceptrons (MLPs). The decoder also includes a masked multi-head layer.
  • Most VLM studies (e.g., CLIP [10]) adopt the standard Transformer with minor modifications, often training from scratch without initializing with weights from pre-trained language models like GPT-2.

4.2.2. VLM Pre-training Objectives (Section 3.2)

The core of VLM is designing objectives to learn rich vision-language correlation. They fall into three categories:

4.2.2.1. Contrastive Objectives

These objectives train VLMs to learn discriminative representations by pulling paired samples (e.g., image and its matching text) close and pushing unpaired samples (e.g., image and a non-matching text) faraway in the feature space.

  • Image Contrastive Learning: Aims to learn discriminative image features, often as an auxiliary objective. It forces a query image to be close to its positive keys (e.g., data augmentations of itself) and far from negative keys (other images). Given a batch of BB images, the InfoNCE loss [68] (and its variants) is typically used: $ \mathcal { L } _ { I } ^ { \mathrm { I n f o N C E } } = - \frac { 1 } { B } \sum _ { i = 1 } ^ { B } \log \frac { \exp { ( z _ { i } ^ { I } \cdot z _ { + } ^ { I } / \tau ) } } { \sum _ { j = 1 , j \ne i } ^ { B + 1 } \exp ( z _ { i } ^ { I } \cdot z _ { j } ^ { I } / \tau ) } $ Where:

    • BB: The batch size.
    • ziIz_i^I: The query embedding of the ii-th image.
    • z+Iz_+^I: The positive key embedding corresponding to ziIz_i^I (e.g., an augmented version of the same image).
    • {zjI}j=1,jiB+1\{z_j^I\}_{j=1, j \neq i}^{B+1}: A set of key embeddings from the batch, including the positive key and BB negative keys (other images in the batch). In some formulations, the positive key z+Iz_+^I might be part of the sum or handled separately.
    • τ\tau: A temperature hyper-parameter that controls the density or sharpness of the learned representation. A smaller τ\tau makes the model more sensitive to differences in similarity.
  • Image-Text Contrastive Learning: Aims to learn discriminative image-text representations by pulling embeddings of paired images and texts close while pushing unpaired ones away. It's often achieved by minimizing a symmetrical InfoNCE loss [10], which considers both image-to-text and text-to-image contrasts. The symmetrical loss LinfoNCEIT=LIT+LTI\mathcal{L}_{\mathrm{infoNCE}}^{I \leftrightarrow T} = \mathcal{L}_{I \rightarrow T} + \mathcal{L}_{T \rightarrow I} is used. For the image-to-text direction (LIT\mathcal{L}_{I \rightarrow T}), it contrasts a query image with text keys: $ \mathcal { L } _ { I \rightarrow T } = - \displaystyle \frac { 1 } { B } \sum _ { i = 1 } ^ { B } \log \frac { \exp { ( z _ { i } ^ { I } \cdot z _ { i } ^ { T } / \tau ) } } { \sum _ { j = 1 } ^ { B } \exp ( z _ { i } ^ { I } \cdot z _ { j } ^ { T } / \tau ) } $ For the text-to-image direction (LTI\mathcal{L}_{T \rightarrow I}), it contrasts a query text with image keys: $ \mathcal { L } _ { T \rightarrow I } = - \displaystyle \frac { 1 } { B } \sum _ { i = 1 } ^ { B } \log \frac { \exp { ( z _ { i } ^ { T } \cdot z _ { i } ^ { I } / \tau ) } } { \sum _ { j = 1 } ^ { B } \exp ( z _ { i } ^ { T } \cdot z _ { j } ^ { I } / \tau ) } $ Where:

    • BB: The batch size.
    • ziIz_i^I: The embedding of the ii-th image.
    • ziTz_i^T: The embedding of the ii-th text, which is paired with ziIz_i^I.
    • zjTz_j^T: The embedding of the jj-th text from the batch.
    • zjIz_j^I: The embedding of the jj-th image from the batch.
    • τ\tau: The temperature hyper-parameter.
    • The dot product ziIzjTz_i^I \cdot z_j^T (or ziTzjIz_i^T \cdot z_j^I) measures the similarity between embeddings.
  • Image-Text-Label Contrastive Learning: This objective introduces Supervised Contrastive Learning [69] by incorporating image classification labels into the image-text contrastive framework. This allows learning both discriminative and task-specific features simultaneously. The image-to-text component is reformulated as: $ \mathcal { L } _ { I \rightarrow T } ^ { I T L } = - \sum _ { i = 1 } ^ { B } \frac { 1 } { \vert \mathcal { P } ( i ) \vert } \sum _ { k \in \mathcal { P } ( i ) } \log \frac { \exp { ( z _ { i } ^ { I } \cdot z _ { k } ^ { T } / \tau ) } } { \sum _ { j = 1 } ^ { B } \exp ( z _ { i } ^ { I } \cdot z _ { j } ^ { T } / \tau ) } $ And the text-to-image component as: $ \mathcal { L } _ { T \rightarrow I } ^ { I T L } = - \sum _ { i = 1 } ^ { B } \frac { 1 } { \vert \mathcal { P } ( i ) \vert } \sum _ { k \in \mathcal { P } ( i ) } \log \frac { \exp { ( z _ { i } ^ { T } \cdot z _ { k } ^ { I } / \tau ) } } { \sum _ { j = 1 } ^ { B } \exp ( z _ { i } ^ { T } \cdot z _ { j } ^ { I } / \tau ) } $ Where:

    • BB: The batch size.
    • ziIz_i^I, zkTz_k^T, zjTz_j^T, ziTz_i^T, zkIz_k^I, zjIz_j^I: Image and text embeddings as before.
    • P(i)={kkB,yk=yi}\mathcal{P}(i) = \{ k | k \in B, y_k = y_i \}: The set of indices kk in the current batch BB such that the classification label yky_k of the kk-th sample is the same as the label yiy_i of the ii-th sample. This means all samples in P(i)\mathcal{P}(i) are considered positive matches for ziIz_i^I (or ziTz_i^T).
    • P(i)\vert \mathcal{P}(i) \vert: The cardinality (number of elements) of the set P(i)\mathcal{P}(i).
    • τ\tau: The temperature hyper-parameter. The total loss is LinfoNCEITL=LITITL+LTIITL\mathcal{L}_{\mathrm{infoNCE}}^{ITL} = \mathcal{L}_{I \rightarrow T}^{ITL} + \mathcal{L}_{T \rightarrow I}^{ITL}.

4.2.2.2. Generative Objectives

These objectives learn semantic features by training networks to generate image or text data, encouraging VLMs to learn rich contextual information within and across modalities.

  • Masked Image Modelling (MIM): This objective learns cross-patch correlation by masking parts of an image and training the encoder to reconstruct them. The loss function is: $ \mathcal { L } _ { M I M } = - \frac { 1 } { B } \sum _ { i = 1 } ^ { B } \log f _ { \theta } \big ( \overline { { x } } _ { i } ^ { I } \mid \hat { x } _ { i } ^ { I } \big ) $ Where:

    • BB: The batch size.
    • fθf_\theta: The image encoder (parameterized by θ\theta).
    • xiI\overline{x}_i^I: The masked patches of the ii-th input image xiIx_i^I.
    • x^iI\hat{x}_i^I: The unmasked patches of the ii-th input image xiIx_i^I.
    • The objective is to maximize the likelihood of reconstructing the masked patches given the unmasked ones.
  • Masked Language Modelling (MLM): A widely adopted NLP pre-training objective, effective for text feature learning. It randomly masks a percentage of input text tokens and trains the model to reconstruct them based on the unmasked tokens. The loss function is: $ \mathcal { L } _ { MLM } = - \frac { 1 } { B } \sum _ { i = 1 } ^ { B } \log f _ { \phi } ( \overline { { x } } _ { i } ^ { T } | \hat { x } _ { i } ^ { T } ) $ Where:

    • BB: The batch size.

    • fϕf_\phi: The text encoder (parameterized by ϕ\phi).

    • xiT\overline{x}_i^T: The masked tokens of the ii-th input text xiTx_i^T.

    • x^iT\hat{x}_i^T: The unmasked tokens of the ii-th input text xiTx_i^T.

    • The objective is to maximize the likelihood of predicting the masked tokens given the unmasked ones. The following figure (Figure 9 from the original paper) illustrates the process of masked language modeling:

      Fig. 9: Illustration of masked language modelling \[14\]. 该图像是示意图,展示了掩码语言模型中的输入文本和Transformer编码器的构建过程。图中包括了位置嵌入和标记嵌入的组合,以及如何通过预测生成最终的文本输出,特别标出了掩码位置 e[mask]e_{[mask]}

  • Masked Cross-Modal Modelling (MCM): Integrates both masked image modelling and masked language modelling by jointly masking and reconstructing both image patches and text tokens from an image-text pair. This captures rich vision-language correspondence. The loss function is: $ \mathcal { L } _ { MCM } = - \frac { 1 } { B } \sum _ { i = 1 } ^ { B } [ \log f _ { \theta } ( \overline { { x } } _ { i } ^ { I } | \hat { x } _ { i } ^ { I } , \hat { x } _ { i } ^ { T } ) + \log f _ { \phi } ( \overline { { x } } _ { i } ^ { T } | \hat { x } _ { i } ^ { I } , \hat { x } _ { i } ^ { T } ) ] $ Where:

    • BB: The batch size.
    • fθf_\theta: The image encoder.
    • fϕf_\phi: The text encoder.
    • xiI/x^iI\overline{x}_i^I / \hat{x}_i^I: The masked/unmasked patches in the ii-th image xiIx_i^I.
    • xiT/x^iT\overline{x}_i^T / \hat{x}_i^T: The masked/unmasked text tokens in the ii-th text xiTx_i^T.
    • The objective is to reconstruct masked image patches given both unmasked image patches and unmasked text tokens, and similarly for text tokens.
  • Image-to-Text Generation: Aims to generate descriptive texts for a given image, capturing fine-grained vision-language correlation by training VLMs to predict tokenized texts autoregressively. The loss function is: $ \mathcal { L } _ { ITG } = - \sum _ { l = 1 } ^ { L } \log \ f _ { \theta } ( x ^ { T } \mid x _ { < l } ^ { T } , z ^ { I } ) $ Where:

    • LL: The number of tokens to be predicted for the text xTx^T.

    • fθf_\theta: The image encoder (often combined with a decoder for text generation).

    • xTx^T: The target text to be generated.

    • x<lTx_{<l}^T: The tokens of the text xTx^T up to position l-1, used as context for predicting the ll-th token.

    • zIz^I: The embedding of the image paired with xTx^T.

    • The objective is to maximize the likelihood of generating the correct text sequence given the image embedding and previous text tokens. The following figure (Figure 10 from the original paper) provides a simplified illustration of image-to-caption generation:

      Fig. 10: A simplified illustration of image-to-caption generation in COCA \[19\]. Figure is reproduced based on \[19\]. 该图像是一个示意图,展示了图像编码器与多模态文本解码器之间的交互。图中展示了通过交叉注意力机制将输入图像与文本进行结合,以生成描述性文本。下方的输入图像是一只玩球的狗,对应的输出文本为"A dog is playing with a ball"。

4.2.2.3. Alignment Objectives

These objectives enforce VLMs to align paired images and texts by learning to predict whether a given text correctly describes a given image.

  • Image-Text Matching (ITM): Models global correlation between images and texts by aligning paired samples. It can be formulated with a score function S()\mathcal{S}(\cdot) that measures the alignment probability and a binary classification loss: $ \mathcal { L } _ { IT } = p \log \mathcal{S} ( z ^ { I } , z ^ { T } ) + ( 1 - p ) \log ( 1 - \mathcal{S} ( z ^ { I } , z ^ { T } ) ) $ Where:

    • S(zI,zT)\mathcal{S}(z^I, z^T): A score function that outputs the probability of alignment between image embedding zIz^I and text embedding zTz^T.
    • pp: A binary label, equal to 1 if the image and text are a true pair (matched) and 0 otherwise (unmatched).
    • This is a standard binary cross-entropy loss, where the model learns to predict pp correctly.
  • Region-Word Matching (RWM): Aims to model local fine-grained cross-modal correlation between image regions and word tokens. This is particularly beneficial for dense visual recognition tasks like object detection and semantic segmentation. The loss function is: $ \mathcal { L } _ { RW } = p \log \mathcal { S } ^ { r } ( r ^ { I } , w ^ { T } ) + ( 1 - p ) \log ( 1 - \mathcal { S } ^ { r } ( r ^ { I } , w ^ { T } ) ) $ Where:

    • (rI,wT)(r^I, w^T): Denotes a region-word pair, where rIr^I is a feature representation of an image region and wTw^T is a feature representation of a word token.

    • Sr()\mathcal{S}^r(\cdot): A local score function that measures the similarity between an image region and a word.

    • pp: A binary label, equal to 1 if the region and word are semantically paired (e.g., the word "dog" matches a region containing a dog) and 0 otherwise. The following figure (Figure 11 from the original paper) illustrates the use of word-region alignment for detection in GLIP:

      Fig. 11: Illustration of GLIP \[67\] that uses word-region alignment for detection. Figure is reproduced from \[67\]. 该图像是示意图,展示了GLIP模型如何使用文本编码器和图像编码器进行词-区域对齐检测。图示中包含的公式用于计算局部化损失和对齐损失。

4.2.3. VLM Pre-training Frameworks (Section 3.3)

VLMs employ different architectures to process and fuse visual and linguistic information. The following figure (Figure 5 from the original paper) illustrates typical VLM pre-training frameworks:

Fig. 5: Illustration of typical VLM pre-training frameworks. 该图像是示意图,展示了三种典型的视觉语言模型(VLM)预训练框架,包括两塔VLM、两腿VLM和一塔VLM。框架中显示了图像和文本嵌入的输入以及多模态融合层的使用,以帮助实现有效的预训练目标。

  • Two-Tower Framework (Fig. 5a): Widely adopted (e.g., CLIP [10], ALIGN [17]). Input images and texts are encoded with two separate, independent encoders (one for vision, one for language). Features are then projected into a common embedding space where objectives like contrastive learning are applied. This design keeps modalities largely separate until the final embedding comparison.
  • Two-Leg Framework (Fig. 5b): (e.g., COCA [19], FLAVA [42]). This extends the two-tower design by introducing additional multi-modal fusion layers after the initial separate encoding. These fusion layers enable explicit feature interaction and communication between the image and text modalities, allowing for deeper cross-modal understanding.
  • One-Tower Framework (Fig. 5c): (e.g., CLIPPO [43], OneFormer [44]). Aims to unify vision and language learning within a single encoder. Images and texts are often tokenized into a common format (e.g., visual tokens from image patches and textual tokens from words) and then processed by a single, shared Transformer-based network. This design promotes efficient communications and interactions across data modalities from an earlier stage.

4.2.4. Evaluation Setups and Downstream Tasks (Section 3.4)

VLMs are evaluated using different setups and on various downstream tasks to assess their generalization capabilities.

4.2.4.1. Zero-shot Prediction

This is the most common way to evaluate VLM generalization: directly applying the pre-trained VLM to downstream tasks without any task-specific fine-tuning.

  • Image Classification: Classifies images into categories. VLMs perform this by comparing image embeddings with text embeddings of class names (often enhanced by prompt engineering, e.g., "a photo of a [label]").
  • Semantic Segmentation: Assigns category labels to each pixel. Zero-shot segmentation involves comparing embeddings of image pixels (or segments) with text descriptions.
  • Object Detection: Localizes and classifies objects. VLMs achieve this by comparing object proposals' embeddings with text embeddings, often leveraging object locating ability learned from auxiliary datasets.
  • Image-Text Retrieval: Retrieves relevant samples across modalities: text-to-image retrieval (find images given text) and image-to-text retrieval (find text given images).

4.2.4.2. Linear Probing

This setup freezes the pre-trained VLM's feature extractor and trains a simple linear classifier on top of the VLM-encoded embeddings. It assesses the quality of the learned representations.

  • Image Classification and Action Recognition are commonly used tasks. For action recognition, video clips are often sub-sampled.

4.2.5. Vision-Language Model Pre-training (Section 5)

This section reviews specific VLM pre-training studies based on the objectives described above. The following are the results from Table 3 of the original paper:

MethodDatasetObjectiveContribution
CLIP† [10] [code]CLIP*ConPropose image-text contrastive learning for VLM pre-training.
ALIGN† [17]ALIGN*ConLeverage large-scale noisy data to scale-up VLM pre-training data.
OTTER† [112] [code]CC3M, YFCC15M, WITConEmploy optimal transport for data efficient VLM pre-training.
DeCLIP† [113] [code]CC3M, CC12M, YFCC100M, WIT*Con,GenEmploy image/ text self-supervision for data efficient VLM pre-training.
ZeroVL† [114] [code]SBU, VG, CC3M, CC12MConIntroduce data augmentation for data-efficient VLM pre-training.
FILIP† [18]FILIP*, C3M, C12M, YFCC100MCon,AlignLeverage region-word similarity for fine-grained VLM pre-training.
UniCL† [65] [code]CC3M, CC12M, YFCC100MConPropose image-text-label contrastive learning for VLM pre-training.
Florence† [115]FLD-900M*ConScale up pre-training data and include depth and temporal information.
SLIP† [64] [code]YFCC100MConIntroduce image self-supervision learning into VLM pre-training.
PyramidCLIP† [116]SBU, CC3M, CC12M, YFCC100M, LAION400MConPerform peer-level/cross-level contrastive learning within/across multiple semantic levels.
ChineseCLIP† [117] [code]LAION5B, WuKong, VG, COCOConCollect large-scale Chinese image-text data and Introduce Chinese VLM.
LiT† [118] [project]CC12M, YFCC100M, WIT*ConPropose contrastive tuning with the locked image encoder.
AltCLIP‡ [119] [code]WuDao, LAION2B, LAION5BConLeverage the multilingual text encoder to achieve multilingual VLM.
FLAVA‡ [42] [code]COCO, SBU, LN, CC3M, VG, WIT, CC12M, RC, YFCC100MGen,Con,AlignPropose a universal and foundational VLM that tackles the single-modal (i.e., image or text) and the multi-model cases at the same time.
KELIP† [120] [code]CUB200, WIT, YFCC15M, CC3M, CC12M, LAION400M, K-WIT*Con,GenCollect large-scale Korean image-text pair data and develop bilingual VLMs with Korean and English.
COCA‡ [19] [code]ALIGN*Con,GenCombine contrastive learning and image captioning for pre-training.
nCLIP† [121]COCO, VG, SBU, CC3M, CC12M, YFCC14MCon,AlignPropose a non-contrastive pre-training objective (i.e., a cross-entropy loss for global image-text matching) for VLM pre-training.
K-lite† [122] [code]CC3M, CC12M, YFCC100MConLeverage auxiliary datasets for training transferable VLMs.
NLIP‡ [123]YFCC100M, COCOCon,GenTrain noise-robust VLM via noise harmonization and completion.
UniCLIP† [84]CC3M, CC12M, YMCC100MConPropose unified image-text and image-image contrastive learning.
PaLI‡ [83] [project]WebLI*GenScale up the data, model and language in VLM pre-taring.
HiCLIP† [124] [code]YFCC100M, CC3M, CC12MConPropose to incorporate hierarchy-aware attention into VLM pre-training.
CLIPPO§ [43] [code]WebLI*ConLearn image and text data with a single network for VLM pre-training.
OneFormer§ [44]CC3M, SBU, VG, COCOCon,GenUnify image and text learning in a single tower transformer.
RA-CLIP† [125]YFCC100MConPropose retrieval-augmented image-text contrastive learning.
LA-CLIP† [126] [code]CC3M, CC12M, RC, LAION400MConPropose LLMs-augmented image-text contrastive learning.
ALIP† [127] [code]YFCC100MConIntroduce synthetic caption supervision into VLM pre-training.
GrowCLIP‡ [128]CC12MConPropose online-learning image-text contrastive learning.
GroupViT† [129] [code]CC12M, YMCC100MConPropose hierarchical visual concepts grouping for VLM pre-training.
SegCLIP† [46] [code]CC3M, COCOCon,GenPropose a plug-in semantic group module for VLM pre-training.
CLIPpy [130] [code]CC12MConPropose spatial representation aggregation for VLM pre-training.
RegionCLIP† [131] [code]CC3M, COCOCon,AlignLearn region-level visual representations for VLM pre-training.
GLIP [67] [code]CC3M, CC12M, SBUAlignUnify detection and phrase grounding for grounded VLM pre-training.
FIBER‡ [71] [code]COCO, CC3M, SBU, VGCon,Gen,AlignPropose deep multi-modal fusion for coarse-to-fine VLM pre-training.
DetCLIP‡ [45]YMCC100MAlignPresent a paralleled visual-concept VLM pre-training method.
  • VLM Pre-training with Contrastive Objectives (Section 5.1):

    • CLIP [10]: The pioneering work, uses symmetrical image-text InfoNCE loss (Eqs. 2 & 3). The following figure (Figure 6 from the original paper) illustrates the image-text contrastive learning in CLIP:

      Fig. 6: Illustration of the image-text contrastive learning in CLIP \[10\]. Figure is reproduced from \[10\]. 该图像是示意图,展示了CLIP中的图像-文本对比学习过程。上半部分展示了文本编码器处理输入文本并生成特征向量 z1,z2,,zBz_1, z_2, \ldots, z_B,下半部分则展示输入图像通过图像编码器生成对应特征向量。两部分通过特征空间矩阵相互关联,进行对比学习以增强模型性能。

    • ALIGN [17]: Scales up CLIP with 1.8 billion noisy image-text pairs and noise-robust contrastive learning.

    • Data-efficient pre-training: OTTER [112] (optimal transport), DeCLIP [113] (nearest-neighbor supervision), ZeroVL [114] (debiased data sampling, mixup).

    • Fine-grained correlation: FILIP [18] (region-word alignment), PyramidCLIP [116] (cross-level and peer-level contrastive learning).

    • UniCL [65]: Introduces image classification labels into image-text contrast (Eqs. 4 & 5). The following figure (Figure 7 from the original paper) illustrates the image-text-label space proposed in UniCL:

      Fig. 7: Illustration of the image-text-label space proposed in UniCL \[65\]. Figure is reproduced from \[65\]. 该图像是示意图,展示了图像-文本-标签对的结构,其中包含图像、文本和标签的对应关系。图中展示了不同动物(如狗和猫)的实例及其描述,清晰地说明了图像-标签对和图像-文本对之间的联系。

    • Recent improvements also involve augmenting image-text pairs (e.g., LA-CLIP [126], ALIP [127] use LLMs for synthetic captions), and unifying vision and language in a single encoder (OneFormer [44], CLIPPO [43]).

  • VLM Pre-training with Generative Objectives (Section 5.2):

    • Masked Image Modelling (MIM): As defined in Eq. 6. Examples include MAE [41], BeiT [70]. FLAVA [42] uses rectangular block masking, while KELIP [120] and SegCLIP [46] mask a large portion of patches. The following figure (Figure 8 from the original paper) illustrates masked image modelling:

      Fig. 8: Illustration of masked image modelling \[66\]. Figure is reproduced from \[66\]. 该图像是一个示意图,阐述了图像编码器和解码器的流程。左侧为输入图像,经过图像编码器处理后,产生中间表示,最后通过图像解码器生成输出图像,展现了图像重构的过程。

    • Masked Language Modelling (MLM): As defined in Eq. 7. FLAVA [42] and FIBER [71] use this for text feature learning.

    • Masked Cross-Modal Modelling (MCM): As defined in Eq. 8. FLAVA [42] masks both image patches and text tokens for joint reconstruction.

    • Image-to-Text Generation: As defined in Eq. 9. COCA [19], NLIP [123], PaLI [83] train VLMs with encoder-decoder architectures and image captioning objectives.

  • VLM Pre-training with Alignment Objectives (Section 5.3):

    • Image-Text Matching: As defined in Eq. 10. FLAVA [42] uses a classifier, FIBER [71] mines hard negatives.
    • Region-Word Matching: As defined in Eq. 11. GLIP [67], FIBER [71], DetCLIP [45] use dot-product similarity between regional visual features and token-wise features for dense prediction tasks.

4.2.6. VLM Transfer Learning (Section 6)

Transfer learning adapts pre-trained VLMs to fit downstream tasks. The following are the results from Table 4 of the original paper:

MethodCategorySetupContribution
CoOp [31] [code]TPTFew-shot Sup.Introduce context optimization with learnable text prompts for VLM transfer learning.
CoCoOp [32] [code]TPTFew-shot Sup.Propose conditional text prompting to mitigate overfitting in VLM transfer learning.
SubPT [132] [code]TPTFew-shot Sup.Propose subspace text prompt tuning to mitigate overfitting in VLM transfer learning.
LASP [133]TPTFew-shot Sup.Propose to regularize the learnable text prompts with the hand-engineered prompts.
ProDA [134]TPTFew-shot Sup.Propose prompt distribution learning that captures the distribution of diverse text prompts.
VPT [135]TPTFew-shot Sup.Propose to model the text prompt learning with instance-specific distribution.
ProGrad [136] [code]TPTFew-shot Sup.Present a prompt-aligned gradient technique for preventing knowledge forgetting.
CPL [137] [code]TPTFew-shot Sup.Employ counterfactual generation and contrastive learning for text prompt tuning.
PLOT [138] [code]TPTFew-shot Sup.Introduce optimal transport to learn multiple comprehensive text prompts.
DualCoOp [139] [code]TPTFew-shot Sup.Introduce positive and negative text prompt learning for multi-label classification.
TaI-DPT [140] [code]TPTFew-shot Sup.Introduce a double-grained prompt tuning technique for multi-label classification.
SoftCPT [141] [code]TPTFew-shot Sup.Propose to fine-tune VLMs on multiple downstream tasks simultaneously.
DenseCLIP [142] [code]TPTSupervisedPropose a language-guided fine-tuning technique for dense visual recognition tasks.
UPL [143] [code]TPTUnsupervisedPropose unsupervised prompt learning with self-training for VLM transfer learning.
TPT [144] [code]TPTUnsupervisedPropose test-time prompt tuning that learns adaptive prompts on the fly.
KgCoOp [145] [code]TPTFew-shot Sup.Introduce knowledge-guided prompt tuning to improve the generalization ability.
ProTeCt [146]TPT, VPTFew-shot Sup.Propose a prompt tuning technique to improve consistency of model predictions.
VP [147] [code]VPTSupervisedInvestigate the efficacy of visual prompt tuning for VLM transfer learning.
RePrompt [148]VPTFew-shot Sup.Introduce retrieval mechanisms to leverage knowledge from downstream tasks.
UPT [149] [code]TPT, VPTFew-shot Sup.Propose a unified prompt tuning that jointly optimizes text and image prompts.
MVLPT [150] [code]TPT, VPTFew-shot Sup.Incorporate multi-task knowledge into text and image prompt tuning.
MaPLE [151] [code]TPT, VPTFew-shot Sup.Propose multi-modal prompt tuning with a mutual promotion strategy.
CAVPT [152] [code]TPT, VPTFew-shot Sup.Introduce class-aware visual prompt for concentrating more on visual concepts.
Clip-Adapter [33] [code]FAFew-shot Sup.Introduce an adapter with residual feature blending for efficient VLM transfer learning.
Tip-Adapter [34] [code]FAFew-shot Sup.Propose to build a training-free adapter with the embeddings of few labelled images.
SVL-Adapter [153] [code]FAFew-shot Sup.Introduce a self-supervised adapter by performing self-supervised learning on images.
SuS-X [154] [code]FAUnsupervisedPropose a training-free name-only transfer learning paradigm with curated support sets.
CLIPPR [155] [code]FAUnsupervisedLeverage the label distribution priors for adapting pre-trained VLMs.
SgVA-CLIP [156]TPT, FAFew-shot Sup.Propose a semantic-guided visual adapter to generate discriminative adapted features.
VT-Clip [157]CAFew-shot Sup.Introduce visual-guided attention that semantically aligns text and image features.
CALIP [158] [code]CAUnsupervisedPropose parameter-free attention for the communication between visual and textual features.
TaskRes [159] [code]CAFew-shot Sup.Propose a technique for better learning old VLM knowledge and new task knowledge.
CuPL [160]LLMUnsupervisedEmploy large language models to generate customized prompts for VLMs.
VCD [161]LLMUnsupervisedEmploy large language models to generate captions for VLMs.
Wise-FT [162] [code]FTSupervisedPropose ensemble-based fine-tuning by combining the fine-tuned and original VLMs.
MaskClip [163] [code]AMUnsupervisedPropose to extract dense features by modifying the image encoder architecture.
MUST [164] [code]Self-trainingUnsupervisedPropose masked unsupervised self-training for unsupervised VLM transfer learning.
  • Motivation (Section 6.1): VLMs face domain gaps (image/text distributions) and training objective gaps (general vs. task-specific).
  • Common Setups (Section 6.2):
    • Supervised transfer: Uses all labelled downstream data.
    • Few-shot supervised transfer: Uses a small amount of labelled samples.
    • Unsupervised transfer: Uses unlabelled downstream data.
  • Common Transfer Learning Methods (Section 6.3):
    • Transfer via Prompt Tuning (TPT): Modifies input text/image with learnable prompts without fine-tuning the entire VLM.

      • Text Prompt Tuning: Learns context words for class names (e.g., CoOp [31]: "a photo of a [V]_1 [V]_2 ... [V]_m [label]"). The [V] tokens are learnable vectors. CoCoOp [32] generates image-specific prompts.

      • Visual Prompt Tuning (VPT): Modulates the input of the image encoder (e.g., VP [147] adds learnable image perturbations vv to xIx^I as xI+v~x^I + \tilde{v}).

      • Text-Visual Prompt Tuning: Jointly optimizes both text and image prompts (e.g., UPT [149], MaPLE [151]). The following figure (Figure 12 from the original paper) illustrates text prompt learning and visual prompt learning:

        Fig. 12: Illustration of text prompt learning \[31\] in (a) and visual prompt learning \[147\] in (b). 该图像是示意图,展示了文本提示学习(a)和视觉提示学习(b)的过程。在(a)部分,利用文本编码器和图像编码器进行分类损失计算,而(b)部分引入了可学习的视觉提示,增强图像信息,通过文本提示进行分类损失的计算。

    • Transfer via Feature Adaptation (FA): Introduces additional light-weight feature adapters to fine-tune VLMs.

      • Clip-Adapter [33]: Inserts trainable linear layers after CLIP's encoders, optimizing them while freezing the main CLIP model.

      • Tip-Adapter [34]: A training-free adapter that directly uses embeddings of few-shot labelled images as weights. The following figure (Figure 13 from the original paper) illustrates the feature adapter:

        Fig. 13: Illustration of feature adapter \[33\]. 该图像是示意图,展示了特征适配器的工作流程。图中描绘了通过文本提示和图像编码器获取视觉特征的过程,并通过特征适配器生成适应特征,最终用于分类损失的计算。

    • Other Transfer Methods: Direct fine-tuning (e.g., Wise-FT [162]), architecture modification (e.g., MaskCLIP [163]), cross-attention mechanisms (e.g., VT-CLIP [157]), and using LLMs to generate prompts (e.g., CuPL [160]).

4.2.7. VLM Knowledge Distillation (Section 7)

Distills general VLM knowledge to task-specific models, especially for dense prediction tasks. The following are the results from Table 5 of the original paper:

TaskMethodContribution
Semantic SegmentationCLIPSeg [175] [code]Extend CLIP by introducing a lightweight transformer-based decoder.
ZegFormer [35] [code]Group the pixels into segments and performs zero-shot classification task on the segments.
LSeg [176] [code]Propose language-driven semantic segmentation by matching pixel and text embeddings.
SSIW [177]Introduce a test-time augmentation technique to refine the pseudo labels generated by CLIP.
MaskClip+ [163] [code]Perform self-training with the pseudo labels generated by MaskClip (modified from CLIP).
ZegClip [174] [code]Propose deep prompt tuning, non-mutually exclusive loss and relationship descriptor.
Fusioner [178]Introduce cross-modality fusion that aligns the visual representation with language concept.
OVSeg [179] [code]Adapt CLIP with the region-word pairs generated by the modified MaskFormer.
ZSSeg [180] [code]Propose to first generate mask proposals and then classifies the generated mask proposals.
OpenSeg [181] [code]Propose to align each word in the caption with the generated segmentation masks.
ReCo [182] [code]Propose language-guided co-segmentation with the CLIP-retrieved images.
CLIMS [183] [code]Use CLIP to generate high-quality class activation maps w/o involving irrelevant background.
CLIP-ES [184] [code]Employ CLIP to refine the class activation map for weakly-supervised segmentation.
FreeSeg [185] [code]Propose a unified, universal and open-Vocabulary image segmentation network.
Object DetectionViLD [36] [code]Propose to distill knowledge from a pre-trained VLM into a two-stage object detector.
DetPro [37] [code]Propose to learn continuous prompt representations for open-vocabulary object detection.
HierKD [186] [code]Propose hierarchical knowledge distillation for global-level and instance-level distillation.
RKD [187] [code]Propose region-based knowledge distillation for aligning region- and image-level embeddings.
PromptDet [188] [code]Introduce regional prompting for aligning text embeddings with regional image embeddings.
PB-OVD [189] [code]Propose to train object detectors with the pseudo bounding-box labels generated by VLMs.
CondHead [190]Propose semantic-visual alignment for better box regression and mask segmentation.
VLDet [191] [code]Achieve open-vocabulary object detection by the bipartite matching between regions and words.
F-VLM [192]Propose to simply build a detection head upon the pre-trained VLM for object localization.
OV-DETR [173] [code]Achieve open-vocabulary detection transformer with a binary matching strategy.
Detic [193] [code]Enlarge detection vocabulary using image-level supervision and pre-trained CLIP text encoder.
XPM [194] [code]Design cross-modal pseudo-labeling to let VLMs generate caption-driven pseudo masks.
OWL-ViT [195] [code]Propose ViT-based open-vocabulary detector by adding object classification/localization head.
VL-PLM [196] [code]Leverage VLMs for assigning category labels to the generated pseudo bounding boxes.
P3OVD [197]Propose prompt-driven self-training that refines the pseudo labels generated by VLMs.
ZSD-YOLO [198] [code]Leverage CLIP for object detection with a self-labeling based data augmentation technique.
RO-ViT [199]Bridge the gap of VLM pre-training and downstream open-vocabulary detection.
BARON [200] [code]Propose neighborhood sampling strategy to align the embedding of bag of regions.
OADP [201] [code]Propose object-aware distillation network to preserve and transfer contextual knowledge.
  • Motivation (Section 7.1): Allows transferring general VLM knowledge to task-specific models (e.g., object detectors, semantic segmenters) that benefit from specialized architectures, without VLM architecture restrictions.
  • Knowledge Distillation for Object Detection (Section 7.2.1):
    • Open-vocabulary object detection: Aims to detect objects from arbitrary text descriptions.
    • ViLD [36]: Distills VLM knowledge to a two-stage detector, enforcing embedding space consistency.
    • HierKD [186]: Hierarchical global-local knowledge distillation.
    • DetPro [37]: Uses detection prompt techniques for continuous prompt representations.
    • Pseudo-labelling: PB-OVD [189] trains detectors with VLM-predicted pseudo bounding boxes, XPM [194] uses VLM-generated pseudo masks.
  • Knowledge Distillation for Semantic Segmentation (Section 7.2.2):
    • Open-vocabulary semantic segmentation: Segments pixels based on arbitrary texts.
    • CLIPSeg [175]: Extends CLIP for segmentation with a lightweight transformer decoder.
    • LSeg [176]: Maximizes correlation between CLIP text embeddings and pixel-wise image embeddings.
    • ZegCLIP [174]: Uses CLIP to generate semantic masks and a relationship descriptor.
    • Weakly-supervised semantic segmentation: CLIP-ES [184], CLIMS [183] leverage VLMs and weak supervision.

4.3. Typology of Vision-Language Models for Visual Recognition

The following figure (Figure 4 from the original paper) presents a typology of vision-language models for visual recognition, classifying methods into three main categories: VLM Pre-training, Transfer Learning, and Knowledge Distillation, each with subcategories corresponding to the detailed methodologies discussed.

Fig. 4: Typology of vision-language models for visual recognition. 该图像是示意图,展示了视觉语言模型在视觉识别中的分类与关联。图中包含了三个主要部分:视觉语言模型预训练、迁移学习及知识蒸馏,具体列出了各个方法及其对应章节。

5. Experimental Setup

5.1. Datasets

The paper extensively covers datasets for both VLM pre-training and evaluation, highlighting the shift towards large-scale, web-sourced image-text pairs.

5.1.1. Datasets for Pre-training VLMs (Section 4.1 & Appendix B)

VLM pre-training primarily relies on massive image-text datasets, often collected from the internet. These are significantly larger and cheaper to acquire than traditional crowd-labelled datasets.

  • Image-Text Datasets:

    • SBU Caption [73]: 1M image-text pairs from Flickr.
    • COCO Caption [74]: 1.5M pairs (330k images, 5 captions each for c5, 40 captions for 5k images for c40) from MS COCO [106].
    • YFCC100M [75]: 100M multimedia items (99.2M images, 0.8M videos) with associated text.
    • Visual Genome (VG) [76]: 5.4M pairs (108k images, 50 descriptions each) with rich annotations (object-level, scene graphs, VQA).
    • Conceptual Captions (CC3M) [77]: 3.3M noisy image-text pairs from the web.
    • Localized Narratives (LN) [78]: 0.87M image-text pairs with localized multi-modal annotations (mouse traces per word).
    • Conceptual 12M (CC12M) [79]: 12M image-text pairs, a larger version of CC3M with less precise data.
    • Wikipedia-based Image Text (WIT) [80]: 37.6M image-text pairs across 108 languages from Wikipedia.
    • Red Caps (RC) [81]: 12M image-text pairs from Reddit, covering diverse objects and scenes.
    • LAION400M [21]: 400M image-text pairs filtered by CLIP, providing embeddings and kNN indices.
    • LAION5B [20]: Over 5.8B image-text pairs (2.32B English, 2.26B multilingual, 1.27B unspecified language).
    • WuKong [82]: 100M Chinese image-text pairs.
    • CLIP [10]: 400M image-text pairs from publicly available internet sources (non-public dataset).
    • ALIGN [17]: 1.8B noisy image-text pairs (non-public dataset).
    • FILIP [18]: 300M image-text pairs (non-public dataset).
    • WebLI [83]: 10B images with 12B corresponding texts across 109 languages (non-public dataset).
  • Auxiliary Datasets: Some studies use additional datasets to provide more information.

    • JFT3B [204]: ~3B images with 30k noisy class hierarchy labels.
    • C4 [205]: ~750GB English text from Common Crawl.
    • Object365 [85]: Object detection dataset with 365 categories, 638K images, and ~10M bounding boxes.
    • Gold-G [86]: 0.8M human-annotated visual grounding data for object phrases.

5.1.2. Datasets for VLM Evaluation (Section 4.2 & Appendix C)

VLMs are evaluated on a wide array of standard visual recognition benchmarks to test their generalization and transfer capabilities. The following table (Table 2 from the original paper) summarizes the widely adopted datasets for VLM evaluations:

TaskDatasetYear Classes TrainingTestingEvaluation Metric
Image ClassificationMNIST [88] [link]19981060,00010,000Accuracy
Mean Per Class Accuracy
11-point mAP
Mean Per Class Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Mean Per Class Accuracy
Accuracy
Mean Per Class Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
ROC AUC
Accuracy
Caltech-101 [89] [link]20041013,0606,085
PASCAL VOC 2007 Classification [90] [link]2007205,0114,952
Oxford 102 Flowers [91] [link]20081022,0406,149
CIFAR-10 [23] [link]20091050,00010,000
CIFAR-100 [23] [link]200910050,00010,000
ImageNet-1k [40] [link]200910001,281,16750,000
SUN397 [24] [link]201039719,85019,850
SVHN [92] [link]20111073,2576,032
STL-10 [93][link]2011101,0008,000
GTSRB [94][link]20114326,64012,630
KITTI Distance [1][link]201236,770713
IIIT5k [95][link]2012662,0003,000
Oxford-IIIT PETS [26] [link]2012373,6803,669
Stanford Cars [25] [link]20131968,1448,041
FGVC Aircraft [96] [link]20131006,6733,333
Facial Emotion Recognition 2013 [97] [link]201382,1403,574
Rendered SST2 [98] [link]201327,7921,821
Describable Textures (DTD) [99] [link]2014473,7601,880
Food-101 [22] [link]201410175,75025,250
Birdsnap [100] [link]201450042,2834,149
RESISC45 [101] [link]2017453,15025,200
CLEVR Counts [102] [link]201782,000500
PatchCamelyon [103] [link]20182294,91232,768
EuroSAT [104] [link]20191010,0005,000
Hateful Memes [27] [link]202028,500500
Country211 [10] [link]202121143,20021,100
Image-Text RetrievalFlickr30k [105] [link]2014--31,7835,000Recall
COCO Caption [74] [link]2015--82,7835,000Recall
Action RecognitionUCF101 [29] [link]20121019,5371,794Accuracy
Kinetics700 [30] [link]2019700494,80131,669Mean(top1, top5)
RareAct [28] [link]20201227,607-mWAP, mSAP
Object DetectionCOCO 2014 Detection [106] [link]20148083,00041,000box mAP
COCO 2017 Detection [106] [link]201780118,0005,000box mAP
LVIS [107] [link]20191203118,0005,000box mAP
ODinW [108] [link]2022314132,41320,070box mAP
Semantic SegmentationPASCAL VOC 2012 Segmentation [90] [link]2012201,4641,449mIoU
PASCAL Content [109] [link]20144594,9981,449mIoU
Cityscapes [110] [link]2016192,975500mIoU
ADE20k [111] [link]201715025,5742,000mIoU
  • Image Classification: 27 datasets covering fine-grained (e.g., Oxford-IIIT PETS [26], Stanford Cars [25]), general (e.g., ImageNet [40]), and diverse tasks (e.g., Hateful Memes [27], Country211 [10]).

  • Object Detection: 4 datasets, including COCO 2014/2017 Detection [106], LVIS [107] (for large vocabulary instance segmentation), and ODinW [108] (for task-level transfer evaluation).

  • Semantic Segmentation: 4 datasets, including PASCAL VOC 2012 Segmentation [90], PASCAL Content [109], Cityscapes [110], and ADE20k [111].

  • Image-Text Retrieval: 2 datasets, Flickr30k [105] and COCO Caption [74].

  • Action Recognition: 3 datasets, UCF101 [29], Kinetics700 [30], and RareAct [28].

    These datasets are chosen because they are standard benchmarks in computer vision, widely recognized for evaluating performance across various visual recognition tasks, ranging from simple classification to complex dense prediction and multi-modal understanding.

5.2. Evaluation Metrics

The paper uses several standard evaluation metrics, depending on the specific visual recognition task.

  • Accuracy:

    • Conceptual Definition: Accuracy measures the proportion of correctly classified instances out of the total number of instances. It provides a general indication of the model's overall correctness.
    • Mathematical Formula: $ \mathrm{Accuracy} = \frac{\mathrm{Number~of~Correct~Predictions}}{\mathrm{Total~Number~of~Predictions}} $
    • Symbol Explanation:
      • Number of Correct Predictions\mathrm{Number~of~Correct~Predictions}: The count of instances where the model's predicted class matches the true class.
      • Total Number of Predictions\mathrm{Total~Number~of~Predictions}: The total count of all instances evaluated by the model.
    • Used for: Image Classification, Action Recognition.
  • Mean Per Class Accuracy:

    • Conceptual Definition: Instead of averaging across all samples, this metric calculates the accuracy for each class individually and then averages these per-class accuracies. This is particularly useful for imbalanced datasets, as it prevents classes with many samples from dominating the overall accuracy score.
    • Mathematical Formula: $ \mathrm{Mean~Per~Class~Accuracy} = \frac{1}{N_c} \sum_{c=1}^{N_c} \mathrm{Accuracy}_c $
    • Symbol Explanation:
      • NcN_c: The total number of classes.
      • Accuracyc\mathrm{Accuracy}_c: The accuracy calculated specifically for class cc.
    • Used for: Image Classification.
  • Mean Average Precision (mAP) / box mAP:

    • Conceptual Definition: mAP is a common metric for object detection that assesses both the precision (how many detected objects are correct) and recall (how many actual objects are detected) of a model across all object classes. It is the mean of the Average Precision (AP) scores for each class. AP itself is the area under the precision-recall curve. box mAP specifically refers to mAP calculated based on the bounding box predictions.
    • Mathematical Formula: $ \mathrm{mAP} = \frac{1}{N_c} \sum_{c=1}^{N_c} \mathrm{AP}_c $ where APc\mathrm{AP}_c is calculated as the area under the precision-recall curve for class cc: $ \mathrm{AP}_c = \int_0^1 P_c(R_c) dR_c $
    • Symbol Explanation:
      • NcN_c: The total number of object classes.
      • APc\mathrm{AP}_c: The Average Precision for class cc.
      • Pc(Rc)P_c(R_c): The precision for class cc at a given recall RcR_c. Precision is True PositivesTrue Positives+False Positives\frac{\mathrm{True~Positives}}{\mathrm{True~Positives} + \mathrm{False~Positives}}, and recall is True PositivesTrue Positives+False Negatives\frac{\mathrm{True~Positives}}{\mathrm{True~Positives} + \mathrm{False~Negatives}}. These are calculated at various Intersection over Union (IoU) thresholds (e.g., IoU=0.5, or a range like IoU=0.5:0.05:0.95 for COCO mAP).
    • Used for: Object Detection.
  • Mean Intersection over Union (mIoU):

    • Conceptual Definition: mIoU is the standard metric for semantic segmentation. It calculates the Intersection over Union (IoU) for each class and then averages these IoU values across all classes. IoU measures the overlap between the predicted segmentation mask and the ground-truth mask for a given class.
    • Mathematical Formula: $ \mathrm{IoU}_c = \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FP}_c + \mathrm{FN}c} $ $ \mathrm{mIoU} = \frac{1}{N_c} \sum{c=1}^{N_c} \mathrm{IoU}_c $
    • Symbol Explanation:
      • NcN_c: The total number of semantic classes.
      • IoUc\mathrm{IoU}_c: The Intersection over Union for class cc.
      • TPc\mathrm{TP}_c (True Positives for class cc): The number of pixels correctly predicted as belonging to class cc.
      • FPc\mathrm{FP}_c (False Positives for class cc): The number of pixels incorrectly predicted as belonging to class cc (they actually belong to another class or background).
      • FNc\mathrm{FN}_c (False Negatives for class cc): The number of pixels that actually belong to class cc but were incorrectly predicted as something else.
    • Used for: Semantic Segmentation.
  • Recall:

    • Conceptual Definition: Recall (also known as sensitivity or true positive rate) measures the proportion of actual positive instances that were correctly identified by the model. In retrieval tasks, it measures how many of the relevant items were successfully retrieved.
    • Mathematical Formula: $ \mathrm{Recall} = \frac{\mathrm{True~Positives}}{\mathrm{True~Positives} + \mathrm{False~Negatives}} $
    • Symbol Explanation:
      • True Positives\mathrm{True~Positives}: The number of positive instances correctly identified.
      • False Negatives\mathrm{False~Negatives}: The number of positive instances incorrectly identified as negative.
    • Used for: Image-Text Retrieval.
  • ROC AUC (Receiver Operating Characteristic Area Under the Curve):

    • Conceptual Definition: ROC AUC is a performance metric for binary classification problems. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. AUC represents the area under this curve. A higher AUC value (closer to 1) indicates better discriminative power of the model, meaning it can distinguish between positive and negative classes more effectively. An AUC of 0.5 suggests performance no better than random guessing.
    • Mathematical Formula: (The ROC curve itself is a plot, so AUC is the integral of that plot) $ \mathrm{AUC} = \int_0^1 \mathrm{TPR}(\mathrm{FPR}^{-1}(t)) dt $
    • Symbol Explanation:
      • TPR\mathrm{TPR} (True Positive Rate) = Recall.
      • FPR\mathrm{FPR} (False Positive Rate) = False PositivesFalse Positives+True Negatives\frac{\mathrm{False~Positives}}{\mathrm{False~Positives} + \mathrm{True~Negatives}}.
      • tt: Varies from 0 to 1, representing different classification thresholds.
    • Used for: Binary classification tasks like Hateful Memes.
  • Mean(top1, top5):

    • Conceptual Definition: For multi-class classification, top-1 accuracy is the conventional accuracy (correct prediction is the highest probability). Top-5 accuracy means the correct class is among the top 5 predicted classes. Mean(top1, top5) indicates an average of these two metrics, often used to give a broader sense of performance, particularly in tasks with many classes.
    • Mathematical Formula: $ \mathrm{Mean(top1, top5)} = \frac{\mathrm{Top1~Accuracy} + \mathrm{Top5~Accuracy}}{2} $
    • Symbol Explanation:
      • Top1 Accuracy\mathrm{Top1~Accuracy}: The percentage of samples where the model's highest-probability prediction is correct.
      • Top5 Accuracy\mathrm{Top5~Accuracy}: The percentage of samples where the true label is among the top 5 highest-probability predictions.
    • Used for: Action Recognition (Kinetics700).
  • mWAP, mSAP (Mean Weighted Average Precision, Mean Semantic Average Precision):

    • Conceptual Definition: These are specialized metrics for fine-grained action recognition, particularly relevant for datasets like RareAct [28]. They likely aim to capture the model's ability to recognize rare or complex actions, possibly by weighting based on rarity or semantic similarity. Specific formulas are usually defined by the dataset's creators.
    • Used for: Action Recognition (RareAct).

5.3. Baselines

The paper compares the proposed VLM methods against various baselines, depending on the task and context:

  • Zero-shot prediction baseline (w/o Transfer): For transfer learning evaluations, the direct zero-shot performance of a pre-trained VLM (e.g., CLIP [10]) without any fine-tuning serves as a crucial baseline. This shows the inherent generalization capability of the VLM before adaptation.
  • Other VLM pre-training methods: When evaluating new VLM pre-training approaches, they are benchmarked against existing leading VLMs (e.g., ALIGN [17], FILIP [18], COCA [19]).
  • Different transfer learning strategies: For VLM transfer, various prompt tuning methods (e.g., CoOp [31], CoCoOp [32]), feature adaptation methods (e.g., Clip-Adapter [33], Tip-Adapter [34]), and other adaptation techniques are compared.
  • Knowledge distillation methods for specific tasks: In object detection and semantic segmentation, the paper compares different knowledge distillation strategies (e.g., ViLD [36], DetPro [37] for detection; CLIPSeg [175], LSeg [176] for segmentation). Often, a baseline "detector/segmenter from scratch" or "detector/segmenter with only image-level pre-training" is implied or explicitly used to show the benefit of VLM knowledge.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Performance of VLM Pre-training (Section 8.1)

The paper analyzes the performance of VLM pre-training methods primarily through zero-shot prediction on image classification, object detection, and semantic segmentation tasks.

The following figure (Figure 14 from the original paper) illustrates the performance versus data size and model size for VLM pre-training:

Fig. 14: Performance versus data size and model size. It shows that scaling up either the pre-training data \[113\] or the pre-training model \[10\] benefits VLM consistently. 该图像是图表,展示了不同预训练数据大小和模型对ImageNet的Top-1准确率的影响。左侧显示ResNet-50在增长的预训练数据下的表现,右侧展示了在固定400M预训练数据下,多个模型结构的准确率变化。

This figure demonstrates two key scaling laws:

  1. Data Size: As shown in the left graph (ResNet-50 backbone), increasing the pre-training data size leads to consistent improvements in Top-1 accuracy on ImageNet. For example, accuracy rises from ~40% with 10M data to ~75% with 100M data. This highlights the importance of web-scale image-text pairs.

  2. Model Size: The right graph (with 400M pre-training data) shows that scaling up model sizes (from ResNet-50 to ViT-L/14) also consistently improves Top-1 accuracy on ImageNet. For instance, ResNet-50 achieves ~60%, while ViT-L/14 reaches over 75%. This indicates that larger models have greater capacity to learn from massive datasets.

    The following are the results from Table 6 of the original paper:

    MethodsImage encoderText encoderData SizeAverage Accuracy (%) on 11 Image Classification Datasets
    ImageNetCaltech-101DTDEuroSATFood-101MNISTSUN397CIFAR-100FGVC AircraftOxford-IIIT PETSAverage
    CLIP [10]ViT-L/14Transformer400M76.295.777.593.868.478.837.255.793.592.878.3
    ALIGN [17]EfficientNetBERT1.8B76.4----------
    OTTER [112]FBNetV3-CDeCLUTR-Sci3M-----------
    DeCLIP [113]RENET-YBERT88M73.7----------
    ZeroVL [114]ViT-B/16BERT100M-----------
    FILIP [18]ViT-L/14Transformer340M77.195.775.392.273.170.860.260.792.093.090.1
    UniCL [65]Swin-tinyTransformer16.3M71.3----------
    Florence [115]CoSwinRoBERT900M83.794.677.695.177.093.255.566.495.994.786.2
    SLIP [64]ViT-LTransformer15M47.987.554.269.256.09.09.529.941.680.960.2
    PyramidCLIP [116]ResNet50T5143M47.881.553.767.865.865.012.647.283.781.765.8
    ChineseCLIP [117]ViT-L/14CNRoberta4B-96.079.7---26.251.2---
    LiT [118]ViT-L/14Transformer2M85.2----------
    AltCLIP [119]ViT-L/1470M-----------
    FLAVA [42]ViT-B/16Transformer1.1B62.691.568.679.5--75.4----
    KELIP [120]ViT-B/324.8B86.3------51.2---
    COCA [19]ViT-G/14Transformer35M48.883.454.565.859.918.05.857.133.273.950.0
    nCLIP [121]CoSwinRoBERT5813M85.8----------
    K-lite [122]ViT-B/16BART26M47.481.947.559.258.77.87.532.939.279.554.0
    NLIP [123]Transformer30M54.287.856.564.661.119.54.736.6-84.08.0
    UniCLIP [84]ViT-B/3212B85.4------69.2---
    PaLI [83]ViT-emT512B70.5----------
    CLIPPO [43]ViT-L/16ViT-L/164M27.3-31.4--------
    OneFormer [44]ViT-L/16ViT-L/16----------76.9-
    RA-CLIP [125]ViT-B/32BERT15M53.589.462.343.846.5--25.6--70.3
    LA-CLIP [126]ViT-B/32Transformer400M64.492.473.079.764.981.920.855.487.291.8-
    ALIP [127]ViT-B/32Transformer15M40.383.851.945.447.83.42.723.230.774.154.8
    GrowCLIP [128]ViT-B/16Transformer12M36.160.728.342.545.5--17.3-71.923.3

From Table 6, several key observations can be made:

  • Superior Zero-shot Performance: VLMs achieve impressive zero-shot performance across various image classification tasks, indicating their strong generalization capabilities (e.g., CLIP achieves 76.2% on ImageNet and 95.7% on Caltech-101).

  • Impact of Data Size: Models trained on larger datasets often show better performance. For instance, ALIGN (1.8B data) and Florence (900M data) show very strong results, with Florence achieving 83.7% on ImageNet. UniCLIP and PaLI with 12B data also report high performance.

  • Impact of Model Size/Architecture: While not directly compared in every row, the general trend from Figure 14 suggests that larger, more capable architectures (e.g., ViT-L/14 vs. ResNet-50) contribute to better performance. FILIP (ViT-L/14, 340M) shows consistent high performance across many tasks (Average 90.1%), possibly due to its fine-grained vision-language pre-training. Florence and nCLIP also achieve very high ImageNet scores with powerful backbones.

  • Diversity of VLMs: Different VLMs excel on different subsets of tasks, indicating the impact of their specific pre-training objectives and data. For example, FILIP has high average accuracy, while Florence and nCLIP show very high ImageNet scores.

  • Challenges for Small Datasets/Specific Tasks: Some VLMs struggle on certain datasets, indicating that generalization is not uniform. For instance, SLIP, COCA, K-lite, NLIP, ALIP, GrowCLIP show lower performance on datasets like MNIST, SUN397, which might require different inductive biases or more specific fine-tuning.

    The superior generalization of VLMs is attributed to:

  1. Big Data: Training on millions or billions of image-text pairs from the internet allows VLMs to cover vast visual and language concepts.

  2. Big Model: Large models (e.g., ViT-G in COCA with 2B parameters) provide the capacity to learn from such extensive data.

  3. Task-agnostic Learning: Language supervision from image-text pairs is diverse and general-purpose, leading to models that generalize well across various downstream tasks, unlike task-specific labels.

    The following are the results from Table 7 of the original paper:

    MethodImage encoderText encoderData sizeVOC [90]PASCAL C. [109]COCO [106]
    GroupViT [129]ViTTransformer26M52.322.4-
    SegCLIP [46]ViTTransformer3.4M52.624.726.5

For semantic segmentation, GroupViT and SegCLIP demonstrate the capability of VLMs for dense prediction, achieving mIoU scores in the 50s on PASCAL VOC. The performance on PASCAL Content and COCO is lower, reflecting the increased complexity of these datasets and the nascent stage of VLMs for dense tasks.

The following are the results from Table 8 of the original paper:

MethodImage encoderText encoderData sizeCOCO [106]LVIS [107]LVIS Mini. [107]
RegionCLIP [131]ResNet50x4Transformer118k29.611.3-
GLIP [67]Swin-LBERT27.43M49.826.934.3
FIBER [71]Swin-BRoBERTa4M49.3-32.2
DetCLIP [45]Swin-LBERT2.43M-35.9-

For object detection, GLIP and FIBER show strong mAP scores on COCO and LVIS, with GLIP achieving 49.8% on COCO and 26.9% on LVIS. These results indicate that VLMs can enable effective zero-shot prediction for dense tasks, especially when equipped with local VLM pre-training objectives like region-word matching. The field for dense visual tasks is relatively underexplored, hence the limited number of VLMs presented.

Limitations of VLMs (from Section 8.1):

  • Performance Saturation: Beyond a certain point, further scaling up data or model size yields diminishing returns.
  • Computational Expense: Large-scale VLM pre-training requires immense computational resources (e.g., CLIP ViT-L needed 256 V100 GPUs for 288 hours).
  • Overhead in Inference: Large models introduce substantial computation and memory overheads during both training and inference.

6.1.2. Performance of VLM Transfer Learning (Section 8.2)

Transfer learning methods adapt pre-trained VLMs to downstream tasks using various fine-tuning strategies. The evaluations are primarily on image classification datasets, often using a 16-shot setup for few-shot supervised methods.

The following are the results from Table 9 of the original paper:

MethodsImage encoderSetupAverage Accuracy (%) on 11 Image Classification Datasets
ImageNetCaltech-101DTDEuroSATFood-101MNISTSUN397CIFAR-100FGVC AircraftOxford-IIIT PETSAverage
Baseline [143]ResNet-50w/o Transfer59.290.960.386.185.855.666.177.316.960.241.662.7
Baseline [10]ViT-B/16w/o Transfer71.795.470.292.894.168.674.890.631.172.256.473.5
Baseline [10]ViT-L/14w/o Transfer73.796.476.293.794.578.878.393.837.268.455.776.9
CoOp [31]ViT-B/16Few-shot Sup.71.695.871.995.896.468.174.185.228.772.554.267.5
CoCoOp [32]ViT-B/16Few-shot Sup.75.895.873.196.497.772.081.791.027.778.364.877.6
SubPT [132]ResNet50Few-shot Sup.66.491.763.491.791.860.773.881.020.370.254.768.1
LASP [133]ViT-B/16Few-shot Sup.76.195.873.095.895.772.281.690.531.677.862.876.8
ProDA [134]ResNet50Few-shot Sup.-91.365.396.490.075.595.582.436.6-70.184.3
VPT [135]ViT-B/16Few-shot Sup.77.496.873.4-96.873.181.191.634.778.567.3-
ProGrad [136]ViT-B/16Few-shot Sup.67.991.562.191.593.462.7-----79.0
CPL [137]ResNet-50Few-shot Sup.-96.3----78.791.021.970.357.868.5
PLOT [138]ViT-B/16Few-shot Sup.73.992.263.087.294.872.877.184.534.570.065.677.3
CuPL [160]ViT-L/14Unsupervised68.491.461.189.571.076.693.336.1----
UPL [143]ResNet-50Unsupervised64.894.269.087.866.969.077.984.721.766.455.170.2
TPT [144]ViT-B/16Unsupervised71.094.2-87.866.969.077.984.724.866.447.871.0
VP [147]ViT-B/32Few-shot Sup.----85.0-70.378.9-65.557.160.8
UPT [149]ViT-B/16Few-shot Sup.76.296.173.296.196.371.881.091.334.578.765.677.2
MaPLE [151]ViT-B/16Few-shot Sup.78.696.073.596.096.673.582.691.436.579.768.280.8
CAVPT [152]ViT-B/16Few-shot Sup.83.296.172.593.588.297.685.057.974.372.692.185.3
Tip-Adapter [34]ViT-B/16Few-shot Sup.--70.8---------
SuS-X [154]ResNet-50Unsupervised--66.8--------50.6
SgVA-CLIP [156]ViT-B/16Few-shot Sup.--73.3------76.4--
VT-Clip [157]ResNet-50Few-shot Sup.----93.1-----65.7-
CALIP [158]ResNet-50Unsupervised59.487.760.658.677.466.456.317.786.242.438.961.7
Wise-FT [162]ViT-L/14Supervised--87.1---------
KgCoOp [145]ViT-B/16Few-shot Sup.74.494.670.193.271.990.686.532.471.758.371.078.4
ProTeCt [146]ViT-B/16Few-shot Sup.83.296.574.693.7-----74.5--
RePrompt [148]ViT-B/16Few-shot Sup.75.793.4-87.885.097.187.450.377.573.767.186.4
TaskRes [159]ResNet-50Few-shot Sup.--65.0---------
VCD [161]ViT-B/16Unsupervised68.086.9-76.896.0-77.688.536.370.6--

From Table 9, the following conclusions can be drawn:

  • Consistent Improvement from Transfer Learning: Transfer learning setups consistently improve performance on downstream tasks compared to the w/o Transfer baselines. For example, on ImageNet, Wise-FT (supervised) boosts accuracy by 10.9% (vs. ViT-L/14 baseline), CoCoOp (few-shot supervised) by 1.7%, and UPL (unsupervised) by 0.8%. This validates the effectiveness of adaptation mechanisms in mitigating domain gaps.
  • Gap between Few-shot and Supervised Transfer: Few-shot supervised transfer methods (e.g., CoCoOp at 75.8% on ImageNet) generally lag behind fully supervised transfer (e.g., Wise-FT at 87.1%). This is largely due to the risk of overfitting to the limited labelled samples in few-shot settings.
  • Promising Unsupervised Transfer: Unsupervised transfer methods (e.g., UPL at 64.8% average, TPT at 71.0% on ImageNet) can achieve performance comparable to or even better than few-shot supervised transfer methods. For instance, UPL (ResNet-50) outperforms CoOp (ViT-B/16) in average accuracy (70.2% vs. 67.5%). This is attributed to access to massive unlabelled data, reducing overfitting risks. This highlights unsupervised transfer as a promising direction despite challenges like noisy pseudo-labels.
  • Effectiveness of Prompt Tuning and Adapters: Methods like CoCoOp, LASP, UPT, MaPLE, CAVPT (all prompt tuning) and Wise-FT (fine-tuning) show strong results, indicating the efficacy of parameter-efficient adaptation strategies.

6.1.3. Performance of VLM Knowledge Distillation (Section 8.3)

Knowledge distillation transfers general VLM knowledge to task-specific models, particularly for dense prediction tasks like object detection and semantic segmentation.

The following are the results from Table 10 of the original paper:

MethodVision-Language ModelCOCO [106]LVIS [107]
APbaseAPnovelAPAPrAPcAP fAP
Baseline [36]28.326.327.819.519.717.018.6
ViLD [36]CLIP ViT-B/3259.527.651.316.726.534.227.8
DetPro [37]CLIP ViT-B/32--34.920.827.832.428.4
HierKD [186]CLIP ViT-B/3253.527.3-----
RKD [187]CLIP ViT-B/3256.636.951.021.125.029.125.9
PromptDet [188]CLIP Transformer-26.650.621.423.329.325.3
PB-OVD [189]CLIP Transformer46.130.842.1----
CondHead [190]CLIP ViT-B/3260.829.849.018.828.333.728.8
VLDet [191]CLIP Transformer50.632.045.826.339.441.938.1
F-VLM [192]CLIP ResNet-50-28.039.632.8--34.9
OV-DETR [173]CLIP ViT-B/3252.729.461.017.425.032.526.6
Detic [193]CLIP Transformer45.027.847.117.826.331.626.8
OWL-ViT [195]CLIP ViT-B/32--28.118.9--22.1
VL-PLM [196]CLIP ViT-B/3260.234.453.5---22.2
P3OVD [197]CLIP ResNet-5051.931.546.6---10.6
RO-ViT [199]CLIP ViT-L/16-33.047.732.1--34.0
BARON [200]CLIP ResNet-5054.942.751.723.229.332.529.5
OADP [201]CLIP ViT-B/3253.330.047.221.928.432.028.7

For object detection, VLM knowledge distillation brings clear performance improvements:

  • Significant Gains over Baseline: Methods like ViLD (CLIP ViT-B/32) dramatically improve AP on COCO from a baseline of 27.8% to 51.3%, and on LVIS from 18.6% to 27.8%. This highlights the power of distilling VLM's broad object understanding.

  • Handling Novel Categories: Many methods show strong APnovel performance, indicating their ability to detect objects from categories not seen during the detector's base training, leveraging VLM's open-vocabulary capabilities. For example, BARON achieves 42.7% APnovel on COCO.

  • Diverse Approaches: The table shows a variety of distillation techniques, from feature-space distillation (ViLD, RKD) to pseudo-labelling (PB-OVD, P3OVD), each contributing to advancing open-vocabulary object detection.

  • Performance on LVIS: LVIS is a challenging dataset with a long-tail distribution and many rare categories. Methods like VLDet achieve higher APr (rare categories) and APc (common categories) compared to ViLD, demonstrating improved performance on fine-grained and long-tail distributions.

    The following are the results from Table 11 of the original paper:

    MethodVision-Language ModelA-847 [111]PC-459 [109]A-150 [111]PC-59 [109]PAS-20 [90]C-19 [110]
    Baseline [203]---24.318.3-
    LSeg [35]CLIP ResNet-101----47.4-
    ZegFormer [176]CLIP ResNet-50--16.4-80.7-
    OVSeg [179]CLIP Swin-B9.012.429.655.794.5-
    ZSSeg [180]CLIP ResNet-1017.0-20.547.7-34.5
    OpenSeg [181]CLIP Eff-B76.39.021.142.1--
    ReCo [182]CLIP ResNet-101-----24.2
    FreeSeg [185]CLIP ViT-B/16--39.8-86.9-

For semantic segmentation, VLM knowledge distillation also shows clear benefits:

  • Substantial mIoU Improvement: Compared to the baseline (e.g., 18.3% on PAS-20), methods like ZegFormer (80.7%) and FreeSeg (86.9%) achieve very high mIoU scores on PASCAL VOC 2012 Segmentation. This indicates successful transfer of VLM's semantic understanding to pixel-level tasks.
  • Generalization to Diverse Datasets: While performance varies (e.g., lower on ADE20k and PASCAL Content with many classes), these methods demonstrate the ability to segment pixels for categories beyond what the base segmentation model was trained on.
  • Task-Specific Designs: The strong performance is largely due to the ability of knowledge distillation to combine VLM's broad knowledge with the strengths of specialized segmentation architectures.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

DatasetYearNum. of Image-Text PairsLanguagePublic
SBU Caption [73] [link]20111MEnglish
COCO Caption [74] [link]20161.5MEnglish
Yahoo Flickr Creative Commons 100 Million (YFCC100M) [75] [link]2016100MEnglish
Visual Genome (VG) [76] [link]20175.4MEnglish
Conceptual Captions (CC3M) [77] [link]20183.3MEnglish
Localized Narratives (LN) [78] [link]20200.87MEnglish
Conceptual 12M (CC12M) [79] [link]202112MEnglish
Wikipedia-based Image Text (WIT) [80] [link]202137.6M108 Languages
Red Caps (RC) [81] [link]202112MEnglish
LAION400M [21] [link]2021400MEnglish
LAION5B [20] [link]20225BOver 100 Languages
WuKong [82] [link]2022100MChinese
CLIP [10]2021400MEnglishX
ALIGN [17]20211.8BEnglishX
FILIP [18]2021300MEnglishX
WebLI [83]202212B109 LanguagesX

6.3. Ablation Studies / Parameter Analysis

The paper primarily focuses on surveying and benchmarking existing work rather than conducting new ablation studies. However, the discussions in Section 8 provide indirect insights into parameter analysis and the effectiveness of different components:

  • Scaling Laws (Section 8.1, Figure 14): The analysis of how performance scales with data size and model size implicitly acts as a form of ablation study, demonstrating that these are critical factors for VLM effectiveness. It shows that both larger datasets and larger models consistently improve zero-shot accuracy, but also notes that performance eventually saturates.

  • Comparison of Objectives (Section 5, Table 3): The categorization of VLM pre-training methods by contrastive, generative, and alignment objectives, and the observation that often hybrid objectives (e.g., FLAVA [42] uses Con, Gen, Align) lead to more robust VLMs, suggests that combining different learning signals is beneficial. The discussion highlights that contrastive objectives learn discriminative features, generative objectives learn contextual information, and alignment objectives are easy-to-optimize for global/local matching, but often work best when auxiliary to others.

  • Frameworks (Section 3.3, Figure 5): The discussion on two-tower, two-leg, and one-tower frameworks points to architectural design choices that affect efficiency and cross-modal communication. While no direct comparative ablation is provided, the existence of different frameworks suggests ongoing research into optimal VLM architectures.

  • Transfer Learning Setups (Section 6.2, 8.2): The comparison between supervised, few-shot supervised, and unsupervised transfer setups (Table 9) serves as an implicit ablation. It shows that even unsupervised transfer can be competitive with few-shot supervised transfer, highlighting the potential for annotation-efficient methods and the importance of exploiting unlabelled data. The observation that few-shot methods can overfit to limited samples indicates a sensitivity to training data size in the adaptation phase.

  • Knowledge Distillation Approaches (Section 7.2, 8.3): The differentiation between feature-space distillation and pseudo-labelling distillation for object detection and semantic segmentation implicitly evaluates the effectiveness of different knowledge transfer mechanisms. The results show that both are viable for improving dense prediction tasks, indicating that the choice of distillation strategy can be adapted to task-specific needs.

    In essence, while formal ablation studies with controlled experiments and distinct tables are not present, the survey's comparative analysis across numerous papers provides a high-level overview of how different methodological choices (objectives, scale, transfer mechanisms) impact VLM performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

This survey provides a comprehensive review of Vision-Language Models (VLMs) for visual recognition tasks, ranging from image classification to object detection and semantic segmentation. It elucidates how VLMs address the limitations of traditional visual recognition paradigms, primarily the heavy reliance on labor-intensive labelled data and task-specific model training. By leveraging web-scale image-text pairs and learning rich vision-language correlations, VLMs enable impressive zero-shot predictions and open-vocabulary visual recognition.

The paper systematically dissects VLMs through their foundational elements (network architectures, pre-training objectives, frameworks), datasets, and three main technical approaches: VLM pre-training, VLM transfer learning, and VLM knowledge distillation. Key findings highlight the consistent performance gains from scaling up pre-training data and model size, the effectiveness of transfer learning in bridging domain gaps for downstream tasks (especially unsupervised transfer), and the utility of knowledge distillation in empowering dense prediction models with VLM's generalized knowledge. The survey provides extensive benchmarking and discussion, offering a clear picture of the current state and future potential of VLMs in visual recognition.

7.2. Limitations & Future Work

The authors identify several research challenges and potential future directions for VLMs:

7.2.1. For VLM Pre-training

  • Fine-grained Vision-Language Correlation Modelling: Current VLMs often excel at image-level understanding but lack detailed local vision-language correspondence needed for dense prediction tasks (object detection, semantic segmentation). Future work should focus on methods like region-word matching to enhance zero-shot dense prediction.
  • Unification of Vision and Language Learning: Most VLMs use separate encoders (two-tower frameworks). Moving towards one-tower architectures (e.g., using a single Transformer for both modalities) could improve cross-modal communication and training efficiency, leading to more sustainable VLMs.
  • Pre-training VLMs with Multiple Languages: Most existing VLMs are English-centric, which can introduce cultural and regional biases and limit global applicability. Developing multilingual VLMs can enable learning diverse visual characteristics associated with different languages, making VLMs effective across various linguistic scenarios.
  • Data-efficient VLMs: The enormous computational resources required for training large-scale VLMs are a significant concern for sustainability. Research into data-efficient VLMs that extract more useful information from limited image-text data (e.g., via supervision among image-text pairs) is crucial.
  • Pre-training VLMs with LLMs: Leveraging Large Language Models (LLMs) to augment text captions (e.g., generating richer synthetic descriptions) can enhance VLM pre-training by providing more diverse and informative language supervision, improving vision-language correlation.

7.2.2. For VLM Transfer Learning

  • Unsupervised VLM Transfer: While most transfer methods require labelled data (supervised or few-shot supervised), unsupervised VLM transfer (using unlabelled data) is more annotation-efficient and has lower overfitting risks. More research is needed in this promising but challenging area.
  • VLM Transfer with Visual Prompt/Adapter: The focus has largely been on text prompt learning. Exploring visual prompt learning or visual adapters is essential as they are complementary to text prompting and can facilitate pixel-level adaptation for dense prediction tasks.
  • Test-time VLM Transfer: Current transfer methods often involve repetitive fine-tuning for each downstream task. Test-time VLM transfer, which adapts prompts on the fly during inference, could circumvent this inefficiency.
  • VLM Transfer with LLMs: Exploiting LLMs to automatically generate customized text prompts for downstream tasks is an automatic and data-efficient approach compared to manual prompt engineering or learnable prompts.

7.2.3. For VLM Knowledge Distillation

  • Knowledge Distillation from Multiple VLMs: Exploring how to coordinate knowledge distillation from several different VLMs could harness their synergistic effects for improved performance.
  • Knowledge Distillation for Other Visual Recognition Tasks: Beyond object detection and semantic segmentation, VLM knowledge distillation could be extended to other complex tasks like instance segmentation, panoptic segmentation, person re-identification, etc.

7.3. Personal Insights & Critique

This survey offers an exceptionally clear and comprehensive overview of the rapidly evolving field of Vision-Language Models for visual recognition. Its structured approach, detailed explanations of foundations and methodologies, and thorough benchmarking are invaluable for both beginners and experienced researchers. The emphasis on the transition from traditional, data-hungry paradigms to web-scale, zero-shot learning with VLMs effectively highlights the paradigm shift.

Inspirations and Applications:

  • Efficiency in AI Development: The zero-shot capability of VLMs is a game-changer for applications where labelled data is scarce or new categories constantly emerge (e.g., specialized medical imaging, niche product classification in e-commerce, real-time event monitoring).
  • Foundation Models Beyond Vision: The success of VLMs hints at the broader potential of multimodal foundation models that can integrate information from even more modalities (e.g., audio, 3D data) for even more holistic AI understanding.
  • Customization and Personalization: The prompt tuning and adapter mechanisms offer powerful ways to customize general VLMs for specific user needs or domains with minimal effort and data, opening doors for highly personalized AI experiences.
  • Cross-Cultural AI: The call for multilingual VLMs is particularly inspiring, emphasizing the need for AI to be inclusive and context-aware across diverse cultures and languages, moving beyond English-centric biases.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

  • Interpretability and Bias: While VLMs perform incredibly well, their black-box nature remains a concern. How exactly do they learn vision-language correlations? What biases from the vast, unstructured web data are embedded in these models, and how do these biases affect predictions, especially in sensitive applications? The paper acknowledges bias in multilingual VLMs but could expand on inherent dataset biases more broadly.

  • Computational Cost and Environmental Impact: The immense computational resources required for pre-training VLMs are a significant barrier for many researchers and raise environmental sustainability concerns. While data-efficient VLMs are mentioned as a future direction, the immediate practical challenges of training these big models are substantial.

  • Robustness to Adversarial Attacks and Out-of-Distribution Data: VLMs trained on diverse web data might be more robust in some ways, but their performance on out-of-distribution data or against adversarial attacks is an important area for deeper investigation. How do their zero-shot capabilities fare under such challenging conditions?

  • Fine-grained Grounding in Complex Scenes: While region-word matching is a promising direction, fully understanding and localizing every semantic entity in a complex scene, especially with overlapping objects or abstract concepts, remains a significant challenge. The current dense prediction results (Tables 7 & 8) show lower performance than classification, indicating this is indeed a hard problem.

  • Dynamic Language and Visual Concepts: Both language and visual concepts can evolve. How well can static VLMs adapt to new jargon, emerging visual styles, or real-world changes without continuous re-training? Test-time VLM transfer offers a partial solution, but the long-term adaptability is a complex issue.

    Overall, this survey serves as an excellent resource for anyone seeking to understand the landscape of VLMs for visual recognition. It not only summarizes current achievements but also critically points towards the necessary advancements for building more robust, efficient, and universally applicable visual intelligence systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.