Vision-Language Models for Vision Tasks: A Survey
TL;DR Summary
This survey reviews Vision-Language Models (VLMs) for visual tasks, addressing challenges of crowd-label dependency in DNNs and training inefficiency. It analyzes network architectures, pre-training objectives, and existing methods, offering insights for future research.
Abstract
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at https://github.com/jingyi0000/VLM_survey.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Vision-Language Models for Vision Tasks: A Survey
1.2. Authors
-
Jingyi Zhang† (Research Associate and Ph.D. student, School of Computer Science and Engineering, Nanyang Technological University (NTU))
-
Jiaxing Huang† (Research Associate and Ph.D. student, School of Computer Science and Engineering, NTU)
-
Sheng Jin (Research Fellow, Nanyang Technology University (NTU))
-
Shijian Lu* (Associate Professor, School of Computer Science and Engineering, Nanyang Technological University, Singapore)
(† denotes equal contribution, * denotes corresponding author)
1.3. Journal/Conference
This paper was published on arXiv, a preprint server, with a publication date of 2023-04-03. While arXiv hosts preprints before formal peer review, it is a highly influential platform in computer science and related fields, where much cutting-edge research first appears. Its widespread availability allows for broad dissemination and feedback within the research community.
1.4. Publication Year
2023
1.5. Abstract
This survey addresses two major challenges in visual recognition: the heavy reliance on crowd-labelled data for training deep neural networks (DNNs) and the common practice of training a separate DNN for each visual recognition task, which is laborious and time-consuming. To overcome these, Vision-Language Models (VLMs) have emerged, learning rich vision-language correlations from web-scale image-text pairs. A single VLM can then perform zero-shot predictions across various visual recognition tasks without task-specific fine-tuning. The paper systematically reviews VLMs for visual recognition by covering: (1) the historical development of visual recognition paradigms; (2) the foundational elements of VLMs, including network architectures, pre-training objectives, and downstream tasks; (3) the datasets used for VLM pre-training and evaluation; (4) a categorization of existing VLM pre-training, transfer learning, and knowledge distillation methods; (5) a benchmarking, analysis, and discussion of these methods; and (6) an outline of future research challenges and directions.
1.6. Original Source Link
https://arxiv.org/abs/2304.00685 (Preprint) PDF Link: https://arxiv.org/pdf/2304.00685v2.pdf
2. Executive Summary
2.1. Background & Motivation
The paper addresses fundamental inefficiencies and limitations in traditional visual recognition research.
-
Data Scarcity and Laborious Annotation: Most deep neural network (DNN) training for visual recognition tasks (like image classification, object detection, and semantic segmentation) heavily relies on large volumes of meticulously crowd-labelled data. Collecting such datasets is a
laboriousandtime-consumingprocess. -
Task-Specific Model Training: Typically, a separate DNN is trained for each distinct visual recognition task. This
single-task, single-modelparadigm leads to redundancy and inefficiency, especially as the number of visual tasks grows.The core problem, therefore, is the
laborious and time-consuming visual recognition paradigmthat hinders scalability and efficiency. This problem is crucial because visual recognition is a cornerstone of numerous applications, from autonomous driving to robotics. Existing solutions likeDeep Learning from ScratchandSupervised/Unsupervised Pre-training, Fine-tuning and Predictionparadigms have improved convergence and data efficiency but still require a fine-tuning stage with task-specific labelled data.
The paper's innovative idea and entry point is the Vision-Language Model Pre-training and Zero-shot Prediction paradigm. Inspired by advances in natural language processing (NLP), this paradigm proposes pre-training a single VLM on web-scale image-text pairs (which are abundantly available) to learn rich vision-language correlation. This pre-trained VLM can then perform zero-shot predictions on various downstream visual tasks, eliminating the need for task-specific fine-tuning. This promises to be simple to implement yet performs incredibly well, as demonstrated by models like CLIP.
2.2. Main Contributions / Findings
The paper makes three primary contributions:
-
Systematic Review of VLMs for Visual Recognition: It provides the first comprehensive survey focusing specifically on VLMs for visual recognition tasks (image classification, object detection, semantic segmentation). It offers a holistic view of this emerging field, summarizing and categorizing existing studies.
-
Up-to-date Progress Benchmarking and Discussion: The survey includes a comprehensive benchmarking of current VLM-based visual recognition methods across multiple public datasets. This provides a clear picture of their performance and facilitates comparison.
-
Identification of Research Challenges and Future Directions: It highlights several open research challenges and suggests promising future research directions for VLMs in visual recognition, guiding future studies in the field.
Key conclusions and findings reached by the paper include:
- VLMs, especially with
contrastive objectiveslike those inCLIP, can learn highlytransferable visual modelsfromnatural language supervision, achieving superiorzero-shot performanceon diverse visual recognition tasks. - Scaling up both
pre-training dataandmodel sizegenerally leads to consistent improvements in VLM performance, though this trend eventually saturates. VLM transfer learning(e.g.,prompt tuning,feature adaptation) effectively mitigatesdomain gapsbetween pre-trained VLMs and downstream tasks, consistently improving performance.Unsupervised VLM transfershows particular promise due to its annotation efficiency.VLM knowledge distillationis effective for transferringgeneral VLM knowledgetotask-specific models, especially fordense prediction taskslike object detection and semantic segmentation, allowing the benefits of VLM generalization while leveraging specialized architectures.- The field still faces challenges such as the need for
fine-grained vision-language correlation modelling,unification of vision and language learning,multilingual VLM pre-training,data-efficient VLMs, and more advanced integration withLarge Language Models (LLMs).
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following foundational concepts:
- Deep Neural Networks (DNNs): These are a class of artificial neural networks with multiple layers (hence "deep") between the input and output layers. They learn representations of data with multiple levels of abstraction. When fed with data, they learn to identify patterns and make predictions or classifications.
- Visual Recognition: This is a broad field in computer vision concerned with enabling computers to identify and understand objects, scenes, and activities in images and videos. Key tasks include:
- Image Classification: Assigning a single category label to an entire image (e.g., "cat," "dog," "car").
- Object Detection: Identifying and localizing multiple objects within an image by drawing bounding boxes around them and assigning a category label to each (e.g., "dog" at coordinates X, Y, W, H).
- Semantic Segmentation: Assigning a category label to each pixel in an image, effectively outlining objects and regions with pixel-level precision (e.g., marking all pixels belonging to the "sky," "road," or "car").
- Pre-training, Fine-tuning, and Zero-shot Prediction: These are paradigms for training and deploying machine learning models:
- Pre-training: An initial training phase where a model is trained on a very large dataset (often general-purpose and unlabelled or weakly labelled) to learn broad, transferable features.
- Fine-tuning: After pre-training, the model's weights are further adjusted (fine-tuned) on a smaller, task-specific dataset with labelled data. This adapts the pre-trained general knowledge to a specific downstream task.
- Zero-shot Prediction: The ability of a model to make predictions for categories or tasks it has never explicitly seen during training. In the context of VLMs, this means classifying an image into a category just by reading its textual description, without any labelled examples of that category in the training data.
- Vision-Language Models (VLMs): These are neural networks designed to process and understand information from both visual (images/videos) and textual (natural language) modalities. They learn to find correlations and alignments between images and text, enabling them to perform tasks that involve both vision and language.
- Transformers: An attention-based neural network architecture introduced in 2017, initially for natural language processing. Unlike recurrent neural networks, Transformers process entire sequences at once, making them highly efficient for parallel computation. Their core mechanism,
self-attention, allows them to weigh the importance of different parts of the input sequence relative to each other, capturing long-range dependencies effectively. Transformers have since been adapted for computer vision (Vision Transformers,ViT) and multi-modal tasks. - Contrastive Learning: A self-supervised learning paradigm where a model learns by contrasting different samples. It aims to pull "positive pairs" (e.g., different augmented views of the same image, or an image and its matching text) closer together in the embedding space while pushing "negative pairs" (e.g., an image and a non-matching text, or different images) farther apart. This encourages the model to learn discriminative features.
- Generative Models: A class of models that learn the underlying distribution of data and can then generate new data samples that resemble the training data. In VLMs, this might involve generating text descriptions from images or even images from text.
- Transfer Learning: A machine learning technique where a model trained on one task (the source task) is reused as the starting point for a model on a second task (the target task). It is particularly useful when the target task has limited labelled data, as the pre-trained model provides a strong set of initial features.
- Knowledge Distillation: A technique where a smaller, simpler model (the "student") is trained to mimic the behavior of a larger, more complex model (the "teacher"). The teacher's "knowledge" (e.g., its predictions, intermediate features) is transferred to the student, allowing the student to achieve performance comparable to the teacher with lower computational cost.
- Large Language Models (LLMs): Very large neural networks trained on massive amounts of text data, capable of understanding, generating, and processing human language with remarkable fluency and coherence (e.g.,
GPT-3,GPT-4). They are increasingly used to enhance VLMs by providing rich linguistic knowledge.
3.2. Previous Works
The paper contextualizes VLMs within the broader evolution of visual recognition paradigms:
- Traditional Machine Learning and Prediction: Before deep learning, visual recognition relied on
feature engineeringby domain experts (e.g.,SIFT[38],HoGfeatures) andlightweight learning modelslikeSupport Vector Machines (SVM)[8] ork-Nearest Neighbors (k-NN)[7]. This approach lacked scalability and required specialized feature design for each task. - Deep Learning from Scratch and Prediction: The advent of deep learning, exemplified by
AlexNet[4] (aConvolutional Neural Network (CNN)),VGG[5], andResNet[6], enabled end-to-end learning of features directly from data. This paradigm achieved unprecedented performance on benchmarks likeImageNet[40]. However, it introduced challenges:slow convergenceand thelaborious collection of large-scale, task-specific, and crowd-labelled data. - Supervised Pre-training, Fine-tuning and Prediction: To address slow convergence and data needs, researchers discovered that features learned from large-scale labelled datasets (like ImageNet) could be
transferredto downstream tasks [11]. A DNN is pre-trained on a large labelled dataset (e.g., ImageNet) with asupervised loss, and then fine-tuned on smaller, task-specific labelled data. This accelerates training and improves performance with limited downstream data. - Unsupervised Pre-training, Fine-tuning and Prediction: To further mitigate the reliance on labelled data for pre-training,
self-supervised learningemerged [12], [13]. This paradigm learnsuseful and transferable representationsfromunlabelled data. Examples includeMasked Image Modelling (MIM)[41] (reconstructing masked parts of an image) andcontrastive learning[12] (learning discriminative features by contrasting data samples). Models pre-trained this way are then fine-tuned on labelled task-specific data. - Vision-language Model Pre-training and Zero-shot Prediction: This is the latest paradigm, inspired by
Large Language Models (LLMs)in NLP [14], [15], [16]. It involves:- VLM Pre-training: A model is trained on
large-scale image-text pairs(abundant on the internet) usingvision-language objectives[10], [18], [19] to learnimage-text correspondences. - Zero-shot Prediction: The pre-trained VLM is directly applied to downstream visual recognition tasks without fine-tuning, by matching image embeddings with text embeddings (e.g., category names or descriptions).
The pioneer in this paradigm is
CLIP[10], which uses animage-text contrastive objectiveto pull paired images and texts closer in anembedding space.
- VLM Pre-training: A model is trained on
3.3. Technological Evolution
The evolution of visual recognition paradigms has moved from:
-
Hand-crafted features and simple classifiers (Traditional ML): Focused on human expertise in feature design.
-
End-to-end learning with deep CNNs (Deep Learning from Scratch): Shifted focus to network architecture, but required massive labelled data for each task.
-
Transfer of supervised pre-trained features (Supervised Pre-training): Recognized the value of general features, but still needed labelled pre-training data.
-
Learning from unlabelled data (Unsupervised Pre-training): Reduced annotation burden for pre-training, but still required fine-tuning.
-
Learning from web-scale image-text data for zero-shot generalization (VLM Pre-training): Leverages freely available multimodal data, enabling direct application to new tasks without fine-tuning, dramatically increasing efficiency and scalability.
This paper's work fits at the cutting edge of this evolution, surveying the rapidly growing field of VLMs, particularly their application to visual recognition tasks, and extending to how these powerful pre-trained models are adapted (
transfer learning) or have their knowledge extracted (knowledge distillation) for specific downstream challenges.
3.4. Differentiation Analysis
The paper distinguishes itself from existing VLM surveys by focusing specifically on VLMs for visual recognition tasks rather than broader vision-language tasks (e.g., visual question answering, natural language for visual reasoning, phrase grounding).
Other related surveys (e.g., [50], [51], [52], [53], [54]) have covered:
-
Advances on vision-language tasks generally, including VLM pre-training for various task-specific methods [50].
-
VLM pre-training for vision-language tasks [51], [52].
-
Recent progress of multi-modal learning on multi-modal tasks [53], [54].
In contrast, this survey provides a unique contribution by focusing on three major aspects for visual recognition tasks:
-
Recent progress of VLM pre-training for visual recognition tasks: Specifically how VLMs are trained to understand images for classification, detection, and segmentation.
-
Two typical transfer approaches from VLMs to visual recognition tasks: How pre-trained VLMs are adapted to new visual recognition scenarios with minimal effort.
-
Benchmarking of VLM pre-training methods on visual recognition tasks: A comparative analysis of their performance.
This specialized focus provides a comprehensive and up-to-date resource for researchers interested in leveraging VLMs for computer vision's core recognition challenges.
4. Methodology
4.1. Principles
The core principle behind Vision-Language Models (VLMs) for visual recognition is to learn a rich correlation between visual information (images) and linguistic information (text) from vast quantities of readily available web-scale image-text pairs. By encoding images and texts into a shared embedding space, a VLM can understand the semantic relationship between them. This allows the model to perform zero-shot predictions on various visual recognition tasks: for a given image, it can classify, detect, or segment by comparing the image's embedding with the embeddings of candidate text descriptions (e.g., class names, object labels, segment descriptions). The theoretical intuition is that natural language provides a flexible and scalable form of supervision, enabling VLMs to learn generalizable visual concepts that transcend predefined categories.
4.2. Core Methodology In-depth (Layer by Layer)
VLM pre-training involves a deep neural network that extracts features from images and texts, and then learns vision-language correlation using specific pre-training objectives.
4.2.1. Network Architectures (Section 3.1)
VLMs employ separate encoders for images and text, or sometimes a unified encoder.
4.2.1.1. Architectures for Learning Image Features
Two main types are used:
- CNN-based Architectures:
ConvNetslikeVGG[5],ResNet[6], andEfficientNet[59] are commonly used.ResNet[6] is popular due to itsskip connectionswhich mitigategradient vanishingandexplosion, allowing for very deep networks.- Modifications for VLM pre-training (e.g.,
CLIP[10]): includeResNet-D[60],anti-aliased rect-2 blur pooling[61], and replacingglobal average poolingwithattention poolingfrom Transformer multi-head attention [58].
- Transformer-based Architectures:
Vision Transformer (ViT)[57] is a standard. It splits input images into fixed-size patches, linearly projects them, addsposition embeddings, and feeds them into aTransformer encoder.- The
Transformer encoderconsists of a stack ofTransformer blocks, each with amulti-head self-attention layerand afeed-forward network. - Modifications (e.g.,
CLIP[10],FILIP[18],SLIP[64]): often add a normalization layer before the Transformer encoder.
4.2.1.2. Architectures for Learning Language Features
Transformerand its variants (e.g.,BERT[14],GPT-2[16], original Transformer [58]) are widely adopted.- The standard
Transformer[58] has anencoder-decoderstructure. Both encoder and decoder have multiple blocks, each containingmulti-head (self-)attentionlayers andMulti-Layer Perceptrons (MLPs). The decoder also includes amasked multi-head layer. - Most VLM studies (e.g.,
CLIP[10]) adopt the standard Transformer with minor modifications, often training from scratch without initializing with weights from pre-trained language models likeGPT-2.
4.2.2. VLM Pre-training Objectives (Section 3.2)
The core of VLM is designing objectives to learn rich vision-language correlation. They fall into three categories:
4.2.2.1. Contrastive Objectives
These objectives train VLMs to learn discriminative representations by pulling paired samples (e.g., image and its matching text) close and pushing unpaired samples (e.g., image and a non-matching text) faraway in the feature space.
-
Image Contrastive Learning: Aims to learn discriminative image features, often as an auxiliary objective. It forces a
query imageto be close to itspositive keys(e.g., data augmentations of itself) and far fromnegative keys(other images). Given a batch of images, theInfoNCEloss [68] (and its variants) is typically used: $ \mathcal { L } _ { I } ^ { \mathrm { I n f o N C E } } = - \frac { 1 } { B } \sum _ { i = 1 } ^ { B } \log \frac { \exp { ( z _ { i } ^ { I } \cdot z _ { + } ^ { I } / \tau ) } } { \sum _ { j = 1 , j \ne i } ^ { B + 1 } \exp ( z _ { i } ^ { I } \cdot z _ { j } ^ { I } / \tau ) } $ Where:- : The batch size.
- : The
query embeddingof the -th image. - : The
positive key embeddingcorresponding to (e.g., an augmented version of the same image). - : A set of
key embeddingsfrom the batch, including the positive key and negative keys (other images in the batch). In some formulations, the positive key might be part of the sum or handled separately. - : A
temperature hyper-parameterthat controls the density or sharpness of the learned representation. A smaller makes the model more sensitive to differences in similarity.
-
Image-Text Contrastive Learning: Aims to learn discriminative image-text representations by pulling embeddings of paired images and texts close while pushing unpaired ones away. It's often achieved by minimizing a symmetrical
InfoNCEloss [10], which considers both image-to-text and text-to-image contrasts. The symmetrical loss is used. For the image-to-text direction (), it contrasts a query image with text keys: $ \mathcal { L } _ { I \rightarrow T } = - \displaystyle \frac { 1 } { B } \sum _ { i = 1 } ^ { B } \log \frac { \exp { ( z _ { i } ^ { I } \cdot z _ { i } ^ { T } / \tau ) } } { \sum _ { j = 1 } ^ { B } \exp ( z _ { i } ^ { I } \cdot z _ { j } ^ { T } / \tau ) } $ For the text-to-image direction (), it contrasts a query text with image keys: $ \mathcal { L } _ { T \rightarrow I } = - \displaystyle \frac { 1 } { B } \sum _ { i = 1 } ^ { B } \log \frac { \exp { ( z _ { i } ^ { T } \cdot z _ { i } ^ { I } / \tau ) } } { \sum _ { j = 1 } ^ { B } \exp ( z _ { i } ^ { T } \cdot z _ { j } ^ { I } / \tau ) } $ Where:- : The batch size.
- : The embedding of the -th image.
- : The embedding of the -th text, which is paired with .
- : The embedding of the -th text from the batch.
- : The embedding of the -th image from the batch.
- : The temperature hyper-parameter.
- The dot product (or ) measures the similarity between embeddings.
-
Image-Text-Label Contrastive Learning: This objective introduces
Supervised Contrastive Learning[69] by incorporating image classification labels into the image-text contrastive framework. This allows learning both discriminative and task-specific features simultaneously. The image-to-text component is reformulated as: $ \mathcal { L } _ { I \rightarrow T } ^ { I T L } = - \sum _ { i = 1 } ^ { B } \frac { 1 } { \vert \mathcal { P } ( i ) \vert } \sum _ { k \in \mathcal { P } ( i ) } \log \frac { \exp { ( z _ { i } ^ { I } \cdot z _ { k } ^ { T } / \tau ) } } { \sum _ { j = 1 } ^ { B } \exp ( z _ { i } ^ { I } \cdot z _ { j } ^ { T } / \tau ) } $ And the text-to-image component as: $ \mathcal { L } _ { T \rightarrow I } ^ { I T L } = - \sum _ { i = 1 } ^ { B } \frac { 1 } { \vert \mathcal { P } ( i ) \vert } \sum _ { k \in \mathcal { P } ( i ) } \log \frac { \exp { ( z _ { i } ^ { T } \cdot z _ { k } ^ { I } / \tau ) } } { \sum _ { j = 1 } ^ { B } \exp ( z _ { i } ^ { T } \cdot z _ { j } ^ { I } / \tau ) } $ Where:- : The batch size.
- , , , , , : Image and text embeddings as before.
- : The set of indices in the current batch such that the classification label of the -th sample is the same as the label of the -th sample. This means all samples in are considered positive matches for (or ).
- : The cardinality (number of elements) of the set .
- : The temperature hyper-parameter. The total loss is .
4.2.2.2. Generative Objectives
These objectives learn semantic features by training networks to generate image or text data, encouraging VLMs to learn rich contextual information within and across modalities.
-
Masked Image Modelling (MIM): This objective learns
cross-patch correlationby masking parts of an image and training the encoder to reconstruct them. The loss function is: $ \mathcal { L } _ { M I M } = - \frac { 1 } { B } \sum _ { i = 1 } ^ { B } \log f _ { \theta } \big ( \overline { { x } } _ { i } ^ { I } \mid \hat { x } _ { i } ^ { I } \big ) $ Where:- : The batch size.
- : The image encoder (parameterized by ).
- : The masked patches of the -th input image .
- : The unmasked patches of the -th input image .
- The objective is to maximize the likelihood of reconstructing the masked patches given the unmasked ones.
-
Masked Language Modelling (MLM): A widely adopted NLP pre-training objective, effective for text feature learning. It randomly masks a percentage of input text tokens and trains the model to reconstruct them based on the unmasked tokens. The loss function is: $ \mathcal { L } _ { MLM } = - \frac { 1 } { B } \sum _ { i = 1 } ^ { B } \log f _ { \phi } ( \overline { { x } } _ { i } ^ { T } | \hat { x } _ { i } ^ { T } ) $ Where:
-
: The batch size.
-
: The text encoder (parameterized by ).
-
: The masked tokens of the -th input text .
-
: The unmasked tokens of the -th input text .
-
The objective is to maximize the likelihood of predicting the masked tokens given the unmasked ones. The following figure (Figure 9 from the original paper) illustrates the process of masked language modeling:
该图像是示意图,展示了掩码语言模型中的输入文本和Transformer编码器的构建过程。图中包括了位置嵌入和标记嵌入的组合,以及如何通过预测生成最终的文本输出,特别标出了掩码位置 。
-
-
Masked Cross-Modal Modelling (MCM): Integrates both masked image modelling and masked language modelling by jointly masking and reconstructing both image patches and text tokens from an image-text pair. This captures rich vision-language correspondence. The loss function is: $ \mathcal { L } _ { MCM } = - \frac { 1 } { B } \sum _ { i = 1 } ^ { B } [ \log f _ { \theta } ( \overline { { x } } _ { i } ^ { I } | \hat { x } _ { i } ^ { I } , \hat { x } _ { i } ^ { T } ) + \log f _ { \phi } ( \overline { { x } } _ { i } ^ { T } | \hat { x } _ { i } ^ { I } , \hat { x } _ { i } ^ { T } ) ] $ Where:
- : The batch size.
- : The image encoder.
- : The text encoder.
- : The masked/unmasked patches in the -th image .
- : The masked/unmasked text tokens in the -th text .
- The objective is to reconstruct masked image patches given both unmasked image patches and unmasked text tokens, and similarly for text tokens.
-
Image-to-Text Generation: Aims to generate descriptive texts for a given image, capturing fine-grained vision-language correlation by training VLMs to predict tokenized texts autoregressively. The loss function is: $ \mathcal { L } _ { ITG } = - \sum _ { l = 1 } ^ { L } \log \ f _ { \theta } ( x ^ { T } \mid x _ { < l } ^ { T } , z ^ { I } ) $ Where:
-
: The number of tokens to be predicted for the text .
-
: The image encoder (often combined with a decoder for text generation).
-
: The target text to be generated.
-
: The tokens of the text up to position
l-1, used as context for predicting the -th token. -
: The embedding of the image paired with .
-
The objective is to maximize the likelihood of generating the correct text sequence given the image embedding and previous text tokens. The following figure (Figure 10 from the original paper) provides a simplified illustration of image-to-caption generation:
该图像是一个示意图,展示了图像编码器与多模态文本解码器之间的交互。图中展示了通过交叉注意力机制将输入图像与文本进行结合,以生成描述性文本。下方的输入图像是一只玩球的狗,对应的输出文本为"A dog is playing with a ball"。
-
4.2.2.3. Alignment Objectives
These objectives enforce VLMs to align paired images and texts by learning to predict whether a given text correctly describes a given image.
-
Image-Text Matching (ITM): Models
global correlationbetween images and texts by aligning paired samples. It can be formulated with ascore functionthat measures the alignment probability and abinary classification loss: $ \mathcal { L } _ { IT } = p \log \mathcal{S} ( z ^ { I } , z ^ { T } ) + ( 1 - p ) \log ( 1 - \mathcal{S} ( z ^ { I } , z ^ { T } ) ) $ Where:- : A score function that outputs the probability of alignment between image embedding and text embedding .
- : A binary label, equal to 1 if the image and text are a true pair (matched) and 0 otherwise (unmatched).
- This is a standard binary cross-entropy loss, where the model learns to predict correctly.
-
Region-Word Matching (RWM): Aims to model
local fine-grained cross-modal correlationbetweenimage regionsandword tokens. This is particularly beneficial fordense visual recognition taskslike object detection and semantic segmentation. The loss function is: $ \mathcal { L } _ { RW } = p \log \mathcal { S } ^ { r } ( r ^ { I } , w ^ { T } ) + ( 1 - p ) \log ( 1 - \mathcal { S } ^ { r } ( r ^ { I } , w ^ { T } ) ) $ Where:-
: Denotes a
region-word pair, where is a feature representation of an image region and is a feature representation of a word token. -
: A
local score functionthat measures the similarity between an image region and a word. -
: A binary label, equal to 1 if the region and word are semantically paired (e.g., the word "dog" matches a region containing a dog) and 0 otherwise. The following figure (Figure 11 from the original paper) illustrates the use of word-region alignment for detection in
GLIP:
该图像是示意图,展示了GLIP模型如何使用文本编码器和图像编码器进行词-区域对齐检测。图示中包含的公式用于计算局部化损失和对齐损失。
-
4.2.3. VLM Pre-training Frameworks (Section 3.3)
VLMs employ different architectures to process and fuse visual and linguistic information. The following figure (Figure 5 from the original paper) illustrates typical VLM pre-training frameworks:
该图像是示意图,展示了三种典型的视觉语言模型(VLM)预训练框架,包括两塔VLM、两腿VLM和一塔VLM。框架中显示了图像和文本嵌入的输入以及多模态融合层的使用,以帮助实现有效的预训练目标。
- Two-Tower Framework (Fig. 5a): Widely adopted (e.g.,
CLIP[10],ALIGN[17]). Input images and texts are encoded with two separate, independent encoders (one for vision, one for language). Features are then projected into a common embedding space where objectives like contrastive learning are applied. This design keeps modalities largely separate until the final embedding comparison. - Two-Leg Framework (Fig. 5b): (e.g.,
COCA[19],FLAVA[42]). This extends the two-tower design by introducingadditional multi-modal fusion layersafter the initial separate encoding. These fusion layers enable explicit feature interaction and communication between the image and text modalities, allowing for deeper cross-modal understanding. - One-Tower Framework (Fig. 5c): (e.g.,
CLIPPO[43],OneFormer[44]). Aims to unify vision and language learning within asingle encoder. Images and texts are often tokenized into a common format (e.g., visual tokens from image patches and textual tokens from words) and then processed by a single, shared Transformer-based network. This design promotes efficient communications and interactions across data modalities from an earlier stage.
4.2.4. Evaluation Setups and Downstream Tasks (Section 3.4)
VLMs are evaluated using different setups and on various downstream tasks to assess their generalization capabilities.
4.2.4.1. Zero-shot Prediction
This is the most common way to evaluate VLM generalization: directly applying the pre-trained VLM to downstream tasks without any task-specific fine-tuning.
- Image Classification: Classifies images into categories. VLMs perform this by comparing image embeddings with text embeddings of class names (often enhanced by
prompt engineering, e.g., "a photo of a [label]"). - Semantic Segmentation: Assigns category labels to each pixel. Zero-shot segmentation involves comparing embeddings of image pixels (or segments) with text descriptions.
- Object Detection: Localizes and classifies objects. VLMs achieve this by comparing object proposals' embeddings with text embeddings, often leveraging
object locating abilitylearned from auxiliary datasets. - Image-Text Retrieval: Retrieves relevant samples across modalities:
text-to-image retrieval(find images given text) andimage-to-text retrieval(find text given images).
4.2.4.2. Linear Probing
This setup freezes the pre-trained VLM's feature extractor and trains a simple linear classifier on top of the VLM-encoded embeddings. It assesses the quality of the learned representations.
- Image Classification and Action Recognition are commonly used tasks. For action recognition, video clips are often sub-sampled.
4.2.5. Vision-Language Model Pre-training (Section 5)
This section reviews specific VLM pre-training studies based on the objectives described above. The following are the results from Table 3 of the original paper:
| Method | Dataset | Objective | Contribution |
| CLIP† [10] [code] | CLIP* | Con | Propose image-text contrastive learning for VLM pre-training. |
| ALIGN† [17] | ALIGN* | Con | Leverage large-scale noisy data to scale-up VLM pre-training data. |
| OTTER† [112] [code] | CC3M, YFCC15M, WIT | Con | Employ optimal transport for data efficient VLM pre-training. |
| DeCLIP† [113] [code] | CC3M, CC12M, YFCC100M, WIT* | Con,Gen | Employ image/ text self-supervision for data efficient VLM pre-training. |
| ZeroVL† [114] [code] | SBU, VG, CC3M, CC12M | Con | Introduce data augmentation for data-efficient VLM pre-training. |
| FILIP† [18] | FILIP*, C3M, C12M, YFCC100M | Con,Align | Leverage region-word similarity for fine-grained VLM pre-training. |
| UniCL† [65] [code] | CC3M, CC12M, YFCC100M | Con | Propose image-text-label contrastive learning for VLM pre-training. |
| Florence† [115] | FLD-900M* | Con | Scale up pre-training data and include depth and temporal information. |
| SLIP† [64] [code] | YFCC100M | Con | Introduce image self-supervision learning into VLM pre-training. |
| PyramidCLIP† [116] | SBU, CC3M, CC12M, YFCC100M, LAION400M | Con | Perform peer-level/cross-level contrastive learning within/across multiple semantic levels. |
| ChineseCLIP† [117] [code] | LAION5B, WuKong, VG, COCO | Con | Collect large-scale Chinese image-text data and Introduce Chinese VLM. |
| LiT† [118] [project] | CC12M, YFCC100M, WIT* | Con | Propose contrastive tuning with the locked image encoder. |
| AltCLIP‡ [119] [code] | WuDao, LAION2B, LAION5B | Con | Leverage the multilingual text encoder to achieve multilingual VLM. |
| FLAVA‡ [42] [code] | COCO, SBU, LN, CC3M, VG, WIT, CC12M, RC, YFCC100M | Gen,Con,Align | Propose a universal and foundational VLM that tackles the single-modal (i.e., image or text) and the multi-model cases at the same time. |
| KELIP† [120] [code] | CUB200, WIT, YFCC15M, CC3M, CC12M, LAION400M, K-WIT* | Con,Gen | Collect large-scale Korean image-text pair data and develop bilingual VLMs with Korean and English. |
| COCA‡ [19] [code] | ALIGN* | Con,Gen | Combine contrastive learning and image captioning for pre-training. |
| nCLIP† [121] | COCO, VG, SBU, CC3M, CC12M, YFCC14M | Con,Align | Propose a non-contrastive pre-training objective (i.e., a cross-entropy loss for global image-text matching) for VLM pre-training. |
| K-lite† [122] [code] | CC3M, CC12M, YFCC100M | Con | Leverage auxiliary datasets for training transferable VLMs. |
| NLIP‡ [123] | YFCC100M, COCO | Con,Gen | Train noise-robust VLM via noise harmonization and completion. |
| UniCLIP† [84] | CC3M, CC12M, YMCC100M | Con | Propose unified image-text and image-image contrastive learning. |
| PaLI‡ [83] [project] | WebLI* | Gen | Scale up the data, model and language in VLM pre-taring. |
| HiCLIP† [124] [code] | YFCC100M, CC3M, CC12M | Con | Propose to incorporate hierarchy-aware attention into VLM pre-training. |
| CLIPPO§ [43] [code] | WebLI* | Con | Learn image and text data with a single network for VLM pre-training. |
| OneFormer§ [44] | CC3M, SBU, VG, COCO | Con,Gen | Unify image and text learning in a single tower transformer. |
| RA-CLIP† [125] | YFCC100M | Con | Propose retrieval-augmented image-text contrastive learning. |
| LA-CLIP† [126] [code] | CC3M, CC12M, RC, LAION400M | Con | Propose LLMs-augmented image-text contrastive learning. |
| ALIP† [127] [code] | YFCC100M | Con | Introduce synthetic caption supervision into VLM pre-training. |
| GrowCLIP‡ [128] | CC12M | Con | Propose online-learning image-text contrastive learning. |
| GroupViT† [129] [code] | CC12M, YMCC100M | Con | Propose hierarchical visual concepts grouping for VLM pre-training. |
| SegCLIP† [46] [code] | CC3M, COCO | Con,Gen | Propose a plug-in semantic group module for VLM pre-training. |
| CLIPpy [130] [code] | CC12M | Con | Propose spatial representation aggregation for VLM pre-training. |
| RegionCLIP† [131] [code] | CC3M, COCO | Con,Align | Learn region-level visual representations for VLM pre-training. |
| GLIP [67] [code] | CC3M, CC12M, SBU | Align | Unify detection and phrase grounding for grounded VLM pre-training. |
| FIBER‡ [71] [code] | COCO, CC3M, SBU, VG | Con,Gen,Align | Propose deep multi-modal fusion for coarse-to-fine VLM pre-training. |
| DetCLIP‡ [45] | YMCC100M | Align | Present a paralleled visual-concept VLM pre-training method. |
-
VLM Pre-training with Contrastive Objectives (Section 5.1):
-
CLIP[10]: The pioneering work, uses symmetrical image-text InfoNCE loss (Eqs. 2 & 3). The following figure (Figure 6 from the original paper) illustrates the image-text contrastive learning inCLIP:
该图像是示意图,展示了CLIP中的图像-文本对比学习过程。上半部分展示了文本编码器处理输入文本并生成特征向量 ,下半部分则展示输入图像通过图像编码器生成对应特征向量。两部分通过特征空间矩阵相互关联,进行对比学习以增强模型性能。 -
ALIGN[17]: Scales upCLIPwith 1.8 billion noisy image-text pairs and noise-robust contrastive learning. -
Data-efficient pre-training:
OTTER[112] (optimal transport),DeCLIP[113] (nearest-neighbor supervision),ZeroVL[114] (debiased data sampling, mixup). -
Fine-grained correlation:
FILIP[18] (region-word alignment),PyramidCLIP[116] (cross-level and peer-level contrastive learning). -
UniCL[65]: Introduces image classification labels into image-text contrast (Eqs. 4 & 5). The following figure (Figure 7 from the original paper) illustrates the image-text-label space proposed inUniCL:
该图像是示意图,展示了图像-文本-标签对的结构,其中包含图像、文本和标签的对应关系。图中展示了不同动物(如狗和猫)的实例及其描述,清晰地说明了图像-标签对和图像-文本对之间的联系。 -
Recent improvements also involve augmenting image-text pairs (e.g.,
LA-CLIP[126],ALIP[127] use LLMs for synthetic captions), and unifying vision and language in a single encoder (OneFormer[44],CLIPPO[43]).
-
-
VLM Pre-training with Generative Objectives (Section 5.2):
-
Masked Image Modelling(MIM): As defined in Eq. 6. Examples includeMAE[41],BeiT[70].FLAVA[42] uses rectangular block masking, whileKELIP[120] andSegCLIP[46] mask a large portion of patches. The following figure (Figure 8 from the original paper) illustrates masked image modelling:
该图像是一个示意图,阐述了图像编码器和解码器的流程。左侧为输入图像,经过图像编码器处理后,产生中间表示,最后通过图像解码器生成输出图像,展现了图像重构的过程。 -
Masked Language Modelling(MLM): As defined in Eq. 7.FLAVA[42] andFIBER[71] use this for text feature learning. -
Masked Cross-Modal Modelling(MCM): As defined in Eq. 8.FLAVA[42] masks both image patches and text tokens for joint reconstruction. -
Image-to-Text Generation: As defined in Eq. 9.COCA[19],NLIP[123],PaLI[83] train VLMs with encoder-decoder architectures and image captioning objectives.
-
-
VLM Pre-training with Alignment Objectives (Section 5.3):
Image-Text Matching: As defined in Eq. 10.FLAVA[42] uses a classifier,FIBER[71] mines hard negatives.Region-Word Matching: As defined in Eq. 11.GLIP[67],FIBER[71],DetCLIP[45] use dot-product similarity between regional visual features and token-wise features for dense prediction tasks.
4.2.6. VLM Transfer Learning (Section 6)
Transfer learning adapts pre-trained VLMs to fit downstream tasks. The following are the results from Table 4 of the original paper:
| Method | Category | Setup | Contribution |
| CoOp [31] [code] | TPT | Few-shot Sup. | Introduce context optimization with learnable text prompts for VLM transfer learning. |
| CoCoOp [32] [code] | TPT | Few-shot Sup. | Propose conditional text prompting to mitigate overfitting in VLM transfer learning. |
| SubPT [132] [code] | TPT | Few-shot Sup. | Propose subspace text prompt tuning to mitigate overfitting in VLM transfer learning. |
| LASP [133] | TPT | Few-shot Sup. | Propose to regularize the learnable text prompts with the hand-engineered prompts. |
| ProDA [134] | TPT | Few-shot Sup. | Propose prompt distribution learning that captures the distribution of diverse text prompts. |
| VPT [135] | TPT | Few-shot Sup. | Propose to model the text prompt learning with instance-specific distribution. |
| ProGrad [136] [code] | TPT | Few-shot Sup. | Present a prompt-aligned gradient technique for preventing knowledge forgetting. |
| CPL [137] [code] | TPT | Few-shot Sup. | Employ counterfactual generation and contrastive learning for text prompt tuning. |
| PLOT [138] [code] | TPT | Few-shot Sup. | Introduce optimal transport to learn multiple comprehensive text prompts. |
| DualCoOp [139] [code] | TPT | Few-shot Sup. | Introduce positive and negative text prompt learning for multi-label classification. |
| TaI-DPT [140] [code] | TPT | Few-shot Sup. | Introduce a double-grained prompt tuning technique for multi-label classification. |
| SoftCPT [141] [code] | TPT | Few-shot Sup. | Propose to fine-tune VLMs on multiple downstream tasks simultaneously. |
| DenseCLIP [142] [code] | TPT | Supervised | Propose a language-guided fine-tuning technique for dense visual recognition tasks. |
| UPL [143] [code] | TPT | Unsupervised | Propose unsupervised prompt learning with self-training for VLM transfer learning. |
| TPT [144] [code] | TPT | Unsupervised | Propose test-time prompt tuning that learns adaptive prompts on the fly. |
| KgCoOp [145] [code] | TPT | Few-shot Sup. | Introduce knowledge-guided prompt tuning to improve the generalization ability. |
| ProTeCt [146] | TPT, VPT | Few-shot Sup. | Propose a prompt tuning technique to improve consistency of model predictions. |
| VP [147] [code] | VPT | Supervised | Investigate the efficacy of visual prompt tuning for VLM transfer learning. |
| RePrompt [148] | VPT | Few-shot Sup. | Introduce retrieval mechanisms to leverage knowledge from downstream tasks. |
| UPT [149] [code] | TPT, VPT | Few-shot Sup. | Propose a unified prompt tuning that jointly optimizes text and image prompts. |
| MVLPT [150] [code] | TPT, VPT | Few-shot Sup. | Incorporate multi-task knowledge into text and image prompt tuning. |
| MaPLE [151] [code] | TPT, VPT | Few-shot Sup. | Propose multi-modal prompt tuning with a mutual promotion strategy. |
| CAVPT [152] [code] | TPT, VPT | Few-shot Sup. | Introduce class-aware visual prompt for concentrating more on visual concepts. |
| Clip-Adapter [33] [code] | FA | Few-shot Sup. | Introduce an adapter with residual feature blending for efficient VLM transfer learning. |
| Tip-Adapter [34] [code] | FA | Few-shot Sup. | Propose to build a training-free adapter with the embeddings of few labelled images. |
| SVL-Adapter [153] [code] | FA | Few-shot Sup. | Introduce a self-supervised adapter by performing self-supervised learning on images. |
| SuS-X [154] [code] | FA | Unsupervised | Propose a training-free name-only transfer learning paradigm with curated support sets. |
| CLIPPR [155] [code] | FA | Unsupervised | Leverage the label distribution priors for adapting pre-trained VLMs. |
| SgVA-CLIP [156] | TPT, FA | Few-shot Sup. | Propose a semantic-guided visual adapter to generate discriminative adapted features. |
| VT-Clip [157] | CA | Few-shot Sup. | Introduce visual-guided attention that semantically aligns text and image features. |
| CALIP [158] [code] | CA | Unsupervised | Propose parameter-free attention for the communication between visual and textual features. |
| TaskRes [159] [code] | CA | Few-shot Sup. | Propose a technique for better learning old VLM knowledge and new task knowledge. |
| CuPL [160] | LLM | Unsupervised | Employ large language models to generate customized prompts for VLMs. |
| VCD [161] | LLM | Unsupervised | Employ large language models to generate captions for VLMs. |
| Wise-FT [162] [code] | FT | Supervised | Propose ensemble-based fine-tuning by combining the fine-tuned and original VLMs. |
| MaskClip [163] [code] | AM | Unsupervised | Propose to extract dense features by modifying the image encoder architecture. |
| MUST [164] [code] | Self-training | Unsupervised | Propose masked unsupervised self-training for unsupervised VLM transfer learning. |
- Motivation (Section 6.1): VLMs face
domain gaps(image/text distributions) andtraining objective gaps(general vs. task-specific). - Common Setups (Section 6.2):
Supervised transfer: Uses all labelled downstream data.Few-shot supervised transfer: Uses a small amount of labelled samples.Unsupervised transfer: Uses unlabelled downstream data.
- Common Transfer Learning Methods (Section 6.3):
-
Transfer via Prompt Tuning (TPT): Modifies input
text/imagewith learnable prompts without fine-tuning the entire VLM.-
Text Prompt Tuning: Learns context words for class names (e.g.,CoOp[31]: "a photo of a[V]_1 [V]_2 ... [V]_m[label]"). The[V]tokens are learnable vectors.CoCoOp[32] generates image-specific prompts. -
Visual Prompt Tuning(VPT): Modulates the input of the image encoder (e.g.,VP[147] adds learnable image perturbations to as ). -
Text-Visual Prompt Tuning: Jointly optimizes both text and image prompts (e.g.,UPT[149],MaPLE[151]). The following figure (Figure 12 from the original paper) illustrates text prompt learning and visual prompt learning:
该图像是示意图,展示了文本提示学习(a)和视觉提示学习(b)的过程。在(a)部分,利用文本编码器和图像编码器进行分类损失计算,而(b)部分引入了可学习的视觉提示,增强图像信息,通过文本提示进行分类损失的计算。
-
-
Transfer via Feature Adaptation (FA): Introduces additional light-weight feature adapters to fine-tune VLMs.
-
Clip-Adapter[33]: Inserts trainable linear layers after CLIP's encoders, optimizing them while freezing the main CLIP model. -
Tip-Adapter[34]: A training-free adapter that directly uses embeddings of few-shot labelled images as weights. The following figure (Figure 13 from the original paper) illustrates the feature adapter:
该图像是示意图,展示了特征适配器的工作流程。图中描绘了通过文本提示和图像编码器获取视觉特征的过程,并通过特征适配器生成适应特征,最终用于分类损失的计算。
-
-
Other Transfer Methods: Direct fine-tuning (e.g.,
Wise-FT[162]), architecture modification (e.g.,MaskCLIP[163]), cross-attention mechanisms (e.g.,VT-CLIP[157]), and using LLMs to generate prompts (e.g.,CuPL[160]).
-
4.2.7. VLM Knowledge Distillation (Section 7)
Distills general VLM knowledge to task-specific models, especially for dense prediction tasks. The following are the results from Table 5 of the original paper:
| Task | Method | Contribution | ||
| Semantic Segmentation | CLIPSeg [175] [code] | Extend CLIP by introducing a lightweight transformer-based decoder. | ||
| ZegFormer [35] [code] | Group the pixels into segments and performs zero-shot classification task on the segments. | |||
| LSeg [176] [code] | Propose language-driven semantic segmentation by matching pixel and text embeddings. | |||
| SSIW [177] | Introduce a test-time augmentation technique to refine the pseudo labels generated by CLIP. | |||
| MaskClip+ [163] [code] | Perform self-training with the pseudo labels generated by MaskClip (modified from CLIP). | |||
| ZegClip [174] [code] | Propose deep prompt tuning, non-mutually exclusive loss and relationship descriptor. | |||
| Fusioner [178] | Introduce cross-modality fusion that aligns the visual representation with language concept. | |||
| OVSeg [179] [code] | Adapt CLIP with the region-word pairs generated by the modified MaskFormer. | |||
| ZSSeg [180] [code] | Propose to first generate mask proposals and then classifies the generated mask proposals. | |||
| OpenSeg [181] [code] | Propose to align each word in the caption with the generated segmentation masks. | |||
| ReCo [182] [code] | Propose language-guided co-segmentation with the CLIP-retrieved images. | |||
| CLIMS [183] [code] | Use CLIP to generate high-quality class activation maps w/o involving irrelevant background. | |||
| CLIP-ES [184] [code] | Employ CLIP to refine the class activation map for weakly-supervised segmentation. | |||
| FreeSeg [185] [code] | Propose a unified, universal and open-Vocabulary image segmentation network. | |||
| Object Detection | ViLD [36] [code] | Propose to distill knowledge from a pre-trained VLM into a two-stage object detector. | ||
| DetPro [37] [code] | Propose to learn continuous prompt representations for open-vocabulary object detection. | |||
| HierKD [186] [code] | Propose hierarchical knowledge distillation for global-level and instance-level distillation. | |||
| RKD [187] [code] | Propose region-based knowledge distillation for aligning region- and image-level embeddings. | |||
| PromptDet [188] [code] | Introduce regional prompting for aligning text embeddings with regional image embeddings. | |||
| PB-OVD [189] [code] | Propose to train object detectors with the pseudo bounding-box labels generated by VLMs. | |||
| CondHead [190] | Propose semantic-visual alignment for better box regression and mask segmentation. | |||
| VLDet [191] [code] | Achieve open-vocabulary object detection by the bipartite matching between regions and words. | |||
| F-VLM [192] | Propose to simply build a detection head upon the pre-trained VLM for object localization. | |||
| OV-DETR [173] [code] | Achieve open-vocabulary detection transformer with a binary matching strategy. | |||
| Detic [193] [code] | Enlarge detection vocabulary using image-level supervision and pre-trained CLIP text encoder. | |||
| XPM [194] [code] | Design cross-modal pseudo-labeling to let VLMs generate caption-driven pseudo masks. | |||
| OWL-ViT [195] [code] | Propose ViT-based open-vocabulary detector by adding object classification/localization head. | |||
| VL-PLM [196] [code] | Leverage VLMs for assigning category labels to the generated pseudo bounding boxes. | |||
| P3OVD [197] | Propose prompt-driven self-training that refines the pseudo labels generated by VLMs. | |||
| ZSD-YOLO [198] [code] | Leverage CLIP for object detection with a self-labeling based data augmentation technique. | |||
| RO-ViT [199] | Bridge the gap of VLM pre-training and downstream open-vocabulary detection. | |||
| BARON [200] [code] | Propose neighborhood sampling strategy to align the embedding of bag of regions. | |||
| OADP [201] [code] | Propose object-aware distillation network to preserve and transfer contextual knowledge. | |||
- Motivation (Section 7.1): Allows transferring general VLM knowledge to task-specific models (e.g., object detectors, semantic segmenters) that benefit from specialized architectures, without VLM architecture restrictions.
- Knowledge Distillation for Object Detection (Section 7.2.1):
Open-vocabulary object detection: Aims to detect objects from arbitrary text descriptions.ViLD[36]: Distills VLM knowledge to a two-stage detector, enforcing embedding space consistency.HierKD[186]: Hierarchical global-local knowledge distillation.DetPro[37]: Uses detection prompt techniques for continuous prompt representations.Pseudo-labelling:PB-OVD[189] trains detectors with VLM-predicted pseudo bounding boxes,XPM[194] uses VLM-generated pseudo masks.
- Knowledge Distillation for Semantic Segmentation (Section 7.2.2):
Open-vocabulary semantic segmentation: Segments pixels based on arbitrary texts.CLIPSeg[175]: Extends CLIP for segmentation with a lightweight transformer decoder.LSeg[176]: Maximizes correlation between CLIP text embeddings and pixel-wise image embeddings.ZegCLIP[174]: Uses CLIP to generate semantic masks and a relationship descriptor.Weakly-supervised semantic segmentation:CLIP-ES[184],CLIMS[183] leverage VLMs and weak supervision.
4.3. Typology of Vision-Language Models for Visual Recognition
The following figure (Figure 4 from the original paper) presents a typology of vision-language models for visual recognition, classifying methods into three main categories: VLM Pre-training, Transfer Learning, and Knowledge Distillation, each with subcategories corresponding to the detailed methodologies discussed.
该图像是示意图,展示了视觉语言模型在视觉识别中的分类与关联。图中包含了三个主要部分:视觉语言模型预训练、迁移学习及知识蒸馏,具体列出了各个方法及其对应章节。
5. Experimental Setup
5.1. Datasets
The paper extensively covers datasets for both VLM pre-training and evaluation, highlighting the shift towards large-scale, web-sourced image-text pairs.
5.1.1. Datasets for Pre-training VLMs (Section 4.1 & Appendix B)
VLM pre-training primarily relies on massive image-text datasets, often collected from the internet. These are significantly larger and cheaper to acquire than traditional crowd-labelled datasets.
-
Image-Text Datasets:
- SBU Caption [73]: 1M image-text pairs from Flickr.
- COCO Caption [74]: 1.5M pairs (330k images, 5 captions each for c5, 40 captions for 5k images for c40) from MS COCO [106].
- YFCC100M [75]: 100M multimedia items (99.2M images, 0.8M videos) with associated text.
- Visual Genome (VG) [76]: 5.4M pairs (108k images, 50 descriptions each) with rich annotations (object-level, scene graphs, VQA).
- Conceptual Captions (CC3M) [77]: 3.3M noisy image-text pairs from the web.
- Localized Narratives (LN) [78]: 0.87M image-text pairs with localized multi-modal annotations (mouse traces per word).
- Conceptual 12M (CC12M) [79]: 12M image-text pairs, a larger version of CC3M with less precise data.
- Wikipedia-based Image Text (WIT) [80]: 37.6M image-text pairs across 108 languages from Wikipedia.
- Red Caps (RC) [81]: 12M image-text pairs from Reddit, covering diverse objects and scenes.
- LAION400M [21]: 400M image-text pairs filtered by CLIP, providing embeddings and kNN indices.
- LAION5B [20]: Over 5.8B image-text pairs (2.32B English, 2.26B multilingual, 1.27B unspecified language).
- WuKong [82]: 100M Chinese image-text pairs.
- CLIP [10]: 400M image-text pairs from publicly available internet sources (non-public dataset).
- ALIGN [17]: 1.8B noisy image-text pairs (non-public dataset).
- FILIP [18]: 300M image-text pairs (non-public dataset).
- WebLI [83]: 10B images with 12B corresponding texts across 109 languages (non-public dataset).
-
Auxiliary Datasets: Some studies use additional datasets to provide more information.
- JFT3B [204]: ~3B images with 30k noisy class hierarchy labels.
- C4 [205]: ~750GB English text from Common Crawl.
- Object365 [85]: Object detection dataset with 365 categories, 638K images, and ~10M bounding boxes.
- Gold-G [86]: 0.8M human-annotated visual grounding data for object phrases.
5.1.2. Datasets for VLM Evaluation (Section 4.2 & Appendix C)
VLMs are evaluated on a wide array of standard visual recognition benchmarks to test their generalization and transfer capabilities. The following table (Table 2 from the original paper) summarizes the widely adopted datasets for VLM evaluations:
| Task | Dataset | Year Classes Training | Testing | Evaluation Metric | ||||
| Image Classification | MNIST [88] [link] | 1998 | 10 | 60,000 | 10,000 | Accuracy Mean Per Class Accuracy 11-point mAP Mean Per Class Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Mean Per Class Accuracy Accuracy Mean Per Class Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy ROC AUC Accuracy | ||
| Caltech-101 [89] [link] | 2004 | 101 | 3,060 | 6,085 | ||||
| PASCAL VOC 2007 Classification [90] [link] | 2007 | 20 | 5,011 | 4,952 | ||||
| Oxford 102 Flowers [91] [link] | 2008 | 102 | 2,040 | 6,149 | ||||
| CIFAR-10 [23] [link] | 2009 | 10 | 50,000 | 10,000 | ||||
| CIFAR-100 [23] [link] | 2009 | 100 | 50,000 | 10,000 | ||||
| ImageNet-1k [40] [link] | 2009 | 1000 | 1,281,167 | 50,000 | ||||
| SUN397 [24] [link] | 2010 | 397 | 19,850 | 19,850 | ||||
| SVHN [92] [link] | 2011 | 10 | 73,257 | 6,032 | ||||
| STL-10 [93][link] | 2011 | 10 | 1,000 | 8,000 | ||||
| GTSRB [94][link] | 2011 | 43 | 26,640 | 12,630 | ||||
| KITTI Distance [1][link] | 2012 | 3 | 6,770 | 713 | ||||
| IIIT5k [95][link] | 2012 | 66 | 2,000 | 3,000 | ||||
| Oxford-IIIT PETS [26] [link] | 2012 | 37 | 3,680 | 3,669 | ||||
| Stanford Cars [25] [link] | 2013 | 196 | 8,144 | 8,041 | ||||
| FGVC Aircraft [96] [link] | 2013 | 100 | 6,673 | 3,333 | ||||
| Facial Emotion Recognition 2013 [97] [link] | 2013 | 8 | 2,140 | 3,574 | ||||
| Rendered SST2 [98] [link] | 2013 | 2 | 7,792 | 1,821 | ||||
| Describable Textures (DTD) [99] [link] | 2014 | 47 | 3,760 | 1,880 | ||||
| Food-101 [22] [link] | 2014 | 101 | 75,750 | 25,250 | ||||
| Birdsnap [100] [link] | 2014 | 500 | 42,283 | 4,149 | ||||
| RESISC45 [101] [link] | 2017 | 45 | 3,150 | 25,200 | ||||
| CLEVR Counts [102] [link] | 2017 | 8 | 2,000 | 500 | ||||
| PatchCamelyon [103] [link] | 2018 | 2 | 294,912 | 32,768 | ||||
| EuroSAT [104] [link] | 2019 | 10 | 10,000 | 5,000 | ||||
| Hateful Memes [27] [link] | 2020 | 2 | 8,500 | 500 | ||||
| Country211 [10] [link] | 2021 | 211 | 43,200 | 21,100 | ||||
| Image-Text Retrieval | Flickr30k [105] [link] | 2014 | -- | 31,783 | 5,000 | Recall | ||
| COCO Caption [74] [link] | 2015 | -- | 82,783 | 5,000 | Recall | |||
| Action Recognition | UCF101 [29] [link] | 2012 | 101 | 9,537 | 1,794 | Accuracy | ||
| Kinetics700 [30] [link] | 2019 | 700 | 494,801 | 31,669 | Mean(top1, top5) | |||
| RareAct [28] [link] | 2020 | 122 | 7,607 | - | mWAP, mSAP | |||
| Object Detection | COCO 2014 Detection [106] [link] | 2014 | 80 | 83,000 | 41,000 | box mAP | ||
| COCO 2017 Detection [106] [link] | 2017 | 80 | 118,000 | 5,000 | box mAP | |||
| LVIS [107] [link] | 2019 | 1203 | 118,000 | 5,000 | box mAP | |||
| ODinW [108] [link] | 2022 | 314 | 132,413 | 20,070 | box mAP | |||
| Semantic Segmentation | PASCAL VOC 2012 Segmentation [90] [link] | 2012 | 20 | 1,464 | 1,449 | mIoU | ||
| PASCAL Content [109] [link] | 2014 | 459 | 4,998 | 1,449 | mIoU | |||
| Cityscapes [110] [link] | 2016 | 19 | 2,975 | 500 | mIoU | |||
| ADE20k [111] [link] | 2017 | 150 | 25,574 | 2,000 | mIoU | |||
-
Image Classification: 27 datasets covering fine-grained (e.g.,
Oxford-IIIT PETS[26],Stanford Cars[25]), general (e.g.,ImageNet[40]), and diverse tasks (e.g.,Hateful Memes[27],Country211[10]). -
Object Detection: 4 datasets, including
COCO 2014/2017 Detection[106],LVIS[107] (for large vocabulary instance segmentation), andODinW[108] (for task-level transfer evaluation). -
Semantic Segmentation: 4 datasets, including
PASCAL VOC 2012 Segmentation[90],PASCAL Content[109],Cityscapes[110], andADE20k[111]. -
Image-Text Retrieval: 2 datasets,
Flickr30k[105] andCOCO Caption[74]. -
Action Recognition: 3 datasets,
UCF101[29],Kinetics700[30], andRareAct[28].These datasets are chosen because they are standard benchmarks in computer vision, widely recognized for evaluating performance across various visual recognition tasks, ranging from simple classification to complex dense prediction and multi-modal understanding.
5.2. Evaluation Metrics
The paper uses several standard evaluation metrics, depending on the specific visual recognition task.
-
Accuracy:
- Conceptual Definition: Accuracy measures the proportion of correctly classified instances out of the total number of instances. It provides a general indication of the model's overall correctness.
- Mathematical Formula: $ \mathrm{Accuracy} = \frac{\mathrm{Number~of~Correct~Predictions}}{\mathrm{Total~Number~of~Predictions}} $
- Symbol Explanation:
- : The count of instances where the model's predicted class matches the true class.
- : The total count of all instances evaluated by the model.
- Used for: Image Classification, Action Recognition.
-
Mean Per Class Accuracy:
- Conceptual Definition: Instead of averaging across all samples, this metric calculates the accuracy for each class individually and then averages these per-class accuracies. This is particularly useful for imbalanced datasets, as it prevents classes with many samples from dominating the overall accuracy score.
- Mathematical Formula: $ \mathrm{Mean~Per~Class~Accuracy} = \frac{1}{N_c} \sum_{c=1}^{N_c} \mathrm{Accuracy}_c $
- Symbol Explanation:
- : The total number of classes.
- : The accuracy calculated specifically for class .
- Used for: Image Classification.
-
Mean Average Precision (mAP) / box mAP:
- Conceptual Definition:
mAPis a common metric for object detection that assesses both the precision (how many detected objects are correct) and recall (how many actual objects are detected) of a model across all object classes. It is the mean of theAverage Precision (AP)scores for each class.APitself is the area under the precision-recall curve.box mAPspecifically refers to mAP calculated based on the bounding box predictions. - Mathematical Formula: $ \mathrm{mAP} = \frac{1}{N_c} \sum_{c=1}^{N_c} \mathrm{AP}_c $ where is calculated as the area under the precision-recall curve for class : $ \mathrm{AP}_c = \int_0^1 P_c(R_c) dR_c $
- Symbol Explanation:
- : The total number of object classes.
- : The Average Precision for class .
- : The precision for class at a given recall . Precision is , and recall is . These are calculated at various Intersection over Union (IoU) thresholds (e.g., IoU=0.5, or a range like IoU=0.5:0.05:0.95 for COCO mAP).
- Used for: Object Detection.
- Conceptual Definition:
-
Mean Intersection over Union (mIoU):
- Conceptual Definition:
mIoUis the standard metric for semantic segmentation. It calculates theIntersection over Union (IoU)for each class and then averages theseIoUvalues across all classes.IoUmeasures the overlap between the predicted segmentation mask and the ground-truth mask for a given class. - Mathematical Formula: $ \mathrm{IoU}_c = \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FP}_c + \mathrm{FN}c} $ $ \mathrm{mIoU} = \frac{1}{N_c} \sum{c=1}^{N_c} \mathrm{IoU}_c $
- Symbol Explanation:
- : The total number of semantic classes.
- : The Intersection over Union for class .
- (True Positives for class ): The number of pixels correctly predicted as belonging to class .
- (False Positives for class ): The number of pixels incorrectly predicted as belonging to class (they actually belong to another class or background).
- (False Negatives for class ): The number of pixels that actually belong to class but were incorrectly predicted as something else.
- Used for: Semantic Segmentation.
- Conceptual Definition:
-
Recall:
- Conceptual Definition: Recall (also known as sensitivity or true positive rate) measures the proportion of actual positive instances that were correctly identified by the model. In retrieval tasks, it measures how many of the relevant items were successfully retrieved.
- Mathematical Formula: $ \mathrm{Recall} = \frac{\mathrm{True~Positives}}{\mathrm{True~Positives} + \mathrm{False~Negatives}} $
- Symbol Explanation:
- : The number of positive instances correctly identified.
- : The number of positive instances incorrectly identified as negative.
- Used for: Image-Text Retrieval.
-
ROC AUC (Receiver Operating Characteristic Area Under the Curve):
- Conceptual Definition:
ROC AUCis a performance metric for binary classification problems. TheROCcurve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.AUCrepresents the area under this curve. A higherAUCvalue (closer to 1) indicates better discriminative power of the model, meaning it can distinguish between positive and negative classes more effectively. AnAUCof 0.5 suggests performance no better than random guessing. - Mathematical Formula: (The ROC curve itself is a plot, so AUC is the integral of that plot) $ \mathrm{AUC} = \int_0^1 \mathrm{TPR}(\mathrm{FPR}^{-1}(t)) dt $
- Symbol Explanation:
- (True Positive Rate) = Recall.
- (False Positive Rate) = .
- : Varies from 0 to 1, representing different classification thresholds.
- Used for: Binary classification tasks like
Hateful Memes.
- Conceptual Definition:
-
Mean(top1, top5):
- Conceptual Definition: For multi-class classification,
top-1 accuracyis the conventional accuracy (correct prediction is the highest probability).Top-5 accuracymeans the correct class is among the top 5 predicted classes.Mean(top1, top5)indicates an average of these two metrics, often used to give a broader sense of performance, particularly in tasks with many classes. - Mathematical Formula: $ \mathrm{Mean(top1, top5)} = \frac{\mathrm{Top1~Accuracy} + \mathrm{Top5~Accuracy}}{2} $
- Symbol Explanation:
- : The percentage of samples where the model's highest-probability prediction is correct.
- : The percentage of samples where the true label is among the top 5 highest-probability predictions.
- Used for: Action Recognition (
Kinetics700).
- Conceptual Definition: For multi-class classification,
-
mWAP, mSAP (Mean Weighted Average Precision, Mean Semantic Average Precision):
- Conceptual Definition: These are specialized metrics for fine-grained action recognition, particularly relevant for datasets like
RareAct[28]. They likely aim to capture the model's ability to recognize rare or complex actions, possibly by weighting based on rarity or semantic similarity. Specific formulas are usually defined by the dataset's creators. - Used for: Action Recognition (
RareAct).
- Conceptual Definition: These are specialized metrics for fine-grained action recognition, particularly relevant for datasets like
5.3. Baselines
The paper compares the proposed VLM methods against various baselines, depending on the task and context:
- Zero-shot prediction baseline (
w/o Transfer): For transfer learning evaluations, the direct zero-shot performance of a pre-trained VLM (e.g.,CLIP[10]) without any fine-tuning serves as a crucial baseline. This shows the inherent generalization capability of the VLM before adaptation. - Other VLM pre-training methods: When evaluating new VLM pre-training approaches, they are benchmarked against existing leading VLMs (e.g.,
ALIGN[17],FILIP[18],COCA[19]). - Different transfer learning strategies: For VLM transfer, various prompt tuning methods (e.g.,
CoOp[31],CoCoOp[32]), feature adaptation methods (e.g.,Clip-Adapter[33],Tip-Adapter[34]), and other adaptation techniques are compared. - Knowledge distillation methods for specific tasks: In object detection and semantic segmentation, the paper compares different knowledge distillation strategies (e.g.,
ViLD[36],DetPro[37] for detection;CLIPSeg[175],LSeg[176] for segmentation). Often, a baseline "detector/segmenter from scratch" or "detector/segmenter with only image-level pre-training" is implied or explicitly used to show the benefit of VLM knowledge.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Performance of VLM Pre-training (Section 8.1)
The paper analyzes the performance of VLM pre-training methods primarily through zero-shot prediction on image classification, object detection, and semantic segmentation tasks.
The following figure (Figure 14 from the original paper) illustrates the performance versus data size and model size for VLM pre-training:
该图像是图表,展示了不同预训练数据大小和模型对ImageNet的Top-1准确率的影响。左侧显示ResNet-50在增长的预训练数据下的表现,右侧展示了在固定400M预训练数据下,多个模型结构的准确率变化。
This figure demonstrates two key scaling laws:
-
Data Size: As shown in the left graph (
ResNet-50backbone), increasing thepre-training data sizeleads to consistent improvements inTop-1 accuracyonImageNet. For example, accuracy rises from ~40% with 10M data to ~75% with 100M data. This highlights the importance ofweb-scale image-text pairs. -
Model Size: The right graph (with
400Mpre-training data) shows thatscaling up model sizes(fromResNet-50toViT-L/14) also consistently improvesTop-1 accuracyonImageNet. For instance,ResNet-50achieves ~60%, whileViT-L/14reaches over 75%. This indicates that larger models have greater capacity to learn from massive datasets.The following are the results from Table 6 of the original paper:
Methods Image encoder Text encoder Data Size Average Accuracy (%) on 11 Image Classification Datasets ImageNet Caltech-101 DTD EuroSAT Food-101 MNIST SUN397 CIFAR-100 FGVC Aircraft Oxford-IIIT PETS Average CLIP [10] ViT-L/14 Transformer 400M 76.2 95.7 77.5 93.8 68.4 78.8 37.2 55.7 93.5 92.8 78.3 ALIGN [17] EfficientNet BERT 1.8B 76.4 - - - - - - - - - - OTTER [112] FBNetV3-C DeCLUTR-Sci 3M - - - - - - - - - - - DeCLIP [113] RENET-Y BERT 88M 73.7 - - - - - - - - - - ZeroVL [114] ViT-B/16 BERT 100M - - - - - - - - - - - FILIP [18] ViT-L/14 Transformer 340M 77.1 95.7 75.3 92.2 73.1 70.8 60.2 60.7 92.0 93.0 90.1 UniCL [65] Swin-tiny Transformer 16.3M 71.3 - - - - - - - - - - Florence [115] CoSwin RoBERT 900M 83.7 94.6 77.6 95.1 77.0 93.2 55.5 66.4 95.9 94.7 86.2 SLIP [64] ViT-L Transformer 15M 47.9 87.5 54.2 69.2 56.0 9.0 9.5 29.9 41.6 80.9 60.2 PyramidCLIP [116] ResNet50 T5 143M 47.8 81.5 53.7 67.8 65.8 65.0 12.6 47.2 83.7 81.7 65.8 ChineseCLIP [117] ViT-L/14 CNRoberta 4B - 96.0 79.7 - - - 26.2 51.2 - - - LiT [118] ViT-L/14 Transformer 2M 85.2 - - - - - - - - - - AltCLIP [119] ViT-L/14 70M - - - - - - - - - - - FLAVA [42] ViT-B/16 Transformer 1.1B 62.6 91.5 68.6 79.5 - - 75.4 - - - - KELIP [120] ViT-B/32 4.8B 86.3 - - - - - - 51.2 - - - COCA [19] ViT-G/14 Transformer 35M 48.8 83.4 54.5 65.8 59.9 18.0 5.8 57.1 33.2 73.9 50.0 nCLIP [121] CoSwin RoBERT5 813M 85.8 - - - - - - - - - - K-lite [122] ViT-B/16 BART 26M 47.4 81.9 47.5 59.2 58.7 7.8 7.5 32.9 39.2 79.5 54.0 NLIP [123] Transformer 30M 54.2 87.8 56.5 64.6 61.1 19.5 4.7 36.6 - 84.0 8.0 UniCLIP [84] ViT-B/32 12B 85.4 - - - - - - 69.2 - - - PaLI [83] ViT-e mT5 12B 70.5 - - - - - - - - - - CLIPPO [43] ViT-L/16 ViT-L/16 4M 27.3 - 31.4 - - - - - - - - OneFormer [44] ViT-L/16 ViT-L/16 - - - - - - - - - - 76.9 - RA-CLIP [125] ViT-B/32 BERT 15M 53.5 89.4 62.3 43.8 46.5 - - 25.6 - - 70.3 LA-CLIP [126] ViT-B/32 Transformer 400M 64.4 92.4 73.0 79.7 64.9 81.9 20.8 55.4 87.2 91.8 - ALIP [127] ViT-B/32 Transformer 15M 40.3 83.8 51.9 45.4 47.8 3.4 2.7 23.2 30.7 74.1 54.8 GrowCLIP [128] ViT-B/16 Transformer 12M 36.1 60.7 28.3 42.5 45.5 - - 17.3 - 71.9 23.3
From Table 6, several key observations can be made:
-
Superior Zero-shot Performance: VLMs achieve impressive zero-shot performance across various image classification tasks, indicating their strong generalization capabilities (e.g.,
CLIPachieves 76.2% onImageNetand 95.7% onCaltech-101). -
Impact of Data Size: Models trained on larger datasets often show better performance. For instance,
ALIGN(1.8B data) andFlorence(900M data) show very strong results, withFlorenceachieving 83.7% on ImageNet.UniCLIPandPaLIwith 12B data also report high performance. -
Impact of Model Size/Architecture: While not directly compared in every row, the general trend from Figure 14 suggests that larger, more capable architectures (e.g.,
ViT-L/14vs.ResNet-50) contribute to better performance.FILIP(ViT-L/14, 340M) shows consistent high performance across many tasks (Average 90.1%), possibly due to its fine-grained vision-language pre-training.FlorenceandnCLIPalso achieve very highImageNetscores with powerful backbones. -
Diversity of VLMs: Different VLMs excel on different subsets of tasks, indicating the impact of their specific pre-training objectives and data. For example,
FILIPhas high average accuracy, whileFlorenceandnCLIPshow very high ImageNet scores. -
Challenges for Small Datasets/Specific Tasks: Some VLMs struggle on certain datasets, indicating that generalization is not uniform. For instance,
SLIP,COCA,K-lite,NLIP,ALIP,GrowCLIPshow lower performance on datasets likeMNIST,SUN397, which might require different inductive biases or more specific fine-tuning.The superior generalization of VLMs is attributed to:
-
Big Data: Training on millions or billions of image-text pairs from the internet allows VLMs to cover vast visual and language concepts.
-
Big Model: Large models (e.g.,
ViT-GinCOCAwith 2B parameters) provide the capacity to learn from such extensive data. -
Task-agnostic Learning: Language supervision from image-text pairs is diverse and general-purpose, leading to models that generalize well across various downstream tasks, unlike task-specific labels.
The following are the results from Table 7 of the original paper:
Method Image encoder Text encoder Data size VOC [90] PASCAL C. [109] COCO [106] GroupViT [129] ViT Transformer 26M 52.3 22.4 - SegCLIP [46] ViT Transformer 3.4M 52.6 24.7 26.5
For semantic segmentation, GroupViT and SegCLIP demonstrate the capability of VLMs for dense prediction, achieving mIoU scores in the 50s on PASCAL VOC. The performance on PASCAL Content and COCO is lower, reflecting the increased complexity of these datasets and the nascent stage of VLMs for dense tasks.
The following are the results from Table 8 of the original paper:
| Method | Image encoder | Text encoder | Data size | COCO [106] | LVIS [107] | LVIS Mini. [107] |
| RegionCLIP [131] | ResNet50x4 | Transformer | 118k | 29.6 | 11.3 | - |
| GLIP [67] | Swin-L | BERT | 27.43M | 49.8 | 26.9 | 34.3 |
| FIBER [71] | Swin-B | RoBERTa | 4M | 49.3 | - | 32.2 |
| DetCLIP [45] | Swin-L | BERT | 2.43M | - | 35.9 | - |
For object detection, GLIP and FIBER show strong mAP scores on COCO and LVIS, with GLIP achieving 49.8% on COCO and 26.9% on LVIS. These results indicate that VLMs can enable effective zero-shot prediction for dense tasks, especially when equipped with local VLM pre-training objectives like region-word matching. The field for dense visual tasks is relatively underexplored, hence the limited number of VLMs presented.
Limitations of VLMs (from Section 8.1):
- Performance Saturation: Beyond a certain point, further scaling up data or model size yields diminishing returns.
- Computational Expense: Large-scale VLM pre-training requires immense computational resources (e.g.,
CLIP ViT-Lneeded 256 V100 GPUs for 288 hours). - Overhead in Inference: Large models introduce substantial computation and memory overheads during both training and inference.
6.1.2. Performance of VLM Transfer Learning (Section 8.2)
Transfer learning methods adapt pre-trained VLMs to downstream tasks using various fine-tuning strategies. The evaluations are primarily on image classification datasets, often using a 16-shot setup for few-shot supervised methods.
The following are the results from Table 9 of the original paper:
| Methods | Image encoder | Setup | Average Accuracy (%) on 11 Image Classification Datasets | |||||||||||
| ImageNet | Caltech-101 | DTD | EuroSAT | Food-101 | MNIST | SUN397 | CIFAR-100 | FGVC Aircraft | Oxford-IIIT PETS | Average | ||||
| Baseline [143] | ResNet-50 | w/o Transfer | 59.2 | 90.9 | 60.3 | 86.1 | 85.8 | 55.6 | 66.1 | 77.3 | 16.9 | 60.2 | 41.6 | 62.7 |
| Baseline [10] | ViT-B/16 | w/o Transfer | 71.7 | 95.4 | 70.2 | 92.8 | 94.1 | 68.6 | 74.8 | 90.6 | 31.1 | 72.2 | 56.4 | 73.5 |
| Baseline [10] | ViT-L/14 | w/o Transfer | 73.7 | 96.4 | 76.2 | 93.7 | 94.5 | 78.8 | 78.3 | 93.8 | 37.2 | 68.4 | 55.7 | 76.9 |
| CoOp [31] | ViT-B/16 | Few-shot Sup. | 71.6 | 95.8 | 71.9 | 95.8 | 96.4 | 68.1 | 74.1 | 85.2 | 28.7 | 72.5 | 54.2 | 67.5 |
| CoCoOp [32] | ViT-B/16 | Few-shot Sup. | 75.8 | 95.8 | 73.1 | 96.4 | 97.7 | 72.0 | 81.7 | 91.0 | 27.7 | 78.3 | 64.8 | 77.6 |
| SubPT [132] | ResNet50 | Few-shot Sup. | 66.4 | 91.7 | 63.4 | 91.7 | 91.8 | 60.7 | 73.8 | 81.0 | 20.3 | 70.2 | 54.7 | 68.1 |
| LASP [133] | ViT-B/16 | Few-shot Sup. | 76.1 | 95.8 | 73.0 | 95.8 | 95.7 | 72.2 | 81.6 | 90.5 | 31.6 | 77.8 | 62.8 | 76.8 |
| ProDA [134] | ResNet50 | Few-shot Sup. | - | 91.3 | 65.3 | 96.4 | 90.0 | 75.5 | 95.5 | 82.4 | 36.6 | - | 70.1 | 84.3 |
| VPT [135] | ViT-B/16 | Few-shot Sup. | 77.4 | 96.8 | 73.4 | - | 96.8 | 73.1 | 81.1 | 91.6 | 34.7 | 78.5 | 67.3 | - |
| ProGrad [136] | ViT-B/16 | Few-shot Sup. | 67.9 | 91.5 | 62.1 | 91.5 | 93.4 | 62.7 | - | - | - | - | - | 79.0 |
| CPL [137] | ResNet-50 | Few-shot Sup. | - | 96.3 | - | - | - | - | 78.7 | 91.0 | 21.9 | 70.3 | 57.8 | 68.5 |
| PLOT [138] | ViT-B/16 | Few-shot Sup. | 73.9 | 92.2 | 63.0 | 87.2 | 94.8 | 72.8 | 77.1 | 84.5 | 34.5 | 70.0 | 65.6 | 77.3 |
| CuPL [160] | ViT-L/14 | Unsupervised | 68.4 | 91.4 | 61.1 | 89.5 | 71.0 | 76.6 | 93.3 | 36.1 | - | - | - | - |
| UPL [143] | ResNet-50 | Unsupervised | 64.8 | 94.2 | 69.0 | 87.8 | 66.9 | 69.0 | 77.9 | 84.7 | 21.7 | 66.4 | 55.1 | 70.2 |
| TPT [144] | ViT-B/16 | Unsupervised | 71.0 | 94.2 | - | 87.8 | 66.9 | 69.0 | 77.9 | 84.7 | 24.8 | 66.4 | 47.8 | 71.0 |
| VP [147] | ViT-B/32 | Few-shot Sup. | - | - | - | - | 85.0 | - | 70.3 | 78.9 | - | 65.5 | 57.1 | 60.8 |
| UPT [149] | ViT-B/16 | Few-shot Sup. | 76.2 | 96.1 | 73.2 | 96.1 | 96.3 | 71.8 | 81.0 | 91.3 | 34.5 | 78.7 | 65.6 | 77.2 |
| MaPLE [151] | ViT-B/16 | Few-shot Sup. | 78.6 | 96.0 | 73.5 | 96.0 | 96.6 | 73.5 | 82.6 | 91.4 | 36.5 | 79.7 | 68.2 | 80.8 |
| CAVPT [152] | ViT-B/16 | Few-shot Sup. | 83.2 | 96.1 | 72.5 | 93.5 | 88.2 | 97.6 | 85.0 | 57.9 | 74.3 | 72.6 | 92.1 | 85.3 |
| Tip-Adapter [34] | ViT-B/16 | Few-shot Sup. | - | - | 70.8 | - | - | - | - | - | - | - | - | - |
| SuS-X [154] | ResNet-50 | Unsupervised | - | - | 66.8 | - | - | - | - | - | - | - | - | 50.6 |
| SgVA-CLIP [156] | ViT-B/16 | Few-shot Sup. | - | - | 73.3 | - | - | - | - | - | - | 76.4 | - | - |
| VT-Clip [157] | ResNet-50 | Few-shot Sup. | - | - | - | - | 93.1 | - | - | - | - | - | 65.7 | - |
| CALIP [158] | ResNet-50 | Unsupervised | 59.4 | 87.7 | 60.6 | 58.6 | 77.4 | 66.4 | 56.3 | 17.7 | 86.2 | 42.4 | 38.9 | 61.7 |
| Wise-FT [162] | ViT-L/14 | Supervised | - | - | 87.1 | - | - | - | - | - | - | - | - | - |
| KgCoOp [145] | ViT-B/16 | Few-shot Sup. | 74.4 | 94.6 | 70.1 | 93.2 | 71.9 | 90.6 | 86.5 | 32.4 | 71.7 | 58.3 | 71.0 | 78.4 |
| ProTeCt [146] | ViT-B/16 | Few-shot Sup. | 83.2 | 96.5 | 74.6 | 93.7 | - | - | - | - | - | 74.5 | - | - |
| RePrompt [148] | ViT-B/16 | Few-shot Sup. | 75.7 | 93.4 | - | 87.8 | 85.0 | 97.1 | 87.4 | 50.3 | 77.5 | 73.7 | 67.1 | 86.4 |
| TaskRes [159] | ResNet-50 | Few-shot Sup. | - | - | 65.0 | - | - | - | - | - | - | - | - | - |
| VCD [161] | ViT-B/16 | Unsupervised | 68.0 | 86.9 | - | 76.8 | 96.0 | - | 77.6 | 88.5 | 36.3 | 70.6 | - | - |
From Table 9, the following conclusions can be drawn:
- Consistent Improvement from Transfer Learning: Transfer learning setups consistently improve performance on downstream tasks compared to the
w/o Transferbaselines. For example, onImageNet,Wise-FT(supervised) boosts accuracy by 10.9% (vs.ViT-L/14baseline),CoCoOp(few-shot supervised) by 1.7%, andUPL(unsupervised) by 0.8%. This validates the effectiveness of adaptation mechanisms in mitigatingdomain gaps. - Gap between Few-shot and Supervised Transfer:
Few-shot supervised transfermethods (e.g.,CoCoOpat 75.8% on ImageNet) generally lag behindfully supervised transfer(e.g.,Wise-FTat 87.1%). This is largely due to the risk of overfitting to the limited labelled samples in few-shot settings. - Promising Unsupervised Transfer:
Unsupervised transfermethods (e.g.,UPLat 64.8% average,TPTat 71.0% on ImageNet) can achieve performance comparable to or even better thanfew-shot supervised transfermethods. For instance,UPL(ResNet-50) outperformsCoOp(ViT-B/16) in average accuracy (70.2% vs. 67.5%). This is attributed to access to massive unlabelled data, reducing overfitting risks. This highlights unsupervised transfer as a promising direction despite challenges like noisy pseudo-labels. - Effectiveness of Prompt Tuning and Adapters: Methods like
CoCoOp,LASP,UPT,MaPLE,CAVPT(allprompt tuning) andWise-FT(fine-tuning) show strong results, indicating the efficacy of parameter-efficient adaptation strategies.
6.1.3. Performance of VLM Knowledge Distillation (Section 8.3)
Knowledge distillation transfers general VLM knowledge to task-specific models, particularly for dense prediction tasks like object detection and semantic segmentation.
The following are the results from Table 10 of the original paper:
| Method | Vision-Language Model | COCO [106] | LVIS [107] | |||||
| APbase | APnovel | AP | APr | APc | AP f | AP | ||
| Baseline [36] | 28.3 | 26.3 | 27.8 | 19.5 | 19.7 | 17.0 | 18.6 | |
| ViLD [36] | CLIP ViT-B/32 | 59.5 | 27.6 | 51.3 | 16.7 | 26.5 | 34.2 | 27.8 |
| DetPro [37] | CLIP ViT-B/32 | - | - | 34.9 | 20.8 | 27.8 | 32.4 | 28.4 |
| HierKD [186] | CLIP ViT-B/32 | 53.5 | 27.3 | - | - | - | - | - |
| RKD [187] | CLIP ViT-B/32 | 56.6 | 36.9 | 51.0 | 21.1 | 25.0 | 29.1 | 25.9 |
| PromptDet [188] | CLIP Transformer | - | 26.6 | 50.6 | 21.4 | 23.3 | 29.3 | 25.3 |
| PB-OVD [189] | CLIP Transformer | 46.1 | 30.8 | 42.1 | - | - | - | - |
| CondHead [190] | CLIP ViT-B/32 | 60.8 | 29.8 | 49.0 | 18.8 | 28.3 | 33.7 | 28.8 |
| VLDet [191] | CLIP Transformer | 50.6 | 32.0 | 45.8 | 26.3 | 39.4 | 41.9 | 38.1 |
| F-VLM [192] | CLIP ResNet-50 | - | 28.0 | 39.6 | 32.8 | - | - | 34.9 |
| OV-DETR [173] | CLIP ViT-B/32 | 52.7 | 29.4 | 61.0 | 17.4 | 25.0 | 32.5 | 26.6 |
| Detic [193] | CLIP Transformer | 45.0 | 27.8 | 47.1 | 17.8 | 26.3 | 31.6 | 26.8 |
| OWL-ViT [195] | CLIP ViT-B/32 | - | - | 28.1 | 18.9 | - | - | 22.1 |
| VL-PLM [196] | CLIP ViT-B/32 | 60.2 | 34.4 | 53.5 | - | - | - | 22.2 |
| P3OVD [197] | CLIP ResNet-50 | 51.9 | 31.5 | 46.6 | - | - | - | 10.6 |
| RO-ViT [199] | CLIP ViT-L/16 | - | 33.0 | 47.7 | 32.1 | - | - | 34.0 |
| BARON [200] | CLIP ResNet-50 | 54.9 | 42.7 | 51.7 | 23.2 | 29.3 | 32.5 | 29.5 |
| OADP [201] | CLIP ViT-B/32 | 53.3 | 30.0 | 47.2 | 21.9 | 28.4 | 32.0 | 28.7 |
For object detection, VLM knowledge distillation brings clear performance improvements:
-
Significant Gains over Baseline: Methods like
ViLD(CLIP ViT-B/32) dramatically improveAPonCOCOfrom a baseline of 27.8% to 51.3%, and onLVISfrom 18.6% to 27.8%. This highlights the power of distilling VLM's broad object understanding. -
Handling Novel Categories: Many methods show strong
APnovelperformance, indicating their ability to detect objects from categories not seen during the detector's base training, leveraging VLM's open-vocabulary capabilities. For example,BARONachieves 42.7%APnovelonCOCO. -
Diverse Approaches: The table shows a variety of distillation techniques, from
feature-space distillation(ViLD,RKD) topseudo-labelling(PB-OVD,P3OVD), each contributing to advancingopen-vocabulary object detection. -
Performance on LVIS:
LVISis a challenging dataset with a long-tail distribution and many rare categories. Methods likeVLDetachieve higherAPr(rare categories) andAPc(common categories) compared toViLD, demonstrating improved performance on fine-grained and long-tail distributions.The following are the results from Table 11 of the original paper:
Method Vision-Language Model A-847 [111] PC-459 [109] A-150 [111] PC-59 [109] PAS-20 [90] C-19 [110] Baseline [203] - - - 24.3 18.3 - LSeg [35] CLIP ResNet-101 - - - - 47.4 - ZegFormer [176] CLIP ResNet-50 - - 16.4 - 80.7 - OVSeg [179] CLIP Swin-B 9.0 12.4 29.6 55.7 94.5 - ZSSeg [180] CLIP ResNet-101 7.0 - 20.5 47.7 - 34.5 OpenSeg [181] CLIP Eff-B7 6.3 9.0 21.1 42.1 - - ReCo [182] CLIP ResNet-101 - - - - - 24.2 FreeSeg [185] CLIP ViT-B/16 - - 39.8 - 86.9 -
For semantic segmentation, VLM knowledge distillation also shows clear benefits:
- Substantial mIoU Improvement: Compared to the baseline (e.g., 18.3% on
PAS-20), methods likeZegFormer(80.7%) andFreeSeg(86.9%) achieve very highmIoUscores onPASCAL VOC 2012 Segmentation. This indicates successful transfer of VLM's semantic understanding to pixel-level tasks. - Generalization to Diverse Datasets: While performance varies (e.g., lower on
ADE20kandPASCAL Contentwith many classes), these methods demonstrate the ability to segment pixels for categories beyond what the base segmentation model was trained on. - Task-Specific Designs: The strong performance is largely due to the ability of knowledge distillation to combine VLM's broad knowledge with the strengths of specialized segmentation architectures.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Dataset | Year | Num. of Image-Text Pairs | Language | Public |
| SBU Caption [73] [link] | 2011 | 1M | English | ✓ |
| COCO Caption [74] [link] | 2016 | 1.5M | English | ✓ |
| Yahoo Flickr Creative Commons 100 Million (YFCC100M) [75] [link] | 2016 | 100M | English | ✓ |
| Visual Genome (VG) [76] [link] | 2017 | 5.4M | English | ✓ |
| Conceptual Captions (CC3M) [77] [link] | 2018 | 3.3M | English | ✓ |
| Localized Narratives (LN) [78] [link] | 2020 | 0.87M | English | ✓ |
| Conceptual 12M (CC12M) [79] [link] | 2021 | 12M | English | ✓ |
| Wikipedia-based Image Text (WIT) [80] [link] | 2021 | 37.6M | 108 Languages | ✓ |
| Red Caps (RC) [81] [link] | 2021 | 12M | English | ✓ |
| LAION400M [21] [link] | 2021 | 400M | English | ✓ |
| LAION5B [20] [link] | 2022 | 5B | Over 100 Languages | ✓ |
| WuKong [82] [link] | 2022 | 100M | Chinese | ✓ |
| CLIP [10] | 2021 | 400M | English | X |
| ALIGN [17] | 2021 | 1.8B | English | X |
| FILIP [18] | 2021 | 300M | English | X |
| WebLI [83] | 2022 | 12B | 109 Languages | X |
6.3. Ablation Studies / Parameter Analysis
The paper primarily focuses on surveying and benchmarking existing work rather than conducting new ablation studies. However, the discussions in Section 8 provide indirect insights into parameter analysis and the effectiveness of different components:
-
Scaling Laws (Section 8.1, Figure 14): The analysis of how performance scales with
data sizeandmodel sizeimplicitly acts as a form of ablation study, demonstrating that these are critical factors for VLM effectiveness. It shows that both larger datasets and larger models consistently improve zero-shot accuracy, but also notes that performance eventually saturates. -
Comparison of Objectives (Section 5, Table 3): The categorization of VLM pre-training methods by
contrastive,generative, andalignmentobjectives, and the observation that oftenhybrid objectives(e.g.,FLAVA[42] usesCon, Gen, Align) lead to more robust VLMs, suggests that combining different learning signals is beneficial. The discussion highlights thatcontrastive objectiveslearndiscriminative features,generative objectiveslearncontextual information, andalignment objectivesareeasy-to-optimizefor global/local matching, but often work best whenauxiliaryto others. -
Frameworks (Section 3.3, Figure 5): The discussion on
two-tower,two-leg, andone-towerframeworks points to architectural design choices that affectefficiencyandcross-modal communication. While no direct comparative ablation is provided, the existence of different frameworks suggests ongoing research into optimal VLM architectures. -
Transfer Learning Setups (Section 6.2, 8.2): The comparison between
supervised,few-shot supervised, andunsupervised transfersetups (Table 9) serves as an implicit ablation. It shows that evenunsupervised transfercan be competitive withfew-shot supervised transfer, highlighting the potential for annotation-efficient methods and the importance of exploiting unlabelled data. The observation that few-shot methods canoverfitto limited samples indicates a sensitivity to training data size in the adaptation phase. -
Knowledge Distillation Approaches (Section 7.2, 8.3): The differentiation between
feature-space distillationandpseudo-labelling distillationfor object detection and semantic segmentation implicitly evaluates the effectiveness of different knowledge transfer mechanisms. The results show that both are viable for improving dense prediction tasks, indicating that the choice of distillation strategy can be adapted to task-specific needs.In essence, while formal ablation studies with controlled experiments and distinct tables are not present, the survey's comparative analysis across numerous papers provides a high-level overview of how different methodological choices (objectives, scale, transfer mechanisms) impact VLM performance.
7. Conclusion & Reflections
7.1. Conclusion Summary
This survey provides a comprehensive review of Vision-Language Models (VLMs) for visual recognition tasks, ranging from image classification to object detection and semantic segmentation. It elucidates how VLMs address the limitations of traditional visual recognition paradigms, primarily the heavy reliance on labor-intensive labelled data and task-specific model training. By leveraging web-scale image-text pairs and learning rich vision-language correlations, VLMs enable impressive zero-shot predictions and open-vocabulary visual recognition.
The paper systematically dissects VLMs through their foundational elements (network architectures, pre-training objectives, frameworks), datasets, and three main technical approaches: VLM pre-training, VLM transfer learning, and VLM knowledge distillation. Key findings highlight the consistent performance gains from scaling up pre-training data and model size, the effectiveness of transfer learning in bridging domain gaps for downstream tasks (especially unsupervised transfer), and the utility of knowledge distillation in empowering dense prediction models with VLM's generalized knowledge. The survey provides extensive benchmarking and discussion, offering a clear picture of the current state and future potential of VLMs in visual recognition.
7.2. Limitations & Future Work
The authors identify several research challenges and potential future directions for VLMs:
7.2.1. For VLM Pre-training
- Fine-grained Vision-Language Correlation Modelling: Current VLMs often excel at image-level understanding but lack detailed
local vision-language correspondenceneeded for dense prediction tasks (object detection, semantic segmentation). Future work should focus on methods likeregion-word matchingto enhancezero-shot dense prediction. - Unification of Vision and Language Learning: Most VLMs use separate encoders (
two-towerframeworks). Moving towardsone-towerarchitectures (e.g., using a single Transformer for both modalities) could improvecross-modal communicationandtraining efficiency, leading to moresustainable VLMs. - Pre-training VLMs with Multiple Languages: Most existing VLMs are English-centric, which can introduce
cultural and regional biasesand limit global applicability. Developingmultilingual VLMscan enable learning diverse visual characteristics associated with different languages, making VLMs effective across various linguistic scenarios. - Data-efficient VLMs: The enormous computational resources required for training large-scale VLMs are a significant concern for
sustainability. Research intodata-efficient VLMsthat extract more useful information from limited image-text data (e.g., viasupervision among image-text pairs) is crucial. - Pre-training VLMs with LLMs: Leveraging
Large Language Models (LLMs)toaugment text captions(e.g., generating richer synthetic descriptions) can enhance VLM pre-training by providing more diverse and informative language supervision, improvingvision-language correlation.
7.2.2. For VLM Transfer Learning
- Unsupervised VLM Transfer: While most transfer methods require labelled data (supervised or few-shot supervised),
unsupervised VLM transfer(using unlabelled data) is moreannotation-efficientand has loweroverfitting risks. More research is needed in this promising but challenging area. - VLM Transfer with Visual Prompt/Adapter: The focus has largely been on
text prompt learning. Exploringvisual prompt learningorvisual adaptersis essential as they are complementary to text prompting and can facilitatepixel-level adaptationfor dense prediction tasks. - Test-time VLM Transfer: Current transfer methods often involve
repetitive fine-tuningfor each downstream task.Test-time VLM transfer, which adapts promptson the flyduring inference, could circumvent this inefficiency. - VLM Transfer with LLMs: Exploiting
LLMsto automaticallygenerate customized text promptsfor downstream tasks is anautomaticanddata-efficientapproach compared to manual prompt engineering or learnable prompts.
7.2.3. For VLM Knowledge Distillation
- Knowledge Distillation from Multiple VLMs: Exploring how to
coordinate knowledge distillationfrom several different VLMs could harness theirsynergistic effectsfor improved performance. - Knowledge Distillation for Other Visual Recognition Tasks: Beyond object detection and semantic segmentation,
VLM knowledge distillationcould be extended to other complex tasks likeinstance segmentation,panoptic segmentation,person re-identification, etc.
7.3. Personal Insights & Critique
This survey offers an exceptionally clear and comprehensive overview of the rapidly evolving field of Vision-Language Models for visual recognition. Its structured approach, detailed explanations of foundations and methodologies, and thorough benchmarking are invaluable for both beginners and experienced researchers. The emphasis on the transition from traditional, data-hungry paradigms to web-scale, zero-shot learning with VLMs effectively highlights the paradigm shift.
Inspirations and Applications:
- Efficiency in AI Development: The
zero-shotcapability of VLMs is a game-changer for applications where labelled data is scarce or new categories constantly emerge (e.g., specialized medical imaging, niche product classification in e-commerce, real-time event monitoring). - Foundation Models Beyond Vision: The success of VLMs hints at the broader potential of multimodal
foundation modelsthat can integrate information from even more modalities (e.g., audio, 3D data) for even more holistic AI understanding. - Customization and Personalization: The
prompt tuningandadaptermechanisms offer powerful ways to customize general VLMs for specific user needs or domains with minimal effort and data, opening doors for highly personalized AI experiences. - Cross-Cultural AI: The call for
multilingual VLMsis particularly inspiring, emphasizing the need for AI to be inclusive and context-aware across diverse cultures and languages, moving beyond English-centric biases.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Interpretability and Bias: While VLMs perform incredibly well, their
black-boxnature remains a concern. How exactly do they learnvision-language correlations? Whatbiasesfrom the vast, unstructured web data are embedded in these models, and how do these biases affect predictions, especially in sensitive applications? The paper acknowledges bias in multilingual VLMs but could expand on inherent dataset biases more broadly. -
Computational Cost and Environmental Impact: The immense
computational resourcesrequired for pre-training VLMs are a significant barrier for many researchers and raiseenvironmental sustainabilityconcerns. Whiledata-efficient VLMsare mentioned as a future direction, the immediate practical challenges of training thesebig modelsare substantial. -
Robustness to Adversarial Attacks and Out-of-Distribution Data: VLMs trained on diverse web data might be more robust in some ways, but their performance on
out-of-distributiondata or againstadversarial attacksis an important area for deeper investigation. How do theirzero-shotcapabilities fare under such challenging conditions? -
Fine-grained Grounding in Complex Scenes: While
region-word matchingis a promising direction, fully understanding and localizing every semantic entity in a complex scene, especially with overlapping objects or abstract concepts, remains a significant challenge. The currentdense predictionresults (Tables 7 & 8) show lower performance than classification, indicating this is indeed a hard problem. -
Dynamic Language and Visual Concepts: Both language and visual concepts can evolve. How well can static VLMs adapt to new jargon, emerging visual styles, or real-world changes without continuous re-training?
Test-time VLM transferoffers a partial solution, but the long-term adaptability is a complex issue.Overall, this survey serves as an excellent resource for anyone seeking to understand the landscape of VLMs for visual recognition. It not only summarizes current achievements but also critically points towards the necessary advancements for building more robust, efficient, and universally applicable visual intelligence systems.
Similar papers
Recommended via semantic vector search.