Paper status: completed

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Published:08/25/2023

Multimodal Large Language Model (28)Vision-Language Models (14)Vision Foundation Models (2)Image Captioning and Question Answering (1)Vision-Language Alignment (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Qwen-VL is a series of large-scale vision-language models designed for understanding and localization of text and images, achieving new benchmarks in visual tasks and outperforming existing vision-language chatbots after instruction tuning.

Abstract

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

Mind Map

In-depth Reading

English Analysis~38 min read · 54,914 chars

1. Bibliographic Information

1.1. Title

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

1.2. Authors

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou from Alibaba Group.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server, on 2023-08-24. arXiv is a widely recognized platform for disseminating research in fields like AI, often preceding formal peer-reviewed publication. Its influence is significant in rapidly evolving fields, allowing researchers to share findings promptly.

1.4. Publication Year

2023

1.5. Abstract

This work introduces the Qwen-VL series, a collection of large-scale vision-language models (LVLMs) capable of perceiving and comprehending both text and images. Built upon the Qwen-LM as a foundational language model, Qwen-VL integrates visual capabilities through a meticulously designed visual receptor, an input-output interface supporting fine-grained understanding, a robust 3-stage training pipeline, and a multilingual multimodal cleaned corpus. Beyond traditional image description and question-answering, Qwen-VL models achieve advanced grounding and text-reading abilities by aligning image-caption-box tuples. The resulting models, Qwen-VL and Qwen-VL-Chat, establish new performance benchmarks for generalist models of similar scale across a wide array of visual-centric tasks, including image captioning, question answering, and visual grounding, in both zero-shot and few-shot settings. Furthermore, the instruction-tuned Qwen-VL-Chat demonstrates superior performance on real-world dialogue benchmarks compared to existing vision-language chatbots. The models, code, and a demo are publicly available to foster further research.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/2308.12966 PDF Link: https://arxiv.org/pdf/2308.12966v3.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The rapid advancements in Large Language Models (LLMs) have significantly enhanced text generation and comprehension, offering powerful tools for various applications. However, a fundamental limitation of native LLMs is their confinement to the "pure-text world," meaning they cannot process or understand other common modalities like images, speech, or video. This restriction severely limits their applicability in real-world scenarios, which are inherently multimodal.

Motivated by this, a new wave of Large Vision Language Models (LVLMs) has emerged, aiming to imbue LLMs with the ability to perceive and understand visual signals. While these LVLMs show promising potential, the authors identify several critical gaps and challenges in the current landscape:

Inadequate Training and Optimization: Many existing open-source LVLMs suffer from insufficient training or sub-optimal strategies, causing them to lag significantly behind proprietary models (e.g., those from OpenAI). This disparity hinders the broader exploration and application of LVLMs within the open-source community.
Lack of Fine-grained Visual Understanding: Real-world visual scenarios are complex and often require precise, fine-grained perception. Most open-source LVLMs, however, primarily offer coarse-grained image understanding and lack capabilities for tasks like object grounding (locating specific objects in an image based on textual descriptions) or text reading (optical character recognition, OCR). Only a few attempts have been made in this direction, indicating a significant unmet need.

The paper's entry point is to address these limitations by developing a highly performant and versatile open-source LVLM that excels in fine-grained visual understanding and offers state-of-the-art performance across a broad range of tasks, thereby democratizing access to advanced multimodal AI.

2.2. Main Contributions / Findings

The Qwen-VL series makes several significant contributions to the field of large-scale vision-language models:

Introduction of Qwen-VL Series: The paper introduces Qwen-VL and its instruction-tuned variant, Qwen-VL-Chat, as highly performant and versatile vision-language foundation models. These models are built upon the robust Qwen-7B Large Language Model, enhancing it with extensive visual capabilities.
Meticulously Designed Architecture and Training: The authors detail key design choices that enable Qwen-VL's capabilities:
- A novel visual receptor (language-aligned visual encoder and position-aware adapter) for efficient and effective image processing.
- A concise input-output interface specifically designed to handle and generate fine-grained visual information, including bounding boxes for grounding and text reading.
- A robust 3-stage training pipeline that progressively builds multimodal understanding from general image-text pairs to fine-grained annotations and instruction-following.
- Leveraging a multilingual multimodal cleaned corpus that supports both English and Chinese, enabling multilingual capabilities.
Enhanced Fine-grained Visual Understanding: Qwen-VL significantly advances fine-grained perception. It implements strong grounding (aligning image regions with textual descriptions) and text-reading abilities by incorporating image-caption-box tuples during training. This allows Qwen-VL to excel in tasks like text-oriented question answering and fine-grained dialogue.
Leading Performance Across Benchmarks: The Qwen-VL models achieve state-of-the-art results among generalist models of similar scales across a broad spectrum of visual-centric benchmarks. This includes conventional tasks like image captioning, general visual question answering (VQA), and visual grounding, as well as specialized tasks such as text-oriented VQA. They demonstrate strong performance in both zero-shot and few-shot settings.
Superior Dialogue Capabilities: The instruction-tuned Qwen-VL-Chat demonstrates superiority on real-world dialogue benchmarks compared to existing vision-language chatbots. It supports multi-image inputs, multi-round conversations, multilingual interaction, and fine-grained recognition within dialogues.
Open-Source Release: All models, code, and a demo are made public, fostering transparency, reproducibility, and further research and development within the open-source community.

The key finding is that through careful architectural design and a progressive multi-stage training strategy on a diverse, cleaned multimodal corpus, it is possible to create an open-source LVLM that not only achieves leading performance but also possesses crucial fine-grained visual understanding capabilities, rivaling proprietary models.

3.1. Foundational Concepts

To understand Qwen-VL, a reader should be familiar with the following fundamental concepts:

Large Language Models (LLMs): These are neural networks, typically based on the Transformer architecture, trained on vast amounts of text data. They excel at understanding and generating human-like text, performing tasks such as translation, summarization, and question answering. Examples include GPT-3, PaLM 2, and the foundational Qwen-LM used in this paper. The core idea is to predict the next word in a sequence, allowing them to learn complex linguistic patterns.
Vision-Language Models (VLMs) / Large Vision Language Models (LVLMs): These models extend LLMs by integrating visual understanding capabilities. They can process both images and text, enabling tasks like image captioning (describing an image in text), visual question answering (answering questions about an image), and visual grounding (linking textual descriptions to specific regions in an image). LVLMs are generally larger-scale versions of VLMs, often built by connecting LLMs with powerful visual encoders.
Transformer Architecture: The foundational architecture for most modern LLMs and many VLMs. It relies heavily on the self-attention mechanism to weigh the importance of different parts of the input sequence (text tokens or image patches) when processing each element. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where:
- $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
- $d_k$ is the dimension of the key vectors, used for scaling to prevent vanishing gradients.
- The softmax function normalizes the scores, turning them into probabilities. The attention mechanism allows the model to focus on relevant parts of the input, regardless of their position.
Vision Transformer (ViT): An adaptation of the Transformer architecture for image processing. Instead of convolutional layers, ViT treats an image as a sequence of fixed-size patches. Each patch is linearly embedded, positional embeddings are added, and the resulting sequence is fed into a standard Transformer encoder. This allows ViT to capture global relationships between image patches, similar to how Transformers handle words in text.
Cross-Attention: A variant of the attention mechanism used in multimodal models to allow information flow between different modalities. For example, in a VLM, a language query might attend to visual features, or visual features might attend to text features. The query comes from one modality (e.g., text tokens), and the keys and values come from another modality (e.g., image features). This is crucial for integrating visual and linguistic information.
Instruction Tuning / Fine-tuning: A process where a pre-trained model (like an LLM or VLM) is further trained on a dataset of instructions (prompts) and desired responses. This helps the model align its behavior with user intentions, improving its ability to follow complex commands, engage in dialogue, and generate contextually appropriate outputs. Qwen-VL-Chat is an instruction-tuned version of Qwen-VL.
Bounding Boxes and Visual Grounding: A bounding box is a rectangular coordinate pair (e.g., $(X_{topleft}, Y_{topleft}), (X_{bottomright}, Y_{bottomright})$ ) used to define the location of an object within an image. Visual grounding is the task of linking specific words or phrases in a natural language description to their corresponding regions or objects in an image, often represented by bounding boxes. This is a key aspect of fine-grained visual understanding.
Optical Character Recognition (OCR): The electronic conversion of images of typed, handwritten, or printed text into machine-encoded text. In the context of VLMs, this means the model can "read" text present within images.

3.2. Previous Works

The paper contextualizes Qwen-VL against the backdrop of significant developments in both LLMs and VLMs.

Large Language Models (LLMs): The recent surge in LLM capabilities, exemplified by models like GPT-3 (Brown et al., 2020), OpenAI's GPT-4 (OpenAI, 2023), PaLM 2 (Anil et al., 2023), and Qwen-LM (Qwen, 2023), has showcased their immense power in text generation and comprehension. These models form the foundation upon which LVLMs are built.
Multimodal Generalist Models: Earlier efforts in vision-language learning focused on multi-task generalist models that unified various vision-language tasks.
- CoCa (Yu et al., 2022): Proposed an encoder-decoder structure for simultaneous image-text retrieval and vision-language generation.
- OFA (Wang et al., 2022a): Transformed specific vision-language tasks into sequence-to-sequence problems using customized task instructions.
- Unified I/O (Lu et al., 2022a): Further expanded this by incorporating tasks like segmentation and depth estimation into a unified framework.
Vision-Language Representation Models: Another line of research focused on learning strong multimodal representations.
- CLIP (Radford et al., 2021): Leveraged contrastive learning on massive image-text pairs to align modalities in a semantic space, achieving strong zero-shot generalization.
- BEIT-3 (Wang et al., 2022b): Employed a mixture-of-experts (MoE) structure and a unified masked token prediction objective for SOTA results on various tasks.
- More recent works like ImageBind (Girdhar et al., 2023) and ONE-PEACE (Wang et al., 2023) extended this to even more modalities (e.g., speech).
LLM-based LVLMs: With the advent of powerful LLMs, researchers began building more capable LVLMs by leveraging LLMs as a language backbone.
- Flamingo (Alayrac et al., 2022): One of the early prominent LVLMs that demonstrated strong few-shot learning by connecting a frozen vision encoder to an LLM via cross-attention.
- BLIP-2 (Li et al., 2023c): Introduced a Q-Former to efficiently align frozen vision foundation models and LLMs, achieving impressive performance.
- InstructBLIP (Dai et al., 2023): Further enhanced BLIP-2 with instruction tuning for better alignment with user commands.
- LLaVA (Liu et al., 2023) and MiniGPT-4 (Zhu et al., 2023): Introduced visual instruction tuning to improve instruction-following in LVLMs.
- mPLUG-DocOwl (Ye et al., 2023a): Incorporated document understanding by including digital document data.
- Kosmos-2 (Peng et al., 2023), Shikra (Chen et al., 2023a), BuboGPT (Zhao et al., 2023): Advanced LVLMs by adding visual grounding abilities, enabling region description and localization.

3.3. Technological Evolution

The evolution of vision-language models can be broadly summarized:

Early Multimodal Models: Focused on fusing visual and textual features, often using separate encoders and then combining them (e.g., CNNs for vision, RNNs for text).
Attention-based VLMs: The introduction of attention mechanisms allowed for more dynamic interaction between modalities, with models like ViLBERT, UNITER, and LXMERT.
Contrastive Pre-training: Models like CLIP revolutionized representation learning by aligning image and text embeddings in a shared space using contrastive loss on massive datasets.
LLM-powered LVLMs: The current wave, leveraging the strong language understanding and generation capabilities of pre-trained LLMs, connecting them with powerful vision encoders. This paradigm benefits from the LLM's emergent abilities.
Instruction-tuned and Grounding-capable LVLMs: The latest trend, where LVLMs are fine-tuned on instruction datasets to improve dialogue and fine-grained understanding, including the ability to localize objects and read text within images.

Qwen-VL fits into this latest stage, building on the success of LLMs and VLMs, specifically focusing on overcoming the limitations of existing open-source LVLMs by incorporating robust training strategies and fine-grained visual capabilities, including grounding and OCR.

3.4. Differentiation Analysis

Compared to the main methods in related work, Qwen-VL distinguishes itself with several core innovations:

Foundation LLM: While many LVLMs use public LLMs like LLaMA or Vicuna, Qwen-VL is specifically built upon the high-performing Qwen-LM (Qwen-7B) developed by the same group, potentially leading to better integration and performance.
Comprehensive Fine-grained Understanding: Unlike many open-source LVLMs that offer only coarse-grained image perception, Qwen-VL is explicitly designed for and excels at fine-grained visual tasks such as object grounding, text reading (OCR), and text-oriented question answering. This is achieved through the unique alignment of image-caption-box tuples during training and a higher input resolution.
Meticulous System Design: The paper emphasizes a "meticulously designed" visual receptor, input-output interface, and training pipeline. The position-aware adapter is a key component to efficiently compress visual features while retaining crucial spatial information, addressing a common challenge in LVLMs. The special tokens for bounding boxes ( $<box>$ , $</box>$ ) and references ( $<ref>$ , $</ref>$ ) provide a structured way to handle grounding.
Multi-stage Training Pipeline: Qwen-VL employs a sophisticated 3-stage training pipeline (Pre-training, Multi-task Pre-training, Supervised Fine-tuning). This progressive approach, starting from general image-text pairs and moving to fine-grained annotations and instruction-following, allows for robust and versatile learning, including unlocking the LLM in the second stage.
Multilingual and Multi-image Support: Qwen-VL is explicitly trained on a multilingual corpus (English and Chinese), making it natively multilingual. Furthermore, its training data and interface allow for arbitrary interleaved image-text inputs, enabling it to compare, understand, and analyze contexts with multiple images—a feature not universally supported by all LVLMs.
State-of-the-Art Generalist Performance: The paper claims and demonstrates that Qwen-VL sets new records for generalist models under similar model scales across a broad range of benchmarks, surpassing many existing LVLMs, including some with significantly more parameters (e.g., Flamingo-80B on certain tasks). This suggests a highly efficient and effective design.
No Catastrophic Forgetting: Through the inclusion of pure-text data during multi-task training and SFT, Qwen-VL maintains and even improves its pure-text capabilities, mitigating catastrophic forgetting, which can be an issue when adapting LLMs to multimodal tasks.

4. Methodology

4.1. Principles

The core principle behind Qwen-VL is to augment a powerful, pre-trained Large Language Model (Qwen-LM) with comprehensive visual understanding capabilities, transforming it into a versatile multimodal assistant. This is achieved by:

Modular Integration: Connecting a specialized visual encoder to the LLM via an efficient adapter, allowing the LLM to "see" images.
Fine-grained Perception: Training the model with explicit visual grounding and text-reading data, enabling it to understand specific regions and text within images, rather than just holistic scene understanding.
Progressive Learning: Employing a multi-stage training strategy that gradually introduces complexity, starting with general image-text alignment and progressing to multi-task learning with fine-grained annotations and instruction-following.
Multilingual and Multi-image Support: Designing the system to natively handle multiple languages and process multiple images in interleaved sequences, reflecting real-world interactive scenarios.

The intuition is that a strong LLM backbone provides robust reasoning and generation abilities. By carefully designing the visual input pathway and training with diverse, high-quality multimodal data, these linguistic capabilities can be extended to the visual domain, resulting in a model that can not only describe images but also answer complex questions, locate objects, read text, and engage in nuanced dialogue about visual content.

4.2. Core Methodology In-depth (Layer by Layer)

The Qwen-VL architecture integrates three main components: a Large Language Model, a Visual Encoder, and a Position-aware Vision-Language Adapter. The entire system is trained using a 3-stage pipeline.

4.2.1. Model Architecture

The overall network architecture consists of the following components, with detailed parameters provided in Table 1.

The following are the results from Table 1 of the original paper:

Vision Encoder	VL Adapter	LLM	Total
1.9B	0.08B	7.7B	9.6B

Large Language Model (LLM):
- Role: Serves as the foundational component for language understanding, generation, and reasoning.
- Initialization: The LLM is initialized with pre-trained weights from Qwen-7B (Qwen, 2023). Qwen-7B is a 7.7 billion parameter LLM, providing strong linguistic capabilities as a base.
Visual Encoder:
- Role: Processes input images to extract visual features.
- Architecture: Uses a Vision Transformer (ViT) (Dosovitskiy et al., 2021). The ViT takes an image as input and divides it into a sequence of non-overlapping patches. Each patch is then linearly embedded, and the resulting patch embeddings, combined with positional information, are fed through a standard Transformer encoder.
- Processing: During both training and inference, input images are resized to a specific resolution (e.g., $224 \times 224$ or $448 \times 448$ ). The visual encoder processes these images by splitting them into patches with a stride of 14. This means for a $224 \times 224$ image, it would produce $(224/14) \times (224/14) = 16 \times 16 = 256$ patches. Each patch is then transformed into a feature vector, generating a set of image features.
Position-aware Vision-Language Adapter (VL Adapter):
- Role: This component acts as a bridge between the visual encoder and the LLM. Its primary function is to compress the potentially long sequence of image features from the visual encoder into a fixed-length sequence that is more efficient for the LLM to process, while crucially preserving positional information for fine-grained understanding.
- Mechanism: It comprises a single-layer cross-attention module.
  - Query Vectors: A group of trainable vectors, referred to as Embeddings, are used as query vectors ( $Q$ ). These are randomly initialized.
  - Keys: The image features (patch embeddings) extracted from the visual encoder serve as keys ( $K$ ) for the cross-attention operation.
  - Cross-Attention Operation: The cross-attention mechanism computes attention scores between the trainable query vectors and the image features. The output of this cross-attention is a compressed representation of the visual features.
  - Compression: This mechanism compresses the visual feature sequence (e.g., 256 or 1024 features depending on input resolution) to a fixed length of 256. An ablation study (Appendix E.2) confirmed that 256 queries offered a good balance.
  - Positional Encoding: To mitigate the potential loss of positional details during compression, 2D absolute positional encodings are incorporated directly into the query-key pairs of the cross-attention mechanism. This ensures that the compressed visual features retain crucial spatial information necessary for fine-grained image comprehension tasks like grounding.
- Output: The compressed image feature sequence of length 256 is then fed into the large language model.

4.2.2. Inputs and Outputs

The design of Qwen-VL's input and output interface is crucial for its versatility, especially for fine-grained tasks.

Image Input:
- Images are processed through the visual encoder and VL adapter, resulting in fixed-length sequences of image features.
- To clearly delineate visual content within the language model's input stream, two special tokens, $<img>$ and $</img>$ , are appended to the beginning and end of the image feature sequence, respectively. These tokens explicitly signal the start and end of image content to the LLM.
Bounding Box Input and Output:
- For fine-grained visual understanding and grounding (tasks involving specific regions within an image), Qwen-VL is trained on data containing region descriptions, questions, and detections. This requires the model to precisely understand and generate region information.
- Normalization: Any given bounding box coordinates are first normalized within the range [0, 100). This standardizes the coordinates for consistent processing.
- String Format: The normalized coordinates are then transformed into a specific string format: (X_topleft, Y_topleft), (X_bottomright, Y_bottomright). For example, $(10,20),(30,40)$ .
- Tokenization: This bounding box string is tokenized as regular text by the LLM and does not require an additional positional vocabulary. This simplifies the integration into the language model's existing tokenization scheme.
- Special Tokens for Detections: To distinguish bounding box strings from ordinary text, two special tokens, $<box>$ and $</box>$ , are added at the beginning and end of the bounding box string.
- Special Tokens for References: To associate bounding boxes with their corresponding descriptive words or sentences, another set of special tokens, $<ref>$ and $</ref>$ , is introduced. These tokens mark the textual content that refers to the bounding box. For example, $<ref>the red car</ref><box>(x1,y1),(x2,y2)</box>$ . This structure enables the model to understand "what" a box contains and "where" a description points.

4.2.3. Training

The training process of the Qwen-VL model is a sophisticated 3-stage pipeline designed to progressively build its multimodal capabilities, as illustrated in Figure 3.

The following figure (Figure 3 from the original paper) shows the training pipeline of the Qwen-VL series:

Figure 3: The training pipeline of the Qwen-VL series. 该图像是Qwen-VL系列的训练流程示意图。该图展示了基于QwenLM和视觉转换模型（ViT）的多任务和交错的视觉语言数据处理流程，用户可在低分辨率和高分辨率下使用图像-文本对和聊天交错的视觉语言数据进行训练。

4.2.3.1. Stage 1: Pre-training

Objective: To learn initial visual-language alignment from a large corpus of image-text pairs. The LLM is kept frozen to preserve its linguistic knowledge, while the visual components (visual encoder and VL adapter) are trained to effectively extract and represent visual information for the LLM.

Data:

The pre-training dataset combines several publicly accessible sources with in-house data.
Initial raw data: 5 billion image-text pairs.
Cleaning Process: A rigorous cleaning process is applied to filter noisy data. This includes removing pairs with extreme aspect ratios, too small images, low CLIP scores, non-English/non-Chinese characters, emoji, or text that is too short/long, and cleaning HTML tags or irregular text patterns.

Cleaned data: 1.4 billion image-text pairs remain, with $77.3\%$ English text data and $22.7\%$ Chinese text data.

The following are the results from Table 2 of the original paper:

Language	Dataset	Original	Cleaned	Remaining%
English	LAION-en	2B	280M	14%
	LAION-COCO	600M	300M	50%
	DataComp	1.4B	300M	21%
	Coyo	700M	200M	28%
	CC12M	12M	8M	66%
	CC3M	3M	3M	100%
	SBU	1M	0.8M	80%
	COCO Caption	0.6M	0.6M	100%
	LAION-zh	108M	105M	97%
Chinese	In-house Data Total	220M 5B	220M 1.4B	100% 28%

Trainable Components: Only the visual encoder and VL adapter are optimized. The large language model (Qwen-7B) is kept frozen.
Input Resolution: Images are resized to $224 \times 224$ pixels.
Training Objective: Minimize the cross-entropy loss of the text tokens, effectively learning to predict the text given the image features.
Hyperparameters: Maximum learning rate is $2e^{-4}$ . Batch size is 30720 for image-text pairs. The stage lasts for 50,000 steps, consuming approximately 1.5 billion image-text samples.

4.2.3.2. Stage 2: Multi-task Pre-training

Objective: To enhance fine-grained multimodal understanding and expand the model's capabilities across various vision-language tasks by training on diverse, high-quality, and fine-grained annotated data. In this stage, the LLM is unfrozen to allow for joint optimization of all components.

Data:

This stage introduces high-quality and fine-grained vision-language annotation data, including interleaved image-text data.
Simultaneous Training: Qwen-VL is trained on 7 distinct tasks simultaneously, leveraging a mixture of publicly available and in-house datasets.
Text Generation: In-house collected corpus is used to maintain the LLM's language generation ability.
Captioning: A subset of data from Stage 1 (LAION-en & zh, DataComp, Coyo, CC12M & 3M, SBU, COCO, In-house Data), excluding LAION-COCO.
VQA: Mixture of GQA, VGQA, VQAv2, DVQA, OCR-VQA, DocVQA, TextVQA, ChartQA, AI2D.
Grounding: GRIT dataset (Peng et al., 2023) with minor modifications.
Reference Grounding & Grounded Captioning: Constructed from GRIT, Visual Genome, RefCOCO, RefCOCO+, and RefCOCOg. These tasks are crucial for fine-grained understanding of regions.
OCR: Synthetically generated OCR data (SynthDoG-en & zh) with natural scenery backgrounds and real-world data from Common Crawl (PDF and HTML formats). This boosts text-reading capabilities.

Interleaved Data: Data from the same task are packed into sequences of length 2048 to create interleaved image-text data.

The following are the results from Table 3 of the original paper:

Task	# Samples	Dataset
Captioning	19.7M	LAION-en & zh, DataComp, Coyo, CC12M & 3M, SBU, COCO,In-house Data
VQA	3.6M	GQA, VGQA, VQAv2, DVQA, OCR-VQA, DocVQA, TextVQA, ChartQA, AI2D
Grounding	3.5M	GRIT
Ref Grounding	8.7M	GRIT, Visual Genome, RefCOCO, RefCOCO+, RefCOCOg
Grounded Cap.	8.7M	GRIT, Visual Genome, RefCOCO, RefCOCO+, RefCOCOg
OCR	24.8M	SynthDoG-en & zh, Common Crawl pdf & HTML
Pure-text Autoregression	7.8M	In-house Data

Trainable Components: The entire model (visual encoder, VL adapter, and LLM) is trained, meaning the large language model is "unlocked."
Input Resolution: The input resolution for the visual encoder is increased from $224 \times 224$ to $448 \times 448$ to reduce information loss from down-sampling and facilitate fine-grained understanding.
Training Objective: Same as Stage 1, minimizing cross-entropy of text tokens.

4.2.3.3. Stage 3: Supervised Fine-tuning (SFT)

Objective: To enhance the model's instruction-following capabilities and dialogue proficiency, transforming the pre-trained Qwen-VL into the interactive Qwen-VL-Chat model. This stage focuses on aligning the model with user intent and real-world conversational patterns.
Data:
- The multimodal instruction tuning data primarily comes from caption data and dialogue data generated through LLM self-instruction. However, these often focus on single-image dialogue and limited content comprehension.
- Enrichment: An additional set of dialogue data is constructed through manual annotation, model generation, and strategy concatenation. This explicitly incorporates localization (grounding) and multi-image comprehension abilities. This is critical for the model to "transfer these capabilities to a wider range of languages and question types."
- Universality: To ensure the model's universality in dialogue capabilities, multimodal dialogue data is mixed with pure text dialogue data.
- Scale: The instruction tuning amounts to approximately 350,000 instructions.
Trainable Components: The visual encoder is frozen, and only the language model (LLM) and the VL adapter module are optimized. This helps preserve the visual feature extraction stability while adapting the language generation and multimodal reasoning for dialogue.
Data Format: The instruction tuning dataset uses the ChatML (OpenAI) format, employing <im_start> and <im_end> tokens to mark statements within interactions for dialogue termination. This also includes handling multiple images by associating them with Picture X: labels.

The following is an example of the ChatML data format as provided in Box B.2 of the original paper: $<im_start>user Picture 1: <img>vg/VG_100K_2/649.jpg</img>What is the sign in the picture?<im_end> <im_start>assistant The sign is a road closure with an orange rhombus.<im_end> <im_start>user How is the weather in the picture?<im_end> <im_start>assistant The shape of the road closure sign is an orange rhombus.<im_end>$

The model hyperparameters for each stage are detailed in Table 8 in the Experimental Setup section.

5. Experimental Setup

Qwen-VL's evaluation is comprehensive, covering various multimodal tasks to assess its visual understanding abilities. Qwen-VL refers to the model after multi-task training (Stage 2), and Qwen-VL-Chat denotes the model after the supervised fine-tuning (SFT) stage (Stage 3).

5.1. Datasets

The following are the results from Table 9 of the original paper:

Task	\| Dataset	Description	\| Split	\| Metric
Image Caption	NocapsFlickr30K	Captioning of natural imagesCaptioning of natural images	valkarpathy-test	CIDEr(↑)CDEr()
General VQA	VQAv2OKVQAGQAScienceQA-ImgVizWiz	VQA on natural imagesVQA on natural images requiring outside knowledgeVQA on scene understanding and reasoningMulti-choice VQA on a diverse set of science topicsVQA on photos taken by people who are blind	test-devvaltest-balancedtesttest-dev	VQA Score(↑)VQAA Sore()EM(↑)Accuracy(↑)VQA Score(↑)
Text-oriented VQA	TextVQADocVQAChartQAOCRVQAAI2Diagram	VQA on natural images containing textVQA on images of scanned documentsVQA on images of chartsVQA on images of book coversVQA on images of scientific diagrams	valtesttesttesttest	VQA Score(↑)ANLS(↑)Relaxed EM(↑)EM(↑)EM(↑
Refer ExpressionComprehension	RefCOCORefCOCO+RefCOCOgGRiT	Refer grounding on natural imagesRefer grounding on natural imagesRefer grounding on natural imagesRefer grounding on natural images	val & testA & testBval & testA & testBval & testtest	Accuracy(↑)Accuracy(↑)Accuracy(↑)Accuracy(↑)
Instruction Following	TouchStoneMMESeed-Bench	Open-ended VL instruction following benchmarkOpen-ended VL Benchmark by yes/no questionsOpen-ended VL Benchmark by Multi-choice VQA	English & ChinesePerception & CognitionImage & Video	GPT-4 Score (↑)Accuracy (↑)Accuracy ↑)

5.1.1. Image-text pairs (for Pre-training and Captioning)

LAION-en / LAION-zh (Schuhmann et al., 2022a): Large-scale datasets derived from the web, containing billions of image-text pairs. LAION-en is the English subset, and LAION-zh is the Chinese subset. These are known for their vast scale but also their noisiness.
LAION-COCO (Schuhmann et al., 2022b): A synthetic dataset generated from LAION-en, specifically for image captioning.
DataComp (Gadre et al., 2023): A collection of image-text pairs designed for dataset research.
Coyo (Byeon et al., 2022): A large-scale image-text pair dataset.
CC12M (Changpinyo et al., 2021), CC3M (Sharma et al., 2018), SBU (Ordonez et al., 2011), COCO Caption (Chen et al., 2015): Academic caption datasets, generally higher quality but smaller scale than web-crawled datasets.
Cleaning: The paper details a multi-step cleaning process for these noisy web-crawled datasets to ensure data quality.

5.1.2. Visual Question Answering (VQA)

VQAv2 (Goyal et al., 2017): A large-scale VQA dataset focusing on natural images. The authors select answer annotations based on maximum confidence.
OKVQA (Marino et al., 2019): VQA on natural images that often requires external common-sense knowledge beyond what's visible in the image.
GQA (Hudson and Manning, 2019): A VQA dataset for real-world visual reasoning and compositional question answering, emphasizing scene graph representations.
ScienceQA (Lu et al., 2022b): A multi-choice VQA dataset involving scientific topics, often requiring multimodal reasoning (text, images, diagrams).
VizWiz (Gurari et al., 2018): VQA on photos taken by blind people, presenting unique challenges due to image quality and content.
TextVQA (Sidorov et al., 2020): VQA on natural images where answering requires reading text within the image.
DocVQA (Mathew et al., 2021): VQA on scanned document images, testing document understanding.
ChartQA (Masry et al., 2022): VQA on images of charts and plots, requiring visual and logical reasoning over data visualizations.
AI2Diagram (Kembhavi et al., 2016): VQA on images of scientific diagrams.
OCR-VQA (Mishra et al., 2019): VQA on images of book covers, often requiring OCR.

5.1.3. Grounding

GRIT (Gupta et al., 2022): General Robust Image Task benchmark, used for grounding tasks. The authors use a modified version for training and a specific split for evaluation (GRIT refexp).
Visual Genome (Krishna et al., 2017): A dataset connecting language and vision through dense image annotations, providing region descriptions.
RefCOCO (Kazemzadeh et al., 2014), RefCOCO+ (Mao et al., 2016), RefCOCOg (Mao et al., 2016): Datasets for referring expression comprehension, requiring models to localize objects based on natural language descriptions.

5.1.4. OCR (for training, not explicit evaluation benchmarks in Table 9 for this task)

SynthDoG (Kim et al., 2022): A synthetic OCR dataset generator. Qwen-VL uses COCO train2017 and unlabeled2017 datasets as natural scenery backgrounds, generating English and Chinese text with associated quadrilateral coordinates as labels. The following figure (Figure 5 from the original paper) shows the visualization of the Grounding and OCR data used for training Qwen-VL:

该图像是示意图，展示了用于训练 Qwen-VL 的定位和OCR数据的可视化效果。图中包含多个框选区域，标注了不同的文本信息，以支持视听语言模型的训练。
Common Crawl PDF and HTML data: Real-world data collected from Common Crawl. PDF data extraction involves parsing pages with PyMuPDF (Artifex Software, 2015) to get rendering results and text annotations with bounding boxes. HTML data is pre-processed similarly using Puppeteer (Google, 2023) for rendering. Both processes involve cleaning steps like removing small images, those with too many/few characters, or specific Unicode blocks.

5.1.5. Instruction Following

TouchStone (Bai et al., 2023): An open-ended vision-language instruction-following benchmark, evaluated in both English and Chinese.
SEED-Bench (Li et al., 2023b): Consists of 19K multiple-choice questions for evaluating Multimodal LLMs, covering 12 dimensions including spatial and temporal understanding.
MME (Fu et al., 2023): Measures both perception and cognition abilities across 14 subtasks using yes/no questions.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, a complete explanation is provided:

CIDEr (Consensus-based Image Description Evaluation):
1. Conceptual Definition: CIDEr measures the similarity of a generated image caption to a set of human-written reference captions. It places higher weight on n-grams (sequences of n words) that are frequently found in the reference captions but are rare in the overall corpus, thus penalizing common, generic words while rewarding specific, descriptive ones. It is designed to evaluate both grammaticality and saliency.
2. Mathematical Formula: $ \mathrm{CIDEr} = \frac{1}{|C|} \sum_{c \in C} \prod_{n=1}^N \frac{e^{-\left(\frac{m_n(c) - m_n(S_c)}{m_n(c)}\right)^2}}{1 + \delta \sum_{w \in W} \mathrm{df}(w)^2} \cdot \left(\frac{\sum_{w \in W} \min(\mathrm{count}(w|S_c), \mathrm{count}(w|c))}{\sum_{w \in W} \mathrm{count}(w|S_c)}\right) $ The simplified form often used is based on cosine similarity of TF-IDF weighted n-gram vectors: $ \mathrm{CIDEr}n(c_i, S_i) = \frac{\mathbf{g}^n(c_i) \cdot \mathbf{g}^n(S_i)}{||\mathbf{g}^n(c_i)|| \cdot ||\mathbf{g}^n(S_i)||} $ where $\mathbf{g}^n(c_i)$ and $\mathbf{g}^n(S_i)$ are TF-IDF weighted n-gram vectors for candidate caption $c_i$ and reference captions $S_i$ . The final CIDEr score is typically a weighted sum over different n-gram lengths: $ \mathrm{CIDEr} = \sum{n=1}^N w_n \cdot \mathrm{CIDEr}_n $
3. Symbol Explanation:
  - $C$ : Set of candidate captions (generated by the model).
  - $c$ : A single candidate caption.
  - $S_c$ : Set of reference captions for candidate caption $c$ .
  - $N$ : Maximum n-gram length considered (e.g., typically 4 for unigrams to 4-grams).
  - $\mathbf{g}^n(c_i)$ : TF-IDF weighted vector of n-grams for candidate caption $c_i$ .
  - $\mathbf{g}^n(S_i)$ : TF-IDF weighted vector of n-grams for the set of reference captions $S_i$ . This is usually the average of TF-IDF vectors of individual reference captions.
  - $\mathrm{tf}(w)$ : Term frequency of n-gram $w$ in a caption.
  - $\mathrm{idf}(w)$ : Inverse document frequency of n-gram $w$ across the corpus.
  - $w_n$ : Weight for the n-gram score.
VQA Score (Accuracy):
1. Conceptual Definition: For VQA tasks, "VQA Score" (often referred to simply as accuracy) is a specific variant of accuracy used to account for multiple human answers. If at least 3 out of 10 human annotators agree with the model's answer, it gets a full score. If 1 or 2 annotators agree, it gets a partial score. Otherwise, it gets 0. This metric is designed to handle the subjective nature of VQA answers.
2. Mathematical Formula: $ \mathrm{VQA Score}(A) = \min\left(1, \frac{\text{count}(A)}{\text{num_human_answers}}\right) $ where $num\_human\_answers$ is typically 10. For a given question and image, if the model predicts answer $A$ : $ \text{Accuracy}(A) = \sum_{k=1}^{10} \mathbb{I}(A = A_k) / 3 $ where $\mathbb{I}$ is the indicator function, and $A_k$ is the $k$ -th human answer. The final score is $\min(1, \text{Accuracy}(A))$ .
3. Symbol Explanation:
  - $A$ : The answer predicted by the model.
  - $\text{count}(A)$ : The number of human annotators who provided answer $A$ .
  - $\text{num\_human\_answers}$ : Total number of human answers provided for a question (typically 10).
  - $\mathbb{I}(\cdot)$ : Indicator function, which is 1 if the condition is true, 0 otherwise.
  - $A_k$ : The $k$ -th human annotator's answer.
EM (Exact Match):
1. Conceptual Definition: Exact Match (EM) is a stringent metric that measures whether the model's generated answer perfectly matches one of the ground truth answers. It is commonly used in question answering and text reading (OCR) tasks where precise answers are expected.
2. Mathematical Formula: $ \mathrm{EM} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}(P_i = T_i) $ where $P_i$ is the predicted answer for the $i$ -th instance, and $T_i$ is the ground truth answer for the $i$ -th instance. If multiple ground truth answers exist, a match to any one of them counts as an exact match.
3. Symbol Explanation:
  - $N$ : Total number of instances (e.g., question-answer pairs).
  - $P_i$ : The model's predicted answer for the $i$ -th instance.
  - $T_i$ : The ground truth answer(s) for the $i$ -th instance.
  - $\mathbb{I}(\cdot)$ : Indicator function, which is 1 if the condition is true (prediction exactly matches ground truth), 0 otherwise.
Accuracy (Top-1 Accuracy):
1. Conceptual Definition: Standard accuracy, particularly "Top-1 Accuracy" in multi-choice settings, measures the percentage of predictions that perfectly match the correct label or option. For tasks like ScienceQA where options are constrained, it's the proportion of questions for which the model selects the correct option. For referring expression comprehension, it's the proportion of times the model correctly localizes the described object.
2. Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}(y_i = \hat{y}_i) $
3. Symbol Explanation:
  - $N$ : Total number of samples or predictions.
  - $y_i$ : The true label/answer for the $i$ -th sample.
  - $\hat{y}_i$ : The model's predicted label/answer for the $i$ -th sample.
  - $\mathbb{I}(\cdot)$ : Indicator function, which is 1 if $y_i = \hat{y}_i$ , and 0 otherwise.
ANLS (Average Normalized Levenshtein Similarity):
1. Conceptual Definition: ANLS is a metric used for document VQA tasks (like DocVQA) that measures the similarity between a predicted answer and a ground truth answer, while being robust to minor variations. It is based on the Levenshtein distance (edit distance), which quantifies the number of single-character edits (insertions, deletions, substitutions) required to change one word into another. The score is normalized and ranges from 0 to 1, where 1 means a perfect match. It is designed to be more forgiving than exact match for variations in text transcription while still penalizing errors.
2. Mathematical Formula: $ \mathrm{ANLS} = \frac{1}{N} \sum_{i=1}^N \left(1 - \frac{\mathrm{Levenshtein}(P_i, T_i)}{\max(|P_i|, |T_i|)}\right) $ This formula is for a single prediction and single ground truth. When multiple ground truths exist, the maximum similarity is usually taken. The actual DocVQA ANLS uses a slightly different formulation to handle multiple answers and is truncated at 0.5 for incorrect answers: $ \mathrm{ANLS} = \frac{1}{N} \sum_{i=1}^N \max_{j} \left( \min \left( 1 - \frac{\mathrm{Levenshtein}(P_i, T_{i,j})}{\max(|P_i|, |T_{i,j}|)}, 0.5 \right) \right) $
3. Symbol Explanation:
  - $N$ : Total number of instances.
  - $P_i$ : The model's predicted answer for the $i$ -th instance.
  - $T_i$ : The ground truth answer for the $i$ -th instance (or $T_{i,j}$ for the $j$ -th ground truth if multiple exist).
  - $\mathrm{Levenshtein}(P_i, T_i)$ : The Levenshtein distance between the predicted answer $P_i$ and the ground truth answer $T_i$ .
  - $|P_i|$ : Length of the predicted answer string.
  - $|T_i|$ : Length of the ground truth answer string.
  - $\max(|P_i|, |T_i|)$ : The maximum length between the predicted and ground truth answers.

5.3. Baselines

Qwen-VL's performance is benchmarked against a wide array of existing vision-language models, categorized into "Generalist Models" and "Specialist SOTAs."

Generalist Models: These are typically large-scale, versatile models designed to handle a broad range of V-L tasks.
- Flamingo (Alayrac et al., 2022): 9B and 80B parameter versions, known for strong few-shot learning.
- Unified-IO-XL (Lu et al., 2022a): A large unified model for vision, language, and multimodal tasks.
- Kosmos-1 / Kosmos-2 (Peng et al., 2023): Multimodal LLMs with grounding capabilities.
- BLIP-2 (Li et al., 2023c): Uses a Q-Former to align frozen vision encoders and LLMs (Vicuna-13B).
- InstructBLIP (Dai et al., 2023): Instruction-tuned version of BLIP-2 (Vicuna-13B).
- Shikra (Chen et al., 2023a): 7B and 13B parameter versions, known for referential dialogue.
- mPLUG-DocOwl (Ye et al., 2023a): Focuses on document understanding (LLaMA-7B).
- Pix2Struct-Large: A 1.3B parameter model, relevant for text-oriented tasks.
- VisualGLM, PandaGPT, MiniGPT4, LLaMA-AdapterV2, LLaVA, mPLUG-Owl: Other instruction-tuned LVLMs used for comparison on instruction-following benchmarks.
Specialist SOTAs: These are models that often achieve state-of-the-art results on specific tasks or domains, sometimes with dedicated architectures or more extensive training data for that particular task.
- PALI-X-55B (Chen et al., 2023b): A very large multilingual language-image model.
- G-DINO-L: Likely a grounding-focused model based on DINO.
- UNINEXT-H, ONE-PEACE (Wang et al., 2023): Other high-performing multimodal models.
  
  These baselines represent a comprehensive set of contemporary and high-performing multimodal models, providing a strong basis for evaluating Qwen-VL's generalist capabilities and fine-grained understanding.

5.4. Hyperparameters

The following are the results from Table 8 of the original paper:

Configuration	Pre-training	Multi-task Pre-training	Supervised Fine-tuning
ViT init.	Open-CLIP-bigG	Qwen-VL 1st-stage	Qwen-VL 2nd-stage
LLM init.	Qwen-7B	Qwen-7B	Qwen-VL 2nd-stage
VL Adapter init.	random	Qwen-VL 1st-stage	Qwen-VL 2nd-stage
Image resolution	$224^2$	$448^2$	$448^2$
ViT sequence length	256	1024	1024
LLM sequence length	512	2048	2048
Learnable query numbers	256	256	256
Optimizer		AdamW
Optimizer hyperparameter		β1 = 0.9, β2 = 0.98, eps = 1e−6
Peak learning rate	2e-4	5e-5	1e-5
Minimum learning rate	1e-6	1e-5	1e-6
ViT learning rate decay	0.95	0.95	0
ViT Drop path rate		0
Learning rate schedule		cosine decay
Weight decay		0.05
Gradient clip		1.0
Training steps	50k	19k	8k
Warm-up steps	500	400	3k
Global batch size	30720	4096	128
Gradient Acc.	6	8	8
Numerical precision		bfloat16
Optimizer sharding		✓
Activation checkpointing		X
Model parallelism	X	2	2
Pipeline parallelism		X

Detailed training hyperparameter settings are provided for each of the three stages:

Pre-training (Stage 1):
- ViT Initialization: Open-CLIP-bigG.
- LLM Initialization: Qwen-7B.
- VL Adapter Initialization: Random.
- Image Resolution: $224 \times 224$ .
- ViT Sequence Length: 256.
- LLM Sequence Length: 512.
- Learnable Query Numbers: 256.
- Optimizer: AdamW with $\beta_1 = 0.9, \beta_2 = 0.98, \epsilon = 1e^{-6}$ .
- Learning Rate Schedule: Cosine decay with a linear warm-up of 500 steps.
- Peak Learning Rate: $2e^{-4}$ .
- Minimum Learning Rate: $1e^{-6}$ .
- ViT Learning Rate Decay: 0.95 (layer-wise).
- Weight Decay: $5e^{-2}$ .
- Gradient Clip: 1.0.
- Training Steps: 50,000 steps.
- Global Batch Size: 30720.
- Gradient Accumulation: 6.
- Numerical Precision: bfloat16.
Multi-task Pre-training (Stage 2):
- ViT Initialization: Weights from Qwen-VL 1st-stage.
- LLM Initialization: Qwen-7B.
- VL Adapter Initialization: Weights from Qwen-VL 1st-stage.
- Image Resolution: Increased to $448 \times 448$ .
- ViT Sequence Length: 1024.
- LLM Sequence Length: 2048.
- Learnable Query Numbers: 256.
- Optimizer: AdamW with $\beta_1 = 0.9, \beta_2 = 0.98, \epsilon = 1e^{-6}$ .
- Learning Rate Schedule: Cosine decay with 400 warm-up steps.
- Peak Learning Rate: $5e^{-5}$ .
- Minimum Learning Rate: $1e^{-5}$ .
- ViT Learning Rate Decay: 0.95.
- ViT Drop Path Rate: 0.
- Weight Decay: 0.05.
- Gradient Clip: 1.0.
- Training Steps: 19,000 steps.
- Global Batch Size: 4096.
- Gradient Accumulation: 8.
- Numerical Precision: bfloat16.
- Optimizer Sharding: Enabled.
- Model Parallelism: 2 (indicating distributed training across 2 devices/partitions).
Supervised Fine-tuning (Stage 3):
- ViT Initialization: Weights from Qwen-VL 2nd-stage.
- LLM Initialization: Weights from Qwen-VL 2nd-stage.
- VL Adapter Initialization: Weights from Qwen-VL 2nd-stage.
- Image Resolution: $448 \times 448$ .
- ViT Sequence Length: 1024.
- LLM Sequence Length: 2048.
- Learnable Query Numbers: 256.
- Peak Learning Rate: $1e^{-5}$ .
- Minimum Learning Rate: $1e^{-6}$ .
- ViT Learning Rate Decay: 0 (ViT frozen).
- Training Steps: 8,000 steps.
- Warm-up Steps: 3,000.
- Global Batch Size: 128.
- Gradient Accumulation: 8.
- Model Parallelism: 2.

6. Results & Analysis

The evaluation of Qwen-VL models demonstrates their strong performance across a diverse set of vision-language tasks, positioning them as state-of-the-art generalist models at their scale.

6.1. Core Results Analysis

6.1.1. Image Caption and General Visual Question Answering

The following are the results from Table 4 of the original paper:

Model Type	Model	Image Caption		General VQA
Model Type	Model	Nocaps (0-shot)	Flickr30K (0-shot)	VQAv2	OKVQA	GQA	SciQA-Img (0-shot)	VizWiz (0-shot)
Generalist Models	Flamingo-9B	-	61.5	51.8	44.7	-	-	28.8
	Flamingo-80B	-	67.2	56.3	50.6		-	31.6
	Unified-IO-XL	100.0	-	77.9	54.0		-
	Kosmos-1	-	67.1	51.0	-			29.2
	Kosmos-2	-	80.5	51.1	-	-	-	-
	BLIP-2 (Vicuna-13B)	103.9	71.6	65.0	45.9	32.3	61.0	19.6
	InstructBLIP (Vicuna-13B)	121.9	82.8	-	-	49.5	63.1	33.4
	Shikra (Vicuna-13B)	-	73.9	77.36	47.16	-	-	-
	Qwen-VL (Qwen-7B) Qwen-VL-Chat	121.4	85.8	79.5	58.6	59.3	67.1	35.2
Specialist SOTAS		120.2	81.0	78.2	56.6	57.5	68.2 92.53	38.9
Specialist SOTAS	127.0 (PALI-17B)	84.5 (InstructBLIP -FlanT5-XL)	86.1 (PALI-X	66.1 (PALI-X	72.1 (CFR)	(LLaVa+	70.9 (PALI-X

Qwen-VL and Qwen-VL-Chat demonstrate consistently superior results compared to previous generalist models.

Image Captioning: On Flickr30K (0-shot), Qwen-VL achieves a CIDEr score of 85.8, outperforming Flamingo-80B (67.2) and InstructBLIP (82.8), even though Flamingo-80B is significantly larger. On Nocaps, Qwen-VL-Chat scores 121.4, comparable to InstructBLIP (121.9).
General VQA: Qwen-VL achieves 79.5 on VQAv2, 58.6 on OKVQA, and 59.3 on GQA, showing distinct advantages over other LVLMs. For instance, on VQAv2, its score is notably higher than BLIP-2 (65.0) and Shikra (77.36). The models also show strong zero-shot performance on SciQA-Img (67.1) and VizWiz (35.2).

6.1.2. Text-oriented Visual Question Answering

The following are the results from Table 5 of the original paper:

Model type	Model	TextVQA	DocVQA	ChartQA	AI2D	OCR-VQA
Generalist Models	BLIP-2 (Vicuna-13B) InstructBLIP (Vicuna-13B) mPLUG-DocOwl (LLaMA-7B) Pix2Struct-Large (1.3B)	42.4 50.7	-	-	'	-
			-	-	'	-
	52.6	62.2	57.4	-	-
	- 63.8	76.6	58.6	42.1	71.3
	Qwen-VL (Qwen-7B) Qwen-VL-Chat	61.5	65.1 62.6	65.7 66.3	62.3 57.7	75.7 70.5
	PALI-X-55B (Single-task fine- 71.44	61.5			81.2

Qwen-VL models show better performance on most text-oriented VQA benchmarks, frequently by a large margin. For instance, Qwen-VL achieves 61.5 on TextVQA, surpassing InstructBLIP (50.7). On DocVQA, Qwen-VL (65.1) outperforms mPLUG-DocOwl (62.2), a model specifically designed for document understanding. This highlights Qwen-VL's strong fine-grained text reading and understanding capabilities.

6.1.3. Refer Expression Comprehension

The following are the results from Table 6 of the original paper:

Model type	Model	RefCOCO			RefCOCO+			RefCOCOg		GRIT refexp
Model type	Model	val	test-A test-B		val	test-A test-B		val	test	GRIT refexp
Generalist Models	GPV-2	-	-	-	-	-	-	-	-	51.50
	OFA-L*	79.96	83.67	76.39	68.29	76.00	61.75	67.57	67.58	61.70
	Unified-IO	-	-	-	-	-	-	-	-	78.61
	VisionLLM-H		86.70	-	-	-	-	-	-	-
	Shikra-7B	87.01	90.61	80.24	81.60	87.36	72.12	82.27	82.19	69.34
	Shikra-13B	87.83	91.11	81.81	82.89	87.79	74.41	82.64	83.16	69.03
	Qwen-VL-7B	89.36	92.26	85.34	83.12	88.25	77.21	85.58	85.48	78.22
	Qwen-VL-7B-Chat	88.55	92.27	84.51	82.82	88.59	76.79	85.96	86.32	-
Specialist SOTAs	G-DINO-L	90.56	93.19	88.24	82.75	88.95	75.92	86.13	87.02
	UNINEXT-H	92.64	94.33	91.46	85.24	89.63	79.79	88.73	89.37
	ONE-PEACE	92.58	94.18	89.26	88.77	92.21	83.23	89.22	89.27

Qwen-VL-7B (89.36 on RefCOCO val, 92.26 on test-A, 85.34 on test-B) and Qwen-VL-7B-Chat achieve top-tier results on all referring expression comprehension benchmarks, often surpassing other generalist models like Shikra-13B. This indicates its strong ability to localize objects based on textual descriptions, a crucial aspect of fine-grained understanding.

6.1.4. Few-shot Learning

The following figure (Figure 4 from the original paper) shows the few-shot learning results of Qwen-VL in comparison with other models:

Figure 4: Few-shot learning results of Qwen-VL in comparison with other models. 该图像是图表，展示了Qwen-VL与其他模型在少样本学习中的性能比较。图中包含多个模型在不同训练步骤下的表现曲线，以数据精确展示了模型的有效性和优势。

Qwen-VL exhibits strong in-context few-shot learning capabilities. As shown in Figure 4, it achieves better performance through in-context few-shot learning on OKVQA, Vizwiz, TextVQA, and Flickr30k, compared to models of similar parameter counts like Flamingo-9B, OpenFlamingo-9B, and IDEFICS-9B. Notably, Qwen-VL's performance is even comparable to much larger models like Flamingo-80B and IDEFICS-80B, without using sophisticated exemplar construction methods.

6.1.5. Instruction Following in Real-world User Behavior

The following are the results from Table 7 of the original paper:

Model	TouchStone		SEED-Bench			MME
Model	En	Cn	All	Img	Video	Perception	Cognition
VisualGLM	-	247.1	-	-	-	705.31	181.79
PandaGPT	488.5	-	-	-	-	642.59	228.57
MiniGPT4	531.7	-	42.8	47.4	29.9	581.67	144.29
InstructBLIP	552.4	-	53.4	58.8	38.1	1212.82	291.79
LLaMA-AdapterV2	590.1	-	32.7	35.2	25.8	972.67	248.93
LLaVA	602.7	-	33.5	37.0	23.8	502.82	214.64
mPLUG-Owl	605.4	-	34.0	37.9	23.0	967.34	276.07
Qwen-VL	-	-	56.3	62.3	39.1	-	-
Qwen-VL-Chat	645.2	401.2	58.2	65.4	37.8	1487.58	360.71

Qwen-VL-Chat achieves notable advantages over other LVLMs on instruction-following benchmarks.

TouchStone: Qwen-VL-Chat scores 645.2 in English and 401.2 in Chinese, significantly outperforming other models, particularly in Chinese capabilities (e.g., VisualGLM at 247.1). This highlights its strong multilingual instruction-following.
SEED-Bench: Qwen-VL-Chat achieves 58.2 (All), 65.4 (Img), and 37.8 (Video), surpassing InstructBLIP (53.4 All, 58.8 Img, 38.1 Video). The model's visual capabilities effectively transfer to video tasks by simply sampling frames.
MME: Qwen-VL-Chat achieves 1487.58 (Perception) and 360.71 (Cognition), setting new records and demonstrating superior understanding and recognition, especially in areas like text recognition and chart analysis.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Convergence of the Pre-training Stage

The following figure (Figure 6 from the original paper) shows the visualization of the convergence of the pre-training stage:

Figur :Visualization of the Convergence of the re-raining Sage 该图像是一个示意图，展示了Qwen-VL模型在预训练损失、Flickr数据集的标题生成表现和零-shot视觉问答任务上的准确率随图像数量变化的趋势。可以看到，预训练损失逐渐降低，而生成表现和准确率则波动上升。

Figure 6 illustrates the convergence of the first pre-training stage. The pre-training loss decreases steadily as the number of trained images increases. Interestingly, even in this stage, where no VQA data is explicitly added, the Zero-shot VQA score shows a fluctuating but overall increasing trend, indicating that the model implicitly learns some VQA capabilities from general image-text alignment. The caption generation performance on Flickr (CIDEr) also improves.

6.2.2. Number of Learnable Queries in the Vision-Language Adapter

The following figure (Figure 7 from the original paper) shows the visualization of the training loss when using different compressed feature lengths of the vision adapter:

Figure 7: Visualization of the training loss when using different compressed feature lengths of the visionldapteThe e depic heiaioih 0 e nhe ghdepicthei convergence (1k-5k steps). In the legend, L64 denotes that the adapter uses 64 queries to compress the visual feature sequence to a fixed length of 64, and so on. The loss curves have been smoothed to avoid shading owing to fluctuations. 该图像是训练损失的可视化，展示了使用不同压缩特征长度（L64、L144、L256、L400）的结果。损失曲线表现出在不同步骤下的收敛情况，左侧为前50步的损失变化，右侧为1000至4500步的损失变化。

Figure 7 presents an ablation study on the number of learnable queries in the VL adapter. The visual encoder's output for a $224 \times 224$ image is 256 patches. The study shows that:

Initial Loss: Fewer queries (e.g., L64) lead to a lower initial loss.
Convergence: Too few or too many queries can slow down convergence.
Optimal Choice: Considering that the second stage uses $448 \times 448$ resolution (leading to 1024 ViT output patches), where too few queries could lead to significant information loss, 256 queries were chosen for the VL adapter in Qwen-VL. This balance allows for efficient compression while retaining sufficient visual information.

6.2.3. Window Attention vs Global Attention for Vision Transformer

The following figure (Figure 8 from the original paper) shows the visualization of the Loss when using Window Attention vs Global Attention:

Figure 8: Visualization of the Loss when using Window Attention vs Global Attention 该图像是损失值随训练步数变化的图表，比较了窗口注意力和全局注意力的效果。不同颜色的曲线代表不同的输入尺寸和注意力机制，其中蓝色和绿色分别表示896x896的窗口和全局注意力，红色和浅蓝色表示448x448的窗口和全局注意力。这些结果展示了不同配置下模型训练损失的变化趋势。

The following are the results from Table 10 of the original paper:

Model input resolution & Attention type	Training speed
$448 \times 448$ , Global Attention	10s / iter
$448 \times 448$ , Window Attention	9s / iter
$896 \times 896$ , Global Attention	60s / iter
$896 \times 896$ , Window Attention	25s / iter

This ablation investigates the trade-off between computational cost and performance for high-resolution ViT inputs.

Performance: As shown in Figure 8, models using Window Attention consistently exhibit significantly higher loss compared to those using Global Attention, indicating a performance degradation.
Training Speed: For $448 \times 448$ resolution (Table 10), the training speed difference between Global (10s/iter) and Window (9s/iter) Attention is minimal.
Conclusion: Given the minimal speed improvement at $448 \times 448$ and the significant performance drop, Global Attention was chosen for Qwen-VL at this resolution. While Window Attention could offer speedups at very high resolutions like $896 \times 896$ (25s/iter vs 60s/iter for Global), the associated performance degradation and still slow training speed made it unsuitable for Qwen-VL's goals.

6.2.4. Performance on Pure-text Tasks

The following are the results from Table 11 of the original paper:

Model	MMLU	CMMLU	C-Eval
LLaMA-7B	35.1	26.8	-
LLaMA2-7B	46.8	31.8	32.5
Baichuan-7B	42.3	44.4	42.8
Baichuan2-7B	54.2	57.1	54.0
ChatGLM2-6B	47.9	48.8	51.7
InternLM-7B	51.0	51.8	52.8
Qwen-7B (final released)	58.2	62.2	63.5
Qwen-7B (intermediate, use as Qwen-VL's LLM initialization)	49.9	-	48.5
Qwen-VL	50.7	49.5	51.1

To prevent catastrophic forgetting of the LLM's linguistic abilities during multimodal training, pure-text data is included in the multi-task training (Stage 2) and SFT (Stage 3) stages. Table 11 demonstrates the effectiveness of this strategy:

Qwen-VL, initialized from an intermediate checkpoint of Qwen-7B (which scores 49.9 on MMLU and 48.5 on C-Eval), shows improved performance on pure-text benchmarks after multimodal training.
Qwen-VL achieves 50.7 on MMLU, 49.5 on CMMLU, and 51.1 on C-Eval. These scores are comparable to or better than other 7B-class open-source LLMs like LLaMA2-7B, Baichuan-7B, ChatGLM2-6B, and InternLM-7B.
This indicates that Qwen-VL successfully integrates multimodal capabilities without compromising its pure-text understanding and generation, showcasing robust knowledge retention.

6.3. Qualitative Examples

Figure 2 (in the Executive Summary) provides qualitative examples from Qwen-VL-Chat, showcasing its ability to handle multiple image inputs, multi-round dialogue, multilingual conversation, text-reading, localization, and fine-grained recognition. These examples visually support the quantitative results, illustrating the model's practical versatility.

7. Conclusion & Reflections

7.1. Conclusion Summary

The Qwen-VL series represents a significant advancement in open-source large-scale vision-language models. By meticulously designing its visual receptor, input-output interface, and employing a sophisticated 3-stage training pipeline on a diverse, multilingual, and cleaned multimodal corpus, Qwen-VL successfully endows the Qwen-LM with robust visual understanding capabilities. The models, Qwen-VL and Qwen-VL-Chat, demonstrate state-of-the-art performance across a broad range of visual-centric benchmarks, including image captioning, general VQA, text-oriented VQA, and referring expression comprehension. Crucially, Qwen-VL excels in fine-grained tasks like object grounding and text reading, a capability often lacking in other generalist LVLMs. Its instruction-tuned variant, Qwen-VL-Chat, further exhibits superior performance in real-world dialogue scenarios, supporting multilingual and multi-image interactions. The public release of these models, code, and demo aims to significantly facilitate multimodal research.

7.2. Limitations & Future Work

The authors acknowledge several areas for future improvement and expansion:

Integration with More Modalities: The current Qwen-VL focuses on vision and language. Future work aims to integrate additional modalities such as speech and video, moving towards a truly universal multimodal model.
Scaling Up: Enhancing Qwen-VL's capabilities by scaling up the model size, expanding the training data volume, and utilizing higher input resolutions are identified as crucial steps. This would allow the model to handle more complex and intricate relationships within multimodal data, pushing the boundaries of its understanding.
Expanding Multimodal Generation: Beyond understanding, a key future direction is to expand Qwen-VL's prowess in multimodal generation, specifically generating high-fidelity images and fluent speech, enabling it to create multimodal content rather than just interpreting it.

7.3. Personal Insights & Critique

Qwen-VL represents a highly commendable effort to provide a powerful, open-source, and versatile LVLM. Several aspects stand out:

Rigorous Engineering: The paper highlights "meticulously designed" components and a "3-stage training pipeline." This attention to detail in engineering, particularly the position-aware adapter and the structured input/output for bounding boxes, is key to its fine-grained capabilities. The comprehensive data cleaning and multi-task pre-training are also crucial for its robust performance.
Fine-grained Focus: The emphasis on grounding, text reading, and text-oriented VQA is a critical differentiator. Many generalist VLMs tend to offer good holistic understanding but struggle with precise localization or OCR. Qwen-VL's success in these areas addresses a significant practical need for real-world applications where detailed visual understanding is paramount.
Multilingualism: The native support for English and Chinese, achieved through multilingual data training, makes Qwen-VL highly relevant for a global user base, distinguishing it from many English-centric models. The strong performance on Chinese instruction-following benchmarks (e.g., TouchStone Cn) is particularly noteworthy.
Open-Source Contribution: The decision to open-source the models, code, and demo is invaluable for the research community, enabling broader adoption, benchmarking, and further innovation in multimodal AI.

Potential Issues or Areas for Improvement:

Reliance on In-house Data: While the paper mentions using publicly available datasets, a significant portion of the training data (especially for text generation and fine-grained tasks) includes "in-house collected corpus" and "in-house data." The specifics of this data, its composition, and collection methodology are not fully transparent. This can make exact replication challenging and raises questions about potential biases if the data characteristics are not thoroughly documented.
"Similar Model Scales" Nuance: While Qwen-VL consistently outperforms "generalist models under similar model scales," the definition of "similar scale" can sometimes be ambiguous in the VLM landscape. Different models might have varying architectural complexities, parameter distributions (e.g., in visual vs. language components), or efficiency-boosting mechanisms that are not purely reflected in total parameter count. A deeper architectural comparison beyond just parameter numbers could provide more context.
Computational Cost of High Resolution: The ablation study on Window Attention versus Global Attention reveals that higher resolutions like $896 \times 896$ are computationally very expensive with Global Attention, and Window Attention degrades performance. This suggests that achieving even higher resolution fine-grained understanding might require entirely new architectural innovations or significantly more computational resources, which remains a general challenge for LVLMs.

Transferability and Future Impact: The architectural choices (e.g., the position-aware adapter) and the 3-stage training methodology are highly transferable and could serve as a blueprint for developing other specialized or general-purpose LVLMs. The explicit handling of bounding box tokens and reference tokens is a robust approach for visual grounding that could be adopted by other multimodal systems. The focus on preventing catastrophic forgetting of pure-text abilities is also a crucial lesson for integrating new modalities into powerful LLMs. Qwen-VL's release is likely to accelerate research in fine-grained multimodal understanding, multilingual LVLMs, and practical multimodal applications.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~38 min read · 54,914 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Model Architecture

4.2.2. Inputs and Outputs

4.2.3. Training

4.2.3.1. Stage 1: Pre-training

4.2.3.2. Stage 2: Multi-task Pre-training

4.2.3.3. Stage 3: Supervised Fine-tuning (SFT)

5. Experimental Setup

5.1. Datasets

5.1.1. Image-text pairs (for Pre-training and Captioning)

5.1.2. Visual Question Answering (VQA)

5.1.3. Grounding

5.1.4. OCR (for training, not explicit evaluation benchmarks in Table 9 for this task)

5.1.5. Instruction Following

5.2. Evaluation Metrics

5.3. Baselines

5.4. Hyperparameters

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Image Caption and General Visual Question Answering

6.1.2. Text-oriented Visual Question Answering

6.1.3. Refer Expression Comprehension

6.1.4. Few-shot Learning

6.1.5. Instruction Following in Real-world User Behavior

6.2. Ablation Studies / Parameter Analysis

6.2.1. Convergence of the Pre-training Stage

6.2.2. Number of Learnable Queries in the Vision-Language Adapter

6.2.3. Window Attention vs Global Attention for Vision Transformer

6.2.4. Performance on Pure-text Tasks

6.3. Qualitative Examples

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers