Paper status: completed

UniTok: A Unified Tokenizer for Visual Generation and Understanding

Published:02/28/2025

Vision-Language Models (14)Unified Tokenizer (1)Visual Generation and Understanding (1)Multi-Codebook Quantization Mechanism (1)VQVAE and CLIP Integration (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

UniTok is a unified tokenizer for visual generation and understanding, addressing loss conflicts in combining VQVAE and CLIP. It achieves state-of-the-art performance on ImageNet with 0.38 rFID and 78.6% zero-shot accuracy, and integrates seamlessly into MLLMs, enhancing both gen

Abstract

Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of 0.38 rFID and 78.6% zero-shot accuracy on ImageNet. Besides, UniTok can be seamlessly integrated into MLLMs to unlock native visual generation capability, without compromising the understanding performance. Additionally, we show that UniTok favors cfg-free generation, reducing gFID from 14.6 to 2.5 on ImageNet 256 $\times$ 256 benchmark. GitHub: https://github.com/FoundationVision/UniTok.

Mind Map

In-depth Reading

English Analysis~39 min read · 56,308 chars

1. Bibliographic Information

1.1. Title

UniTok: A Unified Tokenizer for Visual Generation and Understanding

1.2. Authors

The paper lists several authors, primarily from The University of Hong Kong and ByteDance Inc., with one author also affiliated with Huazhong University of Science and Technology.

Chuofan Ma
Yi Jiang
Junfeng Wu
Jihan Yang
Xin Yu
Zehuan Yuan
Bingyue Peng
Xiaojuan Qi

1.3. Journal/Conference

This paper is published as a preprint on arXiv. The exact journal or conference for its final publication is not specified in the provided abstract, but arXiv is a well-respected open-access repository for preprints of scientific papers in various fields, including computer science and artificial intelligence.

1.4. Publication Year

The paper was published at 2025-02-27T17:47:01.000Z, indicating a publication year of 2025.

1.5. Abstract

The abstract introduces UniTok, a unified tokenizer designed for both visual generation and understanding tasks. It highlights a common problem in the field: visual models typically use distinct tokenizers, such as VQVAE (for autoregressive generation) and CLIP (for understanding), making unification challenging. While previous attempts to combine VQVAE and CLIP training objectives faced loss conflicts, this paper argues that the root cause is not inherent conflict but rather the limited representational capacity of discrete token space.

UniTok addresses this by introducing a novel multi-codebook quantization mechanism, which effectively scales up the vocabulary size and bottleneck dimension. This innovation leads to state-of-the-art performance: 0.38 rFID and 78.6% zero-shot accuracy on ImageNet. Furthermore, UniTok can be seamlessly integrated into Multimodal Large Language Models (MLLMs), enabling native visual generation without compromising understanding capabilities. The paper also demonstrates that UniTok facilitates cfg-free generation, reducing gFID from 14.6 to 2.5 on the ImageNet 256x256 benchmark.

1.6. Original Source Link

https://arxiv.org/abs/2502.20321 (Publication status: Preprint on arXiv)

1.7. PDF Link

https://arxiv.org/pdf/2502.20321v3.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the reliance of visual generative and understanding models on distinct tokenizers, posing a significant challenge for unifying them within a single framework. For instance, CLIP (Contrastive Language-Image Pre-training) models excel in multimodal understanding due to their high-dimensional, continuous feature spaces, but their continuous nature complicates discrete, autoregressive generative modeling. Conversely, VQVAE (Vector Quantized Variational AutoEncoder) tokenizers generate discrete tokens well-suited for autoregressive generation but often struggle to capture essential semantics for understanding tasks.

This problem is important because the advent of powerful Multimodal Large Language Models (MLLMs) like GPT-4o highlights the immense potential of models with native visual generation capabilities. Such unified models promise precise control in multimodal interactions, facilitating advanced tasks like multi-turn image editing and visual in-context learning. However, the fundamental dilemma of choosing an appropriate visual tokenizer that can equally serve both generation and understanding remains.

Previous attempts to unify tokenizers by combining VQVAE (reconstruction objective) and CLIP (semantic objective) training often resulted in severe loss conflicts and convergence issues, leading to suboptimal performance in both domains. The paper questions this common assumption, proposing that the perceived loss conflict might not be inherent but rather a symptom of another underlying issue.

The paper's entry point is an empirical investigation into why combining VQVAE and CLIP objectives leads to performance degradation. Their innovative idea is to re-evaluate the true bottleneck, which they identify as the limited representational capacity of the discrete token space, rather than an inherent conflict between the training objectives themselves.

2.2. Main Contributions / Findings

The paper makes several primary contributions and reaches key conclusions:

Identification of the True Bottleneck: The paper empirically demonstrates that the loss conflicts observed in prior unified tokenizer training are not inherent to combining reconstruction and semantic supervision. Instead, the primary bottleneck stems from the limited representational capacity of discrete token space when undergoing token factorization and discretization.
Introduction of UniTok: The paper proposes UniTok, a unified tokenizer featuring a novel multi-codebook quantization (MCQ) mechanism and attention projection.
- Multi-codebook quantization effectively scales up the vocabulary size and bottleneck dimension exponentially without the optimization problems of large monolithic codebooks or diminished returns.
- Attention projection replaces traditional linear layers for token factorization, improving representational power and training stability.
State-of-the-Art Performance: UniTok achieves new record performance on ImageNet, with 0.38 rFID (reconstruction quality) and 78.6% zero-shot accuracy (understanding capability). This demonstrates its superiority over both domain-specific and other unified tokenizers.
Seamless MLLM Integration: UniTok can be seamlessly integrated into MLLMs, enabling native visual generation capabilities without compromising understanding performance. It significantly outperforms existing MLLM baselines using VQGAN tokenizers across various VQA benchmarks (e.g., 5.5% higher on VQAv2, 9.2% on TextVQA, 339 points on MME compared to Liquid).
Improved Latent Space for Generation: The semantic supervision incorporated in UniTok leads to a better-structured latent space, significantly improving cfg-free generation performance. It reduces gFID from 14.6 to 2.5 on ImageNet 256x256 under the LlamaGen framework, indicating more coherent and higher-quality image generation without classifier-free guidance.
Ablation Study Insights: Ablation studies confirm that reconstruction and contrastive learning objectives do not inherently conflict when the discrete feature space is appropriately enhanced. MCQ consistently benefits both reconstruction FID and classification accuracy, independent of training objectives.

These findings solve the problem of developing a single visual tokenizer that effectively supports both high-fidelity visual generation and robust multimodal understanding, paving the way for more powerful and unified MLLMs.

3.1. Foundational Concepts

To fully understand the paper, a reader should be familiar with several core concepts in deep learning for vision and language.

3.1.1. Tokenization

Tokenization in the context of machine learning refers to the process of breaking down raw input data (like text or images) into smaller, discrete units called tokens. These tokens are then fed into a model for processing.

For text, tokens are typically words, subwords, or characters.
For images, visual tokenization involves converting raw pixel data into a sequence of discrete representations. This abstraction simplifies the image, making it more manageable for models, especially those designed for sequential data like Transformers.

3.1.2. Variational AutoEncoders (VAEs) and Vector Quantized VAEs (VQVAEs)

Variational AutoEncoders (VAEs) are a type of generative model that learn a compressed, latent representation of input data. A VAE consists of two main parts:

Encoder: Maps the input data (e.g., an image) to a latent space (a lower-dimensional continuous representation).
Decoder: Reconstructs the input data from a sample drawn from the latent space. VAEs are trained to minimize a reconstruction loss (how well the output matches the input) and a Kullback-Leibler (KL) divergence term that forces the latent space to approximate a simple distribution (e.g., a standard Gaussian).

Vector Quantized VAEs (VQVAEs) build upon VAEs by introducing a discrete latent space. Instead of mapping inputs to a continuous vector, the encoder's output is quantized by mapping it to the closest vector in a learnable "codebook" or "dictionary" of discrete embedding vectors. This quantization process makes the latent space discrete, which is crucial for autoregressive generation (where the model predicts the next token in a sequence).

Codebook: A finite set of embedding vectors (codes) that represent different visual patterns.
Vector Quantization (VQ): The process of finding the closest codebook entry to an encoded continuous vector and replacing the continuous vector with that discrete code.
VQ Loss ( $\mathcal{L}_{VQ}$ ): A component of the VQVAE loss function that encourages the encoder outputs to commit to certain codebook entries and ensures the codebook entries are updated to be close to the encoder outputs.

3.1.3. CLIP (Contrastive Language-Image Pre-training)

CLIP is a multimodal model designed to learn visual representations from natural language supervision. It consists of two separate encoders:

Image Encoder: Processes images into image embeddings.
Text Encoder: Processes text into text embeddings. During training, CLIP learns to align these image and text embeddings in a shared multimodal embedding space. It does this using a contrastive learning objective: given a batch of image-text pairs, the model is trained to maximize the similarity between correctly matched image-text pairs and minimize the similarity between incorrectly matched pairs.

Image-text contrastive loss ( $\mathcal{L}_{contra}$ ): The loss function used in CLIP to pull positive (matching) image-text pairs closer and push negative (non-matching) pairs apart in the embedding space. This allows CLIP to learn robust, semantically rich visual representations that are highly aligned with natural language descriptions, making it excellent for multimodal understanding tasks like zero-shot classification.

3.1.4. Multimodal Large Language Models (MLLMs)

Multimodal Large Language Models (MLLMs) are Large Language Models (LLMs) that have been extended to process and understand multiple modalities of data, typically text and images (but sometimes also audio or video). They aim to unify language understanding and generation with visual understanding and generation. A key challenge in building MLLMs is how to represent visual information in a way that is compatible with the discrete token space usually employed by LLMs.

3.1.5. Autoregressive Generation

Autoregressive generation is a method of generating sequences (like text or image tokens) one step at a time, where each new element is conditioned on the previously generated elements. In the context of VQVAE models for image generation, after an image is tokenized into a sequence of discrete visual tokens, an autoregressive model (e.g., a Transformer) can be trained to predict the next visual token in the sequence given the preceding ones. This allows for controlled and sequential image synthesis.

3.1.6. Transformer Architecture and Multi-Head Attention

The Transformer architecture is a neural network model primarily known for its use in natural language processing (NLP) but now widely adopted in vision. Its core component is the self-attention mechanism, specifically Multi-Head Attention (MHA). The Attention mechanism allows the model to weigh the importance of different parts of the input sequence when processing each element. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:

$Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
$d_k$ is the dimension of the keys (and queries), used for scaling to prevent very large dot products that push the softmax into regions with tiny gradients.
$\mathrm{softmax}$ normalizes the scores to create a probability distribution.

Multi-Head Attention allows the model to jointly attend to information from different representation subspaces at different positions. It does this by running several attention mechanisms in parallel and then concatenating their outputs, which are then linearly transformed. This provides the model with a richer understanding of the relationships within the data.

3.1.7. FID (Fréchet Inception Distance) and rFID / gFID

Fréchet Inception Distance (FID) is a metric used to assess the quality of images generated by generative models. It measures the "distance" between the feature distributions of real images and generated images. $ \mathrm{FID} = ||\mu_1 - \mu_2||^2_2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $ Where:

$\mu_1$ and $\Sigma_1$ are the mean and covariance matrix of the feature vectors for real images, typically extracted from an Inception-v3 model.
$\mu_2$ and $\Sigma_2$ are the mean and covariance matrix of the feature vectors for generated images.
$||\cdot||^2_2$ denotes the squared Euclidean distance.
$\mathrm{Tr}(\cdot)$ denotes the trace of a matrix. A lower FID score indicates higher quality and diversity of the generated images, suggesting they are more similar to real images.
rFID (reconstruction FID) specifically measures the quality of reconstructed images (output of VQVAE decoder) compared to original images.
gFID (generation FID) measures the quality of images generated by an autoregressive model using the VQVAE tokens, compared to real images.

3.1.8. Zero-shot Accuracy

Zero-shot accuracy refers to a model's ability to classify images into categories it has not explicitly seen during training, based solely on textual descriptions of those categories. In CLIP, for example, zero-shot classification works by computing the similarity between an image's embedding and the embeddings of candidate class names (e.g., "a photo of a cat", "a photo of a dog"). The class with the highest similarity score is chosen as the prediction. High zero-shot accuracy implies that the model has learned robust and generalizable semantic representations.

3.1.9. Classifier-Free Guidance (CFG)

Classifier-Free Guidance (CFG) is a technique used in generative models (especially diffusion models and sometimes autoregressive models) to improve the quality and coherence of generated samples by increasing the influence of the conditioning signal (e.g., a text prompt). It involves training a single model with and without conditioning. During inference, samples are generated by interpolating between the unconditioned (or weakly conditioned) and conditioned score estimates.

cfg-free generation refers to generation without using CFG, relying solely on the intrinsic conditioning signal. If a model performs well cfg-free, it implies its latent space is inherently well-structured and aligns strongly with the conditioning.

3.2. Previous Works

The paper contextualizes its work by discussing prior research in image tokenization for generation, understanding, and unified vision-language models.

3.2.1. Image Tokenization for Generation

VQVAE [53]: Pioneering work introducing discrete latent spaces by mapping continuous tokens to nearest neighbors in a learnable codebook. This was foundational for autoregressive generation.
VQGAN [8]: Improved VQVAE by incorporating perceptual loss (based on LPIPS metric [72]) and discriminator loss (from GANs [16]) to enhance reconstruction quality and generate more realistic images.
ViT-VQGAN [65]: Applied the Transformer architecture to the VQGAN framework, leveraging Transformer's scalability.
Recent Quantization Methods: Efforts like residual quantization (RQ) [18] and lookup-free quantization (LFQ) [67] aim to develop better ways to discretize latent features, often by expanding effective codebook size or improving codebook utilization. RQ uses multiple quantization steps to progressively refine the representation, while LFQ seeks to avoid explicit codebook lookups.

3.2.2. Image Tokenization for Understanding

CLIP [38]: A widely adopted vision tokenizer for MLLMs, pretrained to align images and text using contrastive learning. While excellent for understanding, its continuous feature space poses challenges for autoregressive generation.
DINOv2 [34]: A self-supervised learning model that produces robust visual features advantageous for region-level tasks. These models also typically output continuous features.
Discretizing CLIP tokens [10] or VQVAE encoders [27, 62]: Some works have explored methods to make CLIP tokens discrete or use VQVAE encoders for understanding. However, these often substantially impair understanding performance, highlighting the difficulty of maintaining semantic richness in discrete tokens.

3.2.3. Unified Vision-Language Models (MLLMs)

Continuous Visual Tokenizers + Diffusion Models [7, 11, 44]: One line of MLLMs uses continuous visual tokenizers (like CLIP features) for image encoding and then relies on pretrained diffusion models for image synthesis. This approach increases model complexity and disconnects the visual sampling process from the MLLM itself.
VQVAE-based Discrete Tokens [46, 56, 62, 61, 59]: Another stream, to which UniTok belongs, adopts VQVAE models to encode images into discrete tokens. These discrete tokens can then be modeled using the same cross-entropy loss applied to text tokens, facilitating a unified approach within LLMs. However, these models typically suffer from degraded visual comprehension capabilities because reconstruction-oriented VQVAE does not naturally align with the LLM token space for understanding. Notable examples include Liquid [59] and VILA-U [61].

3.3. Technological Evolution

The evolution of visual tokenization has largely followed two parallel paths:

Generative Path: Started with VAEs and progressed to VQVAEs (VQVAE, VQGAN, ViT-VQGAN) to enable discrete representations suitable for sequential generation models. These focused heavily on reconstruction fidelity.
Understanding Path: Developed contrastive learning methods like CLIP or self-supervised learning methods like DINOv2 to learn rich semantic representations aligned with language, primarily for classification, retrieval, and VQA. These focused on semantic richness.

The current frontier, motivated by MLLMs, is to merge these paths. Early attempts to combine VQVAE and CLIP often led to loss conflicts or suboptimal performance, suggesting a fundamental trade-off. This paper fits into the latest stage of this evolution, where researchers are moving beyond the assumption of inherent conflict, instead probing the underlying limitations of the discrete token space itself.

3.4. Differentiation Analysis

Compared to the main methods in related work, UniTok introduces several core innovations:

Addressing the "Loss Conflict" Misconception: Unlike prior works that attribute performance degradation in unified training to loss conflicts between reconstruction and semantic objectives, UniTok's analysis re-frames this as a problem of limited representational capacity within the discrete token space. This shift in perspective is crucial, as it suggests the solution lies in enhancing the tokenizer's expressive power rather than finding a delicate balance between conflicting losses.
Novel Multi-Codebook Quantization (MCQ): While residual quantization (RQ) also uses multiple codes, MCQ fundamentally differs by employing a divide-and-conquer strategy (splitting the latent vector into chunks for independent sub-codebooks) rather than RQ's coarse-to-fine approach. This allows MCQ to exponentially scale vocabulary size and bottleneck dimension more effectively, especially in high-dimensional latent spaces, overcoming codebook utilization issues and diminishing gains seen with large monolithic codebooks.
Attention Projection for Factorization: UniTok replaces conventional linear or convolutional projection layers with adapted attention modules for token factorization. This design choice specifically addresses the loss of rich semantics during channel compression, a key issue identified in their quantization bottleneck analysis, and improves training stability and understanding performance.
Superior Unified Performance: UniTok achieves state-of-the-art results on both reconstruction quality (rFID) and zero-shot classification accuracy simultaneously, outperforming specialized VQVAE models in reconstruction and rivaling CLIP models in understanding. It also significantly surpasses other unified MLLMs that use discrete tokenizers in VQA and generation benchmarks.
Enhanced Latent Space Structure: UniTok demonstrates that semantic supervision (from CLIP) can lead to a better-structured latent space for autoregressive generation, enabling cfg-free generation with much higher quality (gFID reduction from 14.6 to 2.5), a property previously observed in diffusion models but now extended to autoregressive tokenizers.

In essence, UniTok innovates by enhancing the discrete visual tokenizer's core design to handle increased representational demands, enabling it to bridge the gap between generation and understanding effectively, a feat that previous unified approaches struggled with due to perceived fundamental conflicts or suboptimal quantization strategies.

4. Methodology

4.1. Principles

The core idea behind UniTok is to construct a unified visual tokenizer that can effectively serve both high-fidelity visual generation and robust multimodal understanding. The theoretical basis and intuition are built upon two key insights:

Reconstruction and semantic supervision do not inherently conflict. The paper challenges the common assumption that combining VQVAE's pixel-level reconstruction objective with CLIP's high-level semantic alignment objective leads to an unavoidable trade-off or loss conflict. Instead, it posits that the observed performance degradation in unified training stems from a more fundamental issue.
The primary bottleneck is the limited representational capacity of discrete token space. Traditional discrete VQ-tokenizers use relatively small codebooks and latent dimensions. When continuous visual features are mapped into such a constrained discrete space, especially after token factorization (reducing dimensionality for code lookup), significant information loss occurs. This quantization bottleneck prevents the discrete tokens from simultaneously capturing fine-grained details for reconstruction and rich semantics for understanding.

To address this, UniTok employs two main principles:

Exponentially Scale Representational Capacity: By introducing multi-codebook quantization (MCQ), UniTok aims to dramatically increase the effective vocabulary size and bottleneck dimension of the discrete latent space without the optimization challenges associated with single, massive codebooks. This allows the discrete tokens to retain more information.
Preserve Rich Semantics During Compression: Through attention projection, UniTok seeks to create a more effective token factorization mechanism that better preserves critical semantic information when compressing feature dimensions, thereby enhancing the quality of the discrete tokens for downstream understanding tasks.

By adhering to these principles, UniTok aims to create discrete visual tokens that are sufficiently expressive to satisfy the distinct demands of both generation (precise detail) and understanding (high-level semantics) within a single unified framework.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Unified Supervision

UniTok is trained with a combined objective that integrates both reconstruction and semantic supervisions. This is crucial for enabling the tokenizer to capture both low-level visual details and high-level semantics.

The total loss term, $\mathcal { L }$ , is a weighted sum of a VQVAE-based reconstruction loss and an image-text contrastive loss: $ \begin{array} { r } { \mathcal { L } = \mathcal { L } _ { \mathrm { recon } } + \lambda _ { \mathrm { contra } } \mathcal { L } _ { \mathrm { contra } } . } \end{array} $ Where:

$\mathcal { L }$ is the total loss function that the model minimizes during training.
$\mathcal { L } _ { \mathrm { recon } }$ is the VQVAE-based reconstruction loss component, designed to preserve low-level visual information and ensure high-fidelity image reconstruction.
$\lambda _ { \mathrm { contra } }$ is a weighting factor for the image-text contrastive loss. In this paper, it is simply set to 1, meaning both loss components contribute equally.
$\mathcal { L } _ { \mathrm { contra } }$ is the image-text contrastive loss component, designed to enhance the high-level semantic understanding capabilities of the visual features by aligning them with text.

The VQVAE-based reconstruction loss, $\mathcal { L } _ { \mathrm { recon } }$ , itself comprises several terms: $ \begin{array} { r } { \mathcal { L } _ { \mathrm { recon } } = \mathcal { L } _ { \mathrm { R } } + \lambda _ { \mathrm { VQ } } \mathcal { L } _ { \mathrm { VQ } } + \lambda _ { \mathrm { P } } \mathcal { L } _ { \mathrm { P } } + \lambda _ { \mathrm { G } } \mathcal { L } _ { \mathrm { G } } , } \end{array} $ Where:
$\mathcal { L } _ { \mathrm { R } }$ is the pixel-level reconstruction loss (e.g., L1 or L2 loss between the original and reconstructed images), which directly measures how well the decoder reproduces the input image.
$\lambda _ { \mathrm { VQ } }$ is the weight factor for the vector quantization loss.
$\mathcal { L } _ { \mathrm { VQ } }$ is the vector quantization loss, which encourages the encoder's output to "commit" to specific codebook entries and ensures that the codebook entries are updated to accurately represent the encoder's outputs. It typically includes terms like the codebook loss and commitment loss.
$\lambda _ { \mathrm { P } }$ is the weight factor for the perceptual loss.
$\mathcal { L } _ { \mathrm { P } }$ is the perceptual loss, based on metrics like LPIPS [72]. This loss measures the similarity between intermediate feature representations of the original and reconstructed images from a pretrained network (e.g., VGG), focusing on perceptual quality rather than just pixel-wise accuracy.
$\lambda _ { \mathrm { G } }$ is the weight factor for the discriminator loss.
$\mathcal { L } _ { \mathrm { G } }$ is the discriminator loss, typically used in Generative Adversarial Networks (GANs) [16]. A discriminator network tries to distinguish between real images and reconstructed images, while the generator (decoder) tries to fool the discriminator. This adversarial training helps produce more realistic-looking reconstructions.

The image-text contrastive loss term, $\mathcal { L } _ { \mathrm { contra } }$ , is essentially the same as used in CLIP [38]. It pulls together the embeddings of matching image-text pairs and pushes apart non-matching pairs in a shared multimodal embedding space, thereby fostering strong semantic alignment between visual and textual representations.

4.2.2. Quantization Bottleneck

The paper identifies the root cause of performance issues in unified tokenizers not as loss conflicts, but as the limited representational capacity of discrete token space. This is a critical insight, derived from a comprehensive ablation study illustrated in Figure 3.

The following figure (Figure 3 from the original paper) shows the roadmap from CLIP to UniTok, highlighting the impact of different components on understanding performance:

$Figure 3: Roadmap from CLIP to UniTok. It is observed that major degradation in understanding performance comes from token factorization and discretization, rather than reconstruction supervision. The proposed multi-codebook quantization and attention projection effectively address this by scaling up the vocabulary size and bottleneck dimension. The VQA score is measured using the average score across the VQAv2, GQA, TextVQA, and POPE benchmarks. All tokenizers are trained from scratch on $5 1 2 \\mathrm { m }$ image-text pairs from DataComp.$ 该图像是一个示意图，展示了从 CLIP 到 UniTok 的演变过程。图中显示了不同因素对 VQA 评分的影响，包括 Token 化与离散化如何导致理解性能的下降，以及通过多代码本量化和注意力投影来提升词汇量和瓶颈维度的策略。

Figure 3: Roadmap from CLIP to UniTok. It is observed that major degradation in understanding performance comes from token factorization and discretization, rather than reconstruction supervision. The proposed multi-codebook quantization and attention projection effectively address this by scaling up the vocabulary size and bottleneck dimension. The VQA score is measured using the average score across the VQAv2, GQA, TextVQA, and POPE benchmarks. All tokenizers are trained from scratch on $5 1 2 \mathrm { m }$ image-text pairs from DataComp.

The analysis breaks down the degradation in visual understanding tasks (measured by VQA scores) as follows:

Factorization: Modern VQ-tokenizers often project continuous visual features (e.g., 768-dimensional CLIP embeddings) to a much lower-dimensional space (e.g., 8-dimensional) before code index lookup. This process, termed token factorization, is a severe form of channel compression. The paper empirically found that simply adding two linear projection layers to a CLIP vision encoder (reducing dimensionality from 768-d to 16-d and then back to 768-d), without vector quantization or reconstruction supervision, already caused a significant performance degradation in VQA tasks. This indicates that the act of compressing the feature space itself compromises the expressiveness of tokens.
Discretization: Building on factorization, the introduction of vector quantization further degrades performance. Mapping these factorized tokens to a small codebook (typically 4k to 16k entries, much smaller than LLM vocabularies of 200k+ entries) results in considerable information loss. Experiments showed an average VQA accuracy drop of 2.1 points due to this discretization step.
Reconstruction Supervision: Finally, integrating reconstruction losses into the training process to form a unified tokenizer is analyzed. While previous literature (and initial observations) suggested loss conflict between VQVAE and CLIP objectives, the paper surprisingly finds that this degradation has negligible impacts on downstream understanding performance after accounting for the effects of factorization and discretization. Moreover, this degradation diminishes with improved quantization methods. This leads to the conclusion that the perceived loss conflict is merely a superficial issue, and the real problem is the limited representational capacity of the discrete token space.

4.2.3. UniTok Enhancements

To overcome the quantization bottleneck and enhance the representational capacity of discrete tokens, UniTok introduces two key architectural innovations: multi-codebook quantization and attention projection.

4.2.3.1. Multi-Codebook Quantization (MCQ)

Multi-codebook quantization (MCQ) is proposed to significantly scale up the vocabulary size and latent code dimension without incurring the optimization problems (like low codebook utilization or 'dead codes') associated with very large monolithic codebooks.

Instead of quantizing a single high-dimensional latent vector directly, MCQ works by partitioning the latent vector into several smaller chunks, each of which is then quantized independently using its own dedicated sub-codebook.

Specifically, a latent vector $f \in \mathbb { R } ^ { d }$ (where $d$ is the total dimension) is first evenly split into $n$ smaller chunks $\{ f _ { 1 } , f _ { 2 } , . . . , f _ { n } \}$ . Each chunk f _ { i } has a dimension of $\mathbb { R } ^ { \frac { d } { n } }$ . The subsequent quantization process for these chunks is described as: $ \hat { f } = \operatorname { C o n c a t } \left( \mathcal { Q } \left( Z _ { 1 } , f _ { 1 } \right) , \mathcal { Q } \left( Z _ { 2 } , f _ { 2 } \right) , . . . , \mathcal { Q } \left( Z _ { n } , f _ { n } \right) \right) $ Where:

$\hat { f }$ is the final discretized latent vector after multi-codebook quantization. It is formed by concatenating the independently quantized chunks.
$\operatorname { C o n c a t }$ denotes the concatenation operation, which joins the quantized chunks back together to form the full quantized vector.
$\mathcal { Q } ( Z _ { i } , f _ { i } )$ represents the code index lookup operation for the $i$ -th chunk. It takes the $i$ -th continuous latent chunk f _ { i } and finds its closest codebook entry within the $i$ -th sub-codebook Z _ { i }. The output of $\mathcal { Q }$ is the selected codebook entry (a vector).
Z _ { i } is the $i$ -th sub-codebook, which is a distinct, learnable collection of embedding vectors specifically for quantizing the $i$ -th chunk f _ { i }.
f _ { i } is the $i$ -th continuous chunk of the original latent vector $f$ .

Key advantages of MCQ:

Exponentially Scaled Vocabulary: If there are $n$ sub-codebooks, and each sub-codebook has $K$ entries, the theoretical effective vocabulary size (number of possible combinations of codes for a single visual token) becomes $K^n$ . For example, with 4 sub-codebooks and 16k entries each, the effective vocabulary size is $(2^{14})^4 = 2^{56}$ , which is vastly larger than any single monolithic codebook.
Avoids Optimization Problems: By keeping each individual sub-codebook small (e.g., 4,096 entries), MCQ sidesteps the issues of low codebook utilization and dead codes that plague large monolithic codebooks, which are harder to optimize effectively.
Increased Latent Code Dimension: The dimensionality of the discretized latent codes ( $\hat{f}$ ) also scales proportionally with the number of sub-codebooks ( $n \times \text{dimension of } f_i$ ), further enhancing representational capacity. For instance, if $d/n = 8$ and $n=8$ , the global latent dimension becomes 64-d.

Comparison with Residual Quantization (RQ): MCQ shares the concept of using multiple codes for a token with Residual Quantization (RQ) [18], but their design philosophies differ significantly. RQ follows a coarse-to-fine approach, where subsequent quantization steps quantize the residual error from the previous step. MCQ, conversely, uses a divide-and-conquer strategy, partitioning the original latent space and quantizing each partition independently. The paper claims this distinction gives MCQ unique advantages in high-dimensional latent spaces, where RQ can suffer from increasingly sparse codes, leading to significantly lower quantization loss for MCQ.

4.2.3.2. Attention Projection

To mitigate the semantic information loss during token factorization (the compression of feature dimensions), UniTok replaces traditional linear projection layers with adapted attention modules.

Existing VQ methods commonly use simple linear or convolutional layers for projection, which were shown to fail to preserve rich semantics during compression. The paper proposes adapting multi-head attention (MHA) modules for this purpose.

The following figure (Figure 6 from the original paper) illustrates the adapted attention blocks for factorization:

$Figure 6: Modified attention blocks for factorization. Modules in yellow indicate a change in the number of channels. $C$ and $c$ stand for the channel dimension, $h$ is the number of heads in the multi-head attention module. $C = h \\times c$ .$ 该图像是示意图，展示了通道压缩块和通道扩展块的修改注意力模块。模块中的黄色部分表示通道数的变化。 $N \times c$ 和 $N \times C$ 表示不同的通道维度， $h$ 是多头注意力模块中的头数。

Figure 6: Modified attention blocks for factorization. Modules in yellow indicate a change in the number of channels. $C$ and $c$ stand for the channel dimension, $h$ is the number of heads in the multi-head attention module. $C = h \times c$ .

In a standard Multi-Head Attention module, the outputs from different attention heads are concatenated before a final linear projection. In attention projection, this concatenation operation is replaced with average pooling to achieve channel compression.

Let $C$ be the input channel dimension and $c$ be the desired output channel dimension after compression.
In Multi-Head Attention, if there are $h$ heads, each head typically processes features of dimension C/h. After attention calculation, the outputs of the $h$ heads are usually concatenated, resulting in a feature vector of dimension $C$ .
For attention projection, instead of concatenating the outputs of the $h$ heads (each having dimension $c' = C/h$ ), they are average pooled across the head dimension. If the desired output dimension is $c$ , then $h$ attention heads would produce $h$ output vectors of dimension $c$ . These $h$ vectors are then average pooled to get a single vector of dimension $c$ . This operation effectively performs channel compression while allowing the model to leverage the attentional processing of MHA to selectively focus on important information during dimension reduction.

This simple adaptation effectively strengthens the representational power of factorized tokens and stabilizes training, addressing the issue of semantic degradation during dimension reduction.

4.2.4. Unified MLLM Integration

UniTok is designed to be seamlessly integrated into Multimodal Large Language Models (MLLMs), specifically leveraging the unified framework similar to Liquid [59]. The goal is to allow the MLLM to model discrete vision and language sequences with a universal next-token prediction loss.

Key aspects of integration:

Reusing Code Embeddings: Instead of training the visual codebook from scratch within the MLLM, UniTok's pretrained code embeddings are reused. These embeddings are projected into the MLLM token space using a simple Multi-Layer Perceptron (MLP) projector. This transfer learning approach allows the MLLM to immediately benefit from UniTok's rich visual representations.
Simplifying MLLM Input: UniTok encodes an image into $H \times W \times K$ codes, where $H \times W$ is the spatial resolution of the visual tokens and $K$ is the number of sub-codebooks (e.g., 8 in UniTok). For MLLM input, this complex representation is simplified by merging every $K$ consecutive codes into a single visual token. This reduction in the number of effective visual tokens makes the input more manageable for the MLLM.
Efficient Visual Token Prediction for Generation: When the MLLM needs to perform visual generation (i.e., predicting visual tokens), it maintains efficiency even with multi-codebooks. Each visual token produced by the MLLM is made to autoregressively predict the next $K$ codes (corresponding to the $K$ sub-codebook entries that form that single visual token). This is achieved using a depth transformer head, a technique implemented in models like RQ-Transformer [18] and VILA-U [61]. This strategy allows the MLLM to generate the detailed multi-codebook representation for each visual token effectively.

By following this integration strategy, UniTok enables MLLMs to possess native visual generation capabilities while maintaining strong understanding performance, facilitating truly unified multimodal learning.

5. Experimental Setup

5.1. Datasets

5.1.1. Tokenizer Training

DataComp-1B [9]: A large-scale public dataset consisting of 1.28 billion image-text pairs. It is used for training UniTok from scratch. All images are resized to 256x256 resolution.
OpenImages [17]: A dataset specifically used for training a version of UniTok solely with reconstruction supervision, primarily for rFID evaluation and comparison with tokenizers trained on smaller datasets.

5.1.2. MLLM Training

Language Data from DCLM [22]: 10 million language-only data samples are used for MLLM pretraining.
Internal MidJourney-style Synthetic Data: 30 million synthetic image-text pairs, mimicking the style of MidJourney prompts, are used for MLLM pretraining.
Re-captioned Image-Text Pairs from COYO [32] and Laion [41]: 30 million re-captioned image-text pairs are included in the MLLM pretraining mix. COYO and LAION are large-scale datasets often used for training vision-language models.
Text-to-Image Data: 1.5 million text-to-image data samples are used for finetuning the MLLM for visual generation.
Multimodal Instruction Tuning Data from Mini-Gemini [23]: 1.5 million multimodal instruction tuning data samples are used for finetuning the MLLM to align with instructions across modalities.

5.1.3. Evaluation Datasets

ImageNet [38]: Used for evaluating tokenizer performance, specifically reconstruction FID (rFID) and top-1 zero-shot classification accuracy.
VQA Benchmarks (for MLLM Understanding):
- VQAv2 [13]: A large-scale dataset for Visual Question Answering, requiring models to answer natural language questions about images.
- GQA [14]: A VQA dataset focused on real-world visual reasoning and compositional question answering.
- TextVQA [42]: A VQA dataset where questions require reading text in images (e.g., signs, labels).
- POPE [24]: A benchmark for evaluating object hallucination in Large Vision-Language Models.
- MME [64]: A comprehensive multimodal evaluation benchmark.
- MM-Vet [70]: A benchmark for evaluating integrated capabilities of Large Multimodal Models.
Visual Generation Benchmarks (for MLLM Generation):
- GenAI-Bench [26]: A benchmark for evaluating text-to-image generation quality using image-to-text generation to assess alignment with prompts.
- MJHQ-30K [20]: A dataset designed for evaluating the quality of generated images, particularly in terms of FID scores for high-resolution photorealistic images.

5.2. Evaluation Metrics

For every evaluation metric, the conceptual definition, mathematical formula, and symbol explanations are provided below.

5.2.1. Fréchet Inception Distance (FID)

Conceptual Definition: FID measures the "distance" between the feature distributions of real images and generated/reconstructed images. It is commonly used to assess the quality of images produced by generative models, capturing both realism and diversity. A lower FID indicates better quality, suggesting that the generated images are more similar to real images in terms of their feature representations.
Mathematical Formula: $ \mathrm{FID} = ||\mu_1 - \mu_2||^2_2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
Symbol Explanation:
- $\mu_1$ : The mean feature vector of the real images.
- $\mu_2$ : The mean feature vector of the generated/reconstructed images.
- $\Sigma_1$ : The covariance matrix of the feature vectors for real images.
- $\Sigma_2$ : The covariance matrix of the feature vectors for generated/reconstructed images.
- $||\cdot||^2_2$ : The squared Euclidean distance (L2 norm).
- $\mathrm{Tr}(\cdot)$ : The trace of a matrix (sum of its diagonal elements).
- The feature vectors are typically extracted from an intermediate layer of a pre-trained Inception-v3 network.
rFID (Reconstruction FID): FID applied to reconstructed images (output of the decoder) compared to the original input images.
gFID (Generation FID): FID applied to images generated from scratch by a generative model compared to a dataset of real images.

5.2.2. Top-1 Zero-Shot Accuracy

Conceptual Definition: Top-1 zero-shot accuracy measures the percentage of images for which the model correctly predicts the class label, where the class labels were not seen during training. For CLIP-like models, this involves computing the similarity between an image embedding and a set of text embeddings representing candidate class labels (e.g., "a photo of a [class name]"). The class with the highest similarity is chosen as the prediction.
Mathematical Formula: $ \text{Top-1 Accuracy} = \frac{\text{Number of correct top-1 predictions}}{\text{Total number of samples}} \times 100% $
Symbol Explanation:
- Number of correct top-1 predictions: The count of instances where the model's highest-scoring predicted class (the "top-1" prediction) matches the true label.
- Total number of samples: The total number of instances in the evaluation dataset.

5.2.3. ImageNet Zero-Shot Classification Accuracy

This is a specific application of Top-1 Zero-Shot Accuracy where the evaluation is performed on the ImageNet dataset.

5.2.4. VQA Benchmarks (VQAv2, GQA, TextVQA, POPE, MME, MM-Vet)

These benchmarks evaluate a model's multimodal understanding capabilities, specifically Visual Question Answering. While each has specific nuances, they generally measure the accuracy of answers generated by the MLLM in response to questions about images. The specific accuracy calculation can vary (e.g., standard accuracy, soft accuracy, F1 score) depending on the dataset guidelines.

VQAv2 [13]: Evaluates general-purpose VQA. Metrics usually include accuracy where answers are compared against human annotations, often using a "soft accuracy" to account for synonymy.
GQA [14]: Focuses on compositional visual reasoning and diverse question types. Metrics typically involve accuracy for various question categories.
TextVQA [42]: Specifically tests the ability to read and understand text present in images to answer questions. Accuracy is a common metric.
POPE [24]: POPE (Prompting Out-of-Context Objects) is a benchmark designed to evaluate object hallucination in LLMs. It measures how frequently a model generates responses indicating the presence of objects not actually in the image. Metrics often involve precision, recall, and F1-score for hallucinated objects.
MME [64]: MME (Multi-Modality Evaluation) is a comprehensive benchmark assessing various multimodal perception and cognition abilities. It provides scores across multiple dimensions (e.g., Perception scores mentioned in the paper).
MM-Vet [70]: MM-Vet (Multimodal-Verification for integrated capabilities) evaluates a broad range of integrated capabilities of LLMs across 16 categories, such as perception, reasoning, and knowledge. Its scores reflect a model's overall multimodal proficiency.

5.2.5. Inception Score (IS)

Conceptual Definition: Inception Score is a metric used to evaluate the quality of images generated by Generative Adversarial Networks (GANs) or other generative models. It quantifies two aspects: the fidelity (quality) of generated images and their diversity. High IS indicates both high quality (images are clearly recognizable as objects) and high diversity (the model generates a wide variety of images).
Mathematical Formula: $ \mathrm{IS}(G) = \exp \left( E_{x \sim p_g} [ D_{KL}(p(y|x) || p(y)) ] \right) $
Symbol Explanation:
- $G$ : The generative model.
- $x$ : A generated image sample from the model $G$ .
- $p_g$ : The distribution of generated images.
- $p(y|x)$ : The conditional class probability distribution for a generated image $x$ , typically obtained from a pre-trained Inception-v3 network. It indicates how well the image is classified into one of the ImageNet classes.
- p(y): The marginal class probability distribution, which is the average of $p(y|x)$ over all generated images. It reflects the diversity of the generated samples.
- $D_{KL}(P || Q)$ : The Kullback-Leibler (KL) divergence between two probability distributions $P$ and $Q$ .
- $E_{x \sim p_g} [\cdot]$ : The expectation over generated images. A large IS means that the generated images are highly discriminative (low entropy for $p(y|x)$ ) and diverse (high entropy for p(y), leading to a large KL divergence between $p(y|x)$ and p(y)).

5.2.6. Precision and Recall (in Image Generation)

Conceptual Definition: In image generation, Precision and Recall (often calculated in a way similar to their use in information retrieval or object detection) can be used to evaluate the quality and coverage of generated images relative to a real dataset.
- Precision measures how many of the generated images are "real-like" or fall within the distribution of real images. High precision indicates high quality (fewer "bad" or out-of-distribution generations).
- Recall measures how well the generated distribution covers the real data distribution. High recall indicates high diversity (the generator can produce all kinds of "real" images).
Mathematical Formula: While there isn't a single universal formula for Precision and Recall in image generation that parallels classification (True Positives, False Positives, etc.), they are often derived from similarities in feature space, similar to FID. For instance, some methods calculate precision and recall based on the fraction of generated samples that have a nearest neighbor among real samples (precision) and vice-versa (recall) in a feature embedding space. The exact formula is often implementation-specific but aims to capture these concepts.
Symbol Explanation: Specific symbols are not provided in the paper for these metrics, but generally, Precision signifies how realistic the generated samples are, and Recall signifies how much of the target data distribution the generated samples cover.

5.2.7. GenEval Benchmarks (Single Obj., Two Obj., Counting, Colors, Position, Color Attri., Overall)

Conceptual Definition: GenEval evaluates text-to-image generation models on their ability to correctly interpret and render various compositional attributes and relationships specified in text prompts. Each sub-metric measures the accuracy of rendering a specific aspect.
Symbol Explanation:
- Single Obj.: Accuracy in generating a single specified object.
- Two Obj.: Accuracy in generating two specified objects.
- Counting: Accuracy in generating the correct number of objects.
- Colors: Accuracy in rendering specified colors for objects.
- Position: Accuracy in placing objects in specified spatial positions.
- Color Attri.: Accuracy in assigning specific color attributes to objects.
- Overall: Average accuracy across all GenEval categories.

5.2.8. GenAI-Bench Benchmarks (Attribute, Scene, Spatial, Action, Part, Count, Differ, Compare, Negate, Universal, Overall)

Conceptual Definition: GenAI-Bench provides a fine-grained evaluation of text-to-image generation by using image-to-text generation to verify whether generated images align with various aspects of complex text prompts. It tests the model's understanding of attributes, scenes, spatial relationships, actions, object parts, counts, differences, comparisons, negations, and universal quantifiers.
Symbol Explanation:
- Attribute: Assesses if objects have the correct attributes (e.g., "red car").
- Scene: Evaluates the correctness of the overall scene description (e.g., "a beach at sunset").
- Spatial: Checks if objects are in the correct relative positions (e.g., "a cat on the mat").
- Action: Verifies if subjects are performing the specified actions (e.g., "a dog jumping").
- Part: Ensures that objects include specified parts (e.g., "a bird with spread wings").
- Count: Assesses the correct number of objects generated.
- Differ: Evaluates the ability to generate images where elements are distinct or different as specified.
- Compare: Checks if comparative relationships are correctly rendered (e.g., "taller than").
- Negate: Tests understanding of negative conditions (e.g., "no tree").
- Universal: Evaluates adherence to universal quantifiers (e.g., "all birds are blue").
- Overall: The average performance across all categories.

5.3. Baselines

The paper compares UniTok against a wide array of existing models across different categories:

5.3.1. VQVAE Models (for Reconstruction & Generation)

These models are primarily designed for efficient image representation and generation.

VQ-GAN [8]: A foundational VQVAE variant using perceptual and adversarial losses.
RQ-VAE [18]: Utilizes residual quantization for improved discrete representation.
VAR [48]: Visual Autoregressive Modeling with next-scale prediction.
UniTok*: A version of UniTok trained only with reconstruction supervision on OpenImages, serving as a controlled baseline.

5.3.2. CLIP Models (for Understanding)

These models are specialized for image-text alignment and understanding tasks.

CLIP [38]: The seminal Contrastive Language-Image Pre-training model.
SigLIP [71]: An improved CLIP variant using sigmoid loss.
ViTamin [5]: A scalable vision model designed in the vision-language era.

5.3.3. Unified Models (Early Attempts at Unification)

These models try to combine aspects of both generation and understanding.

TokenFlow† [37]: A unified image tokenizer for multimodal understanding and generation, with † indicating CLIP weight initialization.
VILA-U† [61]: A unified foundation model integrating visual understanding and generation, also with † indicating CLIP weight initialization.
UniTok (random init): UniTok trained from scratch without CLIP weights.
UniTok† (CLIP init): UniTok initialized with pretrained CLIP weights.

5.3.4. Diffusion Models (for Class-Conditional Image Generation)

These are powerful generative models that typically operate in continuous latent spaces.

SD-VAE [40]: The VAE component of Stable Diffusion.
DiT [35]: Diffusion Transformer for image generation.
LightningDiT [63]: A faster version of DiT.
VAVAE [63]: VAE with semantic regularization for diffusion models.

5.3.5. Masked Generative Models (for Image Generation)

These models generate images by masking and predicting visual tokens.

LFQ [67]: Lookup-Free Quantization for visual generation.
MAGVIT-v2 [67]: Masked Generative Video Transformer.
TiTok-L [69]: Transformer-based Image Tokenizer for reconstruction and generation.
MaskGIT [3]: Masked Generative Image Transformer.

5.3.6. Autoregressive Models (for Class-Conditional Image Generation)

These models generate visual tokens sequentially.

VQGAN†: The VQGAN tokenizer used within the LlamaGen framework.
LlamaGen‡ [43]: A powerful autoregressive model for scalable image generation.
UniTok (Ours): UniTok integrated into the LlamaGen framework.

5.3.7. Unified MLLMs (for Understanding and Generation)

These are large models that aim to perform both VQA and image generation.

Emu [45]: Generative Pretraining in Multimodality.
LaVIT [15]: Unified Language-Vision Pretraining with Dynamic Discrete Visual Tokenization.
DreamLLM [7]: Synergistic Multimodal Comprehension and Creation.
Unified-IO 2 [29]: Scaling Autoregressive Multimodal Models.
Janus [58]: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation.
CM3Leon [68]: An autoregressive multimodal model.
LWM [27]: World Model on Million-Length Video and Language.
Show-o [62]: One Single Transformer to Unify Multimodal Understanding and Generation.
Chameleon [46]: Mixed-modal early-fusion foundation models.
Liquid [59]: Language models are scalable and unified multi-modal generators.
VILA-U [61]: A unified foundation model integrating visual understanding and generation.

5.4. Implementation Details

5.4.1. Tokenizer Setup

Architecture: UniTok uses a hybrid architecture, ViTamin-L/16 [5], to combine the strengths of CNNs (common in VQVAE) and ViTs (preferred in CLIP).
Quantization Configuration: UniTok is configured with eight sub-codebooks, each containing 4,096 code entries. The latent dimension for each sub-codebook is set to 8-d, resulting in a global latent dimension of 64-d ( $8 \times 8 = 64$ ).
Discriminator: The discriminator component of the VQVAE loss is initialized with a pre-trained DINOv2-S [34] model.
Training: The tokenizer is trained for one epoch on DataComp-1B [9] (1.28 billion image-text pairs). Images are resized to 256x256.
Batch Size & Learning Rate: A global batch size of 16k is used. The learning rate for the tokenizer is 1e-3 and 2e-4 for the discriminator.
Initialization Settings: Two settings are evaluated:
1. Random Initialization (default): UniTok trained from scratch.
2. Pretrained CLIP Weight Initialization: UniTok initialized with weights from a pre-trained CLIP model.

5.4.2. MLLM Setup

Base Model: The unified MLLM is instantiated using the Llama-2-7B base model [52].
Framework: It follows the Liquid [59] framework, which uses next-token prediction loss for both vision and language.
Pretraining: The model is first pretrained on a mix of multimodal data:
- 10M language data from DCLM [22].
- 30M internal MidJourney-style synthetic data.
- 30M re-captioned image-text pairs from COYO [32] and Laion [41].
- Learning rate for pretraining: 5e-5.
Finetuning: Subsequently, the model is finetuned on:
- 1.5M text-to-image data.
- 1.5M multimodal instruction tuning data introduced in Mini-Gemini [23].
- Learning rate for finetuning: 2e-5.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Tokenizer Comparison

The paper evaluates UniTok on ImageNet using reconstruction FID (rFID) for generation quality and top-1 zero-shot accuracy for understanding.

The following are the results from Table 1 of the original paper:

Method	#Tokens	rFID ↓	Accuracy
VQVAE Model
VQ-GAN* [8]	256	4.98	-
RQ-VAE [18]	256	1.30	-
VAR* [48]	680	0.90	-
UniTok*	256	0.33	-
CLIP Model
CLIP [38]	256	-	76.2
SigLIP [71]	256	-	80.5
ViTamin [5]	256	-	81.2
Unified Model
TokenFlow† [37]	680	1.37	-
VILA-U† [61]	256	1.80	73.3
UniTok	256	0.41	70.8
UniTok†	256	0.38	78.6

Analysis:

Reconstruction Quality (rFID): UniTok demonstrates exceptional reconstruction quality. The $UniTok*$ model (trained only with reconstruction on OpenImages) achieves an rFID of 0.33, significantly outperforming other VQVAE models like VQ-GAN (4.98), RQ-VAE (1.30), and VAR (0.90). Even the full UniTok† model (with CLIP initialization) achieves an rFID of 0.38, which is lower (better) than continuous VAE tokenizers like Stable Diffusion v2.1's (0.61, from Table 2). This highlights the superiority of multi-codebook quantization in capturing fine-grained details.
Understanding Performance (Accuracy): For zero-shot classification accuracy, the randomly initialized UniTok achieves 70.8%, which is suboptimal compared to specialized CLIP models (CLIP 76.2%, SigLIP 80.5%, ViTamin 81.2%). This is attributed to the relatively short training schedule (one epoch on 1.28B samples) being insufficient for full CLIP convergence. However, UniTok† (initialized with pretrained CLIP weights) boosts accuracy to 78.6%, narrowing the gap significantly with CLIP baselines and outperforming other unified tokenizers like VILA-U† (73.3%).
Unified Performance: The UniTok† model effectively balances both objectives, achieving both excellent reconstruction (0.38 rFID) and strong understanding (78.6% zero-shot accuracy), a feat not matched by other unified tokenizers. TokenFlow† only reports rFID (1.37), and VILA-U† has a much higher rFID (1.80) despite lower accuracy.

The following figure (Figure 4 from the original paper) shows qualitative results on image reconstruction:

$Figure 4: Qualitative results on image reconstruction in a resolution of $2 5 6 \\times 2 5 6$ .$ 该图像是一个对于图像重建的定性结果展示，左侧为原始图像，右侧为重建图像。重建分别展示了不同分辨率图像的效果，展示了UniTok的图像恢复能力。

Figure 4: Qualitative results on image reconstruction in a resolution of $2 5 6 \times 2 5 6$ .

Qualitative Analysis of Reconstruction: Figure 4 visually confirms UniTok's high reconstruction fidelity. The reconstructed images (right column) are remarkably close to the original images (left column), demonstrating the model's ability to preserve fine details and overall image quality at 256x256 resolution.

6.1.2. Class-Conditional Image Generation

The paper investigates whether UniTok's semantic supervision leads to a better-structured latent space for autoregressive generation, similar to findings in diffusion models. This is evaluated using the LlamaGen framework for class-conditional image generation.

The following are the results from Table 2 of the original paper:

Tokenizer	rFID	Generator	#Params.	Generation w/o CFG				Generation w/ CFG
Tokenizer	rFID	Generator	#Params.	gFID↓	IS↑	Pre.	Rec.	gFID↓	IS↑	Pre.	Rec.
Diffusion Models
SD-VAE [40]	0.61	DiT [35]	675M	9.62	121.5	0.67	0.67	2.27	278.2	0.83	0.57
VAVAE [63]	0.28	LightningDiT [63]	675M	2.17	205.6	0.77	0.65	1.35	295.3	0.79	0.65
Masked Generative Models
LFQ [67]	0.9	MAGVIT-v2 [67]	307M	3.07	213.1	—	—	1.91	324.3	−	−
TiTok-L [69]	2.21	MaskGIT [3]	177M	3.15	173.0	−	−	2.77	199.8	−	−
Autoregressive Models
VQGAN†	2.19	LlamaGen‡ [43]	1.4B	14.65	86.3	0.63	0.68	2.34	253.9	0.81	0.60
UniTok (Ours)	0.41	LlamaGen [43]	1.4B	2.51	216.7	0.82	0.57	2.77	227.5	0.81	0.57

Analysis:

CFG-free Generation: UniTok significantly reduces gFID without classifier-free guidance (CFG) from 14.65 (for VQGAN† with LlamaGen) to a remarkable 2.51. This is a massive improvement, indicating that UniTok learns a much better-structured and semantically coherent latent space compared to VQGAN. The Inception Score (IS) also jumps dramatically from 86.3 to 216.7, further confirming higher quality and diversity cfg-free generation.
CFG-enabled Generation: With CFG, UniTok's gFID is 2.77, which is slightly higher than VQGAN†'s 2.34. However, UniTok's IS (227.5) is still much higher than VQGAN†'s (253.9), suggesting UniTok generates more diverse and realistic images even with CFG. The smaller improvement from cfg-free to CFG-enabled for UniTok (2.51 to 2.77 gFID for UniTok vs. 14.65 to 2.34 for VQGAN†) suggests that UniTok inherently benefits less from CFG because its latent space is already well-organized.
Comparison to Diffusion Models: UniTok's cfg-free gFID of 2.51 is competitive even with strong diffusion models like SD-VAE (9.62) and approaches VAVAE (2.17), especially impressive considering LlamaGen is an autoregressive model. This strongly supports the idea that semantic supervision (from CLIP) imbues UniTok with a highly structured latent space suitable for high-quality generation.

6.1.3. Unified Understanding and Generation (MLLM Integration)

6.1.3.1. Understanding Performance

The paper evaluates UniTok's multimodal understanding performance by integrating it into an MLLM and testing on various VQA benchmarks.

The following are the results from Table 3 of the original paper:

Method	LLM	Token Type	Res.	VQAv2	GQA	TextVQA	POPE	MME	MM-Vet
Emu [45]	Llama-13B	Continuous	224	52.0	-	-	-	-	-
LaVIT [15]	Llama-7B	Continuous	224	66.0	46.8	-	-	-	-
DreamLLM [7]	Vicuna-7B	Continuous	224	72.9	-	41.8	-	-	26.6
Unified-IO 2 [29]	6.8B from scratch	Continuous	384	79.4	-	-	87.7	-	-
Janus [58]	DeepSeek-1.3B	Continuous	384	77.3	59.1	-	87.0	1338	34.3
CM3Leon [68]	7B from scratch	Discrete	256	47.6	-	-	-	-	-
LWM [27]	Llama-2-7B	Discrete	256	55.8	44.8	18.8	75.2	-	-
Show-o [62]	Phi-1.5-1.3B	Discrete	256	59.3	48.7	-	73.8	948	-
Chameleon [46]	34B from scratch	Discrete	512	69.6	-	-	-	-	-
Liquid [59]	Gemma-7B	Discrete	512	71.3	58.4	42.4	81.1	1119	-
VILA-U [61]	Llama-2-7B	Discrete	256	75.3	58.3	48.3	83.9	1336	27.7
UniTok	Llama-2-7B	Discrete	256	76.8	61.1	51.6	83.2	1448	33.9

Analysis:

Outperforming Discrete Tokenizers: UniTok (using Llama-2-7B) significantly outperforms other MLLMs that use discrete visual tokenizers. For example, it surpasses Chameleon (34B from scratch) by 7.2% on VQAv2 (76.8 vs 69.6) and Liquid (Gemma-7B) by 5.5% on VQAv2 (76.8 vs 71.3), 2.7% on GQA (61.1 vs 58.4), and 9.2% on TextVQA (51.6 vs 42.4). It also achieves the highest MME score (1448) among all models in the table, and a strong MM-Vet score of 33.9, only slightly below Janus (34.3) but Janus uses a continuous token type.
Narrowing the Gap with Continuous Tokenizers: UniTok largely narrows the performance gap with MLLMs that use continuous visual tokenizers (e.g., Unified-IO 2 at 79.4 VQAv2, Janus at 77.3 VQAv2). This is a crucial achievement, demonstrating that UniTok enables discrete tokens to achieve understanding performance competitive with continuous features, without compromising generation.
Strong Candidate for Unified MLLMs: These results confirm that UniTok is a highly effective unified visual tokenizer for multimodal models, capable of robust visual comprehension.

6.1.3.2. Generation Performance

The paper further demonstrates UniTok's visual generation capabilities when integrated into the unified MLLM.

The following are the results from Table 4 of the original paper:

Method	Type	#Data	Single Obj.	Two Obj.	Counting	Colors	Position	Color Attri.	Overall↑
SD v2.1 [39]	Diffusion	2000M	0.98	0.51	0.44	0.85	0.07	0.17	0.50
SD-XL [36]	Diffusion	2000M	0.98	0.74	0.39	0.85	0.15	0.23	0.55
DALL-E 3 [2]	Diffusion	−	0.96	0.87	0.47	0.83	0.43	0.45	0.67
Show-o [62]	Discrete Diff.	36M	0.95	0.52	0.49	0.82	0.11	0.28	0.53
Show-o [62]		2.0B	0.98	0.80	0.66	0.84	0.31	0.50	0.68
LWM [27]	Autoregressive	−	0.93	0.41	0.46	0.79	0.09	0.15	0.47
Janus [58]	Autoregressive	−	0.97	0.68	0.30	0.84	0.46	0.42	0.61
Liquid [59]	Autoregressive	30M	0.98	0.73	0.32	0.76	0.17	0.37	0.55
UniTok	Autoregressive	30M	0.99	0.71	0.36	0.79	0.26	0.45	0.59

Analysis on GenEval:

UniTok achieves an Overall score of 0.59 on the GenEval benchmark, outperforming Liquid (0.55) while using the same amount of text-to-image training data (30M). This highlights the superior tokenizer design of UniTok.
It performs very well on Single Obj. (0.99), Two Obj. (0.71), and Color Attri. (0.45) tasks, indicating strong compositional generation capabilities.

While DALL-E 3 (0.67) and Show-o (2.0B, 0.68) still hold higher Overall scores, these models are either diffusion-based or trained on significantly more data (billions vs 30M). UniTok shows competitive performance against these domain experts given its autoregressive nature and data scale.

The following are the results from Table 5 of the original paper:

Method	Type	Res.	FID↓
SD-XL [36]	Diffusion	1024	9.55
PixArt [6]	Diffusion	1024	6.14
Playground [20]	Diffusion	1024	4.48
Liquid [59]	Autoregressive	512	5.47
Janus [58]	Autoregressive	384	10.10
LWM [27]	Autoregressive	256	17.77
Show-o [62]	Discrete Diff.	256	15.18
VILA-U [61]	Autoregressive	256	12.81
UniTok	Autoregressive	256	7.46

Analysis on MJHQ-30K:

UniTok achieves an FID of 7.46 on MJHQ-30K at 256x256 resolution. This is a substantial improvement over other autoregressive models at the same resolution, such as LWM (17.77), Show-o (15.18), and VILA-U (12.81).
Even though MJHQ-30K favors higher resolutions for lower FID (e.g., Playground at 4.48 FID for 1024 resolution), UniTok's score at 256x256 is highly competitive, even surpassing some models at higher resolutions like SD-XL (9.55 at 1024 resolution) or Janus (10.10 at 384 resolution), and notably better than Liquid (5.47 at 512 resolution). This demonstrates UniTok's ability to generate high-quality, detail-rich images even at a smaller resolution.

The following figure (Figure 5 from the original paper) shows examples of images generated by the unified MLLM:

$Figure 5: Images generated in a resolution of $2 5 6 \\times 2 5 6$ with our unified MLLM.$ 该图像是展示了一组以 256 imes 256 分辨率生成的多样化图像，包括动物、植物、风景及抽象艺术等多个主题。

Figure 5: Images generated in a resolution of $2 5 6 \times 2 5 6$ with our unified MLLM.

Qualitative Analysis of Generation: Figure 5 showcases UniTok's ability to synthesize photorealistic and visually appealing images from text prompts. The examples demonstrate that the model can understand a wide range of concepts (e.g., "Vincent van Gogh painting style", "bitcoin") and combine them creatively, producing diverse and high-quality outputs.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Impact of Supervision Types

This ablation study investigates the impact of different combinations of contrastive and reconstruction losses on downstream generation and understanding performance. All tokenizers are vector-quantized.

The following are the results from Table 6 of the original paper:

Supervision	Generation		Understanding
Supervision	rFID ↓	gFID ↓	VQAv2	GQA	SciQA	TextVQA	POPE	MME
Contrastive	−	−	68.95	56.89	65.64	49.89	82.34	1373
Reconstruction	0.82	3.59	56.33	47.53	63.26	43.65	77.09	902
Recon. + Contra.	0.72	3.26	69.14	56.06	65.25	49.22	81.42	1333

Analysis:

Reconstruction-oriented tokenizer limitations: The Reconstruction only tokenizer shows good rFID (0.82) and reasonable gFID (3.59), but significantly lags in understanding performance (e.g., VQAv2 56.33, MME 902) compared to models with contrastive supervision. This confirms the limitations of traditional VQVAE models for multimodal understanding.
Contrastive-only tokenizer: A Contrastive only tokenizer performs well in understanding (e.g., VQAv2 68.95, MME 1373) but is not designed for generation, hence no rFID or gFID reported.
Combined Supervision (Recon. + Contra.): The jointly trained tokenizer (Recon. + Contra.) achieves an rFID of 0.72 and gFID of 3.26, which are both better than the Reconstruction only tokenizer. Crucially, its understanding performance (e.g., VQAv2 69.14, MME 1333) is on par with, or even slightly better than, the Contrastive only tokenizer. This strongly supports the paper's central argument: reconstruction and contrastive training objectives do not inherently conflict when the discrete feature space is enhanced. Instead, combining them leads to a tokenizer that performs well in both domains.

6.2.2. MCQ vs. RQ

This study provides a direct comparison between Multi-Codebook Quantization (MCQ) and Residual Quantization (RQ), ensuring fair comparisons by implementing RQ on the same codebase and training settings as UniTok.

The following are the results from Table 7 of the original paper:

	Method	Code Shape	Code Dim.	rFID↓	Accuracy
RQ	16×16×8	64	3.46	58.8
MCQ	16×16×8	64	0.55	63.7

Analysis:

MCQ demonstrates clear superiority over RQ. With the same code shape (16x16 spatial tokens, 8 sub-codebooks) and global latent code dimension (64-d), MCQ achieves a much lower rFID (0.55 vs 3.46 for RQ) and higher classification accuracy (63.7% vs 58.8% for RQ).
This confirms the discussion in Section 3.3, which posited that MCQ's divide-and-conquer strategy is more effective than RQ's coarse-to-fine approach, especially in high-dimensional latent spaces, leading to better quantization and overall performance.

6.2.3. Number of Sub-Codebooks

This ablation explores how the number of sub-codebooks impacts tokenizer performance.

The following are the results from Table 8 of the original paper:

Codebook / Vocabulary	1×16384 /2^14	2×8192 /2^26	4×4096 /2^48	8×2048 /2^88
rFID ↓	1.50	0.98	0.54	0.33
Accuracy	41.0%	43.9%	44.7%	46.1%

Analysis:

The table shows that as the number of sub-codebooks increases (while keeping the total capacity number of sub-codebooks $\times$ size of sub-codebook roughly equivalent in terms of total code entries, but exponentially increasing the effective vocabulary size), both rFID (reconstruction quality) and Accuracy (understanding performance) consistently improve.
For example, moving from 1 sub-codebook (1x16384, effective vocabulary size $2^{14}$ ) to 8 sub-codebooks (8x2048, effective vocabulary size $2^{88}$ ) reduces rFID from 1.50 to 0.33 and increases Accuracy from 41.0% to 46.1%.
This provides strong evidence that MCQ generally benefits vector-quantized models, regardless of whether the primary objective is reconstruction or contrastive learning. It confirms that increasing the representational capacity of the discrete token space is key to improving performance.

6.2.4. CLIP Weight Initialization

This study investigates the impact of CLIP weight initialization on visual understanding performance within the LLaVA framework.

The following are the results from Table 9 of the original paper:

Tokenizer	VQAv2	GQA	TextVQA	POPE	MME
UniTok†	69.9	56.2	49.3	81.2	1331
UniTok	72.4	58.2	51.6	82.4	1392

Analysis:

Surprisingly, UniTok trained from scratch (randomly initialized) outperforms UniTok† (initialized with pretrained CLIP weights) across all VQA benchmarks (VQAv2, GQA, TextVQA, POPE, MME). For instance, VQAv2 accuracy is 72.4% for UniTok vs 69.9% for UniTok†, and MME score is 1392 vs 1331.
This is counter-intuitive because UniTok† achieved better zero-shot classification accuracy on ImageNet (78.6% vs 70.8% for UniTok, as shown in Table 1).
The finding suggests that downstream VQA performance may not be strongly correlated with ImageNet classification accuracy. More importantly, CLIP weight initialization might act as a negative prior for unified tokenizers. This is because the unified visual feature space (designed for both generation and understanding) could be fundamentally different from the CLIP feature space (designed primarily for understanding). Starting from CLIP weights might bias the learning towards CLIP's original feature space, hindering the tokenizer from learning a more optimal, truly unified representation. This implies that training UniTok from scratch allows it to learn a representation space that is intrinsically better suited for unified multimodal tasks.

6.2.5. GenAI-Bench Results (Appendix)

The paper provides additional text-to-image generation results on the GenAI-Bench in the appendix.

The following are the results from Table 10 of the original paper:

Method	Type	#Training Images	Attribute↑	Scene↑	Relation↑			Overall↑
Method	Type	#Training Images			Spatial	Action	Part	Overall↑
SD v2.1 [39]	Diffusion	2000M	0.80	0.79	0.76	0.77	0.80	0.78
SD-XL [36]	Diffusion	2000M	0.84	0.84	0.82	0.83	0.89	0.83
Midjourney v6	Diffusion	−	0.88	0.87	0.87	0.87	0.91	0.87
DALL-E 3 [2]	Diffusion	−	0.91	0.90	0.92	0.89	0.91	0.90
Show-o [62]	Discrete Diff.	36M	0.72	0.72	0.70	0.70	0.75	0.70
LWM [27]	Autoregressive	−	0.63	0.62	0.65	0.63	0.70	0.63
VILA-U [61]	Autoregressive	15M	0.78	0.78	0.77	0.78	0.79	0.76
Liquid [59]	Autoregressive	30M	0.84	0.86	0.81	0.83	0.91	0.83
UniTok	Autoregressive	30M	0.85	0.87	0.86	0.86	0.89	0.85

The following are the results from Table 11 of the original paper:

Method	Type	#Training Images	Count↑	Differ↑	Compare↑	Logical↑		Overall↑
Method	Type	#Training Images				Negate	Universal	Overall↑
SD v2.1 [39]	Diffusion	2000M	0.68	0.70	0.68	0.54	0.64	0.62
SD-XL [36]	Diffusion	2000M	0.71	0.73	0.69	0.50	0.66	0.63
Midjourney v6	Diffusion	−	0.78	0.78	0.79	0.50	0.76	0.69
DALL-E 3 [2]	Diffusion	−	0.82	0.78	0.82	0.48	0.80	0.70
Show-o [62]	Discrete Diff.	36M	0.70	0.62	0.71	0.51	0.65	0.60
LWM [27]	Autoregressive	−	0.59	0.58	0.54	0.49	0.52	0.53
VILA-U [61]	Autoregressive	15M	0.70	0.71	0.74	0.53	0.66	0.64
Liquid [59]	Autoregressive	30M	0.76	0.73	0.74	0.46	0.74	0.65
UniTok	Autoregressive	30M	0.76	0.76	0.79	0.46	0.73	0.67

Analysis on GenAI-Bench (Basic and Advanced Prompts):

Basic Prompts (Table 10): UniTok achieves an Overall score of 0.85, surpassing Liquid (0.83) using the same 30M training images. It performs exceptionally well on Attribute (0.85), Scene (0.87), Spatial (0.86), Action (0.86), and Part (0.89), demonstrating strong capabilities in generating basic object properties and relationships. It even competes closely with SD-XL (0.83) and Midjourney v6 (0.87) which are diffusion models trained on vast amounts of data.
Advanced Prompts (Table 11): For more complex prompts, UniTok achieves an Overall score of 0.67, again outperforming Liquid (0.65) and VILA-U (0.64). It shows strong performance in Count (0.76), Differ (0.76), and Compare (0.79), indicating a good understanding of numerical and comparative aspects. However, performance on Logical categories (Negate 0.46, Universal 0.73) is more challenging, consistent with other models. Notably, its Overall score is close to DALL-E 3 (0.70) and Midjourney v6 (0.69) for advanced prompts, which are state-of-the-art diffusion models.

These results consistently show that UniTok delivers superior generation performance across various metrics and prompt complexities, especially when compared to other autoregressive MLLMs, further solidifying its role as an effective unified tokenizer.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces UniTok, a novel unified tokenizer that addresses the long-standing challenge of creating a single visual representation suitable for both generation and understanding tasks within Multimodal Large Language Models (MLLMs). Through rigorous empirical analysis, the authors demonstrate that the perceived loss conflicts between reconstruction and semantic supervision are not inherent, but rather a symptom of the limited representational capacity of discrete token space.

UniTok overcomes this quantization bottleneck by employing two key innovations: multi-codebook quantization (MCQ) and attention projection. MCQ exponentially scales the effective vocabulary size and bottleneck dimension by partitioning latent vectors into independently quantized chunks, avoiding the optimization pitfalls of large monolithic codebooks. Attention projection enhances token factorization by using adapted multi-head attention modules, which preserve richer semantics during dimension compression.

The experimental results showcase UniTok's impressive performance:

It achieves a new record of 0.38 rFID and 78.6% zero-shot accuracy on ImageNet, demonstrating high fidelity reconstruction and strong understanding capabilities simultaneously.
When integrated into MLLMs, UniTok enables native visual generation without compromising understanding performance, significantly outperforming VQGAN-based baselines on VQA benchmarks (e.g., 5.5% on VQAv2, 9.2% on TextVQA).
It fosters a better-structured latent space, leading to a drastic reduction in gFID for cfg-free generation (from 14.6 to 2.5 on ImageNet), indicating improved inherent generation quality.
Ablation studies confirm that reconstruction and contrastive learning objectives are compatible when the discrete feature space is sufficiently rich, and MCQ consistently improves both generation and understanding metrics.

Overall, UniTok provides a powerful and efficient solution for unified visual tokenization, paving the way for more capable and cohesive multimodal models.

7.2. Limitations & Future Work

The authors acknowledge one primary limitation:

Limited Training Schedule: Due to computational resource constraints, UniTok was trained for only one epoch. The authors believe that extending the training schedule could further benefit the tokenizer, especially in understanding performance, as CLIP-based semantic representation learning typically requires extensive training to converge fully.

Based on this, a clear direction for future work is to:
Extend Training Duration: Investigate the performance gains and convergence patterns of UniTok when trained for longer durations and with more computational resources. This could potentially push understanding performance even higher, further closing the gap with dedicated CLIP models.

7.3. Personal Insights & Critique

This paper offers several profound insights that could inspire future research in multimodal AI:

Redefining the "Loss Conflict": The most impactful insight is debunking the perceived loss conflict between reconstruction and semantic objectives. By attributing performance issues to limited representational capacity, the paper shifts the research focus from finding delicate weighting schemes for conflicting losses to fundamentally improving the expressiveness of discrete tokens. This is a powerful paradigm shift, suggesting that robust multimodal representation might be achievable without inherent trade-offs. This finding could be generalized to other domains where seemingly conflicting objectives (e.g., efficiency vs. accuracy, local vs. global features) might be resolved by increasing the underlying model's capacity or representation power.
Scalability of Discrete Token Space: The multi-codebook quantization (MCQ) mechanism is a clever solution to a long-standing problem in VQVAEs – how to scale vocabulary size without codebook utilization issues. The exponential increase in effective vocabulary size is a significant advantage. This approach could be applied to other discrete tokenization schemes, even in NLP if context-dependent sub-vocabularies were beneficial. The comparison to Residual Quantization (RQ) also highlights the importance of design philosophy (divide-and-conquer vs. coarse-to-fine) in quantization.
Semantic Regularization for Generative Latent Spaces: The observation that semantic supervision leads to a better-structured latent space for autoregressive generation (evidenced by massive gFID reduction in cfg-free generation) is crucial. This reinforces findings in diffusion models and suggests that semantic guidance during tokenizer training is not just for understanding but fundamentally improves the generative properties of the latent space itself, making it more disentangled and coherent. This principle could inform the design of future generative models, emphasizing the integration of strong semantic priors from the earliest stages of representation learning.
Beyond ImageNet Accuracy for MLLMs: The ablation on CLIP weight initialization provides a critical insight: higher ImageNet zero-shot classification accuracy does not automatically translate to superior downstream VQA performance for unified MLLMs. This suggests that the optimal feature space for a truly unified multimodal model might differ significantly from that learned by a vision-only model like CLIP. This is a valuable warning against overly relying on single-task benchmarks when designing complex multimodal systems. It implies that MLLM training should prioritize learning a unified representation rather than simply trying to integrate pre-trained unimodal components.

Potential Issues/Areas for Improvement:

Computational Cost: While MCQ addresses some optimization issues of large monolithic codebooks, the training of UniTok and its integrated MLLM is still highly computationally intensive, especially considering the 1.28B image-text pairs and the Llama-2-7B base model. The mentioned limitation of "one epoch" due to computational resources highlights this. Future work could explore more computationally efficient training strategies or distillation methods.
Generalizability of "No Conflict" Claim: While the paper successfully demonstrates that reconstruction and semantic losses do not inherently conflict with UniTok's specific enhancements, it might be too strong to claim this universally. Without these enhancements, the conflict might still manifest. Further theoretical work to formally characterize the conditions under which these objectives align or conflict would be valuable.
Complexity of Attention Projection: While attention projection improves performance, multi-head attention modules are inherently more complex and computationally expensive than linear layers. A detailed analysis of the computational overhead and latency introduced by attention projection compared to its performance gains would be beneficial for practical deployment.
Explicit Explanation of MLP Projector: The paper states that code embeddings are projected to the MLLM token space with an MLP projector. While standard, a brief discussion on its architecture (e.g., number of layers, activation functions) and how it handles the potential modality gap between vision and language embeddings could add clarity.

Overall, UniTok represents a significant step forward in unified visual tokenization, offering a robust framework and valuable insights for the continued development of sophisticated multimodal AI systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

UniTok: A Unified Tokenizer for Visual Generation and Understanding

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~39 min read · 56,308 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Tokenization

3.1.2. Variational AutoEncoders (VAEs) and Vector Quantized VAEs (VQVAEs)

3.1.3. CLIP (Contrastive Language-Image Pre-training)

3.1.4. Multimodal Large Language Models (MLLMs)

3.1.5. Autoregressive Generation

3.1.6. Transformer Architecture and Multi-Head Attention

3.1.7. FID (Fréchet Inception Distance) and rFID / gFID

3.1.8. Zero-shot Accuracy

3.1.9. Classifier-Free Guidance (CFG)

3.2. Previous Works

3.2.1. Image Tokenization for Generation

3.2.2. Image Tokenization for Understanding

3.2.3. Unified Vision-Language Models (MLLMs)

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Unified Supervision

4.2.2. Quantization Bottleneck

4.2.3. UniTok Enhancements

4.2.3.1. Multi-Codebook Quantization (MCQ)

4.2.3.2. Attention Projection

4.2.4. Unified MLLM Integration

5. Experimental Setup

5.1. Datasets

5.1.1. Tokenizer Training

5.1.2. MLLM Training

5.1.3. Evaluation Datasets

5.2. Evaluation Metrics

5.2.1. Fréchet Inception Distance (FID)

5.2.2. Top-1 Zero-Shot Accuracy

5.2.3. ImageNet Zero-Shot Classification Accuracy

5.2.4. VQA Benchmarks (VQAv2, GQA, TextVQA, POPE, MME, MM-Vet)

5.2.5. Inception Score (IS)

5.2.6. Precision and Recall (in Image Generation)

5.2.7. GenEval Benchmarks (Single Obj., Two Obj., Counting, Colors, Position, Color Attri., Overall)

5.2.8. GenAI-Bench Benchmarks (Attribute, Scene, Spatial, Action, Part, Count, Differ, Compare, Negate, Universal, Overall)

5.3. Baselines

5.3.1. VQVAE Models (for Reconstruction & Generation)

5.3.2. CLIP Models (for Understanding)

5.3.3. Unified Models (Early Attempts at Unification)

5.3.4. Diffusion Models (for Class-Conditional Image Generation)

5.3.5. Masked Generative Models (for Image Generation)

5.3.6. Autoregressive Models (for Class-Conditional Image Generation)

5.3.7. Unified MLLMs (for Understanding and Generation)

5.4. Implementation Details

5.4.1. Tokenizer Setup

5.4.2. MLLM Setup

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Tokenizer Comparison

6.1.2. Class-Conditional Image Generation

6.1.3. Unified Understanding and Generation (MLLM Integration)

6.1.3.1. Understanding Performance

6.1.3.2. Generation Performance

6.2. Ablation Studies / Parameter Analysis

6.2.1. Impact of Supervision Types

6.2.2. MCQ vs. RQ

6.2.3. Number of Sub-Codebooks

6.2.4. CLIP Weight Initialization