UniTok: A Unified Tokenizer for Visual Generation and Understanding
TL;DR Summary
UniTok is a unified tokenizer for visual generation and understanding, addressing loss conflicts in combining VQVAE and CLIP. It achieves state-of-the-art performance on ImageNet with 0.38 rFID and 78.6% zero-shot accuracy, and integrates seamlessly into MLLMs, enhancing both gen
Abstract
Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of 0.38 rFID and 78.6% zero-shot accuracy on ImageNet. Besides, UniTok can be seamlessly integrated into MLLMs to unlock native visual generation capability, without compromising the understanding performance. Additionally, we show that UniTok favors cfg-free generation, reducing gFID from 14.6 to 2.5 on ImageNet 256256 benchmark. GitHub: https://github.com/FoundationVision/UniTok.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
UniTok: A Unified Tokenizer for Visual Generation and Understanding
1.2. Authors
The paper lists several authors, primarily from The University of Hong Kong and ByteDance Inc., with one author also affiliated with Huazhong University of Science and Technology.
- Chuofan Ma
- Yi Jiang
- Junfeng Wu
- Jihan Yang
- Xin Yu
- Zehuan Yuan
- Bingyue Peng
- Xiaojuan Qi
1.3. Journal/Conference
This paper is published as a preprint on arXiv. The exact journal or conference for its final publication is not specified in the provided abstract, but arXiv is a well-respected open-access repository for preprints of scientific papers in various fields, including computer science and artificial intelligence.
1.4. Publication Year
The paper was published at 2025-02-27T17:47:01.000Z, indicating a publication year of 2025.
1.5. Abstract
The abstract introduces UniTok, a unified tokenizer designed for both visual generation and understanding tasks. It highlights a common problem in the field: visual models typically use distinct tokenizers, such as VQVAE (for autoregressive generation) and CLIP (for understanding), making unification challenging. While previous attempts to combine VQVAE and CLIP training objectives faced loss conflicts, this paper argues that the root cause is not inherent conflict but rather the limited representational capacity of discrete token space.
UniTok addresses this by introducing a novel multi-codebook quantization mechanism, which effectively scales up the vocabulary size and bottleneck dimension. This innovation leads to state-of-the-art performance: 0.38 rFID and 78.6% zero-shot accuracy on ImageNet. Furthermore, UniTok can be seamlessly integrated into Multimodal Large Language Models (MLLMs), enabling native visual generation without compromising understanding capabilities. The paper also demonstrates that UniTok facilitates cfg-free generation, reducing gFID from 14.6 to 2.5 on the ImageNet 256x256 benchmark.
1.6. Original Source Link
https://arxiv.org/abs/2502.20321 (Publication status: Preprint on arXiv)
1.7. PDF Link
https://arxiv.org/pdf/2502.20321v3.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the reliance of visual generative and understanding models on distinct tokenizers, posing a significant challenge for unifying them within a single framework. For instance, CLIP (Contrastive Language-Image Pre-training) models excel in multimodal understanding due to their high-dimensional, continuous feature spaces, but their continuous nature complicates discrete, autoregressive generative modeling. Conversely, VQVAE (Vector Quantized Variational AutoEncoder) tokenizers generate discrete tokens well-suited for autoregressive generation but often struggle to capture essential semantics for understanding tasks.
This problem is important because the advent of powerful Multimodal Large Language Models (MLLMs) like GPT-4o highlights the immense potential of models with native visual generation capabilities. Such unified models promise precise control in multimodal interactions, facilitating advanced tasks like multi-turn image editing and visual in-context learning. However, the fundamental dilemma of choosing an appropriate visual tokenizer that can equally serve both generation and understanding remains.
Previous attempts to unify tokenizers by combining VQVAE (reconstruction objective) and CLIP (semantic objective) training often resulted in severe loss conflicts and convergence issues, leading to suboptimal performance in both domains. The paper questions this common assumption, proposing that the perceived loss conflict might not be inherent but rather a symptom of another underlying issue.
The paper's entry point is an empirical investigation into why combining VQVAE and CLIP objectives leads to performance degradation. Their innovative idea is to re-evaluate the true bottleneck, which they identify as the limited representational capacity of the discrete token space, rather than an inherent conflict between the training objectives themselves.
2.2. Main Contributions / Findings
The paper makes several primary contributions and reaches key conclusions:
-
Identification of the True Bottleneck: The paper empirically demonstrates that the
loss conflictsobserved in prior unified tokenizer training are not inherent to combiningreconstructionandsemantic supervision. Instead, the primary bottleneck stems from thelimited representational capacity of discrete token spacewhen undergoingtoken factorizationanddiscretization. -
Introduction of UniTok: The paper proposes
UniTok, a unified tokenizer featuring a novelmulti-codebook quantization (MCQ)mechanism andattention projection.Multi-codebook quantizationeffectively scales up thevocabulary sizeandbottleneck dimensionexponentially without the optimization problems of large monolithic codebooks or diminished returns.Attention projectionreplaces traditional linear layers fortoken factorization, improving representational power and training stability.
-
State-of-the-Art Performance:
UniTokachieves new record performance on ImageNet, with 0.38rFID(reconstruction quality) and 78.6% zero-shot accuracy (understanding capability). This demonstrates its superiority over both domain-specific and other unified tokenizers. -
Seamless MLLM Integration:
UniTokcan be seamlessly integrated intoMLLMs, enabling native visual generation capabilities without compromising understanding performance. It significantly outperforms existingMLLMbaselines usingVQGANtokenizers across variousVQAbenchmarks (e.g., 5.5% higher onVQAv2, 9.2% onTextVQA, 339 points onMMEcompared toLiquid). -
Improved Latent Space for Generation: The semantic supervision incorporated in
UniTokleads to a better-structured latent space, significantly improvingcfg-free generationperformance. It reducesgFIDfrom 14.6 to 2.5 on ImageNet 256x256 under theLlamaGenframework, indicating more coherent and higher-quality image generation without classifier-free guidance. -
Ablation Study Insights: Ablation studies confirm that
reconstructionandcontrastivelearning objectives do not inherently conflict when the discrete feature space is appropriately enhanced.MCQconsistently benefits bothreconstruction FIDandclassification accuracy, independent of training objectives.These findings solve the problem of developing a single visual tokenizer that effectively supports both high-fidelity visual generation and robust multimodal understanding, paving the way for more powerful and unified
MLLMs.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the paper, a reader should be familiar with several core concepts in deep learning for vision and language.
3.1.1. Tokenization
Tokenization in the context of machine learning refers to the process of breaking down raw input data (like text or images) into smaller, discrete units called tokens. These tokens are then fed into a model for processing.
- For text, tokens are typically words, subwords, or characters.
- For images,
visual tokenizationinvolves converting raw pixel data into a sequence of discrete representations. This abstraction simplifies the image, making it more manageable for models, especially those designed for sequential data likeTransformers.
3.1.2. Variational AutoEncoders (VAEs) and Vector Quantized VAEs (VQVAEs)
Variational AutoEncoders (VAEs) are a type of generative model that learn a compressed, latent representation of input data. A VAE consists of two main parts:
- Encoder: Maps the input data (e.g., an image) to a
latent space(a lower-dimensional continuous representation). - Decoder: Reconstructs the input data from a sample drawn from the
latent space.VAEsare trained to minimize areconstruction loss(how well the output matches the input) and aKullback-Leibler (KL) divergenceterm that forces the latent space to approximate a simple distribution (e.g., a standard Gaussian).
Vector Quantized VAEs (VQVAEs) build upon VAEs by introducing a discrete latent space. Instead of mapping inputs to a continuous vector, the encoder's output is quantized by mapping it to the closest vector in a learnable "codebook" or "dictionary" of discrete embedding vectors. This quantization process makes the latent space discrete, which is crucial for autoregressive generation (where the model predicts the next token in a sequence).
- Codebook: A finite set of
embedding vectors(codes) that represent different visual patterns. - Vector Quantization (VQ): The process of finding the closest codebook entry to an encoded continuous vector and replacing the continuous vector with that discrete code.
- VQ Loss (): A component of the
VQVAEloss function that encourages the encoder outputs to commit to certain codebook entries and ensures the codebook entries are updated to be close to the encoder outputs.
3.1.3. CLIP (Contrastive Language-Image Pre-training)
CLIP is a multimodal model designed to learn visual representations from natural language supervision. It consists of two separate encoders:
- Image Encoder: Processes images into
image embeddings. - Text Encoder: Processes text into
text embeddings. During training,CLIPlearns to align theseimageandtext embeddingsin a sharedmultimodal embedding space. It does this using acontrastive learningobjective: given a batch of image-text pairs, the model is trained to maximize the similarity between correctly matched image-text pairs and minimize the similarity between incorrectly matched pairs.
- Image-text contrastive loss (): The loss function used in
CLIPto pull positive (matching) image-text pairs closer and push negative (non-matching) pairs apart in the embedding space. This allowsCLIPto learn robust, semantically richvisual representationsthat are highly aligned with natural language descriptions, making it excellent formultimodal understandingtasks likezero-shot classification.
3.1.4. Multimodal Large Language Models (MLLMs)
Multimodal Large Language Models (MLLMs) are Large Language Models (LLMs) that have been extended to process and understand multiple modalities of data, typically text and images (but sometimes also audio or video). They aim to unify language understanding and generation with visual understanding and generation. A key challenge in building MLLMs is how to represent visual information in a way that is compatible with the discrete token space usually employed by LLMs.
3.1.5. Autoregressive Generation
Autoregressive generation is a method of generating sequences (like text or image tokens) one step at a time, where each new element is conditioned on the previously generated elements. In the context of VQVAE models for image generation, after an image is tokenized into a sequence of discrete visual tokens, an autoregressive model (e.g., a Transformer) can be trained to predict the next visual token in the sequence given the preceding ones. This allows for controlled and sequential image synthesis.
3.1.6. Transformer Architecture and Multi-Head Attention
The Transformer architecture is a neural network model primarily known for its use in natural language processing (NLP) but now widely adopted in vision. Its core component is the self-attention mechanism, specifically Multi-Head Attention (MHA).
The Attention mechanism allows the model to weigh the importance of different parts of the input sequence when processing each element.
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where:
-
(Query), (Key), (Value) are matrices derived from the input embeddings.
-
is the dimension of the keys (and queries), used for scaling to prevent very large dot products that push the
softmaxinto regions with tiny gradients. -
normalizes the scores to create a probability distribution.
Multi-Head Attentionallows the model to jointly attend to information from different representation subspaces at different positions. It does this by running severalattentionmechanisms in parallel and then concatenating their outputs, which are then linearly transformed. This provides the model with a richer understanding of the relationships within the data.
3.1.7. FID (Fréchet Inception Distance) and rFID / gFID
Fréchet Inception Distance (FID) is a metric used to assess the quality of images generated by generative models. It measures the "distance" between the feature distributions of real images and generated images.
$
\mathrm{FID} = ||\mu_1 - \mu_2||^2_2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2})
$
Where:
- and are the mean and covariance matrix of the feature vectors for real images, typically extracted from an Inception-v3 model.
- and are the mean and covariance matrix of the feature vectors for generated images.
- denotes the squared Euclidean distance.
- denotes the trace of a matrix.
A lower
FIDscore indicates higher quality and diversity of the generated images, suggesting they are more similar to real images. rFID(reconstruction FID) specifically measures the quality of reconstructed images (output ofVQVAEdecoder) compared to original images.gFID(generation FID) measures the quality of images generated by anautoregressive modelusing theVQVAEtokens, compared to real images.
3.1.8. Zero-shot Accuracy
Zero-shot accuracy refers to a model's ability to classify images into categories it has not explicitly seen during training, based solely on textual descriptions of those categories. In CLIP, for example, zero-shot classification works by computing the similarity between an image's embedding and the embeddings of candidate class names (e.g., "a photo of a cat", "a photo of a dog"). The class with the highest similarity score is chosen as the prediction. High zero-shot accuracy implies that the model has learned robust and generalizable semantic representations.
3.1.9. Classifier-Free Guidance (CFG)
Classifier-Free Guidance (CFG) is a technique used in generative models (especially diffusion models and sometimes autoregressive models) to improve the quality and coherence of generated samples by increasing the influence of the conditioning signal (e.g., a text prompt). It involves training a single model with and without conditioning. During inference, samples are generated by interpolating between the unconditioned (or weakly conditioned) and conditioned score estimates.
cfg-free generationrefers to generation without usingCFG, relying solely on the intrinsic conditioning signal. If a model performs wellcfg-free, it implies its latent space is inherently well-structured and aligns strongly with the conditioning.
3.2. Previous Works
The paper contextualizes its work by discussing prior research in image tokenization for generation, understanding, and unified vision-language models.
3.2.1. Image Tokenization for Generation
- VQVAE [53]: Pioneering work introducing discrete latent spaces by mapping continuous tokens to nearest neighbors in a learnable codebook. This was foundational for
autoregressive generation. - VQGAN [8]: Improved
VQVAEby incorporatingperceptual loss(based onLPIPSmetric [72]) anddiscriminator loss(fromGANs[16]) to enhance reconstruction quality and generate more realistic images. - ViT-VQGAN [65]: Applied the
Transformerarchitecture to theVQGANframework, leveragingTransformer's scalability. - Recent Quantization Methods: Efforts like
residual quantization (RQ)[18] andlookup-free quantization (LFQ)[67] aim to develop better ways to discretize latent features, often by expanding effectivecodebook sizeor improvingcodebook utilization.RQuses multiple quantization steps to progressively refine the representation, whileLFQseeks to avoid explicit codebook lookups.
3.2.2. Image Tokenization for Understanding
- CLIP [38]: A widely adopted
vision tokenizerforMLLMs, pretrained to align images and text usingcontrastive learning. While excellent for understanding, its continuous feature space poses challenges forautoregressive generation. - DINOv2 [34]: A
self-supervised learningmodel that produces robustvisual featuresadvantageous forregion-level tasks. These models also typically output continuous features. - Discretizing CLIP tokens [10] or VQVAE encoders [27, 62]: Some works have explored methods to make
CLIPtokens discrete or useVQVAEencoders for understanding. However, these oftensubstantially impair understanding performance, highlighting the difficulty of maintaining semantic richness in discrete tokens.
3.2.3. Unified Vision-Language Models (MLLMs)
- Continuous Visual Tokenizers + Diffusion Models [7, 11, 44]: One line of
MLLMsuses continuous visual tokenizers (likeCLIPfeatures) for image encoding and then relies onpretrained diffusion modelsfor image synthesis. This approachincreases model complexityanddisconnects the visual sampling process from the MLLMitself. - VQVAE-based Discrete Tokens [46, 56, 62, 61, 59]: Another stream, to which
UniTokbelongs, adoptsVQVAEmodels to encode images into discrete tokens. These discrete tokens can then be modeled using the samecross-entropy lossapplied to text tokens, facilitating a unified approach withinLLMs. However, these models typicallysuffer from degraded visual comprehension capabilitiesbecausereconstruction-oriented VQVAEdoes not naturally align with theLLM token spacefor understanding. Notable examples includeLiquid [59]andVILA-U [61].
3.3. Technological Evolution
The evolution of visual tokenization has largely followed two parallel paths:
-
Generative Path: Started with
VAEsand progressed toVQVAEs(VQVAE,VQGAN,ViT-VQGAN) to enable discrete representations suitable for sequential generation models. These focused heavily onreconstruction fidelity. -
Understanding Path: Developed
contrastive learningmethods likeCLIPorself-supervised learningmethods likeDINOv2to learn rich semantic representations aligned with language, primarily forclassification,retrieval, andVQA. These focused onsemantic richness.The current frontier, motivated by
MLLMs, is to merge these paths. Early attempts to combineVQVAEandCLIPoften led toloss conflictsorsuboptimal performance, suggesting a fundamental trade-off. This paper fits into the latest stage of this evolution, where researchers are moving beyond the assumption of inherent conflict, instead probing the underlying limitations of thediscrete token spaceitself.
3.4. Differentiation Analysis
Compared to the main methods in related work, UniTok introduces several core innovations:
-
Addressing the "Loss Conflict" Misconception: Unlike prior works that attribute performance degradation in unified training to
loss conflictsbetween reconstruction and semantic objectives,UniTok's analysis re-frames this as a problem oflimited representational capacitywithin thediscrete token space. This shift in perspective is crucial, as it suggests the solution lies in enhancing the tokenizer's expressive power rather than finding a delicate balance between conflicting losses. -
Novel Multi-Codebook Quantization (MCQ): While
residual quantization (RQ)also uses multiple codes,MCQfundamentally differs by employing adivide-and-conquer strategy(splitting the latent vector into chunks for independent sub-codebooks) rather thanRQ'scoarse-to-fineapproach. This allowsMCQto exponentially scalevocabulary sizeandbottleneck dimensionmore effectively, especially in high-dimensional latent spaces, overcomingcodebook utilizationissues anddiminishing gainsseen with large monolithic codebooks. -
Attention Projection for Factorization:
UniTokreplaces conventional linear or convolutional projection layers withadapted attention modulesfortoken factorization. This design choice specifically addresses the loss of rich semantics during channel compression, a key issue identified in theirquantization bottleneckanalysis, and improves training stability and understanding performance. -
Superior Unified Performance:
UniTokachieves state-of-the-art results on bothreconstruction quality(rFID) andzero-shot classification accuracysimultaneously, outperforming specializedVQVAEmodels in reconstruction and rivalingCLIPmodels in understanding. It also significantly surpasses otherunified MLLMsthat use discrete tokenizers inVQAand generation benchmarks. -
Enhanced Latent Space Structure:
UniTokdemonstrates thatsemantic supervision(fromCLIP) can lead to a better-structuredlatent spaceforautoregressive generation, enablingcfg-free generationwith much higher quality (gFIDreduction from 14.6 to 2.5), a property previously observed indiffusion modelsbut now extended toautoregressive tokenizers.In essence,
UniTokinnovates by enhancing the discrete visual tokenizer's core design to handle increased representational demands, enabling it to bridge the gap between generation and understanding effectively, a feat that previous unified approaches struggled with due to perceived fundamental conflicts or suboptimal quantization strategies.
4. Methodology
4.1. Principles
The core idea behind UniTok is to construct a unified visual tokenizer that can effectively serve both high-fidelity visual generation and robust multimodal understanding. The theoretical basis and intuition are built upon two key insights:
-
Reconstruction and semantic supervision do not inherently conflict. The paper challenges the common assumption that combining
VQVAE's pixel-level reconstruction objective withCLIP's high-level semantic alignment objective leads to an unavoidable trade-off orloss conflict. Instead, it posits that the observed performance degradation in unified training stems from a more fundamental issue. -
The primary bottleneck is the limited representational capacity of discrete token space. Traditional discrete
VQ-tokenizersuse relatively smallcodebooksandlatent dimensions. When continuous visual features are mapped into such a constrained discrete space, especially aftertoken factorization(reducing dimensionality for code lookup), significant information loss occurs. Thisquantization bottleneckprevents the discrete tokens from simultaneously capturing fine-grained details for reconstruction and rich semantics for understanding.To address this, UniTok employs two main principles:
-
Exponentially Scale Representational Capacity: By introducing
multi-codebook quantization (MCQ), UniTok aims to dramatically increase the effectivevocabulary sizeandbottleneck dimensionof the discrete latent space without the optimization challenges associated with single, massivecodebooks. This allows the discrete tokens to retain more information. -
Preserve Rich Semantics During Compression: Through
attention projection, UniTok seeks to create a more effectivetoken factorizationmechanism that better preserves critical semantic information when compressing feature dimensions, thereby enhancing the quality of the discrete tokens for downstream understanding tasks.By adhering to these principles, UniTok aims to create discrete visual tokens that are sufficiently expressive to satisfy the distinct demands of both generation (precise detail) and understanding (high-level semantics) within a single unified framework.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Unified Supervision
UniTok is trained with a combined objective that integrates both reconstruction and semantic supervisions. This is crucial for enabling the tokenizer to capture both low-level visual details and high-level semantics.
The total loss term, , is a weighted sum of a VQVAE-based reconstruction loss and an image-text contrastive loss:
$
\begin{array} { r } { \mathcal { L } = \mathcal { L } _ { \mathrm { recon } } + \lambda _ { \mathrm { contra } } \mathcal { L } _ { \mathrm { contra } } . } \end{array}
$
Where:
-
is the total loss function that the model minimizes during training.
-
is the
VQVAE-based reconstruction losscomponent, designed to preserve low-level visual information and ensure high-fidelity image reconstruction. -
is a weighting factor for the
image-text contrastive loss. In this paper, it is simply set to 1, meaning both loss components contribute equally. -
is the
image-text contrastive losscomponent, designed to enhance the high-level semantic understanding capabilities of the visual features by aligning them with text.The
VQVAE-based reconstruction loss, , itself comprises several terms: $ \begin{array} { r } { \mathcal { L } _ { \mathrm { recon } } = \mathcal { L } _ { \mathrm { R } } + \lambda _ { \mathrm { VQ } } \mathcal { L } _ { \mathrm { VQ } } + \lambda _ { \mathrm { P } } \mathcal { L } _ { \mathrm { P } } + \lambda _ { \mathrm { G } } \mathcal { L } _ { \mathrm { G } } , } \end{array} $ Where: -
is the
pixel-level reconstruction loss(e.g., L1 or L2 loss between the original and reconstructed images), which directly measures how well the decoder reproduces the input image. -
is the weight factor for the
vector quantization loss. -
is the
vector quantization loss, which encourages the encoder's output to "commit" to specificcodebookentries and ensures that thecodebookentries are updated to accurately represent the encoder's outputs. It typically includes terms like thecodebook lossandcommitment loss. -
is the weight factor for the
perceptual loss. -
is the
perceptual loss, based on metrics likeLPIPS[72]. This loss measures the similarity between intermediate feature representations of the original and reconstructed images from a pretrained network (e.g., VGG), focusing on perceptual quality rather than just pixel-wise accuracy. -
is the weight factor for the
discriminator loss. -
is the
discriminator loss, typically used inGenerative Adversarial Networks (GANs)[16]. A discriminator network tries to distinguish between real images and reconstructed images, while the generator (decoder) tries to fool the discriminator. This adversarial training helps produce more realistic-looking reconstructions.The
image-text contrastive loss term, , is essentially the same as used inCLIP[38]. It pulls together the embeddings of matching image-text pairs and pushes apart non-matching pairs in a shared multimodal embedding space, thereby fostering strong semantic alignment between visual and textual representations.
4.2.2. Quantization Bottleneck
The paper identifies the root cause of performance issues in unified tokenizers not as loss conflicts, but as the limited representational capacity of discrete token space. This is a critical insight, derived from a comprehensive ablation study illustrated in Figure 3.
The following figure (Figure 3 from the original paper) shows the roadmap from CLIP to UniTok, highlighting the impact of different components on understanding performance:
该图像是一个示意图,展示了从 CLIP 到 UniTok 的演变过程。图中显示了不同因素对 VQA 评分的影响,包括 Token 化与离散化如何导致理解性能的下降,以及通过多代码本量化和注意力投影来提升词汇量和瓶颈维度的策略。
Figure 3: Roadmap from CLIP to UniTok. It is observed that major degradation in understanding performance comes from token factorization and discretization, rather than reconstruction supervision. The proposed multi-codebook quantization and attention projection effectively address this by scaling up the vocabulary size and bottleneck dimension. The VQA score is measured using the average score across the VQAv2, GQA, TextVQA, and POPE benchmarks. All tokenizers are trained from scratch on image-text pairs from DataComp.
The analysis breaks down the degradation in visual understanding tasks (measured by VQA scores) as follows:
- Factorization: Modern
VQ-tokenizersoften project continuous visual features (e.g., 768-dimensionalCLIPembeddings) to a much lower-dimensional space (e.g., 8-dimensional) beforecode index lookup. This process, termedtoken factorization, is a severe form ofchannel compression. The paper empirically found that simply adding two linear projection layers to aCLIPvision encoder (reducing dimensionality from 768-d to 16-d and then back to 768-d), withoutvector quantizationorreconstruction supervision, already caused a significantperformance degradationinVQAtasks. This indicates that the act of compressing the feature space itself compromises the expressiveness of tokens. - Discretization: Building on
factorization, the introduction ofvector quantizationfurther degrades performance. Mapping thesefactorized tokensto a smallcodebook(typically 4k to 16k entries, much smaller thanLLMvocabularies of 200k+ entries) results in considerableinformation loss. Experiments showed an averageVQA accuracydrop of 2.1 points due to thisdiscretizationstep. - Reconstruction Supervision: Finally, integrating
reconstruction lossesinto the training process to form aunified tokenizeris analyzed. While previous literature (and initial observations) suggestedloss conflictbetweenVQVAEandCLIPobjectives, the paper surprisingly finds that this degradation hasnegligible impacts on downstream understanding performanceafter accounting for the effects offactorizationanddiscretization. Moreover, this degradation diminishes with improvedquantization methods. This leads to the conclusion that the perceivedloss conflictis merely a superficial issue, and the real problem is thelimited representational capacityof thediscrete token space.
4.2.3. UniTok Enhancements
To overcome the quantization bottleneck and enhance the representational capacity of discrete tokens, UniTok introduces two key architectural innovations: multi-codebook quantization and attention projection.
4.2.3.1. Multi-Codebook Quantization (MCQ)
Multi-codebook quantization (MCQ) is proposed to significantly scale up the vocabulary size and latent code dimension without incurring the optimization problems (like low codebook utilization or 'dead codes') associated with very large monolithic codebooks.
Instead of quantizing a single high-dimensional latent vector directly, MCQ works by partitioning the latent vector into several smaller chunks, each of which is then quantized independently using its own dedicated sub-codebook.
Specifically, a latent vector (where is the total dimension) is first evenly split into smaller chunks . Each chunk f _ { i } has a dimension of . The subsequent quantization process for these chunks is described as:
$
\hat { f } = \operatorname { C o n c a t } \left( \mathcal { Q } \left( Z _ { 1 } , f _ { 1 } \right) , \mathcal { Q } \left( Z _ { 2 } , f _ { 2 } \right) , . . . , \mathcal { Q } \left( Z _ { n } , f _ { n } \right) \right)
$
Where:
- is the final
discretized latent vectoraftermulti-codebook quantization. It is formed by concatenating the independently quantized chunks. - denotes the concatenation operation, which joins the quantized chunks back together to form the full quantized vector.
- represents the
code index lookup operationfor the -th chunk. It takes the -th continuous latent chunkf _ { i }and finds its closestcodebook entrywithin the -thsub-codebookZ _ { i }. The output of is the selectedcodebook entry(a vector). Z _ { i }is the -thsub-codebook, which is a distinct, learnable collection ofembedding vectorsspecifically for quantizing the -th chunkf _ { i }.f _ { i }is the -th continuous chunk of the originallatent vector.
Key advantages of MCQ:
- Exponentially Scaled Vocabulary: If there are
sub-codebooks, and eachsub-codebookhas entries, the theoretical effectivevocabulary size(number of possible combinations of codes for a singlevisual token) becomes . For example, with 4sub-codebooksand 16k entries each, the effectivevocabulary sizeis , which is vastly larger than any single monolithic codebook. - Avoids Optimization Problems: By keeping each individual
sub-codebooksmall (e.g., 4,096 entries),MCQsidesteps the issues oflow codebook utilizationanddead codesthat plague large monolithic codebooks, which are harder to optimize effectively. - Increased Latent Code Dimension: The dimensionality of the
discretized latent codes() also scales proportionally with the number ofsub-codebooks(), further enhancingrepresentational capacity. For instance, if and , the global latent dimension becomes 64-d.
Comparison with Residual Quantization (RQ):
MCQ shares the concept of using multiple codes for a token with Residual Quantization (RQ) [18], but their design philosophies differ significantly. RQ follows a coarse-to-fine approach, where subsequent quantization steps quantize the residual error from the previous step. MCQ, conversely, uses a divide-and-conquer strategy, partitioning the original latent space and quantizing each partition independently. The paper claims this distinction gives MCQ unique advantages in high-dimensional latent spaces, where RQ can suffer from increasingly sparse codes, leading to significantly lower quantization loss for MCQ.
4.2.3.2. Attention Projection
To mitigate the semantic information loss during token factorization (the compression of feature dimensions), UniTok replaces traditional linear projection layers with adapted attention modules.
Existing VQ methods commonly use simple linear or convolutional layers for projection, which were shown to fail to preserve rich semantics during compression. The paper proposes adapting multi-head attention (MHA) modules for this purpose.
The following figure (Figure 6 from the original paper) illustrates the adapted attention blocks for factorization:
该图像是示意图,展示了通道压缩块和通道扩展块的修改注意力模块。模块中的黄色部分表示通道数的变化。 和 表示不同的通道维度, 是多头注意力模块中的头数。
Figure 6: Modified attention blocks for factorization. Modules in yellow indicate a change in the number of channels. and stand for the channel dimension, is the number of heads in the multi-head attention module. .
In a standard Multi-Head Attention module, the outputs from different attention heads are concatenated before a final linear projection. In attention projection, this concatenation operation is replaced with average pooling to achieve channel compression.
-
Let be the input channel dimension and be the desired output channel dimension after compression.
-
In
Multi-Head Attention, if there are heads, each head typically processes features of dimensionC/h. After attention calculation, the outputs of the heads are usually concatenated, resulting in a feature vector of dimension . -
For
attention projection, instead of concatenating the outputs of the heads (each having dimension ), they areaverage pooledacross the head dimension. If the desired output dimension is , then attention heads would produce output vectors of dimension . These vectors are thenaverage pooledto get a single vector of dimension . This operation effectively performschannel compressionwhile allowing the model to leverage theattentional processingofMHAto selectively focus on important information during dimension reduction.This simple adaptation effectively
strengthens the representational poweroffactorized tokensandstabilizes training, addressing the issue ofsemantic degradationduring dimension reduction.
4.2.4. Unified MLLM Integration
UniTok is designed to be seamlessly integrated into Multimodal Large Language Models (MLLMs), specifically leveraging the unified framework similar to Liquid [59]. The goal is to allow the MLLM to model discrete vision and language sequences with a universal next-token prediction loss.
Key aspects of integration:
-
Reusing Code Embeddings: Instead of training the visual
codebookfrom scratch within theMLLM, UniTok's pretrainedcode embeddingsare reused. These embeddings are projected into theMLLM token spaceusing a simpleMulti-Layer Perceptron (MLP) projector. This transfer learning approach allows theMLLMto immediately benefit from UniTok's rich visual representations. -
Simplifying MLLM Input: UniTok encodes an image into codes, where is the spatial resolution of the
visual tokensand is the number ofsub-codebooks(e.g., 8 in UniTok). ForMLLMinput, this complex representation is simplified bymerging every consecutive codes into a single visual token. This reduction in the number of effectivevisual tokensmakes the input more manageable for theMLLM. -
Efficient Visual Token Prediction for Generation: When the
MLLMneeds to performvisual generation(i.e., predicting visual tokens), it maintains efficiency even withmulti-codebooks. Eachvisual tokenproduced by theMLLMis made to autoregressively predict the next codes (corresponding to thesub-codebookentries that form that singlevisual token). This is achieved using adepth transformer head, a technique implemented in models likeRQ-Transformer[18] andVILA-U[61]. This strategy allows theMLLMto generate the detailed multi-codebook representation for eachvisual tokeneffectively.By following this integration strategy, UniTok enables
MLLMsto possess nativevisual generation capabilitieswhile maintaining strongunderstanding performance, facilitating truly unified multimodal learning.
5. Experimental Setup
5.1. Datasets
5.1.1. Tokenizer Training
- DataComp-1B [9]: A large-scale public dataset consisting of 1.28 billion image-text pairs. It is used for training UniTok from scratch. All images are resized to 256x256 resolution.
- OpenImages [17]: A dataset specifically used for training a version of UniTok solely with reconstruction supervision, primarily for
rFIDevaluation and comparison with tokenizers trained on smaller datasets.
5.1.2. MLLM Training
- Language Data from DCLM [22]: 10 million language-only data samples are used for
MLLMpretraining. - Internal MidJourney-style Synthetic Data: 30 million synthetic image-text pairs, mimicking the style of
MidJourneyprompts, are used forMLLMpretraining. - Re-captioned Image-Text Pairs from COYO [32] and Laion [41]: 30 million
re-captionedimage-text pairs are included in theMLLMpretraining mix.COYOandLAIONare large-scale datasets often used for trainingvision-language models. - Text-to-Image Data: 1.5 million
text-to-imagedata samples are used for finetuning theMLLMforvisual generation. - Multimodal Instruction Tuning Data from Mini-Gemini [23]: 1.5 million
multimodal instruction tuningdata samples are used for finetuning theMLLMto align with instructions across modalities.
5.1.3. Evaluation Datasets
- ImageNet [38]: Used for evaluating
tokenizerperformance, specificallyreconstruction FID (rFID)andtop-1 zero-shot classification accuracy. - VQA Benchmarks (for MLLM Understanding):
- VQAv2 [13]: A large-scale dataset for
Visual Question Answering, requiring models to answer natural language questions about images. - GQA [14]: A
VQAdataset focused onreal-world visual reasoningandcompositional question answering. - TextVQA [42]: A
VQAdataset where questions require reading text in images (e.g., signs, labels). - POPE [24]: A benchmark for evaluating
object hallucinationinLarge Vision-Language Models. - MME [64]: A comprehensive
multimodal evaluationbenchmark. - MM-Vet [70]: A benchmark for evaluating
integrated capabilitiesofLarge Multimodal Models.
- VQAv2 [13]: A large-scale dataset for
- Visual Generation Benchmarks (for MLLM Generation):
- GenAI-Bench [26]: A benchmark for evaluating
text-to-image generationquality usingimage-to-text generationto assess alignment with prompts. - MJHQ-30K [20]: A dataset designed for evaluating the quality of generated images, particularly in terms of
FIDscores for high-resolution photorealistic images.
- GenAI-Bench [26]: A benchmark for evaluating
5.2. Evaluation Metrics
For every evaluation metric, the conceptual definition, mathematical formula, and symbol explanations are provided below.
5.2.1. Fréchet Inception Distance (FID)
- Conceptual Definition:
FIDmeasures the "distance" between the feature distributions of real images and generated/reconstructed images. It is commonly used to assess the quality of images produced by generative models, capturing both realism and diversity. A lowerFIDindicates better quality, suggesting that the generated images are more similar to real images in terms of their feature representations. - Mathematical Formula: $ \mathrm{FID} = ||\mu_1 - \mu_2||^2_2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
- Symbol Explanation:
- : The mean feature vector of the real images.
- : The mean feature vector of the generated/reconstructed images.
- : The covariance matrix of the feature vectors for real images.
- : The covariance matrix of the feature vectors for generated/reconstructed images.
- : The squared Euclidean distance (L2 norm).
- : The trace of a matrix (sum of its diagonal elements).
- The feature vectors are typically extracted from an intermediate layer of a pre-trained
Inception-v3network.
rFID(Reconstruction FID):FIDapplied to reconstructed images (output of the decoder) compared to the original input images.gFID(Generation FID):FIDapplied to images generated from scratch by a generative model compared to a dataset of real images.
5.2.2. Top-1 Zero-Shot Accuracy
- Conceptual Definition:
Top-1 zero-shot accuracymeasures the percentage of images for which the model correctly predicts the class label, where the class labels were not seen during training. ForCLIP-like models, this involves computing the similarity between an image embedding and a set of text embeddings representing candidate class labels (e.g., "a photo of a [class name]"). The class with the highest similarity is chosen as the prediction. - Mathematical Formula: $ \text{Top-1 Accuracy} = \frac{\text{Number of correct top-1 predictions}}{\text{Total number of samples}} \times 100% $
- Symbol Explanation:
Number of correct top-1 predictions: The count of instances where the model's highest-scoring predicted class (the "top-1" prediction) matches the true label.Total number of samples: The total number of instances in the evaluation dataset.
5.2.3. ImageNet Zero-Shot Classification Accuracy
This is a specific application of Top-1 Zero-Shot Accuracy where the evaluation is performed on the ImageNet dataset.
5.2.4. VQA Benchmarks (VQAv2, GQA, TextVQA, POPE, MME, MM-Vet)
These benchmarks evaluate a model's multimodal understanding capabilities, specifically Visual Question Answering. While each has specific nuances, they generally measure the accuracy of answers generated by the MLLM in response to questions about images. The specific accuracy calculation can vary (e.g., standard accuracy, soft accuracy, F1 score) depending on the dataset guidelines.
- VQAv2 [13]: Evaluates general-purpose
VQA. Metrics usually include accuracy where answers are compared against human annotations, often using a "soft accuracy" to account for synonymy. - GQA [14]: Focuses on
compositional visual reasoninganddiverse question types. Metrics typically involve accuracy for various question categories. - TextVQA [42]: Specifically tests the ability to read and understand text present in images to answer questions. Accuracy is a common metric.
- POPE [24]:
POPE(Prompting Out-of-Context Objects) is a benchmark designed to evaluateobject hallucinationinLLMs. It measures how frequently a model generates responses indicating the presence of objects not actually in the image. Metrics often involve precision, recall, and F1-score for hallucinated objects. - MME [64]:
MME(Multi-Modality Evaluation) is a comprehensive benchmark assessing variousmultimodal perceptionandcognitionabilities. It provides scores across multiple dimensions (e.g.,Perceptionscores mentioned in the paper). - MM-Vet [70]:
MM-Vet(Multimodal-Verification for integrated capabilities) evaluates a broad range ofintegrated capabilitiesofLLMsacross 16 categories, such as perception, reasoning, and knowledge. Its scores reflect a model's overall multimodal proficiency.
5.2.5. Inception Score (IS)
- Conceptual Definition:
Inception Scoreis a metric used to evaluate the quality of images generated byGenerative Adversarial Networks (GANs)or other generative models. It quantifies two aspects: thefidelity(quality) of generated images and theirdiversity. HighISindicates both high quality (images are clearly recognizable as objects) and high diversity (the model generates a wide variety of images). - Mathematical Formula: $ \mathrm{IS}(G) = \exp \left( E_{x \sim p_g} [ D_{KL}(p(y|x) || p(y)) ] \right) $
- Symbol Explanation:
- : The generative model.
- : A generated image sample from the model .
- : The distribution of generated images.
- : The conditional class probability distribution for a generated image , typically obtained from a pre-trained
Inception-v3network. It indicates how well the image is classified into one of theImageNetclasses. p(y): The marginal class probability distribution, which is the average of over all generated images. It reflects the diversity of the generated samples.- : The
Kullback-Leibler (KL) divergencebetween two probability distributions and . - : The expectation over generated images.
A large
ISmeans that the generated images are highly discriminative (low entropy for ) and diverse (high entropy forp(y), leading to a largeKL divergencebetween andp(y)).
5.2.6. Precision and Recall (in Image Generation)
- Conceptual Definition: In image generation,
PrecisionandRecall(often calculated in a way similar to their use in information retrieval or object detection) can be used to evaluate thequalityandcoverageof generated images relative to a real dataset.Precisionmeasures how many of the generated images are "real-like" or fall within the distribution of real images. High precision indicates high quality (fewer "bad" or out-of-distribution generations).Recallmeasures how well the generated distribution covers the real data distribution. High recall indicates high diversity (the generator can produce all kinds of "real" images).
- Mathematical Formula: While there isn't a single universal formula for
PrecisionandRecallin image generation that parallels classification (True Positives, False Positives, etc.), they are often derived from similarities in feature space, similar toFID. For instance, some methods calculate precision and recall based on the fraction of generated samples that have a nearest neighbor among real samples (precision) and vice-versa (recall) in a feature embedding space. The exact formula is often implementation-specific but aims to capture these concepts. - Symbol Explanation: Specific symbols are not provided in the paper for these metrics, but generally,
Precisionsignifies how realistic the generated samples are, andRecallsignifies how much of the target data distribution the generated samples cover.
5.2.7. GenEval Benchmarks (Single Obj., Two Obj., Counting, Colors, Position, Color Attri., Overall)
- Conceptual Definition:
GenEvalevaluatestext-to-image generationmodels on their ability to correctly interpret and render variouscompositional attributesand relationships specified in text prompts. Each sub-metric measures the accuracy of rendering a specific aspect. - Symbol Explanation:
Single Obj.: Accuracy in generating a single specified object.Two Obj.: Accuracy in generating two specified objects.Counting: Accuracy in generating the correct number of objects.Colors: Accuracy in rendering specified colors for objects.Position: Accuracy in placing objects in specified spatial positions.Color Attri.: Accuracy in assigning specific color attributes to objects.Overall: Average accuracy across allGenEvalcategories.
5.2.8. GenAI-Bench Benchmarks (Attribute, Scene, Spatial, Action, Part, Count, Differ, Compare, Negate, Universal, Overall)
- Conceptual Definition:
GenAI-Benchprovides a fine-grained evaluation oftext-to-image generationby usingimage-to-text generationto verify whether generated images align with various aspects of complex text prompts. It tests the model's understanding ofattributes,scenes,spatial relationships,actions,object parts,counts,differences,comparisons,negations, anduniversal quantifiers. - Symbol Explanation:
Attribute: Assesses if objects have the correct attributes (e.g., "red car").Scene: Evaluates the correctness of the overall scene description (e.g., "a beach at sunset").Spatial: Checks if objects are in the correct relative positions (e.g., "a cat on the mat").Action: Verifies if subjects are performing the specified actions (e.g., "a dog jumping").Part: Ensures that objects include specified parts (e.g., "a bird with spread wings").Count: Assesses the correct number of objects generated.Differ: Evaluates the ability to generate images where elements are distinct or different as specified.Compare: Checks if comparative relationships are correctly rendered (e.g., "taller than").Negate: Tests understanding of negative conditions (e.g., "no tree").Universal: Evaluates adherence to universal quantifiers (e.g., "all birds are blue").Overall: The average performance across all categories.
5.3. Baselines
The paper compares UniTok against a wide array of existing models across different categories:
5.3.1. VQVAE Models (for Reconstruction & Generation)
These models are primarily designed for efficient image representation and generation.
- VQ-GAN [8]: A foundational
VQVAEvariant usingperceptualandadversarial losses. - RQ-VAE [18]: Utilizes
residual quantizationfor improved discrete representation. - VAR [48]:
Visual Autoregressive Modelingwithnext-scale prediction. - UniTok*: A version of UniTok trained only with reconstruction supervision on
OpenImages, serving as a controlled baseline.
5.3.2. CLIP Models (for Understanding)
These models are specialized for image-text alignment and understanding tasks.
- CLIP [38]: The seminal
Contrastive Language-Image Pre-trainingmodel. - SigLIP [71]: An improved
CLIPvariant usingsigmoid loss. - ViTamin [5]: A scalable
vision modeldesigned in thevision-language era.
5.3.3. Unified Models (Early Attempts at Unification)
These models try to combine aspects of both generation and understanding.
- TokenFlow† [37]: A unified
image tokenizerfor multimodal understanding and generation, with†indicatingCLIPweight initialization. - VILA-U† [61]: A unified foundation model integrating
visual understandingandgeneration, also with†indicatingCLIPweight initialization. - UniTok (random init): UniTok trained from scratch without
CLIPweights. - UniTok† (CLIP init): UniTok initialized with pretrained
CLIPweights.
5.3.4. Diffusion Models (for Class-Conditional Image Generation)
These are powerful generative models that typically operate in continuous latent spaces.
- SD-VAE [40]: The
VAEcomponent ofStable Diffusion. - DiT [35]:
Diffusion Transformerfor image generation. - LightningDiT [63]: A faster version of
DiT. - VAVAE [63]:
VAEwith semantic regularization fordiffusion models.
5.3.5. Masked Generative Models (for Image Generation)
These models generate images by masking and predicting visual tokens.
- LFQ [67]:
Lookup-Free Quantizationfor visual generation. - MAGVIT-v2 [67]:
Masked Generative Video Transformer. - TiTok-L [69]:
Transformer-based Image Tokenizerfor reconstruction and generation. - MaskGIT [3]:
Masked Generative Image Transformer.
5.3.6. Autoregressive Models (for Class-Conditional Image Generation)
These models generate visual tokens sequentially.
- VQGAN†: The
VQGANtokenizer used within theLlamaGenframework. - LlamaGen‡ [43]: A powerful
autoregressive modelforscalable image generation. - UniTok (Ours): UniTok integrated into the
LlamaGenframework.
5.3.7. Unified MLLMs (for Understanding and Generation)
These are large models that aim to perform both VQA and image generation.
- Emu [45]:
Generative Pretraining in Multimodality. - LaVIT [15]:
Unified Language-Vision Pretraining with Dynamic Discrete Visual Tokenization. - DreamLLM [7]:
Synergistic Multimodal Comprehension and Creation. - Unified-IO 2 [29]:
Scaling Autoregressive Multimodal Models. - Janus [58]:
Decoupling Visual Encoding for Unified Multimodal Understanding and Generation. - CM3Leon [68]: An
autoregressive multimodal model. - LWM [27]:
World Model on Million-Length Video and Language. - Show-o [62]:
One Single Transformer to Unify Multimodal Understanding and Generation. - Chameleon [46]:
Mixed-modal early-fusion foundation models. - Liquid [59]:
Language models are scalable and unified multi-modal generators. - VILA-U [61]:
A unified foundation model integrating visual understanding and generation.
5.4. Implementation Details
5.4.1. Tokenizer Setup
- Architecture:
UniTokuses a hybrid architecture,ViTamin-L/16[5], to combine the strengths ofCNNs(common inVQVAE) andViTs(preferred inCLIP). - Quantization Configuration:
UniTokis configured witheight sub-codebooks, each containing4,096 code entries. Thelatent dimensionfor each sub-codebook is set to8-d, resulting in a globallatent dimensionof64-d(). - Discriminator: The
discriminatorcomponent of theVQVAEloss is initialized with a pre-trainedDINOv2-S[34] model. - Training: The tokenizer is trained for
one epochonDataComp-1B[9] (1.28 billion image-text pairs). Images are resized to256x256. - Batch Size & Learning Rate: A global
batch sizeof16kis used. Thelearning ratefor the tokenizer is1e-3and2e-4for the discriminator. - Initialization Settings: Two settings are evaluated:
- Random Initialization (default):
UniToktrained from scratch. - Pretrained CLIP Weight Initialization:
UniTokinitialized with weights from a pre-trainedCLIPmodel.
- Random Initialization (default):
5.4.2. MLLM Setup
- Base Model: The
unified MLLMis instantiated using theLlama-2-7B base model[52]. - Framework: It follows the
Liquid[59] framework, which usesnext-token prediction lossfor both vision and language. - Pretraining: The model is first pretrained on a mix of multimodal data:
- 10M language data from
DCLM[22]. - 30M internal
MidJourney-style synthetic data. - 30M
re-captioned image-text pairsfromCOYO[32] andLaion[41]. Learning ratefor pretraining:5e-5.
- 10M language data from
- Finetuning: Subsequently, the model is finetuned on:
- 1.5M
text-to-image data. - 1.5M
multimodal instruction tuning dataintroduced inMini-Gemini[23]. Learning ratefor finetuning:2e-5.
- 1.5M
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Tokenizer Comparison
The paper evaluates UniTok on ImageNet using reconstruction FID (rFID) for generation quality and top-1 zero-shot accuracy for understanding.
The following are the results from Table 1 of the original paper:
| Method | #Tokens | rFID ↓ | Accuracy |
| VQVAE Model | |||
| VQ-GAN* [8] | 256 | 4.98 | - |
| RQ-VAE [18] | 256 | 1.30 | - |
| VAR* [48] | 680 | 0.90 | - |
| UniTok* | 256 | 0.33 | - |
| CLIP Model | |||
| CLIP [38] | 256 | - | 76.2 |
| SigLIP [71] | 256 | - | 80.5 |
| ViTamin [5] | 256 | - | 81.2 |
| Unified Model | |||
| TokenFlow† [37] | 680 | 1.37 | - |
| VILA-U† [61] | 256 | 1.80 | 73.3 |
| UniTok | 256 | 0.41 | 70.8 |
| UniTok† | 256 | 0.38 | 78.6 |
Analysis:
-
Reconstruction Quality (rFID):
UniTokdemonstrates exceptionalreconstruction quality. The model (trained only with reconstruction onOpenImages) achieves anrFIDof 0.33, significantly outperforming otherVQVAEmodels likeVQ-GAN(4.98),RQ-VAE(1.30), andVAR(0.90). Even the fullUniTok†model (withCLIPinitialization) achieves anrFIDof 0.38, which is lower (better) than continuousVAEtokenizers likeStable Diffusion v2.1's (0.61, from Table 2). This highlights the superiority ofmulti-codebook quantizationin capturing fine-grained details. -
Understanding Performance (Accuracy): For
zero-shot classification accuracy, the randomly initializedUniTokachieves 70.8%, which is suboptimal compared to specializedCLIPmodels (CLIP76.2%,SigLIP80.5%,ViTamin81.2%). This is attributed to the relatively short training schedule (one epoch on 1.28B samples) being insufficient for fullCLIPconvergence. However,UniTok†(initialized with pretrainedCLIPweights) boosts accuracy to 78.6%, narrowing the gap significantly withCLIPbaselines and outperforming other unified tokenizers likeVILA-U†(73.3%). -
Unified Performance: The
UniTok†model effectively balances both objectives, achieving both excellent reconstruction (0.38rFID) and strong understanding (78.6% zero-shot accuracy), a feat not matched by other unified tokenizers.TokenFlow†only reportsrFID(1.37), andVILA-U†has a much higherrFID(1.80) despite lower accuracy.The following figure (Figure 4 from the original paper) shows qualitative results on image reconstruction:
该图像是一个对于图像重建的定性结果展示,左侧为原始图像,右侧为重建图像。重建分别展示了不同分辨率图像的效果,展示了UniTok的图像恢复能力。
Figure 4: Qualitative results on image reconstruction in a resolution of .
Qualitative Analysis of Reconstruction: Figure 4 visually confirms UniTok's high reconstruction fidelity. The reconstructed images (right column) are remarkably close to the original images (left column), demonstrating the model's ability to preserve fine details and overall image quality at 256x256 resolution.
6.1.2. Class-Conditional Image Generation
The paper investigates whether UniTok's semantic supervision leads to a better-structured latent space for autoregressive generation, similar to findings in diffusion models. This is evaluated using the LlamaGen framework for class-conditional image generation.
The following are the results from Table 2 of the original paper:
| Tokenizer | rFID | Generator | #Params. | Generation w/o CFG | Generation w/ CFG | ||||||
| gFID↓ | IS↑ | Pre. | Rec. | gFID↓ | IS↑ | Pre. | Rec. | ||||
| Diffusion Models | |||||||||||
| SD-VAE [40] | 0.61 | DiT [35] | 675M | 9.62 | 121.5 | 0.67 | 0.67 | 2.27 | 278.2 | 0.83 | 0.57 |
| VAVAE [63] | 0.28 | LightningDiT [63] | 675M | 2.17 | 205.6 | 0.77 | 0.65 | 1.35 | 295.3 | 0.79 | 0.65 |
| Masked Generative Models | |||||||||||
| LFQ [67] | 0.9 | MAGVIT-v2 [67] | 307M | 3.07 | 213.1 | — | — | 1.91 | 324.3 | − | − |
| TiTok-L [69] | 2.21 | MaskGIT [3] | 177M | 3.15 | 173.0 | − | − | 2.77 | 199.8 | − | − |
| Autoregressive Models | |||||||||||
| VQGAN† | 2.19 | LlamaGen‡ [43] | 1.4B | 14.65 | 86.3 | 0.63 | 0.68 | 2.34 | 253.9 | 0.81 | 0.60 |
| UniTok (Ours) | 0.41 | LlamaGen [43] | 1.4B | 2.51 | 216.7 | 0.82 | 0.57 | 2.77 | 227.5 | 0.81 | 0.57 |
Analysis:
- CFG-free Generation:
UniToksignificantly reducesgFIDwithoutclassifier-free guidance (CFG)from 14.65 (forVQGAN†withLlamaGen) to a remarkable 2.51. This is a massive improvement, indicating thatUniToklearns a much better-structured and semantically coherentlatent spacecompared toVQGAN. TheInception Score (IS)also jumps dramatically from 86.3 to 216.7, further confirming higher quality and diversitycfg-free generation. - CFG-enabled Generation: With
CFG,UniTok'sgFIDis 2.77, which is slightly higher thanVQGAN†'s 2.34. However,UniTok'sIS(227.5) is still much higher thanVQGAN†'s (253.9), suggestingUniTokgenerates more diverse and realistic images even withCFG. The smaller improvement fromcfg-freetoCFG-enabled forUniTok(2.51 to 2.77gFIDforUniTokvs. 14.65 to 2.34 forVQGAN†) suggests thatUniTokinherently benefits less fromCFGbecause its latent space is already well-organized. - Comparison to Diffusion Models:
UniTok'scfg-free gFIDof 2.51 is competitive even with strongdiffusion modelslikeSD-VAE(9.62) and approachesVAVAE(2.17), especially impressive consideringLlamaGenis anautoregressive model. This strongly supports the idea thatsemantic supervision(fromCLIP) imbuesUniTokwith a highly structuredlatent spacesuitable for high-quality generation.
6.1.3. Unified Understanding and Generation (MLLM Integration)
6.1.3.1. Understanding Performance
The paper evaluates UniTok's multimodal understanding performance by integrating it into an MLLM and testing on various VQA benchmarks.
The following are the results from Table 3 of the original paper:
| Method | LLM | Token Type | Res. | VQAv2 | GQA | TextVQA | POPE | MME | MM-Vet |
| Emu [45] | Llama-13B | Continuous | 224 | 52.0 | - | - | - | - | - |
| LaVIT [15] | Llama-7B | Continuous | 224 | 66.0 | 46.8 | - | - | - | - |
| DreamLLM [7] | Vicuna-7B | Continuous | 224 | 72.9 | - | 41.8 | - | - | 26.6 |
| Unified-IO 2 [29] | 6.8B from scratch | Continuous | 384 | 79.4 | - | - | 87.7 | - | - |
| Janus [58] | DeepSeek-1.3B | Continuous | 384 | 77.3 | 59.1 | - | 87.0 | 1338 | 34.3 |
| CM3Leon [68] | 7B from scratch | Discrete | 256 | 47.6 | - | - | - | - | - |
| LWM [27] | Llama-2-7B | Discrete | 256 | 55.8 | 44.8 | 18.8 | 75.2 | - | - |
| Show-o [62] | Phi-1.5-1.3B | Discrete | 256 | 59.3 | 48.7 | - | 73.8 | 948 | - |
| Chameleon [46] | 34B from scratch | Discrete | 512 | 69.6 | - | - | - | - | - |
| Liquid [59] | Gemma-7B | Discrete | 512 | 71.3 | 58.4 | 42.4 | 81.1 | 1119 | - |
| VILA-U [61] | Llama-2-7B | Discrete | 256 | 75.3 | 58.3 | 48.3 | 83.9 | 1336 | 27.7 |
| UniTok | Llama-2-7B | Discrete | 256 | 76.8 | 61.1 | 51.6 | 83.2 | 1448 | 33.9 |
Analysis:
- Outperforming Discrete Tokenizers:
UniTok(usingLlama-2-7B) significantly outperforms otherMLLMsthat usediscrete visual tokenizers. For example, it surpassesChameleon(34B from scratch) by 7.2% onVQAv2(76.8 vs 69.6) andLiquid(Gemma-7B) by 5.5% onVQAv2(76.8 vs 71.3), 2.7% onGQA(61.1 vs 58.4), and 9.2% onTextVQA(51.6 vs 42.4). It also achieves the highestMMEscore (1448) among all models in the table, and a strongMM-Vetscore of 33.9, only slightly belowJanus(34.3) butJanususes acontinuoustoken type. - Narrowing the Gap with Continuous Tokenizers:
UniToklargely narrows the performance gap withMLLMsthat usecontinuous visual tokenizers(e.g.,Unified-IO 2at 79.4VQAv2,Janusat 77.3VQAv2). This is a crucial achievement, demonstrating thatUniTokenables discrete tokens to achieve understanding performance competitive with continuous features, without compromising generation. - Strong Candidate for Unified MLLMs: These results confirm that
UniTokis a highly effectiveunified visual tokenizerformultimodal models, capable of robustvisual comprehension.
6.1.3.2. Generation Performance
The paper further demonstrates UniTok's visual generation capabilities when integrated into the unified MLLM.
The following are the results from Table 4 of the original paper:
| Method | Type | #Data | Single Obj. | Two Obj. | Counting | Colors | Position | Color Attri. | Overall↑ |
| SD v2.1 [39] | Diffusion | 2000M | 0.98 | 0.51 | 0.44 | 0.85 | 0.07 | 0.17 | 0.50 |
| SD-XL [36] | Diffusion | 2000M | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | 0.55 |
| DALL-E 3 [2] | Diffusion | − | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 | 0.67 |
| Show-o [62] | Discrete Diff. | 36M | 0.95 | 0.52 | 0.49 | 0.82 | 0.11 | 0.28 | 0.53 |
| 2.0B | 0.98 | 0.80 | 0.66 | 0.84 | 0.31 | 0.50 | 0.68 | ||
| LWM [27] | Autoregressive | − | 0.93 | 0.41 | 0.46 | 0.79 | 0.09 | 0.15 | 0.47 |
| Janus [58] | Autoregressive | − | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 | 0.61 |
| Liquid [59] | Autoregressive | 30M | 0.98 | 0.73 | 0.32 | 0.76 | 0.17 | 0.37 | 0.55 |
| UniTok | Autoregressive | 30M | 0.99 | 0.71 | 0.36 | 0.79 | 0.26 | 0.45 | 0.59 |
Analysis on GenEval:
-
UniTokachieves anOverallscore of 0.59 on theGenEvalbenchmark, outperformingLiquid(0.55) while using the same amount oftext-to-image training data(30M). This highlights the superiortokenizerdesign ofUniTok. -
It performs very well on
Single Obj.(0.99),Two Obj.(0.71), andColor Attri.(0.45) tasks, indicating strong compositional generation capabilities. -
While
DALL-E 3(0.67) andShow-o(2.0B, 0.68) still hold higherOverallscores, these models are eitherdiffusion-basedor trained on significantly more data (billions vs 30M).UniTokshows competitive performance against these domain experts given itsautoregressivenature and data scale.The following are the results from Table 5 of the original paper:
Method Type Res. FID↓ SD-XL [36] Diffusion 1024 9.55 PixArt [6] Diffusion 1024 6.14 Playground [20] Diffusion 1024 4.48 Liquid [59] Autoregressive 512 5.47 Janus [58] Autoregressive 384 10.10 LWM [27] Autoregressive 256 17.77 Show-o [62] Discrete Diff. 256 15.18 VILA-U [61] Autoregressive 256 12.81 UniTok Autoregressive 256 7.46
Analysis on MJHQ-30K:
-
UniTokachieves anFIDof 7.46 onMJHQ-30Kat 256x256 resolution. This is a substantial improvement over otherautoregressive modelsat the same resolution, such asLWM(17.77),Show-o(15.18), andVILA-U(12.81). -
Even though
MJHQ-30Kfavors higher resolutions for lowerFID(e.g.,Playgroundat 4.48 FID for 1024 resolution),UniTok's score at 256x256 is highly competitive, even surpassing some models at higher resolutions likeSD-XL(9.55 at 1024 resolution) orJanus(10.10 at 384 resolution), and notably better thanLiquid(5.47 at 512 resolution). This demonstratesUniTok's ability to generate high-quality, detail-rich images even at a smaller resolution.The following figure (Figure 5 from the original paper) shows examples of images generated by the unified MLLM:
该图像是展示了一组以 256 imes 256分辨率生成的多样化图像,包括动物、植物、风景及抽象艺术等多个主题。
Figure 5: Images generated in a resolution of with our unified MLLM.
Qualitative Analysis of Generation: Figure 5 showcases UniTok's ability to synthesize photorealistic and visually appealing images from text prompts. The examples demonstrate that the model can understand a wide range of concepts (e.g., "Vincent van Gogh painting style", "bitcoin") and combine them creatively, producing diverse and high-quality outputs.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Impact of Supervision Types
This ablation study investigates the impact of different combinations of contrastive and reconstruction losses on downstream generation and understanding performance. All tokenizers are vector-quantized.
The following are the results from Table 6 of the original paper:
| Supervision | Generation | Understanding | ||||||
| rFID ↓ | gFID ↓ | VQAv2 | GQA | SciQA | TextVQA | POPE | MME | |
| Contrastive | − | − | 68.95 | 56.89 | 65.64 | 49.89 | 82.34 | 1373 |
| Reconstruction | 0.82 | 3.59 | 56.33 | 47.53 | 63.26 | 43.65 | 77.09 | 902 |
| Recon. + Contra. | 0.72 | 3.26 | 69.14 | 56.06 | 65.25 | 49.22 | 81.42 | 1333 |
Analysis:
- Reconstruction-oriented tokenizer limitations: The
Reconstructiononly tokenizer shows goodrFID(0.82) and reasonablegFID(3.59), but significantly lags inunderstanding performance(e.g.,VQAv256.33,MME902) compared to models withcontrastive supervision. This confirms the limitations of traditionalVQVAEmodels formultimodal understanding. - Contrastive-only tokenizer: A
Contrastiveonly tokenizer performs well in understanding (e.g.,VQAv268.95,MME1373) but is not designed for generation, hence norFIDorgFIDreported. - Combined Supervision (Recon. + Contra.): The jointly trained tokenizer (
Recon. + Contra.) achieves anrFIDof 0.72 andgFIDof 3.26, which are both better than theReconstructiononly tokenizer. Crucially, itsunderstanding performance(e.g.,VQAv269.14,MME1333) is on par with, or even slightly better than, theContrastiveonly tokenizer. This strongly supports the paper's central argument:reconstructionandcontrastive training objectivesdo not inherently conflict when thediscrete feature spaceis enhanced. Instead, combining them leads to a tokenizer that performs well in both domains.
6.2.2. MCQ vs. RQ
This study provides a direct comparison between Multi-Codebook Quantization (MCQ) and Residual Quantization (RQ), ensuring fair comparisons by implementing RQ on the same codebase and training settings as UniTok.
The following are the results from Table 7 of the original paper:
| Method | Code Shape | Code Dim. | rFID↓ | Accuracy | |
| RQ | 16×16×8 | 64 | 3.46 | 58.8 | |
| MCQ | 16×16×8 | 64 | 0.55 | 63.7 |
Analysis:
MCQdemonstrates clear superiority overRQ. With the samecode shape(16x16 spatial tokens, 8 sub-codebooks) andglobal latent code dimension(64-d),MCQachieves a much lowerrFID(0.55 vs 3.46 forRQ) and higherclassification accuracy(63.7% vs 58.8% forRQ).- This confirms the discussion in Section 3.3, which posited that
MCQ'sdivide-and-conquer strategyis more effective thanRQ'scoarse-to-fineapproach, especially in high-dimensional latent spaces, leading to betterquantizationand overall performance.
6.2.3. Number of Sub-Codebooks
This ablation explores how the number of sub-codebooks impacts tokenizer performance.
The following are the results from Table 8 of the original paper:
| Codebook / Vocabulary | 1×16384 /2^14 | 2×8192 /2^26 | 4×4096 /2^48 | 8×2048 /2^88 |
| rFID ↓ | 1.50 | 0.98 | 0.54 | 0.33 |
| Accuracy | 41.0% | 43.9% | 44.7% | 46.1% |
Analysis:
- The table shows that as the number of
sub-codebooksincreases (while keeping the total capacitynumber of sub-codebookssize of sub-codebookroughly equivalent in terms of total code entries, but exponentially increasing theeffective vocabulary size), bothrFID(reconstruction quality) andAccuracy(understanding performance) consistently improve. - For example, moving from 1
sub-codebook(1x16384,effective vocabulary size) to 8sub-codebooks(8x2048,effective vocabulary size) reducesrFIDfrom 1.50 to 0.33 and increasesAccuracyfrom 41.0% to 46.1%. - This provides strong evidence that
MCQgenerally benefitsvector-quantized models, regardless of whether the primary objective isreconstructionorcontrastive learning. It confirms that increasing therepresentational capacityof thediscrete token spaceis key to improving performance.
6.2.4. CLIP Weight Initialization
This study investigates the impact of CLIP weight initialization on visual understanding performance within the LLaVA framework.
The following are the results from Table 9 of the original paper:
| Tokenizer | VQAv2 | GQA | TextVQA | POPE | MME |
| UniTok† | 69.9 | 56.2 | 49.3 | 81.2 | 1331 |
| UniTok | 72.4 | 58.2 | 51.6 | 82.4 | 1392 |
Analysis:
- Surprisingly,
UniToktrained from scratch (randomly initialized) outperformsUniTok†(initialized withpretrained CLIP weights) across allVQAbenchmarks (VQAv2,GQA,TextVQA,POPE,MME). For instance,VQAv2accuracy is 72.4% forUniTokvs 69.9% forUniTok†, andMMEscore is 1392 vs 1331. - This is counter-intuitive because
UniTok†achieved betterzero-shot classification accuracyonImageNet(78.6% vs 70.8% forUniTok, as shown in Table 1). - The finding suggests that
downstream VQA performancemay not be strongly correlated withImageNet classification accuracy. More importantly,CLIP weight initializationmight act as anegative priorforunified tokenizers. This is because theunified visual feature space(designed for both generation and understanding) could be fundamentally different from theCLIP feature space(designed primarily for understanding). Starting fromCLIPweights might bias the learning towardsCLIP's original feature space, hindering thetokenizerfrom learning a more optimal, truly unified representation. This implies that trainingUniTokfrom scratch allows it to learn a representation space that is intrinsically better suited forunified multimodal tasks.
6.2.5. GenAI-Bench Results (Appendix)
The paper provides additional text-to-image generation results on the GenAI-Bench in the appendix.
The following are the results from Table 10 of the original paper:
| Method | Type | #Training Images | Attribute↑ | Scene↑ | Relation↑ | Overall↑ | ||
| Spatial | Action | Part | ||||||
| SD v2.1 [39] | Diffusion | 2000M | 0.80 | 0.79 | 0.76 | 0.77 | 0.80 | 0.78 |
| SD-XL [36] | Diffusion | 2000M | 0.84 | 0.84 | 0.82 | 0.83 | 0.89 | 0.83 |
| Midjourney v6 | Diffusion | − | 0.88 | 0.87 | 0.87 | 0.87 | 0.91 | 0.87 |
| DALL-E 3 [2] | Diffusion | − | 0.91 | 0.90 | 0.92 | 0.89 | 0.91 | 0.90 |
| Show-o [62] | Discrete Diff. | 36M | 0.72 | 0.72 | 0.70 | 0.70 | 0.75 | 0.70 |
| LWM [27] | Autoregressive | − | 0.63 | 0.62 | 0.65 | 0.63 | 0.70 | 0.63 |
| VILA-U [61] | Autoregressive | 15M | 0.78 | 0.78 | 0.77 | 0.78 | 0.79 | 0.76 |
| Liquid [59] | Autoregressive | 30M | 0.84 | 0.86 | 0.81 | 0.83 | 0.91 | 0.83 |
| UniTok | Autoregressive | 30M | 0.85 | 0.87 | 0.86 | 0.86 | 0.89 | 0.85 |
The following are the results from Table 11 of the original paper:
| Method | Type | #Training Images | Count↑ | Differ↑ | Compare↑ | Logical↑ | Overall↑ | |
| Negate | Universal | |||||||
| SD v2.1 [39] | Diffusion | 2000M | 0.68 | 0.70 | 0.68 | 0.54 | 0.64 | 0.62 |
| SD-XL [36] | Diffusion | 2000M | 0.71 | 0.73 | 0.69 | 0.50 | 0.66 | 0.63 |
| Midjourney v6 | Diffusion | − | 0.78 | 0.78 | 0.79 | 0.50 | 0.76 | 0.69 |
| DALL-E 3 [2] | Diffusion | − | 0.82 | 0.78 | 0.82 | 0.48 | 0.80 | 0.70 |
| Show-o [62] | Discrete Diff. | 36M | 0.70 | 0.62 | 0.71 | 0.51 | 0.65 | 0.60 |
| LWM [27] | Autoregressive | − | 0.59 | 0.58 | 0.54 | 0.49 | 0.52 | 0.53 |
| VILA-U [61] | Autoregressive | 15M | 0.70 | 0.71 | 0.74 | 0.53 | 0.66 | 0.64 |
| Liquid [59] | Autoregressive | 30M | 0.76 | 0.73 | 0.74 | 0.46 | 0.74 | 0.65 |
| UniTok | Autoregressive | 30M | 0.76 | 0.76 | 0.79 | 0.46 | 0.73 | 0.67 |
Analysis on GenAI-Bench (Basic and Advanced Prompts):
-
Basic Prompts (Table 10):
UniTokachieves anOverallscore of 0.85, surpassingLiquid(0.83) using the same 30M training images. It performs exceptionally well onAttribute(0.85),Scene(0.87),Spatial(0.86),Action(0.86), andPart(0.89), demonstrating strong capabilities in generating basic object properties and relationships. It even competes closely withSD-XL(0.83) andMidjourney v6(0.87) which arediffusion modelstrained on vast amounts of data. -
Advanced Prompts (Table 11): For more complex prompts,
UniTokachieves anOverallscore of 0.67, again outperformingLiquid(0.65) andVILA-U(0.64). It shows strong performance inCount(0.76),Differ(0.76), andCompare(0.79), indicating a good understanding of numerical and comparative aspects. However, performance onLogicalcategories (Negate0.46,Universal0.73) is more challenging, consistent with other models. Notably, itsOverallscore is close toDALL-E 3(0.70) andMidjourney v6(0.69) for advanced prompts, which are state-of-the-artdiffusion models.These results consistently show that
UniTokdelivers superiorgeneration performanceacross various metrics and prompt complexities, especially when compared to otherautoregressive MLLMs, further solidifying its role as an effectiveunified tokenizer.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces UniTok, a novel unified tokenizer that addresses the long-standing challenge of creating a single visual representation suitable for both generation and understanding tasks within Multimodal Large Language Models (MLLMs). Through rigorous empirical analysis, the authors demonstrate that the perceived loss conflicts between reconstruction and semantic supervision are not inherent, but rather a symptom of the limited representational capacity of discrete token space.
UniTok overcomes this quantization bottleneck by employing two key innovations: multi-codebook quantization (MCQ) and attention projection. MCQ exponentially scales the effective vocabulary size and bottleneck dimension by partitioning latent vectors into independently quantized chunks, avoiding the optimization pitfalls of large monolithic codebooks. Attention projection enhances token factorization by using adapted multi-head attention modules, which preserve richer semantics during dimension compression.
The experimental results showcase UniTok's impressive performance:
-
It achieves a new record of 0.38
rFIDand 78.6%zero-shot accuracyonImageNet, demonstrating high fidelity reconstruction and strong understanding capabilities simultaneously. -
When integrated into
MLLMs,UniTokenables nativevisual generationwithout compromisingunderstanding performance, significantly outperformingVQGAN-based baselines onVQAbenchmarks (e.g., 5.5% onVQAv2, 9.2% onTextVQA). -
It fosters a better-structured
latent space, leading to a drastic reduction ingFIDforcfg-free generation(from 14.6 to 2.5 on ImageNet), indicating improved inherent generation quality. -
Ablation studies confirm that
reconstructionandcontrastivelearning objectives are compatible when thediscrete feature spaceis sufficiently rich, andMCQconsistently improves both generation and understanding metrics.Overall,
UniTokprovides a powerful and efficient solution for unified visual tokenization, paving the way for more capable and cohesivemultimodal models.
7.2. Limitations & Future Work
The authors acknowledge one primary limitation:
-
Limited Training Schedule: Due to computational resource constraints,
UniTokwas trained for onlyone epoch. The authors believe that extending thetraining schedulecould further benefit thetokenizer, especially inunderstanding performance, asCLIP-based semantic representation learningtypically requires extensive training to converge fully.Based on this, a clear direction for
future workis to: -
Extend Training Duration: Investigate the performance gains and convergence patterns of
UniTokwhen trained for longer durations and with more computational resources. This could potentially pushunderstanding performanceeven higher, further closing the gap with dedicatedCLIPmodels.
7.3. Personal Insights & Critique
This paper offers several profound insights that could inspire future research in multimodal AI:
-
Redefining the "Loss Conflict": The most impactful insight is debunking the perceived
loss conflictbetweenreconstructionandsemanticobjectives. By attributing performance issues tolimited representational capacity, the paper shifts the research focus from finding delicate weighting schemes for conflicting losses to fundamentally improving theexpressiveness of discrete tokens. This is a powerful paradigm shift, suggesting that robust multimodal representation might be achievable without inherent trade-offs. This finding could be generalized to other domains where seemingly conflicting objectives (e.g., efficiency vs. accuracy, local vs. global features) might be resolved by increasing the underlying model's capacity or representation power. -
Scalability of Discrete Token Space: The
multi-codebook quantization (MCQ)mechanism is a clever solution to a long-standing problem inVQVAEs– how to scalevocabulary sizewithoutcodebook utilizationissues. The exponential increase in effectivevocabulary sizeis a significant advantage. This approach could be applied to other discrete tokenization schemes, even inNLPif context-dependent sub-vocabularies were beneficial. The comparison toResidual Quantization (RQ)also highlights the importance of design philosophy (divide-and-conquervs.coarse-to-fine) in quantization. -
Semantic Regularization for Generative Latent Spaces: The observation that
semantic supervisionleads to a better-structuredlatent spaceforautoregressive generation(evidenced by massivegFIDreduction incfg-free generation) is crucial. This reinforces findings indiffusion modelsand suggests thatsemantic guidanceduringtokenizertraining is not just for understanding but fundamentally improves the generative properties of thelatent spaceitself, making it more disentangled and coherent. This principle could inform the design of future generative models, emphasizing the integration of strong semantic priors from the earliest stages of representation learning. -
Beyond ImageNet Accuracy for MLLMs: The ablation on
CLIP weight initializationprovides a critical insight: higherImageNet zero-shot classification accuracydoes not automatically translate to superiordownstream VQA performanceforunified MLLMs. This suggests that the optimalfeature spacefor a trulyunified multimodal modelmight differ significantly from that learned by avision-only modellikeCLIP. This is a valuable warning against overly relying on single-task benchmarks when designing complexmultimodal systems. It implies thatMLLMtraining should prioritize learning a unified representation rather than simply trying to integrate pre-trained unimodal components.
Potential Issues/Areas for Improvement:
-
Computational Cost: While
MCQaddresses some optimization issues of large monolithic codebooks, the training ofUniTokand its integratedMLLMis still highly computationally intensive, especially considering the1.28B image-text pairsand theLlama-2-7Bbase model. The mentioned limitation of "one epoch" due to computational resources highlights this. Future work could explore more computationally efficient training strategies or distillation methods. -
Generalizability of "No Conflict" Claim: While the paper successfully demonstrates that reconstruction and semantic losses do not inherently conflict with UniTok's specific enhancements, it might be too strong to claim this universally. Without these enhancements, the conflict might still manifest. Further theoretical work to formally characterize the conditions under which these objectives align or conflict would be valuable.
-
Complexity of Attention Projection: While
attention projectionimproves performance,multi-head attentionmodules are inherently more complex and computationally expensive than linear layers. A detailed analysis of the computational overhead and latency introduced byattention projectioncompared to its performance gains would be beneficial for practical deployment. -
Explicit Explanation of
MLP Projector: The paper states that code embeddings are projected to theMLLM token spacewith anMLP projector. While standard, a brief discussion on its architecture (e.g., number of layers, activation functions) and how it handles the potential modality gap between vision and language embeddings could add clarity.Overall,
UniTokrepresents a significant step forward inunified visual tokenization, offering a robust framework and valuable insights for the continued development of sophisticatedmultimodal AIsystems.
Similar papers
Recommended via semantic vector search.