Paper status: completed

Finite Scalar Quantization: VQ-VAE Made Simple

Published:09/27/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces Finite Scalar Quantization (FSQ) as a simpler alternative to VQ in VQ-VAEs, enabling implicit codebook creation. FSQ achieves competitive performance across tasks while avoiding codebook collapse and reducing complexity.

Abstract

We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Finite Scalar Quantization: VQ-VAE Made Simple

1.2. Authors

Fabian Mentzer, David Minnen, Eirikur Agustsson, Michael Tschannen. Affiliations are Google Research and Google DeepMind.

1.3. Journal/Conference

The paper is published as a preprint on arXiv. The official publication status (journal or conference) is not specified but it is a common practice for research to first appear on arXiv before peer review and publication in a reputable venue. Given the authors' affiliations and the nature of the research, it is likely intended for a top-tier machine learning or computer vision conference (e.g., NeurIPS, ICML, CVPR, ICLR).

1.4. Publication Year

2023

1.5. Abstract

This paper introduces Finite Scalar Quantization (FSQ) as a simpler alternative to Vector Quantization (VQ) in VQ-VAEs. FSQ projects the VAE representation into a few dimensions (typically less than 10), where each dimension is quantized to a small, fixed set of values. This design implicitly creates a codebook as the product of these sets, matching VQ's codebook size. FSQ can serve as a drop-in replacement for VQ-VAE representations in models like autoregressive and masked transformers for various tasks, including image generation (with MaskGIT) and dense prediction computer vision tasks (with UViM). Despite its simplicity, FSQ achieves competitive performance across these tasks. A key advantage of FSQ is its inherent resistance to codebook collapse and the elimination of complex VQ-specific optimization techniques (e.g., commitment losses, codebook reseeding, entropy penalties).

https://arxiv.org/abs/2309.15505 PDF Link: https://arxiv.org/pdf/2309.15505v2.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses lies within the domain of learning discrete representations using neural networks, particularly with Vector Quantized Variational Autoencoders (VQ-VAEs). VQ-VAEs are powerful models for learning compressed, semantic representations of data, crucial for tasks like image and video generation, and multimodal language models.

However, the existing Vector Quantization (VQ) scheme within VQ-VAEs suffers from several significant challenges:

  1. Optimization Difficulty: The quantization operation (finding the nearest codeword) is non-differentiable, requiring workarounds like the Straight-Through Estimator (STE) for gradient propagation.

  2. Complex Auxiliary Losses: To effectively train VQ-VAEs, auxiliary losses (e.g., commitment loss, codebook loss) are necessary to pull codewords towards encoder outputs and vice-versa.

  3. Codebook Underutilization (Codebook Collapse): As the codebook size increases, many codewords often remain unused, leading to inefficient learning and wasted capacity. This necessitates further complex "tricks" like reinitialization, stochastic formulations, or entropy penalties to mitigate.

    These challenges make VQ-VAE training complex, fragile, and sensitive to hyperparameters. The paper's entry point is to simplify this process by drawing inspiration from neural compression literature, specifically scalar quantization, to achieve a more robust and straightforward method for discrete representation learning.

2.2. Main Contributions / Findings

The paper makes several key contributions:

  1. Introduction of Finite Scalar Quantization (FSQ) as a Drop-in Replacement: FSQ is proposed as a simple, effective alternative to VQ. It projects VAE representations to a few dimensions, each independently quantized to a fixed set of values. This implicitly forms a codebook and allows FSQ to be used directly in existing architectures designed for VQ, such as MaskGIT for image generation and UViM for dense prediction tasks.

  2. Competitive Performance with Much Simpler Design: Despite FSQ's simpler architecture and lack of auxiliary losses or complex optimization tricks, it achieves performance competitive with VQ-based methods across various tasks and datasets. The paper reports a minimal reduction (0.5-3%) in metrics compared to VQ, with highly similar visual results.

  3. Inherent Robustness against Codebook Collapse: FSQ inherently achieves high codebook utilization (approaching 100% for most models) by design, without needing any of the common VQ-specific strategies to prevent codebook collapse. This significantly simplifies training and improves efficiency.

  4. Analysis of Trade-offs and Scaling Behaviors: The paper rigorously analyzes the trade-offs between FSQ and VQ, demonstrating that FSQ effectively leverages larger codebook sizes for better reconstruction and sample quality, while VQ's performance often deteriorates with very large codebooks due to optimization difficulties. FSQ also has fewer parameters than VQ.

  5. Evidence of VQ's Limited Benefits for Generality: The study suggests that the "full generality" of VQ's learnable, high-dimensional codebook offers little benefit over FSQ's fixed, low-dimensional, grid-based approach. The complexity of VQ's codebook optimization often outweighs its theoretical flexibility.

    In essence, the paper demonstrates that a significantly simpler quantization scheme can achieve comparable, and in some cases superior, performance to more complex VQ methods, especially when scaling codebook sizes, by shifting the burden of representation learning to the VAE's encoder and decoder.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a foundational grasp of several machine learning concepts is essential, particularly in generative models and representation learning.

3.1.1. Autoencoders (AEs) and Variational Autoencoders (VAEs)

An Autoencoder (AE) is a type of neural network used for unsupervised learning of efficient data codings (representations) in an unsupervised manner. The goal of an AE is to learn a compressed representation (encoding) for a set of data, typically for dimensionality reduction or feature learning. It consists of two parts: an encoder that maps the input data to a latent-space representation, and a decoder that maps this latent-space representation back to the original input data. The network is trained to minimize the reconstruction error between the input and its decoded output.

A Variational Autoencoder (VAE) is a generative model that extends the autoencoder concept by introducing a probabilistic approach to the encoder and decoder. Instead of directly outputting a latent vector, the encoder of a VAE outputs parameters (mean μ\mu and variance σ2\sigma^2) of a probability distribution (typically a Gaussian distribution) in the latent space. The decoder then samples from this distribution to reconstruct the input. This probabilistic formulation allows VAEs to generate new, diverse data samples by sampling from the learned latent distribution. VAEs are trained with a loss function that combines a reconstruction loss (to ensure the decoded output is similar to the input) and a Kullback-Leibler (KL) divergence term (to ensure the learned latent distribution is close to a prior distribution, usually a standard normal distribution, which encourages a well-structured and continuous latent space).

3.1.2. Vector Quantization (VQ)

Vector Quantization (VQ) is a classical signal processing technique, originating from Gray (1984), used for data compression. It works by mapping a high-dimensional vector from a continuous input space to a discrete vector (codeword) from a finite set of codebook vectors. Given an input vector zRDz \in \mathbb{R}^D and a codebook C={c1,c2,,cK}\mathcal{C} = \{c_1, c_2, \ldots, c_K\} where each ckRDc_k \in \mathbb{R}^D is a codeword, the quantization operation finds the codeword cjc_j in C\mathcal{C} that is closest to zz according to some distance metric (typically Euclidean distance). The input vector zz is then replaced by this closest codeword cjc_j. Mathematically, the quantization operation is: $ \hat{z} = \arg\min_{c_k \in \mathcal{C}} |z - c_k|_2 $ where z^\hat{z} is the quantized vector.

3.1.3. VQ-VAEs (Vector Quantized Variational Autoencoders)

VQ-VAEs (Van Den Oord et al., 2017) integrate Vector Quantization into the latent space of a VAE. Instead of learning a continuous latent distribution, the encoder of a VQ-VAE maps the input to a continuous representation, which is then quantized to a discrete codeword from a learned codebook. This discrete representation is then passed to the decoder. The key innovation was to enable gradient flow through the non-differentiable quantization step using the Straight-Through Estimator (STE) and introducing auxiliary losses to train the codebook itself. The VQ-VAE aims to combine the generative capabilities of VAEs with the discrete, symbolic representations offered by VQ, which are particularly useful for tasks involving discrete sequences, like text or tokenized images.

3.1.4. Straight-Through Estimator (STE)

The Straight-Through Estimator (STE) is a technique used to estimate gradients for non-differentiable operations in neural networks, such as quantization or binarization. During the forward pass, the non-differentiable operation is performed as usual. During the backward pass, the gradient is simply passed "straight through" the non-differentiable operation as if it were an identity function (i.e., its derivative is approximated as 1). This allows gradients to flow to preceding layers, enabling end-to-end training. For a quantization function Q(x), where Q(x) is non-differentiable, STE defines the gradient with respect to xx as: $ \frac{\partial L}{\partial x} \approx \frac{\partial L}{\partial Q(x)} \cdot 1 $ where LL is the loss function. A common implementation in frameworks like JAX involves: $ \text{round_ste}(x) = x + \text{sg}(\text{round}(x) - x) $ Here, sg (stop_gradient) ensures that the gradient of round(x) - x is zero, effectively making the gradient of round_ste(x) equal to the gradient of xx (which is 1) during the backward pass, while the forward pass computes round(x).

3.1.5. Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) consist of two neural networks, a generator and a discriminator, that compete against each other in a zero-sum game. The generator learns to create new data instances that resemble the training data, while the discriminator learns to distinguish between real data samples and fake data samples generated by the generator. This adversarial process drives both networks to improve, resulting in a generator capable of producing highly realistic data. Some VQ-VAE variants, like VQ-GAN (Esser et al., 2020), incorporate a GAN loss to improve the quality of generated images.

3.1.6. Transformers and Masked Language Models

Transformers are a type of neural network architecture introduced by Vaswani et al. (2017), primarily known for their success in natural language processing (NLP) but now widely used in computer vision. They rely heavily on the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element. Masked Language Models (MLMs), like BERT (Devlin et al., 2018), are a specific application of transformers. They are trained by masking out a certain percentage of input tokens and then predicting the original masked tokens based on their context. This pre-training objective allows MLMs to learn rich, bidirectional representations of sequences, which can then be fine-tuned for various downstream tasks. MaskGIT (Chang et al., 2022), mentioned in this paper, adapts the masked transformer concept to image generation using discrete image tokens.

3.2. Previous Works

The paper discusses several lines of prior research that contextualize FSQ.

3.2.1. VQ-VAE and Improvements

The original VQ-VAE (Van Den Oord et al., 2017) laid the groundwork by introducing discrete representation learning with the Straight-Through Estimator and auxiliary losses (commitment loss, EMA for codebook learning). Subsequent works aimed to address its optimization challenges, particularly codebook underutilization:

  • Codebook Reinitialization/Reseeding: Dhariwal et al. (2020) and Lacucki et al. (2020) proposed resetting unused codewords to encoder outputs or reinitializing the entire codebook using offline clustering to ensure better utilization.
  • Stochastic Quantization: Takida et al. (2022) and Williams et al. (2020) explored adding noise to the encoder output during quantization to encourage broader codebook usage, often annealing the noise during training.
  • Multi-scale VQ: Dhariwal et al. (2020) also introduced hierarchical VQ for audio, learning codes at different resolutions.
  • Improved Optimization: Yu et al. (2021) used a Vision Transformer (ViT)-based VQ-GAN, normalizing vectors and mapping codes to a lower-dimensional space for lookup. Huh et al. (2023) examined vanilla VQ's challenges and proposed re-parameterization, alternating optimization, and an improved commitment loss.
  • Soft EM: Roy et al. (2018) used soft expectation maximization for training.

3.2.2. VQ Alternatives

Beyond direct improvements to VQ-VAE, other discrete quantization schemes exist:

  • Residual Quantization (RVQ): Used in image (Lee et al., 2022) and audio (Zeghidour et al., 2021) generation. It refines quantized codes by additionally storing (and quantizing) the residuals, effectively building up a representation layer by layer.
  • Product Quantization (PQ): (Chen et al., 2020; El-Nouby et al., 2022) factors a large codebook into a product of smaller codebooks, reducing memory footprint and search complexity.
  • Reduced Token Inference: Huang et al. (2023) focused on reducing the number of tokens output by VQ-VAEs for more efficient inference.
  • Scalar Quantization for Audio: Donahue et al. (2019) and Dieleman et al. (2021) applied scalar quantization to audio tasks, using "margin loss" to bound representations. Hsu et al. (2023) used per-channel codebooks leading to a learned grid, but still optimized with vanilla VQ losses. This is somewhat related to FSQ but still uses VQ's complex optimization.

3.2.3. Neural Compression

The neural compression literature heavily relies on scalar quantization:

  • Unbounded Scalar Quantization: Works like Ballé et al. (2016), Minnen et al. (2018), Lu et al. (2019), Mentzer et al. (2020), and Cheng et al. (2020) use scalar quantization without explicit bounds, relying on entropy regularization to prevent codes from spreading too widely.
  • Bounded Scalar Quantization (FSQ's Precursor): This is where FSQ draws direct inspiration. Mentzer et al. (2018) used d=16,L=5d=16, L=5 for high-fidelity image compression, and Tschannen et al. (2018); Agustsson et al. (2019) used d=5,L=5d=5, L=5 for "extreme compression." These works used a form of FSQ but within the context of image compression, where the goal is usually low bitrate and minimizing reconstruction error, rather than representation learning for generative models where high entropy and maximizing codebook usage is often desired.

3.3. Technological Evolution

The evolution of discrete representation learning can be traced from traditional signal processing techniques like Vector Quantization (VQ) to its integration into deep learning models. Autoencoders provided a way to learn latent representations, and Variational Autoencoders (VAEs) introduced a probabilistic framework for generative modeling. The seminal VQ-VAE (Van Den Oord et al., 2017) combined these, enabling deep neural networks to learn discrete latent codes, which were then leveraged by powerful Transformer architectures (like Masked Language Models for images, e.g., MaskGIT) for tasks like image generation.

However, the VQ component in VQ-VAEs proved challenging to optimize, leading to a cascade of auxiliary losses and complex tricks to manage issues like codebook collapse. The current paper, "Finite Scalar Quantization: VQ-VAE Made Simple," enters this lineage by proposing a fundamental simplification of the quantization step itself. Instead of a complex, learnable, high-dimensional VQ codebook, it reintroduces a form of scalar quantization (inspired by neural compression) that is finite and grid-based. This positions FSQ as a robust, simpler alternative that aims to democratize the use of discrete representations by removing much of the optimization overhead associated with VQ, while maintaining competitive performance.

3.4. Differentiation Analysis

The core differentiation of FSQ from existing methods, particularly VQ, lies in its fundamental design philosophy and resulting practical advantages:

  1. Fixed vs. Learnable Codebook:

    • VQ: Employs a learnable codebook C\mathcal{C} where each codeword ckc_k is a vector that is optimized during training. This creates a complex, adaptive Voronoi partition of the latent space.
    • FSQ: Uses a fixed, implicit codebook defined by a grid. The encoder output zz is bounded and then rounded to integers along each of its few dimensions. The codebook is simply the Cartesian product of these fixed scalar values. It does not contain learnable parameters.
  2. Implicit Grid vs. Explicit Vectors:

    • VQ: Requires an explicit storage of KK codewords, each of dimension DD, leading to K×DK \times D parameters for the codebook.
    • FSQ: The codebook is implicitly defined by the chosen number of dimensions (dd) and levels per dimension (LiL_i). The actual codewords are just the integer coordinates in this low-dimensional grid. This means FSQ has zero parameters for its "codebook" itself, and also significantly fewer parameters in the final encoder layer due to much smaller dd.
  3. Optimization Complexity:

    • VQ: Inherently difficult to optimize due to the non-differentiable nearest neighbor lookup. It requires the Straight-Through Estimator (STE) plus a suite of auxiliary losses (commitment loss, codebook loss, entropy loss) and heuristics (EMA, codebook reseeding, code splitting, stochastic quantization) to prevent codebook collapse.
    • FSQ: Also uses the Straight-Through Estimator for the rounding operation. However, it does not require any auxiliary losses or complex tricks. Codebook utilization is inherently high (near 100%) by design because the encoder is forced to spread information across the fixed, bounded bins to minimize reconstruction loss.
  4. Dimensionality of Latent Space:

    • VQ: Typically uses high-dimensional latent vectors (e.g., d512d \geq 512).
    • FSQ: Operates in a much lower-dimensional space (e.g., d<10d < 10). The paper argues that the non-linearity and capacity of the VAE's encoder and decoder can "absorb" the complexity that VQ tries to capture in its higher-dimensional, learnable codebook.
  5. Robustness to Codebook Collapse:

    • VQ: Prone to codebook collapse, where many codewords remain unused, especially with large codebook sizes. This is a major research problem requiring specific solutions.

    • FSQ: Fundamentally does not suffer from codebook collapse because the fixed, bounded nature of the quantization levels forces the encoder to utilize all available bins to minimize reconstruction error.

      In summary, FSQ differentiates itself by offering a robust, simpler, and more parameter-efficient method for discrete representation learning that side-steps the complex optimization challenges inherent to VQ, while still achieving competitive performance.

4. Methodology

4.1. Principles

The core principle behind Finite Scalar Quantization (FSQ) is to replace the complex, learnable Voronoi partitioning of the latent space used in Vector Quantization (VQ) with a simple, fixed, grid-based partition. This simplification is justified by the observation that modern Variational Autoencoders (VAEs) often have substantial model capacity. This capacity allows the VAE's encoder and decoder to "absorb" the non-linearity that VQ typically tries to learn within its codebook. By shifting this complexity to the encoder and decoder, FSQ can achieve similar expressiveness in partitioning the input space (e.g., images) but with a much simpler and more stable quantization mechanism. The key intuition is that a highly expressive encoder can project relevant information into a low-dimensional, discretizable space, and a powerful decoder can reconstruct complex outputs from these simple discrete codes.

4.2. Core Methodology In-depth (Layer by Layer)

The FSQ method involves projecting a continuous latent representation to a few dimensions, bounding each dimension, and then rounding to fixed integer values. This process creates an implicit, grid-like codebook.

4.2.1. Finite Scalar Quantization Process

Given a dd-dimensional continuous representation zRdz \in \mathbb{R}^d from the VAE encoder's output, the goal is to obtain a quantized discrete representation z^\hat{z}.

  1. Bounding Function Application: First, a bounding function ff is applied to each channel (dimension) of the input vector zz. The purpose of ff is to map the potentially unbounded values of zz into a specific range such that, after rounding, each channel will take on one of LL unique values. The paper provides a general form for this function: $ f: z \mapsto \lfloor L/2 \rfloor \tanh(z) $ This specific form uses the hyperbolic tangent function (tanh\tanh), which maps real numbers to the range (1,1)(-1, 1). By multiplying with L/2\lfloor L/2 \rfloor, the output is scaled to a range approximately (L/2,L/2)(-\lfloor L/2 \rfloor, \lfloor L/2 \rfloor). This ensures that subsequent rounding leads to a desired number of discrete levels. In the general case where each ii-th channel can be mapped to LiL_i values, the bounding function would be applied per channel.

  2. Rounding to Integers: After applying the bounding function, the result is rounded to the nearest integer. This operation discretizes each channel independently. $ \hat{z} = \text{round}(f(z)) $ The z^\hat{z} obtained is a vector where each component is an integer within a specific, bounded range. For example, if L=3L=3, the values might be {1,0,1}\{-1, 0, 1\}.

  3. Implicit Codebook Formation: The collection of all possible z^\hat{z} vectors forms the implicit codebook C\mathcal{C}. Since each of the dd dimensions can take on LiL_i unique values, the total size of this implicit codebook is the product of the number of levels for each dimension. $

    |\mathcal{C}| = \prod_{i=1}^d L_i

    $ For instance, if d=3d=3 and L=3L=3 for all dimensions (as shown in Figure 1), then C=3×3×3=27|\mathcal{C}| = 3 \times 3 \times 3 = 27. The codewords are simply the Cartesian product of these integer sets. This construction means the codebook is fixed and does not need to be learned.

  4. Gradient Propagation (Straight-Through Estimator): The round operation is non-differentiable. To allow gradients to flow from the decoder back to the encoder during training, FSQ uses the Straight-Through Estimator (STE). This is implemented using a function round_ste which behaves like round in the forward pass but passes gradients directly through (as if it were an identity function) in the backward pass. The implementation described uses JAX's stop_gradient operation: $ \text{round_ste}(x) = x + \text{sg}(\text{round}(x) - x) $ Here, sg(...) denotes the stop_gradient operation. In the forward pass, round_ste(x) evaluates to x + (round(x) - x), which simplifies to round(x). In the backward pass, the gradient of sg(...) is zero, so the gradient of round_ste(x) with respect to xx is effectively 1.

Figure 1 visually distinguishes FSQ from VQ:

Figure 1: FSQ (left): the final encoder layer projects to \(d\) dimensions \(\\mathit { d } = 3\) shown). We bound each dimension of the encoder output \(z\) to \(L\) values \(L = 3\) shown), and then round to integers, resulting in the quantized \(\\hat { z }\) , the nearest point in this hypercube. VQ (right): The final encoder layer projects to \(d\) dimensions \(d = 7\) shown, as \(d\) is typically much larger for VQ). The resulting vector \(z\) is replaced with the closest vector from the codebook, \(\\hat { z }\) , by nearest neighbor lookup. 该图像是示意图,展示了有限标量量化(FSQ)和向量量化(VQ)的对比。左侧的FSQ展示了输出向量zz被投影到dd维空间,并量化为固定值的过程,形成超立方体内的量化结果z^\hat{z}。右侧的VQ则阐明了如何通过寻找最近邻(NN)将输出向量zz替换为代码本中的向量z^\hat{z}。两者展示了不同的量化机制及其结构。

Figure 1: FSQ (left): the final encoder layer projects to dd dimensions d=3\mathit { d } = 3 shown). We bound each dimension of the encoder output zz to LL values L=3L = 3 shown), and then round to integers, resulting in the quantized z^\hat { z } , the nearest point in this hypercube. VQ (right): The final encoder layer projects to dd dimensions d=7d = 7 shown, as dd is typically much larger for VQ). The resulting vector zz is replaced with the closest vector from the codebook, z^\hat { z } , by nearest neighbor lookup.

Figure 2 further illustrates the bounding and rounding process for a single channel:

该图像是一个示意图,展示了函数 `f(z)` 和其阶跃逼近 `round ext{_}ste(f(z))` 的关系。其中 `f(z)` 表示平滑函数,而 `round ext{_}ste(f(z))` 是一个量化后的阶跃函数,体现了离散量化的效果。 该图像是一个示意图,展示了函数 f(z) 和其阶跃逼近 round ext{_}ste(f(z)) 的关系。其中 f(z) 表示平滑函数,而 round ext{_}ste(f(z)) 是一个量化后的阶跃函数,体现了离散量化的效果。

Figure 2: Left: VQ made simple: comparing implementation and optimization of VQ vs. FSQ. Right: Bounding zz with ff , and rounding the output, shown for a single channel with L=5L = 5 .

4.2.2. Hyperparameters

FSQ has two main hyperparameters:

  • Number of channels (dd): This defines the dimensionality of the latent representation that is subject to scalar quantization. For FSQ, dd is typically much smaller (e.g., < 10) compared to VQ (where dd can be hundreds).

  • Number of levels per channel (L=[L1,,Ld]\mathcal{L} = [L_1, \ldots, L_d]): This specifies how many discrete values each of the dd channels can take. To obtain a fair comparison with VQ, these levels are chosen such that their product i=1dLi\prod_{i=1}^d L_i approximates the target VQ codebook size. The paper found a heuristic that Li5L_i \geq 5 generally leads to good performance.

    The following are the results from Table 1 of the original paper:

    Target Size |C| Proposed L
    282^8 [8, 6, 5]
    2102^{10} [8, 5, 5, 5]
    2122^{12} [7, 5, 5, 5, 5]
    2142^{14} [8, 8, 8, 6, 5]
    2162^{16} [8, 8, 8, 5, 5, 5]

Table 1: Recommended sets of FSQ levels L\mathcal { L } to approximately match a given codebook size C| { \mathcal { C } } |

4.2.3. Parameter Count

A notable aspect of FSQ is its reduced parameter count compared to VQ. In VQ, the codebook itself constitutes a significant number of parameters, given by Cd|C| \cdot d, where C|C| is the codebook size and dd is the dimension of each codeword. For a typical VQ setup with C=212=4096|C| = 2^{12} = 4096 and d=512d = 512, this results in 4096×51224096 \times 512 \approx 2 million parameters dedicated solely to the codebook. FSQ, by contrast, has no learnable codebook parameters as its codebook is implicit and fixed. Furthermore, because FSQ typically uses a much smaller dd (e.g., d=5d=5 for C=212|C| = 2^{12}), the final encoder layer leading to the latent representation also has fewer parameters. The authors explored adding more dense layers to compensate for this, but found no further gains, indicating that the existing encoder and decoder capacity is sufficient.

4.2.4. FSQ Implementation Details

The paper provides reference code for the FSQ class and round_ste function, implemented in JAX.

def round_ste(z):
  """Round with straight through gradients."""
  zhat = jnp.round(z)
  return z + jax.lax.stop_gradient(zhat - z)

class FSQ:
  def __init__(self, levels: list[int]):
    self._levels = levels
    self._levels_np = np.asarray(levels)
    # _basis is used to convert multi-dimensional codes to a single index
    self._basis = np.concatenate(
        ([1], np.cumprod(self._levels_np[:-1]))
    ).astype(np.uint32)
    codebook_size = np.prod(levels)
    self.implicit_codebook = self.indexes_to_codes(
        np.arange(codebook_size)
    )

  def bound(self, z):
    """Bound z', an array of shape (..., d)."""
    eps = 1e-3 # Small epsilon to prevent tanh from reaching exactly 1 or -1
    # Calculate half_l, which determines the scaling factor for tanh output
    # It ensures that after rounding, we get L discrete levels.
    half_l = (self._levels_np - 1) * (1 - eps) / 2
    # Offset handles asymmetric levels for even L to center quantization points
    offset = jnp.where(self._levels_np % 2 == 1, 0.0, 0.5)
    # The shift factor is applied to z before tanh to ensure the desired range
    shift = jnp.tan(offset / half_l)
    # The bounding operation: scale tanh(z + shift) and then apply offset
    return jnp.tanh(z + shift) * half_l - offset

  def quantize(self, z):
    """Quantizes z, returns quantized zhat, same shape as z."""
    # Apply the bounding function and then the straight-through rounding
    quantized = round_ste(self.bound(z))
    # Calculate half_width for renormalization to [-1, 1] range
    half_width = self._levels_np // 2
    # Renormalize the quantized values to be within [-1, 1]
    # This is useful for the decoder if it expects inputs in this range
    return quantized / half_width

  def _scale_and_shift(self, zhat_normalized):
    """Scales and shifts normalized quantized values back to integer range."""
    half_width = self._levels_np // 2
    return (zhat_normalized * half_width) + half_width

  def _scale_and_shift_inverse(self, zhat):
    """Inverse of _scale_and_shift, for converting integer codes to normalized values."""
    half_width = self._levels_np // 2
    return (zhat - half_width) / half_width

  def codes_to_indexes(self, zhat):
    """Converts a 'code' (quantized vector) to an index in the codebook."""
    assert zhat.shape[-1] == len(self._levels)
    # Scale and shift zhat to its original integer range (e.g., 0 to L-1)
    zhat = self._scale_and_shift(zhat)
    # Calculate the linear index from the multi-dimensional integer code
    # This treats the code as a mixed-radix number system
    return (zhat * self._basis).sum(axis=-1).astype(jnp.uint32)

  def indexes_to_codes(self, indices):
    """Inverse of 'codes_to_indexes', converts an index back to a quantized code vector."""
    indices = indices[..., jnp.newaxis]
    # Convert linear index back to multi-dimensional integer code (0 to L-1)
    codes_non_centered = np.mod(
        np.floor_divide(indices, self._basis), self._levels_np
    )
    # Convert integer codes to normalized values [-1, 1] for the decoder
    return self._scale_and_shift_inverse(codes_non_centered)

Explanation of the FSQ class methods:

  • __init__(self, levels: list[int]):

    • Initializes the FSQ quantizer with a list of levels, where levels[i] is LiL_i, the number of discrete values for the ii-th dimension.
    • self._basis: This is crucial for converting a multi-dimensional code (a vector z^\hat{z}) into a single integer index and vice-versa. It works like a mixed-radix number system. For example, if levels is [L1, L2, L3], then _basis would be [1,L1,L1L2][1, L1, L1*L2]. An index for code (c1,c2,c3)(c1, c2, c3) would be c11+c2L1+c3L1L2c1 * 1 + c2 * L1 + c3 * L1*L2.
    • self.implicit_codebook: Pre-computes all possible normalized code vectors by converting all possible indices to codes.
  • bound(self, z):

    • This is the bounding function ff described earlier.
    • eps=1e3eps = 1e-3: A small epsilon is used to prevent tanh(z + shift) from reaching exactly ±1\pm 1, which can sometimes lead to numerical issues or out-of-bounds rounding.
    • half_l = (self._levels_np - 1) * (1 - eps) / 2: This calculates the effective half-range for the tanh output before rounding. If levels is LL, then (L-1)/2 gives the maximum absolute integer value.
    • offset = jnp.where(self._levels_np % 2 == 1, 0.0, 0.5): This adjusts the centering for even vs. odd numbers of levels. If LL is odd (e.g., 3 levels: -1, 0, 1), the center is 0. If LL is even (e.g., 4 levels: -1.5, -0.5, 0.5, 1.5 which would round to -2, -1, 0, 1 after shifting), the offset of 0.5 is applied to shift the range appropriately before rounding.
    • shift = jnp.tan(offset / half_l): This calculates a small input shift for the tanh function. tanh(x) is symmetric around 0. By adding shift to zz, the tanh output can be slightly asymmetric if needed, which might be important for even LL values to get precise integer boundaries after scaling.
    • jnp.tanh(z+shift)halfloffsetjnp.tanh(z + shift) * half_l - offset: This is the full bounding operation. It scales the tanh output to the desired range and then applies the offset.
  • quantize(self, z):

    • Calls self.bound(z) to get the bounded continuous values.
    • Applies round_ste() to these bounded values to get the discrete integer codes.
    • half_width = self._levels_np // 2: Calculates the half-width of the integer range (e.g., for L=5L=5, values {2,1,0,1,2}\{-2,-1,0,1,2\}, half_width is 2).
    • quantized / half_width: Renormalizes the quantized integer values to the range [1,1][-1, 1]. This is a common practice to provide a standardized input range for the decoder, regardless of the number of levels. For example, if L=5L=5, {2,1,0,1,2}\{-2,-1,0,1,2\} would become {1,0.5,0,0.5,1}\{-1, -0.5, 0, 0.5, 1\}.
  • _scale_and_shift(self, zhat_normalized) and _scale_and_shift_inverse(self, zhat):

    • These are utility methods for converting between the normalized values (typically output by quantize and expected by the decoder) and the raw integer codes (e.g., 0,,Li10, \ldots, L_i-1 or Li/2,,Li/21-L_i/2, \ldots, L_i/2-1). They are essential for codes_to_indexes and indexes_to_codes.
  • codes_to_indexes(self, zhat):

    • Converts a given quantized vector zhat (in its normalized form, e.g., [1,1][-1, 1]) into a single integer index representing its position in the implicit codebook.
    • It first denormalizes zhat using _scale_and_shift to get the raw integer coordinates.
    • Then, it uses self._basis to calculate the unique linear index. This allows any discrete vector to be represented by a single integer token, which is crucial for training transformer models.
  • indexes_to_codes(self, indices):

    • Performs the inverse operation: takes an integer index and converts it back to its corresponding normalized quantized vector (code). This is how the decoder receives its input from the discrete tokens.

5. Experimental Setup

5.1. Datasets

The experiments are conducted across various tasks and utilize several standard datasets:

  1. ImageNet:

    • Source: Large-scale dataset of images organized according to the WordNet hierarchy.
    • Scale: Millions of images across thousands of categories.
    • Characteristics: High diversity in image content, object categories, and visual complexity.
    • Domain: General-purpose object recognition and image generation.
    • Usage:
      • MaskGIT (128x128 ImageNet): Used for the initial trade-off study to sweep codebook sizes and hyperparameters. Lower resolution and shorter training times compared to full MaskGIT.
      • MaskGIT (256x256 ImageNet): Used for the main image generation task, comparing FSQ and VQ performance against established baselines.
    • Example Data Sample: ImageNet images typically depict a single main object or scene, such as "cat", "dog", "car", "building", etc.
  2. NYU Depth v2:

    • Source: Dataset for depth estimation, consisting of video sequences from a Microsoft Kinect camera.
    • Scale: Over 1400 dense depth maps from various indoor scenes.
    • Characteristics: Provides aligned RGB images and dense depth maps, offering varied indoor environments.
    • Domain: Dense prediction, specifically depth estimation.
    • Usage: Used with UViM for the depth estimation task.
  3. COCO Panoptic:

    • Source: Common Objects in Context (COCO) dataset with panoptic annotations.

    • Scale: Tens of thousands of images with rich annotations for object detection, segmentation, and panoptic segmentation.

    • Characteristics: Panoptic segmentation requires segmenting every pixel in an image into a semantic category and an instance ID.

    • Domain: Dense prediction, specifically panoptic segmentation.

    • Usage: Used with UViM for the panoptic segmentation task.

    • Example Data Sample: An image of a street scene might have pixels labeled as "road" (stuff), "car_1", "car_2" (things), "person_1", "person_2" (things), etc.

      The datasets were chosen because they are standard benchmarks for the respective computer vision tasks (image generation, depth estimation, panoptic segmentation) and allow for direct comparison with existing VQ-based methods like MaskGIT and UViM. They are effective for validating the method's performance across different types of visual data and prediction goals.

5.2. Evaluation Metrics

The paper employs several standard evaluation metrics tailored to the specific tasks.

5.2.1. Frechet Inception Distance (FID)

  • Conceptual Definition: FID is a metric used to assess the quality of images generated by generative models, often comparing them to real images. It measures the "distance" between the distribution of generated images and the distribution of real images in a feature space. A lower FID score indicates better quality and diversity of generated images, implying that the generated images are more similar to real images in terms of their perceptual features.
  • Mathematical Formula: $ \text{FID} = |\mu_1 - \mu_2|^2_2 + \text{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
  • Symbol Explanation:
    • μ1\mu_1: The mean feature vector of real images (computed from a pre-trained Inception-v3 network).
    • μ2\mu_2: The mean feature vector of generated images (computed from a pre-trained Inception-v3 network).
    • Σ1\Sigma_1: The covariance matrix of feature vectors for real images.
    • Σ2\Sigma_2: The covariance matrix of feature vectors for generated images.
    • 22\|\cdot\|^2_2: Squared Euclidean distance.
    • Tr()\text{Tr}(\cdot): Trace of a matrix.
    • ()1/2(\cdot)^{1/2}: Matrix square root. The FID is calculated by extracting features from a large set of real and generated images using a pre-trained Inception-v3 model, then fitting a multivariate Gaussian to these features, and finally computing the Frechet distance between the two Gaussians.

5.2.2. Precision and Recall (for Generative Models)

  • Conceptual Definition: These metrics, adapted from classification tasks, evaluate the quality and diversity of generated samples against the real data distribution.
    • Precision: Measures the "quality" of generated samples. A high precision indicates that the generated samples are mostly realistic and avoid producing "out-of-distribution" or clearly fake images. It assesses how many generated samples fall within the support of the real data distribution.
    • Recall: Measures the "diversity" or "coverage" of the generated samples. A high recall indicates that the generative model is able to produce a wide variety of samples, covering most of the modes present in the real data distribution. It assesses how much of the real data distribution's support is covered by the generated samples.
  • Mathematical Formula: The paper refers to Sajjadi et al. (2018) for these metrics. The calculation often involves fitting K-nearest neighbor classifiers in a feature space (e.g., Inception features) to distinguish between real and generated samples and then calculating the proportion of samples that fall into the "correct" category. Explicit simple formulas are not always given, but generally involve: $ \text{Precision} = \frac{\sum_{x_g \in G} \mathbb{I}(\text{NN}(x_g, R) \le T)}{\sum_{x_g \in G} 1} $ $ \text{Recall} = \frac{\sum_{x_r \in R} \mathbb{I}(\text{NN}(x_r, G) \le T)}{\sum_{x_r \in R} 1} $
  • Symbol Explanation:
    • GG: Set of generated samples.
    • RR: Set of real samples.
    • NN(x,Y)\text{NN}(x, Y): Distance of sample xx to its nearest neighbor in set YY.
    • TT: A threshold distance.
    • I()\mathbb{I}(\cdot): Indicator function, which is 1 if the condition is true, 0 otherwise. (Note: These are simplified representations of the core idea; the actual implementation by Sajjadi et al. (2018) involves more complex density estimation using k-NN graphs).

5.2.3. Codebook Usage

  • Conceptual Definition: This metric quantifies the efficiency of the learned discrete representation. It is the fraction of unique codewords (from the total possible codebook size) that are actually utilized at least once when encoding a given dataset (e.g., the validation set). A higher codebook usage indicates that the model is effectively leveraging its discrete capacity, avoiding "codebook collapse" where many codewords remain unused.

5.2.4. Compression Cost

  • Conceptual Definition: Introduced in the paper as a proxy for the modeling complexity of the discrete distribution underlying the representations. It estimates how "hard" it is for a transformer (specifically, a masked transformer) to model and predict the discrete tokens. A higher compression cost suggests that the discrete representations are more complex or unpredictable, requiring more bits to encode optimally. This is estimated by compressing the representations using entropy coding with a deterministic masking schedule.

5.2.5. Panoptic Quality (PQ)

  • Conceptual Definition: PQ is a comprehensive metric for panoptic segmentation, which unifies semantic segmentation (labeling every pixel with a category) and instance segmentation (detecting and segmenting individual objects). PQ evaluates both the segmentation quality (how well pixels are grouped) and the recognition quality (how accurately instances are identified). It combines a segmentation quality (SQ) term and a recognition quality (RQ) term.
  • Mathematical Formula: $ \text{PQ} = \frac{\sum_{(p, g) \in \text{TP}} \text{IoU}(p, g)}{|\text{TP}| + \frac{1}{2}|\text{FP}| + \frac{1}{2}|\text{FN}|} = \text{SQ} \times \text{RQ} $ where $ \text{SQ} = \frac{\sum_{(p, g) \in \text{TP}} \text{IoU}(p, g)}{|\text{TP}|} $ $ \text{RQ} = \frac{|\text{TP}|}{|\text{TP}| + \frac{1}{2}|\text{FP}| + \frac{1}{2}|\text{FN}|} $
  • Symbol Explanation:
    • TP\text{TP}: True Positives, pairs of predicted segments (pp) and ground truth segments (gg) that match (i.e., have an Intersection over Union (IoU) greater than 0.5).
    • FP\text{FP}: False Positives, predicted segments that do not match any ground truth segment.
    • FN\text{FN}: False Negatives, ground truth segments that are not matched by any predicted segment.
    • IoU(p,g)\text{IoU}(p, g): Intersection over Union between a predicted segment pp and a ground truth segment gg.
    • SQ\text{SQ}: Segmentation Quality, the average IoU of all matched segments.
    • RQ\text{RQ}: Recognition Quality, similar to F1-score for detection/recognition.

5.2.6. Root Mean Squared Error (RMSE)

  • Conceptual Definition: RMSE is a common metric to measure the average magnitude of the errors between predicted values and true values. It is the square root of the average of the squared differences between prediction and actual observation. It gives a relatively high weight to large errors. A lower RMSE indicates better accuracy.
  • Mathematical Formula: $ \text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)^2} $
  • Symbol Explanation:
    • NN: The total number of observations or data points.
    • y^i\hat{y}_i: The predicted value for the ii-th observation.
    • yiy_i: The true (actual) value for the ii-th observation.

5.3. Baselines

The paper compares FSQ against a set of strong baselines, primarily focusing on its integration with two prominent architectures that utilize VQ:

  1. MaskGIT (Chang et al., 2022):

    • Description: A powerful image generation model that combines a convolutional VQ-GAN autoencoder (Stage I) with a masked transformer (BERT-style) for predicting discrete image tokens (Stage II). It performs iterative masked prediction for image generation.
    • Representation: Relies on VQ to discretize image representations into tokens.
    • Comparison: FSQ is evaluated as a drop-in replacement for VQ within the MaskGIT framework. The VQ baseline used is the original MaskGIT implementation, which includes an entropy loss for codebook regularization.
    • Representative Status: State-of-the-art for masked generative image transformers.
  2. UViM (Unified Vision Model) (Kolesnikov et al., 2022):

    • Description: A general architecture designed for various dense prediction computer vision tasks (e.g., depth estimation, colorization, panoptic segmentation). It uses a transformer-based VQ-VAE to model the label space of the target task, optionally conditioned on an RGB image as "context." An encoder-decoder transformer then predicts the dense label as quantized tokens.
    • Representation: Relies on VQ to quantize dense label representations.
    • Comparison: FSQ replaces VQ in UViM. The VQ baseline uses codebook splitting (Linde et al., 1980) to manage codebook usage, a common trick.
    • Representative Status: A unified approach for diverse dense vision tasks using discrete representations.
  3. ADM (Dhariwal & Nichol, 2021):

    • Description: A leading diffusion-based generative model (specifically, "guided diffusion") known for state-of-the-art image synthesis. It is used as an external reference point for image generation quality, showcasing the performance of MaskGIT (both VQ and FSQ versions) relative to another high-performing generative paradigm.
  4. DETR-R101 (Carion et al., 2020) and ColTran (Kumar et al., 2021):

    • These are task-specific baselines mentioned in the UViM results table: DETR-R101 for panoptic segmentation and ColTran for colorization. They represent strong, established methods for these dense prediction tasks, providing context for UViM's (and thus FSQ's) performance.

      These baselines are representative because they cover a range of applications (image generation, dense prediction) where discrete representations (specifically VQ) are crucial. By comparing FSQ directly within these established frameworks, the paper effectively demonstrates its viability as a simpler, competitive alternative.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results consistently demonstrate that Finite Scalar Quantization (FSQ) is a highly competitive, and often superior, alternative to Vector Quantization (VQ) across various tasks, especially given its simpler design and optimization benefits.

6.1.1. Trade-off Study (ImageNet 128x128 with MaskGIT)

The trade-off study (Figure 3) provides crucial insights into the scaling behavior and efficiency of FSQ versus VQ when varying the codebook size.

Figure 3: Characteristics and trade-offs for VQ and FSQ for \(1 2 8 \\times 1 2 8\) ImageNet. We see that Reconstruction FID correlates with codebook size for FSQ, and improves as we scale the codebook si. FSQ gets better Sampling FD and higher codebook usage for codebook size exceeding \(2 ^ { 1 0 }\) , while the metrics start deteriorating for VQ. 该图像是图表,展示了 VQ 和 FSQ 在不同代码本大小下的特性和权衡。包括重建 FID、采样 FID、代码本使用率和压缩成本的对比,图中表明 FSQ 在更大的代码本大小下表现出更好的性能。

Figure 3: Characteristics and trade-offs for VQ and FSQ for 128times1281 2 8 \\times 1 2 8 ImageNet. We see that Reconstruction FID correlates with codebook size for FSQ, and improves as we scale the codebook si. FSQ gets better Sampling FD and higher codebook usage for codebook size exceeding 2102 ^ { 1 0 } , while the metrics start deteriorating for VQ.

Key observations from Figure 3:

  • Reconstruction FID (a):

    • FSQ: Shows a clear inverse correlation between codebook size and Reconstruction FID. As the codebook size increases, FSQ's ability to reconstruct the input images improves, indicated by a decreasing FID. This aligns with expectations: more discrete capacity allows for more faithful representations.
    • VQ: Struggles with larger codebooks. Its Reconstruction FID initially improves but then starts to deteriorate for codebook sizes exceeding 2112^{11}. This suggests VQ's optimization difficulties prevent it from effectively utilizing larger capacities, even with entropy regularization.
    • Marginal Outperformance at Low Codebook Sizes: For very small codebook sizes, VQ marginally outperforms FSQ, which the authors attribute to VQ's theoretically more expressive (learnable) nature when capacity is highly constrained. However, this advantage quickly diminishes.
  • Sampling FID (b):

    • FSQ: Demonstrates better Sampling FID (lower is better) for codebook sizes exceeding 2102^{10}. This indicates that the higher quality representations learned by FSQ in Stage I translate to better generative performance with the Stage II transformer. The Sampling FID for FSQ saturates around 2122^{12} codewords, suggesting a point of diminishing returns for this specific task/resolution.
    • VQ: Consistent with its Reconstruction FID, VQ's Sampling FID starts deteriorating for codebook sizes larger than 2102^{10}, further highlighting its difficulty in scaling.
  • Codebook Usage (c):

    • FSQ: Exhibits remarkably high codebook usage, approaching 100% even for very large codebook sizes like 2142^{14} (16k codes), without any auxiliary losses or tricks. This is a significant advantage, showcasing FSQ's inherent robustness against codebook collapse.
    • VQ: Suffers from severe codebook underutilization. Its usage drops below 50% for codebooks larger than 2112^{11} and it effectively uses no more than 2102^{10} codewords, irrespective of how large the codebook is defined. This confirms the long-standing problem of codebook collapse in VQ.
  • Compression Cost (d):

    • FSQ: The compression cost of FSQ representations generally increases with codebook size, indicating that the discrete distributions become richer and harder for the transformer to model. However, within the FSQ group, higher compression cost generally correlates with better (lower) Sampling FID.
    • VQ: The compression cost for VQ correlates with codebook usage: when usage drops, the code becomes easier to model (lower compression cost) because the transformer effectively has fewer distinct tokens to predict.

6.1.2. MaskGIT for Image Generation (ImageNet 256x256)

The following are the results from Table in Figure 4 of the original paper:

Model Source CFG Sampling FID†↓ Precision†↑ Recall↑ Usage↑
MaskGIT (VQ) Ours 0.1 4.509 0.860 0.465 81%
MaskGIT (FSQ) Ours 0.2 4.534 0.864 0.453 100%
MaskGIT (VQ) GitHub - 4.916 0.836 0.489
ADM (Dhariwal & Nichol, 2021) 1.5 4.59 0.83 0.52

Figure 4: MAskGIT results on ImageNet 256. Top: We show the best classifier-free guidance (CFG) setting for each MaskGIT model. As a reference, we show the well established diffusion based ADM model (Dhariwal & Nichol, 2021). Bottom Left: Precision vs. Recall for various CFG weights. Bottom Right: Sampling FID for various CFG weights. We show ADM as a horizontal line, because the CFG weight 1.5 used for ADM is not comparable with our \(\\alpha\) in absolute terms. †We use the ADM TensorFlow Suite to evaluate all shown models, see text. 该图像是图表,展示了MaskGIT模型在ImageNet 256上的表现。左侧为不同CFG权重下的精确度与召回率曲线,右侧为不同CFG权重下的采样FID值,展示了与ADM模型的对比结果。

Figure 4: MAskGIT results on ImageNet 256. Top: We show the best classifier-free guidance (CFG) setting for each MaskGIT model. As a reference, we show the well established diffusion based ADM model (Dhariwal & Nichol, 2021). Bottom Left: Precision vs. Recall for various CFG weights. Bottom Right: Sampling FID for various CFG weights. We show ADM as a horizontal line, because the CFG weight 1.5 used for ADM is not comparable with our α\alpha in absolute terms. †We use the ADM TensorFlow Suite to evaluate all shown models, see text.

  • Competitive Metrics: Both FSQ and VQ MaskGIT models achieve very comparable Sampling FID, Precision, and Recall scores. FSQ (4.534 FID) is very close to VQ (4.509 FID), and both are competitive with the strong diffusion model ADM (4.59 FID).

  • 100% Codebook Usage for FSQ: FSQ maintains 100% codebook usage, while the VQ baseline (ours) achieves 81%. This confirms FSQ's inherent efficiency.

  • Precision-Recall Trade-offs: When sweeping the Classifier-Free Guidance (CFG) weight α\alpha, both FSQ and VQ models cover a very similar space in the Precision & Recall plot (Figure 4, bottom left) and achieve similar minimal FID (Figure 4, bottom right). This indicates that FSQ can achieve similar quality-diversity trade-offs in generation.

  • Visual Quality: Non-cherry-picked samples (Figure 5) show that both FSQ and VQ MaskGIT models produce qualitatively similar and high-quality images across different ImageNet classes.

    Figure 5: Non-cherry-picked samples from our FSQ (top) and VQ (bottom) MaskGIT models for 4 imagenet classes (330, 320, 510, 454). We show two samples per model per category. Both models get very comparable sample quality, as reflected by the metrics in Fig. 4. 该图像是展示了不同类别图像的对比示例,分为上下两行。上行展示了四个物体类别的样本,包括兔子、船只和书籍等;下行为这些类别的不同视角与细节。这样的比较突出了样本质量的相似性。

Figure 5: Non-cherry-picked samples from our FSQ (top) and VQ (bottom) MaskGIT models for 4 imagenet classes (330, 320, 510, 454). We show two samples per model per category. Both models get very comparable sample quality, as reflected by the metrics in Fig. 4.

6.1.3. UViM for Dense Prediction Tasks

The following are the results from Table 2 of the original paper:

NYU Depth v2
Source RMSE† ↓ Codebook Usage
UViM (VQ) Ours 0.468 ± 0.012 99%
UViM (FSQ) Ours 0.473 ± 0.012 99%
UViM (VQ without splitting) Ours 0.490 ± 0.0037 0.78%
UViM (VQ) GitHub 0.463
DenseDepth (Alhashim & Wonka, 2018) 0.465
COCO Panoptic
Source PQ† ↑ Codebook Usage
UViM (VQ) Ours 43.4 ± 0.0008 100%
UViM (FSQ) Ours 43.2 ± 0.0014 100%
UViM (VQ without context) Ours 39.0 ± 0.0023 99%
UViM (FSQ without context) Ours 40.2 ± 0.0019 99%
UViM (VQ) GitHub 43.1
DETR-R101 (Carion et al., 2020) 45.1
ImageNet Colorization
Source FID-5k†↓ Codebook Usage
UViM (VQ) Ours 16.90 ± 0.056 100%
UViM (FSQ) Ours 17.55 ± 0.057 100%
UViM (VQ) Github 16.99 ± 0.057
ColTran (Kumar et al., 2021) 19.37

Table 2: UVIM results for the three tasks. For each, we show results in the corresponding metric averaged over three runs with std. dev. (as in UViM). We show the numbers reported by the reference GitHub repository, as well as one well established baseline per task. For our models, we show Codebook usage. For Depth Estimation, we train an ablation where we do not employ the codebook splitting in VQ. Overall, FSQ obtains competitive but marginally worse results on all tasks. We use the UViM GitHub evaluation suite.

  • Overall Competitiveness: Across all three dense prediction tasks (NYU Depth v2, COCO Panoptic, ImageNet Colorization), FSQ achieves competitive metrics compared to VQ. While FSQ is marginally worse in some cases (e.g., RMSE for Depth, FID for Colorization), the differences are small (e.g., 0.473 vs 0.468 RMSE for depth, 17.55 vs 16.90 FID for colorization).

  • Visual Results: Figure 6 for depth estimation and Figure 7 (Appendix A.2) for panoptic segmentation and colorization show that FSQ and VQ lead to comparable visual outputs.

    Figure 6: Samples from UViM for the depth estimation task. Other tasks in Appendix A.2. We observe that VQ and FSQ lead to comparable samples. VQ without splitting leads to jagged edges. 该图像是图表,展示了UViM在深度估计任务中的样本比较。左侧为输入图像,右侧为真实深度图,FSQ和VQ的输出效果相近,而未拆分的VQ样本显示锯齿边缘。

Figure 6: Samples from UViM for the depth estimation task. Other tasks in Appendix A.2. We observe that VQ and FSQ lead to comparable samples. VQ without splitting leads to jagged edges.

Figure 7: Visualization for panoptic segmentation (first two rows) and colorization (last two rows). 该图像是一个插图,展示了全景分割和上色的可视化效果。图中分为四行,第一行展示了输入图像、真实标签、FSQ结果和VQ结果。第二行同样结构,第三、四行展示了不同图像的上色效果,以及各种上色方法的对比。

Figure 7: Visualization for panoptic segmentation (first two rows) and colorization (last two rows).

6.1.4. Ablation on VAE Context Input and Codebook Splitting

  • Impact of VAE Context (Panoptic Segmentation): When the VAE context (RGB image input to the VAE encoder/decoder) is removed in UViM for panoptic segmentation, both FSQ and VQ models perform worse. However, the FSQ-based model degrades less (PQ of 40.2 for FSQ without context vs. 39.0 for VQ without context), suggesting FSQ might be more robust in scenarios with less side information.
  • FSQ's Independence from Codebook Splitting: A crucial finding is from the depth estimation task: disabling codebook splitting (a common trick for VQ to prevent collapse) in the VQ model leads to significantly worse RMSE (0.490 vs. 0.468) and a drastic drop in Codebook Usage to 0.78%. This manifests visually as "jagged edges" in predictions (Figure 6). In stark contrast, FSQ achieves 99% codebook usage without requiring any auxiliary algorithms or tricks like codebook splitting. This highlights FSQ's inherent stability and simpler optimization.

6.1.5. Visualizing Representations

  • "Average" Representations (Figure 9): Sampling codes according to marginal histograms for FSQ-GAN and VQ-GAN both produce "soup of patches," implying individual codes might not represent very abstract concepts.

  • Stitching Representations (Figure 8): Both FSQ and VQ decoders smoothly blend stitched representations in latent space when decoding to RGB space, indicating that smooth transitions in the latent space translate to smooth transitions in pixel space.

  • Conclusion on Semantics: The study found no evidence that a particular code represents a fixed visual concept in either quantizer, and both behaved very similarly. This contradicts the common argument that VQ-VAEs learn "semantically meaningful codes" in a fixed sense, suggesting the meaning arises from code combinations and decoder weights.

    Figure 8: Analyzing representations: we take two random images A, B from the validation set (first two columns). We compare stitching the top half of A to the bottom half of B in pixel space (center) to stitching the corresponding representations obtained by the FSQ-GAN and VQ-GAN (last two columns) in latent space. Note how the GAN decoder maps the sharp transitions in representation space to smooth transitions in pixel-space. 该图像是示意图,展示了通过拼接随机选择的图像 A 和 B 的不同方法。第一列为图像 A 和 B 的上半部分和下半部分,第二列展示了在像素空间拼接的结果,后两列则展示了 FSQ-GAN 和 VQ-GAN 在潜在空间中拼接的结果,表现出 GAN 解码器如何将表示空间的锐利过渡映射为像素空间的平滑过渡。

Figure 8: Analyzing representations: we take two random images A, B from the validation set (first two columns). We compare stitching the top half of A to the bottom half of B in pixel space (center) to stitching the corresponding representations obtained by the FSQ-GAN and VQ-GAN (last two columns) in latent space. Note how the GAN decoder maps the sharp transitions in representation space to smooth transitions in pixel-space.

Figure 9: Analysing "fake" representations: Left 3 columns: randomly sampling codes according to the marginal histogram, for FSQ (top) and VQ (bottom). Right 3 columns: Creating a representation sharing code across all spatial location, where we pick the 3 most common codes according to the marginal histogram (left-to-right). 该图像是图表,展示了FSQ(上方)和VQ(下方)在随机抽样代码和共享代码表示下的表现差异,左侧三列为随机采样的代码,右侧三列展示了在所有空间位置共享代码的表现,反映出FSQ在表示上相较于VQ的优势。

Figure 9: Analysing "fake" representations: Left 3 columns: randomly sampling codes according to the marginal histogram, for FSQ (top) and VQ (bottom). Right 3 columns: Creating a representation sharing code across all spatial location, where we pick the 3 most common codes according to the marginal histogram (left-to-right).

6.2. Ablation Studies / Parameter Analysis

6.2.1. Effect of Quantization Levels L\mathcal{L} on FSQ

The paper conducted an ablation study on the choice of quantization levels (L\mathcal{L}) for FSQ models within the trade-off study on 128x128 ImageNet.

Figure 10: Exploring different configurations of quantization levels per channel \(\\mathcal { L }\) . The color and marker indicate the smallest `L _ { i }` used for a given model (see legend). 该图像是一个散点图,展示了不同码本大小对采样FID的影响。横轴为码本大小的对数值(bit),纵轴为采样FID。不同的颜色和标记表示最小的量化层数 LiL_i,具体信息见图例。

Figure 10: Exploring different configurations of quantization levels per channel mathcalL\\mathcal { L } . The color and marker indicate the smallest L _ { i } used for a given model (see legend).

  • Observation: Figure 10 shows that having a minimal number of levels per dimension (LiL_i) is important. The study found that choosing Li<5L_i < 5 (e.g., using 3 levels for some dimensions, indicated by purple markers) consistently leads to subpar Sampling FID.
  • Heuristic: Based on this, the authors recommend using Li5L_i \geq 5 for all channels. This ensures sufficient granularity in each dimension for the encoder to effectively spread information.

6.2.2. Classifier-Free Guidance (CFG) for MaskGIT

  • Method: Inspired by diffusion models, the authors applied Classifier-Free Guidance (CFG) to MaskGIT. During training, 10% of class labels are masked to teach the model an unconditional distribution. During inference, logits are interpolated: l=lc+α(lcl)l' = l_c + \alpha(l_c - l_\varnothing), where lcl_c are conditional logits, ll_\varnothing are unconditional logits, and α\alpha is the CFG inference weight.
  • Effect: CFG allows for a trade-off between precision and recall, enabling fine-tuning of the generative output's quality and diversity. Sweeping α\alpha showed that both FSQ and VQ achieve similar Precision-Recall curves and minimal FID values, indicating that FSQ is just as amenable to advanced generative guidance techniques as VQ.

6.2.3. Masking Ratio for MaskGIT

  • Problem: The original MaskGIT's cosine schedule for masking ratios could result in very few tokens being masked (NM=1N_M = 1) during some training steps, leading to instability in the Stage II transformer loss.
  • Solution: The authors introduced a lower bound on the minimal masking ratio. They set rmin=1(arccos(0.45)2/π)r_{\min} = 1 - (\text{arccos}(0.45)2/\pi), which ensures that N_M > 0.45S (where SS is the sequence length) for every training step. They found that any value above 0.2 helps with stabilization, but used 0.45 for their experiments.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces Finite Scalar Quantization (FSQ) as a powerful and significantly simpler alternative to Vector Quantization (VQ) for learning discrete representations in VQ-VAEs. By projecting VAE representations into a few bounded dimensions and rounding them to fixed scalar values, FSQ creates an implicit, grid-based codebook. This simple design fundamentally avoids the complex optimization challenges, auxiliary losses, and codebook collapse issues that plague VQ. The extensive experimental evaluation, integrating FSQ into state-of-the-art architectures like MaskGIT (for image generation) and UViM (for dense prediction tasks), demonstrates that FSQ achieves competitive, and often superior, performance across various benchmarks. Its ability to leverage large codebook sizes effectively and achieve near 100% codebook utilization without any special tricks represents a notable simplification in the field of discrete representation learning.

7.2. Limitations & Future Work

The paper implicitly acknowledges some limitations:

  • Marginal Performance Differences: While FSQ is competitive, it sometimes exhibits marginally worse performance (e.g., 0.5-3% in metrics) compared to highly optimized VQ baselines, especially for very low codebook sizes where VQ's learnable codebook might offer more expressiveness.

  • Theoretical Expressiveness: VQ, in its purest form, has the theoretical advantage of learning arbitrary Voronoi partitions, which could be more flexible. FSQ's fixed grid is inherently less flexible. However, the paper effectively argues that the VAE's encoder and decoder can compensate for this, making the theoretical advantage of VQ less pronounced in practice due to optimization difficulties.

    The authors explicitly suggest:

  • Exploring FSQ in More Applications: The success of FSQ across diverse tasks in this paper encourages further exploration of its applicability in even broader domains where VQ is currently used. This could include audio, video, text, or multimodal representations, potentially simplifying many existing models.

7.3. Personal Insights & Critique

This paper offers a refreshing perspective by challenging the assumption that complex, learnable components are always superior. The core insight that a powerful encoder/decoder pair can "absorb" the non-linearity typically delegated to a complex, learnable VQ codebook is quite profound. It suggests a paradigm shift: instead of investing heavy optimization efforts into making the quantization step adaptive, simplify the quantization and let the surrounding neural network learn to effectively map to and from this simpler discrete space.

A key strength of FSQ is its inherent robustness. Codebook collapse in VQ is a persistent headache, requiring a continuous stream of patches and heuristics. FSQ eliminates this problem by design, making it much easier to train and scale. This robustness translates directly to improved research efficiency and broader applicability, as practitioners no longer need to spend significant time fine-tuning quantization-specific hyperparameters and auxiliary losses.

Critically, the paper's emphasis on empirical results across diverse, complex tasks (image generation, depth estimation, panoptic segmentation) strongly validates its claims. The detailed trade-off study is particularly insightful, clearly illustrating where VQ breaks down and where FSQ shines.

One area for potential future exploration, building on the paper's insights, could be:

  • Adaptive FSQ Levels: While FSQ's fixed grid is its strength, could there be benefits to allowing the levels (L\mathcal{L}) or the bounding function (ff) to be slightly adaptive or learned during training, perhaps with minimal auxiliary loss, to fine-tune the grid to the data distribution without reintroducing full VQ complexity? This might bridge the small gap where VQ marginally outperforms FSQ.

  • Hardware Efficiency: The simplified nature and lower dimensionality of FSQ could translate to better hardware efficiency, especially for deployment on edge devices or in high-throughput scenarios, as there are fewer parameters and simpler operations compared to complex nearest-neighbor lookups in high-dimensional spaces. This aspect could be explicitly evaluated.

  • Impact on Latent Space Semantics: While the paper suggests individual codes don't hold fixed semantic meaning for both FSQ and VQ, a deeper dive into the emergent semantic properties of the combined FSQ codes compared to VQ could be valuable.

    Overall, FSQ represents a significant step towards simplifying discrete representation learning, offering a more stable, robust, and elegant solution that could accelerate progress in various generative and predictive AI applications.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.