Paper status: completed

HiD-VAE: Interpretable Generative Recommendation via Hierarchical and Disentangled Semantic IDs

Published:08/07/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

HiD-VAE is a proposed framework that enhances generative recommendation by learning hierarchically disentangled item representations, addressing traditional methods' flatness and entanglement issues, thereby improving recommendation accuracy and diversity.

Abstract

Recommender systems are indispensable for helping users navigate the immense item catalogs of modern online platforms. Recently, generative recommendation has emerged as a promising paradigm, unifying the conventional retrieve-and-rank pipeline into an end-to-end model capable of dynamic generation. However, existing generative methods are fundamentally constrained by their unsupervised tokenization, which generates semantic IDs suffering from two critical flaws: (1) they are semantically flat and uninterpretable, lacking a coherent hierarchy, and (2) they are prone to representation entanglement (i.e., ``ID collisions''), which harms recommendation accuracy and diversity. To overcome these limitations, we propose HiD-VAE, a novel framework that learns hierarchically disentangled item representations through two core innovations. First, HiD-VAE pioneers a hierarchically-supervised quantization process that aligns discrete codes with multi-level item tags, yielding more uniform and disentangled IDs. Crucially, the trained codebooks can predict hierarchical tags, providing a traceable and interpretable semantic path for each recommendation. Second, to combat representation entanglement, HiD-VAE incorporates a novel uniqueness loss that directly penalizes latent space overlap. This mechanism not only resolves the critical ID collision problem but also promotes recommendation diversity by ensuring a more comprehensive utilization of the item representation space. These high-quality, disentangled IDs provide a powerful foundation for downstream generative models. Extensive experiments on three public benchmarks validate HiD-VAE's superior performance against state-of-the-art methods. The code is available at https://anonymous.4open.science/r/HiD-VAE-84B2.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Interpretable Generative Recommendation via Hierarchical and Disentangled Semantic IDs". This title clearly indicates the paper's focus on improving the interpretability and quality of item representations in generative recommender systems by introducing hierarchical structures and disentanglement.

1.2. Authors

The authors are:

  • Dengzhao Fang (Jilin University, Changchun, China)

  • Jingtong Gao (City University of Hong Kong, Hong Kong, China)

  • Chengcheng Zhu (Nanjing University, Nanjing, China)

  • Yu Li (Jilin University, Changchun, China)

  • Xiangyu Zhao (City University of Hong Kong, Hong Kong, China)

  • Yi Chang (Jilin University, Changchun, China)

    The affiliations suggest a collaborative effort between researchers from Jilin University, City University of Hong Kong, and Nanjing University, indicating expertise in machine learning, recommender systems, and possibly natural language processing due to the use of LLMs.

1.3. Journal/Conference

The paper is listed as a preprint on arXiv. The abstract mentions "Proc. of KDD" and "Proc. of AAAI" in its references, suggesting the authors are targeting top-tier conferences in artificial intelligence and data mining. As a preprint, it is currently undergoing peer review or awaiting formal publication.

1.4. Publication Year

The paper was published at 2025-08-06T16:45:05.000Z, which indicates August 6, 2025.

1.5. Abstract

The abstract introduces the growing field of generative recommendation, which aims to unify retrieval and ranking into an end-to-end model. It identifies two major limitations of existing generative methods, stemming from their unsupervised tokenization for generating semantic IDs:

  1. Semantic flatness and uninterpretability: The semantic IDs lack a coherent hierarchy, making them hard to understand.

  2. Representation entanglement (ID collisions): Distinct items are mapped to the same semantic ID, negatively impacting recommendation accuracy and diversity.

    To address these issues, the paper proposes HiD-VAE, a novel framework that learns hierarchically disentangled item representations through two core innovations:

  3. Hierarchically-supervised quantization: This process aligns discrete codes with multi-level item tags, leading to more uniform and disentangled IDs. The trained codebooks can predict hierarchical tags, providing an interpretable semantic path for each recommendation.

  4. Uniqueness loss: This novel loss function directly penalizes latent space overlap between distinct items, resolving the ID collision problem and promoting recommendation diversity by better utilizing the item representation space.

    The abstract claims that these high-quality, disentangled IDs provide a strong foundation for downstream generative models. Extensive experiments on three public benchmarks are reported to validate HiD-VAE's superior performance against state-of-the-art methods.

The official source link for the paper is https://arxiv.org/abs/2508.04618. The PDF link is https://arxiv.org/pdf/2508.04618v2.pdf. It is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve lies within the rapidly evolving field of generative recommendation. Traditional recommender systems often employ a retrieve-and-rank pipeline, where candidate items are first retrieved and then ranked. Generative recommendation emerged as a promising paradigm to unify this pipeline into an end-to-end model capable of dynamically generating item identifiers (semantic IDs). This approach has the potential to offer more flexible and dynamic recommendations than traditional methods.

However, existing generative methods are fundamentally constrained by their unsupervised tokenization process, which generates these semantic IDs. This process suffers from two critical flaws that limit the efficacy of the entire pipeline:

  1. Semantic Flatness and Uninterpretability: Current semantic IDs are often "flat," meaning they do not inherently possess a coherent, human-understandable hierarchical structure. This lack of hierarchy makes the models uninterpretable or "black boxes," where the semantic meaning behind a generated ID is opaque. This hinders understanding why a particular item was recommended and makes it difficult to control the generation process.

  2. Representation Entanglement and "ID Collisions": The unsupervised vector quantization (VQ) techniques used often lead to representation entanglement, where distinct items are erroneously mapped to the same semantic ID. This phenomenon, termed "ID collisions," is a critical flaw because it means the system cannot differentiate between unique items. This not only harms recommendation accuracy (as the system might recommend the wrong item under a shared ID) but also severely limits recommendation diversity (as many items might be effectively collapsed into a few identical representations). While some existing methods attempt post-hoc fixes, these are often ineffective as they don't address the root cause of the entanglement in the latent space.

    The importance of this problem stems from the need for more interpretable, accurate, and diverse recommender systems. As item catalogs grow immense, users need systems that not only suggest relevant items but also explain why they are relevant, fostering trust and a better user experience. Overcoming ID collisions is crucial for the fundamental correctness and utility of generative recommenders.

The paper's entry point is to directly address these two fundamental challenges at the representation learning stage, arguing that high-quality, structured semantic IDs are foundational for effective generative recommendation.

2.2. Main Contributions / Findings

The paper's primary contributions are encapsulated in its proposed framework, HiD-VAE, and its core technical innovations:

  1. Introduction of HiD-VAE for Hierarchically Structured and Disentangled Representations: The paper proposes HiD-VAE as a novel framework specifically designed for generative recommendation. It learns interpretable and disentangled item representations by integrating hierarchical structure and preventing ID collisions. This directly tackles the limitations of semantically flat and entangled IDs in previous generative models.

  2. Hierarchically-Supervised Quantization Process: HiD-VAE introduces a novel hierarchically-supervised process for vector quantization. This is achieved through two components:

    • Tag Alignment Loss (Lalign\mathcal{L}_{\mathrm{align}}): This loss explicitly aligns the discrete codes from each RQ-VAE layer with multi-level item tags using a contrastive learning approach. This ensures that each layer captures specific levels of category semantics.
    • Tag Prediction Loss (Lpred\mathcal{L}_{\mathrm{pred}}): This loss guides each VAE layer to predict the corresponding hierarchical tag, further ensuring that the learned semantic IDs are structured and mapped to an interpretable hierarchy. This allows the trained codebooks to predict hierarchical tags, providing a traceable and interpretable semantic path for each recommendation.
  3. Novel Uniqueness Loss for Disentanglement: To combat representation entanglement and prevent ID collisions, HiD-VAE incorporates a novel uniqueness loss (Lunique\mathcal{L}_{\mathrm{unique}}). This loss directly penalizes latent space overlap between distinct items that might otherwise be assigned identical semantic ID sequences. This mechanism ensures a more comprehensive utilization of the item representation space, resolves the ID collision problem, and promotes recommendation diversity.

  4. LLM-based Hierarchical Tag Generation: To enhance applicability to real-world datasets often lacking ground-truth labels, the paper introduces an effective LLM-based approach for automatically generating high-quality hierarchical tags via a "retrieval-then-classification" pipeline. This makes the hierarchically-supervised approach feasible even for datasets without pre-defined category structures.

  5. State-of-the-Art Performance: Extensive experiments on three public benchmarks (Beauty, Sports, KuaiRand) validate HiD-VAE's superior performance over state-of-the-art traditional sequential, Transformer-based, and existing generative recommendation methods in terms of Recall@K and NDCG@K. The results demonstrate substantial improvements, confirming the efficacy of the proposed innovations.

    In summary, HiD-VAE provides a powerful foundation for downstream generative models by producing semantic IDs that are both interpretable (due to explicit hierarchical structure) and disentangled (due to the uniqueness loss), leading to more accurate, diverse, and trustworthy recommendations.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand the HiD-VAE framework, it's essential to grasp several core concepts in recommender systems, deep learning, and generative modeling.

  • Recommender Systems (RS): Systems designed to predict user preferences for items and suggest relevant ones. They are crucial for navigating information overload on digital platforms.
    • Sequential Recommendation: A sub-field of RS that focuses on modeling the dynamic nature of user preferences based on their chronological interaction history (e.g., a sequence of items a user has interacted with). The goal is to predict the next item in the sequence.
  • Deep Learning Architectures:
    • Recurrent Neural Networks (RNNs), specifically Gated Recurrent Units (GRUs): Neural networks designed to process sequential data. GRUs are a type of RNN that can capture dependencies over long sequences by using gating mechanisms to control the flow of information.
    • Convolutional Neural Networks (CNNs): Neural networks primarily used for processing grid-like data such as images, but can also be adapted for sequential data by treating sequences as 1D "images."
    • Transformers: A powerful deep learning architecture that relies entirely on self-attention mechanisms to weigh the importance of different parts of the input sequence. They have become the state-of-the-art in many sequence modeling tasks, including natural language processing (NLP) and sequential recommendation, due to their ability to capture long-range dependencies efficiently.
      • Self-Attention: A mechanism that allows a model to weigh the importance of different elements in an input sequence when processing a specific element. It calculates a "score" for each element's relevance to the current element, forming a weighted sum of values.
  • Generative Models: Machine learning models designed to generate new data instances that resemble the training data.
    • Generative Recommendation: A paradigm where the recommender system directly generates the identifiers (or semantic IDs) of recommended items, rather than just scoring existing candidates. This unifies the retrieve-and-rank pipeline into an end-to-end process.
    • Variational Autoencoder (VAE): A type of generative model that learns a compressed, continuous latent space representation of input data. It consists of an encoder that maps input data to a probability distribution in the latent space, and a decoder that reconstructs the input from samples drawn from this latent space. VAEs are optimized using a reconstruction loss (to ensure fidelity) and a KL divergence term (to regularize the latent space and ensure its structure).
      • Encoder (E()E(\cdot)): Maps an input feature vector xx to a continuous latent representation z = E(x).
      • Decoder (D()D(\cdot)): Reconstructs the input as x^=D(z)\hat{x} = D(z) from the latent representation.
      • Latent Space: A lower-dimensional, abstract representation of the input data learned by the VAE.
      • Reconstruction Loss: Measures how well the decoder can reconstruct the original input from the latent representation. Often Mean Squared Error (MSE).
      • KL Divergence (Kullback-Leibler Divergence): A measure of how one probability distribution diverges from a second, expected probability distribution. In VAEs, it regularizes the latent space to follow a prior distribution (e.g., a standard normal distribution).
  • Vector Quantization (VQ): A technique used to introduce discreteness into a continuous latent space. It maps a continuous latent vector to its nearest neighbor in a finite, learnable codebook.
    • Codebook (CC): A finite set of discrete codewords (also called prototypes or embeddings).
    • Codeword (ckc_k): An individual vector within the codebook.
    • VQ-VAE: A VAE extended with Vector Quantization in its latent space. It replaces the continuous latent vector with a discrete codeword from a codebook.
    • Quantization Function (q(z)): $ q(z) = c_{k^{}}\quad \mathrm{where}\quad k^{} = \arg \min_j| z - c_j| _2 $ This function finds the codeword ckc_{k^{*}} in the codebook CC that is closest to the continuous latent vector zz in terms of Euclidean distance.
  • Residual-Quantized VAE (RQ-VAE): An enhancement of VQ-VAE that uses a cascade of multiple quantizers. Instead of quantizing the entire latent vector at once, each subsequent quantizer operates on the residual error from the previous stage. This allows for a more efficient and fine-grained discrete representation by progressively refining the quantization.
    • Residual Error: The difference between the original latent vector (or previous residual) and the codeword chosen by the current quantizer.
    • Cumulative Quantized Embedding (zq(l)z_q^{(l)}): The sum of selected codewords from all preceding quantizer layers up to layer ll.
  • Semantic IDs: Discrete identifiers (often sequences of codewords from a codebook) that are learned to represent items, ideally distilling their rich semantic features into numerical codes.
  • ID Collisions: A critical problem in vector quantization where multiple distinct items are mapped to the same discrete ID sequence. This means the system cannot differentiate between these items, leading to problems in accuracy and diversity.
  • Interpretability: The degree to which a human can understand the decisions or internal workings of a machine learning model. In recommender systems, it often refers to understanding why an item was recommended.

3.2. Previous Works

The paper contextualizes its contributions by discussing the evolution of sequential recommendation and the recent shift towards generative models.

  • Traditional Sequential Models (Discriminative): These models typically learn item embeddings and then use them to score candidate items.

    • GRU4Rec [13]: A pioneering work that applied Gated Recurrent Units (GRUs) to model user sessions for next-item prediction.
    • Caser [35]: Utilized Convolutional Neural Networks (CNNs) to capture local sequential patterns, treating sequences as "images" of item embeddings.
    • HGN [26]: Proposed a Hierarchical Gating Network to integrate both short-term and long-term user interests in sequences.
    • NextItNet [45]: Employed dilated convolutional layers to effectively capture longer-range dependencies in user sequences.
    • Common Limitation: These models are discriminative; they learn to distinguish between relevant and irrelevant items. They often rely on an external Approximate Nearest Neighbor (ANN) search index (like Faiss [5]) to retrieve candidates, which separates representation learning from the final retrieval/ranking process.
  • Transformer-based Models: These models leverage the self-attention mechanism for more sophisticated sequence modeling.

    • SASRec [19]: One of the first models to adapt the Transformer's self-attention mechanism for sequential recommendation, using a decoder-only structure for next-item prediction.
    • BERT4Rec [34]: Adapted the BERT model's bidirectional pre-training paradigm (using a masked language modeling objective) to learn user behavior representations.
    • S3-Rec\mathsf{S^3}\text{-Rec} [49]: Further refined Transformer-based models by incorporating self-supervised pre-training tasks to improve representation quality.
    • Common Limitation: While powerful, they are still fundamentally discriminative and often inherit the retrieve-and-rank pipeline issues.
  • Generative Recommendation Models: These models aim to directly generate item identifiers autoregressively.

    • TIGER [31]: A pioneering work that introduced Semantic IDs—discrete item representations learned via Residual-Quantized VAEs (RQ-VAE)—and used a Transformer-based sequence model to autoregressively predict these IDs.
    • LC-Rec [48]: Also leverages RQ-VAE-based identifiers but focuses on aligning language and collaborative semantics during the tokenization process.
    • LETTER [39]: Another work that enhances the quantization stage by injecting collaborative signals into the item tokenization.
    • VQ-Rec [14]: Learns vector-quantized item representations for sequential recommenders, often using simpler Product Quantization (PQ) or VQ-VAE. The paper adapts this model for a generative framework by training a Transformer on its semantic IDs.
    • Common Limitations of existing Generative Models:
      1. Flat, Uninterpretable Semantic Space: The learned semantic IDs lack an explicit hierarchical structure, making them opaque "black boxes" [32, 50]. Any hierarchy is an implicit byproduct, not explicitly supervised.
      2. Representation Entanglement ("ID Collisions"): Prone to ID collisions where distinct items are mapped to the same identifier [31, 44]. Existing solutions are often post-hoc (e.g., appending an integer to colliding IDs), which doesn't address the root cause of entanglement in the latent space and can corrupt the semantic integrity of the IDs.

3.3. Technological Evolution

The field of sequential recommendation has evolved significantly:

  1. Early Deep Learning (RNNs/CNNs): Models like GRU4Rec and Caser demonstrated the power of deep neural networks to capture sequential patterns. These were foundational but often struggled with long-range dependencies and complex interactions.

  2. Attention-based/Transformer Models: SASRec and BERT4Rec marked a paradigm shift, leveraging self-attention to effectively model complex dependencies, leading to significant performance gains and becoming the new standard. These models still primarily function in a discriminative manner, scoring items based on learned embeddings.

  3. Generative Recommendation: More recently, models like TIGER, LC-Rec, and VQ-Rec pushed the boundary by reframing recommendation as an autoregressive generation task. This paradigm aims to directly generate item identifiers, unifying the retrieve-and-rank pipeline. This approach directly models the probability distribution over items, potentially leading to more flexible and dynamic recommendations.

    HiD-VAE fits into this evolution by addressing critical shortcomings of the current generative recommendation paradigm. While generative models offer a powerful way to model user preferences and the entire item universe, their reliance on unsupervised tokenization for semantic IDs has created new challenges: lack of interpretability and ID collisions. HiD-VAE represents the next step, aiming to refine the ID generation process itself to produce semantic IDs that are both hierarchically structured (for interpretability) and disentangled (for accuracy and diversity).

3.4. Differentiation Analysis

Compared to the main methods in related work, HiD-VAE introduces core innovations:

  • Vs. Traditional & Transformer-based Discriminative Models: HiD-VAE operates within the generative recommendation paradigm, which is fundamentally different from discriminative models. Instead of learning item embeddings and then scoring candidates, HiD-VAE generates structured semantic IDs directly. This avoids the disconnect between representation learning and retrieval/ranking inherent in discriminative approaches and allows for direct modeling of the item distribution. The paper's results show generative models, and especially HiD-VAE, generally outperform discriminative methods.

  • Vs. Existing Generative Models (TIGER, LC-Rec, VQ-Rec): This is where HiD-VAE makes its most direct advancements.

    • Interpretability: Previous generative models (like TIGER and LC-Rec which use RQ-VAE-based IDs, or VQ-Rec which uses PQ) learn semantic IDs through unsupervised vector quantization. This means any hierarchical structure is an implicit byproduct at best, leading to semantically flat and uninterpretable IDs. HiD-VAE explicitly addresses this with its hierarchically-supervised quantization process (via Tag Alignment Loss and Tag Prediction Loss), which aligns discrete codes with multi-level item tags, yielding a transparent, human-readable semantic path.

    • Disentanglement and ID Collisions: Existing generative models suffer significantly from representation entanglement and ID collisions. While some (like TIGER) might use post-hoc fixes (e.g., appending integers), these are "ineffective" as they don't solve the root problem in the latent space and can corrupt semantic integrity. HiD-VAE directly tackles this at the source with its novel uniqueness loss, which explicitly penalizes latent space overlap for distinct items that would otherwise collide. This leads to a nearly injective mapping from items to IDs, drastically reducing collision rates and improving diversity.

    • Applicability to Diverse Datasets: HiD-VAE introduces an LLM-based approach for generating hierarchical tags for datasets that lack them (like KuaiRand). This broadens the applicability of its hierarchically-supervised method, which would otherwise be limited to datasets with pre-existing rich categorical structures.

      In essence, HiD-VAE builds upon the generative paradigm but fundamentally re-engineers the semantic ID generation process to overcome its inherent limitations of uninterpretability and entanglement, thereby providing a more robust, accurate, and transparent foundation for generative recommendation.

4. Methodology

4.1. Principles

The core idea behind HiD-VAE is to learn item representations that are both interpretable and disentangled, specifically designed for generative recommendation. The theoretical basis and intuition stem from addressing the two critical flaws identified in existing generative methods: semantic flatness and representation entanglement (ID collisions).

  1. Interpretability through Hierarchy: The intuition is that human understanding often relies on hierarchical categorization. By explicitly supervising the vector quantization process with multi-level item tags, HiD-VAE forces the learned discrete codes to align with a meaningful, coarse-to-fine semantic hierarchy. This makes the semantic IDs traceable and human-readable, transforming opaque numerical codes into clear semantic paths (e.g., "Clothing \rightarrow Topwear \rightarrow Dress").

  2. Disentanglement through Uniqueness: The intuition here is that each unique item should ideally have a unique and distinct representation in the latent space to avoid ambiguity and maximize diversity. ID collisions occur when distinct items are represented too closely or identically. HiD-VAE addresses this by introducing a uniqueness loss that actively pushes apart the latent representations of items that would otherwise collapse into the same semantic ID. This ensures that the representation space is comprehensively utilized and that each item maintains its distinct identity.

    By integrating these two principles, HiD-VAE aims to create high-quality, disentangled IDs that not only improve recommendation accuracy and diversity but also provide traceable and interpretable semantic paths, fostering user trust and better control over the recommendation process.

4.2. Core Methodology In-depth (Layer by Layer)

The HiD-VAE framework operates in two distinct stages: Offline Hierarchical ID Learning (Stage 1) and Online Interpretable Recommendation (Stage 2).

4.2.1. Preliminary Concepts for HiD-VAE

Before delving into the HiD-VAE specific innovations, the paper first reviews the foundational concepts upon which it builds:

  • Variational Autoencoder (VAE): A generative model consisting of an encoder E()E(\cdot) and a decoder D()D(\cdot). The encoder maps an input feature vector xRdx \in \mathbb{R}^d to a continuous latent representation z = E(x) \in \mathbb{R}^d, from which the decoder reconstructs the input as x^=D(z)\hat{x} = D(z). The model is optimized by minimizing a reconstruction loss (e.g., Mean Squared Error) and a KL divergence term that regularizes the latent space.

  • Vector Quantization (VQ): To introduce discreteness, VQ maps a continuous latent vector zz to its nearest neighbor in a finite, learnable codebook C={ck}k=1KC = \{c_k\}_{k=1}^K, where each codeword ckRdc_k \in \mathbb{R}^d. This is represented by the quantization function: $ q(z) = c_{k^{}}\quad \mathrm{where}\quad k^{} = \arg \min_j| z - c_j| _2 \quad (1) $ Here, zz is the continuous latent vector from the encoder, CC is the codebook, and cjc_j are the codewords in the codebook. The function q(z) selects the codeword ckc_{k^{*}} that has the minimum Euclidean distance (2\|\cdot\|_2) to zz.

  • Residual-Quantized VAE (RQ-VAE): RQ-VAE extends VQ by using a cascade of LL quantizers. Instead of a single quantization step, each subsequent quantizer operates on the residual error from the preceding stage, allowing for a more efficient and fine-grained discrete representation. The cumulative quantized embedding up to layer ll is defined as zq(l)=j=1le(j)z_q^{(l)} = \sum_{j=1}^{l} e^{(j)}, where e(j)e^{(j)} is the selected codeword from the jj-th quantizer. HiD-VAE adapts this cascaded architecture from RQ-VAE to implement its hierarchical item tokenizer, adding explicit hierarchical supervision and a novel disentanglement mechanism.

4.2.2. Problem Formulation

Let U\mathcal{U} be the set of users and I\mathcal{I} be the set of items. Each user uUu \in \mathcal{U} has a chronological interaction sequence Su=(i1,i2,,iT)S_u = (i_1, i_2, \ldots, i_T), where T=SuT = |S_u| is the sequence length and irIi_r \in \mathcal{I}. The goal of sequential recommendation is to predict the next item iT+1i_{T+1} that user uu is most likely to interact with.

In the generative recommendation approach, this is reframed as generating a unique ID for the target item. Each item ii is represented by a structured, hierarchical semantic ID yi=(yi(1),yi(2),,yi(L))y_i = (y_i^{(1)}, y_i^{(2)}, \ldots, y_i^{(L)}), where LL is the number of levels (ID length), and each yi(l)y_i^{(l)} is an index in a level-specific codebook C(l)C^{(l)} of size KlK_l. These IDs are learned to match human-understandable item categories. The task then becomes an autoregressive prediction problem: $ p(y_{t + 1}|y_1,\ldots ,y_t) = \prod_{l = 1}^{L}p(y_{t + 1}^{(l)}|y_1,\ldots ,y_t,y_{t + 1}^{(1)},\ldots ,y_{t + 1}^{(L - 1)}) \quad (2) $ Here, p(yt+1y1,,yt)p(y_{t+1}|y_1,\ldots,y_t) is the probability of the next item's full semantic ID given the history of previous item semantic IDs. This is factorized into a product of conditional probabilities for each level ll of the hierarchical ID, meaning the prediction of yt+1(l)y_{t+1}^{(l)} depends on the user history (y1,,yty_1,\ldots,y_t) and the already predicted codes for the current item at coarser levels (yt+1(1),,yt+1(L1)y_{t+1}^{(1)},\ldots,y_{t+1}^{(L-1)}).

4.2.3. Framework Overview

The HiD-VAE framework cleanly separates representation learning from sequential modeling across two stages, as illustrated in Figure 2.

fig 2 该图像是一个示意图,展示了现有 ID 生成器(左侧)与我们 HiD-VAE 解决方案(右侧)的对比。在左侧的传统生成器中,存在 ID 冲突的问题,而右侧的 HiD-VAE 采用了分层且独立的语义 ID 方案,有效避免了这一问题,增强了推荐的准确性和多样性。

Figure 2: The HiD-VAE framework. HiD-VAE first learns hierarchical semantic IDs and disentangled tag IDs via a supervised VAE (a), which a Transformer then uses for interpretable sequential recommendation (b).

  • Stage 1: Offline Hierarchical ID Learning (Figure 2a)

    • Goal: Train the HiD-VAE to learn hierarchical and disentangled semantic IDs.
    • Process: For each item defined by content features x\mathbf{x} (e.g., text embeddings), HiD-VAE learns a hierarchical semantic ID y\mathbf{y}. This is achieved by optimizing a composite objective function that combines standard VAE losses with the paper's novel hierarchically-supervised and uniqueness losses.
    • Output: A frozen, high-quality item tokenizer capable of converting any item into its unique, interpretable, and disentangled semantic ID.
  • Stage 2: Online Interpretable Recommendation (Figure 2b)

    • Goal: Use the learned semantic IDs for interpretable sequential recommendation.

    • Process: The pre-trained and frozen HiD-VAE acts as the item tokenizer. For each user's interaction history, every item is mapped to its corresponding semantic ID sequence. A Transformer-based sequential model, equipped with hierarchy-aware semantic embeddings (to preserve the structured semantics from Stage 1), is then trained to autoregressively predict the semantic ID of the next item. During inference, constrained decoding ensures that generated IDs correspond to valid items.

    • Output: Recommended items with interpretable semantic paths.

      This two-stage design allows each component to be specialized for its specific task.

4.2.4. Hierarchical Tag Generation

Many real-world datasets lack pre-defined hierarchical tags, which are crucial for HiD-VAE's hierarchically-supervised approach. To overcome this, the paper introduces a two-stage LLM-based approach for automatic hierarchical tag generation. This reframes the task as a "retrieval-then-classification" pipeline.

  • Candidate Tag Retrieval:

    • Goal: Create a small, relevant set of candidate tags for each hierarchical level.
    • Process: For each level ll, a tag pool T(l)\mathcal{T}^{(l)} is constructed (from existing labels or manual annotations). For an item with content text xtext\mathbf{x}_{\mathrm{text}}, its semantic embedding o=Esent(xtext)\mathbf{o} = \mathcal{E}_{\mathrm{sent}}(\mathbf{x}_{\mathrm{text}}) is computed using a pre-trained sentence encoder. Then, for each level ll, the top-KK most semantically similar tags are retrieved from T(l)\mathcal{T}^{(l)} by performing a nearest-neighbor search against pre-computed tag embeddings.
    • Formula: The candidate set Ccand(l)C_{\mathrm{cand}}^{(l)} is defined as: $ C_{\mathrm{cand}}^{(l)} = \mathrm{Top - K}{\bar{t}\in \mathcal{T}^{(l)}}(\mathrm{sim}(\mathbf{o} ,\mathcal{E}{\mathrm{sent}}(t))) \quad (3) $ Here, o\mathbf{o} is the semantic embedding of the item's text, Esent(t)\mathcal{E}_{\mathrm{sent}}(t) is the semantic embedding of a tag tt, and sim(,)\mathrm{sim}(\cdot, \cdot) denotes a similarity function (e.g., cosine similarity). TopK\mathrm{Top-K} selects the KK tags with the highest similarity.
  • LLM-based Tag Classification:

    • Goal: Select the most suitable tag from the candidate set using an LLM.
    • Process: With a small candidate set Ccand(l)C_{\mathrm{cand}}^{(l)}, an LLM (e.g., Qwen3-235B-A22B-Instruct-2507) is prompted to act as a classifier. The prompt includes the item's content xtext\mathbf{x}_{\mathrm{text}}, any previously determined higher-level tags {t(j)}j=1l1\{t^{(j)}\}_{j=1}^{l-1} (to enforce hierarchical consistency), and the candidate set Ccand(l)C_{\mathrm{cand}}^{(l)}. The LLM predicts the most probable tag t(l)t^{(l)*} for the current level ll.
    • Formula: $ t^{(l)*} = \underset {t\in C_{\mathrm{cand}}^{(l)}}{\arg \max}P_{\mathrm{LLM}}\left(t\mid x_{\mathrm{text}},{t^{(j)}}{j = 1}^{l - 1},C{\mathrm{cand}}^{(l)}\right) \quad (4) $ Here, PLLMP_{\mathrm{LLM}} is the probability assigned by the LLM to a tag tt from the candidate set, given the item text and its higher-level tags. The arg max selects the tag with the highest probability.
    • Implementation: This "retrieval-then-classification" approach ensures generated tags are valid and consistent, leveraging the LLM's contextual understanding without risking hallucination from open-ended generation. (Details on prompt design in Appendix B).

4.2.5. Hierarchical Representation Learning

This section details how HiD-VAE addresses codebook collapse and ID collisions by employing hierarchically-supervised quantization.

Given an item with feature vector X\mathcal{X} (text embeddings) and ground-truth hierarchical category tags with indices {c(l)}l=1L\{c^{(l)}\}_{l=1}^{L} and embeddings {t(l)}l=1L\{\pmb{t}^{(l)}\}_{l=1}^{L}:

  1. The item features X\mathcal{X} are passed through an encoder E()E(\cdot) to produce an initial continuous latent representation z_0 = E(x).

  2. An RQ-VAE initiates a layered quantization process. At each layer l{1,,L}l \in \{1, \dots, L\}, a quantizer qlq_l takes the residual from the previous layer rl1r_{l-1} (with r0=z0r_0 = z_0) and identifies the closest codeword e(l)=ql(rl1)\pmb{e}^{(l)} = q_l(r_{l-1}) from codebook C(l)C^{(l)}. The residual for the next layer is computed as rl=rl1e(l)r_l = r_{l-1} - \pmb{e}^{(l)}. The cumulative quantized embedding up to layer ll is zq(l)=j=1le(j)z_q^{(l)} = \sum_{j=1}^{l} \pmb{e}^{(j)}.

    To supervise this process, two main losses are introduced:

  • Tag Alignment Loss (Lalign(l)\mathcal{L}_{\mathrm{align}}^{(l)}):

    • Goal: Ensure that the codebook at layer ll captures the semantics of the ll-th level of the category hierarchy by making the cumulative quantized embedding zq(l)z_q^{(l)} similar to its corresponding tag embedding and dissimilar to others.
    • Process: The ground-truth tag embedding t(l)\pmb{t}^{(l)} is projected into the item's latent space using a layer-specific projector Pl()P_l(\cdot). This loss pulls zq(l)z_q^{(l)} towards its corresponding projected tag embedding Pl(t(l))P_l(\pmb{t}^{(l)}) while pushing it away from other tag embeddings in the same mini-batch. This is a contrastive loss.
    • Formula: $ \mathcal{L}{\mathrm{align}}^{(l)} = -\log \frac{\exp(\operatorname{sim}(z_q^{(l)},P_l(\pmb{t}^{(l)})) / \tau)}{\sum{j = 1}^{B}\exp(\operatorname{sim}(z_q^{(l)},P_l(\pmb{t}^{(j)})) / \tau)}, \quad (5) $ Here, sim(,)\operatorname{sim}(\cdot, \cdot) denotes cosine similarity, τ\tau is a temperature hyperparameter (controlling the sharpness of the distribution), BB is the batch size, zq(l)z_q^{(l)} is the cumulative quantized embedding for the item at layer ll, and Pl(t(l))P_l(\pmb{t}^{(l)}) is the projected embedding of the ground-truth tag for that item at layer ll. The denominator sums over all projected tag embeddings t(j)\pmb{t}^{(j)} in the batch, including the positive pair.
  • Tag Prediction Loss (Lpred(l)\mathcal{L}_{\mathrm{pred}}^{(l)}):

    • Goal: Directly classify the cumulative quantized embedding to its corresponding hierarchical tag, especially to handle varying semantic depths and numbers of categories across layers.
    • Process: Each layer employs a tailored classifier ClC_l. Deeper layers use larger hidden dimensions and progressively higher dropout rates to handle the richer (and growing dimensionality) cumulative quantized embedding zq(l)z_q^{(l)} as input. This ensures classifiers for deeper layers have greater capacity for nuanced semantics.
    • Formula: $ \mathcal{L}{\mathrm{pred}}^{\left(I\right)} = \mathrm{CrossEntropy}(C{l}(z_{q}^{(I)}),c^{(I)}), \quad (6) $ Here, CrossEntropy\mathrm{CrossEntropy} is the standard cross-entropy loss, Cl()C_l(\cdot) is the layer-specific classifier for layer ll, zq(l)z_q^{(l)} is the cumulative quantized embedding for the item at layer ll, and c(l)c^{(l)} is the ground-truth tag index for the item at layer ll. Optional extensions like focal loss (with γ=2.0\gamma = 2.0) can be applied for imbalanced classes.

This dual supervision (alignment and prediction) ensures that the learned semantic IDs are structured and mapped to an interpretable hierarchy.

4.2.6. Disentanglement via Uniqueness Loss

To combat ID collision, HiD-VAE introduces a uniqueness loss that operates on the continuous, pre-quantization latent vectors.

  • Goal: Penalize representation overlap between pairs of distinct items that are assigned identical semantic ID sequences within a training batch. This directly encourages an injective mapping from items to IDs.
  • Process: For two different items xix_i and xjx_j in a batch, with initial latent representations z0,i\mathbf{z}_{0,i} and z0,j\mathbf{z}_{0,j} respectively, if their full semantic ID sequences collide (i.e., yi=yjy_i = y_j), a margin-based penalty is applied.
  • Formula: $ \begin{array}{r}\mathcal{L}{\mathrm{unique}} = \frac{1}{|P|}\sum{(i,j)\in P}\max \left(0,\frac{z_{0,i}\cdot z_{0,j}}{|z_{0,i}|2|z{0,j}|_2} -m\right), \end{array} \quad (7) $ Here, P={(i,j)ij,yi=yj}\mathcal{P} = \{(i,j)\mid i\neq j,y_{i} = y_{j}\} is the set of all distinct item pairs (i,j) within the batch that have colliding semantic IDs. The term z0,iz0,jz0,i2z0,j2\frac{z_{0,i}\cdot z_{0,j}}{\|z_{0,i}\|_2\|z_{0,j}\|_2} is the cosine similarity between the initial latent representations of the two colliding items. mm is a margin hyperparameter. The max(0,)\max(0, \cdot) ensures that the loss is only incurred when the cosine similarity exceeds the margin mm, pushing these latent representations apart. This loss is averaged over all such colliding pairs in the batch.

4.2.7. Interpretable Generative Recommendation (Stage 2)

With high-quality IDs generated in Stage 1, Stage 2 uses these semantic IDs for sequential recommendation. The challenge is modeling these structured, multi-level IDs for autoregressive generation without losing their semantics or generating invalid combinations.

HiD-VAE proposes two innovations for the Transformer-based autoregressive model:

  • Hierarchy-Aware Semantic Embeddings:

    • Goal: Preserve the structured semantics of the IDs and enhance interpretability.
    • Process: A custom embedding layer is designed. Each token (level) in an item's semantic ID is first mapped to its corresponding tag text. This tag text is then encoded into a semantic vector using a pre-trained embedding model. These semantic vectors are then concatenated with learnable ID embeddings (corresponding to the codebook index) and type embeddings specific to each hierarchical level (l{1,,L}l \in \{1, \ldots, L\}).
    • Benefit: This enriches the feature representation by integrating explicit semantic information, allowing the model to capture the coarse-to-fine semantic path encoded in the ID, thereby enhancing interpretability and modeling capability.
  • Constrained Decoding for Validity:

    • Goal: Ensure that generated IDs always correspond to real, existing items.
    • Process: During inference, a constrained decoding strategy is implemented. All valid semantic ID prefixes are pre-computed and stored in an efficient data structure (e.g., a trie). During token-by-token generation (autoregressive prediction of each level yt(l)y_t^{(l)}), the model's output vocabulary is dynamically masked. This masking only allows tokens that form valid, existing prefixes based on the pre-computed structure.
    • Benefit: This pruning mechanism guarantees that the generated output always corresponds to an actual item in the inventory, preventing hallucination of non-existent ID combinations and improving practical applicability.

4.2.8. Optimization

The framework is trained in two distinct stages, each with its own objective function.

  • Stage 1: HiD-VAE Training:

    • Goal: Train the HiD-VAE encoder and decoder, codebooks, projectors, and classifiers to learn hierarchical and disentangled item representations.
    • Objective Function: The HiD-VAE is trained end-to-end by minimizing a composite loss function: $ \begin{array}{rl} & {\mathcal{L}{\mathrm{H i d - V A E}} = \mathcal{L}{\mathrm{r e c o n}} + \beta_{\mathrm{commit}}\mathcal{L}{\mathrm{co n m i t}}}\ & {\qquad +\beta{\mathrm{sup}}\sum_{l = 1}^{L}(\mathcal{L}{\mathrm{align}}^{(l)} + \mathcal{L}{\mathrm{p r e d}}^{(l)}) + \beta_{\mathrm{unique}}\mathcal{L}_{\mathrm{unique}}} \end{array} \quad (8) $ Here:
      • Lrecon\mathcal{L}_{\mathrm{recon}}: The reconstruction loss between the input xx and the decoder output x^\hat{x}. (See Appendix C.1 for details) $ \mathcal{L}_{\mathrm{recon}} = ||x - \hat{x} ||_2^2 \quad (10) $ This is the Mean Squared Error (MSE) between the original input xx and its reconstruction x^\hat{x} from the decoder.
      • Lcommit\mathcal{L}_{\mathrm{commit}}: The vector quantization commitment loss, which regularizes the encoder's output space by encouraging the encoder's continuous output ze(x)z_e(x) to "commit" to its chosen codeword Z~q(x)\tilde{\mathbf{Z}}_q(x). It uses a stop-gradient (sg) operator to isolate the gradient flow. (See Appendix C.1 for details) $ \mathcal{L}_{\mathrm{commit}} = ||z_e(x) - \mathrm{sg}[z_q(x)]||_2^2 \quad (11) $ This loss pushes the encoder's output ze(x)z_e(x) to be close to the quantized code zq(x)z_q(x) chosen by the VQ module. The stop-gradient operator sg prevents gradients from flowing from zq(x)z_q(x) back to ze(x)z_e(x) through this term, ensuring that ze(x)z_e(x) learns to predict codes effectively without directly influencing the codebook updates via this path.
      • l=1L(Lalign(l)+Lpred(l))\sum_{l=1}^{L}(\mathcal{L}_{\mathrm{align}}^{(l)} + \mathcal{L}_{\mathrm{pred}}^{(l)}): The sum of the Tag Alignment Loss (Formula 5) and Tag Prediction Loss (Formula 6) across all LL quantization layers. These are the hierarchical supervision losses.
      • Lunique\mathcal{L}_{\mathrm{unique}}: The Uniqueness Loss (Formula 7), which promotes disentanglement.
      • βcommit\beta_{\mathrm{commit}}, βsup\beta_{\mathrm{sup}}, βunique\beta_{\mathrm{unique}}: Hyperparameters that balance the contribution of each loss component to the overall objective.
  • Stage 2: Recommender Training:

    • Goal: Train the Transformer-based recommender using the frozen semantic IDs from Stage 1.

    • Objective Function: The Transformer-based recommender is trained using next-token prediction with cross-entropy loss. For a user history Su\mathcal{S}_u with ID sequences (y1,,yT)(\pmb{y}_1, \ldots, \pmb{y}_T), the objective maximizes the likelihood of the next item's ID sequence: $ \mathcal{L}{\mathrm{rec}} = -\sum{u\in \mathcal{U}}\sum_{t = 1}^{|S_{u}| - 1}\log p(\pmb{y}_{t + 1}|\pmb{y}_1,\ldots ,\pmb{y}_t) \quad (9) $ Here, Lrec\mathcal{L}_{\mathrm{rec}} is the cross-entropy loss used to train the Transformer. It sums the negative log-likelihoods of predicting the true next item's semantic ID yt+1\pmb{y}_{t+1} given the user's previous sequence of semantic IDs (y1,,yt)(\pmb{y}_1,\ldots ,\pmb{y}_t). This is done for all users uu and all prediction steps tt in their sequences.

    • Enhancement: As mentioned previously, each token in the semantic ID is mapped to its corresponding tag text and encoded into a semantic vector using a pre-trained embedding model, enriching the representation with explicit semantic information.

      This two-stage approach first establishes an interpretable and disentangled representation space, then leverages it to model sequential user behavior effectively. (Details on implementation in Appendix C.2).

4.2.9. Algorithmic Overview (Algorithm 1)

The pseudocode for the HiD-VAE framework provides a unified view of the entire two-stage process.

Algorithm 1: HiD-VAE Framework

Require: Item features {xj}iI\left\{\boldsymbol{x}_j\right\}_{i\in \mathcal{I}}, optional tags {{ci(l)}l=1L}iI\{\{\boldsymbol{c}_i^{(l)}\}_{l = 1}^L\} _{i\in \mathcal{I}}, user sequences {su}uu\{\boldsymbol {s}_u\}_{\boldsymbol {u}\in \boldsymbol {u}} Ensure: Predicted next items for each user

1: // Stage 0: Hierarchical Tag Generation (if tags not available) 2: for each item iIi\in \mathcal{I} do 3: vEsent(xi,text)\quad \boldsymbol{v}\leftarrow \mathcal{E}_{\mathrm{sent}}(\boldsymbol{x}_{\mathrm{i},\mathrm{text}}) \quad \triangleright Semantic embedding of item text 4: \quad for l=1l = 1 to LL do 5: C^cand(l)\quad \quad \hat{\boldsymbol{C}}_{\mathrm{cand}}^{(l)}\gets Top-K retrieval from T(l)\mathcal{T}^{(l)} via sim(v,Esent(t))\mathrm{sim}(\boldsymbol {v},\mathcal{E}_{\mathrm{sent}}(t)) \quad \triangleright Candidate tag retrieval (Eq. 3) 6: ci(l)argmaxtCcand(l)PLLM(txi,text,{ci(j)}j=1l1,C^cand(l))\quad \quad c_i^{(l)}\gets \underset {t\in C_{\mathrm{cand}}^{(l)}}{\arg \max}P_{\mathrm{LLM}}(t\mid \boldsymbol{x}_{\mathrm{i},\mathrm{text}},\{c_i^{(j)}\}_{j=1}^{l-1},\hat{\boldsymbol{C}}_{\mathrm{cand}}^{(l)}) \quad \triangleright LLM-based Tag Classification (Eq. 4) 7: \quad end for 8: end for

9: // Stage 1: HiD-VAE Training 10: Initialize encoder EE, decoder D\mathcal{D}, codebooks {C(l)}l=1L\{\boldsymbol{C}^{(l)}\}_{l = 1}^{L}, projectors {Pl}l=1L\{P_{l}\}_{l = 1}^{L}, classifiers {Cl}l=1L\{C_l\}_{l = 1}^{L} 11: while not converged do 12: \quad Sample batch {xb,{tb(l)}l=1L}b=1B\{\boldsymbol{x}_b,\{\boldsymbol{t}_b^{(l)}\}_{l = 1}^L\}_{b = 1}^B 13: z0E(xb)\quad \boldsymbol{z}_0\leftarrow E(\boldsymbol{x}_b) \quad \triangleright Initial latent representation 14: r0z0\quad \boldsymbol{r}_0\leftarrow \boldsymbol{z}_0 \quad \triangleright Initialize residual 15: zqempty list\quad \boldsymbol{z}_q \leftarrow \text{empty list} \quad \triangleright To store cumulative quantized embeddings 16: \quad for l=1l = 1 to LL do 17: e(l),y(l)ql(rl1)\quad \quad \boldsymbol{e}^{(l)}, \boldsymbol{y}^{(l)}\leftarrow q_l(\boldsymbol{r}_{l - 1}) \quad \triangleright Quantize residual, get codeword e(l)\boldsymbol{e}^{(l)} and index y(l)\boldsymbol{y}^{(l)} 18: rlrl1e(l)\quad \quad \boldsymbol{r}_{l}\gets \boldsymbol{r}_{l - 1} - \boldsymbol{e}^{(l)} \quad \triangleright Compute next residual 19: zq(l)i=1le(i)\quad \quad \boldsymbol{z}_q^{(l)}\leftarrow \sum_{i = 1}^{l} \boldsymbol{e}^{(i)} \quad \triangleright Cumulative quantized embedding 20: \quad \quad Compute Lalign(l)\mathcal{L}_{\mathrm{align}}^{(l)} (Eq. 5) and Lpred(l)\mathcal{L}_{\mathrm{pred}}^{(l)} (Eq. 6) 21: \quad end for 22: x^D(zq(L))\quad \hat{\boldsymbol{x}}\gets \mathcal{D}(\boldsymbol{z}_q^{(L)}) \quad \triangleright Reconstruct input from final cumulative quantized embedding 23: \quad Compute Lrecon\mathcal{L}_{\mathrm{recon}} (Eq. 10), Lcommit\mathcal{L}_{\mathrm{commit}} (Eq. 11), Lunique\mathcal{L}_{\mathrm{unique}} (Eq. 7, on colliding pairs) 24: \quad Update parameters via LHiDVAE\mathcal{L}_{\mathrm{HiD-VAE}} (Eq. 8) 25: end while 26: Freeze HiD-VAE; map all items to IDs {yi}iI\{\mathbf{y}_i\}_{i\in I}

27: // Stage 2: Recommender Training 28: Initialize Transformer with hierarchy-aware embeddings 29: while not converged do 30: \quad Sample user batch {Su}\{\boldsymbol {S}_u\} 31: \quad Map to ID sequences {(y1,,yT)}\{(\boldsymbol{y}_1,\ldots ,\boldsymbol{y}_T)\} 32: \quad Enrich embeddings with tag semantics and level types 33: \quad Compute Lrec\mathcal{L}_{\mathrm{rec}} via autoregressive Cross-Entropy (Eq. 9) 34: \quad Update Transformer parameters 35: end while

36: // Inference 37: for each test sequence (y1,,yT)(\boldsymbol{y}_1,\ldots ,\boldsymbol{y}_T) do 38: \quad Autoregressively generate yT+1\pmb{y}_{T + 1} with constrained decoding (mask invalid prefixes) 39: \quad Map to item via pre-computed ID-to-item index 40: end for

Explanation of Algorithm 1:

  • Lines 1-8 (Stage 0: Hierarchical Tag Generation): This optional preliminary stage is executed once if ground-truth hierarchical tags are not available. For each item, it first computes a semantic embedding of its text. Then, for each hierarchical level (ll), it retrieves a set of candidate tags from a predefined tag pool. Finally, an LLM is used to classify the most appropriate tag for that level, taking into account the item text and any higher-level tags already determined. This process populates the ground-truth tags {ci(l)}\{c_i^{(l)}\} needed for Stage 1.
  • Lines 9-26 (Stage 1: HiD-VAE Training): This is the core representation learning stage.
    • The encoder EE, decoder D\mathcal{D}, codebooks C(l)C^{(l)}, projectors PlP_l, and classifiers ClC_l are initialized.
    • In a training loop, batches of items and their (ground-truth or generated) hierarchical tags are sampled.
    • The item's initial latent representation z0\boldsymbol{z}_0 is obtained from the encoder.
    • The RQ-VAE process iteratively quantizes the residual rl1\boldsymbol{r}_{l-1} at each layer ll, producing a codeword e(l)\boldsymbol{e}^{(l)} (and its index y(l)\boldsymbol{y}^{(l)}) and an updated residual rl\boldsymbol{r}_l. The cumulative quantized embedding zq(l)\boldsymbol{z}_q^{(l)} is also formed.
    • At each layer, the Tag Alignment Loss (Lalign(l)\mathcal{L}_{\mathrm{align}}^{(l)}) and Tag Prediction Loss (Lpred(l)\mathcal{L}_{\mathrm{pred}}^{(l)}) are computed to provide hierarchical supervision.
    • After processing all layers, the final cumulative quantized embedding zq(L)\boldsymbol{z}_q^{(L)} is passed to the decoder D\mathcal{D} to reconstruct the input x^\hat{\boldsymbol{x}}.
    • The standard reconstruction loss (Lrecon\mathcal{L}_{\mathrm{recon}}), commitment loss (Lcommit\mathcal{L}_{\mathrm{commit}}), and the uniqueness loss (Lunique\mathcal{L}_{\mathrm{unique}}) are computed. The uniqueness loss is specifically applied to pairs of distinct items that generate identical semantic ID sequences.
    • All model parameters are updated using the combined HiD-VAE loss.
    • Once training converges, the HiD-VAE is frozen, and all items are mapped to their final, unique hierarchical semantic IDs {yi}\{\mathbf{y}_i\}.
  • Lines 27-35 (Stage 2: Recommender Training): This stage trains the Transformer-based sequential recommender.
    • A Transformer model is initialized with hierarchy-aware embeddings.
    • In a training loop, batches of user sequences are sampled.
    • Each item in the user sequence is mapped to its pre-computed semantic ID sequence.
    • The semantic embeddings are enriched with tag semantics and level types.
    • The Transformer is trained to predict the next item's semantic ID using autoregressive cross-entropy loss (Lrec\mathcal{L}_{\mathrm{rec}}).
    • Transformer parameters are updated.
  • Lines 36-40 (Inference): This describes how recommendations are generated.
    • For a given test sequence of semantic IDs, the Transformer autoregressively generates the semantic ID of the next item (yT+1\pmb{y}_{T+1}).
    • Constrained decoding is used during generation to ensure that only valid semantic ID prefixes (corresponding to real items) are generated.
    • Finally, the generated semantic ID is mapped back to an actual item using a pre-computed index.

Implementation Details (from Appendix C.2):

  • Hyperparameters: Typical values include τ=0.07\tau = 0.07, m=0.9m = 0.9, and batch sizes of 512 for Stage 1. Tuning is performed via grid search on validation sets.
  • Inference Efficiency: Constrained decoding in Stage 2 uses a trie-based prefix tree for efficient vocabulary masking. This reduces invalid generations to zero and maintains O(LlogK)\mathcal{O}(L \log K) time per token, where K=maxKlK = \max K_l (max codebook size across layers). This trie is built offline from all valid item IDs to ensure scalability for large catalogs.
  • Partial Tags: For datasets with partial tags, a hybrid approach can combine LLM-generated and ground-truth labels, weighted by LLM confidence scores. This boosts alignment loss convergence in early epochs.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three widely-used public benchmarks to ensure a comprehensive assessment across different domains and data characteristics. Following the standard practice [19, 34], a 5-core setting is adopted, where all users and items with fewer than five interactions are filtered out.

The following are the results from Table 4 of the original paper:

Dataset # Users # Items # Interactions # Seq.Length
Beauty 22,363 12,101 198,360 8.87
Sports 35,598 18,557 296,175 8.32
KuaiRand 983 29,983 953,166 19.83

Description of Datasets:

  • Beauty: This dataset is derived from the Amazon Review Data project [12]. It is described as a popular and relatively dense benchmark, suitable for evaluating general recommendation performance.

  • Sports and Outdoors (Sports): Also from the Amazon Review Data project [12]. This dataset is larger and significantly sparser compared to Beauty, which allows for evaluating the model's robustness under more challenging data distributions.

  • KuaiRand-1K [9]: This is a large-scale public dataset from the Kuishou short video platform. It contains user interactions with rich side information and represents a distinct domain (short video recommendations). It serves to test the generalizability of the HiD-VAE approach. Notably, for datasets like KuaiRand where structured hierarchies are not natively provided, the paper uses its LLM-based pre-processing pipeline (as described in Section 4.2.4). This involves prompting a Large Language Model with the item's title and raw category string to generate a clean, consistent LL-level category hierarchy for each item, ensuring a complete category path for supervision.

    These datasets were chosen because they represent different domains and data characteristics (density, size, type of items), allowing for a comprehensive and robust assessment of HiD-VAE's capabilities.

5.2. Evaluation Metrics

The paper adopts the standard leave-one-out evaluation protocol [7]. For each user's interaction history, the last item is held out for testing, the second-to-last item is used for validation, and the remaining items are used for training. Performance is reported using two top-K ranking metrics with KK set to 5 and 10: Recall@K and Normalized Discounted Cumulative Gain (NDCG)@K.

  1. Recall@K

    • Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved within the top KK recommendations. It focuses on the completeness of the retrieved relevant items. A higher Recall@K means more relevant items are found among the top KK recommendations.
    • Mathematical Formula: $ \text{Recall@K} = \frac{|\text{Relevant Items in Top K}|}{|\text{All Relevant Items}|} $
    • Symbol Explanation:
      • Relevant Items in Top K|\text{Relevant Items in Top K}|: The number of relevant items that appear in the top KK recommended list.
      • All Relevant Items|\text{All Relevant Items}|: The total number of relevant items for a given user (in the leave-one-out setting, this is typically 1, the single next item the user interacted with).
  2. Normalized Discounted Cumulative Gain (NDCG)@K

    • Conceptual Definition: NDCG@K is a measure of ranking quality that takes into account the position of relevant items in the ranked list. It assigns higher scores to relevant items that appear earlier (at higher ranks) in the recommendation list. It is "normalized" by dividing by the Ideal Discounted Cumulative Gain (IDCG), which is the DCG of a perfectly sorted list where all relevant items are at the top. This normalization ensures the metric is between 0 and 1.
    • Mathematical Formula: $ \text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}} $ where $ \text{DCG@K} = \sum_{j=1}^K \frac{2^{\text{rel}_j} - 1}{\log_2(j+1)} $ and IDCG@K\text{IDCG@K} is the DCG@K for the ideal ranking (i.e., relevant items sorted by highest relevance).
    • Symbol Explanation:
      • DCG@K\text{DCG@K}: Discounted Cumulative Gain at rank KK. It sums the relevance scores of items in the ranked list, discounting them logarithmically by their position.
      • IDCG@K\text{IDCG@K}: Ideal Discounted Cumulative Gain at rank KK. This is the maximum possible DCG for a given set of relevant items.
      • jj: The rank (position) of an item in the recommendation list, ranging from 1 to KK.
      • relj\text{rel}_j: The relevance score of the item at rank jj. For binary relevance (relevant/not relevant), relj\text{rel}_j is typically 1 if the item is relevant and 0 otherwise. In the leave-one-out setting, the test item is 1, and others are 0.
      • log2(j+1)\log_2(j+1): The logarithmic discount factor, which reduces the contribution of relevant items as they appear lower in the list.

5.3. Baselines

The paper conducts a comprehensive comparison against a wide spectrum of state-of-the-art models, grouped into three distinct categories:

  1. Traditional Sequential Models: These represent classic and influential non-Transformer approaches.

    • GRU4Rec [13]: Employs Gated Recurrent Units (GRUs) to capture sequential patterns.
    • Caser [35]: Uses Convolutional Neural Networks (CNNs) for sequential recommendation.
    • HGN [26]: Utilizes a Hierarchical Gating Network to model user interests.
    • NextItNet [45]: Leverages dilated convolutional layers for long-range dependencies.
  2. Transformer-based Models: These models represent the current standard for discriminative sequential recommendation, utilizing self-attention.

    • SASRec [19]: Applies the Transformer's self-attention mechanism to sequential recommendation.
    • BERT4Rec [34]: Adapts BERT's bidirectional encoder representations for user behavior.
  3. Generative Recommendation Models: This category includes models that directly generate item identifiers.

    • TIGER [31]: A prominent generative model that uses hierarchical RQ-VAE to learn semantic item identifiers, which are then predicted by a Transformer-based sequence model.
    • LC-Rec [48]: Also leverages RQ-VAE-based identifiers but focuses on aligning language and collaborative semantics. For a fair comparison, T5-base variant is used as its backbone language model.
    • VQ-Rec [14]: This model is adapted to a generative framework by training a Transformer on its semantic IDs generated via Product Quantization. It serves as a non-hierarchical contrast to HiD-VAE's hierarchical IDs.

Implementation Details (from paper):

  • The HiD-VAE framework is implemented in PyTorch with Hugging Face transformers and accelerate for mixed-precision (FP16) training on NVIDIA 4060 GPUs.
  • Stage 1 (HiD-VAE):
    • Uses a 3-layer MLP encoder and decoder with GELU activations.
    • Takes 768-dimensional SentenceTransformer embeddings as input.
    • Employs L=3L=3 quantization layers, each with codebook size K=256K=256, initialized via K-Means on the first batch.
    • Training uses AdamW optimizer with a learning rate of 3×1043 \times 10^{-4} and batch size 128.
    • Key loss hyperparameters are βcommit=0.25\beta_{\mathrm{commit}} = 0.25, βsup=1.0\beta_{\mathrm{sup}} = 1.0, and βunique=2.0\beta_{\mathrm{unique}} = 2.0.
    • Focal Loss is applied for tag prediction, alignment temperature τ=0.07\tau = 0.07, and uniqueness margin m=0.9m = 0.9.
  • Stage 2 (Recommender):
    • The frozen HiD-VAE serves as the item tokenizer with pre-computed semantic IDs.
    • The sequential model is a 6-layer Transformer encoder-decoder with 8 attention heads and hidden dimension 512.
    • Trained using AdamW optimizer with a learning rate of 1×1041 \times 10^{-4}, batch size 256, and warmup schedule.
    • Inference uses autoregressive generation with constrained decoding against a corpus cache.
  • For all traditional baseline models, the RecBole [47] framework is leveraged for implementation.

6. Results & Analysis

6.1. Overall Performance Comparison (RQ1)

The paper presents a comprehensive comparison of HiD-VAE against a suite of strong baselines across three distinct datasets in Table 1.

The following are the results from Table 1 of the original paper:

Dataset Metric Traditional Transformer-based Generative Improv.
GRU4Rec Caser HGN NextItNet SASRec BERT4Rec VQ-Rec TIGER LC-Rec HiD-VAE
Beauty R@5 0.0216 0.0093 0.0312 0.0143 0.0363 0.0116 0.0285 0.0312 0.0402 0.0543 +35.07%
R@10 0.0293 0.0146 0.0358 0.0221 0.0498 0.0174 0.0431 0.0457 0.0563 0.0698 +23.98%
N@5 0.0154 0.0058 0.0217 0.0090 0.0269 0.0082 0.0182 0.0209 0.0257 0.0358 +33.08%
N@10 0.0180 0.0075 0.0256 0.0115 0.0301 0.0100 0.0225 0.0253 0.0366 0.0421 +15.03%
Sports R@5 0.0097 0.0047 0.0162 0.0081 0.0202 0.0057 0.0291 0.0325 0.0385 0.0435 +12.99%
R@10 0.0150 0.0080 0.0235 0.0130 0.0290 0.0089 0.0415 0.0474 0.0493 0.0632 +28.19%
N@5 0.0065 0.0030 0.0111 0.0052 0.0118 0.0037 0.0199 0.0222 0.0251 0.0332 +32.27%
N@10 0.0082 0.0040 0.0134 0.0067 0.0146 0.0047 0.0238 0.0270 0.0284 0.0397 +39.79%
KuaiRand R@5 0.0298 0.0074 0.0297 0.0276 0.0332 0.0185 0.0513 0.0557 0.0622 0.0668 +7.40%
R@10 0.0383 0.0118 0.0354 0.0327 0.0405 0.0217 0.0589 0.0624 0.0684 0.0785 +14.77%
N@5 0.0217 0.0068 0.0169 0.0216 0.0338 0.0196 0.0354 0.0383 0.0403 0.0479 +18.86%
N@10 0.0245 0.0095 0.0219 0.0278 0.0372 0.0236 0.0412 0.0445 0.0497 0.0586 +17.91%

Key Insights:

  • HiD-VAE Achieves Substantial Improvements Across Datasets: HiD-VAE consistently outperforms all baselines on every dataset (Beauty, Sports, KuaiRand) and across all metrics (Recall@5, Recall@10, NDCG@5, NDCG@10). The performance gains are significant:

    • On Beauty, HiD-VAE shows a remarkable +35.07% relative improvement in Recall@5 and +33.08% in NDCG@5 over the strongest baseline, LC-Rec.
    • On Sports, improvements are also strong, with +32.27% in NDCG@5.
    • On KuaiRand, it achieves +18.86% in NDCG@5. These results underscore the effectiveness of HiD-VAE's approach in generating semantically rich, interpretable, and uniquely distinct item representations, which provide a stronger foundation for the downstream generative recommender.
  • Structured Representations Enhance Generative Model Efficacy:

    • Within the generative model family, a clear performance hierarchy emerges. VQ-Rec, which uses non-hierarchical identifiers (Product Quantization), consistently underperforms TIGER and LC-Rec. This indicates that simpler, flat identifiers are less suitable for autoregressive decoding as the Transformer must predict a sequence of independent codes without the benefit of a coarse-to-fine semantic structure, likely leading to greater error propagation.
    • TIGER and LC-Rec improve upon VQ-Rec by using hierarchical codes based on RQ-VAE. However, they are still limited by their unsupervised nature, making them prone to semantic drift and ID collisions.
    • HiD-VAE excels here because its Hierarchically-Supervised (HS) process (Tag Alignment and Tag Prediction losses) ensures each ID level aligns with a meaningful category. Concurrently, the Disentanglement via Uniqueness Loss (DUL) actively minimizes ID collisions. This results in a more robust and disentangled representation space that is easier for the subsequent Transformer to model, leading to more precise and relevant recommendations.
  • Generative Paradigms Outperform Discriminative Methods: A broader observation is the general superiority of generative models (HiD-VAE, LC-Rec, TIGER) over traditional sequential models (GRU4Rec, Caser, HGN, NextItNet) and Transformer-based discriminative models (SASRec, BERT4Rec). This trend suggests that the paradigm of directly generating item identifiers (rather than scoring a pre-selected candidate set) is a more powerful approach. It sidesteps the potential disconnect between representation learning and the search/ranking process inherent in methods relying on Approximate Nearest Neighbor (ANN) search. By directly modeling the probability distribution over the entire item universe (as represented by the discrete semantic IDs), HiD-VAE can capture more complex and nuanced user preference patterns, confirming the advantages of the proposed "learn-then-generate" two-stage approach.

6.2. Ablation Study (RQ2)

To quantify the contribution of each core component, an ablation study was conducted. The variants are:

  • HiD-VAE (Full Model): Complete model with Hierarchical Supervision (HS) (Tag Alignment Loss and Tag Prediction Loss) and Disentanglement Uniqueness Loss (DUL).

  • w/o Tag Align: Removes the contrastive tag alignment loss (Lalign\mathcal{L}_{\mathrm{align}}), retaining tag prediction and uniqueness constraints.

  • w/o Tag Pred: Removes the tag prediction loss (Lpred\mathcal{L}_{\mathrm{pred}}), relying solely on contrastive alignment for hierarchical structuring.

  • w/o DUL: Removes the uniqueness loss (Lunique\mathcal{L}_{\mathrm{unique}}). To handle ID collisions, it adopts TIGER's [31] resolution strategy: appending a sequentially incrementing integer to colliding semantic IDs.

    The following are the results from Table 2 of the original paper:

    Methods Beauty KuaiRand
    R@10 N@10 R@10 N@10
    HiD-VAE 0.0698 0.0421 0.0785 0.0586
    w/o Tag Align 0.0651 0.0392 0.0742 0.0541
    w/o Tag Pred 0.0633 0.0378 0.0725 0.0529
    w/o DUL 0.0524 0.0301 0.0668 0.0483

Key Findings from Ablation Study:

  • Hierarchical Supervision is Crucial for Semantic Grounding: Removing either component of the Hierarchical Supervision (HS) mechanism (i.e., w/o Tag Align or w/o Tag Pred) leads to a noticeable performance drop.

    • Removing Tag Prediction loss (w/o Tag Pred) results in a more significant decline than removing Tag Alignment loss (w/o Tag Align). This suggests that the direct classification signal from Lpred\mathcal{L}_{\mathrm{pred}} acts as a stronger semantic anchor, forcing each code to map to a concrete category.
    • The contrastive Lalign\mathcal{L}_{\mathrm{align}}, while less impactful in isolation, is essential for refining the geometric structure of the latent space, ensuring semantically similar categories are represented closely. The synergy of both losses is vital for learning a robust and meaningful hierarchy.
  • Disentanglement is Paramount; Avoiding ID Collisions is Critical: The most substantial performance degradation occurs in the w/o DUL variant. This finding is critical:

    • While the TIGER-style post-hoc fix (appending an integer) technically "resolves" ID collisions and prevents evaluation errors, it comes at a great semantic cost. This strategy injects non-semantic, arbitrary information into the final layer of the item identifier. For example, an ID that should be [beauty, skincare, mask] might become [beauty, skincare, mask, 1] or [beauty, skincare, mask, 2] for two distinct "face mask" items.
    • This appended integer disrupts the learned semantic sequence, effectively introducing noise that confuses the downstream generative model and hinders its ability to reason over the item's true attributes. The significant performance drop for w/o DUL strongly validates the paper's core motivation: achieving intrinsic disentanglement during the representation learning stage is far superior to relying on superficial, post-processing fixes that corrupt the semantic integrity of the learned identifiers.

6.3. Qualitative Insights and Analysis (RQ3)

6.3.1. Effectiveness of Disentanglement

To evaluate the mitigation of ID collision, the paper calculates the ID collision rate: the percentage of items that share non-unique IDs out of the total number of items. A lower rate indicates a more robust one-to-one mapping between items and their learned identifiers.

The following are the results from Table 3 of the original paper:

Methods Beauty Sports KuaiRand
VQ-Rec 21.2% 22.5% 20.3%
RQ-VAE 18.7% 19.5% 17.8%
HiD-VAE (Full) 2.1% 2.8% 1.9%
w/o DUL 17.5% 18.2% 16.9%
w/o HS 5.8% 6.5% 5.2%

Analysis:

  • High Collision Rates in Unsupervised Methods: Both VQ-Rec and RQ-VAE (unsupervised baselines) exhibit substantial ID collisions (up to 22.5% for VQ-Rec). VQ-Rec's flat quantization scheme performs the worst, highlighting a fundamental weakness where lack of explicit constraints leads to significant representation entanglement.

  • HiD-VAE Drastically Reduces Collisions: The full HiD-VAE model reduces the collision rate to a negligible level (e.g., 2.8% on Sports), representing a remarkable 87.6% relative reduction compared to the strongest baseline (VQ-Rec). This provides direct quantitative proof of a nearly injective mapping.

  • DUL is the Primary Driver: The w/o DUL variant, lacking the uniqueness loss, performs only marginally better than RQ-VAE. This unequivocally demonstrates that an explicit disentanglement mechanism is essential to prevent the representation collapse that plagues standard VQ-based tokenization schemes.

  • HS Indirectly Contributes: The w/o HS variant, while better than baselines due to DUL, still has a higher collision rate than the full HiD-VAE. This suggests that hierarchical supervision, by imposing a strong semantically meaningful structure on the latent space, naturally encourages better separation of item representations, allowing DUL to operate more effectively.

    In summary, HiD-VAE effectively resolves the ID collision problem through the synergistic combination of disentanglement (DUL) and structured semantic supervision (HS).

6.3.2. Visualization of the Disentanglement Effect

To provide direct, qualitative evidence of the uniqueness loss's efficacy, the paper visualizes the initial latent space (z0\texttt{z}_0) of items prone to collision using t-SNE [28]. The visualization focuses on items from fine-grained categories with high ID collision rates in the w/o DUL ablation model. For these selected items, their initial latent representations (z0\mathbf{z}_0) are extracted from both the w/o DUL model and the full HiD-VAE model and projected into a 2D space.

fig 3 该图像是示意图,展示了使用DUL前后数据分布的差异。左侧为未使用DUL(纠缠状态),点群分布较为稀疏且呈现纠缠现象,右侧为使用DUL(解纠缠状态),点群则更为均匀,呈现出更明显的层级结构。

Figure 3: The t-SNE visualization of the disentanglement effect of our Uniqueness Loss (DUL). Each color represents a group of distinct items from one of the top-11 high-collision categories. The gray dots represent items from other random categories, serving as a background to illustrate the overall structure of the latent space.

Analysis:

  • Entangled Latent Space (Figure 3a, w/o DUL): Items from the same group (represented by a single color) significantly collapse into overlapping representations. This tight clustering makes distinct items indistinguishable, visually demonstrating the root cause of ID collisions.

  • Disentangled Latent Space (Figure 3b, Full HiD-VAE): In contrast, the latent space from the full HiD-VAE model shows a remarkably well-separated structure. The Uniqueness Loss (DUL) has effectively pushed apart the latent representations of distinct items within each group, ensuring they acquire unique representations. While items from the same category still occupy a similar semantic neighborhood, they are now clearly separable.

    This visualization visually confirms that the uniqueness loss is highly effective at resolving representation entanglement at its source, creating a robust foundation for learning unique and meaningful identifiers.

6.3.3. Case Study on Recommendation Interpretability

A case study compares HiD-VAE with a standard RQ-VAE baseline to demonstrate practical advantages beyond aggregate metrics.

fig 4 该图像是一个示意图,展示了 HiD-VAE 和 RQ-VAE 在推荐系统中的工作原理。图中左侧展示用户历史记录,接着通过各自的模型(RQ-VAE 和 HiD-VAE)生成语义 ID,然后经过 Transformer 模型预测下一个项目 ID。同时 HiD-VAE 通过层级标签和更细致的嵌入,提高了语义嵌入质量和推荐多样性。

Figure 4: Case study comparing a standard RQ-VAE with our HiD-VAE. For the same user, HiD-VAE learns a transparent semantic path (e.g., 'Skincare - > Treatments - > Serums'), enabling traceable reasoning. In contrast, the baseline's opaque codes (e.g., '[17, 83, 152]') result in black-box logic and risk generating invalid recommendations.

Analysis:

  • Opaque IDs in RQ-VAE: A standard RQ-VAE maps an item (e.g., a serum) to an opaque identifier like [17, 83, 152]. The semantic meaning of these codes is unknown, making the generation process a black box that relies solely on statistical pattern matching. Such models cannot explain their recommendations. Furthermore, their unconstrained generation can hallucinate invalid ID combinations that do not correspond to any real item.

  • Interpretable IDs in HiD-VAE: In contrast, HiD-VAE generates a transparent, self-explanatory ID like [5, 12, 47] that directly decodes into a human-readable path: Skincare>Treatments>SerumsSkincare -> Treatments --> Serums. This allows the generation process to follow a traceable reasoning, providing an explanation rooted in its explicit semantic path, which enhances user trust.

  • Hierarchy-Aware Semantic Embeddings: The Transformer-based recommender in HiD-VAE utilizes hierarchy-aware semantic embeddings. Each token in the semantic ID is mapped to semantic vectors from an embedding model, concatenated with learnable ID embeddings and level-specific type embeddings. These jointly learned embeddings enrich the representation with explicit semantic information, enabling the model to better capture the coarse-to-fine semantic path and improve overall performance.

  • Robustness against Invalid IDs: Because HiD-VAE's generation is constrained by the learned tag hierarchy and constrained decoding, it is mechanically robust against producing invalid IDs.

    This fundamental difference in representation directly impacts the quality and reliability of recommendations, with HiD-VAE offering superior controllability and trustworthiness crucial for real-world deployments.

6.3.4. Tag Predictor Architecture and Per-Layer Accuracy (Appendix E.1)

The paper provides details on the tag predictor architecture for each layer and its per-layer classification accuracy. The tag predictor is a multi-layer perceptron (MLP) with self-attention, feature extraction layers, residual blocks, and a final classification head. Its structure (hidden dimensions, dropout rates) scales with layer depth to handle increasing complexity.

The following are the results from Table 5 of the original paper:

Dataset Layer 1 (Coarse) Layer 2 (Medium) Layer 3 (Fine)
Beauty 96.38(7) 92.73(30) 85.61(97)
Sports 93.68(23) 87.54(70) 83.27(119)
KuaiRand 87.84(38) 83.58(97) 77.49(146)

Analysis: The numbers in parentheses denote the effective number of tag categories at each level (after filtering out rare categories with <30<30 samples).

  • Consistently Strong Performance: The accuracies range from 96.38% (Layer 1, Beauty) to 77.49% (Layer 3, KuaiRand). This demonstrates consistently strong performance across layers, even with progressive increases in category granularity and count.
  • Efficacy of Scaled Classifier Architecture: Even for deeper layers with a large number of categories (e.g., 97-146), where random chance accuracy would be very low, the achieved accuracies remain robust and substantially outperform naive baselines. This validates the scaled classifier architecture's ability to capture nuanced semantic distinctions. This robust performance confirms that the hierarchical supervision signal is effectively learned and utilized by the HiD-VAE's quantization layers.

6.3.5. Hyperparameter Sensitivity (Appendix E.2)

The paper analyzes the sensitivity of HiD-VAE to key hyperparameters: the number of quantization layers LL, and the uniqueness loss weight βunique\beta_{\mathrm{unique}} and margin mm.

Sensitivity to Number of Quantization Layers LL

The paper evaluates performance for L{3,4,5}L \in \{3, 4, 5\} (excluding L=2L=2 due to severe limitations).

Figure 6: Sensitivity analysis of the number of quantization layers LL on Recall@10 and NDCG@10. Performance peaks at L=4L=4 but diminishes at L=5L=5, indicating diminishing returns with increased depth.

Analysis:

  • Marginal Improvement from L=3L=3 to L=4L=4: Increasing LL from 3 to 4 yields marginal improvements (e.g., +4.88% in Recall@10 on Beauty, +4.97% on KuaiRand). An additional layer enables finer semantic decomposition.
  • Performance Degradation at L=5L=5: Further increasing to L=5L=5 leads to performance degradation (e.g., -5.86% in Recall@10 on Beauty relative to L=4L=4), sometimes even falling below the L=3L=3 baseline.
  • Diminishing Returns and Optimization Challenges: This diminishing returns phenomenon suggests a low benefit-to-cost ratio. Deeper hierarchies increase training costs (more quantization overhead, longer sequences for the Transformer) without proportional gains, likely due to optimization challenges like propagating quantization errors and harder autoregressive modeling.
  • Optimal Trade-off: The default choice of L=3L=3 is validated as an optimal trade-off, achieving strong performance without excessive overhead.

Analyzing Uniqueness Loss Parameters

The sensitivity to the uniqueness loss weight βunique\beta_{\mathrm{unique}} and uniqueness margin mm is also analyzed.

Figure 7:Sensitivity analysis of the uniqueness loss weight βunique\beta_{\mathrm{unique}} and uniqueness margin mm on Recall@10 and NDCG@10 for the Beauty dataset. Performance peaks at βunique=2.0\beta_{\mathrm{unique}} = 2.0 but diminishes at higher values, while metrics improve with increasing mm up to 0.9, indicating optimal disentanglement with balanced penalties.

Analysis of βunique\beta_{\mathrm{unique}} (Figure 7a):

  • Performance peaks at βunique=2.0\beta_{\mathrm{unique}} = 2.0 (Recall@10=0.0698Recall@10 = 0.0698, NDCG@10=0.0421NDCG@10 = 0.0421). This indicates an optimal balance where disentanglement is sufficiently enforced without overpowering other objectives.
  • Lower values lead to slightly degraded metrics due to increased ID collisions (insufficient disentanglement).
  • Higher values may over-penalize and disrupt the latent space structure, leading to sub-optimal performance.

Analysis of mm (Figure 7b):

  • Metrics generally improve with increasing margin mm, achieving the best results at m=0.9m = 0.9 (Recall@10=0.0698Recall@10 = 0.0698, NDCG@10=0.0421NDCG@10 = 0.0421).

  • A stricter margin better separates colliding representations, enhancing disentanglement efficacy.

  • Lower margins allow more overlap, reducing disentanglement effectiveness. The optimal value of m=0.9m=0.9 suggests that a strong separation is beneficial.

    These sensitivity analyses confirm that the chosen hyperparameters for the HiD-VAE model and its uniqueness loss are well-tuned, striking a balance between promoting disentanglement and maintaining overall model performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

In this work, the authors introduced HiD-VAE, a novel framework for generative recommendation that effectively addresses two critical limitations of existing semantic ID tokenization methods: semantic flatness and representation entanglement. HiD-VAE achieves this through two core innovations:

  1. A hierarchically-supervised quantization process that explicitly aligns discrete codes with multi-level item tags, yielding interpretable and structured semantic IDs with traceable semantic paths.

  2. A novel uniqueness loss that directly penalizes latent space overlap between distinct items, thereby resolving the pervasive ID collision problem and promoting recommendation diversity.

    Extensive experiments on three public benchmarks (Beauty, Sports, and KuaiRand) demonstrated that HiD-VAE achieves state-of-the-art performance, significantly outperforming prior discriminative and generative models across various evaluation metrics. The ablation studies confirmed the crucial contributions of both hierarchical supervision and the uniqueness loss. Qualitative analyses, including t-SNE visualizations and a case study, further showcased HiD-VAE's ability to learn disentangled representations and provide interpretable recommendations. By unifying the recommendation pipeline with interpretable and disentangled IDs, HiD-VAE enhances traceability and robustness, paving the way for more trustworthy real-world deployments.

7.2. Limitations & Future Work

The authors acknowledge several directions for future research:

  • Multi-modal Semantic IDs: Extend HiD-VAE's ID generation to incorporate multi-modal data (images, videos, audio) alongside text. This would involve integrating encoders like CLIP to align discrete codes with cross-modal hierarchies, capturing richer nuances (e.g., visual styles, auditory patterns) for heterogeneous domains like e-Commerce and social networks.
  • LLMs for Advanced Sequential Modeling: Enhance the generative stage by integrating Large Language Models (LLMs) for more advanced sequential modeling. This could involve fine-tuning or prompting LLMs on hierarchical IDs to leverage their reasoning capabilities for complex user intents and dependencies, potentially using hybrid architectures with rationales or counterfactuals to boost diversity and personalization.
  • Broader Efforts: Explore diffusion models for ID synthesis, which could improve long-tail coverage and fairness in recommendations. Additionally, investigate privacy-preserving mechanisms like federated learning for ID generation and sequential modeling.

7.3. Personal Insights & Critique

This paper presents a significant advancement in generative recommendation by systematically tackling two of its most pressing issues: interpretability and ID collisions. The core innovation of hierarchically-supervised quantization is particularly insightful, bridging the gap between abstract latent codes and human-understandable categories. This explicit supervision, rather than implicit learning, is a strong step towards making generative models more transparent and trustworthy. The uniqueness loss is also a well-conceived mechanism, directly addressing the representation collapse problem that often plagues VQ-based models.

Strengths:

  • Directly Addresses Core Problems: Unlike many prior works that use post-hoc fixes, HiD-VAE integrates solutions for interpretability and disentanglement directly into the representation learning phase, which is a more principled approach.
  • LLM Integration for Data Richness: The LLM-based hierarchical tag generation is a practical and ingenious solution for applying this framework to datasets lacking inherent categorical structures, significantly broadening its applicability.
  • Clear Interpretability: The ability to decode semantic IDs into human-readable paths (e.g., "Skincare -> Treatments -> Serums") is a major step forward for user understanding and system debugging.
  • Robustness against Hallucination: Constrained decoding ensures that generated IDs always correspond to real items, which is critical for practical deployment.

Potential Issues/Critique:

  • Complexity: The HiD-VAE model involves multiple components and loss functions, including a VAE, RQ-VAE, projectors, classifiers, and a Transformer, with several hyperparameters (βcommit\beta_{\mathrm{commit}}, βsup\beta_{\mathrm{sup}}, βunique\beta_{\mathrm{unique}}, τ\tau, mm). This complexity might lead to higher computational costs and require careful tuning. The sensitivity analysis in Appendix E.2 provides some reassurance but optimal tuning for diverse datasets could still be challenging.
  • Reliance on LLM for Tags: While innovative, the LLM-based tag generation depends on the quality and capabilities of the chosen LLM. Potential biases in the LLM or its training data could be propagated into the generated hierarchical tags, affecting the learned semantic IDs. The prompt design and LLM selection are crucial, and while described, ensuring complete objectivity and generalizability across all possible domains is a continuous challenge.
  • Scalability of Constrained Decoding: While a trie-based prefix tree is efficient for constrained decoding, for extremely large and dynamic item catalogs with frequent additions/deletions, maintaining and querying this structure could still pose engineering challenges.
  • Transferability of Learned IDs: The paper implicitly suggests that the learned HiD-VAE tokenizer (Stage 1) can be frozen and used by various downstream generative models. While validated with a Transformer, exploring its direct plug-and-play compatibility with other generative architectures could be a valuable extension.

Applications and Inspirations:

  • Beyond Recommendation: The concept of learning hierarchically-disentangled discrete representations could be highly valuable in other domains requiring interpretable generative models, such as generative design (e.g., generating product designs with interpretable feature breakdowns), drug discovery (generating molecules with specific, disentangled chemical properties), or data synthesis for privacy-preserving applications.

  • User Control and Explainability: The interpretable semantic paths could be exposed to users, allowing them to refine recommendations by modifying tag paths or understanding why certain items were suggested. This opens avenues for more interactive and transparent recommender systems.

  • Multimodal Generative Models: The proposed future direction of multi-modal semantic IDs is very promising. Integrating different modalities could lead to richer, more comprehensive item representations and enable cross-modal recommendation (e.g., recommending music based on video content).

    Overall, HiD-VAE represents a robust and well-reasoned step forward for generative recommendation, laying a strong foundation for more interpretable, accurate, and diverse recommendation systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.