HiD-VAE: Interpretable Generative Recommendation via Hierarchical and Disentangled Semantic IDs
TL;DR Summary
HiD-VAE is a proposed framework that enhances generative recommendation by learning hierarchically disentangled item representations, addressing traditional methods' flatness and entanglement issues, thereby improving recommendation accuracy and diversity.
Abstract
Recommender systems are indispensable for helping users navigate the immense item catalogs of modern online platforms. Recently, generative recommendation has emerged as a promising paradigm, unifying the conventional retrieve-and-rank pipeline into an end-to-end model capable of dynamic generation. However, existing generative methods are fundamentally constrained by their unsupervised tokenization, which generates semantic IDs suffering from two critical flaws: (1) they are semantically flat and uninterpretable, lacking a coherent hierarchy, and (2) they are prone to representation entanglement (i.e., ``ID collisions''), which harms recommendation accuracy and diversity. To overcome these limitations, we propose HiD-VAE, a novel framework that learns hierarchically disentangled item representations through two core innovations. First, HiD-VAE pioneers a hierarchically-supervised quantization process that aligns discrete codes with multi-level item tags, yielding more uniform and disentangled IDs. Crucially, the trained codebooks can predict hierarchical tags, providing a traceable and interpretable semantic path for each recommendation. Second, to combat representation entanglement, HiD-VAE incorporates a novel uniqueness loss that directly penalizes latent space overlap. This mechanism not only resolves the critical ID collision problem but also promotes recommendation diversity by ensuring a more comprehensive utilization of the item representation space. These high-quality, disentangled IDs provide a powerful foundation for downstream generative models. Extensive experiments on three public benchmarks validate HiD-VAE's superior performance against state-of-the-art methods. The code is available at https://anonymous.4open.science/r/HiD-VAE-84B2.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Interpretable Generative Recommendation via Hierarchical and Disentangled Semantic IDs". This title clearly indicates the paper's focus on improving the interpretability and quality of item representations in generative recommender systems by introducing hierarchical structures and disentanglement.
1.2. Authors
The authors are:
-
Dengzhao Fang (Jilin University, Changchun, China)
-
Jingtong Gao (City University of Hong Kong, Hong Kong, China)
-
Chengcheng Zhu (Nanjing University, Nanjing, China)
-
Yu Li (Jilin University, Changchun, China)
-
Xiangyu Zhao (City University of Hong Kong, Hong Kong, China)
-
Yi Chang (Jilin University, Changchun, China)
The affiliations suggest a collaborative effort between researchers from Jilin University, City University of Hong Kong, and Nanjing University, indicating expertise in machine learning, recommender systems, and possibly natural language processing due to the use of LLMs.
1.3. Journal/Conference
The paper is listed as a preprint on arXiv. The abstract mentions "Proc. of KDD" and "Proc. of AAAI" in its references, suggesting the authors are targeting top-tier conferences in artificial intelligence and data mining. As a preprint, it is currently undergoing peer review or awaiting formal publication.
1.4. Publication Year
The paper was published at 2025-08-06T16:45:05.000Z, which indicates August 6, 2025.
1.5. Abstract
The abstract introduces the growing field of generative recommendation, which aims to unify retrieval and ranking into an end-to-end model. It identifies two major limitations of existing generative methods, stemming from their unsupervised tokenization for generating semantic IDs:
-
Semantic flatness and uninterpretability: The
semantic IDslack a coherent hierarchy, making them hard to understand. -
Representation entanglement (ID collisions): Distinct items are mapped to the same
semantic ID, negatively impacting recommendation accuracy and diversity.To address these issues, the paper proposes
HiD-VAE, a novel framework that learns hierarchically disentangled item representations through two core innovations: -
Hierarchically-supervised quantization: This process aligns
discrete codeswith multi-level item tags, leading to more uniform and disentangledIDs. The trainedcodebookscan predict hierarchical tags, providing an interpretable semantic path for each recommendation. -
Uniqueness loss: This novel loss function directly penalizes
latent spaceoverlap between distinct items, resolving theID collisionproblem and promoting recommendation diversity by better utilizing theitem representation space.The abstract claims that these high-quality, disentangled
IDsprovide a strong foundation for downstream generative models. Extensive experiments on three public benchmarks are reported to validateHiD-VAE's superior performance against state-of-the-art methods.
1.6. Original Source Link
The official source link for the paper is https://arxiv.org/abs/2508.04618. The PDF link is https://arxiv.org/pdf/2508.04618v2.pdf. It is currently a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve lies within the rapidly evolving field of generative recommendation. Traditional recommender systems often employ a retrieve-and-rank pipeline, where candidate items are first retrieved and then ranked. Generative recommendation emerged as a promising paradigm to unify this pipeline into an end-to-end model capable of dynamically generating item identifiers (semantic IDs). This approach has the potential to offer more flexible and dynamic recommendations than traditional methods.
However, existing generative methods are fundamentally constrained by their unsupervised tokenization process, which generates these semantic IDs. This process suffers from two critical flaws that limit the efficacy of the entire pipeline:
-
Semantic Flatness and Uninterpretability: Current
semantic IDsare often "flat," meaning they do not inherently possess a coherent, human-understandable hierarchical structure. This lack of hierarchy makes the modelsuninterpretableor "black boxes," where the semantic meaning behind a generatedIDis opaque. This hinders understanding why a particular item was recommended and makes it difficult to control the generation process. -
Representation Entanglement and "ID Collisions": The
unsupervised vector quantization (VQ)techniques used often lead torepresentation entanglement, where distinct items are erroneously mapped to the samesemantic ID. This phenomenon, termed "ID collisions," is a critical flaw because it means the system cannot differentiate between unique items. This not only harms recommendation accuracy (as the system might recommend the wrong item under a shared ID) but also severely limits recommendation diversity (as many items might be effectively collapsed into a few identical representations). While some existing methods attempt post-hoc fixes, these are often ineffective as they don't address the root cause of the entanglement in thelatent space.The importance of this problem stems from the need for more
interpretable,accurate, anddiverserecommender systems. As item catalogs grow immense, users need systems that not only suggest relevant items but also explain why they are relevant, fostering trust and a better user experience. OvercomingID collisionsis crucial for the fundamental correctness and utility ofgenerative recommenders.
The paper's entry point is to directly address these two fundamental challenges at the representation learning stage, arguing that high-quality, structured semantic IDs are foundational for effective generative recommendation.
2.2. Main Contributions / Findings
The paper's primary contributions are encapsulated in its proposed framework, HiD-VAE, and its core technical innovations:
-
Introduction of HiD-VAE for Hierarchically Structured and Disentangled Representations: The paper proposes
HiD-VAEas a novel framework specifically designed forgenerative recommendation. It learnsinterpretableanddisentangled item representationsby integrating hierarchical structure and preventingID collisions. This directly tackles the limitations ofsemantically flatandentangled IDsin previous generative models. -
Hierarchically-Supervised Quantization Process:
HiD-VAEintroduces a novelhierarchically-supervised processforvector quantization. This is achieved through two components:- Tag Alignment Loss (): This loss explicitly aligns the
discrete codesfrom eachRQ-VAElayer with multi-level item tags using a contrastive learning approach. This ensures that each layer captures specific levels of category semantics. - Tag Prediction Loss (): This loss guides each
VAElayer to predict the corresponding hierarchical tag, further ensuring that the learnedsemantic IDsare structured and mapped to aninterpretable hierarchy. This allows the trainedcodebooksto predict hierarchical tags, providing a traceable andinterpretable semantic pathfor each recommendation.
- Tag Alignment Loss (): This loss explicitly aligns the
-
Novel Uniqueness Loss for Disentanglement: To combat
representation entanglementand preventID collisions,HiD-VAEincorporates anovel uniqueness loss(). This loss directly penalizeslatent spaceoverlap between distinct items that might otherwise be assigned identicalsemantic ID sequences. This mechanism ensures a more comprehensive utilization of theitem representation space, resolves theID collisionproblem, and promotes recommendation diversity. -
LLM-based Hierarchical Tag Generation: To enhance applicability to real-world datasets often lacking ground-truth labels, the paper introduces an effective
LLM-based approachfor automatically generating high-qualityhierarchical tagsvia a "retrieval-then-classification" pipeline. This makes thehierarchically-supervisedapproach feasible even for datasets without pre-defined category structures. -
State-of-the-Art Performance: Extensive experiments on three public benchmarks (
Beauty,Sports,KuaiRand) validateHiD-VAE's superior performance over state-of-the-art traditional sequential, Transformer-based, and existing generative recommendation methods in terms ofRecall@KandNDCG@K. The results demonstrate substantial improvements, confirming the efficacy of the proposed innovations.In summary,
HiD-VAEprovides a powerful foundation for downstream generative models by producingsemantic IDsthat are bothinterpretable(due to explicit hierarchical structure) anddisentangled(due to theuniqueness loss), leading to more accurate, diverse, and trustworthy recommendations.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the HiD-VAE framework, it's essential to grasp several core concepts in recommender systems, deep learning, and generative modeling.
- Recommender Systems (RS): Systems designed to predict user preferences for items and suggest relevant ones. They are crucial for navigating information overload on digital platforms.
- Sequential Recommendation: A sub-field of RS that focuses on modeling the dynamic nature of user preferences based on their chronological interaction history (e.g., a sequence of items a user has interacted with). The goal is to predict the next item in the sequence.
- Deep Learning Architectures:
- Recurrent Neural Networks (RNNs), specifically Gated Recurrent Units (GRUs): Neural networks designed to process sequential data. GRUs are a type of RNN that can capture dependencies over long sequences by using gating mechanisms to control the flow of information.
- Convolutional Neural Networks (CNNs): Neural networks primarily used for processing grid-like data such as images, but can also be adapted for sequential data by treating sequences as 1D "images."
- Transformers: A powerful deep learning architecture that relies entirely on
self-attentionmechanisms to weigh the importance of different parts of the input sequence. They have become the state-of-the-art in many sequence modeling tasks, including natural language processing (NLP) and sequential recommendation, due to their ability to capture long-range dependencies efficiently.- Self-Attention: A mechanism that allows a model to weigh the importance of different elements in an input sequence when processing a specific element. It calculates a "score" for each element's relevance to the current element, forming a weighted sum of values.
- Generative Models: Machine learning models designed to generate new data instances that resemble the training data.
- Generative Recommendation: A paradigm where the recommender system directly generates the
identifiers(orsemantic IDs) of recommended items, rather than just scoring existing candidates. This unifies theretrieve-and-rankpipeline into an end-to-end process. - Variational Autoencoder (VAE): A type of
generative modelthat learns a compressed, continuouslatent spacerepresentation of input data. It consists of anencoderthat maps input data to a probability distribution in thelatent space, and adecoderthat reconstructs the input from samples drawn from thislatent space. VAEs are optimized using areconstruction loss(to ensure fidelity) and aKL divergenceterm (to regularize thelatent spaceand ensure its structure).- Encoder (): Maps an input feature vector to a continuous
latent representationz = E(x). - Decoder (): Reconstructs the input as from the
latent representation. - Latent Space: A lower-dimensional, abstract representation of the input data learned by the
VAE. - Reconstruction Loss: Measures how well the
decodercan reconstruct the original input from thelatent representation. Often Mean Squared Error (MSE). - KL Divergence (Kullback-Leibler Divergence): A measure of how one probability distribution diverges from a second, expected probability distribution. In VAEs, it regularizes the
latent spaceto follow a prior distribution (e.g., a standard normal distribution).
- Encoder (): Maps an input feature vector to a continuous
- Generative Recommendation: A paradigm where the recommender system directly generates the
- Vector Quantization (VQ): A technique used to introduce discreteness into a continuous
latent space. It maps a continuouslatent vectorto its nearest neighbor in a finite, learnablecodebook.- Codebook (): A finite set of discrete
codewords(also calledprototypesorembeddings). - Codeword (): An individual vector within the
codebook. - VQ-VAE: A
VAEextended withVector Quantizationin itslatent space. It replaces the continuouslatent vectorwith a discretecodewordfrom acodebook. - Quantization Function (
q(z)): $ q(z) = c_{k^{}}\quad \mathrm{where}\quad k^{} = \arg \min_j| z - c_j| _2 $ This function finds thecodewordin thecodebookthat is closest to the continuouslatent vectorin terms of Euclidean distance.
- Codebook (): A finite set of discrete
- Residual-Quantized VAE (RQ-VAE): An enhancement of
VQ-VAEthat uses a cascade of multiplequantizers. Instead of quantizing the entirelatent vectorat once, each subsequentquantizeroperates on the residual error from the previous stage. This allows for a more efficient and fine-graineddiscrete representationby progressively refining the quantization.- Residual Error: The difference between the original
latent vector(or previous residual) and thecodewordchosen by the currentquantizer. - Cumulative Quantized Embedding (): The sum of
selected codewordsfrom all precedingquantizerlayers up to layer .
- Residual Error: The difference between the original
- Semantic IDs: Discrete identifiers (often sequences of
codewordsfrom acodebook) that are learned to represent items, ideally distilling their rich semantic features into numerical codes. - ID Collisions: A critical problem in
vector quantizationwhere multiple distinct items are mapped to the samediscrete ID sequence. This means the system cannot differentiate between these items, leading to problems inaccuracyanddiversity. - Interpretability: The degree to which a human can understand the decisions or internal workings of a machine learning model. In recommender systems, it often refers to understanding why an item was recommended.
3.2. Previous Works
The paper contextualizes its contributions by discussing the evolution of sequential recommendation and the recent shift towards generative models.
-
Traditional Sequential Models (Discriminative): These models typically learn
item embeddingsand then use them to score candidate items.- GRU4Rec [13]: A pioneering work that applied
Gated Recurrent Units (GRUs)to model user sessions for next-item prediction. - Caser [35]: Utilized
Convolutional Neural Networks (CNNs)to capture local sequential patterns, treating sequences as "images" ofitem embeddings. - HGN [26]: Proposed a
Hierarchical Gating Networkto integrate both short-term and long-term user interests in sequences. - NextItNet [45]: Employed dilated
convolutional layersto effectively capture longer-range dependencies in user sequences. - Common Limitation: These models are
discriminative; they learn to distinguish between relevant and irrelevant items. They often rely on an externalApproximate Nearest Neighbor (ANN)search index (like Faiss [5]) to retrieve candidates, which separatesrepresentation learningfrom the finalretrieval/rankingprocess.
- GRU4Rec [13]: A pioneering work that applied
-
Transformer-based Models: These models leverage the
self-attentionmechanism for more sophisticated sequence modeling.- SASRec [19]: One of the first models to adapt the
Transformer's self-attentionmechanism for sequential recommendation, using adecoder-onlystructure for next-item prediction. - BERT4Rec [34]: Adapted the
BERTmodel'sbidirectional pre-training paradigm(using a masked language modeling objective) to learn user behavior representations. - [49]: Further refined
Transformer-based modelsby incorporatingself-supervised pre-training tasksto improverepresentation quality. - Common Limitation: While powerful, they are still fundamentally
discriminativeand often inherit theretrieve-and-rankpipeline issues.
- SASRec [19]: One of the first models to adapt the
-
Generative Recommendation Models: These models aim to directly generate item
identifiersautoregressively.- TIGER [31]: A pioneering work that introduced
Semantic IDs—discrete item representations learned viaResidual-Quantized VAEs (RQ-VAE)—and used aTransformer-based sequence modelto autoregressively predict theseIDs. - LC-Rec [48]: Also leverages
RQ-VAE-based identifiersbut focuses on aligninglanguageandcollaborative semanticsduring the tokenization process. - LETTER [39]: Another work that enhances the
quantization stageby injectingcollaborative signalsinto theitem tokenization. - VQ-Rec [14]: Learns
vector-quantized item representationsfor sequential recommenders, often using simplerProduct Quantization (PQ)orVQ-VAE. The paper adapts this model for a generative framework by training a Transformer on itssemantic IDs. - Common Limitations of existing Generative Models:
- Flat, Uninterpretable Semantic Space: The learned
semantic IDslack an explicit hierarchical structure, making them opaque "black boxes" [32, 50]. Any hierarchy is an implicit byproduct, not explicitly supervised. - Representation Entanglement ("ID Collisions"): Prone to
ID collisionswhere distinct items are mapped to the sameidentifier[31, 44]. Existing solutions are oftenpost-hoc(e.g., appending an integer to colliding IDs), which doesn't address the root cause ofentanglementin thelatent spaceand can corrupt thesemantic integrityof theIDs.
- Flat, Uninterpretable Semantic Space: The learned
- TIGER [31]: A pioneering work that introduced
3.3. Technological Evolution
The field of sequential recommendation has evolved significantly:
-
Early Deep Learning (RNNs/CNNs): Models like
GRU4RecandCaserdemonstrated the power of deep neural networks to capture sequential patterns. These were foundational but often struggled with long-range dependencies and complex interactions. -
Attention-based/Transformer Models:
SASRecandBERT4Recmarked a paradigm shift, leveragingself-attentionto effectively model complex dependencies, leading to significant performance gains and becoming the new standard. These models still primarily function in adiscriminativemanner, scoring items based on learnedembeddings. -
Generative Recommendation: More recently, models like
TIGER,LC-Rec, andVQ-Recpushed the boundary by reframing recommendation as anautoregressive generationtask. This paradigm aims to directly generateitem identifiers, unifying theretrieve-and-rankpipeline. This approach directly models the probability distribution over items, potentially leading to more flexible and dynamic recommendations.HiD-VAEfits into this evolution by addressing critical shortcomings of the currentgenerative recommendationparadigm. Whilegenerative modelsoffer a powerful way to model user preferences and the entire item universe, their reliance onunsupervised tokenizationforsemantic IDshas created new challenges: lack ofinterpretabilityandID collisions.HiD-VAErepresents the next step, aiming to refine theID generationprocess itself to producesemantic IDsthat are bothhierarchically structured(for interpretability) anddisentangled(for accuracy and diversity).
3.4. Differentiation Analysis
Compared to the main methods in related work, HiD-VAE introduces core innovations:
-
Vs. Traditional & Transformer-based Discriminative Models:
HiD-VAEoperates within thegenerative recommendationparadigm, which is fundamentally different fromdiscriminative models. Instead of learningitem embeddingsand then scoring candidates,HiD-VAEgenerates structuredsemantic IDsdirectly. This avoids thedisconnect between representation learning and retrieval/rankinginherent indiscriminativeapproaches and allows for direct modeling of the item distribution. The paper's results show generative models, and especiallyHiD-VAE, generally outperformdiscriminativemethods. -
Vs. Existing Generative Models (TIGER, LC-Rec, VQ-Rec): This is where
HiD-VAEmakes its most direct advancements.-
Interpretability: Previous generative models (like
TIGERandLC-Recwhich useRQ-VAE-based IDs, orVQ-Recwhich usesPQ) learnsemantic IDsthroughunsupervised vector quantization. This means any hierarchical structure is an implicit byproduct at best, leading tosemantically flatanduninterpretable IDs.HiD-VAEexplicitly addresses this with itshierarchically-supervised quantization process(viaTag Alignment LossandTag Prediction Loss), which alignsdiscrete codeswith multi-level item tags, yielding a transparent, human-readable semantic path. -
Disentanglement and ID Collisions: Existing generative models suffer significantly from
representation entanglementandID collisions. While some (likeTIGER) might usepost-hoc fixes(e.g., appending integers), these are "ineffective" as they don't solve the root problem in thelatent spaceand can corruptsemantic integrity.HiD-VAEdirectly tackles this at the source with itsnovel uniqueness loss, which explicitly penalizeslatent space overlapfor distinct items that would otherwise collide. This leads to a nearly injective mapping from items toIDs, drastically reducingcollision ratesand improvingdiversity. -
Applicability to Diverse Datasets:
HiD-VAEintroduces anLLM-based approachfor generatinghierarchical tagsfor datasets that lack them (likeKuaiRand). This broadens the applicability of itshierarchically-supervisedmethod, which would otherwise be limited to datasets with pre-existing rich categorical structures.In essence,
HiD-VAEbuilds upon the generative paradigm but fundamentally re-engineers thesemantic IDgeneration process to overcome its inherent limitations ofuninterpretabilityandentanglement, thereby providing a more robust, accurate, and transparent foundation forgenerative recommendation.
-
4. Methodology
4.1. Principles
The core idea behind HiD-VAE is to learn item representations that are both interpretable and disentangled, specifically designed for generative recommendation. The theoretical basis and intuition stem from addressing the two critical flaws identified in existing generative methods: semantic flatness and representation entanglement (ID collisions).
-
Interpretability through Hierarchy: The intuition is that human understanding often relies on hierarchical categorization. By explicitly supervising the
vector quantizationprocess with multi-level item tags,HiD-VAEforces the learneddiscrete codesto align with a meaningful, coarse-to-fine semantic hierarchy. This makes thesemantic IDstraceable and human-readable, transforming opaque numerical codes into clear semantic paths (e.g., "Clothing Topwear Dress"). -
Disentanglement through Uniqueness: The intuition here is that each unique item should ideally have a unique and distinct representation in the
latent spaceto avoid ambiguity and maximizediversity.ID collisionsoccur when distinct items are represented too closely or identically.HiD-VAEaddresses this by introducing auniqueness lossthat actively pushes apart thelatent representationsof items that would otherwise collapse into the samesemantic ID. This ensures that therepresentation spaceis comprehensively utilized and that each item maintains its distinct identity.By integrating these two principles,
HiD-VAEaims to createhigh-quality, disentangled IDsthat not only improve recommendationaccuracyanddiversitybut also providetraceableandinterpretable semantic paths, fostering user trust and better control over the recommendation process.
4.2. Core Methodology In-depth (Layer by Layer)
The HiD-VAE framework operates in two distinct stages: Offline Hierarchical ID Learning (Stage 1) and Online Interpretable Recommendation (Stage 2).
4.2.1. Preliminary Concepts for HiD-VAE
Before delving into the HiD-VAE specific innovations, the paper first reviews the foundational concepts upon which it builds:
-
Variational Autoencoder (VAE): A generative model consisting of an
encoderand adecoder. Theencodermaps an input feature vector to a continuouslatent representationz = E(x) \in \mathbb{R}^d, from which thedecoderreconstructs the input as . The model is optimized by minimizing areconstruction loss(e.g., Mean Squared Error) and aKL divergenceterm that regularizes thelatent space. -
Vector Quantization (VQ): To introduce discreteness,
VQmaps a continuouslatent vectorto its nearest neighbor in a finite, learnablecodebook, where eachcodeword. This is represented by thequantization function: $ q(z) = c_{k^{}}\quad \mathrm{where}\quad k^{} = \arg \min_j| z - c_j| _2 \quad (1) $ Here, is the continuouslatent vectorfrom theencoder, is thecodebook, and are thecodewordsin thecodebook. The functionq(z)selects thecodewordthat has the minimum Euclidean distance () to . -
Residual-Quantized VAE (RQ-VAE):
RQ-VAEextendsVQby using a cascade ofquantizers. Instead of a single quantization step, each subsequentquantizeroperates on theresidual errorfrom the preceding stage, allowing for a more efficient and fine-graineddiscrete representation. Thecumulative quantized embeddingup to layer is defined as , where is theselected codewordfrom the -thquantizer.HiD-VAEadapts thiscascaded architecturefromRQ-VAEto implement itshierarchical item tokenizer, adding explicithierarchical supervisionand anovel disentanglement mechanism.
4.2.2. Problem Formulation
Let be the set of users and be the set of items. Each user has a chronological interaction sequence , where is the sequence length and . The goal of sequential recommendation is to predict the next item that user is most likely to interact with.
In the generative recommendation approach, this is reframed as generating a unique ID for the target item. Each item is represented by a structured, hierarchical semantic ID , where is the number of levels (ID length), and each is an index in a level-specific codebook of size . These IDs are learned to match human-understandable item categories. The task then becomes an autoregressive prediction problem:
$
p(y_{t + 1}|y_1,\ldots ,y_t) = \prod_{l = 1}^{L}p(y_{t + 1}^{(l)}|y_1,\ldots ,y_t,y_{t + 1}^{(1)},\ldots ,y_{t + 1}^{(L - 1)}) \quad (2)
$
Here, is the probability of the next item's full semantic ID given the history of previous item semantic IDs. This is factorized into a product of conditional probabilities for each level of the hierarchical ID, meaning the prediction of depends on the user history () and the already predicted codes for the current item at coarser levels ().
4.2.3. Framework Overview
The HiD-VAE framework cleanly separates representation learning from sequential modeling across two stages, as illustrated in Figure 2.
该图像是一个示意图,展示了现有 ID 生成器(左侧)与我们 HiD-VAE 解决方案(右侧)的对比。在左侧的传统生成器中,存在 ID 冲突的问题,而右侧的 HiD-VAE 采用了分层且独立的语义 ID 方案,有效避免了这一问题,增强了推荐的准确性和多样性。
Figure 2: The HiD-VAE framework. HiD-VAE first learns hierarchical semantic IDs and disentangled tag IDs via a supervised VAE (a), which a Transformer then uses for interpretable sequential recommendation (b).
-
Stage 1: Offline Hierarchical ID Learning (Figure 2a)
- Goal: Train the
HiD-VAEto learnhierarchicalanddisentangled semantic IDs. - Process: For each item defined by content features (e.g., text embeddings),
HiD-VAElearns ahierarchical semantic ID. This is achieved by optimizing a composite objective function that combines standardVAElosses with the paper's novelhierarchically-supervisedanduniqueness losses. - Output: A frozen, high-quality
item tokenizercapable of converting any item into its unique,interpretable, anddisentangled semantic ID.
- Goal: Train the
-
Stage 2: Online Interpretable Recommendation (Figure 2b)
-
Goal: Use the learned
semantic IDsforinterpretable sequential recommendation. -
Process: The pre-trained and frozen
HiD-VAEacts as theitem tokenizer. For each user's interaction history, every item is mapped to its correspondingsemantic ID sequence. ATransformer-based sequential model, equipped withhierarchy-aware semantic embeddings(to preserve the structured semantics from Stage 1), is then trained toautoregressively predictthesemantic IDof the next item. During inference,constrained decodingensures that generatedIDscorrespond to valid items. -
Output: Recommended items with
interpretable semantic paths.This two-stage design allows each component to be specialized for its specific task.
-
4.2.4. Hierarchical Tag Generation
Many real-world datasets lack pre-defined hierarchical tags, which are crucial for HiD-VAE's hierarchically-supervised approach. To overcome this, the paper introduces a two-stage LLM-based approach for automatic hierarchical tag generation. This reframes the task as a "retrieval-then-classification" pipeline.
-
Candidate Tag Retrieval:
- Goal: Create a small, relevant set of candidate tags for each hierarchical level.
- Process: For each level , a
tag poolis constructed (from existing labels or manual annotations). For an item with content text , itssemantic embeddingis computed using a pre-trainedsentence encoder. Then, for each level , the top- most semantically similar tags are retrieved from by performing anearest-neighbor searchagainst pre-computed tag embeddings. - Formula: The candidate set is defined as: $ C_{\mathrm{cand}}^{(l)} = \mathrm{Top - K}{\bar{t}\in \mathcal{T}^{(l)}}(\mathrm{sim}(\mathbf{o} ,\mathcal{E}{\mathrm{sent}}(t))) \quad (3) $ Here, is the semantic embedding of the item's text, is the semantic embedding of a tag , and denotes a similarity function (e.g., cosine similarity). selects the tags with the highest similarity.
-
LLM-based Tag Classification:
- Goal: Select the most suitable tag from the candidate set using an
LLM. - Process: With a small candidate set , an
LLM(e.g., Qwen3-235B-A22B-Instruct-2507) is prompted to act as a classifier. The prompt includes the item's content , any previously determined higher-level tags (to enforce hierarchical consistency), and the candidate set . TheLLMpredicts the most probable tag for the current level . - Formula:
$
t^{(l)*} = \underset {t\in C_{\mathrm{cand}}^{(l)}}{\arg \max}P_{\mathrm{LLM}}\left(t\mid x_{\mathrm{text}},{t^{(j)}}{j = 1}^{l - 1},C{\mathrm{cand}}^{(l)}\right) \quad (4)
$
Here, is the probability assigned by the
LLMto a tag from the candidate set, given the item text and its higher-level tags. Thearg maxselects the tag with the highest probability. - Implementation: This "retrieval-then-classification" approach ensures generated tags are valid and consistent, leveraging the
LLM's contextual understanding without risking hallucination from open-ended generation. (Details on prompt design in Appendix B).
- Goal: Select the most suitable tag from the candidate set using an
4.2.5. Hierarchical Representation Learning
This section details how HiD-VAE addresses codebook collapse and ID collisions by employing hierarchically-supervised quantization.
Given an item with feature vector (text embeddings) and ground-truth hierarchical category tags with indices and embeddings :
-
The item features are passed through an
encoderto produce an initial continuouslatent representationz_0 = E(x). -
An
RQ-VAEinitiates a layeredquantization process. At each layer , aquantizertakes theresidualfrom the previous layer (with ) and identifies the closestcodewordfromcodebook. Theresidualfor the next layer is computed as . Thecumulative quantized embeddingup to layer is .To supervise this process, two main losses are introduced:
-
Tag Alignment Loss ():
- Goal: Ensure that the
codebookat layer captures the semantics of the -th level of the category hierarchy by making thecumulative quantized embeddingsimilar to its correspondingtag embeddingand dissimilar to others. - Process: The ground-truth
tag embeddingis projected into the item'slatent spaceusing a layer-specificprojector. This loss pulls towards its corresponding projectedtag embeddingwhile pushing it away from other tag embeddings in the same mini-batch. This is acontrastive loss. - Formula:
$
\mathcal{L}{\mathrm{align}}^{(l)} = -\log \frac{\exp(\operatorname{sim}(z_q^{(l)},P_l(\pmb{t}^{(l)})) / \tau)}{\sum{j = 1}^{B}\exp(\operatorname{sim}(z_q^{(l)},P_l(\pmb{t}^{(j)})) / \tau)}, \quad (5)
$
Here, denotes
cosine similarity, is atemperature hyperparameter(controlling the sharpness of the distribution), is thebatch size, is thecumulative quantized embeddingfor the item at layer , and is the projected embedding of the ground-truth tag for that item at layer . The denominator sums over all projected tag embeddings in the batch, including the positive pair.
- Goal: Ensure that the
-
Tag Prediction Loss ():
- Goal: Directly classify the
cumulative quantized embeddingto its corresponding hierarchical tag, especially to handle varying semantic depths and numbers of categories across layers. - Process: Each layer employs a tailored
classifier. Deeper layers use larger hidden dimensions and progressively higherdropoutrates to handle the richer (and growing dimensionality)cumulative quantized embeddingas input. This ensuresclassifiersfor deeper layers have greater capacity for nuanced semantics. - Formula:
$
\mathcal{L}{\mathrm{pred}}^{\left(I\right)} = \mathrm{CrossEntropy}(C{l}(z_{q}^{(I)}),c^{(I)}), \quad (6)
$
Here, is the standard
cross-entropy loss, is the layer-specificclassifierfor layer , is thecumulative quantized embeddingfor the item at layer , and is the ground-truth tag index for the item at layer . Optional extensions likefocal loss(with ) can be applied forimbalanced classes.
- Goal: Directly classify the
This dual supervision (alignment and prediction) ensures that the learned semantic IDs are structured and mapped to an interpretable hierarchy.
4.2.6. Disentanglement via Uniqueness Loss
To combat ID collision, HiD-VAE introduces a uniqueness loss that operates on the continuous, pre-quantization latent vectors.
- Goal: Penalize
representation overlapbetween pairs of distinct items that are assigned identicalsemantic ID sequenceswithin a training batch. This directly encourages aninjective mappingfrom items toIDs. - Process: For two different items and in a batch, with initial
latent representationsand respectively, if their fullsemantic ID sequencescollide (i.e., ), amargin-based penaltyis applied. - Formula:
$
\begin{array}{r}\mathcal{L}{\mathrm{unique}} = \frac{1}{|P|}\sum{(i,j)\in P}\max \left(0,\frac{z_{0,i}\cdot z_{0,j}}{|z_{0,i}|2|z{0,j}|_2} -m\right), \end{array} \quad (7)
$
Here, is the set of all distinct item pairs
(i,j)within the batch that have collidingsemantic IDs. The term is thecosine similaritybetween the initiallatent representationsof the two colliding items. is amargin hyperparameter. The ensures that the loss is only incurred when thecosine similarityexceeds themargin, pushing theselatent representationsapart. This loss is averaged over all such colliding pairs in the batch.
4.2.7. Interpretable Generative Recommendation (Stage 2)
With high-quality IDs generated in Stage 1, Stage 2 uses these semantic IDs for sequential recommendation. The challenge is modeling these structured, multi-level IDs for autoregressive generation without losing their semantics or generating invalid combinations.
HiD-VAE proposes two innovations for the Transformer-based autoregressive model:
-
Hierarchy-Aware Semantic Embeddings:
- Goal: Preserve the structured semantics of the
IDsand enhanceinterpretability. - Process: A custom
embedding layeris designed. Each token (level) in an item'ssemantic IDis first mapped to its correspondingtag text. Thistag textis then encoded into asemantic vectorusing a pre-trainedembedding model. Thesesemantic vectorsare thenconcatenatedwithlearnable ID embeddings(corresponding to thecodebook index) andtype embeddingsspecific to each hierarchical level (). - Benefit: This enriches the feature representation by integrating explicit semantic information, allowing the model to capture the
coarse-to-fine semantic pathencoded in theID, thereby enhancinginterpretabilityand modeling capability.
- Goal: Preserve the structured semantics of the
-
Constrained Decoding for Validity:
- Goal: Ensure that generated
IDsalways correspond to real, existing items. - Process: During inference, a
constrained decoding strategyis implemented. All validsemantic ID prefixesare pre-computed and stored in an efficientdata structure(e.g., atrie). During token-by-token generation (autoregressive prediction of each level ), themodel's output vocabularyis dynamicallymasked. Thismaskingonly allows tokens that form valid, existing prefixes based on the pre-computed structure. - Benefit: This
pruning mechanismguarantees that the generated output always corresponds to an actual item in the inventory, preventinghallucinationof non-existentIDcombinations and improving practical applicability.
- Goal: Ensure that generated
4.2.8. Optimization
The framework is trained in two distinct stages, each with its own objective function.
-
Stage 1: HiD-VAE Training:
- Goal: Train the
HiD-VAEencoderanddecoder,codebooks,projectors, andclassifiersto learnhierarchicalanddisentangled item representations. - Objective Function: The
HiD-VAEis trained end-to-end by minimizing a composite loss function: $ \begin{array}{rl} & {\mathcal{L}{\mathrm{H i d - V A E}} = \mathcal{L}{\mathrm{r e c o n}} + \beta_{\mathrm{commit}}\mathcal{L}{\mathrm{co n m i t}}}\ & {\qquad +\beta{\mathrm{sup}}\sum_{l = 1}^{L}(\mathcal{L}{\mathrm{align}}^{(l)} + \mathcal{L}{\mathrm{p r e d}}^{(l)}) + \beta_{\mathrm{unique}}\mathcal{L}_{\mathrm{unique}}} \end{array} \quad (8) $ Here:- : The
reconstruction lossbetween the input and thedecoderoutput . (See Appendix C.1 for details) $ \mathcal{L}_{\mathrm{recon}} = ||x - \hat{x} ||_2^2 \quad (10) $ This is the Mean Squared Error (MSE) between the original input and its reconstruction from the decoder. - : The
vector quantization commitment loss, which regularizes theencoder's output spaceby encouraging theencoder's continuous outputto "commit" to its chosencodeword. It uses astop-gradient (sg)operator to isolate the gradient flow. (See Appendix C.1 for details) $ \mathcal{L}_{\mathrm{commit}} = ||z_e(x) - \mathrm{sg}[z_q(x)]||_2^2 \quad (11) $ This loss pushes the encoder's output to be close to the quantized code chosen by the VQ module. Thestop-gradientoperatorsgprevents gradients from flowing from back to through this term, ensuring that learns to predict codes effectively without directly influencing the codebook updates via this path. - : The sum of the
Tag Alignment Loss(Formula 5) andTag Prediction Loss(Formula 6) across allquantization layers. These are thehierarchical supervisionlosses. - : The
Uniqueness Loss(Formula 7), which promotesdisentanglement. - , , :
Hyperparametersthat balance the contribution of each loss component to the overall objective.
- : The
- Goal: Train the
-
Stage 2: Recommender Training:
-
Goal: Train the
Transformer-based recommenderusing the frozensemantic IDsfrom Stage 1. -
Objective Function: The
Transformer-based recommenderis trained usingnext-token predictionwithcross-entropy loss. For a user history withID sequences, the objective maximizes the likelihood of the next item'sID sequence: $ \mathcal{L}{\mathrm{rec}} = -\sum{u\in \mathcal{U}}\sum_{t = 1}^{|S_{u}| - 1}\log p(\pmb{y}_{t + 1}|\pmb{y}_1,\ldots ,\pmb{y}_t) \quad (9) $ Here, is thecross-entropy lossused to train theTransformer. It sums the negative log-likelihoods of predicting the true next item'ssemantic IDgiven the user's previous sequence ofsemantic IDs. This is done for all users and all prediction steps in their sequences. -
Enhancement: As mentioned previously, each token in the
semantic IDis mapped to its correspondingtag textand encoded into asemantic vectorusing a pre-trainedembedding model, enriching the representation with explicit semantic information.This two-stage approach first establishes an
interpretableanddisentangled representation space, then leverages it to modelsequential user behavioreffectively. (Details on implementation in Appendix C.2).
-
4.2.9. Algorithmic Overview (Algorithm 1)
The pseudocode for the HiD-VAE framework provides a unified view of the entire two-stage process.
Algorithm 1: HiD-VAE Framework
Require: Item features , optional tags , user sequences Ensure: Predicted next items for each user
1: // Stage 0: Hierarchical Tag Generation (if tags not available) 2: for each item do 3: \quad Semantic embedding of item text 4: for to do 5: Top-K retrieval from via \quad Candidate tag retrieval (Eq. 3) 6: \quad LLM-based Tag Classification (Eq. 4) 7: end for 8: end for
9: // Stage 1: HiD-VAE Training 10: Initialize encoder , decoder , codebooks , projectors , classifiers 11: while not converged do 12: Sample batch 13: \quad Initial latent representation 14: \quad Initialize residual 15: \quad To store cumulative quantized embeddings 16: for to do 17: \quad Quantize residual, get codeword and index 18: \quad Compute next residual 19: \quad Cumulative quantized embedding 20: Compute (Eq. 5) and (Eq. 6) 21: end for 22: \quad Reconstruct input from final cumulative quantized embedding 23: Compute (Eq. 10), (Eq. 11), (Eq. 7, on colliding pairs) 24: Update parameters via (Eq. 8) 25: end while 26: Freeze HiD-VAE; map all items to IDs
27: // Stage 2: Recommender Training 28: Initialize Transformer with hierarchy-aware embeddings 29: while not converged do 30: Sample user batch 31: Map to ID sequences 32: Enrich embeddings with tag semantics and level types 33: Compute via autoregressive Cross-Entropy (Eq. 9) 34: Update Transformer parameters 35: end while
36: // Inference 37: for each test sequence do 38: Autoregressively generate with constrained decoding (mask invalid prefixes) 39: Map to item via pre-computed ID-to-item index 40: end for
Explanation of Algorithm 1:
- Lines 1-8 (Stage 0: Hierarchical Tag Generation): This optional preliminary stage is executed once if ground-truth hierarchical tags are not available. For each item, it first computes a semantic embedding of its text. Then, for each hierarchical level (), it retrieves a set of candidate tags from a predefined tag pool. Finally, an
LLMis used to classify the most appropriate tag for that level, taking into account the item text and any higher-level tags already determined. This process populates the ground-truth tags needed for Stage 1. - Lines 9-26 (Stage 1: HiD-VAE Training): This is the core
representation learningstage.- The
encoder,decoder,codebooks,projectors, andclassifiersare initialized. - In a training loop, batches of items and their (ground-truth or generated) hierarchical tags are sampled.
- The item's initial
latent representationis obtained from theencoder. - The
RQ-VAEprocess iteratively quantizes theresidualat each layer , producing acodeword(and its index ) and an updatedresidual. Thecumulative quantized embeddingis also formed. - At each layer, the
Tag Alignment Loss() andTag Prediction Loss() are computed to providehierarchical supervision. - After processing all layers, the final
cumulative quantized embeddingis passed to thedecoderto reconstruct the input . - The standard
reconstruction loss(),commitment loss(), and theuniqueness loss() are computed. Theuniqueness lossis specifically applied to pairs of distinct items that generate identicalsemantic ID sequences. - All model parameters are updated using the combined
HiD-VAEloss. - Once training converges, the
HiD-VAEis frozen, and all items are mapped to their final, uniquehierarchical semantic IDs.
- The
- Lines 27-35 (Stage 2: Recommender Training): This stage trains the
Transformer-based sequential recommender.- A
Transformermodel is initialized withhierarchy-aware embeddings. - In a training loop, batches of user sequences are sampled.
- Each item in the user sequence is mapped to its pre-computed
semantic ID sequence. - The
semantic embeddingsare enriched withtag semanticsandlevel types. - The
Transformeris trained to predict the next item'ssemantic IDusingautoregressive cross-entropy loss(). Transformerparameters are updated.
- A
- Lines 36-40 (Inference): This describes how recommendations are generated.
- For a given test sequence of
semantic IDs, theTransformerautoregressively generates thesemantic IDof the next item (). Constrained decodingis used during generation to ensure that only validsemantic ID prefixes(corresponding to real items) are generated.- Finally, the generated
semantic IDis mapped back to an actual item using a pre-computed index.
- For a given test sequence of
Implementation Details (from Appendix C.2):
- Hyperparameters: Typical values include , , and batch sizes of 512 for Stage 1. Tuning is performed via grid search on validation sets.
- Inference Efficiency:
Constrained decodingin Stage 2 uses atrie-based prefix treefor efficientvocabulary masking. This reduces invalid generations to zero and maintains time per token, where (maxcodebooksize across layers). Thistrieis built offline from all validitem IDsto ensure scalability for large catalogs. - Partial Tags: For datasets with partial tags, a
hybrid approachcan combineLLM-generatedandground-truth labels, weighted byLLM confidence scores. This boostsalignment loss convergencein early epochs.
5. Experimental Setup
5.1. Datasets
The experiments are conducted on three widely-used public benchmarks to ensure a comprehensive assessment across different domains and data characteristics. Following the standard practice [19, 34], a 5-core setting is adopted, where all users and items with fewer than five interactions are filtered out.
The following are the results from Table 4 of the original paper:
| Dataset | # Users | # Items | # Interactions | # Seq.Length |
|---|---|---|---|---|
| Beauty | 22,363 | 12,101 | 198,360 | 8.87 |
| Sports | 35,598 | 18,557 | 296,175 | 8.32 |
| KuaiRand | 983 | 29,983 | 953,166 | 19.83 |
Description of Datasets:
-
Beauty: This dataset is derived from the Amazon Review Data project [12]. It is described as a popular and relatively dense benchmark, suitable for evaluating general recommendation performance.
-
Sports and Outdoors (Sports): Also from the Amazon Review Data project [12]. This dataset is larger and significantly sparser compared to Beauty, which allows for evaluating the model's robustness under more challenging data distributions.
-
KuaiRand-1K [9]: This is a large-scale public dataset from the Kuishou short video platform. It contains user interactions with rich side information and represents a distinct domain (short video recommendations). It serves to test the generalizability of the
HiD-VAEapproach. Notably, for datasets likeKuaiRandwhere structured hierarchies are not natively provided, the paper uses itsLLM-based pre-processing pipeline(as described in Section 4.2.4). This involves prompting aLarge Language Modelwith the item's title and raw category string to generate a clean, consistent -levelcategory hierarchyfor each item, ensuring a complete category path for supervision.These datasets were chosen because they represent different domains and data characteristics (density, size, type of items), allowing for a comprehensive and robust assessment of
HiD-VAE's capabilities.
5.2. Evaluation Metrics
The paper adopts the standard leave-one-out evaluation protocol [7]. For each user's interaction history, the last item is held out for testing, the second-to-last item is used for validation, and the remaining items are used for training. Performance is reported using two top-K ranking metrics with set to 5 and 10: Recall@K and Normalized Discounted Cumulative Gain (NDCG)@K.
-
Recall@K
- Conceptual Definition:
Recall@Kmeasures the proportion of relevant items that are successfully retrieved within the top recommendations. It focuses on the completeness of the retrieved relevant items. A higherRecall@Kmeans more relevant items are found among the top recommendations. - Mathematical Formula: $ \text{Recall@K} = \frac{|\text{Relevant Items in Top K}|}{|\text{All Relevant Items}|} $
- Symbol Explanation:
- : The number of relevant items that appear in the top recommended list.
- : The total number of relevant items for a given user (in the
leave-one-outsetting, this is typically 1, the single next item the user interacted with).
- Conceptual Definition:
-
Normalized Discounted Cumulative Gain (NDCG)@K
- Conceptual Definition:
NDCG@Kis a measure of ranking quality that takes into account the position of relevant items in the ranked list. It assigns higher scores to relevant items that appear earlier (at higher ranks) in the recommendation list. It is "normalized" by dividing by theIdeal Discounted Cumulative Gain (IDCG), which is the DCG of a perfectly sorted list where all relevant items are at the top. This normalization ensures the metric is between 0 and 1. - Mathematical Formula:
$
\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}
$
where
$
\text{DCG@K} = \sum_{j=1}^K \frac{2^{\text{rel}_j} - 1}{\log_2(j+1)}
$
and is the
DCG@Kfor the ideal ranking (i.e., relevant items sorted by highest relevance). - Symbol Explanation:
- :
Discounted Cumulative Gainat rank . It sums the relevance scores of items in the ranked list, discounting them logarithmically by their position. - :
Ideal Discounted Cumulative Gainat rank . This is the maximum possibleDCGfor a given set of relevant items. - : The rank (position) of an item in the recommendation list, ranging from 1 to .
- : The relevance score of the item at rank . For binary relevance (relevant/not relevant), is typically 1 if the item is relevant and 0 otherwise. In the
leave-one-outsetting, the test item is 1, and others are 0. - : The logarithmic discount factor, which reduces the contribution of relevant items as they appear lower in the list.
- :
- Conceptual Definition:
5.3. Baselines
The paper conducts a comprehensive comparison against a wide spectrum of state-of-the-art models, grouped into three distinct categories:
-
Traditional Sequential Models: These represent classic and influential non-Transformer approaches.
- GRU4Rec [13]: Employs
Gated Recurrent Units (GRUs)to capture sequential patterns. - Caser [35]: Uses
Convolutional Neural Networks (CNNs)for sequential recommendation. - HGN [26]: Utilizes a
Hierarchical Gating Networkto model user interests. - NextItNet [45]: Leverages
dilated convolutional layersfor long-range dependencies.
- GRU4Rec [13]: Employs
-
Transformer-based Models: These models represent the current standard for
discriminative sequential recommendation, utilizingself-attention.- SASRec [19]: Applies the
Transformer's self-attention mechanismto sequential recommendation. - BERT4Rec [34]: Adapts
BERT's bidirectional encoder representationsfor user behavior.
- SASRec [19]: Applies the
-
Generative Recommendation Models: This category includes models that directly generate item
identifiers.- TIGER [31]: A prominent
generative modelthat useshierarchical RQ-VAEto learnsemantic item identifiers, which are then predicted by aTransformer-based sequence model. - LC-Rec [48]: Also leverages
RQ-VAE-based identifiersbut focuses on aligning language and collaborative semantics. For a fair comparison,T5-basevariant is used as its backbonelanguage model. - VQ-Rec [14]: This model is adapted to a
generative frameworkby training aTransformeron itssemantic IDsgenerated viaProduct Quantization. It serves as a non-hierarchical contrast toHiD-VAE'shierarchical IDs.
- TIGER [31]: A prominent
Implementation Details (from paper):
- The
HiD-VAEframework is implemented in PyTorch with Hugging Facetransformersandaccelerateformixed-precision (FP16) trainingon NVIDIA 4060 GPUs. - Stage 1 (HiD-VAE):
- Uses a 3-layer
MLP encoderanddecoderwithGELU activations. - Takes 768-dimensional
SentenceTransformer embeddingsas input. - Employs
quantization layers, each withcodebook size, initialized viaK-Meanson the first batch. - Training uses
AdamW optimizerwith a learning rate of andbatch size128. - Key
loss hyperparametersare , , and . Focal Lossis applied fortag prediction,alignment temperature, anduniqueness margin.
- Uses a 3-layer
- Stage 2 (Recommender):
- The frozen
HiD-VAEserves as theitem tokenizerwith pre-computedsemantic IDs. - The sequential model is a 6-layer
Transformer encoder-decoderwith 8attention headsandhidden dimension512. - Trained using
AdamW optimizerwith a learning rate of ,batch size256, andwarmup schedule. - Inference uses
autoregressive generationwithconstrained decodingagainst a corpus cache.
- The frozen
- For all traditional baseline models, the
RecBole[47] framework is leveraged for implementation.
6. Results & Analysis
6.1. Overall Performance Comparison (RQ1)
The paper presents a comprehensive comparison of HiD-VAE against a suite of strong baselines across three distinct datasets in Table 1.
The following are the results from Table 1 of the original paper:
| Dataset | Metric | Traditional | Transformer-based | Generative | Improv. | |||||||
| GRU4Rec | Caser | HGN | NextItNet | SASRec | BERT4Rec | VQ-Rec | TIGER | LC-Rec | HiD-VAE | |||
| Beauty | R@5 | 0.0216 | 0.0093 | 0.0312 | 0.0143 | 0.0363 | 0.0116 | 0.0285 | 0.0312 | 0.0402 | 0.0543 | +35.07% |
| R@10 | 0.0293 | 0.0146 | 0.0358 | 0.0221 | 0.0498 | 0.0174 | 0.0431 | 0.0457 | 0.0563 | 0.0698 | +23.98% | |
| N@5 | 0.0154 | 0.0058 | 0.0217 | 0.0090 | 0.0269 | 0.0082 | 0.0182 | 0.0209 | 0.0257 | 0.0358 | +33.08% | |
| N@10 | 0.0180 | 0.0075 | 0.0256 | 0.0115 | 0.0301 | 0.0100 | 0.0225 | 0.0253 | 0.0366 | 0.0421 | +15.03% | |
| Sports | R@5 | 0.0097 | 0.0047 | 0.0162 | 0.0081 | 0.0202 | 0.0057 | 0.0291 | 0.0325 | 0.0385 | 0.0435 | +12.99% |
| R@10 | 0.0150 | 0.0080 | 0.0235 | 0.0130 | 0.0290 | 0.0089 | 0.0415 | 0.0474 | 0.0493 | 0.0632 | +28.19% | |
| N@5 | 0.0065 | 0.0030 | 0.0111 | 0.0052 | 0.0118 | 0.0037 | 0.0199 | 0.0222 | 0.0251 | 0.0332 | +32.27% | |
| N@10 | 0.0082 | 0.0040 | 0.0134 | 0.0067 | 0.0146 | 0.0047 | 0.0238 | 0.0270 | 0.0284 | 0.0397 | +39.79% | |
| KuaiRand | R@5 | 0.0298 | 0.0074 | 0.0297 | 0.0276 | 0.0332 | 0.0185 | 0.0513 | 0.0557 | 0.0622 | 0.0668 | +7.40% |
| R@10 | 0.0383 | 0.0118 | 0.0354 | 0.0327 | 0.0405 | 0.0217 | 0.0589 | 0.0624 | 0.0684 | 0.0785 | +14.77% | |
| N@5 | 0.0217 | 0.0068 | 0.0169 | 0.0216 | 0.0338 | 0.0196 | 0.0354 | 0.0383 | 0.0403 | 0.0479 | +18.86% | |
| N@10 | 0.0245 | 0.0095 | 0.0219 | 0.0278 | 0.0372 | 0.0236 | 0.0412 | 0.0445 | 0.0497 | 0.0586 | +17.91% | |
Key Insights:
-
HiD-VAE Achieves Substantial Improvements Across Datasets:
HiD-VAEconsistently outperforms all baselines on every dataset (Beauty,Sports,KuaiRand) and across all metrics (Recall@5,Recall@10,NDCG@5,NDCG@10). The performance gains are significant:- On
Beauty,HiD-VAEshows a remarkable +35.07% relative improvement inRecall@5and +33.08% inNDCG@5over the strongest baseline,LC-Rec. - On
Sports, improvements are also strong, with +32.27% inNDCG@5. - On
KuaiRand, it achieves +18.86% inNDCG@5. These results underscore the effectiveness ofHiD-VAE's approach in generatingsemantically rich,interpretable, anduniquely distinct item representations, which provide a stronger foundation for the downstream generative recommender.
- On
-
Structured Representations Enhance Generative Model Efficacy:
- Within the
generative model family, a clear performance hierarchy emerges.VQ-Rec, which uses non-hierarchical identifiers (Product Quantization), consistently underperformsTIGERandLC-Rec. This indicates that simpler, flat identifiers are less suitable forautoregressive decodingas theTransformermust predict a sequence of independent codes without the benefit of a coarse-to-fine semantic structure, likely leading to greatererror propagation. TIGERandLC-Recimprove uponVQ-Recby usinghierarchical codesbased onRQ-VAE. However, they are still limited by theirunsupervised nature, making them prone tosemantic driftandID collisions.HiD-VAEexcels here because itsHierarchically-Supervised (HS) process(Tag Alignment and Tag Prediction losses) ensures eachIDlevel aligns with a meaningful category. Concurrently, theDisentanglement via Uniqueness Loss (DUL)actively minimizesID collisions. This results in a more robust anddisentangled representation spacethat is easier for the subsequentTransformerto model, leading to more precise and relevant recommendations.
- Within the
-
Generative Paradigms Outperform Discriminative Methods: A broader observation is the general superiority of
generative models(HiD-VAE,LC-Rec,TIGER) overtraditional sequential models(GRU4Rec, Caser, HGN, NextItNet) andTransformer-based discriminative models(SASRec, BERT4Rec). This trend suggests that the paradigm of directly generating item identifiers (rather than scoring a pre-selected candidate set) is a more powerful approach. Itsidesteps the potential disconnect between representation learning and the search/ranking processinherent in methods relying onApproximate Nearest Neighbor (ANN)search. By directly modeling theprobability distributionover the entireitem universe(as represented by the discretesemantic IDs),HiD-VAEcan capture more complex and nuanced user preference patterns, confirming the advantages of the proposed "learn-then-generate" two-stage approach.
6.2. Ablation Study (RQ2)
To quantify the contribution of each core component, an ablation study was conducted. The variants are:
-
HiD-VAE (Full Model): Complete model with
Hierarchical Supervision (HS)(Tag Alignment Loss and Tag Prediction Loss) andDisentanglement Uniqueness Loss (DUL). -
w/o Tag Align: Removes the
contrastive tag alignment loss(), retaining tag prediction and uniqueness constraints. -
w/o Tag Pred: Removes the
tag prediction loss(), relying solely oncontrastive alignmentfor hierarchical structuring. -
w/o DUL: Removes the
uniqueness loss(). To handleID collisions, it adoptsTIGER's[31] resolution strategy: appending a sequentially incrementing integer to collidingsemantic IDs.The following are the results from Table 2 of the original paper:
Methods Beauty KuaiRand R@10 N@10 R@10 N@10 HiD-VAE 0.0698 0.0421 0.0785 0.0586 w/o Tag Align 0.0651 0.0392 0.0742 0.0541 w/o Tag Pred 0.0633 0.0378 0.0725 0.0529 w/o DUL 0.0524 0.0301 0.0668 0.0483
Key Findings from Ablation Study:
-
Hierarchical Supervision is Crucial for Semantic Grounding: Removing either component of the
Hierarchical Supervision (HS)mechanism (i.e.,w/o Tag Alignorw/o Tag Pred) leads to a noticeable performance drop.- Removing
Tag Prediction loss(w/o Tag Pred) results in a more significant decline than removingTag Alignment loss(w/o Tag Align). This suggests that thedirect classification signalfrom acts as a strongersemantic anchor, forcing each code to map to a concrete category. - The
contrastive, while less impactful in isolation, is essential for refining thegeometric structureof thelatent space, ensuringsemantically similar categoriesare represented closely. The synergy of both losses is vital for learning a robust and meaningful hierarchy.
- Removing
-
Disentanglement is Paramount; Avoiding ID Collisions is Critical: The most substantial performance degradation occurs in the
w/o DULvariant. This finding is critical:- While the
TIGER-style post-hoc fix(appending an integer) technically "resolves"ID collisionsand prevents evaluation errors, it comes at a greatsemantic cost. This strategy injectsnon-semantic, arbitrary informationinto the final layer of theitem identifier. For example, anIDthat should be[beauty, skincare, mask]might become[beauty, skincare, mask, 1]or[beauty, skincare, mask, 2]for two distinct "face mask" items. - This appended integer disrupts the learned
semantic sequence, effectively introducingnoisethat confuses the downstreamgenerative modeland hinders its ability to reason over the item's true attributes. The significant performance drop forw/o DULstrongly validates the paper's core motivation: achievingintrinsic disentanglementduring therepresentation learning stageis far superior to relying on superficial, post-processing fixes that corrupt thesemantic integrityof the learnedidentifiers.
- While the
6.3. Qualitative Insights and Analysis (RQ3)
6.3.1. Effectiveness of Disentanglement
To evaluate the mitigation of ID collision, the paper calculates the ID collision rate: the percentage of items that share non-unique IDs out of the total number of items. A lower rate indicates a more robust one-to-one mapping between items and their learned identifiers.
The following are the results from Table 3 of the original paper:
| Methods | Beauty | Sports | KuaiRand |
| VQ-Rec | 21.2% | 22.5% | 20.3% |
| RQ-VAE | 18.7% | 19.5% | 17.8% |
| HiD-VAE (Full) | 2.1% | 2.8% | 1.9% |
| w/o DUL | 17.5% | 18.2% | 16.9% |
| w/o HS | 5.8% | 6.5% | 5.2% |
Analysis:
-
High Collision Rates in Unsupervised Methods: Both
VQ-RecandRQ-VAE(unsupervised baselines) exhibit substantialID collisions(up to 22.5% forVQ-Rec).VQ-Rec's flatquantization schemeperforms the worst, highlighting a fundamental weakness where lack of explicit constraints leads to significantrepresentation entanglement. -
HiD-VAE Drastically Reduces Collisions: The full
HiD-VAEmodel reduces thecollision rateto a negligible level (e.g., 2.8% onSports), representing a remarkable 87.6% relative reduction compared to the strongest baseline (VQ-Rec). This provides direct quantitative proof of a nearlyinjective mapping. -
DUL is the Primary Driver: The
w/o DULvariant, lacking theuniqueness loss, performs only marginally better thanRQ-VAE. This unequivocally demonstrates that an explicitdisentanglement mechanismis essential to prevent therepresentation collapsethat plagues standardVQ-based tokenization schemes. -
HS Indirectly Contributes: The
w/o HSvariant, while better than baselines due toDUL, still has a highercollision ratethan the fullHiD-VAE. This suggests thathierarchical supervision, by imposing a strongsemantically meaningful structureon thelatent space, naturally encourages better separation ofitem representations, allowingDULto operate more effectively.In summary,
HiD-VAEeffectively resolves theID collision problemthrough the synergistic combination ofdisentanglement(DUL) andstructured semantic supervision(HS).
6.3.2. Visualization of the Disentanglement Effect
To provide direct, qualitative evidence of the uniqueness loss's efficacy, the paper visualizes the initial latent space () of items prone to collision using t-SNE [28].
The visualization focuses on items from fine-grained categories with high ID collision rates in the w/o DUL ablation model. For these selected items, their initial latent representations () are extracted from both the w/o DUL model and the full HiD-VAE model and projected into a 2D space.
该图像是示意图,展示了使用DUL前后数据分布的差异。左侧为未使用DUL(纠缠状态),点群分布较为稀疏且呈现纠缠现象,右侧为使用DUL(解纠缠状态),点群则更为均匀,呈现出更明显的层级结构。
Figure 3: The t-SNE visualization of the disentanglement effect of our Uniqueness Loss (DUL). Each color represents a group of distinct items from one of the top-11 high-collision categories. The gray dots represent items from other random categories, serving as a background to illustrate the overall structure of the latent space.
Analysis:
-
Entangled Latent Space (Figure 3a, w/o DUL): Items from the same group (represented by a single color) significantly collapse into
overlapping representations. This tight clustering makes distinct items indistinguishable, visually demonstrating the root cause ofID collisions. -
Disentangled Latent Space (Figure 3b, Full HiD-VAE): In contrast, the
latent spacefrom the fullHiD-VAEmodel shows a remarkably well-separated structure. TheUniqueness Loss (DUL)has effectively pushed apart thelatent representationsof distinct items within each group, ensuring they acquire unique representations. While items from the same category still occupy a similar semantic neighborhood, they are now clearly separable.This visualization visually confirms that the
uniqueness lossis highly effective at resolvingrepresentation entanglementat its source, creating a robust foundation for learning unique and meaningful identifiers.
6.3.3. Case Study on Recommendation Interpretability
A case study compares HiD-VAE with a standard RQ-VAE baseline to demonstrate practical advantages beyond aggregate metrics.
该图像是一个示意图,展示了 HiD-VAE 和 RQ-VAE 在推荐系统中的工作原理。图中左侧展示用户历史记录,接着通过各自的模型(RQ-VAE 和 HiD-VAE)生成语义 ID,然后经过 Transformer 模型预测下一个项目 ID。同时 HiD-VAE 通过层级标签和更细致的嵌入,提高了语义嵌入质量和推荐多样性。
Figure 4: Case study comparing a standard RQ-VAE with our HiD-VAE. For the same user, HiD-VAE learns a transparent semantic path (e.g., 'Skincare - > Treatments - > Serums'), enabling traceable reasoning. In contrast, the baseline's opaque codes (e.g., '[17, 83, 152]') result in black-box logic and risk generating invalid recommendations.
Analysis:
-
Opaque IDs in RQ-VAE: A standard
RQ-VAEmaps an item (e.g., a serum) to an opaqueidentifierlike[17, 83, 152]. Thesemantic meaningof these codes is unknown, making the generation process ablack boxthat relies solely on statistical pattern matching. Such models cannot explain their recommendations. Furthermore, theirunconstrained generationcanhallucinate invalid ID combinationsthat do not correspond to any real item. -
Interpretable IDs in HiD-VAE: In contrast,
HiD-VAEgenerates atransparent,self-explanatory IDlike[5, 12, 47]that directly decodes into a human-readable path: . This allows the generation process to follow atraceable reasoning, providing an explanation rooted in itsexplicit semantic path, which enhances user trust. -
Hierarchy-Aware Semantic Embeddings: The
Transformer-based recommenderinHiD-VAEutilizeshierarchy-aware semantic embeddings. Each token in thesemantic IDis mapped to semantic vectors from anembedding model, concatenated withlearnable ID embeddingsandlevel-specific type embeddings. These jointly learnedembeddingsenrich the representation with explicit semantic information, enabling the model to better capture thecoarse-to-fine semantic pathand improve overall performance. -
Robustness against Invalid IDs: Because
HiD-VAE's generationisconstrainedby thelearned tag hierarchyandconstrained decoding, it is mechanically robust against producing invalidIDs.This fundamental difference in
representationdirectly impacts the quality and reliability of recommendations, withHiD-VAEoffering superiorcontrollabilityandtrustworthinesscrucial for real-world deployments.
6.3.4. Tag Predictor Architecture and Per-Layer Accuracy (Appendix E.1)
The paper provides details on the tag predictor architecture for each layer and its per-layer classification accuracy. The tag predictor is a multi-layer perceptron (MLP) with self-attention, feature extraction layers, residual blocks, and a final classification head. Its structure (hidden dimensions, dropout rates) scales with layer depth to handle increasing complexity.
The following are the results from Table 5 of the original paper:
| Dataset | Layer 1 (Coarse) | Layer 2 (Medium) | Layer 3 (Fine) |
| Beauty | 96.38(7) | 92.73(30) | 85.61(97) |
| Sports | 93.68(23) | 87.54(70) | 83.27(119) |
| KuaiRand | 87.84(38) | 83.58(97) | 77.49(146) |
Analysis: The numbers in parentheses denote the effective number of tag categories at each level (after filtering out rare categories with samples).
- Consistently Strong Performance: The accuracies range from 96.38% (Layer 1,
Beauty) to 77.49% (Layer 3,KuaiRand). This demonstrates consistently strong performance across layers, even with progressive increases in category granularity and count. - Efficacy of Scaled Classifier Architecture: Even for deeper layers with a large number of categories (e.g., 97-146), where random chance accuracy would be very low, the achieved accuracies remain robust and substantially outperform naive baselines. This validates the
scaled classifier architecture's ability to capture nuancedsemantic distinctions. This robust performance confirms that thehierarchical supervisionsignal is effectively learned and utilized by theHiD-VAE'squantization layers.
6.3.5. Hyperparameter Sensitivity (Appendix E.2)
The paper analyzes the sensitivity of HiD-VAE to key hyperparameters: the number of quantization layers , and the uniqueness loss weight and margin .
Sensitivity to Number of Quantization Layers
The paper evaluates performance for (excluding due to severe limitations).

Figure 6: Sensitivity analysis of the number of quantization layers on Recall@10 and NDCG@10. Performance peaks at but diminishes at , indicating diminishing returns with increased depth.
Analysis:
- Marginal Improvement from to : Increasing from 3 to 4 yields marginal improvements (e.g., +4.88% in
Recall@10onBeauty, +4.97% onKuaiRand). An additional layer enables finersemantic decomposition. - Performance Degradation at : Further increasing to leads to performance degradation (e.g., -5.86% in
Recall@10onBeautyrelative to ), sometimes even falling below the baseline. - Diminishing Returns and Optimization Challenges: This
diminishing returns phenomenonsuggests a lowbenefit-to-cost ratio. Deeper hierarchies increasetraining costs(morequantization overhead, longer sequences for theTransformer) without proportional gains, likely due tooptimization challengeslikepropagating quantization errorsand harderautoregressive modeling. - Optimal Trade-off: The default choice of is validated as an optimal trade-off, achieving strong performance without excessive overhead.
Analyzing Uniqueness Loss Parameters
The sensitivity to the uniqueness loss weight and uniqueness margin is also analyzed.

Figure 7:Sensitivity analysis of the uniqueness loss weight and uniqueness margin on Recall@10 and NDCG@10 for the Beauty dataset. Performance peaks at but diminishes at higher values, while metrics improve with increasing up to 0.9, indicating optimal disentanglement with balanced penalties.
Analysis of (Figure 7a):
- Performance peaks at (, ). This indicates an optimal balance where
disentanglementis sufficiently enforced withoutoverpoweringother objectives. - Lower values lead to slightly degraded metrics due to increased
ID collisions(insufficientdisentanglement). - Higher values may
over-penalizeand disrupt thelatent space structure, leading to sub-optimal performance.
Analysis of (Figure 7b):
-
Metrics generally improve with increasing
margin, achieving the best results at (, ). -
A stricter
marginbetter separates collidingrepresentations, enhancingdisentanglement efficacy. -
Lower
marginsallow moreoverlap, reducingdisentanglement effectiveness. The optimal value of suggests that a strong separation is beneficial.These sensitivity analyses confirm that the chosen hyperparameters for the
HiD-VAEmodel and itsuniqueness lossare well-tuned, striking a balance between promotingdisentanglementand maintaining overall model performance.
7. Conclusion & Reflections
7.1. Conclusion Summary
In this work, the authors introduced HiD-VAE, a novel framework for generative recommendation that effectively addresses two critical limitations of existing semantic ID tokenization methods: semantic flatness and representation entanglement. HiD-VAE achieves this through two core innovations:
-
A
hierarchically-supervised quantization processthat explicitly alignsdiscrete codeswith multi-level item tags, yieldinginterpretableand structuredsemantic IDswith traceable semantic paths. -
A
novel uniqueness lossthat directly penalizeslatent space overlapbetween distinct items, thereby resolving the pervasiveID collision problemand promotingrecommendation diversity.Extensive experiments on three public benchmarks (
Beauty,Sports, andKuaiRand) demonstrated thatHiD-VAEachieves state-of-the-art performance, significantly outperforming priordiscriminativeandgenerative modelsacross various evaluation metrics. The ablation studies confirmed the crucial contributions of bothhierarchical supervisionand theuniqueness loss. Qualitative analyses, includingt-SNE visualizationsand acase study, further showcasedHiD-VAE's ability to learndisentangled representationsand provideinterpretable recommendations. By unifying the recommendation pipeline withinterpretableanddisentangled IDs,HiD-VAEenhances traceability and robustness, paving the way for more trustworthy real-world deployments.
7.2. Limitations & Future Work
The authors acknowledge several directions for future research:
- Multi-modal Semantic IDs: Extend
HiD-VAE's ID generationto incorporatemulti-modal data(images, videos, audio) alongside text. This would involve integrating encoders likeCLIPto aligndiscrete codeswithcross-modal hierarchies, capturing richer nuances (e.g., visual styles, auditory patterns) for heterogeneous domains like e-Commerce and social networks. - LLMs for Advanced Sequential Modeling: Enhance the
generative stageby integratingLarge Language Models (LLMs)for more advancedsequential modeling. This could involvefine-tuningorprompting LLMsonhierarchical IDsto leverage theirreasoning capabilitiesfor complex user intents and dependencies, potentially usinghybrid architectureswithrationalesorcounterfactualsto boostdiversityandpersonalization. - Broader Efforts: Explore
diffusion modelsforID synthesis, which could improvelong-tail coverageandfairnessin recommendations. Additionally, investigateprivacy-preservingmechanisms likefederated learningforID generationandsequential modeling.
7.3. Personal Insights & Critique
This paper presents a significant advancement in generative recommendation by systematically tackling two of its most pressing issues: interpretability and ID collisions. The core innovation of hierarchically-supervised quantization is particularly insightful, bridging the gap between abstract latent codes and human-understandable categories. This explicit supervision, rather than implicit learning, is a strong step towards making generative models more transparent and trustworthy. The uniqueness loss is also a well-conceived mechanism, directly addressing the representation collapse problem that often plagues VQ-based models.
Strengths:
- Directly Addresses Core Problems: Unlike many prior works that use
post-hoc fixes,HiD-VAEintegrates solutions forinterpretabilityanddisentanglementdirectly into therepresentation learningphase, which is a more principled approach. - LLM Integration for Data Richness: The
LLM-based hierarchical tag generationis a practical and ingenious solution for applying this framework to datasets lacking inherent categorical structures, significantly broadening its applicability. - Clear Interpretability: The ability to decode
semantic IDsinto human-readable paths (e.g., "Skincare -> Treatments -> Serums") is a major step forward for user understanding and system debugging. - Robustness against Hallucination:
Constrained decodingensures that generatedIDsalways correspond to real items, which is critical for practical deployment.
Potential Issues/Critique:
- Complexity: The
HiD-VAEmodel involves multiple components and loss functions, including aVAE,RQ-VAE,projectors,classifiers, and aTransformer, with several hyperparameters (, , , , ). This complexity might lead to higher computational costs and require careful tuning. The sensitivity analysis in Appendix E.2 provides some reassurance but optimal tuning for diverse datasets could still be challenging. - Reliance on LLM for Tags: While innovative, the
LLM-based tag generationdepends on the quality and capabilities of the chosenLLM. Potential biases in theLLMor its training data could be propagated into the generatedhierarchical tags, affecting the learnedsemantic IDs. The prompt design andLLMselection are crucial, and while described, ensuring complete objectivity and generalizability across all possible domains is a continuous challenge. - Scalability of Constrained Decoding: While a
trie-based prefix treeis efficient forconstrained decoding, for extremely large and dynamic item catalogs with frequent additions/deletions, maintaining and querying this structure could still pose engineering challenges. - Transferability of Learned IDs: The paper implicitly suggests that the learned
HiD-VAE tokenizer(Stage 1) can be frozen and used by various downstreamgenerative models. While validated with aTransformer, exploring its direct plug-and-play compatibility with othergenerative architecturescould be a valuable extension.
Applications and Inspirations:
-
Beyond Recommendation: The concept of learning
hierarchically-disentangled discrete representationscould be highly valuable in other domains requiringinterpretable generative models, such asgenerative design(e.g., generating product designs with interpretable feature breakdowns),drug discovery(generating molecules with specific, disentangled chemical properties), ordata synthesisfor privacy-preserving applications. -
User Control and Explainability: The
interpretable semantic pathscould be exposed to users, allowing them to refine recommendations by modifyingtag pathsor understanding why certain items were suggested. This opens avenues for more interactive and transparent recommender systems. -
Multimodal Generative Models: The proposed future direction of
multi-modal semantic IDsis very promising. Integrating different modalities could lead to richer, more comprehensive item representations and enablecross-modal recommendation(e.g., recommending music based on video content).Overall,
HiD-VAErepresents a robust and well-reasoned step forward forgenerative recommendation, laying a strong foundation for moreinterpretable,accurate, anddiverserecommendation systems.
Similar papers
Recommended via semantic vector search.