Paper status: completed

Recommender Systems with Generative Retrieval

Published:05/09/2023

Approximate Nearest Neighbor Search (5)Generative Recommendation Systems (42)User Behavior Prediction (2)Semantic ID-based Recommendation Model (1)Transformer Sequence-to-Sequence Model (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a novel generative retrieval method using autoregressive decoding of Semantic IDs to enhance recommender system performance. A Transformer-based model effectively predicts the next item a user will interact with. Experiments show substantial improvements ove

Abstract

Modern recommender systems perform large-scale retrieval by first embedding queries and item candidates in the same unified space, followed by approximate nearest neighbor search to select top candidates given a query embedding. In this paper, we propose a novel generative retrieval approach, where the retrieval model autoregressively decodes the identifiers of the target candidates. To that end, we create semantically meaningful tuple of codewords to serve as a Semantic ID for each item. Given Semantic IDs for items in a user session, a Transformer-based sequence-to-sequence model is trained to predict the Semantic ID of the next item that the user will interact with. To the best of our knowledge, this is the first Semantic ID-based generative model for recommendation tasks. We show that recommender systems trained with the proposed paradigm significantly outperform the current SOTA models on various datasets. In addition, we show that incorporating Semantic IDs into the sequence-to-sequence model enhances its ability to generalize, as evidenced by the improved retrieval performance observed for items with no prior interaction history.

Mind Map

In-depth Reading

English Analysis~37 min read · 50,611 chars

1. Bibliographic Information

1.1. Title

Recommender Systems with Generative Retrieval

1.2. Authors

The authors of this paper are:

Shashank Rajput (University of Wisconsin-Madison)
Nikhil Mehta (Google DeepMind)
Anima Singh (Google DeepMind)
Raghunandan Keshavan (Google)
Trung Vu (Google)
Lukasz Heldt (Google)
Lichan Hong (Google DeepMind)
Yi Tay (Google DeepMind)
Vinh Q. Tran (Google)
Jonah Samost (Google)
Maciej Kula (Google DeepMind)
Ed H. Chi (Google DeepMind)
Maheswaran Sathiamoorthy (Google DeepMind)

The affiliations indicate a strong presence from Google DeepMind and Google, with one author from the University of Wisconsin-Madison, suggesting a collaborative effort between academia and industry. This often brings a blend of theoretical rigor and practical application to the research.

1.3. Journal/Conference

This paper was published as a preprint on arXiv. While arXiv is a reputable open-access archive for scientific preprints, it is not a peer-reviewed journal or conference. This means the paper has not yet undergone formal peer review, which is a process where experts in the field scrutinize the research for validity, significance, and originality before formal publication. However, many significant works are first shared on arXiv.

1.4. Publication Year

2023

1.5. Abstract

Modern recommender systems typically employ a two-stage approach: first, they embed queries and item candidates into a unified space and then use approximate nearest neighbor (ANN) search for retrieval. This paper introduces a novel generative retrieval method where the model autoregressively decodes the identifiers of target candidates. To achieve this, the authors propose creating "Semantic IDs" for each item, which are semantically meaningful tuples of codewords. A Transformer-based sequence-to-sequence model is then trained to predict the Semantic ID of the next item a user will interact with, given the Semantic IDs of items in their session. The authors claim this is the first Semantic ID-based generative model for recommendation tasks. Experimental results demonstrate that this new paradigm significantly outperforms current state-of-the-art models across various datasets. Furthermore, incorporating Semantic IDs into the sequence-to-sequence model improves its generalization ability, particularly for items with no prior interaction history (cold-start items).

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2305.05065 PDF Link: https://arxiv.org/pdf/2305.05065v3.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve lies within the retrieval stage of modern recommender systems. Traditionally, these systems embed user queries and item candidates into a shared high-dimensional vector space. Then, they rely on approximate nearest neighbor (ANN) search algorithms to find the most relevant items. While effective, this approach has several limitations:

Large Embedding Tables: Learning unique embeddings for potentially billions of items can be memory-intensive and computationally expensive.
Difficulty with Cold-Start Items: Items with no or very few past interactions (cold-start items) are challenging to recommend because their embeddings are poorly learned or non-existent, making it hard for ANN search to find them.
Lack of Semantic Interpretability: Traditional item IDs are often atomic and random, lacking inherent semantic meaning. This limits knowledge sharing across similar items and makes it difficult for models to generalize.
Feedback Loops: Traditional systems can be prone to inherent feedback loops, where popular items are recommended more, leading to a biased cycle.

This problem is crucial because recommender systems are ubiquitous in various applications (e.g., e-commerce, media streaming), and their effectiveness directly impacts user engagement and business success. Improving retrieval efficiency, generalization, and cold-start performance are significant challenges that can unlock better user experiences and more equitable recommendation outcomes.

The paper's entry point or innovative idea is to shift from a traditional "match-and-retrieve" paradigm to a "generative retrieval" approach for recommendations. Instead of searching for pre-computed embeddings, the system directly generates the identifiers of the next relevant items. This is achieved through a novel representation of items called "Semantic IDs" and a Transformer-based sequence-to-sequence model that learns to decode these Semantic IDs autoregressively.

2.2. Main Contributions / Findings

The primary contributions of this paper are:

TIGER Framework: Proposal of TIGER (Transformer Index for GEnerative Recommenders), a novel generative retrieval-based recommendation framework. This framework assigns Semantic IDs to each item, which are learned from the item's content features, and trains a retrieval model to predict the Semantic ID of an item a user may engage with. This fundamentally changes the retrieval process from a search task to a generation task.
State-of-the-Art Performance: Demonstrating that TIGER significantly outperforms existing state-of-the-art recommender systems on multiple datasets (Amazon's Beauty, Sports and Outdoors, and Toys and Games categories) across standard metrics like Recall and NDCG.
Enhanced Generalization and Cold-Start Recommendations: Showing that this new generative retrieval paradigm, particularly with Semantic IDs, provides two new capabilities for sequential recommender systems:
- Ability to recommend new and infrequent items: TIGER improves cold-start recommendations by leveraging item semantics, allowing it to generalize to items without prior interaction history. This is a critical advantage over systems that rely on interaction data for item embeddings.
- Ability to generate diverse recommendations: The framework allows for controlling the diversity of recommendations through tunable parameters (e.g., temperature-based sampling during decoding), a capability often difficult to achieve with traditional methods.
Memory Efficiency: TIGER uses much smaller embedding tables compared to traditional recommender systems. Instead of storing an embedding for each item, it stores embeddings for each semantic codeword. This makes the model more scalable for very large item corpuses.

These findings address the challenges of cold-start items and generalization, offer better performance, and introduce new controls for recommendation diversity, solving critical problems in the field.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with several key concepts in recommender systems, deep learning, and natural language processing.

Recommender Systems (RS): Systems that predict user preferences for items and suggest relevant ones. They are broadly categorized into:
- Collaborative Filtering: Recommends items based on the behavior of similar users or items.
- Content-Based Filtering: Recommends items similar to those a user has liked in the past based on item attributes.
- Hybrid Approaches: Combine collaborative and content-based methods.
- Retrieve-and-Rank Strategy: A common architecture where a retrieval stage quickly filters a large corpus to a smaller set of relevant candidates, which are then precisely ordered by a ranking stage. This paper focuses on innovating the retrieval stage.
Embeddings: Low-dimensional, dense vector representations of items, users, words, etc., that capture their semantic meaning. Items with similar properties or users with similar tastes will have embeddings close to each other in the vector space. For example, a word embedding for "king" might be close to "queen" but far from "apple".
Approximate Nearest Neighbor (ANN) Search: A family of algorithms used to find items whose embeddings are "closest" to a query embedding in a high-dimensional space, but without the computational cost of finding the exact nearest neighbors. This is crucial for large-scale retrieval where an exhaustive search is infeasible. Common ANN algorithms include Locality-Sensitive Hashing (LSH) and Hierarchical Navigable Small World (HNSW) graphs.
Dual-Encoder Architectures: A common structure in deep learning for retrieval, especially in recommender systems. It involves two separate neural networks (towers), one for encoding the query (e.g., user history) and another for encoding the candidate item. The outputs are dense embeddings, and their similarity (e.g., dot product or cosine similarity) is used to determine relevance.
Sequential Recommendation: A subfield of recommender systems that explicitly models the order of user-item interactions. The goal is to predict the next item a user will interact with, given their past sequence of interactions. This often involves using sequence modeling techniques from NLP.
Transformer Models: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017), which revolutionized sequence modeling. Key components:
- Self-Attention Mechanism: Allows the model to weigh the importance of different parts of the input sequence when processing each element. For example, in a sentence, when processing the word "bank", the model might pay more attention to "river" or "money" depending on the context.
- Encoder-Decoder Architecture: A typical Transformer setup consists of an encoder that processes the input sequence and a decoder that generates the output sequence. Both parts use self-attention and feed-forward networks.
- Autoregressive Decoding: The process where a model generates an output sequence one token at a time, with each new token being conditioned on the previously generated tokens. This is common in language generation and, in this paper, for generating Semantic IDs.
Vector Quantization (VQ): A technique to map high-dimensional continuous vectors to discrete, low-dimensional representations (codes). It's analogous to clustering, where each vector is assigned to its closest "codebook" entry.
- Codebook: A collection of representative vectors (codewords) in the quantized space.
- Residual Quantization (RQ): An enhancement where, after an initial quantization step, the "residual error" (the difference between the original vector and its quantized approximation) is further quantized. This process can be repeated multiple times, leading to a sequence of codewords that progressively refine the representation.
Variational Autoencoders (VAEs): Generative models that learn a compressed, latent representation of data. They consist of an encoder that maps input data to a latent space (often a distribution) and a decoder that reconstructs the input from this latent representation. VQ-VAE and RQ-VAE incorporate vector quantization into the VAE framework.

3.2. Previous Works

The paper contextualizes its contribution by discussing prior work in sequential recommenders, generative retrieval, and semantic ID generation.

Sequential Recommenders:
- GRU4Rec [11]: One of the first to use Gated Recurrent Units (GRUs), a type of Recurrent Neural Network (RNN), for session-based recommendations. RNNs are designed to process sequences by maintaining a hidden state that captures information from past elements.
- NARM [24]: Neural Attentive Session-based Recommendation uses an attention mechanism with a GRU to model both short-term and long-term user intent. Attention mechanisms allow the model to focus on the most relevant parts of the input sequence.
- AttRec [41]: Proposed a self-attention mechanism to model user intent and personalized recommendations using metric learning.
- SASRec [17]: Self-Attentive Sequential Recommendation, which applies a Transformer encoder (specifically, a causal mask Transformer) to model sequential interactions. This was a significant step towards using Transformer models in recommendations.
- BERT4Rec [32] and Transformers4Rec [6]: Utilized Transformer models with masking strategies, inspired by masked language modeling in NLP (like BERT), for sequential recommendation tasks. They learn a rich representation by predicting masked items in a sequence.
- $S^3$ -Rec [44]: Extends Transformer models by pre-training on four self-supervised tasks to improve data representation, going beyond just masking.
- P5 [8]: A recent method that fine-tunes large language models (LLMs) for multi-task recommender systems. It leverages the LLM's tokenizer (SentencePiece) to generate tokens from item IDs, which can be random.
  
  Differentiation: Most of these models learn high-dimensional embeddings for each item and use Approximate Nearest Neighbor (ANN) search (specifically, Maximum Inner Product Search - MIPS) to find the next item. In contrast, TIGER uses Generative Retrieval to directly predict Semantic IDs. P5 is closer in spirit due to its generative nature, but TIGER distinguishes itself by using semantically meaningful Semantic IDs derived from content information, as opposed to P5's reliance on arbitrary item IDs tokenized by an LLM tokenizer.
Semantic IDs:
- VQ-Rec [12]: Proposed generating "codes" (similar to Semantic IDs) using content information for item representation. Their focus, however, was on building transferable recommender systems, not using codes generatively for retrieval. They used product quantization, while TIGER uses RQ-VAE for hierarchical representations.
- Singh et al. [31]: A concurrent work that shows hierarchical Semantic IDs can replace item IDs for ranking models, improving generalization.
  
  Differentiation: TIGER is the first to use Semantic IDs in a generative retrieval context for recommendations. RQ-VAE provides a hierarchical structure to the Semantic IDs, which VQ-Rec and other methods might not explicitly achieve.
Generative Retrieval:
- GENRE [5]: Proposed for entity retrieval, using a Transformer-based architecture to autoregressively return the name of an entity.
- DSI [34]: Document Search Index, which used Transformer memory as a differentiable search index. It was the first system to assign structured semantic DocIDs to documents and then autoregressively generate them given a query. This work is a direct inspiration for TIGER's generative approach.
- NCI [37]: Neural Corpus Indexer, which supplements DSI's hierarchical $k$ -means clustering based semantic DocIDs with a new decoder architecture considering prefixes in DocIDs.
- CGR [22]: Contextualized Generative Retrieval, which combines bi-encoder techniques with generative retrieval by allowing the decoder to learn contextualized embeddings.
  
  Differentiation: While generative retrieval exists for document retrieval, TIGER is the first to apply this paradigm to recommender systems using Semantic ID representations derived from an autoencoder (specifically RQ-VAE). This is a key novelty.

3.3. Technological Evolution

The field of recommender systems has evolved from simpler matrix factorization models to complex deep learning architectures. Early systems focused on learning latent factors for users and items (e.g., Matrix Factorization [19]). With the rise of deep learning, dual-encoder architectures became popular, allowing for non-linear relationships and rich feature integration (Deep Neural Networks for YouTube Recommendations [4]). The sequential nature of user interactions led to the adoption of RNNs (GRU4Rec [11]), then attention mechanisms (NARM [24]), and eventually Transformer models (SASRec [17], BERT4Rec [32], $S^3$ -Rec [44]), borrowing heavily from advancements in Natural Language Processing (NLP). These models typically generate item embeddings and rely on ANN search. More recently, the concept of "generative retrieval" emerged in NLP for tasks like document retrieval (GENRE [5], DSI [34]), where the model generates the document identifier directly. P5 [8] introduced Large Language Models (LLMs) into recommendation by formulating recommendation as a language processing task.

TIGER represents a significant step in this evolution by adapting the generative retrieval paradigm to recommender systems. It moves beyond embedding-based retrieval and directly generates Semantic IDs, leveraging the powerful sequence generation capabilities of Transformers. This bridges the gap between sophisticated generative models from NLP and the specific challenges of recommendation, especially regarding cold-start items and semantic generalization.

3.4. Differentiation Analysis

Compared to the main methods in related work, TIGER offers several core differences and innovations:

Generative Retrieval vs. Embedding Search:
- Traditional (e.g., GRU4Rec, SASRec, BERT4Rec): These models learn high-dimensional embeddings for items and use Approximate Nearest Neighbor (ANN) search (e.g., MIPS) to find relevant items. The retrieval is a search operation in a pre-defined embedding space.
- TIGER: Directly generates the Semantic ID of the next item autoregressively using a sequence-to-sequence Transformer. This shifts the retrieval task from a search problem to a generation problem, where the model's memory (parameters) implicitly acts as the item index.
Semantic IDs vs. Atomic/Random IDs:
- Traditional: Most models (except P5 to some extent) use arbitrary, atomic item IDs (e.g., integer 1, 2, 3...) as inputs or targets, often learning a unique embedding for each. These IDs lack inherent semantic meaning.
- P5: Uses LLM tokenizers to tokenize random item IDs, which might capture some sub-word patterns but doesn't inherently embed strong item semantics in the ID itself.
- TIGER: Introduces Semantic IDs derived from item content features using RQ-VAE. These Semantic IDs are tuples of codewords that inherently carry semantic meaning and have a hierarchical structure (e.g., (category, sub-category, item-type)). This allows for knowledge sharing across semantically similar items.
Enhanced Generalization and Cold-Start Handling:
- Traditional: Struggle with cold-start items because they lack sufficient interaction data to learn robust embeddings.
- TIGER: By using Semantic IDs derived from item content, it can generalize to new or infrequent items. If an item shares semantic components with known items, the model can still generate its Semantic ID, even if it has never been seen in training. This is a significant advantage.
Memory Efficiency:
- Traditional: Require large embedding tables where each item has its own high-dimensional embedding. The memory cost scales linearly with the number of items ( $N \times d$ , where $N$ is the number of items and $d$ is embedding dimension).
- TIGER: Stores embeddings only for the semantic codewords, not for each individual item. If there are $m$ codewords and each codebook has size $K$ , the memory cost is $m \times K \times d$ . This is much more efficient when $N$ is very large, as $m \times K$ can be significantly smaller than $N$ .
Controllable Diversity:
- Traditional: Diversity is often an afterthought, sometimes introduced by post-processing or specific training objectives, but not natively integrated into the retrieval mechanism.
- TIGER: The generative nature and hierarchical Semantic IDs allow for explicit control over recommendation diversity, for instance, by adjusting temperature during decoding or by sampling at different levels of the Semantic ID hierarchy.
  
  In summary, TIGER moves beyond the traditional embedding-and-search paradigm by directly generating semantically meaningful item identifiers, leading to superior performance, better cold-start capabilities, and improved memory efficiency, distinguishing it from prior work by integrating advances in generative NLP with the specific needs of recommender systems.

4. Methodology

The proposed framework, TIGER (Transformer Index for GEnerative Recommenders), consists of two main stages:

Semantic ID Generation: Creating a semantically meaningful representation for each item in the recommendation corpus.
Generative Recommender System Training: Training a Transformer-based sequence-to-sequence model to predict the Semantic ID of the next item a user will interact with.

4.1. Principles

The core idea behind TIGER is to transform the item retrieval problem from an Approximate Nearest Neighbor (ANN) search in an embedding space to a sequence generation task. Instead of learning an embedding for each item and then searching for the closest ones, TIGER learns to generate a unique, semantically rich identifier for the next item a user is likely to interact with. This is based on the intuition that if item identifiers themselves convey semantic meaning, a generative model can learn to predict these meaningful identifiers based on a user's past interactions. The Transformer's ability to model complex sequences and generate output autoregressively makes it suitable for this task, effectively using its "memory" (parameters) as an implicit index of items.

The theoretical basis of using Semantic IDs is that semantically similar items should have similar or overlapping identifiers. This allows for knowledge sharing, better generalization, and improved handling of cold-start items compared to atomic, arbitrary item IDs. The hierarchical nature of Semantic IDs further enables coarse-to-fine granularity in item representation and prediction.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Semantic ID Generation

The first crucial step is to represent each item with a Semantic ID, which is a tuple of codewords. This process involves two sub-steps:

Content Feature Encoding: Each item's rich content features (e.g., title, description, category, brand) are first converted into a dense semantic embedding.
Embedding Quantization: The continuous semantic embedding is then quantized into a discrete Semantic ID (a tuple of codewords).

The following figure (Figure 2a from the original paper) shows the overall process for generating Semantic IDs:

该图像是示意图，展示了在TIGER模型中的建模方法。左侧部分展示了如何将物品内容信息（如ItemID、标题、描述、类别和品牌）通过内容编码器和量化处理生成语义ID的嵌入表示。右侧部分展示了使用双向Transformer编码器对用户互动历史进行编码，并通过Transformer解码器预测下一个可能交互的项目及其语义ID。

As can be seen from the overall modeling approach in TIGER (Figure 2), the left side specifically details the Semantic ID generation. Item content features (ItemID, title, description, category, brand) are passed through a content encoder. The resulting embedding is then quantized to produce a Semantic ID.

Step 1: Content Feature Encoding The paper assumes access to a pre-trained content encoder that can convert an item's textual or other content features into a fixed-dimensional semantic embedding $\pmb{x} \in \mathbb{R}^d$ . For example, for text features, general-purpose pre-trained text encoders like Sentence-T5 [27] or BERT [7] can be used. The authors specifically use Sentence-T5 to process concatenated item features (title, price, brand, category) into a 768-dimensional embedding. This embedding $\pmb{x}$ captures the semantic essence of the item.

Step 2: Embedding Quantization using Residual-Quantized Variational AutoEncoder (RQ-VAE) The continuous semantic embedding $\pmb{x}$ is then quantized to generate a Semantic ID. A Semantic ID is defined as a tuple of codewords of length $m$ , where each codeword comes from a different codebook. The primary method used here is RQ-VAE [40].

The following figure (Figure 3 from the original paper) illustrates the RQ-VAE process:

$Figure 3: RQ-VAE: In the figure, the vector output by the DNN Encoder, say $\\mathbf { \\nabla } _ { \\mathbf { r } _ { 0 } }$ (represented by the blue bar), is fed to the quantizer, which works iteratively. First, the closest vector to $\\mathbf { \\boldsymbol { r } } _ { 0 }$ is found in the first level codebook. Let this closest vector be `e _ { c _ { 0 } }` (represented by the red bar). Then, the residual error is computed as $\\pmb { r } _ { 1 } : = \\pmb { r } _ { 0 } - \\pmb { e } _ { c _ { 0 } }$ . This is fed into the second level of the quantizer, and the process is repeated: The closest vector to $\\mathbf { \\Delta } _ { \\mathbf { r } _ { 1 } }$ is found in the second level, say `e _ { c _ { 1 } }` (represented by the green bar), and then the second level residual error is computed as ${ \\pmb r } _ { 2 } = { \\pmb r } _ { 1 } - e _ { c _ { 1 } } ^ { \\prime }$ . Then, the process is repeated for a third time on $\\mathbf { \\boldsymbol { r } } _ { 2 }$ . The semantic codes are computed as the indices of `e _ { c _ { 0 } } , e _ { c _ { 1 } }` , and `e _ { c _ { 2 } }` in their respective codebooks. In the example shown in the figure, this results in the code $( 7 , 1 , 4 )$ .$ 该图像是示意图，展示了残差量化（Residual Quantization）的过程。首先，DNN编码器输出的向量通过第一个码本找到最接近的向量，生成第一个残差。该残差再依次通过第二、第三个码本进行量化。经过三次量化后，生成的语义代码为 $(7, 1, 4)$ ，并用于量化表示的解码。

As depicted in Figure 3, the RQ-VAE works as follows:

Encoder and Latent Representation: The RQ-VAE first uses a deep neural network (DNN) encoder, denoted as $\mathcal{E}$ , to encode the input semantic embedding $\pmb{x}$ into a latent representation $z$ : $ z = \mathcal{E}(\pmb{x}) $ In their implementation, the encoder has three intermediate layers of size 512, 256, and 128 with ReLU activation, and the final latent representation $z$ has a dimension of 32.
Residual Quantization (Iterative Process): The latent representation $z$ is then subjected to a multi-level residual quantization process to generate the tuple of codewords. This process is iterative and works on residuals.
- Initialization of Residual: At the zero-th level ( $d=0$ ), the initial residual $\pmb{r}_0$ is simply the latent representation $z$ : $ \pmb{r}_0 := z $
- Quantization at Each Level: For each level $d$ $d$ (from 0 to m-1), there is a dedicated codebook $\mathcal{C}_d := \{e_k\}_{k=1}^K$ $C_{d} := {e_{k}}_{k = 1}^{K}$ , where $K$ $K$ is the size of the codebook.
  - The codeword $c_d$ for the current level is found by identifying the index of the embedding $e_{c_d}$ in $\mathcal{C}_d$ that is closest to the current residual $\pmb{r}_d$ . This is typically done using Euclidean distance: $ c_d = \arg \min_k |\pmb{r}_d - e_k| $
  - The selected codeword vector is $e_{c_d}$ .
  - The next residual $\pmb{r}_{d+1}$ is then computed by subtracting the selected codeword vector from the current residual: $ \pmb{r}_{d+1} := \pmb{r}d - e{c_d} $ This process is repeated $m$ times, generating a tuple of $m$ codewords $(c_0, c_1, \ldots, c_{m-1})$ . The authors use $m=3$ levels, each with a codebook of size $K=256$ . This hierarchical approach allows approximating the input from a coarse-to-fine granularity.
Quantized Representation and Decoder: Once the Semantic ID $(c_0, \ldots, c_{m-1})$ is obtained, a quantized representation of $z$ , denoted as $\widehat{z}$ , is computed by summing the selected codeword vectors from each level: $ \widehat{z} := \sum_{d=0}^{m-1} e_{c_d} $ This $\widehat{z}$ is then passed to a DNN decoder, denoted as $\mathcal{D}$ , to reconstruct the original semantic embedding $\pmb{x}$ as $\widehat{\pmb{x}}$ : $ \widehat{\pmb{x}} = \mathcal{D}(\widehat{z}) $
RQ-VAE Loss Function: The RQ-VAE model is jointly trained by updating the codebook entries and the encoder-decoder parameters. The total loss $\mathcal{L}(\pmb{x})$ is a combination of a reconstruction loss and an RQ-VAE specific loss term: $ \mathcal{L}(\pmb{x}) := \mathcal{L}{\mathrm{recon}} + \mathcal{L}{\mathrm{rqvae}} $ Where:
- Reconstruction Loss $\mathcal{L}_{\mathrm{recon}}$ measures how well the original semantic embedding $\pmb{x}$ can be reconstructed from its quantized representation: $ \mathcal{L}_{\mathrm{recon}} := |\pmb{x} - \widehat{\pmb{x}}|^2 $ This is a mean squared error.
- RQ-VAE Loss $\mathcal{L}_{\mathrm{rqvae}}$ $L_{rqvae}$ aims to ensure that the codebook vectors are updated effectively and that the encoder's output commits to a codebook entry. It includes a commitment loss and a codebook loss, typically implemented with a stop_gradient operation (sg) to prevent gradients from flowing through the wrong path: $ \mathcal{L}{\mathrm{rqvae}} := \sum{d=0}^{m-1} |\mathbf{sg}[\pmb{r}d] - e{c_d}|^2 + \beta |\pmb{r}d - \mathbf{sg}[e{c_d}]|^2 $ Here:
  - $\|\mathbf{sg}[\pmb{r}_d] - e_{c_d}\|^2$ : This term encourages the codebook vector $e_{c_d}$ to move towards the encoder's output $\pmb{r}_d$ . The stop_gradient on $\pmb{r}_d$ means gradients only update $e_{c_d}$ .
  - $\beta \|\pmb{r}_d - \mathbf{sg}[e_{c_d}]\|^2$ : This term, weighted by $\beta$ (a hyperparameter, set to 0.25 in the paper), encourages the encoder's output $\pmb{r}_d$ to move towards the chosen codebook vector $e_{c_d}$ . The stop_gradient on $e_{c_d}$ ensures gradients only update the encoder parameters. This prevents "codebook collapse" where only a few codebook vectors are utilized.
Codebook Initialization: To prevent codebook collapse, where most input vectors map to only a few codebook entries, the codebooks are initialized using $k$ -means clustering on the first training batch.
Collision Handling: Depending on the distribution of semantic embeddings, codebook sizes, and Semantic ID length, multiple items might map to the same Semantic ID (collision). To resolve this, an extra token is appended to the Semantic ID to make it unique. For example, if $(12, 24, 52)$ is a collision, it becomes $(12, 24, 52, 0)$ and $(12, 24, 52, 1)$ . This results in a unique Semantic ID of length 4 for each item. A lookup table is maintained for this.

4.2.2. Generative Retrieval with Semantic IDs

Once each item has a unique Semantic ID, the task of sequential recommendation is formulated as a generative problem.

The following figure (Figure 1 from the original paper) provides a high-level overview of the TIGER framework:

Figure 1: Overview of the Transformer Index for GEnerative Recommenders (TIGER) framework. With TIGER, sequential recommendation is expressed as a generative retrieval task by representing each item as a tuple of discrete semantic tokens. 该图像是示意图，展示了生成推荐系统中的Transformer Index for Generative Recommenders (TIGER)框架。用户历史记录的语义ID生成器生成相应的元组，将推荐任务转化为生成检索任务，最终用于查找下一个可能的项目。

As illustrated in Figure 1, the user's sequential interaction history, represented by Semantic IDs, serves as input to a Transformer model. The model then autoregressively generates the Semantic ID of the next item.

The following figure (Figure 2b from the original paper) shows the Transformer-based encoder-decoder setup:

Figure 2: An overview of the modeling approach used in TIGER. 该图像是示意图，展示了在TIGER模型中的建模方法。左侧部分展示了如何将物品内容信息（如ItemID、标题、描述、类别和品牌）通过内容编码器和量化处理生成语义ID的嵌入表示。右侧部分展示了使用双向Transformer编码器对用户互动历史进行编码，并通过Transformer解码器预测下一个可能交互的项目及其语义ID。

Figure 2b specifically depicts the Transformer-based encoder-decoder setup. The user's interaction history (sequence of Semantic IDs) and potentially user ID are input to the Transformer encoder. The Transformer decoder then autoregressively generates the tokens of the next item's Semantic ID.

Sequence Construction: For each user, their item interactions are sorted chronologically to form a sequence: $(\text{item}_1, \ldots, \text{item}_n)$ . Each item _i is replaced by its Semantic ID, which is a tuple of codewords $(c_{i,0}, \ldots, c_{i,m-1})$ . The entire user session is then flattened into a single sequence of codewords: $ (c_{1,0}, \ldots, c_{1,m-1}, c_{2,0}, \ldots, c_{2,m-1}, \ldots, c_{n,0}, \ldots, c_{n,m-1}) $ Additionally, user-specific tokens (user IDs) are prepended to the input sequence to personalize recommendations. The user ID is mapped to one of a limited set of user ID tokens using the Hashing Trick [38].
Transformer Architecture: A Transformer-based encoder-decoder model is employed.
- Encoder: Processes the input sequence (user ID token followed by the sequence of Semantic ID tokens from past interactions).
- Decoder: Autoregressively generates the Semantic ID of the next item, which is $(c_{n+1,0}, \ldots, c_{n+1,m-1})$ . This means the decoder predicts one codeword at a time, conditioning each prediction on the previously generated codewords and the encoder's output. The model uses 4 layers for both the encoder and decoder, with 6 self-attention heads (dimension 64) per layer. ReLU activation is used, and the MLP and input dimension are set to 1024 and 128, respectively. A dropout of 0.1 is applied. The vocabulary includes tokens for each semantic codeword (e.g., $256 \times 4 = 1024$ tokens for a 4-length Semantic ID with 256 codebook size per level) and user-specific tokens (2000 tokens for user IDs).
Training Objective: The model is trained to minimize the negative log-likelihood of predicting the correct next Semantic ID given the preceding sequence. This is a standard cross-entropy loss over the vocabulary of Semantic ID tokens.
Inference/Retrieval: During inference, given a user's session, the Transformer decoder generates the Semantic ID of the predicted next item token by token.
- The generated Semantic ID is then matched against the lookup table (created during Semantic ID generation) to retrieve the actual item.
- The generative nature means that the model could predict an Invalid ID (a Semantic ID that doesn't correspond to any item in the corpus). However, the paper notes this probability is low (see Section 4.5 and Figure 6). In cases of invalid IDs, strategies like beam search with filtering or prefix matching can be employed.
New Capabilities:
- Cold-Start Recommendation: Since Semantic IDs are derived from content, the model can predict Semantic IDs for new items (unseen in training) that share semantic characteristics with items it has learned about. The framework allows for the inclusion of unseen items with matching first few semantic tokens.
- Recommendation Diversity: Temperature-based sampling during the autoregressive decoding process can be used to control the diversity of generated Semantic IDs. Sampling at different hierarchical levels of the Semantic ID allows for controlling diversity from coarse categories to finer-grained item types.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three public real-world benchmark datasets derived from the Amazon Product Reviews dataset [10]. This dataset contains user reviews and item metadata spanning from May 1996 to July 2014. The three categories used for the sequential recommendation task are:

Beauty
Sports and Outdoors
Toys and Games

These datasets are suitable for sequential recommendation tasks because they contain user interaction histories (reviews) with timestamps, allowing for the chronological ordering of items. The diverse categories also help evaluate the model's performance across different domains.

Data Pre-processing:

User review histories are used to create item sequences, sorted by timestamp.
Users with fewer than 5 reviews are filtered out to ensure sufficient interaction data.
A leave-one-out strategy is employed for evaluation: for each item sequence, the very last item is used for testing, the item just before the last is used for validation, and all preceding items are used for training.
During training, the maximum number of items in a user's history is limited to 20.

The following are the statistics for the three datasets (Table 6 from the original paper):

Dataset # Users # Items Sequence Length

Mean Median

Beauty 22,363 12,101 8.87 6

Sports and Outdoors 35,598 18,357 8.32 6

Toys and Games 19,412 11,924 8.63 6

Dataset	# Users	# Items	Sequence Length
Beauty	22,363	12,101	8.87	6
Sports and Outdoors	35,598	18,357	8.32	6
Toys and Games	19,412	11,924	8.63	6

The datasets have varying numbers of users and items, providing a good testbed for different scales. The sequence lengths indicate that user sessions are typically short, which is common in many recommendation scenarios.

5.2. Evaluation Metrics

The recommendation performance is evaluated using two standard top-k metrics: Recall@K and Normalized Discounted Cumulative Gain (NDCG@K), with K set to 5 and 10.

5.2.1. Recall@K

Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved among the top K recommendations. It indicates how many of the actual items the user interacted with (in the test set) were present in the model's top K predictions. A higher Recall@K means the model is better at finding all relevant items.
Mathematical Formula: $ \mathrm{Recall@K} = \frac{\text{Number of relevant items in top K recommendations}}{\text{Total number of relevant items}} $ For sequential recommendation, where typically there is only one "next item" to predict (the ground truth), the formula simplifies: $ \mathrm{Recall@K} = \frac{|\text{Recommended top K items} \cap {\text{Ground Truth Item}}|}{1} $ This means if the ground truth item is among the top K recommendations, Recall@K is 1; otherwise, it's 0. The reported Recall@K is then the average over all users/sessions.
Symbol Explanation:
- $\text{Recommended top K items}$ : The set of K items predicted by the recommender system.
- $\{\text{Ground Truth Item}\}$ : The single item the user actually interacted with next in the test set.
- $|\cdot|$ : Denotes the cardinality (number of elements) of a set.
- $\cap$ : Denotes set intersection.

5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)

Conceptual Definition: NDCG@K is a metric that evaluates the ranking quality of a recommendation list. It assigns higher scores to relevant items that appear higher in the list and penalizes relevant items that appear lower. It is "normalized" by dividing by the "ideal" DCG (IDCG), which is the DCG of a perfectly sorted list where all relevant items are at the top. This makes NDCG comparable across different queries or recommendation lists.
Mathematical Formula: First, Discounted Cumulative Gain (DCG@K) is calculated: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ Then, Ideal Discounted Cumulative Gain (IDCG@K) is calculated. This is the DCG@K if the recommendations were perfectly ordered (most relevant items ranked highest). For sequential recommendation with a single ground truth item, IDCG@K is calculated assuming the ground truth item is at rank 1: $ \mathrm{IDCG@K} = \frac{2^{\mathrm{rel}{\text{ground_truth}}} - 1}{\log_2(1+1)} $ Since $\mathrm{rel}_{\text{ground\_truth}}$ is typically 1 for the single relevant item, this simplifies to $1/\log_2(2) = 1$ . Finally, Normalized Discounted Cumulative Gain (NDCG@K) is: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
Symbol Explanation:
- $K$ : The number of top recommendations considered.
- $i$ : The rank of an item in the recommendation list (from 1 to K).
- $\mathrm{rel}_i$ : The relevance score of the item at rank $i$ . For binary relevance (item is either the next item or not), $\mathrm{rel}_i$ is 1 if the item at rank $i$ is the ground truth item, and 0 otherwise.
- $\log_2(i+1)$ : The discount factor, which reduces the contribution of relevant items as their rank $i$ increases. The base 2 logarithm is common, but other bases can be used.

5.3. Baselines

The proposed TIGER framework is compared against a comprehensive set of existing sequential recommendation methods:

GRU4Rec [11]: An early RNN-based approach using GRUs for session-based recommendations.
Caser [33]: Utilizes a Convolutional Neural Network (CNN) architecture to capture high-order Markov Chains in sequences using horizontal and vertical convolutional operations.
HGN [25]: Hierarchical Gating Network, which aims to capture both long-term and short-term user interests through a novel gating architecture.
SASRec [17]: Self-Attentive Sequential Recommendation, which was one of the first to employ a causal mask Transformer to model user sequential interactions.
BERT4Rec [32]: Addresses the uni-directional limitation of some Transformer models by using a bidirectional self-attention Transformer for sequential recommendation, inspired by BERT from NLP.
FDSA [42]: Feature-level Deeper Self-Attention Network, which incorporates additional item features (beyond just item embeddings) into the input sequence of Transformers.
$S^3$ -Rec [44]: Self-Supervised Learning for Sequential Recommendation, which improves data representation by pre-training a bidirectional Transformer on four self-supervision tasks.
P5 [8]: A recent method that leverages a pre-trained Large Language Model (LLM) to unify various recommendation tasks within a single model, treating recommendation as a language processing task.

Most baselines (except P5) learn high-dimensional item embeddings using a dual-encoder architecture and rely on Maximum Inner Product Search (MIPS) for retrieval. P5 uses an LLM tokenizer to handle item IDs generatively but does not use semantically learned IDs in the same way as TIGER. These baselines are representative of the state-of-the-art in sequential recommendation, covering RNN, CNN, and various Transformer-based approaches.

6. Results & Analysis

6.1. Core Results Analysis

The main experimental results indicate that the proposed TIGER framework significantly outperforms existing state-of-the-art models across all three Amazon datasets (Beauty, Sports and Outdoors, and Toys and Games) for both Recall@K and NDCG@K metrics. This strongly validates the effectiveness of the generative retrieval approach with Semantic IDs.

The following are the results from Table 1 of the original paper:

Methods	Sports and Outdoors				Beauty				Toys and Games
Methods	Recall @5	NDCG @5	Recall @10	NDCG @10	Recall @5	NDCG @5	Recall @10	NDCG @10	Recall @5	NDCG @5	Recall @10	NDCG @10
P5 [8]	0.0061	0.0041	0.0095	0.0052	0.0163	0.0107	0.0254	0.0136	0.0070	0.0050	0.0121	0.0066
Caser [33]	0.0116	0.0072	0.0194	0.0097	0.0205	0.0131	0.0347	0.0176	0.0166	0.0107	0.0270	0.0141
HGN [25]	0.0189	0.0120	0.0313	0.0159	0.0325	0.0206	0.0512	0.0266	0.0321	0.0221	0.0497	0.0277
GRU4Rec [11]	0.0129	0.0086	0.0204	0.0110	0.0164	0.0099	0.0283	0.0137	0.0097	0.0059	0.0176	0.0084
BERT4Rec [32]	0.0115	0.0075	0.0191	0.0099	0.0203	0.0124	0.0347	0.0170	0.0116	0.0071	0.0203	0.0099
FDSA [42]	0.0182	0.0122	0.0288	0.0156	0.0267	0.0163	0.0407	0.0208	0.0228	0.0140	0.0381	0.0189
SASRec [17]	0.0233	0.0154	0.0350	0.0192	0.0387	0.0249	0.0605	0.0318	0.0463	0.0306	0.0675	0.0374
S3-Rec [44]	0.0251	0.0161	0.0385	0.0204	0.0387	0.0244	0.0647	0.0327	0.0443	0.0294	0.0700	0.0376
TIGER [Ours]	0.0264	0.0181	0.0400	0.0225	0.0454	0.0321	0.0648	0.0384	0.0521	0.0371	0.0712	0.0432
	+5.22%	+12.55%	+3.90%	+10.29%	+17.31%	+29.04%	+0.15%	+17.43%	+12.53%	+21.24%	+1.71%	+14.97%

Key Observations:

Consistent Outperformance: TIGER consistently achieves the best performance across all datasets and metrics (Recall@5, NDCG@5, Recall@10, NDCG@10). The improvements are substantial, particularly in NDCG, which accounts for ranking order.
Significant Gains on Beauty Dataset: The most notable improvements are observed on the "Beauty" dataset, where TIGER shows a 29.04% improvement in NDCG@5 compared to the second-best (SASRec) and a 17.31% improvement in Recall@5 compared to the second-best ( $S^3$ -Rec). This suggests that TIGER's ability to leverage item semantics might be particularly effective in domains with rich product descriptions or a strong hierarchical structure in item categories.
Strong Performance on Toys and Games: TIGER also performs significantly better on the "Toys and Games" dataset, with 21.24% better NDCG@5 and 14.97% better NDCG@10.
Advantages over P5: TIGER vastly outperforms P5, suggesting that while both are generative, the Semantic ID approach derived from content provides a much stronger signal than P5's reliance on LLM tokenization of arbitrary item IDs for recommendation tasks.
Overall Superiority: The results indicate that changing the retrieval paradigm to generative prediction with Semantic IDs is a highly effective strategy for sequential recommendation. The Transformer's capacity to learn complex sequential patterns and the RQ-VAE's ability to create meaningful, hierarchical item representations contribute to this success.

6.2. Item Representation Ablation Studies

The paper conducts ablation studies to analyze the importance of RQ-VAE Semantic IDs and their underlying principles.

6.2.1. Qualitative Analysis of RQ-VAE Semantic IDs

The authors perform a qualitative analysis on the Amazon Beauty dataset to understand the hierarchical nature of the learned Semantic IDs. They set RQ-VAE with 3 levels, using codebook sizes of 4, 16, and 256 for $c_1, c_2, c_3$ respectively, meaning a Semantic ID is $(c_1, c_2, c_3)$ where $0 \le c_1 \le 3$ , $0 \le c_2 \le 15$ , and $0 \le c_3 \le 255$ . An additional $c_4$ is used for collision handling.

The following figure (Figure 4 from the original paper) shows this qualitative analysis:

$Figure 4: Qualitative study of RQ-VAE Semantic IDs `( c _ { 1 } , c _ { 2 } , c _ { 3 } , c _ { 4 } )` on the Amazon Beauty dataset. We show that the ground-truth categories are distribute across different Semantictokens. Moreover, the RQVAE semantic IDs form a hierarchy of items, where the first semantic token $\\left( c _ { 1 } \\right)$ corresponds to coarse-level category, while second/third semantic token $\\left( c _ { 2 } / c _ { 3 } \\right)$ correspond to fine-grained categories.$ 该图像是图表，展示了RQ-VAE语义ID在亚马逊美容数据集上的定性研究。图表分为上下两部分，左侧显示不同类别（如洗发水、护肤品等）与其对应的语义ID（如 $(3, 15, 4)$ ）的概率分布，右侧则展示各类别的语义ID与其概率的关系。每一个条形图反映了语义ID与特定类别之间的细致层次，强调了粗细分类的层级关系。

As can be seen from Figure 4, the analysis reveals that:

Coarse-grained semantics from $c_1$ : Figure 4a demonstrates that the first codeword ( $c_1$ ) captures high-level categories. For example, $c_1=3$ primarily corresponds to "Hair" products, while $c_1=1$ is associated with "Makeup" and "Skin" products (for face, lips, eyes).
Fine-grained semantics from $c_2$ : Figure 4b further visualizes the hierarchical nature. By fixing $c_1$ (e.g., $c_1=3$ ) and observing the distribution for different values of $c_2$ , it's evident that $c_2$ further refines the broad category captured by $c_1$ into more specific sub-categories. For instance, within Hair ( $c_1=3$ ), different values of $c_2$ might delineate shampoos, conditioners, styling products, etc.

This qualitative study confirms that RQ-VAE effectively learns hierarchical and semantically meaningful representations, where deeper levels of the Semantic ID capture finer-grained details. This intrinsic semantic structure is key to TIGER's ability to share knowledge across similar items and generalize.

6.2.2. Hashing vs. RQ-VAE Semantic IDs

This ablation compares RQ-VAE with Locality Sensitive Hashing (LSH) for Semantic ID generation.

LSH Semantic IDs: Generated by applying $h$ random hyperplanes $w_1, \ldots, w_h$ to project the semantic embedding $\pmb{x}$ . The $i$ -th bit of the hash is 1 if $\pmb{w}_i^\top \pmb{x} > 0$ , and 0 otherwise. This binary vector is then converted to an integer codeword $c_0 = \sum_{i=1}^h 2^{i-1} \mathbf{1}_{\pmb{w}_i^\top \pmb{x} > 0}$ . This process can be repeated $m$ times to get a tuple $(c_0, c_1, \ldots, c_{m-1})$ . For the experiment, $h=8$ random hyperplanes and $m=4$ levels were used, ensuring comparable cardinality to RQ-VAE.

The following are the results from Table 2 of the original paper:

Methods	Sports and Outdoors				Beauty				Toys and Games
Methods	Recall @5	NDCG @5	Recall @10	NDCG @10	Recall @5	NDCG @5	Recall @10	NDCG @10	Recall @5	NDCG @5	Recall @10	NDCG @10
Random ID	0.007	0.005	0.0116	0.0063	0.0296	0.0205	0.0434	0.0250	0.0362	0.0270	0.0448	0.0298
LSH SID	0.0215	0.0146	0.0321	0.0180	0.0379	0.0259	0.0533	0.0309	0.0412	0.0299	0.0566	0.0349
RQ-VAE SID	0.0264	0.0181	0.0400	0.0225	0.0454	0.0321	0.0648	0.0384	0.0521	0.0371	0.0712	0.0432

Key Observation: RQ-VAE Semantic IDs consistently and significantly outperform LSH Semantic IDs across all datasets and metrics. This highlights that learning Semantic IDs through a non-linear DNN architecture (like RQ-VAE) yields better quantization and thus more effective semantic representations than simpler random projection methods like LSH, even when starting from the same content-based semantic embedding. RQ-VAE's ability to learn optimized codebooks and residual refinement is crucial.

6.2.3. Random ID vs. Semantic ID

This ablation studies the importance of using semantic information in the IDs.

Random ID Baseline: Each item is assigned an ID consisting of $m$ random codewords, where each codeword is uniformly sampled from $\{1, \ldots, K\}$ . For comparison, $m=4$ and $K=255$ were used to match the cardinality of RQ-VAE Semantic IDs.

The results are included in the same table above (Table 2).

Key Observation: Semantic IDs (both RQ-VAE and LSH) consistently and substantially outperform the Random ID baseline. This clearly demonstrates the critical importance of leveraging content-based semantic information when generating item representations for generative retrieval. Random IDs lack any meaningful structure, preventing knowledge transfer and generalization, which are key benefits of Semantic IDs.

6.3. New Capabilities

The TIGER framework introduces two new capabilities: cold-start recommendations and recommendation diversity.

6.3.1. Cold-Start Recommendation

TIGER can handle cold-start recommendations because Semantic IDs are derived from item content, not just user interactions. This means even new items without interaction history can have meaningful Semantic IDs.

Setup: The "Beauty" dataset is used. To simulate cold-start, 5% of test items are removed from the training data, becoming "unseen items."
Process: The RQ-VAE and sequence-to-sequence model are trained on the reduced training split. RQ-VAE then generates Semantic IDs for all items, including unseen ones.
Retrieval: When TIGER predicts a Semantic ID $(c_1, c_2, c_3, c_4)$ , it retrieves seen items with that exact ID. Additionally, to account for unseen items, any unseen items sharing the first three semantic tokens $(c_1, c_2, c_3)$ are also included as candidates. A hyperparameter $\epsilon$ controls the maximum proportion of unseen items allowed in the top-K retrieved candidates.
Baseline: Semantic_KNN, which uses the semantic embedding space for nearest-neighbor search.

The following figure (Figure 5 from the original paper) presents the results for cold-start retrieval:

Key Observations (from Figure 5):
Superiority over Semantic_KNN: Figure 5a shows that TIGER (with $\epsilon=0.1$ ) consistently outperforms Semantic_KNN across all Recall@K metrics for cold-start items. This indicates that the generative approach, combined with Semantic IDs, is more effective at identifying new items than a simple nearest-neighbor search in the semantic embedding space.
Impact of $\epsilon$ : Figure 5b shows TIGER's performance across different values of $\epsilon$ . For $\epsilon \ge 0.1$ , TIGER outperforms the baseline, demonstrating that allowing a controlled proportion of unseen items from the semantic neighborhood enhances cold-start performance without sacrificing overall recall.

6.3.2. Recommendation Diversity

TIGER allows for controlling diversity, primarily through temperature-based sampling during the decoding process. The hierarchical nature of RQ-VAE Semantic IDs enables sampling at different levels (e.g., sampling the first token for coarse categories, or subsequent tokens for fine-grained items within a category).

Metric: Diversity is measured using Entropy@K of the ground-truth category distribution for the top-K predicted items. Higher entropy indicates greater diversity.

The following are the results from Table 3 of the original paper:

Temperature Entropy@10 Entropy@20 Entropy@50

T = 1.0 0.76 1.14 1.70

T = 1.5 1.14 1.52 2.06

T = 2.0 1.38 1.76 2.28

Temperature	Entropy@10	Entropy@20	Entropy@50
T = 1.0	0.76	1.14	1.70
T = 1.5	1.14	1.52	2.06
T = 2.0	1.38	1.76	2.28

Key Observation: As the temperature ( $T$ ) in temperature-based sampling increases, the Entropy@K values also consistently increase across K=10, 20, and 50. This confirms that temperature-based sampling during decoding effectively increases the diversity of predicted items' ground-truth categories. A higher temperature makes the softmax distribution over next tokens "softer," allowing the model to explore more diverse (less probable) Semantic ID paths.

The following are the results from Table 4 of the original paper:

Target Category	Most-common Categories for top-10 predicted items
Target Category	T = 1.0	T = 2.0
Hair Styling Products	Hair Styling Products	Hair Styling Products, Hair Styling Tools, Skin Face
Tools Nail	Tools Nail	Tools Nail, Makeup Nails
Makeup Nails	Makeup Nails	Makeup Nails, Skin Hands & Nails, Tools Nail
Skin Eyes	Skin Eyes	Hair Relaxers, Skin Face, Hair Styling Products, Skin Eyes
Makeup Face	Tools Makeup Brushes,Makeup Face	Tools Makeup Brushes, Makeup Face,Skin Face, Makeup Sets, Hair Styling Tools
Hair Loss Products	Hair Loss Products,Skin Face, Skin Body	Skin Face, Hair Loss Products, Hair Shampoos,Hair & Scalp Treatments, Hair Conditioners

Key Observation: The qualitative analysis in Table 4 further supports the claim. At $T=1.0$ , the model tends to recommend items from a very narrow set of categories, often just the target category itself. However, at $T=2.0$ , the recommendations become much more diverse, including items from related but different categories. For instance, for Hair Styling Products, $T=1.0$ only recommends Hair Styling Products, while $T=2.0$ also includes Hair Styling Tools and Skin Face. This demonstrates the practical tunability of diversity.

6.4. Ablation Study: Effect of Number of Layers

The authors investigate the impact of varying the number of layers in the sequence-to-sequence Transformer model.

The following are the results from Table 5 of the original paper:

Number of Layers	Recall@5	NDCG@5	Recall@10	NDCG@10
3	0.04499	0.03062	0.06699	0.03768
4	0.0454	0.0321	0.0648	0.0384
5	0.04633	0.03206	0.06596	0.03834

Key Observation: Increasing the number of Transformer layers from 3 to 4 shows a slight improvement in Recall@5, NDCG@5, and NDCG@10, while Recall@10 decreases slightly. Going from 4 to 5 layers shows a small increase in Recall@5 and Recall@10, but a slight decrease in NDCG@5 and NDCG@10. This suggests that while more layers can generally improve model capacity, the gains are marginal beyond 4 layers in this specific setup, and there might be a trade-off between model complexity and performance, or a saturation point for these datasets. The authors chose 4 layers for the main results.

6.5. Ablation Study: Effect of User Information

The authors also measure the effect of providing user information to the recommender system.

The following are the results from Table 8 of the original paper:

	Recall@5	NDCG@5	Recall@10	NDCG@10
No user information	0.04458	0.0302	0.06479	0.0367
With user id (reported in the paper)	0.0454	0.0321	0.0648	0.0384

Key Observation: Including user ID information in the input sequence leads to slight improvements across all metrics (Recall@5, NDCG@5, Recall@10, NDCG@10). This confirms that personalizing the recommendations with user-specific signals, even a simple hashed user ID, is beneficial and helps the model better tailor its predictions.

6.6. Scalability Analysis

The paper briefly discusses the scalability of Semantic IDs by training RQ-VAE on a combined dataset (all three Amazon categories) and then evaluating performance on the "Beauty" dataset.

The following are the results from Table 10 of the original paper:

	Recall@5	NDCG@5	Recall@10	NDCG@10
Semantic ID [Combined datasets]	0.04355	0.3047	0.06314	0.03676
Semantic ID [Amazon Beauty]	0.0454	0.0321	0.0648	0.0384

Key Observation: There is only a small decrease in performance when Semantic IDs are generated from a larger, combined dataset compared to generating them exclusively from the "Beauty" dataset. This suggests that Semantic IDs are robust and scale well across different domains, indicating potential for large-scale applications where a single RQ-VAE model could generate Semantic IDs for a vast and diverse item corpus.

6.7. Invalid IDs

Since the model autoregressively decodes codewords, it's possible to generate Semantic IDs that do not map to any existing item in the dataset (Invalid IDs). The authors investigate the frequency of such events.

The following figure (Figure 6 from the original paper) shows the percentage of invalid IDs:

$Figure 6: Percentage of invalid IDs when generating Semantic IDs using Beam search for various values of $K$ . As shown, $\\sim 0 . 3 \\% - 6 \\%$ of the IDs are invalid when retrieving the top-20 items.$ 该图像是图表，展示了使用 Beam search 生成语义 ID 时，针对不同 K 值的无效 ID 百分比。在 "Sports and Outdoors"、"Beauty" 和 "Toys and Games" 三个类别中，当 K 从 1 增加到 20，出现的无效 ID 百分比分别在 $0.3\%$ 到 $6\%$ 之间。 Key Observation (from Figure 6): For top-10 predictions, the fraction of Invalid IDs is low, ranging from approximately 0.1% to 1.6% across the three datasets. Even when retrieving up to 20 items, the percentage of Invalid IDs remains relatively small (around 0.3% to 6%). This indicates that the model largely learns to generate valid Semantic IDs. The authors suggest that increasing beam size during inference and filtering out invalid IDs can ensure a full list of valid recommendations. They also propose prefix matching as a future extension, where partially valid Semantic IDs could still retrieve relevant items based on their semantic categories.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces TIGER (Transformer Index for GEnerative Recommenders), a novel framework that reimagines the retrieval stage of recommender systems as a generative task. At its core, TIGER employs Semantic IDs, which are semantically meaningful, hierarchically structured tuples of codewords generated from item content features using RQ-VAE. A Transformer-based sequence-to-sequence model is then trained to autoregressively predict the Semantic ID of the next item a user will interact with.

The key contributions and findings include:

TIGER significantly outperforms current state-of-the-art sequential recommender systems across multiple real-world datasets (Amazon Beauty, Sports and Outdoors, Toys and Games) in terms of Recall and NDCG.
The framework inherently supports cold-start recommendations and generalization to new/unseen items by leveraging content-based Semantic IDs, overcoming a major limitation of traditional interaction-based methods.
TIGER offers a native mechanism for controlling recommendation diversity through temperature-based sampling during decoding, a capability often achieved through complex post-processing in other systems.
The Semantic ID representation leads to memory-efficient embedding tables, scaling better than traditional item-specific embedding approaches for large item corpuses.
Qualitative and ablation studies confirm the semantic and hierarchical nature of RQ-VAE generated IDs, their superiority over hashing-based or random IDs, and the benefits of including user information.

Overall, TIGER successfully demonstrates a paradigm shift in recommendation retrieval, utilizing generative models and structured semantic representations to achieve superior performance and new capabilities.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future work:

Inference Cost: Despite TIGER's strong performance, the autoregressive decoding with beam search can be computationally more expensive during inference compared to Approximate Nearest Neighbor (ANN)-based models. The authors state that optimizing computational efficiency was not the main objective of this work but will be a focus for future research. This includes exploring ways to make the model smaller or improving inference efficiency.
Invalid IDs Handling: While the model mostly predicts valid Semantic IDs, a small fraction can be invalid. The current solution is to filter invalid IDs and increase beam size. Future work could explore more sophisticated handling, such as prefix matching, where partially generated Semantic IDs could still retrieve relevant items by matching their semantic categories. This could potentially further improve metrics.
Effects of Semantic ID Length and Codebook Size: The paper notes that recommendation metrics were robust to variations in Semantic ID length and codebook size. However, longer Semantic IDs increase input sequence length for the Transformer, leading to higher computational costs. This implies a trade-off that might need further investigation for optimal configuration.

7.3. Personal Insights & Critique

This paper presents a highly innovative and promising approach to recommender systems. The shift from "search" to "generation" in retrieval, coupled with semantically meaningful item identifiers, is a compelling direction.

Strengths and Inspirations:

Elegant Solution to Cold-Start: The Semantic ID concept is an elegant solution to the perennial cold-start problem. By encoding content semantics directly into the item's identifier, the model can generalize to unseen items based on their intrinsic properties, rather than relying solely on interaction history. This has broad applicability beyond just recommendations, for any system dealing with new entities.
Improved Generalization: The knowledge sharing across semantically similar items, enabled by overlapping Semantic IDs, is a powerful mechanism for improving generalization and robustness.
Memory Efficiency: The reduction in embedding table size is a critical practical advantage for deploying large-scale recommender systems, especially as item corpuses continue to grow.
Controllable Diversity: Integrating diversity control directly into the generative process is a valuable feature. Many existing systems struggle to balance relevance and diversity, often adding diversity as a post-hoc optimization. TIGER's approach makes it a tunable aspect of the core model.
Leveraging NLP Advances: The successful adaptation of generative retrieval concepts from NLP to recommender systems demonstrates the continuous cross-pollination of ideas between these fields and opens up new avenues for research. The Transformer acting as a "differentiable search index" is a profound concept.

Potential Issues, Unverified Assumptions, and Areas for Improvement:

Dependence on High-Quality Content Encoders: The effectiveness of Semantic IDs heavily relies on the quality of the pre-trained content encoder (e.g., Sentence-T5). If item content features are sparse, noisy, or poorly captured by the encoder, the Semantic IDs might not be genuinely semantic, undermining the core premise. This is an implicit dependency that should be considered.
Computational Cost of RQ-VAE Training: While the RQ-VAE process offers memory efficiency for the final Semantic IDs, the training of RQ-VAE itself can be complex and computationally intensive, especially for very large corpuses and high-dimensional initial embeddings.
Interpretability of Semantic IDs: While the qualitative analysis shows some hierarchical semantic meaning, further quantitative measures of the "semanticness" and "hierarchical quality" of Semantic IDs could be beneficial. How well do these codes truly represent human-understandable categories?
Robustness to Content Changes: If item content changes frequently, the Semantic IDs would need to be regenerated and the lookup tables updated. How this impacts system stability and maintenance is a practical consideration.
Hybrid Approaches: While TIGER focuses on generative retrieval, it could potentially be combined with traditional embedding-based methods in a hybrid fashion, perhaps using Semantic IDs for cold-start and niche items, and traditional embeddings for well-established ones.
Real-time Inference Speed: The stated limitation regarding inference cost is significant for real-world, high-traffic recommender systems. Future work on optimizing beam search or exploring non-autoregressive decoding strategies would be crucial for practical adoption.
Cold-Start Definition: The cold-start evaluation removes items from the training data. This is a good proxy, but real-world cold-start often means truly new items for which the model has no prior exposure in any form. While TIGER relies on content, the semantic encoder itself might have been trained on similar content. The Semantic_KNN baseline is a good comparison, but deeper analysis into the novelty aspect could be interesting.

In conclusion, TIGER is a thought-provoking paper that offers a fresh perspective on a long-standing problem in recommender systems. Its innovations in Semantic ID generation and generative retrieval pave the way for more intelligent, generalizable, and efficient recommendation models, although practical deployment will require addressing the identified inference cost challenges.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Recommender Systems with Generative Retrieval

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~37 min read · 50,611 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Semantic ID Generation

4.2.2. Generative Retrieval with Semantic IDs

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Recall@K

5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Item Representation Ablation Studies

6.2.1. Qualitative Analysis of RQ-VAE Semantic IDs

6.2.2. Hashing vs. RQ-VAE Semantic IDs

6.2.3. Random ID vs. Semantic ID

6.3. New Capabilities

6.3.1. Cold-Start Recommendation

6.3.2. Recommendation Diversity

6.4. Ablation Study: Effect of Number of Layers

6.5. Ablation Study: Effect of User Information

6.6. Scalability Analysis

6.7. Invalid IDs

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers