Paper status: completed

Learning Decomposed Contextual Token Representations from Pretrained and Collaborative Signals for Generative Recommendation

Published:08/23/2025

Generative Recommendation Systems (42)Contextual Token Representation Learning (1)Large Language Model Optimization (1)Sequence-to-Sequence Modeling (1)User Interaction Modeling (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The DECOR framework addresses limitations in generative recommenders, enhancing token adaptability while preserving pretrained semantics. It employs contextualized token composition and decomposed embedding fusion, demonstrating superior performance on real datasets compared to s

Abstract

Recent advances in generative recommenders adopt a two-stage paradigm: items are first tokenized into semantic IDs using a pretrained tokenizer, and then large language models (LLMs) are trained to generate the next item via sequence-to-sequence modeling. However, these two stages are optimized for different objectives: semantic reconstruction during tokenizer pretraining versus user interaction modeling during recommender training. This objective misalignment leads to two key limitations: (i) suboptimal static tokenization, where fixed token assignments fail to reflect diverse usage contexts; and (ii) discarded pretrained semantics, where pretrained knowledge - typically from language model embeddings - is overwritten during recommender training on user interactions. To address these limitations, we propose to learn DEcomposed COntextual Token Representations (DECOR), a unified framework that preserves pretrained semantics while enhancing the adaptability of token embeddings. DECOR introduces contextualized token composition to refine token embeddings based on user interaction context, and decomposed embedding fusion that integrates pretrained codebook embeddings with newly learned collaborative embeddings. Experiments on three real-world datasets demonstrate that DECOR consistently outperforms state-of-the-art baselines in recommendation performance. Our code will be made available upon publication.

Mind Map

In-depth Reading

English Analysis~35 min read · 49,089 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Learning Decomposed Contextual Token Representations from Pretrained and Collaborative Signals for Generative Recommendation." This indicates a focus on improving how items are represented as tokens in generative recommender systems by combining existing knowledge and dynamic context.

1.2. Authors

Yifan Liu, Yaokun Liu, Zelin Li, Zhenrui Yue, Gyusek Lee, Ruichen Yao, Dong Wang (University of Illinois Urbana-Champaign, Champaign, Illinois, USA)
Yang Zhang (Miami University, Oxford, Ohio, USA)

The authors are primarily affiliated with academic institutions, suggesting a research-oriented background in areas such as recommender systems, natural language processing, and machine learning.

1.3. Journal/Conference

The paper is published as a preprint, indicated by the provided arXiv link. While the abstract states "Our code will be made available upon publication," and the publication date is given as "2025-08-22T18:50:38.000Z," it is currently in a preprint stage and not yet formally published in a specific journal or conference proceedings. arXiv is a reputable open-access archive for preprints of scientific papers, particularly in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Its influence is significant in disseminating cutting-edge research before formal peer review.

1.4. Publication Year

The declared publication date (UTC) is 2025-08-22T18:50:38.000Z.

1.5. Abstract

Recent advancements in generative recommenders utilize a two-stage approach: first, items are converted into semantic IDs using a pretrained tokenizer, and then Large Language Models (LLMs) predict the next item. However, the paper identifies a fundamental misalignment: the tokenizer is optimized for semantic reconstruction, while the recommender is optimized for user interaction modeling. This leads to two key limitations: (i) suboptimal static tokenization, where fixed token assignments fail to capture diverse usage contexts, and (ii) discarded pretrained semantics, where valuable pretrained knowledge is lost during recommender training.

To address these issues, the authors propose DECOR (DEcomposed COntextual Token Representations), a unified framework designed to preserve pretrained semantic knowledge while enhancing the adaptability of token embeddings. DECOR introduces two main components: contextualized token composition, which refines token embeddings based on user interaction context, and decomposed embedding fusion, which integrates frozen pretrained codebook embeddings with newly learned collaborative embeddings. Experimental results on three real-world datasets demonstrate that DECOR consistently surpasses state-of-the-art baselines in recommendation performance.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2509.10468
PDF Link: https://arxiv.org/pdf/2509.10468v1.pdf The paper is currently available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve lies within the two-stage paradigm of recent generative recommenders. In this paradigm, items are first transformed into discrete semantic IDs using a pretrained tokenizer, and then Large Language Models (LLMs) are trained to predict the next item in a sequence. The authors identify a crucial objective misalignment between these two stages:

The tokenizer is optimized for semantic reconstruction, meaning its goal is to accurately represent item metadata (like descriptions) in a compressed, tokenized form.
The recommender (the LLM) is optimized for user interaction modeling, aiming to predict user preferences and sequential behavior.

This misalignment leads to two significant limitations:

(i) Suboptimal Static Tokenization (C1): The item tokens, once generated by the tokenizer, are static and fixed. This static assignment fails to reflect the diverse and dynamic usage contexts in which an item might appear. For example, a "noise-canceling headphone" might be used for office work, workouts, or sleep, each implying different contextual semantics. Static tokens cannot adapt to these nuances, leading to ambiguous representations and reduced representation efficiency.
(ii) Discarded Pretrained Semantics (C2): The rich pretrained knowledge (e.g., world knowledge embedded in LLM embeddings used to create semantic representations) is effectively discarded after tokenization. The token embeddings used by the recommender are typically randomly initialized and solely trained on user interaction data. This process overwrites or ignores the valuable semantic information captured during the initial pretraining, limiting the recommender's ability to ground item representations in external knowledge, especially for sparse items (items with few interactions).

This problem is important because generative recommenders are a promising new paradigm, leveraging the power of LLMs for sequence generation. However, these inherent limitations hinder their full potential, leading to less accurate and less context-aware recommendations. The paper's entry point is to bridge this objective misalignment by proposing a unified framework that simultaneously preserves pretrained semantics and introduces contextual adaptability to token representations.

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field of generative recommendation:

Identification and Characterization of Limitations: The authors are the first to explicitly identify and analyze the two key limitations in existing generative recommenders: suboptimal static tokenization (C1) and discarded pretrained semantics (C2). They clearly characterize how static tokenization restricts representational flexibility.
Introduction of DECOR Framework: They propose DECOR (DEcomposed COntextual Token Representations), a novel unified framework. DECOR's design directly addresses the identified limitations through two main components:
- Contextualized Token Composition: This mechanism dynamically refines token embeddings based on the user's interaction context, allowing for context-dependent interpretation of tokens without retraining the tokenizer. This tackles the suboptimal static tokenization problem.
- Decomposed Embedding Fusion: This integrates frozen pretrained codebook embeddings (preserving initial semantics) with newly learned collaborative embeddings (capturing interaction patterns). This prevents the discarded pretrained semantics problem by maintaining a balance between semantic knowledge and collaborative signals.
Demonstrated Superior Performance: Extensive experiments conducted on three real-world Amazon Review datasets (Scientific, Instrument, Game) consistently show that DECOR outperforms both classical sequential recommenders and state-of-the-art generative recommender baselines. This includes dynamic baselines like ETEGRec, highlighting DECOR's robust and effective approach.
Enhanced Embedding Utilization: The paper demonstrates that DECOR significantly improves the utilization of active embeddings, especially at the first quantization layer, by enabling previously under-utilized codebook entries to contribute through contextual composition.

The key findings are that by explicitly addressing the objective misalignment through these two novel components, DECOR achieves more robust, context-aware, and semantically rich item representations, leading to improved recommendation accuracy.

This section provides the foundational concepts and summarizes relevant prior work necessary to understand the DECOR framework.

3.1. Foundational Concepts

3.1.1. Sequential Recommendation

Sequential recommendation is a sub-field of recommender systems where the goal is to predict the next item a user will interact with, given their past sequence of interactions. Unlike traditional collaborative filtering which might recommend items based on overall preferences, sequential recommenders focus on the order and temporal dependencies of user behavior. For example, if a user watched "Star Wars Episode IV," a sequential recommender might suggest "Star Wars Episode V" next.

3.1.2. Generative Recommendation

Generative recommendation is a new paradigm that formulates the sequential recommendation task as a sequence-to-sequence (seq2seq) generation problem, often leveraging Large Language Models (LLMs). Instead of ranking a predefined list of items, these models generate the identifier or a sequence of tokens representing the next item. The typical two-stage pipeline in generative recommenders involves:

Item Tokenization: Converting items (e.g., products, movies) into discrete tokens or semantic IDs. This often involves using item metadata (like title, description) and a specialized tokenizer.
Recommender Training: Training an LLM (e.g., T5) to autoregressively generate the sequence of tokens corresponding to the next recommended item, based on the user's past tokenized interaction sequence.

3.1.3. Large Language Models (LLMs)

Large Language Models (LLMs) are deep learning models, typically based on the Transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language. Their strength lies in sequence-to-sequence modeling and capturing long-range dependencies. In generative recommendation, LLMs are adapted to treat item tokens as "words" and user interaction sequences as "sentences," generating new "words" (item tokens) to complete the "sentence" (recommend the next item).

3.1.4. Sequence-to-Sequence (Seq2Seq) Modeling

Sequence-to-sequence (Seq2Seq) models are a class of neural networks capable of transforming input sequences into output sequences. They typically consist of an encoder that processes the input sequence and a decoder that generates the output sequence. In generative recommendation, the input sequence is a user's past interaction history (represented by item tokens), and the output sequence is the token representation of the next item to be recommended.

3.1.5. Embeddings and Semantic Embeddings

An embedding is a dense vector representation of a discrete variable (like a word, an item, or a token) in a continuous vector space. The goal is to map similar items closer together in this space. Semantic embeddings specifically refer to embeddings that capture the meaning or semantic properties of an item, often derived from item metadata (e.g., descriptions, titles) using pretrained language models or text encoders (e.g., Sentence-T5). These embeddings are rich in external knowledge.

3.1.6. Vector Quantization (VQ) and Residual Quantization VAE (RQ-VAE)

Vector Quantization (VQ) is a method used to map high-dimensional continuous vectors (like semantic embeddings) to discrete codes (tokens). It involves a codebook or code vector, which is a set of learned prototype vectors. During quantization, an input vector is replaced by the nearest vector in the codebook. Residual Quantization VAE (RQ-VAE) is an advanced form of vector quantization that quantizes vectors in a hierarchical or iterative manner. Instead of quantizing the entire vector at once, it quantizes the residual (the part not captured by the previous quantization step) in subsequent stages. This allows for a more fine-grained and accurate reconstruction of the original vector using a sequence of discrete codes. Each stage has its own codebook. The output of an RQ-VAE for an item is a sequence of indices, where each index refers to a vector in a specific codebook at a particular quantization level. These indices form the semantic ID of an item.

3.1.7. Attention Mechanism

The attention mechanism is a key component of Transformer models. It allows a neural network to focus on specific parts of its input sequence when processing or generating output. Specifically, it computes a weighted sum of input elements, where the weights are determined by a query (what we're looking for), keys (what's available), and values (the actual information). The weights indicate the "relevance" of each input element to the current processing step. This mechanism is used in DECOR for Context Vector Computation to aggregate historical context.

The general formula for Scaled Dot-Product Attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:

$Q$ (Query), $K$ (Key), $V$ (Value) are matrices representing the input sequences.
$d_k$ is the dimension of the keys, used for scaling to prevent very large dot products that push the softmax function into regions with tiny gradients.
$QK^T$ computes the dot product similarity between queries and keys.
$\mathrm{softmax}$ normalizes the scores to obtain attention weights.
The result is a weighted sum of the values.

3.1.8. Multi-Layer Perceptron (MLP)

A Multi-Layer Perceptron (MLP) is a type of artificial neural network composed of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer. Each node (except for the input nodes) is a neuron that uses a nonlinear activation function. MLPs are commonly used for tasks like classification and regression, and in DECOR, they are used for transforming vectors and in attention mechanisms.

3.2. Previous Works

The paper discusses several categories of related work:

3.2.1. Generative Recommenders

P5 (Geng et al. 2022): An early work that fine-tunes a pretrained LLM (e.g., T5) to handle multiple recommendation tasks within a unified generative framework, using natural language prompts.
TIGER (Rajput et al. 2023): This is a key baseline and the foundation for DECOR. TIGER introduces a discrete semantic tokenization scheme where item metadata is quantized into semantic IDs using an RQ-VAE. A pretrained T5 model then autoregressively generates these item token sequences.
EAGER (Wang et al. 2024c): Builds on TIGER, proposing a two-stream generation framework with a shared encoder and separate decoders to jointly capture user behavior and item content semantics.
OneRec (Deng et al. 2025): Extends the generative approach by unifying retrieval and ranking into a single iterative generation process.
DECOR's relation: DECOR investigates the limitations of the tokenization pretraining - recommender training process, specifically suboptimal static tokenization and loss of pretrained semantics, which are not fully addressed by these previous generative models.

3.2.2. Item Token Representation

This section highlights the evolution of how items are represented as discrete tokens for language models.

Pseudo ID-based methods: Assign a unique token identifier to each item without incorporating semantic structure (e.g., Chu et al. 2023, Geng et al. 2022, Hua et al. 2023, Wang et al. 2024b). P5-SID (Hua et al. 2023) is an example, decomposing numeric item IDs into ordered subtokens to improve locality.
Text-based methods: Utilize item metadata (titles, descriptions) to construct natural language prompts (e.g., Bao et al. 2025, Dai et al. 2023, Li et al. 2023, Liao et al. 2023, Zhang et al. 2025, 2021). While expressive, they can incur high inference costs and potentially lead to hallucinated content.
Semantic Indexing Schemes: These are recent approaches that balance semantic fidelity and token efficiency by extracting compact token sequences through quantizing pretrained text embeddings of item descriptions (e.g., Rajput et al. 2023 (TIGER), Wang et al. 2024a (LETTER), Zheng et al. 2024, Zhu et al. 2024 (CoST), Hou et al. 2023). These methods, including TIGER, typically use static tokenization, where item representations are fixed.
Dynamic Tokenization Approaches:
- ETEGRec (Liu et al. 2025): Jointly trains the item tokenizer and recommender model in an end-to-end fashion. This aims to overcome static tokenization but can introduce training instability as token assignments evolve.
- ED $^2$ (Yin et al. 2025): Proposes a duo-index framework with a multi-grained token regulator and instruction tuning with user-level metadata.
DECOR's relation: DECOR specifically targets the static tokenization limitation of semantic indexers. Unlike ETEGRec, DECOR avoids re-tokenization and the associated instability by adapting token embeddings contextually during generation, rather than dynamically updating the semantic ID tokens themselves. It also doesn't rely on user-specific information like ED^2, making it more generally applicable.
Unified Framework: DECOR provides a unified solution that addresses both suboptimal static tokenization and discarded pretrained semantics simultaneously, which is a more comprehensive approach than methods that tackle only one aspect or introduce instability.

In essence, DECOR distinguishes itself by providing a stable yet highly adaptive token representation learning framework that intelligently leverages both pretrained semantic knowledge and collaborative interaction patterns within the generative recommendation paradigm.

4. Methodology

4.1. Principles

The core idea behind DECOR is to resolve the objective misalignment between item tokenization and recommender training in generative recommendation. The theoretical basis is that item representations should be rich in pretrained semantic knowledge (from metadata) and also adapt dynamically to diverse user interaction contexts (collaborative signals). DECOR achieves this by:

Preserving Pretrained Semantics: It prevents the loss of valuable semantic information by freezing the original codebook embeddings from the RQ-VAE tokenizer and fusing them with newly learned collaborative embeddings.
Enhancing Contextual Adaptability: It overcomes suboptimal static tokenization by dynamically refining token embeddings based on the immediate user interaction context through a soft composition mechanism, allowing the recommender to interpret tokens in a context-dependent manner without changing the underlying static token IDs.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

The paper frames the task as sequential recommendation within a generative setting. Given a set of items $\tau$ and a user's interaction sequence $\mathbf{S}^u = [i_1, i_2, \ldots, i_{t-1}] \in \tau$ , the objective is to predict the next item $i_t \in \tau$ . This task involves two main steps:

Item Tokenization: Each item $i \in \tau$ is mapped to a discrete token sequence $\mathbf{c}_i = [c_{i,1}, c_{i,2}, \ldots, c_{i,L}] \in \mathcal{C}$ , where $L$ is the sequence length (number of quantization levels) and $\mathcal{C}$ is the predefined token set (the codebook indices). The user's interaction sequence $\mathbf{S}^u$ is thus transformed into a tokenized sequence $\mathbf{X}^u = [c_{i_1}, c_{i_2}, \ldots, c_{i_{t-1}}]$ .
Autoregressive Generation: Given $\mathbf{X}^u$ , the model autoregressively generates the token sequence $\mathbf{c}_i$ for the next target item $i_t$ by factorizing the conditional probability: $ p(\mathbf{c}i | \mathbf{X}^u) = \prod{l = 1}^L p(c_{i,l} | \mathbf{X}^u, c_{i,1}, \ldots, c_{i,l-1}) \quad (1) $ Where:

$p(\mathbf{c}_i | \mathbf{X}^u)$ is the probability of generating the token sequence $\mathbf{c}_i$ for the target item given the user's historical tokenized sequence $\mathbf{X}^u$ .
$L$ is the length of the token sequence for an item (number of quantization levels).
$c_{i,l}$ is the $l$ -th token in the sequence for item $i$ .
The product indicates that each token $c_{i,l}$ is generated conditioned on the historical sequence $\mathbf{X}^u$ and all previously generated tokens for the current item $c_{i,1}, \ldots, c_{i,l-1}$ .

4.2.2. Semantic Indexer Pretraining

Following prior work (Rajput et al. 2023), DECOR utilizes an RQ-VAE (Residual Quantization Variational Autoencoder) as its semantic indexer. The RQ-VAE consists of:

A pair of MLP encoder-decoder networks.
A sequence of codebooks $\{\mathbf{C}_1, \ldots, \mathbf{C}_L\}$ , where $L$ is the number of quantization levels. Each $\mathbf{C}_l$ contains a set of prototype vectors (code vectors).

The process of hierarchical quantization for a pretrained text embedding of an item's metadata $\mathbf{x} \in \mathbb{R}^d$ operates through residual steps: $ \begin{array}{r} \mathbf{z}0 = \operatorname{Encoder}(\mathbf{x}) \ \mathbf{z}l = \mathbf{z}{l - 1} - q_l(\mathbf{z}{l - 1}), l = 1,\ldots,L \ q_l(\mathbf{z}{l - 1}) = \arg \min{\mathbf{e}\in \mathcal{C}l}\left| \mathbf{z}{l - 1} - \mathbf{e}\right|_2 \end{array} \quad (4) $ Where:
$\mathbf{x} \in \mathbb{R}^d$ is the pretrained text embedding of an item's metadata.
$\operatorname{Encoder}(\cdot)$ is the MLP encoder that maps the input embedding to a latent space.
$\mathbf{z}_0$ is the initial latent representation from the encoder.
$\mathbf{z}_l$ is the residual vector at quantization level $l$ .
$q_l(\cdot)$ is the quantization function for level $l$ , which performs a nearest-neighbor lookup in the codebook $\mathcal{C}_l$ using Euclidean distance to find the closest code vector $\mathbf{e}$ .
$\mathcal{C}_l$ is the codebook for the $l$ -th quantization level.
The semantic ID of an item is the concatenation of the indices of the selected code vectors from each level $\{i_1, \ldots, i_L\}$ , plus a final token for item collision handling (not explicitly shown in the formula but mentioned in the text).

To reconstruct the original semantic embedding, the quantized representation $\tilde{\mathbf{r}}$ is formed by summing the selected codebook vectors: $ \tilde{\mathbf{r}} = \sum_{l = 1}^{L} \mathbf{e}_{c_l}^l $ Where $\mathbf{e}_{c_l}^l$ is the corresponding codebook vector for the semantic ID $c_l$ in codebook $\mathcal{C}_l$ . This $\tilde{\mathbf{r}}$ is then decoded back to the semantic space: $ \tilde{\mathbf{z}} = \operatorname{Decoder}(\tilde{\mathbf{r}}) \quad (5) $ Where $\operatorname{Decoder}(\cdot)$ is the MLP decoder.

The entire RQ-VAE semantic indexer is optimized using a loss function $\mathcal{L}_{\mathrm{RQ}}$ , which includes a reconstruction term and a codebook update term: $ \begin{array}{ll} \mathcal{L}{\mathrm{SQ}} = \mathcal{L}{\mathrm{RECON}} + \mathcal{L}{\mathrm{RQ}} \ \mathcal{L}{\mathrm{RECON}} = | \mathbf{x} - \tilde{\mathbf{x}}|{2}^{2} \quad \mathrm{(reconstruction\ loss)} \ \mathcal{L}{\mathrm{RQ}} = \sum_{l = 1}^{L}\left| \mathrm{sg}[\mathbf{z}{l - 1}] - \mathbf{e}{c_{i}}^{l}\right|{2}^{2} + \beta\left| \mathbf{z}{l - 1} - \boldsymbol {\mathrm{sg}}[\mathbf{e}{c{i}}^{l}]\right|_{2}^{2} \end{array} \quad (7) $ Where:

$\mathcal{L}_{\mathrm{SQ}}$ is the overall loss for the semantic quantizer (RQ-VAE).
$\mathcal{L}_{\mathrm{RECON}}$ is the reconstruction loss, measuring the L2 norm (Euclidean distance) between the original input embedding $\mathbf{x}$ and the reconstructed embedding $\tilde{\mathbf{x}}$ .
$\mathcal{L}_{\mathrm{RQ}}$ $L_{RQ}$ is the codebook loss, which has two parts:
- The first term $\left\| \mathrm{sg}[\mathbf{z}_{l - 1}] - \mathbf{e}_{c_{i}}^{l}\right\|_{2}^{2}$ is the codebook update term (also known as commitment loss): It encourages the selected codebook vector $\mathbf{e}_{c_{i}}^{l}$ to move towards the encoder output $\mathbf{z}_{l - 1}$ . The $\mathrm{sg}[\cdot]$ (stop-gradient) prevents gradients from flowing back through the encoder, ensuring that only the codebook vectors are updated.
- The second term $\beta\left\| \mathbf{z}_{l - 1} - \boldsymbol {\mathrm{sg}}[\mathbf{e}_{c_{i}}^{l}]\right\|_{2}^{2}$ is the encoder loss (also known as codebook loss or straight-through estimator): It encourages the encoder output $\mathbf{z}_{l - 1}$ to move towards the selected codebook vector $\mathbf{e}_{c_{i}}^{l}$ . The $\boldsymbol {\mathrm{sg}}[\mathbf{e}_{c_{i}}^{l}]$ stops gradients from flowing back through the codebook, allowing the encoder to be updated based on the selected code.
$\beta = 0.25$ is a hyperparameter that balances the two parts of the codebook loss.

4.2.3. Decomposed Embedding Fusion

To address C2: discarded pretrained semantics, DECOR introduces a Decomposed Embedding Fusion module. This module dynamically integrates pretrained semantic and newly learned collaborative embedding representations, treating them as complementary information channels.

Pretrained Semantic Embedding Space: This space leverages the frozen codebooks from the pretrained RQ-VAE tokenizer. Each codebook provides token-level embeddings learned to reconstruct pretrained embeddings and preserves the hierarchical structure from multi-stage tokenization. $ E_{\mathrm{pre}} = {\mathcal{C}i}{i = 1}^M \in \mathbb{R}^{K \cdot M \times d} \quad (9) $ Where:
- $E_{\mathrm{pre}}$ represents the overall pretrained semantic embedding space.
- $M$ is the number of quantization layers (codebooks).
- $\mathcal{C}_i \in \mathbb{R}^{K \times d}$ is the $i$ -th frozen codebook, containing $K$ embeddings (code vectors) each of dimension $d$ . $K$ is the size of each codebook.
- $d$ is the embedding dimension, chosen to be compatible with the downstream LLM recommender.
- For a given token index $c_i$ , its pretrained semantic embedding $e_c$ is retrieved by a direct lookup: $e_c = \mathcal{C}_i[c_i]$ .
Collaborative Embedding Space: This is a separate, learnable embedding matrix $\mathcal{E}_{\mathrm{collab}} \in \mathbb{R}^{K \cdot M \times d}$ . These embeddings are initialized randomly and trained from scratch solely through the autoregressive generation objective of the recommender. This allows them to encode sequential patterns, co-occurrence, and user preference dynamics based on user interaction data. Notably, both $E_{\mathrm{pre}}$ and $\mathcal{E}_{\mathrm{collab}}$ share the same dimensionality $(K \times d)$ for each layer, ensuring seamless alignment for fusion.
Modality-Aware Fusion: First, both types of embeddings are projected into a shared latent space and layer normalized: $ \hat{e}{\mathrm{pre}} = \mathrm{LN}(W{\mathrm{pre}}e_{\mathrm{pre}}), \quad \hat{e}{\mathrm{collab}} = \mathrm{LN}(W{\mathrm{collab}}e_{\mathrm{collab}}) \quad (10) $ Where:
- $e_{\mathrm{pre}}, e_{\mathrm{collab}} \in \mathbb{R}^d$ are the token embeddings from the pretrained and collaborative modalities, respectively.
- $W_{\mathrm{pre}}, W_{\mathrm{collab}} \in \mathbb{R}^{d' \times d}$ are learnable projection matrices that map the embeddings to a potentially different dimension $d'$ .
- $\mathrm{LN}(\cdot)$ denotes Layer Normalization, which helps stabilize training.
- $\hat{e}_{\mathrm{pre}}, \hat{e}_{\mathrm{collab}} \in \mathbb{R}^{d'}$ are the projected and normalized embeddings.
  
  Next, these normalized embeddings are concatenated and passed through a fusion layer to map them back to the original latent space dimension $d$ : $ e_{\mathrm{fused}} = W_{\mathrm{fuse}}[\hat{e}{\mathrm{pre}} \parallel \hat{e}{\mathrm{collab}}] \in \mathbb{R}^d \quad (11) $ Where:
- $W_{\mathrm{fuse}} \in \mathbb{R}^{d \times 2d'}$ is a learnable fusion matrix.
- $[\hat{e}_{\mathrm{pre}} \parallel \hat{e}_{\mathrm{collab}}]$ denotes vector concatenation, resulting in a vector of dimension 2d'.
- $e_{\mathrm{fused}} \in \mathbb{R}^d$ is the final fused embedding, integrating both pretrained semantic and collaborative signals. This fused embedding is computed dynamically for each input token sequence $\mathbf{c_i}$ during the forward pass.
  
  该图像是一个示意图，展示了生成推荐算法的两个阶段：第一阶段是项目标记预训练，第二阶段是推荐器训练。图中分别阐述了如何在标记预训练阶段进行编码、共享权重以及在推荐器训练来构建上下文化的表示。关键流程如"Decomposed Embedding Fusion"和"Contextualized Token Composition"表现了如何结合预训练语义与新的协作嵌入。
The figure above (Figure 2 from the original paper) visually represents the Decomposed Embedding Fusion (top left) and Contextualized Token Composition (top right) components. It shows how pretrained embeddings from the RQ-VAE are frozen, how collaborative embeddings are learned, and how they are fused to create fused embeddings. These fused embeddings then serve as input for the contextualized token composition to dynamically refine representations.

4.2.4. Contextualized Token Composition

To address C1: suboptimal static tokenization, DECOR introduces a Contextualized Token Composition mechanism. Instead of updating the semantic ID tokens themselves (which can be unstable), this component refines the token embeddings dynamically based on the usage context.

Overall Idea: Given a target item $i$ with its cached semantic ID sequence $\mathbf{c}_i = \{c_i^{(j)}\}_{j=1}^M$ and a historical context sequence $\mathbf{h}_i$ , a context-aware embedding $\tilde{e}_c$ for each token $c \in \mathbf{c}_i$ is computed using a function $\Phi$ . $ \tilde{e}c = \Phi (c,u{c},{e_{c^{\prime}}}_{c^{\prime}\in \mathcal{N}(c)}) \quad (12) $ Where:
- $\tilde{e}_c$ is the refined context-aware embedding for token $c$ .
- $c$ is the current token for which the embedding is being refined.
- $u_c$ is a context vector derived from the user's interaction history.
- $\{e_{c^{\prime}}\}_{c^{\prime}\in \mathcal{N}(c)}$ is a set of candidate composition tokens for $c$ . Here, $\mathcal{N}(c)$ is chosen to be all tokens from the same RQ-VAE codebook layer as token $c$ . This allows the model to leverage the entire codebook's capacity, including under-utilized entries.
Context Vector Computation: To obtain the context vector $u_c \in \mathbb{R}^d$ used for generating a specific target token $c$ , the fused embeddings of the historical context sequence $\mathbf{h}_c = \{h_1, h_2, \ldots, h_L\}$ are aggregated. Each $h_\ell = f_{\mathrm{fuse}}(c_\ell) \in \mathbb{R}^d$ is the fused embedding of token $c_\ell$ obtained from the Decomposed Embedding Fusion module. An attention-based pooling mechanism is applied: $ u_c = \mathit{AttnPool}(\mathbf{h_c}) = \mathrm{MLP_{ctx}}\left(\sum_{\ell = 1}^{L}\alpha_{\ell}\cdot h_{\ell}\right) \quad (13) $ The attention weights $\{\alpha_{\ell}\}_{\ell = 1}^{L}$ are computed as: $ \begin{array}{l} s_{\ell} = \mathbf{w}^{\top}\operatorname{tanh}\left(\mathbf{W}h_{\ell} + \mathbf{b}\right) \ \alpha_{\ell} = \frac{\exp(s_{\ell})}{\sum_{m = 1}^{L}\exp(s_{m})} \end{array} \quad (14) $ Where:
- $h_\ell$ is the fused embedding of a historical token.
- $\mathbf{W} \in \mathbb{R}^{d' \times d}$ , $\mathbf{b} \in \mathbb{R}^{d'}$ , and $\mathbf{w} \in \mathbb{R}^{d'}$ are learnable parameters of the attention network.
- $s_\ell$ is the attention score for $h_\ell$ .
- $\operatorname{tanh}(\cdot)$ is a hyperbolic tangent activation function.
- $\mathrm{MLP_{ctx}}$ is a multi-layer perceptron that transforms the weighted sum of historical embeddings into the final context vector $u_c$ . This attention pooling allows the model to selectively focus on relevant parts of the history.
Token Composition: To implement the function $\Phi$ from Equation (12), a soft composition is performed over the candidate token embeddings $\{e_{c^{\prime}}\}_{c^{\prime}\in \mathcal{N}(c)}$ (all tokens in the same RQ-VAE codebook layer). This composition is guided by the context vector $u_c$ : $ \alpha_{c^{\prime}} = \frac{\exp(\langle W_{q}u,W_{k}e_{c^{\prime}}\rangle)}{\sum_{c^{\prime\prime}\in\mathcal{N}(c)}\exp(\langle W_{q}u,W_{k}e_{c^{\prime\prime}}\rangle)} \quad (16) $ $ \tilde{e}{c} = \Phi{\mathrm{soft}}(c,u_{c},\mathcal{N}(c)) = \sum_{c^{\prime}\in \mathcal{N}(c)}\alpha_{c^{\prime}}\cdot e_{c^{\prime}} \quad (17) $ Where:
- $e_{c^{\prime}}$ refers to a candidate token embedding from the same codebook layer.
- $u$ is the context vector $u_c$ .
- $W_{q}, W_{k} \in \mathbb{R}^{d \times d}$ are learnable projection matrices (query and key transformations, similar to self-attention).
- $\langle \cdot, \cdot \rangle$ denotes the dot product similarity.
- $\alpha_{c^{\prime}}$ are the composition attention weights, indicating how much each candidate token $e_{c^{\prime}}$ contributes to the refined embedding of $c$ , based on the context $u_c$ .
- $\tilde{e}_c$ is the context-aware composed token embedding, formed as a weighted sum of candidate embeddings.
  
  Finally, this context-aware composed token embedding $\tilde{e}_c$ is fused with the original static embedding $e_c^{\mathrm{static}}$ (the fused embedding $e_{\mathrm{fused}}$ from the Decomposed Embedding Fusion for token $c$ ) using a residual link: $ e_{\mathrm{final}} = \alpha \cdot \tilde{e}{c} + (1 - \alpha)\cdot e{c}^{\mathrm{static}},\quad \alpha \in [0,1] \quad (18) $ Where:
- $e_{\mathrm{final}}$ is the final embedding used by the LLM recommender.
- $\alpha$ is a tunable hyperparameter that controls the strength of context adaptation. A larger $\alpha$ means more reliance on the composed (context-adapted) embedding, while a smaller $\alpha$ prioritizes the static embedding. This allows for flexible integration of contextual signals.
Learnable BOS Embedding Composition: To handle the generation of the very first token (Beginning-of-Sequence, BOS), where no historical tokens have been generated yet, DECOR introduces N learnable BOS query vectors $\mathcal{Q}_{\mathrm{BOS}} \in \mathbb{R}^{N \times d}$ . These queries serve as latent representations of candidate BOS tokens. For the generation of the target token $c$ , the BOS token's embedding $e_{\mathrm{BOS}}$ is composed with these BOS queries using the contextual composition function $\Phi$ : $ \hat{e}{\mathrm{BOS}} = \Phi (e{\mathrm{BOS}},u_{c},\mathcal{Q}_{\mathrm{BOS}}) \quad (19) $ Where:
- $e_{\mathrm{BOS}}$ is the original fused embedding for the BOS token.
- $u_c$ is the context vector.
- $\mathcal{Q}_{\mathrm{BOS}}$ serves as the set of candidate composition tokens for the BOS embedding. The resulting composed BOS embedding $\hat{e}_{\mathrm{BOS}}$ acts as the initial prefix for autoregressive generation, ensuring that even the first token's interpretation is dynamically adapted.
  
  该图像是一个示意图，展示了 DECOR 方法在主动嵌入使用和推荐性能上的比较。左侧显示了关于不同层次的嵌入使用情况，而右侧则展示了与传统方法 TIGER 的 NDCG@10 和 Recall@10 性能提升。下方包含了不同商品的标记化示例，强调了上下文差异与模糊前缀的影响。
The figure above (Figure 1 from the original paper) illustrates the problem of suboptimal static tokenization, where noise-canceling headphones for different contexts (office, workout, sleep) receive identical prefix tokens (e.g., (1, 276)). DECOR addresses this by contextually adapting token representations through compositions, enabling different usage contexts to refine the token embedding, leading to increased embedding utilization and improved recommendation accuracy.

4.2.5. Complexity Analysis

The paper briefly analyzes the additional computational cost introduced by DECOR:

Decomposed Embedding Fusion: This adds a constant cost of $\mathcal{O}(d^2)$ per item due to projection matrices ( $W_{\mathrm{pre}}, W_{\mathrm{collab}}$ ) and the fusion matrix ( $W_{\mathrm{fuse}}$ ), which are $d \times d'$ and $d \times 2d'$ respectively.
Contextualized Token Composition:
- Context Vector Computation: Attention pooling over $L$ historical tokens has a complexity of $\mathcal{O}(L \cdot d^2)$ .
- Token Composition: For each token, attending to a fixed candidate set of size $K$ (all codebook entries in a layer) adds $\mathcal{O}(K \cdot d^2)$ . Since $d$ (embedding dimension) and $K$ (codebook size) are constants, the additional cost of DECOR components scales linearly with the context length $L$ . This is considered negligible compared to the $\mathcal{O}(L^2 \cdot d)$ complexity of Transformer self-attention layers in the backbone recommender model.

4.2.6. DECOR Training

DECOR is integrated into every forward pass of the generative recommender model.

Encoder Input: Decomposed Embedding Fusion is first applied to compute the encoder input embeddings. This combines the frozen pretrained semantic embeddings with the learnable collaborative representations.
Autoregressive Generation: During the autoregressive generation process in the decoder, the static embedding lookup is replaced by Contextualized Token Composition. Each token embedding is dynamically adapted based on the generated context (user history and previously generated tokens for the current item). This allows token embedding representations to evolve during training.

By doing so, DECOR simultaneously preserves pretrained semantics (via Decomposed Embedding Fusion) and adapts to recommendation signals (via Contextualized Token Composition), effectively addressing both suboptimal static tokenization (C1) and discarded pretrained text semantics (C2) within a unified framework without requiring complex joint optimization of the tokenizer.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three subsets of the updated Amazon Review dataset (Hou et al. 2024).

Preprocessing: The datasets undergo a 5-core filter, meaning items and users with less than five interaction records are excluded. User interaction sequences are constructed by chronologically ordering items, with a maximum sequence length of 20.
Characteristics:
- Source: Amazon Review dataset.
- Domain: E-commerce (specific categories: Scientific, Instrument, Game).
- Sparsity: All datasets are highly sparse, indicating a challenging recommendation task.
Suitability: These datasets are widely used benchmarks in sequential recommendation and are effective for validating methods in sparse, real-world e-commerce scenarios.

The following are the statistics of the datasets from Table 4 of the original paper:

Dataset # Users # Items # Interactions Sparsity

Scientific 50,985 25,848 412,947 99.969%

Instrument 57,439 24,587 511,836 99.964%

Game 94,762 25,612 814,586 99.966%

Dataset	# Users	# Items	# Interactions	Sparsity
Scientific	50,985	25,848	412,947	99.969%
Instrument	57,439	24,587	511,836	99.964%
Game	94,762	25,612	814,586	99.966%

5.2. Evaluation Metrics

All models are evaluated using top-K Recall and NDCG@K (Normalized Discounted Cumulative Gain) with $K = \{5, 10\}$ . The leave-one-out strategy is employed: for each user, the last interaction is for testing, the second-last for validation, and the rest for training. A full-ranking evaluation is performed over the entire candidate item set.

5.2.1. Recall@K

Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved among the top K recommendations. In the context of recommendation, it answers the question: "Out of all the items the user actually interacted with (or the next item they interacted with), how many did our system correctly recommend within the top K?" A higher Recall@K indicates that the recommender system is better at finding and presenting items that users are likely to engage with.
Mathematical Formula: $ \mathrm{Recall@K} = \frac{1}{|U|} \sum_{u \in U} \frac{|\mathrm{Rel}_u \cap \mathrm{Rec}_u(K)|}{|\mathrm{Rel}_u|} $
Symbol Explanation:
- $U$ : The set of all users.
- $|U|$ : The total number of users.
- $\mathrm{Rel}_u$ : The set of actual relevant items for user $u$ (in leave-one-out, this is typically the single next item the user interacted with).
- $\mathrm{Rec}_u(K)$ : The set of top K items recommended by the system for user $u$ .
- $|\cdot|$ : Denotes the cardinality of a set.
- $\cap$ : Represents the intersection of two sets.

5.2.2. NDCG@K

Conceptual Definition: Normalized Discounted Cumulative Gain at K (NDCG@K) is a measure of ranking quality. It considers not only whether relevant items are in the top K recommendations, but also their positions in the ranked list. Highly relevant items appearing at the top of the list contribute more to the score than those appearing lower down. It is "normalized" to values between 0 and 1 by dividing by the ideal DCG (IDCG), which is the DCG achieved if all relevant items were perfectly ranked at the top. A higher NDCG@K indicates a better-ranked list, where relevant items are prioritized.
Mathematical Formula: $ \mathrm{NDCG@K} = \frac{1}{|U|} \sum_{u \in U} \frac{\mathrm{DCG@K}_u}{\mathrm{IDCG@K}_u} $ where $ \mathrm{DCG@K}u = \sum{j=1}^{K} \frac{2^{\mathrm{rel}(j)} - 1}{\log_2(j+1)} $ And $\mathrm{IDCG@K}_u$ is the maximum possible DCG@K, achieved by sorting all relevant items by their relevance.
Symbol Explanation:
- $U$ : The set of all users.
- $|U|$ : The total number of users.
- $\mathrm{DCG@K}_u$ : Discounted Cumulative Gain at K for user $u$ .
- $\mathrm{IDCG@K}_u$ : Ideal Discounted Cumulative Gain at K for user $u$ .
- $K$ : The number of top recommendations considered.
- $j$ : The position of an item in the ranked list.
- $\mathrm{rel}(j)$ : The relevance score of the item at position $j$ . In a binary relevance setting (item is either relevant or not), $\mathrm{rel}(j)$ is 1 if the item at position $j$ is relevant, and 0 otherwise.

5.3. Baselines

The paper compares DECOR against a comprehensive set of baselines, categorized into traditional ID-based sequential recommenders and LLM-based generative recommenders.

5.3.1. Traditional Baselines

These models typically learn embeddings for item IDs and model sequential patterns without explicit semantic tokenization.

Caser (Tang and Wang 2018): Applies convolutional neural networks (CNNs) to capture both short-range sequential patterns (vertical filters) and co-occurring latent features (horizontal filters).
GRU4Rec (Jannach and Ludewig 2017): Utilizes a Gated Recurrent Unit (GRU) network to model session-based user behavior, using the final hidden state to predict the next item.
SASRec (Kang and McAuley 2018): Employs a stack of Transformer encoder layers with multi-head self-attention to capture long-range dependencies in user interaction sequences.
BERT4Rec (Sun et al. 2019): Adapts the BERT architecture (a bidirectional Transformer encoder) for sequential recommendation, learning rich, context-aware item representations by predicting masked items.
FDSA: (Not explicitly cited, but likely a Transformer-based model given its performance relative to SASRec). Often refers to Feature-rich Deep Sequential Autoencoder.
S3Rec (Zhou et al. 2020): Enhances pre-training with four auxiliary self-supervised tasks (masking attributes, predicting masked items, distinguishing subsequences, and contrasting full sequences) to maximize mutual information at multiple item granularities.

5.3.2. Generative Baselines

These models leverage LLMs and often use semantic tokenization.

P5-SID (Hua et al. 2023): Decomposes numeric item IDs into ordered subtokens (e.g., prefixes) so that frequently co-occurring or sequentially adjacent items share subtoken patterns, improving locality in autoregressive generation.
P5-CID (Hua et al. 2023): Performs spectral clustering on collaborative co-occurrence graphs to group items, then uses the resulting cluster IDs as discrete tokens for generative recommendation.
TIGER (Rajput et al. 2023): A core baseline. It frames sequential recommendation as a generative retrieval task by quantizing item text embeddings via RQ-VAE into a fixed vocabulary of semantic IDs and trains an LLM (T5) to autoregressively generate the next item's ID.
LETTER (Wang et al. 2024a): Optimizes an RQ-VAE tokenizer by enforcing contrastive alignment and diversity regularization to learn hierarchical, collaborative, and diverse item tokens.
CoST (Zhu et al. 2024): Trains a quantization codebook with an InfoNCE-style contrastive loss to map item embeddings into discrete semantic tokens that preserve both semantic similarity and neighborhood structure.
ETEGRec (Liu et al. 2025): A dynamic generative baseline. It jointly optimizes the tokenizer and the recommender model with a set of alignment losses to improve tokenizer-recommender consistency.

5.4. Implementation Details

Tokenizer: Sentence-T5 is used as the text encoder for pretrained semantics (to generate input $\mathbf{x}$ for the RQ-VAE).
Generative Recommender: T5 (Raffel et al. 2020) serves as the backbone LLM for autoregressive generation.
Hardware: Experiments are carried out on a single NVIDIA Tesla A40 GPU.
Reproducibility: A random seed is fixed to 2025.

Evaluation Protocol: The test set performance of the best-performing model (based on NDCG@10 on the validation set) is reported.

The following are the hyperparameters used for each dataset from Table 5 of the original paper:

Hyperparameter	Scientific	Instrument	Game
learning_rate	0.003	0.003	0.003
warmup_steps	10,000	10,000	10,001
dropout_rate	0.1	0.1	0.1
max_grad_norm	1.0	1.0	1.0
weight_decay	0.05	0.05	0.05
beam_size	50	50	50
d_model	128	128	128
α	0.25	0.55	0.55
BOs_Query_Number	32	64	64
optimizer	adamw	adamw	adamw
lr_scheduler	cosine	cosine	cosine
train_batch_size	256	256	256
max_epochs	200	200	200
early_stop_patience	20	20	10

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that DECOR consistently outperforms all baselines across different metrics and datasets. This strongly validates the effectiveness of the proposed method in integrating pretrained semantics and collaborative signals.

The following are the results from Table 1 of the original paper:

Group	Method	Scientific				Instrument				Game
Group	Method	R@5	R@10	N@5	N@10	R@5	R@10	N@5	N@10	R@5	R@10	N@5	N@10
Traditional	Caser	0.0172	0.0281	0.0107	0.0142	0.0242	0.0392	0.0154	0.0202	0.0346	0.0567	0.0221	0.0291
	GRU4Rec	0.0221	0.0353	0.0144	0.0186	0.0345	0.0537	0.0220	0.0281	0.0522	0.0831	0.0337	0.0436
	SASRec	0.0256	0.0406	0.0147	0.0195	0.0341	0.0530	0.0217	0.0277	0.0517	0.0821	0.0329	0.0426
	BERT4Rec	0.0180	0.0300	0.0113	0.0151	0.0305	0.0483	0.0196	0.0253	0.0453	0.0716	0.0294	0.0378
	FDSA	0.0261	0.0391	0.0174	0.0216	0.0364	0.0557	0.0233	0.0295	0.0548	0.0857	0.0353	0.0453
	S3Rec	0.0253	0.0410	0.0172	0.0218	0.0340	0.0538	0.0218	0.0282	0.0533	0.0823	0.0351	0.0444
Generative (Static)	P5-SID	0.0155	0.0234	0.0103	0.0129	0.0319	0.0437	0.0237	0.0275	0.0480	0.0693	0.0333	0.0401
	P5-CID	0.0192	0.0300	0.0123	0.0158	0.0352	0.0507	0.0234	0.0285	0.0497	0.0748	0.0343	0.0424
	TIGER	0.0275	0.0431	0.0181	0.0231	0.0368	0.0574	0.0242	0.0308	0.0570	0.0895	0.0370	0.0471
	LETTER	0.0276	0.0433	0.0179	0.0230	0.0372	0.0581	0.0243	0.0310	0.0576	0.0901	0.0373	0.0475
	CoST	0.0270	0.0426	0.0180	0.0229	0.0366	0.0570	0.0242	0.0306	0.0569	0.0897	0.0379	0.0472
	Generative (Dynamic)	ETEGRec	0.0272	0.0433	0.0173	0.0225	0.0387	0.0609	0.0251	0.0323	0.0591	0.0925	0.0385	0.0492
	DECOR	0.0301*	0.0469*	0.0201*	0.0256*	0.0399*	0.0610	0.0266*	0.0333*	0.0610*	0.0944*	0.0400*	0.0507*

Comparison with Traditional Models: DECOR achieves substantial gains over traditional models like SASRec and FDSA, which highlights the benefits of integrating pretrained semantics and collaborative signals through a generative framework. For example, on the Scientific dataset, DECOR's NDCG@10 (0.0256) significantly surpasses FDSA (0.0216) and SASRec (0.0195).

Comparison with Generative (Static) Models: DECOR consistently outperforms generative models that rely on static tokenization (e.g., TIGER, LETTER, CoST). For instance, on the Scientific dataset, DECOR improves NDCG@10 from TIGER's 0.0231 to 0.0256 (+10.8%). This indicates the efficacy of contextualized token composition in mitigating the suboptimal static tokenization problem.

Comparison with Generative (Dynamic) Models: Even against ETEGRec, a dynamic baseline that jointly optimizes the tokenizer and recommender, DECOR shows superior performance. On Scientific, DECOR yields a NDCG@10 of 0.0256 compared to ETEGRec's 0.0225, representing a +13.8% improvement. This is a critical finding, demonstrating that DECOR's approach of contextually adapting token *embeddings* without iterative re-tokenization provides a more robust and stable framework, avoiding the potential instability of joint training.

In summary, DECOR's advantages stem from its ability to:

Preserve rich semantic knowledge from pretraining (via Decomposed Embedding Fusion).
Adapt token representations to diverse usage contexts dynamically (via Contextualized Token Composition). These combined benefits lead to state-of-the-art recommendation performance across various datasets.

6.2. Ablation Studies / Parameter Analysis

The following are the results from Table 2 of the original paper:

Model Variant	Scientific				Instrument				Game
Model Variant	R@5	R@10	N@5	N@10	R@5	R@10	N@5	N@10	R@5	R@10	N@5	N@10
Base (TIGER)	0.0275	0.0431	0.0181	0.0231	0.0368	0.0574	0.0242	0.0308	0.0570	0.0895	0.0370	0.0471
w/ Token Comp.	0.0292	0.0459	0.0193	0.0247	0.0385	0.0583	0.0261	0.0321	0.0599	0.0931	0.0394	0.0502
w/ Token Comp. and BOS Queries	0.0300	0.0462	0.0198	0.0248	0.0374	0.0585	0.0246	0.0313	0.0600	0.0932	0.0395	0.0500
w/ Pretrained Only	0.0294	0.0457	0.0192	0.0246	0.0382	0.0595	0.0254	0.0323	0.0602	0.0934	0.0390	0.0501
w/ Pretrained + Token Comp.	0.0298	0.0465	0.0198	0.0250	0.0388	0.0598	0.0257	0.0324	0.0603	0.0932	0.0396	0.0500
DECOR (Full)	0.0301	0.0469	0.0201	0.0256	0.0399	0.0610	0.0266	0.0333	0.0610	0.0944	0.0400	0.0507

The ablation study in Table 2 systematically evaluates the contribution of each component of DECOR, starting from the TIGER model as the baseline (Base (TIGER)).

w/ Token Comp.: Adding contextualized token composition alone yields the most significant improvements across all datasets. For example, on Instrument, NDCG@10 improves from 0.0308 (Base) to 0.0321, an increase of approximately 4.2%. This confirms the effectiveness of dynamically refining static token semantics based on context.
w/ Token Comp. and BOS Queries: Further incorporating learnable BOS queries (which aid in the initial generation phase) provides additional gains, particularly notable on the Scientific dataset, where NDCG@10 slightly improves from 0.0247 to 0.0248. This suggests that allowing context-aware composition for the Beginning-of-Sequence token helps in bootstrapping the generation process, especially in datasets with more complex initial item semantics.
w/ Pretrained Only: This variant only incorporates frozen pretrained embeddings (via Decomposed Embedding Fusion without Contextualized Token Composition). It also shows significant improvements over the Base (TIGER) model (e.g., NDCG@10 on Scientific improves from 0.0231 to 0.0246). This highlights the intrinsic value of preserving pretrained semantic knowledge and preventing its erasure during collaborative training.
w/ Pretrained + Token Comp.: Combining Decomposed Embedding Fusion (with frozen pretrained embeddings) and Contextualized Token Composition results in even better performance (e.g., NDCG@10 on Scientific improves to 0.0250). This suggests a synergistic effect between preserving semantics and adapting context.
DECOR (Full): The full DECOR model, incorporating all components (Decomposed Embedding Fusion, Contextualized Token Composition, and Learnable BOS Queries), consistently achieves the best performance across all metrics and datasets. For example, NDCG@10 on Scientific reaches 0.0256, demonstrating that each component contributes positively to the overall recommendation quality. The slight reduction in R@5, N@5, N@10 for Instrument from w/ Token Comp. and BOS Queries to w/ Pretrained + Token Comp. suggests that the specific interaction of these components can be subtle, but the full model finds the optimal balance.

6.2.1. Hyperparameter Sensitivity

fig 3 该图像是一个结果对比图，展示了不同数据集（Scientific、Instrument、Game）在不同参数（Alpha和BOS Query Number）下的Recall@10和NDCG@10指标。通过分析，可以看出在不同 Alpha 值和 BOS 查询数量下各模型的性能变化。

The figure above (Figure 3 from the original paper) displays the sensitivity of DECOR's performance to two key hyperparameters: the composition weight $\alpha$ and the BOS query number. Performance is measured by NDCG@10.

Effect of Composition Weight $\alpha$ :

Robustness: DECOR exhibits general robustness to a range of $\alpha$ values.
Optimal Range: Moderate values of $\alpha$ (e.g., 0.4 to 0.55) consistently yield the strongest performance. This indicates a balanced contribution from both the residual link (original static embedding) and the context-aware composition.
Performance Degradation: Very high $\alpha$ values (e.g., 0.7) lead to a sharp performance degradation. This is likely due to undertraining of individual token embeddings. If $\alpha$ is too high, the model relies excessively on compositional signals from other tokens, potentially neglecting to learn robust individual token embeddings.
Dataset Specificity: The performance collapse at high $\alpha$ is more pronounced on the Instrument and Game datasets. These datasets often have larger interaction spaces, requiring more stable and robust individual token embeddings to generalize across diverse usage contexts. Excessive reliance on composition in such scenarios can delay embedding convergence and lead to training instability.

Effect of BOS Query Number:

Consistent Enhancement: Increasing the number of BOS queries generally enhances performance, especially when moving from 0 (or a very small number) up to 32. For example, on the Scientific dataset, NDCG@10 improves significantly by introducing BOS queries. This suggests that BOS queries facilitate meaningful convergence by enabling the model to better capture diverse user preferences before generating the first item token.
Plateau Effect: Beyond a certain point (e.g., 32 or 64 on the Game dataset), performance gains plateau. Larger values (e.g., 128) yield no additional improvement. This suggests that a moderate number of BOS queries is sufficient for the model to capture the necessary contextual information, and adding more becomes redundant.

6.2.2. Addressing Suboptimal Static Tokenization

fig 4 该图像是一个示意图，左侧展示了采用静态前缀表示的项目分布，右侧展示了采用上下文组合前缀表示后项目的分布变化。图中标注了不同类型的前缀，包括由星标指示的原始前缀以及其他带有颜色区分的前缀和可能的下一个SID3。

The figure above (Figure 4 from the original paper) presents t-SNE visualizations to illustrate how DECOR addresses suboptimal static tokenization on the Scientific dataset, focusing on prefix ambiguity.

Static Tokenization (Left): Shows that a static tokenization process assigns a single, fixed embedding for a specific prefix (e.g., (1, 276)). This fixed embedding often appears distant or semantically incoherent with valid next-token candidates (represented by red triangles). This visualizes prefix ambiguity, where the same prefix has to represent different item meanings, making it hard for the recommender to disambiguate.

DECOR (Right): In contrast, DECOR's contextually composed prefix embeddings (scattered colored points) are adaptively enhanced based on context. These composed embeddings align more coherently and are closer to their respective valid next-token candidates. The scattered nature of the composed prefix embeddings (e.g., multiple distinct points for the prefix (1, 276)) demonstrates that DECOR successfully reinterprets and refines these representations according to the user interaction context, effectively mitigating the ambiguity inherent in static tokenization.

The following are the results from Table 3 of the original paper:

Dataset	Method	Embedding Utilization
		Layer-1	Layer-2	Layer-3
Scientific	TIGER Ours	26.6% 51.06%	99.07% 99.97%	99.87% 100.00%
Instrument	TIGER Ours	27.97% 100.00%	96.77% 100.00%	100.00% 100.00%
Game	TIGER Ours	25.67% 100.00%	99.61% 100.00%	100.00% 100.00%

Table 3 compares the active embedding usage at each quantization layer between TIGER (a static tokenization baseline) and DECOR.

Low Utilization in TIGER: TIGER shows low utilization of Layer-1 embeddings, with only 25-28% of embeddings being actively used during tokenizer pretraining across datasets. This indicates that many potential codebook entries are under-utilized or rarely selected, suggesting an inefficient use of the codebook's capacity.
DECOR's Improved Utilization: By filtering out tokens with below-uniform composition attention weights, DECOR substantially improves the actively trained embedding coverage. For Instrument and Game datasets, DECOR achieves 100% utilization at Layer-1, meaning all available codebook entries are actively engaged through the contextualized token composition mechanism. For Scientific, it reaches 51.06%.
Adaptive Activation: On the Scientific dataset, which has fewer interactions and thus potentially less diverse contextual modeling, DECOR adaptively activates a smaller (though still significantly higher than TIGER) proportion of additional embeddings. This demonstrates DECOR's ability to activate embeddings based on the diversity of contexts, mitigating suboptimal static tokenization with efficient representation usage. This increase in active embedding usage, particularly at the crucial Layer-1 (which captures coarse-grained semantics), directly supports the claim that DECOR's contextualized token composition enhances expressiveness by interpolating beyond original static assignments and leveraging previously unused embedding capacity.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully identifies and addresses two critical limitations in current generative recommender systems: suboptimal static tokenization and the discarding of pretrained semantic knowledge. The proposed DECOR framework offers an elegant solution by integrating decomposed embedding fusion to preserve valuable pretrained semantics and contextualized token composition to dynamically adapt token representations based on user interaction context. Through extensive experiments on three real-world Amazon datasets, DECOR consistently demonstrates superior performance compared to both traditional and state-of-the-art generative baselines, including those attempting dynamic tokenization. The ablation studies confirm the individual and synergistic contributions of DECOR's components, while hyperparameter analysis reveals its robustness within a practical range and the importance of balanced contextual adaptation. Furthermore, visual analysis and embedding utilization metrics validate DECOR's ability to mitigate prefix ambiguity and enhance the active use of codebook embeddings.

7.2. Limitations & Future Work

The paper's abstract and conclusion do not explicitly state specific limitations or direct avenues for future work. However, some implicit points can be inferred:

Computational Cost: While a complexity analysis is provided, it primarily focuses on the scaling of additional components. A more detailed empirical analysis of the actual computational overhead during training and inference (e.g., in terms of wall-clock time or GPU memory usage compared to baselines) could offer further insights.
Generalizability of Hyperparameters: While the paper shows robustness to alpha and BOS query number, the optimal values still vary across datasets. Future work might explore adaptive mechanisms for these hyperparameters.
Complexity of Context Modeling: The current context vector computation uses attention pooling over historical tokens. More sophisticated context modeling techniques (e.g., incorporating temporal dynamics, user long-term preferences, or richer side information) could further enhance token composition.
Beyond Two-Stage: Although DECOR effectively mitigates the issues within the two-stage paradigm, the fundamental objective misalignment between tokenizer pretraining and recommender training still exists at a high level. Exploring truly end-to-end solutions that inherently learn item representations optimized for recommendation, without a separate tokenization pretraining stage, could be a long-term future direction.

7.3. Personal Insights & Critique

This paper presents a highly insightful and practical solution to fundamental challenges in generative recommendation. The explicit identification of suboptimal static tokenization and discarded pretrained semantics is a crucial contribution, as these issues are often overlooked or only partially addressed in the literature.

Inspirations:

Synergy of Pretrained and Collaborative Signals: DECOR provides a compelling blueprint for how to effectively combine rich, general semantic knowledge (from pretrained models) with specific, dynamic collaborative signals (from user interactions). This decomposed fusion approach could be transferable to other domains where foundational models provide strong initial representations, but task-specific nuances require adaptation.
Stable Contextual Adaptation: The contextualized token composition mechanism is particularly inspiring. Instead of complex and potentially unstable joint training of tokenizers, DECOR achieves dynamic adaptation by modifying the interpretation of static tokens through a soft composition. This embedding-level adaptation is a more robust and computationally efficient way to introduce contextuality without altering the underlying discrete vocabulary. This idea of contextual re-interpretation could be applied to other areas where discrete representations need to be adapted to specific situations without retraining the entire representation generation pipeline.
Efficient Embedding Utilization: The observation that many codebook entries are under-utilized in static tokenization and that DECOR can significantly improve this utilization is a valuable insight. It suggests that even with fixed discrete vocabularies, intelligent contextualization can unlock greater representational capacity.

Critique:

Depth of Decomposed Embedding Fusion: While the concept of frozen pretrained embeddings and learnable collaborative embeddings is clear, the interaction within the fusion layer (Equation 11) is relatively simple (linear projection after concatenation). Exploring more sophisticated fusion mechanisms, such as gating mechanisms or attention-based fusion between the two modalities, could potentially yield further improvements and provide a finer-grained control over how semantic and collaborative signals are blended.
Interpretability of Composed Embeddings: While the t-SNE plots show the visual effect of contextual composition, a deeper analysis into what specific semantic or collaborative aspects are being highlighted or suppressed in the composed embeddings for different contexts could enhance interpretability. For instance, can we identify distinct contextual "dimensions" that are activated for different usage scenarios of an item?
Scalability to Extremely Large Vocabularies: The Contextualized Token Composition involves attention over all tokens in a codebook layer ( $\mathcal{N}(c)$ ). While this is stated as constant complexity ( $\mathcal{O}(K \cdot d^2)$ ), if $K$ (codebook size) becomes very large in future applications, this constant factor could become a bottleneck. Investigating methods to sample or dynamically select a subset of candidate tokens for composition could be a future consideration.
Practical Deployment Considerations: The paper provides a complexity analysis, but a direct comparison of training time, inference latency, and memory footprint between DECOR and its strong baselines (ETEGRec, TIGER) would be beneficial for practical deployment decisions, especially given the "negligible" claim.

Overall, DECOR offers a significant advancement in generative recommendation by effectively addressing critical representational challenges. Its principled approach to leveraging and adapting token semantics opens up promising avenues for more accurate and context-aware recommendation systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.