Learning Decomposed Contextual Token Representations from Pretrained and Collaborative Signals for Generative Recommendation
TL;DR Summary
The DECOR framework addresses limitations in generative recommenders, enhancing token adaptability while preserving pretrained semantics. It employs contextualized token composition and decomposed embedding fusion, demonstrating superior performance on real datasets compared to s
Abstract
Recent advances in generative recommenders adopt a two-stage paradigm: items are first tokenized into semantic IDs using a pretrained tokenizer, and then large language models (LLMs) are trained to generate the next item via sequence-to-sequence modeling. However, these two stages are optimized for different objectives: semantic reconstruction during tokenizer pretraining versus user interaction modeling during recommender training. This objective misalignment leads to two key limitations: (i) suboptimal static tokenization, where fixed token assignments fail to reflect diverse usage contexts; and (ii) discarded pretrained semantics, where pretrained knowledge - typically from language model embeddings - is overwritten during recommender training on user interactions. To address these limitations, we propose to learn DEcomposed COntextual Token Representations (DECOR), a unified framework that preserves pretrained semantics while enhancing the adaptability of token embeddings. DECOR introduces contextualized token composition to refine token embeddings based on user interaction context, and decomposed embedding fusion that integrates pretrained codebook embeddings with newly learned collaborative embeddings. Experiments on three real-world datasets demonstrate that DECOR consistently outperforms state-of-the-art baselines in recommendation performance. Our code will be made available upon publication.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Learning Decomposed Contextual Token Representations from Pretrained and Collaborative Signals for Generative Recommendation." This indicates a focus on improving how items are represented as tokens in generative recommender systems by combining existing knowledge and dynamic context.
1.2. Authors
-
Yifan Liu, Yaokun Liu, Zelin Li, Zhenrui Yue, Gyusek Lee, Ruichen Yao, Dong Wang (University of Illinois Urbana-Champaign, Champaign, Illinois, USA)
-
Yang Zhang (Miami University, Oxford, Ohio, USA)
The authors are primarily affiliated with academic institutions, suggesting a research-oriented background in areas such as recommender systems, natural language processing, and machine learning.
1.3. Journal/Conference
The paper is published as a preprint, indicated by the provided arXiv link. While the abstract states "Our code will be made available upon publication," and the publication date is given as "2025-08-22T18:50:38.000Z," it is currently in a preprint stage and not yet formally published in a specific journal or conference proceedings. arXiv is a reputable open-access archive for preprints of scientific papers, particularly in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Its influence is significant in disseminating cutting-edge research before formal peer review.
1.4. Publication Year
The declared publication date (UTC) is 2025-08-22T18:50:38.000Z.
1.5. Abstract
Recent advancements in generative recommenders utilize a two-stage approach: first, items are converted into semantic IDs using a pretrained tokenizer, and then Large Language Models (LLMs) predict the next item. However, the paper identifies a fundamental misalignment: the tokenizer is optimized for semantic reconstruction, while the recommender is optimized for user interaction modeling. This leads to two key limitations: (i) suboptimal static tokenization, where fixed token assignments fail to capture diverse usage contexts, and (ii) discarded pretrained semantics, where valuable pretrained knowledge is lost during recommender training.
To address these issues, the authors propose DECOR (DEcomposed COntextual Token Representations), a unified framework designed to preserve pretrained semantic knowledge while enhancing the adaptability of token embeddings. DECOR introduces two main components: contextualized token composition, which refines token embeddings based on user interaction context, and decomposed embedding fusion, which integrates frozen pretrained codebook embeddings with newly learned collaborative embeddings. Experimental results on three real-world datasets demonstrate that DECOR consistently surpasses state-of-the-art baselines in recommendation performance.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2509.10468
- PDF Link: https://arxiv.org/pdf/2509.10468v1.pdf The paper is currently available as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve lies within the two-stage paradigm of recent generative recommenders. In this paradigm, items are first transformed into discrete semantic IDs using a pretrained tokenizer, and then Large Language Models (LLMs) are trained to predict the next item in a sequence. The authors identify a crucial objective misalignment between these two stages:
-
The
tokenizeris optimized forsemantic reconstruction, meaning its goal is to accurately represent item metadata (like descriptions) in a compressed, tokenized form. -
The
recommender(the LLM) is optimized foruser interaction modeling, aiming to predict user preferences and sequential behavior.This misalignment leads to two significant limitations:
-
(i) Suboptimal Static Tokenization (C1): The item tokens, once generated by the tokenizer, are static and fixed. This static assignment fails to reflect the diverse and dynamic
usage contextsin which an item might appear. For example, a "noise-canceling headphone" might be used for office work, workouts, or sleep, each implying different contextual semantics. Static tokens cannot adapt to these nuances, leading to ambiguous representations and reduced representation efficiency. -
(ii) Discarded Pretrained Semantics (C2): The rich
pretrained knowledge(e.g., world knowledge embedded in LLM embeddings used to create semantic representations) is effectively discarded after tokenization. The token embeddings used by the recommender are typically randomly initialized and solely trained onuser interaction data. This process overwrites or ignores the valuable semantic information captured during the initial pretraining, limiting the recommender's ability to ground item representations in external knowledge, especially forsparse items(items with few interactions).This problem is important because generative recommenders are a promising new paradigm, leveraging the power of LLMs for sequence generation. However, these inherent limitations hinder their full potential, leading to less accurate and less context-aware recommendations. The paper's entry point is to bridge this
objective misalignmentby proposing a unified framework that simultaneously preserves pretrained semantics and introduces contextual adaptability to token representations.
2.2. Main Contributions / Findings
The paper makes several primary contributions to the field of generative recommendation:
-
Identification and Characterization of Limitations: The authors are the first to explicitly identify and analyze the two key limitations in existing generative recommenders:
suboptimal static tokenization(C1) anddiscarded pretrained semantics(C2). They clearly characterize how static tokenization restricts representational flexibility. -
Introduction of DECOR Framework: They propose DECOR (DEcomposed COntextual Token Representations), a novel unified framework. DECOR's design directly addresses the identified limitations through two main components:
- Contextualized Token Composition: This mechanism dynamically refines token embeddings based on the user's interaction context, allowing for context-dependent interpretation of tokens without retraining the tokenizer. This tackles the
suboptimal static tokenizationproblem. - Decomposed Embedding Fusion: This integrates frozen
pretrained codebook embeddings(preserving initial semantics) with newly learnedcollaborative embeddings(capturing interaction patterns). This prevents thediscarded pretrained semanticsproblem by maintaining a balance between semantic knowledge and collaborative signals.
- Contextualized Token Composition: This mechanism dynamically refines token embeddings based on the user's interaction context, allowing for context-dependent interpretation of tokens without retraining the tokenizer. This tackles the
-
Demonstrated Superior Performance: Extensive experiments conducted on three real-world Amazon Review datasets (Scientific, Instrument, Game) consistently show that DECOR outperforms both classical sequential recommenders and state-of-the-art generative recommender baselines. This includes dynamic baselines like ETEGRec, highlighting DECOR's robust and effective approach.
-
Enhanced Embedding Utilization: The paper demonstrates that DECOR significantly improves the utilization of active embeddings, especially at the first quantization layer, by enabling previously under-utilized codebook entries to contribute through contextual composition.
The key findings are that by explicitly addressing the objective misalignment through these two novel components, DECOR achieves more robust, context-aware, and semantically rich item representations, leading to improved recommendation accuracy.
3. Prerequisite Knowledge & Related Work
This section provides the foundational concepts and summarizes relevant prior work necessary to understand the DECOR framework.
3.1. Foundational Concepts
3.1.1. Sequential Recommendation
Sequential recommendation is a sub-field of recommender systems where the goal is to predict the next item a user will interact with, given their past sequence of interactions. Unlike traditional collaborative filtering which might recommend items based on overall preferences, sequential recommenders focus on the order and temporal dependencies of user behavior. For example, if a user watched "Star Wars Episode IV," a sequential recommender might suggest "Star Wars Episode V" next.
3.1.2. Generative Recommendation
Generative recommendation is a new paradigm that formulates the sequential recommendation task as a sequence-to-sequence (seq2seq) generation problem, often leveraging Large Language Models (LLMs). Instead of ranking a predefined list of items, these models generate the identifier or a sequence of tokens representing the next item.
The typical two-stage pipeline in generative recommenders involves:
- Item Tokenization: Converting items (e.g., products, movies) into discrete tokens or
semantic IDs. This often involves using item metadata (like title, description) and a specialized tokenizer. - Recommender Training: Training an LLM (e.g., T5) to autoregressively generate the sequence of tokens corresponding to the next recommended item, based on the user's past tokenized interaction sequence.
3.1.3. Large Language Models (LLMs)
Large Language Models (LLMs) are deep learning models, typically based on the Transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language. Their strength lies in sequence-to-sequence modeling and capturing long-range dependencies. In generative recommendation, LLMs are adapted to treat item tokens as "words" and user interaction sequences as "sentences," generating new "words" (item tokens) to complete the "sentence" (recommend the next item).
3.1.4. Sequence-to-Sequence (Seq2Seq) Modeling
Sequence-to-sequence (Seq2Seq) models are a class of neural networks capable of transforming input sequences into output sequences. They typically consist of an encoder that processes the input sequence and a decoder that generates the output sequence. In generative recommendation, the input sequence is a user's past interaction history (represented by item tokens), and the output sequence is the token representation of the next item to be recommended.
3.1.5. Embeddings and Semantic Embeddings
An embedding is a dense vector representation of a discrete variable (like a word, an item, or a token) in a continuous vector space. The goal is to map similar items closer together in this space. Semantic embeddings specifically refer to embeddings that capture the meaning or semantic properties of an item, often derived from item metadata (e.g., descriptions, titles) using pretrained language models or text encoders (e.g., Sentence-T5). These embeddings are rich in external knowledge.
3.1.6. Vector Quantization (VQ) and Residual Quantization VAE (RQ-VAE)
Vector Quantization (VQ) is a method used to map high-dimensional continuous vectors (like semantic embeddings) to discrete codes (tokens). It involves a codebook or code vector, which is a set of learned prototype vectors. During quantization, an input vector is replaced by the nearest vector in the codebook.
Residual Quantization VAE (RQ-VAE) is an advanced form of vector quantization that quantizes vectors in a hierarchical or iterative manner. Instead of quantizing the entire vector at once, it quantizes the residual (the part not captured by the previous quantization step) in subsequent stages. This allows for a more fine-grained and accurate reconstruction of the original vector using a sequence of discrete codes. Each stage has its own codebook. The output of an RQ-VAE for an item is a sequence of indices, where each index refers to a vector in a specific codebook at a particular quantization level. These indices form the semantic ID of an item.
3.1.7. Attention Mechanism
The attention mechanism is a key component of Transformer models. It allows a neural network to focus on specific parts of its input sequence when processing or generating output. Specifically, it computes a weighted sum of input elements, where the weights are determined by a query (what we're looking for), keys (what's available), and values (the actual information). The weights indicate the "relevance" of each input element to the current processing step. This mechanism is used in DECOR for Context Vector Computation to aggregate historical context.
The general formula for Scaled Dot-Product Attention is:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where:
- (Query), (Key), (Value) are matrices representing the input sequences.
- is the dimension of the keys, used for scaling to prevent very large dot products that push the softmax function into regions with tiny gradients.
- computes the dot product similarity between queries and keys.
- normalizes the scores to obtain attention weights.
- The result is a weighted sum of the values.
3.1.8. Multi-Layer Perceptron (MLP)
A Multi-Layer Perceptron (MLP) is a type of artificial neural network composed of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer. Each node (except for the input nodes) is a neuron that uses a nonlinear activation function. MLPs are commonly used for tasks like classification and regression, and in DECOR, they are used for transforming vectors and in attention mechanisms.
3.2. Previous Works
The paper discusses several categories of related work:
3.2.1. Generative Recommenders
- P5 (Geng et al. 2022): An early work that fine-tunes a pretrained LLM (e.g., T5) to handle multiple recommendation tasks within a unified generative framework, using natural language prompts.
- TIGER (Rajput et al. 2023): This is a key baseline and the foundation for DECOR. TIGER introduces a discrete
semantic tokenizationscheme where item metadata is quantized intosemantic IDsusing an RQ-VAE. A pretrained T5 model then autoregressively generates theseitem token sequences. - EAGER (Wang et al. 2024c): Builds on TIGER, proposing a two-stream generation framework with a shared encoder and separate decoders to jointly capture user behavior and item content semantics.
- OneRec (Deng et al. 2025): Extends the generative approach by unifying retrieval and ranking into a single iterative generation process.
- DECOR's relation: DECOR investigates the limitations of the
tokenization pretraining - recommender trainingprocess, specificallysuboptimal static tokenizationandloss of pretrained semantics, which are not fully addressed by these previous generative models.
3.2.2. Item Token Representation
This section highlights the evolution of how items are represented as discrete tokens for language models.
-
Pseudo ID-based methods: Assign a unique token identifier to each item without incorporating semantic structure (e.g.,
Chu et al. 2023,Geng et al. 2022,Hua et al. 2023,Wang et al. 2024b).P5-SID(Hua et al. 2023) is an example, decomposing numeric item IDs into ordered subtokens to improve locality. -
Text-based methods: Utilize item metadata (titles, descriptions) to construct natural language prompts (e.g.,
Bao et al. 2025,Dai et al. 2023,Li et al. 2023,Liao et al. 2023,Zhang et al. 2025, 2021). While expressive, they can incur high inference costs and potentially lead tohallucinated content. -
Semantic Indexing Schemes: These are recent approaches that balance semantic fidelity and token efficiency by extracting compact
token sequencesthroughquantizing pretrained text embeddingsof item descriptions (e.g.,Rajput et al. 2023(TIGER),Wang et al. 2024a(LETTER),Zheng et al. 2024,Zhu et al. 2024(CoST),Hou et al. 2023). These methods, including TIGER, typically usestatic tokenization, where item representations are fixed. -
Dynamic Tokenization Approaches:
- ETEGRec (Liu et al. 2025): Jointly trains the item tokenizer and recommender model in an
end-to-endfashion. This aims to overcomestatic tokenizationbut can introduce training instability as token assignments evolve. - ED (Yin et al. 2025): Proposes a
duo-index frameworkwith a multi-grained token regulator and instruction tuning with user-level metadata.
- ETEGRec (Liu et al. 2025): Jointly trains the item tokenizer and recommender model in an
-
DECOR's relation: DECOR specifically targets the
static tokenizationlimitation of semantic indexers. UnlikeETEGRec, DECOR avoidsre-tokenizationand the associated instability by adapting token embeddings contextually during generation, rather than dynamically updating thesemantic ID tokensthemselves. It also doesn't rely on user-specific information likeED^2. ## 3.3. Technological Evolution The field of sequential recommendation has evolved from traditional `ID-based models` (like `GRU4Rec`, `SASRec`, `BERT4Rec`, `Caser`, `FDSA`, `S3Rec`) that learn representations for item IDs, to `generative recommenders` that leverage the power of LLMs. This shift brings the benefit of LLMs' strong `sequence modeling` capabilities and their ability to incorporate rich `semantic information` from item metadata. However, it also introduces new challenges: 1. **Bridging ID-based and Semantic-rich representations:** LLMs are text-based, so items need to be converted into a "language" they can understand. This led to `item tokenization` schemes, moving from simple `pseudo IDs` to more `semantic IDs` derived from item descriptions. 2. **Maintaining Semantic Fidelity during Tokenization:** Early tokenization efforts might have lost valuable semantic information. The introduction of `RQ-VAE` based semantic indexers (like in TIGER) aimed to preserve this, but their static nature still posed a problem. 3. **Aligning Objectives:** The core problem DECOR addresses is the `objective misalignment`. The `tokenizer` is optimized for reconstruction, while the `recommender` is optimized for user interaction. This creates a gap where the `static tokenization` may not be optimal for downstream recommendation, and the LLM might `discard pretrained semantic knowledge` by overwriting token embeddings during training on user interactions alone. DECOR's work fits into this timeline by offering a sophisticated solution to this `objective misalignment`, building on `semantic indexing` but enhancing it with `contextual adaptability` and `semantic preservation` during the recommender training phase, without the instability of joint tokenizer-recommender optimization. ## 3.4. Differentiation Analysis Compared to the main methods in related work, DECOR's core differences and innovations are: * **Compared to Static Semantic Tokenization methods (e.g., TIGER, LETTER, CoST):** * **Innovation:** DECOR introduces `Contextualized Token Composition`. While these baselines use `semantic IDs` from `RQ-VAE`, their token embeddings are static. DECOR dynamically refines these embeddings based on the `user interaction context`, overcoming the `suboptimal static tokenization` problem that limits the expressiveness of fixed tokens. * **Innovation:** DECOR employs `Decomposed Embedding Fusion` to explicitly integrate `frozen pretrained codebook embeddings` with `learnable collaborative embeddings`. This directly addresses the `discarded pretrained semantics` issue, which these static methods suffer from by typically re-initializing and training token embeddings solely on interaction data. * **Compared to Dynamic Tokenization methods (e.g., ETEGRec, ED$^2$):** * **Innovation:** DECOR achieves contextual adaptability *without* iteratively `re-tokenizing` items or jointly optimizing the `tokenizer` and `recommender`. `ETEGRec`'s joint training can introduce instability and computational overhead. DECOR adapts the *interpretation* of static tokens via `composition`, which is more stable and efficient. * **Innovation:** DECOR's `contextualized token composition` doesn't require user-specific metadata during inference, unlike some aspects of `ED`^2, making it more generally applicable. -
Unified Framework: DECOR provides a unified solution that addresses both
suboptimal static tokenizationanddiscarded pretrained semanticssimultaneously, which is a more comprehensive approach than methods that tackle only one aspect or introduce instability.In essence, DECOR distinguishes itself by providing a stable yet highly adaptive token representation learning framework that intelligently leverages both
pretrained semantic knowledgeandcollaborative interaction patternswithin the generative recommendation paradigm.
4. Methodology
4.1. Principles
The core idea behind DECOR is to resolve the objective misalignment between item tokenization and recommender training in generative recommendation. The theoretical basis is that item representations should be rich in pretrained semantic knowledge (from metadata) and also adapt dynamically to diverse user interaction contexts (collaborative signals). DECOR achieves this by:
- Preserving Pretrained Semantics: It prevents the loss of valuable semantic information by freezing the original
codebook embeddingsfrom theRQ-VAE tokenizerand fusing them with newly learnedcollaborative embeddings. - Enhancing Contextual Adaptability: It overcomes
suboptimal static tokenizationby dynamically refining token embeddings based on the immediateuser interaction contextthrough asoft compositionmechanism, allowing the recommender to interpret tokens in a context-dependent manner without changing the underlying static token IDs.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Problem Formulation
The paper frames the task as sequential recommendation within a generative setting.
Given a set of items and a user's interaction sequence , the objective is to predict the next item .
This task involves two main steps:
- Item Tokenization: Each item is mapped to a discrete
token sequence, where is the sequence length (number of quantization levels) and is the predefined token set (the codebook indices). The user's interaction sequence is thus transformed into atokenized sequence. - Autoregressive Generation: Given , the model autoregressively generates the
token sequencefor the next target item by factorizing the conditional probability: $ p(\mathbf{c}i | \mathbf{X}^u) = \prod{l = 1}^L p(c_{i,l} | \mathbf{X}^u, c_{i,1}, \ldots, c_{i,l-1}) \quad (1) $ Where:
- is the probability of generating the token sequence for the target item given the user's historical tokenized sequence .
- is the length of the token sequence for an item (number of quantization levels).
- is the -th token in the sequence for item .
- The product indicates that each token is generated conditioned on the historical sequence and all previously generated tokens for the current item .
4.2.2. Semantic Indexer Pretraining
Following prior work (Rajput et al. 2023), DECOR utilizes an RQ-VAE (Residual Quantization Variational Autoencoder) as its semantic indexer. The RQ-VAE consists of:
-
A pair of
MLP encoder-decodernetworks. -
A sequence of
codebooks, where is the number of quantization levels. Each contains a set of prototype vectors (code vectors).The process of
hierarchical quantizationfor a pretrained text embedding of an item's metadata operates through residual steps: $ \begin{array}{r} \mathbf{z}0 = \operatorname{Encoder}(\mathbf{x}) \ \mathbf{z}l = \mathbf{z}{l - 1} - q_l(\mathbf{z}{l - 1}), l = 1,\ldots,L \ q_l(\mathbf{z}{l - 1}) = \arg \min{\mathbf{e}\in \mathcal{C}l}\left| \mathbf{z}{l - 1} - \mathbf{e}\right|_2 \end{array} \quad (4) $ Where: -
is the pretrained text embedding of an item's metadata.
-
is the MLP encoder that maps the input embedding to a latent space.
-
is the initial latent representation from the encoder.
-
is the residual vector at quantization level .
-
is the quantization function for level , which performs a
nearest-neighbor lookupin the codebook usingEuclidean distanceto find the closestcode vector. -
is the codebook for the -th quantization level.
-
The
semantic IDof an item is the concatenation of the indices of the selected code vectors from each level , plus a final token for item collision handling (not explicitly shown in the formula but mentioned in the text).To reconstruct the original semantic embedding, the
quantized representationis formed by summing the selected codebook vectors: $ \tilde{\mathbf{r}} = \sum_{l = 1}^{L} \mathbf{e}_{c_l}^l $ Where is the corresponding codebook vector for the semantic ID in codebook . This is then decoded back to the semantic space: $ \tilde{\mathbf{z}} = \operatorname{Decoder}(\tilde{\mathbf{r}}) \quad (5) $ Where is the MLP decoder.
The entire RQ-VAE semantic indexer is optimized using a loss function , which includes a reconstruction term and a codebook update term:
$
\begin{array}{ll}
\mathcal{L}{\mathrm{SQ}} = \mathcal{L}{\mathrm{RECON}} + \mathcal{L}{\mathrm{RQ}} \
\mathcal{L}{\mathrm{RECON}} = | \mathbf{x} - \tilde{\mathbf{x}}|{2}^{2} \quad \mathrm{(reconstruction\ loss)} \
\mathcal{L}{\mathrm{RQ}} = \sum_{l = 1}^{L}\left| \mathrm{sg}[\mathbf{z}{l - 1}] - \mathbf{e}{c_{i}}^{l}\right|{2}^{2} + \beta\left| \mathbf{z}{l - 1} - \boldsymbol {\mathrm{sg}}[\mathbf{e}{c{i}}^{l}]\right|_{2}^{2}
\end{array} \quad (7)
$
Where:
- is the overall loss for the semantic quantizer (RQ-VAE).
- is the reconstruction loss, measuring the
L2 norm(Euclidean distance) between the original input embedding and the reconstructed embedding . - is the codebook loss, which has two parts:
- The first term is the
codebook update term(also known ascommitment loss): It encourages the selected codebook vector to move towards the encoder output . The (stop-gradient) prevents gradients from flowing back through the encoder, ensuring that only the codebook vectors are updated. - The second term is the
encoder loss(also known ascodebook lossorstraight-through estimator): It encourages the encoder output to move towards the selected codebook vector . The stops gradients from flowing back through the codebook, allowing the encoder to be updated based on the selected code.
- The first term is the
- is a hyperparameter that balances the two parts of the codebook loss.
4.2.3. Decomposed Embedding Fusion
To address C2: discarded pretrained semantics, DECOR introduces a Decomposed Embedding Fusion module. This module dynamically integrates pretrained semantic and newly learned collaborative embedding representations, treating them as complementary information channels.
-
Pretrained Semantic Embedding Space: This space leverages the
frozen codebooksfrom the pretrainedRQ-VAE tokenizer. Eachcodebookprovides token-level embeddings learned to reconstruct pretrained embeddings and preserves the hierarchical structure from multi-stage tokenization. $ E_{\mathrm{pre}} = {\mathcal{C}i}{i = 1}^M \in \mathbb{R}^{K \cdot M \times d} \quad (9) $ Where:- represents the overall pretrained semantic embedding space.
- is the number of quantization layers (codebooks).
- is the -th
frozen codebook, containing embeddings (code vectors) each of dimension . is the size of each codebook. - is the embedding dimension, chosen to be compatible with the downstream LLM recommender.
- For a given
token index, itspretrained semantic embeddingis retrieved by a direct lookup: .
-
Collaborative Embedding Space: This is a separate,
learnable embedding matrix. These embeddings are initialized randomly and trained from scratch solely through theautoregressive generation objectiveof the recommender. This allows them to encode sequential patterns, co-occurrence, and user preference dynamics based on user interaction data. Notably, both and share the same dimensionality for each layer, ensuring seamless alignment for fusion. -
Modality-Aware Fusion: First, both types of embeddings are projected into a
shared latent spaceandlayer normalized: $ \hat{e}{\mathrm{pre}} = \mathrm{LN}(W{\mathrm{pre}}e_{\mathrm{pre}}), \quad \hat{e}{\mathrm{collab}} = \mathrm{LN}(W{\mathrm{collab}}e_{\mathrm{collab}}) \quad (10) $ Where:-
are the token embeddings from the pretrained and collaborative modalities, respectively.
-
are
learnable projection matricesthat map the embeddings to a potentially different dimension . -
denotes
Layer Normalization, which helps stabilize training. -
are the projected and normalized embeddings.
Next, these normalized embeddings are concatenated and passed through a
fusion layerto map them back to the original latent space dimension : $ e_{\mathrm{fused}} = W_{\mathrm{fuse}}[\hat{e}{\mathrm{pre}} \parallel \hat{e}{\mathrm{collab}}] \in \mathbb{R}^d \quad (11) $ Where: -
is a
learnable fusion matrix. -
denotes
vector concatenation, resulting in a vector of dimension2d'. -
is the final
fused embedding, integrating both pretrained semantic and collaborative signals. This fused embedding is computed dynamically for each input token sequence during the forward pass.
该图像是一个示意图,展示了生成推荐算法的两个阶段:第一阶段是项目标记预训练,第二阶段是推荐器训练。图中分别阐述了如何在标记预训练阶段进行编码、共享权重以及在推荐器训练来构建上下文化的表示。关键流程如"Decomposed Embedding Fusion"和"Contextualized Token Composition"表现了如何结合预训练语义与新的协作嵌入。
The figure above (Figure 2 from the original paper) visually represents the
Decomposed Embedding Fusion(top left) andContextualized Token Composition(top right) components. It shows how pretrained embeddings from the RQ-VAE are frozen, how collaborative embeddings are learned, and how they are fused to createfused embeddings. Thesefused embeddingsthen serve as input for thecontextualized token compositionto dynamically refine representations. -
4.2.4. Contextualized Token Composition
To address C1: suboptimal static tokenization, DECOR introduces a Contextualized Token Composition mechanism. Instead of updating the semantic ID tokens themselves (which can be unstable), this component refines the token embeddings dynamically based on the usage context.
-
Overall Idea: Given a target item with its cached
semantic ID sequenceand ahistorical context sequence, a context-aware embedding for each token is computed using a function . $ \tilde{e}c = \Phi (c,u{c},{e_{c^{\prime}}}_{c^{\prime}\in \mathcal{N}(c)}) \quad (12) $ Where:- is the refined context-aware embedding for token .
- is the current token for which the embedding is being refined.
- is a
context vectorderived from the user's interaction history. - is a set of
candidate composition tokensfor . Here, is chosen to be all tokens from the sameRQ-VAE codebook layeras token . This allows the model to leverage the entire codebook's capacity, including under-utilized entries.
-
Context Vector Computation: To obtain the
context vectorused for generating a specific target token , thefused embeddingsof thehistorical context sequenceare aggregated. Each is thefused embeddingof token obtained from theDecomposed Embedding Fusionmodule. Anattention-based pooling mechanismis applied: $ u_c = \mathit{AttnPool}(\mathbf{h_c}) = \mathrm{MLP_{ctx}}\left(\sum_{\ell = 1}^{L}\alpha_{\ell}\cdot h_{\ell}\right) \quad (13) $ Theattention weightsare computed as: $ \begin{array}{l} s_{\ell} = \mathbf{w}^{\top}\operatorname{tanh}\left(\mathbf{W}h_{\ell} + \mathbf{b}\right) \ \alpha_{\ell} = \frac{\exp(s_{\ell})}{\sum_{m = 1}^{L}\exp(s_{m})} \end{array} \quad (14) $ Where:- is the
fused embeddingof a historical token. - , , and are
learnable parametersof theattention network. - is the attention score for .
- is a hyperbolic tangent
activation function. - is a
multi-layer perceptronthat transforms theweighted sumof historical embeddings into the finalcontext vector. Thisattention poolingallows the model to selectively focus on relevant parts of the history.
- is the
-
Token Composition: To implement the function from Equation (12), a
soft compositionis performed over thecandidate token embeddings(all tokens in the sameRQ-VAE codebook layer). This composition is guided by thecontext vector: $ \alpha_{c^{\prime}} = \frac{\exp(\langle W_{q}u,W_{k}e_{c^{\prime}}\rangle)}{\sum_{c^{\prime\prime}\in\mathcal{N}(c)}\exp(\langle W_{q}u,W_{k}e_{c^{\prime\prime}}\rangle)} \quad (16) $ $ \tilde{e}{c} = \Phi{\mathrm{soft}}(c,u_{c},\mathcal{N}(c)) = \sum_{c^{\prime}\in \mathcal{N}(c)}\alpha_{c^{\prime}}\cdot e_{c^{\prime}} \quad (17) $ Where:-
refers to a candidate token embedding from the same codebook layer.
-
is the
context vector. -
are
learnable projection matrices(query and key transformations, similar toself-attention). -
denotes the
dot productsimilarity. -
are the
composition attention weights, indicating how much each candidate token contributes to the refined embedding of , based on thecontext. -
is the
context-aware composed token embedding, formed as aweighted sumof candidate embeddings.Finally, this
context-aware composed token embeddingis fused with the originalstatic embedding(thefused embeddingfrom theDecomposed Embedding Fusionfor token ) using aresidual link: $ e_{\mathrm{final}} = \alpha \cdot \tilde{e}{c} + (1 - \alpha)\cdot e{c}^{\mathrm{static}},\quad \alpha \in [0,1] \quad (18) $ Where: -
is the final embedding used by the LLM recommender.
-
is a
tunable hyperparameterthat controls the strength of context adaptation. A larger means more reliance on the composed (context-adapted) embedding, while a smaller prioritizes the static embedding. This allows for flexible integration of contextual signals.
-
-
Learnable BOS Embedding Composition: To handle the generation of the very first token (Beginning-of-Sequence,
BOS), where no historical tokens have been generated yet, DECOR introducesN learnable BOS query vectors. These queries serve as latent representations of candidate BOS tokens. For the generation of the target token , the BOS token's embedding is composed with theseBOS queriesusing thecontextual composition function: $ \hat{e}{\mathrm{BOS}} = \Phi (e{\mathrm{BOS}},u_{c},\mathcal{Q}_{\mathrm{BOS}}) \quad (19) $ Where:-
is the original
fused embeddingfor the BOS token. -
is the
context vector. -
serves as the set of
candidate composition tokensfor the BOS embedding. The resultingcomposed BOS embeddingacts as the initial prefix for autoregressive generation, ensuring that even the first token's interpretation is dynamically adapted.
该图像是一个示意图,展示了 DECOR 方法在主动嵌入使用和推荐性能上的比较。左侧显示了关于不同层次的嵌入使用情况,而右侧则展示了与传统方法 TIGER 的 NDCG@10 和 Recall@10 性能提升。下方包含了不同商品的标记化示例,强调了上下文差异与模糊前缀的影响。
The figure above (Figure 1 from the original paper) illustrates the problem of
suboptimal static tokenization, wherenoise-canceling headphonesfor different contexts (office, workout, sleep) receive identicalprefix tokens(e.g., (1, 276)). DECOR addresses this bycontextually adapting token representationsthrough compositions, enabling different usage contexts to refine the token embedding, leading to increasedembedding utilizationand improved recommendation accuracy. -
4.2.5. Complexity Analysis
The paper briefly analyzes the additional computational cost introduced by DECOR:
- Decomposed Embedding Fusion: This adds a constant cost of per item due to projection matrices () and the fusion matrix (), which are and respectively.
- Contextualized Token Composition:
Context Vector Computation: Attention pooling over historical tokens has a complexity of .Token Composition: For each token, attending to a fixed candidate set of size (all codebook entries in a layer) adds . Since (embedding dimension) and (codebook size) are constants, the additional cost of DECOR components scales linearly with the context length . This is considerednegligiblecompared to the complexity ofTransformer self-attention layersin the backbone recommender model.
4.2.6. DECOR Training
DECOR is integrated into every forward pass of the generative recommender model.
-
Encoder Input:
Decomposed Embedding Fusionis first applied to compute the encoder input embeddings. This combines thefrozen pretrained semantic embeddingswith thelearnable collaborative representations. -
Autoregressive Generation: During the
autoregressive generation processin the decoder, the static embedding lookup is replaced byContextualized Token Composition. Each token embedding is dynamically adapted based on thegenerated context(user history and previously generated tokens for the current item). This allows token embedding representations to evolve during training.By doing so, DECOR simultaneously preserves
pretrained semantics(viaDecomposed Embedding Fusion) and adapts torecommendation signals(viaContextualized Token Composition), effectively addressing bothsuboptimal static tokenization (C1)anddiscarded pretrained text semantics (C2)within a unified framework without requiring complex joint optimization of the tokenizer.
5. Experimental Setup
5.1. Datasets
The experiments are conducted on three subsets of the updated Amazon Review dataset (Hou et al. 2024).
-
Preprocessing: The datasets undergo a
5-core filter, meaning items and users with less than five interaction records are excluded. User interaction sequences are constructed by chronologically ordering items, with a maximum sequence length of 20. -
Characteristics:
- Source: Amazon Review dataset.
- Domain: E-commerce (specific categories: Scientific, Instrument, Game).
- Sparsity: All datasets are highly sparse, indicating a challenging recommendation task.
-
Suitability: These datasets are widely used benchmarks in sequential recommendation and are effective for validating methods in sparse, real-world e-commerce scenarios.
The following are the statistics of the datasets from Table 4 of the original paper:
Dataset # Users # Items # Interactions Sparsity Scientific 50,985 25,848 412,947 99.969% Instrument 57,439 24,587 511,836 99.964% Game 94,762 25,612 814,586 99.966%
5.2. Evaluation Metrics
All models are evaluated using top-K Recall and NDCG@K (Normalized Discounted Cumulative Gain) with . The leave-one-out strategy is employed: for each user, the last interaction is for testing, the second-last for validation, and the rest for training. A full-ranking evaluation is performed over the entire candidate item set.
5.2.1. Recall@K
- Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved among the top K recommendations. In the context of recommendation, it answers the question: "Out of all the items the user actually interacted with (or the next item they interacted with), how many did our system correctly recommend within the top K?" A higher Recall@K indicates that the recommender system is better at finding and presenting items that users are likely to engage with.
- Mathematical Formula: $ \mathrm{Recall@K} = \frac{1}{|U|} \sum_{u \in U} \frac{|\mathrm{Rel}_u \cap \mathrm{Rec}_u(K)|}{|\mathrm{Rel}_u|} $
- Symbol Explanation:
- : The set of all users.
- : The total number of users.
- : The set of actual relevant items for user (in leave-one-out, this is typically the single next item the user interacted with).
- : The set of top K items recommended by the system for user .
- : Denotes the cardinality of a set.
- : Represents the intersection of two sets.
5.2.2. NDCG@K
- Conceptual Definition: Normalized Discounted Cumulative Gain at K (NDCG@K) is a measure of ranking quality. It considers not only whether relevant items are in the top K recommendations, but also their positions in the ranked list. Highly relevant items appearing at the top of the list contribute more to the score than those appearing lower down. It is "normalized" to values between 0 and 1 by dividing by the ideal DCG (IDCG), which is the DCG achieved if all relevant items were perfectly ranked at the top. A higher NDCG@K indicates a better-ranked list, where relevant items are prioritized.
- Mathematical Formula: $ \mathrm{NDCG@K} = \frac{1}{|U|} \sum_{u \in U} \frac{\mathrm{DCG@K}_u}{\mathrm{IDCG@K}_u} $ where $ \mathrm{DCG@K}u = \sum{j=1}^{K} \frac{2^{\mathrm{rel}(j)} - 1}{\log_2(j+1)} $ And is the maximum possible DCG@K, achieved by sorting all relevant items by their relevance.
- Symbol Explanation:
- : The set of all users.
- : The total number of users.
- : Discounted Cumulative Gain at K for user .
- : Ideal Discounted Cumulative Gain at K for user .
- : The number of top recommendations considered.
- : The position of an item in the ranked list.
- : The relevance score of the item at position . In a binary relevance setting (item is either relevant or not), is 1 if the item at position is relevant, and 0 otherwise.
5.3. Baselines
The paper compares DECOR against a comprehensive set of baselines, categorized into traditional ID-based sequential recommenders and LLM-based generative recommenders.
5.3.1. Traditional Baselines
These models typically learn embeddings for item IDs and model sequential patterns without explicit semantic tokenization.
- Caser (Tang and Wang 2018): Applies
convolutional neural networks (CNNs)to capture both short-range sequential patterns (vertical filters) and co-occurring latent features (horizontal filters). - GRU4Rec (Jannach and Ludewig 2017): Utilizes a
Gated Recurrent Unit (GRU)network to model session-based user behavior, using the final hidden state to predict the next item. - SASRec (Kang and McAuley 2018): Employs a stack of
Transformer encoder layerswithmulti-head self-attentionto capture long-range dependencies in user interaction sequences. - BERT4Rec (Sun et al. 2019): Adapts the
BERTarchitecture (abidirectional Transformer encoder) for sequential recommendation, learning rich, context-aware item representations by predicting masked items. - FDSA: (Not explicitly cited, but likely a Transformer-based model given its performance relative to SASRec). Often refers to
Feature-rich Deep Sequential Autoencoder. - S3Rec (Zhou et al. 2020): Enhances pre-training with four
auxiliary self-supervised tasks(masking attributes, predicting masked items, distinguishing subsequences, and contrasting full sequences) to maximizemutual informationat multiple item granularities.
5.3.2. Generative Baselines
These models leverage LLMs and often use semantic tokenization.
- P5-SID (Hua et al. 2023): Decomposes numeric item IDs into ordered subtokens (e.g., prefixes) so that frequently co-occurring or sequentially adjacent items share subtoken patterns, improving locality in autoregressive generation.
- P5-CID (Hua et al. 2023): Performs
spectral clusteringoncollaborative co-occurrence graphsto group items, then uses the resulting cluster IDs as discrete tokens for generative recommendation. - TIGER (Rajput et al. 2023): A core baseline. It frames sequential recommendation as a
generative retrieval taskby quantizing item text embeddings viaRQ-VAEinto a fixed vocabulary ofsemantic IDsand trains an LLM (T5) to autoregressively generate the next item's ID. - LETTER (Wang et al. 2024a): Optimizes an
RQ-VAE tokenizerby enforcingcontrastive alignmentanddiversity regularizationto learn hierarchical, collaborative, and diverse item tokens. - CoST (Zhu et al. 2024): Trains a
quantization codebookwith anInfoNCE-style contrastive lossto map item embeddings into discretesemantic tokensthat preserve both semantic similarity andneighborhood structure. - ETEGRec (Liu et al. 2025): A dynamic generative baseline. It jointly optimizes the
tokenizerand therecommender modelwith a set ofalignment lossesto improvetokenizer-recommender consistency.
5.4. Implementation Details
-
Tokenizer:
Sentence-T5is used as the text encoder for pretrained semantics (to generate input for theRQ-VAE). -
Generative Recommender:
T5(Raffel et al. 2020) serves as the backboneLLMfor autoregressive generation. -
Hardware: Experiments are carried out on a single
NVIDIA Tesla A40 GPU. -
Reproducibility: A random seed is fixed to 2025.
-
Evaluation Protocol: The test set performance of the best-performing model (based on
NDCG@10on the validation set) is reported.The following are the hyperparameters used for each dataset from Table 5 of the original paper:
Hyperparameter Scientific Instrument Game learning_rate 0.003 0.003 0.003 warmup_steps 10,000 10,000 10,001 dropout_rate 0.1 0.1 0.1 max_grad_norm 1.0 1.0 1.0 weight_decay 0.05 0.05 0.05 beam_size 50 50 50 d_model 128 128 128 α 0.25 0.55 0.55 BOs_Query_Number 32 64 64 optimizer adamw adamw adamw lr_scheduler cosine cosine cosine train_batch_size 256 256 256 max_epochs 200 200 200 early_stop_patience 20 20 10
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that DECOR consistently outperforms all baselines across different metrics and datasets. This strongly validates the effectiveness of the proposed method in integrating pretrained semantics and collaborative signals.
The following are the results from Table 1 of the original paper:
| Group | Method | Scientific | Instrument | Game | |||||||||
| R@5 | R@10 | N@5 | N@10 | R@5 | R@10 | N@5 | N@10 | R@5 | R@10 | N@5 | N@10 | ||
| Traditional | Caser | 0.0172 | 0.0281 | 0.0107 | 0.0142 | 0.0242 | 0.0392 | 0.0154 | 0.0202 | 0.0346 | 0.0567 | 0.0221 | 0.0291 |
| GRU4Rec | 0.0221 | 0.0353 | 0.0144 | 0.0186 | 0.0345 | 0.0537 | 0.0220 | 0.0281 | 0.0522 | 0.0831 | 0.0337 | 0.0436 | |
| SASRec | 0.0256 | 0.0406 | 0.0147 | 0.0195 | 0.0341 | 0.0530 | 0.0217 | 0.0277 | 0.0517 | 0.0821 | 0.0329 | 0.0426 | |
| BERT4Rec | 0.0180 | 0.0300 | 0.0113 | 0.0151 | 0.0305 | 0.0483 | 0.0196 | 0.0253 | 0.0453 | 0.0716 | 0.0294 | 0.0378 | |
| FDSA | 0.0261 | 0.0391 | 0.0174 | 0.0216 | 0.0364 | 0.0557 | 0.0233 | 0.0295 | 0.0548 | 0.0857 | 0.0353 | 0.0453 | |
| S3Rec | 0.0253 | 0.0410 | 0.0172 | 0.0218 | 0.0340 | 0.0538 | 0.0218 | 0.0282 | 0.0533 | 0.0823 | 0.0351 | 0.0444 | |
| Generative (Static) | P5-SID | 0.0155 | 0.0234 | 0.0103 | 0.0129 | 0.0319 | 0.0437 | 0.0237 | 0.0275 | 0.0480 | 0.0693 | 0.0333 | 0.0401 |
| P5-CID | 0.0192 | 0.0300 | 0.0123 | 0.0158 | 0.0352 | 0.0507 | 0.0234 | 0.0285 | 0.0497 | 0.0748 | 0.0343 | 0.0424 | |
| TIGER | 0.0275 | 0.0431 | 0.0181 | 0.0231 | 0.0368 | 0.0574 | 0.0242 | 0.0308 | 0.0570 | 0.0895 | 0.0370 | 0.0471 | |
| LETTER | 0.0276 | 0.0433 | 0.0179 | 0.0230 | 0.0372 | 0.0581 | 0.0243 | 0.0310 | 0.0576 | 0.0901 | 0.0373 | 0.0475 | |
| CoST | 0.0270 | 0.0426 | 0.0180 | 0.0229 | 0.0366 | 0.0570 | 0.0242 | 0.0306 | 0.0569 | 0.0897 | 0.0379 | 0.0472 | |
| Generative (Dynamic) | ETEGRec | 0.0272 | 0.0433 | 0.0173 | 0.0225 | 0.0387 | 0.0609 | 0.0251 | 0.0323 | 0.0591 | 0.0925 | 0.0385 | 0.0492 |
| DECOR | 0.0301* | 0.0469* | 0.0201* | 0.0256* | 0.0399* | 0.0610 | 0.0266* | 0.0333* | 0.0610* | 0.0944* | 0.0400* | 0.0507* | |
Comparison with Traditional Models: DECOR achieves substantial gains over traditional models like SASRec and FDSA, which highlights the benefits of integrating pretrained semantics and collaborative signals through a generative framework. For example, on the Scientific dataset, DECOR's NDCG@10 (0.0256) significantly surpasses FDSA (0.0216) and SASRec (0.0195).
Comparison with Generative (Static) Models: DECOR consistently outperforms generative models that rely on static tokenization (e.g., TIGER, LETTER, CoST). For instance, on the Scientific dataset, DECOR improves NDCG@10 from TIGER's 0.0231 to 0.0256 (+10.8%). This indicates the efficacy of contextualized token composition in mitigating the suboptimal static tokenization problem.
Comparison with Generative (Dynamic) Models: Even against ETEGRec, a dynamic baseline that jointly optimizes the tokenizer and recommender, DECOR shows superior performance. On Scientific, DECOR yields a NDCG@10 of 0.0256 compared to ETEGRec's 0.0225, representing a +13.8% improvement. This is a critical finding, demonstrating that DECOR's approach of contextually adapting token *embeddings* without iterative re-tokenization provides a more robust and stable framework, avoiding the potential instability of joint training.
In summary, DECOR's advantages stem from its ability to:
- Preserve rich semantic knowledge from pretraining (via
Decomposed Embedding Fusion). - Adapt token representations to diverse usage contexts dynamically (via
Contextualized Token Composition). These combined benefits lead to state-of-the-art recommendation performance across various datasets.
6.2. Ablation Studies / Parameter Analysis
The following are the results from Table 2 of the original paper:
| Model Variant | Scientific | Instrument | Game | |||||||||
| R@5 | R@10 | N@5 | N@10 | R@5 | R@10 | N@5 | N@10 | R@5 | R@10 | N@5 | N@10 | |
| Base (TIGER) | 0.0275 | 0.0431 | 0.0181 | 0.0231 | 0.0368 | 0.0574 | 0.0242 | 0.0308 | 0.0570 | 0.0895 | 0.0370 | 0.0471 |
| w/ Token Comp. | 0.0292 | 0.0459 | 0.0193 | 0.0247 | 0.0385 | 0.0583 | 0.0261 | 0.0321 | 0.0599 | 0.0931 | 0.0394 | 0.0502 |
| w/ Token Comp. and BOS Queries | 0.0300 | 0.0462 | 0.0198 | 0.0248 | 0.0374 | 0.0585 | 0.0246 | 0.0313 | 0.0600 | 0.0932 | 0.0395 | 0.0500 |
| w/ Pretrained Only | 0.0294 | 0.0457 | 0.0192 | 0.0246 | 0.0382 | 0.0595 | 0.0254 | 0.0323 | 0.0602 | 0.0934 | 0.0390 | 0.0501 |
| w/ Pretrained + Token Comp. | 0.0298 | 0.0465 | 0.0198 | 0.0250 | 0.0388 | 0.0598 | 0.0257 | 0.0324 | 0.0603 | 0.0932 | 0.0396 | 0.0500 |
| DECOR (Full) | 0.0301 | 0.0469 | 0.0201 | 0.0256 | 0.0399 | 0.0610 | 0.0266 | 0.0333 | 0.0610 | 0.0944 | 0.0400 | 0.0507 |
The ablation study in Table 2 systematically evaluates the contribution of each component of DECOR, starting from the TIGER model as the baseline (Base (TIGER)).
w/ Token Comp.: Addingcontextualized token compositionalone yields the most significant improvements across all datasets. For example, on Instrument,NDCG@10improves from 0.0308 (Base) to 0.0321, an increase of approximately 4.2%. This confirms the effectiveness of dynamically refining static token semantics based on context.w/ Token Comp. and BOS Queries: Further incorporatinglearnable BOS queries(which aid in the initial generation phase) provides additional gains, particularly notable on the Scientific dataset, whereNDCG@10slightly improves from 0.0247 to 0.0248. This suggests that allowing context-aware composition for theBeginning-of-Sequencetoken helps in bootstrapping the generation process, especially in datasets with more complex initial item semantics.w/ Pretrained Only: This variant only incorporatesfrozen pretrained embeddings(viaDecomposed Embedding FusionwithoutContextualized Token Composition). It also shows significant improvements over theBase (TIGER)model (e.g.,NDCG@10on Scientific improves from 0.0231 to 0.0246). This highlights the intrinsic value of preservingpretrained semantic knowledgeand preventing its erasure during collaborative training.w/ Pretrained + Token Comp.: CombiningDecomposed Embedding Fusion(with frozen pretrained embeddings) andContextualized Token Compositionresults in even better performance (e.g.,NDCG@10on Scientific improves to 0.0250). This suggests a synergistic effect between preserving semantics and adapting context.DECOR (Full): The full DECOR model, incorporating all components (Decomposed Embedding Fusion,Contextualized Token Composition, andLearnable BOS Queries), consistently achieves the best performance across all metrics and datasets. For example,NDCG@10on Scientific reaches 0.0256, demonstrating that each component contributes positively to the overall recommendation quality. The slight reduction inR@5,N@5,N@10for Instrument fromw/ Token Comp. and BOS Queriestow/ Pretrained + Token Comp.suggests that the specific interaction of these components can be subtle, but the full model finds the optimal balance.
6.2.1. Hyperparameter Sensitivity
该图像是一个结果对比图,展示了不同数据集(Scientific、Instrument、Game)在不同参数(Alpha和BOS Query Number)下的Recall@10和NDCG@10指标。通过分析,可以看出在不同 Alpha 值和 BOS 查询数量下各模型的性能变化。
The figure above (Figure 3 from the original paper) displays the sensitivity of DECOR's performance to two key hyperparameters: the composition weight and the BOS query number. Performance is measured by NDCG@10.
Effect of Composition Weight :
- Robustness: DECOR exhibits general robustness to a range of values.
- Optimal Range: Moderate values of (e.g.,
0.4to0.55) consistently yield the strongest performance. This indicates a balanced contribution from both theresidual link(original static embedding) and thecontext-aware composition. - Performance Degradation: Very high values (e.g.,
0.7) lead to a sharp performance degradation. This is likely due toundertrainingof individual token embeddings. If is too high, the model relies excessively oncompositional signalsfrom other tokens, potentially neglecting to learn robust individual token embeddings. - Dataset Specificity: The performance collapse at high is more pronounced on the Instrument and Game datasets. These datasets often have larger interaction spaces, requiring more stable and robust individual token embeddings to generalize across diverse usage contexts. Excessive reliance on composition in such scenarios can delay embedding convergence and lead to training instability.
Effect of BOS Query Number:
- Consistent Enhancement: Increasing the
number of BOS queriesgenerally enhances performance, especially when moving from 0 (or a very small number) up to 32. For example, on the Scientific dataset,NDCG@10improves significantly by introducing BOS queries. This suggests thatBOS queriesfacilitate meaningful convergence by enabling the model to better capture diverse user preferences before generating the first item token. - Plateau Effect: Beyond a certain point (e.g., 32 or 64 on the Game dataset), performance gains plateau. Larger values (e.g., 128) yield no additional improvement. This suggests that a moderate number of
BOS queriesis sufficient for the model to capture the necessary contextual information, and adding more becomes redundant.
6.2.2. Addressing Suboptimal Static Tokenization
该图像是一个示意图,左侧展示了采用静态前缀表示的项目分布,右侧展示了采用上下文组合前缀表示后项目的分布变化。图中标注了不同类型的前缀,包括由星标指示的原始前缀以及其他带有颜色区分的前缀和可能的下一个SID3。
The figure above (Figure 4 from the original paper) presents t-SNE visualizations to illustrate how DECOR addresses suboptimal static tokenization on the Scientific dataset, focusing on prefix ambiguity.
-
Static Tokenization (Left): Shows that a
static tokenizationprocess assigns a single, fixed embedding for a specific prefix (e.g., (1, 276)). This fixed embedding often appears distant or semantically incoherent with valid next-token candidates (represented by red triangles). This visualizesprefix ambiguity, where the same prefix has to represent different item meanings, making it hard for the recommender to disambiguate. -
DECOR (Right): In contrast, DECOR's
contextually composed prefix embeddings(scattered colored points) are adaptively enhanced based on context. These composed embeddings align more coherently and are closer to their respective valid next-token candidates. The scattered nature of the composed prefix embeddings (e.g., multiple distinct points for the prefix (1, 276)) demonstrates that DECOR successfully reinterprets and refines these representations according to theuser interaction context, effectively mitigating the ambiguity inherent in static tokenization.The following are the results from Table 3 of the original paper:
Dataset Method Embedding Utilization Layer-1 Layer-2 Layer-3 Scientific TIGER
Ours26.6%
51.06%99.07%
99.97%99.87%
100.00%Instrument TIGER
Ours27.97%
100.00%96.77%
100.00%100.00%
100.00%Game TIGER
Ours25.67%
100.00%99.61%
100.00%100.00%
100.00%
Table 3 compares the active embedding usage at each quantization layer between TIGER (a static tokenization baseline) and DECOR.
- Low Utilization in TIGER:
TIGERshows low utilization ofLayer-1embeddings, with only25-28%of embeddings being actively used during tokenizer pretraining across datasets. This indicates that many potentialcodebook entriesareunder-utilizedorrarely selected, suggesting an inefficient use of thecodebook's capacity. - DECOR's Improved Utilization: By filtering out tokens with below-uniform
composition attention weights, DECOR substantially improves theactively trained embedding coverage. For Instrument and Game datasets, DECOR achieves100% utilizationatLayer-1, meaning all available codebook entries are actively engaged through thecontextualized token compositionmechanism. For Scientific, it reaches51.06%. - Adaptive Activation: On the Scientific dataset, which has fewer interactions and thus potentially less diverse contextual modeling, DECOR adaptively activates a smaller (though still significantly higher than TIGER) proportion of additional embeddings. This demonstrates DECOR's ability to activate embeddings based on the diversity of contexts, mitigating
suboptimal static tokenizationwith efficient representation usage. This increase in active embedding usage, particularly at the crucialLayer-1(which captures coarse-grained semantics), directly supports the claim that DECOR'scontextualized token compositionenhances expressiveness by interpolating beyond original static assignments and leveraging previously unused embedding capacity.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully identifies and addresses two critical limitations in current generative recommender systems: suboptimal static tokenization and the discarding of pretrained semantic knowledge. The proposed DECOR framework offers an elegant solution by integrating decomposed embedding fusion to preserve valuable pretrained semantics and contextualized token composition to dynamically adapt token representations based on user interaction context. Through extensive experiments on three real-world Amazon datasets, DECOR consistently demonstrates superior performance compared to both traditional and state-of-the-art generative baselines, including those attempting dynamic tokenization. The ablation studies confirm the individual and synergistic contributions of DECOR's components, while hyperparameter analysis reveals its robustness within a practical range and the importance of balanced contextual adaptation. Furthermore, visual analysis and embedding utilization metrics validate DECOR's ability to mitigate prefix ambiguity and enhance the active use of codebook embeddings.
7.2. Limitations & Future Work
The paper's abstract and conclusion do not explicitly state specific limitations or direct avenues for future work. However, some implicit points can be inferred:
- Computational Cost: While a complexity analysis is provided, it primarily focuses on the scaling of additional components. A more detailed empirical analysis of the actual computational overhead during training and inference (e.g., in terms of wall-clock time or GPU memory usage compared to baselines) could offer further insights.
- Generalizability of Hyperparameters: While the paper shows robustness to
alphaandBOS query number, the optimal values still vary across datasets. Future work might explore adaptive mechanisms for these hyperparameters. - Complexity of Context Modeling: The current
context vector computationuses attention pooling over historical tokens. More sophisticatedcontext modelingtechniques (e.g., incorporating temporal dynamics, user long-term preferences, or richer side information) could further enhance token composition. - Beyond Two-Stage: Although DECOR effectively mitigates the issues within the two-stage paradigm, the fundamental
objective misalignmentbetween tokenizer pretraining and recommender training still exists at a high level. Exploring truly end-to-end solutions that inherently learn item representations optimized for recommendation, without a separate tokenization pretraining stage, could be a long-term future direction.
7.3. Personal Insights & Critique
This paper presents a highly insightful and practical solution to fundamental challenges in generative recommendation. The explicit identification of suboptimal static tokenization and discarded pretrained semantics is a crucial contribution, as these issues are often overlooked or only partially addressed in the literature.
Inspirations:
- Synergy of Pretrained and Collaborative Signals: DECOR provides a compelling blueprint for how to effectively combine rich, general semantic knowledge (from pretrained models) with specific, dynamic collaborative signals (from user interactions). This
decomposed fusionapproach could be transferable to other domains where foundational models provide strong initial representations, but task-specific nuances require adaptation. - Stable Contextual Adaptation: The
contextualized token compositionmechanism is particularly inspiring. Instead of complex and potentially unstable joint training of tokenizers, DECOR achieves dynamic adaptation by modifying the interpretation of static tokens through a soft composition. Thisembedding-level adaptationis a more robust and computationally efficient way to introduce contextuality without altering the underlying discrete vocabulary. This idea ofcontextual re-interpretationcould be applied to other areas where discrete representations need to be adapted to specific situations without retraining the entire representation generation pipeline. - Efficient Embedding Utilization: The observation that many
codebook entriesareunder-utilizedin static tokenization and that DECOR can significantly improve this utilization is a valuable insight. It suggests that even with fixed discrete vocabularies, intelligent contextualization can unlock greater representational capacity.
Critique:
-
Depth of
Decomposed Embedding Fusion: While the concept offrozen pretrained embeddingsandlearnable collaborative embeddingsis clear, the interaction within the fusion layer (Equation 11) is relatively simple (linear projection after concatenation). Exploring more sophisticated fusion mechanisms, such asgating mechanismsorattention-based fusionbetween the two modalities, could potentially yield further improvements and provide a finer-grained control over how semantic and collaborative signals are blended. -
Interpretability of Composed Embeddings: While the
t-SNE plotsshow the visual effect of contextual composition, a deeper analysis into what specific semantic or collaborative aspects are being highlighted or suppressed in the composed embeddings for different contexts could enhance interpretability. For instance, can we identify distinct contextual "dimensions" that are activated for different usage scenarios of an item? -
Scalability to Extremely Large Vocabularies: The
Contextualized Token Compositioninvolves attention over all tokens in a codebook layer (). While this is stated as constant complexity (), if (codebook size) becomes very large in future applications, this constant factor could become a bottleneck. Investigating methods to sample or dynamically select a subset of candidate tokens for composition could be a future consideration. -
Practical Deployment Considerations: The paper provides a complexity analysis, but a direct comparison of training time, inference latency, and memory footprint between DECOR and its strong baselines (
ETEGRec,TIGER) would be beneficial for practical deployment decisions, especially given the "negligible" claim.Overall, DECOR offers a significant advancement in generative recommendation by effectively addressing critical representational challenges. Its principled approach to leveraging and adapting token semantics opens up promising avenues for more accurate and context-aware recommendation systems.
Similar papers
Recommended via semantic vector search.