Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation
TL;DR Summary
This paper introduces FlexCode, a dual codebook framework that balances collaborative filtering and semantic understanding in generative recommendation systems, enhancing performance with adaptive token allocation and a lightweight mixture-of-experts model, outperforming strong b
Abstract
Generative recommendation has recently emerged as a powerful paradigm that unifies retrieval and generation, representing items as discrete semantic tokens and enabling flexible sequence modeling with autoregressive models. Despite its success, existing approaches rely on a single, uniform codebook to encode all items, overlooking the inherent imbalance between popular items rich in collaborative signals and long-tail items that depend on semantic understanding. We argue that this uniform treatment limits representational efficiency and hinders generalization. To address this, we introduce FlexCode, a popularity-aware framework that adaptively allocates a fixed token budget between a collaborative filtering (CF) codebook and a semantic codebook. A lightweight MoE dynamically balances CF-specific precision and semantic generalization, while an alignment and smoothness objective maintains coherence across the popularity spectrum. We perform experiments on both public and industrial-scale datasets, showing that FlexCode consistently outperform strong baselines. FlexCode provides a new mechanism for token representation in generative recommenders, achieving stronger accuracy and tail robustness, and offering a new perspective on balancing memorization and generalization in token-based recommendation models.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "Semantics Meet Signals: Dual Codebook Representation Learning for Generative Recommendation". This title highlights the paper's core idea: combining semantic understanding (semantics) with collaborative interaction patterns (signals) through a dual codebook approach to improve generative recommendation systems.
1.2. Authors
The authors and their affiliations are:
-
Zheng Hui: University of Cambridge (and zh2483@columbia.edu)
-
Xiaokai Wei, Chen Wang, Alejandro Peláez, Michelle Gong: Roblox Corporation (and {xwei, cwang, apelaez, mgong}@roblox.com)
-
Reza Shirkavand: University of Maryland (and rezashkv@cs.umd.edu)
-
Weizhi Zhang: University of Illinois Chicago (and wzhan42@uic.edu)
The author list indicates a collaboration between academic institutions (University of Cambridge, University of Maryland, University of Illinois Chicago) and an industrial research lab (Roblox Corporation), suggesting a blend of theoretical rigor and practical application focus.
1.3. Journal/Conference
The paper is listed as an arXiv preprint arXiv:2511.20673v1.
This means it has not yet been formally peer-reviewed and published in a journal or conference proceedings. arXiv is a widely respected open-access repository for preprints in fields like computer science, physics, mathematics, and more, allowing researchers to share their work rapidly. While it's a common practice for disseminating new research, it implies the work has not yet undergone formal academic scrutiny by a program committee or journal editors.
1.4. Publication Year
The publication timestamp provided is 2025-11-15T05:51:38.000Z. The publication year is 2025.
1.5. Abstract
Generative recommendation, a new paradigm, unifies item retrieval and generation by representing items as discrete semantic tokens for autoregressive modeling. However, current methods use a single, uniform codebook for all items, which fails to account for the disparity between popular items (rich in collaborative signals) and long-tail items (reliant on semantic understanding). This uniform treatment leads to inefficient representation and poor generalization.
To address this, the paper introduces FlexCode, a popularity-aware framework that adaptively allocates a fixed token budget between two specialized codebooks: a collaborative filtering (CF) codebook and a semantic codebook. It uses a lightweight Mixture-of-Experts (MoE) router to dynamically balance the precision of CF for popular items and the generalization of semantics for long-tail items. An alignment and smoothness objective ensures coherence across the entire popularity spectrum.
Experiments on public and industrial-scale datasets show that FlexCode consistently outperforms strong baselines. The paper concludes that FlexCode offers a novel mechanism for token representation in generative recommenders, achieving superior accuracy and robustness for long-tail items, and providing a fresh perspective on balancing memorization and generalization in token-based recommendation models.
1.6. Original Source Link
- Official Source: https://arxiv.org/abs/2511.20673v1
- PDF Link: https://arxiv.org/pdf/2511.20673v1.pdf
- Publication Status: This paper is a preprint available on arXiv, meaning it has not yet been formally peer-reviewed or published in a conference or journal.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve lies within the emerging field of generative recommendation. Traditional recommender systems, like collaborative filtering (CF), excel at identifying head items (popular items with abundant interaction data) due to their ability to memorize fine-grained co-occurrence patterns. However, they systematically struggle with long-tail and cold-start items (items with sparse interaction data) because they lack mechanisms for semantic generalization. This results in a fundamental memorization bias.
Generative recommenders, which map items to discrete semantic tokens and use autoregressive models (like Transformers) to predict future items, offer advantages such as compatibility with large language models and better generalization through multimodal supervision (e.g., text, visual, metadata semantics).
Despite these benefits, current generative recommenders suffer from two critical limitations:
-
Representation Entanglement: Most existing approaches use a single, shared
codebookfor all items. This forces the model to compress bothsemantic content informationandcollaborative interaction informationinto the same quantized space. Thiscouplingleads to interference, where semantic signals can dilute collaborative patterns for head items, and unreliable collaborative signals can dominate for tail items, harmingcold-startbehavior. The result isrepresentational collapse, where neither factor is optimally modeled. -
Static Capacity Allocation: Current methods allocate a fixed representational budget (same number and type of
codebook tokens) to every item, regardless of its popularity or data sparsity. This is inefficient given thelong-tailed natureof recommender data. Head items, rich in interaction data, could benefit from high-capacitycollaborative representations, while tail items, lacking interaction support, need moresemantic evidence. Treating them uniformly wastes capacity for one group and starves the other, leading tooverfitting on the headandunder-representation of the tail.The paper frames these limitations as a more general
adaptive capacity allocation problem: how to judiciously allocate a fixed token budget betweencollaborative specificityandsemantic generalizationfor each item, while avoiding destructive interference.
2.2. Main Contributions / Findings
The paper introduces FlexCode to address the identified challenges. Its primary contributions and findings are:
- Identification of Structural Limitations: The authors formally identify and define two key structural limitations in existing generative recommenders:
representation entanglementin a single shared codebook andstatic capacity allocationthat disregards item popularity and data sparsity. They frame these as anadaptive capacity allocation problem. - Introduction of FlexCode Framework: They propose
FlexCode, a novel framework thatfactorizes item representationsinto acollaborative codebookand asemantic codebook. This disentanglement allows each codebook to specialize. - Popularity-Aware Dynamic Token Allocation:
FlexCodeincorporates apopularity-aware mixture-of-experts (MoE) gate. This gate dynamically allocates a fixed token budget between the collaborative and semantic codebooks on a per-item basis. This means popular items receive more collaborative tokens, while long-tail items receive more semantic tokens, optimizing representation for different data regimes. - Cross-Codebook Alignment and Smoothness Objectives: The framework includes an
alignment objectiveto prevent the two codebooks from drifting apart and asmoothness objectiveto ensure coherent allocation across the popularity spectrum, promoting stability. - Consistent Performance Improvements: Through extensive experiments on both public benchmarks (Amazon-Beauty, Amazon-Sports, KuaiRand-1K) and a large-scale industrial dataset,
FlexCodeconsistently outperforms strong baselines, includingItemID-based,Semantic-ID, andunified representationmodels, across standard ranking metrics likeRecall@KandNDCG@K. - Enhanced Tail Robustness and Accuracy:
FlexCodedemonstrates significantly better performance onlong-tailandcold-start items, providing a substantialNDCG@10improvement for tail items on the industrial dataset, while also improvinghead itemperformance. - New Perspective on Memorization and Generalization: The work offers a new perspective on how to balance
memorization(for popular items, leveraging collaborative signals) andgeneralization(for rare items, leveraging semantic information) intoken-based recommendation modelsthrough adaptive capacity allocation.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand FlexCode, a reader should be familiar with the following fundamental concepts in recommender systems, machine learning, and natural language processing:
- Recommender Systems: At their core, recommender systems aim to predict user preferences for items. They help users discover new content (products, movies, news, etc.) and address
information overload. The goal is to provide personalized suggestions. - Collaborative Filtering (CF): This is a widely used technique in recommender systems. It works on the principle that if two users share similar tastes in the past, they are likely to have similar tastes in the future.
- Mechanism: CF models analyze historical user-item interactions (e.g., purchases, ratings, clicks) to find patterns.
- Strengths: Highly effective for
head items(popular items) where there's abundant interaction data, as it canmemorizecomplex co-occurrence relationships. - Weaknesses: Suffers from the
cold-start problem(difficulty recommending new users or new items due to lack of historical data) and thelong-tail problem(poor performance ontail itemswith very few interactions). It lackssemantic understandingof items.
- Generative Recommendation: This is a newer paradigm that rephrases the recommendation task.
- Traditional View: Recommenders predict a score for each item, then rank items by score.
- Generative View: Items are treated as discrete "words" or "tokens." The task becomes akin to
language modeling, where anautoregressive modellearns to predict the next item (or sequence of item tokens) given a user's past interaction sequence. This allows integration with powerfullarge language models (LLMs).
- Item ID vs. Semantic ID:
- Item ID: A unique identifier assigned to each item (e.g., product ID 12345). Traditional CF models often learn embeddings directly for these IDs. These are
opaqueand carry no inherent meaning. - Semantic ID: An identifier that carries inherent meaning, often derived from item features like text descriptions, images, or categories. In generative recommendation,
semantic tokensare typically created byquantizingitem embeddings (e.g., converting a continuous vector representing an item's description into a discrete code from a dictionary). These areinterpretableto some extent.
- Item ID: A unique identifier assigned to each item (e.g., product ID 12345). Traditional CF models often learn embeddings directly for these IDs. These are
- Vector Quantization (VQ) / Codebooks:
- Vector Quantization (VQ): A process of mapping continuous-valued vectors into discrete "code vectors" from a finite set (a
codebook). It's like finding the closest "template" vector in a dictionary for a given continuous input vector. - Codebook: A learnable dictionary of discrete "code vectors" or "embeddings." Each entry in the codebook is a vector. When an item is
quantized, it's assigned one or more indices corresponding to these code vectors. This allows large continuous spaces to be represented in a discrete, compact way.VQ-VAE(Vector Quantized Variational Autoencoder) is a common architecture for learning such codebooks.
- Vector Quantization (VQ): A process of mapping continuous-valued vectors into discrete "code vectors" from a finite set (a
- Autoregressive Models / Transformer Architecture:
- Autoregressive Model: A model that predicts the next element in a sequence based on all preceding elements. In generative recommendation, this means predicting the next item token(s) given the history of previous item tokens.
- Transformer: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017). It revolutionized sequence modeling, especially in natural language processing. Its core mechanism is the
self-attentionmechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element. Transformers are highly effective for capturing long-range dependencies in sequences and are the backbone of most large language models. In this context, aTransformer decoderis typically used for sequence generation.
- Mixture-of-Experts (MoE):
- Concept: An MoE model consists of multiple "expert" neural networks and a "gating network." For a given input, the gating network learns to select (or combine outputs from) the most appropriate expert(s).
- Purpose: MoE allows a model to specialize different parts of its network on different types of inputs, potentially improving efficiency and performance by having experts focus on specific sub-problems. It's useful for handling heterogeneous data or tasks.
- InfoNCE Loss (Info Noise Contrastive Estimation):
- Concept: A type of contrastive loss function commonly used in self-supervised learning. It aims to pull together representations of "positive pairs" (e.g., different augmentations of the same data point) while pushing apart "negative pairs" (e.g., different data points).
- Purpose: In
FlexCode, it's used forCross-Codebook Alignmentto ensure that the semantic and collaborative representations of the same item are similar, while those of different items are dissimilar. This helps maintain coherence between the two specialized codebooks. - Mathematical Formula (simplified form for understanding): Let be the query embedding, be a positive key embedding, and be negative key embeddings. $ \mathcal{L}{\mathrm{InfoNCE}} = -\log \frac{\exp(\mathrm{sim}(q, k+) / \tau)}{\sum_{i=0}^{N} \exp(\mathrm{sim}(q, k_i) / \tau)} $ Where is a similarity function (e.g., cosine similarity), is a temperature parameter that controls the sharpness of the distribution, and is usually . The denominator sums over the positive key and all negative keys.
3.2. Previous Works
The paper discusses various types of previous works in recommender systems, broadly categorized into Item ID-based and Semantic ID-based methods.
Item ID-based Methods (Traditional Sequence Models): These models learn embeddings for discrete item IDs and capture sequential patterns in user interactions. They primarily rely on collaborative signals.
- Caser (Tang and Wang, 2018): Uses convolutional sequence embedding to capture sequential patterns.
- GRU4Rec (Hidasi et al., 2015): Applies Gated Recurrent Units (GRUs) to model session-based recommendations. Recurrent Neural Networks (RNNs) are good at processing sequences.
- HGN (Ma et al., 2019): Hierarchical Gating Networks for sequential recommendation.
- BERT4Rec (Sun et al., 2019): Adapts the Bidirectional Encoder Representations from Transformers (BERT) architecture for sequential recommendation, treating item sequences like sentences and predicting masked items.
- SASRec (Kang and McAuley, 2018): Self-Attentive Sequential Recommendation. This model uses a Transformer-like
self-attentionmechanism to capture long-range dependencies in user interaction sequences, focusing on relevant past items.- Brief concept of
Self-Attention(as used in SASRec and Transformers): Given a sequence of input embeddings ,self-attentioncalculates an output where each is a weighted sum of all in the sequence. The weights are learned dynamically based on the relevance between and . This involves three matrices: Query (), Key (), and Value (), derived from the input embeddings. , , where are learnable weight matrices. TheAttentionmechanism is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Here,Q, K, Vare matrices containing query, key, and value vectors for each position in the sequence. is the dimension of the key vectors, used for scaling. Thesoftmaxensures weights sum to 1.
- Brief concept of
- S3-Rec (Zhou et al., 2020): Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization.
- Recformer (Li et al., 2023): Leverages language representations for sequential recommendation.
Semantic ID-based Methods (Generative Recommenders): These approaches map items to discrete tokens, often derived from item content (semantics), and use generative models.
- VQRec (Hou et al., 2023): Learns
vector-quantized item representationsfor transferable sequential recommenders, using a single codebook. - TIGER (Rajput et al., 2023): Recommender Systems with Generative Retrieval. A pioneering work in generative recommendation, representing items as discrete tokens.
- LC-Rec (Zheng et al., 2024): Adapts Large Language Models by integrating collaborative semantics for recommendation.
- COBRA (Yang et al., 2025): Sparse meets dense: Unified generative recommendations with cascaded sparse-dense representations.
- URL (Unified Representation Learning) (Lin et al., 2025): A method aiming for unified semantic and ID representation learning for deep recommenders. This is a strong baseline because it also tries to combine semantic and collaborative information, but still within a single representation framework, making it a good comparison point for
FlexCode's disentangled approach.
3.3. Technological Evolution
The field of recommender systems has evolved significantly:
-
Early Era (Pre-Deep Learning): Dominated by traditional
Collaborative Filtering(e.g.,Matrix FactorizationlikeSVDorALS, exemplified by Koren et al., 2009) andContent-Based Filtering(recommending items similar to those a user liked in the past based on item features). These methods were often effective but struggled with sparsity and lacked sophisticated sequential modeling capabilities. -
Deep Learning Era (Post-2015): The introduction of deep learning architectures led to
Deep Recommendation Systems.GRU4RecbroughtRNNsto session-based recommendation, andSASRecandBERT4Recleveraged the power ofTransformernetworks to capture complex sequential patterns in user interaction histories. These models improved performance significantly by learning richer, non-linear representations from interaction data. However, they largely remainedItemID-based, still facingcold-startandlong-tailchallenges due to their reliance on interaction history. -
Generative Recommendation Era (Recent): Inspired by the success of
Large Language Models (LLMs), the paradigm shifted from score prediction to sequence generation. Items are tokenized into discretesemantic identifiers, allowing recommender systems to leverage the power ofautoregressive generative models. Works likeVQRecandTIGERspearheaded this shift, offering better generalization by incorporating item semantics. The key challenge in this era, as highlighted byFlexCode, is how to effectively merge powerfulsemantic generalizationwith precisecollaborative memorizationwhen representing items as tokens.FlexCodepositions itself at the forefront of this generative recommendation era, specifically addressing the limitations of existingSemantic ID-basedmethods that struggle withrepresentation entanglementandstatic capacity allocationdue to theirsingle codebookdesign.
3.4. Differentiation Analysis
Compared to the main methods in related work, FlexCode introduces core differences and innovations:
-
vs. Item ID-based methods (e.g., SASRec, BERT4Rec):
- Differentiation:
FlexCodemoves beyond opaqueItemID embeddingstodiscrete semantic tokens. WhileItemID-basedmethods excel atmemorizationfor popular items, they fundamentally lacksemantic generalizationforlong-tailorcold-startitems.FlexCodeexplicitly incorporatessemantic codebooksto address this. - Innovation: It provides a mechanism for
semantic understandingdirectly within the item representation, whichItemID-basedmethods cannot.
- Differentiation:
-
vs. Semantic ID-based methods with a single codebook (e.g., VQRec, TIGER, URL):
- Differentiation: This is the most direct comparison. Existing
Semantic ID-basedmethods use a single, uniform codebook to encode all items.FlexCodeproposes a dual codebook architecture—one forcollaborative signals(CF codebook) and one forsemantic information(semantic codebook). - Innovation:
-
Disentanglement:
FlexCodeexplicitlydisentanglescollaborative and semantic factors, preventingrepresentation entanglementandrepresentational collapsethat occur when a single codebook tries to compress heterogeneous information. This allows each codebook to optimize for its specific type of signal. -
Adaptive Capacity Allocation: Crucially,
FlexCodeintroduces apopularity-aware Mixture-of-Experts (MoE) router. This dynamically allocates afixed token budgetbetween the two codebooks on a per-item basis. ExistingSemantic ID-basedmethods usestatic capacity allocation, assigning the same number and type of tokens to all items, irrespective of their popularity.FlexCode's adaptive allocation is a major innovation, allowing moreCF tokensforhead items(for memorization) and moresemantic tokensfortail items(for generalization). -
Cross-Codebook Coherence:
FlexCodeincludes analignment objectiveto ensure that despite being disentangled, the two codebooks maintain a coherent relationship, preventing them from drifting into unrelated spaces.In essence,
FlexCodeinnovates by recognizing thatmemorizationandgeneralizationare not mutually exclusive but should be adaptively emphasized based on item characteristics. It achieves this through a novel architecture that explicitly separates and then intelligently combines these two types of signals, a clear departure from theuniformandstaticapproaches of prior generative recommenders.
-
- Differentiation: This is the most direct comparison. Existing
4. Methodology
4.1. Principles
The core idea behind FlexCode is to address the limitations of representation entanglement and static capacity allocation in existing generative recommenders by explicitly disentangling collaborative and semantic signals and dynamically allocating representational capacity between them based on item popularity. The theoretical basis is that different types of items (popular vs. long-tail) benefit from different types of representations: popular items thrive on precise collaborative patterns (memorization), while long-tail items require robust semantic understanding (generalization). By introducing two specialized codebooks and an adaptive routing mechanism, FlexCode aims to optimize the representation for each item across the entire popularity spectrum.
4.2. Core Methodology In-depth (Layer by Layer)
The FlexCode framework is composed of several interconnected modules: Dual Codebook Construction, Cross-Codebook Alignment, Popularity-Aware Token Allocation, and Autoregressive Generation, all trained jointly. The overall framework is illustrated in Figure 1.
4.2.1. Generative Recommendation Framework
First, let's understand the general generative recommendation setup.
Let be the set of users and be the set of items. For each user , we observe an interaction sequence:
$
\mathbf{s}_u = [i_1, i_2, \dots, i_T]
$
Here, represents the -th item interacted with by user . The goal of recommendation is to predict the next item given the preceding sequence .
In the generative recommendation paradigm, this problem is reformulated as a conditional sequence generation task. Each item is first mapped into one or more discrete tokens using a vector-quantized codebook:
$
\mathbf{z}{i_t} = [z{i_t}^1, z_{i_t}^2, \dots, z_{i_t}^L]
$
Each is an index to an entry in a learnable codebook . Here, is the number of tokens assigned to each item, and is the embedding dimension of each codebook entry.
An autoregressive model (e.g., a Transformer decoder) is then trained to maximize the likelihood of the item sequence in this tokenized form:
$
\mathcal{L}{\mathrm{gen}} = - \sum{u \in \mathcal{U}} \sum_{t=1}^{T} \log p_{\theta} (\mathbf{z}{i_t} \mid \mathbf{z}{i_{<t}})
$
This formulation treats recommendation like a language modeling problem, leveraging powerful generative models.
4.2.2. Collaborative and Semantic Representations
To highlight the distinction FlexCode makes, it's important to understand how traditional approaches handle representations:
- Traditional Collaborative Filtering (CF): Models learn an embedding matrix directly from user-item interactions. The relevance score between user (with embedding ) and item (with embedding ) is typically computed as:
$
r(u, i) = \mathbf{e}_u^{\top} \mathbf{e}_i
$
This is often optimized with
implicit feedback losseslikeBayesian Personalized Ranking (BPR). These are good at capturingrelational specificitybut lacksemantic generalization. - Semantic Encoders: These models map content features (e.g., text descriptions, visual attributes) into a representation:
$
\mathbf{e}{\mathrm{sem}}(i) = f{\mathrm{sem}}(\mathbf{x}_i)
$
This
captures modality-derived meaningindependent of user interactions, offering better generalization tocold-start items, but often lackscollaborative precision. - Existing Semantic-ID Generative Recommenders: These methods (like
Rajput et al., 2023) quantize semantic embeddings into discrete tokens usingvector quantization: $ \hat{z}i = \arg \operatorname*{min}{k} |\mathbf{e}_{\mathrm{sem}}(i) - \mathbf{C}_k|_2^2 $ The resulting token indices are used for sequence generation. The critical point here is thatexisting methods employ a single codebook C for all items, implicitly mixing collaborative and semantic signals. Thisuniform codebook assumptionis a key limitationFlexCodeaddresses.
4.2.3. Problem Formulation: Capacity Allocation
FlexCode introduces two separate codebooks: a collaborative codebook and a semantic codebook . Given a fixed total token budget , the problem is formulated as learning a function to allocate tokens between these two codebooks for each item :
$
L_{\mathrm{CF}}(i) + L_{\mathrm{SEM}}(i) = L
$
Here, and are the number of tokens for item drawn from the collaborative codebook and semantic codebook, respectively.
The objective of capacity allocation is to learn an adaptive mapping, g(i), which determines the proportion of CF versus semantic tokens. This mapping is defined as:
$
g(i) = \sigma (w \cdot \log ( \mathrm{pop}(i) ) + b)
$
Where:
- is the
empirical popularity(interaction frequency) of item . The logarithm is used to handle the skewed nature of popularity distributions. - and are learnable parameters.
- is the
sigmoid function, which squashes the output to a range[0, 1]. The value determines the proportion of CF tokens. Accordingly, the number of tokens allocated from each codebook is: $ \begin{array}{r} { L_{\mathtt{CF}}(i) = \lfloor g(i) \cdot L \rfloor , \quad L_{\mathtt{SEM}}(i) = L - L_{\mathtt{CF}}(i) . } \end{array} $ This means forhead items(high ),g(i)approaches 1, allocating moreCF tokens. Fortail items(low ),g(i)decreases, emphasizingsemantic tokens. This creates an adaptive dual-codebook encoding.
The resulting generation objective then combines tokens from both sources:
$
\mathcal{L}{\mathrm{gen}}^{\mathrm{dual}} = - \sum{u \in \mathcal{U}} \sum_{t=1}^{T} \log p_{\theta} \big( [\mathbf{z}{i_t}^{\mathrm{CF}}, \mathbf{z}{i_t}^{\mathrm{SEM}}] \mid \mathbf{z}{i{CF-based tokenization; if , it's pure semantic tokenization. This lays the foundation for the Dynamic Dual-Codebook Learning framework.
4.2.4. Dual Codebook Construction
To tackle the representation entanglement issue, FlexCode uses two specialized codebooks.
4.2.4.1. Semantic Codebook Learning (SCL)
The Semantic Codebook aims to capture the intrinsic meaning of an item from its associated textual and categorical metadata.
- Semantic Embedding: For each item , its descriptive attributes (brand, category, price, title) are concatenated into a unified text string. This string is then processed by a powerful
pretrained text embedding model(e.g., from anLLM) to obtain a dense semantic embedding . - Quantization: These continuous semantic embeddings are then converted into a discrete codebook using a
Residual Quantization Variational Autoencoder (RQ-VAE)(Lee et al., 2022). For each , theRQ-VAEencodes it into a sequence of discrete code vectors, denoted as . Each is an index to an entry in a sharedsemantic codebook. - Loss Function: The total reconstruction loss for the semantic codebook is defined as:
$
\begin{array}{r l r} { \mathcal{L}{\mathrm{SCL}} = \mathbb{E}{\mathbf{e}i^{\mathrm{sem}} \sim D{\mathrm{sem}}} [ |\mathbf{e}i^{\mathrm{sem}} - \mathrm{Decode}(\mathbf{c}i^{\mathrm{sem}}) |2^2 ] } \ & { } & { + \sum{j=1}^{L{\mathrm{sem}}} \mathcal{L}{\mathrm{vq}} \big( \mathbf{z}{i,j-1}^{\mathrm{sem}}, \mathbf{q}{i,j}^{\mathrm{sem}} \big) \qquad (1) } \end{array}
$
Where:
- is the continuous semantic embedding of item .
- is the distribution of semantic embeddings.
- is the decoder component of the
RQ-VAEthat reconstructs the embedding from its quantized representation. The term is the reconstruction loss, ensuring the quantized representation can faithfully recover the original embedding. - is the standard
VQ-VAE loss(typically composed of a codebook loss, a commitment loss, and a reconstruction loss component for the quantization process). - and refer to intermediate representations in the
Residual Quantizationprocess, where the current residual is quantized based on previous quantization steps. The output of this stage for item is its semantic codebook representation .
4.2.4.2. Collaborative Codebook Learning (CCL)
The Collaborative Codebook is designed to capture high-order co-purchase or co-view patterns.
- Collaborative Embedding: A
SASRec-like architecture(a Transformer-based sequential model) is used to derive a context-aware collaborative embedding for each item . This embedding implicitly encodes an item's position within interaction sequences. - Quantization: Similar to the semantic codebook, an
RQ-VAEis applied to discretize these collaborative embeddings into a sequence of discrete code vectors, , drawn from acollaborative codebook. - Loss Function: The loss function is defined analogously to :
$
\begin{array}{r l r} { \mathcal{L}{\mathrm{CCL}} = \mathbb{E}{\mathbf{e}i^{\mathrm{col}} \sim D{\mathrm{col}}} [ |\mathbf{e}i^{\mathrm{col}} - \mathrm{Decode}(\mathbf{c}i^{\mathrm{col}}) |2^2 ] } \ & { } & { + \sum{j=1}^{L{\mathrm{col}}} \mathcal{L}{\mathrm{vq}} ( \mathbf{z}{i,j-1}^{\mathrm{col}}, \mathbf{q}{i,j}^{\mathrm{col}} ) \quad \quad (2) } \end{array}
$
Where:
- is the continuous collaborative embedding of item .
- is the distribution of collaborative embeddings.
- The terms are analogous to those in , ensuring reconstruction and effective quantization of collaborative signals. The output for item is its collaborative codebook representation .
4.2.5. Cross-Codebook Alignment (CCA)
While the two codebooks are specialized, they need to maintain some coherence to represent the same item consistently. The Cross-Codebook Alignment (CCA) objective ensures this.
- Reconstructed Embeddings: The
CCAobjective operates on the reconstructed embeddings from the codebooks: and . This is crucial because it forces the alignment loss to act on the information actually preserved and decodable by the quantized representations. - Projection Heads: Let and be projection heads (e.g., simple multi-layer perceptrons) that map these reconstructed embeddings into a shared latent space.
- Alignment Loss: The alignment loss is defined using an
InfoNCE objective: $ \begin{array}{r} { \mathcal{L}{\mathrm{CCA}} = - \log \frac{ \exp \left( \mathrm{sim} ( P{\mathrm{sem}} ( \tilde{\mathbf{e}}i^{\mathrm{sem}} ) , P{\mathrm{col}} ( \tilde{\mathbf{e}}i^{\mathrm{col}} ) ) / \tau \right) }{ \sum{j \in \mathcal{T}} \exp \left( \mathrm{sim} ( P_{\mathrm{sem}} ( \tilde{\mathbf{e}}i^{\mathrm{sem}} ) , P{\mathrm{col}} ( \tilde{\mathbf{e}}_j^{\mathrm{col}} ) ) / \tau \right) } } \end{array} $ Where:- is the
cosine similarityfunction, measuring the similarity between two vectors. - is a
temperature parameter, which scales the logits before the exponential, controlling the sharpness of the distribution and the difficulty of distinguishing positive from negative pairs. A smaller makes the model focus more on hard negatives. - The numerator calculates the similarity between the projected semantic and collaborative reconstructions of the same item (positive pair).
- The denominator sums similarities between the projected semantic reconstruction of item and projected collaborative reconstructions of all items (including itself and other items as negatives).
This loss function effectively pulls the representations of the same item (from its semantic and collaborative perspectives) closer in the shared latent space, while pushing representations of different items apart. This contrastive pressure also helps to regularize the codebooks, encouraging more uniform usage of codebook entries and mitigating
codebook collapse(where only a few codebook entries are frequently used).
- is the
4.2.6. Popularity-Aware Token Allocation (PATA)
To adaptively allocate a fixed total token budget between the codebooks, FlexCode uses a lightweight Mixture-of-Experts (MoE) router. The goal is to allocate more collaborative capacity to popular (head) items and more semantic capacity to sparse (tail) items.
- Feature Vector: For each item , a feature vector is constructed. This vector includes statistics relevant to item popularity and characteristics:
$
\begin{array}{r c l} { \mathbf{x}_i } & { = } & { [ \log(1 + f_i), \mathrm{age}_i, \mathrm{sparsity}_i, \mathrm{uncertainty}_i ] } \end{array}
$
Where:
- is the
normalized interaction frequencyof item . The transformation helps to handle skewed popularity distributions. - is the time since the item was introduced.
- is the inverse of its interaction density.
- is the variance of its embedding during training, which can indicate how well-defined its representation is. This rich feature vector allows the router to make nuanced allocation decisions beyond simple popularity.
- is the
- MoE Router: A shallow
Multi-Layer Perceptron (MLP)denoted as processes to produce logits . Atemperature-scaled softmaxthen converts these logits into routing probabilities . Thecollaborative allocation ratiois defined as , representing the proportion of tokens to be drawn from the collaborative codebook. - Soft Token Allocation: This ratio determines the soft number of tokens for each codebook during training: $ \begin{array}{r} { \bar{L}i^{\mathrm{col}} = \alpha_i L{\mathrm{total}}, \quad \bar{L}i^{\mathrm{sem}} = (1 - \alpha_i) L{\mathrm{total}} . } \end{array} $
- Sigmoid-based Masks: To maintain differentiability during training while effectively selecting tokens,
sigmoid-based masksare used to softly select tokens: $ \begin{array}{c} { m_k^{\mathrm{col}} (\alpha_i) = \sigma \displaystyle \left( \frac{ \bar{L}_i^{\mathrm{col}} - (k - \frac{1}{2}) }{ \tau_m } \right) , } \ { m_k^{\mathrm{sem}} (\alpha_i) = \sigma \displaystyle \left( \frac{ \bar{L}_i^{\mathrm{sem}} - (k - \frac{1}{2}) }{ \tau_m } \right) , } \end{array} $ Where:- and are the masking values for the -th token position for the collaborative and semantic codebooks, respectively.
- is the
sigmoid function. - is a
hyperparameter(set to 0.1) that controls the smoothness of the soft-to-hard transition. A smaller makes the transition sharper, closer to a hard cutoff. These masks are applied to the codebook entries. Thefinal combined representationconcatenates the masked codebooks. At inference time, the allocation isdiscretized via rounding(e.g., ).
- Router Stability: To ensure the router's stability and effective learning, two regularization objectives are added:
Stratified load-balancing (\mathcal{L}_{\mathrm{lb}}): This prevents any single expert (or codebook pathway) from being overloaded or underutilized, distributing the load evenly within popularity bands.Local smoothness regularizer (\mathcal{L}_{\mathrm{smooth}}): This is applied to the allocation ratios , encouraging similar items to have similar allocation patterns, leading to more stable routing.
4.2.7. Autoregressive Generation (ARG)
After the item representations are formed by concatenating the dynamically allocated and masked tokens from the dual codebooks, an autoregressive Transformer model is used for the final recommendation task.
- Sequence Formation: Given a user's historical sequence of items , their corresponding combined codebooks are retrieved.
- Likelihood Maximization: The
autoregressive Transformer modelis trained to maximize the likelihood of the next item's codebook . The loss is the standardcross-entropyover the sequence of predicted tokens: $ \mathcal{L}{\mathrm{ARG}} = - \sum{t=1}^{N_{\mathrm{seq}} - 1} \sum_{k=1}^{L_{\mathrm{total}}} \log P (\mathbf{c}{i{t+1}, k} \mid \mathbf{C}{S_u^{<t+1}}, \mathbf{c}{i_{t+1}, <k}) $ Where:- is the length of the sequence.
- is the total number of tokens per item.
- is the probability of the -th token of item given the combined codebooks of previous items in the sequence () and the previously generated tokens of the current target item (). This structure allows for
token-level autoregressionwithin an item's representation.
4.2.8. Overall Objective and Training
The entire FlexCode framework is trained jointly in an end-to-end manner. The final objective function is a weighted sum of all components:
$
\begin{array}{r l} & { \mathcal{L}{\mathrm{total}} = \mathcal{L}{\mathrm{SCL}} + \mathcal{L}{\mathrm{CCL}} + \lambda{\mathrm{CCA}} \mathcal{L}{\mathrm{CCA}} } \ & { ~ + ~ \lambda{\mathrm{ARG}} \mathcal{L}{\mathrm{ARG}} + \lambda{\mathrm{lb}} \mathcal{L}{\mathrm{lb}} + \lambda{\mathrm{smooth}} \mathcal{L}_{\mathrm{smooth}} } \end{array}
$
Where:
-
: Semantic Codebook Learning loss (Equation 1).
-
: Collaborative Codebook Learning loss (Equation 2).
-
: Cross-Codebook Alignment loss.
-
: Autoregressive Generation loss.
-
: Load-balancing regularization loss for the MoE router.
-
: Smoothness regularization loss for the MoE router's allocation ratios.
-
are
hyperparametersthat balance the contribution of each term.This end-to-end training allows the
semantic codebook,collaborative codebook,popularity-aware router, andautoregressive Transformertoco-adapt, leading to a cohesive and effective system that balancesmemorizationandgeneralizationdynamically.
The following figure (Figure 1 from the original paper) shows the overall architecture of FlexCode:
Figure 1: Overview of the FlexCode framework for generative recommendation. Each item is encoded by a dual codebook with collaborativeand semanticcodebooks,aligned viaacross-codebookcontrastiveobjective.A popularity-aware Mixture-o-Experts (MoE) router adaptivelyallocates the budget between codebooks, and an autoregressive Transformer is trained on the resulting sequences to generate items.
5. Experimental Setup
5.1. Datasets
The experiments are conducted on three public benchmarks and one large-scale proprietary industrial dataset to comprehensively evaluate FlexCode across diverse domains and data characteristics.
The standard leave-last-out evaluation protocol is adopted, where the last item in a user's chronological interaction history is used for testing, and the second-to-last for validation. A 5-core filtering strategy is applied, meaning only users and items with at least 5 interactions are kept.
The following are the statistics from Table 2 of the original paper:
| Dataset | #Users | #Items | #Interactions | Avg. Seq. |
| Beauty | 22,363 | 12,101 | 198,360 | 8.87 |
| Sports | 35,598 | 18,357 | 296,175 | 8.32 |
| KuaiRand | 1,000 | 3.6M | 11M | 11.71 |
| Proprietary | 1.5M+ | 1M+ | 45M+ | 28.0 |
-
Public Benchmarks:
- Amazon-Beauty (McAuley et al., 2015): A relatively dense, medium-scale e-commerce dataset focused on beauty products. Its density implies more interaction data per user/item, making it suitable for evaluating memorization capabilities.
- Amazon-Sports and Outdoors (McAuley et al., 2015): A larger and sparser e-commerce dataset compared to Beauty. Its sparsity presents a greater challenge for models relying solely on collaborative signals and highlights the need for generalization.
- KuaiRand-1K (Gao et al., 2022): A large-scale dataset from a short-video recommendation platform, featuring rich side information. The "1K" refers to 1,000 users, but it's important to note the scale in terms of items and interactions (3.6M items, 11M interactions), making it a significant test for models handling large item spaces and diverse content.
- Choice Rationale: These datasets cover different scales, densities, and domains (e-commerce, short-video), providing a robust testbed for
FlexCode's generalizability and ability to handle varying data characteristics.
-
Proprietary Industrial Dataset:
- Characteristics: This dataset comes from a commercial platform, containing
tens of millions of usersandtens of thousands of items(reported as1.5M+ users,1M+ items,45M+ interactionsin Table 2). Interactions are logged over a year, with the last day used for evaluation to simulate an online inference scenario. - Choice Rationale: Evaluating on this dataset is crucial for demonstrating
FlexCode's effectiveness and scalability under real-world, production-level conditions, which often involve massive scale, noise, and complex user behaviors.
- Characteristics: This dataset comes from a commercial platform, containing
5.2. Evaluation Metrics
The paper assesses model performance using Recall@K and NDCG@K, with . These are standard ranking metrics in recommender systems, following common practice (Rajput et al., 2023).
5.2.1. Recall@K
- Conceptual Definition:
Recall@Kmeasures the proportion of relevant items that are successfully retrieved within the top recommendations. It focuses on the model's ability to find as many relevant items as possible from the user's ground truth set within the fixed recommendation list length . A higherRecall@Kindicates that the model is effectively identifying relevant items. - Mathematical Formula: $ \mathrm{Recall@K} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{|\mathrm{Rel}_u \cap \mathrm{Rec}_u(K)|}{|\mathrm{Rel}_u|} $
- Symbol Explanation:
- : The total number of users in the evaluation set.
- : The set of actual relevant items for user (i.e., the items the user interacted with in the test set).
- : The set of top items recommended by the model for user .
- : The number of relevant items found within the top recommendations for user .
- : The total number of relevant items for user .
5.2.2. NDCG@K (Normalized Discounted Cumulative Gain)
- Conceptual Definition:
NDCG@Kis a metric that evaluates the quality of ranking, giving more weight to highly relevant items that appear at higher (earlier) positions in the recommendation list. It's "normalized" because the calculatedDCGis divided by theIdeal DCG(theDCGof a perfect ranking) to ensure scores are comparable across different query sets. A higherNDCG@Kmeans not only are relevant items found, but they are also ranked appropriately. - Mathematical Formula: $ \mathrm{NDCG@K} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{\mathrm{DCG}_u(K)}{\mathrm{IDCG}_u(K)} $ Where: $ \mathrm{DCG}u(K) = \sum{j=1}^{K} \frac{2^{\mathrm{rel}(j)} - 1}{\log_2(j+1)} $ And is the maximum possible (i.e., the DCG value if the recommendations were perfectly ordered by relevance).
- Symbol Explanation:
- : The total number of users in the evaluation set.
- : The relevance score of the item at position in the recommended list. For implicit feedback, relevance is usually binary (1 if relevant, 0 otherwise).
- : A logarithmic discount factor, which reduces the contribution of relevant items as their rank decreases. Items at higher ranks (smaller ) contribute more to the
DCGscore. - : Discounted Cumulative Gain for user up to rank .
- : Ideal Discounted Cumulative Gain for user up to rank . This is calculated by sorting all relevant items for user by their actual relevance and then applying the
DCGformula. It serves as a normalization factor.
5.3. Baselines
The paper compares FlexCode against a comprehensive set of baselines, categorized into Item ID-based and Semantic ID-based methods.
5.3.1. Item ID-based methods
These are traditional sequential recommendation models that learn item representations directly from historical user-item interactions, typically using discrete item IDs. They focus on collaborative signals.
- Caser (Tang and Wang, 2018): Uses convolutional filters to capture local sequential patterns.
- GRU4Rec (Hidasi et al., 2015): An
RNN-based model (GRU) for session-based recommendations, capturing temporal dependencies. - HGN (Ma et al., 2019): Hierarchical Gating Networks, designed to model complex user behaviors and item transitions.
- BERT4Rec (Sun et al., 2019): Adapts the
BERTmodel (aTransformerencoder) for sequential recommendation by using a masked item prediction task. - SASRec (Kang and McAuley, 2018): A
Transformer-based model that employs aself-attentionmechanism to capture relevant item transitions in long sequences, often considered a strong baseline for sequential recommendation. - S3-Rec (Zhou et al., 2020): Self-Supervised Learning for Sequential Recommendation, using mutual information maximization.
- Recformer (Li et al., 2023): Leverages text-based language representations within a
Transformerframework for sequential recommendation.
5.3.2. Semantic ID-based methods
These are more recent generative recommendation approaches that tokenize or quantize item representations into discrete semantic identifiers, often incorporating item content information.
-
VQRec (Hou et al., 2023): Learns
vector-quantized item representations(using a single codebook) for sequential recommenders, aiming for transferability. -
TIGER (Rajput et al., 2023): A prominent
generative retrievalmodel that represents items as discrete tokens, enablingautoregressive generation. -
LC-Rec (Zheng et al., 2024): Adapts
Large Language Modelsby integratingcollaborative semanticsfor recommendation. -
COBRA (Yang et al., 2025): A unified generative recommendation approach that combines sparse and dense representations.
-
URL (Unified Representation Learning) (Lin et al., 2025): A unified semantic and
ItemID representation learningapproach for deep recommenders. This baseline is particularly relevant as it also attempts to combine different types of signals, but (asFlexCodeargues) likely within a single, entangled codebook.These baselines are representative as they cover both traditional
collaborative filteringapproaches that focus on interaction patterns and the latestgenerative recommendationmodels that leveragesemantic information. The inclusion ofURLis important as it represents state-of-the-art efforts in unifying item representations, makingFlexCode's dual-codebook and adaptive allocation strategy a direct comparison point.
6. Results & Analysis
6.1. Core Results Analysis
The experiments demonstrate that FlexCode consistently outperforms all baseline models across various metrics and datasets. This robust superiority is attributed to its unique dual-codebook architecture, which disentangles collaborative and semantic signals, and its popularity-aware token allocation mechanism.
The following are the results from Table 1 of the original paper:
| Model | Beauty | Sports and Outdoors | KuaiRand | |||||||||
| R@5 | N@5 | R@10 | N@10 | R@5 | N@5 | R@10 | N@10 | R@5 | N@5 | R@10 | N@10 | |
| Item ID-based methods | ||||||||||||
| Caser | 0.0205 | 0.0131 | 0.0347 | 0.0176 | 0.0116 | 0.0072 | 0.0194 | 0.0097 | 0.0074 | 0.0068 | 0.0118 | 0.0095 |
| GRU4Rec | 0.0164 | 0.0099 | 0.0283 | 0.0137 | 0.0129 | 0.0086 | 0.0204 | 0.0110 | 0.0298 | 0.0217 | 0.0383 | 0.0245 |
| HGN | 0.0325 | 0.0206 | 0.0512 | 0.0266 | 0.0189 | 0.0120 | 0.0313 | 0.0159 | 0.0297 | 0.0169 | 0.0354 | 0.0219 |
| BERT4Rec | 0.0203 | 0.0124 | 0.0347 | 0.0170 | 0.0115 | 0.0075 | 0.0191 | 0.0099 | 0.0185 | 0.0196 | 0.0217 | 0.0236 |
| SASRec | 0.0387 | 0.0249 | 0.0605 | 0.0318 | 0.0233 | 0.0154 | 0.0350 | 0.0192 | 0.0332 | 0.0338 | 0.0405 | 0.0372 |
| S3-Rec | 0.0387 | 0.0244 | 0.0647 | 0.0327 | 0.0251 | 0.0161 | 0.0385 | 0.0204 | — | — | — | — |
| Recformer | 0.0379 | 0.0257 | 0.0589 | 0.0321 | 0.0249 | 0.0154 | 0.0370 | 0.0201 | — | — | — | — |
| Semantic ID-based methods | ||||||||||||
| VQ-Rec | 0.0457 | 0.0317 | 0.0664 | 0.0383 | 0.0208 | 0.0144 | 0.0300 | 0.0173 | 0.0513 | 0.0354 | 0.0589 | 0.0412 |
| TIGER | 0.0454 | 0.0321 | 0.0648 | 0.0384 | 0.0264 | 0.0181 | 0.0400 | 0.0225 | 0.0557 | 0.0383 | 0.0624 | 0.0445 |
| LC-Rec | 0.0478 | 0.0329 | 0.0679 | 0.0389 | 0.0268 | 0.0177 | 0.0412 | 0.0221 | 0.0622 | 0.0403 | 0.0684 | 0.0497 |
| COBRA | 0.0537 | 0.0395 | 0.0725 | 0.0456 | 0.0306 | 0.0215 | 0.0434 | 0.0257 | — | — | — | — |
| URL | 0.0553 | 0.0410 | 0.0736 | 0.0471 | 0.0305 | 0.0218 | 0.0449 | 0.0273 | 0.0654 | 0.0481 | 0.0778 | 0.0585 |
| FlexCode-SID only | 0.0510 | 0.0375 | 0.0689 | 0.0433 | 0.0291 | 0.0204 | 0.0412 | 0.0244 | 0.0523 | 0.0461 | 0.0759 | 0.0517 |
| FlexCode-CF only | 0.0360 | 0.0232 | 0.0563 | 0.0296 | 0.0217 | 0.0143 | 0.0326 | 0.0179 | 0.0309 | 0.0314 | 0.0377 | 0.0346 |
| FlexCode-Fix | 0.0531 | 0.0394 | 0.0707 | 0.0452 | 0.0293 | 0.0209 | 0.0431 | 0.0262 | 0.0628 | 0.0462 | 0.0747 | 0.0562 |
| FlexCode (ours) | 0.0578 | 0.0415 | 0.0769 | 0.0483 | 0.0329 | 0.0232 | 0.0471 | 0.0275 | 0.0709 | 0.0524 | 0.0825 | 0.0632 |
Key observations from Table 1:
- Overall Superiority:
FlexCode (ours)consistently achieves the best performance across all datasets (Beauty, Sports and Outdoors, KuaiRand) and all metrics (, ). For example, on Amazon-Sports and Outdoors,FlexCodeachieves aRecall@10of 0.0471, representing a 5.3% relative improvement overURL(0.0449), the strongestsemantic ID-basedbaseline. On KuaiRand, itsNDCG@10is 0.0632, an 8.0% improvement overURL(0.0585). - Addressing Limitations:
- Versus Item ID-based Models (e.g., SASRec): These models perform significantly worse, especially on sparser datasets like Sports and KuaiRand. This highlights their lack of
semantic generalizationand inability to handlesparse dataeffectively.FlexCode's dedicatedsemantic codebookdirectly addresses this. - Versus Semantic ID-based Models (e.g., URL): While these models are generally better than
Item ID-basedones,FlexCodesurpasses them. This indicates thatFlexCodesuccessfully mitigates therepresentation entanglementproblem prevalent in single-codebook generative models. By explicitly disentangling collaborative and semantic signals,FlexCodeallows for more pure and effective representations.
- Versus Item ID-based Models (e.g., SASRec): These models perform significantly worse, especially on sparser datasets like Sports and KuaiRand. This highlights their lack of
- Validation of Dual-Codebook Structure:
FlexCode-SID only(using only semantic IDs) andFlexCode-CF only(using only collaborative IDs) show substantially worse performance than the fullFlexCodeor evenFlexCode-Fix. This confirms that neither type of signal alone is sufficient and that the combination of both is critical.FlexCode-SID onlyoften performs better thanFlexCode-CF onlyon sparser datasets (e.g., KuaiRandNDCG@100.0517 vs 0.0346), underscoring the importance of semantics for generalization.
- Validation of Popularity-Aware Token Allocation (PATA):
FlexCode-Fix(which uses a static 50/50 token split) already outperforms most baselines, includingURL, confirming the inherent advantage of thedisentangled dual-codebook structure.- However, the full
FlexCodemodel consistently outperformsFlexCode-Fix. For instance, on KuaiRand, the dynamic allocation ofPATAimprovesNDCG@10from 0.0562 (FlexCode-Fix) to 0.0632 (FlexCode), a relative gain of 12.5%. This isolates the critical contribution of the adaptive allocation mechanism, proving thatdynamically balancing capacitybased on item popularity is essential for achieving state-of-the-art performance.
6.2. Further Analysis on Industrial Dataset
To further validate FlexCode's robustness and scalability, detailed analyses were performed on a large-scale proprietary industrial dataset.
The following figure (Figure 2 from the original paper) shows the performance evaluation on the large-scale industrial dataset:
Figure 2: Performance evaluation on the large-scale industrial dataset.
6.2.1. Overall Industrial Performance
Figure 2a compares the performance improvement of generative models relative to the SASRec baseline on the industrial dataset.
- A pure collaborative generative model (
CF only) actuallydegrades performanceby over 5% compared toSASRec. This highlights the limitations of solely relying on collaborative signals, especially in complex, real-world industrial settings that often have a stronglong-tail. - Advanced
unified modelslikeURLprovide a moderate 4.9% lift inNDCG@10. This shows that attempts to combine signals in a single representation offer some benefit, but are not optimal. FlexCodedelivers a substantial 13.2% improvement inNDCG@10and a 16.5% improvement inHR@10(Hit Ratio, equivalent to Recall@K, but often used to denote a single relevant item in the top-K list for implicit feedback). This significant margin of victory on a massive, noisy, and production-level dataset strongly confirmsFlexCode's scalability and effectiveness in real-world industrial scenarios.
6.2.2. Cold-Start and Long-Tail Item Performance
Figure 2b directly addresses FlexCode's ability to balance memorization for popular (head) items and generalization for rare (tail) items on the industrial dataset, which exhibits a highly skewed long-tail distribution.
- Baselines' Trade-off:
- The
CF-onlymodel shows an improvement forhead itemsbut a significant 5.5% degradation fortail items. This is expected as CF excels at memorizing popular patterns but fails on sparse data. - Conversely, the
SID-only(Semantic ID only) model improvestail performanceby 5.7% (due to semantic generalization) but at the cost of weakerhead performance(as semantics might dilute fine-grained collaborative signals for popular items). This exemplifies thememorization-generalization trade-offin single-faceted models.
- The
- FlexCode's Resolution:
FlexCodeeffectively resolves this conflict:-
For
head items, itsMoE routercorrectly allocates more tokens to thecollaborative codebook, enablingfine-grained memorization. This yields a 3.0%NDCG@10improvement for head items. -
For
tail items, the router shifts capacity to thesemantic codebook, facilitatinggeneralization from content features. This results in an impressive 11.3%NDCG@10improvement for tail items. This is the largest gain on tail items by a wide margin, highlightingFlexCode's practical importance forlong-tail discovery.These results collectively demonstrate that
FlexCode's dynamic allocation of representational budget allows it to achieve superior performance across the entire item popularity spectrum, effectively balancing the needs formemorizationandgeneralization.
-
6.3. Ablation Study
An ablation study was conducted on the KuaiRand dataset to understand the contribution of each component within FlexCode.
The following are the results from Table 3 of the original paper:
| Model Variant | Recall@10 | NDCG@10 |
| FlexCode (Full) | 0.0825 | 0.0632 |
| FlexCode (CID Only) | 0.0405 | 0.0372 |
| FlexCode (SID Only) | 0.0511 | 0.0401 |
| w/o MoE Gating (Fixed Split) | 0.0791 | 0.0598 |
| w/o Alignment Loss | 0.0809 | 0.0615 |
Analysis of Ablation Study:
- Dual-Codebook Structure:
FlexCode (CID Only): Using only collaborative IDs results in a significant drop inNDCG@10from 0.0632 to 0.0372.FlexCode (SID Only): Using only semantic IDs also leads to a substantial drop, from 0.0632 to 0.0401.- Conclusion: This confirms that
separating and combining collaborative and semantic representationsthrough a dual-codebook structure is absolutely critical. Neither signal type alone can achieve the performance of their combined, disentangled use, validating the core architectural decision.
- Popularity-Aware MoE Gating:
w/o MoE Gating (Fixed Split): Disabling theMoE gating networkand defaulting to afixed 50/50 token splitresults in a drop inNDCG@10from 0.0632 to 0.0598.- Conclusion: This demonstrates that the
adaptive allocationmechanism is crucial. While a fixed dual-codebook (FlexCode-Fix in Table 1) is already beneficial, dynamically adapting the allocation based on item popularity significantly enhances performance by optimizingcapacity allocationfor different item regimes.
- Cross-Codebook Alignment Loss:
w/o Alignment Loss: Excluding thealignment loss(which ensures coherence between the two codebooks) leads to a drop inNDCG@10from 0.0632 to 0.0615.- Conclusion: This confirms the importance of the
Cross-Codebook Alignment (CCA)objective. Without it, the specialized codebooks might drift into unrelated spaces, diminishing the overall effectiveness of the combined representation. TheInfoNCE lossserves as a powerful regularizer, ensuring a coherent shared space.
6.3.1. Token Budget Sensitivity
The paper also analyzed FlexCode's robustness under varying token budgets on the KuaiRand dataset.
The following are the results from Table 4 of the original paper:
| Model | L=3 | L= 4 | L=5 | L = 6 |
| FlexCode (SID Only) | 0.0401 | 0.0415 | 0.0418 | 0.0420 |
| FlexCode (CID Only) | 0.0372 | 0.0389 | 0.0395 | 0.0397 |
| FlexCode-Fix (50/50 Split) | 0.0598 | 0.0615 | 0.0619 | 0.0621 |
| FlexCode (ours) | 0.0632 | 0.0685 | 0.0691 | 0.0693 |
Analysis:
FlexCode (ours)consistently achieves the strongest performance across all token budgets, significantly outperformingSID Only,CID Only, and evenFlexCode-Fix.- As the
token budgetincreases, performance generally improves for all models, indicating that more tokens allow for richer representations. - Importantly,
FlexCodedemonstrates strong performance even with a reduced token capacity (), where it still achieves anNDCG@10of 0.0632. This shows that itsadaptive capacity reallocationallows formore efficient use of limited representational resources. This is a crucial advantage for real-world deployment where computational and memory budgets are often constrained.
6.3.2. Hyper-Parameter Analysis
The paper also examined the effect of key hyperparameters on FlexCode's performance (specifically NDCG@10 on KuaiRand).
The following are the results from Table 5 of the original paper:
| Parameter Setting | K | d | λalign | λsmooth | NDCG@10 |
| Default (Base) | 512 | 64 | 0.1 | 0.01 | 0.0632 |
| K Variation | 256 | 64 | 0.1 | 0.01 | 0.0603 |
| 1024 | 64 | 0.1 | 0.01 | 0.0635 | |
| d Variation | 512 | 32 | 0.1 | 0.01 | 0.0611 |
| 512 | 128 | 0.1 | 0.01 | 0.0639 | |
| λalign Variation | 512 | 64 | 0.01 | 0.01 | 0.0618 |
| 512 | 64 | 0.5 | 0.01 | 0.0625 | |
| 512 | 64 | 1.0 | 0.01 | 0.0609 | |
| λsmooth Variation | 512 | 64 | 0.1 | 0.001 | 0.0621 |
| 512 | 64 | 0.1 | 0.05 | 0.0628 | |
| 512 | 64 | 0.1 | 0.1 | 0.0615 |
Analysis:
-
Codebook Size (): Increasing from 256 to 1024 generally improves performance, but with diminishing returns (0.0603 to 0.0635). This suggests that a larger codebook provides more expressive power until saturation.
-
Embedding Dimension (): Similar to , increasing from 32 to 128 also shows slight improvements (0.0611 to 0.0639), indicating that higher-dimensional embeddings can capture more nuanced information.
-
Alignment Loss Weight (): Performance is relatively robust to moderate changes in . A default value of 0.1 performs best, while very low (0.01) or very high (1.0) values lead to slight performance degradation, indicating the importance of a balanced alignment. Too little alignment might cause drift, too much might force entanglement.
-
Smoothness Loss Weight (): The model is also stable across a range of values. The default 0.01 performs well, with slightly worse results at 0.001 and 0.1. This suggests that a moderate smoothness regularization is beneficial for router stability without overly constraining its adaptability.
In summary,
FlexCodeis generally stable across a broad range of hyperparameter settings, implying that it is not overly sensitive to precise tuning, which is a desirable characteristic for practical deployment.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces FlexCode, a novel dual-codebook framework for generative recommendation. It addresses two fundamental limitations of existing generative recommenders: representation entanglement within a single codebook and static capacity allocation that ignores item popularity. FlexCode successfully disentangles collaborative and semantic item representations into two dedicated codebooks. A popularity-aware Mixture-of-Experts (MoE) router dynamically allocates a fixed token budget between these two codebooks on a per-item basis, emphasizing collaborative signals for popular items (memorization) and semantic signals for long-tail items (generalization). An alignment and smoothness objective ensures coherence and stability. Extensive experiments on public and industrial datasets demonstrate that FlexCode consistently outperforms strong baselines in terms of overall accuracy and, crucially, exhibits superior tail robustness. The paper concludes that FlexCode offers a new and effective mechanism for token representation in generative recommenders, providing a valuable perspective on balancing memorization and generalization in token-based recommendation models.
7.2. Limitations & Future Work
The authors acknowledge several promising directions for future work:
- Extension to Richer Multi-modal Descriptors: The current
semantic codebookuses textual and categorical metadata. Future work could extendadaptive capacity allocationto incorporate richermulti-modal item descriptors(e.g., images, audio, video features) to further enhance semantic understanding. - User-Side Modeling: The framework could be extended to user-side modeling, enabling
joint reasoning over user and item codebooks. This could lead to more nuanced personalization by also adapting user representations. - Integration with LLMs and Online Learning: Future research could integrate
FlexCodewithlarge language models (LLMs)andonline learning pipelines. This would involve studying howpopularity-aware tokenizationinteracts with challenges likeexploration,temporal drift(changes in item popularity or user preferences over time), andcalibrationin real-world, dynamic deployments. - Fairness and Exposure: It is important to examine the
fairnessandexposureoflong-tail contentunder different routing strategies. WhileFlexCodeaims to improve long-tail performance, the dynamic allocation mechanism could inadvertently introduce new biases or affect item exposure if not carefully designed. - Theoretical Understanding: Developing
theoretical toolsto understand when and whydual-codebook architecturesprovide benefits overunified tokenizationswould further solidify the foundations of this approach.
7.3. Personal Insights & Critique
This paper presents a highly intuitive and impactful solution to a well-recognized problem in recommendation: balancing the needs of popular vs. long-tail items. The core insight that memorization (via CF signals) and generalization (via semantic signals) should be disentangled and adaptively balanced is very powerful.
Inspirations and Transferability:
- Adaptive Resource Allocation: The idea of dynamically allocating computational or representational resources based on the inherent characteristics of data points (e.g., popularity, uncertainty, sparsity) is broadly applicable. This could extend beyond recommendation to other machine learning tasks with skewed data distributions, such as natural language processing (e.g., allocating more complex models to rare words) or computer vision (e.g., dedicating more processing to unusual objects).
- Disentangled Representations: The success of
FlexCodein disentangling collaborative and semantic signals highlights the general value of learning specialized representations for different aspects of data. This principle could be applied in multi-modal learning, where different modalities are processed by specialized subnetworks before being adaptively combined. - Hybrid Modeling:
FlexCodeeffectively demonstrates a sophisticated form ofhybrid recommendation, not by simply averaging scores, but by deeply integrating and adaptively weighting different information sources at the representation learning stage. This could inspire more advanced hybrid architectures.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Defining "Popularity": The paper uses
empirical popularity (interaction frequency)and other features likeageandsparsityin theMoE router. However, "popularity" itself can be a dynamic concept. An item might be popular for a short burst (e.g., trending news), then fade. Static popularity metrics might not fully capture thistemporal drift. Future work could exploredynamic popularity signalsorcontext-aware routingthat considers the current session or user intent. -
Overhead of Dual Codebooks and MoE: While the paper calls the
MoE router"lightweight," having two separateRQ-VAEsand codebooks, along with the router and alignment objective, adds architectural complexity and potentially computational overhead compared to a single-codebook system. The paper mentions "lightweight" and "robust to small variations," but a deeper analysis of the computational and memory footprint during training and inference, especially for theRQ-VAEcomponents, would be beneficial for industrial applications. -
Interpretability of Allocation: While the
MoE routerallocates based on popularity features, the exact mechanism for why a certain allocation works best could be further investigated. Thesigmoid-based masksprovide differentiability but might obscure clear thresholds. More interpretable routing mechanisms or post-hoc analysis of router decisions could offer deeper insights. -
Generalizability of
SASRec-like Architecturefor CF Embeddings: Thecollaborative codebookrelies on aSASRec-like architectureto derive collaborative embeddings. While effective, the choice of this specific architecture for generating these initial embeddings might have its own limitations in extremely sparse or highly dynamic environments. Exploring other state-of-the-artCF embeddinggeneration methods could further enhance the collaborative signal quality. -
Cold-Start Users: The paper focuses heavily on
cold-start items. While improving long-tail item discovery is crucial, the framework's performance oncold-start users(who have very few interactions themselves) is not explicitly detailed. An adaptive allocation for user representations, perhaps mirroring the item-sideFlexCode, could be a logical next step.Overall,
FlexCodeis an elegant and effective solution that pushes the boundaries of generative recommendation. Its emphasis on adaptive capacity allocation and disentangled representations provides a strong foundation for future research in building more robust and fair recommender systems.
Similar papers
Recommended via semantic vector search.