Paper status: completed

Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

Published:11/15/2025

Generative Recommendation Systems (42)Universal Recommendation Systems (1)Dual Codebook Representation Learning (1)Balancing Collaborative Filtering and Semantic Understanding (1)FlexCode Framework (1)

Original Link PDF

Price: 0.100000

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces FlexCode, a dual codebook framework that balances collaborative filtering and semantic understanding in generative recommendation systems, enhancing performance with adaptive token allocation and a lightweight mixture-of-experts model, outperforming strong b

Abstract

Generative recommendation has recently emerged as a powerful paradigm that unifies retrieval and generation, representing items as discrete semantic tokens and enabling flexible sequence modeling with autoregressive models. Despite its success, existing approaches rely on a single, uniform codebook to encode all items, overlooking the inherent imbalance between popular items rich in collaborative signals and long-tail items that depend on semantic understanding. We argue that this uniform treatment limits representational efficiency and hinders generalization. To address this, we introduce FlexCode, a popularity-aware framework that adaptively allocates a fixed token budget between a collaborative filtering (CF) codebook and a semantic codebook. A lightweight MoE dynamically balances CF-specific precision and semantic generalization, while an alignment and smoothness objective maintains coherence across the popularity spectrum. We perform experiments on both public and industrial-scale datasets, showing that FlexCode consistently outperform strong baselines. FlexCode provides a new mechanism for token representation in generative recommenders, achieving stronger accuracy and tail robustness, and offering a new perspective on balancing memorization and generalization in token-based recommendation models.

Mind Map

In-depth Reading

English Analysis~26 min read · 38,108 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is "Semantics Meet Signals: Dual Codebook Representation Learning for Generative Recommendation". This title highlights the paper's core idea: combining semantic understanding (semantics) with collaborative interaction patterns (signals) through a dual codebook approach to improve generative recommendation systems.

1.2. Authors

The authors and their affiliations are:

Zheng Hui: University of Cambridge (and zh2483@columbia.edu)
Xiaokai Wei, Chen Wang, Alejandro Peláez, Michelle Gong: Roblox Corporation (and {xwei, cwang, apelaez, mgong}@roblox.com)
Reza Shirkavand: University of Maryland (and rezashkv@cs.umd.edu)
Weizhi Zhang: University of Illinois Chicago (and wzhan42@uic.edu)

The author list indicates a collaboration between academic institutions (University of Cambridge, University of Maryland, University of Illinois Chicago) and an industrial research lab (Roblox Corporation), suggesting a blend of theoretical rigor and practical application focus.

1.3. Journal/Conference

The paper is listed as an arXiv preprint arXiv:2511.20673v1. This means it has not yet been formally peer-reviewed and published in a journal or conference proceedings. arXiv is a widely respected open-access repository for preprints in fields like computer science, physics, mathematics, and more, allowing researchers to share their work rapidly. While it's a common practice for disseminating new research, it implies the work has not yet undergone formal academic scrutiny by a program committee or journal editors.

1.4. Publication Year

The publication timestamp provided is 2025-11-15T05:51:38.000Z. The publication year is 2025.

1.5. Abstract

Generative recommendation, a new paradigm, unifies item retrieval and generation by representing items as discrete semantic tokens for autoregressive modeling. However, current methods use a single, uniform codebook for all items, which fails to account for the disparity between popular items (rich in collaborative signals) and long-tail items (reliant on semantic understanding). This uniform treatment leads to inefficient representation and poor generalization.

To address this, the paper introduces FlexCode, a popularity-aware framework that adaptively allocates a fixed token budget between two specialized codebooks: a collaborative filtering (CF) codebook and a semantic codebook. It uses a lightweight Mixture-of-Experts (MoE) router to dynamically balance the precision of CF for popular items and the generalization of semantics for long-tail items. An alignment and smoothness objective ensures coherence across the entire popularity spectrum.

Experiments on public and industrial-scale datasets show that FlexCode consistently outperforms strong baselines. The paper concludes that FlexCode offers a novel mechanism for token representation in generative recommenders, achieving superior accuracy and robustness for long-tail items, and providing a fresh perspective on balancing memorization and generalization in token-based recommendation models.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2511.20673v1
PDF Link: https://arxiv.org/pdf/2511.20673v1.pdf
Publication Status: This paper is a preprint available on arXiv, meaning it has not yet been formally peer-reviewed or published in a conference or journal.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve lies within the emerging field of generative recommendation. Traditional recommender systems, like collaborative filtering (CF), excel at identifying head items (popular items with abundant interaction data) due to their ability to memorize fine-grained co-occurrence patterns. However, they systematically struggle with long-tail and cold-start items (items with sparse interaction data) because they lack mechanisms for semantic generalization. This results in a fundamental memorization bias.

Generative recommenders, which map items to discrete semantic tokens and use autoregressive models (like Transformers) to predict future items, offer advantages such as compatibility with large language models and better generalization through multimodal supervision (e.g., text, visual, metadata semantics).

Despite these benefits, current generative recommenders suffer from two critical limitations:

Representation Entanglement: Most existing approaches use a single, shared codebook for all items. This forces the model to compress both semantic content information and collaborative interaction information into the same quantized space. This coupling leads to interference, where semantic signals can dilute collaborative patterns for head items, and unreliable collaborative signals can dominate for tail items, harming cold-start behavior. The result is representational collapse, where neither factor is optimally modeled.
Static Capacity Allocation: Current methods allocate a fixed representational budget (same number and type of codebook tokens) to every item, regardless of its popularity or data sparsity. This is inefficient given the long-tailed nature of recommender data. Head items, rich in interaction data, could benefit from high-capacity collaborative representations, while tail items, lacking interaction support, need more semantic evidence. Treating them uniformly wastes capacity for one group and starves the other, leading to overfitting on the head and under-representation of the tail.

The paper frames these limitations as a more general adaptive capacity allocation problem: how to judiciously allocate a fixed token budget between collaborative specificity and semantic generalization for each item, while avoiding destructive interference.

2.2. Main Contributions / Findings

The paper introduces FlexCode to address the identified challenges. Its primary contributions and findings are:

Identification of Structural Limitations: The authors formally identify and define two key structural limitations in existing generative recommenders: representation entanglement in a single shared codebook and static capacity allocation that disregards item popularity and data sparsity. They frame these as an adaptive capacity allocation problem.
Introduction of FlexCode Framework: They propose FlexCode, a novel framework that factorizes item representations into a collaborative codebook and a semantic codebook. This disentanglement allows each codebook to specialize.
Popularity-Aware Dynamic Token Allocation: FlexCode incorporates a popularity-aware mixture-of-experts (MoE) gate. This gate dynamically allocates a fixed token budget between the collaborative and semantic codebooks on a per-item basis. This means popular items receive more collaborative tokens, while long-tail items receive more semantic tokens, optimizing representation for different data regimes.
Cross-Codebook Alignment and Smoothness Objectives: The framework includes an alignment objective to prevent the two codebooks from drifting apart and a smoothness objective to ensure coherent allocation across the popularity spectrum, promoting stability.
Consistent Performance Improvements: Through extensive experiments on both public benchmarks (Amazon-Beauty, Amazon-Sports, KuaiRand-1K) and a large-scale industrial dataset, FlexCode consistently outperforms strong baselines, including ItemID-based, Semantic-ID, and unified representation models, across standard ranking metrics like Recall@K and NDCG@K.
Enhanced Tail Robustness and Accuracy: FlexCode demonstrates significantly better performance on long-tail and cold-start items, providing a substantial NDCG@10 improvement for tail items on the industrial dataset, while also improving head item performance.
New Perspective on Memorization and Generalization: The work offers a new perspective on how to balance memorization (for popular items, leveraging collaborative signals) and generalization (for rare items, leveraging semantic information) in token-based recommendation models through adaptive capacity allocation.

3.1. Foundational Concepts

To fully understand FlexCode, a reader should be familiar with the following fundamental concepts in recommender systems, machine learning, and natural language processing:

Recommender Systems: At their core, recommender systems aim to predict user preferences for items. They help users discover new content (products, movies, news, etc.) and address information overload. The goal is to provide personalized suggestions.
Collaborative Filtering (CF): This is a widely used technique in recommender systems. It works on the principle that if two users share similar tastes in the past, they are likely to have similar tastes in the future.
- Mechanism: CF models analyze historical user-item interactions (e.g., purchases, ratings, clicks) to find patterns.
- Strengths: Highly effective for head items (popular items) where there's abundant interaction data, as it can memorize complex co-occurrence relationships.
- Weaknesses: Suffers from the cold-start problem (difficulty recommending new users or new items due to lack of historical data) and the long-tail problem (poor performance on tail items with very few interactions). It lacks semantic understanding of items.
Generative Recommendation: This is a newer paradigm that rephrases the recommendation task.
- Traditional View: Recommenders predict a score for each item, then rank items by score.
- Generative View: Items are treated as discrete "words" or "tokens." The task becomes akin to language modeling, where an autoregressive model learns to predict the next item (or sequence of item tokens) given a user's past interaction sequence. This allows integration with powerful large language models (LLMs).
Item ID vs. Semantic ID:
- Item ID: A unique identifier assigned to each item (e.g., product ID 12345). Traditional CF models often learn embeddings directly for these IDs. These are opaque and carry no inherent meaning.
- Semantic ID: An identifier that carries inherent meaning, often derived from item features like text descriptions, images, or categories. In generative recommendation, semantic tokens are typically created by quantizing item embeddings (e.g., converting a continuous vector representing an item's description into a discrete code from a dictionary). These are interpretable to some extent.
Vector Quantization (VQ) / Codebooks:
- Vector Quantization (VQ): A process of mapping continuous-valued vectors into discrete "code vectors" from a finite set (a codebook). It's like finding the closest "template" vector in a dictionary for a given continuous input vector.
- Codebook: A learnable dictionary of discrete "code vectors" or "embeddings." Each entry in the codebook is a vector. When an item is quantized, it's assigned one or more indices corresponding to these code vectors. This allows large continuous spaces to be represented in a discrete, compact way. VQ-VAE (Vector Quantized Variational Autoencoder) is a common architecture for learning such codebooks.
Autoregressive Models / Transformer Architecture:
- Autoregressive Model: A model that predicts the next element in a sequence based on all preceding elements. In generative recommendation, this means predicting the next item token(s) given the history of previous item tokens.
- Transformer: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017). It revolutionized sequence modeling, especially in natural language processing. Its core mechanism is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element. Transformers are highly effective for capturing long-range dependencies in sequences and are the backbone of most large language models. In this context, a Transformer decoder is typically used for sequence generation.
Mixture-of-Experts (MoE):
- Concept: An MoE model consists of multiple "expert" neural networks and a "gating network." For a given input, the gating network learns to select (or combine outputs from) the most appropriate expert(s).
- Purpose: MoE allows a model to specialize different parts of its network on different types of inputs, potentially improving efficiency and performance by having experts focus on specific sub-problems. It's useful for handling heterogeneous data or tasks.
InfoNCE Loss (Info Noise Contrastive Estimation):
- Concept: A type of contrastive loss function commonly used in self-supervised learning. It aims to pull together representations of "positive pairs" (e.g., different augmentations of the same data point) while pushing apart "negative pairs" (e.g., different data points).
- Purpose: In FlexCode, it's used for Cross-Codebook Alignment to ensure that the semantic and collaborative representations of the same item are similar, while those of different items are dissimilar. This helps maintain coherence between the two specialized codebooks.
- Mathematical Formula (simplified form for understanding): Let $q$ be the query embedding, $k_+$ be a positive key embedding, and $k_1, \dots, k_N$ be negative key embeddings. $ \mathcal{L}{\mathrm{InfoNCE}} = -\log \frac{\exp(\mathrm{sim}(q, k+) / \tau)}{\sum_{i=0}^{N} \exp(\mathrm{sim}(q, k_i) / \tau)} $ Where $\mathrm{sim}(\cdot, \cdot)$ is a similarity function (e.g., cosine similarity), $\tau$ is a temperature parameter that controls the sharpness of the distribution, and $k_0$ is usually $k_+$ . The denominator sums over the positive key and all negative keys.

3.2. Previous Works

The paper discusses various types of previous works in recommender systems, broadly categorized into Item ID-based and Semantic ID-based methods.

Item ID-based Methods (Traditional Sequence Models): These models learn embeddings for discrete item IDs and capture sequential patterns in user interactions. They primarily rely on collaborative signals.

Caser (Tang and Wang, 2018): Uses convolutional sequence embedding to capture sequential patterns.
GRU4Rec (Hidasi et al., 2015): Applies Gated Recurrent Units (GRUs) to model session-based recommendations. Recurrent Neural Networks (RNNs) are good at processing sequences.
HGN (Ma et al., 2019): Hierarchical Gating Networks for sequential recommendation.
BERT4Rec (Sun et al., 2019): Adapts the Bidirectional Encoder Representations from Transformers (BERT) architecture for sequential recommendation, treating item sequences like sentences and predicting masked items.
SASRec (Kang and McAuley, 2018): Self-Attentive Sequential Recommendation. This model uses a Transformer-like self-attention mechanism to capture long-range dependencies in user interaction sequences, focusing on relevant past items.
- Brief concept of Self-Attention (as used in SASRec and Transformers): Given a sequence of input embeddings $X = [x_1, x_2, \dots, x_T]$ , self-attention calculates an output $Z = [z_1, z_2, \dots, z_T]$ where each $z_i$ is a weighted sum of all $x_j$ in the sequence. The weights are learned dynamically based on the relevance between $x_i$ and $x_j$ . This involves three matrices: Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ), derived from the input embeddings. $Q = XW_Q$ , $K = XW_K$ , $V = XW_V$ where $W_Q, W_K, W_V$ are learnable weight matrices. The Attention mechanism is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Here, Q, K, V are matrices containing query, key, and value vectors for each position in the sequence. $d_k$ is the dimension of the key vectors, used for scaling. The softmax ensures weights sum to 1.
S3-Rec (Zhou et al., 2020): Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization.
Recformer (Li et al., 2023): Leverages language representations for sequential recommendation.

Semantic ID-based Methods (Generative Recommenders): These approaches map items to discrete tokens, often derived from item content (semantics), and use generative models.

VQRec (Hou et al., 2023): Learns vector-quantized item representations for transferable sequential recommenders, using a single codebook.
TIGER (Rajput et al., 2023): Recommender Systems with Generative Retrieval. A pioneering work in generative recommendation, representing items as discrete tokens.
LC-Rec (Zheng et al., 2024): Adapts Large Language Models by integrating collaborative semantics for recommendation.
COBRA (Yang et al., 2025): Sparse meets dense: Unified generative recommendations with cascaded sparse-dense representations.
URL (Unified Representation Learning) (Lin et al., 2025): A method aiming for unified semantic and ID representation learning for deep recommenders. This is a strong baseline because it also tries to combine semantic and collaborative information, but still within a single representation framework, making it a good comparison point for FlexCode's disentangled approach.

3.3. Technological Evolution

The field of recommender systems has evolved significantly:

Early Era (Pre-Deep Learning): Dominated by traditional Collaborative Filtering (e.g., Matrix Factorization like SVD or ALS, exemplified by Koren et al., 2009) and Content-Based Filtering (recommending items similar to those a user liked in the past based on item features). These methods were often effective but struggled with sparsity and lacked sophisticated sequential modeling capabilities.
Deep Learning Era (Post-2015): The introduction of deep learning architectures led to Deep Recommendation Systems. GRU4Rec brought RNNs to session-based recommendation, and SASRec and BERT4Rec leveraged the power of Transformer networks to capture complex sequential patterns in user interaction histories. These models improved performance significantly by learning richer, non-linear representations from interaction data. However, they largely remained ItemID-based, still facing cold-start and long-tail challenges due to their reliance on interaction history.
Generative Recommendation Era (Recent): Inspired by the success of Large Language Models (LLMs), the paradigm shifted from score prediction to sequence generation. Items are tokenized into discrete semantic identifiers, allowing recommender systems to leverage the power of autoregressive generative models. Works like VQRec and TIGER spearheaded this shift, offering better generalization by incorporating item semantics. The key challenge in this era, as highlighted by FlexCode, is how to effectively merge powerful semantic generalization with precise collaborative memorization when representing items as tokens.

FlexCode positions itself at the forefront of this generative recommendation era, specifically addressing the limitations of existing Semantic ID-based methods that struggle with representation entanglement and static capacity allocation due to their single codebook design.

3.4. Differentiation Analysis

Compared to the main methods in related work, FlexCode introduces core differences and innovations:

vs. Item ID-based methods (e.g., SASRec, BERT4Rec):
- Differentiation: FlexCode moves beyond opaque ItemID embeddings to discrete semantic tokens. While ItemID-based methods excel at memorization for popular items, they fundamentally lack semantic generalization for long-tail or cold-start items. FlexCode explicitly incorporates semantic codebooks to address this.
- Innovation: It provides a mechanism for semantic understanding directly within the item representation, which ItemID-based methods cannot.
vs. Semantic ID-based methods with a single codebook (e.g., VQRec, TIGER, URL):
- Differentiation: This is the most direct comparison. Existing Semantic ID-based methods use a single, uniform codebook to encode all items. FlexCode proposes a dual codebook architecture—one for collaborative signals (CF codebook) and one for semantic information (semantic codebook).
- Innovation:
  - Disentanglement: FlexCode explicitly disentangles collaborative and semantic factors, preventing representation entanglement and representational collapse that occur when a single codebook tries to compress heterogeneous information. This allows each codebook to optimize for its specific type of signal.
  - Adaptive Capacity Allocation: Crucially, FlexCode introduces a popularity-aware Mixture-of-Experts (MoE) router. This dynamically allocates a fixed token budget between the two codebooks on a per-item basis. Existing Semantic ID-based methods use static capacity allocation, assigning the same number and type of tokens to all items, irrespective of their popularity. FlexCode's adaptive allocation is a major innovation, allowing more CF tokens for head items (for memorization) and more semantic tokens for tail items (for generalization).
  - Cross-Codebook Coherence: FlexCode includes an alignment objective to ensure that despite being disentangled, the two codebooks maintain a coherent relationship, preventing them from drifting into unrelated spaces.
    
    In essence, FlexCode innovates by recognizing that memorization and generalization are not mutually exclusive but should be adaptively emphasized based on item characteristics. It achieves this through a novel architecture that explicitly separates and then intelligently combines these two types of signals, a clear departure from the uniform and static approaches of prior generative recommenders.

4. Methodology

4.1. Principles

The core idea behind FlexCode is to address the limitations of representation entanglement and static capacity allocation in existing generative recommenders by explicitly disentangling collaborative and semantic signals and dynamically allocating representational capacity between them based on item popularity. The theoretical basis is that different types of items (popular vs. long-tail) benefit from different types of representations: popular items thrive on precise collaborative patterns (memorization), while long-tail items require robust semantic understanding (generalization). By introducing two specialized codebooks and an adaptive routing mechanism, FlexCode aims to optimize the representation for each item across the entire popularity spectrum.

4.2. Core Methodology In-depth (Layer by Layer)

The FlexCode framework is composed of several interconnected modules: Dual Codebook Construction, Cross-Codebook Alignment, Popularity-Aware Token Allocation, and Autoregressive Generation, all trained jointly. The overall framework is illustrated in Figure 1.

4.2.1. Generative Recommendation Framework

First, let's understand the general generative recommendation setup. Let $\mathcal{U}$ be the set of users and $\mathcal{T}$ be the set of items. For each user $u \in \mathcal{U}$ , we observe an interaction sequence: $ \mathbf{s}_u = [i_1, i_2, \dots, i_T] $ Here, $i_t \in \mathcal{T}$ represents the $t$ -th item interacted with by user $u$ . The goal of recommendation is to predict the next item $i_{T+1}$ given the preceding sequence $\mathbf{s}_u$ .

In the generative recommendation paradigm, this problem is reformulated as a conditional sequence generation task. Each item $i_t$ is first mapped into one or more discrete tokens using a vector-quantized codebook: $ \mathbf{z}{i_t} = [z{i_t}^1, z_{i_t}^2, \dots, z_{i_t}^L] $ Each $z_{i_t}^l \in \{1, \dots, K\}$ is an index to an entry in a learnable codebook $\mathbf{C} \in \mathbb{R}^{K \times d}$ . Here, $L$ is the number of tokens assigned to each item, and $d$ is the embedding dimension of each codebook entry. An autoregressive model $p_\theta$ (e.g., a Transformer decoder) is then trained to maximize the likelihood of the item sequence in this tokenized form: $ \mathcal{L}{\mathrm{gen}} = - \sum{u \in \mathcal{U}} \sum_{t=1}^{T} \log p_{\theta} (\mathbf{z}{i_t} \mid \mathbf{z}{i_{<t}}) $ This formulation treats recommendation like a language modeling problem, leveraging powerful generative models.

4.2.2. Collaborative and Semantic Representations

To highlight the distinction FlexCode makes, it's important to understand how traditional approaches handle representations:

Traditional Collaborative Filtering (CF): Models learn an embedding matrix $\mathbf{E}_{\mathrm{CF}} \in \mathbb{R}^{|\mathcal{T}| \times d}$ directly from user-item interactions. The relevance score between user $u$ (with embedding $\mathbf{e}_u$ ) and item $i$ (with embedding $\mathbf{e}_i$ ) is typically computed as: $ r(u, i) = \mathbf{e}_u^{\top} \mathbf{e}_i $ This is often optimized with implicit feedback losses like Bayesian Personalized Ranking (BPR). These are good at capturing relational specificity but lack semantic generalization.
Semantic Encoders: These models map content features $\mathbf{x}_i$ (e.g., text descriptions, visual attributes) into a representation: $ \mathbf{e}{\mathrm{sem}}(i) = f{\mathrm{sem}}(\mathbf{x}_i) $ This captures modality-derived meaning independent of user interactions, offering better generalization to cold-start items, but often lacks collaborative precision.
Existing Semantic-ID Generative Recommenders: These methods (like Rajput et al., 2023) quantize semantic embeddings into discrete tokens using vector quantization: $ \hat{z}i = \arg \operatorname*{min}{k} |\mathbf{e}_{\mathrm{sem}}(i) - \mathbf{C}_k|_2^2 $ The resulting token indices are used for sequence generation. The critical point here is that existing methods employ a single codebook C for all items, implicitly mixing collaborative and semantic signals. This uniform codebook assumption is a key limitation FlexCode addresses.

4.2.3. Problem Formulation: Capacity Allocation

FlexCode introduces two separate codebooks: a collaborative codebook $\mathbf{C}_{\mathrm{CF}}$ and a semantic codebook $\mathbf{C}_{\mathrm{SEM}}$ . Given a fixed total token budget $L$ , the problem is formulated as learning a function to allocate tokens between these two codebooks for each item $i$ : $ L_{\mathrm{CF}}(i) + L_{\mathrm{SEM}}(i) = L $ Here, $L_{\mathrm{CF}}(i)$ and $L_{\mathrm{SEM}}(i)$ are the number of tokens for item $i$ drawn from the collaborative codebook and semantic codebook, respectively. The objective of capacity allocation is to learn an adaptive mapping, g(i), which determines the proportion of CF versus semantic tokens. This mapping is defined as: $ g(i) = \sigma (w \cdot \log ( \mathrm{pop}(i) ) + b) $ Where:

$\mathrm{pop}(i)$ is the empirical popularity (interaction frequency) of item $i$ . The logarithm $\log(\mathrm{pop}(i))$ is used to handle the skewed nature of popularity distributions.
$w$ and $b$ are learnable parameters.
$\sigma$ is the sigmoid function, which squashes the output to a range [0, 1]. The value $g(i) \in [0, 1]$ determines the proportion of CF tokens. Accordingly, the number of tokens allocated from each codebook is: $ \begin{array}{r} { L_{\mathtt{CF}}(i) = \lfloor g(i) \cdot L \rfloor , \quad L_{\mathtt{SEM}}(i) = L - L_{\mathtt{CF}}(i) . } \end{array} $ This means for head items (high $\mathrm{pop}(i)$ ), g(i) approaches 1, allocating more CF tokens. For tail items (low $\mathrm{pop}(i)$ ), g(i) decreases, emphasizing semantic tokens. This creates an adaptive dual-codebook encoding.

The resulting generation objective then combines tokens from both sources: $ \mathcal{L}{\mathrm{gen}}^{\mathrm{dual}} = - \sum{u \in \mathcal{U}} \sum_{t=1}^{T} \log p_{\theta} \big( [\mathbf{z}{i_t}^{\mathrm{CF}}, \mathbf{z}{i_t}^{\mathrm{SEM}}] \mid \mathbf{z}{i{ $g(i)=1$ CF-based tokenization; if $g(i)=0$ , it's pure semantic tokenization. This lays the foundation for the Dynamic Dual-Codebook Learning framework.

4.2.4. Dual Codebook Construction

To tackle the representation entanglement issue, FlexCode uses two specialized codebooks.

4.2.4.1. Semantic Codebook Learning (SCL)

The Semantic Codebook aims to capture the intrinsic meaning of an item from its associated textual and categorical metadata.

Semantic Embedding: For each item $i$ , its descriptive attributes (brand, category, price, title) are concatenated into a unified text string. This string is then processed by a powerful pretrained text embedding model (e.g., from an LLM) to obtain a dense semantic embedding $\mathbf{e}_i^{\mathrm{sem}} \in \mathbb{R}^{d_{\mathrm{sem}}}$ .
Quantization: These continuous semantic embeddings are then converted into a discrete codebook using a Residual Quantization Variational Autoencoder (RQ-VAE) (Lee et al., 2022). For each $\mathbf{e}_i^{\mathrm{sem}}$ , the RQ-VAE encodes it into a sequence of $L_{\mathrm{sem}}$ discrete code vectors, denoted as $\mathbf{c}_i^{\mathrm{sem}} = [\mathbf{c}_{i,1}^{\mathrm{sem}}, \ldots, \mathbf{c}_{i,L_{\mathrm{sem}}}^{\mathrm{sem}}]$ . Each $\mathbf{c}_{i,j}^{\mathrm{sem}}$ is an index to an entry in a shared semantic codebook $\mathcal{V}_{\mathrm{sem}}$ .
Loss Function: The total reconstruction loss for the semantic codebook is defined as: $ \begin{array}{r l r} { \mathcal{L}{\mathrm{SCL}} = \mathbb{E}{\mathbf{e}i^{\mathrm{sem}} \sim D{\mathrm{sem}}} [ |\mathbf{e}i^{\mathrm{sem}} - \mathrm{Decode}(\mathbf{c}i^{\mathrm{sem}}) |2^2 ] } \ & { } & { + \sum{j=1}^{L{\mathrm{sem}}} \mathcal{L}{\mathrm{vq}} \big( \mathbf{z}{i,j-1}^{\mathrm{sem}}, \mathbf{q}{i,j}^{\mathrm{sem}} \big) \qquad (1) } \end{array} $ Where:
- $\mathbf{e}_i^{\mathrm{sem}}$ is the continuous semantic embedding of item $i$ .
- $D_{\mathrm{sem}}$ is the distribution of semantic embeddings.
- $\mathrm{Decode}(\cdot)$ is the decoder component of the RQ-VAE that reconstructs the embedding from its quantized representation. The term $\|\mathbf{e}_i^{\mathrm{sem}} - \mathrm{Decode}(\mathbf{c}_i^{\mathrm{sem}}) \|_2^2$ is the reconstruction loss, ensuring the quantized representation can faithfully recover the original embedding.
- $\mathcal{L}_{\mathrm{vq}}$ is the standard VQ-VAE loss (typically composed of a codebook loss, a commitment loss, and a reconstruction loss component for the quantization process).
- $\mathbf{z}_{i,j-1}^{\mathrm{sem}}$ and $\mathbf{q}_{i,j}^{\mathrm{sem}}$ refer to intermediate representations in the Residual Quantization process, where the current residual is quantized based on previous quantization steps. The output of this stage for item $i$ is its semantic codebook representation $\mathbf{C}_i^{\mathrm{SEM}}$ .

4.2.4.2. Collaborative Codebook Learning (CCL)

The Collaborative Codebook is designed to capture high-order co-purchase or co-view patterns.

Collaborative Embedding: A SASRec-like architecture (a Transformer-based sequential model) is used to derive a context-aware collaborative embedding $\mathbf{e}_i^{\mathrm{col}} \in \mathbb{R}^{d_{\mathrm{col}}}$ for each item $i$ . This embedding implicitly encodes an item's position within interaction sequences.
Quantization: Similar to the semantic codebook, an RQ-VAE is applied to discretize these collaborative embeddings into a sequence of $L_{\mathrm{col}}$ discrete code vectors, $\mathbf{c}_i^{\mathrm{col}}$ , drawn from a collaborative codebook $\mathcal{V}_{\mathrm{col}}$ .
Loss Function: The loss function $\mathcal{L}_{\mathrm{CCL}}$ $L_{CCL}$ is defined analogously to $\mathcal{L}_{\mathrm{SCL}}$ $L_{SCL}$ : $ \begin{array}{r l r} { \mathcal{L}{\mathrm{CCL}} = \mathbb{E}{\mathbf{e}i^{\mathrm{col}} \sim D{\mathrm{col}}} [ |\mathbf{e}i^{\mathrm{col}} - \mathrm{Decode}(\mathbf{c}i^{\mathrm{col}}) |2^2 ] } \ & { } & { + \sum{j=1}^{L{\mathrm{col}}} \mathcal{L}{\mathrm{vq}} ( \mathbf{z}{i,j-1}^{\mathrm{col}}, \mathbf{q}{i,j}^{\mathrm{col}} ) \quad \quad (2) } \end{array} $ Where:
- $\mathbf{e}_i^{\mathrm{col}}$ is the continuous collaborative embedding of item $i$ .
- $D_{\mathrm{col}}$ is the distribution of collaborative embeddings.
- The terms are analogous to those in $\mathcal{L}_{\mathrm{SCL}}$ , ensuring reconstruction and effective quantization of collaborative signals. The output for item $i$ is its collaborative codebook representation $\mathbf{C}_i^{\mathrm{COL}}$ .

4.2.5. Cross-Codebook Alignment (CCA)

While the two codebooks are specialized, they need to maintain some coherence to represent the same item consistently. The Cross-Codebook Alignment (CCA) objective ensures this.

Reconstructed Embeddings: The CCA objective operates on the reconstructed embeddings from the codebooks: $\tilde{\mathbf{e}}_i^{\mathrm{sem}} = \mathrm{Decode}(\mathbf{C}_i^{\mathrm{SEM}})$ and $\tilde{\mathbf{e}}_i^{\mathrm{col}} = \mathrm{Decode}(\mathbf{C}_i^{\mathrm{COL}})$ . This is crucial because it forces the alignment loss to act on the information actually preserved and decodable by the quantized representations.
Projection Heads: Let $P_{\mathrm{sem}}(\cdot)$ and $P_{\mathrm{col}}(\cdot)$ be projection heads (e.g., simple multi-layer perceptrons) that map these reconstructed embeddings into a shared latent space.
Alignment Loss: The alignment loss $\mathcal{L}_{\mathrm{CCA}}$ $L_{CCA}$ is defined using an InfoNCE objective: $ \begin{array}{r} { \mathcal{L}{\mathrm{CCA}} = - \log \frac{ \exp \left( \mathrm{sim} ( P{\mathrm{sem}} ( \tilde{\mathbf{e}}i^{\mathrm{sem}} ) , P{\mathrm{col}} ( \tilde{\mathbf{e}}i^{\mathrm{col}} ) ) / \tau \right) }{ \sum{j \in \mathcal{T}} \exp \left( \mathrm{sim} ( P_{\mathrm{sem}} ( \tilde{\mathbf{e}}i^{\mathrm{sem}} ) , P{\mathrm{col}} ( \tilde{\mathbf{e}}_j^{\mathrm{col}} ) ) / \tau \right) } } \end{array} $ Where:
- $\mathrm{sim}(\cdot, \cdot)$ is the cosine similarity function, measuring the similarity between two vectors.
- $\tau$ is a temperature parameter, which scales the logits before the exponential, controlling the sharpness of the distribution and the difficulty of distinguishing positive from negative pairs. A smaller $\tau$ makes the model focus more on hard negatives.
- The numerator calculates the similarity between the projected semantic and collaborative reconstructions of the same item $i$ (positive pair).
- The denominator sums similarities between the projected semantic reconstruction of item $i$ and projected collaborative reconstructions of all items $j \in \mathcal{T}$ (including $i$ itself and other items as negatives). This loss function effectively pulls the representations of the same item (from its semantic and collaborative perspectives) closer in the shared latent space, while pushing representations of different items apart. This contrastive pressure also helps to regularize the codebooks, encouraging more uniform usage of codebook entries and mitigating codebook collapse (where only a few codebook entries are frequently used).

4.2.6. Popularity-Aware Token Allocation (PATA)

To adaptively allocate a fixed total token budget $L_{\mathrm{total}}$ between the codebooks, FlexCode uses a lightweight Mixture-of-Experts (MoE) router. The goal is to allocate more collaborative capacity to popular (head) items and more semantic capacity to sparse (tail) items.

Feature Vector: For each item $i$ $i$ , a feature vector $\mathbf{x}_i$ $x_{i}$ is constructed. This vector includes statistics relevant to item popularity and characteristics: $ \begin{array}{r c l} { \mathbf{x}_i } & { = } & { [ \log(1 + f_i), \mathrm{age}_i, \mathrm{sparsity}_i, \mathrm{uncertainty}_i ] } \end{array} $ Where:
- $f_i$ is the normalized interaction frequency of item $i$ . The $\log(1+f_i)$ transformation helps to handle skewed popularity distributions.
- $\mathrm{age}_i$ is the time since the item was introduced.
- $\mathrm{sparsity}_i$ is the inverse of its interaction density.
- $\mathrm{uncertainty}_i$ is the variance of its embedding during training, which can indicate how well-defined its representation is. This rich feature vector allows the router to make nuanced allocation decisions beyond simple popularity.
MoE Router: A shallow Multi-Layer Perceptron (MLP) denoted as $\phi(\cdot)$ processes $\mathbf{x}_i$ to produce logits $[z_i^{\mathrm{col}}, z_i^{\mathrm{sem}}]$ . A temperature-scaled softmax then converts these logits into routing probabilities $\pi_i$ . The collaborative allocation ratio is defined as $\alpha_i = \pi_i^{\mathrm{col}}$ , representing the proportion of tokens to be drawn from the collaborative codebook.
Soft Token Allocation: This ratio determines the soft number of tokens for each codebook during training: $ \begin{array}{r} { \bar{L}i^{\mathrm{col}} = \alpha_i L{\mathrm{total}}, \quad \bar{L}i^{\mathrm{sem}} = (1 - \alpha_i) L{\mathrm{total}} . } \end{array} $
Sigmoid-based Masks: To maintain differentiability during training while effectively selecting tokens, sigmoid-based masks are used to softly select tokens: $ \begin{array}{c} { m_k^{\mathrm{col}} (\alpha_i) = \sigma \displaystyle \left( \frac{ \bar{L}_i^{\mathrm{col}} - (k - \frac{1}{2}) }{ \tau_m } \right) , } \ { m_k^{\mathrm{sem}} (\alpha_i) = \sigma \displaystyle \left( \frac{ \bar{L}_i^{\mathrm{sem}} - (k - \frac{1}{2}) }{ \tau_m } \right) , } \end{array} $ Where:
- $m_k^{\mathrm{col}}(\alpha_i)$ and $m_k^{\mathrm{sem}}(\alpha_i)$ are the masking values for the $k$ -th token position for the collaborative and semantic codebooks, respectively.
- $\sigma(\cdot)$ is the sigmoid function.
- $\tau_m$ is a hyperparameter (set to 0.1) that controls the smoothness of the soft-to-hard transition. A smaller $\tau_m$ makes the transition sharper, closer to a hard cutoff. These masks are applied to the codebook entries. The final combined representation $\mathbf{C}_i$ concatenates the masked codebooks. At inference time, the allocation is discretized via rounding (e.g., $\lfloor \alpha_i L_{\mathrm{total}} \rfloor$ ).
Router Stability: To ensure the router's stability and effective learning, two regularization objectives are added:
- Stratified load-balancing (\mathcal{L}_{\mathrm{lb}}): This prevents any single expert (or codebook pathway) from being overloaded or underutilized, distributing the load evenly within popularity bands.
- Local smoothness regularizer (\mathcal{L}_{\mathrm{smooth}}): This is applied to the allocation ratios $\alpha_i$ , encouraging similar items to have similar allocation patterns, leading to more stable routing.

4.2.7. Autoregressive Generation (ARG)

After the item representations are formed by concatenating the dynamically allocated and masked tokens from the dual codebooks, an autoregressive Transformer model is used for the final recommendation task.

Sequence Formation: Given a user's historical sequence of items $S_u = (i_1, \dots, i_{T-1})$ , their corresponding combined codebooks $(\mathbf{C}_{i_1}, \dots, \mathbf{C}_{i_{T-1}})$ are retrieved.
Likelihood Maximization: The autoregressive Transformer model is trained to maximize the likelihood of the next item's codebook $\mathbf{C}_{i_T}$ $C_{i_{T}}$ . The loss is the standard cross-entropy over the sequence of predicted tokens: $ \mathcal{L}{\mathrm{ARG}} = - \sum{t=1}^{N_{\mathrm{seq}} - 1} \sum_{k=1}^{L_{\mathrm{total}}} \log P (\mathbf{c}{i{t+1}, k} \mid \mathbf{C}{S_u^{<t+1}}, \mathbf{c}{i_{t+1}, <k}) $ Where:
- $N_{\mathrm{seq}}$ is the length of the sequence.
- $L_{\mathrm{total}}$ is the total number of tokens per item.
- $P (\mathbf{c}_{i_{t+1}, k} \mid \mathbf{C}_{S_u^{<t+1}}, \mathbf{c}_{i_{t+1}, <k})$ is the probability of the $k$ -th token of item $i_{t+1}$ given the combined codebooks of previous items in the sequence ( $S_u^{<t+1}$ ) and the previously generated tokens of the current target item ( $c_{i_{t+1}, <k}$ ). This structure allows for token-level autoregression within an item's representation.

4.2.8. Overall Objective and Training

The entire FlexCode framework is trained jointly in an end-to-end manner. The final objective function is a weighted sum of all components: $ \begin{array}{r l} & { \mathcal{L}{\mathrm{total}} = \mathcal{L}{\mathrm{SCL}} + \mathcal{L}{\mathrm{CCL}} + \lambda{\mathrm{CCA}} \mathcal{L}{\mathrm{CCA}} } \ & { ~ + ~ \lambda{\mathrm{ARG}} \mathcal{L}{\mathrm{ARG}} + \lambda{\mathrm{lb}} \mathcal{L}{\mathrm{lb}} + \lambda{\mathrm{smooth}} \mathcal{L}_{\mathrm{smooth}} } \end{array} $ Where:

$\mathcal{L}_{\mathrm{SCL}}$ : Semantic Codebook Learning loss (Equation 1).
$\mathcal{L}_{\mathrm{CCL}}$ : Collaborative Codebook Learning loss (Equation 2).
$\mathcal{L}_{\mathrm{CCA}}$ : Cross-Codebook Alignment loss.
$\mathcal{L}_{\mathrm{ARG}}$ : Autoregressive Generation loss.
$\mathcal{L}_{\mathrm{lb}}$ : Load-balancing regularization loss for the MoE router.
$\mathcal{L}_{\mathrm{smooth}}$ : Smoothness regularization loss for the MoE router's allocation ratios.
$\lambda_{(\cdot)}$ are hyperparameters that balance the contribution of each term.

This end-to-end training allows the semantic codebook, collaborative codebook, popularity-aware router, and autoregressive Transformer to co-adapt, leading to a cohesive and effective system that balances memorization and generalization dynamically.

The following figure (Figure 1 from the original paper) shows the overall architecture of FlexCode:

Figure 1: Overview of the FlexCode framework for generative recommendation. Each item is encoded by a dual codebook with collaborativeand semanticcodebooks,aligned viaacross-codebookcontrastiveobjective.A popularity-aware Mixture-o-Experts (MoE) router adaptivelyallocates the budget between codebooks, and an autoregressive Transformer is trained on the resulting sequences to generate items.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three public benchmarks and one large-scale proprietary industrial dataset to comprehensively evaluate FlexCode across diverse domains and data characteristics. The standard leave-last-out evaluation protocol is adopted, where the last item in a user's chronological interaction history is used for testing, and the second-to-last for validation. A 5-core filtering strategy is applied, meaning only users and items with at least 5 interactions are kept.

The following are the statistics from Table 2 of the original paper:

Dataset	#Users	#Items	#Interactions	Avg. Seq.
Beauty	22,363	12,101	198,360	8.87
Sports	35,598	18,357	296,175	8.32
KuaiRand	1,000	3.6M	11M	11.71
Proprietary	1.5M+	1M+	45M+	28.0

Public Benchmarks:
- Amazon-Beauty (McAuley et al., 2015): A relatively dense, medium-scale e-commerce dataset focused on beauty products. Its density implies more interaction data per user/item, making it suitable for evaluating memorization capabilities.
- Amazon-Sports and Outdoors (McAuley et al., 2015): A larger and sparser e-commerce dataset compared to Beauty. Its sparsity presents a greater challenge for models relying solely on collaborative signals and highlights the need for generalization.
- KuaiRand-1K (Gao et al., 2022): A large-scale dataset from a short-video recommendation platform, featuring rich side information. The "1K" refers to 1,000 users, but it's important to note the scale in terms of items and interactions (3.6M items, 11M interactions), making it a significant test for models handling large item spaces and diverse content.
- Choice Rationale: These datasets cover different scales, densities, and domains (e-commerce, short-video), providing a robust testbed for FlexCode's generalizability and ability to handle varying data characteristics.
Proprietary Industrial Dataset:
- Characteristics: This dataset comes from a commercial platform, containing tens of millions of users and tens of thousands of items (reported as 1.5M+ users, 1M+ items, 45M+ interactions in Table 2). Interactions are logged over a year, with the last day used for evaluation to simulate an online inference scenario.
- Choice Rationale: Evaluating on this dataset is crucial for demonstrating FlexCode's effectiveness and scalability under real-world, production-level conditions, which often involve massive scale, noise, and complex user behaviors.

5.2. Evaluation Metrics

The paper assesses model performance using Recall@K and NDCG@K, with $K \in \{5, 10\}$ . These are standard ranking metrics in recommender systems, following common practice (Rajput et al., 2023).

5.2.1. Recall@K

Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved within the top $K$ recommendations. It focuses on the model's ability to find as many relevant items as possible from the user's ground truth set within the fixed recommendation list length $K$ . A higher Recall@K indicates that the model is effectively identifying relevant items.
Mathematical Formula: $ \mathrm{Recall@K} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{|\mathrm{Rel}_u \cap \mathrm{Rec}_u(K)|}{|\mathrm{Rel}_u|} $
Symbol Explanation:
- $|\mathcal{U}|$ : The total number of users in the evaluation set.
- $\mathrm{Rel}_u$ : The set of actual relevant items for user $u$ (i.e., the items the user interacted with in the test set).
- $\mathrm{Rec}_u(K)$ : The set of top $K$ items recommended by the model for user $u$ .
- $|\mathrm{Rel}_u \cap \mathrm{Rec}_u(K)|$ : The number of relevant items found within the top $K$ recommendations for user $u$ .
- $|\mathrm{Rel}_u|$ : The total number of relevant items for user $u$ .

5.2.2. NDCG@K (Normalized Discounted Cumulative Gain)

Conceptual Definition: NDCG@K is a metric that evaluates the quality of ranking, giving more weight to highly relevant items that appear at higher (earlier) positions in the recommendation list. It's "normalized" because the calculated DCG is divided by the Ideal DCG (the DCG of a perfect ranking) to ensure scores are comparable across different query sets. A higher NDCG@K means not only are relevant items found, but they are also ranked appropriately.
Mathematical Formula: $ \mathrm{NDCG@K} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{\mathrm{DCG}_u(K)}{\mathrm{IDCG}_u(K)} $ Where: $ \mathrm{DCG}u(K) = \sum{j=1}^{K} \frac{2^{\mathrm{rel}(j)} - 1}{\log_2(j+1)} $ And $\mathrm{IDCG}_u(K)$ is the maximum possible $\mathrm{DCG}_u(K)$ (i.e., the DCG value if the recommendations were perfectly ordered by relevance).
Symbol Explanation:
- $|\mathcal{U}|$ : The total number of users in the evaluation set.
- $\mathrm{rel}(j)$ : The relevance score of the item at position $j$ in the recommended list. For implicit feedback, relevance is usually binary (1 if relevant, 0 otherwise).
- $\log_2(j+1)$ : A logarithmic discount factor, which reduces the contribution of relevant items as their rank $j$ decreases. Items at higher ranks (smaller $j$ ) contribute more to the DCG score.
- $\mathrm{DCG}_u(K)$ : Discounted Cumulative Gain for user $u$ up to rank $K$ .
- $\mathrm{IDCG}_u(K)$ : Ideal Discounted Cumulative Gain for user $u$ up to rank $K$ . This is calculated by sorting all relevant items for user $u$ by their actual relevance and then applying the DCG formula. It serves as a normalization factor.

5.3. Baselines

The paper compares FlexCode against a comprehensive set of baselines, categorized into Item ID-based and Semantic ID-based methods.

5.3.1. Item ID-based methods

These are traditional sequential recommendation models that learn item representations directly from historical user-item interactions, typically using discrete item IDs. They focus on collaborative signals.

Caser (Tang and Wang, 2018): Uses convolutional filters to capture local sequential patterns.
GRU4Rec (Hidasi et al., 2015): An RNN-based model (GRU) for session-based recommendations, capturing temporal dependencies.
HGN (Ma et al., 2019): Hierarchical Gating Networks, designed to model complex user behaviors and item transitions.
BERT4Rec (Sun et al., 2019): Adapts the BERT model (a Transformer encoder) for sequential recommendation by using a masked item prediction task.
SASRec (Kang and McAuley, 2018): A Transformer-based model that employs a self-attention mechanism to capture relevant item transitions in long sequences, often considered a strong baseline for sequential recommendation.
S3-Rec (Zhou et al., 2020): Self-Supervised Learning for Sequential Recommendation, using mutual information maximization.
Recformer (Li et al., 2023): Leverages text-based language representations within a Transformer framework for sequential recommendation.

5.3.2. Semantic ID-based methods

These are more recent generative recommendation approaches that tokenize or quantize item representations into discrete semantic identifiers, often incorporating item content information.

VQRec (Hou et al., 2023): Learns vector-quantized item representations (using a single codebook) for sequential recommenders, aiming for transferability.
TIGER (Rajput et al., 2023): A prominent generative retrieval model that represents items as discrete tokens, enabling autoregressive generation.
LC-Rec (Zheng et al., 2024): Adapts Large Language Models by integrating collaborative semantics for recommendation.
COBRA (Yang et al., 2025): A unified generative recommendation approach that combines sparse and dense representations.
URL (Unified Representation Learning) (Lin et al., 2025): A unified semantic and ItemID representation learning approach for deep recommenders. This baseline is particularly relevant as it also attempts to combine different types of signals, but (as FlexCode argues) likely within a single, entangled codebook.

These baselines are representative as they cover both traditional collaborative filtering approaches that focus on interaction patterns and the latest generative recommendation models that leverage semantic information. The inclusion of URL is important as it represents state-of-the-art efforts in unifying item representations, making FlexCode's dual-codebook and adaptive allocation strategy a direct comparison point.

6. Results & Analysis

6.1. Core Results Analysis

The experiments demonstrate that FlexCode consistently outperforms all baseline models across various metrics and datasets. This robust superiority is attributed to its unique dual-codebook architecture, which disentangles collaborative and semantic signals, and its popularity-aware token allocation mechanism.

The following are the results from Table 1 of the original paper:

Model	Beauty				Sports and Outdoors				KuaiRand
Model	R@5	N@5	R@10	N@10	R@5	N@5	R@10	N@10	R@5	N@5	R@10	N@10
Item ID-based methods
Caser	0.0205	0.0131	0.0347	0.0176	0.0116	0.0072	0.0194	0.0097	0.0074	0.0068	0.0118	0.0095
GRU4Rec	0.0164	0.0099	0.0283	0.0137	0.0129	0.0086	0.0204	0.0110	0.0298	0.0217	0.0383	0.0245
HGN	0.0325	0.0206	0.0512	0.0266	0.0189	0.0120	0.0313	0.0159	0.0297	0.0169	0.0354	0.0219
BERT4Rec	0.0203	0.0124	0.0347	0.0170	0.0115	0.0075	0.0191	0.0099	0.0185	0.0196	0.0217	0.0236
SASRec	0.0387	0.0249	0.0605	0.0318	0.0233	0.0154	0.0350	0.0192	0.0332	0.0338	0.0405	0.0372
S3-Rec	0.0387	0.0244	0.0647	0.0327	0.0251	0.0161	0.0385	0.0204	—	—	—	—
Recformer	0.0379	0.0257	0.0589	0.0321	0.0249	0.0154	0.0370	0.0201	—	—	—	—
Semantic ID-based methods
VQ-Rec	0.0457	0.0317	0.0664	0.0383	0.0208	0.0144	0.0300	0.0173	0.0513	0.0354	0.0589	0.0412
TIGER	0.0454	0.0321	0.0648	0.0384	0.0264	0.0181	0.0400	0.0225	0.0557	0.0383	0.0624	0.0445
LC-Rec	0.0478	0.0329	0.0679	0.0389	0.0268	0.0177	0.0412	0.0221	0.0622	0.0403	0.0684	0.0497
COBRA	0.0537	0.0395	0.0725	0.0456	0.0306	0.0215	0.0434	0.0257	—	—	—	—
URL	0.0553	0.0410	0.0736	0.0471	0.0305	0.0218	0.0449	0.0273	0.0654	0.0481	0.0778	0.0585
FlexCode-SID only	0.0510	0.0375	0.0689	0.0433	0.0291	0.0204	0.0412	0.0244	0.0523	0.0461	0.0759	0.0517
FlexCode-CF only	0.0360	0.0232	0.0563	0.0296	0.0217	0.0143	0.0326	0.0179	0.0309	0.0314	0.0377	0.0346
FlexCode-Fix	0.0531	0.0394	0.0707	0.0452	0.0293	0.0209	0.0431	0.0262	0.0628	0.0462	0.0747	0.0562
FlexCode (ours)	0.0578	0.0415	0.0769	0.0483	0.0329	0.0232	0.0471	0.0275	0.0709	0.0524	0.0825	0.0632

Key observations from Table 1:

Overall Superiority: FlexCode (ours) consistently achieves the best performance across all datasets (Beauty, Sports and Outdoors, KuaiRand) and all metrics ( $Recall@5/10$ , $NDCG@5/10$ ). For example, on Amazon-Sports and Outdoors, FlexCode achieves a Recall@10 of 0.0471, representing a 5.3% relative improvement over URL (0.0449), the strongest semantic ID-based baseline. On KuaiRand, its NDCG@10 is 0.0632, an 8.0% improvement over URL (0.0585).
Addressing Limitations:
- Versus Item ID-based Models (e.g., SASRec): These models perform significantly worse, especially on sparser datasets like Sports and KuaiRand. This highlights their lack of semantic generalization and inability to handle sparse data effectively. FlexCode's dedicated semantic codebook directly addresses this.
- Versus Semantic ID-based Models (e.g., URL): While these models are generally better than Item ID-based ones, FlexCode surpasses them. This indicates that FlexCode successfully mitigates the representation entanglement problem prevalent in single-codebook generative models. By explicitly disentangling collaborative and semantic signals, FlexCode allows for more pure and effective representations.
Validation of Dual-Codebook Structure:
- FlexCode-SID only (using only semantic IDs) and FlexCode-CF only (using only collaborative IDs) show substantially worse performance than the full FlexCode or even FlexCode-Fix. This confirms that neither type of signal alone is sufficient and that the combination of both is critical.
- FlexCode-SID only often performs better than FlexCode-CF only on sparser datasets (e.g., KuaiRand NDCG@10 0.0517 vs 0.0346), underscoring the importance of semantics for generalization.
Validation of Popularity-Aware Token Allocation (PATA):
- FlexCode-Fix (which uses a static 50/50 token split) already outperforms most baselines, including URL, confirming the inherent advantage of the disentangled dual-codebook structure.
- However, the full FlexCode model consistently outperforms FlexCode-Fix. For instance, on KuaiRand, the dynamic allocation of PATA improves NDCG@10 from 0.0562 (FlexCode-Fix) to 0.0632 (FlexCode), a relative gain of 12.5%. This isolates the critical contribution of the adaptive allocation mechanism, proving that dynamically balancing capacity based on item popularity is essential for achieving state-of-the-art performance.

6.2. Further Analysis on Industrial Dataset

To further validate FlexCode's robustness and scalability, detailed analyses were performed on a large-scale proprietary industrial dataset.

The following figure (Figure 2 from the original paper) shows the performance evaluation on the large-scale industrial dataset:

Figure 2: Performance evaluation on the large-scale industrial dataset.

6.2.1. Overall Industrial Performance

Figure 2a compares the performance improvement of generative models relative to the SASRec baseline on the industrial dataset.

A pure collaborative generative model (CF only) actually degrades performance by over 5% compared to SASRec. This highlights the limitations of solely relying on collaborative signals, especially in complex, real-world industrial settings that often have a strong long-tail.
Advanced unified models like URL provide a moderate 4.9% lift in NDCG@10. This shows that attempts to combine signals in a single representation offer some benefit, but are not optimal.
FlexCode delivers a substantial 13.2% improvement in NDCG@10 and a 16.5% improvement in HR@10 (Hit Ratio, equivalent to Recall@K, but often used to denote a single relevant item in the top-K list for implicit feedback). This significant margin of victory on a massive, noisy, and production-level dataset strongly confirms FlexCode's scalability and effectiveness in real-world industrial scenarios.

6.2.2. Cold-Start and Long-Tail Item Performance

Figure 2b directly addresses FlexCode's ability to balance memorization for popular (head) items and generalization for rare (tail) items on the industrial dataset, which exhibits a highly skewed long-tail distribution.

Baselines' Trade-off:
- The CF-only model shows an improvement for head items but a significant 5.5% degradation for tail items. This is expected as CF excels at memorizing popular patterns but fails on sparse data.
- Conversely, the SID-only (Semantic ID only) model improves tail performance by 5.7% (due to semantic generalization) but at the cost of weaker head performance (as semantics might dilute fine-grained collaborative signals for popular items). This exemplifies the memorization-generalization trade-off in single-faceted models.
FlexCode's Resolution: FlexCode effectively resolves this conflict:
- For head items, its MoE router correctly allocates more tokens to the collaborative codebook, enabling fine-grained memorization. This yields a 3.0% NDCG@10 improvement for head items.
- For tail items, the router shifts capacity to the semantic codebook, facilitating generalization from content features. This results in an impressive 11.3% NDCG@10 improvement for tail items. This is the largest gain on tail items by a wide margin, highlighting FlexCode's practical importance for long-tail discovery.
  
  These results collectively demonstrate that FlexCode's dynamic allocation of representational budget allows it to achieve superior performance across the entire item popularity spectrum, effectively balancing the needs for memorization and generalization.

6.3. Ablation Study

An ablation study was conducted on the KuaiRand dataset to understand the contribution of each component within FlexCode.

The following are the results from Table 3 of the original paper:

Model Variant	Recall@10	NDCG@10
FlexCode (Full)	0.0825	0.0632
FlexCode (CID Only)	0.0405	0.0372
FlexCode (SID Only)	0.0511	0.0401
w/o MoE Gating (Fixed Split)	0.0791	0.0598
w/o Alignment Loss	0.0809	0.0615

Analysis of Ablation Study:

Dual-Codebook Structure:
- FlexCode (CID Only): Using only collaborative IDs results in a significant drop in NDCG@10 from 0.0632 to 0.0372.
- FlexCode (SID Only): Using only semantic IDs also leads to a substantial drop, from 0.0632 to 0.0401.
- Conclusion: This confirms that separating and combining collaborative and semantic representations through a dual-codebook structure is absolutely critical. Neither signal type alone can achieve the performance of their combined, disentangled use, validating the core architectural decision.
Popularity-Aware MoE Gating:
- w/o MoE Gating (Fixed Split): Disabling the MoE gating network and defaulting to a fixed 50/50 token split results in a drop in NDCG@10 from 0.0632 to 0.0598.
- Conclusion: This demonstrates that the adaptive allocation mechanism is crucial. While a fixed dual-codebook (FlexCode-Fix in Table 1) is already beneficial, dynamically adapting the allocation based on item popularity significantly enhances performance by optimizing capacity allocation for different item regimes.
Cross-Codebook Alignment Loss:
- w/o Alignment Loss: Excluding the alignment loss (which ensures coherence between the two codebooks) leads to a drop in NDCG@10 from 0.0632 to 0.0615.
- Conclusion: This confirms the importance of the Cross-Codebook Alignment (CCA) objective. Without it, the specialized codebooks might drift into unrelated spaces, diminishing the overall effectiveness of the combined representation. The InfoNCE loss serves as a powerful regularizer, ensuring a coherent shared space.

6.3.1. Token Budget Sensitivity

The paper also analyzed FlexCode's robustness under varying token budgets $L \in \{3, 4, 5, 6\}$ on the KuaiRand dataset.

The following are the results from Table 4 of the original paper:

Model	L=3	L= 4	L=5	L = 6
FlexCode (SID Only)	0.0401	0.0415	0.0418	0.0420
FlexCode (CID Only)	0.0372	0.0389	0.0395	0.0397
FlexCode-Fix (50/50 Split)	0.0598	0.0615	0.0619	0.0621
FlexCode (ours)	0.0632	0.0685	0.0691	0.0693

Analysis:

FlexCode (ours) consistently achieves the strongest performance across all token budgets, significantly outperforming SID Only, CID Only, and even FlexCode-Fix.
As the token budget $L$ increases, performance generally improves for all models, indicating that more tokens allow for richer representations.
Importantly, FlexCode demonstrates strong performance even with a reduced token capacity ( $L=3$ ), where it still achieves an NDCG@10 of 0.0632. This shows that its adaptive capacity reallocation allows for more efficient use of limited representational resources. This is a crucial advantage for real-world deployment where computational and memory budgets are often constrained.

6.3.2. Hyper-Parameter Analysis

The paper also examined the effect of key hyperparameters on FlexCode's performance (specifically NDCG@10 on KuaiRand).

The following are the results from Table 5 of the original paper:

Parameter Setting	K	d	λalign	λsmooth	NDCG@10
Default (Base)	512	64	0.1	0.01	0.0632
K Variation	256	64	0.1	0.01	0.0603
	1024	64	0.1	0.01	0.0635
d Variation	512	32	0.1	0.01	0.0611
	512	128	0.1	0.01	0.0639
λalign Variation	512	64	0.01	0.01	0.0618
	512	64	0.5	0.01	0.0625
	512	64	1.0	0.01	0.0609
λsmooth Variation	512	64	0.1	0.001	0.0621
	512	64	0.1	0.05	0.0628
	512	64	0.1	0.1	0.0615

Analysis:

Codebook Size ( $K$ ): Increasing $K$ from 256 to 1024 generally improves performance, but with diminishing returns (0.0603 to 0.0635). This suggests that a larger codebook provides more expressive power until saturation.
Embedding Dimension ( $d$ ): Similar to $K$ , increasing $d$ from 32 to 128 also shows slight improvements (0.0611 to 0.0639), indicating that higher-dimensional embeddings can capture more nuanced information.
Alignment Loss Weight ( $\lambda_{\mathrm{align}}$ ): Performance is relatively robust to moderate changes in $\lambda_{\mathrm{align}}$ . A default value of 0.1 performs best, while very low (0.01) or very high (1.0) values lead to slight performance degradation, indicating the importance of a balanced alignment. Too little alignment might cause drift, too much might force entanglement.
Smoothness Loss Weight ( $\lambda_{\mathrm{smooth}}$ ): The model is also stable across a range of $\lambda_{\mathrm{smooth}}$ values. The default 0.01 performs well, with slightly worse results at 0.001 and 0.1. This suggests that a moderate smoothness regularization is beneficial for router stability without overly constraining its adaptability.

In summary, FlexCode is generally stable across a broad range of hyperparameter settings, implying that it is not overly sensitive to precise tuning, which is a desirable characteristic for practical deployment.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces FlexCode, a novel dual-codebook framework for generative recommendation. It addresses two fundamental limitations of existing generative recommenders: representation entanglement within a single codebook and static capacity allocation that ignores item popularity. FlexCode successfully disentangles collaborative and semantic item representations into two dedicated codebooks. A popularity-aware Mixture-of-Experts (MoE) router dynamically allocates a fixed token budget between these two codebooks on a per-item basis, emphasizing collaborative signals for popular items (memorization) and semantic signals for long-tail items (generalization). An alignment and smoothness objective ensures coherence and stability. Extensive experiments on public and industrial datasets demonstrate that FlexCode consistently outperforms strong baselines in terms of overall accuracy and, crucially, exhibits superior tail robustness. The paper concludes that FlexCode offers a new and effective mechanism for token representation in generative recommenders, providing a valuable perspective on balancing memorization and generalization in token-based recommendation models.

7.2. Limitations & Future Work

The authors acknowledge several promising directions for future work:

Extension to Richer Multi-modal Descriptors: The current semantic codebook uses textual and categorical metadata. Future work could extend adaptive capacity allocation to incorporate richer multi-modal item descriptors (e.g., images, audio, video features) to further enhance semantic understanding.
User-Side Modeling: The framework could be extended to user-side modeling, enabling joint reasoning over user and item codebooks. This could lead to more nuanced personalization by also adapting user representations.
Integration with LLMs and Online Learning: Future research could integrate FlexCode with large language models (LLMs) and online learning pipelines. This would involve studying how popularity-aware tokenization interacts with challenges like exploration, temporal drift (changes in item popularity or user preferences over time), and calibration in real-world, dynamic deployments.
Fairness and Exposure: It is important to examine the fairness and exposure of long-tail content under different routing strategies. While FlexCode aims to improve long-tail performance, the dynamic allocation mechanism could inadvertently introduce new biases or affect item exposure if not carefully designed.
Theoretical Understanding: Developing theoretical tools to understand when and why dual-codebook architectures provide benefits over unified tokenizations would further solidify the foundations of this approach.

7.3. Personal Insights & Critique

This paper presents a highly intuitive and impactful solution to a well-recognized problem in recommendation: balancing the needs of popular vs. long-tail items. The core insight that memorization (via CF signals) and generalization (via semantic signals) should be disentangled and adaptively balanced is very powerful.

Inspirations and Transferability:

Adaptive Resource Allocation: The idea of dynamically allocating computational or representational resources based on the inherent characteristics of data points (e.g., popularity, uncertainty, sparsity) is broadly applicable. This could extend beyond recommendation to other machine learning tasks with skewed data distributions, such as natural language processing (e.g., allocating more complex models to rare words) or computer vision (e.g., dedicating more processing to unusual objects).
Disentangled Representations: The success of FlexCode in disentangling collaborative and semantic signals highlights the general value of learning specialized representations for different aspects of data. This principle could be applied in multi-modal learning, where different modalities are processed by specialized subnetworks before being adaptively combined.
Hybrid Modeling: FlexCode effectively demonstrates a sophisticated form of hybrid recommendation, not by simply averaging scores, but by deeply integrating and adaptively weighting different information sources at the representation learning stage. This could inspire more advanced hybrid architectures.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Defining "Popularity": The paper uses empirical popularity (interaction frequency) and other features like age and sparsity in the MoE router. However, "popularity" itself can be a dynamic concept. An item might be popular for a short burst (e.g., trending news), then fade. Static popularity metrics might not fully capture this temporal drift. Future work could explore dynamic popularity signals or context-aware routing that considers the current session or user intent.
Overhead of Dual Codebooks and MoE: While the paper calls the MoE router "lightweight," having two separate RQ-VAEs and codebooks, along with the router and alignment objective, adds architectural complexity and potentially computational overhead compared to a single-codebook system. The paper mentions "lightweight" and "robust to small variations," but a deeper analysis of the computational and memory footprint during training and inference, especially for the RQ-VAE components, would be beneficial for industrial applications.
Interpretability of Allocation: While the MoE router allocates based on popularity features, the exact mechanism for why a certain allocation works best could be further investigated. The sigmoid-based masks provide differentiability but might obscure clear thresholds. More interpretable routing mechanisms or post-hoc analysis of router decisions could offer deeper insights.
Generalizability of SASRec-like Architecture for CF Embeddings: The collaborative codebook relies on a SASRec-like architecture to derive collaborative embeddings. While effective, the choice of this specific architecture for generating these initial embeddings might have its own limitations in extremely sparse or highly dynamic environments. Exploring other state-of-the-art CF embedding generation methods could further enhance the collaborative signal quality.
Cold-Start Users: The paper focuses heavily on cold-start items. While improving long-tail item discovery is crucial, the framework's performance on cold-start users (who have very few interactions themselves) is not explicitly detailed. An adaptive allocation for user representations, perhaps mirroring the item-side FlexCode, could be a logical next step.

Overall, FlexCode is an elegant and effective solution that pushes the boundaries of generative recommendation. Its emphasis on adaptive capacity allocation and disentangled representations provides a strong foundation for future research in building more robust and fair recommender systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~26 min read · 38,108 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Generative Recommendation Framework

4.2.2. Collaborative and Semantic Representations

4.2.3. Problem Formulation: Capacity Allocation

4.2.4. Dual Codebook Construction

4.2.4.1. Semantic Codebook Learning (SCL)

4.2.4.2. Collaborative Codebook Learning (CCL)

4.2.5. Cross-Codebook Alignment (CCA)

4.2.6. Popularity-Aware Token Allocation (PATA)

4.2.7. Autoregressive Generation (ARG)

4.2.8. Overall Objective and Training

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Recall@K

5.2.2. NDCG@K (Normalized Discounted Cumulative Gain)

5.3. Baselines

5.3.1. Item ID-based methods

5.3.2. Semantic ID-based methods

6. Results & Analysis

6.1. Core Results Analysis

6.2. Further Analysis on Industrial Dataset

6.2.1. Overall Industrial Performance

6.2.2. Cold-Start and Long-Tail Item Performance

6.3. Ablation Study

6.3.1. Token Budget Sensitivity

6.3.2. Hyper-Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers