Paper status: completed

Content-Based Collaborative Generation for Recommender Systems

Published:03/27/2024

Generative Recommendation Systems (42)Content-Based Generation for Recommender Systems (1)Unified Generative Framework (1)Collaborative Filtering Model (1)Reinforcement Learning in Recommendation Systems (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces ColaRec, a content-based collaborative generation model for recommenders, which uses a sequence-to-sequence framework to generate item identifiers directly, integrating user interaction data and collaborative signals effectively.

Abstract

Generative models have emerged as a promising utility to enhance recommender systems. It is essential to model both item content and user-item collaborative interactions in a unified generative framework for better recommendation. Although some existing large language model (LLM)-based methods contribute to fusing content information and collaborative signals, they fundamentally rely on textual language generation, which is not fully aligned with the recommendation task. How to integrate content knowledge and collaborative interaction signals in a generative framework tailored for item recommendation is still an open research challenge. In this paper, we propose content-based collaborative generation for recommender systems, namely ColaRec. ColaRec is a sequence-to-sequence framework which is tailored for directly generating the recommended item identifier. Precisely, the input sequence comprises data pertaining to the user's interacted items, and the output sequence represents the generative identifier (GID) for the suggested item. To model collaborative signals, the GIDs are constructed from a pretrained collaborative filtering model, and the user is represented as the content aggregation of interacted items. To this end, ColaRec captures both collaborative signals and content information in a unified framework. Then an item indexing task is proposed to conduct the alignment between the content-based semantic space and the interaction-based collaborative space. Besides, a contrastive loss is further introduced to ensure that items with similar collaborative GIDs have similar content representations. To verify the effectiveness of ColaRec, we conduct experiments on four benchmark datasets. Empirical results demonstrate the superior performance of ColaRec.

Mind Map

In-depth Reading

English Analysis~25 min read · 35,586 chars

1. Bibliographic Information

1.1. Title

The title of the paper is Content-Based Collaborative Generation for Recommender Systems. It clearly indicates that the paper focuses on combining item content information and user-item collaborative interactions within a generative framework for recommender systems.

1.2. Authors

The paper lists twelve authors: Yidan Wang, Zhaochun Ren, Zhixiang Liang, Xin Chen, Xu Zhang, Pengjie Ren, Jiyuan Yang, Weiwei Sun, Ruobing Xie, Su Yan, Zhumin Chen, and Xin Xin. Most authors are affiliated with Shandong University, with others from Leiden University, Zhejiang University, and Tencent (WeChat). This indicates a collaborative effort between academic institutions and an industry research lab.

1.3. Journal/Conference

The paper is published at the 33rd ACM International Conference on Information and Knowledge Management (CIKM '24). CIKM is a highly reputable and influential conference in the fields of information retrieval, knowledge management, and database systems, suggesting that the work has undergone rigorous peer review and is considered significant within these communities.

1.4. Publication Year

The paper was published at (UTC): 2024-03-27T11:49:58.000Z, as a preprint on arXiv. The ACM reference format indicates it is scheduled for the CIKM conference in October 2024.

1.5. Abstract

Generative models are a promising approach for enhancing recommender systems. The paper highlights the importance of integrating both item content and user-item collaborative interactions within a unified generative framework for improved recommendations. While existing large language model (LLM)-based methods combine content and collaborative signals, they primarily rely on textual language generation, which is not fully aligned with the direct item recommendation task.

To address this, the authors propose ColaRec, a sequence-to-sequence framework specifically designed for generating recommended item identifiers (GID). The input sequence consists of data related to a user's interacted items, and the output sequence is the GID of the suggested item. ColaRec models collaborative signals by constructing GIDs from a pretrained collaborative filtering model (e.g., LightGCN) and representing users as content aggregations of their interacted items. This unified approach captures both collaborative signals and content information.

To further align these two types of information, an item indexing task is introduced, which maps item side information (content and interacting users) into the item's GID. Additionally, a contrastive loss is incorporated to ensure that items with similar collaborative GIDs have similar content representations. Experiments on four benchmark datasets demonstrate ColaRec's superior performance compared to existing methods.

1.6. Original Source Link

The original source link is https://arxiv.org/abs/2403.18480v2. This is a preprint on arXiv. The PDF link is https://arxiv.org/pdf/2403.18480v2.pdf.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the ineffective integration and alignment of item content information and user-item collaborative signals within generative recommender systems.

This problem is important because recommender systems are widely deployed to personalize information services, and their effectiveness heavily relies on accurately understanding user preferences and item characteristics. Traditional collaborative filtering (CF) methods primarily leverage user-item interaction data, while content-based methods focus on item attributes. Recent advancements in generative models, particularly large language models (LLMs), have shown promise in recommendation by transforming the task into language generation. However, these LLM-based approaches face inherent misalignments with the direct item recommendation task. Specifically, they often require a complex grounding stage to map generated text back to concrete items and struggle with directly generating target item IDs from a large candidate pool.

The paper identifies a gap in prior research where existing generative recommendation methods (which directly generate item identifiers, or GIDs) either prioritize item content (e.g., TIGER) or collaborative signals (e.g., Si et al.), but fail to effectively model both in a unified framework or to properly align them. Simple concatenation of content and collaborative IDs has been shown to be suboptimal, indicating a need for an explicit learning process for alignment.

The paper's innovative idea or entry point is to propose a sequence-to-sequence generative framework, ColaRec, tailored for directly generating item identifiers. This framework aims to unify collaborative signals (derived from user-item interactions) and item content information (textual descriptions) in an end-to-end manner, overcoming the limitations of LLM-based methods and existing generative recommendation approaches that only consider one aspect or lack proper alignment.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Proposed ColaRec Framework: Introduction of a novel generative recommendation framework, ColaRec, which utilizes an encoder-decoder model to jointly capture item content information and user-item collaborative signals for recommendation. This represents a unified approach that is specifically tailored for the recommendation task, addressing the misalignment issues of LLM-based methods.
Auxiliary Tasks for Alignment: Development of an auxiliary item indexing task and a contrastive loss to facilitate better alignment between item content information and user-item collaborative signals. This explicit learning process for alignment is crucial for enhancing the performance of generative recommendation.
Empirical Validation: Extensive experiments conducted on four benchmark datasets demonstrate the superior recommendation performance of ColaRec compared to state-of-the-art baselines. The results show ColaRec's effectiveness and generalization across different domains, particularly its significant improvement for long-tail users with sparse interactions.

The key conclusions or findings reached by the paper are:

Unifying content information and collaborative signals within a tailored generative framework significantly improves recommendation performance.
Explicit alignment mechanisms, such as the item indexing task and contrastive loss, are essential for effectively integrating these two types of signals.
The proposed GID construction strategy, which leverages a pretrained collaborative filtering model, is highly effective, outperforming GIDs based solely on random strings or item content.
ColaRec is particularly beneficial for long-tail users, indicating its robustness in sparse data scenarios. These findings collectively address the challenge of integrating diverse information sources in generative recommender systems, leading to more accurate and robust recommendations.

3.1. Foundational Concepts

Recommender Systems: A class of information filtering systems that seek to predict the "rating" or "preference" a user would give to an item. They are used to suggest items (e.g., movies, products, news articles) to users that they might like. The goal is to enhance user experience and engagement by providing personalized suggestions.
Collaborative Filtering (CF): A widely used technique in recommender systems that makes recommendations based on the preferences of similar users or the characteristics of similar items.
- User-based CF: Recommends items to a user that similar users have liked.
- Item-based CF: Recommends items that are similar to items the user has liked in the past.
- The core idea is that if user A and user B have similar tastes, and user A liked item X, then user B is likely to like item X too. Collaborative signals refer to the patterns and knowledge derived from these user-item interactions.
Content Information: Refers to the descriptive attributes of items, such as textual descriptions (e.g., titles, genres, brands, tags), images, videos, or other metadata. Content-based recommendation systems recommend items similar to those a user has liked in the past based on their attributes, rather than relying solely on interactions from other users.
Generative Models: A type of artificial intelligence model that can learn the patterns and structures of input data and then generate new, similar data. Examples include Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Diffusion Models, and Large Language Models (LLMs). In the context of recommendation, generative models can be used to generate item attributes, explanations, or even the item identifiers themselves.
Sequence-to-Sequence (Seq2Seq) Models: A neural network architecture that transforms an input sequence into an output sequence. It typically consists of an encoder that processes the input sequence and a decoder that generates the output sequence. Widely used in machine translation, text summarization, and, in this paper, for generating item identifiers.
Transformer Architecture: A neural network architecture introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017). It relies heavily on self-attention mechanisms to weigh the importance of different parts of the input sequence when processing each element. Transformers have become the backbone of many LLMs and Seq2Seq models due to their ability to handle long-range dependencies and parallelize computation efficiently.
Graph Neural Networks (GNNs): Neural networks designed to operate on graph-structured data. They learn representations (embeddings) of nodes and edges by aggregating information from their local neighborhoods. In recommender systems, user-item interactions can be naturally represented as a bipartite graph, making GNNs suitable for learning user and item embeddings.
K-means Clustering: An unsupervised machine learning algorithm used to partition $n$ observations into $k$ clusters, where each observation belongs to the cluster with the nearest mean (centroid). Hierarchical K-means applies K-means iteratively to create a tree-like structure of clusters. In this paper, it's used to construct Generative Identifiers (GIDs).
Contrastive Learning: A machine learning paradigm where the model learns representations by pushing "similar" (positive) samples closer together in the embedding space and "dissimilar" (negative) samples further apart. It's used to learn robust and discriminative representations without explicit labels.
Bayesian Personalized Ranking (BPR) Loss: A pairwise ranking loss function commonly used in recommender systems. It optimizes the model to rank observed (positive) items higher than unobserved (negative) items for a given user. $ \mathcal{L}{\mathrm{BPR}} = -\sum{(u, i, j) \in D_S} \log \sigma(\hat{x}{ui} - \hat{x}{uj}) $ where $D_S$ is the training set of triplets (u, i, j) where user $u$ prefers item $i$ over item $j$ , $\hat{x}_{ui}$ is the predicted score of item $i$ for user $u$ , $\hat{x}_{uj}$ is the predicted score of item $j$ for user $u$ , and $\sigma(\cdot)$ is the sigmoid function. The goal is to maximize the difference between the positive and negative item scores.
Generative Identifier (GID): A unique sequence of tokens assigned to each item, designed to be generated by a generative recommendation model. Unlike a single itemID, a GID is a structured sequence that can encode more information and correlations.

3.2. Previous Works

Matrix Factorization (MF) [26, 42]: Early CF approaches that decompose the sparse user-item interaction matrix into lower-dimensional latent user and item factor matrices. NeuMF [18] is a deep learning extension.
Graph Convolutional Matrix Completion [2]: Early work using GNNs for CF.
Neural Graph Collaborative Filtering (NGCF) [51]: A prominent GNN-based CF model that explicitly encodes high-order connectivity in the user-item interaction graph.
LightGCN [16]: A simplified GNN-based CF model that removes non-linear activation functions and feature transformations from NGCF, focusing solely on neighborhood aggregation for learning user and item embeddings. It's known for its effectiveness and simplicity.
SimpleX [36]: A simple yet strong CF baseline using a cosine-based contrastive loss and negative sampling.
NCL [34]: Improves LightGCN by incorporating contrastive learning to further enhance item and user representations.
Variational Autoencoders (VAEs) for CF (MultiVAE) [31, 43]: Apply VAEs to model user-item interactions, treating user preferences as latent variables and reconstructing observed interactions.
Diffusion Models for Recommendation (DiffRec) [30, 50]: New approaches that leverage diffusion models to learn user-item interaction knowledge through reconstruction and denoising processes.
LLM-based Recommendation [10, 29, 32, 33, 35, 60, 61, 63, 65]: Reformulate recommendation as a language generation task, using LLMs to generate natural language responses (e.g., item descriptions, explanations) based on user prompts and historical interactions.
- P5 [12]: A unified Pretrain, Personalized Prompt & Predict Paradigm that treats recommendation tasks as language processing.
- LC-Rec [63]: Performs recommendation through various language generation tasks using LLMs.
- Core Challenge: These methods often suffer from task misalignment, requiring grounding stages to map generated text to actual items and struggling to generate specific itemIDs from large pools.
Generative Retrieval (DSI) [3, 45, 47, 52]: A paradigm where generative models directly generate identifiers (e.g., document IDs) for retrieval.
- DSI [47]: Differentiable Search Index, a pioneering work in generative retrieval that uses a Seq2Seq model to directly generate document identifiers.
Generative Recommendation (tailored for item IDs) [22, 28, 37, 40, 44, 46, 49, 53, 58]: A new paradigm inspired by generative retrieval but tailored for recommendation. Items are assigned Generative Identifiers (GIDs), which are sequences of tokens, and a Seq2Seq model directly generates the GID of the recommended item.
- TIGER [40]: Uses an RQ-VAE to construct GIDs from item textual content embeddings and then a Transformer to generate sequential recommendations.
- Si et al. [44]: Constructs GIDs from item embeddings of a pretrained SASRec model (a sequential recommender).
- Hua et al. [22]: Investigated item identifier construction and combined content-based semantic strings with collaborative IDs (from co-occurrence matrix) via naive concatenation.

3.3. Technological Evolution

Recommender systems have evolved from traditional collaborative filtering (e.g., Matrix Factorization) to deep learning-based approaches (e.g., NeuMF), then to Graph Neural Networks (GNNs) that better capture complex interaction patterns (NGCF, LightGCN, NCL). Concurrently, the rise of generative models has spurred new directions. Initial attempts involved VAEs and GANs for implicit feedback. More recently, Large Language Models (LLMs) have been adapted, treating recommendation as a language generation task. This LLM-based approach, while powerful, often faces task misalignment.

A new wave, generative recommendation, inspired by generative retrieval, aims to overcome this misalignment by directly generating item identifiers (GIDs) rather than natural language. Early generative recommendation methods, like TIGER and Si et al., focused on either content or collaborative signals in GID construction. However, a key challenge remained: how to jointly model and effectively align both item content information and user-item collaborative signals within this end-to-end generative framework. This paper (ColaRec) fits into this evolution by proposing a comprehensive solution to this specific alignment challenge within the generative recommendation paradigm.

3.4. Differentiation Analysis

Compared to the main methods in related work, ColaRec offers several core differences and innovations:

LLM-based Recommendation:
- Differentiation: ColaRec does not rely on textual language generation for recommendation. Instead, it directly generates Generative Identifiers (GIDs) that map to specific items. This fundamentally avoids the task misalignment and grounding stage issues inherent in LLM-based methods (e.g., P5, LC-Rec), which struggle with generating concrete itemIDs from a large candidate pool.
- Innovation: ColaRec is specifically tailored for item recommendation, offering a more direct and efficient approach than repurposing LLMs for a potentially ill-fitting task.
Existing Generative Recommendation Methods (e.g., TIGER, Si et al., Hua et al.):
- Differentiation: ColaRec explicitly and jointly models both item content information and user-item collaborative signals in a unified sequence-to-sequence framework.
  - TIGER primarily constructs GIDs from item textual content.
  - Si et al. constructs GIDs from collaborative filtering embeddings (e.g., SASRec) but the framework doesn't explicitly model content.
  - Hua et al. naively concatenates content and collaborative IDs without a proper learning process for alignment.
- Innovation: ColaRec introduces a novel GID construction strategy based on pretrained collaborative filtering (LightGCN) to embed collaborative signals directly into the GIDs. Crucially, it then proposes an auxiliary item indexing task and a contrastive loss to explicitly align the content-based semantic space and the interaction-based collaborative space. This dual focus on unified modeling and explicit alignment is a key innovation.
Conventional CF-based Methods (e.g., LightGCN, NCL):
- Differentiation: ColaRec is a generative model that predicts items by generating their GIDs, whereas CF-based methods typically learn item embeddings and perform ranking by calculating similarity scores. While CF models can incorporate content (e.g., NCL), they are not inherently generative in the sense of producing structured identifiers.
- Innovation: By using a generative approach, ColaRec offers an end-to-end paradigm that can potentially capture more complex relationships and facilitate better interpretability if GIDs are designed meaningfully, while still leveraging the strengths of CF.
  
  In summary, ColaRec innovates by creating a generative recommendation framework that effectively unifies content and collaborative signals through a tailored GID construction, and critically, introduces explicit alignment mechanisms via multi-task learning (indexing task and contrastive loss), which is largely missing or insufficient in previous generative approaches.

4. Methodology

This section details the ColaRec framework, which aims to integrate item content information and user-item collaborative signals into a sequence-to-sequence generation model for direct item recommendation.

4.1. Notations

$u$ : A specific user.
$i$ : A specific item.
$\mathcal{I}_u^+$ : The set of items that user $u$ has interacted with.
$\mathcal{U}_i^+$ : The set of users who have interacted with item $i$ .
$c_i$ : The content description of item $i$ .
$\mathrm{uad}_u$ : The user's atomic identifier, a randomly assigned single token for user $u$ .
$\mathrm{iad}_i$ : The item's atomic identifier, a randomly assigned single token for item $i$ .
$\mathrm{GID}_i = [z_i^1, z_i^2, \cdots, z_i^l]$ : The generative identifier for item $i$ , a sequence of $l$ tokens.
$l$ : The length of the GID.

4.2. Generative Recommendation Task Formulation

The goal of generative recommendation is to predict a list of items for a user $u$ by generating their Generative Identifiers (GIDs), given information about the user's previously interacted items $\mathcal{I}_u^+$ . This generation process is auto-regressive, meaning each token in the GID is generated based on the previously generated tokens and the input information.

The probability of recommending item $i$ for user $u$ is estimated as the product of the probabilities of generating each token in its GID, sequentially: $ p ( u , i ) = \prod _ { t = 1 } ^ { l } { p ( z _ { i } ^ { t } | \mathcal { I } _ { u } ^ { + } , z _ { i } ^ { 1 } , z _ { i } ^ { 2 } , \cdot \cdot \cdot , z _ { i } ^ { t - 1 } ) } $ where $z_i^t$ is the $t$ -th token of $\mathrm{GID}_i$ , and $z_i^1, z_i^2, \cdot \cdot \cdot , z_i^{t-1}$ are the tokens generated before step $t$ . The recommender then selects items with the top- $N$ highest p(u,i) scores as the recommendation list for user $u$ .

4.3. Overview of ColaRec

The overall architecture of ColaRec is illustrated in Figure 3. It's built around a sequence-to-sequence encoder-decoder Transformer model, typically a pretrained T5 model.

$该图像是示意图，展示了内容基础协作生成推荐系统的框架 ColaRec。左侧展示了用户与项目的推荐过程，包括用户交互的项目和对应的文本内容。右侧则涵盖了项目索引的流程，使用图神经网络（GNN）进行协作过滤。图中标出了多个损失函数，如推荐损失 $L_{rec}$ 和对比损失 $L_c$，并说明了生成项目标识符（GID）的过程。整体结构体现了信息流动与模型架构。$
该图像是示意图，展示了内容基础协作生成推荐系统的框架 ColaRec。左侧展示了用户与项目的推荐过程，包括用户交互的项目和对应的文本内容。右侧则涵盖了项目索引的流程，使用图神经网络（GNN）进行协作过滤。图中标出了多个损失函数，如推荐损失 $L_{rec}$ 和对比损失 $L_c$ ，并说明了生成项目标识符（GID）的过程。整体结构体现了信息流动与模型架构。

ure:OverviColaReolaReasss acitem wi GbtaifromN-base delolaRecs twtasks. Use-ItemRecenati ims aphe user' teacs wi xtualcontent tohe G e recommended item, i.e., $\mathcal { L } _ { \mathbf { r e c } }$ Item-Item Indexing targets on the mapping from item side information into the item's GID, i.e., $\mathcal { L } _ { \mathrm { i n d e x } }$ Besides, a ranking loss ${ \mathcal { L } } _ { \mathbf { b p r } }$ and a contrastive loss $\mathcal { L } _ { \mathbf { c } }$ are also introduced.

ColaRec consists of two main training tasks:

User-Item Recommendation Task: This is the primary task, where the model takes the content information of a user's historically interacted items as input and generates the GID of a recommended item. This task is optimized using a recommendation loss $\mathcal{L}_{\mathrm{rec}}$ .
Item-Item Indexing Task: An auxiliary task designed to align content information and collaborative signals. It maps item side information (including its textual content and interacting users) into the item's GID. This task is optimized using an indexing loss $\mathcal{L}_{\mathrm{index}}$ .

Additionally, ColaRec incorporates two other loss functions:

A ranking loss $\mathcal{L}_{\mathrm{bpr}}$ (Bayesian Personalized Ranking) to enhance the model's ability to discriminate between positive and negative items.
A contrastive loss $\mathcal{L}_{\mathrm{c}}$ to further ensure that items with similar collaborative GIDs have similar content representations, strengthening the alignment.

All these tasks are learned jointly, sharing the same encoder-decoder model. The GIDs themselves are constructed using a graph-based collaborative filtering (CF) model (LightGCN) to embed collaborative signals.

4.4. Generative Identifier (GID) Construction

The construction of GIDs is critical. Ideal GIDs should:

Contain both collaborative signals and content information.
Reflect correlations: similar items (content-wise or user-wise) should have correlated GIDs.
Be unique for each item and unambiguously map back to that item.

ColaRec constructs GIDs using a hierarchical clustering approach based on item representations from a pretrained LightGCN model.
Item Representation Extraction: Item embeddings are first obtained from a LightGCN model, which has been pretrained on the user-item interaction graph. Since LightGCN learns representations by aggregating information from the interaction graph, these embeddings naturally encode user-item collaborative signals.
Hierarchical Clustering: A constrained K-means algorithm is applied hierarchically to these LightGCN item representations.
- This process forms a $K$ -ary tree structure where each item corresponds to a leaf node.
- The path from the root node to an item's leaf node constitutes the item's GID.
- For a GID of length $l$ , the clustering is performed for l-1 levels. For the $t$ -th level ( $t \in [1, l-1]$ ), the number of items in each cluster is constrained to be no more than $K^{l-t}$ .
- At the final (leaf) level, a random value from 1 to $K$ is allocated to each item within its respective leaf cluster.
Codebook Embeddings: For each position in the GID (i.e., each level of the hierarchy), there is a corresponding codebook embedding matrix. These codebook embeddings are learned during the training of ColaRec and help incorporate content information alongside the collaborative signals already present in the GID structure.

This design ensures that GIDs inherently capture collaborative signals through the LightGCN embeddings and structure, while content information is integrated during the ColaRec training process via the codebook embeddings and the item indexing task.

4.5. User-Item Recommendation

This task models user preferences by generating the GID of a recommended item based on the user's historical interactions.

4.5.1. Model Inputs

The input sequence $X_u$ for user $u$ is an unordered aggregation of content tuples from the items $u$ has interacted with. This design directly reflects the CF principle that user preferences can be inferred from their interacted items.

Item Content Tuple $c_i$ : For each item $i$ , its textual description is flattened into a sequence of key-value attribute pairs (e.g., [k1:v1, k2:v2, ...]). To further improve fidelity, the item's atomic identifier $iad_i$ is also included in this tuple. $ c _ { i } = \left[ \mathrm { iad } _ { i } , k _ { 1 } { : } v _ { 1 } , k _ { 2 } { : } v _ { 2 } , \ldots \right] $
User Input Sequence $X_u$ : The aggregation of these content tuples for all items in $\mathcal{I}_u^+$ . A special task token $task_u$ is prepended to inform the model that it's performing a recommendation task. $ X _ { u } = [ \mathrm { task } _ { u } , { c _ { i } | i \in \mathcal { I } _ { u } ^ { + } } ] $

4.5.2. Item Generation

A Transformer encoder-decoder model (a pretrained T5 model) is used for generation.

Encoding: The encoder processes the input sequence $X_u$ to capture its semantic information, producing a hidden state $\mathrm{Encoder}(X_u)$ .
Decoding: The decoder then auto-regressively generates the GID tokens. At each step $t$ , given the encoder's output and the previously generated tokens $z^{<t}$ (i.e., $z_i^1, z_i^2, \cdot \cdot \cdot , z_i^{t-1}$ ), the decoder produces a latent representation $\mathbf{d}_t$ for the current token: $ \mathbf { d } _ { t } = \operatorname { Decoder } ( \operatorname { Encoder } ( X _ { u } ) , z ^ { < t } ) $ Here, $\mathbf{d}_t \in \mathbb{R}^m$ where $m$ is the dimension of the latent representation.
Token Probability: The probability of generating the $t$ -th GID token $z^t$ is calculated by comparing $\mathbf{d}_t$ with the codebook embedding matrix $\mathbf{E}_t$ specific to that GID position (or level in the hierarchy): $ p ( z ^ { t } | z ^ { < t } , X _ { u } ) = \mathrm { softmax } ( \mathbf { d } _ { t } \cdot \mathbf { E } _ { t } ^ { \top } ) $ $\mathbf{E}_t$ is the codebook embedding matrix for the $t$ -th GID position, containing embeddings for all possible tokens at that position.

4.5.3. Recommendation Loss

The model is optimized using a cross-entropy loss to minimize the negative log-likelihood of generating the correct GID tokens for observed positive user-item pairs (u, i): $ \mathcal { L } _ { \mathrm { rec } } = - \sum _ { t = 1 } ^ { l } \log \hat { p } ( z _ { i } ^ { t } | X _ { u } , z _ { i } ^ { 1 } , z _ { i } ^ { 2 } , \cdot \cdot \cdot , z _ { i } ^ { t - 1 } ) $ where $\hat{p}(\cdot)$ represents the predicted probability from the model. The parameters of the pretrained T5 model are fine-tuned during this process.

4.6. Item-Item Indexing

This auxiliary task is crucial for aligning collaborative signals (embedded in the GIDs) with item content information (from textual descriptions). It trains the model to map an item's comprehensive side information to its GID.

4.6.1. Model Inputs

The input sequence $X_i$ for item $i$ combines its textual content and the atomic identifiers of users who have interacted with it.

Item Input Sequence $X_i$ : A special task token $task_i$ is prepended to indicate the indexing task, followed by the item's content tuple $c_i$ and the atomic identifiers of users in $\mathcal{U}_i^+$ : $ X _ { i } = [ \mathrm { task } _ { i } , c _ { i } , { \mathrm { uad } _ { u } | u \in \mathcal { U } _ { i } ^ { + } } ] $ This input explicitly includes both item content ( $c_i$ ) and collaborative signals ( $\mathrm{uad}_u$ for interacting users).

4.6.2. Item Indexing Loss

The indexing task uses the same encoder-decoder model and codebook embeddings as the recommendation task. The generation probabilities are calculated identically to Eq. (4) and Eq. (5), but with $X_i$ as the input instead of $X_u$ . The loss for item indexing is also a cross-entropy loss: $ \mathcal { L } _ { \mathrm { index } } = - \sum _ { t = 1 } ^ { l } \log p ( z _ { i } ^ { t } | X _ _ { i } , z _ { i } ^ { 1 } , z _ { i } ^ { 2 } , \cdot \cdot \cdot , z _ { i } ^ { t - 1 } ) $ This loss encourages the model to generate the correct GID for an item when provided with its content and interacting user information, thereby aligning these diverse signals with the collaborative GID.

4.7. Multi-Task Training

ColaRec is trained with a combined objective that includes the recommendation loss, indexing loss, and two additional losses for ranking and contrastive learning.

4.7.1. Item Ranking

To improve the model's ranking capabilities, a Bayesian Personalized Ranking (BPR) loss [42] is applied. For each positive user-item pair (u, i) in the training set, a negative item $i_-$ (an item user $u$ has not interacted with) is randomly sampled. The BPR loss aims to maximize the score of the positive item over the negative item: $ \mathcal { L } _ { \mathrm { bpr } } = - \ln \sigma ( \mathbf { h } ( X _ { u } ) \cdot ( \mathbf { h } ( X _ { i } ) - \mathbf { h } ( X _ { i _ { - } } ) ) ) $ where $\mathbf{h}(\cdot)$ denotes the last hidden state of the Encoder when processing its input. Specifically, $\mathbf{h}(X_u)$ represents the user's aggregated content representation, $\mathbf{h}(X_i)$ represents the content-based representation of item $i$ , and $\mathbf{h}(X_{i_-})$ represents the content-based representation of the negative item $i_-$ . $\sigma$ is the sigmoid function. This loss pushes the representation of the positive item closer to the user's representation and further from the negative item's representation in the latent space.

4.7.2. Contrastive Learning

A contrastive loss is introduced to ensure better alignment between collaborative signals (embedded in GIDs) and content-based semantic representations. The core idea is that items with similar GIDs should also have similar content representations. For an item $i$ , a positive sample $i_+$ is randomly chosen such that it shares an overlapped prefix of tokens in its GID with item $i$ . The negative sample $i_-$ from the BPR loss is used here, ensuring that $i_-$ has no overlapped GID tokens with item $i$ . The contrastive loss is defined as: $ \mathcal { L } _ { \mathrm { c } } = - \ln \sigma ( \mathbf { h } ( X _ { i } ) \cdot ( \mathbf { h } ( X _ { i _ { + } } ) - \mathbf { h } ( X _ { i _ { - } } ) ) ) $ Here, $\mathbf{h}(X_i)$ is the content-based representation of item $i$ , $\mathbf{h}(X_{i_+})$ for the positive sample, and $\mathbf{h}(X_{i_-})$ for the negative sample. This loss encourages the model to pull the content representation of item $i$ closer to that of $i_+$ (which has a similar GID) and push it away from $i_-$ (which has a dissimilar GID). This helps the model learn more discriminative item input representations that reflect both content and collaborative similarity.

4.7.3. Joint Optimization

All the described losses are combined for the final joint optimization of ColaRec: $ \mathcal { L } = \mathcal { L } _ { \mathrm { rec } } + \mathcal { L } _ { \mathrm { i n d e x } } + \mathcal { L } _ { \mathrm { b p r } } + \alpha \mathcal { L } _ { \mathrm { c } } $ where $\alpha$ is a hyperparameter controlling the weight of the contrastive loss.

4.7.4. Inference

During inference, to prevent the generation of invalid GIDs (sequences that do not correspond to any actual item), constrained beam search [6] is employed. This technique limits the possible tokens that can be generated at each step based on the previously generated prefix tokens, ensuring that only valid GID paths are explored.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on four real-world public datasets:

Amazon Product Reviews: Three subcategories were used:
- Beauty
- Sports and Outdoors
- Cell Phones and Accessories (Phone)
- For these Amazon datasets, item content information includes "title", "brand", and "categories" from the item metadata.
Food.com:
- Recipe
- For the Recipe dataset, item content information includes "name", "description", and "tag".

Preprocessing:

Users and items with fewer than five interactions were filtered out from all datasets.

The following are the results from Table 1 of the original paper:

Datasets #Users #Items #Interactions

Beauty 22,363 12,101 198,502

Sports 35,598 18,357 296,337

Phone 27,879 10,429 194,439

Recipe 17,813 41,240 555,618

Datasets	#Users	#Items	#Interactions
Beauty	22,363	12,101	198,502
Sports	35,598	18,357	296,337
Phone	27,879	10,429	194,439
Recipe	17,813	41,240	555,618

These datasets were chosen because they represent diverse domains (e-commerce, recipes) and are commonly used benchmarks in recommender systems research, making them effective for validating the model's performance and generalization. The inclusion of textual content alongside interaction data aligns with the paper's focus on fusing both information types.

5.2. Evaluation Metrics

Two widely used metrics for evaluating recommendation performance are employed: Recall@n and Normalized Discount Cumulative Gain (NDCG@n). The candidate item set for evaluation is the entire item set, not a small subset. Each experiment was run three times, and the average score is reported.

Recall@n
- Conceptual Definition: Recall@n measures the proportion of relevant (ground-truth) items that are successfully retrieved and included in the top $n$ recommended items. It indicates the model's ability to find as many relevant items as possible within a given recommendation list length.
- Mathematical Formula: $ \mathrm{Recall@n} = \frac{\sum_{u \in U} | {\mathrm{recommended \ items \ for \ } u }{@n} \cap {\mathrm{ground-truth \ items \ for \ } u} |}{\sum{u \in U} |{\mathrm{ground-truth \ items \ for \ } u} |} $
- Symbol Explanation:
  - $U$ : The set of all users in the test set.
  - $\{\mathrm{recommended \ items \ for \ } u\}_{@n}$ : The set of top $n$ items recommended to user $u$ .
  - $\{\mathrm{ground-truth \ items \ for \ } u\}$ : The set of actual items user $u$ interacted with in the test set.
  - $|\cdot|$ : Denotes the cardinality (number of elements) of a set.
Normalized Discount Cumulative Gain (NDCG@n)
- Conceptual Definition: NDCG@n is a measure of ranking quality that takes into account the position of relevant items in the recommendation list. It assigns higher scores to relevant items that appear earlier in the list. It is "normalized" by comparing the calculated DCG with the ideal DCG (where all relevant items are perfectly ranked at the top).
- Mathematical Formula: $ \mathrm{NDCG@n} = \frac{\mathrm{DCG@n}}{\mathrm{IDCG@n}} $ where DCG@n (Discounted Cumulative Gain at rank $n$ ) is calculated as: $ \mathrm{DCG@n} = \sum_{k=1}^{n} \frac{\mathrm{rel}k}{\log_2(k+1)} $ and IDCG@n (Ideal Discounted Cumulative Gain at rank $n$ ) is: $ \mathrm{IDCG@n} = \sum{k=1}^{|\mathrm{REL}|} \frac{\mathrm{rel}_k}{\log_2(k+1)} $
- Symbol Explanation:
  - $\mathrm{rel}_k$ : The relevance score of the item at position $k$ in the recommendation list. In typical recommendation scenarios (implicit feedback), $\mathrm{rel}_k$ is 1 if the item at position $k$ is a ground-truth item, and 0 otherwise.
  - $n$ : The number of items in the recommendation list being considered (e.g., 5, 10, 20).
  - $|\mathrm{REL}|$ : The total number of relevant items in the ground truth for the current user, up to rank $n$ . IDCG@n sorts the relevant items by their true relevance (which is 1 for all ground-truth items in implicit feedback) to achieve the maximum possible DCG score.

5.3. Baselines

ColaRec is compared against several representative baselines, categorized into CF-based methods and Generative models for recommendation:

CF-based Methods:

NeuMF [18]: Neural Matrix Factorization, enhances traditional Matrix Factorization with deep neural networks to learn non-linear patterns from user-item interactions.
LightGCN [16]: A simplified Graph Neural Network model for Collaborative Filtering, which learns user and item representations through linear neighborhood aggregation on the user-item interaction graph.
SimpleX [36]: A straightforward CF model that employs a cosine-based contrastive loss and negative sampling to learn user and item embeddings.
NCL [34]: Neural Collaborative Learning, an improvement over LightGCN that integrates contrastive learning to further enhance the quality of graph-based CF representations.

Generative Models for Recommendation:

MultiVAE [31]: An autoencoder-based method that uses Variational Autoencoders (VAEs) to model user-item interaction signals through a reconstruction objective.
DiffRec [50]: A recent recommendation model based on diffusion models, which learns user-item interaction knowledge by a reconstruction and denoising process.
DSI [47]: Differentiable Search Index, a generative document retrieval method. Two versions are adapted for recommendation:
- DSI-R: A DSI model where GIDs are random strings.
- DSI-S: A DSI model where GIDs are constructed using hierarchical K-means based on item textual content embeddings from a pretrained BERT model.
TIGER [40]: A generative recommendation method that uses a pretrained Sentence-T5 encoder for item textual content embeddings, quantizes them with an RQ-VAE to build GIDs, and then uses an encoder-decoder Transformer for sequential recommendation. Adapted for general recommendation by removing sequential orders.
LC-Rec [63]: Adapts Large Language Models by integrating collaborative semantics for recommendation. It follows a similar approach to TIGER for item index learning and uses various language generation tasks under different prompts.

P5-based baselines [12, 22] were not included because they require candidate items in the input prompt, limiting their ability to perform ranking across the whole item set in general recommendation tasks.

5.4. Implementation Details

GID Length ( $l$ ): Set to 3 for all datasets.
K-means Cluster Number ( $K$ ): Set to 32 for Beauty, Sports, and Phone datasets; set to 48 for the Recipe dataset.
User Representation: Each user's input is formed by aggregating randomly sampled interacted item tuples.
Item Representation (Indexing Task): Each item's input for the indexing task includes one randomly sampled user who interacted with it.
Negative Sampling: A uniform distribution is used to sample negative instances for both $\mathcal{L}_{\mathrm{bpr}}$ and $\mathcal{L}_{\mathrm{c}}$ to ensure fair comparison and avoid biases from different negative sampling strategies.
Embedding Dimensions: uad, iad, and codebook embeddings are set to 512 to match the word embeddings of the pretrained T5-small model used as the Transformer.
Contrastive Loss Coefficient ( $\alpha$ ): Values are fine-tuned for each dataset: 0.02 (Beauty), 0.08 (Sports), 0.1 (Phone), and 0.05 (Recipe).
Optimizer: AdamW is used with a learning rate of 5e-4.
Batch Size: 128.
Baselines Hyper-parameters: Hyper-parameters for baselines were carefully tuned, with user and item embedding sizes set to 512 for fair comparison with ColaRec.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate ColaRec's superior performance across various metrics and datasets, particularly highlighting its effectiveness for long-tail users.

6.1.1. Comparison on Whole Users (RQ1)

The following are the results from Table 2 of the original paper:

Datasets	Metric	CF-based Methods				Generative Models for Recommendation						Ours
Datasets	Metric	NeuMF	LightGCN	SimpleX	NCL	MultiVAE	DiffRec	DSI-R	DSI-S	TIGER	LC-Rec	Ours
Beauty	R@5	0.0447	0.0649	0.0551	0.0650	0.0530	0.0524	0.0128	0.0451	0.0519	0.0492	0.0667*
	R@10	0.0653	0.0952	0.0831	0.0940	0.0776	0.0741	0.0228	0.0705	0.0799	0.0770	0.0993*
	R@20	0.0889	0.1314	0.1193	0.1320	0.1093	0.1016	0.0360	0.1018	0.1154	0.1104	0.1371*
	N@5	0.0315	0.0450	0.0377	0.0452	0.0362	0.0378	0.0084	0.0305	0.0350	0.0326	0.0449
	N@10	0.0383	0.0549	0.0469	0.0547	0.0443	0.0450	0.0117	0.0385	0.0443	0.0415	0.0556*
	N@20	0.0445	0.0643	0.0563	0.0646	0.0526	0.0521	0.0151	0.0470	0.0534	0.0499	0.0654*
Sports	R@5	0.0206	0.0418	0.0355	0.0427	0.0314	0.0273	0.0117	0.0320	0.0374	0.0397	0.0442*
	R@10	0.0321	0.0623	0.0557	0.0631	0.0476	0.0403	0.0178	0.0497	0.0572	0.0617	0.0660*
	R@20	0.0471	0.0901	0.0836	0.0908	0.0713	0.0569	0.0284	0.0766	0.0881	0.0931	0.0964*
	N@5	0.0140	0.0288	0.0240	0.0294	0.0208	0.0193	0.0079	0.0225	0.0249	0.0264	0.0294
	N@10	0.0177	0.0355	0.0306	0.0359	0.0261	0.0235	0.0099	0.0284	0.0313	0.0335	0.0364*
	N@20	0.0215	0.0426	0.0377	0.0431	0.0321	0.0278	0.0126	0.0350	0.0392	0.0413	0.0442*
Phone	R@5	0.0410	0.0713	0.0643	0.0717	0.0569	0.0470	0.0187	0.0412	0.0601	0.0615	0.0745*
	R@10	0.0603	0.1052	0.0976	0.1043	0.0855	0.0668	0.0341	0.0625	0.0895	0.0919	0.1121*
	R@20	0.0871	0.1487	0.1420	0.1481	0.1233	0.0928	0.0564	0.0966	0.1299	0.1354	0.1587*
	N@5	0.0282	0.0481	0.0423	0.0486	0.0378	0.0315	0.0121	0.0282	0.0403	0.0408	0.0490*
	N@10	0.0344	0.0590	0.0530	0.0593	0.0470	0.0379	0.0170	0.0347	0.0498	0.0506	0.0611*
	N@20	0.0412	0.0700	0.0643	0.0704	0.0566	0.0445	0.0225	0.0431	0.0600	0.0615	0.0729*
Recipe	R@5	0.0118	0.0188	0.0114	0.0192	0.0167	0.0142	0.0142	0.0157	0.0168	0.0174	0.0198*
	R@10	0.0210	0.0296	0.0202	0.0298	0.0285	0.0235	0.0248	0.0270	0.0292	0.0289	0.0306*
	R@20	0.0339	0.0454	0.0328	0.0459	0.0462	0.0343	0.0403	0.0436	0.0464	0.0454	0.0482*
	N@5	0.0088	0.0149	0.0093	0.0149	0.0128	0.0105	0.0107	0.0122	0.0137	0.0138	0.0151*
	N@10	0.0119	0.0182	0.0122	0.0182	0.0167	0.0135	0.0141	0.0158	0.0176	0.0175	0.0185*
	N@20	0.0154	0.0223	0.0156	0.0224	0.0214	0.0165	0.0182	0.0202	0.0221	0.0218	0.0232*

(Note: * denotes a paired t-test with significance p-value < 0.1)

Observations from Table 2:

Overall Superiority: ColaRec achieves the best recommendation performance across almost all metrics (Recall@n and NDCG@n) on all four datasets. The only exception is NDCG@5 in Beauty and Sports, where it achieves comparable scores with the NCL baseline (0.0449 vs 0.0452 in Beauty, 0.0294 vs 0.0294 in Sports).
Significant Improvement in Recall@20: ColaRec consistently outperforms previous CF-based and generative models in Recall@20, showing relative improvements of $3.87\%$ (Beauty), $3.54\%$ (Sports), $6.72\%$ (Phone), and $3.88\%$ (Recipe). This indicates its strong ability to retrieve relevant items within a longer recommendation list.
Outperforming Generative Retrieval Adaptations: ColaRec significantly outperforms DSI-R (random GIDs) and DSI-S (content-based GIDs). For example, it shows notable relative improvements in Recall@5 of $47.89\%$ , $38.13\%$ , $80.82\%$ , and $26.11\%$ on the four datasets respectively compared to DSI-S. This suggests that simply adapting generative retrieval methods for recommendation is insufficient, and ColaRec's tailored approach for GID construction and multi-task learning is effective.
Superiority over Other Generative Recommenders: ColaRec consistently outperforms TIGER and LC-Rec. This is attributed to TIGER overlooking collaborative signals and LC-Rec's reliance on language generation for recommendation, which ColaRec argues is not fully aligned with the task.
Competitiveness with Strong CF Baselines: While existing generative methods like DiffRec and TIGER often underperform strong CF-based methods (NCL), ColaRec achieves competitive or superior results against NCL and other CF models on all datasets. This indicates that ColaRec successfully infuses content information into collaborative generation without sacrificing the core strengths of CF.

6.1.2. Comparison on Long-Tail Users (RQ1)

The paper also evaluated ColaRec's performance on long-tail users (users with sparse interactions), where the ratio between head users and long-tail users was set to $20\%:80\%$ .

The following are the results from Table 3 of the original paper:

Datasets	Metric	CF-based Methods				Generative Models for Recommendation						Ours
Datasets	Metric	NeuMF	LightGCN	SimpleX	NCL	MultiVAE	DiffRec	DSI-R	DSI-S	TIGER	LC-Rec	Ours
Beauty	R@5	0.0416	0.0636	0.0555	0.0639	0.0510	0.0464	0.0131	0.0415	0.0487	0.0492	0.0660**
	R@10	0.0604	0.0922	0.0825	0.0907	0.0742	0.0662	0.0228	0.0653	0.0745	0.0772	0.0975**
	R@20	0.0817	0.1253	0.1160	0.1264	0.1039	0.0917	0.0354	0.0940	0.1084	0.1107	0.1327**
Sports	R@5	0.0209	0.0433	0.0355	0.0440	0.0329	0.0267	0.0116	0.0307	0.0380	0.0397	0.0456**
	R@10	0.0317	0.0639	0.0562	0.0645	0.0495	0.0394	0.0170	0.0472	0.0581	0.0617	0.0674**
	R@20	0.0468	0.0904	0.0836	0.0908	0.0725	0.0553	0.0273	0.0728	0.0882	0.0929	0.0976**
Phone	R@5	0.0405	0.0723	0.0660	0.0727	0.0571	0.0451	0.0206	0.0404	0.0602	0.0612	0.0756**
	R@10	0.0590	0.1054	0.0986	0.1043	0.0861	0.0641	0.0371	0.0623	0.0898	0.0918	0.1126**
	R@20	0.0855	0.1482	0.1418	0.1473	0.1228	0.0899	0.0600	0.0939	0.1293	0.1353	0.1590**
Recipe	R@5	0.0128	0.0204	0.0121	0.0210	0.0182	0.0172	0.0157	0.0171	0.0181	0.0189	0.0219**
	R@10	0.0229	0.0320	0.0212	0.0322	0.0309	0.0269	0.0274	0.0295	0.0316	0.0313	0.0334**
	R@20	0.0371	0.0487	0.0343	0.0490	0.0499	0.0412	0.0443	0.0475	0.0504	0.0493	0.0528**

(Note: ** denotes improvements are significant with p-value < 0.05. Only Recall metrics are shown, as NDCG shows similar trends.)

Observations from Table 3:

Superiority for Long-Tail Users: ColaRec significantly outperforms all baselines for long-tail users across all Recall@n metrics on all datasets. This is a crucial advantage, as long-tail users often suffer from poor recommendations due to sparse interaction data, a common problem in real-world systems.
Reason for Improvement: The authors attribute this improvement to ColaRec's ability to model both user-item interactions (collaborative signals) and item content information. For long-tail users with limited interaction history, the rich content information provides crucial supplementary knowledge that CF-only methods might lack, enabling better recommendations.

In summary, ColaRec is effective in yielding superior recommendation performance across a wide range of users and is particularly beneficial for long-tail users, demonstrating the value of its unified modeling and alignment strategy.

6.2. Ablation Study (RQ2)

To understand the contribution of each component of ColaRec, an ablation study was conducted. Four variants were created by removing specific components:

(1) w/o textual content: Removes all textual content information from the model input.
(2) w/o indexing: Removes the item-item indexing task.
(3) w/o Lbpr: Removes the Bayesian Personalized Ranking (BPR) loss ( $\mathcal{L}_{\mathrm{bpr}}$ ).

(4) w/o Lc: Removes the contrastive loss ( $\mathcal{L}_{\mathrm{c}}$ ).

The following are the results from the upper part of Table 4 of the original paper:

	Beauty		Sports		Phone		Recipe
	Recall@5	NDCG@5	Recall@5	NDCG@5	Recall@5	NDCG@5	Recall@5	NDCG@5
ColaRec	0.0667	0.0449	0.0442	0.0294	0.0745	0.0490	0.0198	0.0151
(1) w/o textual content	0.0527	0.0346	0.0364	0.0239	0.0636	0.0426	0.0181	0.0141
(2) w/o indexing	0.0637	0.0428	0.0422	0.0278	0.0728	0.0487	0.0179	0.0142
(3) w/o Lbpr	0.0612	0.0412	0.0424	0.0282	0.0719	0.0486	0.0184	0.0140
(4) w/o Lc	0.0657	0.0434	0.0422	0.0279	0.0731	0.0485	0.0188	0.0145

(Note: The best scores are marked in bold.)

Observations from Table 4 (upper part):

Importance of Textual Content: Removing textual content ((1) w/o textual content) leads to a significant performance drop across all datasets (e.g., $20.99\%$ in Beauty, $17.65\%$ in Sports, $14.63\%$ in Phone, and $8.59\%$ in Recipe for Recall@5). This confirms that item content information is crucial for the model's understanding of items and for improving recommendation quality.
Effectiveness of BPR Loss: The absence of the BPR loss ((3) w/o Lbpr) also results in a notable performance decrease. This highlights the importance of the pairwise ranking objective in teaching the model to prioritize relevant items within the generative recommendation framework.
Value of Alignment Components: Both the item-item indexing task ((2) w/o indexing) and the contrastive loss ((4) w/o Lc) contribute positively to the overall performance. Removing either results in a consistent reduction in performance across all datasets. This validates the effectiveness of these explicit alignment mechanisms in facilitating mutual reinforcement between item content information and user-item collaborative signals, leading to more comprehensive and effective item representations.

In conclusion, each component of ColaRec—textual content, item indexing task, BPR loss, and contrastive loss—is essential and contributes to the model's superior recommendation performance.

6.3. GID Investigation (RQ3)

This section investigates the impact of GID design choices on recommendation performance.

6.3.1. Effect of Different GID Types

To evaluate the proposed GID construction strategy (based on collaborative signals), it was compared against three alternative GID types:

iad-based GID: Each item is represented by a single, unique atomic identifier ( $iad_i$ ).
Random GID: Each item is assigned a random string as its GID, without any underlying knowledge.
Content GID: GIDs are constructed using hierarchical K-means clustering based on item textual content embeddings derived from a pretrained BERT model.

For a fair comparison, the length and codebook size for Random GIDs and Content GIDs were kept identical to those in ColaRec.

The following are the results from the bottom part of Table 4 of the original paper:

	Beauty		Sports		Phone		Recipe
	Recall@5	NDCG@5	Recall@5	NDCG@5	Recall@5	NDCG@5	Recall@5	NDCG@5
ColaRec	0.0667	0.0449	0.0442	0.0294	0.0745	0.0490	0.0198	0.0151
iad	0.0658	0.0437	0.0428	0.0285	0.0719	0.0474	0.0189	0.0145
Random	0.0600	0.0401	0.0411	0.0272	0.0667	0.0443	0.0190	0.0149
Content	0.0662	0.0440	0.0423	0.0278	0.0716	0.0477	0.0183	0.0141

(Note: The best scores are marked in bold.)

Observations from Table 4 (bottom part):

Superiority of Collaborative GIDs: ColaRec's GID construction method, which is based on collaborative signals from LightGCN embeddings, achieves the best performance across all metrics and datasets. This underscores the effectiveness and importance of encoding collaborative signals directly into the GIDs.
Collaborative vs. Content GIDs: ColaRec outperforms Content GID, indicating that while content is important, collaborative signals are more effective for the fundamental structure of GIDs in generative recommendation.
GIDs vs. Single ItemIDs: ColaRec also outperforms the iad-based GID (single token). This demonstrates the benefit of using sequential GIDs that explicitly introduce item correlations and structured information, compared to arbitrary single itemIDs.
Random GIDs: The Random GID method yields the lowest performance, as expected. Random strings introduce noise and lack meaningful correlations, making the learning process more difficult and less effective.

These results strongly emphasize the importance of constructing effective GIDs that embed meaningful information, particularly collaborative signals, for robust generative recommender systems.

6.3.2. Impact of GID Hyper-parameters

The paper investigated the impact of two key GID hyper-parameters: GID length (l) and the number of clusters (K) in the hierarchical clustering.

Impact of GID Length ( $l$ )

The following figure (Figure 4 from the original paper) shows the impact of GID length on performance:

Figure 4: Impact of the length of GIDs.
该图像是图表，展示了在 GID 长度变化下的召回率（Recall@10）和标准化折扣累积增益（NDCG@10）。左侧为 "Beauty" 类别，右侧为 "Sports" 类别，较长的 GID 在 "Beauty" 类别下表现出更好的效果。数据呈现了不同 GID 长度对于推荐效果的影响。

Figure 4: Impact of the length of GIDs.

Observations from Figure 4:

The Recall@10 and NDCG@10 performance of ColaRec fluctuates as the GID length $l$ varies from 1 to 4.
From $l=1$ to $l=2$ : A performance drop is observed in Beauty and Sports. This is attributed to GIDs with $l=2$ increasing the decoding steps while the search space for each step remains large, making the generation more difficult than a single token GID ( $l=1$ ).
Optimal Length ( $l=3$ ): ColaRec achieves the best performance in most cases when $l=3$ . This suggests that $l=3$ strikes a good balance between the number of decoding steps and the size of the search space for each GID position.
Longer GIDs ( $l=4$ ): When $l=4$ , performance generally decreases. A longer GID implies more auto-regressive decoding steps, which increases generation difficulty and inference latency.
Conclusion: $l=3$ was chosen as the default setting, indicating a practical trade-off for effective and efficient GID generation.

Impact of Number of Clusters ( $K$ )

The following figure (Figure 5 from the original paper) shows the impact of the number of clusters on performance:

Figure 5: Impact of the number of clusters.
该图像是图表，展示了不同聚类数量对推荐系统性能的影响。左侧为“美容”类别，右侧为“运动”类别。图表中分别用蓝色和橙色曲线表示Recall@10和NDCG@10随聚类数量变化的趋势。

Figure 5: Impact of the number of clusters.

Observations from Figure 5:

With the GID length fixed at $l=3$ , the number of clusters $K$ was varied (32, 64, 96, 128).
Higher $K$ and Performance: Generally, a higher value of $K$ tends to result in a slight decrease in overall performance, with the drop being more noticeable in Beauty.
Reason for Decrease: A higher $K$ means a larger search space for the decoder at each GID position, which increases the generation difficulty and computational complexity.
Principle for Choosing $K$ : It is important to select a suitable $K$ based on the total number of items. $K$ needs to be large enough to uniquely encode the entire item set (given the GID length $l$ ), but beyond that, it should be controlled to limit the search space and maintain generation efficiency.

These investigations highlight that careful design and tuning of GID construction parameters are essential for maximizing the performance of generative recommendation models.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduced ColaRec, a novel content-based collaborative generation framework for recommender systems. ColaRec stands out by effectively integrating both item content information and user-item collaborative signals within a unified end-to-end sequence-to-sequence generative model tailored for direct item identifier generation. The core innovation lies in its GID construction (derived from a pretrained LightGCN model to embed collaborative signals) and the explicit alignment mechanisms: an auxiliary item indexing task and a contrastive loss. Extensive experiments on four real-world datasets demonstrated that ColaRec significantly outperforms state-of-the-art baselines, especially for long-tail users who typically suffer from data sparsity. This work validates the potential of a generative paradigm that carefully fuses multi-faceted information for robust and accurate recommendations.

7.2. Limitations & Future Work

The authors acknowledge several areas for future research:

GID Construction: Investigating more advanced methods for constructing Generative Identifiers (GIDs) that might better encode item properties and relationships.
Alignment Approaches: Adopting more effective strategies for aligning content information and collaborative signals beyond the current indexing task and contrastive loss. This could involve more sophisticated cross-modal learning techniques.
Negative Sampling: Exploring improved negative sampling techniques for generative recommendation. This could involve sampling more informative negative samples (e.g., hard negatives) using GIDs or even generating synthetic negative instances with generative models.
Model Efficiency: Addressing how to improve the model efficiency of generative recommendation during both training and inference, as auto-regressive generation can be computationally intensive, especially for long GIDs or large item sets.

7.3. Personal Insights & Critique

Personal Insights

This paper presents a strong step forward in generative recommendation. The most significant insight is the explicit focus on alignment between content and collaborative signals, not just their co-existence. The use of a multi-task learning approach with an item indexing task and a contrastive loss to bridge the semantic gaps between different information modalities is a well-reasoned and effective design. This approach of generating GIDs directly, rather than relying on LLM-based text generation and subsequent grounding, seems more aligned with the core task of item recommendation and avoids many of the complexities and inefficiencies associated with LLM integration. The superior performance on long-tail users is particularly noteworthy, highlighting the practical value of integrating content information to alleviate cold-start or sparsity issues.

The method of constructing GIDs from LightGCN embeddings is clever, embedding a strong collaborative signal directly into the identifier structure. This provides a solid foundation upon which the content-based aspects can be built and aligned. The idea of generative retrieval is clearly influential here, and ColaRec successfully adapts it to the unique challenges of recommendation by incorporating richer user-item interactions and content.

Potential Issues, Unverified Assumptions, or Areas for Improvement

Interpretability of GIDs: While GIDs are structured sequences, their direct interpretability for humans isn't explicitly discussed. If the GID tokens could be designed to correspond to human-understandable attributes or clusters, it could further enhance the model's explainability, which is a growing demand in recommender systems.
Scalability of GID Construction: The hierarchical K-means approach for GID construction, especially for very large item sets, might become computationally expensive. While LightGCN is efficient, the clustering itself could be a bottleneck. Exploring GID construction methods that are inherently more scalable or dynamic could be beneficial.
Dependency on Pretrained CF Model: The quality of ColaRec's GIDs heavily relies on the performance of the pretrained LightGCN model. If the LightGCN model is suboptimal or biased, it could impact the entire ColaRec system. An end-to-end learning approach for GID construction that doesn't rely on a separate pretrained model could be an interesting future direction, though potentially more complex.
Implicit Feedback Only: The paper focuses on implicit feedback. Extending ColaRec to handle explicit feedback (e.g., star ratings) or multimodal content (e.g., images, videos) could further broaden its applicability and performance.
Efficiency and Latency: As acknowledged in future work, auto-regressive generation can be slower than direct embedding similarity search. While constrained beam search helps, further optimizations for inference latency, especially in real-time recommendation scenarios, would be critical for practical deployment.
Negative Sampling Strategies: The use of uniform negative sampling for $\mathcal{L}_{\mathrm{bpr}}$ and $\mathcal{L}_{\mathrm{c}}$ is a simplification. Investigating more sophisticated hard negative mining strategies or in-batch negative sampling could potentially yield further performance gains.
Task Token Generalization: The use of special task tokens ( $task_u$ , $task_i$ ) is a common Transformer practice. However, the robustness of this mechanism and its potential impact on performance if more tasks are introduced or if the task distribution is imbalanced could be an area for deeper analysis.

Overall, ColaRec provides a robust and well-validated framework for generative recommendation, effectively addressing key challenges in integrating content and collaborative signals. Its approach is likely transferable to other domains where structured identifiers and rich content information are available, serving as a strong foundation for future research in generative recommender systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.