On the Memorization Behavior of LLMs in Generative Recommendation: Observations, Implications, and Training Strategies

Analysis

~9 min read · 10,728 charsThis analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

1. Bibliographic Information

1.1. Title

The central topic of the paper is the memorization behavior of Large Language Models (LLMs) when applied to Generative Recommendation (GR) tasks. Specifically, it investigates how LLMs rely on "one-hop memorization" (memorizing direct item transitions from training data) rather than generalizing from their pre-trained knowledge, and proposes a training strategy called IIRG to mitigate this.

1.2. Authors

The authors are Sunwoo Kim (KAIST), Sunkyung Lee (Sungkyunkwan University), Clark Mingxuan Ju, Donald Loveland, Bhuvesh Kumar, Neil Shah, and Liam Collins (Snap Inc.). The research backgrounds span academic institutions in South Korea (KAIST, SKKU) and industry research at Snap Inc., indicating a collaboration between academia and industrial application.

1.3. Journal/Conference

The provided text does not explicitly state the publication venue (e.g., "Published at NeurIPS"). However, the references cite works from 2025 and 2026 (e.g., WWW, NAACL, RecSys), suggesting this is a recent research paper, likely a preprint or submitted to a top-tier conference in the field of Recommender Systems or Natural Language Processing.

1.4. Publication Year

Based on the citations of 2025 and 2026 papers, the publication year is likely 2025 or 2026.

1.5. Abstract

The paper investigates the tendency of LLMs to memorize training data when fine-tuned for Generative Recommendation (GR). The authors find that LLMs rely heavily on "one-hop memorization"—predicting items that are direct successors in training sequences—more so than non-LLM GR models. This reliance limits their ability to generalize. To address this, the authors propose IIRG (Item-Item Relation Generation), a training strategy that teaches LLMs to capture richer item-item relations, specifically collaborative relations (multi-hop co-occurrences) and semantic relations (similar themes). Experiments show that IIRG significantly improves performance, particularly for users whose target items are not covered by one-hop transitions.

The source is provided as an uploaded file (uploaded://314b6393-3e87-479d-a53d-0494162baaa9) with a PDF link (/files/papers/6a367d2c239a205ca2a3a63b/paper.pdf). The status is likely a preprint or unpublished manuscript given the future-dated citations.

2. Executive Summary

2.1. Background & Motivation

Generative Recommendation (GR) is a promising direction where models generate item identifiers directly, rather than ranking a fixed set of candidates. LLMs are increasingly used for GR because of their rich pre-trained world knowledge. However, a known issue with LLMs is "memorization"—the tendency to reproduce training data rather than generalize.

The core problem is that existing LLM-based GR works largely ignore this memorization tendency. If LLMs fine-tuned for GR simply memorize training patterns (specifically, direct transitions between items in a user's history), they fail to utilize their pre-trained knowledge. This restricts their utility to cases where the target item is a direct successor of an item in the input sequence, limiting performance on other users.

The paper's entry point is to rigorously analyze this "one-hop memorization" behavior and quantify its impact on recommendation performance.

2.2. Main Contributions / Findings

The paper makes three primary contributions:

  1. New Finding on Memorization: The authors demonstrate that LLMs exhibit stronger "one-hop memorization" than non-LLM-based GR models (like TIGER). Furthermore, the performance gains of LLMs over baselines are largely concentrated on users who benefit from this specific type of memorization (i.e., users whose target item is a direct successor in the training data).
  2. New Method (IIRG): The authors propose IIRG (Item-Item Relation Generation), a novel training strategy. IIRG introduces two auxiliary training tasks:
    • Collaborative Neighbor Generation: Teaching the model to generate items that frequently co-occur within a multi-hop window in user sequences.
    • Semantic Neighbor Generation: Teaching the model to generate items with semantically similar textual descriptions. These tasks encourage the LLM to learn richer item-item relations beyond simple one-hop transitions.
  3. Strong Performance Validation: Experimental results show that IIRG consistently outperforms LLMs trained solely with next-item prediction. The gains are particularly significant (up to 50% in some cases) for users whose target items are not covered by one-hop memorization, validating the method's effectiveness in forcing generalization.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, several foundational concepts are required:

  • Generative Recommendation (GR): Unlike traditional recommendation systems that score a fixed set of candidate items (discriminative approach), GR methods treat recommendation as a sequence generation task. Given a user's history, the model generates the identifier (ID) of the next item token by token, similar to how a language model generates text.
  • Large Language Models (LLMs): Deep learning models (like GPT, Qwen) trained on vast amounts of text data. They possess "world knowledge" and reasoning capabilities. In this paper, they are used as the backbone for the recommender system.
  • Item Identifiers: To recommend specific items, LLMs need a way to represent them. The paper discusses two main types:
    • Semantic IDs (SIDs): Discrete code sequences (e.g., [6, 102, 47]) derived from item metadata using techniques like RQ-VAE. These capture hierarchical semantic meaning.
    • Term IDs (TIDs): Keywords representing item characteristics (e.g., [cycling, tire, wire-bead]). These use existing vocabulary tokens, making them easier for LLMs to interpret without special alignment.
  • One-hop Memorization: A specific behavior where a model learns to predict item BB simply because it frequently appears immediately after item AA in the training data (i.e., the transition ABA \to B). This is a form of overfitting to local patterns rather than learning complex user preferences or item relationships.

3.2. Previous Works

The paper situates itself within several lines of research:

  • LLM Memorization: Prior work in NLP (e.g., Carlini et al., Mireshghallah et al.) has shown that LLMs can memorize training data, leading to privacy risks. This paper extends that investigation to the domain of Generative Recommendation.
  • Generative Recommendation with LLMs: Methods like TIGER (non-LLM GR) and P5, ReAT, LC-Rec (LLM-based GR) have explored using generative models for recommendation. Many use SIDs or TIDs to represent items.
  • Auxiliary Training Tasks: Some works use auxiliary tasks like distillation (learning from a stronger teacher) or alignment (learning to understand IDs). However, the authors note that none explicitly train LLMs to learn item-item relations beyond one-hop transitions, which is the gap IIRG fills.

3.3. Technological Evolution

The field has evolved from traditional Collaborative Filtering (CF) and Sequential Recommendation (e.g., SASRec) to Generative Retrieval (e.g., TIGER). The latest trend involves integrating LLMs into GR to leverage their reasoning and world knowledge. This paper represents a critical step in this evolution by identifying a fundamental flaw (excessive memorization) in the current LLM-GR paradigm and proposing a correction (IIRG).

3.4. Differentiation Analysis

Compared to previous works, this paper is distinct because:

  1. It focuses on diagnosing the behavior (memorization) rather than just proposing a new architecture.
  2. Its proposed solution (IIRG) targets the relational reasoning capability of the LLM by explicitly teaching it collaborative and semantic links, whereas previous auxiliary tasks often focused on alignment or distillation of ranking scores.

4. Methodology

4.1. Principles

The core principle of IIRG is to mitigate the LLM's reliance on simple one-hop transitions by explicitly training it to recognize and generate more complex item relationships.

The intuition is that if a user's target item is not a direct successor of any item in their history (i.e., not covered by one-hop memorization), the model needs a different signal to make a correct prediction. IIRG provides two such signals:

  1. Collaborative Signal: "Items that appear together in complex user behaviors (multi-hop) are related."

  2. Semantic Signal: "Items with similar descriptions/themes are related."

    By training the model to generate these neighbors, the LLM internalizes these relationships, which can then be leveraged during the final next-item prediction task.

4.2. Core Methodology In-depth (Layer by Layer)

The IIRG methodology consists of three main phases: constructing neighbor sets, defining the training objectives, and performing joint optimization.

Step 1: Obtaining Collaborative Neighbors

First, the method defines "collaborative neighbors" for each item. These are items that frequently appear within a specific time window of the target item in user sequences, not just immediately after it.

For a given item iki_k, we need to find other items iqi_q that are strongly associated with it. The paper defines an item-item pair weight wkqw_{kq} as the frequency of co-occurrence within a window WW.

The formula for this weight is:

wkq=uUI[ikW(u)iq],ikW(u)iq    t,j[su]:it(u)=ik,ij(u)=iq,0<tjW w_{kq} = \sum_{u \in \mathcal{U}} \mathbb{I} [ i_k \sim_W^{(u)} i_q ], \quad i_k \sim_W^{(u)}i_q \iff \exists t, j \in [ | s_u | ]: i_t^{(u)} = i_k, i_j^{(u)} = i_q, 0 < | t - j | \leq W

Where:

  • wkqw_{kq}: The weight or strength of the relationship between item iki_k and item iqi_q.

  • U\mathcal{U}: The set of all users.

  • I[]\mathbb{I}[\cdot]: The indicator function (returns 1 if the condition is true, 0 otherwise).

  • ikW(u)iqi_k \sim_W^{(u)} i_q: A notation indicating that items iki_k and iqi_q co-occur within a window of size WW in user uu's sequence sus_u.

  • sus_u: The interaction sequence of user uu.

  • t, j: Indices in the sequence.

  • WW: The time-window length (a hyperparameter).

    Once weights are calculated for all pairs, the collaborative neighbor sequence nkn_k for item iki_k is formed by selecting the top N1N_1 items with the highest weights wkqw_{kq}.

Step 2: Obtaining Semantic Neighbors

Second, the method defines "semantic neighbors" based on textual similarity.

  1. Each item iki_k is represented by its textual description t(ik)\mathbf{t}(i_k).

  2. A pre-trained embedding model ff encodes this text into a vector zk=f(t(ik))\mathbf{z}_k = f(\mathbf{t}(i_k)).

  3. The semantic neighbor sequence mkm_k for item iki_k is defined as the top N2N_2 items with the highest cosine similarity to zk\mathbf{z}_k.

    The similarity metric used is the cosine similarity:

Similarity=zkTzqzkzq \text{Similarity} = \frac{\mathbf{z}_k^T\mathbf{z}_q}{\| \mathbf{z_k}\| \| \mathbf{z_q}\|}

Where:

  • zk,zq\mathbf{z}_k, \mathbf{z}_q: The embedding vectors of item iki_k and candidate item iqi_q.
  • \| \cdot \|: The Euclidean norm (length) of the vector.

Step 3: Training Objectives

IIRG jointly trains the LLM on three tasks using a combined loss function.

Task A: Next-Item Prediction This is the standard GR objective. Given a user sequence sus_u, the model predicts the next item. The loss for user uu is:

Lu=t=T+1T(su)logpθ(Tt(su)T<t(su),C)T(su)T \mathcal{L}_u = - \sum_{t = T + 1}^{\left| \mathcal{T}^{(s_u)} \right|} \frac{\log p_{\theta} \left(\mathcal{T}_t^{(s_u)} \mid \mathcal{T}_{< t}^{(s_u)} , \mathcal{C}\right)}{\left| \mathcal{T}^{(s_u)} \right| - T}

Where:

  • Lu\mathcal{L}_u: The loss for user uu.
  • pθp_{\theta}: The probability distribution generated by the LLM with parameters θ\theta.
  • T(su)\mathcal{T}^{(s_u)}: The tokenized representation of the user's sequence sus_u.
  • Tt(su)\mathcal{T}_t^{(s_u)}: The tt-th token in the sequence.
  • T<t(su)\mathcal{T}_{< t}^{(s_u)}: The context (all tokens before tt).
  • C\mathcal{C}: The instruction tokens for the next-item prediction task.
  • TT: The index of the final token for the anchor item (the starting point of prediction).

Task B: Collaborative Neighbor Generation The model is trained to generate the collaborative neighbor sequence nkn_k for a given item iki_k. The loss for item iki_k is:

Lk(C)=t=1T(nk)logpθ(Tt(nk)T<t(nk),P)T(nk) \mathcal{L}_k^{(C)} = \sum_{t = 1}^{\left| \mathcal{T}^{(n_k)} \right|} \frac{\log p_{\theta} \left(\mathcal{T}_t^{(n_k)} \mid \mathcal{T}_{< t}^{(n_k)} , \mathcal{P}\right)}{- \left| \mathcal{T}^{(n_k)} \right|}

Where:

  • Lk(C)\mathcal{L}_k^{(C)}: The collaborative generation loss for item iki_k.
  • T(nk)\mathcal{T}^{(n_k)}: The tokenized representation of the neighbor sequence nkn_k.
  • P\mathcal{P}: The instruction tokens for the collaborative generation task.
  • Note: The negative sign in the denominator combined with the sum of log probabilities effectively computes the Negative Log Likelihood (NLL), averaged over the sequence length.

Task C: Semantic Neighbor Generation Similarly, the model is trained to generate the semantic neighbor sequence mkm_k for a given item iki_k. The loss for item iki_k is:

Lk(S)=t=1T(mk)logpθ(Tt(mk)T<t(mk),S)T(mk) \mathcal{L}_k^{(S)} = \sum_{t = 1}^{\left| \mathcal{T}^{(m_k)} \right|} \frac{\log p_{\theta} \left(\mathcal{T}_t^{(m_k)} \mid \mathcal{T}_{< t}^{(m_k)} , \mathcal{S}\right)}{- \left| \mathcal{T}^{(m_k)} \right|}

Where:

  • Lk(S)\mathcal{L}_k^{(S)}: The semantic generation loss for item iki_k.
  • T(mk)\mathcal{T}^{(m_k)}: The tokenized representation of the neighbor sequence mkm_k.
  • S\mathcal{S}: The instruction tokens for the semantic generation task.

Step 4: Joint Optimization

Finally, IIRG combines these losses into a single objective function to be minimized. For a batch bb containing users U(b)\mathcal{U}_{(b)} and items I(b)(C),I(b)(S)\mathcal{I}_{(b)}^{(C)}, \mathcal{I}_{(b)}^{(S)}, the total loss L\mathcal{L} is:

L=uU(b)LuB+ikI(b)(C)λ1Lk(C)B+itI(b)(S)λ2Lt(S)B \mathcal{L} = \sum_{u \in \mathcal{U}_{(b)}} \frac{\mathcal{L}_u}{B} + \sum_{i_k \in \mathcal{I}_{(b)}^{(C)}} \frac{\lambda_1 \mathcal{L}_k^{(C)}}{B} + \sum_{i_t \in \mathcal{I}_{(b)}^{(S)}} \frac{\lambda_2 \mathcal{L}_t^{(S)}}{B}

Where:

  • BB: The batch size.

  • λ1,λ2\lambda_1, \lambda_2: Hyperparameters controlling the weight of the collaborative and semantic auxiliary losses, respectively.

    The following figure (Figure 4 from the original paper) provides a visual overview of the IIRG framework, showing how the three tasks are integrated:

    img-3.jpeg 该图像是一个图表,展示了不同模型在一跳记忆依赖性和推荐性能方面的比较。图(a)显示,LLMs依赖一跳记忆的程度高于非LLM模型TIGER。图(b)比较了有利于一跳记忆和非一跳记忆用户的推荐性能,LLMs在有利用户中的表现显著优于非LLM模型。

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three benchmark sequential recommendation datasets from the Amazon Reviews collection:

  1. Sports: Sports and Outdoors products.

  2. Toys: Toys and Games products.

  3. Beauty: Beauty and Personal Care products.

    Additional experiments are run on the Yelp dataset (restaurant reviews) to test generalization. These datasets are chosen because they are standard benchmarks in the field, containing rich item metadata (titles, descriptions) necessary for semantic ID generation and semantic neighbor analysis.

The following is an example of a data sample used for training with Term IDs (TIDs), formatted as a prompt for the LLM:

img-6.jpeg 该图像是一个条形图,展示了不同推荐系统在三个领域(体育、玩具、美容)的 NDCG@10 分数。图中比较了三种方法 Naive-SID、LC-Rec-SID 和 IIRG-SID(我们的)在各领域的性能表现。

5.2. Evaluation Metrics

The paper uses standard ranking metrics for recommendation:

  1. Recall@K

    • Conceptual Definition: Recall@K measures the ability of the recommender system to retrieve relevant items within the top-K recommendations. It answers the question: "Out of all the relevant items (in this case, the single ground-truth next item), how many were found in the top K list?"
    • Mathematical Formula: Recall@K=RecommendedKRelevantRelevant \text{Recall@K} = \frac{| \text{Recommended}_K \cap \text{Relevant} |}{| \text{Relevant} |}
    • Symbol Explanation:
      • RecommendedK\text{Recommended}_K: The set of top-K items generated by the model.
      • Relevant\text{Relevant}: The set of ground-truth relevant items (usually just the next item in the sequence).
      • | \cdot |: The cardinality (size) of the set.
  2. NDCG@K (Normalized Discounted Cumulative Gain)

    • Conceptual Definition: NDCG@K measures the ranking quality of the recommendation list. It assigns higher scores to relevant items appearing at higher ranks (positions closer to the top). It is normalized so that a perfect ranking gets a score of 1.
    • Mathematical Formula: NDCG@K=DCG@KIDCG@K \text{NDCG@K} = \frac{DCG@K}{IDCG@K} Where: DCG@K=i=1K2reli1log2(i+1) DCG@K = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)}
    • Symbol Explanation:
      • relirel_i: The relevance score of the item at position ii (usually 1 if relevant, 0 otherwise).
      • ii: The rank position (1, 2, ..., K).
      • IDCG@K: The Ideal DCG, calculated assuming the items are sorted perfectly by relevance.

5.3. Baselines

The paper compares IIRG against 17 baselines categorized into four groups:

  1. Traditional Models: LightGCN, SimGCL, SASRec, FDSA, S3^3-Rec.

  2. Non-LLM-based GR Models: TIGER, LETTER.

  3. GR with Auxiliary Tasks: P5, ReAT, LC-Rec, EAGER-LLM.

  4. LLM-based GR Models: PLUM, OneRec-Think, GRAM, GRLM, AgenticTagger, and a "Naive" LLM trained solely with next-item prediction.

    These baselines are representative because they cover the spectrum from classical collaborative filtering to state-of-the-art LLM-based generative methods.

6. Results & Analysis

6.1. Core Results Analysis

The main results demonstrate that IIRG consistently outperforms all baselines across all datasets (Sports, Toys, Beauty) and metrics (Recall@5, NDCG@5, Recall@10, NDCG@10).

The following are the results from Table 1 of the original paper:

img-5.jpeg

Analysis:

  • Comparison with Naive LLM: IIRG significantly improves upon the "Naive" LLM (trained only with next-item prediction). For instance, in the Sports dataset, Recall@5 improves from 0.0337 to 0.0406.
  • Comparison with Best Baselines: IIRG outperforms strong baselines like GRAM and GRLM. For example, in the Toys dataset, IIRG achieves a Recall@5 of 0.0780, compared to GRAM's 0.0718.
  • Consistency: The performance gains are consistent across different domains, indicating the robustness of the method.

6.2. Memorization Analysis

The paper investigates the reliance on one-hop memorization.

The following figure (Figure 3 from the original paper) illustrates the reliance on one-hop memorization and the performance gains for different user groups:

img-3.jpeg

Analysis of Figure 3:

  • Panel (a): Shows that LLMs (Naive-SID and Naive-TID) have a higher "One-hop Memorization Ratio" in their top-5 recommendations compared to TIGER (a non-LLM model). This confirms that LLMs rely more heavily on memorizing direct transitions.
  • Panel (b): Compares the relative improvement of LLMs over TIGER. The improvement is significantly larger for "One-hop Benefiting Users" (users whose target item is a direct successor) than for "Non-One-hop Benefiting Users". This proves that the LLM's advantage is largely due to memorization, which fails for the second group.

6.3. Design Goal Achievement

The authors analyze whether IIRG successfully reduces this reliance on memorization and improves performance for non-benefiting users.

The following figure (Figure 6 from the original paper) shows the impact of IIRG on memorization behavior:

img-6.jpeg

Analysis of Figure 6:

  • Panel (a): IIRG reduces the one-hop memorization ratio compared to the Naive LLM, suggesting it learns more complex relations.
  • Panel (b): IIRG yields much larger percentage gains for "Non-Memorization Benefiting Users" (NMB) compared to "Memorization Benefiting Users" (MB). For example, in the Sports dataset, the gain for NMB users is substantial, validating the core hypothesis that learning richer relations helps users not covered by simple transitions.

6.4. Ablation Studies / Parameter Analysis

The authors perform an ablation study to determine the necessity of the two auxiliary tasks (Collaborative and Semantic neighbor generation).

The following are the results from Table 2 of the original paper:

img-7.jpeg

Analysis:

  • Naive (X X): Only next-item prediction.
  • Collab Only (✓ X): Adding collaborative neighbors improves performance (e.g., Sports R@5: 0.0337 -> 0.0401).
  • Semantic Only (X ✓): Adding semantic neighbors also improves performance (e.g., Sports R@5: 0.0337 -> 0.0391).
  • IIRG (✓ ✓): Using both tasks yields the best performance (e.g., Sports R@5: 0.0406). This indicates that collaborative and semantic relations provide complementary signals.

6.5. Generalization across Item IDs

The paper also tests IIRG using Semantic IDs (SIDs) to ensure the method is not limited to Term IDs (TIDs).

The following figure (Figure 5 from the original paper) shows the results under Semantic IDs:

img-4.jpeg

Analysis:

  • IIRG-SID consistently outperforms Naive-SID and the strong baseline LC-Rec-SID across all datasets. This confirms that the effectiveness of IIRG generalizes across different item identifier strategies.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper concludes that LLMs used in Generative Recommendation exhibit a strong tendency to rely on "one-hop memorization," limiting their ability to generalize to users whose target items are not direct successors in the training data. To address this, the authors proposed IIRG, a training strategy that incorporates collaborative and semantic neighbor generation tasks. Experiments on Amazon datasets show that IIRG reduces reliance on memorization and significantly improves recommendation performance, especially for users not covered by one-hop transitions.

7.2. Limitations & Future Work

The authors identify several limitations:

  1. Unknown Underlying Factor: While the paper identifies memorization behavior, it does not pinpoint the exact underlying cause (e.g., is it model architecture, training data, or the optimization process?). The authors note that model size alone does not explain it.
  2. Limited LLM Types: The experiments are primarily conducted using the Qwen family of models. Future work should verify if these findings hold for other LLM families like Llama or Mistral.

7.3. Personal Insights & Critique

  • Innovation: The paper provides a valuable diagnostic lens for LLM-based recommenders. Instead of just chasing higher metrics, it asks why the model performs well. The finding that LLMs might just be "overfitting" to transitions is crucial for the field.
  • Methodology: IIRG is an elegant solution. By framing the learning of item relations as a generative task (generating neighbors), it fits naturally into the LLM paradigm without requiring complex architectural changes.
  • Potential Issues: The method relies on constructing neighbor sets (collaborative and semantic) beforehand. For extremely large-scale catalogs (millions of items), computing and storing these neighbor lists for every item could be computationally expensive or require approximation techniques (though the paper mentions efficient retrieval using SIDs).
  • Formula Observation: In the provided text, the loss formulas for collaborative and semantic generation (Eq 2 and 3) have a negative sign in the denominator (e.g., T(nk)- \left| \mathcal{T}^{(n_k)} \right|). Mathematically, this results in a positive loss (Negative Log Likelihood), but it is a slightly unconventional notation compared to the standard 1Nlogp-\frac{1}{N} \sum \log p. This suggests a potential typo in the manuscript's LaTeX source, though the mathematical intent (averaging the negative log likelihood) is clear.