Paper status: completed

Think Before Recommend: Unleashing the Latent Reasoning Power for Sequential Recommendation

Published:03/29/2025

Sequential Recommender Systems (24)Inference-Time Computing Framework (1)Implicit Multi-Step Reasoning (1)User Representation Enhancement (1)Lightweight Reasoning-Based Learning Methods (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ReaRec is introduced as an innovative inference-time computing framework that enhances user representation in sequential recommendation systems. Employing implicit multi-step reasoning, it overcomes limitations in understanding user preferences and long-tail items, demonstrating

Abstract

Sequential Recommendation (SeqRec) aims to predict the next item by capturing sequential patterns from users' historical interactions, playing a crucial role in many real-world recommender systems. However, existing approaches predominantly adopt a direct forward computation paradigm, where the final hidden state of the sequence encoder serves as the user representation. We argue that this inference paradigm, due to its limited computational depth, struggles to model the complex evolving nature of user preferences and lacks a nuanced understanding of long-tail items, leading to suboptimal performance. To address this issue, we propose \textbf{ReaRec}, the first inference-time computing framework for recommender systems, which enhances user representations through implicit multi-step reasoning. Specifically, ReaRec autoregressively feeds the sequence's last hidden state into the sequential recommender while incorporating special reasoning position embeddings to decouple the original item encoding space from the multi-step reasoning space. Moreover, we introduce two lightweight reasoning-based learning methods, Ensemble Reasoning Learning (ERL) and Progressive Reasoning Learning (PRL), to further effectively exploit ReaRec's reasoning potential. Extensive experiments on five public real-world datasets and different SeqRec architectures demonstrate the generality and effectiveness of our proposed ReaRec. Remarkably, post-hoc analyses reveal that ReaRec significantly elevates the performance ceiling of multiple sequential recommendation backbones by approximately 30%-50%. Thus, we believe this work can open a new and promising avenue for future research in inference-time computing for sequential recommendation.

Mind Map

In-depth Reading

English Analysis~36 min read · 51,179 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Think Before Recommend: Unleashing the Latent Reasoning Power for Sequential Recommendation." This title highlights a novel approach to sequential recommendation systems by introducing multi-step reasoning during the inference phase, akin to a "think before action" paradigm.

1.2. Authors

Jiakai Tang, Sunhao Dai, Teng Shi, Jun Xu, Xu Chen, Wen Chen, Jian Wu, Yuning Jiang. The authors are affiliated with Alibaba Group, Beijing, China, and Renmin University of China. Their research backgrounds appear to be in recommender systems, natural language processing, and potentially large language models, given the paper's motivation from the NLP community.

1.3. Journal/Conference

The paper is published as a pre-print on arXiv and does not specify a particular journal or conference in the provided text. The ACM reference format suggests it might be intended for an ACM publication. ACM conferences and journals are highly reputable in computer science, particularly in areas like information retrieval, data mining, and recommender systems.

1.4. Publication Year

The paper was published on 2025-03-28T17:59:03.000Z.

1.5. Abstract

Sequential Recommendation (SeqRec) aims to predict the next item a user will interact with by learning sequential patterns from their historical interactions. Current SeqRec models typically use a direct forward computation paradigm, where the final hidden state of a sequence encoder forms the user representation. The authors argue that this approach has limited computational depth, making it insufficient for modeling complex, evolving user preferences and understanding long-tail items, leading to suboptimal performance.

To address this, the paper proposes ReaRec, an inference-time computing framework that enhances user representations through implicit multi-step reasoning. ReaRec autoregressively feeds the sequence's last hidden state back into the sequential recommender, incorporating special Reasoning Position Embeddings (RPE) to separate the item encoding space from the multi-step reasoning space.

Furthermore, the paper introduces two lightweight reasoning-based learning methods: Ensemble Reasoning Learning (ERL) and Progressive Reasoning Learning (PRL). ERL constructs multi-order user representations by ensembling reasoning steps and uses KL divergence regularization to encourage diversity. PRL employs a progressive temperature annealing mechanism and Reasoning-aware Contrastive Learning (RCL) to guide the model towards better generalization.

Extensive experiments on five real-world datasets and various SeqRec architectures demonstrate ReaRec's generality and effectiveness. Notably, post-hoc analyses show that ReaRec can significantly improve the performance ceiling of multiple sequential recommendation backbones by approximately 30%-50%. The authors believe this work opens a new research direction in inference-time computing for sequential recommendation.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2503.22675
PDF Link: https://arxiv.org/pdf/2503.22675v3.pdf
Publication Status: This is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the limitation of current sequential recommendation (SeqRec) models in accurately capturing complex user preferences and understanding long-tail items due to their direct forward computation paradigm. Existing models typically use a single, final hidden state from a sequence encoder as the user representation. This approach, while efficient, offers limited computational depth, which the authors argue is insufficient for nuanced comprehension of dynamic user preferences and evolving interest patterns, especially for long-tail users (users with few interactions) and unpopular items. These scenarios inherently demand deeper reasoning and richer representation learning.

The importance of this problem stems from the ubiquity of recommender systems in modern daily life (e-commerce, music, video streaming). Improving their accuracy, especially for less common items and users, can significantly enhance user experience and discoverability.

The paper's innovative idea and entry point are motivated by recent advancements in the natural language processing (NLP) community, specifically the success of Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs). CoT allows LLMs to perform multi-step deliberation before generating an output, which has been shown to significantly improve performance on complex tasks by increasing computational depth. The authors explore whether a similar "think-before-action" paradigm can benefit sequential recommendation, leading to the proposal of ReaRec, a reasoning-enhanced framework that enables implicit multi-step reasoning during inference.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Proposal of ReaRec Framework: ReaRec is introduced as the first inference-time computing framework for recommender systems. It empowers SeqRec models to perform implicit multi-step reasoning during inference, thereby enhancing user representations and deepening feature crossing. This is a novel exploration of inference-time computational power within recommender systems.
Introduction of Two Reasoning Learning Strategies:
- Ensemble Reasoning Learning (ERL): Leverages ensemble learning by aggregating diverse reasoning results from different steps and uses multi-step supervised optimization with a representation diversity regularizer (KL divergence) to prevent reasoning degradation.
- Progressive Reasoning Learning (PRL): Inspired by curriculum learning, it uses a progressive temperature annealing (PTA) mechanism to guide the model's learning and incorporates reasoning-aware contrastive learning (RCL) to enhance robustness by simulating error self-correction.
Extensive Experimental Validation: Through comprehensive experiments on five real-world datasets and various representative SeqRec models (both ID-based and text-based), ReaRec's generality and effectiveness are validated.
Significant Performance Improvement and Ceiling Breakthrough: ReaRec achieves an average performance gain of 7.49% across all metrics with only 3.51% additional inference latency. Remarkably, post-hoc analysis reveals that ReaRec significantly elevates the performance ceiling of multiple sequential recommendation backbones by approximately 30%-50%.
Identification of Future Research Avenues: The paper identifies challenges and opportunities in reasoning-enhanced recommendation methods, stimulating a new research direction at the intersection of inference-time computing and sequential recommendation. Key insights include the differential impact of reasoning on user/item subgroups (long-tail benefiting more, active users potentially "overthinking"), the need for adaptive inference depth selection, and parameter disentanglement between encoding and reasoning. The authors also question the existence of an "inference-time scaling law" for recommendation systems and suggest theoretical analysis and efficient inference mechanisms as future work.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following fundamental concepts:

Recommender Systems (RS): Software systems that provide suggestions for items to users. These systems aim to predict user preferences and recommend relevant items from a large pool. They are widely used in e-commerce, streaming services, and social media.
Sequential Recommendation (SeqRec): A sub-field of recommender systems that focuses on predicting a user's next interaction based on their historical sequence of interactions. Unlike traditional RS that might treat interactions as independent, SeqRec models consider the order and temporal dependencies of user behavior. For example, if a user watches action movie A, then action movie B, the system might recommend action movie C next.
User-Item Interaction Sequence: A chronological list of items a user has interacted with (e.g., purchased, viewed, liked). For a user $u$ , this is denoted as $S^u = [v_1^u, v_2^u, \ldots, v_{n_u}^u]$ , where $v_i^u$ is the $i$ -th item in the sequence.
Item Embedding: A low-dimensional vector representation of an item, designed to capture its semantic and collaborative properties. Items with similar characteristics or that are frequently interacted with by similar users will have embeddings that are close in the vector space.
User Representation: A vector that encapsulates a user's preferences, interests, or state within the recommendation system. In many SeqRec models, this is derived from the user's interaction sequence.
Transformer Architecture: A neural network architecture introduced in 2017, known for its self-attention mechanism. It has become a cornerstone in Natural Language Processing (NLP) and is increasingly adopted in other domains, including recommender systems.
- Self-Attention: A mechanism that allows a model to weigh the importance of different parts of an input sequence relative to each other when processing each part. For example, when encoding a word in a sentence, self-attention helps the model decide which other words in the sentence are most relevant. In SeqRec, it helps determine the relevance of past interacted items to the prediction of the next item.
- Multi-Head Attention: An extension of self-attention where the attention mechanism is run multiple times in parallel. This allows the model to jointly attend to information from different representation subspaces at different positions.
- Positional Encoding: Since Transformers do not inherently process sequences in order, positional encodings are added to item embeddings to inject information about the relative or absolute position of items in the sequence.
Encoder-Decoder Architecture: Transformers can be built with an encoder-decoder structure. In SeqRec, typically only the encoder part is used to process the input sequence and generate a user representation.
Cross-Entropy Loss: A common loss function used in classification tasks, including next-item prediction in SeqRec. It measures the difference between the true probability distribution (one-hot for the ground truth item) and the predicted probability distribution over all possible items. $ \mathcal{L}{CE} = - \sum{i=1}^{C} y_i \log(\hat{y}_i) $ where $C$ is the number of classes (items), $y_i$ is the true probability (1 for the correct item, 0 otherwise), and $\hat{y}_i$ is the predicted probability for class $i$ .
Kullback-Leibler (KL) Divergence: A measure of how one probability distribution $P$ diverges from a second, expected probability distribution $Q$ . It quantifies the information lost when $Q$ is used to approximate $P$ . $ \mathrm{KL}(P || Q) = \sum_i P(i) \log \left(\frac{P(i)}{Q(i)}\right) $ In this paper, it's used as a regularization term to encourage diversity between prediction distributions from different reasoning steps.
Contrastive Learning: A self-supervised learning approach that aims to learn useful representations by pulling semantically similar samples closer together in an embedding space while pushing dissimilar samples apart.
- InfoNCE Loss: A popular loss function used in contrastive learning, derived from Noise-Contrastive Estimation (NCE). It encourages the embedding of an anchor to be closer to its positive samples and further from negative samples. $ \mathcal{L}{\mathrm{InfoNCE}} = -\log \frac{\exp(\mathrm{sim}(\mathbf{z}, \mathbf{z}^+) / \tau)}{\exp(\mathrm{sim}(\mathbf{z}, \mathbf{z}^+) / \tau) + \sum{k=1}^N \exp(\mathrm{sim}(\mathbf{z}, \mathbf{z}_k^-) / \tau)} $ where $\mathbf{z}$ is the anchor, $\mathbf{z}^+$ is a positive sample, $\mathbf{z}_k^-$ are negative samples, $\mathrm{sim}(\cdot)$ is a similarity function (e.g., dot product or cosine similarity), and $\tau$ is a temperature parameter.
Curriculum Learning: A training strategy where a model is first trained on "easy" examples and gradually exposed to more complex examples. This can help stabilize training and improve final performance.
Temperature Parameter ( $\tau$ ): In softmax functions, the temperature parameter $\tau$ controls the sharpness of the probability distribution. A high $\tau$ (softening) leads to a flatter distribution, while a low $\tau$ (sharpening) leads to a more peaked distribution where one class dominates.

3.2. Previous Works

The paper builds upon and differentiates itself from established SeqRec methods and recent inference-time reasoning techniques from NLP.

3.2.1. Sequential Recommendation Models

The paper categorizes mainstream SeqRec methods into ID-based Encoding and Text-based Encoding.

ID-based Encoding: These methods represent items using unique IDs, which are then mapped to embeddings.
- SASRec [31]: A foundational Transformer-based model for SeqRec. It employs a causal multi-head attention mechanism to capture sequential patterns. For each item in the sequence, it only attends to previous items to predict the next one. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where $Q$ is the query matrix, $K$ is the key matrix, $V$ is the value matrix, and $d_k$ is the dimension of the keys. In causal attention, future items are masked out. The output from the last position of the final layer typically serves as the user's representation.
- BERT4Rec [49]: Leverages bidirectional self-attention layers (similar to BERT in NLP) to capture deeper contextual information by allowing attention to both preceding and succeeding items in the sequence, but with a masked language model objective. This often leads to more robust representations.
  - Note: In BERT4Rec for SeqRec, a mask token is inserted into the sequence, and the model is trained to predict the masked item. For inference, the last item is typically masked to predict the next. The bidirectional nature allows for richer context but requires careful handling during inference for next-item prediction.
Text-based Encoding: These methods leverage textual attributes of items, often using pre-trained Large Language Models (LLMs) to generate item representations.
- UniSRec [26]: Utilizes parameter whitening and a Mixture-of-Experts (MoE) adaptor to learn universal item and sequence representations from textual features. This approach is effective in addressing cold-start and data sparsity issues by transferring knowledge from textual descriptions.
- MoRec [76]: Replaces traditional ID features with representations from advanced text and visual encoders (e.g., RoBERTa [38] and ViT [12]) to model multimodal item representations. This captures richer item semantics beyond just textual information.
  
  Differentiation: Existing SeqRec methods, regardless of ID-based or text-based, primarily adopt a direct forward computation paradigm. This means they process the input sequence once to produce a user representation. The ReaRec framework departs from this by introducing multi-step implicit reasoning during inference, effectively deepening the computational process to refine user representations.

3.2.2. Inference-time Reasoning in NLP

The paper draws significant motivation from inference-time scaling in Large Language Models (LLMs).

Chain-of-Thought (CoT) Reasoning [19, 45, 53, 62]: A technique where LLMs are prompted to generate intermediate reasoning steps (a "chain of thought") before providing a final answer. This significantly improves performance on complex tasks (e.g., mathematics, coding) by explicitly increasing computational depth and allowing for deliberation.
Emergent Thinking Capabilities from Computational Depth [14]: Theoretical work has shown that CoT-based reasoning enhances models' capacity to handle complex problems by introducing increased computational depth, overcoming expressivity limitations of direct answers.
Implicit vs. Explicit CoT: While CoT in NLP often involves generating explicit intermediate tokens, there's also research into implicit chain of thought reasoning in latent spaces [3, 15, 20, 67]. This means the model performs reasoning internally without necessarily outputting the intermediate steps as tokens, aiming for efficiency and performance gains. Examples include Coconut [20] for continuous thinking in LLM latent spaces and Heima [46] which compresses multimodal CoT processes into single high-level thinking tokens.

Differentiation: ReaRec is the first work to systematically explore inference-time computational power within recommender systems, translating the think-before-action paradigm from NLP (CoT) to SeqRec. Unlike NLP tasks where explicit reasoning chains can provide process supervision, ReaRec focuses on implicit reasoning in the latent space of SeqRec models, requiring novel learning strategies (ERL, PRL) to provide effective supervision signals for these intermediate, unobservable reasoning steps.

3.3. Technological Evolution

Sequential recommendation has evolved from simpler Markov Chain-based models (e.g., matrix factorization [22, 42]) that capture item-to-item transitions, to increasingly sophisticated deep learning architectures.

Early Deep Learning: Recurrent Neural Networks (RNNs) like GRU4Rec [23] were introduced for session-based recommendations, capturing temporal dependencies.
Convolutional Networks: Convolutional Neural Networks (CNNs) such as Caser [52] applied convolutional operations to item sequence embeddings, treating them like "images" to extract multi-level features.
Attention Mechanisms: The advent of the Transformer architecture [57] revolutionized SeqRec with self-attention mechanisms. SASRec [31] became a classic baseline by using self-attention to weight historical items. BERT4Rec [49] further enhanced this with bidirectional encoding.
Leveraging Side Information: To combat data sparsity and cold-start issues, recent models have integrated item attributes (text, images) using pre-trained language models (e.g., UniSRec [26], MoRec [76]) to learn richer representations.

This paper's work (ReaRec) fits into this timeline by pushing the boundaries of Transformer-based SeqRec models. While previous works focused on designing better forward computation architectures or richer input representations, ReaRec introduces a novel dimension: enhancing the inference process itself by adding multi-step reasoning. This is a significant shift from the reasoning-free forward computation that characterized prior SeqRec models, bringing SeqRec closer to the advanced reasoning capabilities seen in LLMs.

3.4. Differentiation Analysis

The core differences and innovations of ReaRec compared to main methods in related work are:

Inference-Time Reasoning vs. Direct Forward Computation: The most significant difference is ReaRec's introduction of implicit multi-step reasoning during inference. Traditional SeqRec models (like SASRec, BERT4Rec, UniSRec, MoRec) perform a single forward pass to generate a user representation. ReaRec iteratively refines this representation through additional computational steps, explicitly aiming to deepen the model's understanding of user preferences.
Enhanced Computational Depth: By autoregressively feeding the last hidden state back into the encoder, ReaRec effectively increases the computational depth at inference time. This is a direct parallel to the concept of Chain-of-Thought reasoning in LLMs, which has been shown to improve performance on complex tasks. Prior SeqRec models primarily focus on architectural depth (e.g., number of Transformer layers) during training but not on iterative refinement during inference.
Addressing Task Gap with Reasoning Position Embeddings (RPE): A key innovation is the use of RPE to explicitly distinguish between the sequence encoding phase and the reasoning phase. This prevents the model from confusing the two distinct computational modes, a challenge not present in traditional SeqRec or easily transferable from NLP CoT (where intermediate steps are often explicit).
Novel Learning Strategies for Implicit Reasoning: Since implicit reasoning lacks explicit intermediate supervision signals, ReaRec proposes ERL and PRL.
- ERL uses ensemble learning and KL divergence regularization to ensure diversity and effective supervision for multi-step reasoning, addressing the pattern collapse or reasoning degradation issue.
- PRL uses progressive temperature annealing (inspired by curriculum learning) to guide distribution sharpening and Reasoning-aware Contrastive Learning (RCL) to enhance robustness against reasoning bias and error accumulation. These strategies are specifically designed for the challenges of implicit reasoning in SeqRec, which differ from explicit CoT in LLMs that might rely on process supervision.
Model Agnostic Nature: ReaRec is designed as a model-agnostic framework, meaning it can be integrated with various existing SeqRec backbones (both ID-based like SASRec/BERT4Rec and text-based like UniSRec/MoRec), enhancing their performance without requiring fundamental changes to their core architecture. This demonstrates its broad applicability.
Focus on Performance Ceiling and Long-Tail Items: The paper explicitly highlights ReaRec's ability to significantly elevate the performance ceiling of backbones and enhance modeling capability for underrepresented groups (long-tail users and items), areas where traditional SeqRec models often struggle due to limited data and less nuanced representations.

4. Methodology

The ReaRec framework aims to unleash the latent sequential reasoning capability of SeqRec models by introducing multi-step implicit reasoning during the inference phase. This section details its backbone and the two proposed learning strategies.

4.1. ReaRec Backbone

The proposed ReaRec framework is designed to be model-agnostic, meaning it can be integrated into various sequential recommenders. The paper uses the Transformer architecture as an example to illustrate its workings.

4.1.1. Self-attention Sequence Encoding

Given a user's historical interaction sequence $S_u = [v_1^u, v_2^u, \ldots, v_n^u]$ , where $n$ is the length of the sequence. First, each item $v_i^u$ is converted into an item embedding $\mathbf{e}_{v_i^u}$ by looking up the embedding matrix $\mathbf{E}$ . To incorporate the sequential order, Absolute Position Embeddings are added to these item embeddings. For an item $v$ at position $i$ , its initial input representation $\mathbf{h}_i^0$ is constructed by summing its item embedding $\mathbf{e}_v$ and the corresponding positional embedding $\mathbf{p}_i^I$ :

$\mathbf{h}_i^0 = \mathbf{e}_v + \mathbf{p}_i^I$

Here, $\mathbf{p}_i^I$ is obtained from a learnable positional embedding matrix $\mathbf{P}^I \in \mathbb{R}^{n \times d}$ , where $d$ is the dimension of the embeddings. Next, these input representations are fed into a sequence encoder $f(\cdot)$ , which typically consists of multiple layers of Multi-Head Self-Attention (MHSA) modules and Point-wise Feed-Forward Networks (FFN):

$\mathbf{H}^l = f(\mathbf{H}^{l-1}) = \mathrm{FFN}(\mathrm{MHSA}(\mathbf{H}^{l-1}))$

where $\mathbf{H}^l = [\mathbf{h}_1^l, \mathbf{h}_2^l, \ldots, \mathbf{h}_n^l]$ represents the concatenated hidden states at the $l$ -th layer. $L$ denotes the total number of layers. In the conventional direct inference paradigm, the user representation $\mathbf{h}_u$ is simply the output hidden state at the last position of the final layer: $\mathbf{h}_u = \mathbf{H}^L[-1]$ .

4.1.2. Extended Inference-Time Reasoning

Instead of directly using $\mathbf{H}^L[-1]$ as the final user representation, ReaRec introduces an implicit reasoning mechanism to augment computational capacity during inference. This is achieved by autoregressively feeding the hidden state of the last position back into the encoder for $K$ forward computations, effectively creating $K$ reasoning steps.

To bridge the task gap between the initial sequence encoding and the subsequent reasoning phases, ReaRec introduces Reasoning Position Embeddings (RPE), denoted as $\mathbf{P}^R \in \mathbb{R}^{K \times d}$ . These embeddings are used to explicitly distinguish item representations from reasoning inputs.

At the $k$ -th reasoning step (where $k$ ranges from 1 to $K$ ): The input embedding for the Transformer at this step is conceptualized as an extended sequence. The first $n$ positions correspond to the original item sequence and remain unchanged from their initial encoding (as per Equation (1)). For positions $n+i$ (where $i$ refers to the $i$ -th reasoning step, ranging from 1 to $K$ ), the latent representation $\mathbf{h}_{n+i}^0$ is calculated as the summation of the last output hidden state $\mathbf{h}_{n+i-1}^L$ from the previous computation step (either the last item of the original sequence or the output of the previous reasoning step) and the $i$ -th reasoning position embedding $\mathbf{p}_i^R$ :

$\mathbf{h}_{n+i}^0 = \mathbf{h}_{n+i-1}^L + \mathbf{p}_i^R$

Here, $\mathbf{h}_{n+i-1}^L$ is the output of the Transformer's final layer from the previous step. For the first reasoning step ( $i=1$ ), $\mathbf{h}_{n}^L$ would correspond to the output of the final layer for the last item of the original sequence. The $\mathbf{p}_i^R$ are looked up from the learnable reasoning positional embedding matrix $\mathbf{P}^R$ .

The hidden states of the model's final layer from position $n$ to $n+K$ (i.e., the output from the original sequence's last item and then the $K$ reasoning steps) are denoted as $\mathbf{R} = [\mathbf{r}_0, \mathbf{r}_1, \ldots, \mathbf{r}_K]$ , where $\mathbf{r}_i \in \mathbb{R}^d$ represents the reasoning hidden state at the $i$ -th step. Specifically, $\mathbf{r}_0$ would be $\mathbf{H}^L[-1]$ (the original user representation), and $\mathbf{r}_i$ for $i > 0$ are the outputs of the $i$ -th reasoning step.

A straightforward approach for obtaining the user representation $\mathbf{h}_u$ would be to use the last reasoning output $\mathbf{r}_K$ : $\mathbf{h}_u = \mathbf{r}_K$ . The predicted probability $\hat{y}$ for the user $u$ is then calculated using a softmax function over the dot product similarity between the user representation and all item embeddings in $\mathbf{E}$ :

$\hat{y} = \mathrm{softmax}(\mathbf{h}_u \cdot \mathbf{E}^\top)$

The recommendation objective is to minimize the cross-entropy loss:

$\mathcal{L}_{\mathrm{Rec}} = - \log \hat{y}_{v^+}$

where $\hat{y}_{v^+}$ denotes the predicted probability of the ground-truth next item $v^+$ for user $u$ . However, this naive objective suffers from a lack of supervision signals for intermediate reasoning states, making it vulnerable to reasoning pattern degradation. To address this, the paper proposes Ensemble Reasoning Learning (ERL) and Progressive Reasoning Learning (PRL).

4.2. Ensemble Reasoning Learning (ERL)

ERL provides effective supervision signals for the implicit reasoning process by treating the hidden states from different reasoning steps as multi-view representations of the user's evolving interests. It leverages ensemble learning to aggregate diverse reasoning results.

4.2.1. Multi-Step Reasoning Supervision

Instead of relying solely on the last reasoning state $\mathbf{r}_K$ , ERL applies an average pooling layer to aggregate all reasoning hidden states, including the initial one $\mathbf{r}_0$ , to obtain the final user representation:

$\mathbf{h}_u = \frac{1}{K+1} \sum_{i=0}^K \mathbf{r}_i$

The output distribution $\hat{y}$ is then computed using this aggregated $\mathbf{h}_u$ according to Equation (4). This aggregation aims to capture a more comprehensive understanding of user interests by combining insights from different depths of reasoning. The cross-entropy loss (Equation (5)) is applied to this ensembled representation.

4.2.2. KL Divergence Regularization

To prevent pattern collapse—where the model might take shortcuts by simply copying previous reasoning outputs, leading to homogenization and undermining the benefits of computational scaling—ERL introduces a Kullback-Leibler (KL) divergence constraint. This regularization term encourages diversity across the predictive probability distributions of different reasoning states. The goal is to make the multi-step reasoning process gather multi-view insights into the user's complex interest distribution. The regularization term to be minimized is:

$\mathcal{L}_{\mathrm{KL}} = - \sum_{i=0}^{K-1} \sum_{j=i+1}^K \mathrm{KL}(\hat{y}^{(i)} || \hat{y}^{(j)})$

where $\hat{y}^{(i)}$ represents the predicted probability distribution (logit) from the $i$ -th reasoning step. The KL divergence $\mathrm{KL}(P || Q)$ is defined as $\sum_x P(x) \log \left(\frac{P(x)}{Q(x)}\right)$ . Minimizing $-\mathrm{KL}(P || Q)$ is equivalent to maximizing $\mathrm{KL}(P || Q)$ , thus encouraging divergence.

The overall learning objective for ERL is to minimize the following loss function:

$\mathcal{L}_{\mathrm{ERL}} = \mathcal{L}_{\mathrm{Rec}} + \lambda \mathcal{L}_{\mathrm{KL}}$

where $\lambda$ is a hyperparameter that controls the strength of the KL divergence regularization.

4.2.3. Inference Phase

During inference for ERL, the user representation $\mathbf{h}_u$ is obtained by average pooling the reasoning hidden states from all $K+1$ steps: $\mathbf{h}_u = \frac{1}{K+1} \sum_{i=0}^K \mathbf{r}_i$ . This ensembled user representation is then used to compute similarity scores with candidate item representations to generate the final recommendation list.

4.3. Progressive Reasoning Learning (PRL)

PRL uses a different mechanism to guide the intermediate reasoning chains, aiming to progressively approximate the user's true preference distribution through a progressive distribution sharpening strategy.

4.3.1. Progressive Temperature Annealing (PTA)

Inspired by the human cognitive process, PRL assumes that as reasoning depth increases, the model should clarify user interest patterns, resulting in sharper predicted distributions. This is achieved using Progressive Temperature Annealing (PTA). A temperature coefficient, $\tau_k$ , is introduced for the $k$ -th reasoning step to adjust the sharpness of the predicted distribution. It is formulated as:

$\tau_k = \tau * \alpha^{K-k}$

where $\tau$ is the base temperature, and $\alpha$ is a hyperparameter controlling the temperature decay rate. Note that as $k$ approaches $K$ , $\alpha^{K-k}$ approaches $\alpha^0 = 1$ , making $\tau_k$ closer to $\tau$ . If $\alpha > 1$ , then $\tau_k$ decreases as $k$ increases, leading to sharper distributions. The paper uses this formulation for temperature annealing, where the distribution becomes sharper (temperature decreases) as reasoning progresses.

The predicted distribution $\hat{\mathbf{y}}^{(k)}$ for the $k$ -th reasoning step is then computed using this annealed temperature:

$\hat{\mathbf{y}}^{(k)} = \mathrm{softmax}(\mathbf{r}_k \cdot \mathbf{E}^\top / \tau_k)$

Unlike ERL, PRL applies separate recommendation losses to each reasoning hidden state to inject process supervision:

$\mathcal{L}_{\mathrm{Rec}} = - \sum_{k=0}^K \log \hat{y}_{v^+}^{(k)}$

Here, $\hat{y}_{v^+}^{(k)}$ represents the predicted probability (logit) of the ground-truth item $v^+$ at the $k$ -th reasoning step. This annealing strategy allows the model to explore a broader solution space in early reasoning (higher $\tau_k$ ) and then gradually narrow the search space (lower $\tau_k$ ) towards the optimal solution.

4.3.2. Reasoning-aware Contrastive Learning (RCL)

To enhance the generalization ability and robustness of PRL against reasoning bias and error accumulation, a Reasoning-aware Contrastive Learning (RCL) method is designed. RCL simulates accumulated reasoning error by injecting noise into reasoning states. For each reasoning step $i \in \{1, 2, \ldots, K\}$ , noise is added to the input to generate a noised reasoning input:

$\tilde{\mathbf{h}}_{n+i}^0 = \mathbf{h}_{n+i}^0 + \epsilon$

where $\mathbf{h}_{n+i}^0$ is the input for the $i$ -th reasoning step (as defined in Equation (3)), and $\epsilon$ is a noise embedding sampled from a normal distribution $\epsilon \sim \mathcal{N}(0, \gamma \mathbf{I})$ . Here, $\mathbf{I} \in \mathbb{R}^d$ is the identity matrix, and $\gamma$ controls the noise intensity. Feeding this noised input into the Transformer encoder yields a new set of hidden states $\tilde{\mathbf{R}} = [\tilde{\mathbf{r}}_1, \tilde{\mathbf{r}}_2, \ldots, \tilde{\mathbf{r}}_K]$ .

To learn robust representations, RCL uses a self-supervised task based on Mutual Information Maximization (MIM). MIM aims to maximize the mutual information $I(X, Y) = H(X) - H(X|Y)$ between the original hidden states $\mathbf{R}$ and the denoised hidden states $\tilde{\mathbf{R}}$ . This forces the model to capture essential sequential information and perform self-reflection in the implicit thought space.

Since directly maximizing mutual information is intractable, an InfoNCE-based reasoning contrastive learning method is used to optimize its lower bound:

$\mathcal{L}_{\mathrm{RCL}} = - \sum_{k=1}^K \log \frac{\exp(\mathrm{sim}(\tilde{\mathbf{r}}_k, \mathbf{r}_k^+) / \tau_{\mathrm{CL}})}{\exp(\mathrm{sim}(\tilde{\mathbf{r}}_k, \mathbf{r}_k^+) / \tau_{\mathrm{CL}}) + \sum_{\mathbf{r}_k^- \in \mathbf{R}_k^-} \exp(\mathrm{sim}(\tilde{\mathbf{r}}_k, \mathbf{r}_k^-) / \tau_{\mathrm{CL}})}$

Here, $\mathrm{sim}(\cdot)$ denotes the dot product similarity function. $\mathbf{r}_k^+$ is the positive contrastive hidden state (the original $k$ -th reasoning state $\mathbf{r}_k$ ), and $\mathbf{R}_k^-$ is the set of negative contrastive hidden states (the $k$ -th reasoning states from other item sequences within the same batch). $\tau_{\mathrm{CL}}$ is a temperature parameter for contrastive learning.

The overall objective function for the PRL method combines the recommendation loss and the reasoning contrastive loss:

$\mathcal{L}_{\mathrm{PRL}} = \mathcal{L}_{\mathrm{Rec}} + \mathcal{L}_{\mathrm{RCL}}$

4.3.3. Inference Phase

During inference for PRL, the user representation $\mathbf{h}_u$ is taken directly from the output of the final reasoning step: $\mathbf{h}_u = \mathbf{r}_K$ . This final representation is then used to compute similarity scores with candidate item embeddings to generate the recommendation list.

4.4. Discussion

4.4.1. Principle Analysis

The ReaRec framework's core principle is to extend the model's modeling capability by strategically increasing inference-time computational amounts. By autoregressively feeding reasoning hidden states back into the sequence encoder, the model continuously deepens feature crossing depth, leading to the capture of finer-grained sequence characteristics and improved recommendation performance.

ERL integrates multi-level deep crossing features into the final user representation, effectively ensembling diverse insights from various reasoning depths.
PRL, leveraging curriculum learning, gradually uncovers more complex intent evolution patterns as reasoning progresses, aiming to approximate the true user interest distribution more accurately.

4.4.2. Time and Space Complexity

Time Complexity: Let $C$ be the user sequence length and $L$ be the number of Transformer layers.

Base Backbone (without reasoning): The sequence passes through $L$ layers. Each layer involves MHSA (Multi-Head Self-Attention) and FFN (Feed-Forward Network). The time complexity for MHSA on a sequence of length $C$ with embedding dimension $d$ is $O(C^2d + Cd^2)$ . The time complexity for FFN is $O(Cd^2)$ . So, the total time complexity for the base backbone is $O(L(C^2d + Cd^2))$ .
Reasoning-Enhanced Phase (with $K$ reasoning steps): ReaRec employs a KV Caching technique to store history key-value pairs, which significantly reduces redundant computations. At the $k$ -th reasoning step, the effective sequence length grows to C+k-1. However, due to KV Caching, only the new token (the output from the previous reasoning step) needs to attend to the entire cached history and itself. The time complexity for MHSA at step $k$ for one new token is effectively O((C+k-1)d), as it's attending to C+k-1 previous items. The time complexity for FFN for this one new token is $O(d^2)$ . Since there are $L$ Transformer blocks and $K$ reasoning steps, the total additional time complexity overhead for the reasoning phase is approximately: $O(L \sum_{k=1}^K ((C+k-1)d + d^2))$ . This simplifies to $O(L(K(C + K)d + K d^2))$ . As $K$ (number of reasoning steps, typically 2 or 3) is usually much smaller than $C$ (sequence length, e.g., 50), the $K^2$ term is small, and the overhead can be further approximated as $O(L(KCd + K d^2))$ . This overhead is considered acceptable because KV Caching prevents quadratic scaling with the number of reasoning steps; instead, it scales linearly with the cached context length.

Space Complexity: The method only adds $K$ $d$ -dimensional Reasoning Position Embeddings $\mathbf{P}^R$ . The space complexity for these embeddings is $O(Kd)$ , which is negligible compared to the original model parameters (item embeddings, Transformer weights). Thus, the framework is lightweight and flexible in terms of space.

The following are the results from Figure 3 of the original paper:

Figure 3: Overview of the proposed ReaRec framework and two reasoning-enhanced learning strategies: Ensemble Reasoning Learning and Progressive Reasoning Learning. 该图像是示意图，展示了提出的 ReaRec 框架及其两个增强推理学习策略：集成推理学习（Ensemble Reasoning Learning, ERL）和渐进推理学习（Progressive Reasoning Learning, PRL）。图中详细描述了推理序列推荐模型的结构，包含关键组件如项嵌入、位置嵌入和推理隐藏状态，以及利用平均池化和 KL 正则化进行优化的过程。通过这些策略，ReaRec 能够有效利用多步推理提升推荐性能。

5. Experimental Setup

5.1. Datasets

To evaluate the effectiveness of the proposed ReaRec methods, experiments were conducted on five real-world recommendation datasets from Yelp and Amazon platforms.

Yelp:
- Source: A well-known business review website.
- Characteristics: Provides rich multidimensional data on user behaviors and business attributes.
- Preprocessing: Interactions with ratings greater than 3 were considered positive. 20-core filtering was applied, meaning users and items with fewer than 20 interactions were removed.
- Textual Encoding: Item information included name, location (city and state), and business categories.
- Splitting: Chronologically split into training, validation, and test sets based on two timestamp thresholds: September 4, 2018, and May 12, 2020.
Amazon 2023: Derived from Amazon, a global e-commerce platform. Four domain-specific datasets were selected:
- Video & Games
- Software
- CDs & Vinyl
- Baby & Products
- Textual Features: Retained product attributes like title, description, and price.
- Preprocessing: User-item interactions with ratings greater than 3 were positive. Filtering applied: users with fewer than 5 interactions for Video & Games, Software, Baby & Products, and fewer than 10 interactions for CDs & Vinyl were removed.
- Splitting: Followed official absolute timestamps to partition item sequences, aligning with real-world scenarios.
  
  The detailed statistics of the datasets are summarized in Table 1.

The following are the results from Table 1 of the original paper:

Dataset	Yelp	Video & Games	Software	CDs & Vinyl	Baby & Products
#Users	13,083	89,021	30,049	35,238	140,292
#Items	10,697	22,933	16,705	87,969	30,689
#Avg. Inter. / User	33.92	5.96	5.59	14.59	5.57
#Avg. Inter. / Item	41.49	23.15	10.06	5.84	25.44
#Avg. Inter.	443,807	530,989	168,029	513,991	780,809
Sparisty	99.68%	99.97%	99.97%	99.98%	99.98%

The datasets were chosen to represent diverse domains (e-commerce, reviews) and varying levels of data sparsity and scale, making them effective for validating the method's generality and performance across different recommendation scenarios. They are standard benchmarks in sequential recommendation research.

5.2. Evaluation Metrics

The paper adopts two widely used top-k evaluation metrics in sequential recommendation research:

Normalized Discounted Cumulative Gain (NDCG):
- Conceptual Definition: NDCG measures the ranking quality of a recommendation list by considering both the relevance of recommended items and their position in the list. More relevant items appearing higher in the list contribute more to the NDCG score. It is particularly useful for scenarios where item relevance can be graded (e.g., ratings) or where positional accuracy is important.
- Mathematical Formula: First, Discounted Cumulative Gain (DCG) for a recommendation list at position $k$ is calculated as: $\mathrm{DCG}_k = \sum_{i=1}^k \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)}$ Where $\mathrm{rel}_i$ is the relevance score of the item at position $i$ . Then, NDCG normalizes DCG by dividing it by the Ideal DCG (IDCG), which is the DCG of an ideal ranking where all relevant items are perfectly ordered: $\mathrm{NDCG}_k = \frac{\mathrm{DCG}_k}{\mathrm{IDCG}_k}$ The paper reports NDCG@10 and NDCG@20.
- Symbol Explanation:
  - $k$ : The cut-off position (e.g., 10 or 20 for NDCG@10, NDCG@20).
  - $\mathrm{rel}_i$ : The relevance score of the item at position $i$ in the ranked list. In SeqRec, this is often binary (1 if the next true item is at position $i$ , 0 otherwise).
  - $\mathrm{DCG}_k$ : Discounted Cumulative Gain at position $k$ .
  - $\mathrm{IDCG}_k$ : Ideal Discounted Cumulative Gain at position $k$ , representing the maximum possible DCG for the list.
Recall:
- Conceptual Definition: Recall (also known as True Positive Rate or Sensitivity) measures the proportion of actual relevant items that are successfully identified and recommended within a top-k list. It focuses on the completeness of the recommendation, i.e., how many of the truly relevant items were retrieved.
- Mathematical Formula: $\mathrm{Recall}_k = \frac{|\{\text{relevant items in top-k list}\}|}{|\{\text{all relevant items}\}|}$ In the context of next-item prediction, there is typically only one "true" next item. So, if the true next item is in the top-k list, Recall is 1, otherwise 0. The paper reports Recall@10 and Recall@20.
- Symbol Explanation:
  - $k$ : The cut-off position (e.g., 10 or 20 for Recall@10, Recall@20).
  - $|\{\text{relevant items in top-k list}\}|$ : The number of relevant items found within the top-k recommendations.
  - $|\{\text{all relevant items}\}|$ : The total number of relevant items (which is typically 1 for the next true item).

5.3. Baselines

The generality of the ReaRec framework was evaluated by integrating it with different types of sequential recommendation models, including ID-based and text-based encoding methods.

ID-based Encoding Methods:
- SASRec [31]: A representative and strong baseline for sequential recommendation that uses a causal multi-head attention mechanism to capture sequential patterns.
- BERT4Rec [49]: Leverages bidirectional self-attention layers for deeper contextual information infusion across user behavior sequences, allowing it to capture context from both past and "future" (masked) items.
Text-based Encoding Methods:
- UniSRec [26]: Utilizes parameter whitening and a Mixture-of-Experts (MoE) adaptor to learn universal item and sequence representations from textual features, addressing cold-start and data sparsity.
- MoRec [76]: Incorporates advanced text and visual encoders (e.g., RoBERTa [38] and ViT [12]) to model multimodal representations of items, replacing traditional ID features.
  
  The chosen baselines are representative of state-of-the-art and widely adopted sequential recommendation models, covering both discrete ID representations and richer textual/multimodal item features, thus providing a comprehensive evaluation of ReaRec's applicability.

5.4. Implementation Details

Hardware: All experiments were conducted on 8 NVIDIA A100 GPUs.
Hyperparameters (General):
- Embedding size: 256 for all methods.
- Batch size: 2048 for all methods.
- Optimizer: Adam [32].
- Learning rate: 0.001.
- Activation function: GeLU.
- Sequence length: User sequences truncated to a maximum length of 50 across all datasets.
BERT4Rec Specifics: For BERT4Rec's bidirectional Transformer, a Prefix Masking strategy was employed. The item sequence part uses bidirectional attention, while the reasoning phase adopts unidirectional attention.
Item-based Methods (Textual Features):
- LLaMA-3.1-8B [17] was used to encode item textual features.
- Principle Component Analysis (PCA) was applied to the averaged hidden states from the last layer, preserving core features and distilling 768-dimensional model representations.
ERL Specifics:
- $KL regularization hyperparameter λ$ : Searched within $\{0.001, 0.005, 0.01, 0.05, 0.1\}$ .
PRL Specifics:
- $Noise strength γ$ : Set to 0.01.
- $Base temperature τ$ : Tuned over ranges $\{0.05, 0.1, 0.5, 1.0, 2.0, 5.0\}$ .
- $Temperature decay rate α$ : Tuned over ranges $\{1.0, 1.2, 1.5, 2.0, 5.0, 10.0\}$ .
Training Protocol: Early stopping was triggered if metrics on the validation set did not improve over 10 consecutive epochs.
Code Availability: The code will be available at https://github.com/TangJiakai/ReaRec.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate the superiority of the proposed ReaRec framework with its ERL and PRL methods across various SeqRec backbones and datasets.

Comparison of ID-based Models (Table 2):
- BERT4Rec generally performs slightly better than SASRec, indicating the benefit of bidirectional contextual information in capturing sequential patterns.
- Both ERL and PRL significantly improve the performance of SASRec and BERT4Rec. For instance, PRL on SASRec achieves an average improvement of 11.81% on Yelp and 7.00% on Video & Games. ERL on BERT4Rec shows a 21.49% average improvement on Baby & Products. This highlights ReaRec's ability to unlock latent reasoning power even for established ID-based models.
Comparison of Text-based Models (Table 3):
- Text-based methods (UniSRec, MoRec) consistently outperform ID-based models across all datasets. This is attributed to their use of powerful language models to encode item information, which effectively mitigates data sparsity and cold-start issues by learning domain-invariant representations.
- ERL and PRL further enhance the performance of text-based models. ERL on UniSRec achieves an average improvement of 31.54% on CDs & Vinyl, and PRL on UniSRec gains 25.66% on the same dataset. MoRec also sees notable improvements, e.g., 7.76% for ERL on Video & Games. This indicates that ReaRec is effective even when starting with richer, text-aware item representations.

Overall Effectiveness of ReaRec:

ReaRec (with ERL and PRL) consistently and significantly surpasses baseline models in most cases.
For ID-based methods, ERL and PRL on SASRec achieve average improvements of 6.76% and 8.21% across all metrics on five datasets.
For text-based methods, ERL and PRL on UniSRec outperform the base model by 12.29% and 10.43% on average.

The paper concludes that ReaRec introduces a novel approach of using latent-space computations during inference to deepen feature crossing depth, effectively unlocking latent reasoning power and demonstrating that increasing inference-time computation is a promising avenue for improving recommendation performance.

The following are the results from Table 2 of the original paper:

Dataset	Method	SASRec					BERT4Rec
Dataset	Method	N@10	N@20	R@10	R@20	Avg.	N@10	N@20	R@10	R@20	Avg.
Yelp	Base	0.0347	0.0452	0.0626	0.1047		0.0364	0.046	0.0653	0.1038	-
	+ERL	0.0383	0.0474	0.0691	0.1056	↑6.62%	0.0371	0.0476	0.0661	0.1077	↑2.60%
	(Improv.)	(↑10.37%)	(↑4.87%)	(↑10.38%)	(↑0.86%)		(↑1.92%)	(↑3.48%)	(↑1.23%)	(3.76%)
	+PRL	0.0388	0.0493	0.073	0.1149	↑11.81%	0.0377	0.0487	0.0708	0.1149	↑7.14%
	(Improv.)	(↑11.82%)	(↑9.07%)	(↑16.61%)	(↑9.74%)		(↑3.57%)	(↑5.87%)	(↑8.42%)	(↑10.69%)
Video & Games	Base	0.0284	0.0353	0.0542	0.0816	-	0.0289	0.0355	0.0548	0.0810	-
	+ERL	0.0301	0.0385	0.0581	0.0915	↑8.59%	0.0311	0.0375	0.0578	0.0832	↑5.36%
	(Improv.)	(↑5.99%)	(↑9.07%)	(↑7.20%)	(↑12.13%)		(↑7.61%)	(↑5.63%)	(↑5.47%)	(↑2.72%)
	+PRL	0.0299	0.0379	0.0572	0.0890	↑6.81%	0.0306	0.0380	0.0584	0.0879	↑7.00%
	(Improv.)	(↑5.28%)	(↑7.37%)	(↑5.54%)	(↑9.07%)		(↑5.88%)	(↑7.04%)	(↑6.57%)	(↑8.52%)
Software	Base	0.0696	0.0895	0.1468	0.2264	-	0.0710	0.0893	0.1530	0.2258	-
	+ERL	0.0743	0.0935	0.1456	0.2224	↑2.16%	0.0769	0.0964	0.1554	0.2328	↑5.23%
	(Improv.)	(↑6.75%)	(↑4.47%)	(↓0.82%)	(↓1.77%)		(↑8.31%)	(↑7.95%)	(↑1.57%)	(↑3.10%)
	+PRL	0.0739	0.0949	0.1488	0.2324	↑4.06%	0.0762	0.0976	0.1500	0.2350	↑4.68%
	(Improv.)	(↑6.18%)	(↑6.03%)	(↑1.36%)	(↑2.65%)		(↑7.32%)	(↑9.29%)	(↓1.96%)	(↑4.07%)
CDs & Vinyl	Base	0.0148	0.0174	0.0317	0.0419		0.0149	0.0185	0.0326	0.0468	-
	+ERL	0.0182	0.0212	0.0363	0.0482	↑18.59%	0.0165	0.0208	0.0354	0.0524	↑10.93%
	(Improv.)	(↑22.97%)	(↑21.84%)	(↑14.51%)	(↑15.04%)		(↑10.74%)	(↑12.43%)	(↑8.59%)	(↑11.97%)
	+PRL	0.0155	0.0195	0.0315	0.0470	↑7.08%	0.0162	0.0202	0.0334	0.0496	↑6.59%
	(Improv.)	(↑4.73%)	(↑12.07%)	(↓0.63%)	(↑12.17%)		(↑8.72%)	(↑9.19%)	(↑2.45%)	(↑5.98%)
Baby & Products	Base	0.0112	0.0157	0.0260	0.0437	-	0.0109	0.0154	0.0257	0.0439	-
	+ERL	0.0116	0.0164	0.0228	0.0418	↓2.16%	0.0148	0.0195	0.0293	0.0481	↑21.49%
	(Improv.)	(↑3.57%)	(↑4.46%)	(↓12.31%)	(↓4.35%)		(↑35.78%)	(↑26.62%)	(↑9.57%)	(↑14.01%)
	+PRL	0.0135	0.0178	0.0281	0.0451	↑11.30%	0.0140	0.0185	0.0291	0.0466	↑16.99%
	(Improv.)	(↑20.54%)	(↑13.38%)	(↑8.08%)	(↑3.20%)		(↑28.44%)	(↑20.13%)	(↑6.15%)	(↑13.23%)

The following are the results from Table 3 of the original paper:

Dataset	Method	UniSRec					MoRec
Dataset	Method	N@10	N@20	R@10	R@20	Avg.	N@10	N@20	R@10	R@20	Avg.
Yelp	Base	0.0380	0.0495	0.0737	0.1195	-	0.0391	0.0516	0.0757	0.1258	-
	+ERL	0.0406	0.0521	0.0770	0.1227	↑4.81%	0.0417	0.0531	0.0832	0.1283	↑5.36%
	(Improv.)	(↑6.84%)	(↑5.25%)	(↑4.48%)	(↑2.68%)		(↑6.65%)	(↑2.91%)	(↑9.91%)	(↑1.99%)
	+PRL	0.0413	0.0529	0.0788	0.1253	↑6.83%	0.0410	0.0532	0.0804	0.1289	↑4.16%
	(Improv.)	(↑8.68%)	(↑6.87%)	(↑6.92%)	(↑4.85%)		(↑4.86%)	(↑3.10%)	(↑6.21%)	(↑2.46%)
Video & Games	Base	0.0328	0.0421	0.0683	0.1054	-	0.0350	0.0438	0.0716	0.1065	-
	+ERL	0.0364	0.0440	0.0711	0.1015	↑3.97%	0.0392	0.0485	0.0744	0.1112	↑7.76%
	(Improv.)	(↑10.98%)	(↑4.51%)	(↑4.10%)	(↓3.70%)		(↑12.00%)	(↑10.73%)	(↑3.91%)	(↑4.41%)
	+PRL	0.0352	0.0433	0.0658	0.0982	↓0.08%	0.0371	0.0462	0.0708	0.1067	↑2.64%
(Improv.)	(↑7.32%)	(↑2.85%)	(↓3.66%)	(↓6.83%)		(↑6.00%)	(↑5.48%)	(↓1.12%)	(↑0.19%)
Software	Base	0.0820	0.1041	0.1643	0.2522	-	0.0846	0.1050	0.1697	0.2510	-
	+ERL	0.0851	0.1075	0.1669	0.2556	↑2.49%	0.0881	0.1071	0.1711	0.2466	↑1.30%
	(Improv.)	(↑3.78%)	(↑3.27%)	(↑1.58%)	(↑1.35%)		(↑4.14%)	(↑2.00%)	(↑0.82%)	(↓1.75%)
	+PRL	0.0869	0.1076	0.1687	0.2518	↑2.96%	0.0917	0.1120	0.1723	0.2532	↑4.37%
	(Improv.)	(↑5.98%)	(↑3.36%)	(↑2.68%)	(↓0.16%)		(↑8.39%)	(↑6.67%)	(↑1.53%)	(↑0.88%)
CDs & Vinyl	Base	0.0150	0.0208	0.0298	0.0527	-	0.0186	0.0235	0.0405	0.0604	-
	+ERL	0.0208	0.0259	0.0428	0.0629	↑31.54%	0.0199	0.0248	0.0417	0.0609	↑4.08%
	(Improv.)	(↑38.67%)	(↑24.52%)	(↑43.62%)	(↑19.35%)		(↑6.99%)	(↑5.53%)	(↑2.96%)	(↑0.83%)
	+PRL	0.0191	0.0253	0.0394	0.0640	↑25.66%	0.0198	0.0249	0.0417	0.0618	↑4.42%
	(Improv.)	(↑27.33%)	(↑21.63%)	(↑32.21%)	(↑21.44%)		(↑6.45%)	(↑5.96%)	(↑2.96%)	(↑2.32%)
Baby & Products	Base	0.0152	0.0199	0.0315	0.0501	-	0.0176	0.0231	0.0371	0.0588	-
	+ERL	0.0183	0.0239	0.0367	0.0589	↑18.64%	0.0184	0.0242	0.0373	0.0602	↑3.06%
	(Improv.)	(↑20.39%)	(↑20.10%)	(↑16.51%)	(↑17.56%)		(↑4.55%)	(↑4.76%)	(↑0.54%)	(↑2.38%)
	+PRL	0.0182	0.0236	0.0359	0.0575	↑16.77%	0.0189	0.0247	0.0376	0.0611	↑4.89%
	(Improv.)	(↑19.74%)	(↑18.59%)	(↑13.97%)	(↑14.77%)		(↑7.39%)	(↑6.93%)	(↑1.35%)	(↑3.91%)

6.2. Ablation Studies / Parameter Analysis

6.2.1. Robustness Analysis Across User and Item Subgroups

The paper conducts a robustness analysis by splitting users and items into four equal-sized subgroups:

Users: UG-0 (shortest sequences) to UG-3 (longest sequences).
Items: IG-0 (least popular) to IG-3 (most popular). The PRL method (with SASRec backbone) was trained with three reasoning steps, and inference performance (NDCG@20) was analyzed as reasoning steps increased.

The following are the results from Figure 4 of the original paper:

$Figure 4: Robustness study w.r.t different user and item subgroups on Yelp dataset. 'Step `x '` represents the recommendation performance at the $x$ -th reasoning step. 'UG' and 'IG' denote User and Item Group, respectively, where higher group numbers indicate longer sequences and more popular items.$

Observations (Figure 4):

Long-tail users (UG-0, UG-1) and unpopular items (IG-0, IG-1): Recommendation quality steadily improves as reasoning steps increase. For instance, in IG-1, performance gains of 12.08%, 16.35%, and 18.69% are observed with more reasoning steps. This suggests that multi-step reasoning is particularly beneficial for sparse interaction signals.
Active users (UG-2, UG-3) and popular items (IG-2, IG-3): Performance tends to decline as reasoning steps increase.
- Explanation: Longer user sequences provide richer contextual information, making interest evolution patterns easier to mine. For popular items, their well-trained representations allow the recommender to easily capture collaborative signals. In these cases, additional inference computation might lead to overthinking, providing negligible benefits and even causing performance degradation.
Conclusion: Long-tail users and items require more thinking space to reason about sparse signals, while highly active users and items may not need redundant computational expansion. This implies a need for an adaptive inference depth selection mechanism in future work.

6.2.2. Impact of Reasoning Steps on Recommendation Performance

This analysis compares the NDCG@20 performance under different inference steps, using SASRec as the backbone.

Base: Original SASRec (no reasoning).
Naive: Base method extended to multi-step reasoning by autoregressively feeding the last hidden state, but only using the final position's output.
RPE: Builds on Naive but integrates Reasoning Positional Embeddings to distinguish sequence encoding from reasoning.
ERL & PRL: The proposed methods.

The following are the results from Figure 5 of the original paper:

Observations (Figure 5):
Naive Method: Fails to yield performance improvements and even underperforms the Base model. This is attributed to the model's inability to distinguish between sequence encoding and reasoning phases.
RPE Method: Significantly mitigates this task gap, leading to obvious performance gains compared to Naive. However, it still suffers from reasoning pattern degradation and error accumulation as it only optimizes cross-entropy loss on the final-step output, lacking supervision for intermediate states.
ERL & PRL Methods: Significantly alleviate these issues by explicitly injecting stepwise supervision signals, reducing optimization difficulty.
Performance Decline with Excessive Reasoning: Across all methods, a consistent performance decline is observed as the number of inference steps increases beyond a certain point. This suggests overthinking for simpler user interaction patterns. This further supports the need for adaptive inference depth selection.

6.2.3. Impact of Reasoning Steps on Inference Latency

The paper evaluates the additional overhead introduced by ReaRec's expanded computational demands during inference. Using PRL as an example, the inference time cost on the test set was measured as reasoning steps increased.

The following are the results from Table 4 of the original paper:

	Base	Step-1	Step-2	Step-3	Step-4	Step-5
SASRec	5.6761	5.7985	5.8752	5.9305	6.0310	6.2786
Cost Inc.	-	2.16%	3.51%	4.48%	6.25%	10.61%
BERT4Rec	5.6535	5.7685	5.9174	5.9621	6.0862	6.1224
Cost Inc.	-	2.03%	4.67%	5.46%	7.65%	8.29%
UniSRec	5.6061	5.6312	5.7596	5.8732	6.0303	6.0502
Cost Inc.	-	0.45%	2.74%	4.76%	7.57%	7.92%
MoRec	5.6638	5.7143	5.8391	5.9565	5.9659	5.9812
Cost Inc.	-	0.89%	3.10%	5.17%	5.33%	5.60%

Note: All time units are in second (s).

Observations (Table 4):

The extra latency for ReaRec remains manageable despite the recurrent autoregressive inference.
This efficiency is due to the KV Caching technique, which reduces attention computation complexity from $O(N^2)$ to $O(N)$ by reusing key and value vectors from past steps.
Optimal performance is typically achieved at two reasoning steps (Step-2). At this point, the method increases performance by an average of 7.49% across all metrics with only a modest latency overhead of 3.51%, which is deemed acceptable and practical for real-world industrial deployment.

6.2.4. Ablation Study

The ablation study focuses on the contributions of the KL regularization term in ERL and Reasoning-aware Contrastive Learning (RCL) in PRL. Performance on NDCG@20 was evaluated by removing these auxiliary loss terms.

The following are the results from Figure 7 of the original paper:

Figure 7: Ablation study for key components in ERL and PRL.

Observations (Figure 7):

ERL without KL regularization: Performs worse than the full ERL model. This indicates that without KL regularization, the model suffers from pattern degradation in reasoning states, leading to highly homogeneous outputs and failing to capture diverse insights.
PRL without RCL: Also yields suboptimal recommendation performance. While progressive temperature scheduling helps, the absence of robust inference mechanisms (provided by RCL) prevents the recommender from self-correcting deviations in intermediate reasoning states, thus struggling to approximate the true user preference distribution.

6.2.5. Sensitivity Analysis

The sensitivity analysis examines the effects of three key hyperparameters: base temperature $\tau$ , temperature decay rate $\alpha$ (for PRL), and KL regularization strength $\lambda$ (for ERL).

The following are the results from Figure 6 of the original paper:

$Figure 6: Performance comparison w.r.t. different hyperparameters, including base temperature $\\tau$ , temperature decay rate $\\alpha$ and KL regularization strength λ. The green and orange lines represent the PRL and ERL methods, respectively.$

Observations (Figure 6):

Sensitivity to Base Temperature τ (in PRL):
- As $\tau$ increases within $\{0.05, 0.1, 0.5, 1.0, 2.0, 5.0\}$ , model performance generally improves initially.
- This suggests that overly sharp probability distributions (low $\tau$ ) might not align with potential user preferences, as forcing the model to learn extreme preferences from noisy interaction data hinders generalization.
- However, too large base temperatures (e.g., 5.0) lead to degraded performance. A large $\tau$ can blur ranking differences among candidate items, making it harder to learn meaningful sequential patterns.
- Conclusion: Setting a $moderate τ$ is crucial for optimal performance.
Sensitivity to Temperature Decay Rate α (in PRL):
- $Moderate α$ usually achieves the best performance.
- $Too small α$ (e.g., $\alpha = 1.0$ ): The score distributions learned at different reasoning steps remain largely the same, leading to pattern collapse or replication of prior reasoning states. This prevents reasoning enhancement.
- $Overly high α$ (e.g., $\alpha = 10.0$ ): Causes performance degradation. An aggressive temperature change triggers a rapid distribution sharpness transition (from smooth to sharp), disrupting the model's curriculum-style reasoning process.
- Conclusion: An appropriate temperature decay rate is critical for reducing optimization difficulty.
Sensitivity to KL Regularization Strength λ (in ERL):
- The model is generally not sensitive to $\lambda$ within a certain range.
- However, recommendation performance drops significantly when $\lambda$ exceeds a certain threshold (e.g., 0.05).
- Explanation: While KL regularization encourages diverse reasoning paths, too strong regularization can dominate gradient optimization, enforcing excessively divergent sequential patterns that might disrupt sequential modeling capability and increase optimization challenges, leading to performance degradation.

6.2.6. Embedding Visualization Analysis

The similarity heatmaps of multi-step reasoning outputs are visualized to analyze hidden state dynamics.

The following are the results from Figure 9 of the original paper:

Figure 9: Visualization of similarity in multi-step reasoning hidden states for different methods.

Observations (Figure 9):

RPE Variant (Figure 9a): Exhibits high homogeneity in reasoning states. The similarity scores between the final output and previous steps are almost identical (e.g., 1.00 and 0.98), confirming the reasoning pattern degradation issue without proper regularization.
PRL Method (Figure 9b): Effectively leverages reasoning-enhanced computation for performance improvement, showing more distinct patterns across steps.
ERL w/o KL (Figure 9c): Shows more overlapping patterns across different reasoning steps, similar to the RPE variant and suggesting pattern collapse.
Full ERL Method (Figure 9d): Demonstrates diverse sequential patterns, where KL regularization encourages distinct representations across steps.

The following are the results from Figure 10 of the original paper:

$Figure 10: The embedding visualization of the full ERL method vs. its ablated version without KL regularization. Dashed boxes highlight high similarity between different reasoning steps (Step $\\mathbf { 0 } \\sim$ Step 3) in the ablated version.$

Observations (Figure 10):
The visualization in Figure 10 further confirms that the ERL method without KL constraint reveals more overlapping patterns across different reasoning steps compared to the full ERL method. This validates that KL regularization helps to address the homogenization output issue and encourage diverse reasoning.

6.3. Case Studies

6.3.1. Rank Change Analysis of Target Items

This analysis examines how the rank of target items changes during multi-step inference using PRL methods on the Yelp dataset.

The following are the results from Figure 8 of the original paper:

该图像是一个示意图，展示了不同步长下预测得分与项目排名的关系。图中包含四个子图，分别对应不同参数设置与项目排名的影响，显现出各步骤（Step 0, Step 1, Step 2）下的预测得分变化情况。

Observations (Figure 8):

Full PRL Method: Progressively improves the target item ranking within the candidate pool as reasoning depth increases, aligning with expectations.
Temperature Decay Coefficient (α):
- Smaller $\alpha$ : Leads to smoother transitions in score distribution across different inference steps.
- Larger $\alpha$ : Induces more aggressive distribution changes, consistent with the sensitivity analysis (Sec. 4.3.5).
Ablated Version without RCL: Leads to reasoning errors. For example, in Figure 8(d), the target item's rank drops from #12 at step 1 to #22 at step 2, indicating that increasing reasoning steps incorrectly pushes the target item down without the robustness mechanism of RCL.

6.3.2. Case Study in Real-world Recommendation Scenario

A specific example from the Video & Games dataset illustrates the stepwise preference refinement effect of the PRL method.

The following are the results from Figure 11 of the original paper:

$Figure 11: Case study of multi-step inference on the Video & Games Dataset. $\\mathbf { \\widetilde { H } } \\mathbf { \\Phi } _ { X } \\mathbf { \\widetilde { \\Phi } }$ represents historical items, with smaller $x$ indicating more recent interactions. $\\mathbf { \\Delta } ^ { \\mathfrak { s } } \\mathbf { R } \\mathbf { \\boldsymbol { x } } ^ { \\mathfrak { s } }$ represents the top-1 recommended items at the $x$ -th reasoning step, with larger $x$ indicating later reasoning steps.$ 该图像是图表，展示了视频与游戏数据集上的多步推理案例。历史项目用 H1 到 H5 表示，推理步骤用 R0 到 R2 表示，oldsymbol{x} 代表推理步骤的序号。

Scenario: A user purchased Halo and Halo 5 (FPS games for XBox-One), then accessories (memory card, dust cover, stand). The goal is to predict the next item.

Observations (Figure 11):

Step R0 (Initial Inference): The model correctly captures the user's preference for FPS games on XBox. It recommends Conflict Desert Storm. However, this recommendation is suboptimal as it lacks timeliness (older game) and may not align with a gaming enthusiast's preference for newer releases.
Step R1 (First Reasoning Step): The model adjusts, recommending a game controller. This reflects the user's recent purchase habits (gaming accessories). However, it's still suboptimal because it primarily reflects collaborative relevance rather than sequential characteristics (users typically buy controllers before accessories like stands) and lacks diversity.
Step R2 (Final Reasoning Step): The model recommends Resident Evil 2, a newly released shooter game. This matches the actual target item and aligns well with the user's true preference.
Conclusion: This case study validates how recurrent reasoning (in PRL) resolves ambiguity by integrating temporal context, collaborative relevance, and output diversity, leading to a more accurate and refined recommendation.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduces ReaRec, a pioneering inference-time computing framework for sequential recommendation, inspired by the think-before-action paradigm. Unlike traditional direct inference models, ReaRec enhances computational depth through multi-step implicit reasoning, enabling SeqRec models to "think" before making recommendations. To address challenges in optimizing multi-step reasoning, two lightweight learning strategies were proposed: Ensemble Reasoning Learning (ERL) and Progressive Reasoning Learning (PRL). ERL leverages ensemble techniques and KL divergence to foster diverse reasoning, while PRL employs progressive temperature annealing and reasoning-aware contrastive learning for robust and effective optimization.

Extensive experiments on five real-world datasets and various SeqRec architectures confirm ReaRec's effectiveness and generalizability. Notably, ReaRec not only improves recommendations for long-tail users and items but also significantly elevates the performance ceiling of existing SeqRec backbones by up to 50% with post-hoc optimal step selection. This highlights the substantial, previously untapped potential of inference-time computing for sequential recommendation. The authors are optimistic that this research opens a promising new direction at the intersection of reasoning and recommendation.

7.2. Limitations & Future Work

The authors acknowledge that ReaRec is an initial exploratory effort and identify several challenges and opportunities for future research:

Adaptive Inference Depth Selection:
- Limitation: ReaRec paradoxically induces performance degradation for high-activity users and popular items due to overthinking. Additional computation provides negligible benefits for well-learned patterns and can lead to suboptimal results.
- Opportunity: Develop an adaptive inference depth selection policy to balance computational depth with sequence complexity and user/item characteristics. This would allow for shallower reasoning for easily predictable preferences and deeper reasoning for complex or sparse scenarios, bridging the gap between current performance and the theoretical upper bound (as shown in Figure 2).
Parameter Disentanglement Between Encoding and Reasoning:
- Limitation: The current ReaRec framework shares parameters between the item sequence encoding phase and the reasoning computations. While parameter-efficient, this design creates task ambiguity as the same neural modules must handle two distinct objectives. Although Reasoning Position Embeddings (RPE) help, the suboptimal performance trajectories (initial improvement followed by decline, Figure 5) suggest this solution is not optimal.
- Opportunity: Explore parameter decoupling at the model level, creating specialized modules for item encoding and deep sequential reasoning. This could reduce task interference, allow for more specialized representation learning, and better adaptation to multi-step inference, ultimately improving recommendation quality.
The Missing Inference-time Scaling Law:
- Observation: In Large Reasoning Models (LRMs), inference-time scaling laws suggest that longer reasoning chains lead to better reasoning capabilities and downstream performance. However, ReaRec experiments (Figure 5) do not perfectly exhibit this behavior, showing performance decline with excessive steps.
- Opportunity: Investigate whether a true scaling law exists for inference-time computation in recommendation systems. If so, design more effective reasoning-enhanced sequential recommenders that can truly realize such a scaling law. This requires deeper research into the inherent nature of reasoning in recommendation contexts.
Theoretical Analysis:
- Intuition: Increasing inference-time computational depth should enable sequential recommenders to capture higher-order sequential feature crossing and improve user preference predictions.
- Opportunity: Develop theoretical analyses to formalize how multi-step reasoning contributes to improved recommendation performance. Establishing a strong theoretical foundation would guide more principled model design and optimization strategies.
Efficient Inference Mechanism:
- Concern: While current ReaRec has marginal latency overhead, future advancements or a true inference-time scaling law could lead to efficiency concerns with the autoregressive generation paradigm.
- Opportunity: Explore optimization strategies like linear attention mechanisms [60], model quantization [73], and long-to-short reasoning distillation [53] to achieve lighter and faster inference for industrial-scale deployment.

7.3. Personal Insights & Critique

This paper presents a highly insightful and timely approach by drawing inspiration from the success of Chain-of-Thought (CoT) reasoning in LLMs and applying it to sequential recommendation. The core idea of "thinking before recommending" through implicit multi-step reasoning during inference is genuinely innovative in the SeqRec domain.

Key Strengths:

Novelty: The explicit application of inference-time reasoning to SeqRec is a significant conceptual leap. Most SeqRec research focuses on architectural improvements or richer embeddings; ReaRec introduces a new dimension of computational depth during prediction.
Model Agnostic: The framework's ability to enhance diverse SeqRec backbones (ID-based, text-based) is a strong testament to its generalizability and potential impact across the field.
Targeted Problem Solving: The specific design choices like Reasoning Position Embeddings to bridge the task gap and the ERL/PRL strategies to handle the lack of explicit CoT supervision demonstrate a deep understanding of the unique challenges of implicit reasoning in SeqRec.
Empirical Robustness: The comprehensive experiments across multiple datasets and detailed ablation studies provide strong empirical evidence for ReaRec's effectiveness. The significant performance gains, especially the 30-50% ceiling elevation, are remarkable.
Detailed Analysis: The subgroup analysis (long-tail vs. active users/items) and the embedding visualization provide valuable insights into why the method works and where its limitations lie, pushing beyond mere performance numbers.

Potential Issues/Critique:

"Overthinking" Phenomenon: While identified as future work, the "overthinking" phenomenon for active users and popular items is a practical concern. If not adaptively managed, it could lead to unnecessary computational cost or even degraded user experience for the majority of interactions. An adaptive inference depth mechanism is not just an opportunity but a crucial requirement for real-world deployment.
True Scaling Law for SeqRec: The observation that ReaRec does not perfectly follow the inference-time scaling law observed in LLMs is a critical point. This suggests that the nature of "reasoning" in SeqRec might be fundamentally different from LLM's symbolic reasoning. Perhaps SeqRec reasoning involves more "contextual refinement" than "logical deduction." Future theoretical work needs to clarify this distinction.
Interpretability of Implicit Reasoning: While CoT in LLMs offers some interpretability through explicit reasoning steps, ReaRec's reasoning is implicit within the latent space. While embedding visualizations provide some clues, truly understanding how the model "thinks" or what intermediate insights are generated remains challenging. This limits debugging and trust in critical applications.
Complexity vs. Simplicity: While the paper claims ERL and PRL are "lightweight," integrating multi-step reasoning, RPE, KL regularization, temperature annealing, and contrastive learning adds a fair degree of complexity compared to a direct forward pass model. The trade-off between this added complexity and the performance gains needs to be continually evaluated, especially as the "base" models themselves become more powerful.

Applicability and Transferability:

The core concept of inference-time computational depth is highly transferable.

Other Recommendation Tasks: Could be applied to other RS tasks beyond next-item prediction, such as session-based recommendation, cold-start recommendation, or even multi-task recommendation, where refining user/item representations iteratively could be beneficial.
Graph Neural Networks (GNNs): GNNs are used in recommendation. Could ReaRec's multi-step reasoning concept be applied to iteratively propagate and refine information on a graph structure during inference?
Generative Models: The idea of progressive refinement (as in PRL) could inspire generative recommendation models that generate items in multiple steps, refining the generated item characteristics with each step.

Overall, ReaRec is a highly inspiring paper that pushes the boundaries of sequential recommendation by embracing advanced inference-time computing paradigms. It opens up a rich vein of research questions and practical applications for building more intelligent and nuanced recommender systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.