Paper status: completed

FICLRec: Frequency enhanced intent contrastive learning for sequential recommendation

Published:06/11/2025
Original Link
Price: 0.100000
5 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

FICLRec, a proposed model, uses frequency-enhanced intent contrastive learning to address the limitations of capturing high-frequency intents in sequential recommendation. It significantly improves performance across five real-world datasets.

Abstract

User purchasing behavior is mainly driven by their intentions. However, existing methods typically favor low-frequency intents, leading to insufficient capability in capturing more expressive high-frequency intents. Moreover, like typical sequence recommendations, data sparsity remains a primary factor influencing recommendation performance. To address this issue, we propose a Frequency Enhanced Intent Contrastive Learning Recommendation model (FICLRec), which innovatively utilizes frequency information from users’ latent intentions to improve the recognition of high-frequency intents. Additionally, we introduce frequency contrastive learning to reduce the negative impact of data sparsity on model performance. To validate the effectiveness of the proposed method, extensive experiments were conducted on five real-world datasets: Beauty (0.19M interactions), Sports (0.29M interactions), Toys (0.16M interactions), Yelp (0.31M interactions), and LastFM (0.05M interactions). The experimental results indicate that, in comparison with baseline models, our method improves by 2.03%, 4.87%, 2.50%, 13.85%, and 16.93% on five datasets, proving the effectiveness of our method. Our implemented model is available via https://github.com/syf1844803351/FICLRec .

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

FICLRec: Frequency enhanced intent contrastive learning for sequential recommendation

1.2. Authors

  • Yifeng Su
  • Xiaodong Cai
  • Ting Li

1.3. Journal/Conference

The paper does not explicitly state the journal or conference name in the provided text. The abstract mentions "Published at (UTC): 2025-06-11T00:00:00.000Z", suggesting it is a forthcoming publication. Given the academic rigor and content, it is likely intended for a reputable conference or journal in artificial intelligence, machine learning, or recommender systems.

1.4. Publication Year

2025 (as indicated by the publication UTC timestamp 2025-06-11T00:00:00.000Z)

1.5. Abstract

User purchasing behavior is primarily driven by their intentions. Existing sequential recommendation methods often struggle to effectively capture high-frequency intents (e.g., immediate, short-term interests) and tend to favor low-frequency intents (e.g., stable, long-term preferences). Additionally, data sparsity, a common challenge in sequential recommendation, significantly impacts performance. To address these issues, the authors propose the Frequency Enhanced Intent Contrastive Learning Recommendation model (FICLRec). This model innovatively utilizes frequency information extracted from users' latent intentions to improve the recognition of high-frequency intents. Furthermore, it introduces frequency contrastive learning to mitigate the negative effects of data sparsity. Extensive experiments were conducted on five real-world datasets (Beauty, Sports, Toys, Yelp, and LastFM), demonstrating that FICLRec consistently outperforms baseline models, achieving improvements of 2.03%, 4.87%, 2.50%, 13.85%, and 16.93% on these datasets. The implemented model is publicly available.

/files/papers/692bf9cf4114e99a4cde8763/paper.pdf This is a local file link, suggesting the paper is either a preprint or an internal document. Its publication status is likely pending or a preprint as of the provided date.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper lies in the limitations of existing sequential recommendation (SR) methods, which aim to predict the next item a user will interact with based on their historical sequence of interactions. User purchasing behavior is fundamentally driven by their intentions, which can be short-term (high-frequency, rapidly changing) or long-term (low-frequency, stable). The paper identifies two critical challenges:

  1. Bias towards low-frequency intents: Current SR models often struggle to effectively capture and represent high-frequency intents from user interaction sequences. This leads to insufficient capability in recognizing dynamic, momentary user interests, which are crucial for timely and relevant recommendations.

  2. Data sparsity: Like many recommendation systems, sequential recommendation suffers from data sparsity, meaning that most users interact with only a small fraction of available items. This scarcity of interaction data makes it difficult for models to learn robust and generalizable user preferences and item representations, thereby hindering recommendation performance.

    The importance of solving these problems stems from the need for more accurate and personalized recommendations. If a system can better understand both the fleeting and enduring intentions of users, it can provide more relevant suggestions, improving user satisfaction and engagement. The paper's innovative idea is to leverage frequency information to disentangle and enhance the learning of different types of user intentions and to apply contrastive learning in the frequency domain to combat data sparsity.

2.2. Main Contributions / Findings

The primary contributions and key findings of the FICLRec paper are:

  • Novel Model for Frequency-Enhanced Intent Learning: The paper proposes FICLRec, a novel model that innovatively uses frequency information derived from users' latent intentions. This approach specifically aims to improve the recognition of high-frequency intents, addressing the bias towards low-frequency intents in existing models.
  • Frequency Contrastive Learning for Data Sparsity: FICLRec introduces frequency contrastive learning, a mechanism designed to reduce the negative impact of data sparsity. By contrasting representations in the frequency domain, the model can learn more robust and informative user and item embeddings, even with limited interaction data.
  • Comprehensive Intent Modeling: The model incorporates a Frequency Redistribution Encoder (FRE) to decompose user behaviors into low-frequency and high-frequency components, capturing both stable long-term preferences and dynamic short-term interests. It then utilizes distinct contrastive learning objectives for each frequency domain: high-frequency intent contrastive learning and low-frequency intent contrastive learning.
  • State-of-the-Art Performance: Extensive experiments on five real-world datasets (Beauty, Sports, Toys, Yelp, LastFM) demonstrate that FICLRec significantly outperforms various state-of-the-art (SOTA) sequential recommendation models, including traditional methods, frequency-domain methods, self-supervised methods, and intent learning methods. The model achieves average improvements ranging from 2.03% to 16.93% across datasets.
  • Robustness and Efficiency: The ablation studies confirm the effectiveness of each proposed component. FICLRec also exhibits robustness to noisy data and sparse data, and its training efficiency is comparable to other self-attention based models, making the performance gains worthwhile despite a slight increase in computational complexity.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand FICLRec, a reader should be familiar with several fundamental concepts in recommender systems and deep learning.

  • Sequential Recommendation (SR):

    • Conceptual Definition: Sequential recommendation is a subfield of recommender systems that focuses on predicting the next item a user will interact with, given their historical sequence of interactions. Unlike traditional collaborative filtering that might recommend items based on overall preferences, SR emphasizes the temporal order and dependencies between items in a user's behavior history. For example, if a user watched a specific movie, SR might recommend the next logical movie in a series or a related genre, considering the immediate context.
    • Importance: SR is crucial in dynamic environments like e-commerce, content streaming, or news feeds, where user interests evolve rapidly, and the order of interactions carries significant information about their current intent.
  • Intent Learning:

    • Conceptual Definition: Intent learning in recommender systems aims to identify and model the underlying purposes or goals that drive a user's interactions. These intentions can be explicit (e.g., a user searching for "running shoes") or implicit (e.g., a sequence of interactions that suggests a user is preparing for a trip).
    • Role in FICLRec: FICLRec specifically distinguishes between high-frequency intents (short-term, dynamic, like buying a specific accessory after a main purchase) and low-frequency intents (long-term, stable, like a consistent preference for a certain brand or genre). The paper argues that understanding both types of intents leads to better recommendations.
  • Contrastive Learning (CL):

    • Conceptual Definition: Contrastive learning is a self-supervised learning paradigm where a model learns representations by pushing "similar" (positive) samples closer together in an embedding space while pushing "dissimilar" (negative) samples farther apart. It doesn't require explicit human labels; instead, it generates pseudo-labels from the data itself.
    • How it Works: Typically, given an anchor data point, a positive sample is created (e.g., through data augmentation or by identifying semantically related data points), and multiple negative samples are chosen (e.g., random samples from the batch). A contrastive loss function (like InfoNCE loss) then aims to minimize the distance between the anchor and positive sample and maximize the distance between the anchor and negative samples.
    • Role in FICLRec: FICLRec uses contrastive learning to strengthen the learned intent representations and mitigate data sparsity. It applies distinct contrastive learning objectives for high-frequency and low-frequency intents.
  • Frequency Domain Analysis (Fourier Transform):

    • Conceptual Definition: The Fourier Transform (FT) is a mathematical operation that decomposes a function or signal into its constituent frequencies. It transforms a signal from its original domain (often time domain or spatial domain) into the frequency domain. In simple terms, it tells us which frequencies are present in the signal and their magnitudes.
    • Discrete Fourier Transform (DFT) and Inverse DFT (IDFT): For discrete sequences (like user interaction sequences), the Discrete Fourier Transform (DFT) is used, often implemented efficiently as the Fast Fourier Transform (FFT). The Inverse Discrete Fourier Transform (IDFT) or Inverse Fast Fourier Transform (IFFT) converts the signal back from the frequency domain to the time domain.
      • DFT Formula: Given a sequence {xn}n=0N1 \{ x _ { n } \} _ { n = 0 } ^ { N - 1 } , its DFT is defined as: Xk=F[xn]=n=0N1xne2πiNkn,0kN1 X _ { k } = { \mathcal { F } } [ x _ { n } ] = \sum _ { n = 0 } ^ { N - 1 } x _ { n } e ^ { - { \frac { 2 \pi i } { N } } k n } , \quad 0 \leq k \leq N - 1 where X _ { k } are the frequency components, x _ { n } are the time-domain samples, N N is the total number of samples, k k is the frequency index, and i i is the imaginary unit.
      • IDFT Formula: The IDFT is given by: xn=F1[Xk]=1Nn=0N1Xke2πiNkn,0nN1 x _ { n } = \mathcal { F } ^ { - 1 } [ X _ { k } ] = \frac { 1 } { N } \sum _ { n = 0 } ^ { N - 1 } X _ { k } e ^ { \frac { 2 \pi i } { N } k n } , \quad 0 \leq n \leq N - 1 where x _ { n } are the reconstructed time-domain samples.
    • Role in FICLRec: FICLRec applies Fourier Transform to user interaction sequences to analyze their frequency components. Low frequencies typically capture global, stable, or long-term patterns (e.g., general preference for a genre), while high frequencies capture local, rapid, or short-term changes (e.g., a sudden interest in specific items). By separating and analyzing these components, the model can better capture different types of user intentions.
  • Self-Attention Mechanism (from Transformers):

    • Conceptual Definition: Self-attention is a mechanism that allows a model to weigh the importance of different parts of an input sequence when processing a specific element. It helps capture dependencies between items in a sequence regardless of their distance.
    • How it Works (Simplified): For each item in a sequence, self-attention computes three vectors: a Query (Q), a Key (K), and a Value (V). The Query of an item is compared against the Keys of all other items (including itself) to compute attention scores. These scores are then scaled, passed through a softmax function to get attention weights, and finally used to weigh the Values of all items. The sum of these weighted Values forms the output representation for the current item, reflecting its contextualized meaning within the sequence.
    • Scaled Dot-Product Attention Formula: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where QQ is the matrix of queries, KK is the matrix of keys, VV is the matrix of values, and dkd_k is the dimension of the keys (used for scaling to prevent vanishing gradients).
    • Multi-Head Self-Attention: This extends self-attention by running the attention mechanism multiple times in parallel (multiple heads), allowing the model to focus on different aspects of the input sequence simultaneously. The outputs from these heads are then concatenated and linearly transformed.
    • Role in FICLRec: FICLRec uses multi-head self-attention within its Frequency Redistribution Encoder to capture relationships between items in the sequence, similar to its use in SASRec and other Transformer-based models.

3.2. Previous Works

The paper compares FICLRec against a range of existing sequential recommendation models, which can be broadly categorized.

  • Traditional Sequential Recommendation Methods:

    • GRU4Rec (Hidasi et al., 2015): One of the pioneering deep learning methods for sequential recommendation, using Gated Recurrent Units (GRUs) to model user sequences. GRUs are a type of recurrent neural network (RNN) that can capture temporal dependencies.
    • Caser (Tang & Wang, 2018): Employs convolutional neural networks (CNNs) with horizontal and vertical convolution techniques to capture patterns in sequential data.
    • SASRec (Kang & McAuley, 2018): A highly influential Transformer-based model that uses self-attention to capture long-range dependencies in user sequences, significantly improving SR performance compared to RNN-based methods. It serves as a strong backbone for many subsequent SR models.
  • Frequency Domain Methods:

    • FMLPRec (Zhou et al., 2022): A sequential recommendation model that processes sequences in the frequency domain using Fast Fourier Transform (FFT) and Inverse FFT (IFFT) to capture global and local patterns.
    • BSARec (Li et al., 2024b): Utilizes both time-domain and frequency-domain analysis. It integrates Fourier transforms to identify and manage periodic patterns and adaptive filtering to reduce noise.
    • FEARec (Ni et al., 2023): Combines frequency domain analysis with attention mechanisms to enhance sequential recommendation, focusing on capturing both global and local patterns effectively.
  • Self-supervised Learning Methods:

    • DuoRec (Chen et al., 2022): A self-supervised learning model that uses dual contrastive learning tasks to learn robust representations, addressing data sparsity and item embedding degradation.
    • ICLRec (Li et al., 2024d): Integrates self-supervised learning by introducing contrastive learning objectives to enhance intent learning in sequential recommendation.
    • ELCRec (Li et al., 2024c): Focuses on enhancing latent sequence representation through self-supervised contrastive learning, particularly in sparse data contexts.
  • Intent Learning Methods:

    • ICSRec (Li et al., 2024e): An intent-contrastive sequential recommendation model that employs contrastive self-supervised learning objectives to learn user intentions more effectively. It aims to make item embeddings more discriminative by contrasting different intention aspects.
    • IDCLRec (Chen et al., 2025): Another intent-driven contrastive learning model for sequential recommendation.

3.3. Technological Evolution

The field of sequential recommendation has evolved significantly:

  1. Early models (e.g., Markov Chains, matrix factorization): Focused on simple sequence transitions or static preferences.

  2. Recurrent Neural Networks (RNNs) (e.g., GRU4Rec): Introduced the ability to model longer-term dependencies in sequences, but suffered from issues like vanishing gradients and difficulty capturing very long-range connections.

  3. Convolutional Neural Networks (CNNs) (e.g., Caser): Applied CNNs to sequences, offering efficiency and local pattern capture.

  4. Attention Mechanisms and Transformers (e.g., SASRec): Revolutionized SR by allowing models to weigh the importance of all previous items for predicting the next one, effectively capturing long-range dependencies and making models highly parallelizable.

  5. Self-supervised Learning (SSL) (e.g., DuoRec, ICLRec): Addressed data sparsity by generating supervisory signals from the data itself, often using contrastive learning to learn more robust item and user representations.

  6. Intent-guided Models (e.g., ICSRec, ELCRec): Began to explicitly model user intentions, recognizing that different interactions might stem from different underlying goals, leading to more nuanced recommendations.

  7. Frequency Domain Models (e.g., FMLPRec, BSARec, FEARec): Introduced Fourier Transform to analyze sequences in the frequency domain, aiming to disentangle different types of patterns (e.g., global trends vs. local fluctuations) that might correspond to long-term vs. short-term interests.

    FICLRec fits into this timeline by combining the advancements in intent-guided models, self-supervised learning (specifically contrastive learning), and frequency domain analysis. It attempts to bridge the gap by using frequency domain techniques to enhance the learning of diverse user intentions within a self-supervised contrastive learning framework.

3.4. Differentiation Analysis

Compared to the main methods in related work, FICLRec offers several core differences and innovations:

  • Novel Combination of Frequency Domain and Intent Contrastive Learning: While models like FMLPRec, BSARec, and FEARec utilize the frequency domain for sequential recommendation, and models like ICLRec, ELCRec, ICSRec, and IDCLRec employ intent learning with contrastive learning, FICLRec innovatively integrates both. It specifically uses frequency information to enhance the recognition of high-frequency intents and then applies frequency contrastive learning to mitigate data sparsity.

  • Explicit Disentanglement of High- and Low-Frequency Intents: FICLRec's Frequency Redistribution Encoder (FRE) explicitly decomposes user interaction sequences into low-frequency and high-frequency components. This allows for dedicated processing and contrastive learning objectives for each type of intent, whereas many prior intent-guided models might treat intents more holistically without such a distinct frequency-based separation.

  • Targeted Contrastive Learning Objectives: The model introduces two specific contrastive learning objectives:

    • High-frequency intent contrastive learning: This specifically aims to align the high-frequency features of positive pairs while pushing away negative ones, coupled with a high-frequency alignment loss (Lhfal \mathcal { L } _ { h f a l } ) to explicitly bring high-frequency components closer. This is crucial for capturing dynamic, short-term interests.
    • Low-frequency intent contrastive learning: This leverages intent prototypes (cluster centers) as low-frequency representations and includes a cluster-level center alignment loss (Lccal \mathcal { L } _ { c c a l } ) to ensure that general intent representations are close to their corresponding prototypes. This helps in capturing stable, long-term preferences.
  • Addressing Data Sparsity through Frequency-Aware CL: By introducing frequency contrastive learning, FICLRec attempts to learn more robust embeddings even with limited data. The frequency domain can sometimes highlight underlying patterns that are less apparent in raw, sparse time-domain data, making contrastive learning more effective.

  • Improved High-Frequency Intent Capture: The paper claims that existing methods typically favor low-frequency intents. FICLRec directly confronts this by enhancing the processing of high-frequency components, leading to a better balance in capturing diverse user intentions.

    In essence, FICLRec stands out by providing a structured way to separate and learn short-term and long-term user intentions using frequency domain techniques, and then solidifying these learnings through tailored contrastive learning objectives, particularly addressing the persistent problem of data sparsity.

4. Methodology

4.1. Principles

The core idea behind FICLRec is to enhance sequential recommendation by explicitly modeling both high-frequency (short-term, dynamic) and low-frequency (long-term, stable) user intentions. The theoretical basis is that user behavior sequences, when analyzed in the frequency domain using Fourier Transform, reveal distinct patterns corresponding to these different types of intentions. Low-frequency components represent stable, overarching preferences, while high-frequency components represent transient, specific interests. By disentangling and processing these components, FICLRec aims to overcome the limitation of existing models that often favor low-frequency intents. Furthermore, to address data sparsity and learn more robust representations for these intents, the model employs frequency-enhanced contrastive learning which pulls positive samples closer and pushes negative samples apart in the embedding space, adapted for both high-frequency and low-frequency intent representations.

4.2. Overall Framework

The overall architecture of FICLRec is designed to process user interaction sequences, extract frequency-enhanced intent representations, and use these representations for next-item prediction augmented by multi-task learning with intent contrastive learning. The model consists of four main components:

  1. Embedding Layer: Converts discrete item IDs into dense vector representations.

  2. Frequency Redistribution Encoder (FRE): Processes the embedded sequence, disentangling and re-weighting low-frequency and high-frequency components.

  3. Intent Contrastive Learning: A self-supervised task that comprises high-frequency intent contrastive learning and low-frequency intent contrastive learning, designed to capture short-term and long-term user preferences respectively.

  4. Prediction Layer: Uses the learned user representation to predict the next item.

    The following figure (Figure 2 from the original paper) illustrates the overall framework:

The figure illustrates the data flow, starting with the input user sequence SuS^u. This sequence first passes through the Embedding Layer. The embedded sequence then enters the Frequency Redistribution Encoder, which includes a Frequency Redistribution Structure, Multi-Head Self-Attention, and Point-wise Feed-Forward Network. The output of this encoder, representing the user's comprehensive behavior, is then used for two tasks: Next-item Prediction and Intent Contrastive Learning. The Intent Contrastive Learning branch consists of High-Frequency Intent Contrastive Learning and Low-Frequency Intent Contrastive Learning modules, which are optimized to refine the intent representations. Finally, the overall loss combines the prediction loss and the intent contrastive loss.

4.3. Embedding layer

In the embedding layer, each unique item in the global item set VV is mapped to a continuous vector space. This creates an item embedding matrix MRV×DM \in \mathbb { R } ^ { | V | \times D }, where V|V| is the total number of items and DD is the embedding dimension.

For a given user uu, their historical interaction sequence is su=[v1u,v2u,,vNu]s ^ { u } = [ v _ { 1 } ^ { u } , v _ { 2 } ^ { u } , \ldots , v _ { N } ^ { u } ], where viuv _ { i } ^ { u } is the item interacted with at position ii, and NN is the maximum sequence length. Each item viuv _ { i } ^ { u } is converted into its embedding eiue _ { i } ^ { u }. The sequence of item embeddings for user uu is thus represented as: Eu=[e1u,e2u,,eNu]RN×D \boldsymbol { E } ^ { u } = [ e _ { 1 } ^ { u } , e _ { 2 } ^ { u } , \ldots , e _ { N } ^ { u } ] \in \mathbb { R } ^ { N \times D } Since mechanisms like self-attention (used later in the encoder) do not inherently preserve the order of items, positional information must be explicitly added to the embeddings. This is done by adding a positional embedding PRN×DP \in \mathbb { R } ^ { N \times D } to the item embeddings. The combined embeddings are then processed by Layer Normalization and Dropout for stability and regularization: Eu=Dropout(LayerNorm(Eu+P)) E ^ { u } = \mathrm { D r o p o u t } ( \mathrm { L a y e r N o r m } ( E ^ { u } + P ) ) Here, PP represents the positional embedding matrix, LayerNorm normalizes the activations across features for each individual sample, and Dropout randomly sets a fraction of input units to zero during training to prevent overfitting. The output EuE^u serves as the initial input to the Frequency Redistribution Encoder.

4.4. Frequency Redistribution Encoder

The Frequency Redistribution Encoder (FRE) is the core component responsible for processing the sequence embeddings by leveraging frequency domain information. It aims to capture both low-frequency and high-frequency patterns from user behaviors. The FRE consists of multiple stacked blocks.

Let the input sequence representation for the ll-th encoder block be HlRN×DH ^ { l } \in \mathbb { R } ^ { N \times D }. The operation of the ll-th FRE block can be described as: Hl+1=FRE(Hl) \boldsymbol { H } ^ { l + 1 } = \mathrm { F R E } ( \boldsymbol { H } ^ { l } ) For the first block, the input is the output of the embedding layer, i.e., H0=EuH ^ { 0 } = E ^ { u }. Each FRE block itself contains three sub-components: a Frequency Redistribution Structure, a Multi-head Self-Attention mechanism, and a Point-wise Feed-Forward Network.

4.4.1. Frequency Redistribution Structure

This structure is where the frequency domain analysis takes place. It applies the Fast Fourier Transform (FFT) to the input sequence, separates it into low-frequency and high-frequency components, and then recombines them adaptively.

Given the input HlH^l, the FFT operation is applied. The FFT converts the time-domain sequence into a frequency-domain representation. The paper then defines the low-frequency component by taking the first pp elements of the frequency-domain representation (representing the lowest frequencies) and converting them back to the time-domain using the Inverse Fast Fourier Transform (IFFT): XLl=F1(F(Hl)0:ωp)RN X _ { L } ^ { l } = \mathcal { F } ^ { - 1 } \left( \mathcal { F } \left( H ^ { l } \right) \big | _ { 0 : \omega _ { p } } \right) \in \mathbb { R } ^ { N } Here, F \mathcal { F } denotes the FFT operation, F1 \mathcal { F } ^ { - 1 } denotes the IFFT operation, and Hl H ^ { l } is the input sequence. The notation F(Hl)0:ωp \mathcal { F } \left( H ^ { l } \right) \big | _ { 0 : \omega _ { p } } signifies taking the frequency components from index 0 up to ωp \omega _ { p } , where ωp=p/2+1 \omega _ { p } = p / 2 + 1 . The value pp is a hyperparameter determining the cutoff frequency. This effectively filters out higher frequencies, leaving only the low-frequency content.

Conversely, the high-frequency component is defined by taking the frequency components from ωp+1 \omega _ { p } + 1 to N1 N - 1 (the remaining higher frequencies) and converting them back to the time-domain using IFFT: XHl=F1(F(Hl)ωp+1:N1)RN X _ { H } ^ { l } = \mathcal { F } ^ { - 1 } \left( \mathcal { F } \left( H ^ { l } \right) \big | _ { \omega _ { p } + 1 : N - 1 } \right) \in \mathbb { R } ^ { N } The paper notes that reconstructing the full spectrum from just these two components is generally sufficient and reduces computational complexity.

After obtaining both low-frequency (XLlX _ { L } ^ { l }) and high-frequency (XHlX _ { H } ^ { l }) representations, the model combines them using a gating mechanism. This mechanism adaptively re-weights the contribution of each component based on their importance. Z=sigmoid(XLlW1+XHlW2)H~l=ZXLl+(1Z)XHl \begin{array} { c } { { Z = s i g m o i d ( X _ { L } ^ { l } W _ { 1 } + X _ { H } ^ { l } W _ { 2 } ) } } \\ { { { \tilde { H } ^ { l } = Z \odot X _ { L } ^ { l } + ( 1 - Z ) \odot X _ { H } ^ { l } } } } \end{array} In this formulation:

  • W _ { 1 } and W _ { 2 } are learnable dimensionality reduction parameter matrices, transforming XLlX _ { L } ^ { l } and XHlX _ { H } ^ { l } before combination.
  • sigmoid is the sigmoid activation function, which squashes the values between 0 and 1, acting as a gate.
  • ZZ is the gate vector, determining the weight given to the low-frequency component.
  • (1Z)(1 - Z) then determines the weight given to the high-frequency component.
  • \odot represents element-wise multiplication.
  • H~l\tilde { H } ^ { l } is the adaptively recombined representation, which is then passed to the multi-head self-attention layer. This process allows the model to selectively emphasize or de-emphasize low-frequency (stable patterns) or high-frequency (dynamic changes) information as needed.

4.4.2. Multi-head self-attention

Following the Frequency Redistribution Structure, the recombined representation H~l\tilde { H } ^ { l } is passed through a multi-head self-attention layer. This layer, inspired by SASRec, allows the model to capture dependencies between items in the sequence, considering their contextual relevance. HAttl=Softmax(H~lWQl(H~lWKl)TD)H~lWVl H _ { A t t } ^ { l } = \mathrm { S o f t m a x } \left( \frac { \tilde { H } ^ { l } W _ { Q } ^ { l } ( \tilde { H } ^ { l } W _ { K } ^ { l } ) ^ { T } } { \sqrt { D } } \right) \tilde { H } ^ { l } W _ { V } ^ { l } Here:

  • H~l \tilde { H } ^ { l } is the input to the self-attention layer.
  • WQl W _ { Q } ^ { l } , WKl W _ { K } ^ { l } , WVlRD×D W _ { V } ^ { l } \in \mathbb { R } ^ { D \times D } are learnable projection matrices for queries, keys, and values, respectively, for the ll-th block.
  • D D is the embedding dimension, and D \sqrt { D } is a scaling factor to prevent large dot product values from dominating the softmax function.
  • The softmax function normalizes the attention scores into attention weights.
  • HAttl H _ { A t t } ^ { l } is the output of the multi-head self-attention layer, representing the context-aware sequence.

4.4.3. Point-wise feed-forward network

After the self-attention layer, a point-wise feed-forward network (PFFN) is applied. This network processes each position independently and identically, capturing non-linear features within the sequence. Hl+1=(GELU(HAttlW1l+b1l))W2l+b2lRN×D \begin{array} { r } { H ^ { l + 1 } = \left( \mathrm { GELU } ( H _ { A t t } ^ { l } W _ { 1 } ^ { l } + b _ { 1 } ^ { l } ) \right) W _ { 2 } ^ { l } + b _ { 2 } ^ { l } \in \mathbb { R } ^ { N \times D } } \end{array} In this formula:

  • HAttl H _ { A t t } ^ { l } is the output from the self-attention layer.
  • W1l,W2lRD×D W _ { 1 } ^ { l } , W _ { 2 } ^ { l } \in \mathbb { R } ^ { D \times D } are learnable weight matrices, and b1l,b2lR1×D b _ { 1 } ^ { l } , b _ { 2 } ^ { l } \in \mathbb { R } ^ { 1 \times D } are learnable bias vectors for the ll-th block.
  • GELU (Gaussian Error Linear Unit) is an activation function that introduces non-linearity.
  • Hl+1 H ^ { l + 1 } is the final output of the ll-th FRE block.

4.4.4. Stacking blocks

The entire Frequency Redistribution Encoder consists of multiple such blocks stacked together. Each block incorporates residual connections, Layer Normalization, and Dropout for stable training and improved performance. The structure for stacking these components is as follows: H~l=LayerNorm(Hl+Dropout(FR(Hl)))HAttl=LayerNorm(H~l+Dropout(MHA(H~l)))Hl+1=LayerNorm(HAttl+Dropout(PFFN(HAttl))) \begin{array} { r l } & { \tilde { H } ^ { l } = \mathrm { L a y e r N o r m } ( H ^ { l } + \mathrm { D r o p o u t } ( \mathrm { FR } ( H ^ { l } ) ) ) } \\ & { } \\ & { H _ { A t t } ^ { l } = \mathrm { L a y e r N o r m } ( \tilde { H } ^ { l } + \mathrm { D r o p o u t } ( \mathrm { M H A } ( \tilde { H } ^ { l } ) ) ) } \\ & { } \\ & { H ^ { l + 1 } = \mathrm { L a y e r N o r m } ( H _ { A t t } ^ { l } + \mathrm { D r o p o u t } ( \mathrm { PFFN } ( H _ { A t t } ^ { l } ) ) ) } \end{array} Here:

  • FR(Hl) \mathrm { FR } ( H ^ { l } ) represents the Frequency Redistribution Structure described in Section 4.4.1.
  • MHA(H~l) \mathrm { MHA } ( \tilde { H } ^ { l } ) represents the Multi-Head Self-Attention mechanism described in Section 4.4.2.
  • PFFN(HAttl) \mathrm { PFFN } ( H _ { A t t } ^ { l } ) represents the Point-wise Feed-Forward Network described in Section 4.4.3. Each LayerNorm is applied after adding the residual connection and Dropout, ensuring that the inputs to subsequent layers are normalized. The final output of the last FRE block, usually the representation of the last item in the sequence, huh^u, is used as the user's intent representation for next-item prediction and contrastive learning.

4.5. Prediction layer

The prediction layer is responsible for taking the final user intent representation (denoted as huR1×Dh ^ { u } \in \mathbb { R } ^ { 1 \times D }, typically the embedding of the last item from the final encoder block) and computing a probability distribution over all possible items in VV. This probability distribution indicates the likelihood of the user interacting with each item next.

The prediction is calculated by taking the dot product of the user's intent representation huh ^ { u } with the transpose of the item embedding matrix MTM ^ { T }. This dot product measures the similarity between the user's intent and each item. The result is then passed through a softmax function to convert these similarities into a probability distribution y^RV\hat { y } \in \mathbb { R } ^ { | V | }: y^=softmax(huMT) \hat { y } = s o f t m a x ( h ^ { u } M ^ { T } ) Here, MT M ^ { T } is the transpose of the item embedding matrix, where each column corresponds to an item embedding.

The model is trained to minimize a recommendation loss (cross-entropy loss) between the predicted probability distribution y^\hat { y } and the true next item yy. The cross-entropy loss (specifically, binary cross-entropy per item in the sequence prediction context) is formulated as: LRec=i=1nyilog(yi^)+(1yi)log(1yi^) \mathcal { L } _ { R e c } = - \sum _ { i = 1 } ^ { n } y _ { i } l o g ( \widehat { y _ { i } } ) + ( 1 - y _ { i } ) l o g ( 1 - \widehat { y _ { i } } ) In this formula:

  • n n represents the number of items in the sequence being processed for prediction (often NN, the sequence length, or batch size).
  • y _ { i } is a one-hot encoded vector representing the ground truth next item for the ii-th position in the sequence (or for the ii-th sample in a batch). It is 1 for the actual next item and 0 for others.
  • yi^ \widehat { y _ { i } } is the predicted probability distribution for the next item. The loss penalizes incorrect predictions and encourages the model to assign high probabilities to the actual next items.

4.6. Multi-task learning

The multi-task learning component in FICLRec is implemented through Intent Contrastive Learning (ICL). This self-supervised learning task is crucial for addressing data sparsity and for learning more robust representations of both high-frequency (short-term) and low-frequency (long-term) user intentions. The goal is to optimize these objectives alongside the primary recommendation loss.

The following figure (Figure 3 from the original paper) details the intent contrastive learning mechanism:

Fig. 3. The details of intent contrastive learning. Fig. 3. The details of intent contrastive learning.

The figure illustrates how intent contrastive learning works. From a user's historical sequence, two positive samples, sis_i and si,ss_{i,s}, are derived. Their intent representations, hih_i and hi,sh_{i,s}, are extracted. For high-frequency intent contrastive learning, these representations are directly contrasted and their FFT (frequency domain) representations are aligned. For low-frequency intent contrastive learning, intent prototypes (ci,ci,sc_i, c_{i,s}) are queried using hi,hi,sh_i, h_{i,s}, and these prototypes are then contrasted with the original intent representations. The overall intent contrastive loss is a weighted sum of these high and low-frequency components.

4.6.1. High-frequency intent contrastive learning

This part of the ICL focuses on capturing and distinguishing high-frequency intents, which correspond to users' immediate and dynamic short-term interests. It operates on two positive samples derived from the user's interaction sequence.

  • s _ { i }: Represents a sequence containing the target item ii.

  • s _ { i , s }: Represents another sequence from the user's historical interactions (SuS^u) that shares the same target item ii. These two sequences are considered positive pairs.

    Their comprehensive feature representations HiH_i and Hi,sH_{i,s} are obtained from the Frequency Redistribution Encoder. Then, their final intent representations, denoted as h _ { i } and h _ { i , s }, are extracted (typically the last hidden state of the sequences).

The primary contrastive loss used is a variant of InfoNCE loss, denoted as LInFoNCEFNM \mathcal { L } _ { I n F o N C E } ^ { F N M } . This loss pulls the representations of positive pairs closer while pushing negative samples farther apart. LInFoNCEFNM(hi,hi,s)=i=1Nlogexp(sin(hi,hi,s)/τ)k=1Nexp(sin(hi,hk)/τ)mask(i,k) \mathcal { L } _ { I n F o N C E } ^ { F N M } ( h _ { i } , h _ { i , s } ) = - \sum _ { i = 1 } ^ { N } \log \frac { \exp ( { \sin ( h _ { i } , h _ { i , s } ) } / { \tau } ) } { \sum _ { k = 1 } ^ { N } \exp ( { \sin ( h _ { i } , h _ { k } ) } / { \tau } ) { \mathrm { m a s k } ( i , k ) } } Here:

  • N N represents the batch size.

  • h _ { i } is the intent representation of an anchor sequence.

  • h _ { i , s } is the intent representation of a positive sample for hih_i.

  • h _ { k } represents the intent representations of other sequences in the batch, which serve as negative samples.

  • sim(,) \mathrm { sim } ( \cdot , \cdot ) denotes a similarity function (e.g., cosine similarity).

  • τ \tau is a temperature parameter that controls the sharpness of the softmax distribution. A smaller τ\tau makes the distribution sharper, emphasizing larger similarities more.

  • mask(i,k) \mathrm { mask } ( i , k ) is a masking function that ensures positive samples are excluded from the negative sample set (i.e., mask(i,k)=0 \mathrm { mask } ( i , k ) = 0 if h _ { k } is a positive sample for h _ { i }, and 1 otherwise), preventing them from being mistakenly treated as negatives.

    To explicitly enhance high-frequency features, an additional high-frequency alignment loss (Lhfal \mathcal { L } _ { h f a l } ) is introduced. This loss directly minimizes the distance between the high-frequency components of the positive sample pairs in the frequency domain. Lhfal=1Ni=1NXi[ωp,:]Xi,s[ωp,:] \mathcal { L } _ { h f a l } = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \| X _ { i } [ \omega _ { p } , : ] - X _ { i , s } [ \omega _ { p } , : ] \| In this formula:

  • N N represents the batch size.

  • X _ { i } and X _ { i , s } are the frequency-domain representations obtained by applying the FFT operation to h _ { i } and h _ { i , s }, respectively.

  • Xi[ωp,:] X _ { i } [ \omega _ { p } , : ] refers to the high-frequency part of the FFT result (from the boundary frequency ωp\omega_p onwards) for sequence ii.

  • \| \cdot \| denotes the L2 norm, measuring the Euclidean distance between the high-frequency features of the positive pair. Minimizing this loss encourages their high-frequency characteristics to be similar.

    Finally, the total high-frequency intent contrastive loss (Lhicl \mathcal { L } _ { h i c l } ) combines the InfoNCE loss (applied bidirectionally) and the high-frequency alignment loss: Lhicl=LInFoNCEFNM(hi,hi,s)+LInFoNCEFNM(hi,s,hi)+Lhfal \begin{array} { r } { \mathcal { L } _ { h i c l } = \mathcal { L } _ { I n F o N C E } ^ { F N M } ( h _ { i } , h _ { i , s } ) + \mathcal { L } _ { I n F o N C E } ^ { F N M } ( h _ { i , s } , h _ { i } ) + \mathcal { L } _ { h f a l } } \end{array} The bidirectional InfoNCE terms ensure that both hih_i serves as an anchor for hi,sh_{i,s} and vice-versa, making the contrastive learning more robust.

4.6.2. Low-frequency intent contrastive learning

This component focuses on capturing users' long-term preferences and stable interests, which are considered low-frequency representations. Instead of directly contrasting individual sequence embeddings, it introduces intent prototypes as representations of these stable interests.

The intent prototypes are represented by a set of cluster centers C={c1,c2,,cK}C = \{ c _ { 1 } , c _ { 2 } , \dots , c _ { K } \}, where c _ { k } is the kk-th cluster center. These prototypes are obtained by applying K-Means clustering to all learned intent representations huh^u across the entire dataset. For the two positive samples (hih_i and hi,sh_{i,s}), their corresponding intent prototypes (cic_i and ci,sc_{i,s}) are retrieved by querying the nearest cluster center: ci,ci,s=query(hi),query(hi,s) c _ { i } , c _ { i , s } = \operatorname { q u e r y } ( h _ { i } ) , \operatorname { q u e r y } ( h _ { i , s } ) The query function assigns an intent representation to its closest cluster center. This effectively treats the cluster centers as low-frequency or general intent representations.

A cluster-level center alignment loss (Lccal \mathcal { L } _ { c c a l } ) is introduced to ensure that the intent representations of the positive samples (hi,hi,sh_i, h_{i,s}) are aligned with their respective intent prototypes (ci,ci,sc_i, c_{i,s}): Lccal=1Ni=1N(hici)2+1Ni=1N(hi,sci,s)2 \mathcal { L } _ { c c a l } = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } ( h _ { i } - \mathbf { c } _ { i } ) ^ { 2 } + \frac { 1 } { N } \sum _ { i = 1 } ^ { N } ( h _ { i , s } - \mathbf { c } _ { i , s } ) ^ { 2 } This loss minimizes the Euclidean distance between each intent representation and its assigned cluster center, thus encouraging the intent representations to conform to the learned intent prototypes.

Similar to the high-frequency part, an InfoNCE loss is applied to contrast the intent representations with their intent prototypes. This pushes intent representations closer to their correct low-frequency prototypes while pushing them away from incorrect ones. Llicl=LInFoNCEFNM(hi,ci)+LInFoNCEFNM(ci,hi)+Lccal \mathcal { L } _ { l i c l } = \mathcal { L } _ { I n F o N C E } ^ { F N M } ( h _ { i } , c _ { i } ) + \mathcal { L } _ { I n F o N C E } ^ { F N M } ( c _ { i } , h _ { i } ) + \mathcal { L } _ { c c a l } This combined loss (Llicl \mathcal { L } _ { l i c l } ) ensures that intent representations are not only aligned with their prototypes but also discriminative against other prototypes, thereby better capturing users' long-term preferences.

Finally, the complete intent contrastive learning loss (Licl \mathcal { L } _ { i c l } ) is a weighted sum of the high-frequency and low-frequency intent contrastive losses: Licl=αLhicl+βLlicl \mathcal { L } _ { i c l } = \alpha \cdot \mathcal { L } _ { h i c l } + \beta \cdot \mathcal { L } _ { l i c l } Here, α \alpha and β \beta are hyperparameters that control the relative importance of high-frequency (short-term) and low-frequency (long-term) intent contrastive learning tasks in the overall self-supervised objective.

4.6.3. Train and inference

The overall loss function for FICLRec combines the primary recommendation loss (LRec \mathcal { L } _ { \mathrm { Rec } } ) and the intent contrastive learning loss (Licl \mathcal { L } _ { i c l } ). The model is trained to minimize this combined objective: L=LRec+Licl \mathcal { L } = \mathcal { L } _ { \mathrm { Rec } } + \mathcal { L } _ { i c l } This multi-task learning approach allows the model to simultaneously learn to predict the next item accurately and to form robust, frequency-aware intent representations, thereby benefiting from the self-supervised signals. During inference, only the prediction layer is used with the learned encoder to generate recommendations.

The training algorithm, Algorithm 1 from the paper, is as follows: Algorithm 1: FICLRec Training Algorithm Input: A sequence recommendation dataset {Su}u=1U \{ S ^ { u } \} _ { u = 1 } ^ { | U | } , sequence encoder fθ() f _ { \theta } ( \cdot ) , hyperparameters (e.g., α,β,τ,p,K \alpha, \beta, \tau, p, K ), epochs B B , batch size N N . Output: Trained encoder fθ() f _ { \theta } ( \cdot )

1 Partition the dataset into subsets for training, validation, and testing.; 2 Initialize encoder fθ() f _ { \theta } ( \cdot ) ; 3 for epoch <B < B do 4 // Obtain intent prototypes C C 5 C=KMeans(fθ({Su}u=1U),K); C = { \mathrm { K M e a n s } } ( f _ { \theta } ( \{ S ^ { u } \} _ { u = 1 } ^ { | U | } ) , K ) ; 6 for a mini-batch {Su}u=1N \{ S ^ { u } \} _ { u = 1 } ^ { N } do 7 for u{1,2,,N} u \in \{ 1 , 2 , \ldots , N \} do 8 // Sample a subsequence si,su s _ { i , s } ^ { u } from Su S ^ { u } with the same target item as siu s _ { i } ^ { u } 9 si,su=sample(siu); s _ { i , s } ^ { u } = { \mathrm { s a m p l e } } ( s _ { i } ^ { u } ) ; 10 // Encode positive samples 11 hsiuu=fθ(siu),hsi,suu=fθ(si,su); h _ { s _ { i } ^ { u } } ^ { u } = f _ { \theta } ( s _ { i } ^ { u } ) , h _ { s _ { i , s } ^ { u } } ^ { u } = f _ { \theta } ( s _ { i , s } ^ { u } ) ; 12 // Query intent prototype representations: 13 ci=query(hi),ci,s=query(hi,s); c _ { i } = \operatorname { q u e r y } ( h _ { i } ) , c _ { i , s } = \operatorname { q u e r y } ( h _ { i , s } ) ; 14 // Multi-task Optimization 15 Licl=αLhicl+βLlicl; \mathcal { L } _ { i c l } = \alpha \cdot \mathcal { L } _ { h i c l } + \beta \cdot \mathcal { L } _ { l i c l } ; 16 L=LRec+Licl; \mathcal { L } = \mathcal { L } _ { \mathrm { Rec } } + \mathcal { L } _ { i c l } ; 17 Update encoder fθ() f _ { \theta } ( \cdot ) to minimize L \mathcal { L } ; 18 return fθ() f _ { \theta } ( \cdot )

The algorithm outlines the iterative training process. In each epoch, K-Means clustering is performed on the current intent representations from all users to update the intent prototypes CC. Then, for each mini-batch, positive sequence pairs are sampled, their intent representations and corresponding prototypes are obtained, and the multi-task loss (combining recommendation loss and the weighted intent contrastive loss) is computed and used to update the encoder parameters.

4.7. Complexity analysis

The computational complexity of FICLRec is analyzed by considering its main components.

  • Frequency Redistribution Structure: This involves a Fast Fourier Transform (FFT) and an Inverse Fast Fourier Transform (IFFT), which have a complexity of O(Nlog(N)) \mathcal { O } ( N \log ( N ) ) where NN is the sequence length. Additionally, there's a linear layer for the gating mechanism, with complexity O(ND) \mathcal { O } ( N D ) (where DD is the embedding dimension).

  • Self-Attention Mechanism: This component has a complexity of O(N2D) \mathcal { O } ( N ^ { 2 } D ) due to the dot product computations between queries and keys for all pairs of items in the sequence.

  • Point-wise Feed-Forward Layer: This layer has a complexity of O(ND2) \mathcal { O } ( N D ^ { 2 } ) as it applies a feed-forward network independently to each position, involving matrix multiplications of dimension D×DD \times D.

    Combining these, the total complexity of FICLRec per layer is: O(N2D+Nlog(N)+ND2+ND) \mathcal { O } ( N ^ { 2 } D + N \log ( N ) + N D ^ { 2 } + N D ) The following are the results from Table 3 of the original paper, comparing the time complexity of FICLRec with several baseline models:

Model Backbone network Time complexity
SASRec Self-Attention + point-wise feed-forward layer O(n2+nd2+nd) \mathcal { O } ( n^2 + n d^2 + n d )
DuoRec Self-Attention + point-wise feed-forward layer O(n2+nd2+nd) \mathcal { O } ( n^2 + n d^2 + n d )
FEARec Hybrid Attention + point-wise feed-forward layer O(2d+nlog(n)+n2+nd) \mathcal { O } ( 2d + n \log(n) + n^2 + nd )
ICLRec Self-Attention + point-wise feed-forward layer O(n2+nd2+nd) \mathcal { O } ( n^2 + n d^2 + n d )
ICSRec Self-Attention + point-wise feed-forward layer O(n2+nd2+nd) \mathcal { O } ( n^2 + n d^2 + n d )
FICLRec (ours) FRE + Self-Attention + point-wise feed-forward layer O(n2+nlog(n)+nd2+nd) \mathcal { O } ( n^2 + n \log(n) + n d^2 + n d )

As shown in Table 3, FICLRec has a slightly higher time complexity compared to pure self-attention models like SASRec or ICSRec due to the additional O(Nlog(N)) \mathcal { O } ( N \log ( N ) ) term introduced by the FFT/IFFT operations in its Frequency Redistribution Encoder. However, this additional complexity is often acceptable given the performance gains and the efficiency of FFT algorithms for typical sequence lengths. FEARec also includes an O(nlog(n)) \mathcal { O } ( n \log(n) ) term due to its use of Fourier Transforms.

5. Experimental Setup

5.1. Datasets

The authors conducted extensive experiments on five publicly available real-world datasets to validate the effectiveness of FICLRec. These datasets are commonly used in sequential recommendation research and originate from various domains, providing a comprehensive evaluation.

The following are the results from Table 4 of the original paper:

Dataset Beauty Sports Toys Yelp LastFM
#Users 22,363 35,598 19,412 30,431 1,090
#Items 12,101 18,357 11,924 20,033 3,646
#Actions 198,502 296,337 167,597 316,354 52,551
# Avg. Actions/User 8.8 8.3 8.6 10.4 48.2
# Avg. Actions/Item 16.4 16.1 14 15.8 14.4
Sparsity 99.93% 99.95% 99.93% 99.95% 98.68%

A summary of each dataset:

  • Beauty: This dataset contains user reviews for beauty products from Amazon. It has 22,363 users, 12,101 items, and 198,502 interactions. The average number of actions per user is 8.8, and the sparsity is high at 99.93%. This represents a domain with many diverse products and relatively short user interaction histories.

  • Sports: Also from Amazon, this dataset focuses on sports and outdoor items. It includes 35,598 users, 18,357 items, and 296,337 interactions. With an average of 8.3 actions per user and 99.95% sparsity, it's another sparse dataset with typical user behavior patterns.

  • Toys: This Amazon dataset covers toys, board games, and outdoor toys, with 19,412 users, 11,924 items, and 167,597 interactions. It has an average of 8.6 actions per user and 99.93% sparsity, similar in characteristics to Beauty and Sports.

  • Yelp: This dataset is derived from Yelp reviews, covering various businesses. It has 30,431 users, 20,033 items, and 316,354 interactions. The average actions per user is 10.4, and sparsity is 99.95%. This dataset offers a different interaction context (services/places rather than products).

  • LastFM: This dataset contains user listening habits from the Last.fm music platform. It is a smaller dataset with 1,090 users, 3,646 items, and 52,551 interactions. Notably, it has a much higher average number of actions per user (48.2) and lower sparsity (98.68%) compared to the Amazon and Yelp datasets. This dataset is valuable for evaluating performance on denser, longer user sequences, potentially rich in both short-term music tastes and long-term artist preferences.

    These datasets were chosen because they represent diverse domains (e-commerce, reviews, music), vary in scale, and importantly, exhibit significant data sparsity (ranging from 98.68% to 99.95%). This makes them highly suitable for validating FICLRec's ability to handle sparse data and capture complex user intentions in real-world scenarios.

5.2. Evaluation Metrics

The paper uses two widely adopted ranking metrics to evaluate the performance of the sequential recommendation models: HR@K (Hit Rate) and NDCG@K (Normalized Discounted Cumulative Gain). KK typically denotes the number of top recommended items considered for evaluation.

  1. Hit Rate (HR@K):

    • Conceptual Definition: Hit Rate @ K measures the proportion of users for whom the ground truth next item appears in the top KK recommended items. It's a binary metric: if the target item is in the top KK list, it's a "hit" (1); otherwise, it's a "miss" (0). It focuses on recall, indicating how often the system successfully recommends the relevant item within a given ranking cutoff.
    • Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of hits @ K}}{\text{Total number of interactions}} $ Where a "hit" means the true next item vnextv_{next} is in the top KK predicted items V^K\hat{V}_K.
    • Symbol Explanation:
      • Number of hits @ K: The count of all instances where the actual next item was found among the top KK recommendations.
      • Total number of interactions: The total number of prediction tasks (i.e., the total number of users' next items to be predicted).
  2. Normalized Discounted Cumulative Gain (NDCG@K):

    • Conceptual Definition: NDCG@K evaluates the relevance and ranking quality of the recommended items. It gives higher scores to more relevant items that appear at higher (earlier) positions in the recommendation list. This metric is particularly useful when the relevance of items is graded (though in implicit feedback sequential recommendation, relevance is often binary, i.e., the next item is relevant, others are not). It reflects both the quality (relevance) and the position of the hits.
    • Mathematical Formula: First, Discounted Cumulative Gain (DCG@K) is calculated: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)} $ Then, NDCG@K normalizes DCG@K by the Ideal DCG (IDCG@K), which is the DCG of a perfect ranking: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
    • Symbol Explanation:
      • KK: The number of top recommended items considered.

      • reli\mathrm{rel}_i: The relevance score of the item at position ii in the recommended list. In binary relevance (common for implicit feedback), reli=1\mathrm{rel}_i = 1 if the item at position ii is the ground truth next item, and 0 otherwise.

      • log2(i+1)\log_2(i+1): The discount factor, which reduces the contribution of items at lower ranks.

      • IDCG@K\mathrm{IDCG@K}: The maximum possible DCG value for a given user's query, obtained by sorting all relevant items by their relevance in descending order. For sequential recommendation with a single ground truth next item, IDCG@K\mathrm{IDCG@K} is simply 211log2(1+1)=1\frac{2^1-1}{\log_2(1+1)} = 1 if the item is at rank 1, and 0 otherwise for K>1K>1. More generally, if the actual item is at rank rr, IDCG@K=1log2(r+1)\mathrm{IDCG@K} = \frac{1}{\log_2(r+1)}. If KK is small and the actual item is not in top KK, both DCG and IDCG will be zero, leading to an undefined division; usually, this implies NDCG@K=0\mathrm{NDCG@K}=0.

        Both HR@K and NDCG@K are commonly used to provide a comprehensive evaluation of sequential recommendation systems, with HR@K focusing on whether the item is found, and NDCG@K also considering its position. The paper typically uses K=5,10,20K=5, 10, 20.

5.3. Baselines

The authors compared FICLRec against a wide range of state-of-the-art sequential recommendation models, representing different methodological categories:

Traditional recommendation methods:

  • GRU4Rec (Hidasi et al., 2015): A pioneering RNN-based model for sequential recommendation using Gated Recurrent Units. It models user sessions to predict the next item.
  • Caser (Tang & Wang, 2018): Employs horizontal and vertical convolutional filters to capture sequential patterns in user interactions, allowing for session parallelization.
  • SASRec (Kang & McAuley, 2018): A Transformer-based model that applies self-attention to user sequences, effectively capturing long-term dependencies and significantly improving sequential recommendation performance.

Frequency domain methods:

  • FMLPRec (Zhou et al., 2022): Leverages the frequency domain via Fourier Transforms to model sequential data, aiming to filter out noise and capture both global and local patterns.
  • BSARec (Li et al., 2024b): Combines time-domain and frequency-domain analysis. It uses adaptive filtering to handle noise in the embedding matrix and Fourier transforms to identify periodic patterns.
  • FEARec (Ni et al., 2023): Integrates hybrid attention with Fourier transforms to process user sequences, enhancing the model's capacity to identify and manage periodic patterns by separating concerns between time and frequency domains.

Self-supervised methods:

  • ContrastVAE (Zou et al., 2022): A self-supervised learning approach based on Variational Autoencoders with contrastive learning objectives, aiming to learn better item representations.
  • DuoRec (Chen et al., 2022): Employs dual contrastive learning tasks to generate robust item representations, addressing data sparsity and item embedding degradation.
  • RTRRec (Li et al., 2024f): A self-supervised learning model for sequential recommendation that uses contrastive learning to learn latent sequence representations, alleviating data sparsity concerns.
  • ICLRec (Li et al., 2024d): Introduces intent contrastive learning within a self-supervised framework, enhancing item embedding quality and addressing the challenge of item embedding degradation.
  • ELCRec (Li et al., 2024c): Another self-supervised model that leverages contrastive learning to improve latent sequence representation, particularly effective in sparse data contexts and in increasing noise robustness.

Intent learning methods:

  • ICSRec (Li et al., 2024e): An intent-contrastive sequential recommendation model that utilizes a novel contrastive self-supervised learning objective to learn and represent user intentions effectively.

  • IDCLRec (Chen et al., 2025): An intent-driven contrastive learning model that further refines intent learning by enhancing the extraction of user intentions in sequential recommendation.

    These baselines were selected because they represent the current state-of-the-art in various aspects of sequential recommendation, including traditional sequence modeling, frequency domain processing, self-supervised learning for robustness, and explicit intent learning. This diverse set allows for a comprehensive evaluation of FICLRec's performance and its distinct contributions.

5.4. Implementation details

The experiments used an embedding dimension DD fixed at 64 and a maximum sequence length NN limited to 50, which are common settings in sequential recommendation. The temperature parameter τ\tau for InfoNCE loss was set to 1.0. The hyperparameters α\alpha and β\beta (weights for high-frequency and low-frequency intent contrastive losses) were searched from {0.1,0.2,0.3,0.4,0.5}\{0.1, 0.2, 0.3, 0.4, 0.5\}. The boundary frequency pp (determining the split between low and high frequencies) was selected from {1,3,5,7,9,11,13}\{1, 3, 5, 7, 9, 11, 13\}. The Adam optimizer (Kingma, 2017) was used with a learning rate of 10310^{-3}, and a dropout rate of 0.5 was applied. All experiments were conducted on an NVIDIA GeForce RTX 2080 Ti GPU (11GB). The baselines (Caser, GRU4Rec, SASRec, FMLPRec, BSARec, RTRRec, ContrastVAE, DuoRec, ICLRec, ELCRec, ICSRec, IDCLRec) were either implemented by the authors or their results were taken from published papers, ensuring fair comparison.

6. Results & Analysis

6.1. Core Results Analysis

The paper conducted extensive experiments to compare FICLRec with several state-of-the-art baseline models. The primary goal was to demonstrate FICLRec's superior performance in sequential recommendation, particularly its enhanced ability to capture high-frequency intents and its robustness to data sparsity.

The following are the results from Table 5 of the original paper, showing the performance comparison across all five datasets using HR@K and NDCG@K metrics:

Dataset Metric Caser GRU4Rec SASRec FMLPRec BSARec RTRRec ContrastVAE DuoRec ICLRec ELCRec ICSRec IDCLRec FICLRec (ours) Imp. vs. SOTA
Beauty HR@5 0.0159 0.0175 0.0338 0.0355 0.0597 0.0705 0.0436 0.0322 0.0460 0.0548 0.0440 0.0498 0.0677 0.0724 2.70%
HR@10 0.0226 0.0287 0.0525 0.0561 0.0881 0.0987 0.0673 0.0548 0.0728 0.0844 0.0650 0.0742 0.0930 0.1007 2.03%
HR@20 0.0427 0.0461 0.0800 0.0847 0.1248 0.1345 0.1005 0.0832 0.1090 0.1208 0.0935 0.1048 0.1271 0.1390 3.35%
NDCG@5 0.0097 0.0105 0.0217 0.0222 0.0369 0.0504 0.0287 0.0195 0.0306 0.0344 0.0288 0.0341 0.0481 0.0516 2.38%
NDCG@10 0.0132 0.0141 0.0277 0.0288 0.0460 0.0595 0.0364 0.0267 0.0392 0.0439 0.0355 0.0420 0.0562 0.0606 1.85%
NDCG@20 0.0172 0.0185 0.0346 0.0361 0.0553 0.0685 0.0447 0.0339 0.0483 0.0531 0.0426 0.0497 0.0648 0.0703 2.63%
Sports HR@5 0.0074 0.0103 0.0185 0.0205 0.0346 0.0408 0.0245 0.0222 0.0231 0.0299 0.0262 0.0265 0.0384 0.0438 7.35%
HR@10 0.0131 0.0182 0.0303 0.0314 0.0525 0.0589 0.0407 0.0360 0.0370 0.0456 0.0400 0.0410 0.0548 0.0623 5.77%
HR@20 0.0224 0.0304 0.0453 0.0483 0.0758 0.0839 0.0628 0.0557 0.0562 0.0669 0.0593 0.0634 0.0770 0.0885 5.48%
NDCG@5 0.0048 0.0064 0.0119 0.0136 0.0198 0.0284 0.0156 0.0144 0.0153 0.0189 0.0178 0.0177 0.0256 0.0300 5.63%
NDCG@10 0.0066 0.0090 0.0156 0.0171 0.0256 0.0342 0.0209 0.0189 0.0198 0.0240 0.0222 0.0224 0.0318 0.0359 4.97%
NDCG@20 0.0089 0.0120 0.0194 0.0213 0.0314 0.0405 0.0264 0.0238 0.0246 0.0293 0.0271 0.0280 0.0374 0.0425 4.94%
Toys HR@5 0.0076 0.0112 0.0194 0.0213 0.0434 0.0472 0.0264 0.0238 0.0246 0.0293 0.0271 0.0280 0.0374 0.0425 4.94%
HR@10 0.0144 0.0202 0.0324 0.0352 0.0614 0.0685 0.0481 0.0413 0.0409 0.0489 0.0465 0.0486 0.0646 0.0679 3.14%
HR@20 0.0249 0.0324 0.0434 0.0472 0.0853 0.0978 0.0726 0.0603 0.0595 0.0728 0.0774 0.0828 0.1038 0.1092 2.82%
NDCG@5 0.0044 0.0068 0.0105 0.0113 0.0294 0.0319 0.0330 0.0241 0.0409 0.0387 0.0395 0.0406 0.0557 0.0591 2.43%
NDCG@10 0.0066 0.0096 0.0145 0.0156 0.0352 0.0412 0.0409 0.0303 0.0489 0.0476 0.0465 0.0486 0.0646 0.0679 2.41%
NDCG@20 0.0092 0.0127 0.0194 0.0213 0.0434 0.0472 0.0490 0.0444 0.0563 0.0538 0.0538 0.0565 0.0728 0.0774 2.50%
Yelp HR@5 0.0108 0.0140 0.0234 0.0261 0.0461 0.0567 0.0234 0.0227 0.0182 0.0270 0.0148 0.0143 0.0710 0.0826 13.85%
HR@10 0.0186 0.0240 0.0419 0.0426 0.0702 0.0707 0.0398 0.0331 0.0268 0.0445 0.0256 0.0261 0.0826 0.0934 13.11%
HR@20 0.0321 0.0424 0.0702 0.0707 0.1069 0.1064 0.0669 0.0565 0.0380 0.0729 0.0391 0.0411 0.1089 0.1235 13.41%
NDCG@5 0.0091 0.0118 0.0210 0.0225 0.0376 0.0460 0.0195 0.0171 0.0112 0.0227 0.0093 0.0195 0.0564 0.0645 14.36%
NDCG@10 0.0125 0.0165 0.0290 0.0287 0.0487 0.0558 0.0268 0.0220 0.0170 0.0299 0.0130 0.0256 0.0673 0.0766 13.82%
NDCG@20 0.0172 0.0226 0.0391 0.0387 0.0622 0.0694 0.0371 0.0319 0.0226 0.0391 0.0195 0.0319 0.0826 0.0932 12.83%
LastFM HR@5 0.0193 0.0239 0.0367 0.0376 0.0404 0.0450 0.0376 0.0312 0.0468 0.0394 0.0303 0.0266 0.0459 0.0569 23.97%
HR@10 0.0367 0.0358 0.0560 0.0642 0.0615 0.0615 0.0615 0.0468 0.0679 0.0587 0.0339 0.0400 0.0766 0.0881 14.90%
HR@20 0.0550 0.0495 0.0917 0.1064 0.0862 0.1064 0.0862 0.0798 0.1009 0.0881 0.0514 0.0688 0.1171 0.1292 10.33%
NDCG@5 0.0146 0.0155 0.0260 0.0263 0.0261 0.0338 0.0263 0.0213 0.0346 0.0287 0.0193 0.0185 0.0346 0.0413 19.36%
NDCG@10 0.0200 0.0194 0.0320 0.0338 0.0339 0.0392 0.0338 0.0253 0.0413 0.0348 0.0246 0.0208 0.0415 0.0503 21.20%
NDCG@20 0.0245 0.0228 0.0410 0.0402 0.0400 0.0504 0.0400 0.0343 0.0496 0.0422 0.0300 0.0252 0.0501 0.0594 18.56%

Key observations and analysis from Table 5:

  1. Overall Superiority of FICLRec: Across all five datasets and all evaluation metrics (HR@K and NDCG@K for K=5, 10, 20), FICLRec consistently outperforms all baseline models. The "Imp. vs. SOTA" column indicates significant percentage improvements over the best baseline on each specific metric and dataset. This strong performance validates the overall effectiveness of the proposed frequency-enhanced intent contrastive learning framework.

  2. Significant Gains on Sparse and Dense Datasets:

    • On highly sparse datasets like Yelp (99.95% sparsity), FICLRec achieves particularly large improvements, for example, 13.85% on HR@5 and 14.36% on NDCG@5. This suggests that frequency contrastive learning is highly effective in mitigating the negative impacts of data sparsity, a core claim of the paper.
    • Even on the relatively denser LastFM dataset (98.68% sparsity, higher average actions/user), FICLRec demonstrates remarkable gains, such as 23.97% on HR@5 and 19.36% on NDCG@5. This indicates its ability to capture complex user dynamics across different data densities.
  3. Advantages over other Transformer-based and Self-supervised Models:

    • FICLRec consistently outperforms SASRec (a strong Transformer baseline), highlighting the benefit of incorporating frequency domain analysis and intent contrastive learning.
    • Compared to other self-supervised and intent-learning models like DuoRec, ICLRec, ELCRec, ICSRec, and IDCLRec, FICLRec still shows significant gains. This suggests that the frequency-enhanced approach to disentangling and learning high-frequency and low-frequency intents provides a distinct advantage over other contrastive learning strategies that might not explicitly leverage frequency information. For instance, on Beauty dataset, FICLRec improves HR@20 by 3.35% over IDCLRec (0.1390 vs 0.1345), which is another intent-driven contrastive learning model.
  4. Benefits of Frequency Domain Analysis: Models incorporating the frequency domain like FMLPRec and BSARec generally perform better than traditional RNN or CNN models (GRU4Rec, Caser), and sometimes even SASRec. FICLRec further builds upon this, demonstrating that its Frequency Redistribution Encoder and frequency contrastive learning effectively leverage this domain for superior results.

  5. Robustness to Different KK Values: The improvements are consistent across different KK values (@5, @10, @20), indicating that FICLRec not only improves the chances of a hit but also places relevant items higher in the recommendation list.

    In summary, the experimental results strongly validate FICLRec's effectiveness. Its architecture, which combines frequency domain analysis with intent contrastive learning, successfully addresses the challenges of capturing diverse user intentions and mitigating data sparsity, leading to state-of-the-art performance across various real-world datasets.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Study (RQ2)

The authors conducted an ablation study to evaluate the contribution of each key component of FICLRec. This helps to understand which parts are most critical for the model's overall performance. The study used HR@20 and NDCG@20 metrics on the five datasets.

The following are the results from Table 6 of the original paper, showing the ablation study of FICLRec:

Model Dataset
Beauty Sports Toys Yelp LastFM
HR NDCG HR NDCG HR NDCG HR NDCG HR NDCG
(A) FICLRec 0.1390 0.0703 0.0885 0.0425 0.1471 0.0774 0.0826 0.0342 0.1211 0.0594
(B) w/o FR 0.1315 0.0669 0.0774 0.0380 0.1345 0.0727 0.0732 0.0302 0.1083 0.0504
(C) w/o HFAL 0.1369 0.0694 0.0846 0.0413 0.1458 0.0772 0.0760 0.0316 0.1257 0.0546
(D) w/o CCAL 0.1367 0.0689 0.0836 0.0406 0.1444 0.0761 0.0761 0.0315 0.1138 0.0528
(E) ICSRec 0.1271 0.0648 0.0770 0.0374 0.1364 0.0728 0.0710 0.0293 0.1018 0.0501

Analysis of Ablation Study Results:

  • (A) FICLRec (Full Model): Represents the complete proposed model, serving as the benchmark for comparison.

  • (B) w/o FR (without Frequency Redistribution Structure): When the Frequency Redistribution Structure is removed, the performance drops significantly across all datasets. For example, on Beauty, HR@20 drops from 0.1390 to 0.1315, and on Sports, it drops from 0.0885 to 0.0774. This indicates that the Frequency Redistribution Structure is highly effective in capturing both low-frequency and high-frequency features, which are crucial for the model's overall performance. Its absence severely impacts the model's ability to differentiate and re-weight different intent components.

  • (C) w/o HFAL (without High-Frequency Alignment Loss): Removing the high-frequency alignment loss (Lhfal \mathcal { L } _ { h f a l } ) leads to a noticeable drop in performance (e.g., HR@20 on Beauty drops from 0.1390 to 0.1369). While the drop is less dramatic than removing FR, it still confirms the importance of explicitly aligning high-frequency components of positive samples. This term helps the model learn more distinct high-frequency intents and highlights that dynamic short-term interests need specific contrastive alignment. The paper notes that this specific alignment loss reduces the impact of noisy features in high-frequency components.

  • (D) w/o CCAL (without Cluster-Level Center Alignment Loss): Removing the cluster-level center alignment loss (Lccal \mathcal { L } _ { c c a l } ) also results in a performance decrease (e.g., HR@20 on Beauty drops from 0.1390 to 0.1367). This loss ensures that intent representations are closely aligned with their intent prototypes (cluster centers), which represent low-frequency or general preferences. Its removal indicates that explicitly enforcing this alignment is important for capturing stable, long-term user interests and strengthening the model's overall intent learning capability.

  • (E) ICSRec: This is a strong baseline that is an intent-contrastive sequential recommendation model but does not utilize frequency domain information in the same manner. FICLRec (A) consistently outperforms ICSRec (E), demonstrating that the frequency-enhanced approach, especially the Frequency Redistribution Structure and specialized frequency contrastive learning objectives, provides significant advantages over other intent-contrastive methods.

    In summary, the ablation study confirms that all proposed components – the Frequency Redistribution Structure (FR), the High-Frequency Alignment Loss (HFAL), and the Cluster-Level Center Alignment Loss (CCAL) – are essential for FICLRec's superior performance. Each component contributes uniquely to either effectively capturing frequency-aware intentions or robustly learning from sparse data, collectively boosting the model's ability to provide accurate sequential recommendations.

6.2.2. Hyperparameter Study (RQ3)

Impact of ICL loss weight α\alpha and β\beta

The hyperparameters α\alpha and β\beta control the weights of the high-frequency intent contrastive loss (Lhicl \mathcal { L } _ { h i c l } ) and low-frequency intent contrastive loss (Llicl \mathcal { L } _ { l i c l } ), respectively, in the overall intent contrastive loss (Licl \mathcal { L } _ { i c l } ). The authors investigated their impact on model performance.

The following figure (Figure 4 from the original paper) shows the parameter sensitivity of the ICL loss weight:

Fig. 4. Parameter sensitivity of the ICL loss weight. Fig. 4. Parameter sensitivity of the ICL loss weight.

The figure shows that FICLRec is robust to the choice of α\alpha and β\beta. The model generally performs well when α=0.1\alpha = 0.1 and β=0.1\beta = 0.1, indicating that even a small contribution from both high-frequency and low-frequency contrastive learning components is beneficial. The performance does not drastically change with variations in these weights, suggesting that the model's core architecture and the inherent value of frequency information are strong. The optimal values might vary slightly across datasets, but a configuration with relatively low weights for both contrastive losses already yields satisfactory performance, simplifying hyperparameter tuning.

Impact of the intent number KK

The intent number KK refers to the number of clusters used in K-Means to derive the intent prototypes for low-frequency intent contrastive learning. This parameter directly influences the granularity of low-frequency intent representation.

The following figure (Figure 5 from the original paper) shows the parameter sensitivity of the intent number:

Fig. 5. Parameter sensitivity of the intent number. Fig. 5. Parameter sensitivity of the intent number.

The graph indicates that model performance is sensitive to the choice of KK. Different datasets have different optimal KK values. For instance, K=1024K = 1024 is typically best for Beauty, K=512K = 512 for Sports, K=2048K = 2048 for Toys, K=256K = 256 for Yelp, and K=2048K = 2048 for LastFM. This suggests that the optimal number of low-frequency intent prototypes varies depending on the diversity and complexity of user behaviors within each dataset. A carefully chosen KK allows the model to strike a balance between capturing distinct long-term preferences and avoiding overfitting to too many fine-grained prototypes.

Impact of the boundary frequency pp

The boundary frequency pp is a critical hyperparameter within the Frequency Redistribution Structure. It determines the cutoff point for separating low-frequency and high-frequency components of the signal after Fourier Transform.

The following figure (Figure 6 from the original paper) shows the parameter sensitivity of the boundary frequency:

The results show that the model's performance is affected by the boundary frequency pp. Similar to KK, the optimal pp varies by dataset: p=13p = 13 for Beauty, p=7p = 7 for Sports, p=7p = 7 for Toys, p=13p = 13 for Yelp, and p=5p = 5 for LastFM. This indicates that the ideal division between short-term (high-frequency) and long-term (low-frequency) intents is dataset-dependent. Fine-tuning pp is crucial for effectively disentangling and processing different temporal patterns in user behavior, thereby maximizing the benefits of the Frequency Redistribution Structure.

6.3. Effectiveness on noisy or sparse data (RQ4)

6.3.1. Robustness to noisy data

To evaluate FICLRec's ability to handle noisy interactions, the authors introduced negative samples (simulated noise) at different rates: 5%, 10%, 15%, and 20% into the training data.

The following figure (Figure 7 from the original paper) illustrates the performance under noisy data:

The figure displays the NDCG@20 (left axis, bar graph) and HR@20 (right axis, line graph) performance of FICLRec and baselines as the noise rate increases. While all models experience a decline in performance with increasing noise, FICLRec consistently outperforms the baselines. Notably, even when the noise rate reaches 15%, FICLRec maintains a higher performance level compared to other models. This demonstrates FICLRec's superior robustness to noisy data, which is attributed to its frequency redistribution structure and contrastive learning objectives. By discerning high-frequency noise from meaningful short-term signals and leveraging robust intent prototypes, FICLRec can learn more accurate representations despite corrupted inputs.

6.3.2. Robustness to sparse data

To assess FICLRec's performance under data sparsity, the authors simulated different levels of sparsity by randomly sampling 75%, 50%, and 25% of the original training data for model training, while keeping the test data unchanged.

The following figure (Figure 8 from the original paper) illustrates the performance under sparse data:

The graph shows NDCG@20 (left axis, bar graph) and HR@20 (right axis, line graph) for different training data percentages. As the percentage of training data decreases (indicating higher sparsity), all models experience a drop in performance. However, FICLRec consistently exhibits more stable performance and maintains a larger margin over the baselines, especially under severe sparsity (e.g., 25% training data). This indicates that the frequency contrastive learning approach effectively mitigates the negative impact of data sparsity. By forcing similar samples closer and dissimilar ones apart, even with limited interactions, the model can learn more discriminative and robust item and intent embeddings. This finding directly supports one of the paper's core claims: that FICLRec reduces the negative impact of data sparsity on model performance.

6.4. Analysis of item embedding quality (RQ5)

The paper evaluates item embedding quality both qualitatively and quantitatively to understand how FICLRec creates richer and more discriminative item representations.

6.4.1. Qualitative analysis

The authors used t-SNE to visualize the item embeddings learned by FICLRec and compared them to those from ICSRec, specifically on the Beauty and Yelp datasets.

The following figure (Figure 9 from the original paper) shows visualizations of item embeddings:

The figure displays t-SNE visualizations of item embeddings. The text indicates that FICLRec produces a more uniform item distribution and clusters long-tail items (items with few interactions) more effectively than ICSRec. This suggests that FICLRec's frequency-enhanced intent contrastive learning helps to enrich the representations of less frequently interacted items. By leveraging both high-frequency (dynamic) and low-frequency (stable, often captured by prototypes) signals, FICLRec can learn more robust embeddings for long-tail items, which are challenging for many recommendation systems. The improved embedding quality for these items can lead to better personalized recommendations, as long-tail items often represent niche interests.

The following figure (Figure 10 from the original paper) shows further visualization of item embeddings:

Figure 10 further reinforces the observation that FICLRec effectively improves the quality of item representation and strengthens the recommendation of long-tail items. This implies that the model's ability to capture distinct high-frequency and low-frequency intents, combined with its contrastive learning objectives, leads to a more organized and meaningful feature space where similar items (including long-tail ones) are grouped together.

6.4.2. Quantitative analysis

The paper states that FICLRec performs well in regions where user interactions are frequent and diverse. While the paper mentions quantitative analysis, it does not provide a specific table or figure for this aspect in the provided text. However, the overall performance improvements in Table 5 across various metrics indirectly serve as a quantitative validation of better item embedding quality, as higher quality embeddings generally lead to better recommendation performance. The implicit argument is that improved HR and NDCG values, particularly on diverse datasets, are a direct consequence of the model's ability to learn more effective item and intent embeddings.

6.5. Case study (RQ6)

To further illustrate how FICLRec focuses on different types of features, the authors conducted a case study by visualizing the average attention weights on the Yelp dataset and comparing it with ICSRec. This addresses RQ6: "Does FICLRec focus more effectively on high-frequency features than ICSRec?"

The following figure (Figure 11 from the original paper) shows visualizations of average attention weights (dataset: Yelp):

Fig. 11. Visualizations of average attention weights (dataset:Yelp). Fig. 11. Visualizations of average attention weights (dataset:Yelp).

The figure displays a heatmap of average attention weights from the self-attention layer. The colors indicate the strength of attention. Comparing FICLRec with ICSRec, the visualization shows that FICLRec exhibits stronger attention weights on the more recent items in the sequence (i.e., towards the right end of the sequence). This suggests that FICLRec, due to its Frequency Redistribution Structure and high-frequency intent contrastive learning, effectively learns to focus on local high-frequency dependencies in the current sequence. This enhanced focus on recent interactions (which often drive short-term intents) allows FICLRec to better capture dynamic user behaviors and make more accurate next-item predictions. ICSRec, while also using attention, might not have the same specialized mechanism to prioritize these high-frequency signals.

6.6. Long-tail and short sequence analysis

The paper further analyzes FICLRec's performance in specific challenging scenarios: long-tail items and short sequences.

The following are the results from Table 7 of the original paper, showing user distributions across different interaction length intervals:

Dataset [ < 10](prop.) [10, 20](prop.) [20, 30](prop.) [30, 40](prop.) [ ≥ 40](prop.)
Beauty 17353(77.60%) 3152(14.10%) 1065(4.76%) 367(1.64%) 426(1.90%)
Sports 28478(79.99%) 4555(12.80%) 1480(4.16%) 395(1.11%) 690(1.94%)
Toys 16345(84.20%) 2320(11.95%) 476(2.45%) 130(0.67%) 141(0.73%)
Yelp 26550(87.25%) 2948(9.69%) 692(2.27%) 135(0.44%) 106(0.35%)
LastFM 1090(100%) 0(0%) 0(0%) 0(0%) 0(0%)

Table 7 shows the distribution of users across different sequence length categories. For most datasets (Beauty, Sports, Toys, Yelp), a very large proportion of users (77-87%) have short sequences (length < 10). LastFM is an outlier where all users have sequences of length < 10 (though its average actions per user is higher in Table 4, implying many users have multiple short sequences, or this table's definition of sequence length cutoff is different). This highlights the prevalence of short sequences in real-world data, a challenging scenario for sequential recommendation.

The following are the results from Table 8 of the original paper, showing item popularity distributions:

Dataset [ < 10](prop.) [10, 20](prop.) [20, 30](prop.) [30, 40](prop.) [ ≥ 40](prop.)
Beauty 17240(77.10%) 3875(17.33%) 729(3.25%) 254(1.14%) 265(1.18%)
Sports 27740(77.92%) 6450(18.12%) 946(2.66%) 271(0.76%) 191(0.54%)
Toys 15224(78.43%) 3214(16.55%) 552(2.84%) 217(1.12%) 205(1.06%)
Yelp 20977(68.93%) 6587(21.65%) 1558(5.12%) 631(2.07%) 678(2.23%)
LastFM 302(27.71%) 263(24.13%) 111(10.17%) 78(7.16%) 336(30.83%)

Table 8 shows the distribution of items based on their popularity (number of interactions). A large majority of items (68-84% for most datasets) fall into the long-tail category (<10<10 interactions), indicating the prevalence of long-tail items which are difficult to recommend due to insufficient data. LastFM again stands out with a more balanced distribution, including a significant portion of very popular items (>=40>=40 interactions).

The following figure (Figure 12 from the original paper) shows a subgroup sequence analysis on the Sports dataset:

Fig. 12. Subgroup sequences analysis on Sports dataset. Fig. 12. Subgroup sequences analysis on Sports dataset.

Figure 12 presents HR@20 performance for FICLRec and baseline methods on the Sports dataset, broken down by categories of user sequence length (derived from Table 7) and item popularity (derived from Table 8).

Analysis of Figure 12:

  • Performance on Short Sequences: FICLRec demonstrates strong performance in short sequence categories (e.g., [<10][<10] and [10, 20] lengths). This is crucial because short sequences represent the majority of user behaviors in many datasets (as seen in Table 7). The ability of FICLRec to capture high-frequency intents (short-term signals) is particularly beneficial in these scenarios where long-term context is limited.

  • Performance on Long-Tail Items: The figure likely shows FICLRec performing well for long-tail items (e.g., in the [<10][<10] and [10, 20] popularity categories). This aligns with the qualitative analysis of item embedding quality (Figure 9, 10), suggesting that FICLRec's frequency-enhanced contrastive learning effectively learns robust representations for items with limited interactions. This is a significant advantage, as recommending long-tail items improves content diversity and caters to niche user interests.

  • Performance on Popular Items: The paper explicitly states that "for popular items (i.e., [30, 40] and [>=40][>=40]), FICLRec did not outperform other models." This is an important limitation to note. It implies that while FICLRec excels at capturing dynamic and niche preferences, its specialized frequency-enhanced approach might not offer the same comparative advantage when recommending very popular items, where simpler models might already achieve high performance due to abundant interaction data.

    Overall, the long-tail and short sequence analysis highlights FICLRec's strength in addressing challenging real-world scenarios, particularly where data is sparse and user intentions are dynamic. While it performs exceptionally well on the majority of users and items, there might be room for improvement or a different optimization focus for highly popular items.

6.7. Training efficiency

The authors also evaluated the training efficiency of FICLRec by comparing its Flops (floating point operations), number of epochs, time per epoch, total training time, and number of parameters with several baselines.

The following are the results from Table 9 of the original paper, showing the training efficiency:

Methods Flops Sports Toys
epoch s/epoch total time # params epoch s/epoch total time # params
SASRec 1.272G 88 27 22.0 1,278,208 246 9 36.9 866,496
DuoRec 1.272G 103 70 46.4 1,278,208 151 25 62.9 866,496
FEARec 1.258G 114 323 613.7 1,278,208 152 223 564.9 866,496
ICLRec 1.258G 300 35 175.0 1,278,272 300 17 85.0 866,560
ICSRec 1.258G 143 164 107.3 1,278,272 174 89 258.1 866,560
FICLRec (ours) 1.468G 123 188 116.9 1,295,424 157 93 243.4 883,712

Analysis of Training Efficiency:

  • Flops (Floating Point Operations): FICLRec has slightly higher Flops (1.468G) compared to most baselines (e.g., SASRec and DuoRec at 1.272G, FEARec, ICLRec, ICSRec at 1.258G). This is expected, as FICLRec integrates the additional FFT/IFFT computations in its Frequency Redistribution Structure, which adds an O(Nlog(N)) \mathcal { O } ( N \log ( N ) ) term to the complexity.
  • Number of Parameters: FICLRec also has a slightly larger number of parameters (e.g., 1,295,424 for Sports) compared to baselines like SASRec (1,278,208). This is due to the additional learnable weight matrices (W1,W2W_1, W_2) in the Frequency Redistribution Structure.
  • Time/Epoch: FICLRec's time per epoch (188 s/epoch for Sports, 93 s/epoch for Toys) is generally higher than simpler Transformer-based models like SASRec (27 s/epoch, 9 s/epoch) and ICLRec (35 s/epoch, 17 s/epoch). This is a direct consequence of its increased computational complexity per epoch from the frequency domain operations. However, it is comparable to or even better than FEARec (323 s/epoch, 223 s/epoch) and not drastically higher than ICSRec (164 s/epoch, 89 s/epoch) on Toys.
  • Total Time: Despite the higher time per epoch, FICLRec often requires fewer epochs to converge than some baselines (e.g., ICLRec and ICSRec often need 300 and 143/174 epochs respectively, while FICLRec needs 123/157 epochs). This sometimes results in a competitive total training time. For example, on Sports, FICLRec's total time (116.9 minutes) is less than ICLRec (175.0 minutes) and FEARec (613.7 minutes). On Toys, its total time (243.4 minutes) is less than FEARec and ICSRec.

Conclusion on Efficiency: The analysis indicates that FICLRec introduces a reasonable increase in computational complexity (both Flops and parameters) and time per epoch compared to some baselines. However, this trade-off is often justified by its significantly superior recommendation performance. The fact that it can achieve strong results in a competitive total training time, sometimes even faster than other complex contrastive learning or frequency domain models, suggests that the added complexity is worthwhile. The model's efficiency remains within an acceptable range, especially when considering the substantial performance advantages it offers in capturing nuanced user intentions and handling data sparsity.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduced FICLRec, a novel Frequency Enhanced Intent Contrastive Learning Recommendation model designed to address key challenges in sequential recommendation: the insufficient capture of high-frequency intents and the pervasive issue of data sparsity. FICLRec innovatively leverages frequency information from user interaction sequences, using a Frequency Redistribution Encoder to disentangle low-frequency (long-term, stable) and high-frequency (short-term, dynamic) user intentions. It then employs a multi-task learning framework that includes both high-frequency intent contrastive learning (with high-frequency alignment loss) and low-frequency intent contrastive learning (with cluster-level center alignment loss) to learn robust, frequency-aware intent representations. Extensive experiments on five real-world datasets demonstrated that FICLRec consistently outperforms a wide array of state-of-the-art baselines, achieving significant improvements in HR@K and NDCG@K metrics. The ablation studies confirmed the effectiveness of each proposed component, and hyperparameter studies revealed the model's robustness and sensitivity to specific configurations. Furthermore, FICLRec proved to be highly robust to noisy and sparse data conditions, and its item embedding quality was shown to be superior, particularly for long-tail items and short sequences. While incurring a slight increase in computational complexity, the substantial performance gains validate the model's efficacy.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose directions for future research:

  • Performance on Popular Items: While FICLRec excels at recommending long-tail items and handling short sequences, it "did not always achieve the desired performance" for very popular items. This suggests that its specialized focus on disentangling intentions and addressing sparsity might not provide as much incremental benefit for items with abundant interaction data.
  • Computational Overhead of K-Means: The current approach for low-frequency intent contrastive learning involves K-Means clustering to generate intent prototypes in each training epoch (Algorithm 1, line 5). This can be computationally intensive, especially for large datasets or a high number of clusters (KK).
  • Dynamic Nature of Intent Prototypes: The intent prototypes are re-generated via K-Means on all user representations in each epoch. While this allows for adaptation, a more dynamic or incremental clustering approach could potentially be more efficient and responsive to subtle shifts in long-term preferences.
  • Optimizing Computational Complexity: Future work could focus on designing more efficient frequency domain processing techniques or contrastive learning strategies to reduce the training overhead without compromising performance.
  • Adaptive Frequency Band Separation: The boundary frequency pp is a static hyperparameter. Future research could explore adaptive mechanisms to dynamically determine the optimal frequency split for different users or contexts.

7.3. Personal Insights & Critique

FICLRec presents a compelling and well-executed approach to sequential recommendation, particularly for its innovative integration of frequency domain analysis with intent contrastive learning.

Insights:

  • Power of Frequency Domain: The paper strongly highlights the untapped potential of frequency domain analysis in recommender systems. Decomposing user behavior into low-frequency (stable, long-term) and high-frequency (dynamic, short-term) components provides a natural and intuitive way to model the duality of user intentions. This concept could be broadly applicable to other sequential modeling tasks beyond recommendation, such as user behavior prediction in other domains (e.g., health, finance).
  • Targeted Contrastive Learning: The idea of applying different contrastive learning objectives tailored to the specific characteristics of high-frequency and low-frequency intents is highly effective. Rather than a generic contrastive loss, FICLRec shows that specialized alignment (e.g., HFAL for short-term, CCAL for long-term prototypes) significantly enhances the model's ability to discriminate between and learn different aspects of user preference.
  • Robustness to Real-World Challenges: The demonstrated robustness to noisy and sparse data is a critical practical advantage. Real-world recommender systems are inherently messy, and a model that can perform well under such conditions is highly valuable. The frequency domain seems to offer a level of signal robustness that helps contrastive learning shine even with limited or corrupted inputs.
  • Addressing the "Cold Start" and Long-Tail Problem: By improving item embedding quality for long-tail items and performing well on short sequences, FICLRec makes a significant step towards alleviating the cold start problem for new items and users, which is a persistent challenge in recommendation.

Critique / Areas for Improvement:

  • K-Means Computational Cost: The K-Means step in every epoch, while effective, can be a computational bottleneck, especially with growing datasets and embedding dimensions. Exploring more online or streaming clustering methods, or using a fixed set of prototypes updated via a momentum encoder (similar to MoCo in computer vision) could be more scalable.

  • Generalizability of Boundary Frequency pp: The boundary frequency pp is dataset-dependent. Manually tuning this parameter for every new dataset can be cumbersome. An adaptive mechanism that learns this split (e.g., using an attention-like mechanism over frequency bins or a trainable filter) could enhance the model's automation and generalizability.

  • Interpretation of Frequency Components: While the paper intuitively links low-frequency to long-term and high-frequency to short-term, a deeper theoretical or empirical analysis of what specific types of "patterns" or "intents" are actually captured by different frequency bands would be beneficial for further understanding and model design. For instance, can mid-range frequencies represent recurring seasonal interests?

  • Trade-off with Popular Items: The observation that FICLRec does not outperform baselines for popular items is interesting. It suggests a potential trade-off where optimizing for long-tail and short-term often leads to less emphasis on the very dominant popular items. Future work could investigate multi-objective optimization or hybrid approaches that maintain long-tail performance while still excelling at popular item recommendations.

  • Dynamic Sequence Lengths: The maximum sequence length NN is fixed. While common, actual user sequences vary greatly. Exploring Transformer-variants that handle variable sequence lengths more naturally or padding strategies that minimize information loss could be beneficial.

    In conclusion, FICLRec offers a powerful and principled way to integrate temporal dynamics with user intentions through the lens of the frequency domain. Its contributions are significant for advancing sequential recommendation and open up exciting avenues for future research in understanding and modeling complex user behaviors.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.