Paper status: completed

Masked Diffusion for Generative Recommendation

Published:11/28/2025

Autoregressive Modeling (2)Sequential Recommender Systems (24)Generative Recommendation Systems (42)Masked Diffusion Model (1)Semantic ID Modeling (2)

Original Link PDF

Price: 0.100000

12 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces Masked Diffusion for Generative Recommendation (MADRec), which models user interaction sequences using discrete masking noise, outperforming traditional autoregressive models in efficiency and performance, particularly in data-restricted and coarse-grained r

Abstract

Generative recommendation (GR) with semantic IDs (SIDs) has emerged as a promising alternative to traditional recommendation approaches due to its performance gains, capitalization on semantic information provided through language model embeddings, and inference and storage efficiency. Existing GR with SIDs works frame the probability of a sequence of SIDs corresponding to a user's interaction history using autoregressive modeling. While this has led to impressive next item prediction performances in certain settings, these autoregressive GR with SIDs models suffer from expensive inference due to sequential token-wise decoding, potentially inefficient use of training data and bias towards learning short-context relationships among tokens. Inspired by recent breakthroughs in NLP, we propose to instead model and learn the probability of a user's sequence of SIDs using masked diffusion. Masked diffusion employs discrete masking noise to facilitate learning the sequence distribution, and models the probability of masked tokens as conditionally independent given the unmasked tokens, allowing for parallel decoding of the masked tokens. We demonstrate through thorough experiments that our proposed method consistently outperforms autoregressive modeling. This performance gap is especially pronounced in data-constrained settings and in terms of coarse-grained recall, consistent with our intuitions. Moreover, our approach allows the flexibility of predicting multiple SIDs in parallel during inference while maintaining superior performance to autoregressive modeling.

Mind Map

In-depth Reading

English Analysis~24 min read · 32,423 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is the application of Masked Diffusion for Generative Recommendation systems, particularly within the framework of Semantic IDs (SIDs).

1.2. Authors

Kulin Shah (UT Austin)
Bhuvesh Kumar (Snap Inc.)
Neil Shah (Snap Inc.)
Liam Collins (Snap Inc.)

1.3. Journal/Conference

The paper is published at (UTC): 2025-11-28T09:36:26.000Z, which suggests it is a recent preprint or submitted work expected to be published. Given the provided arXiv link, it is currently a preprint, meaning it has been made publicly available but has not yet undergone full peer review for publication in a journal or conference proceedings. arXiv is a highly influential platform for rapid dissemination of research in fields like machine learning, often serving as a precursor to formal publication.

1.4. Publication Year

2025

1.5. Abstract

This paper introduces Masked Diffusion for Generative Recommendation (MADRec), a novel approach that leverages discrete masking noise and masked diffusion techniques, inspired by recent advancements in Natural Language Processing (NLP), to model and learn the probability distribution of user interaction sequences represented by Semantic IDs (SIDs). While existing Generative Recommendation (GR) with SIDs methods predominantly use autoregressive (AR) modeling, which suffers from expensive sequential inference, inefficient data usage, and a bias towards short-context relationships, MADRec models the probability of masked tokens as conditionally independent given unmasked tokens, enabling parallel decoding. Through extensive experiments, MADRec demonstrates consistent outperformance over autoregressive modeling, especially in data-constrained settings and for coarse-grained recall. Furthermore, MADRec offers the flexibility of parallel prediction of multiple SIDs during inference while maintaining superior performance.

1.6. Original Source Link

Official source: https://arxiv.org/abs/2511.23021v1
PDF link: https://arxiv.org/pdf/2511.23021v1.pdf
Publication Status: This is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem

The core problem the paper addresses is the limitations of current Generative Recommendation (GR) with Semantic IDs (SIDs) models, which primarily rely on autoregressive (AR) modeling. While AR models have shown promise in next item prediction, they inherently suffer from several drawbacks:

Expensive Inference: AR models decode tokens sequentially (one after another), leading to high latency and computational cost, which is a significant bottleneck for real-world deployment.
Inefficient Use of Training Data: They typically use next-token prediction as their training objective, which can be less efficient in extracting signal from limited training data.
Bias Towards Short-Context Relationships: By predicting the next token based on all preceding ones, AR models tend to over-index on local dependencies and might struggle to capture global relationships among tokens within a sequence.

2.1.2. Importance of the Problem

GR with SIDs is a rapidly growing and promising paradigm in Recommendation Systems (RecSys). It offers significant advantages over traditional approaches by:

Performance Gains: Delivering improved recommendation accuracy.
Capitalization on Semantic Information: Effectively leveraging rich item features (e.g., text, visual) by converting them into language model embeddings and then SIDs.
Inference and Storage Efficiency: Using SIDs reduces the vocabulary size compared to sparse item IDs, leading to more efficient models.

Given these benefits, overcoming the limitations of AR modeling in this promising framework is crucial for its wider adoption and performance scaling, especially in data-constrained environments and for applications requiring faster inference.

2.1.3. Paper's Entry Point / Innovative Idea

The paper's innovative idea is to adapt masked diffusion techniques, which have recently achieved breakthroughs in Natural Language Processing (NLP), to model the probability distribution of SID sequences in Generative Recommendation. Instead of sequential autoregressive decoding, masked diffusion uses discrete masking noise during training and allows parallel decoding of masked tokens during inference. This paradigm shift directly addresses the identified limitations of AR models.

2.2. Main Contributions / Findings

2.2.1. Primary Contributions

The paper's primary contributions are:

Introduction of MADRec: Proposing MADRec (Masked Diffusion over SIDs for Generative Recommendation), the first application of masked diffusion to GR with SIDs. The design prioritizes simplicity and generality to establish foundational empirical observations.
Superior Performance: Demonstrating that MADRec consistently outperforms standard AR modeling (like TIGER) and other GR baselines across various sequential recommendation datasets. This performance advantage is particularly pronounced in data-constrained settings and for coarse-grained recall.
Enhanced Inference Efficiency and Flexibility: Showing that MADRec can decode multiple SIDs in parallel, allowing a flexible trade-off between inference performance and efficiency. It can outperform AR models even with fewer function evaluations (NFEs).
Data Efficiency: Providing evidence that masked diffusion makes better use of limited training data, as the performance gap between MADRec and TIGER widens when dataset sizes are reduced.
Extensibility: Illustrating MADRec's compatibility with auxiliary methods (e.g., dense retrieval), suggesting it can integrate with existing advancements in AR-based SID modeling.
Reproducibility: Making the code publicly available to facilitate further research and validation.

2.2.2. Key Conclusions / Findings

The key findings of the paper are:

Masked diffusion is a more effective paradigm than autoregressive modeling for GR with SIDs, offering significant performance gains.
The benefits of masked diffusion are particularly strong in scenarios with limited training data and when evaluating coarse-grained recall, indicating its ability to learn more global relationships among items.
Masked diffusion inherently supports parallel decoding, leading to more efficient inference without sacrificing accuracy, and often achieving better results with fewer computational steps.
The semantic information encoded in SIDs is crucial for MADRec's performance, as demonstrated by experiments with random SIDs or direct item IDs.
The dynamic masking probability (masking at various rates from 0 to 1) used in masked diffusion is more effective than fixed masking ratios (like in BERT4Rec).
MADRec is a flexible framework that can be extended with other techniques, such as dense retrieval, to further enhance performance.

3.1. Foundational Concepts

To fully understand Masked Diffusion for Generative Recommendation (MADRec), a beginner should grasp several fundamental concepts:

3.1.1. Recommendation Systems (RecSys)

Recommendation Systems are software tools that provide suggestions for items (e.g., movies, products, music) that are most relevant to a particular user. Their goal is to predict what a user might like based on their past interactions and behavior, or the behavior of similar users.

Sequential Recommendation: A sub-field of RecSys where the order of user interactions is crucial. Models predict the next item a user will interact with based on their historical sequence of interactions.

3.1.2. Generative Models

Generative Models are a class of machine learning models that are designed to learn the underlying distribution of a dataset and then generate new samples that resemble the training data. In Generative Recommendation (GR), this means learning the probability distribution of user interaction sequences and then generating (or predicting) the most likely next item in a sequence.

3.1.3. Semantic IDs (SIDs)

Traditionally, items in a RecSys are represented by unique, sparse integer IDs (e.g., Item ID 1, Item ID 2). These IDs carry no inherent meaning. Semantic IDs (SIDs) are a newer approach where items are represented not by a single sparse ID, but by a tuple (a short sequence) of tokens, where each token is derived from the item's rich semantic features (e.g., text descriptions, images, categories).

How they are created:
1. Semantic Embedding: First, a pre-trained language model (for text features) or vision model (for visual features) is used to generate a dense numerical vector (an embedding) that captures the semantic meaning of each item.
2. Clustering / Quantization: These dense embeddings are then quantized or clustered into a discrete set of tokens. Techniques like Residual K-Means (RK-means) or Residual Quantized Variational Autoencoder (RQ-VAE) are often used. This process reduces the dimensionality and converts continuous embeddings into discrete tokens.
Benefits: SIDs allow the recommendation model to operate on a smaller, semantically meaningful vocabulary, leveraging the power of language models and improving efficiency.

3.1.4. Autoregressive (AR) Models

Autoregressive models are a type of generative model that predicts the next element in a sequence based on all preceding elements. In the context of language models or SID sequences:

Mechanism: To predict a sequence $x_1, x_2, \ldots, x_N$ , an AR model computes the probability of the entire sequence as a product of conditional probabilities: $p(x_1, \ldots, x_N) = p(x_1) \times p(x_2 | x_1) \times p(x_3 | x_1, x_2) \times \ldots \times p(x_N | x_1, \ldots, x_{N-1})$ .
Inference: This factorization implies that to generate a new sequence, tokens must be generated one by one in a strict sequential (left-to-right) order. This is known as sequential decoding.
Examples: GPT models are prominent AR models. In RecSys, models like SASRec and TIGER use an AR approach.

3.1.5. Transformer Architecture

The Transformer is a neural network architecture introduced in Attention Is All You Need [VSP+17]. It revolutionized NLP and is now widely used in many fields.

Self-Attention: The core mechanism is self-attention, which allows the model to weigh the importance of different parts of the input sequence when processing each element. This captures long-range dependencies effectively. The self-attention mechanism computes a weighted sum of value vectors, where the weight assigned to each value is determined by a compatibility function of the query with the corresponding key. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where:
- $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
- $d_k$ is the dimension of the key vectors, used for scaling to prevent very large dot products from pushing the softmax into regions with extremely small gradients.
- $\mathrm{softmax}$ is applied row-wise to ensure weights sum to 1.
Encoder-Decoder vs. Encoder-only/Decoder-only:
- Encoder-Decoder: Used for sequence-to-sequence tasks (e.g., machine translation), where an encoder processes the input and a decoder generates the output.
- Encoder-only: Used for tasks like sequence classification or masked language modeling (e.g., BERT), where the goal is to understand the input.
- Decoder-only: Used for autoregressive generation (e.g., GPT), where the model generates new tokens one by one.
MADRec uses an encoder-only transformer as its backbone.

3.1.6. Diffusion Models (Discrete Masking Noise)

Diffusion models are a class of generative models that have achieved state-of-the-art results, especially in image generation. They work by gradually adding noise to data (forward process) and then learning to reverse this process (reverse process) to generate new data from noise.

Continuous vs. Discrete Diffusion:
- Continuous Diffusion: Typically uses Gaussian noise and operates in a continuous latent space (e.g., for images).
- Discrete Diffusion (Masked Diffusion): Relevant for discrete data like text tokens or SIDs. Instead of adding Gaussian noise, the forward process involves masking tokens (replacing them with a special [M] token) with a certain probability. The reverse process then involves learning to predict the original tokens for these masked positions.
Key Advantage: Unlike AR models, masked diffusion can predict multiple masked tokens in parallel, because it assumes masked tokens are conditionally independent given the unmasked ones. This allows for faster inference.

3.1.7. Cross-Entropy Loss

Cross-entropy loss is a common loss function used in classification tasks. It measures the performance of a classification model whose output is a probability value between 0 and 1. For a multi-class classification problem (like predicting which SID token should be at a masked position), it quantifies the difference between the true probability distribution (one-hot encoding of the true SID) and the predicted probability distribution (output of the model). The goal during training is to minimize this loss.

3.2. Previous Works

The paper contextualizes MADRec by discussing existing Generative Recommendation (GR) paradigms and their limitations.

3.2.1. Traditional Item ID-based Sequential Recommendation

SASRec [KM18]: A foundational sequential recommender that uses a self-attention mechanism (a Transformer encoder) to capture sequential patterns in user interactions, directly operating on item IDs. It predicts the next item autoregressively.
BERT4Rec [SLW+19]: Inspired by BERT from NLP, BERT4Rec uses an encoder-only Transformer and a masked language modeling objective. It randomly masks a fixed proportion of item IDs in a sequence and learns to predict them. Unlike autoregressive models, it can process the sequence bidirectionally during training. The key difference from MADRec is that BERT4Rec operates on item IDs and uses a fixed masking ratio, whereas MADRec uses SIDs and a dynamic, continuous masking probability.

3.2.2. Continuous Diffusion-based Recommendation

These models apply continuous diffusion processes, typically with Gaussian noise, to item embeddings or user interaction vectors.

DreamRec [YWW+23]: Applies continuous diffusion on item IDs (or their embeddings) for sequential recommendation.
CaDiRec [CWH+24]: Another model using continuous diffusion for recommendation, often operating on semantic embeddings.
Limitation: The paper notes that continuous diffusion models applied to discrete domains (like item IDs or SIDs) have historically struggled to match the performance of AR models and often require a very large number of inference steps (e.g., 1000 steps for DreamRec and CaDiRec).

3.2.3. Autoregressive Generative Recommendation with Semantic IDs (AR GR with SIDs)

This is the most direct predecessor to MADRec and the primary baseline.

TIGER [RMS+23]: This is the seminal work for GR with SIDs. It represents items as SIDs (derived from semantic embeddings via quantization) and uses an autoregressive Transformer to model the probability distribution of these SID sequences. It predicts the next SID in a token-by-token manner.
LIGER [YPH+24]: An extension of TIGER that integrates dense retrieval with SID generation. It aims to combine the benefits of explicit SID generation with the strong performance of dense retrieval models.
RPG [HLS+25]: Proposes using product-quantized SIDs to enable parallel decoding of SID tokens for a single item, but still requires decoding items sequentially.

3.2.4. Masked Diffusion in NLP

The masked diffusion framework that MADRec draws inspiration from has seen recent breakthroughs in NLP.

Key References: [LME24, SAS+25, SHW+24, ONX+24]
Innovation: These works demonstrated that using discrete masking noise (instead of Gaussian noise) in diffusion models can make them competitive with AR language models on standard benchmarks, providing the foundation for MADRec's approach to discrete SID sequences.

3.3. Technological Evolution

The field of Recommendation Systems has evolved from simple collaborative filtering and matrix factorization to more sophisticated models:

Early Sequential Models: GRU4Rec [JL17] and SASRec [KM18] introduced neural networks and self-attention to capture sequential user behavior, primarily using sparse item IDs.
BERT-inspired Models: $BERT4Rec [SLW+19]$ adapted masked language modeling for sequential recommendation, allowing for bidirectional context during training.
Rise of Generative Models: The success of Large Language Models (LLMs) led to Generative Recommendation (GR) paradigms, where models generate recommendations directly.
Semantic Information Integration (LLM-as-Enhancer): The GR with SIDs framework (e.g., TIGER [RMS+23]) marked a significant step by leveraging LLMs to generate semantic embeddings which are then quantized into SIDs. This allowed RecSys to tap into rich item metadata and overcome the limitations of sparse item IDs. Most GR with SIDs models continued to use autoregressive (AR) mechanisms.
Diffusion Models for RecSys: Inspired by image generation, diffusion models began to be explored for RecSys, initially with continuous diffusion over item embeddings or user interaction vectors.
MADRec's Position: MADRec represents the next evolutionary step by applying the more recent and effective masked diffusion paradigm (developed for discrete data in NLP) to the discrete SID sequences in GR with SIDs. This aims to address the inherent inference efficiency and data utilization challenges of AR-based SID models.

3.4. Differentiation Analysis

MADRec differentiates itself from previous Generative Recommendation methods, especially those using SIDs, in several key ways:

Autoregressive (AR) Models (e.g., TIGER):
- Core Difference: MADRec uses masked diffusion for parallel generation, while AR models generate tokens sequentially.
- Inference: AR models have fixed sequential inference, whereas MADRec can perform parallel decoding, leading to significant inference speedups and flexibility (trading off NFEs for performance).
- Training Data Utilization: MADRec uses a dynamic masking probability (any masking rate from 0 to 1), effectively generating exponentially more training samples per sequence, which leads to better data efficiency than AR models' next-token prediction.
- Relationship Learning: AR models tend to focus on local, short-context relationships. MADRec's masked diffusion (trained to reconstruct masked tokens throughout the sequence) is hypothesized to capture global relationships better, which is supported by its superior coarse-grained recall.
Continuous Diffusion Models (e.g., DreamRec, CaDiRec):
- Core Difference: MADRec employs discrete masking noise, while these models use continuous Gaussian noise often applied to item embeddings.
- Effectiveness for Discrete Data: Prior research and MADRec's results indicate that discrete masking noise is more effective for modeling discrete token sequences (like SIDs) compared to Gaussian noise. MADRec achieves better performance with significantly fewer inference steps (5 vs. 1000).
BERT4Rec:
- Core Difference: MADRec operates on Semantic IDs (SIDs), leveraging rich semantic information, whereas BERT4Rec works directly on sparse item IDs.
- Masking Strategy: MADRec uses a dynamic masking probability (uniformly sampling masking rates from [0,1]), which is mathematically grounded in discrete diffusion theory. BERT4Rec uses a fixed masking ratio (e.g., 15%). The paper shows MADRec's masking strategy is superior even when applied to item IDs.
RPG [HLS+25]:
- Core Difference: RPG allows parallel decoding of SIDs within a single item by using product quantization, but still generates items sequentially. MADRec enables parallel decoding of SIDs for any SIDs within the sequence, including across items if desired, and demonstrates performance gains even with standard residual-quantized SIDs.
  
  In essence, MADRec innovates by bringing the strengths of masked diffusion from NLP to GR with SIDs, offering a more efficient, flexible, and robust alternative to existing autoregressive and continuous diffusion approaches, especially for discrete sequential data.

4. Methodology

The proposed method, MADRec (Masked Diffusion over SIDs for Generative Recommendation), frames the problem of modeling SID sequences as a discrete diffusion process with masking noise. This section details the training and inference procedures of MADRec, followed by an extension incorporating dense retrieval.

4.1. Principles

The core idea behind MADRec is to model the probability distribution of a user's SID sequence using masked diffusion. This approach consists of two main processes:

Forward Process: Gradually corrupts the original SID sequence by randomly masking (replacing with a special [M] token) individual SIDs with a probability determined by a noise level $t$ .
Reverse Process (Denoising): Learns to reverse this corruption by predicting the original values of the masked SIDs based on the partially corrupted sequence. This denoising step is typically performed by a neural network, such as an encoder-only transformer.

Unlike autoregressive models that generate tokens one by one, masked diffusion allows for parallel decoding of multiple masked SIDs, as it models the probability of masked tokens as conditionally independent given the unmasked tokens. This parallel nature is key to its inference efficiency and ability to capture global relationships.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. SID Sequence Representation

Similar to other GR with SIDs methods, MADRec first transforms a user's interaction history of items $(i_1, \ldots, i_{n_u})$ into a sequence of Semantic IDs (SIDs). Each item $i_j$ is represented by an m-tuple of SIDs $(s_j^1, \ldots, s_j^m)$ . The full user interaction sequence is then flattened into a single sequence of SIDs: $ S^u := (s_1^1, \ldots, s_1^m, s_2^1, \ldots, s_2^m, \ldots, s_{n_u}^1, \ldots, s_{n_u}^m) $ For simplicity during modeling, the paper assumes $n_u = n$ for all users $u \in \mathcal{M}$ , meaning all user sequences are padded or truncated to a fixed length $n$ . The total length of the SID sequence is mn.

4.2.2. MADRec Training

The training process involves discrete diffusion models with a masking noise framework. Let [M] denote the special mask token. Let $S_t^u$ denote the corrupted SID sequence at a noise level $t \in [0, 1]$ . Let $S_t^u(i)$ be the $i^{\mathrm{th}}$ element of the sequence $S_t^u$ .

4.2.2.1. Forward Process

The forward process introduces noise by independently masking each SID token in the original (clean) sequence $S_0^u = S^u$ . Each token in $S_0^u$ is replaced with the [M] token with probability $t$ .

The transition probability from the clean sequence $S_0^u$ to the noisy sequence $S_t^u$ is defined as: $p(S_t^u | S_0^u) = \prod_{i=1}^{mn} p(S_t^u(i) | S_0^u(i))$ where mn is the total length of the flattened SID sequence. The conditional probability for each individual token $S_t^u(i)$ given $S_0^u(i)$ is defined using a categorical distribution: $p(S_t^u(i) | S_0^u(i)) = \mathrm{Cat}(t \cdot e_{[\mathrm{M}]} + (1-t) \cdot e_{S_0^u(i)})$ Here:

$\mathrm{Cat}(\cdot)$ denotes the categorical distribution over $mc+1$ possible tokens (where $m$ is the number of SIDs per item, $c$ is the codebook size of each SID, and $+1$ accounts for the [M] token).
$e_{[\mathrm{M}]}$ represents a one-hot vector corresponding to the mask token.
$e_{S_0^u(i)}$ represents a one-hot vector corresponding to the original SID token $S_0^u(i)$ . This means that with probability $t$ , the original token $S_0^u(i)$ is replaced by [M], and with probability 1-t, it remains $S_0^u(i)$ .

4.2.2.2. Reverse Process

The reverse process aims to denoise the corrupted sequence. To guide this, the posterior probability $p(S_\ell^u | S_t^u, S_0^u)$ for a smaller noise scale $\ell < t$ is considered. Due to the coordinate-wise independence of the forward process, this posterior can also be decomposed coordinate-wise: $p(S_\ell^u | S_t^u, S_0^u) = \prod_{i=1}^{mn} p(S_\ell^u(i) | S_t^u, S_0^u)$ The posterior for each coordinate $i$ is given by a piecewise definition: $p(S_\ell^u(i) | S_t^u, S_0^u) = \left\{ \begin{array}{ll} \mathrm{Cat}(e_{S_t^u(i)}) & \mathrm{if~} S_t^u(i) \ne [\mathrm{M}] \\ \mathrm{Cat}(\frac{\ell}{t} e_{[\mathrm{M}]} + (1 - \frac{\ell}{t}) e_{S_0^u(i)}) & \mathrm{if~} S_t^u(i) = [\mathrm{M}] \end{array} \right.$ This formula indicates that:

If the token at position $i$ in $S_t^u$ is not masked, it remains unchanged in $S_\ell^u$ .
If the token at position $i$ in $S_t^u$ is masked, then in $S_\ell^u$ , it becomes [M] with probability $\ell/t$ or returns to its original clean value $S_0^u(i)$ with probability $1 - \ell/t$ .

The model approximates $e_{S_0^u(i)}$ (the one-hot vector of the true original token) with $f_\theta(\cdot | S_t^u)$ , which is a neural network (typically a transformer encoder) with parameters $\theta$ . This network takes the corrupted sequence $S_t^u$ as input and outputs a predicted probability distribution over the mc possible SID elements for each masked position.

4.2.2.3. Training Objective

The model parameters $\theta$ are trained by maximizing the evidence lower bound (ELBO) on the likelihood, which simplifies to a cross-entropy loss applied only to the masked tokens. The training objective is: $L = \underset{t \sim \mathrm{Unif}[0,1], S_0^u \sim p_{SID}, S_t^u \sim p(S_t^u | S_0^u)}{\mathbb{E}} \left[ - \frac{1}{t} \sum_{i=1}^{mn} \mathbb{I}[S_t^u(i) = [\mathrm{M}]] \log p_\theta(S_0^u(i) | S_t^u) \right]$ Here:

$\mathbb{E}$ denotes the expectation.
$t \sim \mathrm{Unif}[0,1]$ means the noise level $t$ is randomly sampled from a uniform distribution between 0 and 1 for each training step.
$S_0^u \sim p_{SID}$ means the original SID sequence $S_0^u$ is sampled from the true distribution of SID sequences.
$S_t^u \sim p(S_t^u | S_0^u)$ means the corrupted sequence $S_t^u$ is generated from $S_0^u$ using the forward process with noise level $t$ .
$\mathbb{I}[S_t^u(i) = [\mathrm{M}]]$ is an indicator function that is 1 if the token at position $i$ in $S_t^u$ is the mask token, and 0 otherwise. This ensures that the loss is only computed for the positions that were actually masked.
$p_\theta(S_0^u(i) | S_t^u)$ is the probability predicted by the model $f_\theta$ for the original SID token $S_0^u(i)$ given the corrupted sequence $S_t^u$ . The term $1/t$ acts as a weighting factor, giving more importance to denoisings from higher noise levels.

This objective trains MADRec to predict the original values of the masked SIDs, learning the conditional distribution of each masked SID given the rest of the sequence.

4.2.3. MADRec Inference and Beam Search

The goal of inference is to predict the next item $i_n$ given a user's interaction history $(i_1, \ldots, i_{n-1})$ , which corresponds to the SID sequence $Q = (s_1^1, \ldots, s_{n-1}^m)$ .

4.2.3.1. Iterative Unmasking

To generate the SID tuple for the $n^{\mathrm{th}}$ item, MADRec begins by masking all SIDs of this target item. Let this initial partially masked sequence be $\tilde{S}_1 = (Q, A_1)$ , where $A_1 = ([\mathrm{M}], \ldots, [\mathrm{M}])$ is an m-tuple of mask tokens. The model iteratively unmasks SIDs by transitioning from a noisy state $\tilde{S}_t = (Q, A_t)$ to a less noisy state $\tilde{S}_r = (Q, A_r)$ where $r < t$ . In each step, the model predicts values for all masked positions using $p_\theta(S_0^u(i) \mid S_t^u)$ and then selectively unmasks a subset of these positions by retaining their predicted values. This process continues until all SIDs of the $n^{\mathrm{th}}$ item are unmasked.

4.2.3.2. Beam Search in MADRec

Unlike AR models with a fixed left-to-right generation order, MADRec can generate SIDs in any order. Beam search is used to find the most probable items. If the SIDs of the next item are unmasked following a specific order $(k_1, \ldots, k_m)$ , the probability of the generated item is approximated as: $p_\theta(s_n^1, \ldots, s_n^m \mid Q) = \prod_{i=1}^m p_\theta({s_n^{k_i}} \mid Q, s_n^{k_1}, \ldots, s_n^{k_{i-1}})$ Here:

$s_n^j$ refers to the $j^{\mathrm{th}}$ SID of the $n^{\mathrm{th}}$ item.
$Q$ is the SID sequence of previous items.
$k_i$ denotes the $i-th SID$ to be unmasked in a specific order. This formula represents a sequential generation process, but MADRec offers flexibility in choosing the order. The paper explores random, uncertainty-based greedy, and left-to-right orders.

4.2.3.3. Beam Search in MADRec with Multi-Token Prediction

A key advantage of MADRec is its ability to predict multiple SIDs simultaneously, leading to fewer sequential function evaluations (NFEs). When $T < m$ model evaluations are used, and in the $j^{\mathrm{th}}$ step, $\alpha_j$ SIDs at positions $(k_1^j, \ldots, k_{\alpha_j}^j)$ are unmasked, the probability of the item is approximated as: $p_\theta(s_n^1, \ldots, s_n^m | Q) = \prod_{j=1}^T \prod_{i=1}^{\alpha_j} p_\theta({s_n^{k_i^j}} \ | \ Q, s_n^{k_1^1}, \ldots, s_n^{k_{\alpha_{j-1}}^{j-1}})$ Here:

$T$ is the total number of function evaluations (inference steps).
$\alpha_j$ is the number of SIDs unmasked at step $j$ .
$k_i^j$ refers to the $i-th SID$ unmasked during step $j$ . The approximation involves assuming that the SIDs unmasked in a single step are conditionally independent given the previously unmasked SIDs and the context $Q$ . Despite this assumption, this multi-token prediction strategy empirically outperforms AR models.

4.2.4. Extending MADRec with Dense Retrieval

To enhance MADRec by combining SID generation with dense retrieval, the framework is modified as follows:

4.2.4.1. Input Representation

Instead of only using SID embeddings, the input to the encoder-only transformer combines SID embeddings with the text representation of each item. For an item $i$ with SID sequence $(\sigma_i^1, \ldots, \sigma_i^m)$ , the input embedding for its $j^{\mathrm{th}}$ SID is defined as: $H_{\sigma_i^j} = h_{\sigma_i^j} + A_j h_i^{\mathrm{text}}$ Where:

$H_{\sigma_i^j}$ is the combined input embedding for the $j^{\mathrm{th}}$ SID.
$h_{\sigma_i^j}$ is the embedding of the SID token $\sigma_i^j$ .
$h_i^{\mathrm{text}}$ is the embedding of item $i$ 's text features, extracted from a language model (e.g., Flan-T5-XXL).
$A_j$ is a learnable linear transformation that projects the text embedding $h_i^{\mathrm{text}}$ to the same dimension as the SID embeddings.

4.2.4.2. Masking Strategy

To encourage learning at the item level (as dense embeddings represent an item as a whole), a modified masking strategy is introduced. With a fixed probability $\beta$ , all SIDs of an item are masked jointly using the same noise scale $t$ . With probability $1-\beta$ , the original MADRec masking strategy (independent masking of each SID) is used.

4.2.4.3. Prediction Mechanism and Loss Function

After a fixed number of layers $\eta$ in the network $f_\theta$ , the hidden states are used to form a predicted dense embedding. To predict the dense embedding for an item, all its SIDs are masked, and the full sequence is passed through $f_\theta$ . Let $\bar{h}^j$ be the embedding of the $j^{\mathrm{th}}$ SID after $\eta$ layers. These are concatenated to form the predicted dense embedding $\tilde{H} = \{\bar{h}^1, \ldots, \bar{h}^m\}$ . This $\tilde{H}$ is then projected to the dimension of the text item embedding using a linear layer or a small multi-layer perceptron (MLP) denoted as $g_\theta$ . The dense retrieval objective aims to align this predicted embedding $\tilde{H}$ with the ground-truth text embedding $h_i^{\mathrm{text}}$ of the corresponding item. This is formulated as a contrastive loss (similar to negative sampling or in-batch negatives): $\mathcal{L}_{\mathrm{dense}} = - \log \frac{\exp \Bigl( g_\theta(\tilde{H})^\top h_i^{\mathrm{text}} \Bigr)}{\sum_{j \in \mathcal{T}} \exp \Bigl( g_\theta(\tilde{H})^\top h_j^{\mathrm{text}} \Bigr)}$ Where:

$\mathcal{L}_{\mathrm{dense}}$ is the dense retrieval loss.
$g_\theta(\tilde{H})$ is the projected predicted dense embedding.
$h_i^{\mathrm{text}}$ is the ground-truth text embedding of the item $i$ .
$\mathcal{T}$ represents the set of all items. The sum in the denominator calculates the similarity of the predicted embedding to all text embeddings in the batch (or a sampled set), effectively serving as negative samples. This loss is applied only to items whose SIDs were fully masked. This joint objective allows MADRec to perform both SID generation and dense retrieval.

4.2.5. Overview of Training and Inference

The overall process is summarized in Figure 1 from the original paper.

The following figure (Figure 1 from the original paper) shows the overview of training and inference of MADRec:

$Figure 1: Overview of training and inference of MADRec. During training, MADRec randomly masks each SID in the SID sequence with a probability $t \\sim \\mathrm { U n i f } \\vert 0 , 1 \\vert$ and feeds the masked sequence into an encoder-only transformer. The model is then optimized to reconstruct the original values of the masked SIDs using a cross-entropy loss applied at the masked positions (see Eq. (3)). During inference, MADRec begins with all SIDs of the last item replaced by masks. At each inference step, the partially masked sequence is passed through the network to predict values for all masked positions. The model then selectively unmasks a subset of these positions by retaining their predicted values while keeping the remaining positions masked. This iterative process continues until all SIDs are unmasked.$

Figure 1 Description:

Training: An SID sequence $S_0^u$ is chosen. A noise level $t$ is sampled from Unif[0,1]. Each SID in $S_0^u$ is masked with probability $t$ to create $S_t^u$ . This masked sequence is fed into an encoder-only transformer ( $f_\theta$ ). The model is trained to predict the original values of the masked SIDs using cross-entropy loss.
Inference: To predict the SIDs of the next item, they are initially all masked. The partially masked sequence is passed through $f_\theta$ . The model outputs predictions for all masked positions. A subset of these positions are unmasked (their predicted values are kept), while others remain masked. This iterative unmasking process continues until all SIDs for the target item are predicted.

5. Experimental Setup

This section details the datasets, evaluation metrics, and baselines used to evaluate MADRec.

5.1. Datasets

The experiments were conducted on four public benchmark datasets:

Amazon Review dataset: Three categories were used: Beauty, Sports, and Toys.
- Preprocessing: Standard 5-core filtering was applied to remove users and items with fewer than 5 interactions.
- Item Text Features: title, category, description, and price were used as features for extracting semantic embeddings, following $[RMS+23]$ .

MovieLens-1M (ML-1M) dataset [HKA16]: A well-known dataset for movie ratings.

Preprocessing: 5-core filtering was also applied.

Item Text Features: title, year, and genres were used as features.

The following are the results from [Table 6] of the original paper:

Dataset	Beauty	Toys	Sports	ML-1M
# Users	22,363	19,412	35,598	6,040
# Items	12,101	11,924	18,357	3,416
# Interactions	198,502	167,597	296,337	999,611
# Avg. Length	8.88	8.63	8.32	165.50
Sparsity	99.93%	99.93%	99.95%	95.16%

Table 6 Description: The table provides statistics for the processed datasets, including the number of users, items, total interactions, average sequence length per user, and dataset sparsity (percentage of zero entries in the user-item interaction matrix). ML-1M stands out with a much longer average sequence length.

5.2. Evaluation Metrics

The performance of all methods is evaluated using standard sequential recommendation metrics: Recall@K and NDCG@K, with $K \in \{5, 10\}$ . The leave-one-out evaluation protocol is used: the last item of each user's sequence is for testing, the second-to-last for validation, and the rest for training.

5.2.1. Recall@K

Recall@K measures the proportion of relevant items that are successfully retrieved among the top $K$ recommendations. In the context of sequential recommendation, a relevant item is typically the next item the user actually interacted with.

Conceptual Definition: For a given user, if the actual next item is present in the top $K$ items recommended by the system, it's considered a "hit." Recall@K is the average number of hits across all users, normalized by the total number of users (or relevant items). It indicates how well the system can find any of the relevant items.

Mathematical Formula: $ \mathrm{Recall}@K = \frac{1}{|U|} \sum_{u \in U} \frac{|\mathrm{Rel}_u \cap \mathrm{Rec}_u(K)|}{|\mathrm{Rel}_u|} $ Symbol Explanation:

$U$ : The set of all users.
$|U|$ : The total number of users.
$\mathrm{Rel}_u$ : The set of actual relevant items for user $u$ (in leave-one-out, this is typically just the single next item $i_{n_u+1}$ ).
$\mathrm{Rec}_u(K)$ : The set of top $K$ items recommended by the system for user $u$ .
$|\mathrm{Rel}_u \cap \mathrm{Rec}_u(K)|$ : The number of relevant items found within the top $K$ recommendations.
$|\mathrm{Rel}_u|$ : The total number of relevant items for user $u$ (which is 1 in leave-one-out).

5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)

NDCG@K is a widely used metric that evaluates the ranking quality of recommendations. It considers not only whether relevant items are present in the top $K$ list but also their position in the list, giving higher scores to relevant items that appear at higher (earlier) ranks.

Conceptual Definition: NDCG@K is derived from Discounted Cumulative Gain (DCG). DCG sums the relevance scores of items in a ranked list, with a logarithmic penalty for items appearing lower in the list. NDCG normalizes DCG by dividing it by the Ideal DCG (IDCG), which is the DCG of a perfectly ordered list where all relevant items are ranked as high as possible. This normalization ensures that NDCG values are between 0 and 1.

Mathematical Formula: $ \mathrm{DCG}@K = \sum_{j=1}^{K} \frac{\mathrm{rel}_j}{\log_2(j+1)} $ $ \mathrm{NDCG}@K = \frac{\mathrm{DCG}@K}{\mathrm{IDCG}@K} $ Symbol Explanation:

$\mathrm{rel}_j$ : The relevance score of the item at position $j$ in the recommended list. In sequential recommendation, this is usually 1 if the item at position $j$ is the true next item, and 0 otherwise.
$j$ : The rank of the item in the recommended list (from 1 to $K$ ).
$\log_2(j+1)$ : The logarithmic discount factor. Items at lower ranks receive a smaller discount.
$\mathrm{DCG}@K$ : Discounted Cumulative Gain up to rank $K$ .
$\mathrm{IDCG}@K$ : Ideal Discounted Cumulative Gain up to rank $K$ . This is the DCG calculated for a perfect ranking where all relevant items are placed at the top of the list in decreasing order of relevance. For leave-one-out, if the single relevant item is at rank 1, \mathrm{IDCG}@K = \mathrm{rel}_1 / \log_2(1+1) = 1/1 = 1.

5.3. Baselines

MADRec's performance is compared against six representative baselines:

5.3.1. Item ID-based Methods

These models operate directly on sparse item IDs.

SASRec [KM18]: Self-attentive sequential recommendation. A Transformer encoder-based model that captures sequential dependencies among item IDs.
BERT4Rec [SLW+19]: A BERT-style encoder model that uses masked language modeling to predict masked item IDs in the interaction history.

5.3.2. Diffusion-based Methods

These apply continuous diffusion processes to item IDs or their embeddings.

DreamRec [YWW+23]: A continuous diffusion model designed for sequential recommendation.
CaDiRec [CWH+24]: Another continuous diffusion method that enhances sequential recommendation with context-aware diffusion-based contrastive learning. Both DreamRec and CaDiRec used 1,000 inference steps in their original evaluations.

5.3.3. Generative Recommendation with SIDs

These models use Semantic IDs but primarily autoregressive modeling.

TIGER [RMS+23]: The foundational autoregressive (AR) model for Generative Recommendation with SIDs. It models the probability distribution over the SID sequence. This is the main AR-based SID baseline.
LIGER [YPH+24]: An extension of TIGER that integrates dense retrieval capabilities with autoregressive SID generation.

5.4. Implementation Details

Model Architecture: An 8-layer encoder-only transformer model.
- Embedding Dimension: 128 dimensions.
- Positional Embedding: Uses Rotary Position Embedding.
- Attention Heads: 8 attention heads.
- MLP Hidden Layer Size: 3072.
- Total Parameters: 7 million.
SID Assignment:
- Item text embeddings: 4096-dimensional embeddings extracted from Flan-T5-XXL [CHL+24] based on item metadata.
- Clustering: Residual K-Means with four layers, each having a codebook size of 256.
- Deduplication Token: Appended to distinguish items with identical SID tuples, as in $[RMS+23]$ .
Inference:
- Number of Inference Steps (NFEs): By default, equal to the number of SIDs per item (e.g., 4 SIDs means 4 NFEs).
- Inference Strategy: Greedy inference, where SIDs are chosen to be unmasked based on their prediction uncertainty. Uncertainty is measured as the difference between the probabilities of the first and second most likely assignments at each masked SID position.
Training:
- Optimizer: AdamW.
- Learning Rate: 0.005.
- Weight Decay: 0.001.
- Batch Size: 8192.
- Early Stopping: Based on validation Recall@10.
Hardware: Experiments run on nodes with four 16-GB NVIDIA V100 GPUs or four 40-GB NVIDIA A100 GPUs.
Baselines Implementation: MADRec's implementation is based on the GRID codebase [JCN+25]. TIGER's implementation is from GRID. LIGER's implementation uses the public repository by its authors. Both TIGER and LIGER use 4 SIDs per item (including the deduplication token).
Other Baselines: SASRec results are taken from $[RMS+23]$ . BERT4Rec, DreamRec, and CaDiRec results are from $[CWH+24]$ .

6. Results & Analysis

This section presents the experimental findings, comparing MADRec against baselines and analyzing its performance characteristics.

6.1. Core Results Analysis

6.1.1. Q1. Overall Performance

The initial experiments compare MADRec's performance with other Generative Recommendation (GR) methods across four datasets.

The following are the results from [Table 1] of the original paper:

Method	Beauty		Sports		Toys		ML-1M
Method	R@5	N@5	R@5	N@5	R@5	N@5	R@5	N@5
SASRec	3.87	2.49	2.33	1.54	4.63	3.06	9.38	5.31
BERT4Rec	3.60	2.16	2.17	1.43	4.61	3.11	13.63	8.89
DreamRec	4.40	2.74	2.48	1.51	4.97	3.16	13.04	8.58
CaDiRec	4.95	3.14	2.76	1.83	5.22	3.56	15.04	10.01
TIGER	4.29	2.88	2.45	1.64	4.42	2.91	12.83	8.85
LIGER	4.62	3.17	2.61	1.67	4.66	3.01	13.73	9.12
MADRec + Improv %	5.38 +8.7 %	3.51 +10.7 %	3.02 +9.4 %	1.91 +4.4 %	5.48 + 5.0 %	3.75 +5.3 %	16.72 +11.2 %	11.12 +11.1 %

Table 1 Analysis:

Comparison with AR modeling (TIGER): MADRec consistently outperforms TIGER (the autoregressive SID model) across all datasets and metrics. The average NDCG@5 increase over TIGER is 21.9%. This strongly supports the hypothesis that masked diffusion is more effective than AR models in sequential recommendation settings.
Comparison with continuous diffusion (DreamRec, CaDiRec): MADRec also achieves higher performance than DreamRec and CaDiRec. Notably, MADRec achieves this with only 5 inference steps, whereas DreamRec and CaDiRec require 1000 steps, highlighting the efficiency and effectiveness of discrete masking noise for discrete data.

Overall: MADRec achieves the best Recall@5 and NDCG@5 scores across all datasets, with LIGER (an extension of TIGER with dense retrieval) often being the second-best, demonstrating the state-of-the-art performance of MADRec.

The following are the results from [Table 7] of the original paper:

Method	Beauty		Sports		Toys		ML-1m
Method	R@10	N@10	R@10	N@10	R@10	N@10	R@10	N@10
SASRec	6.05	3.18	3.50	1.92	7.12	4.32	16.89	7.72
BERT4Rec	6.01	3.08	3.59	1.81	6.65	3.68	20.56	11.12
DreamRec	6.87	3.52	3.74	1.91	6.43	4.02	20.29	10.47
CaDiRec	7.18	3.86	4.25	2.33	7.85	4.41	22.82	12.51
TIGER	6.33	3.54	3.61	2.03	6.63	3.61	19.97	10.13
LIGER	7.52	4.14	4.27	2.30	6.25	3.52	20.58	10.81
MADRec (+ Improv. (%))	8.15 +8.7 %	4.41 +6.5 %	4.54 +6.3 %	2.49 +6.9 %	8.46 +7.8 %	4.45 +0.9 %	23.96 +5.0 %	13.45

Table 7 Analysis: This table confirms the superior performance of MADRec at $K=10$ as well. The percentage improvements over the second-best method (often CaDiRec or LIGER) are substantial for Recall@10 and NDCG@10 across most datasets, reinforcing the conclusions drawn from $K=5$ metrics.

The following figure (Figure 2 from the original paper) shows the improved performance gap for coarse-grained retrieval on the Beauty and Sports datasets:

$Figure 2: Improved performance gap for coarse-grained retrieval on the Beauty and Sports datasets The gap in Recall $\\textcircled { \\mathrm { a K } }$ between TIGER and MADRec increases as K increases.$ 该图像是图表，展示了在Beauty和Sports数据集上MADRec与TIGER之间的粗粒度召回率差异。纵轴为召回值，横轴为不同的K值，表中的增量显示了MADRec相较于TIGER的性能提升。在Beauty数据集上，R@40的召回率提升幅度为+3.8，而在Sports数据集上，R@40的提升为+1.6。

Figure 2 Analysis (Coarse-grained Retrieval): This figure illustrates a crucial finding: the performance gap (in Recall@K) between MADRec and TIGER increases as $K$ grows (5, 10, 20, 40). This is particularly evident on the Beauty and Sports datasets. For example, on Beauty, the Recall difference at $K=40$ is significantly larger than at $K=5$ . This suggests that MADRec is better at capturing global relationships among tokens (items) compared to TIGER's autoregressive modeling, which tends to over-index on local dependencies. The ability to model global relationships leads to higher-quality depth in its rankings, benefiting coarse-grained retrieval where a broader set of relevant items is sought.

6.1.2. Q2. Data-constrained Performance

This experiment investigates MADRec's robustness in data-constrained settings by evaluating its performance against TIGER on sparsified versions of the Beauty dataset.

The following figure (Figure 3 from the original paper) shows the comparison of data efficiency of MADRec and TIGER:

$Figure 3: Comparison of data eficiency of MADRec and TIGER by dropping $2 5 \\%$ , $3 7 . 5 \\%$ , $5 0 \\%$ , $6 7 . 5 \\%$ and $7 5 \\%$ of items from each sequence in the training set, while maintaining at least three items in each sequence.$ 该图像是条形图，展示了MADRec和TIGER模型在不同丢失比例（25%、37.5%、50%、62.5%、75%）的训练集项目下的性能保持率。上方的两组柱状图分别表示NDCG@5和Recall@5的百分比性能保留情况。MADRec（橙色）在各个丢失比例下的性能有所提高，特别是在丢失25%时比TIGER（蓝色）多保留了10%（NDCG）和9%（Recall）。

Figure 3 Analysis (Data Efficiency): The figure plots the percentage of retained performance (relative to a model trained on the full dataset) for Recall@5 and NDCG@5 as an increasing percentage of items are dropped from the training sequences.

Observation: As the percentage of dropped items increases (i.e., data becomes scarcer), TIGER's performance drops significantly faster than MADRec's. For instance, at 25% item drop, MADRec retains considerably more performance than TIGER.
Conclusion: This supports the hypothesis that masked diffusion (used by MADRec) makes more efficient use of limited training data. This is likely due to its dynamic masking probability, which generates a larger variety of training samples from each raw sequence, allowing it to extract more signal from sparse interaction histories. At very high drop percentages (62.5% and 75%), the performance drop for both models converges, which is expected as data becomes extremely sparse.

6.1.3. Q3. Inference Performance-Efficiency Trade-off

This analysis explores MADRec's ability to balance inference performance and efficiency, a key advantage over AR models.

The following figure (Figure 4 from the original paper) shows the next-k item prediction performance vs number of function evaluations (NFEs):

$Figure 4: Next- $k$ item prediction performance vs number of function evaluations (NFEs) during inference for (Left) $k = 1$ on Beauty and (Right) $k = 2$ on MovieLens-1M. The AR methods (TIGER and LIGER) must decode tokens sequentially, so they always execute $k \\times ( \\# \\mathrm { S I D S } / \\mathrm { i t e m } )$ NFEs. MADRec can decode multiple items in parallel, thereby allows trading off performance and efficiency by tuning the NFEs. Moreover, it already outperforms the AR methods with fewer NFEs.$

Figure 4 Analysis (Inference Trade-off):

Single-item Prediction (Left Panel, k=1 on Beauty):
- AR models (TIGER, LIGER) require $m$ NFEs to predict one item (where $m$ is the number of SIDs per item, typically 4).
- MADRec can achieve comparable or superior performance with fewer NFEs. Even with only 3 NFEs, MADRec surpasses TIGER by 13% and LIGER by 4.7% in NDCG@5. As NFEs increase, MADRec's performance further improves. This demonstrates MADRec's ability to flexibly trade off between computation and accuracy, and its inherent efficiency due to parallel decoding.
Multi-item Prediction (Right Panel, k=2 on MovieLens-1M):
- This experiment uses a leave-two-out protocol to predict two next items. AR models would require $k \times m = 2 \times 4 = 8$ NFEs.
- MADRec achieves the same average session Recall@10 as TIGER with only 4 NFEs, representing a 50% reduction in NFEs.
- Conclusion: MADRec's multi-token prediction capabilities translate into significant inference efficiency gains, especially as the number of items to recommend increases, outperforming AR models even with its conditional independence assumption during parallel decoding.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Q4. Component-wise importance

6.2.1.1. Importance of Semantic IDs

To understand the role of Semantic IDs, two experiments are conducted on the Beauty dataset.

The following are the results from [Table 2] of the original paper:

Method	R@5	R@10	N@5	N@10
BERT4Rec	3.60	6.01	2.16	3.08
MADRec W Item IDs	4.69	6.71	3.12	3.77
MADRec w/ Random SIDs	3.78	5.53	2.61	3.05
MADRec	5.38	8.15	3.51	4.41

Table 2 Analysis (Semantic IDs):

MADRec w/ Random SIDs vs. MADRec: Replacing actual SIDs (derived from semantic item embeddings) with randomly assigned tokens leads to a drastic performance reduction (e.g., Recall@10 drops from 8.15 to 5.53). This clearly demonstrates that MADRec effectively leverages the semantic information embedded within SIDs.
MADRec w/ Item IDs vs. MADRec: Applying MADRec directly to item IDs (instead of SIDs) also results in lower performance (e.g., Recall@10 drops from 8.15 to 6.71). This further confirms that SIDs provide valuable semantic context that improves recommendation quality.

6.2.1.2. Importance of Dynamic Masking Probability

Comparing MADRec w/ Item IDs with BERT4Rec in Table 2:

MADRec w/ Item IDs still outperforms BERT4Rec (e.g., Recall@10: 6.71 vs. 6.01).
Reason: BERT4Rec uses a fixed masking ratio (e.g., 15%), while MADRec (even when applied to item IDs) uses a dynamic masking probability (sampling $t$ from [0,1]). This allows MADRec to learn from a more diverse set of corruption levels, leading to a more robust and effective training regime, even when the underlying tokens are just item IDs instead of SIDs.

6.2.1.3. Dependence on the Number of SIDs per Item

This ablation studies how MADRec's performance changes with the number of SIDs used to represent each item.

The following are the results from [Table 3] of the original paper:

Number of SIDs	R@5	R@10	N@5	N@10
3	4.96	7.93	3.24	4.20
4	5.38	8.15	3.51	4.41
5	4.86	7.53	3.26	4.11

Table 3 Analysis (Number of SIDs):

Performance generally improves from 3 SIDs to 4 SIDs per item (e.g., Recall@10 from 7.93 to 8.15). This suggests that 4 SIDs might capture enough semantic information to be optimal for this dataset.
However, performance decreases when increasing the number of SIDs from 4 to 5 (e.g., Recall@10 drops from 8.15 to 7.53).
Plausible Reason: The paper suggests that an increased number of SIDs per item might lead to a higher potential for the model to predict invalid SID combinations (sequences of SIDs that do not correspond to any real item in the catalog). This highlights a practical challenge for GR with SIDs and suggests future work on constrained beam search to ensure valid item generation.

6.2.1.4. Role of Inference Strategy

MADRec's training does not impose a fixed token order, allowing flexibility in inference strategy. Three strategies are compared.

The following are the results from [Table 4] of the original paper:

Inference Method	R@5	R@10	N@5	N@10
MADRec + Random	5.01	7.54	3.27	4.09
MADRec b + Left-to-right	5.31	8.09	3.46	4.37
MADRec (Greedy)	5.38	8.15	3.51	4.41

Table 4 Analysis (Inference Strategy):

Random inference: Randomly selecting SIDs to unmask performs the worst among the three strategies.
Left-to-right inference: Sequentially unmasking SIDs in the order they are assigned by residual k-means (left-to-right) significantly outperforms random inference.
Greedy inference: This strategy, which prioritizes unmasking SIDs with lower prediction uncertainty (measured by the difference between the top two predicted probabilities), achieves the best performance.
Conclusion: The choice of inference strategy has a substantial impact on performance, and dynamically prioritizing SIDs based on model uncertainty is the most effective approach.

6.2.2. Q5. Extension via Dense Retrieval

This experiment evaluates MADRec extended with dense retrieval capabilities, as described in Section 3.3.

The following are the results from [Table 5] of the original paper:

Method	R@5	R@10	N@5	N@10
MADRec	5.38	8.15	3.51	4.41
MADRec + Dense Retrieval	5.41	8.50	3.53	4.45
MADRec + Unified Retrieval	5.43	8.59	3.54	4.47

Table 5 Analysis (Dense Retrieval Extension):

MADRec + Dense Retrieval: Adding a dense retrieval objective and predicting dense item embeddings (as a secondary task during training) results in a slight improvement over base MADRec (e.g., Recall@10 from 8.15 to 8.50). This variant performs dense retrieval solely.
MADRec + Unified Retrieval: This strategy first uses MADRec to generate 20 candidate beams via beam search (generative retrieval) and then re-ranks these candidates based on their dense retrieval scores. This unified approach yields the best performance (e.g., Recall@10 of 8.59), slightly outperforming pure dense retrieval.
Conclusion: These results demonstrate that MADRec is general and flexible enough to be combined with auxiliary techniques like dense retrieval (originally developed for AR-based SID modeling), leading to further performance gains. This indicates its potential for integration with a broader ecosystem of GR advancements.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces MADRec (Masked Diffusion for Generative Recommendation), a novel framework for Generative Recommendation with Semantic IDs (SIDs) based on masked diffusion modeling. It effectively addresses several key limitations of traditional autoregressive (AR) approaches, such as expensive sequential inference, inefficient use of training data, and a bias towards local dependencies. MADRec demonstrates superior performance across various sequential recommendation datasets, particularly excelling in data-constrained settings and for coarse-grained recall. A significant advantage is its ability to perform parallel decoding during inference, offering a flexible trade-off between speed and accuracy while maintaining state-of-the-art results. Furthermore, MADRec proves to be extensible, allowing seamless integration with auxiliary techniques like dense retrieval.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose avenues for future research:

Invalid SID Predictions: The observation that increasing the number of SIDs per item beyond a certain point can lead to decreased performance (Table 3) due to the potential for generating invalid SID combinations is a limitation.
- Future Work: This suggests exploring more sophisticated training and inference guidance strategies, such as constrained beam search, to prevent the generation of SIDs that do not correspond to valid items.
Sophisticated Guidance: The current masked diffusion framework could be further enhanced.
- Future Work: Investigate the application of advanced diffusion model guidance techniques, such as classifier-free guidance or classifier-based guidance [SSP+24], which can improve sample quality without requiring additional training, or error correction mechanisms like remasking [WSSK25, vRFD+25], which can refine generated sequences.
Integration with Other Auxiliary Techniques: While dense retrieval was demonstrated, AR SID modeling has many other extensions.
- Future Work: Explore combining MADRec with other auxiliary methods developed for AR SID modeling, such as incorporating user preferences [PYL+24] or specialized methods for long SID generation [HLS+25].

7.3. Personal Insights & Critique

7.3.1. Innovations and Strengths

Cross-Domain Transfer: The paper demonstrates a highly effective transfer of masked diffusion techniques, a powerful paradigm from NLP for discrete sequence generation, to the recommendation domain. This is a significant innovation that opens new avenues for RecSys research.
Addressing Key AR Limitations: MADRec directly tackles the known drawbacks of autoregressive models in GR with SIDs. Its parallel decoding mechanism is a compelling solution to inference latency, a critical concern for real-world deployments. The improved data efficiency is also crucial for RecSys datasets, which are often sparse.
Global Relationship Capture: The finding that MADRec excels in coarse-grained recall and that its performance gap with AR models widens with larger $K$ strongly supports the claim of better global relationship modeling. This is a valuable quality for recommendation, where understanding broader item preferences is important.
Flexibility: The ability to trade off NFEs for performance and its extensibility with dense retrieval showcases MADRec as a robust and adaptable framework. The performance gains even with fewer NFEs are particularly impressive.

7.3.2. Potential Issues and Areas for Improvement

Invalid SID Combinations: The observed performance drop when moving from 4 to 5 SIDs due to invalid SID combinations is a practical challenge. While constrained beam search is a suggested future direction, it highlights that the model, by itself, doesn't inherently learn to only produce valid items, which could be a hurdle in real-world deployment if not carefully managed. This could be exacerbated in domains with very large item catalogs and complex SID structures.
Conditional Independence Assumption: While MADRec empirically performs well with multi-token prediction, it relies on the conditional independence assumption of masked tokens given unmasked tokens. For certain intricate SID relationships or highly correlated item attributes, this assumption might be suboptimal. Future work could investigate methods to account for dependencies among simultaneously unmasked tokens without sacrificing too much parallelism.
Complexity of SID Assignment: The quality of MADRec's performance is heavily reliant on the initial item SID assignment (semantic embeddings + clustering). While the paper uses Flan-T5-XXL and Residual K-Means, exploring more advanced or adaptive SID generation methods could further boost MADRec's potential.
Hyperparameter Sensitivity: The optimal number of SIDs per item (4 in this case) seems dataset-dependent. Automatically determining this parameter or making the model less sensitive to it would be beneficial for generalization.

7.3.3. Broader Applicability

The methods and conclusions of this paper could be transferred or applied to other domains involving discrete sequence modeling where autoregressive models face similar challenges:

Code Generation: Generating code snippets or programs, where parallel token prediction and capturing global structure (e.g., syntax, function calls) could be advantageous.
Molecular Design: Generating novel molecular structures, where molecules are represented as sequences of discrete components.
Time Series Forecasting (Discrete Events): Modeling and predicting sequences of discrete events in areas like finance, healthcare, or logistics.
Dialogue Generation: Generating responses in a dialogue system, especially if the responses can be broken down into semantic tokens, allowing for faster response generation.

Overall, MADRec is a compelling advancement that successfully brings powerful diffusion modeling techniques to Generative Recommendation, setting a new benchmark and opening exciting avenues for future research in discrete sequence generation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.