Paper status: completed

The Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation

Published:12/11/2025

Sequential Recommender Systems (24)Harmonization of Semantic and Hash IDs (1)Long-Tail Problem Mitigation Methods (1)Multi-Granular Semantic Modeling (1)Knowledge Transfer Strategies in Recommendation Systems (1)

Original Link PDF

Price: 0.100000

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The H$^2$Rec framework harmonizes Semantic IDs and Hash IDs to tackle long-tail issues in sequential recommendation systems, utilizing a dual-branch architecture for capturing multi-granular semantics and a dual-level alignment strategy for knowledge transfer.

Abstract

Conventional Sequential Recommender Systems (SRS) typically assign unique Hash IDs (HID) to construct item embeddings. These HID embeddings effectively learn collaborative information from historical user-item interactions, making them vulnerable to situations where most items are rarely consumed (the long-tail problem). Recent methods that incorporate auxiliary information often suffer from noisy collaborative sharing caused by co-occurrence signals or semantic homogeneity caused by flat dense embeddings. Semantic IDs (SIDs), with their capability of code sharing and multi-granular semantic modeling, provide a promising alternative. However, the collaborative overwhelming phenomenon hinders the further development of SID-based methods. The quantization mechanisms commonly compromise the uniqueness of identifiers required for modeling head items, creating a performance seesaw between head and tail items. To address this dilemma, we propose \textbf{\name}, a novel framework that harmonizes the SID and HID. Specifically, we devise a dual-branch modeling architecture that enables the model to capture both the multi-granular semantics within SID while preserving the unique collaborative identity of HID. Furthermore, we introduce a dual-level alignment strategy that bridges the two representations, facilitating knowledge transfer and supporting robust preference modeling. Extensive experiments on three real-world datasets show that \name~ effectively balances recommendation quality for both head and tail items while surpassing the existing baselines. The implementation code can be found online\footnote{https://github.com/ziwliu8/H2Rec}.

Mind Map

In-depth Reading

English Analysis~35 min read · 51,443 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is "The Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation." It focuses on addressing the limitations of traditional sequential recommender systems by integrating two types of item identifiers.

1.2. Authors

The authors of this paper are:

Ziwei Liu (City University of Hong Kong, Hong Kong, China)
Yejing Wang (City University of Hong Kong, Hong Kong, China)
Qidong Liu (Xi'an Jiaotong University & City University of Hong Kong, Xi'an, China)
Zijian Zhang (City University of Hong Kong, Hong Kong, China)
Wei Huang (Independent Researcher, Beijing, China)
Chong Chen (Tsinghua University, Beijing, China)
Xiangyu Zhao (City University of Hong Kong, Hong Kong, China)

The authors are primarily affiliated with City University of Hong Kong, with contributions from Xi'an Jiaotong University, Tsinghua University, and an independent researcher. Their research backgrounds appear to be in computer science, particularly in areas related to recommender systems, information retrieval, and potentially natural language processing given the mention of large language models.

1.3. Journal/Conference

The paper is published "In ." The specific journal or conference name is not provided in the abstract or bibliographic information but it is formatted as an ACM publication, suggesting it is intended for a prominent ACM conference or journal. ACM venues are highly reputable in computer science, indicating a rigorous peer-review process and influence in relevant fields like recommender systems and information retrieval.

1.4. Publication Year

The publication year is 2026.

1.5. Abstract

The paper addresses the limitations of conventional Sequential Recommender Systems (SRS) that use Hash IDs (HID) for item embeddings. While HID effectively capture collaborative information, they struggle with the long-tail problem where rare items have sparse interactions. Existing methods that incorporate auxiliary information suffer from noisy collaborative sharing (due to unreliable co-occurrence signals) or semantic homogeneity (due to flat dense embeddings). Semantic IDs (SIDs) offer a promising alternative with their code sharing and multi-granular semantic modeling capabilities. However, SID-based methods face the collaborative overwhelming phenomenon where quantization compromises identifier uniqueness for head items, leading to a performance trade-off between head and tail items.

To resolve this, the authors propose H2Rec, a novel framework that harmonizes SID and HID. It features a dual-branch modeling architecture to capture multi-granular semantics from SID while preserving the unique collaborative identity of HID. Additionally, a dual-level alignment strategy bridges the two representations, facilitating knowledge transfer and robust preference modeling. Extensive experiments on three real-world datasets demonstrate that H2Rec effectively balances recommendation quality for both head and tail items, outperforming existing baselines.

1.6. Original Source Link

The original source link is https://arxiv.org/abs/2512.10388v1. This indicates it is a preprint available on arXiv.

1.7. PDF Link

The PDF link is https://arxiv.org/pdf/2512.10388v1.pdf. This is also a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem

The core problem the paper aims to solve is the inherent trade-off and limitations faced by existing Sequential Recommender Systems (SRS) when dealing with item identification and representation, particularly regarding the long-tail problem and the collaborative overwhelming phenomenon.

Traditional SRS rely on Hash IDs (HID) to create item embeddings. While HID are excellent at capturing collaborative information from user-item interactions, they suffer significantly for tail items (items with sparse interactions). This is because HID embeddings for these items lack sufficient data to learn reliable representations, leading to poor recommendation quality for the vast majority of items.

Recent attempts to incorporate auxiliary information, often using large language models (LLMs) to generate semantic features, introduce new issues:

Noisy Collaborative Sharing: Some methods exploit co-occurrence patterns, but these signals can be unreliable. Accidental clicks can lead to tail items inheriting misleading collaborative information from semantically unrelated popular items.
Semantic Homogeneity: LLM-based approaches often compress all textual information into a single, dense vector. This "flat" representation creates a single-granularity bottleneck, making it difficult to distinguish subtle semantic differences among similar items.

Conversely, Semantic IDs (SIDs), generated by decomposing dense semantic embeddings into discrete code sequences through vector quantization, offer multi-granular semantic modeling and code sharing. They can aggregate collaborative signals among semantically related items and provide finer semantic distinctions. However, SID-based methods introduce their own problem: the collaborative overwhelming phenomenon. The quantization process inevitably causes code collisions (multiple items sharing the same SID), which compromises the uniqueness of item identifiers. This particularly harms the modeling of head items (popular items with abundant interactions) by confusing the model with inflated user-item connections, leading to performance degradation for these crucial items.

2.1.2. Importance of the Problem

This problem is critical because a vast majority of items in real-world recommendation scenarios belong to the long tail. If a recommender system cannot effectively recommend these items, it leads to:

Reduced item discoverability: Users are repeatedly shown popular items, limiting their exposure to a wider catalog.
Suboptimal user experience: Users might miss out on items that truly match their niche preferences.
Economic inefficiency: Less popular items remain undiscovered and unpurchased, affecting businesses.
Biased recommendations: Reinforcing popularity bias rather than diverse interests.

The trade-off between HID and SID highlights a fundamental dilemma: uniqueness for popular items vs. semantic generalization for rare items. Solving this means achieving a more balanced and effective recommendation system that serves both popular and niche interests, which is a significant step towards more robust and fair recommendation.

2.1.3. Paper's Entry Point / Innovative Idea

The paper's innovative idea is to propose H2Rec, a novel framework that harmonizes Semantic IDs (SID) and Hash IDs (HID) within a single system, aiming to leverage the strengths of both while mitigating their individual weaknesses. Instead of replacing one with the other or simply concatenating them, H2Rec employs a sophisticated dual-branch architecture with a dual-level alignment strategy. This allows the model to simultaneously capture multi-granular semantics from SID (benefiting tail items) and preserve the unique collaborative identity of HID (benefiting head items). The key insight is that a nuanced integration and alignment, rather than simple fusion, is necessary to achieve the "best of both worlds."

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Identification of Collaborative Overwhelming Phenomenon: The paper formally identifies and characterizes the collaborative overwhelming phenomenon in SID-based methods. This phenomenon highlights the fundamental trade-off between identifier uniqueness (crucial for head items) and semantic generalization (beneficial for tail items) due to quantization-induced code collisions. This clarifies a significant limitation in existing SID-based approaches.
Proposed H2Rec Framework: The authors introduce H2Rec, a novel dual-branch framework designed to harmonize HID and SID. This framework effectively combines multi-granularity modules (a Multi-granularity Fusion Network for the SID branch and a Multi-granularity Cross Attention Network for the HID branch) with a Dual-level Alignment Strategy. This architecture specifically addresses the semantic homogeneity and collaborative overwhelming problems.
Dual-level Alignment Strategy: H2Rec incorporates a Code-guided Alignment Loss at the item level and a Masked Sequence Granularity Loss at the user level. These alignment strategies facilitate knowledge transfer between SID and HID representations, allowing tail items to "borrow" high-quality collaborative signals from semantically similar head items without noisy collaborative sharing, and enhancing robust user preference modeling.
Empirical Validation and Superior Performance: Extensive experiments on three real-world datasets demonstrate that H2Rec significantly outperforms state-of-the-art baselines. Crucially, it achieves a robust balance, showing substantial improvements in recommending tail items without sacrificing performance on head items, thereby breaking the traditional performance bottleneck and achieving better overall recommendation quality. The framework is also shown to be quantization-agnostic and model-agnostic.

These findings collectively solve the problem of achieving robust recommendations across the entire item popularity spectrum by intelligently integrating the complementary strengths of semantic and hash item representations.

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following foundational concepts:

Recommender Systems (RS): Software systems that provide suggestions for items to users. The goal is to predict what a user might like, based on their past behavior or preferences, and the behavior of other users.
Sequential Recommender Systems (SRS): A specialized type of RS that focuses on the order of user interactions. Instead of just predicting general preferences, SRS predict the next item a user will interact with, based on their historical sequence of interactions. Examples include predicting the next movie to watch or the next product to buy.
Item Embeddings: In RS, items (e.g., movies, products) are typically represented as numerical vectors in a high-dimensional space. These vectors, called embeddings, capture the characteristics and relationships between items. Similar items have embeddings that are close to each other in this space.
Hash IDs (HID): These are unique identifiers assigned to each item in a database. In traditional SRS, these HID are directly mapped to learnable embeddings. Each HID corresponds to a unique embedding vector.
Long-tail Problem: In many domains, a small number of items are very popular ("head items"), while a large number of items are consumed very rarely ("tail items"). The long-tail problem refers to the difficulty of effectively recommending these tail items due to their sparse interaction data, making it hard to learn good embeddings or collaborative signals for them.
Collaborative Filtering (CF): A common technique in RS that makes predictions about a user's interests by collecting preferences from many users. The underlying assumption is that if two users have similar tastes, they will like similar items. HID-based methods primarily learn collaborative signals.
Semantic IDs (SID): Unlike HID, SID are derived from the semantic content (e.g., text descriptions, images) of items. They represent items using discrete codes that capture multi-granular semantic information. SID often involve a process of vector quantization.
Vector Quantization (VQ): A technique used to convert continuous vector data into discrete codes. It works by mapping an input vector to the closest "codeword" in a predefined codebook. This effectively compresses the information and can group similar vectors together.
Residual Quantized Variational Autoencoder (RQ-VAE): A specific type of vector quantization method. It quantizes an embedding in multiple stages (or "levels"). At each stage, it quantizes the residual (the part of the embedding not yet explained by previous quantizations), allowing for more fine-grained and hierarchical semantic representation. Each level captures a different granularity of semantics.
Large Language Models (LLM): Powerful AI models trained on vast amounts of text data, capable of understanding and generating human-like text. They can encode textual descriptions of items into dense semantic embeddings, which can then be used in recommender systems. An example mentioned is text-embedding-ada-002.
Self-Attention / Transformer: A neural network architecture component that allows a model to weigh the importance of different parts of the input sequence relative to a given element. It's fundamental to models like SASRec and BERT4Rec. The core Attention mechanism is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where:
- $Q$ (Query), $K$ (Key), $V$ (Value) are matrices representing different projections of the input sequence.
- $Q K^T$ calculates the similarity between queries and keys.
- $\sqrt{d_k}$ is a scaling factor to prevent large dot products from pushing the softmax function into regions with tiny gradients.
- $\mathrm{softmax}$ normalizes the scores to create weights.
- The result is a weighted sum of the Value vectors.
Cross-Attention: A variant of the attention mechanism where the Query, Key, and Value come from different input sequences. For example, queries might come from one sequence, and keys/values from another, allowing one sequence to attend to relevant parts of the other.

3.2. Previous Works

The paper discusses several categories of previous works:

Conventional HID-based SRS:
- BERT4Rec [24], GRU4Rec [10], SASRec [12]: These are foundational sequential recommender systems that use unique Hash IDs (HID) for items. They map these HID to high-dimensional embeddings to capture collaborative information.
  - GRU4Rec uses Gated Recurrent Units (GRUs) to model sequential patterns.
  - SASRec and BERT4Rec leverage self-attention mechanisms (from the Transformer architecture) to capture long-range dependencies in user interaction sequences. SASRec uses a left-to-right self-attention, while BERT4Rec uses a masked self-attention.
- MELT [13]: This method attempts to address the long-tail problem by exploiting co-occurrence patterns to enhance tail items using information from popular items. However, the paper argues that co-occurrence signals can be unreliable, leading to noisy collaborative sharing.
- LLM-ESR [18]: This represents recent advances that enrich HID embeddings with semantic information derived from Large Language Models (LLMs). While it improves performance, the paper points out that it compresses all textual information into a single dense vector, leading to the semantic homogeneity problem and a single-granularity bottleneck.
SID-based Methods:
- PG-SID [37], SPM-SID [23], CCFRec [16], PSRQ+MCCA (denoted as PSRQ) [30]: These methods use Semantic IDs (SID) generated by decomposing dense semantic embeddings (e.g., from LLMs) into discrete code sequences via vector quantization (RQ-VAE).
  - These methods are theoretically appealing because shared codes aggregate collaborative signals among semantically related items (mitigating noisy collaborative sharing) and multi-granular codes offer finer semantic distinctions (alleviating semantic homogeneity).
- The paper notes that existing SID-based methods often replace HID with SID or fuse them using simple concatenation or contrastive learning. These strategies, however, are hindered by the collaborative overwhelming phenomenon where quantization-induced code collisions compromise the uniqueness required for head items.
Hybrid Embedding Methods:
- URL4DR [15], MME-SID [32], PCR-CA [25]: These methods attempt to combine HID and SID representations. The paper implies that these fusions are often simplistic (e.g., simple concatenation or contrastive learning) and do not fully resolve the fundamental trade-offs identified.

3.3. Technological Evolution

The field of sequential recommendation has evolved significantly:

Early Methods (Statistical/Matrix Factorization): Focused on basic item similarities or user-item matrices.
Recurrent Neural Networks (RNNs): GRU4Rec brought sequential modeling capabilities, capturing dependencies in user sessions.
Convolutional Neural Networks (CNNs): Caser used CNNs to capture local sequential patterns.
Self-Attention/Transformers: SASRec and BERT4Rec revolutionized SRS by using self-attention, allowing for better capture of long-range dependencies and global patterns within sequences.
Addressing Data Sparsity (Co-occurrence/Auxiliary Information): Methods like MELT tried to leverage co-occurrence. Simultaneously, integrating external auxiliary information (e.g., item attributes) became common.
Rise of LLMs: LLM-ESR and similar works leverage the powerful semantic understanding of LLMs to generate rich item representations, pushing beyond simple categorical attributes.
Semantic IDs and Vector Quantization: Recognizing the single-granularity bottleneck of dense LLM embeddings, SID-based methods (PG-SID, CCFRec) emerged, aiming for multi-granular semantic representations through vector quantization to better handle both semantic homogeneity and noisy collaborative sharing.

This paper's work (H2Rec) fits into the latest stage of this evolution, specifically attempting to overcome the limitations of both pure HID and pure SID methods by proposing a sophisticated harmonization framework. It aims to resolve the collaborative overwhelming phenomenon inherent in SID and the long-tail problem of HID, building on the strengths of LLM-derived semantic embeddings and Transformer-based sequence modeling.

3.4. Differentiation Analysis

Compared to the main methods in related work, H2Rec offers several core differences and innovations:

Harmonization vs. Substitution/Simple Fusion:
- Difference: Unlike pure HID-based or SID-based methods (e.g., SASRec, LLM-ESR vs. PG-SID, CCFRec) that focus on one type of ID, or hybrid methods (e.g., URL4DR, MME-SID, PCR-CA) that often use simple concatenation or contrastive learning, H2Rec proposes a deep, harmonized integration.
- Innovation: It doesn't just combine them; it designs a dual-branch architecture where each branch explicitly addresses the weaknesses of the other. The SID branch handles semantic homogeneity with multi-granularity fusion, while the HID branch selectively injects semantic signals to avoid collaborative overwhelming using multi-granularity cross-attention.
Addressing Collaborative Overwhelming:
- Difference: H2Rec explicitly identifies and tackles the collaborative overwhelming phenomenon, a critical limitation of SID-based methods where quantization-induced code collisions compromise uniqueness for head items.
- Innovation: The Multi-granularity Cross Attention Network in the HID branch is specifically designed to allow HID to act as a query, selectively drawing semantic information from SID without losing its unique identity, thereby protecting head item performance.
Mitigating Noisy Collaborative Sharing with Targeted Alignment:
- Difference: While methods like MELT and SID-based approaches try to mitigate noisy collaborative sharing, H2Rec introduces a more refined Code-guided Alignment Loss.
- Innovation: This loss is not just a general contrastive alignment but leverages multi-granular SID to identify truly semantically similar items, ensuring that tail items "borrow" high-quality collaborative signals from relevant head items effectively, rather than inheriting noise from loosely co-occurring items.
Robust Preference Modeling through Dual-level Alignment:
- Innovation: The Dual-level Alignment Strategy (combining Code-guided Alignment Loss and Masked Sequence Granularity Loss) operates at both the item and user levels. The Masked Sequence Granularity Loss at the user level enhances internal correlations among multi-granular semantics, leading to more robust user representations, a feature not prominently found in other hybrid approaches.
Generality and Modularity:
- Innovation: H2Rec is quantization-agnostic (compatible with different RQ-VAE variants) and model-agnostic (can integrate with various SRS backbones like GRU4Rec or BERT4Rec), making it a flexible and plug-and-play enhancement module.
  
  In essence, H2Rec moves beyond simple integration by architecturally designing how HID and SID complement each other, using sophisticated attention and alignment mechanisms to explicitly resolve the inherent tensions between identifier uniqueness and semantic generalization across all item popularity groups.

4. Methodology

4.1. Principles

The core idea behind H2Rec is to leverage the complementary strengths of Hash IDs (HID) and Semantic IDs (SID) while mitigating their individual weaknesses.

HID's Strength: HID are excellent at capturing unique collaborative information, especially for head items with abundant interaction data. However, they struggle with tail items due to data sparsity.
SID's Strength: SID, derived from LLM-based semantic embeddings and vector quantization, can model multi-granular semantics and enable code sharing among semantically related items, which is beneficial for tail items and cold-start scenarios. However, quantization-induced code collisions can compromise identifier uniqueness, leading to collaborative overwhelming for head items.

The theoretical basis and intuition are to build a dual-branch architecture: one branch primarily handles the semantic aspects via SID, and the other focuses on the collaborative and unique identity aspects via HID. These two branches are then harmonized through dual-level alignment strategies that facilitate knowledge transfer in a controlled and beneficial manner, ensuring that:

Tail items gain semantic and collaborative signals from related head items without introducing noise.
Head items retain their unique collaborative identity, not being "overwhelmed" by shared semantic codes.
The system learns robust, fine-grained representations for both.

4.2. Core Methodology In-depth (Layer by Layer)

The H2Rec framework is structured around a dual-branch modeling architecture and a dual-level alignment strategy.

The following figure (Figure 2 from the original paper) shows the framework overview for H2Rec:

$Figure 2: The framework overview for ${ \\bf H } ^ { 2 } { \\bf R e c }$ In this figure, we set $L = 3$$ 该图像是示意图，展示了 ${ \bf H } ^ { 2 } { \bf R e c }$ 框架的整体结构。左侧为SID分支，右侧为HID分支，二者通过多粒度融合网络和交叉注意力机制连接；图中包含图例，显示了不同类型的嵌入及其编码方式。该框架设计旨在平衡推荐系统中头部和尾部项目的推荐质量。

4.2.1. Dual-branch Modeling

This architecture consists of an SID Branch and an HID Branch, designed to leverage the complementary benefits of semantic and hash IDs.

4.2.1.1. SID Branch

The SID branch focuses on constructing fine-grained semantic representations from Semantic IDs.

Semantic Code Generation:
- First, semantic embeddings are derived from item textual attributes using Large Language Models (LLM). Specifically, item attributes are converted into textual instructions, and an LLM encoder (e.g., text-embedding-ada-002) produces item representations, denoted as $E_{LLM} \in \mathbb{R}^{|\mathcal{V}| \times d_{llm}}$ $E_{LL M} \in R^{∣ V ∣ \times d_{ll m}}$ .
  - $E_{LLM}$ : Matrix of LLM-derived semantic embeddings for all items.
  - $|\mathcal{V}|$ : Total number of unique items.
  - $d_{llm}$ : Dimensionality of the LLM embeddings.
- Subsequently, a Residual Quantized Variational Autoencoder (RQ-VAE) framework (as mentioned in Section 2.2) is trained. This RQ-VAE quantizes each $e_{LLM}$ into $L$ discrete semantic codes.
- An item $v_i$ is then represented as a tuple of codes $C_i = \{c_i^1, \dots, c_i^l, \dots, c_i^L\}$ , where $c_i^l$ is a single code at the $l$ -th level.
- Corresponding codebook embedding matrices $E_C$ are learned, where $E_C^l$ is the codebook for level $l$ .
- The semantic code embeddings for item $i$ are retrieved by looking up $c_i^l$ in $E_C^l$ , yielding $[e_{C_i}^1, \dots, e_{C_i}^L]$ .
Multi-granularity Fusion Network:
- The semantic code embeddings for all items in a user's sequence $S_u = \{v_1, \dots, v_N\}$ $S_{u} = {v_{1}, \dots, v_{N}}$ are grouped by their code levels. This forms $L$ $L$ separate granularity sequences: $ \mathcal{E}g^l = [\pmb{e}{C_1}^l, \pmb{e}{C_2}^l, \dots, \pmb{e}{C_N}^l] $
  - $\mathcal{E}_g^l$ : Embedding sequence for items at code level $l$ .
  - $\pmb{e}_{C_i}^l$ : Semantic code embedding of item $v_i$ at code level $l$ .
  - $N$ : Length of the interaction sequence.
- To adaptively aggregate these granularity sequences into a final fine-grained SID embedding sequence, an adaptive mechanism is designed. This mechanism uses the user's interaction context to assign importance weights to different code-level granularities.
- The embedding of the user's last interacted item, $\pmb{e}_N^{hid} \in \mathbb{R}^d$ , from the HID embedding sequence (discussed later) serves as the query anchor for the current intent.
- A learnable residual bias vector $b_{prior} \in \mathbb{R}^L$ is introduced to incorporate prior knowledge that coarse-grained semantics are generally more robust and to ensure training stability.
- The unnormalized importance scores $\pmb{s} \in \mathbb{R}^L$ $s \in R^{L}$ are computed via a Multi-Layer Perceptron (MLP): $ \pmb{s} = W_2 \left( \sigma \left( W_1 \left[ \pmb{e}N^{hid} ; \pmb{b}{prior} \right] + \pmb{b}_1 \right) \right) + \pmb{b}2 + \pmb{b}{prior} $
  - $\pmb{e}_N^{hid}$ : HID embedding of the last item in the sequence.
  - $\pmb{b}_{prior}$ : Learnable residual bias vector, emphasizing coarse-grained semantics.
  - $[ \cdot ; \cdot ]$ : Concatenation operation.
  - $W_1 \in \mathbb{R}^{d_h \times (d+L)}$ , $W_2 \in \mathbb{R}^{L \times d_h}$ : Weight matrices for the MLP.
  - $\pmb{b}_1 \in \mathbb{R}^{d_h}$ , $\pmb{b}_2 \in \mathbb{R}^L$ : Bias vectors for the MLP.
  - $\sigma$ : Activation function (e.g., ReLU).
  - $d$ : Dimensionality of HID embeddings.
  - $L$ : Number of code levels.
  - $d_h$ : Hidden dimension of the MLP.
- The scores are then normalized into importance weights $\pmb{\alpha}$ $α$ using a softmax function: $ \alpha_l = \frac{\exp(s_l)}{\sum_{k=1}^L \exp(s_k)} $
  - $\alpha_l$ : Importance weight for code level $l$ .
  - $s_l$ : $l$ -th component of the score vector $\pmb{s}$ .
- Finally, the SID item embedding sequence $\mathcal{E}^{sid} = [\pmb{e}_1^{sid}, \dots, \pmb{e}_N^{sid}]$ $E^{s i d} = [e_{1}^{s i d}, \dots, e_{N}^{s i d}]$ is derived by a weighted sum of the granularity sequences: $ \mathcal{E}^{sid} = \sum_{l=1}^L \alpha_l \cdot \mathcal{E}_g^l $
  - $\mathcal{E}^{sid}$ : The final fused SID embedding sequence.
Sequence Encoder:
- An independent sequence encoder $f_{\theta^1}$ $f_{θ^{1}}$ (e.g., SASRec or BERT4Rec backbone) processes the fused SID embedding sequence $\mathcal{E}^{sid}$ $E^{s i d}$ to acquire the final user representation $\pmb{u}^{sid}$ $u^{s i d}$ for the SID branch.
  - $f_{\theta^1}$ : Sequence encoder with parameters $\theta^1$ .
  - $\pmb{u}^{sid}$ : User representation from the SID branch.

4.2.1.2. HID Branch

The HID branch focuses on preserving the unique collaborative identity of Hash IDs while selectively incorporating multi-granular semantic information.

HID Embedding Layer:
- A learnable item embedding layer $E_{hid} \in \mathbb{R}^{|\mathcal{V}| \times d}$ $E_{hi d} \in R^{∣ V ∣ \times d}$ is used to derive initial HID item embeddings. This layer is updated during training to absorb collaborative information from historical user-item interactions.
  - $E_{hid}$ : Matrix of HID embeddings for all items.
  - $d$ : Dimensionality of HID embeddings.
- The item embedding sequence $\mathcal{E}^{hid} = [\pmb{e}_1^{hid}, \dots, \pmb{e}_N^{hid}]$ is formed by looking up rows from $E_{hid}$ based on the items in the user's sequence.
Multi-granularity Cross Attention Network:
- This module is crucial for injecting nuanced multi-granular semantic information into HID embeddings without collaborative overwhelming.
- The HID item embedding sequence $\mathcal{E}^{hid}$ acts as the query anchor ( $Q$ ).
- The multi-granular SID item embedding sequences $\mathcal{E}_g^l$ (from the SID branch) act as the key-value pairs ( $K_l$ , $V_l$ ) for each code level $l$ .
- This setup allows the unique HID to selectively retrieve relevant semantic information from SID.
- For each code level $l$ $l$ , projections into distinct subspaces are performed: $ Q = \mathcal{E}^{hid} W^Q $ $ K_l = \mathcal{E}_g^l W^K $ $ V_l = \mathcal{E}_g^l W^V $
  - $W^Q, W^K, W^V \in \mathbb{R}^{d \times d}$ : Learnable weight matrices for projecting HID and SID embeddings into query, key, and value subspaces, respectively.
- The final fused embedding sequence $\mathcal{E}^f$ $E^{f}$ is derived by aggregating the granularity-specific attention outputs, weighted by the user's intent-aware scores $\alpha_l$ $α_{l}$ (from Equation (3)). A residual connection adds back the original HID item embedding sequence to preserve its uniqueness: $ \mathcal{E}^f = \sum_{l=1}^L \alpha_l \cdot \left( \mathrm{softmax}\left(\frac{Q K_l^\top}{\sqrt{d}}\right) V_l \right) + \mathcal{E}^{hid} $
  - $\mathcal{E}^f$ : The fused HID embedding sequence incorporating semantic information.
  - $\alpha_l$ : Importance weight for code level $l$ , computed in the SID branch.
  - $\mathrm{softmax}(\cdot)$ : Softmax function applied to the attention scores.
  - $Q K_l^\top$ : Dot product similarity between query and keys for level $l$ .
  - $\sqrt{d}$ : Scaling factor.
  - $V_l$ : Value matrix for level $l$ .
  - $\mathcal{E}^{hid}$ : Original HID embedding sequence added as a residual connection.
Sequence Encoder:
- Another independent sequence encoder $f_{\theta^2}$ $f_{θ^{2}}$ processes the fused HID embedding sequence $\mathcal{E}^f$ $E^{f}$ to derive the comprehensive user representation $\pmb{u}^{hid}$ $u^{hi d}$ for the HID branch.
  - $f_{\theta^2}$ : Sequence encoder with parameters $\theta^2$ .
  - $\pmb{u}^{hid}$ : User representation from the HID branch.

4.2.2. Dual-level Alignment

This strategy further enhances the representation ability of each branch by aligning the semantic and collaborative spaces at two levels.

4.2.2.1. Code-guided Alignment Loss ( $\mathcal{L}_{CA}$ )

This loss mitigates Noisy Collaborative Sharing by aligning SID and HID representations, allowing long-tail items to "borrow" collaborative signals from semantically similar head items.

Standard Contrastive Learning (Base Idea): A straightforward approach would be to use a standard contrastive learning objective to pull the SID item embedding $\pmb{e}_i^{sid}$ and HID item embedding $\pmb{e}_i^{hid}$ of the same item $i$ closer, while pushing them away from negative samples. This 1-to-1 alignment transfers unique collaborative signals into semantic representation. The original form would be: $ \mathcal{L}{align} = - \frac{1}{B} \sum{i=1}^B \log \frac{\exp(\cos(\pmb{e}_i^{sid}, \pmb{e}i^{hid}) / \tau)}{\sum{j=1}^B \mathbb{I}[i \neq j] \exp(\cos(\pmb{e}_i^{sid}, \pmb{e}_j^{hid}) / \tau)} $
- $B$ : Batch size.
- $\cos(\cdot, \cdot)$ : Cosine similarity.
- $\tau$ : Temperature coefficient.
- $\mathbb{I}[\cdot]$ : Indicator function, ensuring $i \neq j$ for negative samples.
1-to-Many Code-guided Alignment: To achieve the goal of borrowing collaborative signals for tail items, the objective is expanded from 1-to-1 to 1-to-many. This involves constructing a unified positive set $\mathcal{P}(i)$ for item $i$ by incorporating two additional sources:
- $\mathcal{P}_C(i)$ : The set of items sharing $P$ levels of semantic codes with item $i$ . This captures semantic similarity.
- $\mathcal{P}_H(i)$ : The set of items appearing within the co-occurrence context window $o$ . This captures local collaborative proximity.
- The unified positive set is defined as: $\mathcal{P}(i) = \{i\} \cup \mathcal{P}_C(i) \cup \mathcal{P}_H(i)$ .
- The Code-guided Alignment Loss $\mathcal{L}_{CA}^1$ $L_{C A}^{1}$ maximizes the cumulative similarity between the anchor SID item embedding $\pmb{e}_i^{sid}$ $e_{i}^{s i d}$ and all positive HID item embeddings $\pmb{e}_k^{hid}$ $e_{k}^{hi d}$ from $\mathcal{P}(i)$ $P (i)$ : $ \mathcal{L}{CA}^1 = - \frac{1}{B} \sum{i=1}^B \log \frac{\sum_{k \in \mathcal{P}(i)} \exp(\cos(\pmb{e}_i^{sid}, \pmb{e}k^{hid}) / \tau)}{\sum{j=1}^B \exp(\cos(\pmb{e}_i^{sid}, \pmb{e}_j^{hid}) / \tau)} $
  - $B$ : Batch size.
  - $\mathcal{P}(i)$ : Positive set for item $i$ .
  - $\pmb{e}_i^{sid}$ : SID embedding of the anchor item $i$ .
  - $\pmb{e}_k^{hid}$ : HID embedding of a positive item $k$ .
  - $\pmb{e}_j^{hid}$ : HID embedding of any item $j$ in the batch (including negative samples).
- To make the alignment bidirectional, a second part of the loss, $\mathcal{L}_{CA}^2$ , is computed by exchanging the roles of $\pmb{e}^{sid}$ and $\pmb{e}^{hid}$ .
- The total Code-guided Alignment Loss is: $\mathcal{L}_{CA} = \mathcal{L}_{CA}^1 + \mathcal{L}_{CA}^2$ .

4.2.2.2. Masked Sequence Granularity Loss ( $\mathcal{L}_{MSG}$ )

This loss enhances robust user representation by enforcing internal correlations among multi-granular semantics.

View Construction: For each user interaction sequence, two views are constructed:
- Global View: Preserves the complete multi-granular semantic codes, leading to the user representation $\pmb{u}^{sid}$ after processing by the Multi-granularity Fusion Network and sequence encoder $f_{\theta^1}$ .
- Granularity-Masked View: A target granularity index $m \in [1, L]$ is randomly sampled, and the embedding sequence at this level is replaced with a learnable mask token. This results in a masked sequence $\tilde{\mathcal{E}}^{sid}$ . This masked sequence is then processed through the same Multi-granularity Fusion Network and sequence encoder $f_{\theta^1}$ to derive the masked user representation $\tilde{\pmb{u}}^{sid}$ .
Mutual Information Maximization: The model is trained to implicitly infer the missing semantic information by maximizing the mutual information between the global representation $\pmb{u}^{sid}$ and the masked representation $\tilde{\pmb{u}}^{sid}$ . This is done using a contrastive learning objective: $ \mathcal{L}{MSG}^1 = - \frac{1}{N} \sum{i=1}^N \log \frac{\exp(\cos(\pmb{u}_i^{sid}, \tilde{\pmb{u}}i^{sid}) / \tau)}{\sum{j=1}^N \exp(\cos(\pmb{u}_i^{sid}, \tilde{\pmb{u}}_j^{sid}) / \tau)} $
- $N$ : Number of users in the batch.
- $\pmb{u}_i^{sid}$ : Global user representation for user $i$ .
- $\tilde{\pmb{u}}_i^{sid}$ : Masked user representation for user $i$ .
- $\cos(\cdot, \cdot)$ : Cosine similarity.
- $\tau$ : Temperature coefficient.
- The denominator includes all masked user representations in the batch as potential negative samples.
- Similarly, a second part, $\mathcal{L}_{MSG}^2$ , is obtained by exchanging the roles of $\pmb{u}^{sid}$ and $\tilde{\pmb{u}}^{sid}$ .
- The total Masked Sequence Granularity Loss is: $\mathcal{L}_{MSG} = \mathcal{L}_{MSG}^1 + \mathcal{L}_{MSG}^2$ .

4.2.3. Training and Inference

4.2.3.1. Training

Item and User Representation Generation:
- As described in Section 4.2.1, the SID branch generates $\mathcal{E}^{sid}$ and $\pmb{u}^{sid}$ , and the HID branch generates $\mathcal{E}^f$ and $\pmb{u}^{hid}$ .
Recommendation Score Calculation: The mutually enhanced HID and SID user representations ( $\pmb{u}^{hid}$ $u^{hi d}$ and $\pmb{u}^{sid}$ $u^{s i d}$ ) are fused to derive recommendation scores. For a given target item $v_j$ $v_{j}$ , its SID embedding $\pmb{e}_j^{sid}$ $e_{j}^{s i d}$ and HID embedding $\pmb{e}_j^{hid}$ $e_{j}^{hi d}$ are also used. The probability of recommending item $j$ $j$ to user $u$ $u$ is calculated as: $ P(v_{N+1} = v_j | S_u) = [\pmb{e}_j^{sid} : \pmb{e}_j^{hid}]^\top [\pmb{u}^{sid} : \pmb{u}^{hid}] $
- $[\cdot : \cdot]$ : Concatenation operator.
- The user representation is the concatenation of $\pmb{u}^{sid}$ and $\pmb{u}^{hid}$ .
- The item representation is the concatenation of $\pmb{e}_j^{sid}$ and $\pmb{e}_j^{hid}$ .
- The score is the dot product of the concatenated user and item representations.
Ranking Loss: The model is optimized using a pairwise ranking loss: $ \mathcal{L}{rec} = - \sum{u \in \mathcal{U}} \sum_{j=1}^N \log \sigma(P(v_{j+1} = v^+ | S_u) - P(v_{j+1} = v^- | S_u)) $
- $\mathcal{U}$ : Set of all users.
- $N$ : Length of the interaction sequence.
- $\sigma(\cdot)$ : Sigmoid function.
- $v^+$ : Ground-truth (positive) item.
- $v^-$ : Corresponding negative item.
- The loss aims to maximize the score difference between positive and negative items.
Overall Training Objective: The total loss combines the main recommendation loss with the two auxiliary alignment losses: $ \mathcal{L} = \mathcal{L}{rec} + \beta \cdot \mathcal{L}{CA} + \gamma \cdot \mathcal{L}_{MSG} $
- $\mathcal{L}_{rec}$ : Recommendation loss.
- $\mathcal{L}_{CA}$ : Code-guided Alignment Loss.
- $\mathcal{L}_{MSG}$ : Masked Sequence Granularity Loss.
- $\beta, \gamma$ : Hyperparameters controlling the contributions of the auxiliary objectives.
  
  The following is the full training and inference pipeline (Algorithm 1 from the original paper): $Algorithm 1 Train and Inference Procedures of H2Rec Require: User set $\mathcal{U}$, Item set $\mathcal{V}$ 1: Indicate the backbone SRS $f_{\theta^1}$ and $f_{\theta^2}$. 2: Indicate the weights of $\mathcal{L}_{CA}$ and $\mathcal{L}_{MSG}$. 3: Derive semantic embedding $E_{LLM}$ by LLM and codebook embedding matrix $E_C$ by RQ-VAE. Training 4: Initialize the HID item embeddings by dimension-reduced $E_{LLM}$. 5: Initialize the semantic code embeddings by looking up through the $E_C$. 6: for a batch of users $\mathcal{U}_B$ in $\mathcal{U}$ do 7: Get the embedding sequence of different granularities from code embeddings. 8: Generate the weight $\alpha_l$ for different granularities by Equation (2). 9: Derive the final SID item embedding sequence $\mathcal{E}^{sid}$ by Equation (3). 10: Get the fused embedding sequence $\mathcal{E}^f$ by Equation (5). 11: Get the user representation in SID branch and HID branch, i.e., $\pmb{u}^{sid}$ and $\pmb{u}^{hid}$. 12: Calculate the probability score of ground-truth and negative items by Equation (9). 13: Calculate the ranking loss by Equation (10). 14: Calculate the $\mathcal{L}_{CA}$ by Equation (7) and $\mathcal{L}_{MSG}$ by Equation (8). 15: Sum the $\mathcal{L}_{rec}$, $\mathcal{L}_{CA}$, and $\mathcal{L}_{MSG}$. Then, update the parameters. 16: end for Inference 17: Load $E_{LLM}$, $E_C$, and $\alpha_l$ for item embedding layers and other parameters. 18: for a batch of users $\mathcal{U}_B$ in $\mathcal{U}$ do 19: Get the user representation in SID branch and HID branch, i.e., $\pmb{u}^{sid}$ and $\pmb{u}^{hid}$. 20: Calculate the probability score by Equation (9) and give the final next-item prediction. 21: end for$

4.2.3.2. Inference

During inference, the SID and their associated codebook embeddings are pre-cached. Recommendation scores are directly computed using Equation (9) (the score calculation formula), and the item with the highest score is predicted.

5. Experimental Setup

5.1. Datasets

The experiments used three real-world datasets:

Yelp: A dataset often used for reviews and business recommendations.
Amazon Beauty: A dataset from Amazon focusing on beauty product interactions.
Amazon Instrument: A dataset from Amazon focusing on musical instrument interactions.

For data preprocessing and splitting, the authors followed previous SRS works [18, 21], implying standard procedures like filtering inactive users/items and chronological splitting for training/validation/testing.

The statistics of the datasets are as follows (Table 1 from the original paper): The following are the results from Table 1 of the original paper:

Dataset	# Users	# Items	Sparsity	Avg.length
Yelp	15,720	11,383	99.89%	12.23
Beauty	52,204	57,289	99.92%	7.56
Instrument	40,644	30,676	99.97%	8.01

Characteristics and Domain:

Domain: E-commerce and local business review (Yelp).
Scale: The datasets vary in size, with Beauty being the largest in terms of users and items.
Sparsity: All datasets exhibit extremely high sparsity (over 99.89%), which is typical for recommendation tasks and highlights the long-tail problem that the paper aims to address.
Average Length: The average interaction sequence lengths vary, with Yelp having the longest.

These datasets are well-suited for validating the method's performance, especially its ability to handle sparse data and the long-tail problem, given their high sparsity and diverse item popularities.

5.2. Evaluation Metrics

The paper evaluates performance using Hit Rate (H@10) and Normalized Discounted Cumulative Gain (NDCG@10), truncated at 10. Items are split into tail and head groups, with the top 20% most popular items considered head items and the rest tail items.

For every evaluation metric mentioned in the paper, we provide a complete explanation:

5.2.1. Hit Rate (H@K)

Conceptual Definition: Hit Rate (or Recall) at K measures how often the ground-truth item appears within the top K recommended items. It indicates the model's ability to recall relevant items. A higher Hit Rate means more relevant items are included in the recommendation list.
Mathematical Formula: $ \mathrm{H@K} = \frac{\text{Number of users for whom the ground-truth item is in the top K recommendations}}{\text{Total number of users}} $ Or for a single user $u$ : $ \mathrm{H@K}(u) = \mathbb{I}(\text{ground-truth item } v^+ \text{ is in top K recommended items for } u) $ The overall H@K is typically the average over all users.
Symbol Explanation:
- $\mathbb{I}(\cdot)$ : Indicator function, which returns 1 if the condition inside is true, and 0 otherwise.
- $v^+$ : The actual next item the user interacted with (ground-truth).
- $K$ : The size of the recommendation list (here, $K=10$ ).

5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)

Conceptual Definition: NDCG at K is a measure of ranking quality. It evaluates how well the recommender system places highly relevant items at the top of the recommendation list. It is "normalized" because it compares the DCG (Discounted Cumulative Gain) of the recommendation list to the ideal DCG (where all relevant items are perfectly ranked at the top). It gives higher scores to relevant items that appear earlier in the list.
Mathematical Formula: First, Cumulative Gain (CG): $ \mathrm{CG@K} = \sum_{i=1}^K \mathrm{rel}i $ Then, Discounted Cumulative Gain (DCG): $ \mathrm{DCG@K} = \sum{i=1}^K \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ Finally, Normalized Discounted Cumulative Gain (NDCG): $ \mathrm{IDCG@K} = \sum{i=1}^{|\mathrm{REL}|} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)} $ $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
Symbol Explanation:
- $\mathrm{rel}_i$ : The relevance score of the item at position $i$ in the recommendation list. In many implicit feedback scenarios (like this paper, where interaction is typically binary), $\mathrm{rel}_i$ is 1 if the item is the ground-truth item, and 0 otherwise.
- $K$ : The size of the recommendation list (here, $K=10$ ).
- $\log_2(i+1)$ : Logarithmic discount factor, reducing the impact of items at lower ranks.
- $\mathrm{IDCG@K}$ : Ideal DCG at K, which is the maximum possible DCG if all relevant items were perfectly ranked at the top.
- $|\mathrm{REL}|$ : Number of relevant items in the set (for a single ground-truth item, this is 1).

5.3. Baselines

The paper compares H2Rec against various state-of-the-art baselines categorized into three types:

HID Embedding Baselines (Hash ID based): These models primarily rely on Hash IDs and collaborative filtering.
- BERT4Rec [24]: A Transformer-based sequential recommender system that uses a masked self-attention mechanism, similar to BERT in NLP.
- SASRec [12]: Another Transformer-based sequential recommender system that uses self-attention to capture long-range dependencies.
- MELT [13]: A method designed to enhance long-tailed items by exploiting co-occurrence patterns from popular items.
- LLM-ESR [18]: An LLM-based method that enriches HID embeddings with semantic information derived from large language models. This represents the state-of-the-art in LLM-enhanced HID methods.
SID Embedding Baselines (Semantic ID based): These models use Semantic IDs derived from item content.
- PG-SID [37]: A method using Semantic IDs for transferable sequential recommenders.
- SPM-SID [23]: Another SID-based approach for recommendation.
- CCFRec [16]: A method that bridges textual-collaborative gaps through semantic codes for sequential recommendation. It progressively incorporates semantic embeddings from multiple attribute dimensions.
- PSRQ+MCCA (denoted as PSRQ) [30]: Progressive Semantic Residual Quantization combined with Multimodal-Joint Interest Modeling, a recent SID-based method.
Hybrid Embedding Baselines (Combining HID and SID): These models attempt to integrate both Hash and Semantic IDs.
- URL4DR [15]: Unified Semantic and ID Representation Learning for deep recommenders.
- MME-SID [32]: Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs.
- PCR-CA [25]: Parallel Codebook Representations with Contrastive Alignment for multiple-category app recommendation.
  
  These baselines are representative because they cover the main evolutionary paths in sequential recommendation: traditional ID-based models, LLM-enhanced ID-based models, pure Semantic ID-based models, and various hybrid approaches. This comprehensive selection allows for a thorough validation of H2Rec's superiority and its ability to address the specific challenges outlined.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents extensive experimental results demonstrating the effectiveness of H2Rec across different item popularity groups on three real-world datasets.

The following are the results from Table 2 of the original paper:

Dataset	Group	Metric	HID Emb				SID Emb				Hybrid Emb			Ours	Improv. %
Dataset	Group	Metric	BERT4Rec	SASRec	MELT	LLM-ESR	SFM-SID	PG-SID	CCFRec PSRQ	URL4DR	MME-SID	PCR-CA	\$H^{2Rec	Ours	Improv. %
Yelp	Overall	H@10 N@10	0.5314 0.3147	0.5940 0.3601	0.6101	0.6573	0.4727	0.4881	0.5947	0.5438	0.6402	0.6431	0.6447	0.6692*	1.81%
	Tail				0.3394	0.4102	0.3148	0.3251	0.3694	0.3422	0.3776	0.3884	0.3971	0.4272*	4.14%
		H@10	0.0177	0.1175	0.1223	0.1802	0.2441	0.2492	0.2478	0.2543	0.1957	0.2215	0.2032	0.2693*	5.90%
		N@10	0.0068	0.0588	0.0599	0.0676	0.1162	0.1186	0.1171	0.1210	0.0914	0.1103	0.0954	0.1306*	7.93%
Beauty	Head	H@10 N@10	0.6919 0.3876	0.7413 0.4592	0.7790 0.4745	0.8059 0.5122	0.5218 0.3079	0.5334 0.3147	0.7071 0.4243	0.6398 0.3721	0.7699 0.4735	0.7702 0.4823	0.7748 0.4860	0.8324* 0.5483*	3.29% 7.05%
	Overall	H@10	0.3992	0.4401	0.4890	0.5544	0.3715	0.3836	0.4398	0.4038	0.5464	0.5509	0.5539	0.5742*	3.57%
		N@10	0.2401	0.3043	0.3357	0.3702	0.2617	0.2703	0.3021	0.2845	0.3675	0.3617	0.3673	0.3957*
															6.89%
		H@10 N@10	0.0123 0.0052	0.0921 0.0675	0.1536 0.0877	0.2198 0.1074	0.2208 0.1355	0.2254	0.2238	0.2300	0.1967	0.2177	0.2048	0.2557*	11.17% 8.01%
Overall	Head	H@10	0.4988	0.5291	0.5815			0.1383 0.4485	0.1323	0.1411	0.1105	0.1404	0.1342	0.1524*
		N@10	0.2971	0.4007	0.4106	0.6377	0.4388		0.5099	0.4875	0.6299	0.6393	0.6419	0.6502*	1.29%
						0.4289	0.3251	0.3323	0.3887	0.3612	0.4300	0.4379	0.4401	0.4538*	3.11%
		H@10	0.4601	0.5057	0.5510	0.5881	0.4312	0.4453	0.5078	0.4687	0.6005	0.6044	0.6072	0.6184*	1.84%
Instrument		N@10 H@10	0.3213	0.3442 0.0489	0.3622	0.3809	0.2908	0.3003	0.3376	0.3161	0.4024	0.4027	0.4056	0.4153*	2.39%
	Tail	N@10	0.0199	0.0257	0.0766	0.0998	0.2058	0.2101	0.2099	0.2144	0.1605	0.2044	0.1827	0.2382*	11.10%
			0.0143		0.0459	0.0549	0.0990	0.1010	0.0907	0.1031	0.0828	0.0991	0.1025	0.1233*	11.88%
		H@10	0.5028	0.5806	0.6188	0.6676	0.5569	0.5693	0.5629	0.6188	0.6643	0.6646	0.6701	0.6832*	1.95%
	Head	N@10	0.3190	0.3764	0.4237	0.4522	0.3813	0.3898	0.4192	0.4237	0.4483	0.4498	0.4543	0.4638*	2.09%

Overall Performance:

H2Rec consistently achieves the best overall performance across all three datasets (Yelp, Beauty, Instrument) for both H@10 and N@10.
It shows statistically significant improvements (indicated by *) over all baselines, with relative gains ranging from 1.29% to 11.88%. This validates the core hypothesis that harmonizing SID and HID leads to superior performance.
Comparison with LLM-ESR and Hybrid Methods: Following H2Rec, Hybrid Embedding methods (e.g., PCR-CA, MME-SID) and LLM-based HID methods (e.g., LLM-ESR) generally perform the next best. This suggests that incorporating LLM-derived knowledge or combining multiple embeddings is beneficial, but H2Rec's explicit and fine-grained alignment strategies provide a further significant boost.
Comparison with SID Embedding Baselines: SID Embedding baselines show mixed performance. While some (CCFRec, PSRQ) are comparable to HID-based methods, they generally underperform H2Rec and other hybrid approaches. This indicates that merely using SID or improving semantic quality without a dedicated recommendation-oriented architectural design that addresses collaborative overwhelming is insufficient.

Popularity Breakdown Analysis (Head vs. Tail Items):

HID Embedding Methods (e.g., SASRec, LLM-ESR): These methods perform strongly on Head items (due to abundant interaction data) but suffer significant performance drops on Tail items. This confirms their reliance on dense collaborative signals and vulnerability to the long-tail problem. For instance, SASRec's N@10 on Yelp drops from 0.4592 (Head) to 0.0588 (Tail).
SID Embedding Methods (e.g., PSRQ, CCFRec): These methods generally show better performance on Tail items compared to pure HID-based methods, effectively leveraging semantic information to address sparsity. For example, PSRQ's N@10 on Yelp for Tail items (0.1210) is significantly higher than SASRec's (0.0588). However, their performance on Head items is limited due to the coarse-grained nature of SID and quantization-induced ID collisions (the collaborative overwhelming phenomenon).
H2Rec:
- Achieves the best results in the Tail group, substantially outperforming all SID-based baselines. This confirms its success in transferring semantic knowledge to cold items and mitigating Noisy Collaborative Sharing. For instance, on Yelp, H2Rec achieves N@10 of 0.1306 for Tail items, notably higher than PSRQ (0.1210) and CCFRec (0.1171).
- Simultaneously maintains state-of-the-art performance on the Head group, surpassing all HID-based methods. This demonstrates its ability to preserve fine-grained collaborative information for popular items and successfully mitigate the collaborative overwhelming phenomenon. For instance, on Yelp, H2Rec achieves N@10 of 0.5483 for Head items, outperforming LLM-ESR (0.5122).
  
  This comprehensive analysis validates that H2Rec successfully eliminates the traditional trade-off between head and tail item performance, achieving a balanced and superior recommendation quality across the entire item popularity spectrum.

6.2. Ablation Studies / Parameter Analysis

To understand the contribution of each component, ablation studies were conducted on the Yelp dataset.

The following are the results from Table 3 of the original paper:

Variants	Overall		Tail		Head
Variants	N@10	H@10	N@10	H@10	N@10	H@10
\$H{2Rec	0.4272	0.6692	0.1306	0.2693	0.5483	0.8324
w/o FN	0.4123	0.6605	0.1013	0.2347	0.5404	0.8208
w/o Lca	0.4044	0.6455	0.1067	0.2161	0.4920	0.7719
w/o MCA	0.4072	0.6523	0.1105	0.2190	0.5218	0.8077
w/o LMSG	0.4202	0.6667	0.1091	0.2285	0.5398	0.8258

Effectiveness of Fusion Network (FN) and Masked Sequence Granularity Loss (LMSG):
- Removing the Multi-granularity Fusion Network (w/o FN) results in a performance drop across all metrics, especially on Tail items (e.g., N@10 for Tail drops from 0.1306 to 0.1013). This confirms that adaptively aggregating multi-granular semantic views is crucial for constructing fine-grained SID representations, which in turn benefits tail items.
- Removing the Masked Sequence Granularity Loss (w/o LMSG) also leads to a noticeable decline, particularly on Tail items (N@10 for Tail drops from 0.1306 to 0.1091). This indicates that this loss effectively enhances the quality and robustness of SID embeddings by encouraging the model to learn meaningful correlations between different semantic granularities.
Role of Cross-branch Interaction (MCA):
- Removing the Multi-granularity Cross Attention (w/o MCA) leads to a significant performance degradation, particularly on Head items (N@10 for Head drops from 0.5483 to 0.5218). This is critical evidence that MCA successfully allows the HID branch to selectively absorb beneficial multi-granular semantic information from SID without losing the identifier uniqueness required for head items, thereby preventing collaborative overwhelming.
Importance of Code-guided Alignment (LCA):
- Removing the Code-guided Alignment Loss (w/o LCA) causes the largest performance drop across both Head and Tail groups (e.g., Overall N@10 drops from 0.4272 to 0.4044, N@10 for Head drops from 0.5483 to 0.4920, and N@10 for Tail drops from 0.1306 to 0.1067). This highlights the critical role of LCA in effectively aligning the semantic and collaborative spaces, facilitating accurate and noise-resistant collaborative information sharing among semantically related items, especially for tail items borrowing signals from head items.
  
  These ablation results confirm that all proposed modules (Fusion Network, LCA, MCA, and LMSG) are essential and contribute significantly to H2Rec's superior performance by effectively addressing the limitations of HID and SID and harmonizing their strengths.

6.3. Hyperparameter Analysis

The paper analyzes the sensitivity of two key hyperparameters: $\beta$ (weight of Code-guided Alignment Loss, $\mathcal{L}_{CA}$ ) and $\gamma$ (weight of Masked Sequence Granularity Loss, $\mathcal{L}_{MSG}$ ).

The following figure (Figure 3 from the original paper) shows the hyper-parameter results on the Yelp Dataset:

$Figure 3: Hyper-parameter Results on Yelp Dataset. Note that Masked" represents the $\\mathcal { L } _ { M S G }$ while "Alignment" represents the $\\mathcal { L } _ { C A }$ .$ 该图像是图表，展示了在Yelp数据集上超参数调整的结果。左侧图表显示了指标 H@10 的变化，"Masked" 用三角形标记表示，而 "Alignment" 用圆点标记表示。右侧图表展示了 N@10 的变化情况。两个图表的横轴均为超参数值，纵轴分别表示各自对应的评估指标的结果。

Alignment Weight $\beta$ :
- Performance (both H@10 and N@10) initially increases as $\beta$ grows, reaching an optimum around $\beta=0.5$ .
- A small $\beta$ value suggests insufficient code-guided alignment, meaning not enough collaborative signals are effectively transferred to semantically similar tail items.
- An excessively large $\beta$ leads to a performance decline, indicating that over-alignment of the semantic space to the collaborative space can introduce noise and harm generalization. This implies a delicate balance is needed to control the information flow between the two spaces.
Granularity Weight $\gamma$ :
- The performance for the Masked Sequence Granularity Loss peaks at $\gamma=0.3$ .
- As an auxiliary regularizer, $\mathcal{L}_{MSG}$ needs a balanced weight. Low $\gamma$ values fail to ensure the robustness of SID representations by not adequately enforcing internal consistency among multi-granular semantics.
- Excessive $\gamma$ values (e.g., $\gamma > 0.3$ ) interfere with the primary recommendation task, as the model over-prioritizes learning internal semantic consistency rather than overall recommendation accuracy.
  
  These analyses confirm that optimal hyperparameter tuning is crucial for H2Rec to effectively leverage its alignment strategies without introducing detrimental effects.

6.4. In-depth Analysis for $\mathcal{L}_{CA}$

This section further validates the Code-guided Alignment Loss's effectiveness in mitigating noisy collaborative sharing by analyzing two key design factors: code-matching threshold $P$ and context window size $o$ .

6.4.1. Impact of Code Matching Threshold

The code-matching threshold $P$ determines how many semantic code levels items must share to be considered semantically similar for the positive set $\mathcal{P}_C(i)$ .

The following are the results from Table 4 of the original paper:

code num	Overall		Tail		Head
code num	N@10	H@10	N@10	H@10	N@10	H@10
Removed	0.4095	0.6505	0.1177	0.2211	0.4970	0.7769
1	0.4059	0.6473	0.1058	0.2044	0.4955	0.7712
2	0.4199	0.6582	0.1204	0.2591	0.5385	0.8247
3 (Ours)	0.4272	0.6692	0.1306	0.2693	0.5483	0.8324

Setting $P=1$ (considering only the coarsest semantic layer) leads to notable performance degradation. This is because overly coarse categories group many weakly related items, introducing noise into the positive set and undermining the quality of collaborative signal transfer.
Increasing the threshold to $P=3$ (the default setting for H2Rec) yields significant improvements across all metrics, especially for Tail items. This demonstrates that deeper semantic matching is crucial for filtering out noise and ensuring that only truly similar items share collaborative signals, making the LCA more effective.
The "Removed" row implies LCA is still active, but the code-guided part (using $P$ ) is absent. The performance for "Removed" is slightly better than $P=1$ , suggesting that a very coarse semantic matching (like $P=1$ ) can be worse than no explicit semantic filtering if the shared code does not imply strong similarity.

6.4.2. Impact of Context Window Size

The context window size $o$ defines the local co-occurrence window to identify items for the positive set $\mathcal{P}_H(i)$ .

The following are the results from Table 5 of the original paper:

context	Overall		Tail		Head
context	N@10	H@10	N@10	H@10	N@10	H@10
Removed	0.4148	0.6522	0.1176	0.2230	0.5072	0.7911
1	0.4190	0.6523	0.1208	0.2294	0.5205	0.8062
3 (Ours)	0.4272	0.6692	0.1306	0.2693	0.5483	0.8324
5	0.4205	0.6637	0.1244	0.2638	0.5441	0.8299

Expanding the co-occurrence window from 1 to $o=3$ (the default setting) improves performance across the board. This confirms the utility of incorporating local, sequentially related items as additional positives in $\mathcal{L}_{CA}$ .
However, when the window is enlarged to $o=5$ , performance drops slightly. This indicates that a too-large window starts to include less relevant items, introducing noise into the positive set and diluting the quality of collaborative signals.
The "Removed" row implies $\mathcal{L}_{CA}$ is still active, but the co-occurrence part (using $o$ ) is absent. Performance improves when $o=1$ and $o=3$ are introduced, underscoring the value of this component.

These results collectively demonstrate that H2Rec's design for $\mathcal{L}_{CA}$ , combining multi-level semantic matching with a judiciously sized local collaborative context, is essential for achieving accurate and noise-robust alignment between SID and HID representations.

6.5. Group Analysis

For a more fine-grained analysis, the paper divides the item popularity groups from Table 2 into five finer segments.

The following figure (Figure 4 from the original paper) shows the detailed results in different item groups on the Yelp dataset:

Figure 4: Detailed results in different item groups on the Yelp dataset. 该图像是图表，展示了在Yelp数据集上不同项组的H@10和N@10的详细结果。图中对比了三种推荐方法的性能，其中我们的方法相较于LLM-ESR和PCR-CA在推荐质量上表现更为优越。

Comparison of Hybrid Methods: The figure shows that PCR-CA (a hybrid embedding method) achieves a clear performance gain over LLM-ESR (an LLM-enhanced HID method) for groups with popularity lower than 60 (the first three grouped columns). For groups with popularity greater than 60 (the last two grouped columns), PCR-CA performs almost on par with LLM-ESR. This illustrates the general advantage of hybrid embedding approaches over purely LLM-enhanced HID methods across various popularity levels.
H2Rec's Consistent Superiority: H2Rec consistently outperforms all existing methods across all five popularity groups for both H@10 and N@10. This is the strongest evidence that H2Rec successfully leverages its dual-branch architecture and dual-level alignment strategy to achieve additional optimization, breaking the performance bottleneck and providing robust recommendations for both very popular and very niche items. This fine-grained analysis reinforces the conclusion from the Head/Tail breakdown.

6.6. Generality Validation

6.6.1. Quantization Mechanism Analysis

To demonstrate H2Rec's generality, its performance is evaluated using SID generated by three common vector quantization mechanisms: VQ (Vector Quantization), PQ (Product Quantization), and RQ (Residual Quantization).

The following are the results from Table 6 of the original paper:

Model	Overall		Tail		Head
Model	N@10	H@10	N@10	H@10	N@10	H@10
VQ +	0.4091	0.6531	0.0943	0.1986	0.5018	0.7869
PQ +	0.4124	0.6597	0.1091	0.2285	0.5118	0.7958
RQ +	0.4272	0.6692	0.1306	0.2693	0.5483	0.8324

H2Rec maintains competitive performance across all quantization mechanisms (VQ, PQ, RQ), demonstrating the robustness and general applicability of its design.
RQ (Residual Quantization): Achieves the strongest performance. This is attributed to its residual coding structure, which naturally captures fine-grained semantic granularity. This helps mitigate data sparsity and leads to better performance on Tail items.
PQ (Product Quantization): Provides moderate improvements. It partitions the embedding space into subspaces, offering a balanced trade-off between semantic precision and code compactness.
VQ (Vector Quantization): Performs the weakest. Mapping each item to a single discrete code results in severe semantic collapse, reducing item uniqueness across both head and tail groups.

Even with the least expressive VQ setting, H2Rec achieves strong results, confirming its ability to effectively exploit semantic signals regardless of the underlying quantization mechanism. This highlights its generalizability and plug-and-play nature.

6.6.2. Backbone Analysis

To verify the model-agnostic nature of H2Rec, it is evaluated with two common sequential recommendation backbones: GRU4Rec (RNN-based) and BERT4Rec (Transformer-based). Comparisons are made against PCR-CA and LLM-ESR using the same backbones.

The following are the results from Table 7 of the original paper:

Backbone	Model	Overall		Tail Items		Popular Items
Backbone	Model	N@10	H@10	N@10	H@10	N@10	H@10
GRU4Rec	PCR-CA	0.3613	0.5988	0.0745	0.1889	0.4311	0.7204
	LLM-ESR	0.3627	0.6075	0.0482	0.0952	0.4491	0.7338
	Ours	0.3804	0.6239	0.0983	0.2058	0.4634	0.7470
BERT4Rec	PCR-CA	0.4173	0.6604	0.0720	0.1689	0.5346	0.8175
	LLM-ESR	0.4205	0.6635	0.0503	0.1247	0.5444	0.8223
	Ours	0.4298	0.6724	0.0991	0.1735	0.5545	0.8344

Across both RNN-based (GRU4Rec) and Transformer-based (BERT4Rec) architectures, H2Rec consistently surpasses all baselines (PCR-CA, LLM-ESR).
This confirms that H2Rec is not restricted to a specific encoder design. It functions as a universal, plug-and-play enhancement module for various SRS backbones, demonstrating its broad applicability.

6.7. Semantic Code Analysis

This section analyzes H2Rec's performance under SID with different qualities, focusing on code layers and codebook size.

The following are the results from Table 8 of the original paper:

Settings	Overall		Tail		Head		Coll.%	Util.%
Settings	N@10	H@10	N@10	H@10	N@10	H@10	Coll.%	Util.%
3x128	0.4145	0.6510	0.1152	0.2480	0.5365	0.8150	33.92	0.54
3x256	0.4192	0.6585	0.1215	0.2565	0.5402	0.8215	29.92	0.068
3x512	0.4235	0.6640	0.1268	0.2630	0.5445	0.8270	25.85	0.008
4x128	0.4272	0.6692	0.1306	0.2693	0.5483	0.8324	22.28	0.004
4x256	0.4338	0.6785	0.1385	0.2790	0.5540	0.8410	11.87	<0.001

Impact of Codebook Capacity (e.g., 3x128 vs. 3x256 vs. 3x512): Expanding codebook capacity (e.g., from 128 to 512 for 3 layers) consistently yields improvements in performance metrics. This is attributed to a significant reduction in collision rates (Coll.%). Lower collision rates mitigate semantic ambiguity, enhancing the distinctiveness of item representations.
Impact of Code Layers (e.g., 3x128 vs. 4x128): Increasing the number of code layers (e.g., from 3 to 4, keeping codebook size per layer at 128) also improves performance and reduces collision rates. This suggests that more layers can capture finer semantic granularities, leading to better representations.
Trade-off between Performance and Efficiency: While larger codebook sizes and more layers generally improve performance by reducing collision rates, this comes at the cost of codebook redundancy and potentially computational efficiency. For example, the 4x256 setting achieves the best performance and lowest collision rate (11.87%) but suffers from an extremely low utilization rate (Util.% < 0.001%). A low utilization rate means many codewords in the codebook are never used, indicating inefficiency.
Optimal Setting: The authors choose 4x128 as the default setting because it strikes a balance between minimizing collision rates (22.28%) to ensure distinctiveness and maintaining a reasonable (though still low) utilization rate (0.004%) for efficiency.

This analysis provides valuable insights into how the design choices for Semantic IDs (number of layers and codebook size) impact H2Rec's performance, especially the trade-off between representation distinctiveness (lower collision) and codebook efficiency (higher utilization).

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper rigorously investigated the fundamental trade-off between identifier uniqueness (inherent in Hash IDs for head items) and semantic generalization (offered by Semantic IDs for tail items) in sequential recommendation. It formally identified the collaborative overwhelming phenomenon, where the quantization process of SID compromises uniqueness, leading to a performance seesaw between popular and niche items.

To address this dilemma, the authors proposed H2Rec, a novel dual-branch framework that harmonizes Hash IDs (HID) and Semantic IDs (SID). The framework integrates a Multi-granularity Fusion Network in the SID branch to construct fine-grained semantic representations and a Multi-granularity Cross Attention Network in the HID branch to selectively inject semantic signals while preserving unique collaborative identity. This architecture effectively alleviates the Semantic Homogeneity problem and prevents the Collaborative Overwhelming phenomenon.

Furthermore, H2Rec introduced a Dual-level Alignment Strategy, comprising a Code-guided Alignment Loss at the item level and a Masked Sequence Granularity Loss at the user level. These mechanisms facilitate knowledge transfer between the semantic and collaborative spaces, enabling long-tail items to "borrow" high-quality signals from semantically similar head items and mitigating the Noisy Collaborative Sharing problem, while enhancing robust user preference modeling.

Extensive experiments on three real-world datasets demonstrated that H2Rec consistently outperforms state-of-the-art baselines. Crucially, it successfully balances recommendation quality across head and tail items, achieving substantial gains on tail items without sacrificing performance on head items, thereby breaking a long-standing performance bottleneck. The framework's quantization-agnostic and model-agnostic nature further underscores its generalizability and practical utility.

7.2. Limitations & Future Work

The paper does not explicitly list "Limitations" as a separate section but implies some through its discussion, such as the codebook redundancy observed in Semantic Code Analysis (Table 8), where increasing codebook size for performance came at the cost of extremely low utilization rates. This suggests potential inefficiencies in SID generation or storage.

Potential future research directions, though not explicitly stated as such, could be inferred from the paper's focus:

Dynamic Codebook Adaptation: Investigating adaptive vector quantization methods that can dynamically adjust codebook size and utilization to maintain efficiency while maximizing semantic precision, especially considering the low utilization rates.
Enhanced Semantic-Collaborative Interaction: Exploring more sophisticated interaction mechanisms between SID and HID branches beyond cross-attention, potentially involving graph neural networks or more complex fusion architectures that can capture higher-order relationships.
Explanation and Interpretability: With the increased complexity of dual-branch modeling and multi-level alignment, future work could focus on improving the interpretability of H2Rec's recommendations, explaining why a particular item was recommended based on its semantic and collaborative features.
Real-time Adaptation: Further optimizing the framework for real-time recommendation scenarios, especially for dynamically changing item features or user preferences.
Beyond Next-Item Prediction: Extending H2Rec to other recommendation tasks, such as list recommendation or session-based recommendation with diverse objectives.

7.3. Personal Insights & Critique

This paper provides a highly comprehensive and effective solution to a long-standing problem in sequential recommendation: the persistent trade-off between head and tail item performance. The explicit identification of the collaborative overwhelming phenomenon is a significant contribution, as it clearly articulates a key limitation of SID-based methods that was previously less formally recognized.

The dual-branch architecture of H2Rec is intuitively compelling. By maintaining separate but interacting HID and SID representations, the model can specialize in capturing different facets of item information without forcing them into a single, potentially compromising, representation. The Multi-granularity Cross Attention Network is particularly clever in its design, allowing the HID to "query" the SID for semantic enrichment while protecting its unique identity – a direct answer to the collaborative overwhelming problem. Similarly, the Code-guided Alignment Loss offers a more nuanced approach to contrastive learning, ensuring that tail items borrow high-quality, semantically relevant signals, rather than noisy co-occurrence patterns.

Transference and Application: The methods and conclusions of H2Rec can certainly be transferred to other domains facing similar long-tail challenges or the need to harmonize different types of item identifiers (e.g., multimodal features, knowledge graph embeddings). For example, in fashion recommendation, where visual semantics are critical, SID could be derived from image features, and H2Rec could harmonize these with HID. In scientific article recommendation, SID could be derived from abstract/full text and HID from citation networks. The model-agnostic nature of H2Rec makes it highly adaptable.

Potential Issues or Areas for Improvement:

Computational Cost: A dual-branch architecture with multi-granularity fusion, cross-attention, and dual-level alignment is inherently more complex and likely more computationally expensive than single-branch models. While the paper shows performance gains, a deeper analysis of the trade-off between computational cost (training and inference time, memory footprint) and performance would be valuable. The use of RQ-VAE for SID generation also adds an initial computational overhead.
Hyperparameter Sensitivity: As shown in the hyperparameter analysis, the weights $\beta$ and $\gamma$ for the auxiliary losses are crucial. Finding these optimal values for new datasets might require significant tuning, which can be a practical challenge. The choice of code-matching threshold $P$ and context window size $o$ also indicates sensitivity.
Interpretability of Multi-Granularities: While multi-granular semantics are theoretically appealing, understanding what specific semantic aspects each quantization level represents could be challenging. More tools for interpreting these granularities could further enhance the model's transparency.
Codebook Redundancy: The low utilization rates observed in the Semantic Code Analysis (Table 8) suggest that a significant portion of the learned codebook is unused. This represents wasted capacity and potential for optimization. Future work could explore more adaptive or dynamic codebook learning strategies to maximize utilization without sacrificing performance or increasing collision rates.
Generalizability of LLM Embeddings: The performance relies heavily on the quality of LLM-derived semantic embeddings. While text-embedding-ada-002 is powerful, its biases or limitations could propagate through the SID branch. Exploring the robustness to different LLM encoders or incorporating other forms of semantic information (e.g., knowledge graphs) could be interesting.

Overall, H2Rec presents a robust and well-thought-out solution that advances the state-of-the-art in sequential recommendation by intelligently harmonizing disparate item representations. Its critical analysis of existing problems and elegant architectural design make it a significant contribution to the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.