The Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation
TL;DR Summary
The H$^2$Rec framework harmonizes Semantic IDs and Hash IDs to tackle long-tail issues in sequential recommendation systems, utilizing a dual-branch architecture for capturing multi-granular semantics and a dual-level alignment strategy for knowledge transfer.
Abstract
Conventional Sequential Recommender Systems (SRS) typically assign unique Hash IDs (HID) to construct item embeddings. These HID embeddings effectively learn collaborative information from historical user-item interactions, making them vulnerable to situations where most items are rarely consumed (the long-tail problem). Recent methods that incorporate auxiliary information often suffer from noisy collaborative sharing caused by co-occurrence signals or semantic homogeneity caused by flat dense embeddings. Semantic IDs (SIDs), with their capability of code sharing and multi-granular semantic modeling, provide a promising alternative. However, the collaborative overwhelming phenomenon hinders the further development of SID-based methods. The quantization mechanisms commonly compromise the uniqueness of identifiers required for modeling head items, creating a performance seesaw between head and tail items. To address this dilemma, we propose \textbf{\name}, a novel framework that harmonizes the SID and HID. Specifically, we devise a dual-branch modeling architecture that enables the model to capture both the multi-granular semantics within SID while preserving the unique collaborative identity of HID. Furthermore, we introduce a dual-level alignment strategy that bridges the two representations, facilitating knowledge transfer and supporting robust preference modeling. Extensive experiments on three real-world datasets show that \name~ effectively balances recommendation quality for both head and tail items while surpassing the existing baselines. The implementation code can be found online\footnote{https://github.com/ziwliu8/H2Rec}.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "The Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation." It focuses on addressing the limitations of traditional sequential recommender systems by integrating two types of item identifiers.
1.2. Authors
The authors of this paper are:
-
Ziwei Liu (City University of Hong Kong, Hong Kong, China)
-
Yejing Wang (City University of Hong Kong, Hong Kong, China)
-
Qidong Liu (Xi'an Jiaotong University & City University of Hong Kong, Xi'an, China)
-
Zijian Zhang (City University of Hong Kong, Hong Kong, China)
-
Wei Huang (Independent Researcher, Beijing, China)
-
Chong Chen (Tsinghua University, Beijing, China)
-
Xiangyu Zhao (City University of Hong Kong, Hong Kong, China)
The authors are primarily affiliated with City University of Hong Kong, with contributions from Xi'an Jiaotong University, Tsinghua University, and an independent researcher. Their research backgrounds appear to be in computer science, particularly in areas related to recommender systems, information retrieval, and potentially natural language processing given the mention of large language models.
1.3. Journal/Conference
The paper is published "In ." The specific journal or conference name is not provided in the abstract or bibliographic information but it is formatted as an ACM publication, suggesting it is intended for a prominent ACM conference or journal. ACM venues are highly reputable in computer science, indicating a rigorous peer-review process and influence in relevant fields like recommender systems and information retrieval.
1.4. Publication Year
The publication year is 2026.
1.5. Abstract
The paper addresses the limitations of conventional Sequential Recommender Systems (SRS) that use Hash IDs (HID) for item embeddings. While HID effectively capture collaborative information, they struggle with the long-tail problem where rare items have sparse interactions. Existing methods that incorporate auxiliary information suffer from noisy collaborative sharing (due to unreliable co-occurrence signals) or semantic homogeneity (due to flat dense embeddings). Semantic IDs (SIDs) offer a promising alternative with their code sharing and multi-granular semantic modeling capabilities. However, SID-based methods face the collaborative overwhelming phenomenon where quantization compromises identifier uniqueness for head items, leading to a performance trade-off between head and tail items.
To resolve this, the authors propose H2Rec, a novel framework that harmonizes SID and HID. It features a dual-branch modeling architecture to capture multi-granular semantics from SID while preserving the unique collaborative identity of HID. Additionally, a dual-level alignment strategy bridges the two representations, facilitating knowledge transfer and robust preference modeling. Extensive experiments on three real-world datasets demonstrate that H2Rec effectively balances recommendation quality for both head and tail items, outperforming existing baselines.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2512.10388v1. This indicates it is a preprint available on arXiv.
1.7. PDF Link
The PDF link is https://arxiv.org/pdf/2512.10388v1.pdf. This is also a preprint available on arXiv.
2. Executive Summary
2.1. Background & Motivation
2.1.1. Core Problem
The core problem the paper aims to solve is the inherent trade-off and limitations faced by existing Sequential Recommender Systems (SRS) when dealing with item identification and representation, particularly regarding the long-tail problem and the collaborative overwhelming phenomenon.
Traditional SRS rely on Hash IDs (HID) to create item embeddings. While HID are excellent at capturing collaborative information from user-item interactions, they suffer significantly for tail items (items with sparse interactions). This is because HID embeddings for these items lack sufficient data to learn reliable representations, leading to poor recommendation quality for the vast majority of items.
Recent attempts to incorporate auxiliary information, often using large language models (LLMs) to generate semantic features, introduce new issues:
-
Noisy Collaborative Sharing: Some methods exploit co-occurrence patterns, but these signals can be unreliable. Accidental clicks can lead to
tail itemsinheriting misleading collaborative information from semantically unrelated popular items. -
Semantic Homogeneity: LLM-based approaches often compress all textual information into a single, dense vector. This "flat" representation creates a
single-granularity bottleneck, making it difficult to distinguish subtle semantic differences among similar items.Conversely,
Semantic IDs (SIDs), generated by decomposing dense semantic embeddings into discrete code sequences through vector quantization, offer multi-granular semantic modeling and code sharing. They can aggregate collaborative signals among semantically related items and provide finer semantic distinctions. However,SID-basedmethods introduce their own problem: thecollaborative overwhelming phenomenon. The quantization process inevitably causescode collisions(multiple items sharing the sameSID), which compromises the uniqueness of item identifiers. This particularly harms the modeling ofhead items(popular items with abundant interactions) by confusing the model with inflated user-item connections, leading to performance degradation for these crucial items.
2.1.2. Importance of the Problem
This problem is critical because a vast majority of items in real-world recommendation scenarios belong to the long tail. If a recommender system cannot effectively recommend these items, it leads to:
-
Reduced item discoverability: Users are repeatedly shown popular items, limiting their exposure to a wider catalog.
-
Suboptimal user experience: Users might miss out on items that truly match their niche preferences.
-
Economic inefficiency: Less popular items remain undiscovered and unpurchased, affecting businesses.
-
Biased recommendations: Reinforcing popularity bias rather than diverse interests.
The trade-off between
HIDandSIDhighlights a fundamental dilemma: uniqueness for popular items vs. semantic generalization for rare items. Solving this means achieving a more balanced and effective recommendation system that serves both popular and niche interests, which is a significant step towards more robust and fair recommendation.
2.1.3. Paper's Entry Point / Innovative Idea
The paper's innovative idea is to propose H2Rec, a novel framework that harmonizes Semantic IDs (SID) and Hash IDs (HID) within a single system, aiming to leverage the strengths of both while mitigating their individual weaknesses. Instead of replacing one with the other or simply concatenating them, H2Rec employs a sophisticated dual-branch architecture with a dual-level alignment strategy. This allows the model to simultaneously capture multi-granular semantics from SID (benefiting tail items) and preserve the unique collaborative identity of HID (benefiting head items). The key insight is that a nuanced integration and alignment, rather than simple fusion, is necessary to achieve the "best of both worlds."
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Identification of Collaborative Overwhelming Phenomenon: The paper formally identifies and characterizes the
collaborative overwhelming phenomenoninSID-basedmethods. This phenomenon highlights the fundamental trade-off between identifier uniqueness (crucial forhead items) and semantic generalization (beneficial fortail items) due toquantization-induced code collisions. This clarifies a significant limitation in existingSID-basedapproaches. -
Proposed H2Rec Framework: The authors introduce
H2Rec, a noveldual-branch frameworkdesigned to harmonizeHIDandSID. This framework effectively combinesmulti-granularity modules(aMulti-granularity Fusion Networkfor theSID branchand aMulti-granularity Cross Attention Networkfor theHID branch) with aDual-level Alignment Strategy. This architecture specifically addresses thesemantic homogeneityandcollaborative overwhelmingproblems. -
Dual-level Alignment Strategy:
H2Recincorporates aCode-guided Alignment Lossat the item level and aMasked Sequence Granularity Lossat the user level. These alignment strategies facilitate knowledge transfer betweenSIDandHIDrepresentations, allowingtail itemsto "borrow" high-quality collaborative signals from semantically similarhead itemswithoutnoisy collaborative sharing, and enhancing robust user preference modeling. -
Empirical Validation and Superior Performance: Extensive experiments on three real-world datasets demonstrate that
H2Recsignificantly outperforms state-of-the-art baselines. Crucially, it achieves a robust balance, showing substantial improvements in recommendingtail itemswithout sacrificing performance onhead items, thereby breaking the traditional performance bottleneck and achieving better overall recommendation quality. The framework is also shown to bequantization-agnosticandmodel-agnostic.These findings collectively solve the problem of achieving robust recommendations across the entire item popularity spectrum by intelligently integrating the complementary strengths of
semanticandhashitem representations.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following foundational concepts:
- Recommender Systems (RS): Software systems that provide suggestions for items to users. The goal is to predict what a user might like, based on their past behavior or preferences, and the behavior of other users.
- Sequential Recommender Systems (SRS): A specialized type of RS that focuses on the order of user interactions. Instead of just predicting general preferences, SRS predict the next item a user will interact with, based on their historical sequence of interactions. Examples include predicting the next movie to watch or the next product to buy.
- Item Embeddings: In RS, items (e.g., movies, products) are typically represented as numerical vectors in a high-dimensional space. These vectors, called embeddings, capture the characteristics and relationships between items. Similar items have embeddings that are close to each other in this space.
- Hash IDs (HID): These are unique identifiers assigned to each item in a database. In traditional SRS, these
HIDare directly mapped to learnable embeddings. EachHIDcorresponds to a unique embedding vector. - Long-tail Problem: In many domains, a small number of items are very popular ("head items"), while a large number of items are consumed very rarely ("tail items"). The
long-tail problemrefers to the difficulty of effectively recommending thesetail itemsdue to their sparse interaction data, making it hard to learn good embeddings or collaborative signals for them. - Collaborative Filtering (CF): A common technique in RS that makes predictions about a user's interests by collecting preferences from many users. The underlying assumption is that if two users have similar tastes, they will like similar items.
HID-basedmethods primarily learn collaborative signals. - Semantic IDs (SID): Unlike
HID,SIDare derived from the semantic content (e.g., text descriptions, images) of items. They represent items using discrete codes that capture multi-granular semantic information.SIDoften involve a process of vector quantization. - Vector Quantization (VQ): A technique used to convert continuous vector data into discrete codes. It works by mapping an input vector to the closest "codeword" in a predefined codebook. This effectively compresses the information and can group similar vectors together.
- Residual Quantized Variational Autoencoder (RQ-VAE): A specific type of vector quantization method. It quantizes an embedding in multiple stages (or "levels"). At each stage, it quantizes the residual (the part of the embedding not yet explained by previous quantizations), allowing for more fine-grained and hierarchical semantic representation. Each level captures a different granularity of semantics.
- Large Language Models (LLM): Powerful AI models trained on vast amounts of text data, capable of understanding and generating human-like text. They can encode textual descriptions of items into dense
semantic embeddings, which can then be used in recommender systems. An example mentioned istext-embedding-ada-002. - Self-Attention / Transformer: A neural network architecture component that allows a model to weigh the importance of different parts of the input sequence relative to a given element. It's fundamental to models like
SASRecandBERT4Rec. The coreAttentionmechanism is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where:- (Query), (Key), (Value) are matrices representing different projections of the input sequence.
- calculates the similarity between queries and keys.
- is a scaling factor to prevent large dot products from pushing the softmax function into regions with tiny gradients.
- normalizes the scores to create weights.
- The result is a weighted sum of the Value vectors.
- Cross-Attention: A variant of the attention mechanism where the Query, Key, and Value come from different input sequences. For example, queries might come from one sequence, and keys/values from another, allowing one sequence to attend to relevant parts of the other.
3.2. Previous Works
The paper discusses several categories of previous works:
-
Conventional HID-based SRS:
- BERT4Rec [24], GRU4Rec [10], SASRec [12]: These are foundational sequential recommender systems that use unique
Hash IDs (HID)for items. They map theseHIDto high-dimensional embeddings to capture collaborative information.GRU4Recuses Gated Recurrent Units (GRUs) to model sequential patterns.SASRecandBERT4Recleverage self-attention mechanisms (from the Transformer architecture) to capture long-range dependencies in user interaction sequences.SASRecuses a left-to-right self-attention, whileBERT4Recuses a masked self-attention.
- MELT [13]: This method attempts to address the
long-tail problemby exploiting co-occurrence patterns to enhancetail itemsusing information from popular items. However, the paper argues that co-occurrence signals can be unreliable, leading tonoisy collaborative sharing. - LLM-ESR [18]: This represents recent advances that enrich
HIDembeddings with semantic information derived fromLarge Language Models (LLMs). While it improves performance, the paper points out that it compresses all textual information into a single dense vector, leading to thesemantic homogeneityproblem and asingle-granularity bottleneck.
- BERT4Rec [24], GRU4Rec [10], SASRec [12]: These are foundational sequential recommender systems that use unique
-
SID-based Methods:
- PG-SID [37], SPM-SID [23], CCFRec [16], PSRQ+MCCA (denoted as PSRQ) [30]: These methods use
Semantic IDs (SID)generated by decomposing dense semantic embeddings (e.g., fromLLMs) into discrete code sequences via vector quantization (RQ-VAE).- These methods are theoretically appealing because
shared codesaggregate collaborative signals among semantically related items (mitigatingnoisy collaborative sharing) andmulti-granular codesoffer finer semantic distinctions (alleviatingsemantic homogeneity).
- These methods are theoretically appealing because
- The paper notes that existing
SID-basedmethods often replaceHIDwithSIDor fuse them using simple concatenation or contrastive learning. These strategies, however, are hindered by thecollaborative overwhelming phenomenonwherequantization-induced code collisionscompromise the uniqueness required forhead items.
- PG-SID [37], SPM-SID [23], CCFRec [16], PSRQ+MCCA (denoted as PSRQ) [30]: These methods use
-
Hybrid Embedding Methods:
- URL4DR [15], MME-SID [32], PCR-CA [25]: These methods attempt to combine
HIDandSIDrepresentations. The paper implies that these fusions are often simplistic (e.g., simple concatenation or contrastive learning) and do not fully resolve the fundamental trade-offs identified.
- URL4DR [15], MME-SID [32], PCR-CA [25]: These methods attempt to combine
3.3. Technological Evolution
The field of sequential recommendation has evolved significantly:
-
Early Methods (Statistical/Matrix Factorization): Focused on basic item similarities or user-item matrices.
-
Recurrent Neural Networks (RNNs):
GRU4Recbrought sequential modeling capabilities, capturing dependencies in user sessions. -
Convolutional Neural Networks (CNNs):
Caserused CNNs to capture local sequential patterns. -
Self-Attention/Transformers:
SASRecandBERT4Recrevolutionized SRS by using self-attention, allowing for better capture of long-range dependencies and global patterns within sequences. -
Addressing Data Sparsity (Co-occurrence/Auxiliary Information): Methods like
MELTtried to leverage co-occurrence. Simultaneously, integrating externalauxiliary information(e.g., item attributes) became common. -
Rise of LLMs:
LLM-ESRand similar works leverage the powerful semantic understanding ofLLMsto generate rich item representations, pushing beyond simple categorical attributes. -
Semantic IDs and Vector Quantization: Recognizing the
single-granularity bottleneckof denseLLMembeddings,SID-basedmethods (PG-SID,CCFRec) emerged, aiming formulti-granular semantic representationsthroughvector quantizationto better handle bothsemantic homogeneityandnoisy collaborative sharing.This paper's work (
H2Rec) fits into the latest stage of this evolution, specifically attempting to overcome the limitations of both pureHIDand pureSIDmethods by proposing a sophisticated harmonization framework. It aims to resolve thecollaborative overwhelming phenomenoninherent inSIDand thelong-tail problemofHID, building on the strengths ofLLM-derived semantic embeddingsandTransformer-based sequence modeling.
3.4. Differentiation Analysis
Compared to the main methods in related work, H2Rec offers several core differences and innovations:
-
Harmonization vs. Substitution/Simple Fusion:
- Difference: Unlike pure
HID-basedorSID-basedmethods (e.g.,SASRec,LLM-ESRvs.PG-SID,CCFRec) that focus on one type of ID, or hybrid methods (e.g.,URL4DR,MME-SID,PCR-CA) that often use simple concatenation or contrastive learning,H2Recproposes a deep, harmonized integration. - Innovation: It doesn't just combine them; it designs a
dual-branch architecturewhere each branch explicitly addresses the weaknesses of the other. TheSID branchhandlessemantic homogeneitywithmulti-granularity fusion, while theHID branchselectively injectssemantic signalsto avoidcollaborative overwhelmingusingmulti-granularity cross-attention.
- Difference: Unlike pure
-
Addressing Collaborative Overwhelming:
- Difference:
H2Recexplicitly identifies and tackles thecollaborative overwhelming phenomenon, a critical limitation ofSID-basedmethods wherequantization-induced code collisionscompromise uniqueness forhead items. - Innovation: The
Multi-granularity Cross Attention Networkin theHID branchis specifically designed to allowHIDto act as a query, selectively drawing semantic information fromSIDwithout losing its unique identity, thereby protectinghead itemperformance.
- Difference:
-
Mitigating Noisy Collaborative Sharing with Targeted Alignment:
- Difference: While methods like
MELTandSID-basedapproaches try to mitigatenoisy collaborative sharing,H2Recintroduces a more refinedCode-guided Alignment Loss. - Innovation: This loss is not just a general contrastive alignment but leverages
multi-granular SIDto identify truly semantically similar items, ensuring thattail items"borrow" high-quality collaborative signals from relevanthead itemseffectively, rather than inheriting noise from loosely co-occurring items.
- Difference: While methods like
-
Robust Preference Modeling through Dual-level Alignment:
- Innovation: The
Dual-level Alignment Strategy(combiningCode-guided Alignment LossandMasked Sequence Granularity Loss) operates at both the item and user levels. TheMasked Sequence Granularity Lossat the user level enhances internal correlations amongmulti-granular semantics, leading to more robust user representations, a feature not prominently found in other hybrid approaches.
- Innovation: The
-
Generality and Modularity:
-
Innovation:
H2Recisquantization-agnostic(compatible with differentRQ-VAEvariants) andmodel-agnostic(can integrate with variousSRS backboneslikeGRU4RecorBERT4Rec), making it a flexible and plug-and-play enhancement module.In essence,
H2Recmoves beyond simple integration by architecturally designing howHIDandSIDcomplement each other, using sophisticated attention and alignment mechanisms to explicitly resolve the inherent tensions between identifier uniqueness and semantic generalization across all item popularity groups.
-
4. Methodology
4.1. Principles
The core idea behind H2Rec is to leverage the complementary strengths of Hash IDs (HID) and Semantic IDs (SID) while mitigating their individual weaknesses.
-
HID's Strength:
HIDare excellent at capturing unique collaborative information, especially forhead itemswith abundant interaction data. However, they struggle withtail itemsdue to data sparsity. -
SID's Strength:
SID, derived fromLLM-based semantic embeddingsandvector quantization, can modelmulti-granular semanticsand enablecode sharingamong semantically related items, which is beneficial fortail itemsandcold-start scenarios. However,quantization-induced code collisionscan compromise identifier uniqueness, leading tocollaborative overwhelmingforhead items.The theoretical basis and intuition are to build a
dual-branch architecture: one branch primarily handles thesemanticaspects viaSID, and the other focuses on thecollaborativeandunique identityaspects viaHID. These two branches are then harmonized throughdual-level alignment strategiesthat facilitate knowledge transfer in a controlled and beneficial manner, ensuring that:
Tail itemsgain semantic and collaborative signals from relatedhead itemswithout introducing noise.Head itemsretain their unique collaborative identity, not being "overwhelmed" by shared semantic codes.- The system learns robust, fine-grained representations for both.
4.2. Core Methodology In-depth (Layer by Layer)
The H2Rec framework is structured around a dual-branch modeling architecture and a dual-level alignment strategy.
The following figure (Figure 2 from the original paper) shows the framework overview for H2Rec:
该图像是示意图,展示了框架的整体结构。左侧为SID分支,右侧为HID分支,二者通过多粒度融合网络和交叉注意力机制连接;图中包含图例,显示了不同类型的嵌入及其编码方式。该框架设计旨在平衡推荐系统中头部和尾部项目的推荐质量。
4.2.1. Dual-branch Modeling
This architecture consists of an SID Branch and an HID Branch, designed to leverage the complementary benefits of semantic and hash IDs.
4.2.1.1. SID Branch
The SID branch focuses on constructing fine-grained semantic representations from Semantic IDs.
-
Semantic Code Generation:
- First,
semantic embeddingsare derived from item textual attributes usingLarge Language Models (LLM). Specifically, item attributes are converted into textual instructions, and anLLM encoder(e.g.,text-embedding-ada-002) produces item representations, denoted as .- : Matrix of
LLM-derived semantic embeddingsfor all items. - : Total number of unique items.
- : Dimensionality of the
LLM embeddings.
- : Matrix of
- Subsequently, a
Residual Quantized Variational Autoencoder (RQ-VAE)framework (as mentioned in Section 2.2) is trained. ThisRQ-VAEquantizes each into discretesemantic codes. - An item is then represented as a tuple of codes , where is a single code at the -th level.
- Corresponding
codebook embedding matricesare learned, where is the codebook for level . - The
semantic code embeddingsfor item are retrieved by looking up in , yielding .
- First,
-
Multi-granularity Fusion Network:
- The
semantic code embeddingsfor all items in a user's sequence are grouped by their code levels. This forms separate granularity sequences: $ \mathcal{E}g^l = [\pmb{e}{C_1}^l, \pmb{e}{C_2}^l, \dots, \pmb{e}{C_N}^l] $- : Embedding sequence for items at code level .
- : Semantic code embedding of item at code level .
- : Length of the interaction sequence.
- To adaptively aggregate these granularity sequences into a final fine-grained
SID embedding sequence, an adaptive mechanism is designed. This mechanism uses the user's interaction context to assign importance weights to different code-level granularities. - The embedding of the user's last interacted item, , from the
HID embedding sequence(discussed later) serves as thequery anchorfor the current intent. - A learnable
residual bias vectoris introduced to incorporate prior knowledge that coarse-grained semantics are generally more robust and to ensure training stability. - The unnormalized importance scores are computed via a Multi-Layer Perceptron (MLP):
$
\pmb{s} = W_2 \left( \sigma \left( W_1 \left[ \pmb{e}N^{hid} ; \pmb{b}{prior} \right] + \pmb{b}_1 \right) \right) + \pmb{b}2 + \pmb{b}{prior}
$
- :
HID embeddingof the last item in the sequence. - : Learnable residual bias vector, emphasizing coarse-grained semantics.
- : Concatenation operation.
- , : Weight matrices for the MLP.
- , : Bias vectors for the MLP.
- : Activation function (e.g., ReLU).
- : Dimensionality of
HID embeddings. - : Number of code levels.
- : Hidden dimension of the MLP.
- :
- The scores are then normalized into
importance weightsusing a softmax function: $ \alpha_l = \frac{\exp(s_l)}{\sum_{k=1}^L \exp(s_k)} $- : Importance weight for code level .
- : -th component of the score vector .
- Finally, the
SID item embedding sequenceis derived by a weighted sum of the granularity sequences: $ \mathcal{E}^{sid} = \sum_{l=1}^L \alpha_l \cdot \mathcal{E}_g^l $- : The final fused
SID embedding sequence.
- : The final fused
- The
-
Sequence Encoder:
- An independent
sequence encoder(e.g.,SASRecorBERT4Recbackbone) processes the fusedSID embedding sequenceto acquire the finaluser representationfor theSID branch.- : Sequence encoder with parameters .
- : User representation from the
SID branch.
- An independent
4.2.1.2. HID Branch
The HID branch focuses on preserving the unique collaborative identity of Hash IDs while selectively incorporating multi-granular semantic information.
-
HID Embedding Layer:
- A learnable
item embedding layeris used to derive initialHID item embeddings. This layer is updated during training to absorbcollaborative informationfrom historical user-item interactions.- : Matrix of
HID embeddingsfor all items. - : Dimensionality of
HID embeddings.
- : Matrix of
- The
item embedding sequenceis formed by looking up rows from based on the items in the user's sequence.
- A learnable
-
Multi-granularity Cross Attention Network:
- This module is crucial for injecting nuanced
multi-granular semantic informationintoHID embeddingswithoutcollaborative overwhelming. - The
HID item embedding sequenceacts as thequery anchor(). - The
multi-granular SID item embedding sequences(from theSID branch) act as thekey-value pairs(, ) for each code level . - This setup allows the unique
HIDto selectively retrieve relevantsemantic informationfromSID. - For each code level , projections into distinct subspaces are performed:
$
Q = \mathcal{E}^{hid} W^Q
$
$
K_l = \mathcal{E}_g^l W^K
$
$
V_l = \mathcal{E}_g^l W^V
$
- : Learnable weight matrices for projecting
HIDandSIDembeddings into query, key, and value subspaces, respectively.
- : Learnable weight matrices for projecting
- The
final fused embedding sequenceis derived by aggregating the granularity-specificattention outputs, weighted by the user'sintent-aware scores(from Equation (3)). Aresidual connectionadds back the originalHID item embedding sequenceto preserve its uniqueness: $ \mathcal{E}^f = \sum_{l=1}^L \alpha_l \cdot \left( \mathrm{softmax}\left(\frac{Q K_l^\top}{\sqrt{d}}\right) V_l \right) + \mathcal{E}^{hid} $- : The fused
HID embedding sequenceincorporating semantic information. - : Importance weight for code level , computed in the
SID branch. - : Softmax function applied to the attention scores.
- : Dot product similarity between query and keys for level .
- : Scaling factor.
- : Value matrix for level .
- : Original
HID embedding sequenceadded as a residual connection.
- : The fused
- This module is crucial for injecting nuanced
-
Sequence Encoder:
- Another independent
sequence encoderprocesses the fusedHID embedding sequenceto derive thecomprehensive user representationfor theHID branch.- : Sequence encoder with parameters .
- : User representation from the
HID branch.
- Another independent
4.2.2. Dual-level Alignment
This strategy further enhances the representation ability of each branch by aligning the semantic and collaborative spaces at two levels.
4.2.2.1. Code-guided Alignment Loss ()
This loss mitigates Noisy Collaborative Sharing by aligning SID and HID representations, allowing long-tail items to "borrow" collaborative signals from semantically similar head items.
-
Standard Contrastive Learning (Base Idea): A straightforward approach would be to use a standard
contrastive learning objectiveto pull theSID item embeddingandHID item embeddingof the same item closer, while pushing them away from negative samples. This 1-to-1 alignment transfers unique collaborative signals into semantic representation. The original form would be: $ \mathcal{L}{align} = - \frac{1}{B} \sum{i=1}^B \log \frac{\exp(\cos(\pmb{e}_i^{sid}, \pmb{e}i^{hid}) / \tau)}{\sum{j=1}^B \mathbb{I}[i \neq j] \exp(\cos(\pmb{e}_i^{sid}, \pmb{e}_j^{hid}) / \tau)} $- : Batch size.
- : Cosine similarity.
- : Temperature coefficient.
- : Indicator function, ensuring for negative samples.
-
1-to-Many Code-guided Alignment: To achieve the goal of borrowing collaborative signals for
tail items, the objective is expanded from 1-to-1 to 1-to-many. This involves constructing a unifiedpositive setfor item by incorporating two additional sources:- : The set of items sharing levels of
semantic codeswith item . This captures semantic similarity. - : The set of items appearing within the
co-occurrence context window. This captures local collaborative proximity. - The unified positive set is defined as: .
- The
Code-guided Alignment Lossmaximizes the cumulative similarity between theanchor SID item embeddingand allpositive HID item embeddingsfrom : $ \mathcal{L}{CA}^1 = - \frac{1}{B} \sum{i=1}^B \log \frac{\sum_{k \in \mathcal{P}(i)} \exp(\cos(\pmb{e}_i^{sid}, \pmb{e}k^{hid}) / \tau)}{\sum{j=1}^B \exp(\cos(\pmb{e}_i^{sid}, \pmb{e}_j^{hid}) / \tau)} $- : Batch size.
- : Positive set for item .
- :
SID embeddingof the anchor item . - :
HID embeddingof a positive item . - :
HID embeddingof any item in the batch (including negative samples).
- To make the alignment bidirectional, a second part of the loss, , is computed by exchanging the roles of and .
- The total
Code-guided Alignment Lossis: .
- : The set of items sharing levels of
4.2.2.2. Masked Sequence Granularity Loss ()
This loss enhances robust user representation by enforcing internal correlations among multi-granular semantics.
-
View Construction: For each user interaction sequence, two views are constructed:
- Global View: Preserves the complete
multi-granular semantic codes, leading to theuser representationafter processing by theMulti-granularity Fusion Networkandsequence encoder. - Granularity-Masked View: A target granularity index is randomly sampled, and the
embedding sequenceat this level is replaced with a learnablemask token. This results in amasked sequence. Thismasked sequenceis then processed through the sameMulti-granularity Fusion Networkandsequence encoderto derive themasked user representation.
- Global View: Preserves the complete
-
Mutual Information Maximization: The model is trained to implicitly infer the missing
semantic informationby maximizing the mutual information between theglobal representationand themasked representation. This is done using acontrastive learningobjective: $ \mathcal{L}{MSG}^1 = - \frac{1}{N} \sum{i=1}^N \log \frac{\exp(\cos(\pmb{u}_i^{sid}, \tilde{\pmb{u}}i^{sid}) / \tau)}{\sum{j=1}^N \exp(\cos(\pmb{u}_i^{sid}, \tilde{\pmb{u}}_j^{sid}) / \tau)} $- : Number of users in the batch.
- :
Global user representationfor user . - :
Masked user representationfor user . - : Cosine similarity.
- : Temperature coefficient.
- The denominator includes all
masked user representationsin the batch as potential negative samples. - Similarly, a second part, , is obtained by exchanging the roles of and .
- The total
Masked Sequence Granularity Lossis: .
4.2.3. Training and Inference
4.2.3.1. Training
- Item and User Representation Generation:
- As described in Section 4.2.1, the
SID branchgenerates and , and theHID branchgenerates and .
- As described in Section 4.2.1, the
- Recommendation Score Calculation: The
mutually enhanced HIDandSID user representations( and ) are fused to derive recommendation scores. For a given target item , itsSID embeddingandHID embeddingare also used. The probability of recommending item to user is calculated as: $ P(v_{N+1} = v_j | S_u) = [\pmb{e}_j^{sid} : \pmb{e}_j^{hid}]^\top [\pmb{u}^{sid} : \pmb{u}^{hid}] $- : Concatenation operator.
- The user representation is the concatenation of and .
- The item representation is the concatenation of and .
- The score is the dot product of the concatenated user and item representations.
- Ranking Loss: The model is optimized using a
pairwise ranking loss: $ \mathcal{L}{rec} = - \sum{u \in \mathcal{U}} \sum_{j=1}^N \log \sigma(P(v_{j+1} = v^+ | S_u) - P(v_{j+1} = v^- | S_u)) $- : Set of all users.
- : Length of the interaction sequence.
- : Sigmoid function.
- : Ground-truth (positive) item.
- : Corresponding negative item.
- The loss aims to maximize the score difference between positive and negative items.
- Overall Training Objective: The total loss combines the main
recommendation losswith the twoauxiliary alignment losses: $ \mathcal{L} = \mathcal{L}{rec} + \beta \cdot \mathcal{L}{CA} + \gamma \cdot \mathcal{L}_{MSG} $-
: Recommendation loss.
-
: Code-guided Alignment Loss.
-
: Masked Sequence Granularity Loss.
-
: Hyperparameters controlling the contributions of the auxiliary objectives.
The following is the full training and inference pipeline (Algorithm 1 from the original paper): Algorithm 1 Train and Inference Procedures of H2Rec Require: User set $\mathcal{U}$, Item set $\mathcal{V}$ 1: Indicate the backbone SRS $f_{\theta^1}$ and $f_{\theta^2}$. 2: Indicate the weights of $\mathcal{L}_{CA}$ and $\mathcal{L}_{MSG}$. 3: Derive semantic embedding $E_{LLM}$ by LLM and codebook embedding matrix $E_C$ by RQ-VAE. Training 4: Initialize the HID item embeddings by dimension-reduced $E_{LLM}$. 5: Initialize the semantic code embeddings by looking up through the $E_C$. 6: for a batch of users $\mathcal{U}_B$ in $\mathcal{U}$ do 7: Get the embedding sequence of different granularities from code embeddings. 8: Generate the weight $\alpha_l$ for different granularities by Equation (2). 9: Derive the final SID item embedding sequence $\mathcal{E}^{sid}$ by Equation (3). 10: Get the fused embedding sequence $\mathcal{E}^f$ by Equation (5). 11: Get the user representation in SID branch and HID branch, i.e., $\pmb{u}^{sid}$ and $\pmb{u}^{hid}$. 12: Calculate the probability score of ground-truth and negative items by Equation (9). 13: Calculate the ranking loss by Equation (10). 14: Calculate the $\mathcal{L}_{CA}$ by Equation (7) and $\mathcal{L}_{MSG}$ by Equation (8). 15: Sum the $\mathcal{L}_{rec}$, $\mathcal{L}_{CA}$, and $\mathcal{L}_{MSG}$. Then, update the parameters. 16: end for Inference 17: Load $E_{LLM}$, $E_C$, and $\alpha_l$ for item embedding layers and other parameters. 18: for a batch of users $\mathcal{U}_B$ in $\mathcal{U}$ do 19: Get the user representation in SID branch and HID branch, i.e., $\pmb{u}^{sid}$ and $\pmb{u}^{hid}$. 20: Calculate the probability score by Equation (9) and give the final next-item prediction. 21: end for
-
4.2.3.2. Inference
During inference, the SID and their associated codebook embeddings are pre-cached. Recommendation scores are directly computed using Equation (9) (the score calculation formula), and the item with the highest score is predicted.
5. Experimental Setup
5.1. Datasets
The experiments used three real-world datasets:
-
Yelp: A dataset often used for reviews and business recommendations.
-
Amazon Beauty: A dataset from Amazon focusing on beauty product interactions.
-
Amazon Instrument: A dataset from Amazon focusing on musical instrument interactions.
For data preprocessing and splitting, the authors followed previous SRS works [18, 21], implying standard procedures like filtering inactive users/items and chronological splitting for training/validation/testing.
The statistics of the datasets are as follows (Table 1 from the original paper): The following are the results from Table 1 of the original paper:
| Dataset | # Users | # Items | Sparsity | Avg.length |
| Yelp | 15,720 | 11,383 | 99.89% | 12.23 |
| Beauty | 52,204 | 57,289 | 99.92% | 7.56 |
| Instrument | 40,644 | 30,676 | 99.97% | 8.01 |
Characteristics and Domain:
-
Domain: E-commerce and local business review (Yelp).
-
Scale: The datasets vary in size, with Beauty being the largest in terms of users and items.
-
Sparsity: All datasets exhibit extremely high sparsity (over 99.89%), which is typical for recommendation tasks and highlights the
long-tail problemthat the paper aims to address. -
Average Length: The average interaction sequence lengths vary, with Yelp having the longest.
These datasets are well-suited for validating the method's performance, especially its ability to handle sparse data and the
long-tail problem, given their high sparsity and diverse item popularities.
5.2. Evaluation Metrics
The paper evaluates performance using Hit Rate (H@10) and Normalized Discounted Cumulative Gain (NDCG@10), truncated at 10. Items are split into tail and head groups, with the top 20% most popular items considered head items and the rest tail items.
For every evaluation metric mentioned in the paper, we provide a complete explanation:
5.2.1. Hit Rate (H@K)
- Conceptual Definition:
Hit Rate(orRecall) at K measures how often the ground-truth item appears within the top K recommended items. It indicates the model's ability to recall relevant items. A higherHit Ratemeans more relevant items are included in the recommendation list. - Mathematical Formula: $ \mathrm{H@K} = \frac{\text{Number of users for whom the ground-truth item is in the top K recommendations}}{\text{Total number of users}} $ Or for a single user : $ \mathrm{H@K}(u) = \mathbb{I}(\text{ground-truth item } v^+ \text{ is in top K recommended items for } u) $ The overall H@K is typically the average over all users.
- Symbol Explanation:
- : Indicator function, which returns 1 if the condition inside is true, and 0 otherwise.
- : The actual next item the user interacted with (ground-truth).
- : The size of the recommendation list (here, ).
5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)
- Conceptual Definition:
NDCGat K is a measure of ranking quality. It evaluates how well the recommender system places highly relevant items at the top of the recommendation list. It is "normalized" because it compares theDCG(Discounted Cumulative Gain) of the recommendation list to the idealDCG(where all relevant items are perfectly ranked at the top). It gives higher scores to relevant items that appear earlier in the list. - Mathematical Formula:
First,
Cumulative Gain (CG): $ \mathrm{CG@K} = \sum_{i=1}^K \mathrm{rel}i $ Then,Discounted Cumulative Gain (DCG): $ \mathrm{DCG@K} = \sum{i=1}^K \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ Finally,Normalized Discounted Cumulative Gain (NDCG): $ \mathrm{IDCG@K} = \sum{i=1}^{|\mathrm{REL}|} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)} $ $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ - Symbol Explanation:
- : The relevance score of the item at position in the recommendation list. In many implicit feedback scenarios (like this paper, where interaction is typically binary), is 1 if the item is the ground-truth item, and 0 otherwise.
- : The size of the recommendation list (here, ).
- : Logarithmic discount factor, reducing the impact of items at lower ranks.
- : Ideal
DCGat K, which is the maximum possibleDCGif all relevant items were perfectly ranked at the top. - : Number of relevant items in the set (for a single ground-truth item, this is 1).
5.3. Baselines
The paper compares H2Rec against various state-of-the-art baselines categorized into three types:
-
HID Embedding Baselines (Hash ID based): These models primarily rely on
Hash IDsand collaborative filtering.- BERT4Rec [24]: A Transformer-based sequential recommender system that uses a masked self-attention mechanism, similar to BERT in NLP.
- SASRec [12]: Another Transformer-based sequential recommender system that uses self-attention to capture long-range dependencies.
- MELT [13]: A method designed to enhance
long-taileditems by exploiting co-occurrence patterns from popular items. - LLM-ESR [18]: An
LLM-basedmethod that enrichesHIDembeddings with semantic information derived from large language models. This represents the state-of-the-art inLLM-enhanced HIDmethods.
-
SID Embedding Baselines (Semantic ID based): These models use
Semantic IDsderived from item content.- PG-SID [37]: A method using
Semantic IDsfor transferable sequential recommenders. - SPM-SID [23]: Another
SID-basedapproach for recommendation. - CCFRec [16]: A method that bridges
textual-collaborative gapsthroughsemantic codesfor sequential recommendation. It progressively incorporates semantic embeddings from multiple attribute dimensions. - PSRQ+MCCA (denoted as PSRQ) [30]:
Progressive Semantic Residual Quantizationcombined withMultimodal-Joint Interest Modeling, a recentSID-basedmethod.
- PG-SID [37]: A method using
-
Hybrid Embedding Baselines (Combining HID and SID): These models attempt to integrate both
HashandSemantic IDs.-
URL4DR [15]:
Unified Semantic and ID Representation Learningfor deep recommenders. -
MME-SID [32]:
Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs. -
PCR-CA [25]:
Parallel Codebook Representations with Contrastive Alignmentfor multiple-category app recommendation.These baselines are representative because they cover the main evolutionary paths in sequential recommendation: traditional
ID-basedmodels,LLM-enhanced ID-basedmodels, pureSemantic ID-basedmodels, and varioushybridapproaches. This comprehensive selection allows for a thorough validation ofH2Rec's superiority and its ability to address the specific challenges outlined.
-
6. Results & Analysis
6.1. Core Results Analysis
The paper presents extensive experimental results demonstrating the effectiveness of H2Rec across different item popularity groups on three real-world datasets.
The following are the results from Table 2 of the original paper:
| Dataset | Group | Metric | HID Emb | SID Emb | Hybrid Emb | Ours | Improv. % | ||||||||
| BERT4Rec | SASRec | MELT | LLM-ESR | SFM-SID | PG-SID | CCFRec PSRQ | URL4DR | MME-SID | PCR-CA | \$H^{2Rec | |||||
| Yelp | Overall | H@10 N@10 | 0.5314 0.3147 | 0.5940 0.3601 | 0.6101 | 0.6573 | 0.4727 | 0.4881 | 0.5947 | 0.5438 | 0.6402 | 0.6431 | 0.6447 | 0.6692* | 1.81% |
| Tail | 0.3394 | 0.4102 | 0.3148 | 0.3251 | 0.3694 | 0.3422 | 0.3776 | 0.3884 | 0.3971 | 0.4272* | 4.14% | ||||
| H@10 | 0.0177 | 0.1175 | 0.1223 | 0.1802 | 0.2441 | 0.2492 | 0.2478 | 0.2543 | 0.1957 | 0.2215 | 0.2032 | 0.2693* | 5.90% | ||
| N@10 | 0.0068 | 0.0588 | 0.0599 | 0.0676 | 0.1162 | 0.1186 | 0.1171 | 0.1210 | 0.0914 | 0.1103 | 0.0954 | 0.1306* | 7.93% | ||
| Beauty | Head | H@10 N@10 | 0.6919 0.3876 | 0.7413 0.4592 | 0.7790 0.4745 | 0.8059 0.5122 | 0.5218 0.3079 | 0.5334 0.3147 | 0.7071 0.4243 | 0.6398 0.3721 | 0.7699 0.4735 | 0.7702 0.4823 | 0.7748 0.4860 | 0.8324* 0.5483* | 3.29% 7.05% |
| Overall | H@10 | 0.3992 | 0.4401 | 0.4890 | 0.5544 | 0.3715 | 0.3836 | 0.4398 | 0.4038 | 0.5464 | 0.5509 | 0.5539 | 0.5742* | 3.57% | |
| N@10 | 0.2401 | 0.3043 | 0.3357 | 0.3702 | 0.2617 | 0.2703 | 0.3021 | 0.2845 | 0.3675 | 0.3617 | 0.3673 | 0.3957* | |||
| 6.89% | |||||||||||||||
| H@10 N@10 | 0.0123 0.0052 | 0.0921 0.0675 | 0.1536 0.0877 | 0.2198 0.1074 | 0.2208 0.1355 | 0.2254 | 0.2238 | 0.2300 | 0.1967 | 0.2177 | 0.2048 | 0.2557* | 11.17% 8.01% | ||
| Overall | Head | H@10 | 0.4988 | 0.5291 | 0.5815 | 0.1383 0.4485 | 0.1323 | 0.1411 | 0.1105 | 0.1404 | 0.1342 | 0.1524* | |||
| N@10 | 0.2971 | 0.4007 | 0.4106 | 0.6377 | 0.4388 | 0.5099 | 0.4875 | 0.6299 | 0.6393 | 0.6419 | 0.6502* | 1.29% | |||
| 0.4289 | 0.3251 | 0.3323 | 0.3887 | 0.3612 | 0.4300 | 0.4379 | 0.4401 | 0.4538* | 3.11% | ||||||
| H@10 | 0.4601 | 0.5057 | 0.5510 | 0.5881 | 0.4312 | 0.4453 | 0.5078 | 0.4687 | 0.6005 | 0.6044 | 0.6072 | 0.6184* | 1.84% | ||
| Instrument | N@10 H@10 | 0.3213 | 0.3442 0.0489 | 0.3622 | 0.3809 | 0.2908 | 0.3003 | 0.3376 | 0.3161 | 0.4024 | 0.4027 | 0.4056 | 0.4153* | 2.39% | |
| Tail | N@10 | 0.0199 | 0.0257 | 0.0766 | 0.0998 | 0.2058 | 0.2101 | 0.2099 | 0.2144 | 0.1605 | 0.2044 | 0.1827 | 0.2382* | 11.10% | |
| 0.0143 | 0.0459 | 0.0549 | 0.0990 | 0.1010 | 0.0907 | 0.1031 | 0.0828 | 0.0991 | 0.1025 | 0.1233* | 11.88% | ||||
| H@10 | 0.5028 | 0.5806 | 0.6188 | 0.6676 | 0.5569 | 0.5693 | 0.5629 | 0.6188 | 0.6643 | 0.6646 | 0.6701 | 0.6832* | 1.95% | ||
| Head | N@10 | 0.3190 | 0.3764 | 0.4237 | 0.4522 | 0.3813 | 0.3898 | 0.4192 | 0.4237 | 0.4483 | 0.4498 | 0.4543 | 0.4638* | 2.09% | |
Overall Performance:
H2Recconsistently achieves the best overall performance across all three datasets (Yelp, Beauty, Instrument) for bothH@10andN@10.- It shows statistically significant improvements (indicated by
*) over all baselines, with relative gains ranging from 1.29% to 11.88%. This validates the core hypothesis that harmonizingSIDandHIDleads to superior performance. - Comparison with LLM-ESR and Hybrid Methods: Following
H2Rec,Hybrid Embeddingmethods (e.g.,PCR-CA,MME-SID) andLLM-based HIDmethods (e.g.,LLM-ESR) generally perform the next best. This suggests that incorporatingLLM-derived knowledgeor combining multiple embeddings is beneficial, butH2Rec's explicit and fine-grained alignment strategies provide a further significant boost. - Comparison with SID Embedding Baselines:
SID Embeddingbaselines show mixed performance. While some (CCFRec,PSRQ) are comparable toHID-basedmethods, they generally underperformH2Recand otherhybridapproaches. This indicates that merely usingSIDor improvingsemantic qualitywithout a dedicatedrecommendation-oriented architectural designthat addressescollaborative overwhelmingis insufficient.
Popularity Breakdown Analysis (Head vs. Tail Items):
- HID Embedding Methods (e.g., SASRec, LLM-ESR): These methods perform strongly on
Head items(due to abundant interaction data) but suffer significant performance drops onTail items. This confirms their reliance on densecollaborative signalsand vulnerability to thelong-tail problem. For instance,SASRec'sN@10on Yelp drops from 0.4592 (Head) to 0.0588 (Tail). - SID Embedding Methods (e.g., PSRQ, CCFRec): These methods generally show better performance on
Tail itemscompared to pureHID-basedmethods, effectively leveraging semantic information to address sparsity. For example,PSRQ'sN@10on Yelp for Tail items (0.1210) is significantly higher thanSASRec's (0.0588). However, their performance onHead itemsis limited due to thecoarse-grained nature of SIDandquantization-induced ID collisions(thecollaborative overwhelming phenomenon). - H2Rec:
-
Achieves the best results in the Tail group, substantially outperforming all
SID-basedbaselines. This confirms its success in transferring semantic knowledge to cold items and mitigatingNoisy Collaborative Sharing. For instance, on Yelp,H2RecachievesN@10of 0.1306 for Tail items, notably higher thanPSRQ(0.1210) andCCFRec(0.1171). -
Simultaneously maintains state-of-the-art performance on the Head group, surpassing all
HID-basedmethods. This demonstrates its ability to preservefine-grained collaborative informationfor popular items and successfully mitigate thecollaborative overwhelming phenomenon. For instance, on Yelp,H2RecachievesN@10of 0.5483 for Head items, outperformingLLM-ESR(0.5122).This comprehensive analysis validates that
H2Recsuccessfully eliminates the traditional trade-off betweenheadandtail itemperformance, achieving a balanced and superior recommendation quality across the entire item popularity spectrum.
-
6.2. Ablation Studies / Parameter Analysis
To understand the contribution of each component, ablation studies were conducted on the Yelp dataset.
The following are the results from Table 3 of the original paper:
| Variants | Overall | Tail | Head | |||
| N@10 | H@10 | N@10 | H@10 | N@10 | H@10 | |
| \$H{2Rec | 0.4272 | 0.6692 | 0.1306 | 0.2693 | 0.5483 | 0.8324 |
| w/o FN | 0.4123 | 0.6605 | 0.1013 | 0.2347 | 0.5404 | 0.8208 |
| w/o Lca | 0.4044 | 0.6455 | 0.1067 | 0.2161 | 0.4920 | 0.7719 |
| w/o MCA | 0.4072 | 0.6523 | 0.1105 | 0.2190 | 0.5218 | 0.8077 |
| w/o LMSG | 0.4202 | 0.6667 | 0.1091 | 0.2285 | 0.5398 | 0.8258 |
-
Effectiveness of Fusion Network (
FN) and Masked Sequence Granularity Loss (LMSG):- Removing the
Multi-granularity Fusion Network(w/o FN) results in a performance drop across all metrics, especially onTail items(e.g.,N@10for Tail drops from 0.1306 to 0.1013). This confirms that adaptively aggregatingmulti-granular semantic viewsis crucial for constructing fine-grainedSID representations, which in turn benefitstail items. - Removing the
Masked Sequence Granularity Loss(w/o LMSG) also leads to a noticeable decline, particularly onTail items(N@10for Tail drops from 0.1306 to 0.1091). This indicates that this loss effectively enhances the quality and robustness ofSID embeddingsby encouraging the model to learn meaningful correlations between differentsemantic granularities.
- Removing the
-
Role of Cross-branch Interaction (
MCA):- Removing the
Multi-granularity Cross Attention(w/o MCA) leads to a significant performance degradation, particularly onHead items(N@10for Head drops from 0.5483 to 0.5218). This is critical evidence thatMCAsuccessfully allows theHID branchto selectively absorb beneficialmulti-granular semantic informationfromSIDwithout losing theidentifier uniquenessrequired forhead items, thereby preventingcollaborative overwhelming.
- Removing the
-
Importance of Code-guided Alignment (
LCA):-
Removing the
Code-guided Alignment Loss(w/o LCA) causes the largest performance drop across bothHeadandTail groups(e.g., OverallN@10drops from 0.4272 to 0.4044,N@10for Head drops from 0.5483 to 0.4920, andN@10for Tail drops from 0.1306 to 0.1067). This highlights the critical role ofLCAin effectively aligning thesemanticandcollaborative spaces, facilitatingaccurateandnoise-resistant collaborative information sharingamong semantically related items, especially fortail itemsborrowing signals fromhead items.These ablation results confirm that all proposed modules (
Fusion Network,LCA,MCA, andLMSG) are essential and contribute significantly toH2Rec's superior performance by effectively addressing the limitations ofHIDandSIDand harmonizing their strengths.
-
6.3. Hyperparameter Analysis
The paper analyzes the sensitivity of two key hyperparameters: (weight of Code-guided Alignment Loss, ) and (weight of Masked Sequence Granularity Loss, ).
The following figure (Figure 3 from the original paper) shows the hyper-parameter results on the Yelp Dataset:
该图像是图表,展示了在Yelp数据集上超参数调整的结果。左侧图表显示了指标 H@10 的变化,"Masked" 用三角形标记表示,而 "Alignment" 用圆点标记表示。右侧图表展示了 N@10 的变化情况。两个图表的横轴均为超参数值,纵轴分别表示各自对应的评估指标的结果。
-
Alignment Weight :
- Performance (both
H@10andN@10) initially increases as grows, reaching an optimum around . - A small value suggests insufficient
code-guided alignment, meaning not enoughcollaborative signalsare effectively transferred tosemantically similar tail items. - An excessively large leads to a performance decline, indicating that
over-alignmentof thesemantic spaceto thecollaborative spacecan introduce noise and harm generalization. This implies a delicate balance is needed to control the information flow between the two spaces.
- Performance (both
-
Granularity Weight :
-
The performance for the
Masked Sequence Granularity Losspeaks at . -
As an auxiliary regularizer, needs a balanced weight. Low values fail to ensure the
robustnessofSID representationsby not adequately enforcing internal consistency amongmulti-granular semantics. -
Excessive values (e.g., ) interfere with the primary recommendation task, as the model over-prioritizes learning
internal semantic consistencyrather than overall recommendation accuracy.These analyses confirm that optimal hyperparameter tuning is crucial for
H2Recto effectively leverage itsalignment strategieswithout introducing detrimental effects.
-
6.4. In-depth Analysis for
This section further validates the Code-guided Alignment Loss's effectiveness in mitigating noisy collaborative sharing by analyzing two key design factors: code-matching threshold and context window size .
6.4.1. Impact of Code Matching Threshold
The code-matching threshold determines how many semantic code levels items must share to be considered semantically similar for the positive set .
The following are the results from Table 4 of the original paper:
| code num | Overall | Tail | Head | |||
| N@10 | H@10 | N@10 | H@10 | N@10 | H@10 | |
| Removed | 0.4095 | 0.6505 | 0.1177 | 0.2211 | 0.4970 | 0.7769 |
| 1 | 0.4059 | 0.6473 | 0.1058 | 0.2044 | 0.4955 | 0.7712 |
| 2 | 0.4199 | 0.6582 | 0.1204 | 0.2591 | 0.5385 | 0.8247 |
| 3 (Ours) | 0.4272 | 0.6692 | 0.1306 | 0.2693 | 0.5483 | 0.8324 |
- Setting (considering only the coarsest semantic layer) leads to notable performance degradation. This is because overly coarse categories group many weakly related items, introducing noise into the
positive setand undermining the quality ofcollaborative signaltransfer. - Increasing the threshold to (the default setting for
H2Rec) yields significant improvements across all metrics, especially forTail items. This demonstrates that deepersemantic matchingis crucial for filtering out noise and ensuring that only truly similar items sharecollaborative signals, making theLCAmore effective. - The "Removed" row implies
LCAis still active, but the code-guided part (using ) is absent. The performance for "Removed" is slightly better than , suggesting that a very coarse semantic matching (like ) can be worse than no explicit semantic filtering if the shared code does not imply strong similarity.
6.4.2. Impact of Context Window Size
The context window size defines the local co-occurrence window to identify items for the positive set .
The following are the results from Table 5 of the original paper:
| context | Overall | Tail | Head | |||
| N@10 | H@10 | N@10 | H@10 | N@10 | H@10 | |
| Removed | 0.4148 | 0.6522 | 0.1176 | 0.2230 | 0.5072 | 0.7911 |
| 1 | 0.4190 | 0.6523 | 0.1208 | 0.2294 | 0.5205 | 0.8062 |
| 3 (Ours) | 0.4272 | 0.6692 | 0.1306 | 0.2693 | 0.5483 | 0.8324 |
| 5 | 0.4205 | 0.6637 | 0.1244 | 0.2638 | 0.5441 | 0.8299 |
-
Expanding the
co-occurrence windowfrom 1 to (the default setting) improves performance across the board. This confirms the utility of incorporating local, sequentially related items as additional positives in . -
However, when the window is enlarged to , performance drops slightly. This indicates that a too-large window starts to include less relevant items, introducing noise into the
positive setand diluting the quality ofcollaborative signals. -
The "Removed" row implies is still active, but the co-occurrence part (using ) is absent. Performance improves when and are introduced, underscoring the value of this component.
These results collectively demonstrate that
H2Rec's design for , combiningmulti-level semantic matchingwith a judiciously sizedlocal collaborative context, is essential for achieving accurate andnoise-robust alignmentbetweenSIDandHIDrepresentations.
6.5. Group Analysis
For a more fine-grained analysis, the paper divides the item popularity groups from Table 2 into five finer segments.
The following figure (Figure 4 from the original paper) shows the detailed results in different item groups on the Yelp dataset:
该图像是图表,展示了在Yelp数据集上不同项组的H@10和N@10的详细结果。图中对比了三种推荐方法的性能,其中我们的方法相较于LLM-ESR和PCR-CA在推荐质量上表现更为优越。
- Comparison of Hybrid Methods: The figure shows that
PCR-CA(ahybrid embeddingmethod) achieves a clear performance gain overLLM-ESR(anLLM-enhanced HIDmethod) for groups with popularity lower than 60 (the first three grouped columns). For groups with popularity greater than 60 (the last two grouped columns),PCR-CAperforms almost on par withLLM-ESR. This illustrates the general advantage ofhybrid embeddingapproaches over purelyLLM-enhanced HIDmethods across various popularity levels. - H2Rec's Consistent Superiority:
H2Recconsistently outperforms all existing methods across all five popularity groups for bothH@10andN@10. This is the strongest evidence thatH2Recsuccessfully leverages itsdual-branch architectureanddual-level alignment strategyto achieve additional optimization, breaking the performance bottleneck and providing robust recommendations for both very popular and very niche items. This fine-grained analysis reinforces the conclusion from theHead/Tailbreakdown.
6.6. Generality Validation
6.6.1. Quantization Mechanism Analysis
To demonstrate H2Rec's generality, its performance is evaluated using SID generated by three common vector quantization mechanisms: VQ (Vector Quantization), PQ (Product Quantization), and RQ (Residual Quantization).
The following are the results from Table 6 of the original paper:
| Model | Overall | Tail | Head | |||
| N@10 | H@10 | N@10 | H@10 | N@10 | H@10 | |
| VQ + | 0.4091 | 0.6531 | 0.0943 | 0.1986 | 0.5018 | 0.7869 |
| PQ + | 0.4124 | 0.6597 | 0.1091 | 0.2285 | 0.5118 | 0.7958 |
| RQ + | 0.4272 | 0.6692 | 0.1306 | 0.2693 | 0.5483 | 0.8324 |
-
H2Recmaintains competitive performance across allquantization mechanisms(VQ,PQ,RQ), demonstrating the robustness and general applicability of its design. -
RQ (Residual Quantization): Achieves the strongest performance. This is attributed to its
residual coding structure, which naturally capturesfine-grained semantic granularity. This helps mitigatedata sparsityand leads to better performance onTail items. -
PQ (Product Quantization): Provides moderate improvements. It partitions the embedding space into subspaces, offering a balanced trade-off between
semantic precisionandcode compactness. -
VQ (Vector Quantization): Performs the weakest. Mapping each item to a single discrete code results in severe
semantic collapse, reducingitem uniquenessacross bothheadandtail groups.Even with the least expressive
VQsetting,H2Recachieves strong results, confirming its ability to effectively exploitsemantic signalsregardless of the underlyingquantization mechanism. This highlights its generalizability andplug-and-playnature.
6.6.2. Backbone Analysis
To verify the model-agnostic nature of H2Rec, it is evaluated with two common sequential recommendation backbones: GRU4Rec (RNN-based) and BERT4Rec (Transformer-based). Comparisons are made against PCR-CA and LLM-ESR using the same backbones.
The following are the results from Table 7 of the original paper:
| Backbone | Model | Overall | Tail Items | Popular Items | |||
| N@10 | H@10 | N@10 | H@10 | N@10 | H@10 | ||
| GRU4Rec | PCR-CA | 0.3613 | 0.5988 | 0.0745 | 0.1889 | 0.4311 | 0.7204 |
| LLM-ESR | 0.3627 | 0.6075 | 0.0482 | 0.0952 | 0.4491 | 0.7338 | |
| Ours | 0.3804 | 0.6239 | 0.0983 | 0.2058 | 0.4634 | 0.7470 | |
| BERT4Rec | PCR-CA | 0.4173 | 0.6604 | 0.0720 | 0.1689 | 0.5346 | 0.8175 |
| LLM-ESR | 0.4205 | 0.6635 | 0.0503 | 0.1247 | 0.5444 | 0.8223 | |
| Ours | 0.4298 | 0.6724 | 0.0991 | 0.1735 | 0.5545 | 0.8344 | |
- Across both
RNN-based (GRU4Rec)andTransformer-based (BERT4Rec)architectures,H2Recconsistently surpasses all baselines (PCR-CA,LLM-ESR). - This confirms that
H2Recis not restricted to a specific encoder design. It functions as auniversal,plug-and-play enhancement modulefor variousSRSbackbones, demonstrating its broad applicability.
6.7. Semantic Code Analysis
This section analyzes H2Rec's performance under SID with different qualities, focusing on code layers and codebook size.
The following are the results from Table 8 of the original paper:
| Settings | Overall | Tail | Head | Coll.% | Util.% | |||
| N@10 | H@10 | N@10 | H@10 | N@10 | H@10 | |||
| 3x128 | 0.4145 | 0.6510 | 0.1152 | 0.2480 | 0.5365 | 0.8150 | 33.92 | 0.54 |
| 3x256 | 0.4192 | 0.6585 | 0.1215 | 0.2565 | 0.5402 | 0.8215 | 29.92 | 0.068 |
| 3x512 | 0.4235 | 0.6640 | 0.1268 | 0.2630 | 0.5445 | 0.8270 | 25.85 | 0.008 |
| 4x128 | 0.4272 | 0.6692 | 0.1306 | 0.2693 | 0.5483 | 0.8324 | 22.28 | 0.004 |
| 4x256 | 0.4338 | 0.6785 | 0.1385 | 0.2790 | 0.5540 | 0.8410 | 11.87 | <0.001 |
-
Impact of Codebook Capacity (e.g.,
3x128vs.3x256vs.3x512): Expandingcodebook capacity(e.g., from 128 to 512 for 3 layers) consistently yields improvements in performance metrics. This is attributed to a significant reduction incollision rates(Coll.%). Lowercollision ratesmitigatesemantic ambiguity, enhancing the distinctiveness ofitem representations. -
Impact of Code Layers (e.g.,
3x128vs.4x128): Increasing the number ofcode layers(e.g., from 3 to 4, keeping codebook size per layer at 128) also improves performance and reducescollision rates. This suggests that more layers can capture finersemantic granularities, leading to better representations. -
Trade-off between Performance and Efficiency: While larger
codebook sizesand more layers generally improve performance by reducingcollision rates, this comes at the cost ofcodebook redundancyand potentially computational efficiency. For example, the4x256setting achieves the best performance and lowestcollision rate(11.87%) but suffers from an extremely lowutilization rate(Util.%< 0.001%). A lowutilization ratemeans many codewords in the codebook are never used, indicating inefficiency. -
Optimal Setting: The authors choose
4x128as the default setting because it strikes a balance between minimizingcollision rates(22.28%) to ensure distinctiveness and maintaining a reasonable (though still low)utilization rate(0.004%) for efficiency.This analysis provides valuable insights into how the design choices for
Semantic IDs(number of layers and codebook size) impactH2Rec's performance, especially the trade-off betweenrepresentation distinctiveness(lower collision) andcodebook efficiency(higher utilization).
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper rigorously investigated the fundamental trade-off between identifier uniqueness (inherent in Hash IDs for head items) and semantic generalization (offered by Semantic IDs for tail items) in sequential recommendation. It formally identified the collaborative overwhelming phenomenon, where the quantization process of SID compromises uniqueness, leading to a performance seesaw between popular and niche items.
To address this dilemma, the authors proposed H2Rec, a novel dual-branch framework that harmonizes Hash IDs (HID) and Semantic IDs (SID). The framework integrates a Multi-granularity Fusion Network in the SID branch to construct fine-grained semantic representations and a Multi-granularity Cross Attention Network in the HID branch to selectively inject semantic signals while preserving unique collaborative identity. This architecture effectively alleviates the Semantic Homogeneity problem and prevents the Collaborative Overwhelming phenomenon.
Furthermore, H2Rec introduced a Dual-level Alignment Strategy, comprising a Code-guided Alignment Loss at the item level and a Masked Sequence Granularity Loss at the user level. These mechanisms facilitate knowledge transfer between the semantic and collaborative spaces, enabling long-tail items to "borrow" high-quality signals from semantically similar head items and mitigating the Noisy Collaborative Sharing problem, while enhancing robust user preference modeling.
Extensive experiments on three real-world datasets demonstrated that H2Rec consistently outperforms state-of-the-art baselines. Crucially, it successfully balances recommendation quality across head and tail items, achieving substantial gains on tail items without sacrificing performance on head items, thereby breaking a long-standing performance bottleneck. The framework's quantization-agnostic and model-agnostic nature further underscores its generalizability and practical utility.
7.2. Limitations & Future Work
The paper does not explicitly list "Limitations" as a separate section but implies some through its discussion, such as the codebook redundancy observed in Semantic Code Analysis (Table 8), where increasing codebook size for performance came at the cost of extremely low utilization rates. This suggests potential inefficiencies in SID generation or storage.
Potential future research directions, though not explicitly stated as such, could be inferred from the paper's focus:
- Dynamic Codebook Adaptation: Investigating adaptive
vector quantizationmethods that can dynamically adjustcodebook sizeandutilizationto maintain efficiency while maximizing semantic precision, especially considering thelow utilization rates. - Enhanced Semantic-Collaborative Interaction: Exploring more sophisticated interaction mechanisms between
SIDandHIDbranches beyondcross-attention, potentially involvinggraph neural networksor more complexfusion architecturesthat can capture higher-order relationships. - Explanation and Interpretability: With the increased complexity of
dual-branch modelingandmulti-level alignment, future work could focus on improving theinterpretabilityofH2Rec's recommendations, explaining why a particular item was recommended based on itssemanticandcollaborativefeatures. - Real-time Adaptation: Further optimizing the framework for
real-time recommendation scenarios, especially for dynamically changing item features or user preferences. - Beyond Next-Item Prediction: Extending
H2Recto other recommendation tasks, such aslist recommendationorsession-based recommendationwith diverse objectives.
7.3. Personal Insights & Critique
This paper provides a highly comprehensive and effective solution to a long-standing problem in sequential recommendation: the persistent trade-off between head and tail item performance. The explicit identification of the collaborative overwhelming phenomenon is a significant contribution, as it clearly articulates a key limitation of SID-based methods that was previously less formally recognized.
The dual-branch architecture of H2Rec is intuitively compelling. By maintaining separate but interacting HID and SID representations, the model can specialize in capturing different facets of item information without forcing them into a single, potentially compromising, representation. The Multi-granularity Cross Attention Network is particularly clever in its design, allowing the HID to "query" the SID for semantic enrichment while protecting its unique identity – a direct answer to the collaborative overwhelming problem. Similarly, the Code-guided Alignment Loss offers a more nuanced approach to contrastive learning, ensuring that tail items borrow high-quality, semantically relevant signals, rather than noisy co-occurrence patterns.
Transference and Application: The methods and conclusions of H2Rec can certainly be transferred to other domains facing similar long-tail challenges or the need to harmonize different types of item identifiers (e.g., multimodal features, knowledge graph embeddings). For example, in fashion recommendation, where visual semantics are critical, SID could be derived from image features, and H2Rec could harmonize these with HID. In scientific article recommendation, SID could be derived from abstract/full text and HID from citation networks. The model-agnostic nature of H2Rec makes it highly adaptable.
Potential Issues or Areas for Improvement:
-
Computational Cost: A
dual-branch architecturewithmulti-granularity fusion,cross-attention, anddual-level alignmentis inherently more complex and likely more computationally expensive than single-branch models. While the paper shows performance gains, a deeper analysis of the trade-off between computational cost (training and inference time, memory footprint) and performance would be valuable. The use ofRQ-VAEforSIDgeneration also adds an initial computational overhead. -
Hyperparameter Sensitivity: As shown in the
hyperparameter analysis, the weights and for the auxiliary losses are crucial. Finding these optimal values for new datasets might require significant tuning, which can be a practical challenge. The choice ofcode-matching thresholdandcontext window sizealso indicates sensitivity. -
Interpretability of Multi-Granularities: While
multi-granular semanticsare theoretically appealing, understanding what specific semantic aspects eachquantization levelrepresents could be challenging. More tools for interpreting these granularities could further enhance the model's transparency. -
Codebook Redundancy: The
low utilization ratesobserved in theSemantic Code Analysis(Table 8) suggest that a significant portion of the learned codebook is unused. This represents wasted capacity and potential for optimization. Future work could explore more adaptive or dynamiccodebook learningstrategies to maximizeutilizationwithout sacrificing performance or increasingcollision rates. -
Generalizability of LLM Embeddings: The performance relies heavily on the quality of
LLM-derived semantic embeddings. Whiletext-embedding-ada-002is powerful, its biases or limitations could propagate through theSID branch. Exploring the robustness to differentLLM encodersor incorporating other forms of semantic information (e.g., knowledge graphs) could be interesting.Overall,
H2Recpresents a robust and well-thought-out solution that advances the state-of-the-art in sequential recommendation by intelligently harmonizing disparate item representations. Its critical analysis of existing problems and elegant architectural design make it a significant contribution to the field.
Similar papers
Recommended via semantic vector search.