Knowledge Graph Convolutional Networks for Recommender Systems
TL;DR Summary
KGCN uses neighbor sampling to capture high-order structural and semantic information in knowledge graphs, addressing sparsity and cold-start in recommender systems. It performs well on large-scale datasets in movie, book, and music recommendations.
Abstract
To alleviate sparsity and cold start problem of collaborative filtering based recommender systems, researchers and engineers usually collect attributes of users and items, and design delicate algorithms to exploit these additional information. In general, the attributes are not isolated but connected with each other, which forms a knowledge graph (KG). In this paper, we propose Knowledge Graph Convolutional Networks (KGCN), an end-to-end framework that captures inter-item relatedness effectively by mining their associated attributes on the KG. To automatically discover both high-order structure information and semantic information of the KG, we sample from the neighbors for each entity in the KG as their receptive field, then combine neighborhood information with bias when calculating the representation of a given entity. The receptive field can be extended to multiple hops away to model high-order proximity information and capture users' potential long-distance interests. Moreover, we implement the proposed KGCN in a minibatch fashion, which enables our model to operate on large datasets and KGs. We apply the proposed model to three datasets about movie, book, and music recommendation, and experiment results demonstrate that our approach outperforms strong recommender baselines.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is "Knowledge Graph Convolutional Networks for Recommender Systems". It clearly indicates the central topic, which is the application of Knowledge Graph Convolutional Networks (KGCN) to enhance recommender systems.
1.2. Authors
The authors of this paper are:
-
Hongwei Wang (Shanghai Jiao Tong University, Shanghai, China)
-
Miao Zhao (The Hong Kong Polytechnic University, Hong Kong, China)
-
Xing Xie (Microsoft Research Asia, Beijing, China)
-
Wenjie Li (The Hong Kong Polytechnic University, Hong Kong, China)
-
Minyi Guo (Shanghai Jiao Tong University, Shanghai, China)
Their affiliations suggest a collaborative effort between academic institutions and industrial research (Microsoft Research Asia), which often leads to both theoretically sound and practically relevant research.
1.3. Journal/Conference
This paper was published at the 2019 World Wide Web Conference (WWW '19), held from May 13-17, 2019, in San Francisco, CA, USA. The World Wide Web Conference (WWW) is a highly reputable and influential academic conference in the fields of web science, information retrieval, and data mining. Publication at WWW signifies that the research has undergone a rigorous peer-review process and is considered a significant contribution to the field.
1.4. Publication Year
The paper was published in 2019. The Published at (UTC): 2019-03-18T17:17:34.000Z indicates that it was likely made available online as a preprint or accepted version around March 2019, ahead of the May conference.
1.5. Abstract
The paper addresses the sparsity and cold start problems inherent in collaborative filtering (CF)-based recommender systems (RS). To mitigate these issues, researchers often leverage auxiliary information such as user and item attributes. Recognizing that these attributes are interconnected and form a knowledge graph (KG), the authors propose Knowledge Graph Convolutional Networks (KGCN).
KGCN is presented as an end-to-end framework designed to effectively capture inter-item relatedness by mining associated attributes within the KG. Its core mechanism involves sampling neighbors for each entity in the KG to define their receptive field, and then combining this neighborhood information with a bias when computing entity representations. This approach aims to automatically discover both high-order structural information and semantic information from the KG. The receptive field can be extended to multiple hops to model high-order proximity and capture users' long-distance interests.
Furthermore, the KGCN is implemented in a minibatch fashion to handle large datasets and KGs efficiently. The model's effectiveness is demonstrated through experiments on three real-world datasets for movie, book, and music recommendation, where it consistently outperforms strong recommender baselines.
1.6. Original Source Link
The original source link is: https://arxiv.org/abs/1904.12575v1
This is a link to arXiv, which is a preprint server. The indicates it's the first version of the paper uploaded to arXiv.
1.7. PDF Link
The PDF link is: https://arxiv.org/pdf/1904.12575v1.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the limitations of traditional collaborative filtering (CF) in recommender systems (RS), specifically the sparsity of user-item interactions and the cold start problem.
-
Sparsity: In real-world
recommender systems, users typically interact with only a tiny fraction of available items. This leads to a sparse user-item interaction matrix, making it difficult forCFalgorithms to find enough similar users or items to make accurate recommendations. -
Cold Start: This problem arises when new users or new items are introduced to the system. For new users, there's no interaction history to base recommendations on. For new items, there's no interaction data to recommend them to existing users.
These problems are important because they severely limit the accuracy and coverage of recommendations, leading to poor user experience and underutilized items.
Previous research often addresses these issues by collecting attributes of users and items. However, the authors point out a critical gap: these attributes are generally not isolated but are interconnected, forming a knowledge graph (KG). While incorporating KGs into RS has shown promise (e.g., in exploring latent connections, extending user interests, and providing explainability), directly utilizing them is challenging due to their high dimensionality and heterogeneity. Existing KG-aware methods often rely on knowledge graph embedding (KGE) (which might not be optimal for recommendation tasks as they focus on rigorous semantic relatedness within the graph) or meta-path/meta-graph-based approaches (which require manual design and may not be optimal). RippleNet, a memory-network-like model, addresses multi-hop propagation but weakly characterizes the importance of relations and can incur heavy computational overhead.
The paper's entry point and innovative idea lie in proposing a novel graph convolutional network (GCN)-based approach, KGCN, that directly operates on the knowledge graph to automatically capture both high-order structural information and semantic information in a personalized manner, without relying on manually designed paths or fixed embedding strategies.
2.2. Main Contributions / Findings
The paper makes several significant contributions:
-
Novel Framework (
KGCN): They proposeKnowledge Graph Convolutional Networks (KGCN), an end-to-end framework specifically designed forrecommender systems. This framework effectively explores users' preferences on theknowledge graphby leveraging itshigh-order structureandsemantic information. -
Personalized, Biased Neighborhood Aggregation: The
KGCNintroduces a mechanism to aggregate neighborhood information withbias, where neighbors are weighted by scores dependent on the connecting relation and the specific user. This allows the model to capture both thesemantic informationof theKGand users'personalized interestsin relations, providing a more nuanced item representation. -
High-Order Proximity Modeling: By extending the
receptive fieldof each entity to multiplehopsaway in theKG,KGCNcan modelhigh-order proximity informationand capture users'potential long-distance interests, which is crucial for discovering complex relationships between items. -
Minibatch Implementation: The
KGCNis designed to be implemented in aminibatch fashion, enabling it to operate efficiently on large-scale datasets andknowledge graphs, making it practical for real-worldrecommender systems. -
Empirical Validation: Through extensive experiments on three real-world datasets (MovieLens-20M, Book-Crossing, and Last.FM) covering movie, book, and music recommendations, the
KGCNdemonstrates superior performance. It achieves averageAUCgains of 4.4%, 8.1%, and 6.2% across the datasets compared to state-of-the-art baselines, particularly excelling in sparse scenarios. -
Open-Source Release: The authors have released the code for
KGCNand the usedknowledge graphdatasets, fostering reproducibility and further research in the community.The key findings demonstrate that
KGCNeffectively addresses thesparsityandcold startproblems by deeply integratingknowledge graphinformation, leading to significantly improved recommendation accuracy. The model's ability to capture personalized interests through biased aggregation and high-order connections proves crucial for its success.
3. Prerequisite Knowledge & Related Work
This section provides the necessary background to understand the KGCN paper, explaining foundational concepts and summarizing prior research discussed by the authors.
3.1. Foundational Concepts
3.1.1. Recommender Systems (RS)
Recommender Systems (RS) are information filtering systems that aim to predict user preferences or ratings for items and recommend items that users might like. They are widely used in various online platforms like e-commerce (e.g., Amazon), streaming services (e.g., Netflix, Spotify), and social media (e.g., Facebook, Twitter) to alleviate information overload by providing personalized suggestions.
3.1.2. Collaborative Filtering (CF)
Collaborative Filtering (CF) is a common technique in recommender systems. The core idea behind CF is that users who agreed in the past on some items will agree again in the future on other items, or that items liked by similar users are likely to be liked by the current user.
- User-based CF: Recommends items to a user based on the preferences of other "similar" users.
- Item-based CF: Recommends items that are similar to the ones a user has liked in the past.
CFoften represents users and items asID-based representation vectors(also known asembeddings) and models their interactions (e.g., byinner productorneural networks).
3.1.3. Sparsity Problem
The sparsity problem refers to the issue in recommender systems where the user-item interaction matrix is very sparse, meaning most users have interacted with only a small fraction of the total available items. This lack of data makes it difficult for CF algorithms to find reliable similarities between users or items, leading to less accurate recommendations.
3.1.4. Cold Start Problem
The cold start problem occurs when a recommender system cannot make accurate recommendations because it lacks sufficient information.
- New User Cold Start: When a new user joins, there's no historical interaction data to build a preference profile.
- New Item Cold Start: When a new item is introduced, it has no interaction history, making it impossible for
CFto recommend it to users. Auxiliary information, likeitem attributes, is often used to alleviate this problem.
3.1.5. Knowledge Graph (KG)
A Knowledge Graph (KG) is a structured representation of information that describes entities and their relationships. It's essentially a large network of interconnected facts.
- Entities: Nodes in the graph, representing real-world objects or abstract concepts (e.g., "George Martin", "A Song of Ice and Fire", "book genre").
- Relations: Edges connecting entities, describing the type of relationship between them (e.g., "author of", "has genre").
- Triples: Facts are typically represented as
(head, relation, tail)triples (e.g.,(A Song of Ice and Fire, book.book.author, George Martin)). KGs provide richsemantic relatednessandcontextual informationthat can be beneficial forrecommender systems.
3.1.6. Graph Convolutional Networks (GCN)
Graph Convolutional Networks (GCNs) are a type of neural network designed to operate directly on graph-structured data. They generalize the concept of convolutional operations (common in image processing for grids) to arbitrary graph topologies. The core idea is to compute the representation of a node by aggregating information from its neighbors.
- Convolution on Graphs: Unlike images with regular grid structures, graphs have irregular structures (varying number of neighbors per node).
GCNsdefineconvolutional filtersthat can handle this irregularity. - Node Representation Learning:
GCNslearnnode embeddings(low-dimensional vector representations) that capture both the node's own features and its structural role within the graph by iteratively aggregating features from itsneighbors. - Spectral vs. Non-spectral Methods:
- Spectral Methods: Define graph convolutions in the
spectral domainby transforming the graph signal using theeigenbasisof thegraph Laplacian. Examples include Bruna et al. [2], Defferrard et al. [4] (Chebyshev expansion), and Kipf et al. [10] (first-order approximation). These methods often require full graph processing. - Non-spectral Methods: Directly perform convolution on the graph in the
spatial domain(original graph structure). They define aggregation functions for nodes and their neighbors.KGCNfalls into this category, similar to approaches likeGraphSAGE[7]. These are often more flexible for large graphs and varying neighborhood sizes.
- Spectral Methods: Define graph convolutions in the
3.2. Previous Works
The paper discusses several existing recommender system approaches, categorizing them into KG-free and KG-aware methods.
3.2.1. KG-Free Methods
These methods do not explicitly use knowledge graph information.
- SVD [11]:
Singular Value Decomposition (SVD)is a classicCF-based matrix factorization technique. It decomposes the user-item interaction matrix into lower-rank matrices representinglatent featuresof users and items. The predicted rating for user and item is typically modeled as theinner productof theirlatent vectors.- Formula (Conceptual for CF models): , where is the latent vector for user and is the latent vector for item .
- LibFM [14]:
Factorization Machines (FM)are a generic predictor working with any real-valued feature vector.LibFMis an implementation ofFM. It models all variable interactions usingfactorized parameters, making it capable of estimating interactions even in highly sparse settings. Inrecommender systems, it can useuser ID,item ID, and other features.- Formula (FM conceptual, simplifying to second-order interaction): . Here, is the global bias, is the weight for the -th feature, is the -th feature value, and is the dot product of two
latent vectorsand representing the interaction between features and .
- Formula (FM conceptual, simplifying to second-order interaction): . Here, is the global bias, is the weight for the -th feature, is the -th feature value, and is the dot product of two
3.2.2. KG-Aware Methods
These methods incorporate knowledge graph information.
- Knowledge Graph Embedding (KGE) Methods [20]: These methods map entities and relations from a
knowledge graphinto low-dimensionalvector spaces(embeddings).- TransE [1]: A prominent
KGEmethod.TransEmodels relations as "translations" in the embedding space. For a triple(head, relation, tail), it aims to ensure that the embedding of the head entity plus the embedding of the relation vector is approximately equal to the embedding of the tail entity: .- Score function (conceptual for TransE): , where is the -norm (typically or ).
- TransR [12]: An extension of
TransEthat maps entities and relations into different vector spaces. It projects entities to a relation-specific space before performing the translation operation. - Limitation for RS: The paper argues that
KGEmethods likeTransEandTransRare primarily designed forin-graph applications(e.g.,KG completion,link prediction) by modelingrigorous semantic relatedness. This objective might not directly align withrecommender systemgoals, where user preferences and personalized item representations are key.
- TransE [1]: A prominent
- LibFM + TransE: An approach that extends
LibFMby incorporatingTransEembeddings. It attaches an entity representation learned byTransEto each user-item pair as additional features forLibFM. This is a hybrid approach combiningfactorization machineswithknowledge graph embeddings. - PER [22]:
Personalized Entity Recommendation (PER)treats theKGas aheterogeneous information network. It extractsmeta-path-based features to represent the connectivity between users and items.- Meta-path: A sequence of relation types connecting two entity types (e.g.,
User-Movie-Director-Moviefor movie recommendations). - Limitation: Relies heavily on manually designed
meta-pathsormeta-graphs, which are difficult to optimize and may not capture all relevant connections.
- Meta-path: A sequence of relation types connecting two entity types (e.g.,
- CKE [23]:
Collaborative Knowledge Embedding (CKE)combinescollaborative filteringwithstructural,textual, andvisual knowledgefrom theKGwithin a unified framework. The paper specifically mentions implementingCKEasCFplus astructural knowledge module, implying that it regularizesCFembeddings usingKGEobjectives (e.g.,TransR-like relations). - RippleNet [18]: A memory-network-like model that propagates users'
potential preferenceson theknowledge graphto explore theirhierarchical interests. It effectively performsmulti-hop message passingover theKG.- Mechanism: It learns a user's
ripple set, which represents their interests by traversing theKG. - Limitation: The paper notes
RippleNetweakly characterizes the importance of relations (via a quadratic form ), and the size of theripple setcan grow unpredictably, leading to heavy computation and storage overhead for large KGs.
- Mechanism: It learns a user's
3.2.3. Graph Neural Networks (GNN) for Homogeneous Graphs
The paper also contextualizes KGCN with general GCN methods and other Graph Neural Networks (GNNs) designed for homogeneous graphs (graphs where all nodes and edges are of the same type).
- GraphSAGE [7]: A
non-spectral GCNmethod that learnsinductive representationsby sampling and aggregating features from a node's local neighborhood. It operates in aminibatchfashion.KGCNshares conceptual similarities withGraphSAGEin itssamplingandaggregationstrategy. - PinSage [21]: A
GCN-basedweb-scale recommender systemdeveloped by Pinterest. It learnsnode embeddingsfor items in a largeitem-item graph(homogeneous) by aggregating information from sampled neighbors usingGraphSAGE-like principles. - GAT [15]:
Graph Attention Networks (GATs)are anothernon-spectral GNNthat uses anattention mechanismto assign different weights to different neighbors in a node's aggregation process. This allows it to learnvariable importancefor neighbors. - Differentiation from KGCN: The paper highlights that both
PinSageandGATare primarily designed forhomogeneous graphs.KGCNspecifically addresses the challenges ofheterogeneous knowledge graphsinrecommender systems, where different relation types and their personalized importance are crucial.
3.3. Technological Evolution
The field of recommender systems has evolved through several stages:
-
Early Collaborative Filtering: Focused on
ID-based representationsandmatrix factorization(SVD,FunkSVD). These methods struggle withsparsityandcold start. -
Feature-Rich Recommenders: Incorporated auxiliary
useranditem attributesto enrich representations and alleviatesparsity(LibFM,Wide & Deep Learning[3]). This provided more context but often treated attributes as isolated features. -
Knowledge Graph Integration (Early Stage): Recognized the interconnectedness of attributes and started using
Knowledge Graphs. Initial approaches includedKGE(TransE,TransR) to pre-embedKGinformation, ormeta-pathmining (PER,FMG[24]) to extract connectivity patterns. These methods faced challenges withKGEobjectives not aligning perfectly withRStasks or the manual effort required formeta-pathdesign. -
Memory Networks and KG Propagation: Models like
RippleNetintroducedmemory network-like mechanisms to propagate user preferences directly on theKG, allowing for more dynamic exploration ofmulti-hop interests. -
Graph Neural Networks for RS: The emergence of
Graph Neural Networks(GCNs,GraphSAGE,GAT) provided powerful tools for learning on arbitrary graph structures. Initially applied tohomogeneous graphs(PinSage), the logical next step is to adapt them for the complexities ofheterogeneous knowledge graphsinrecommender systems.This paper's work,
KGCN, fits into the fifth stage. It leverages the strengths ofGCNsbut specifically adapts them to theheterogeneous natureofknowledge graphsand thepersonalized aspectofrecommender systems, offering a robust, end-to-end framework.
3.4. Differentiation Analysis
Compared to the main methods in related work, KGCN offers several core differences and innovations:
- Direct KG Exploitation vs. KGE: Unlike
KGE-based methods (TransE, ,CKE) that pre-embed theKGor useKGEas a regularization term,KGCNdirectly operates on theKGstructure throughgraph convolution. This avoids the potential mismatch betweenKGEobjectives (e.g., semantic relatedness) andRSobjectives (e.g., user preference prediction). - Automated Feature Learning vs. Meta-paths:
KGCNovercomes the limitations ofmeta-path-based methods (PER,FMG) by automatically discoveringhigh-order structureandsemantic informationthrough an iterative aggregation process. It does not require manual design ofmeta-pathsormeta-graphs, which are often sub-optimal and labor-intensive. - Personalized and Biased Aggregation: A key innovation is the
user-specific biasapplied duringneighborhood aggregation. Unlike generalGCNs(GraphSAGE,GAT,PinSage) which typically use a fixed aggregation scheme or attention weights learned universally across the graph,KGCNcalculatesuser-relation scores() to dynamically weight the importance of different relations for a specific user. This allowsKGCNto capture personalized interests that previousGNNmodels, not designed forrecommender systems, often miss. - Handling Heterogeneity: While
GATusesattentionto weight neighbors,KGCNexplicitly handles theheterogeneousnature of theKGby making the aggregation dependent on the relation type connecting the entities and the user. This is a crucial distinction fromGNNsdesigned forhomogeneous graphs. - Computational Efficiency:
KGCNemploysfixed-size neighbor samplingandminibatch training, addressing the computational and storage overhead issues that models likeRippleNetmight face due to unpredictably growingripple sets. - End-to-End Learning:
KGCNis an end-to-end framework, meaning theKGstructure and user preferences are learned jointly during the training process, leading to more coherent and effective representations for recommendation.
4. Methodology
4.1. Principles
The core idea behind Knowledge Graph Convolutional Networks (KGCN) is to adapt Graph Convolutional Networks (GCNs) to heterogeneous knowledge graphs for recommender systems. The method aims to automatically discover both high-order structural information and semantic information within the KG by iteratively aggregating information from an entity's neighbors. Crucially, this aggregation is not uniform; it's biased and personalized based on the specific user and the relation type connecting entities. This allows KGCN to model how a user's interests might traverse the KG and influence their preferences for an item. The multi-hop aggregation extends the receptive field to capture long-distance interests, while fixed-size neighbor sampling and minibatch training ensure scalability.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Problem Formulation
The problem is defined within a standard recommender system context augmented with a knowledge graph.
-
We have a set of users, denoted as .
-
We have a set of items, denoted as .
-
The user-item
interaction matrixis . An entry indicates that user has interacted with item (e.g., clicked, browsed, purchased), while indicates no interaction. This representsimplicit feedback. -
A
knowledge graphis provided, composed ofentity-relation-entitytriples(h, r, t).- is the
head entity. - is the
relation. - is the
tail entity. - is the set of all entities in the
knowledge graph. - is the set of all relations in the
knowledge graph.
- is the
-
In
recommender scenarios, an item corresponds to one or more entities in theknowledge graph.The goal is to predict the probability that user will engage with item , especially for items with which the user has no prior interaction. This is achieved by learning a prediction function with parameters and : $ \hat{y}_{uv} = \mathcal{F}(u, v | \Theta, \Upsilon, \mathcal{G}) $ Here, denotes the model parameters of the function , and might represent initial entity embeddings or other learnable parameters related to the
KG.
4.2.2. KGCN Layer
The KGCN is built upon aggregating information from neighbors. A single KGCN layer computes an updated representation for an entity based on its immediate neighbors. Let's consider a candidate pair of user and item (entity) .
-
User-Relation Scoring: The first step is to quantify how important a particular
relationis to a specific user . This is done using a function that takes therepresentation(embedding) of the user and therepresentationof the relation as input. $ \pi_r^u = g(\mathbf{u}, \mathbf{r}) $ where:- is the
representation vector(embedding) of user . - is the
representation vector(embedding) of relation . - is the
dimensionof therepresentation vectors. - is a function (e.g.,
inner product) that computes a score. This scorecharacterizes the importance of relation r to user u. For example, one user might find the "director" relation important for movies, while another values the "genre" relation.
- is the
-
Neighborhood Representation (Biased Aggregation): To characterize the
topological proximity structureof an item ,KGCNcomputes alinear combinationof its neighbors. This combination isuser-specificandrelation-aware. First,N(v)denotes the set of entities directly connected to . The relation between and a neighbor is denoted . Theneighborhood representationfor item with respect to user , denoted , is calculated as: $ \mathbf{v}{N(v)}^u = \sum{e \in N(v)} \tilde{\pi}{r{v,e}}^u \mathbf{e} $ where:-
is the
representation vectorof the neighbor entity . -
is a
normalized user-relation scorefor the relation that connects and . This score acts as apersonalized filter.The
normalized user-relation scoreis obtained by applying asoftmaxfunction to the raw scores computed earlier, ensuring that the weights sum to 1 for a given item's neighborhood: $ \tilde{\pi}{r{v,e}}^u = \frac{\exp{(\pi_{r_{v,e}}^u)}}{\sum_{e' \in N(v)} {\exp{(\pi_{r_{v,e'}}^u)}}} $ Thissoftmaxnormalization ensures that the contributions of neighbors are appropriately weighted based on the personalized relevance of their connecting relations.
-
-
Fixed-Size Neighbor Sampling: In real-world
knowledge graphs, the number of neighborsN(e)for an entity can vary significantly and may be very large. To ensure computational efficiency and a fixed pattern forminibatchprocessing,KGCNuniformly samples afixed-size set of neighborsfor each entity.- denotes the sampled set of neighbors for entity .
- is a
configurable constantrepresenting the fixed number of sampled neighbors. This sampled setS(v)is also referred to as the(single-layer) receptive fieldof entity . The calculation for (the neighborhood representation for the sampled set) replaces by summing overS(v)instead ofN(v).
-
Aggregator Functions: After computing the
neighborhood representation, the final step in aKGCN layeris toaggregatethis with the entity's ownrepresentation(or its previous layer's representation) to form a new, updated representation. The paper implements three types ofaggregators, denoted as :-
Sum Aggregator (
agg_sum): This aggregator sums the entity's own representation with its neighborhood representation, followed by anonlinear transformation. $ agg_{sum}(\mathbf{v}, \mathbf{v}{S(v)}^u) = \sigma \left( \mathbf{W} \cdot (\mathbf{v} + \mathbf{v}{S(v)}^u) + \mathbf{b} \right) $ where:- is a learnable
weight matrix. - is a learnable
bias vector. - is a
nonlinear activation function(e.g.,ReLU).
- is a learnable
-
Concat Aggregator (
agg_concat): This aggregator concatenates the entity's own representation with its neighborhood representation before applying anonlinear transformation. This allows the model to differentiate between the entity's own features and its neighborhood features more explicitly. $ agg_{concat}(\mathbf{v}, \mathbf{v}{S(v)}^u) = \sigma \left( {\bf W} \cdot {concat} ({{\bf v}}, {{\bf v}}{S(v)}^u) + {{\bf b}} \right) $ Here, theweight matrixwould have a dimension compatible with the concatenated vector (i.e., if and are -dimensional). -
Neighbor Aggregator (
agg_neighbor): This aggregator directly uses theneighborhood representationas the output, applying only anonlinear transformationto it. The entity's own representation is not explicitly included in the aggregation. $ agg_{neighbor}(\mathbf{v}, \mathbf{v}{S(v)}^u) = \sigma \left( \mathbf{W} \cdot \mathbf{v}{S(v)}^u + \mathbf{b} \right) $
-
The choice of aggregator influences how the entity's own information is combined with its contextual neighborhood information.
4.2.3. Learning Algorithm
The KGCN algorithm extends a single layer to multiple layers, allowing it to capture high-order proximity information and long-distance interests. This is achieved by iteratively propagating and aggregating representations.
Multi-Layer Extension:
-
A
0-order representationis the initial embedding of an entity. -
A
1-order representationis obtained after oneKGCN layerby aggregating the0-order representationsof its immediate neighbors. -
An
h-order representationis a mixture of initial representations of the entity itself and its neighbors up tohopsaway.The overall
KGCNalgorithm is presented in Algorithm 1 and illustrated in Figure 1b.
Algorithm 1: KGCN algorithm
Input: Interaction matrix Y; knowledge graph G(E, R);
Initial embeddings: {u}_u∈U, {e}_e∈E, {r}_r∈R;
Neighbor sampling function: S: e → 2^E;
Learnable parameters: {W_i, b_i}_i=1^H;
Hyperparameters: H, d, g(·), f(·), σ(·), agg(·);
Output: Prediction function F(u, v | Θ, Υ, G)
1 while KGCN not converge do
2 for (u, v) in Y do
3 {M[i]}_i=0^H ← Get-Receptive-Field(v);
4 e^u[0] ← e, ∀e ∈ M[0]; // Initialize 0-order representations
5 for h = 1, ..., H do
6 for e ∈ M[h] do
7 e_S(e)^u[h-1] ← ∑_(e'∈S(e))_ (normalized_user_relation_score(u, r_(e,e')) * e'^u[h-1]);
// Calculate neighborhood representation for entity e at layer h-1,
// weighted by personalized user-relation scores and using representations from the previous layer.
8 e^u[h] ← agg(e^u[h-1], e_S(e)^u[h-1]);
// Aggregate entity's own representation from previous layer with its neighborhood representation
// to get its h-order representation.
9 v^u ← e^u[H]; // The final H-order representation for item v, specific to user u
10 Calculate predicted probability ŷ_uv = f(u, v^u);
11 Update parameters by gradient descent;
12 return F;
13 Function Get-Receptive-Field(v)
14 M[H] ← {v}; // The H-th layer receptive field contains only the target entity v
15 for h = H - 1, ..., 0 do
16 M[h] ← M[h+1]; // Initialize current layer's receptive field with entities from the next layer
17 for e ∈ M[h+1] do
18 M[h] ← M[h] ∪ S(e); // Add sampled neighbors of entities in M[h+1] to M[h]
19 return {M[i]}_i=0^H;
Explanation of Algorithm 1:
-
Input: The algorithm takes the
interaction matrix, theknowledge graph, initialembeddingsfor users, entities, and relations, theneighbor sampling function, learnableweightandbiasparameters for each layer, and varioushyperparameters( for depth, for dimension, for scoring, for prediction, for activation,aggfor aggregation type). -
Get-Receptive-Field(v)Function (Lines 13-19): This function determines the set of entities whose initial embeddings will contribute to the final representation of item . It works backward from the target entity .- It starts by placing the target entity in the
receptive fieldof the outermost layerM[H]. - Then, for each layer from
H-1down to0, it initializesM[h]with the entities from (i.e., the entities that will be processed in the next aggregation step). - For every entity in , it samples its neighbors using
S(e)and adds them toM[h]. This effectively builds the multi-hopreceptive field.
- It starts by placing the target entity in the
-
Initialization (Line 4): For all entities in the
0-th layer receptive fieldM[0](which contains and all entities reachable within hops), their0-order representations\mathbf{e}^u[0]are initialized with their original embeddings . -
Aggregation Loop (Lines 5-8): This is the core
graph convolutionprocess, repeated times (for layers).- In each iteration , it computes the
neighborhood representationfor every entity inM[h](entities that will be updated in this layer). This uses the personalizeduser-relation scoreand the(h-1)-order representationsof its sampled neighbors. - Then, it
aggregatesthe entity's own(h-1)-order representationwith its newly computedneighborhood representationto produce itsh-order representation.
- In each iteration , it computes the
-
Final Item Representation (Line 9): After iterations, the
H-order representationof the target item (which is ) is obtained. This representation isuser-specific() because the aggregation process wasbiasedby user 's preferences. -
Prediction (Line 10): The final
H-order representationof item () is fed into aprediction functionalong with the user's representation to calculate the predicted probability . $ \hat{y}_{uv} = f(\mathbf{u}, \mathbf{v}^u) $ The paper mentions that is set asinner product. -
Parameter Update (Line 11): The model parameters are updated using
gradient descentbased on a definedloss function. -
Minibatch Implementation: The algorithm is designed to operate in a
minibatch fashion. This means instead of iterating over all user-item pairs (which can be very large, as suggested by line 2 iteratingfor (u, v) in Y), anegative sampling strategyis used during training, where mini-batches of positive and negative samples are processed.The framework of
KGCNis illustrated in Figure 1b.
Figure 1: KGCN Framework
The following figure illustrates the two key components of KGCN: (a) the receptive field concept and (b) the overall framework for entity representation calculation.
该图像是论文中的示意图,包括(a)一个知识图谱中蓝色实体的两层感受野示意(绿色节点)和(b) KGCN框架中实体表示计算与预测概率的流程示意。
Figure 1: (a) A two-layer receptive field (green entities) of the blue entity in a KG. (b) The framework of KGCN.
Figure 1a (Receptive Field): This panel visually explains how the receptive field is constructed. For a blue entity (the target item), its 1-hop neighbors are shown, and then the neighbors of those 1-hop neighbors form the 2-hop receptive field. The green entities represent the entities whose initial embeddings will eventually contribute to the final representation of the blue entity. The parameter (neighbor sampling size) limits the number of neighbors sampled at each hop, ensuring a manageable and fixed computation graph.
Figure 1b (KGCN Framework): This panel depicts the iterative aggregation process within KGCN for one layer ( to ).
- Initial Representation: An entity starts with its
h-order representation(e.g., for the blue node). - Neighbor Sampling: It samples a fixed number of neighbors (e.g., two neighbors for in the illustration).
- User-Relation Scoring & Weighting: For each sampled neighbor (e.g., , ), their
h-order representationsare considered. The edges connecting them to have associated relations (). Theuser-relation scores() are used to normalize these contributions (e.g., ). - Neighborhood Aggregation: The
h-order representationsof the sampled neighbors are summed, weighted by thenormalized user-relation scores, to form theneighborhood representation(the green combined vector). - Aggregator Function: This
neighborhood representationis then combined with the entity's ownh-order representation(the blue vector) using anaggregator function(sum, concat, or neighbor) and passed through anon-linear activation() andlinear transformation() to produce the(h+1)-order representation(). This process repeats for layers, culminating in theH-order representationfor the item, which is then used for prediction.
Loss Function
To train the model, a loss function is defined. The paper uses a binary cross-entropy loss with negative sampling.
$
\mathcal{L} = \sum_{u \in \mathcal{U}} \bigg( \sum_{v : y_{uv}=1} \mathcal{T}(y_{uv}, \hat{y}{uv}) - \sum{i=1}^{T^u} \mathbb{E}{v_i \sim P(v_i)} \mathcal{T}(y{uv_i}, \hat{y}_{uv_i}) \bigg) + \lambda | \mathcal{F} |_2^2
$
where:
- is the total
loss. - is the set of all users.
- represents
positive samples(items user interacted with). - is the true label (1 for positive interaction, 0 for negative).
- is the
predicted probabilityof interaction. - is the
cross-entropy lossfunction for a single prediction. Forbinary classification, it is typically: $ \mathcal{T}(y, \hat{y}) = -y \log(\hat{y}) - (1-y) \log(1-\hat{y}) $ - The first sum calculates the
cross-entropy lossfor allpositive interactionsof user . - The second sum calculates the
cross-entropy lossfornegative samples.- is the number of
negative samplesfor user . The paper states , meaning an equal number of positive and negative samples. - is a
negative sampling distribution. In this paper, it follows auniform distribution, meaning negative items are sampled randomly from all non-interacted items. - denotes the
expectationover sampled negative items.
- is the number of
- is an
L2-regularization termto preventoverfitting.-
is the
L2-regularizer weight. -
represents the
L2-normof the model's parameters (e.g., , and all embedding vectors).The model is optimized using the
Adam algorithm.
-
5. Experimental Setup
5.1. Datasets
The authors evaluate KGCN on three real-world datasets from different domains: movie, book, and music recommendation.
-
MovieLens-20M:
- Domain: Movie recommendation.
- Characteristics: Contains approximately 20 million explicit ratings (from 1 to 5) on the MovieLens website.
- Usage: Explicit ratings are transformed into
implicit feedback. Ratings are considered positive (), while unrated items are sampled as negative ().
-
Book-Crossing:
- Domain: Book recommendation.
- Characteristics: Contains 1 million ratings (from 0 to 10) from the Book-Crossing community.
- Usage: Explicit ratings are transformed into
implicit feedback. No threshold is set due to its sparsity, implying all ratings likely contribute as positive or implicitly negative.
-
- Domain: Music recommendation.
- Characteristics: Contains musician listening information from about 2 thousand users from the Last.fm online music system.
- Usage: Similar to Book-Crossing, no threshold is set for transforming explicit feedback into
implicit feedbackdue to sparsity.
Knowledge Graph Construction:
For each dataset, a knowledge graph is constructed using Microsoft Satori.
-
A subset of triples from
Satoriwith a is initially selected. -
Item IDs(movies, books, musicians) are matched with entities in thesub-KGby matching their names with thetailof specific triples (e.g.,(head, film.film.name, tail)for movies,(head, book.book.title, tail)for books,(head, type.object.name, tail)for musicians). -
Items with multiple matches or no matches are excluded for simplicity.
-
Finally, the matched
item IDsare used to select all relevant triples from thesub-KGwhere the item is theheadentity.The following are the results from [Table 1] of the original paper:
MovieLens-20M Book-Crossing Last.FM # users 138,159 19,676 1,872 # items 16,954 20,003 3,846 # interactions 13,501,622 172,576 42,346 # entities 102,569 25,787 9,366 # relations 32 18 60 # KG triples 499,474 60,787 15,518 K 4 8 8 d 32 64 16 H 2 1 1 λ 10-7 2 × 10-5 10-4 n 2 × 10-2 2 × 10-4 5 × 10-4 batch size 65,536 256 128
Table 1: Basic statistics and hyper-parameter settings for the three datasets (K: neighbor sampling size, d: dimension of embeddings, H: depth of receptive field, λ: L2 regularizer weight, η: learning rate).
Dataset Characteristics from Table 1:
-
MovieLens-20M is the largest dataset in terms of users, interactions, and KG triples, making it less sparse.
-
Book-Crossing and Last.FM are significantly sparser, with fewer users and interactions. This makes them good candidates for evaluating how well
KGCNaddresses sparsity. -
The number of entities and relations in the KGs varies, reflecting the complexity and richness of the knowledge available for each domain.
Data Splitting: For each dataset, the ratio of
training,evaluation(validation), andtestset is fixed at6:2:2.
5.2. Evaluation Metrics
The model is evaluated in two scenarios: click-through rate (CTR) prediction and top-K recommendation.
5.2.1. Click-Through Rate (CTR) Prediction Metrics
For CTR prediction, where the model predicts the probability of a user interacting with an item, the following metrics are used:
-
Area Under the Receiver Operating Characteristic Curve (AUC):
- Conceptual Definition:
AUCmeasures the performance of abinary classifieracross all possible classification thresholds. It represents the probability that the classifier will rank a randomly chosenpositive instancehigher than a randomly chosennegative instance. AnAUCof 1 indicates a perfect classifier, while 0.5 indicates a classifier no better than random guessing. It is robust to imbalanced datasets. - Mathematical Formula: $ \text{AUC} = \frac{\sum_{i=1}^{P} \sum_{j=1}^{N} \mathbb{I}(\text{score}_i > \text{score}_j)}{P \times N} $
- Symbol Explanation:
- : Number of
positive samples. - : Number of
negative samples. - : The
predicted probabilityorscorefor the -thpositive sample. - : The
predicted probabilityorscorefor the -thnegative sample. - : The
indicator function, which is 1 if its argument is true, and 0 otherwise.
- : Number of
- Conceptual Definition:
-
F1 Score:
- Conceptual Definition: The
F1 scoreis theharmonic meanofprecisionandrecall. It is a measure of a model's accuracy on a dataset, particularly useful when there is an uneven class distribution (imbalance). It balances bothprecision(minimizing false positives) andrecall(minimizing false negatives). - Mathematical Formula:
$
\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$
where:
Precision: The fraction of relevant instances among the retrieved instances. $ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $Recall: The fraction of relevant instances that were retrieved. $ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $
- Symbol Explanation:
True Positives (TP): Number of correctly predicted positive instances.False Positives (FP): Number of incorrectly predicted positive instances (negative instances predicted as positive).False Negatives (FN): Number of incorrectly predicted negative instances (positive instances predicted as negative).
- Conceptual Definition: The
5.2.2. Top-K Recommendation Metric
For top-K recommendation, where the model selects items with the highest predicted click probability for each user, the following metric is used:
- Recall@K:
- Conceptual Definition:
Recall@Kmeasures the proportion of relevant items (items the user actually interacted with) that are successfully retrieved within the top recommendations. It focuses on how many of the truly relevant items are present in the recommended list of size . - Mathematical Formula: $ \text{Recall@K} = \frac{\sum_{u \in \mathcal{U}_{\text{test}}} |\text{RecommendedItems}_u(K) \cap \text{RelevantItems}_u|}{|\text{RelevantItems}_u|} $ This formula represents the average recall over all users in the test set. More commonly, for a single user : $ \text{Recall@K}_u = \frac{|\text{RecommendedItems}_u(K) \cap \text{RelevantItems}_u|}{|\text{RelevantItems}_u|} $
- Symbol Explanation:
- : Set of users in the test set.
- : The set of items recommended to user .
- : The set of items user has actually interacted with in the test set (positive interactions).
- : Set intersection operator.
- : Cardinality of a set.
- Conceptual Definition:
5.3. Baselines
The KGCN is compared against six baselines, categorized into KG-free and KG-aware methods.
5.3.1. KG-Free Baselines
- SVD [11]:
- Description: A classic
collaborative filteringmodel usingmatrix factorization. It models user-item interactions through theinner productof theirlatent vectors. The paper uses the unbiased version. - Configuration: , for MovieLens-20M and Book-Crossing; , for Last.FM.
- Description: A classic
- LibFM [14]:
- Description: A
feature-based factorization modelcommonly used inclick-through rate (CTR)prediction. It models interactions between features. For this comparison, user ID and item ID are concatenated as input features. - Configuration: ,
50 training epochs.
- Description: A
5.3.2. KG-Aware Baselines
- LibFM + TransE:
- Description: An extension of
LibFMthat incorporatesknowledge graph embeddings. The entity representation learned byTransE[1] is attached to each user-item pair as additional features forLibFM. - Configuration: .
- Description: An extension of
- PER [22]:
- Description:
Personalized Entity Recommendation (PER)treats theknowledge graphas aheterogeneous information network. It extractsmeta-path-based features to represent connectivity between users and items. - Configuration: Manually designed
meta-pathsare used:- MovieLens-20M: "user-movie-director-movie", "user-movie-genre-movie", "user-movie-star-movie".
- Book-Crossing: "user-book-author-book", "user-book-genre-book".
- Last.FM: "user-musician-date_of_birth-musician" (date of birth discretized), "user-musician-country-musician", "user-musician-genre-musician".
- Description:
- CKE [23]:
- Description:
Collaborative Knowledge Embedding (CKE)combinescollaborative filteringwithstructural knowledgefrom theKG. The paper implements it asCFplus astructural knowledge module(likely usingTransR-like regularization). - Configuration: for MovieLens-20M, for Book-Crossing, for Last.FM. for all datasets. Learning rates are the same as
SVD.
- Description:
- RippleNet [18]:
- Description: A
memory-network-like approach that propagates users' preferences on theknowledge graphthroughmulti-hop ripple setsfor recommendation. - Configuration:
- MovieLens-20M: , , , , .
- Last.FM: , , , , .
- Other hyperparameters are set as in their original paper or default code.
- Description: A
5.4. KGCN Hyperparameters
For KGCN specifically:
- Functions and : Both are set as
inner product. - Activation :
ReLUfor non-last-layer aggregators,tanhfor the last-layer aggregator. - Other settings: Detailed in Table 1 (repeated above).
- (neighbor sampling size): 4 for MovieLens-20M, 8 for Book-Crossing and Last.FM.
- (dimension of embeddings): 32 for MovieLens-20M, 64 for Book-Crossing, 16 for Last.FM.
- (depth of receptive field): 2 for MovieLens-20M, 1 for Book-Crossing and Last.FM.
- (L2 regularizer weight): for MovieLens-20M, for Book-Crossing, for Last.FM.
- (learning rate): for MovieLens-20M, for Book-Crossing, for Last.FM.
batch size: 65,536 for MovieLens-20M, 256 for Book-Crossing, 128 for Last.FM. These hyperparameters were determined by optimizingAUCon avalidation set. Each experiment was repeated 3 times, and the average performance is reported. The models are optimized using theAdam algorithm. The implementation uses Python 3.6, TensorFlow 1.12.0, and NumPy 1.14.3.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that KGCN consistently outperforms all baselines across three diverse real-world datasets, validating its efficacy in knowledge graph-aware recommender systems.
CTR Prediction Results (AUC and F1)
The following are the results from [Table 2] of the original paper:
| Model | MovieLens-20M | Book-Crossing | Last.FM | |||
| AUC | F1 | AUC | F1 | AUC | F1 | |
| SVD | 0.963 (-1.5%) | 0.919 (-1.4%) | 0.672 (-8.9%) | 0.635 (-7.7%) | 0.769 (-3.4%) | 0.696 (-3.5%) |
| LibFM | 0.959 (-1.9%) | 0.906 (-2.8%) | 0.691 (-6.4%) | 0.618 (-10.2%) | 0.778 (-2.3%) | 0.710 (-1.5%) |
| LibFM + TransE | 0.966 (-1.2%) | 0.917 (-1.6%) | 0.698 (-5.4%) | 0.622 (-9.6%) | 0.777 (-2.4%) | 0.709 (-1.7%) |
| PER | 0.832 (-14.9%) | 0.788 (-15.5%) | 0.617 (-16.4%) | 0.562 (-18.3%) | 0.633 (-20.5%) | 0.596 (-17.3%) |
| CKE | 0.924 (-5.5%) | 0.871 (-6.5%) | 0.677 (-8.3%) | 0.611 (-11.2%) | 0.744 (-6.5%) | 0.673 (-6.7%) |
| RippleNet | 0.968 (-1.0%) | 0.912 (-2.1%) | 0.715 (-3.1%) | 0.650 (-5.5%) | 0.780 (-2.0%) | 0.702 (-2.6%) |
| KGCN-sum | 0.978 | 0.932* | 0.738 | 0.688* | 0.794 (-0.3%) | 0.719 (-0.3%) |
| KGCN-concat | 0.977 (-0.1%) | 0.931 (-0.1%) | 0.734 (-0.5%) | 0.681 (-1.0%) | 0.796* | 0.721* |
| KGCN-neighbor | 0.977 (-0.1%) | 0.932* | 0.728 (-1.4%) | 0.679 (-1.3%) | 0.781 (-1.9%) | 0.699 (-3.1%) |
| KGCN-avg | 0.975 (-0.3%) | 0.929 (-0.3%) | 0.722 (-2.2%) | 0.682 (-0.9%) | 0.774 (-2.8%) | 0.692 (-4.0%) |
Table 2: The results of AUC and F1 in CTR prediction. (Statistically significant improvement by unpaired two-sample t-test with p = 0.1, indicated by *)
Observations from Table 2:
- KGCN Superiority:
KGCNvariants consistently achieve the bestAUCandF1scores across all three datasets.KGCN-sumandKGCN-concatgenerally perform the best, withKGCN-sumshowing leadingAUCon MovieLens-20M and Book-Crossing, andKGCN-concatleading on Last.FM. TheAUCgains are substantial: 4.4% for MovieLens-20M (compared to its best baselineRippleNet), 8.1% for Book-Crossing (compared toRippleNet), and 6.2% for Last.FM (compared toRippleNet). The percentages in parentheses indicate the relative performance difference compared to the bestKGCNvariant for each metric. - Effectiveness in Sparse Scenarios:
KGCNshows particularly high improvements on Book-Crossing and Last.FM. These datasets are much sparser than MovieLens-20M (as seen in Table 1). This confirms thatKGCNeffectively addresses thesparsity problemby leveraging the richknowledge graphinformation. - KG-aware vs. KG-free Baselines: Surprisingly,
KG-freebaselines likeSVDandLibFMsometimes outperformKG-awarebaselines likePERandCKE. This suggests that simply introducing aKGis not enough; the method of integrating it matters significantly.PERperforms the worst, likely due to the difficulty of manually designing optimalmeta-paths.CKEalso lags behind, indicating that itsTransR-like regularization might not be as effective for recommendation asKGCN's directgraph convolution. - RippleNet's Strong Performance:
RippleNetis a strong baseline, performing better than most otherKG-awaremethods. This validates the importance ofmulti-hop neighborhood structureandproximity informationin theKGfor recommendation, a core idea also adopted byKGCN. - Aggregator Comparison (KGCN variants):
KGCN-sumgenerally performs very well, especially leading inAUCfor MovieLens-20M and Book-Crossing.KGCN-concatis also highly competitive, even surpassingKGCN-sumon Last.FMAUC.KGCN-neighborshows a noticeable performance gap on sparser datasets (Book-Crossing, Last.FM) compared toKGCN-sumandKGCN-concat. This suggests that excluding the entity's own representation in aggregation might lead to loss of crucial information, especially whenKGcontext is limited.KGCN-avgperforms worse thanKGCN-sumacross all datasets. This variant is a reduced form ofKGCN-sumwithout theuser-relation scores(i.e., uniform weighting of neighbors). Its poorer performance highlights the critical role of thepersonalized attention mechanism(user-relation scores) in capturing user-specific preferences andsemantic informationfrom theKG. This is particularly evident in sparse scenarios (Book-Crossing, Last.FM), where this personalized weighting provides more meaningful context.
Top-K Recommendation Results (Recall@K)
The following figure illustrates the performance of various models in top-K recommendation using the Recall@K metric.
该图像是图表,展示了三个数据集(MovieLens-20M、Book-Crossing 和 Last.FM)上不同方法在 Recall@K 评价指标下的表现对比,横轴为 K,纵轴为 Recall@K,图中包括 LibFM+TransE、PER、CKE、RippleNet 和 KGCN-sum 五种方法的曲线。
Figure 2: The results of Recall@K in top-K recommendation.
Observations from Figure 2:
- Consistent KGCN Outperformance: Similar to
CTR prediction,KGCN-sumconsistently achieves the highestRecall@Kacross all values of and all three datasets. This reinforces its effectiveness in providing relevant recommendations. - Benefits of KG-aware Methods:
RippleNetalso shows strong performance and generally outperformsKG-freeand otherKG-awarebaselines, again indicating the value of leveragingmulti-hop KGstructure. - Struggles of Meta-path Methods:
PERcontinues to perform poorly, demonstrating the limitations of manually definedmeta-paths. - KGCN's Robustness: The curves for
KGCN-sumare consistently above the other baselines, indicating its robustness and superior ability to capture relevant items for users over different recommendation list sizes. The improvement is particularly pronounced on Book-Crossing and Last.FM, consistent with its strongCTR predictionperformance on sparse data.
6.2. Ablation Studies / Parameter Analysis
The paper conducts ablation studies and parameter analyses to understand the impact of various KGCN components and hyperparameters.
6.2.1. Impact of Neighbor Sampling Size (K)
The following are the results from [Table 3] of the original paper:
| K | 2 | 4 | 8 | 16 | 32 | 64 |
| MovieLens-20M | 0.978 | 0.979 | 0.978 | 0.978 | 0.977 | 0.978 |
| Book-Crossing | 0.680 | 0.727 | 0.736 | 0.725 | 0.711 | 0.723 |
| Last.FM | 0.791 | 0.794 | 0.795 | 0.793 | 0.794 | 0.792 |
Table 3: AUC result of KGCN with different neighbor sampling size K.
Observations:
- Optimal K:
KGCNachieves optimal performance when (the neighbor sampling size) is 4 or 8. For MovieLens-20M, yields the bestAUC(0.979). For Book-Crossing, is best (0.736), and for Last.FM, is best (0.795). - Too Small K: A value that is too small (e.g., 2 for Book-Crossing and Last.FM) results in lower performance. This is because a limited number of sampled neighbors (the
receptive field) might not provide enough contextual information to effectively enrich the entity's representation. - Too Large K: A value that is too large (e.g., 16, 32, 64) also tends to degrade performance or offer no significant improvement. This is attributed to the model being
misled by noisesfrom irrelevant or less important neighbors, or simply introducing redundancy without adding valuable, distinct information. The computational cost also increases with larger .
6.2.2. Impact of Depth of Receptive Field (H)
The following are the results from [Table 4] of the original paper:
| H | 1 | 2 | 3 | 4 |
| MovieLens-20M | 0.972 | 0.976 | 0.974 | 0.514 |
| Book-Crossing | 0.738 | 0.731 | 0.684 | 0.547 |
| Last.FM | 0.794 | 0.723 | 0.545 | 0.534 |
Table 4: AUC result of KGCN with different depth of receptive field H.
Observations:
- Sensitivity to H:
KGCNis more sensitive to (the depth of the receptive field or number of aggregation layers) compared to . - Optimal H: An of 1 or 2 appears to be optimal or near-optimal. MovieLens-20M performs best at (0.976), while Book-Crossing and Last.FM perform best at (0.738 and 0.794, respectively).
- Model Collapse with Large H: A critical finding is the
serious model collapseobserved when reaches 3 or 4. TheAUCdrops dramatically (e.g., to 0.514 for MovieLens-20M at ).- This is consistent with the intuition that
too long relation-chains make little sensein inferring item similarities or user interests for recommendation. Excessively deep layers might propagate too much noise, over-smooth entity representations, or lead toover-squashingof information, where distinguishing features from distant nodes become diluted or lost. - This suggests that for
recommender systemsusingknowledge graphs,shallow aggregation(1 or 2 hops) is generally sufficient and more effective than very deep aggregation.
- This is consistent with the intuition that
6.2.3. Impact of Dimension of Embedding (d)
The following are the results from [Table 5] of the original paper:
| d | 4 | 8 | 16 | 32 | 64 | 128 |
| MovieLens-20M | 0.968 | 0.970 | 0.975 | 0.977 | 0.973 | 0.972 |
| Book-Crossing | 0.709 | 0.732 | 0.733 | 0.735 | 0.739 | 0.736 |
| Last.FM | 0.789 | 0.793 | 0.797 | 0.793 | 0.790 | 0.789 |
Table 5: AUC result of KGCN with different dimension of embedding.
Observations:
- Initial Boost: Increasing the
dimension of embedding (d)generallyboosts performanceinitially. A larger allows the model to encode more intricate and richer information about users and entities. For instance, MovieLens-20M improves from to , Book-Crossing improves consistently up to , and Last.FM peaks at . - Overfitting with Large d: However,
too large dcanadversely affect performancedue tooverfitting. If the embedding dimension is too high relative to the amount of available data, the model might learn spurious patterns specific to the training set and fail to generalize. This is observed when MovieLens-20M'sAUCslightly drops at and , and Last.FM'sAUCdecreases after . Book-Crossing shows a more robust performance with larger , possibly due to its greater sparsity requiring more expressive power or the specific nature of itsKG.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces Knowledge Graph Convolutional Networks (KGCN), an end-to-end framework designed to enhance recommender systems by effectively integrating knowledge graph (KG) information. KGCN addresses the critical sparsity and cold start problems of traditional collaborative filtering by leveraging KG entities and relations. Its core innovation lies in its user-specific and relation-aware neighborhood aggregation mechanism, which computes a biased weighted sum of neighbor representations. This allows KGCN to automatically capture both high-order structural information and rich semantic information from the KG, tailored to individual user preferences. The model's ability to extend its receptive field over multiple hops enables it to discover long-distance interests. Furthermore, the minibatch implementation ensures scalability for large datasets and KGs. Extensive experiments on movie, book, and music recommendation datasets demonstrate that KGCN consistently outperforms state-of-the-art baselines in CTR prediction and top-K recommendation, especially in sparse scenarios, validating its effectiveness and robustness.
7.2. Limitations & Future Work
The authors identify several avenues for future research:
- Non-uniform Neighbor Sampling: Currently,
KGCNusesuniform samplingfrom an entity's neighbors to construct itsreceptive field. A limitation is that not all neighbors are equally important. Future work could explorenon-uniform samplingstrategies (e.g.,importance samplingbased on relation type, entity popularity, or learned weights) to select more informative neighbors, potentially leading to more effective and efficient aggregation. - Leveraging User-End KGs: The current
KGCN(and most existing literature) primarily focuses on modelingitem-end KGs(knowledge about items and their attributes). The paper suggests aninteresting directionwould be to investigate whether incorporatinguser-end KGs(knowledge about users, their demographics, social connections, preferences, etc.) could further improve recommendation performance. - Combining User-End and Item-End KGs: A more advanced future direction is to design an algorithm that can effectively
combineand jointly utilizeKGsat both theuser-endanditem-end. This could create a holistic view of both users and items, potentially leading to more nuanced and accurate recommendations.
7.3. Personal Insights & Critique
The KGCN paper presents a compelling solution for integrating knowledge graphs into recommender systems by intelligently adapting Graph Convolutional Networks. Its primary strength lies in the personalized and biased aggregation mechanism, which explicitly models how a user's preferences interact with different types of relations in the KG. This is a crucial distinction from generic GCNs and previous KG-aware methods, making the item representations truly user-specific.
One notable insight is the observation that shallow aggregation (1 or 2 layers, or ) performs best, with deeper layers leading to model collapse. This is an important practical finding, suggesting that for recommender systems, multi-hop information is valuable but long-distance information quickly becomes noisy or redundant. This could be due to over-smoothing (where node representations become indistinguishable after many layers) or the inherent nature of user interests, which might not meaningfully extend beyond a few hops in a KG. This finding could guide future GNN designs in RS to focus on more sophisticated, rather than just deeper, aggregation strategies.
The choice of inner product for both (user-relation scoring) and (final prediction) functions, while simple and effective, could be an area for improvement. More complex neural network layers for these functions might capture more intricate non-linear interactions. Similarly, the uniform negative sampling could be enhanced with more advanced strategies (e.g., popularity-biased sampling or adversarial sampling) to better reflect real-world implicit feedback scenarios.
The methods and conclusions of KGCN could be broadly applied to other domains where graph-structured data (especially heterogeneous graphs) and personalized predictions are important. For instance:
-
Information Retrieval: Enhancing search results by considering personalized relevance of entities and their relations in a knowledge base.
-
Drug Discovery: Predicting drug-target interactions or drug side effects by leveraging
knowledge graphsof chemical compounds, proteins, and biological pathways, withpersonalized modelsaccounting for patient-specific factors. -
Fraud Detection: Identifying fraudulent activities in complex transaction networks by modeling
entity relationshipsand detecting unusual patterns based on specific user profiles.A potential unverified assumption might be the quality and completeness of the
knowledge graphsderived fromMicrosoft Satori. WhileSatoriis a powerful resource, any inaccuracies or incompleteness in theKGcould propagate errors through theGCNlayers. The impact ofKG noiseonKGCN's performance could be an area for further investigation. Overall,KGCNrepresents a significant step forward in building more intelligent and personalizedrecommender systemsby effectively harnessing the power ofknowledge graphsthroughgraph neural networks.
Similar papers
Recommended via semantic vector search.