Paper status: completed

Knowledge Graph Convolutional Networks for Recommender Systems

Published:03/19/2019

Knowledge Graph-based Recommendation (5)GNN-based Recommender Systems (2)Cold-Start Problem (1)High-Order Structural Information Modeling (1)Neighbor Sampling Mechanism (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

KGCN uses neighbor sampling to capture high-order structural and semantic information in knowledge graphs, addressing sparsity and cold-start in recommender systems. It performs well on large-scale datasets in movie, book, and music recommendations.

Abstract

To alleviate sparsity and cold start problem of collaborative filtering based recommender systems, researchers and engineers usually collect attributes of users and items, and design delicate algorithms to exploit these additional information. In general, the attributes are not isolated but connected with each other, which forms a knowledge graph (KG). In this paper, we propose Knowledge Graph Convolutional Networks (KGCN), an end-to-end framework that captures inter-item relatedness effectively by mining their associated attributes on the KG. To automatically discover both high-order structure information and semantic information of the KG, we sample from the neighbors for each entity in the KG as their receptive field, then combine neighborhood information with bias when calculating the representation of a given entity. The receptive field can be extended to multiple hops away to model high-order proximity information and capture users' potential long-distance interests. Moreover, we implement the proposed KGCN in a minibatch fashion, which enables our model to operate on large datasets and KGs. We apply the proposed model to three datasets about movie, book, and music recommendation, and experiment results demonstrate that our approach outperforms strong recommender baselines.

Mind Map

In-depth Reading

English Analysis~32 min read · 43,474 chars

1. Bibliographic Information

1.1. Title

The title of the paper is "Knowledge Graph Convolutional Networks for Recommender Systems". It clearly indicates the central topic, which is the application of Knowledge Graph Convolutional Networks (KGCN) to enhance recommender systems.

1.2. Authors

The authors of this paper are:

Hongwei Wang (Shanghai Jiao Tong University, Shanghai, China)
Miao Zhao (The Hong Kong Polytechnic University, Hong Kong, China)
Xing Xie (Microsoft Research Asia, Beijing, China)
Wenjie Li (The Hong Kong Polytechnic University, Hong Kong, China)
Minyi Guo (Shanghai Jiao Tong University, Shanghai, China)

Their affiliations suggest a collaborative effort between academic institutions and industrial research (Microsoft Research Asia), which often leads to both theoretically sound and practically relevant research.

1.3. Journal/Conference

This paper was published at the 2019 World Wide Web Conference (WWW '19), held from May 13-17, 2019, in San Francisco, CA, USA. The World Wide Web Conference (WWW) is a highly reputable and influential academic conference in the fields of web science, information retrieval, and data mining. Publication at WWW signifies that the research has undergone a rigorous peer-review process and is considered a significant contribution to the field.

1.4. Publication Year

The paper was published in 2019. The Published at (UTC): 2019-03-18T17:17:34.000Z indicates that it was likely made available online as a preprint or accepted version around March 2019, ahead of the May conference.

1.5. Abstract

The paper addresses the sparsity and cold start problems inherent in collaborative filtering (CF)-based recommender systems (RS). To mitigate these issues, researchers often leverage auxiliary information such as user and item attributes. Recognizing that these attributes are interconnected and form a knowledge graph (KG), the authors propose Knowledge Graph Convolutional Networks (KGCN).

KGCN is presented as an end-to-end framework designed to effectively capture inter-item relatedness by mining associated attributes within the KG. Its core mechanism involves sampling neighbors for each entity in the KG to define their receptive field, and then combining this neighborhood information with a bias when computing entity representations. This approach aims to automatically discover both high-order structural information and semantic information from the KG. The receptive field can be extended to multiple hops to model high-order proximity and capture users' long-distance interests.

Furthermore, the KGCN is implemented in a minibatch fashion to handle large datasets and KGs efficiently. The model's effectiveness is demonstrated through experiments on three real-world datasets for movie, book, and music recommendation, where it consistently outperforms strong recommender baselines.

1.6. Original Source Link

The original source link is: https://arxiv.org/abs/1904.12575v1 This is a link to arXiv, which is a preprint server. The $v1$ indicates it's the first version of the paper uploaded to arXiv.

1.7. PDF Link

The PDF link is: https://arxiv.org/pdf/1904.12575v1.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the limitations of traditional collaborative filtering (CF) in recommender systems (RS), specifically the sparsity of user-item interactions and the cold start problem.

Sparsity: In real-world recommender systems, users typically interact with only a tiny fraction of available items. This leads to a sparse user-item interaction matrix, making it difficult for CF algorithms to find enough similar users or items to make accurate recommendations.
Cold Start: This problem arises when new users or new items are introduced to the system. For new users, there's no interaction history to base recommendations on. For new items, there's no interaction data to recommend them to existing users.

These problems are important because they severely limit the accuracy and coverage of recommendations, leading to poor user experience and underutilized items.

Previous research often addresses these issues by collecting attributes of users and items. However, the authors point out a critical gap: these attributes are generally not isolated but are interconnected, forming a knowledge graph (KG). While incorporating KGs into RS has shown promise (e.g., in exploring latent connections, extending user interests, and providing explainability), directly utilizing them is challenging due to their high dimensionality and heterogeneity. Existing KG-aware methods often rely on knowledge graph embedding (KGE) (which might not be optimal for recommendation tasks as they focus on rigorous semantic relatedness within the graph) or meta-path/meta-graph-based approaches (which require manual design and may not be optimal). RippleNet, a memory-network-like model, addresses multi-hop propagation but weakly characterizes the importance of relations and can incur heavy computational overhead.

The paper's entry point and innovative idea lie in proposing a novel graph convolutional network (GCN)-based approach, KGCN, that directly operates on the knowledge graph to automatically capture both high-order structural information and semantic information in a personalized manner, without relying on manually designed paths or fixed embedding strategies.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

Novel Framework (KGCN): They propose Knowledge Graph Convolutional Networks (KGCN), an end-to-end framework specifically designed for recommender systems. This framework effectively explores users' preferences on the knowledge graph by leveraging its high-order structure and semantic information.
Personalized, Biased Neighborhood Aggregation: The KGCN introduces a mechanism to aggregate neighborhood information with bias, where neighbors are weighted by scores dependent on the connecting relation and the specific user. This allows the model to capture both the semantic information of the KG and users' personalized interests in relations, providing a more nuanced item representation.
High-Order Proximity Modeling: By extending the receptive field of each entity to multiple hops away in the KG, KGCN can model high-order proximity information and capture users' potential long-distance interests, which is crucial for discovering complex relationships between items.
Minibatch Implementation: The KGCN is designed to be implemented in a minibatch fashion, enabling it to operate efficiently on large-scale datasets and knowledge graphs, making it practical for real-world recommender systems.
Empirical Validation: Through extensive experiments on three real-world datasets (MovieLens-20M, Book-Crossing, and Last.FM) covering movie, book, and music recommendations, the KGCN demonstrates superior performance. It achieves average AUC gains of 4.4%, 8.1%, and 6.2% across the datasets compared to state-of-the-art baselines, particularly excelling in sparse scenarios.
Open-Source Release: The authors have released the code for KGCN and the used knowledge graph datasets, fostering reproducibility and further research in the community.

The key findings demonstrate that KGCN effectively addresses the sparsity and cold start problems by deeply integrating knowledge graph information, leading to significantly improved recommendation accuracy. The model's ability to capture personalized interests through biased aggregation and high-order connections proves crucial for its success.

This section provides the necessary background to understand the KGCN paper, explaining foundational concepts and summarizing prior research discussed by the authors.

3.1. Foundational Concepts

3.1.1. Recommender Systems (RS)

Recommender Systems (RS) are information filtering systems that aim to predict user preferences or ratings for items and recommend items that users might like. They are widely used in various online platforms like e-commerce (e.g., Amazon), streaming services (e.g., Netflix, Spotify), and social media (e.g., Facebook, Twitter) to alleviate information overload by providing personalized suggestions.

3.1.2. Collaborative Filtering (CF)

Collaborative Filtering (CF) is a common technique in recommender systems. The core idea behind CF is that users who agreed in the past on some items will agree again in the future on other items, or that items liked by similar users are likely to be liked by the current user.

User-based CF: Recommends items to a user based on the preferences of other "similar" users.
Item-based CF: Recommends items that are similar to the ones a user has liked in the past. CF often represents users and items as ID-based representation vectors (also known as embeddings) and models their interactions (e.g., by inner product or neural networks).

3.1.3. Sparsity Problem

The sparsity problem refers to the issue in recommender systems where the user-item interaction matrix is very sparse, meaning most users have interacted with only a small fraction of the total available items. This lack of data makes it difficult for CF algorithms to find reliable similarities between users or items, leading to less accurate recommendations.

3.1.4. Cold Start Problem

The cold start problem occurs when a recommender system cannot make accurate recommendations because it lacks sufficient information.

New User Cold Start: When a new user joins, there's no historical interaction data to build a preference profile.
New Item Cold Start: When a new item is introduced, it has no interaction history, making it impossible for CF to recommend it to users. Auxiliary information, like item attributes, is often used to alleviate this problem.

3.1.5. Knowledge Graph (KG)

A Knowledge Graph (KG) is a structured representation of information that describes entities and their relationships. It's essentially a large network of interconnected facts.

Entities: Nodes in the graph, representing real-world objects or abstract concepts (e.g., "George Martin", "A Song of Ice and Fire", "book genre").
Relations: Edges connecting entities, describing the type of relationship between them (e.g., "author of", "has genre").
Triples: Facts are typically represented as (head, relation, tail) triples (e.g., (A Song of Ice and Fire, book.book.author, George Martin)). KGs provide rich semantic relatedness and contextual information that can be beneficial for recommender systems.

3.1.6. Graph Convolutional Networks (GCN)

Graph Convolutional Networks (GCNs) are a type of neural network designed to operate directly on graph-structured data. They generalize the concept of convolutional operations (common in image processing for grids) to arbitrary graph topologies. The core idea is to compute the representation of a node by aggregating information from its neighbors.

Convolution on Graphs: Unlike images with regular grid structures, graphs have irregular structures (varying number of neighbors per node). GCNs define convolutional filters that can handle this irregularity.
Node Representation Learning: GCNs learn node embeddings (low-dimensional vector representations) that capture both the node's own features and its structural role within the graph by iteratively aggregating features from its neighbors.
Spectral vs. Non-spectral Methods:
- Spectral Methods: Define graph convolutions in the spectral domain by transforming the graph signal using the eigenbasis of the graph Laplacian. Examples include Bruna et al. [2], Defferrard et al. [4] (Chebyshev expansion), and Kipf et al. [10] (first-order approximation). These methods often require full graph processing.
- Non-spectral Methods: Directly perform convolution on the graph in the spatial domain (original graph structure). They define aggregation functions for nodes and their neighbors. KGCN falls into this category, similar to approaches like GraphSAGE [7]. These are often more flexible for large graphs and varying neighborhood sizes.

3.2. Previous Works

The paper discusses several existing recommender system approaches, categorizing them into KG-free and KG-aware methods.

3.2.1. KG-Free Methods

These methods do not explicitly use knowledge graph information.

SVD [11]: Singular Value Decomposition (SVD) is a classic CF-based matrix factorization technique. It decomposes the user-item interaction matrix into lower-rank matrices representing latent features of users and items. The predicted rating for user $u$ $u$ and item $v$ $v$ is typically modeled as the inner product of their latent vectors.
- Formula (Conceptual for CF models): $\hat{r}_{uv} = \mathbf{p}_u^\top \mathbf{q}_v$ , where $\mathbf{p}_u$ is the latent vector for user $u$ and $\mathbf{q}_v$ is the latent vector for item $v$ .
LibFM [14]: Factorization Machines (FM) are a generic predictor working with any real-valued feature vector. LibFM is an implementation of FM. It models all variable interactions using factorized parameters, making it capable of estimating interactions even in highly sparse settings. In recommender systems, it can use user ID, item ID, and other features.
- Formula (FM conceptual, simplifying to second-order interaction): $\hat{y}(\mathbf{x}) = w_0 + \sum_{i=1}^n w_i x_i + \sum_{i=1}^n \sum_{j=i+1}^n \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j$ . Here, $w_0$ is the global bias, $w_i$ is the weight for the $i$ -th feature, $x_i$ is the $i$ -th feature value, and $\langle \mathbf{v}_i, \mathbf{v}_j \rangle$ is the dot product of two latent vectors $\mathbf{v}_i$ and $\mathbf{v}_j$ representing the interaction between features $i$ and $j$ .

3.2.2. KG-Aware Methods

These methods incorporate knowledge graph information.

Knowledge Graph Embedding (KGE) Methods [20]: These methods map entities and relations from a knowledge graph into low-dimensional vector spaces (embeddings).
- TransE [1]: A prominent KGE method. TransE models relations as "translations" in the embedding space. For a triple (head, relation, tail), it aims to ensure that the embedding of the head entity plus the embedding of the relation vector is approximately equal to the embedding of the tail entity: $\mathbf{h} + \mathbf{r} \approx \mathbf{t}$ $h + r \approx t$ .
  - Score function (conceptual for TransE): $f(h, r, t) = \|\mathbf{h} + \mathbf{r} - \mathbf{t}\|_p$ , where $\|\cdot\|_p$ is the $L_p$ -norm (typically $L_1$ or $L_2$ ).
- TransR [12]: An extension of TransE that maps entities and relations into different vector spaces. It projects entities to a relation-specific space before performing the translation operation.
- Limitation for RS: The paper argues that KGE methods like TransE and TransR are primarily designed for in-graph applications (e.g., KG completion, link prediction) by modeling rigorous semantic relatedness. This objective might not directly align with recommender system goals, where user preferences and personalized item representations are key.
LibFM + TransE: An approach that extends LibFM by incorporating TransE embeddings. It attaches an entity representation learned by TransE to each user-item pair as additional features for LibFM. This is a hybrid approach combining factorization machines with knowledge graph embeddings.
PER [22]: Personalized Entity Recommendation (PER) treats the KG as a heterogeneous information network. It extracts meta-path-based features to represent the connectivity between users and items.
- Meta-path: A sequence of relation types connecting two entity types (e.g., User-Movie-Director-Movie for movie recommendations).
- Limitation: Relies heavily on manually designed meta-paths or meta-graphs, which are difficult to optimize and may not capture all relevant connections.
CKE [23]: Collaborative Knowledge Embedding (CKE) combines collaborative filtering with structural, textual, and visual knowledge from the KG within a unified framework. The paper specifically mentions implementing CKE as CF plus a structural knowledge module, implying that it regularizes CF embeddings using KGE objectives (e.g., TransR-like relations).
RippleNet [18]: A memory-network-like model that propagates users' potential preferences on the knowledge graph to explore their hierarchical interests. It effectively performs multi-hop message passing over the KG.
- Mechanism: It learns a user's ripple set, which represents their interests by traversing the KG.
- Limitation: The paper notes RippleNet weakly characterizes the importance of relations (via a quadratic form $\mathbf{v}^\top \mathbf{R h}$ ), and the size of the ripple set can grow unpredictably, leading to heavy computation and storage overhead for large KGs.

3.2.3. Graph Neural Networks (GNN) for Homogeneous Graphs

The paper also contextualizes KGCN with general GCN methods and other Graph Neural Networks (GNNs) designed for homogeneous graphs (graphs where all nodes and edges are of the same type).

GraphSAGE [7]: A non-spectral GCN method that learns inductive representations by sampling and aggregating features from a node's local neighborhood. It operates in a minibatch fashion. KGCN shares conceptual similarities with GraphSAGE in its sampling and aggregation strategy.
PinSage [21]: A GCN-based web-scale recommender system developed by Pinterest. It learns node embeddings for items in a large item-item graph (homogeneous) by aggregating information from sampled neighbors using GraphSAGE-like principles.
GAT [15]: Graph Attention Networks (GATs) are another non-spectral GNN that uses an attention mechanism to assign different weights to different neighbors in a node's aggregation process. This allows it to learn variable importance for neighbors.
Differentiation from KGCN: The paper highlights that both PinSage and GAT are primarily designed for homogeneous graphs. KGCN specifically addresses the challenges of heterogeneous knowledge graphs in recommender systems, where different relation types and their personalized importance are crucial.

3.3. Technological Evolution

The field of recommender systems has evolved through several stages:

Early Collaborative Filtering: Focused on ID-based representations and matrix factorization (SVD, FunkSVD). These methods struggle with sparsity and cold start.
Feature-Rich Recommenders: Incorporated auxiliary user and item attributes to enrich representations and alleviate sparsity (LibFM, Wide & Deep Learning [3]). This provided more context but often treated attributes as isolated features.
Knowledge Graph Integration (Early Stage): Recognized the interconnectedness of attributes and started using Knowledge Graphs. Initial approaches included KGE (TransE, TransR) to pre-embed KG information, or meta-path mining (PER, FMG [24]) to extract connectivity patterns. These methods faced challenges with KGE objectives not aligning perfectly with RS tasks or the manual effort required for meta-path design.
Memory Networks and KG Propagation: Models like RippleNet introduced memory network-like mechanisms to propagate user preferences directly on the KG, allowing for more dynamic exploration of multi-hop interests.
Graph Neural Networks for RS: The emergence of Graph Neural Networks (GCNs, GraphSAGE, GAT) provided powerful tools for learning on arbitrary graph structures. Initially applied to homogeneous graphs (PinSage), the logical next step is to adapt them for the complexities of heterogeneous knowledge graphs in recommender systems.

This paper's work, KGCN, fits into the fifth stage. It leverages the strengths of GCNs but specifically adapts them to the heterogeneous nature of knowledge graphs and the personalized aspect of recommender systems, offering a robust, end-to-end framework.

3.4. Differentiation Analysis

Compared to the main methods in related work, KGCN offers several core differences and innovations:

Direct KG Exploitation vs. KGE: Unlike KGE-based methods (TransE, $LibFM+TransE$ , CKE) that pre-embed the KG or use KGE as a regularization term, KGCN directly operates on the KG structure through graph convolution. This avoids the potential mismatch between KGE objectives (e.g., semantic relatedness) and RS objectives (e.g., user preference prediction).
Automated Feature Learning vs. Meta-paths: KGCN overcomes the limitations of meta-path-based methods (PER, FMG) by automatically discovering high-order structure and semantic information through an iterative aggregation process. It does not require manual design of meta-paths or meta-graphs, which are often sub-optimal and labor-intensive.
Personalized and Biased Aggregation: A key innovation is the user-specific bias applied during neighborhood aggregation. Unlike general GCNs (GraphSAGE, GAT, PinSage) which typically use a fixed aggregation scheme or attention weights learned universally across the graph, KGCN calculates user-relation scores ( $\pi_r^u$ ) to dynamically weight the importance of different relations for a specific user. This allows KGCN to capture personalized interests that previous GNN models, not designed for recommender systems, often miss.
Handling Heterogeneity: While GAT uses attention to weight neighbors, KGCN explicitly handles the heterogeneous nature of the KG by making the aggregation dependent on the relation type connecting the entities and the user. This is a crucial distinction from GNNs designed for homogeneous graphs.
Computational Efficiency: KGCN employs fixed-size neighbor sampling and minibatch training, addressing the computational and storage overhead issues that models like RippleNet might face due to unpredictably growing ripple sets.
End-to-End Learning: KGCN is an end-to-end framework, meaning the KG structure and user preferences are learned jointly during the training process, leading to more coherent and effective representations for recommendation.

4. Methodology

4.1. Principles

The core idea behind Knowledge Graph Convolutional Networks (KGCN) is to adapt Graph Convolutional Networks (GCNs) to heterogeneous knowledge graphs for recommender systems. The method aims to automatically discover both high-order structural information and semantic information within the KG by iteratively aggregating information from an entity's neighbors. Crucially, this aggregation is not uniform; it's biased and personalized based on the specific user and the relation type connecting entities. This allows KGCN to model how a user's interests might traverse the KG and influence their preferences for an item. The multi-hop aggregation extends the receptive field to capture long-distance interests, while fixed-size neighbor sampling and minibatch training ensure scalability.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

The problem is defined within a standard recommender system context augmented with a knowledge graph.

We have a set of $M$ users, denoted as $\mathcal{U} = \{u_1, u_2, \dots, u_M\}$ .
We have a set of $N$ items, denoted as $\mathcal{V} = \{v_1, v_2, \dots, v_N\}$ .
The user-item interaction matrix is $\mathbf{Y} \in \mathbb{R}^{M \times N}$ . An entry $y_{uv} = 1$ indicates that user $u$ has interacted with item $v$ (e.g., clicked, browsed, purchased), while $y_{uv} = 0$ indicates no interaction. This represents implicit feedback.
A knowledge graph $\mathcal{G}$ is provided, composed of entity-relation-entity triples (h, r, t).
- $h \in \mathcal{E}$ is the head entity.
- $r \in \mathcal{R}$ is the relation.
- $t \in \mathcal{E}$ is the tail entity.
- $\mathcal{E}$ is the set of all entities in the knowledge graph.
- $\mathcal{R}$ is the set of all relations in the knowledge graph.
In recommender scenarios, an item $v \in \mathcal{V}$ corresponds to one or more entities $e \in \mathcal{E}$ in the knowledge graph.

The goal is to predict the probability $\hat{y}_{uv}$ that user $u$ will engage with item $v$ , especially for items with which the user has no prior interaction. This is achieved by learning a prediction function $\mathcal{F}$ with parameters $\Theta$ and $\Upsilon$ : $ \hat{y}_{uv} = \mathcal{F}(u, v | \Theta, \Upsilon, \mathcal{G}) $ Here, $\Theta$ denotes the model parameters of the function $\mathcal{F}$ , and $\Upsilon$ might represent initial entity embeddings or other learnable parameters related to the KG.

4.2.2. KGCN Layer

The KGCN is built upon aggregating information from neighbors. A single KGCN layer computes an updated representation for an entity based on its immediate neighbors. Let's consider a candidate pair of user $u$ and item (entity) $v$ .

User-Relation Scoring: The first step is to quantify how important a particular relation $r$ is to a specific user $u$ . This is done using a function $g$ that takes the representation (embedding) of the user and the representation of the relation as input. $ \pi_r^u = g(\mathbf{u}, \mathbf{r}) $ where:
- $\mathbf{u} \in \mathbb{R}^d$ is the representation vector (embedding) of user $u$ .
- $\mathbf{r} \in \mathbb{R}^d$ is the representation vector (embedding) of relation $r$ .
- $d$ is the dimension of the representation vectors.
- $g: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ is a function (e.g., inner product) that computes a score. This score $\pi_r^u$ characterizes the importance of relation r to user u. For example, one user might find the "director" relation important for movies, while another values the "genre" relation.
Neighborhood Representation (Biased Aggregation): To characterize the topological proximity structure of an item $v$ , KGCN computes a linear combination of its neighbors. This combination is user-specific and relation-aware. First, N(v) denotes the set of entities directly connected to $v$ . The relation between $v$ and a neighbor $e$ is denoted $r_{v,e}$ . The neighborhood representation for item $v$ with respect to user $u$ , denoted $\mathbf{v}_{N(v)}^u$ , is calculated as: $ \mathbf{v}{N(v)}^u = \sum{e \in N(v)} \tilde{\pi}{r{v,e}}^u \mathbf{e} $ where:
- $\mathbf{e}$ is the representation vector of the neighbor entity $e$ .
- $\tilde{\pi}_{r_{v,e}}^u$ is a normalized user-relation score for the relation $r_{v,e}$ that connects $v$ and $e$ . This score acts as a personalized filter.
  
  The normalized user-relation score $\tilde{\pi}_{r_{v,e}}^u$ is obtained by applying a softmax function to the raw scores $\pi_r^u$ computed earlier, ensuring that the weights sum to 1 for a given item's neighborhood: $ \tilde{\pi}{r{v,e}}^u = \frac{\exp{(\pi_{r_{v,e}}^u)}}{\sum_{e' \in N(v)} {\exp{(\pi_{r_{v,e'}}^u)}}} $ This softmax normalization ensures that the contributions of neighbors are appropriately weighted based on the personalized relevance of their connecting relations.
Fixed-Size Neighbor Sampling: In real-world knowledge graphs, the number of neighbors N(e) for an entity $e$ can vary significantly and may be very large. To ensure computational efficiency and a fixed pattern for minibatch processing, KGCN uniformly samples a fixed-size set of neighbors for each entity.
- $S(v) \stackrel{\Delta}{=} \{e \mid e \sim N(v)\}$ denotes the sampled set of neighbors for entity $v$ .
- $|S(v)| = K$ is a configurable constant representing the fixed number of sampled neighbors. This sampled set S(v) is also referred to as the (single-layer) receptive field of entity $v$ . The calculation for $\mathbf{v}_{S(v)}^u$ (the neighborhood representation for the sampled set) replaces $\mathbf{v}_{N(v)}^u$ by summing over S(v) instead of N(v).
Aggregator Functions: After computing the neighborhood representation $\mathbf{v}_{S(v)}^u$ , the final step in a KGCN layer is to aggregate this with the entity's own representation $\mathbf{v}$ (or its previous layer's representation) to form a new, updated representation. The paper implements three types of aggregators, denoted as $agg: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}^d$ :
- Sum Aggregator (agg_sum): This aggregator sums the entity's own representation with its neighborhood representation, followed by a nonlinear transformation. $ agg_{sum}(\mathbf{v}, \mathbf{v}{S(v)}^u) = \sigma \left( \mathbf{W} \cdot (\mathbf{v} + \mathbf{v}{S(v)}^u) + \mathbf{b} \right) $ where:
  - $\mathbf{W}$ is a learnable weight matrix.
  - $\mathbf{b}$ is a learnable bias vector.
  - $\sigma$ is a nonlinear activation function (e.g., ReLU).
- Concat Aggregator (agg_concat): This aggregator concatenates the entity's own representation with its neighborhood representation before applying a nonlinear transformation. This allows the model to differentiate between the entity's own features and its neighborhood features more explicitly. $ agg_{concat}(\mathbf{v}, \mathbf{v}{S(v)}^u) = \sigma \left( {\bf W} \cdot {concat} ({{\bf v}}, {{\bf v}}{S(v)}^u) + {{\bf b}} \right) $ Here, the weight matrix $\mathbf{W}$ would have a dimension compatible with the concatenated vector (i.e., $2d \times d$ if $\mathbf{v}$ and $\mathbf{v}_{S(v)}^u$ are $d$ -dimensional).
- Neighbor Aggregator (agg_neighbor): This aggregator directly uses the neighborhood representation as the output, applying only a nonlinear transformation to it. The entity's own representation is not explicitly included in the aggregation. $ agg_{neighbor}(\mathbf{v}, \mathbf{v}{S(v)}^u) = \sigma \left( \mathbf{W} \cdot \mathbf{v}{S(v)}^u + \mathbf{b} \right) $

The choice of aggregator influences how the entity's own information is combined with its contextual neighborhood information.

4.2.3. Learning Algorithm

The KGCN algorithm extends a single layer to multiple layers, allowing it to capture high-order proximity information and long-distance interests. This is achieved by iteratively propagating and aggregating representations.

Multi-Layer Extension:

A 0-order representation is the initial embedding of an entity.
A 1-order representation is obtained after one KGCN layer by aggregating the 0-order representations of its immediate neighbors.
An h-order representation is a mixture of initial representations of the entity itself and its neighbors up to $h$ hops away.

The overall KGCN algorithm is presented in Algorithm 1 and illustrated in Figure 1b.

Algorithm 1: KGCN algorithm

Input: Interaction matrix Y; knowledge graph G(E, R);
       Initial embeddings: {u}_u∈U, {e}_e∈E, {r}_r∈R;
       Neighbor sampling function: S: e → 2^E;
       Learnable parameters: {W_i, b_i}_i=1^H;
       Hyperparameters: H, d, g(·), f(·), σ(·), agg(·);
Output: Prediction function F(u, v | Θ, Υ, G)

1 while KGCN not converge do
2   for (u, v) in Y do
3     {M[i]}_i=0^H ← Get-Receptive-Field(v);
4     e^u[0] ← e, ∀e ∈ M[0];  // Initialize 0-order representations
5     for h = 1, ..., H do
6       for e ∈ M[h] do
7         e_S(e)^u[h-1] ← ∑_(e'∈S(e))_ (normalized_user_relation_score(u, r_(e,e')) * e'^u[h-1]);
        // Calculate neighborhood representation for entity e at layer h-1,
        // weighted by personalized user-relation scores and using representations from the previous layer.
8         e^u[h] ← agg(e^u[h-1], e_S(e)^u[h-1]);
        // Aggregate entity's own representation from previous layer with its neighborhood representation
        // to get its h-order representation.
9     v^u ← e^u[H]; // The final H-order representation for item v, specific to user u
10    Calculate predicted probability ŷ_uv = f(u, v^u);
11    Update parameters by gradient descent;
12 return F;

13 Function Get-Receptive-Field(v)
14   M[H] ← {v}; // The H-th layer receptive field contains only the target entity v
15   for h = H - 1, ..., 0 do
16     M[h] ← M[h+1]; // Initialize current layer's receptive field with entities from the next layer
17     for e ∈ M[h+1] do
18       M[h] ← M[h] ∪ S(e); // Add sampled neighbors of entities in M[h+1] to M[h]
19   return {M[i]}_i=0^H;

Explanation of Algorithm 1:

Input: The algorithm takes the interaction matrix $Y$ , the knowledge graph $G$ , initial embeddings for users, entities, and relations, the neighbor sampling function $S$ , learnable weight and bias parameters for each layer, and various hyperparameters ( $H$ for depth, $d$ for dimension, $g$ for scoring, $f$ for prediction, $σ$ for activation, agg for aggregation type).
Get-Receptive-Field(v) Function (Lines 13-19): This function determines the set of entities whose initial embeddings will contribute to the final representation of item $v$ . It works backward from the target entity $v$ .
- It starts by placing the target entity $v$ in the receptive field of the outermost layer M[H].
- Then, for each layer $h$ from H-1 down to 0, it initializes M[h] with the entities from $M[h+1]$ (i.e., the entities that will be processed in the next aggregation step).
- For every entity $e$ in $M[h+1]$ , it samples its neighbors using S(e) and adds them to M[h]. This effectively builds the multi-hop receptive field.
Initialization (Line 4): For all entities $e$ in the 0-th layer receptive field M[0] (which contains $v$ and all entities reachable within $H$ hops), their 0-order representations \mathbf{e}^u[0] are initialized with their original embeddings $\mathbf{e}$ .
Aggregation Loop (Lines 5-8): This is the core graph convolution process, repeated $H$ times (for $H$ layers).
- In each iteration $h$ , it computes the neighborhood representation for every entity $e$ in M[h] (entities that will be updated in this layer). This uses the personalized user-relation score and the (h-1)-order representations of its sampled neighbors.
- Then, it aggregates the entity's own (h-1)-order representation with its newly computed neighborhood representation to produce its h-order representation $\mathbf{e}^u[h]$ .
Final Item Representation (Line 9): After $H$ iterations, the H-order representation of the target item $v$ (which is $\mathbf{v}^u = \mathbf{e}^u[H]$ ) is obtained. This representation is user-specific ( $\mathbf{v}^u$ ) because the aggregation process was biased by user $u$ 's preferences.
Prediction (Line 10): The final H-order representation of item $v$ ( $\mathbf{v}^u$ ) is fed into a prediction function $f$ along with the user's representation $\mathbf{u}$ to calculate the predicted probability $\hat{y}_{uv}$ . $ \hat{y}_{uv} = f(\mathbf{u}, \mathbf{v}^u) $ The paper mentions that $f$ is set as inner product.
Parameter Update (Line 11): The model parameters are updated using gradient descent based on a defined loss function.
Minibatch Implementation: The algorithm is designed to operate in a minibatch fashion. This means instead of iterating over all user-item pairs (which can be very large, as suggested by line 2 iterating for (u, v) in Y), a negative sampling strategy is used during training, where mini-batches of positive and negative samples are processed.

The framework of KGCN is illustrated in Figure 1b.

Figure 1: KGCN Framework

The following figure illustrates the two key components of KGCN: (a) the receptive field concept and (b) the overall framework for entity representation calculation.

Figure 1: (a) A two-layer receptive field (green entities) of the blue entity in a KG. (b) The framework of KGCN. 该图像是论文中的示意图，包括(a)一个知识图谱中蓝色实体的两层感受野示意（绿色节点）和(b) KGCN框架中实体表示计算与预测概率的流程示意。

Figure 1: (a) A two-layer receptive field (green entities) of the blue entity in a KG. (b) The framework of KGCN.

Figure 1a (Receptive Field): This panel visually explains how the receptive field is constructed. For a blue entity (the target item), its 1-hop neighbors are shown, and then the neighbors of those 1-hop neighbors form the 2-hop receptive field. The green entities represent the entities whose initial embeddings will eventually contribute to the final representation of the blue entity. The parameter $K$ (neighbor sampling size) limits the number of neighbors sampled at each hop, ensuring a manageable and fixed computation graph.

Figure 1b (KGCN Framework): This panel depicts the iterative aggregation process within KGCN for one layer ( $h$ to $h+1$ ).

Initial Representation: An entity starts with its h-order representation (e.g., $\mathbf{e}_1^u[h]$ for the blue node).
Neighbor Sampling: It samples a fixed number of neighbors (e.g., two neighbors for $\mathbf{e}_1^u[h]$ in the illustration).
User-Relation Scoring & Weighting: For each sampled neighbor (e.g., $\mathbf{e}_2^u[h]$ , $\mathbf{e}_3^u[h]$ ), their h-order representations are considered. The edges connecting them to $\mathbf{e}_1^u[h]$ have associated relations ( $r_1, r_2$ ). The user-relation scores ( $\pi_{r_1}^u, \pi_{r_2}^u$ ) are used to normalize these contributions (e.g., $\tilde{\pi}_{r_1}^u, \tilde{\pi}_{r_2}^u$ ).
Neighborhood Aggregation: The h-order representations of the sampled neighbors are summed, weighted by the normalized user-relation scores, to form the neighborhood representation (the green combined vector).
Aggregator Function: This neighborhood representation is then combined with the entity's own h-order representation (the blue vector) using an aggregator function (sum, concat, or neighbor) and passed through a non-linear activation ( $\sigma$ ) and linear transformation ( $\mathbf{W}, \mathbf{b}$ ) to produce the (h+1)-order representation ( $\mathbf{e}_1^u[h+1]$ ). This process repeats for $H$ layers, culminating in the H-order representation for the item, which is then used for prediction.

Loss Function

To train the model, a loss function is defined. The paper uses a binary cross-entropy loss with negative sampling. $ \mathcal{L} = \sum_{u \in \mathcal{U}} \bigg( \sum_{v : y_{uv}=1} \mathcal{T}(y_{uv}, \hat{y}{uv}) - \sum{i=1}^{T^u} \mathbb{E}{v_i \sim P(v_i)} \mathcal{T}(y{uv_i}, \hat{y}_{uv_i}) \bigg) + \lambda | \mathcal{F} |_2^2 $ where:

$\mathcal{L}$ is the total loss.
$\mathcal{U}$ is the set of all users.
$v : y_{uv}=1$ represents positive samples (items user $u$ interacted with).
$y_{uv}$ is the true label (1 for positive interaction, 0 for negative).
$\hat{y}_{uv}$ is the predicted probability of interaction.
$\mathcal{T}$ is the cross-entropy loss function for a single prediction. For binary classification, it is typically: $ \mathcal{T}(y, \hat{y}) = -y \log(\hat{y}) - (1-y) \log(1-\hat{y}) $
The first sum $\sum_{v : y_{uv}=1} \mathcal{T}(y_{uv}, \hat{y}_{uv})$ calculates the cross-entropy loss for all positive interactions of user $u$ .
The second sum $\sum_{i=1}^{T^u} \mathbb{E}_{v_i \sim P(v_i)} \mathcal{T}(y_{uv_i}, \hat{y}_{uv_i})$ $\sum_{i = 1}^{T^{u}} E_{v_{i} \sim P (v_{i})} T (y_{u v_{i}}, \overset{y}{^}_{u v_{i}})$ calculates the cross-entropy loss for negative samples.
- $T^u$ is the number of negative samples for user $u$ . The paper states $T^u = |\{v : y_{uv}=1\}|$ , meaning an equal number of positive and negative samples.
- $P(v_i)$ is a negative sampling distribution. In this paper, it follows a uniform distribution, meaning negative items are sampled randomly from all non-interacted items.
- $\mathbb{E}$ denotes the expectation over sampled negative items.
$\lambda \| \mathcal{F} \|_2^2$ $λ ∥ F ∥_{2}^{2}$ is an L2-regularization term to prevent overfitting.
- $\lambda$ is the L2-regularizer weight.
- $\| \mathcal{F} \|_2^2$ represents the L2-norm of the model's parameters (e.g., $\mathbf{W}, \mathbf{b}$ , and all embedding vectors).
  
  The model is optimized using the Adam algorithm.

5. Experimental Setup

5.1. Datasets

The authors evaluate KGCN on three real-world datasets from different domains: movie, book, and music recommendation.

MovieLens-20M:
- Domain: Movie recommendation.
- Characteristics: Contains approximately 20 million explicit ratings (from 1 to 5) on the MovieLens website.
- Usage: Explicit ratings are transformed into implicit feedback. Ratings $\ge 4$ are considered positive ( $y_{uv}=1$ ), while unrated items are sampled as negative ( $y_{uv}=0$ ).
Book-Crossing:
- Domain: Book recommendation.
- Characteristics: Contains 1 million ratings (from 0 to 10) from the Book-Crossing community.
- Usage: Explicit ratings are transformed into implicit feedback. No threshold is set due to its sparsity, implying all ratings likely contribute as positive or implicitly negative.
Last.FM:
- Domain: Music recommendation.
- Characteristics: Contains musician listening information from about 2 thousand users from the Last.fm online music system.
- Usage: Similar to Book-Crossing, no threshold is set for transforming explicit feedback into implicit feedback due to sparsity.

Knowledge Graph Construction: For each dataset, a knowledge graph is constructed using Microsoft Satori.

A subset of triples from Satori with a $confidence level > 0.9$ is initially selected.
Item IDs (movies, books, musicians) are matched with entities in the sub-KG by matching their names with the tail of specific triples (e.g., (head, film.film.name, tail) for movies, (head, book.book.title, tail) for books, (head, type.object.name, tail) for musicians).
Items with multiple matches or no matches are excluded for simplicity.

Finally, the matched item IDs are used to select all relevant triples from the sub-KG where the item is the head entity.

The following are the results from [Table 1] of the original paper:

	MovieLens-20M	Book-Crossing	Last.FM
# users	138,159	19,676	1,872
# items	16,954	20,003	3,846
# interactions	13,501,622	172,576	42,346
# entities	102,569	25,787	9,366
# relations	32	18	60
# KG triples	499,474	60,787	15,518
K	4	8	8
d	32	64	16
H	2	1	1
λ	10-7	2 × 10-5	10-4
n	2 × 10-2	2 × 10-4	5 × 10-4
batch size	65,536	256	128

Table 1: Basic statistics and hyper-parameter settings for the three datasets (K: neighbor sampling size, d: dimension of embeddings, H: depth of receptive field, λ: L2 regularizer weight, η: learning rate).

Dataset Characteristics from Table 1:

MovieLens-20M is the largest dataset in terms of users, interactions, and KG triples, making it less sparse.
Book-Crossing and Last.FM are significantly sparser, with fewer users and interactions. This makes them good candidates for evaluating how well KGCN addresses sparsity.
The number of entities and relations in the KGs varies, reflecting the complexity and richness of the knowledge available for each domain.

Data Splitting: For each dataset, the ratio of training, evaluation (validation), and test set is fixed at 6:2:2.

5.2. Evaluation Metrics

The model is evaluated in two scenarios: click-through rate (CTR) prediction and top-K recommendation.

5.2.1. Click-Through Rate (CTR) Prediction Metrics

For CTR prediction, where the model predicts the probability of a user interacting with an item, the following metrics are used:

Area Under the Receiver Operating Characteristic Curve (AUC):
- Conceptual Definition: AUC measures the performance of a binary classifier across all possible classification thresholds. It represents the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. An AUC of 1 indicates a perfect classifier, while 0.5 indicates a classifier no better than random guessing. It is robust to imbalanced datasets.
- Mathematical Formula: $ \text{AUC} = \frac{\sum_{i=1}^{P} \sum_{j=1}^{N} \mathbb{I}(\text{score}_i > \text{score}_j)}{P \times N} $
- Symbol Explanation:
  - $P$ : Number of positive samples.
  - $N$ : Number of negative samples.
  - $\text{score}_i$ : The predicted probability or score for the $i$ -th positive sample.
  - $\text{score}_j$ : The predicted probability or score for the $j$ -th negative sample.
  - $\mathbb{I}(\cdot)$ : The indicator function, which is 1 if its argument is true, and 0 otherwise.
F1 Score:
- Conceptual Definition: The F1 score is the harmonic mean of precision and recall. It is a measure of a model's accuracy on a dataset, particularly useful when there is an uneven class distribution (imbalance). It balances both precision (minimizing false positives) and recall (minimizing false negatives).
- Mathematical Formula: $ \text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $ where:
  - Precision: The fraction of relevant instances among the retrieved instances. $ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $
  - Recall: The fraction of relevant instances that were retrieved. $ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $
- Symbol Explanation:
  - True Positives (TP): Number of correctly predicted positive instances.
  - False Positives (FP): Number of incorrectly predicted positive instances (negative instances predicted as positive).
  - False Negatives (FN): Number of incorrectly predicted negative instances (positive instances predicted as negative).

5.2.2. Top-K Recommendation Metric

For top-K recommendation, where the model selects $K$ items with the highest predicted click probability for each user, the following metric is used:

Recall@K:
- Conceptual Definition: Recall@K measures the proportion of relevant items (items the user actually interacted with) that are successfully retrieved within the top $K$ recommendations. It focuses on how many of the truly relevant items are present in the recommended list of size $K$ .
- Mathematical Formula: $ \text{Recall@K} = \frac{\sum_{u \in \mathcal{U}_{\text{test}}} |\text{RecommendedItems}_u(K) \cap \text{RelevantItems}_u|}{|\text{RelevantItems}_u|} $ This formula represents the average recall over all users in the test set. More commonly, for a single user $u$ : $ \text{Recall@K}_u = \frac{|\text{RecommendedItems}_u(K) \cap \text{RelevantItems}_u|}{|\text{RelevantItems}_u|} $
- Symbol Explanation:
  - $\mathcal{U}_{\text{test}}$ : Set of users in the test set.
  - $\text{RecommendedItems}_u(K)$ : The set of $K$ items recommended to user $u$ .
  - $\text{RelevantItems}_u$ : The set of items user $u$ has actually interacted with in the test set (positive interactions).
  - $\cap$ : Set intersection operator.
  - $|\cdot|$ : Cardinality of a set.

5.3. Baselines

The KGCN is compared against six baselines, categorized into KG-free and KG-aware methods.

5.3.1. KG-Free Baselines

SVD [11]:
- Description: A classic collaborative filtering model using matrix factorization. It models user-item interactions through the inner product of their latent vectors. The paper uses the unbiased version.
- Configuration: $d=8$ , $η=0.5$ for MovieLens-20M and Book-Crossing; $d=8$ , $η=0.1$ for Last.FM.
LibFM [14]:
- Description: A feature-based factorization model commonly used in click-through rate (CTR) prediction. It models interactions between features. For this comparison, user ID and item ID are concatenated as input features.
- Configuration: $dimension={1, 1, 8}$ , 50 training epochs.

5.3.2. KG-Aware Baselines

LibFM + TransE:
- Description: An extension of LibFM that incorporates knowledge graph embeddings. The entity representation learned by TransE [1] is attached to each user-item pair as additional features for LibFM.
- Configuration: $TransE dimension=32$ .
PER [22]:
- Description: Personalized Entity Recommendation (PER) treats the knowledge graph as a heterogeneous information network. It extracts meta-path-based features to represent connectivity between users and items.
- Configuration: Manually designed meta-paths are used:
  - MovieLens-20M: "user-movie-director-movie", "user-movie-genre-movie", "user-movie-star-movie".
  - Book-Crossing: "user-book-author-book", "user-book-genre-book".
  - Last.FM: "user-musician-date_of_birth-musician" (date of birth discretized), "user-musician-country-musician", "user-musician-genre-musician".
CKE [23]:
- Description: Collaborative Knowledge Embedding (CKE) combines collaborative filtering with structural knowledge from the KG. The paper implements it as CF plus a structural knowledge module (likely using TransR-like regularization).
- Configuration: $d=64$ for MovieLens-20M, $d=128$ for Book-Crossing, $d=64$ for Last.FM. $KG part training weight=0.1$ for all datasets. Learning rates are the same as SVD.
RippleNet [18]:
- Description: A memory-network-like approach that propagates users' preferences on the knowledge graph through multi-hop ripple sets for recommendation.
- Configuration:
  - MovieLens-20M: $d=8$ , $H=2$ , $λ_1=10^-6$ , $λ_2=0.01$ , $η=0.01$ .
  - Last.FM: $d=16$ , $H=3$ , $λ_1=10^-5$ , $λ_2=0.02$ , $η=0.005$ .
  - Other hyperparameters are set as in their original paper or default code.

5.4. KGCN Hyperparameters

For KGCN specifically:

Functions $g$ and $f$ : Both are set as inner product.
Activation $σ$ : ReLU for non-last-layer aggregators, tanh for the last-layer aggregator.
Other settings: Detailed in Table 1 (repeated above).
- $K$ (neighbor sampling size): 4 for MovieLens-20M, 8 for Book-Crossing and Last.FM.
- $d$ (dimension of embeddings): 32 for MovieLens-20M, 64 for Book-Crossing, 16 for Last.FM.
- $H$ (depth of receptive field): 2 for MovieLens-20M, 1 for Book-Crossing and Last.FM.
- $λ$ (L2 regularizer weight): $10^{-7}$ for MovieLens-20M, $2 \times 10^{-5}$ for Book-Crossing, $10^{-4}$ for Last.FM.
- $η$ (learning rate): $2 \times 10^{-2}$ for MovieLens-20M, $2 \times 10^{-4}$ for Book-Crossing, $5 \times 10^{-4}$ for Last.FM.
- batch size: 65,536 for MovieLens-20M, 256 for Book-Crossing, 128 for Last.FM. These hyperparameters were determined by optimizing AUC on a validation set. Each experiment was repeated 3 times, and the average performance is reported. The models are optimized using the Adam algorithm. The implementation uses Python 3.6, TensorFlow 1.12.0, and NumPy 1.14.3.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that KGCN consistently outperforms all baselines across three diverse real-world datasets, validating its efficacy in knowledge graph-aware recommender systems.

CTR Prediction Results (AUC and F1)

The following are the results from [Table 2] of the original paper:

Model	MovieLens-20M		Book-Crossing		Last.FM
Model	AUC	F1	AUC	F1	AUC	F1
SVD	0.963 (-1.5%)	0.919 (-1.4%)	0.672 (-8.9%)	0.635 (-7.7%)	0.769 (-3.4%)	0.696 (-3.5%)
LibFM	0.959 (-1.9%)	0.906 (-2.8%)	0.691 (-6.4%)	0.618 (-10.2%)	0.778 (-2.3%)	0.710 (-1.5%)
LibFM + TransE	0.966 (-1.2%)	0.917 (-1.6%)	0.698 (-5.4%)	0.622 (-9.6%)	0.777 (-2.4%)	0.709 (-1.7%)
PER	0.832 (-14.9%)	0.788 (-15.5%)	0.617 (-16.4%)	0.562 (-18.3%)	0.633 (-20.5%)	0.596 (-17.3%)
CKE	0.924 (-5.5%)	0.871 (-6.5%)	0.677 (-8.3%)	0.611 (-11.2%)	0.744 (-6.5%)	0.673 (-6.7%)
RippleNet	0.968 (-1.0%)	0.912 (-2.1%)	0.715 (-3.1%)	0.650 (-5.5%)	0.780 (-2.0%)	0.702 (-2.6%)
KGCN-sum	0.978	0.932*	0.738	0.688*	0.794 (-0.3%)	0.719 (-0.3%)
KGCN-concat	0.977 (-0.1%)	0.931 (-0.1%)	0.734 (-0.5%)	0.681 (-1.0%)	0.796*	0.721*
KGCN-neighbor	0.977 (-0.1%)	0.932*	0.728 (-1.4%)	0.679 (-1.3%)	0.781 (-1.9%)	0.699 (-3.1%)
KGCN-avg	0.975 (-0.3%)	0.929 (-0.3%)	0.722 (-2.2%)	0.682 (-0.9%)	0.774 (-2.8%)	0.692 (-4.0%)

Table 2: The results of AUC and F1 in CTR prediction. (Statistically significant improvement by unpaired two-sample t-test with p = 0.1, indicated by *)

Observations from Table 2:

KGCN Superiority: KGCN variants consistently achieve the best AUC and F1 scores across all three datasets. KGCN-sum and KGCN-concat generally perform the best, with KGCN-sum showing leading AUC on MovieLens-20M and Book-Crossing, and KGCN-concat leading on Last.FM. The AUC gains are substantial: 4.4% for MovieLens-20M (compared to its best baseline RippleNet), 8.1% for Book-Crossing (compared to RippleNet), and 6.2% for Last.FM (compared to RippleNet). The percentages in parentheses indicate the relative performance difference compared to the best KGCN variant for each metric.
Effectiveness in Sparse Scenarios: KGCN shows particularly high improvements on Book-Crossing and Last.FM. These datasets are much sparser than MovieLens-20M (as seen in Table 1). This confirms that KGCN effectively addresses the sparsity problem by leveraging the rich knowledge graph information.
KG-aware vs. KG-free Baselines: Surprisingly, KG-free baselines like SVD and LibFM sometimes outperform KG-aware baselines like PER and CKE. This suggests that simply introducing a KG is not enough; the method of integrating it matters significantly. PER performs the worst, likely due to the difficulty of manually designing optimal meta-paths. CKE also lags behind, indicating that its TransR-like regularization might not be as effective for recommendation as KGCN's direct graph convolution.
RippleNet's Strong Performance: RippleNet is a strong baseline, performing better than most other KG-aware methods. This validates the importance of multi-hop neighborhood structure and proximity information in the KG for recommendation, a core idea also adopted by KGCN.
Aggregator Comparison (KGCN variants):
- KGCN-sum generally performs very well, especially leading in AUC for MovieLens-20M and Book-Crossing.
- KGCN-concat is also highly competitive, even surpassing KGCN-sum on Last.FM AUC.
- KGCN-neighbor shows a noticeable performance gap on sparser datasets (Book-Crossing, Last.FM) compared to KGCN-sum and KGCN-concat. This suggests that excluding the entity's own representation in aggregation might lead to loss of crucial information, especially when KG context is limited.
- KGCN-avg performs worse than KGCN-sum across all datasets. This variant is a reduced form of KGCN-sum without the user-relation scores (i.e., uniform weighting of neighbors). Its poorer performance highlights the critical role of the personalized attention mechanism (user-relation scores) in capturing user-specific preferences and semantic information from the KG. This is particularly evident in sparse scenarios (Book-Crossing, Last.FM), where this personalized weighting provides more meaningful context.

Top-K Recommendation Results (Recall@K)

The following figure illustrates the performance of various models in top-K recommendation using the Recall@K metric.

Figure 2: The results of Recall@K in top-K recommendation. 该图像是图表，展示了三个数据集（MovieLens-20M、Book-Crossing 和 Last.FM）上不同方法在 Recall@K 评价指标下的表现对比，横轴为 K，纵轴为 Recall@K，图中包括 LibFM+TransE、PER、CKE、RippleNet 和 KGCN-sum 五种方法的曲线。

Figure 2: The results of Recall@K in top-K recommendation.

Observations from Figure 2:

Consistent KGCN Outperformance: Similar to CTR prediction, KGCN-sum consistently achieves the highest Recall@K across all values of $K$ and all three datasets. This reinforces its effectiveness in providing relevant recommendations.
Benefits of KG-aware Methods: RippleNet also shows strong performance and generally outperforms KG-free and other KG-aware baselines, again indicating the value of leveraging multi-hop KG structure.
Struggles of Meta-path Methods: PER continues to perform poorly, demonstrating the limitations of manually defined meta-paths.
KGCN's Robustness: The curves for KGCN-sum are consistently above the other baselines, indicating its robustness and superior ability to capture relevant items for users over different recommendation list sizes. The improvement is particularly pronounced on Book-Crossing and Last.FM, consistent with its strong CTR prediction performance on sparse data.

6.2. Ablation Studies / Parameter Analysis

The paper conducts ablation studies and parameter analyses to understand the impact of various KGCN components and hyperparameters.

6.2.1. Impact of Neighbor Sampling Size (K)

The following are the results from [Table 3] of the original paper:

K	2	4	8	16	32	64
MovieLens-20M	0.978	0.979	0.978	0.978	0.977	0.978
Book-Crossing	0.680	0.727	0.736	0.725	0.711	0.723
Last.FM	0.791	0.794	0.795	0.793	0.794	0.792

Table 3: AUC result of KGCN with different neighbor sampling size K.

Observations:

Optimal K: KGCN achieves optimal performance when $K$ (the neighbor sampling size) is 4 or 8. For MovieLens-20M, $K=4$ yields the best AUC (0.979). For Book-Crossing, $K=8$ is best (0.736), and for Last.FM, $K=8$ is best (0.795).
Too Small K: A $K$ value that is too small (e.g., 2 for Book-Crossing and Last.FM) results in lower performance. This is because a limited number of sampled neighbors (the receptive field) might not provide enough contextual information to effectively enrich the entity's representation.
Too Large K: A $K$ value that is too large (e.g., 16, 32, 64) also tends to degrade performance or offer no significant improvement. This is attributed to the model being misled by noises from irrelevant or less important neighbors, or simply introducing redundancy without adding valuable, distinct information. The computational cost also increases with larger $K$ .

6.2.2. Impact of Depth of Receptive Field (H)

The following are the results from [Table 4] of the original paper:

H	1	2	3	4
MovieLens-20M	0.972	0.976	0.974	0.514
Book-Crossing	0.738	0.731	0.684	0.547
Last.FM	0.794	0.723	0.545	0.534

Table 4: AUC result of KGCN with different depth of receptive field H.

Observations:

Sensitivity to H: KGCN is more sensitive to $H$ (the depth of the receptive field or number of aggregation layers) compared to $K$ .
Optimal H: An $H$ of 1 or 2 appears to be optimal or near-optimal. MovieLens-20M performs best at $H=2$ (0.976), while Book-Crossing and Last.FM perform best at $H=1$ (0.738 and 0.794, respectively).
Model Collapse with Large H: A critical finding is the serious model collapse observed when $H$ $H$ reaches 3 or 4. The AUC drops dramatically (e.g., to 0.514 for MovieLens-20M at $H=4$ $H = 4$ ).
- This is consistent with the intuition that too long relation-chains make little sense in inferring item similarities or user interests for recommendation. Excessively deep layers might propagate too much noise, over-smooth entity representations, or lead to over-squashing of information, where distinguishing features from distant nodes become diluted or lost.
- This suggests that for recommender systems using knowledge graphs, shallow aggregation (1 or 2 hops) is generally sufficient and more effective than very deep aggregation.

6.2.3. Impact of Dimension of Embedding (d)

The following are the results from [Table 5] of the original paper:

d	4	8	16	32	64	128
MovieLens-20M	0.968	0.970	0.975	0.977	0.973	0.972
Book-Crossing	0.709	0.732	0.733	0.735	0.739	0.736
Last.FM	0.789	0.793	0.797	0.793	0.790	0.789

Table 5: AUC result of KGCN with different dimension of embedding.

Observations:

Initial Boost: Increasing the dimension of embedding (d) generally boosts performance initially. A larger $d$ allows the model to encode more intricate and richer information about users and entities. For instance, MovieLens-20M improves from $d=4$ to $d=32$ , Book-Crossing improves consistently up to $d=64$ , and Last.FM peaks at $d=16$ .
Overfitting with Large d: However, too large d can adversely affect performance due to overfitting. If the embedding dimension is too high relative to the amount of available data, the model might learn spurious patterns specific to the training set and fail to generalize. This is observed when MovieLens-20M's AUC slightly drops at $d=64$ and $d=128$ , and Last.FM's AUC decreases after $d=16$ . Book-Crossing shows a more robust performance with larger $d$ , possibly due to its greater sparsity requiring more expressive power or the specific nature of its KG.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Knowledge Graph Convolutional Networks (KGCN), an end-to-end framework designed to enhance recommender systems by effectively integrating knowledge graph (KG) information. KGCN addresses the critical sparsity and cold start problems of traditional collaborative filtering by leveraging KG entities and relations. Its core innovation lies in its user-specific and relation-aware neighborhood aggregation mechanism, which computes a biased weighted sum of neighbor representations. This allows KGCN to automatically capture both high-order structural information and rich semantic information from the KG, tailored to individual user preferences. The model's ability to extend its receptive field over multiple hops enables it to discover long-distance interests. Furthermore, the minibatch implementation ensures scalability for large datasets and KGs. Extensive experiments on movie, book, and music recommendation datasets demonstrate that KGCN consistently outperforms state-of-the-art baselines in CTR prediction and top-K recommendation, especially in sparse scenarios, validating its effectiveness and robustness.

7.2. Limitations & Future Work

The authors identify several avenues for future research:

Non-uniform Neighbor Sampling: Currently, KGCN uses uniform sampling from an entity's neighbors to construct its receptive field. A limitation is that not all neighbors are equally important. Future work could explore non-uniform sampling strategies (e.g., importance sampling based on relation type, entity popularity, or learned weights) to select more informative neighbors, potentially leading to more effective and efficient aggregation.
Leveraging User-End KGs: The current KGCN (and most existing literature) primarily focuses on modeling item-end KGs (knowledge about items and their attributes). The paper suggests an interesting direction would be to investigate whether incorporating user-end KGs (knowledge about users, their demographics, social connections, preferences, etc.) could further improve recommendation performance.
Combining User-End and Item-End KGs: A more advanced future direction is to design an algorithm that can effectively combine and jointly utilize KGs at both the user-end and item-end. This could create a holistic view of both users and items, potentially leading to more nuanced and accurate recommendations.

7.3. Personal Insights & Critique

The KGCN paper presents a compelling solution for integrating knowledge graphs into recommender systems by intelligently adapting Graph Convolutional Networks. Its primary strength lies in the personalized and biased aggregation mechanism, which explicitly models how a user's preferences interact with different types of relations in the KG. This is a crucial distinction from generic GCNs and previous KG-aware methods, making the item representations truly user-specific.

One notable insight is the observation that shallow aggregation (1 or 2 layers, $H=1$ or $H=2$ ) performs best, with deeper layers leading to model collapse. This is an important practical finding, suggesting that for recommender systems, multi-hop information is valuable but long-distance information quickly becomes noisy or redundant. This could be due to over-smoothing (where node representations become indistinguishable after many layers) or the inherent nature of user interests, which might not meaningfully extend beyond a few hops in a KG. This finding could guide future GNN designs in RS to focus on more sophisticated, rather than just deeper, aggregation strategies.

The choice of inner product for both $g$ (user-relation scoring) and $f$ (final prediction) functions, while simple and effective, could be an area for improvement. More complex neural network layers for these functions might capture more intricate non-linear interactions. Similarly, the uniform negative sampling could be enhanced with more advanced strategies (e.g., popularity-biased sampling or adversarial sampling) to better reflect real-world implicit feedback scenarios.

The methods and conclusions of KGCN could be broadly applied to other domains where graph-structured data (especially heterogeneous graphs) and personalized predictions are important. For instance:

Information Retrieval: Enhancing search results by considering personalized relevance of entities and their relations in a knowledge base.
Drug Discovery: Predicting drug-target interactions or drug side effects by leveraging knowledge graphs of chemical compounds, proteins, and biological pathways, with personalized models accounting for patient-specific factors.
Fraud Detection: Identifying fraudulent activities in complex transaction networks by modeling entity relationships and detecting unusual patterns based on specific user profiles.

A potential unverified assumption might be the quality and completeness of the knowledge graphs derived from Microsoft Satori. While Satori is a powerful resource, any inaccuracies or incompleteness in the KG could propagate errors through the GCN layers. The impact of KG noise on KGCN's performance could be an area for further investigation. Overall, KGCN represents a significant step forward in building more intelligent and personalized recommender systems by effectively harnessing the power of knowledge graphs through graph neural networks.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Knowledge Graph Convolutional Networks for Recommender Systems

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~32 min read · 43,474 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Recommender Systems (RS)

3.1.2. Collaborative Filtering (CF)

3.1.3. Sparsity Problem

3.1.4. Cold Start Problem

3.1.5. Knowledge Graph (KG)

3.1.6. Graph Convolutional Networks (GCN)

3.2. Previous Works

3.2.1. KG-Free Methods

3.2.2. KG-Aware Methods

3.2.3. Graph Neural Networks (GNN) for Homogeneous Graphs

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

4.2.2. KGCN Layer

4.2.3. Learning Algorithm

Figure 1: KGCN Framework

Loss Function

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Click-Through Rate (CTR) Prediction Metrics

5.2.2. Top-K Recommendation Metric

5.3. Baselines

5.3.1. KG-Free Baselines

5.3.2. KG-Aware Baselines

5.4. KGCN Hyperparameters

6. Results & Analysis

6.1. Core Results Analysis

CTR Prediction Results (AUC and F1)

Top-K Recommendation Results (Recall@K)

6.2. Ablation Studies / Parameter Analysis

6.2.1. Impact of Neighbor Sampling Size (K)

6.2.2. Impact of Depth of Receptive Field (H)

6.2.3. Impact of Dimension of Embedding (d)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers