Paper status: completed

A Multi-modal Large Language Model with Graph-of-Thought for Effective Recommendation

Published:01/01/2025

Multimodal Large Language Model (28)Graph-of-Thought Prompting Technique (1)Personalized Recommendation System (6)Multimodal Recommendation Tasks (1)User-Item Interaction Graphs (1)

Original Link

Price: 0.100000

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The GollaRec model integrates a Multi-modal Large Language Model and Graph-of-Thought to enhance user-item interaction for effective recommendations, combining visual and textual data. It utilizes text-graph alignment and tuning, outperforming 12 existing models in multimodal tas

Abstract

Chain-of-Thought (CoT) prompting has been shown to be effective in guiding Large Language Models (LLMs) to decompose complex tasks into multiple intermediate steps, and constructing a rational reasoning chain for inferring answers. However, the linear nature of CoT falls short from enabling LLMs to effectively handle graph structures, which are essential for personalised recommendation tasks that rely on user-item interaction graphs. To bridge this gap, we propose GollaRec, which leverages a new Graph-of-Thought (GoT) prompting technique in a Multi-modal LLM, namely LLaVA, to effectively exploit the complex structure of the interaction graphs. GollaRec enhances the recommendation effectiveness by integrating both visual and textual "thoughts" into a graph-structured prompt, using both item images and descriptions to produce richer multi-modal user/item representations. In our proposed approach, GollaRec leverages text-graph alignment and graph instruction tuning to allow the Multi-modal LLM to capture complex graph structures. In addition, GollaRec leverages a graph adaptor to integrate user-item interactions into the resulting user/item embeddings, therefore effectively adapting the model to the recommendation task. Our extensive experiments on 6 benchmark datasets demonstrate the superiority of our proposed GollaRec model over 12 existing state-of-the-art models in various multi-modal recommendation tasks, including general and multi-domain recommendation tasks.

Mind Map

In-depth Reading

English Analysis~23 min read · 30,770 chars

1. Bibliographic Information

1.1. Title

A Multi-modal Large Language Model with Graph-of-Thought for Effective Recommendation

The title clearly states the paper's core contributions:

It proposes a new model for recommendation.
The model is based on a Multi-modal Large Language Model (MLLM), meaning it processes information from multiple sources, like text and images.
Its key innovation is a technique called Graph-of-Thought (GoT), which is designed to handle graph-structured data.

1.2. Authors

Zixuan Yi (University of Glasgow)
Iadh Ounis (University of Glasgow)

Both authors are affiliated with the University of Glasgow. A quick search reveals that Prof. Iadh Ounis is a well-established professor in the field of Information Retrieval. Zixuan Yi appears to be a Ph.D. student under his supervision. Their research focuses on recommender systems, information retrieval, and the application of large language models.

1.3. Journal/Conference

The paper does not specify the publication venue. The provided publication date is 2025-01-01T00:00:00.000Z, which suggests this is a preprint or a paper submitted for future publication. Given the quality and scope of the work, it is likely intended for a top-tier conference in information retrieval (like SIGIR), data mining (like KDD, WSDM), or artificial intelligence (like AAAI, IJCAI).

1.4. Publication Year

2025 (as specified in the metadata).

1.5. Abstract

The abstract introduces the limitations of the linear Chain-of-Thought (CoT) prompting technique when applied to tasks requiring graph-structured data, such as recommender systems that rely on user-item interaction graphs. To overcome this, the authors propose GollaRec, a model that uses a novel Graph-of-Thought (GoT) prompting method within a Multi-modal Large Language Model (MLLM), specifically LLaVA. GollaRec is designed to exploit the complex structures of interaction graphs by integrating both visual (item images) and textual (item descriptions) information into a graph-structured prompt. The model employs text-graph alignment and graph instruction tuning to help the MLLM understand graph structures. It also uses a graph adaptor to integrate user-item interactions into the final embeddings. The authors demonstrate the superiority of GollaRec over 12 state-of-the-art models on 6 benchmark datasets for both general and multi-domain recommendation tasks.

1.6. Original Source Link

The provided link is /files/papers/6911d5d6b150195a0db74993/paper.pdf. This appears to be a local file path, indicating the paper is likely a preprint or an unpublished manuscript. Its official publication status is unknown.

2. Executive Summary

2.1. Background & Motivation

Core Problem: Standard Large Language Models (LLMs), while powerful, struggle with structured data like graphs. Their pre-training on unstructured text makes it difficult for them to interpret the complex, non-linear relationships inherent in graph data, which is fundamental to modern recommender systems (e.g., user-item interaction graphs).
Existing Gaps: The Chain-of-Thought (CoT) prompting technique, which enhances LLM reasoning by guiding them through a series of intermediate linear steps, is ill-suited for graph-based tasks. The linear nature of CoT cannot capture the rich, relational structure of a graph. The authors term this limitation "insufficient graph mining." Furthermore, many LLM-based recommenders face constraints on input token length, which limits the amount of user history (interaction data) they can process.
Innovative Idea: The paper's central idea is to create a new prompting technique, Graph-of-Thought (GoT), that explicitly structures the prompt to reflect the topology of a user-item graph. Instead of a linear chain of text, GoT presents reasoning steps that traverse the graph, considering item relationships and multi-modal features (text and images). This approach is designed to teach a Multi-modal Large Language Model (MLLM) to "think" in terms of graphs, thereby overcoming the "insufficient graph mining" problem.

2.2. Main Contributions / Findings

The paper presents the following key contributions:

GollaRec Model: A novel recommendation framework, Graph-of-thought LLaVA for Recommendation (GollaRec), which integrates an MLLM (LLaVA) with a graph-based reasoning process.
Graph-of-Thought (GoT) Prompting: The introduction of GoT, a new prompting technique tailored for recommendation. It structures the prompt to include reasoning steps over the user-item interaction graph, incorporating both visual and textual "thoughts" to generate richer representations. This is the first work to apply GoT to multi-modal recommendation.
Graph Understanding Mechanisms: To enable the MLLM to comprehend graph structures, the paper proposes a text-graph alignment method to map graph node embeddings and text embeddings into a shared space, and a graph instruction tuning phase to teach the MLLM to associate graph tokens with their textual descriptions.
Adaptive Graph Truncation: A technique to address the limited input token length of LLMs. It uses a pre-trained recommender to select the most relevant items from a user's history, ensuring that the GoT prompt is both informative and fits within the token limit.
Superior Performance: Extensive experiments on six benchmark datasets show that GollaRec significantly outperforms 12 state-of-the-art baselines in both general and multi-domain recommendation tasks, validating the effectiveness of the GoT approach.

3.1. Foundational Concepts

Large Language Models (LLMs): LLMs are deep learning models, typically based on the Transformer architecture, trained on massive amounts of text data. They excel at understanding and generating human-like text. Examples include GPT-3, LLaMA, and Vicuna. In this paper, an LLM serves as the core reasoning engine.
Multi-modal Large Language Models (MLLMs): MLLMs are an extension of LLMs that can process and understand information from multiple modalities, not just text. They typically combine a pre-trained LLM with encoders for other modalities, such as a vision encoder for images. LLaVA (Large Language and Vision Assistant), used in this paper, is a prominent MLLM. It connects a vision encoder (like CLIP's) to an LLM, allowing it to understand and reason about images in a conversational context.
Recommender Systems: These are systems designed to predict a user's interest in an item and suggest relevant items. Traditional methods often use collaborative filtering (finding users with similar tastes) or content-based filtering (recommending items similar to what a user liked before). Modern systems often use deep learning and model the problem using a user-item interaction graph, where users and items are nodes, and an interaction (like a click or purchase) is an edge.
Graph Neural Networks (GNNs): GNNs are a class of neural networks designed to work directly with graph-structured data. They operate on the principle of message passing, where nodes iteratively aggregate information from their neighbors. This allows them to learn "embeddings" (vector representations) of nodes that capture both their own features and their structural role in the graph. LightGCN, used as a component and baseline in this paper, is a popular and efficient GNN model for recommendation.
Prompting: In the context of LLMs, a prompt is the input text given to the model to elicit a desired output. Prompt engineering is the art of designing effective prompts.
Chain-of-Thought (CoT) Prompting: CoT is an advanced prompting technique where the LLM is given examples that include not just a question and an answer, but also the intermediate reasoning steps to get to the answer. By seeing these "thought processes," the LLM learns to break down complex problems into a sequence of simpler, logical steps, significantly improving its reasoning ability. The key characteristic of CoT is its linear, step-by-step nature.
Contrastive Learning: A machine learning paradigm where the goal is to learn an embedding space in which similar (positive) data samples are pulled closer together and dissimilar (negative) samples are pushed farther apart. This is often achieved using a contrastive loss function. This paper uses it for text-image alignment (aligning an item's description with its image) and text-graph alignment (aligning an item's description with its node representation in the graph).

3.2. Previous Works

LightGCN (He et al., 2020): This is a simplified GNN model for recommendation. Its core idea is that the most important part of a GNN for collaborative filtering is the neighborhood aggregation, not the feature transformations or non-linearities. It learns user and item embeddings by linearly propagating them on the user-item interaction graph. The propagation rule for a user $u$ at layer $k+1$ is: $ \mathbf{e}u^{(k+1)} = \sum{i \in \mathcal{N}_u} \frac{1}{\sqrt{|\mathcal{N}_u|} \sqrt{|\mathcal{N}_i|}} \mathbf{e}_i^{(k)} $ and for an item $i$ : $ \mathbf{e}i^{(k+1)} = \sum{u \in \mathcal{N}_i} \frac{1}{\sqrt{|\mathcal{N}_i|} \sqrt{|\mathcal{N}_u|}} \mathbf{e}_u^{(k)} $ where $\mathbf{e}_u^{(k)}$ and $\mathbf{e}_i^{(k)}$ are the embeddings of user $u$ and item $i$ at layer $k$ , and $\mathcal{N}_u$ and $\mathcal{N}_i$ are the sets of neighbors for user $u$ and item $i$ , respectively. The final embedding is a combination of embeddings from all layers. GollaRec uses LightGCN as a graph adaptor.
LLaVA (Liu et al., 2024): A pioneering MLLM that connects a pre-trained vision encoder (CLIP) and a large language model (Vicuna). A simple projection matrix is trained to align the visual features with the word embedding space of the LLM. This allows the LLM to process and reason about both text and images seamlessly. GollaRec is built on top of LLaVA.
Chain-of-Thought (CoT) (Wei et al., 2022): As described above, CoT enhances LLM reasoning by providing step-by-step demonstrations in the prompt. For example, to solve a math problem, the prompt would show the LLM how to solve a similar problem by breaking it down into logical steps. GollaRec argues that CoT's linear structure is a limitation for graph tasks.
P5 (Geng et al., 2022): An LLM-based recommender that unifies various recommendation tasks (e.g., rating prediction, sequential recommendation, explanation generation) into a single text-to-text format using personalized prompts. It leverages the T5 pre-trained model. This represents a paradigm where recommendation is treated as a natural language processing problem.
GraphGPT (Tang et al., 2024): A model that aims to teach an LLM to understand graphs. It introduces "graph tokens" that represent nodes and their connections. It uses graph instruction tuning to align these graph tokens with their corresponding textual descriptions, enabling the LLM to process graph-structured queries. GollaRec adapts this idea for its text-graph alignment phase.

3.3. Technological Evolution

Early Recommenders: Primarily used matrix factorization and collaborative filtering.
Deep Learning Era: Introduced neural networks to model complex user-item interactions (e.g., NeuMF).
GNN-based Recommenders: Began using GNNs like LightGCN to explicitly model the graph structure of user-item interactions, achieving state-of-the-art performance.
LLM-based Recommenders: Started leveraging the world knowledge and reasoning capabilities of LLMs. Models like P5 and LMRecSys framed recommendation as a language task, using prompts to guide the LLM. However, they often struggled to incorporate the full interaction graph.
MLLM-based Recommenders: Extended the LLM approach to include multi-modal data (images, etc.), but often without deep integration of graph structures.
GollaRec (This Paper): This work sits at the intersection of MLLMs and GNNs. It pushes the boundary by not just using an MLLM, but teaching it to reason directly on the graph structure through the novel GoT prompting technique, combining the strengths of graph-based models and large language models.

3.4. Differentiation Analysis

vs. CoT-based methods: The key difference is the structure of the "thought" process. CoT is linear and sequential, suitable for textual reasoning. GoT is non-linear and graph-structured, designed to mimic reasoning over a graph. It allows the model to consider multiple paths and relationships simultaneously, which is more natural for recommendation (e.g., a user's interest might be influenced by multiple distinct clusters of items they've interacted with).
vs. Other LLM-based Recommenders (e.g., P5): While models like P5 convert user history into a flattened text sequence, they lose the explicit graph topology. GollaRec, through GoT and text-graph alignment, aims to make the MLLM directly aware of the graph's structure, not just a linearized version of it.
vs. GNN-based Recommenders (e.g., LightGCN): GNNs are powerful at capturing collaborative signals from the graph but lack the rich semantic understanding and world knowledge of LLMs. GollaRec combines the best of both worlds: it uses the MLLM for semantic understanding of multi-modal item content and high-level reasoning, and a graph adaptor (LightGCN) to refine embeddings with collaborative filtering signals from the interaction graph.

4. Methodology

4.1. Principles

The core principle of GollaRec is to empower a Multi-modal Large Language Model (MLLM) to perform recommendation by teaching it to "think" in a graph-like manner. Traditional LLM prompting (like CoT) is linear, which fails to capture the complex, non-linear relationships in a user-item interaction graph. GollaRec's proposed Graph-of-Thought (GoT) technique addresses this by structuring the prompt to reflect a reasoning process that explores the user's historical interactions and their connections to candidate items on the graph. This is augmented with multi-modal information (item images and descriptions) to create rich, semantically-aware representations. The model's architecture is a synergistic combination of an MLLM for high-level reasoning and a graph adaptor for low-level collaborative signal propagation.

The overall architecture of the GollaRec model is shown in the figure below.

Figure 2: The architecture of our GollaRec model. 该图像是示意图，展示了GollaRec模型在多模态推荐中的工作流程。图中包含了图结构化的提示（GoT Prompts）、视觉编码器（CLIP），以及图适配器将用户和物品嵌入结合，生成目标用户的推荐项目 top-k，展示了用户与物品的互动关系。

4.2. Core Methodology In-depth (Layer by Layer)

GollaRec is designed to tackle two top- $k$ multi-modal recommendation tasks: a general recommendation task and a multi-domain recommendation task. The methodology is broken down into the GoT prompting technique and the overall model architecture.

4.2.1. Graph-of-Thought (GoT)

The GoT prompt is the central innovation. It aims to provide a structured rationale for the MLLM to follow, addressing three key challenges (C1, C2, C3). The figure below provides a conceptual comparison between CoT and the proposed GoT.

该图像是示意图，展示了 Chain-of-Thought (CoT) 和 Graph-of-Thought (GoT) 提示技术在产品推荐系统中的对比。左侧描述了 CoT 提示过程，包括考虑历史项目的属性并输出用户偏好；右侧展示了 GoT 提示过程，强调在分析推荐时结合用户-项目图谱及图像的信息，进一步增强推荐效果。两者结合用于生成最终推荐排名。

C1. Adaptive Graph Truncation (Challenge: Limited Token Length)

LLMs have a fixed input token limit (e.g., 2048 for Vicuna-7B). In an MLLM like LLaVA, a significant portion of this is reserved for visual tokens (576 tokens), leaving even less space for text. It's impossible to fit a user's entire interaction history into the prompt.

To solve this, GollaRec uses an adaptive graph truncation strategy:

A pre-trained recommender model (e.g., LightGCN) is used to generate an initial ranked list of candidate items for the target user. This list prioritizes items that the user is most likely to be interested in.
The model then iteratively adds the textual descriptions of these high-potential items to the prompt.
It keeps a running count of the total tokens used (including the reserved visual tokens) and stops adding items once the total count approaches the MLLM's maximum token limit.

This ensures the prompt is populated with the most informative items from the user's history/potential future interactions, maximizing the utility of the limited context window. The process is formally described in Algorithm 1.

The following figure illustrates this adaptive truncation and how the selected items are integrated into the GoT prompt.

Figure 3: Our proposed GoT for recommendation. 该图像是示意图，展示了针对目标用户 $u$ 的适应性图形截断和图形的文本对齐过程。左侧展示了用户的喜好项，右侧展示了节点嵌入和文本嵌入的生成。通过 GoT 提示，该模型有效整合了用户-项目交互信息。

C2. Text-image Alignment (Challenge: Relating Images and Text)

To ensure the MLLM can form a coherent understanding of an item from its image and description, GollaRec employs a contrastive pre-training step.

The MLLM (LLaVA) is fine-tuned on all available item image-text pairs from the dataset.
The goal is to learn a joint embedding space where the visual embedding of an item is close to its textual embedding, and far from the embeddings of other items.
This is achieved by minimizing the Image-Text Contrastive (ITC) loss: $\mathcal { L } _ { \mathrm { I T C } } = - \frac { 1 } { B } \sum _ { p = 1 } ^ { N } \log \frac { \exp \left( \sin \left( v _ { p } , t _ { p } \right) / \tau \right) } { \sum _ { q \neq p } \exp \left( \sin \left( v _ { p } , t _ { q } \right) / \tau \right) }$
- $B$ : The batch size.
- $N$ : The number of positive pairs in the batch.
- $v_p$ : The visual embedding of the $p$ -th item (the positive sample).
- $t_p$ : The textual embedding of the $p$ -th item (the positive sample).
- $t_q$ : The textual embedding of a different item, which serves as a negative sample for $v_p$ .
- $\text{sim}(v, t)$ : A similarity function (e.g., cosine similarity), which the paper notes is used with cross-entropy.
- $\tau$ : A temperature parameter that controls the sharpness of the distribution. A smaller $\tau$ makes the model push negative samples further away.
  
  This pre-training aligns the two modalities, enriching the GoT prompt with cross-modal contextual information.

C3. Text-graph Alignment (Challenge: Understanding Graph Structure)

To make the LLM understand graph patterns, a two-stage process inspired by prior work is used:

Text-graph grounding: This stage aims to align the embedding space of graph structures with the language space.
- An item's description is fed into two separate encoders: a text encoder (BERT) to get a textual embedding $z_2$ , and a graph encoder (Graph Transformer) to get a graph node embedding $z_1$ .
- A contrastive loss, similar in principle to the ITC loss, is used to pull the matching pair $(z_{1p}, z_{2p})$ together and push non-matching pairs $(z_{1p}, z_{2q})$ apart. This forces the graph encoder to produce embeddings that are semantically aligned with the natural language descriptions of the nodes.
Graph instruction tuning: After grounding, the pre-trained graph encoder is used to "translate" graph structures into a format the MLLM can process.
- The node embeddings $z_1$ from the graph encoder are projected into graph tokens $\hat{z_1}$ using a Multi-Layer Perceptron (MLP): $\hat{z_1} = \mathrm{MLP}(z_1)$ . These tokens are special embeddings that represent graph structure.
- The MLLM is then trained on a graph matching task. It is given an instruction containing a sequence of graph tokens (e.g., a central node and its neighbors: <graph_start>, <graph_token>1, ..., <graph_token>l, <graph_end>) and a shuffled list of the corresponding textual descriptions.
- The MLLM's task is to correctly match the graph tokens to their language descriptions. The training objective is to maximize the probability of generating the correct sequence of text tokens, optimized via a standard cross-entropy loss. The probability of generating the target output $x_o$ $x_{o}$ is given by: $\psi \left( x _ { o } \mid { \hat { z _ { 1 } } } , z _ { 3 } \right) = \prod _ { j = 1 } ^ { l } \psi _ { \theta _ { 2 } } \left( x _ { j } \mid { \hat { z _ { 1 } } } , z _ { 3 } \right)$
  - $\hat{z_1}$ : The sequence of input graph tokens.
  - $z_3$ : The embeddings of the instruction's text.
  - $x_o$ : The target output sequence of language tokens.
  - $x_j$ : The $j$ -th token in the target sequence.
  - $\psi_{\theta_2}$ : The probability of generating token $x_j$ given the inputs, parameterized by the learnable MLLM parameters $\theta_2$ .
    
    This tuning phase explicitly teaches the MLLM to interpret the meaning of the graph tokens, enabling it to understand the graph patterns presented in the GoT prompt.

4.2.2. GollaRec Model Architecture

The full GollaRec model combines the GoT-prompted MLLM with a graph adaptor.

Input: The model takes textual inputs (item descriptions for the GoT prompt) and visual inputs (item images).
GoT-Prompted MLLM: During inference, the constructed GoT prompt (containing the task description, step-by-step reasoning examples, and the adaptively truncated list of relevant items) is fed into the LLaVA model. The MLLM processes this multi-modal prompt and outputs embeddings from its final layer. These embeddings are semantically rich, capturing both content information and the high-level reasoning from the GoT structure.
Graph Adaptor: The embeddings from the MLLM are used as initial node embeddings for a graph adaptor, which is a LightGCN model. This adaptor then performs neighborhood aggregation on the user-item interaction graph. This step refines the initial semantic embeddings by propagating collaborative filtering signals, effectively integrating the user's specific interaction patterns. The paper presents a garbled formula for this step. The correct LightGCN propagation rule for obtaining final user embeddings (as a combination of embeddings from all layers) is: $ \mathbf{e}u = \sum{k=0}^{K} \alpha_k \mathbf{e}_u^{(k)} \quad \text{and} \quad \mathbf{e}i = \sum{k=0}^{K} \alpha_k \mathbf{e}_i^{(k)} $ where $\mathbf{e}^{(k)}$ are the embeddings at the $k$ -th propagation layer and $\alpha_k$ are weights (often set to $1/(K+1)$ ). The propagation from layer $k$ to $k+1$ is what was described in Section 3.2. This step is crucial for personalization.
Output: The final user and item embeddings from the graph adaptor are used to compute recommendation scores (e.g., via dot product), and the top- $k$ items are ranked for the user.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on six benchmark datasets to evaluate performance in two scenarios.

General Recommendation:
- Amazon Review (Clothing, Baby): Large-scale datasets from Amazon, containing user reviews, product metadata (descriptions, categories), and images. They are standard benchmarks for multi-modal recommendation.
- HM: A fashion recommendation dataset with user transactions, detailed article descriptions, and images.
Multi-domain Recommendation:
- This task uses seven domains from the Amazon Review dataset.
- Source Domains: Food, Home, Clothing, Office. The model is trained on data from these domains.
- Target Domains: Pantry, Electronics, Sports. The model's performance is evaluated on these unseen domains to test its transferability.
  
  These datasets were chosen because they are large, widely used, and contain the rich multi-modal (text and image) and interaction data necessary to validate GollaRec.

5.2. Evaluation Metrics

The top- $k$ recommendation performance was evaluated using two standard metrics, with $k$ set to 20.

Recall@k:
1. Conceptual Definition: Recall@k measures the proportion of relevant items (that the user actually interacted with in the test set) that are found within the top- $k$ recommended items. It answers the question: "Out of all the items the user likes, how many did we manage to recommend in the top-k list?" It is a measure of coverage.
2. Mathematical Formula: $ \text{Recall}@k = \frac{|\text{RecommendedItems}@k \cap \text{GroundTruthItems}|}{|\text{GroundTruthItems}|} $
3. Symbol Explanation:
  - RecommendedItems@k: The set of top- $k$ items recommended by the model.
  - GroundTruthItems: The set of items in the test set that the user actually interacted with.
Normalized Discounted Cumulative Gain (NDCG@k):
1. Conceptual Definition: NDCG@k is a measure of ranking quality. It evaluates how good the ordering of the recommended items is. It gives higher scores for relevant items that appear earlier in the top- $k$ list. It is "normalized" so that the perfect ranking has a score of 1.0, making it comparable across different users.
2. Mathematical Formula: $ \text{NDCG}@k = \frac{\text{DCG}@k}{\text{IDCG}@k} $ where $ \text{DCG}@k = \sum_{i=1}^{k} \frac{rel_i}{\log_2(i+1)} $
3. Symbol Explanation:
  - $rel_i$ : A binary value indicating if the item at rank $i$ is relevant (1) or not (0).
  - $\text{DCG}@k$ : Discounted Cumulative Gain, which accumulates the relevance scores, discounting them by their position.
  - $\text{IDCG}@k$ : Ideal Discounted Cumulative Gain, which is the DCG score of a perfect ranking (i.e., all relevant items ranked at the top).

5.3. Baselines

GollaRec was compared against 12 representative baselines from different categories:

General Recommender: LightGCN (a strong GNN-based model).
Multi-modal Recommenders: VBPR (classic matrix factorization with visual features), MMGCL (multi-modal graph contrastive learning), BM3 (a strong multi-modal model).
MLLM Methods: CLIP (uses pre-trained image-text embeddings), BEiT-3 (a multi-modal foundation model), LLaVA (the base MLLM used by GollaRec, but without the GoT framework).
Language-based Recommenders: P5 (frames recommendation as a text-to-text task), LMRecSys (uses item indices as text), TALLRec.
Multi-domain Recommenders: MOME, PLE, MGFN (specialized models for cross-domain recommendation).

This diverse set of baselines ensures a comprehensive comparison, testing GollaRec against both traditional and modern LLM-based approaches.

6. Results & Analysis

6.1. Core Results Analysis (RQ1)

The main experiments compare GollaRec with the baselines on general and multi-domain recommendation tasks.

General Recommendation Task

The following are the results from Table 1 of the original paper:

Methods	HM		Clothing		Baby
Methods	Recall@20	NDCG@20	Recall@20	NDCG@20	Recall@20	NDCG@20
LightGCN	0.1254*	0.0743*	0.0553*	0.0246*	0.0714*	0.0319*
VBPR	0.1108*	0.0717*	0.0611*	-0.0277*-	0.0740*	0.0329*
MMGCL	0.1633*	0.0964*	0.0607*	0.0277*	0.0790*	0.0352*
BM3	0.1711*	0.0981*	0.0797*	0.0358*	0.0863*	0.0380*
CLIP	0.0956*	0.0687*	0.0631*	0.02δ *	0.0664*	0.0304*
BEiT-3	0.0874*	0.0661*	0.0617*	0.0265*	0.0688*	0.0311*
LLaVA	0.1346*	0.0910*	0.0702*	0.0315*	0.0674*	0.0316*
P5	0.1417*	0.0872*	0.0766*	0.0360*	0.0825	0.0356*
LMRecSys	0.1269*	0.0801*	0.0623*	0.0322*	0.0778*	0.0322*
TALLREC	0.1145*	0.0782*	0.0632*	0.0335*	0.0752*	0.0313*
GollaRec-CoT	0.1807	0.1039	0.0911*	0.0404*	00939*	0.0410
GollaRec	0.1880	0.1064	0.0932	0.0423	0.0958	0.0425

GollaRec's Superiority: GollaRec consistently achieves the best performance across all three datasets and both metrics. It shows an average improvement of 12.7% over the strongest baseline (BM3).
Weakness of Vanilla MLLMs: CLIP, BEiT-3, and LLaVA perform poorly. This shows that simply using an MLLM without a proper recommendation-oriented framework (like GoT and a graph adaptor) is ineffective. They fail to capture the crucial collaborative filtering signals from user interactions.
GoT vs. CoT: GollaRec significantly outperforms GollaRec-CoT, a variant that uses linear Chain-of-Thought instead of Graph-of-Thought. This is strong evidence that the graph-structured GoT prompt is more effective for mining user-item graphs.

Multi-domain Recommendation Task

The following are the results from Table 2 of the original paper:

Methods	Pantry		Electronics		Sports
Methods	Recall@20	NDCG@20	Recall@20	NDCG@20	Recall@20	NDCG@20
VBPR	0.0723*	0.0326*	0.0442*	0.0196*	0.0771*	0.0349*
MMGCL	0.0907*	0.0377*	0.0627*	0.0304*	0.0913*	0.0428*
BM3	0.0932*	0.0417*	0.0638*	0.0310*	0.0970*	0.0438*
CLIP	0.00683*	0.031δ*	0.0461*	0.0235*	0.0727	0.0310*
BEiT-3	0.0596*	0.0289*	0.0481*	0.0240*	0.0748*	0.0341*
LLaVA	0.0659*	0.0313*	0.0604*	0.0288*	0.0709*	0.0303*
MOME	0.0797	0.0352*	0.0573**	0.0261*	0.0749*	0.0318*
PLE	0.0862*	0.0384*	0.0595*	0.0278*	0.0866*	0.0367*
MGFN	0.0891*	0.0413*	0.0623*	0.0305*	0.0894*	0.0383*
GollaRec (CT)	0.1183	0.0469*	0.0655*	0.0323*	0.1046*	0.0456
GollaRec	0.1213	0.0495	0.0681	0.0350	0.1112	0.0502

GollaRec's Transferability: GollaRec again significantly outperforms all baselines, including specialized multi-domain models (MOME, PLE, MGFN). This demonstrates its ability to transfer knowledge learned from source domains to new, unseen target domains effectively.
Value of Multi-modality: The multi-modal models (including GollaRec) generally outperform the multi-domain baselines. This suggests that the rich semantic knowledge from item images and text is a powerful asset for bridging the gap between different domains.

6.2. Ablation Studies / Parameter Analysis (RQ2)

An ablation study was conducted to dissect the contribution of each key component of GollaRec. The results are presented in Table 3.

The following are the results from Table 3 of the original paper:

Variants	General Rec (Clothing)		Multi-domain Rec (Sports)
Variants	Recall@20	NDCG@20	Recall@20	NDCG@20
w/o GoT	0.0885*	0.0402*	0.0982*	0.0442*
w/o Adapter	0.0821*	0.0358*	0.0848*	0.0363*
w/o Text-image Alignment	0.0868*	0.0389*	0.0941*	0.0436*
w/o Text-graph Alignment	0.0901*	0.0402*	0.1068	0.0468*
GollaRec	0.0932	0.423	0.1112	0.0502

(Note: There is a typo in the table for GollaRec's NDCG@20 on Clothing, 0.0423 was likely intended instead of 0.423)

w/o GoT: Removing the GoT prompt and using a simpler task description leads to a significant performance drop. This confirms that the structured, graph-based reasoning provided by GoT is a critical contributor to the model's success.
w/o Adapter: Removing the LightGCN adaptor causes a severe performance degradation. This highlights the necessity of incorporating collaborative filtering signals. The MLLM alone, even with GoT, cannot capture the fine-grained, personalized interaction patterns as effectively as a dedicated graph-based model.
w/o Text-image Alignment: Performance drops, underscoring the importance of aligning visual and textual modalities. A coherent, unified understanding of items is crucial.
w/o Text-graph Alignment: Removing this component also hurts performance. This validates the effectiveness of the grounding and instruction tuning phases in teaching the MLLM to interpret graph structures.

6.3. Embedding Visualization Analysis (RQ3)

The paper visualizes the item embeddings from GollaRec and the strongest baseline (BM3) using t-SNE to assess how well they integrate visual and textual information (Challenge C2).

The figure below (Figure 4 from the paper) shows these visualizations.

Figure 4: The t-SNE visualisation of the item embeddings on the Sports and Clothing datasets. A star refers to a visual embedding while a pentagon represents a text embedding. The average MSE value i… 该图像是示意图，展示了在不同推荐模型下，视觉嵌入和文本嵌入的分布。左上角为 BM3 的服装数据集，平均均方误差为 12.2367；右上角为 GollaRec 的服装数据集，平均均方误差为 1.6686。左下角为 BM3 的体育数据集，平均均方误差为 5.8711；右下角为 GollaRec 的体育数据集，平均均方误差为 0.0120。

BM3 Embeddings (Left Panels): For the BM3 model, the visual embeddings (stars) and textual embeddings (pentagons) for the same items are scattered and far apart. This is reflected in the high Mean Squared Error (MSE) values (12.23 for Clothing, 5.87 for Sports). This indicates that the model struggles to create a unified semantic space for the two modalities.
GollaRec Embeddings (Right Panels): In contrast, GollaRec's embeddings for the visual and textual modalities of the same items are tightly clustered. The very low MSE values (1.66 for Clothing, 0.12 for Sports) quantitatively confirm this visual observation.
Conclusion: GollaRec is far more effective at aligning visual and textual features into a coherent, unified semantic space. This superior multi-modal understanding contributes to its stronger recommendation performance.

6.4. GoT Structure Analysis (RQ4)

This analysis investigates the sensitivity of GollaRec to the structure of the GoT prompt, such as prompt length and the position of elements.

The following are the results from Table 4 of the original paper:

Variants	General Rec (Clothing)		Multi-domain Rec (Sports)
Variants	Recall@20	NDCG@20	Recall@20	NDCG@20
- RandomDemonstrationPos	0.0920†	0.0424†	0.1027	0.0465†
- RandomImagePos	0.0941†	0.0426†	0.1061†	0.0508†
- RandomLenTrunction (80%)	0.0834	0.0366	0.0983	0.0440
- RandomLenTrunction (60%)	0.0807	0.0334	0.0960	0.0425
GollaRec	0.0932	0.0423	0.1112	0.0502

Positional Robustness: Randomly shuffling the position of the demonstration steps (- RandomDemonstrationPos) or the image prompt (- RandomImagePos) within GoT has a negligible impact on performance. This suggests that the MLLM is robust to the exact ordering of these elements, as long as the core content is present.
Impact of Length: Reducing the maximum token length available for the GoT prompt (to 80% and 60%) leads to a noticeable drop in performance. This indicates that the adaptive graph truncation method is effective and that providing a richer, more detailed GoT prompt with more high-potential items is beneficial. It validates the need to maximize the use of the context window.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces GollaRec, a novel MLLM-based framework for recommendation that addresses the critical limitation of existing LLMs in handling graph-structured data. The core innovation, Graph-of-Thought (GoT) prompting, enables an MLLM to reason over user-item interaction graphs by structuring the prompt to mimic graph traversal. GollaRec synergistically combines this high-level reasoning with multi-modal understanding (via text-image alignment), graph-structural awareness (via text-graph alignment), and collaborative signals (via a graph adaptor). Extensive experiments demonstrate that GollaRec significantly outperforms a wide array of state-of-the-art models in both general and multi-domain recommendation tasks. The work confirms that explicitly teaching LLMs to understand and reason with domain-specific structures like graphs is a highly effective strategy.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future directions:

Manual Prompt Design: The demonstration steps within the GoT prompt were manually designed. Although they found no significant difference with auto-generated prompts from LLaMA3, they suggest that exploring more advanced and deterministic prompt generation strategies could yield further improvements.
Initial Candidate Selection: The adaptive graph truncation relies on a pre-trained LightGCN to generate the initial item list. The authors suggest that exploring more advanced models or using similarity measures in the MLLM's hidden space could refine this process and potentially reduce bias from the initial model.
Scope of Application: The paper focuses on top- $k$ recommendation. The authors propose extending GollaRec's application to conversational recommender systems, where its multi-modal and reasoning capabilities could be highly valuable, as a direction for future work.

7so nal Insights & Critique

Key Inspiration: The most inspiring aspect of this paper is the conceptual leap from Chain-of-Thought to Graph-of-Thought. It represents a powerful generalization of the "show, don't just tell" principle of prompting, adapting it from linear sequences to complex, non-linear data structures. This idea is highly transferable and could be applied to other domains where LLMs need to interact with structured knowledge, such as knowledge graphs, molecular structures, or social networks.
Methodological Soundness: The paper is methodologically rigorous. The authors identify clear challenges (C1, C2, C3) and propose specific, well-motivated solutions for each. The comprehensive experimental setup, including strong baselines and detailed ablation studies, provides convincing evidence for their claims. The combination of a reasoning-focused MLLM and a collaboration-focused GNN adaptor is a clever way to get the best of both worlds.
Potential Issues & Areas for Improvement:
- Complexity and Efficiency: The GollaRec pipeline is quite complex, involving multiple stages: a pre-trained recommender for truncation, contrastive pre-training for text-image alignment, a two-step process for text-graph alignment, and finally the main model training. This complexity could be a significant barrier to practical implementation and scalability. The inference time, which involves a forward pass through a large MLLM, is also likely to be much higher than for traditional GNN models.
- Dependency on Pre-trained Recommender: The reliance on LightGCN for candidate selection in the truncation step is a double-edged sword. While it makes the prompt more relevant, it also means that the MLLM's view of the graph is filtered and potentially biased by the LightGCN model. If LightGCN misses a potentially interesting but niche item, the MLLM will never have a chance to reason about it. A more end-to-end selection mechanism could be a valuable improvement.
- GoT as a "Natural" Thought Process: The paper frames GoT as a more natural way for LLMs to reason about graphs. While the results are strong, it is an open question whether the MLLM is truly "thinking" in a graph-like manner or if it is simply learning to pattern-match the structured text in the GoT prompt. Further probing and interpretability studies would be needed to fully understand the internal mechanisms.
  
  Overall, this paper presents a significant and well-executed contribution to the field of recommender systems, effectively bridging the gap between the structured world of graphs and the unstructured world of large language models. The Graph-of-Thought concept is a standout idea with broad potential.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

A Multi-modal Large Language Model with Graph-of-Thought for Effective Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~23 min read · 30,770 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Graph-of-Thought (GoT)

C1. Adaptive Graph Truncation (Challenge: Limited Token Length)

C2. Text-image Alignment (Challenge: Relating Images and Text)

C3. Text-graph Alignment (Challenge: Understanding Graph Structure)

4.2.2. GollaRec Model Architecture

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis (RQ1)

6.2. Ablation Studies / Parameter Analysis (RQ2)

6.3. Embedding Visualization Analysis (RQ3)

6.4. GoT Structure Analysis (RQ4)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7so nal Insights & Critique

Similar papers