Paper status: completed

MultiRAG: A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation

Published:08/05/2025

Multi-Source Retrieval-Augmented Generation (1)Knowledge-Guided Approach (1)Hallucination Mitigation (1)Logical Relationship Graph Construction (1)Multi-Level Confidence Calculation Mechanism (1)

Original Link PDF

Price: 0.100000

0 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

MultiRAG is a knowledge-guided framework designed to mitigate hallucination in multi-source retrieval-augmented generation. By constructing logical relationships with multi-source line graphs and a multi-level confidence mechanism, it effectively reduces challenges related to inf

Abstract

Retrieval Augmented Generation (RAG) has emerged as a promising solution to address hallucination issues in Large Language Models (LLMs). However, the integration of multiple retrieval sources, while potentially more informative, introduces new challenges that can paradoxically exacerbate hallucination problems. These challenges manifest primarily in two aspects: the sparse distribution of multi-source data that hinders the capture of logical relationships and the inherent inconsistencies among different sources that lead to information conflicts. To address these challenges, we propose MultiRAG, a novel framework designed to mitigate hallucination in multi-source retrieval-augmented generation through knowledge-guided approaches. Our framework introduces two key innovations: (1) a knowledge construction module that employs multi-source line graphs to efficiently aggregate logical relationships across different knowledge sources, effectively addressing the sparse data distribution issue; and (2) a sophisticated retrieval module that implements a multi-level confidence calculation mechanism, performing both graph-level and node-level assessments to identify and eliminate unreliable information nodes, thereby reducing hallucinations caused by inter-source inconsistencies. Extensive experiments on four multi-domain query datasets and two multi-hop QA datasets demonstrate that MultiRAG significantly enhances the reliability and efficiency of knowledge retrieval in complex multi-source scenarios. \textcolor{blue}{Our code is available in https://github.com/wuwenlong123/MultiRAG.

Mind Map

In-depth Reading

English Analysis~43 min read · 61,221 chars

1. Bibliographic Information

1.1. Title

MultiRAG: A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation

1.2. Authors

Wenlong Wu (College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education)
Haofen Wang (College of Design & Innovation, Tongji University)
Bohan Li (College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education; Key Laboratory of Intelligent Decision and Digital Operation, Ministry of Industry and Information Technology; Collaborative Innovation Center of Novel Software Technology and Industrialization)
Peixuan Huang (College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education)
Xinzhe Zhao (College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education)
Lei Liang (Ant Group Knowledge Graph Team)

1.3. Journal/Conference

This paper is a preprint published on arXiv. The abstract indicates "Published at (UTC): 2025-08-05T15:20:52.000Z," suggesting it is a recent submission, likely awaiting peer review or publication in a conference/journal. arXiv is a reputable platform for sharing research preprints in various scientific fields, including computer science and artificial intelligence.

1.4. Publication Year

2025 (Based on the publication timestamp: 2025-08-05T15:20:52.000Z)

1.5. Abstract

Retrieval Augmented Generation (RAG) is a promising method to reduce hallucination in Large Language Models (LLMs). However, using multiple retrieval sources introduces new challenges that can worsen hallucinations due to sparse data distribution hindering logical relationship capture and inconsistencies leading to information conflicts. To tackle these, the authors propose MultiRAG, a knowledge-guided framework. MultiRAG features a knowledge construction module that uses multi-source line graphs to aggregate logical relationships from diverse sources, addressing sparse data. It also includes a retrieval module with a multi-level confidence calculation mechanism that performs both graph-level and node-level assessments to identify and filter out unreliable information, thereby mitigating hallucinations from inter-source inconsistencies. Experiments on four multi-domain query datasets and two multi-hop QA datasets demonstrate that MultiRAG significantly improves the reliability and efficiency of knowledge retrieval in complex multi-source scenarios.

1.6. Original Source Link

https://arxiv.org/abs/2508.03553v1 This is the official arXiv preprint link.

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem

The core problem MultiRAG aims to solve is the exacerbation of hallucination issues in Retrieval Augmented Generation (RAG) systems when integrating knowledge from multiple retrieval sources. While multi-source retrieval theoretically offers more comprehensive information, it introduces specific challenges that can paradoxically increase the likelihood of LLMs generating incorrect or misleading responses.

2.1.2. Importance of the Problem

Large Language Models (LLMs) have shown remarkable capabilities in various Natural Language Processing (NLP) tasks. RAG has emerged as a standard solution to address the inherent knowledge limitations and hallucinations (generation of factually incorrect or nonsensical information) in LLMs by grounding their responses in external knowledge bases. However, real-world data is often diverse, fragmented, and stored across multiple heterogeneous sources. When RAG systems attempt to leverage this multi-source knowledge, they face significant hurdles:

Sparse Distribution of Multi-source Data: Data from different domains often uses varied storage formats (e.g., structured SQL tables, semi-structured JSON logs, unstructured text reports). This variability and sparsity make it difficult for RAG systems to effectively capture comprehensive logical relationships between knowledge elements residing in different sources, impacting retrieval recall and quality.
Inter-source Data Inconsistency: Diverse knowledge representations across multiple sources frequently lead to conflicting information. These discrepancies can introduce information conflicts during retrieval, compromising the accuracy of the LLM's generated response. This is particularly critical in complex reasoning tasks like multi-hop question answering and domain-specific applications where factual accuracy is paramount (e.g., finance, law).

Existing RAG frameworks and LLM-KG collaborative methods have improved knowledge retrieval and reduced hallucinations originating from LLM's internal knowledge. However, they often fail to adequately account for the complexities arising from the interaction of multiple, potentially inconsistent, and sparsely connected external data sources. Studies indicate that a significant portion of retrieved content might not directly answer a query but provides indirectly related information, which can misguide LLMs. Addressing these multi-source data challenges is crucial for developing more robust, reliable, and trustworthy RAG systems.

2.1.3. Paper's Entry Point and Innovative Idea

The paper's entry point is to directly tackle the hallucination problem specific to multi-source data retrieval by introducing a knowledge-guided framework. The core innovative idea of MultiRAG is to leverage knowledge graph structures and confidence mechanisms to systematically aggregate disparate information and filter out unreliable data before it reaches the LLM. This is achieved through two main innovations:

Knowledge Construction Module with Multi-source Line Graphs: To overcome sparse data distribution, MultiRAG proposes building multi-source line graphs. This data structure efficiently aggregates logical relationships across different knowledge sources, providing a unified and denser representation of fragmented multi-source knowledge.
Multi-level Confidence Calculation Mechanism: To combat inter-source inconsistencies, a sophisticated retrieval module is introduced. This module performs graph-level and node-level confidence assessments. This hierarchical filtering process identifies and eliminates low-quality subgraphs and unreliable information nodes, ensuring that only trustworthy information is used for context augmentation, thereby reducing hallucinations.

2.2. Main Contributions / Findings

2.2.1. Primary Contributions

The paper's primary contributions are:

Multi-source Knowledge Aggregation (MKA): Introduction of multi-source line graphs in the knowledge construction module. This innovation allows for the rapid aggregation and reconstruction of knowledge structures from various query-relevant data sources, effectively capturing inter-source data dependencies within text chunks. This provides a unified and centralized representation of multi-source knowledge, addressing the issue of sparse data distribution.
Multi-level Confidence Calculation (MCC): Implementation of a sophisticated multi-level confidence calculation method in the retrieval module. This mechanism performs both graph-level and node-level confidence calculations on extracted knowledge subgraphs. Its purpose is to filter out and eliminate low-quality subgraphs and inconsistent retrieval nodes, thereby enhancing the quality of the context provided to the LLM and alleviating retrieval hallucinations.
Experimental Validation: Extensive experiments on four multi-domain query datasets (Movies, Books, Flights, Stocks) and two multi-hop QA datasets (HotpotQA, 2WikiMultiHopQA) demonstrate the robustness and accuracy of the proposed MultiRAG method.

2.2.2. Key Conclusions and Findings

The key conclusions and findings reached by the paper are:

MultiRAG significantly enhances the reliability and efficiency of knowledge retrieval in complex multi-source scenarios.
The MCC module (Multi-level Confidence Calculation) outperforms existing data fusion models and state-of-the-art (SOTA) data retrieval models in terms of F1 score on multi-source datasets, showing an improvement of over 10% in some cases, particularly on sparser datasets like Books and Stocks.
MultiRAG demonstrates strong robustness to both data sparsity (maintaining performance even with up to 70% relationship masking) and data inconsistency (showing minimal F1 score drops even with significant triple increments and shuffled relationship edges).
Ablation studies confirm the effectiveness of both the MKA module and the MCC module. MKA provides significant query acceleration (e.g., from computational infeasibility to 29.8s on the Flights dataset) and consistent accuracy improvements (e.g., 7.3-9.6% F1 increase). MCC is crucial for hallucination control, with its removal leading to drastic F1 degradation (20.1-33.2%).
The hierarchical nature of MCC (graph-level and node-level confidence) plays complementary roles: graph-level filtering ensures global consistency, while node-level verification addresses local credibility issues.
On multi-hop QA datasets (HotpotQA and 2WikiMultiHopQA), MultiRAG achieves higher Precision and Recall@5 scores, demonstrating its effectiveness in reducing hallucinations and enhancing the credibility of Q&A systems in complex settings.
The time costs for MultiRAG are acceptable, with the overhead mainly concentrated on the Multi-source Line Graph (MLG) construction, which is balanced by significant query acceleration.

3.1. Foundational Concepts

To understand MultiRAG, a reader needs to be familiar with the following core concepts:

3.1.1. Large Language Models (LLMs)

Large Language Models (LLMs) are advanced artificial intelligence models trained on vast amounts of text data to understand, generate, and process human language. They are typically based on the transformer architecture and can perform a wide range of tasks, including text generation, summarization, translation, and question answering. Their ability to learn complex patterns in language allows them to exhibit impressive language understanding and generation capabilities. However, a known limitation is their tendency to hallucinate.

3.1.2. Hallucination in LLMs/RAG

Hallucination in the context of LLMs refers to the phenomenon where the model generates information that is factually incorrect, nonsensical, or unfaithful to the provided source content, despite being presented in a confident and fluent manner. In Retrieval Augmented Generation (RAG) systems, hallucinations can stem from:

Internal Knowledge Hallucination: The LLM generates false information based on its pre-trained internal knowledge.
Retrieval-Induced Hallucination: The retrieved documents contain irrelevant, misleading, or conflicting information, causing the LLM to generate incorrect responses. MultiRAG specifically targets this second type, particularly when multiple sources are involved.

3.1.3. Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a paradigm designed to enhance the factual accuracy and reduce hallucinations in LLMs. It combines a retriever component with a generator (LLM). When a user asks a query:

The retriever searches an external knowledge base (e.g., a database, a collection of documents) for relevant information.
The retrieved information (often called context or evidence) is then fed to the generator (LLM) along with the original query.
The LLM uses this context to generate a more informed and factually grounded response, reducing reliance on its potentially outdated or limited internal knowledge.

3.1.4. Knowledge Graphs (KGs)

Knowledge Graphs (KGs) are structured representations of knowledge that model entities (real-world objects, concepts, or events) and their relationships as a graph. In a KG:

Nodes (or Vertices): Represent entities (e.g., "Paris," "Eiffel Tower," "France").
Edges (or Relationships/Predicates): Represent the connections or relationships between entities (e.g., "Paris is the capital of France," "Eiffel Tower is located in Paris").
Triples: The fundamental unit of a KG is a triple in the format (subject, predicate, object) (e.g., (Paris, is the capital of, France)). KGs provide a rich, structured, and interpretable way to store and query factual knowledge, which is highly beneficial for RAG systems.

3.1.5. Multi-source Data

Multi-source data refers to information collected from various disparate origins or channels. In real-world scenarios, data might come from:

Structured Data: E.g., relational databases, CSV files (tabular format with well-defined schemas).
Semi-structured Data: E.g., JSON, XML files (hierarchical data with flexible schemas).
Unstructured Data: E.g., plain text documents, reports, web pages (free-form text without a strict schema). Integrating and reconciling multi-source data presents challenges like varying formats, inconsistencies, and redundancy.

3.1.6. JSON-LD

JSON-LD (JSON for Linking Data) is a lightweight Linked Data format that represents JSON data in a way that is understandable by machines. It allows JSON objects to be interpreted as RDF (Resource Description Framework) graphs, making it easier to publish and consume Linked Data. It uses @context to define terms and map them to IRIs (Internationalized Resource Identifiers), effectively providing a schema for the data and enabling interoperability. MultiRAG uses JSON-LD to store parsed data content.

3.1.7. Mutual Information Entropy

Mutual Information Entropy is a concept from information theory that quantifies the amount of information obtained about one random variable by observing another random variable. In simpler terms, it measures the statistical dependence or shared information between two variables.

Entropy (H(X)): Measures the uncertainty or randomness of a single random variable $X$ . $ H(X) = - \sum_{x \in X} p(x) \log p(x) $ where p(x) is the probability of outcome $x$ for variable $X$ .
Mutual Information (I(X;Y)): Measures the reduction in uncertainty about $X$ given $Y$ , or vice versa. $ I(X;Y) = \sum_{x \in X} \sum_{y \in Y} p(x,y) \log \left(\frac{p(x,y)}{p(x)p(y)}\right) $ where p(x,y) is the joint probability distribution of $X$ and $Y$ , and p(x) and p(y) are their marginal probability distributions. In MultiRAG, it is used to calculate the similarity between nodes in a graph, assessing how consistently their attributes align.

3.2. Previous Works

MultiRAG builds upon and differentiates itself from several categories of previous work, primarily in Knowledge-Guided RAG, Heterogeneous Graph Fusion, and Hallucination Benchmarking.

3.2.1. Knowledge-Guided RAG and General RAG Baselines

Standard RAG [2]: The foundational approach that combines a retriever and a generator. Given a query, it retrieves relevant documents and feeds them to an LLM for generation.
- Relevance: MultiRAG is an enhancement of Standard RAG specifically for multi-source scenarios.
CoT [43]: Chain-of-Thought (CoT) prompting involves step-by-step reasoning by an LLM to arrive at a conclusion. It improves complex reasoning by breaking down problems.
- Relevance: Used as a baseline for comparison, often in conjunction with RAG. IRCoT further develops this.
IRCoT [44]: Interleaving Retrieval with Chain-of-Thought Reasoning. Refines the reasoning process through iterative retrieval, meaning the model might retrieve more information during its reasoning steps.
- Relevance: A more advanced RAG baseline that uses reasoning for better retrieval.
ChatKBQA [45]: A conversational interface-based method for knowledge base question answering. It often involves querying structured KBs.
- Relevance: A strong baseline for KB-driven QA. MultiRAG compares its robustness against ChatKBQA under data perturbation.
MDQA [46]: Multi-Document Question Answering methods are designed to extract answers from multiple documents effectively.
- Relevance: Relevant baseline as MultiRAG deals with multi-source (and thus implicitly multi-document) data.
RQ-RAG [47]: Refine Queries for Retrieval Augmented Generation. Integrates external documents and optimizes the query process to handle complex queries, often by iteratively refining the query based on initial retrieval results.
- Relevance: Addresses query optimization, a component of efficient RAG.
MetaRAG [9]: Employs meta-cognitive strategies to enhance the retrieval process, often by establishing knowledge association verification through meta-cognitive graph reasoning paths, improving self-correction in multi-hop QA.
- Relevance: Focuses on multi-hop QA and self-correction, similar to MultiRAG's goal of reducing hallucination in complex QA.
GraphCoT [48]: Leverages Graph Neural Networks (GNNs) to establish bidirectional connections between KGs and the latent space of LLMs. It aims to reduce factual inconsistencies.
- Relevance: Another graph-structured approach for hallucination mitigation, similar to MultiRAG's use of KGs.
HippoRAG [23]: Inspired by neurobiology, constructs offline memory graphs with a neural indexing mechanism to decrease retrieval latency.
- Relevance: Addresses efficiency in RAG, a goal also considered by MultiRAG.
ToG 2.0 [25]: Graph-Context Co-Retrieval Framework. Dynamically balances structured and unstructured evidence using knowledge-guided retrieval augmented generation.
- Relevance: Directly relevant as it deals with combining different types of evidence and aims to reduce hallucinations.
KAG [26]: Knowledge Augmented Generation. A framework for boosting LLMs in professional domains via knowledge-guided retrieval.
- Relevance: MultiRAG explicitly uses the OpenSPG framework which KAG is built upon for knowledge extraction.

3.2.2. Heterogeneous Graph Fusion

TruthFinder (TF) [37]: A classic iterative data fusion method that identifies trustworthy facts by considering source credibility and data agreement.
- Relevance: A fundamental baseline for multi-source data fusion.
LTM [42]: Probabilistic data fusion method that uses a Bayesian approach to discover truth from conflicting sources.
- Relevance: Another baseline for multi-source data fusion.
Triple Line Graph [31]: A concept that transforms a graph where nodes are triples and edges exist if triples share a common node. This addresses knowledge fragmentation by aggregating cross-domain relationships.
- Relevance: This work directly inspires MultiRAG's Multi-source Line Graph concept for knowledge aggregation.
FusionQuery [34]: An efficient on-demand fusion query framework over multi-source heterogeneous data. Enhances cross-domain retrieval precision by integrating heterogeneous graphs and computing dynamic credibility evaluations.
- Relevance: A strong SOTA baseline for multi-source heterogeneous data handling, which MultiRAG directly compares against.

3.2.3. Hallucination Benchmark and Confidence-Aware Computing

HaluEval [49]: A large-scale hallucination evaluation benchmark for LLMs, offering annotated samples across various error categories.
- Relevance: Provides a framework for understanding and classifying LLM hallucinations.
RefChecker [50]: Implements triple decomposition for fine-grained hallucination detection, improving precision over sentence-level methods.
- Relevance: Focuses on precise detection of factual errors, aligning with MultiRAG's goal of reducing them.
RAGTruth [51]: A hallucination corpus with detailed manual annotations, including word-level hallucination intensities, for developing trustworthy RAG models.
- Relevance: Provides resources for evaluating and improving RAG trustworthiness.

3.3. Technological Evolution

The field has evolved from basic LLMs to Retrieval Augmented Generation (RAG) to address hallucinations and knowledge limitations. Initially, RAG focused on single-source document retrieval. The complexity of real-world data quickly pushed research towards integrating Knowledge Graphs (KGs) with LLMs to leverage structured knowledge for better reasoning, credibility, and interpretability, especially for multi-hop QA. This led to methods like GraphCoT, MetaRAG, and ToG. Concurrently, the need to handle diverse multi-source heterogeneous data (structured, semi-structured, unstructured) emerged, giving rise to data fusion methods like TruthFinder and FusionQuery. The challenge then became how to manage the inconsistencies and sparsity inherent in such multi-source data to prevent retrieval-induced hallucinations.

MultiRAG fits into this timeline by bridging knowledge-guided RAG with advanced heterogeneous graph fusion techniques specifically tailored for multi-source hallucination mitigation. It addresses the dual problems of sparse data distribution and inter-source inconsistency that are particularly prevalent in complex, real-world multi-source RAG scenarios.

3.4. Differentiation Analysis

Compared to the main methods in related work, MultiRAG's core differences and innovations lie in its integrated, knowledge-guided approach to multi-source hallucination mitigation:

Unified Multi-source Knowledge Aggregation via Line Graphs:
- Differentiation: Unlike prior RAG systems that treat multiple sources as separate document pools (e.g., Standard RAG, MDQA) or rely on implicit connections, MultiRAG explicitly constructs multi-source line graphs (MLG). This MLG structure (inspired by Triple Line Graphs) provides a dense, unified representation by aggregating logical relationships across different knowledge sources. This directly tackles the sparse data distribution issue by making inter-source dependencies explicit.
- Innovation: This is a step beyond simple data fusion methods (like TruthFinder or LTM) which might focus on merging factual claims, or heterogeneous graph fusion (like FusionQuery) which aims for on-demand queries. MultiRAG's MLG proactively re-structures knowledge to enhance connectivity and enable more efficient and robust retrieval before confidence assessment.
Multi-level Confidence Calculation (MCC) for Inconsistency Resolution:
- Differentiation: While some RAG methods (e.g., MetaRAG, ToG 2.0) perform confidence checks or leverage meta-cognitive strategies for self-correction, MultiRAG introduces a hierarchical multi-level confidence calculation (graph-level and node-level). This dual-layered approach is specifically designed to handle the complexities of inter-source inconsistencies. Graph-level confidence assesses the overall credibility of an entire homologous subgraph, ensuring global consistency. Node-level confidence then scrutinizes individual nodes based on consistency, authority (LLM-assessed and historical), and historical reliability, filtering out local conflicts.
- Innovation: This systematic, two-stage filtering process is more granular and robust than single-tier confidence mechanisms. It adaptively filters conflicting subgraphs (via Graph-level Confidence Computing (GCC)) and unreliable individual data points (via Node-level Confidence Computing (NCC)), which is crucial for mitigating hallucinations specifically caused by contradictory information from diverse sources. This goes beyond general LLM-KG collaboration methods like ChatKBQA or GraphCoT which might not have such explicit, multi-layered mechanisms for evaluating multi-source data reliability.
  
  In summary, MultiRAG's distinctiveness lies in its proactive knowledge re-structuring through MLGs to tackle sparsity and its refined, hierarchical confidence-aware filtering to address inter-source inconsistencies, making it particularly effective in complex, real-world multi-source RAG environments where hallucination mitigation is paramount.

4. Methodology

4.1. Principles

The core principle behind MultiRAG is to mitigate hallucinations in multi-source Retrieval Augmented Generation (RAG) by proactively organizing and vetting knowledge. The framework operates on two main intuitions:

Aggregating Disparate Knowledge: In a multi-source environment, knowledge is often fragmented and sparsely connected across different data formats and domains. By transforming this disparate information into a unified, denser graph structure (specifically, multi-source line graphs), the system can more effectively capture logical relationships and dependencies that would otherwise be missed. This aggregation enhances the completeness and coherence of the retrieved context.
Hierarchical Confidence-based Filtering: Multi-source data inherently carries the risk of inconsistencies and conflicts. Instead of simply retrieving all seemingly relevant information, MultiRAG introduces a multi-level confidence calculation mechanism. This mechanism acts like a two-stage filter: first, it assesses the overall reliability of a group of related knowledge (graph-level); then, it scrutinizes the credibility of individual data points within that group (node-level). This hierarchical approach ensures that only trustworthy and consistent information is passed to the Large Language Model (LLM), significantly reducing the chances of hallucination induced by unreliable or conflicting retrieval results.

The theoretical basis combines knowledge graph representation for structuring heterogeneous data with information theory (mutual information entropy) for similarity and consistency assessment, along with probabilistic credibility models for authority evaluation. This allows MultiRAG to perform knowledge-guided retrieval that is both efficient and reliable.

4.2. Core Methodology In-depth (Layer by Layer)

The MultiRAG framework consists of three main modules: Multi-source Data Extraction, Multi-source Knowledge Aggregation, and Multi-level Confidence Calculation. These modules work synergistically to process a user query and generate a trustworthy answer. The overall framework is depicted in Fig. 3.

Fig. 3: Framework of MultiRAG, including three modules. 该图像是示意图，展示了MultiRAG框架的工作流程，包括三个主要模块：多源数据提取、多源知识聚合以及多级置信度计算。图中展示了不同数据块的提取过程及其在知识图谱中的转换，另外，置信度计算部分涉及到节点一致性与历史置信度的评估。公式部分包括互信息熵计算等内容。

The above figure (Fig. 3 from the original paper) illustrates the MultiRAG framework, detailing its three core modules and their interactions.

4.2.1. Multi-source Data Extraction

The initial step in MultiRAG involves acquiring and standardizing data from various heterogeneous sources.

4.2.1.1. Data Integration and Standardization

MultiRAG employs an adapter structure to integrate diverse multi-source data and transform it into a unified, normalized storage format. This is crucial for handling practical scenarios where data originates from various non-homologous formats.

Parsing Process: Unique adapters are designed for each distinct data format.
- Structured Data: Tabular information is parsed and stored in JSON format. Attribute variables are managed using a Decomposition Storage Model (DSM), allowing for the extraction of column indices for consistency checks and rapid retrieval.
- Semi-structured Data: Tree-shaped data with multi-layer nested structures (e.g., JSON, XML) is parsed and stored in JSON format. Since this data typically lacks column indices, tree or graph traversal algorithms (e.g., DFS) are used for efficient searching.
- Unstructured Data: Currently, the focus is on textual information, which is stored directly. Subsequent steps involve leveraging LLMs for entity and relationship extraction.
Normalization: File names, metadata, and data domains are parsed and categorized. The data content itself is stored in JSON-LD format, transforming it into linked data. Unique identifiers are assigned to the normalized data.

The final integration of multi-source data can be expressed by the following formula: $ D _ { F u s i o n } = \bigcup _ { i = 1 } ^ { n } A _ { i } ( D _ { i } ) $ Where:
D _ { F u s i o n } represents the unified, normalized dataset after fusion.
$\bigcup$ denotes the union operation, combining data from all sources.
$n$ is the total number of distinct data sources.
A _ { i } represents the adapter parsing function for the $i$ -th data source. The set of possible adapter functions is $A _ { i } \in \{ A d a _ { \mathrm { s t r u } } , A d a _ { \mathrm { s e m i - s } } , A d a _ { \mathrm { u n s t r u } } \}$ for structured, semi-structured, and unstructured data, respectively.
D _ { i } represents the original dataset from the $i$ -th source, where $D _ { i } \in \{ D _ { \mathrm { s t r u } } , D _ { \mathrm { s e m i - s } } , D _ { \mathrm { u n s t r u } } \}$ corresponds to the original structured, semi-structured, or unstructured data.

4.2.1.2. Knowledge Extraction and Graph Construction

From the parsed data $D _ { \mathrm { F u s i o n } }$ , key information (entities and relationships) is extracted and linked to form a knowledge graph. This process utilizes the OpenSPG framework [26], [32], specifically its Custom Prompt module to integrate LLM-based knowledge extraction.

The knowledge construction involves three phases:

Entity Recognition: ner.py prompts are used. Relevant entity types are defined in a schema. The example.input and example.output within ner.py prompts are adjusted to guide the LLM-based SchemaFreeExtractor to accurately identify entities.
Relationship Extraction: triple.py prompts are crucial. Relationships are defined in the schema, and triple_prompt in the SchemaFreeExtractor is used. Instructions in triple.py ensure that extracted Subject-Predicate-Object (SPO) triples are related to entities in an entity_list, facilitating effective relationship extraction.
Attribute Extraction: std.py prompts (entity standardization) are employed. After entity recognition, std_prompt in the SchemaFreeExtractor standardizes entities and helps extract their attributes. example.input, example.named_entities, and example.output in std.py are modified to optimize attribute extraction for specific data characteristics.

The extracted knowledge base (KB) can be described as: $ K B = \sum _ { D _ { i } } ( { e _ { 1 } , e _ { 2 } , . . . , e _ { m } } \bigcup { r _ { 1 } , r _ { 2 } , . . . , r _ { n } } ) $ Where:
KB represents the constructed knowledge base, comprising entities and relationships.
$D_i$ denotes an individual data source (or chunk thereof).
$\{ e _ { 1 } , e _ { 2 } , . . . , e _ { m } \}$ is the set of $m$ entities extracted from $D_i$ .
$\{ r _ { 1 } , r _ { 2 } , . . . , r _ { n } \}$ is the set of $n$ relationships extracted from $D_i$ .
$\sum$ signifies the aggregation of all extracted entities and relationships across all data sources $D_i$ .

4.2.2. Multi-source Knowledge Aggregation (MKA)

After preliminary knowledge extraction, the system proceeds to organize this knowledge into a Multi-source Line Graph (MLG) for efficient aggregation and subsequent processing.

4.2.2.1. Multi-source Line Graph (MLG)

The Multi-source Line Graph (MLG) is a key innovation derived from the concept of a triple line graph [31]. It is defined as follows:

Definition 2. Multi-source line graph [31]. Given a multi-source knowledge graph $\mathcal { G }$ and a transformed knowledge graph $\mathcal { G } ^ { \prime }$ (multi-source line graph, MLG), the MLG satisfies the following characteristics:

A node in $\mathcal { G } ^ { \prime }$ represents a triplet.
There is an associated edge between any two nodes in $\mathcal { G } ^ { \prime }$ if and only if the triples represented by these two nodes share a common node.

This definition implies that the MLG achieves high aggregation of related triples, thereby improving the efficiency of data retrieval and accelerating query algorithms. Fig. 4 provides a visual example of this transformation.

Fig. 4: Example of multi-source line graph transformation 该图像是示意图，展示了原始知识图谱向多源线图的转化过程。在左侧，原始知识图谱包含来自两个不同来源的信息，而右侧则是经过三元线图变换后的多源线图，信息更加整合和清晰。这一变换有助于解决多源数据稀疏的问题。

The above figure (Fig. 4 from the original paper) illustrates the transformation from an original knowledge graph to a multi-source line graph. On the left, two sources contain related triples. On the right, the multi-source line graph represents these triples as nodes, with edges connecting them if they share a common entity, effectively aggregating related information.

4.2.2.2. Homologous Subgraph Matching

The next step is to identify multi-source homologous data groups and isolated points within the constructed knowledge graph $\mathcal{G}$ .

Definition 3. Multi-source homologous data. For any two nodes $\upsilon _ { 1 }$ and v _ { 2 } in $\mathcal { G }$ , they are defined as multi-source homologous if and only if they belong to the same retrieval candidate set in a single search.
Definition 4. Homologous node and homologous subgraph. Given a set of multi-domain homologous data $S V = \{ v _ { i } \} _ { i = 1 } ^ { n }$ $S V = {v_{i}}_{i = 1}^{n}$ in the knowledge graph $\mathcal { G }$ $G$ , the authors define:
- homologous center node as snode = \{ name, meta, num, C(v) \} , where name is the common attribute name, meta is identical file metadata, num is the number of homologous data instances, and C(v) is the data confidence.
- set of homologous nodes as $U_{snode}$ .
- set of homologous edges as $E_{snode}$ .
- The association edge between snode and $v_i$ is $e_i = \{ w_i \} _ { i = 1 } ^ { n }$ , where $w_i$ is the weight of node $v_i$ in the data confidence calculation.
- The homologous center node and $S\mathcal{G}$ together form the homologous subgraph $subS\mathcal{G}$ .
  
  The process of homologous subgraph matching involves:

Initializing an unvisited node set $\mathcal { U } _ { \mathrm { u n v i s i t e d } } = \mathcal { V }$ (all nodes in $\mathcal{G}$ ), an empty homologous data group $s \nu _ { s } = \emptyset$ , and an empty isolated point set $\mathcal { L } \mathcal { V } s = \emptyset$ .
Traversing all nodes: For each node, it retrieves information from various domains.
Matching: If homologous data is matched, a homologous node $sg_i$ and its associated edge $e_i$ are constructed and added to the homologous node set $\mathcal{U}_{sg}$ and edge set $\mathcal{E}_{sg}$ .
No Match: If no homologous data is found after a round of traversal for a node, that node is added to the isolated point set $\mathcal{L}\mathcal{V}s$ .
Aggregation: After traversal, $( \mathcal { U } _ { s g } , \mathcal { E } _ { s g } )$ is added to $s \nu _ { s }$ . The node is then removed from $\mathcal { U } _ { \mathrm { u n v i s i t e d } }$ . The time complexity of this matching process is $O ( n \log n )$ , where $n$ is the number of nodes in the knowledge graph $\mathcal { G }$ .

4.2.2.3. Homologous Triple Line Graph Construction

Definition 5. Homologous triple line graph. For all homologous subgraphs within the knowledge graph $\mathcal { G }$ , they collectively constitute the homologous knowledge graph $S \mathcal { G }$ . By performing a linear graph transformation on the homologous knowledge graph, the homologous triple line graph $S { \mathcal { G } } ^ { \prime }$ is obtained.

For each homologous subgraph in $s \nu _ { s }$ , a homologous linear knowledge subgraph $s u b S { \boldsymbol { \mathcal { G } } } _ { i } ^ { \prime }$ is constructed using the homologous node set $\mathcal { U } _ { s g }$ and homologous edge set $\mathcal { E } _ { s g }$ . All these $s u b S { \cal { G } } _ { i } ^ { \prime }$ and the isolated point set $\mathcal { L } \mathcal { V } s$ are then aggregated to form the final homologous linear knowledge graph $S { \mathcal { G } } ^ { \prime }$ . It's important to note that $S { \mathcal { G } } ^ { \prime }$ is primarily used for consistency checks and retrieval queries of homologous data. Other types of queries still operate on the original knowledge graph $\mathcal { G }$ .

4.2.3. Multi-level Confidence Computing (MCC)

This module addresses the problem of inter-source data inconsistency by calculating confidence scores at both the graph and node levels. This helps filter out unreliable information before it reaches the LLM.

4.2.3.1. Definition of Confidence

Definition 6. Candidate graph confidence and candidate node confidence. For a query $Q ( q , { \mathcal { G } } )$ on the knowledge graph $\mathcal { G }$ , the corresponding Homologous line graph $S { \mathcal { G } } ^ { \prime }$ is obtained. The candidate graph confidence is an estimation of the confidence in the candidate Homologous subgraph, assessing the overall credibility of the candidate graph; the candidate node confidence is an assessment of the confidence in individual node to determine the credibility of single attribute node.

4.2.3.2. Graph-Level Confidence Computing (GCC)

In the first stage, graph-level confidence $C ( \mathcal G )$ is calculated for each homologous line graph using a method based on mutual information entropy. The rationale is that if nodes with the same attributes within a homologous line graph have highly similar content, their confidence should be high, and vice versa.

The mutual information entropy I ( v _ { i } , v _ { j } ) between two nodes $v _ { i } , v _ { j } \in \mathcal { N } ( \mathcal { G } )$ (where $\mathcal{N}(\mathcal{G})$ is the set of nodes in graph $\mathcal{G}$ ) with the same attributes measures the interdependence of their attribute content: $ I ( v _ { i } , v _ { j } ) = \sum _ { x \in V _ { i } } \sum _ { y \in V _ { j } } p ( x , y ) \log ( \frac { p ( x , y ) } { p ( x ) p ( y ) } ) $ Where:

V _ { i } and V _ { j } are the sets of attribute values for nodes v _ { i } and v _ { j }, respectively.
p ( x , y ) is the joint probability distribution of node v _ { i } having attribute value $x$ and node v _ { j } having attribute value $y$ .
p ( x ) and p ( y ) are the marginal probability distributions of $x$ and $y$ , respectively.

The similarity S ( v _ { i } , v _ { j } ) is then defined as the normalized form of mutual information entropy, ensuring its value is between 0 and 1: $ S ( v _ { i } , v _ { j } ) = \frac { I ( v _ { i } , v _ { j } ) } { H ( V _ { i } ) + H ( V _ { j } ) } $ Where:
$H ( V _ { i } )$ and $H ( V _ { j } )$ are the entropies of the attribute value sets of nodes v _ { i } and v _ { j }, calculated as: $ H ( V ) = - \sum _ { x \in V } p ( x ) \log p ( x ) $ Where p(x) is the probability of attribute value $x$ in the set $V$ .

Finally, the confidence $C ( \mathcal G )$ of the homologous line graph $\mathcal { G }$ is the average similarity of all node pairs in the graph: $ C ( \mathcal { G } ) = \frac { 1 } { | \mathcal { N } ( \mathcal { G } ) | ^ { 2 } - | \mathcal { N } ( \mathcal { G } ) | } \sum _ { \substack { v _ { i } \in \mathcal { N } ( \mathcal { G } ) } } \sum _ { \substack { v _ { j } \in \mathcal { N } ( \mathcal { G } ) } } S ( v _ { i } , v _ { j } ) $ Where:

$| { \mathcal { N } } ( { \mathcal { G } } ) |$ denotes the number of nodes in the graph. A high $C(\mathcal{G})$ indicates strong attribute-level consistency among its constituent nodes.

4.2.3.3. Node-Level Confidence Computing (NCC)

In the second stage, the confidence of individual node C ( v ) is calculated. This takes into account the node's consistency, authority (both LLM-assessed and historical), and historical confidence.

Node Consistency Score ( $S_n(v)$ ): This score reflects the consistency of the node across different data sources. It is based on the average similarity between node $v$ and other nodes $u$ that share the same attributes within its local neighborhood: $ S _ { n } ( v ) = { \frac { 1 } { | N ( v ) | } } \sum _ { u \in N ( v ) } S ( v , u ) $ Where:
- N ( v ) is the set of nodes with the same attributes as node $v$ .
- S ( v , u ) is the similarity between nodes $v$ and $u$ , as defined by the normalized mutual information entropy (Equation 5 in the paper, which corresponds to $S ( v _ { i } , v _ { j } ) = \frac { I ( v _ { i } , v _ { j } ) } { H ( V _ { i } ) + H ( V _ { j } ) }$ ).
Node Authority Score (A(v)): This score reflects the importance and authenticity of the node, combining an LLM-assessed authority and a historical authority. It is calculated as a weighted sum: $ A ( v ) = \alpha \cdot A u t h _ { L L M } ( v ) + ( 1 - \alpha ) \cdot A u t h _ { h i s t } ( v ) $ Where:
- $\alpha$ is a weight coefficient ( $0 \leq \alpha \leq 1$ ) that balances the contributions of LLM-assessed authority and historical authority.
- $Auth_{LLM}(v)$ is the LLM-assessed authority score.
- $Auth_{hist}(v)$ is the historical authority score.
- LLM-assessed Authority ( $Auth_{LLM}(v)$ ): Inspired by PTCA [33], this is assessed by an expert LLM based on the node's global influence and local connection strength. The LLM integrates entity association strength, entity type information, and multi-step path information to calculate knowledge credibility. This is formulated as a sigmoid-like function: $ A u t m _ { L L M } ( v ) = \frac { 1 } { 1 + e ^ { - \beta \cdot C _ { L L M } ( v ) } } $ Where:
  - $C _ { \mathrm { L L M } } ( v )$ is the authority score provided by the LLM for node $v$ .
  - $\beta$ is a parameter that controls the steepness of the scoring curve. (The paper states that $C _ { \mathrm { L L M } } ( v )$ is "the average value of all nodes' $C _ { \mathrm { L L M } } ( v )$ ", which seems to be a typo, it should be the LLM's raw score for node $v$ ).
- Historical Authority ( $Auth_{hist}(v)$ ): Inspired by Zhu's work [34], this is an authority score based on the node's historical data, incrementally estimated using the credibility of historical data sources and current query-related data: $ A u t h _ { h i s t } ( v ) = \frac { \mathcal { H } \cdot P r ^ { h } ( D ) + \sum _ { v _ { p } \in D _ { v } [ q ] } P r ( v _ { p } ) } { \mathcal { H } + | D a t a ( q , s u b S \mathcal { G } ^ { \prime } _ { i } ) | } $ Where:
  - $\mathcal { H }$ is the total number of entities provided by data source $D$ for all historical queries.
  - $P r ^ { h } ( D )$ is the historical credibility of data source $D$ .
  - D _ { v } [ q ] is the set of correct answers for query $q$ known from reliable sources.
  - $P r ( v _ { p } )$ is the probability or credibility of a specific correct answer $v_p$ from $D_v[q]$ .
  - $| D a t a ( q , s u b S \mathcal { G } _ { i } ^ { \prime } ) |$ is the number of query-related data points obtained from the multi-source line subgraph $subS\mathcal{G}'_i$ .
    
    The final node confidence C(v) is calculated as $C ( v ) = S _ { n } ( v ) + A ( v )$ . This means the node's consistency score is added to its overall authority score.

4.2.3.4. Multi-level Confidence Computing (MCC) Algorithm

The MCC algorithm (Algorithm 1 in the paper) orchestrates the calculation of credibility for data sources within the homologous subgraph, ensuring the quality of the knowledge graph embedded into the LLM.

$1: procedure CoNFIDENCE_CoMPuTING(v, D) 2: $S_n(v) \leftarrow$ Equation (8) // Node Consistency Score 3: $Auth_{LLM}(v) \leftarrow$ Equation (10) // LLM-assessed Authority 4: $Auth_{hist}(v) \leftarrow$ Equation (11) // Historical Authority 5: $A(v) \leftarrow$ Equation (9) // Overall Node Authority 6: $C(v) \leftarrow S_n(v) + A(v)$ // Total Node Confidence 7: return `C(v)` 8: end procedure 9: procedure MCC($\mathcal{G}$, Q, D) 10: $\mathcal{SV}s \gets \emptyset$, $\mathcal{LV}s \gets \emptyset$ // Initialize homologous data group set and isolated point set 11: $\mathcal{U}_{unvisited} \gets V$ // Set of all nodes in the graph 12: while $\mathcal{U}_{unvisited} \neq \emptyset$ do 13: $v \gets \textsf{pop}$ a node from $\mathcal{U}_{unvisited}$ 14: for all $D \in D$ do // Iterate through all data sources 15: if $v \in Data(Q, subSG_i)$ then // If node v is part of a candidate subgraph for query Q 16: $C(v) \gets \text{Confidence\_Computing}(v, D)$ // Calculate node confidence 17: if $C(v) > \theta$ then // If confidence exceeds threshold $\theta$ 18: $\mathcal{U}_{sg} \gets \mathcal{U}_{sg} \cup \{v\}$ // Add to homologous node set 19: $\mathcal{E}_{sg} \gets \mathcal{E}_{sg} \cup \{e_i\}$ // Add to homologous edge set 20: else 21: $\mathcal{LV}s \gets \mathcal{LV}s \cup \{v\}$ // Add to isolated point set (unreliable) 22: end if 23: end if 24: end for 25: if $\mathcal{U}_{sg} \ne \emptyset$ then 26: $\mathcal{SV}s \gets \mathcal{SV}s \cup (\mathcal{U}_{sg}, \mathcal{E}_{sg})$ // Aggregate homologous subgraph 27: $\mathcal{U}_{sg} \gets \emptyset$, $\mathcal{E}_{sg} \gets \emptyset$ // Reset for next subgraph 28: end if 29: end while 30: return $\mathcal{SV}s$, $\mathcal{LV}s$ 31: end procedure$ The MCC algorithm does not directly output the final graph and node confidence values but processes nodes based on these calculations to build credible subgraphs and identify isolated (unreliable) points. The final confidence values are then obtained through prompting the LLM.

4.2.4. Multi-source Knowledge Line Graph Prompting (MKLGP) Algorithm

The overall Multi-source Knowledge Line Graph Prompting (MKLGP) algorithm (Algorithm 2 in the paper) integrates all modules for multi-source data retrieval.

$1: procedure MKLGP(q) 2: $Eq, Rq \leftarrow$ Logic Form Generation(q) // Extract entities and relationships from query 3: $Dq \leftarrow$ Multi Document Extraction(Vq) // Filter documents/chunks relevant to query 4: $SG' \leftarrow$ Prompt(Dq) // Construct Multi-source Line Graph (MLG) using extracted data and LLM prompting 5: $SVs, LVs \leftarrow$ MCC(SG', q, Dq) // Apply Multi-level Confidence Computing to get credible subgraphs (SVs) and isolated points (LVs) 6: $Cnodes, GA \leftarrow$ Prompt(SVs, LVs) // Obtain final graph confidence (GA) and node confidence (Cnodes) via prompting 7: Answer $\leftarrow$ Generating Trustworthy Answers(Cnodes, GA) // Generate answer using trusted nodes and graph context 8: return Answer 9: end procedure$ Given a user query $q$ :

Logic Form Generation: An LLM is used to extract the intent, entities (Eq), and relationships (Rq) from $q$ , generating corresponding logical relationships.
Multi Document Extraction: The dataset undergoes multi-document filtering (Vq is likely an error and should refer to Dq or KB) to derive relevant text chunks.
MLG Construction: The Multi-source Line Graph (MLG) (SG') is constructed from these text chunks (Dq) for knowledge aggregation, likely guided by LLM prompting.
MCC Application: The MCC algorithm is applied to SG', the query $q$ , and the document chunks Dq to obtain a set of credible query nodes within homologous subgraphs (SVs) and isolated points (LVs).
Confidence Refinement: Leveraging an LLM with prompting, the final graph confidence (GA) and node confidence (Cnodes) are obtained. This step enhances the credibility by having the LLM contextualize the confidence scores.
Answer Generation: The Cnodes (credible nodes) and GA (graph confidence) are embedded into the context of the LLM to generate a trustworthy retrieval answer.

5. Experimental Setup

5.1. Datasets

The authors used a combination of multi-source query datasets and multi-hop QA datasets to validate MultiRAG.

5.1.1. Multi-source Data Fusion Datasets

To evaluate the efficiency of multi-source line graph construction and its impact on retrieval performance, experiments were conducted on four real-world benchmark datasets [35]-[37]:

Movies Dataset: Comprises movie data collected from 13 sources.
Books Dataset: Includes book data from 10 sources.
Flights Dataset: Gathers information on over 1200 flights from 20 sources.
Stocks Dataset: Collects transaction data for 1000 stock symbols from 20 sources.

For these datasets, 100 queries were issued for each to verify retrieval efficiency. It is noted that Movies and Flights are relatively denser, while Books and Stocks are relatively sparser, which impacts model performance.

5.1.1.1. Dataset Preprocessing

To align datasets with real-world applications and demonstrate MultiRAG's applicability to multi-source data, the four datasets were split and reconstructed into three categories of data formats:

Tabular Data (Structured Data): Stored in .csv files.
Nested JSON Data (Semi-structured Data): Stored in .json files.
XML Data (Semi-structured Data): Stored in .xml files. Some data directly stored in Knowledge Graph (KG) format was also retained.

The following are the statistics of the preprocessed datasets (Table I from the original paper):

The following are the results from Table I of the original paper:

Datasets	Data source	Sources	Entities	Relations	Queries
Datasets	Data source	Sources	Entities	Relations	Queries	Movies	JSON(J)	4	19701	45790	100
KG(K)	5	100229	264709
CSV(C)	4	70276	184657
Books	JSON(J)	3	3392	2824	100
	CSV(C)	3	2547	1812
	XML(X)	4	2054	1509
Flights	CSV(C)	10	48672	100835	100
Flights	JSON(J)	10	41939	89339	100
Stocks	CSV(C)	10	7799	11169	100
Stocks	JSON(J)	10	7759	10619	100

5.1.2. Multi-hop Question Answering (QA) Datasets

To validate the robustness of MultiRAG on complex Q&A datasets, two multi-hop question answering datasets were selected:

HotpotQA [38]: This dataset requires finding and reasoning over multiple supporting facts to answer a question.
2WikiMultiHopQA [39]: This dataset is designed for comprehensive evaluation of reasoning steps, also requiring information from multiple documents or hops.

Both datasets are built on Wikipedia documents, allowing for the use of a consistent document corpus and retriever for external references. Due to experimental cost constraints, a subsample analysis was conducted on 300 questions from the validation sets of each dataset.

5.2. Evaluation Metrics

To assess the effectiveness, the following metrics were used:

5.2.1. F1 Score

The F1 score is used as the evaluation metric for data fusion results. It is the harmonic mean of precision (P) and recall (R).

Conceptual Definition:
- Precision (P) measures the proportion of correctly retrieved items among all retrieved items. It answers: "Of all the items I retrieved, how many are actually relevant?"
- Recall (R) measures the proportion of correctly retrieved items among all relevant items in the dataset. It answers: "Of all the relevant items, how many did I actually retrieve?"
- The F1 score provides a single metric that balances both precision and recall, being particularly useful when there is an uneven class distribution or when false positives and false negatives are equally costly.
Mathematical Formula: $ F 1 = 2 \times \frac { P \times R } { P + R } $ Where:
- $P = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$
- $R = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$
Symbol Explanation:
- $P$ : Precision.
- $R$ : Recall.
- $\text{True Positives}$ : Correctly identified relevant items.
- $\text{False Positives}$ : Incorrectly identified relevant items (irrelevant items classified as relevant).
- $\text{False Negatives}$ : Relevant items that were not identified.

5.2.2. Recall@K

Recall@K is used to evaluate the retrieval credibility of the MKLGP Algorithm at three stages: before subgraph filtering, before node filtering, and after node filtering.

Conceptual Definition: Recall@K measures whether at least one correct answer is present within the top $K$ retrieved items. It's a common metric in information retrieval and recommendation systems to assess how well a system retrieves relevant items, particularly when users might only look at a limited number of results. A higher Recall@K indicates that the system is more likely to find a correct answer within the initial set of results.
Mathematical Formula: $ \text{Recall@K} = \frac{\text{Number of queries where at least one correct answer is in top K retrieved items}}{\text{Total number of queries}} $
Symbol Explanation:
- $K$ : The number of top retrieved items considered.

5.2.3. Query Response Time ( $T$ )

Query Response Time ( $T$ ) is used as an evaluative metric to verify the efficiency of knowledge aggregation and retrieval.

Conceptual Definition: It measures the time taken (in seconds) from when a query is submitted to when the system returns a response. A lower query response time indicates higher efficiency.
Symbol Explanation:
- $T$ : Time, measured in seconds (s).

5.3. Hyper-parameter Settings

Weight Coefficient $\alpha$ : Set to 0.5 for balancing LLM-assessed authority and historical authority.
Temperature Parameter $\beta$ : Set to 0.5. (Likely for the sigmoid function in $Auth_{LLM}(v)$ ).
Number of Entities in Historical Queries $\mathcal{H}$ : Initialized to 50.
Initial Node Confidence Threshold $\theta$ : Defined as 0.7. (Used in MCC Algorithm 1, line 17).
Graph Confidence Threshold: Set to 0.5.
Base LLM: Llama3-8B-Instruct was used for most experiments, except for CoT experiments, which utilized GPT-3.5-Turbo.
Data Storage: After slicing into Chunks, slice numbers, data source locations, and transformed triple nodes were stored in the multi-source line graph using JSON-LD format for cross-indexing.
Hardware: Experiments were conducted on a device equipped with an Intel(R) Core(TM) Ultra 9 185H 2.30GHz processor and 512GB of memory.

5.4. Baselines

MultiRAG was compared against basic data fusion methods and state-of-the-art (SOTA) methods, including multi-document question-answering and knowledge base question-answering methods.

5.4.1. Data Fusion Methods (Baselines)

TruthFinder (TF) [37]: A classic iterative data fusion method that estimates the trustworthiness of sources and facts simultaneously.
LTM [42]: A probabilistic data fusion method that uses a Bayesian approach to discover truth from conflicting sources.

5.4.2. General RAG and Question Answering Baselines

CoT [43]: Chain-of-Thought is a foundational approach that enables LLMs to perform complex reasoning by breaking down problems into intermediate steps. The experiments used GPT-3.5-Turbo as the base model for CoT.
Standard RAG [2]: A method that combines the strengths of retrieval and generation models. It retrieves relevant documents and uses them to augment the LLM's generation.
IRCoT [44]: Interleaving Retrieval with Chain-of-Thought reasoning. An advanced method that refines the reasoning process through iterative retrieval steps.
ChatKBQA [45]: A conversational interface-based method for knowledge base question answering, often involving querying structured knowledge bases.
MDQA [46]: Multi-Document Question Answering is a method designed to effectively extract answers from multiple documents.
FusionQuery [34]: A state-of-the-art (SOTA) method based on an efficient on-demand fusion query framework over multi-source heterogeneous data.
RQ-RAG [47]: Refine Queries for Retrieval Augmented Generation. A method that integrates external documents and optimizes the query process to handle complex queries, often by refining the query based on initial retrieval results.
MetaRAG [9]: A method that employs meta-cognitive strategies to enhance the retrieval process, particularly for multi-hop QA, by establishing knowledge association verification through graph reasoning paths.

6. Results & Analysis

This section presents and analyzes the experimental results, addressing the research questions posed in the paper.

6.1. Core Results Analysis

Q1: How does the retrieval recall performance of MultiRAG compare with other data fusion models and SOTA data retrieval models?

To answer Q1, the authors evaluated the Multi-source Knowledge Aggregation (MKA) module using F1 scores and query times across four multi-source query datasets. Different baseline and SOTA models were substituted as the fusion query algorithm. The MCC module (Multi-level Confidence Calculation) is the core of MultiRAG that enables this performance.

The following are the results from Table II of the original paper:

Datasets	Data source	Data Fusion Methods (Baseline)				SOTA Methods												Our Method
		TF		LTM		IR-CoT		MDQA		ChatKBQA		FusionQuery		MultiRAG		MCC
		F1/%	Time/s	F1/%	Time/s	F1/%	Time/s	F1/%	Time/s	F1/%	Time/s	F1/%	Time/s	F1/%	Time/s	F1/%	Time/s
Movies	J/K	37.1	9717	41.4	1995	43.2	1567	46.2	1588	45.1	3809	53.2	122.4	52.6	98.3	52.6	98.3
	J/C	41.9	7214	42.9	1884	45.0	1399	44.5	1360	42.7	3246	52.7	183.1	54.3	75.1	54.3	75.1
	K/C	37.8	2199	41.2	1576	37.6	1014	45.2	987	40.4	2027	42.5	141.0	49.1	86.0	49.1	86.0
	J/K/C	36.6	11225	40.8	2346	41.5	2551	49.8	2264	44.7	5151	53.6	137.8	54.8	157	54.8	157
Books	J/C	40.2	1017	42.4	195.3	35.2	147.6	55.7	124.2	56.1	165.0	58.5	22.7	63.5	13.66	63.5	13.66
	J/X	35.5	1070	35.6	277.7	36.1	178.7	55.1	115.6	54.7	200.1	57.9	20.6	63.1	13.78	63.1	13.78
	C/X	43.0	1033	44.1	232.6	42.6	184.5	57.2	115.6	55.6	201.4	60.3	21.5	64.2	13.54	64.2	13.54
	J/C/X	37.3	2304	41.0	413.2	40.4	342.6	56.4	222.6	57.1	394.1	59.1	47.0	66.8	27.4	66.8	27.4
Flights	C/J	27.3	6049	79.1	14786	58.3	214.0	76.5	360	76.8	376	74.2	20.2	74.9	80	74.9	80
Stocks	C/J	68.4	2.30	19.2	1337	64.8	53.3	65.2	78.4	64.0	88.9	68.0	0.33	78.6	12.1	78.6	12.1

Analysis:

Overall Superiority: MultiRAG (specifically, the MCC module, which is the confidence calculation part within MultiRAG) consistently outperforms all comparative models across all four datasets in terms of F1 score. The F1 score represents a balance between precision and recall, indicating MultiRAG's ability to retrieve relevant information accurately and comprehensively.
Significant Improvement over Baselines: MultiRAG achieves an F1 score that is more than 10% higher than the best baseline data fusion model (LTM in some cases, but generally higher than TF) and demonstrates superior performance compared to other SOTA methods like IR-CoT, MDQA, ChatKBQA, and FusionQuery.
Performance on Dense vs. Sparse Datasets:
- Dense Datasets (Movies, Flights): On these datasets, which are inherently denser in terms of knowledge connectivity, MultiRAG still performs very well, often matching or slightly outperforming previous SOTA models. For example, on Movies (J/K/C), MultiRAG achieves 54.8% F1, compared to FusionQuery's 53.6%. For Flights (C/J), MultiRAG is 74.9% F1, slightly better than MDQA (76.5%) and ChatKBQA (76.8%) but comparable to FusionQuery (74.2%). The paper's text states MDQA and ChatKBQA can "match or outperform our approach in situations where knowledge is abundant", which aligns with the observed results where the difference is smaller on denser datasets.
- Sparse Datasets (Books, Stocks): MultiRAG shows a much more significant advantage on sparser datasets. For instance, on the Books (J/C/X) dataset, MultiRAG achieves 66.8% F1, an average improvement of more than 10% over the best SOTA (FusionQuery at 59.1%). On Stocks (C/J), MultiRAG reaches 78.6% F1, a substantial lead over FusionQuery's 68.0%. This highlights MultiRAG's effectiveness in scenarios where knowledge is fragmented, which is a core problem it aims to address.
Query Time (Efficiency): MultiRAG also demonstrates competitive or significantly better query times compared to most baselines and SOTA methods. For example, on Movies (J/K/C), MultiRAG takes 157s, while TF takes 11225s and LTM takes 2346s. Even against SOTA methods like FusionQuery (137.8s), MultiRAG is comparable or faster in some instances (e.g., Books J/C: 13.66s vs 22.7s; Flights C/J: 80s vs 20.2s, where FusionQuery is faster on Flights, showing trade-offs). The efficiency gain is attributed to the Multi-source Line Graph construction, which aggregates homologous data efficiently.

Q2: What are the respective impacts of data sparsity and data inconsistency on the quality of retrieval recall?

MultiRAG's robustness to data sparsity and data inconsistency was evaluated through controlled perturbation experiments.

6.1.2.1. Impact of Data Sparsity

Methodology: The authors applied 30%, 50%, and 70% random relationship masking to four pre-processed datasets, making connections sparser while ensuring query answers were still retrievable. MultiRAG (Ours) and ChatKBQA (SOTA) were then tested.
Analysis (based on text description, as Fig. 5 is not clearly segmented into a, b, c, d for these specific experiments):
- MultiRAG Robustness: On the Books dataset, MultiRAG's F1 score dropped from 66.8% to 60.0% (a 6.8% drop) after 70% relationship masking. On the Stocks dataset, its F1 score decreased from 78.6% to 71.0% (a 7.6% drop). These are moderate decreases, indicating MultiRAG's effective maintenance of performance even with significant knowledge fragmentation.
- ChatKBQA Sensitivity: In contrast, ChatKBQA showed a more substantial decline. On the Books dataset, its F1 score dropped from 59.1% to 53.0% (a 6.1% drop). On the Stocks dataset, its F1 score decreased from 68.0% to 62.0% (a 6.0% drop). While the absolute percentage drop is similar, the initial F1 score of MultiRAG is significantly higher, meaning it retains much higher absolute performance. This highlights ChatKBQA's greater challenge in handling sparse data.

6.1.2.2. Impact of Data Inconsistency

Methodology: The Books and Stocks datasets were perturbed by adding 30%, 50%, and 70% triple increments (copies of original triples) with completely shuffled relationship edges, disrupting data consistency. MultiRAG and ChatKBQA were then tested.
Analysis (based on text description):
- MultiRAG Robustness: On the Movies dataset (Fig. 5a), with 30%, 50%, and 70% triple increments, MultiRAG's F1 score slightly decreased from 54.8% to 52.1%, 51.5%, and 49.9%, respectively. This represents relatively small decreases (up to ~4.9% at 70% perturbation). On the Flights dataset (Fig. 5c), its F1 score decreased from 74.9% to 73.4%, 72.9%, and 71.4%, respectively (up to ~3.5% at 70% perturbation). This demonstrates MultiRAG's strong stability even when facing severe data inconsistency.
- ChatKBQA Sensitivity: ChatKBQA showed a more rapid decline. On the Movies dataset (Fig. 5a), its F1 score dropped from 53.6% to 51.6%, 47.2%, and 40.8% (a substantial ~12.8% drop at 70% perturbation). On the Flights dataset (Fig. 5c), its F1 score dropped from 74.2% to 69.7%, 64.3%, and 55.8% (a significant ~18.4% drop at 70% perturbation).
Conclusion: MultiRAG maintains a high level of performance stability under conditions of disrupted data consistency, while ChatKBQA is notably more sensitive to such perturbations.

该图像是四个不同数据集下的 F1 分数对比图，包括电影、书籍、航班和股票。图中显示了 MultiRAG 与其他方法在不同结论水平下的 F1 分数变化，显现出 MultiRAG 在各数据集中的优越表现。

The above figure (Fig. 5 from the original paper) presents a comparison of F1 scores across various corruption levels for MultiRAG and ChatKBQA on four different datasets, illustrating MultiRAG's superior robustness against data sparsity and inconsistency.

6.2. Ablation Studies / Parameter Analysis

Q3: How effective are the two modules of MultiRAG individually?

The effectiveness of MultiRAG's individual components (MKA and MCC) and their sub-components (graph-level and node-level confidence) was assessed through ablation studies.

The following are the results from Table III of the original paper:

Datasets	Source	MultiRAG			w/o MKA			w/o Graph Level			w/o Node Level			w/o MCC
Datasets	Source	F1/%	QT/s	PT/s	F1/%	QT/s	PT/s	F1/%	QT/s	PT/s	F1/%	QT/s	PT/s	F1/%	QT/s	PT/s
Movies	J/K	52.6	25.7	62.64	48.2	2783	62.64	45.3	50.1	58.2	38.7	21.3	0.31	31.6	25.7	0.28
	J/C	54.3	12.7	61.36	49.1	1882	61.36	46.8	28.9	57.4	40.2	10.5	0.29	30.5	12.7	0.29
	K/C	49.1	31.6	64.40	45.5	4233	64.40	42.7	65.3	61.8	35.9	28.4	-0.27	33.1	31.6	-0.29
	J/K/C	54.8	39.2	60.8	47.5	4437	60.8	48.1	75.6	56.2	41.5	35.8	0.30	34.7	39.2	0.32
Books	J/C	63.5	1.19	2.47	57.1	11.9	2.47	55.2	4.7	2.12	49.8	0.92	0.18	43.4	1.19	0.22
	J/X	63.1	1.22	2.56	59.3	11.7	2.62	54.7	5.1	2.24	48.3	0.89	0.19	42.6	1.22	0.22
	C/X	64.2	1.16	2.38	55.3	8.39	2.38	53.9	3.9	2.05	47.1	0.85	0.16	41.0	1.16	0.17
	JIC/X	66.8	1.31	3.07	57.2	15.8	3.08	59.4	6.3	2.89	52.7	1.12	0.21	36.4	1.31	0.20
Flights	C/J	74.9	29.8	109.9	72.2	NAN	109.9	68.3	142.7	98.5	61.4	25.3	0.85	52.1	29.8	1.07
Stocks	C/J	78.6	2.72	5.36	69.6	450.8	5.36	72.1	8.9	4.12	65.3	1.98	0.15	45.4	2.72	0.17

6.2.1. Ablation Study on Component Effectiveness

Impact of Multi-source Knowledge Aggregation (MKA):
- Efficiency: The MKA module, through its MLG architecture, significantly improves query efficiency. For instance, on the Flights dataset (C/J), removing MKA results in Query Time (QT) being "NAN" (computationally infeasible), while MultiRAG with MKA achieves 29.8s. For Movies (J/K), QT drops from 2783s (w/o MKA) to 25.7s (MultiRAG), representing a 100x speedup. On Books (J/C), QT drops from 11.9s to 1.19s (a 10x speedup). This validates MKA's role in accelerating queries by providing a compact and connected knowledge structure.
- Accuracy: Removing MKA leads to a notable decrease in F1 score. For Movies (J/K/C), F1 drops from 54.8% to 47.5% (a 7.3% drop). For Books (J/C/X), F1 drops from 66.8% to 57.2% (a 9.6% drop). This demonstrates MKA's effectiveness in aggregating fragmented knowledge across sources, leading to better retrieval quality.
- Preprocessing Time (PT): MKA introduces a preprocessing time (PT) (e.g., 60.8s for Movies J/K/C, 3.07s for Books J/C/X), which is an initial overhead for building the MLG, but this is amortized by the significant query time (QT) reductions.
Impact of Multi-level Confidence Computing (MCC):
- Accuracy & Hallucination Control: Disabling the entire MCC module (w/o MCC) results in a drastic degradation in F1 score. For Movies (J/K/C), F1 drops from 54.8% to 34.7% (a 20.1% drop). For Stocks (C/J), F1 drops from 78.6% to 45.4% (a 33.2% drop). The paper also mentions that PT values (referring to Preprocessing Time, but here likely implicitly indicating increased hallucination risk as the Confidence_Computing is part of PT in some way, or it simply means lower quality output from LLM which is harder to process) indicate increased hallucination risks. This strongly validates MCC's critical role in eliminating unreliable information and controlling hallucinations through its hierarchical confidence computation.

6.2.2. Hierarchical Analysis of MCC

The ablation study further dissects the roles of graph-level and node-level confidence calculations within MCC.

w/o Graph Level (Removing Graph-Level Confidence Computing):
- Accuracy: Removing graph-level filtering leads to F1 drops. For Movies (J/K/C), F1 reduces to 48.1% (an improvement of 13.4% compared to w/o MCC, but still 6.7% lower than full MultiRAG). For Books (J/C/X), F1 drops to 59.4% (7.4% lower than full MultiRAG).
- Efficiency: Query Time (QT) increases significantly. For Movies (J/K/C), QT increases to 75.6s (a 93% increase compared to full MultiRAG's 39.2s). This suggests that graph-level filtering is important for pruning large, unreliable subgraphs early, thus improving overall query efficiency.
- Error Analysis: 38.7% of errors under graph-level removal (e.g., Movies J/K) stem from cross-source inconsistencies. This confirms that graph-level confidence ensures global consistency by filtering out subgraphs that are broadly inconsistent.
w/o Node Level (Removing Node-Level Confidence Computing):
- Accuracy: Disabling node-level computation also causes F1 drops. For Movies (J/K/C), F1 drops to 41.5%. For Books (J/C/X), F1 drops to 52.7%. While still better than w/o MCC, these results indicate that graph-level filtering alone cannot resolve all local conflicts.
- Efficiency: Interestingly, Query Time (QT) generally decreases (e.g., Movies J/K/C from 39.2s to 35.8s, Books J/C/X from 1.31s to 1.12s). This is because node-level confidence calculation itself adds computational overhead. However, the decrease in F1 shows that this efficiency gain comes at the cost of accuracy and increased hallucination risk (indicated by negative PT values in some entries, which is unusual for PT and might suggest a different interpretation, possibly related to hallucination cost).
- Error Analysis: 52.7% of failures with node-level removal (e.g., Books J/C/X) originate from local authority issues. This confirms that node-level confidence is crucial for verifying the credibility of individual data points.
Conclusion: The complete MCC framework, by synergistically combining both graph-level and node-level layers, achieves the best F1 score (e.g., 54.8% for Movies J/K/C). This confirms the functional specialization: graph-level ensures global consistency, while node-level verifies local credibility.

6.2.3. Influence of Hyperparameter $\alpha$

The paper investigated the influence of the hyperparameter $\alpha$ (which balances LLM-assessed authority and historical authority in Node Authority Score) on multi-source retrieval.

The following figure (Fig. 7 from the original paper) displays the influence of hyperparameter $\alpha$ on F1 score and average query time for different data sources (Movies, Books, Flights, Stocks).

$Fig. 7: Influence of hyperparameter $\\alpha$ on multi-source retrieval$ 该图像是图表，展示了超参数 eta 对多源检索的影响。图中的 F1 分数与混合权重 eta 的变化关系明显，展示了不同数据源在检索中的表现及平均查询时间的变化。不同指标的曲线清晰地表现了所提模型在不同场景中的效率与可靠性。

The above figure (Fig. 7 from the original paper) shows the influence of hyperparameter $\alpha$ on F1 score and average query time for multi-source retrieval, across different datasets.

Analysis:

Optimal Balance: The F1 score curve peaks at $\alpha = 0.5$ , achieving 67.7%. This indicates that an optimal balance between LLM-assessed authority ( $Auth_{LLM}$ ) and historical authority ( $Auth_{hist}$ ) is achieved when they contribute equally. Relying too heavily on either (e.g., $\alpha \to 0$ or $\alpha \to 1$ ) leads to a decline in F1 score.
Efficiency-Accuracy Trade-off:
- Increasing $\alpha$ towards 1.0 (more emphasis on LLM-assessed authority) tends to reduce query time (from 83.2 seconds at $\alpha = 0.0$ to 51.8 seconds at $\alpha = 1.0$ ). This is because minimizing historical data validation (which can be computationally intensive) reduces the overall processing time.
- However, solely relying on the LLM ( $\alpha = 1.0$ ) or solely on historical data ( $\alpha = 0.0$ ) results in lower F1 scores. The peak at $\alpha = 0.5$ suggests that leveraging the LLM's contextual adaptability while grounding it with the stability of expert systems (historical patterns) is most effective.
Robustness: The study highlights that this equilibrium enhances robustness against data sparsity and noise, particularly benefiting datasets like Books and Stocks. Ablation studies showed a 62.4% reduction in errors when both components are utilized, confirming the benefit of combining these authority sources.

6.3. Multi-hop QA Performance

Q4: How is the performance of MultiRAG in multi-hop Q&A datasets after incorporating multi-level confidence calculation?

To evaluate the multi-level confidence computing method in reducing hallucinations and enhancing Q&A system credibility, MultiRAG's Precision and Recall@5 scores were compared with other methods on HotpotQA and 2WikiMultiHopQA datasets.

The following are the results from Table IV of the original paper:

Method	HotpotQA		2WikiMultiHopQA
Method	Precision	Recall@5	Precision	Recall@5
Standard RAG	34.1	33.5	25.6	26.2
GPT-3.5-Turbo+CoT	33.9	47.2	35.0	45.1
IRCoT	41.6	41.2	42.3	40.9
ChatKBQA	47.8	42.1	46.5	43.7
MDQA	48.6	52.5	44.1	45.8
RQ-RAG	51.6	49.3	45.3	44.6
MetaRAG	51.1	49.9	50.7	52.2
MultiRAG	59.3	62.7	55.7	61.2

Analysis:

Superior Performance: MultiRAG significantly outperforms all other methods on both HotpotQA and 2WikiMultiHopQA datasets across both Precision and Recall@5 metrics.
- On HotpotQA: MultiRAG achieves 59.3% Precision and 62.7% Recall@5. This is a substantial improvement over the next best (RQ-RAG at 51.6% Precision and MDQA at 52.5% Recall@5).
- On 2WikiMultiHopQA: MultiRAG achieves 55.7% Precision and 61.2% Recall@5, again outperforming other SOTA methods like MetaRAG (50.7% Precision, 52.2% Recall@5).
Reduced Hallucinations and Increased Credibility: The consistently higher Recall@5 scores indicate that MultiRAG is more effective at retrieving correct answers within the top 5 results, directly addressing the problem of retrieval-induced hallucinations. The paper states that multi-level confidence computing not only yields higher average Recall@5 but also maintains a lower standard deviation (though standard deviation values are not explicitly shown in Table IV). This implies more consistent performance and fewer hallucinations across different queries.
Error Analysis (Qualitative): A detailed error analysis indicated that the multi-level confidence computing method significantly reduced the frequency of hallucinations, especially in cases of ambiguous context or unavailable information in the knowledge base. This suggests that the filtering mechanism successfully identifies and suppresses unreliable data that would otherwise lead the LLM astray.

6.4. Time Costs

Q5: What are the time costs of the two modules in MultiRAG?

The time costs of MultiRAG, particularly its Multi-source Knowledge Aggregation (MKA) and Multi-level Confidence Computing (MCC) modules, are crucial for evaluating its practical applicability.

MKA Module (MLG Construction) Overhead:
- Table III shows that the MKA module incurs a preprocessing time (PT). For Movies datasets, PT ranges from 60.8s to 64.40s. For Books datasets, it ranges from 2.38s to 3.07s. For Flights, it's 109.9s, and for Stocks, it's 5.36s. This PT represents the initial cost of building the Multi-source Line Graph (MLG).
- Efficiency Rationale: The intuition is that MLG aggregates homologous data, creating denser retrieval subgraphs. This reduces the need to traverse and store excessive invalid nodes, thereby significantly cutting down the query time (QT) associated with traditional knowledge graph traversal and querying. For example, on Flights (C/J), w/o MKA shows QT as "NAN" (computationally infeasible), while MultiRAG achieves 29.8s. This demonstrates that the preprocessing cost of MKA is justified by the massive gains in query efficiency.
MCC Module Overhead:
- The MCC module itself also contributes to the processing time. However, its primary goal is to enhance accuracy and reduce hallucinations.
- The ablation study (w/o Node Level vs. MultiRAG) shows that removing node-level confidence calculation can reduce Query Time (e.g., Movies J/K/C: 35.8s vs 39.2s). This implies that the confidence calculation adds some computational overhead. However, the accompanying significant drop in F1 score (from 54.8% to 41.5% for Movies J/K/C) indicates that this overhead is a necessary trade-off for improved accuracy and reliability.
Comparison with SOTA Methods:
- SOTA methods like MDQA and ChatKBQA employ LLM-based data retrieval, with their main temporal overhead focusing on token consumption and LLM-based searching.
- MultiRAG's approach concentrates its overhead on MLG construction. While MLG construction is generally efficient (often in seconds), the introduction of an LLM for knowledge extraction (as part of Multi-source Data Extraction and Multi-level Confidence Computing) does incur additional temporal costs due to text generation, which the authors deem "acceptable."
Overall Efficiency: Despite the LLM-related overheads, MultiRAG demonstrates satisfactory query performance. For example, in Table II, MultiRAG's Query Times are generally competitive with or superior to many SOTA methods, especially compared to the higher query times of classic data fusion methods like TF and LTM. The gains in retrieval accuracy and hallucination mitigation (as shown by F1 and Recall@5 scores) outweigh the computational costs.

6.5. Case Study

MultiRAG's effectiveness in multi-source integration is illustrated through a real-world flight status query for "CA981 from Beijing to New York." This case study highlights MultiRAG's unique ability to transform fragmented, conflicting inputs into trustworthy answers.

The following are the results from Table V of the original paper:


Data Sources
Structured	CA981, PEK, JFK, Delayed, 2024-10-01 14:30
Semi-structured	{"fight": "CA981", "Ddelay_reason": "Weather", "source": "AirChina"}
Unstructured	"Typhoon Haikui impacts PEK departures after 14:00."
MKA Module	Structured parsing: Flight attributes mapping
	LLM extraction: (CA981, DelayReason, Typhoon) @0.87
	ForumUser123 (Source) < Source (User Claim On-time Typhoon (Cause) Effective mp at Time) After 14:00+ Conflict Reason 7 CA981 Status Delayed (light) > ( (Statu) Departure Destination Source PEK JFK AirChina APP (rigin) (Destination) (Source)
MCC Module	With GCC: Graph confidence=0.71 (Threshold=0.5), Filtered: ForumUser123 (0.47)
	Without GCC: Unfiltered conflict=2 subgraphs
LLM Context	Trusted: CA981.Status=Delayed (0.89), DelayReason=Typhoon (0.85)
	Conflicts: ForumUser123:On-time (0.47), WeatherAPI:Clear (0.52)
Final Answer	Correct: "CA981 delayed until after 14:30 due to typhoon"
	Hallucinated: "CA981 on-time with possible delay after 14:30

Analysis:

Multi-source Data Integration: MultiRAG successfully integrated three data formats:
- Structured: Flight schedule (CA981, PEK, JFK, Delayed, 2024-10-01 14:30).
- Semi-structured: Airline delay code ({"fight": "CA981", "Ddelay_reason": "Weather", "source": "AirChina"}).
- Unstructured: Weather alert ("Typhoon Haikui impacts PEK departures after 14:00.").
MKA Module: The MKA module processed these inputs. It performed structured parsing for flight attributes and used LLM extraction to identify key relationships like (CA981, DelayReason, Typhoon) with a confidence score of 0.87. It also identified a conflicting user claim from "ForumUser123" that stated "On-time."
MCC Module (Hierarchical Verification): This module was critical in resolving conflicts:
- Graph-level Confidence Computing (GCC): The graph confidence was calculated as 0.71 (above the threshold of 0.5). This indicated that the overall subgraph was largely credible.
- Node-level Confidence Computing (NCC): Individual nodes were assessed. The system filtered out low-reliability sources, such as the "ForumUser123" claim (confidence score of 0.47), which was below the typical node confidence threshold. It prioritized data from reliable sources like "AirChina" (confidence 0.89 for CA981.Status=Delayed) and weather reports (confidence 0.85 for $DelayReason=Typhoon$ ). It also identified another conflict from "WeatherAPI:Clear" (confidence 0.52).
Final Answer Generation: Through this dual-layer validation, MultiRAG precisely reconciled contradictory departure time claims. It suppressed the inconsistent "on-time" report and the "WeatherAPI:Clear" (which might conflict with the typhoon reason) and generated the verified conclusion: "CA981 delayed until after 14:30 due to typhoon," while successfully avoiding the hallucinated answer like "CA981 on-time with possible delay after 14:30."

This case study demonstrates MultiRAG's ability to navigate complex multi-source information, identify and resolve conflicts, and produce accurate, hallucination-free responses by systematically weighting sources and modeling consensus.

6.6. Restrictive Analysis

The authors acknowledge several limitations of the current MultiRAG framework:

Lack of optimization of text chunk segmentation: The current framework might not be optimally segmenting text into chunks, which could impact the quality of extracted knowledge and subsequent graph construction.
Reliance on LLM-based expert evaluation, which may introduce potential security vulnerabilities: The Node Authority Score ( $Auth_{LLM}(v)$ ) relies on an LLM for evaluating node authority. This introduces a dependency on the LLM's inherent biases or potential for generating incorrect evaluations, which could be a security risk in sensitive domains.
Focuses on eliminating factual hallucinations but lacks handling of symbolic hallucinations: MultiRAG primarily addresses factual hallucinations (incorrect information). It does not explicitly handle symbolic hallucinations, which refer to errors in logical reasoning, mathematical calculations, or symbolic manipulation by the LLM.

7. Conclusion & Reflections

7.1. Conclusion Summary

In this work, the authors introduced MultiRAG, a novel framework designed to mitigate hallucinations specifically arising in multi-source knowledge-augmented generation scenarios. The framework addresses two critical challenges: data sparsity (hindering logical relationships) and inter-source inconsistency (leading to information conflicts).

MultiRAG's core contributions are:

Multi-source Knowledge Aggregation (MKA): By employing multi-source line graphs, MultiRAG efficiently aggregates cross-domain data, significantly enhancing knowledge connectivity and retrieval performance. This effectively tackles the sparse data distribution problem.
Multi-level Confidence Calculation (MCC): This sophisticated module performs both graph-level and node-level confidence assessments. It adaptively filters out low-quality subgraphs and unreliable nodes, thereby reducing hallucinations caused by inter-source inconsistencies.

Extensive experiments across various multi-domain query datasets and multi-hop QA datasets demonstrated that MultiRAG consistently and significantly enhances the reliability and efficiency of knowledge retrieval, especially in complex multi-source environments. The F1 scores and Recall@5 metrics showed marked improvements over existing SOTA methods, coupled with strong robustness against data sparsity and inconsistency.

7.2. Limitations & Future Work

The authors identified the following limitations and suggest future research directions:

Text Chunk Segmentation Optimization: Future work should explore better ways to segment text chunks to improve knowledge extraction.
LLM-based Evaluation Security: The reliance on LLM-based expert evaluation for node authority introduces potential security vulnerabilities, which need to be addressed.
Symbolic Hallucination: The current framework focuses on factual hallucinations; future work will aim to handle symbolic hallucinations as well.
Multimodal Retrieval and Ultra-long Text Reasoning: The authors plan to extend MultiRAG to more challenging aspects of hallucination mitigation, including multimodal retrieval (integrating different data modalities like text, images, video) and ultra-long text reasoning, to better adapt generative retrieval systems to real-world, open multi-source environments.

7.3. Personal Insights & Critique

MultiRAG presents a compelling and well-structured approach to a critical problem in the evolving landscape of RAG systems. The explicit focus on multi-source data challenges—namely sparsity and inconsistency—is highly relevant, as real-world knowledge is inherently fragmented and often contradictory. The dual innovations of multi-source line graphs for aggregation and multi-level confidence calculation for filtering are logically sound and empirically validated.

Inspirations and Applications:

The concept of transforming a knowledge graph into a line graph (where nodes are triples) is particularly elegant for consolidating relationships and making implicit connections explicit. This could be highly valuable in other domains where knowledge fragmentation across heterogeneous databases is common, such as bioinformatics (integrating data from various biological databases) or supply chain management (linking disparate data on suppliers, logistics, and inventory).
The multi-level confidence computing mechanism, with its graph-level and node-level assessments, provides a robust blueprint for developing more trustworthy AI systems in high-stakes domains like finance, legal tech, and medical diagnostics, where hallucinations can have severe consequences. The idea of balancing LLM-assessed authority with historical authority (hyperparameter $\alpha$ ) is a practical way to leverage cutting-edge LLM capabilities while maintaining the stability and reliability of established data.

Potential Issues, Unverified Assumptions, and Areas for Improvement:

Complexity of MLG Construction: While MLG construction significantly reduces query time, the preprocessing time (PT) can still be substantial, especially for very large and highly heterogeneous datasets. The scalability of MLG construction itself, particularly for dynamically updating knowledge bases, could be a challenge. The complexity of managing node and edge representations when translating diverse structured, semi-structured, and unstructured data into a single MLG needs careful consideration.
LLM Dependence: The reliance on LLM-based expert evaluation for Auth_LLM(v) is a double-edged sword. While it offers adaptability, it introduces the LLM's own potential for bias, non-determinism, and hallucination into the confidence calculation itself. The paper acknowledges this as a security vulnerability, and mitigating this "hallucination in hallucination mitigation" is crucial. Further research into explainable AI for these LLM-based authority assessments could enhance trust.
Generalizability of Confidence Thresholds: The fixed confidence thresholds (e.g., node threshold 0.7, graph threshold 0.5) might not generalize perfectly across all domains or query types. An adaptive, context-aware mechanism for setting these thresholds could improve robustness.
Semantic Nuance in Inconsistency: While mutual information entropy is good for content similarity, detecting subtle semantic inconsistencies (e.g., two statements being technically correct but contradicting in implication) might require more advanced semantic reasoning capabilities.
Real-time Updates: For highly dynamic multi-source environments (e.g., real-time stock data), the overhead of MLG reconstruction or incremental updates might be a bottleneck. The paper briefly mentions incremental estimation for historical authority, but a more detailed exploration of dynamic MLG maintenance would be valuable.

Overall, MultiRAG represents a significant step forward in making RAG systems more reliable in complex multi-source scenarios. Its strength lies in its thoughtful integration of knowledge graph structures with a pragmatic, multi-level confidence mechanism, offering a robust framework for combating retrieval-induced hallucinations.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

MultiRAG: A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~43 min read · 61,221 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem

2.1.2. Importance of the Problem

2.1.3. Paper's Entry Point and Innovative Idea

2.2. Main Contributions / Findings

2.2.1. Primary Contributions

2.2.2. Key Conclusions and Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs)

3.1.2. Hallucination in LLMs/RAG

3.1.3. Retrieval Augmented Generation (RAG)

3.1.4. Knowledge Graphs (KGs)

3.1.5. Multi-source Data

3.1.6. JSON-LD

3.1.7. Mutual Information Entropy

3.2. Previous Works

3.2.1. Knowledge-Guided RAG and General RAG Baselines

3.2.2. Heterogeneous Graph Fusion

3.2.3. Hallucination Benchmark and Confidence-Aware Computing

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Multi-source Data Extraction

4.2.1.1. Data Integration and Standardization

4.2.1.2. Knowledge Extraction and Graph Construction

4.2.2. Multi-source Knowledge Aggregation (MKA)

4.2.2.1. Multi-source Line Graph (MLG)

4.2.2.2. Homologous Subgraph Matching

4.2.2.3. Homologous Triple Line Graph Construction

4.2.3. Multi-level Confidence Computing (MCC)

4.2.3.1. Definition of Confidence

4.2.3.2. Graph-Level Confidence Computing (GCC)

4.2.3.3. Node-Level Confidence Computing (NCC)

4.2.3.4. Multi-level Confidence Computing (MCC) Algorithm

4.2.4. Multi-source Knowledge Line Graph Prompting (MKLGP) Algorithm

5. Experimental Setup

5.1. Datasets

5.1.1. Multi-source Data Fusion Datasets

5.1.1.1. Dataset Preprocessing

5.1.2. Multi-hop Question Answering (QA) Datasets

5.2. Evaluation Metrics

5.2.1. F1 Score

5.2.2. Recall@K

5.2.3. Query Response Time (TTT)

5.3. Hyper-parameter Settings

5.4. Baselines

5.4.1. Data Fusion Methods (Baselines)

5.4.2. General RAG and Question Answering Baselines

6. Results & Analysis

6.1. Core Results Analysis

Q1: How does the retrieval recall performance of MultiRAG compare with other data fusion models and SOTA data retrieval models?

Q2: What are the respective impacts of data sparsity and data inconsistency on the quality of retrieval recall?

6.1.2.1. Impact of Data Sparsity

6.1.2.2. Impact of Data Inconsistency

6.2. Ablation Studies / Parameter Analysis

Q3: How effective are the two modules of MultiRAG individually?

6.2.1. Ablation Study on Component Effectiveness

6.2.2. Hierarchical Analysis of MCC

6.2.3. Influence of Hyperparameter α\alphaα

6.3. Multi-hop QA Performance

Q4: How is the performance of MultiRAG in multi-hop Q&A datasets after incorporating multi-level confidence calculation?

6.4. Time Costs

Q5: What are the time costs of the two modules in MultiRAG?

5.2.3. Query Response Time ( $T$ )

6.2.3. Influence of Hyperparameter $\alpha$