Paper status: completed

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

Published:10/08/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

HiPRAG introduces a novel hierarchical process rewards method to tackle common over-search and under-search issues in agentic retrieval-augmented generation, improving search efficiency and accuracy significantly across multiple QA benchmarks, demonstrating the importance of opti

Abstract

Agentic Retrieval-Augmented Generation (RAG) is a powerful technique for incorporating external information that Large Language Models (LLMs) lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a Reinforcement Learning (RL) framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hi\textbf{Hi}erarchical P\textbf{P}rocess Rewards for Efficient agentic RAG\textbf{RAG} (HiPRAG), a novel training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B), outperforming strong agentic RAG baselines. This is accomplished while dramatically improving search efficiency, reducing the over-search rate from over 27% in baselines to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents. We will release our code upon acceptance.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

1.2. Authors

The paper lists "Anonymous authors" as it is currently under double-blind review. Therefore, specific research backgrounds and affiliations are not disclosed.

1.3. Journal/Conference

The paper is published at OpenReview, indicated by the provided link (https://openreview.net/pdf?id=Gt4v9WBPzm). OpenReview is a platform for disseminating research papers, particularly those under review for conferences, allowing for public peer review and discussion. Its reputation and influence depend on the associated conference (which is not explicitly stated here, but implied to be a major machine learning or NLP venue given the topic and quality).

1.4. Publication Year

The publication date is provided as 2025-10-08T00:00:00.000Z (UTC). This indicates the paper is slated for publication in late 2025.

1.5. Abstract

Agentic Retrieval-Augmented Generation (RAG) empowers Large Language Models (LLMs) to use external information for problem-solving. However, these systems often suffer from inefficient search behaviors like over-search (retrieving redundant information) and under-search (failing to search when needed), leading to unnecessary computational overhead and unreliable outputs. Existing training methods, primarily outcome-based Reinforcement Learning (RL), lack the fine-grained control to address these issues. This paper introduces HiPRAG (Hierarchical Process Rewards for Efficient agentic RAG), a novel RL training methodology that incorporates a fine-grained, knowledge-grounded process reward. HiPRAG evaluates each search decision's necessity by decomposing the agent's reasoning trajectory into discrete, parsable steps. It applies a hierarchical reward function that bonuses optimal search and non-search steps, in addition to standard outcome and format rewards. Experiments with Qwen2.5 and Llama-3.2 models across seven QA benchmarks demonstrate that HiPRAG achieves higher average accuracies (65.4% for 3B, 67.2% for 7B) than strong agentic RAG baselines. Crucially, it dramatically improves search efficiency, reducing the over-search rate from over 27% to 2.3% and also lowering the under-search rate. The results highlight the efficacy of optimizing the reasoning process itself, rather than just the final outcome, and show strong generalizability across various RL algorithms, model families, sizes, and types.

https://openreview.net/pdf?id=Gt4v9WBPzm This is a preprint link on OpenReview, indicating the paper is currently under double-blind review. The authors state, "We will release our code upon acceptance."

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the prevalence of suboptimal search behaviors in Agentic Retrieval-Augmented Generation (RAG) systems. While agentic RAG, where Large Language Models (LLMs) can autonomously decide when and what to retrieve, is a powerful technique for accessing external information, its practical application is hampered by two main issues:

  1. Over-search: The agent retrieves information it already possesses or performs redundant searches, leading to unnecessary computational overhead, increased latency, and wasted resources.

  2. Under-search: The agent fails to perform a search when external knowledge is actually required, resulting in factual errors, hallucinations, and unreliable outputs.

    These inefficiencies undermine both the accuracy and efficiency of agentic RAG systems. Current training methods, often based on Reinforcement Learning (RL) with outcome-based rewards (e.g., rewarding only the correctness of the final answer), lack the granularity to effectively address these step-level inefficiencies. Existing efforts to introduce process-level rewards often rely on imperfect proxies like confidence scores or separately trained reward models, which can introduce bias or misjudge the true quality of a reasoning step. The fundamental challenge is the lack of explicit, step-specific feedback on individual retrieval decisions.

The paper's innovative idea is to provide fine-grained, knowledge-grounded process rewards to directly optimize the agent's search behavior at each step of its reasoning trajectory. Instead of coarse, outcome-only feedback, HiPRAG aims to give precise signals about when a search was necessary, when it was redundant, and when it was missed.

2.2. Main Contributions / Findings

The primary contributions of the HiPRAG paper are:

  • Novel RL Training Methodology with Hierarchical Process Rewards: The paper proposes HiPRAG, a new training framework for agentic RAG systems that uses a hierarchical and knowledge-aware process reward. This mechanism provides fine-grained supervision on the agent's search behavior, moving beyond traditional outcome-based rewards.

  • Efficient On-the-Fly Detection of Suboptimal Searches: HiPRAG introduces robust and direct methods for detecting over-search and under-search behaviors during the RL training phase. This is enabled by a redesigned, structured output format that allows rule-based parsing of reasoning steps and utilizes external LLMs for efficient semantic equivalence and factual accuracy checks on-the-fly.

  • Demonstrated Superior Performance and Efficiency: Experiments on seven diverse QA benchmarks using Qwen2.5 and Llama-3.2 models show significant improvements:

    • Accuracy: Achieved average accuracies of 65.4% (3B models) and 67.2% (7B models) in Cover Exact Match (CEM), outperforming strong agentic RAG baselines.
    • Efficiency: Dramatically improved search efficiency, reducing the over-search rate (OSR) from over 27% in baselines to just 2.3%, and concurrently lowering the under-search rate (USR).
  • Strong Generalizability: The method demonstrates good generalizability across different RL algorithms (PPO, GRPO), model families (Qwen2.5, Llama-3.2), model sizes (3B, 7B), and model types (base, instruct).

    These findings collectively demonstrate that HiPRAG effectively solves the problem of suboptimal search behaviors by optimizing the reasoning process itself, leading to more accurate, efficient, and reliable agentic RAG systems.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand HiPRAG, a grasp of several core concepts in natural language processing (NLP) and artificial intelligence (AI) is essential:

  • Large Language Models (LLMs): These are advanced deep learning models, typically based on the Transformer architecture, trained on vast amounts of text data to understand, generate, and process human language. They can perform various tasks like question answering, summarization, and translation. Examples include GPT series, Llama, and Qwen.
  • Retrieval-Augmented Generation (RAG): RAG is a technique that enhances LLMs by enabling them to retrieve relevant information from an external knowledge base (like a database or a collection of documents) before generating a response. This helps LLMs access up-to-date or domain-specific information that wasn't included in their training data, thereby reducing hallucinations and improving factual accuracy. A basic RAG system typically performs retrieval once based on the initial query.
  • Agentic RAG / LLM Agents: This is an evolution of RAG where the LLM acts as an "agent" capable of multi-step reasoning and autonomous decision-making. An agentic RAG system can dynamically decide when to search, what to search for (by generating a query), how to use the retrieved information, and when to stop searching to formulate a final answer. This often involves an iterative process of Think-Act-Observe.
  • Reinforcement Learning (RL): RL is a paradigm of machine learning where an "agent" learns to make decisions by interacting with an "environment." The agent receives rewards or penalties for its actions, aiming to maximize cumulative reward over time.
    • Policy (π\pi): The agent's strategy, mapping observed states to actions. In LLM agents, this is often the LLM itself, which generates text (actions) given a prompt (state).
    • Reward Function (RR): Defines the goal of the RL problem by assigning numerical values to the agent's actions and outcomes. A positive reward encourages desirable behavior, while a negative reward (penalty) discourages undesirable behavior.
    • Trajectory: A sequence of states, actions, and rewards from the beginning to the end of an episode (e.g., answering a question).
  • Proximal Policy Optimization (PPO): A widely used Reinforcement Learning algorithm that balances exploration and exploitation while ensuring that policy updates are not too drastic, which helps maintain training stability. PPO achieves this by optimizing a "clipped" objective function, preventing large policy updates that could destabilize training.
  • Group Relative Policy Optimization (GRPO): Another Reinforcement Learning algorithm, often considered a critic-free approach, which can be more sample-efficient for LLM training compared to PPO, although sometimes at the cost of less training stability. It estimates policy gradients based on relative returns within groups of trajectories.

3.2. Previous Works

The paper builds upon a rich body of prior research in Agentic RAG and Reinforcement Learning for tool use. Key related works include:

  • ReAct (Yao et al., 2023): This foundational framework demonstrated that LLMs can synergize reasoning (generating thoughts or plans) and acting (using tools like search engines). It introduced the Thought-Action-Observation loop, which is central to many agentic systems.
  • Chain-of-Retrieval (Wang et al., 2025b) and DeepRAG (Guan et al., 2025): These works further refined the structured retrieval process itself, allowing for sequential or multi-step retrieval to handle complex queries, building upon the chain-of-thought reasoning paradigm.
  • RL for Tool Use:
    • ToolRL (Qian et al., 2025a) and ToRL (Li et al., 2025c): These studies showed that Reinforcement Learning from rewards based on task success can significantly scale and improve an LLM's ability to integrate and use various tools, including search engines.
    • RAG-RL (Huang et al., 2025a): Uses RL and curriculum learning to enhance RAG performance, demonstrating the potential of RL in this domain.
  • RL-Based Agentic RAG Systems:
    • Search-R1 (Jin et al., 2025b): A key baseline in the paper, it trains LLMs to reason and leverage search engines using Reinforcement Learning. It introduced the concept of outcome + format reward, providing a reward if the final answer is correct AND the output follows a specified format. This system often generates reasoning within <think><think> XML blocks, which HiPRAG identifies as a limitation due to parsing difficulties.
    • R1-Searcher (Song et al., 2025a) and R1-Searcher++ (Song et al., 2025b): These works further incentivize search capabilities and dynamic knowledge acquisition in LLMs using RL, focusing on improving overall performance.
    • β\beta-GRPO (Wu et al., 2025): Another strong RL baseline that aims to mitigate suboptimal search behaviors by reducing uncertainty, and is compared against in HiPRAG experiments.
  • Efficient Agentic RAG & Tool Use (Addressing Inefficiency):
    • Uncertainty Detection (Mallen et al., 2023; Dhole, 2025): Early approaches to adaptive retrieval relied on heuristics or classifiers to detect model uncertainty and trigger retrieval.
    • Self-Awareness/Information Needs (Su et al., 2024; Zubkova et al., 2025; Yao et al., 2025; Huanshuo et al., 2025): More advanced methods learn to assess an LLM's real-time information needs or self-awareness from its internal states for dynamic retrieval decisions.
    • Process-Level Reward Shaping:
      • R1-Searcher++ (Song et al., 2025b) and Synergistic Reasoning Agents (Huang et al., 2025b): Use RL to incentivize necessary knowledge acquisition and dynamic search.
      • ReARTeR (Sun et al., 2025): Introduces a framework with a trustworthy process reward model to score and refine each step in a RAG pipeline. This is conceptually similar to HiPRAG but relies on a separate reward model, which HiPRAG avoids for direct, on-the-fly evaluation.
      • SMART (Qian et al., 2025b), SMARTCAL (Shen et al., 2024), and OTC (Wang et al., 2025a): These train agents to be self-aware and make optimal tool calls, often using RL to mitigate tool overuse. Some incorporate length or retrieval time-based penalties, which HiPRAG argues can lead to under-search by oversimplifying the problem.
      • Verifiable Stepwise Rewards (Yue et al., 2025; Ye et al., 2025): These methods also use stepwise rewards to promote more efficient general reasoning paths, aligning with HiPRAG's philosophy.

3.3. Technological Evolution

The field of retrieval-augmented generation has evolved through several stages:

  1. Early RAG (e.g., DPR, RAG models by Lewis et al., 2020): Initial approaches focused on performing a single retrieval at the beginning of the generation process, based on the input query. The retrieved documents were then concatenated with the query and fed into a generative LLM. This was a significant improvement over pure parametric LLMs but lacked dynamic interaction.

  2. Prompt-Based Agentic RAG (e.g., ReAct, IRCoT, Search-o1): The next evolution introduced the concept of agents, where LLMs could dynamically interact with external tools. Through sophisticated prompting strategies (e.g., chain-of-thought prompting), LLMs were instructed to interleave reasoning steps with tool calls (like search), creating multi-step reasoning trajectories. These systems relied heavily on the LLM's ability to follow instructions and generate appropriate tool calls based on prompt engineering. However, their behavior could be inconsistent and suboptimal without explicit training.

  3. RL-Tuned Agentic RAG (e.g., Search-R1, R1-Searcher, β\beta-GRPO): Recognizing the limitations of purely prompt-based methods, researchers began applying Reinforcement Learning to explicitly train LLM agents. In this stage, the LLM's actions (e.g., generating a search query, deciding to stop searching) are rewarded or penalized based on the outcome of a task (e.g., correct answer). This allowed for more robust and optimized tool-use policies. However, most RL approaches still focused on outcome-based rewards, meaning they only gave feedback on the final answer, not the quality of individual reasoning or search steps. This often led to over-search or under-search as the agent didn't receive fine-grained feedback on its process.

    HiPRAG fits into this timeline as a significant advancement in the RL-tuned Agentic RAG stage. It addresses the critical gap of fine-grained process-level optimization. While previous RL methods focused on task success, HiPRAG explicitly targets the efficiency and optimality of each step in the reasoning trajectory by introducing a hierarchical, knowledge-aware process reward.

3.4. Differentiation Analysis

Compared to the main methods in related work, HiPRAG introduces several core differences and innovations:

  • Fine-Grained Process Reward vs. Outcome-Based/Coarse Proxies:

    • Prior Methods: Many existing RL-based agentic RAG systems (e.g., Search-R1, R1-Searcher) primarily use outcome-based rewards (correctness of the final answer) or coarse process proxies (e.g., length penalties, confidence scores, separately trained reward models like in ReARTeR). Length penalties can inadvertently discourage necessary searches (under-search), while confidence scores or learned reward models can be imperfect or introduce bias.
    • HiPRAG's Innovation: HiPRAG provides explicit, step-specific feedback on each retrieval decision. It directly evaluates whether a specific search was superfluous (over-search) or missing (under-search) for that particular step. This knowledge-grounded process reward offers a much more accurate and direct training signal for efficiency.
  • On-the-Fly Direct Detection vs. Complex Regeneration/Proxies:

    • Prior Methods: Some attempts at over-search detection involved complex regeneration pipelines (e.g., removing context and re-prompting, as mentioned by Wu et al., 2025), which were computationally expensive and prone to producing unnatural outputs.
    • HiPRAG's Innovation: HiPRAG proposes a direct and efficient on-the-fly detection mechanism. For over-search, it isolates the search query, regenerates its answer, and uses an external LLM to check semantic equivalence with the original step's conclusion. For under-search, it verifies the factual accuracy and internal logic of non-search steps with an external verifier LLM. This direct approach is faster, more robust, and provides a clearer signal.
  • Structured, Parsable Output Format:

    • Prior Methods: Frameworks like Search-R1 often use <think><think> blocks that mix reasoning, planning, and conclusions, making it difficult to programmatically isolate discrete reasoning steps for fine-grained evaluation without expensive post-hoc interpretation by another LLM.
    • HiPRAG's Innovation: HiPRAG enforces a strict, machine-parsable output format with explicit <step><step> tags, clearly delineating search and non-search steps. This structured format is crucial for enabling the on-the-fly detection and application of process rewards.
  • Hierarchical Reward Function:

    • Prior Methods: Reward functions might prioritize one aspect (e.g., outcome correctness or efficiency) over another, potentially leading to undesired trade-offs.

    • HiPRAG's Innovation: HiPRAG uses a hierarchical reward that first incentivizes basic format adherence and final answer correctness. Only once these primary goals are met does it provide an additional bonus for process optimality (optimal search and non-search steps). This dynamic focus prevents over-suppression of search capabilities during early training while ensuring efficiency later.

      In essence, HiPRAG differentiates itself by moving from a reactive, outcome-centric optimization to a proactive, process-centric optimization, enabled by structured outputs and direct, real-time evaluation of each reasoning step.

4. Methodology

4.1. Principles

The core idea behind HiPRAG is to directly address the suboptimal search behaviors (over-search and under-search) in agentic RAG systems by optimizing the agent's reasoning process itself, rather than solely relying on the final outcome. This is achieved by integrating a fine-grained, knowledge-grounded process reward into the Reinforcement Learning (RL) training loop. The theoretical basis and intuition are that for an LLM agent to be truly efficient and accurate, it needs specific, step-level feedback on whether its decision to search or not search was optimal. If a model is only rewarded for the final answer, it might achieve correctness through inefficient means (e.g., excessive searching) or fail due to insufficient searching, without ever learning why its intermediate steps were good or bad. HiPRAG provides this missing link by:

  1. Making Reasoning Parsable: Structuring the agent's output so each reasoning step is clearly identifiable.

  2. Detecting Suboptimalities Directly: Implementing efficient, on-the-fly mechanisms to determine if each search action was necessary or if a crucial search was missed.

  3. Hierarchically Rewarding Process Quality: Incorporating these detections into a reward function that guides the agent to first achieve basic correctness and format, and then to optimize for search efficiency.

    This approach aims to cultivate a more nuanced understanding in the agent of its own knowledge boundaries and information needs, leading to more robust, accurate, and efficient decision-making when interacting with external tools.

4.2. Core Methodology In-depth

The HiPRAG methodology is composed of three interconnected components: a structured output format, efficient detection mechanisms for suboptimal searches, and a hierarchical reward function. The overall training workflow is summarized in Figure 1.

4.2.1. Decomposing Reasoning Trajectory into Parsable Steps

A significant challenge for implementing process rewards is the difficulty in programmatically parsing an agent's reasoning trajectory. Traditional agentic RAG frameworks, such as Search-R1, often generate reasoning within multiple <think><think> XML blocks, interleaved with search queries and retrieved information. This format presents two key issues:

  1. Ambiguous Step Boundaries: A single <think><think> block often contains a mix of conclusions from previous actions and planning for current ones, making it hard to isolate a discrete, self-contained reasoning step.

  2. Implicit Internal Reasoning: Steps where the agent relies on its parametric knowledge (i.e., its internal knowledge without searching) are embedded as prose within <think><think> blocks, making them indistinguishable from analytical text without complex natural language understanding, which would be slow and error-prone during RL training.

    To overcome these issues, HiPRAG enforces a structured, machine-parsable output format during RL training. The agent's prompt and rollout logic are modified to generate its entire reasoning trajectory within a single <think><think> block, which then contains a sequence of discrete <step><step> blocks. Each <step><step> block represents a single logical reasoning action.

Each step can be one of two types:

  • Search Step: Distinguished by the presence of a <search><search> block (containing the query) and a <context><context> block (containing retrieved information).

  • Non-Search Step: Relies solely on the agent's internal knowledge and contains only <reasoning><reasoning> and <conclusion><conclusion> tags, without <search><search> or <context><context>.

    Formally, a complete reasoning trajectory TT for a given question is defined as a sequence of nn steps with a final answer aa: T={s1,s2,...,sn,a}T = \{ s _ { 1 } , s _ { 2 } , . . . , s _ { n } , a \}

Each step sis_i (for i[1,n]i \in [1, n]) can be:

  • A search step (siRs _ { i } ^ { R }): represented as a tuple ( r _ { i } , q _ { i } , c _ { i } , o _ { i } ), where:
    • rir_i: The reasoning block for this step (planning and analysis).
    • qiq_i: The search query generated.
    • cic_i: The retrieved context from the search.
    • oio_i: The conclusion or summary of knowledge gained.
  • A non-search step (siNRs _ { i } ^ { N R }): represented as a tuple ( r _ { i } , o _ { i } ), where:
    • rir_i: The reasoning block for this step.

    • oio_i: The conclusion or summary of knowledge gained.

      This structured format ensures that each reasoning step, along with its type (search or non-search) and its components (reasoning, query, context, conclusion), can be deterministically identified and extracted.

Adherence to this schema is enforced through two parallel approaches:

  1. System Prompt: The agent's system prompt is updated with explicit instructions and few-shot examples that demonstrate the correct usage of all XML tags (e.g., <think><think>, <step><step>, <reasoning><reasoning>, <search><search>, <context><context>, <conclusion><conclusion>, <answer><answer>). The prompt for this parsable output format is shown in Figure 4 in the Appendix.

  2. Positive Reward for Correct Format: As detailed in Section 4.2.3, the RL framework applies a positive reward for correct output formats, incentivizing the model to consistently produce parsable trajectories.

    The inference process for this format is outlined in Algorithm 1, and the mechanism for checking format correctness is detailed in Algorithm 2 and Algorithm 3 in the Appendix.

The following figure (Figure 3 from the original paper) illustrates the difference between the original Search-R1 format and HiPRAG's parsable format for the same multi-hop question.

Figure 3: Comparison of reasoning trajectory formats for the same multi-hop question. Each logical step is highlighted in a consistent color across both formats to show the correspondence. The actual retrieved documents here are replaced by their summarization to improve readability. 该图像是图表,展示了两种不同推理轨迹格式的比较,针对相同的多步问题进行分析。左侧为原始搜索格式,右侧为 HiPRAG 可解析格式。每个逻辑步骤以一致的颜色高亮,显示各格式之间的对应关系。为了提高可读性,实际检索的文档被其总结文本所替代。

The structured output format is a critical enabler for the on-the-fly detection of suboptimal searches, as it allows for straightforward extraction of step-level information.

4.2.2. On-the-Fly Detection of Suboptimal Searches

With the reasoning trajectories segmented into discrete, parsable steps, HiPRAG can implement efficient and robust checks for over-search and under-search during the RL training phase.

4.2.2.1. Over-search Detection

Previous methods for over-search detection were often complex and computationally expensive. They involved removing the search context and regenerating an answer with a fixed instruction to rely on internal knowledge. This approach was brittle, as appended instructions could conflict with the agent's original reasoning and lead to unnatural outputs.

HiPRAG proposes a more direct and robust method:

  1. For each search step siR=(ri,qi,ci,oi)s _ { i } ^ { R } = ( r _ { i } , q _ { i } , c _ { i } , o _ { i } ), the search query qiq_i is taken as a standalone question.

  2. An answer oio _ { i } ^ { \prime } is then regenerated by prompting the agent to answer only qiq_i without performing a search.

  3. An external LLM judge (specifically, gpt-4.1-mini) is used to assess the semantic equivalence of the original step's conclusion oio_i and the re-generated answer oio _ { i } ^ { \prime }.

  4. If oio_i and oio _ { i } ^ { \prime } are semantically equivalent, it implies the information retrieved by qiq_i was already known to the model or could have been inferred internally, making the search redundant. In this case, the step is flagged as an over-search by the Over(\cdot) detector function.

    This method is faster because it isolates the core knowledge required by the query and provides a more reliable signal. The prompt used for the external LLM judge for over-search detection is shown in Figure 5 in the Appendix.

4.2.2.2. Under-search Detection

Under-search occurs when the agent fails to retrieve necessary external knowledge, leading to factual errors or hallucinations.

HiPRAG detects under-search for non-search steps:

  1. For each non-search step siNR=(ri,oi)s _ { i } ^ { N R } = ( r _ { i } , o _ { i } ), an external verifier model (specifically, gpt-5-mini) is prompted to assess the factual accuracy of its reasoning rir_i and conclusion oio_i.

  2. The verifier model checks two aspects:

    • Factual Accuracy: Is the statement in rir_i and oio_i factually correct?
    • Internal Logic: Does oio_i logically follow from rir_i within this same step?
  3. If the content is found to be incorrect or illogical, the step is flagged as an under-search by the Under(\cdot) detector function, indicating that the agent should have used the search tool but failed to do so.

    The prompt used for the external verifier model for under-search detection is shown in Figure 6 in the Appendix.

In actual implementation, both detection methods can work concurrently to improve speed. For batch over-search detection during the RL rollout phase, the re-generation step can be executed separately through batch generation before using the external LLM judge.

4.2.3. Hierarchical Process Reward Calculation

A naive reward function that simply penalizes search (e.g., based on length or confidence) can harm the agent's ability to retrieve necessary information, especially for knowledge-intensive tasks. HiPRAG aims to incentivize optimal search behavior to improve both performance and efficiency, while preserving the agent's fundamental reasoning and search capabilities. To achieve this, the rewards are dynamically focused: initially prioritizing format and final answer correctness, and then shifting focus towards reasoning efficiency and process optimality once basic capabilities are established. This is accomplished through a hierarchical reward function.

Let's define the components of the reward function:

  • A(a){0,1}A(a) \in \{ 0 , 1 \} is an indicator function that signals the final answer aa of the trajectory TT's correctness. Here, Cover Exact Match (CEM) (introduced in Section 5.2) is used. A(a)=1A(a) = 1 if the answer is correct, 0 otherwise.

  • F(T){0,1}F(T) \in \{ 0 , 1 \} is an indicator function that signals whether the trajectory TT follows the required output format (as described in Section 4.2.1 and Algorithm 2 and 3). F(T)=1F(T) = 1 if the format is correct, 0 otherwise.

  • N(T) is the total number of steps in the trajectory TT.

  • Ncorr(T)N_{\mathrm{corr}}(T) is the number of optimal (neither over-search nor under-search) steps in trajectory TT. This is calculated as: $ N _ { \mathrm { c o r r } } ( T ) = \left| { s ^ { R } \in ( T ) : \neg { \mathsf { O v e r } } ( s ^ { R } ) } \right| + \left| { s ^ { N R } \in ( T ) : \neg { \mathsf { U n d e r } } ( s ^ { N R } ) } \right| $ where:

    • sR(T)s^R \in (T): A search step in trajectory TT.

    • ¬Over(sR)\neg \mathrm{Over}(s^R): The logical NOT of the Over() detector, meaning the search step sRs^R is not an over-search.

    • sNR(T)s^{NR} \in (T): A non-search step in trajectory TT.

    • ¬Under(sNR)\neg \mathrm{Under}(s^{NR}): The logical NOT of the Under() detector, meaning the non-search step sNRs^{NR} is not an under-search.

      The reward function also incorporates two hyperparameters:

  • λf[0,1]\lambda_f \in [0, 1]: A format weight that controls the importance of format adherence.

  • λp0\lambda_p \geq 0: A process bonus coefficient that controls the weight of the process optimality reward.

    The single merged hierarchical reward for a trajectory TT is defined as: $ R ( T ) = A ( T ) \bigl ( 1 - \lambda _ { f } \bigr ) + \lambda _ { f } F ( T ) + \lambda _ { p } A ( T ) F ( T ) \frac { N _ { \mathrm { c o r r } } ( T ) } { N ( T ) } . $ Let's break down this formula:

  • The first two terms, A(T)(1λf)+λfF(T)A ( T ) \bigl ( 1 - \lambda _ { f } \bigr ) + \lambda _ { f } F ( T ), represent a weighted sum of the final answer correctness and format adherence. If λp=0\lambda_p = 0, this part is algebraically equivalent to the standard outcome + format reward used in prior work (e.g., Search-R1).

  • The third term, λpA(T)F(T)Ncorr(T)N(T)\lambda _ { p } A ( T ) F ( T ) \frac { N _ { \mathrm { c o r r } } ( T ) } { N ( T ) }, provides a gated process bonus. This bonus is only applied when both the final answer is correct (A(T)=1A(T) = 1) and the trajectory format is correct (F(T)=1F(T) = 1). If both are correct, the reward becomes 1+λpNcorr(T)N(T)1 + \lambda _ { p } \frac { N _ { \mathrm { c o r r } } ( T ) } { N ( T ) }.

    • The fraction Ncorr(T)N(T)\frac { N _ { \mathrm { c o r r } } ( T ) } { N ( T ) } represents the proportion of optimal steps in the trajectory, effectively measuring the efficiency and optimality of the agent's reasoning path.

      This hierarchical structure ensures a balanced training signal:

  1. Initial Focus: The agent is first incentivized to produce well-formed reasoning trajectories and correct answers. Without A(T)=1A(T)=1 and F(T)=1F(T)=1, the process bonus term is zero.

  2. Subsequent Optimization: Once these basic requirements are met, the agent receives an additional bonus for the efficiency and validity of its reasoning path (i.e., making fewer over-search or under-search errors).

    This approach avoids over-suppression of search capabilities by not punishing suboptimal steps too heavily if the final outcome and format are correct, while still directly encouraging the model to develop a more nuanced understanding of its knowledge boundaries.

The following table (Table 3 from the original paper) provides a description of all symbols and notations used in the methodology.

Symbol Description
TT A complete reasoning trajectory for a given question.
`N(T)` The total number of steps in a trajectory TT.
aa The final answer within a trajectory TT.
sis_i The ii-th step in a reasoning trajectory.
sRs^R A search step, represented as a tuple (ri,qi,ci,oi)(r_i, q_i, c_i, o_i).
sNRs^{NR} A non-search (internal reasoning) step, represented as a tuple (ri,oi)(r_i, o_i).
rir_i The reasoning block of a step, containing planning and analysis.
qiq_i The search query generated in a search step.
cic_i The retrieved context returned from a search query.
oio_i The conclusion or summary of knowledge from a step.
oio_i' The re-generated answer used for over-search detection.
`A(a)` A function indicating if the final answer aa is correct (1) or not (0).
`F(T)` A function indicating if the trajectory TT follows the required format (1) or not (0).
Ncorr(T)N_{corr}(T) The number of optimal (correct) steps in trajectory TT.
Over()\mathrm{Over}(\cdot) The detector function that identifies an over-search step.
Under()\mathrm{Under}(\cdot) The detector function that identifies an under-search step.
λf\lambda_f A hyperparameter representing the format weight.
λp\lambda_p A hyperparameter representing the process bonus coefficient.
`R(T)` The final hierarchical reward calculated for a trajectory TT.

The following figure (Figure 1 from the original paper) provides a general overview of the HiPRAG training workflow, illustrating the components discussed above.

Figure 1: A general overview of the HiPRAG training workflow. The policy model generates a multi-step reasoning trajectory, and each step is evaluated on-the-fly to detect suboptimal search behaviors. A final hierarchical reward is then computed by combining a process bonus for step optimality with rewards for the final answer's correctness and proper formatting. 该图像是HiPRAG训练工作流程的示意图。该流程展示了策略模型生成多步骤推理轨迹的过程,并对每个步骤进行即时评价,以发现潜在的搜索行为不当情况。最后,结合步骤优化性的过程奖励和最终答案的正确性及格式化奖励,计算最终的层级奖励。

5. Experimental Setup

This section details the framework used to evaluate HiPRAG, including datasets, evaluation metrics, baselines, and training procedures.

5.1. Datasets

The experiments used a comprehensive set of question-answering (QA) datasets, covering both single-hop and multi-hop queries.

  • Training Set: A combination of the official training sets from:

    • NQ (Natural Questions) (Kwiatkowski et al., 2019): A large-scale dataset for open-domain question answering. Questions are real Google queries, and answers are short text spans from Wikipedia. This dataset is crucial for teaching single-fact retrieval.

    • HotpotQA (Yang et al., 2018): A multi-hop QA dataset where questions require reasoning over multiple documents to find the answer, often involving compositional reasoning. This dataset is essential for training agentic RAG models in complex, multi-step reasoning.

      This combined training set provides a diverse corpus for teaching agents both single-fact retrieval and multi-hop reasoning, critical for learning efficient reasoning.

  • Test Set: To assess both in-domain and out-of-domain generalization, models were evaluated on the development or test sets of seven diverse QA datasets:

    • NQ (Natural Questions): (Described above)

    • PopQA (Mallen et al., 2023): A knowledge-intensive QA dataset.

    • TriviaQA (Joshi et al., 2017): A large-scale dataset for reading comprehension and question answering, featuring questions from various trivia sources.

    • 2WikiMultiHopQA (Ho et al., 2020): A multi-hop QA dataset designed to evaluate reasoning steps over multiple Wikipedia articles.

    • Bamboogle (Press et al., 2023): A dataset designed to measure and narrow the compositionality gap in language models, often requiring complex reasoning.

    • HotpotQA: (Described above)

    • Musique (Trivedi et al., 2022): A multi-hop QA dataset where questions are formed by composing single-hop questions, requiring reasoning across multiple steps.

      These datasets are selected to provide a wide coverage of question types and complexity, allowing for a thorough evaluation of the agent's accuracy and reasoning capabilities in diverse scenarios.

5.2. Evaluation Metrics

The primary metrics used to evaluate HiPRAG focus on both correctness and efficiency.

  • Cover Exact Match (CEM) (Song et al., 2025a):

    • Conceptual Definition: This metric determines the correctness of a generated answer by checking if the ground-truth answer string is present anywhere within the model's generated response. It is a more lenient metric than strict Exact Match, as modern LLMs often generate verbose or explanatory responses that might contain the correct answer amidst additional text. CEM ensures that models are not unfairly penalized for valid answers embedded in longer, more conversational outputs.
    • Mathematical Formula: The paper describes CEM conceptually rather than providing a formal mathematical formula, implying a string-matching operation. If GG is the ground-truth answer string and PP is the model's generated answer string, then: $ \mathrm{CEM}(P, G) = \begin{cases} 1 & \text{if } G \text{ is a substring of } P \ 0 & \text{otherwise} \end{cases} $
    • Symbol Explanation:
      • PP: The answer string generated by the model.
      • GG: The ground-truth answer string.
      • 1: Indicates the answer is correct according to CEM.
      • 0: Indicates the answer is incorrect according to CEM.
  • Over-search Rate (OSR):

    • Conceptual Definition: OSR measures the proportion of search steps that were deemed over-searches (i.e., redundant or unnecessary) relative to all identifiable search steps within a set of test samples. A lower OSR indicates better search efficiency, as the agent avoids querying for already known or easily inferable information.
    • Mathematical Formula: Given a set of test samples Dtest\mathcal { D } _ { \mathrm { t e s t } }, with each sample having a reasoning trajectory TT: $ { \mathrm { O S R } } = { \frac { \sum _ { T \in { \mathcal { D } } _ { \mathrm { t s s } } } \left| \left{ s ^ { R } \in T : { \mathrm { O v e r } } ( s ^ { R } ) \right} \right| } { \sum _ { T \in { \mathcal { D } } _ { \mathrm { t s s } } } \left| \left{ s ^ { R } \in T \right} \right| } } $
    • Symbol Explanation:
      • Dtest\mathcal { D } _ { \mathrm { t e s t } }: The set of all test samples.
      • TT: A single reasoning trajectory from the test set.
      • sRTs^R \in T: A search step within the trajectory TT.
      • Over(sR)\mathrm{Over}(s^R): A boolean function (or detector) that returns True if the search step sRs^R is an over-search, and False otherwise.
      • {}| \{ \dots \} |: Denotes the count of elements in the set.
      • The numerator sums the count of all over-search steps across all trajectories in the test set.
      • The denominator sums the total count of all search steps across all trajectories in the test set.
  • Under-search Rate (USR):

    • Conceptual Definition: USR measures the proportion of non-search steps where the agent failed to retrieve necessary information (leading to factual errors or logical inaccuracies) relative to all identifiable non-search steps within a set of test samples. A lower USR indicates better reasoning robustness, as the agent correctly identifies when external knowledge is required and avoids hallucinations.
    • Mathematical Formula: Given a set of test samples Dtest\mathcal { D } _ { \mathrm { t e s t } }, with each sample having a reasoning trajectory TT: $ { \mathrm { U S R } } = { \frac { \sum _ { T \in { \mathcal { D } } _ { \mathrm { t s s } } } \left| \left{ s ^ { N R } \in T : { \mathrm { U n d e r } } ( s ^ { N R } ) \right} \right| } { \sum _ { T \in { \mathcal { D } } _ { \mathrm { t s s } } } \left| \left{ s ^ { N R } \in T \right} \right| } } $
    • Symbol Explanation:
      • Dtest\mathcal { D } _ { \mathrm { t e s t } }: The set of all test samples.
      • TT: A single reasoning trajectory from the test set.
      • sNRTs^{NR} \in T: A non-search step within the trajectory TT.
      • Under(sNR)\mathrm{Under}(s^{NR}): A boolean function (or detector) that returns True if the non-search step sNRs^{NR} is an under-search, and False otherwise.
      • {}| \{ \dots \} |: Denotes the count of elements in the set.
      • The numerator sums the count of all under-search steps across all trajectories in the test set.
      • The denominator sums the total count of all non-search steps across all trajectories in the test set.

5.3. Baselines

HiPRAG was compared against a comprehensive set of baselines representing different paradigms in retrieval-augmented generation:

  1. Direct Inference:
    • Description: The LLM generates an answer directly without any retrieval mechanism. This serves as a lower bound, showcasing the LLM's inherent parametric knowledge.
  2. Standard RAG (Lewis et al., 2020):
    • Description: A conventional RAG setup where retrieval is performed once based on the initial query. The retrieved context is then provided to the LLM for a single-pass generation. This represents the basic form of RAG.
  3. Prompt-Based Agentic RAG:
    • Description: These methods rely on sophisticated prompt engineering (e.g., chain-of-thought) to achieve multi-step reasoning and dynamic search without explicit RL training.
    • IRCoT (Trivedi et al., 2023): Interleaving Retrieval with Chain-of-Thought reasoning for knowledge-intensive multi-step questions.
    • Search-o1 (Li et al., 2025b): Agentic Search-Enhanced Large Reasoning Models, another prompt-based agentic approach.
  4. RL-Based Agentic RAG:
    • Description: State-of-the-art methods that use Reinforcement Learning to train search agents, optimizing their policies for invoking tools and making search decisions.
    • Search-R1 (Jin et al., 2025b): Trains LLMs to reason and leverage search engines with Reinforcement Learning. It uses an outcome + format reward.
    • R1-Searcher (Song et al., 2025a): Incentivizes the search capability in LLMs via Reinforcement Learning.
    • R1-Searcher++ (Song et al., 2025b): An enhancement of R1-Searcher, further incentivizing dynamic knowledge acquisition.
    • β\beta-GRPO (Wu et al., 2025): A method that uses Group Relative Policy Optimization to mitigate suboptimal agentic searches by reducing uncertainty.
    • Search-R1-step*: A variant of Search-R1 (v0.3) that uses its outcome + format reward but is adapted to HiPRAG's structured output format to isolate the impact of the format change.
    • β\beta-GRPO-step*: A variant of \beta`-GRPO` that uses its reward but is adapted to `HiPRAG`'s structured output format. These baselines were chosen to represent a spectrum of `RAG` evolution, from simple inference to advanced `RL-trained agents`, allowing for a fair comparison of `HiPRAG`'s innovation in process-level optimization. ## 5.4. Training Details * **Hardware:** All `RL-based models` were trained using four `NVIDIA A100 80GB GPUs`. * **Training Schedule:** The training process ran for a total of `400 steps`, with checkpoints saved every `50 steps`. * **Checkpoint Strategy:** For evaluation, if training completed stably, the final checkpoint was used. If the training reward `collapsed` (indicating instability), the last stable checkpoint before the collapse was used to ensure a fair evaluation of the model's best-learned state. * **Models:** * **Main Experiments:** `Qwen2.5-(3B/7B)-Instruct models` (Qwen et al., 2025). * **Generalizability Analysis:** `Llama3.2-3B-Instruct` (Grattafiori et al., 2024) and `Qwen2.5-3B` (base model, not instruct-tuned). * **Suboptimal Search Detectors:** * `Over-search detection`: Performed by `gpt-4.1-mini` (OpenAI, 2025a). * `Under-search detection`: Relied on `gpt-5-mini` (OpenAI, 2025b). The use of these smaller, proprietary models was chosen for their fast inference speed and sufficient performance. * **Reinforcement Learning Algorithms:** * **Core RL Algorithm:** `Proximal Policy Optimization (PPO)` (Schulman et al., 2017) was used for main experiments due to its demonstrated training stability in complex LLM finetuning. * **Comparison Algorithm:** `Group Relative Policy Optimization (GRPO)` (Shao et al., 2024) was also used, with a group size of 5 and the same training parameters as PPO, to assess the impact of the RL algorithm choice. * **Retrieval Environment:** * Followed the `Search-R1` setup. * `Knowledge Source`: The 2018 Wikipedia dump (Karpukhin et al., 2020). * `Retriever`: `E5-base` (Wang et al., 2024). * `Retrieval Output`: In each search step, the `top-3 relevant passages` were returned. * **Inference Parameters:** * **RL Rollout Stage:** `Temperature` and `top p` were set to `1` to encourage a high possibility of generating diverse reasoning trajectories. * **Testing Phase:** `Temperature` and `top p` were set to the models' default values for evaluation. * **Reward Function Hyperparameters:** * Main experiments: `format weight` $\lambda _ { f } = 0.2$ and `process bonus coefficient` $\lambda _ { p } = 0.4$. * Exploration of different $\lambda _ { p }$ values was also conducted (see Section 6.3). # 6. Results & Analysis This section presents a comprehensive analysis of `HiPRAG`'s performance, comparing it against state-of-the-art baselines and exploring the influence of various parameters and design choices. ## 6.1. Core Results Analysis The main experimental results, presented in Table 1, demonstrate that `HiPRAG` significantly outperforms all baseline methods across a diverse set of seven question-answering benchmarks in terms of `Cover Exact Match (CEM)`. The following are the results from Table 1 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2">Method</td> <td colspan="3">General QA</td> <td colspan="3">Multi-Hop QA</td> <td rowspan="2">Bamboogle</td> <td rowspan="2">Avg.</td> </tr> <tr> <td>NQ</td> <td>TriviaQA</td> <td>PopQA</td> <td>HotpotQA</td> <td>2Wiki</td> <td>Musique</td> </tr> </thead> <tbody> <tr> <td>Direct Inference</td> <td>27.0</td> <td>26.8</td> <td>40.1</td> <td>58.7</td> <td>16.0</td> <td>7.9</td> <td>15.9</td> <td>31.8</td> </tr> <tr> <td>Standard RAG</td> <td>51.2</td> <td>54.7</td> <td>65.7</td> <td>56.9</td> <td>21.6</td> <td>18.5</td> <td>18.6</td> <td>45.3</td> </tr> <tr> <td>IRCoT</td> <td>27.5</td> <td>36.0</td> <td>42.5</td> <td>51.4</td> <td>37.6</td> <td>19.4</td> <td>20.6</td> <td>36.4</td> </tr> <tr> <td>Search-o1</td> <td>40.2</td> <td>42.2</td> <td>58.5</td> <td>56.1</td> <td>45.6</td> <td>15.1</td> <td>19.3</td> <td>43.9</td> </tr> <tr> <td>R1-Searcher</td> <td>60.0</td> <td>73.0</td> <td>58.2</td> <td>60.4</td> <td>60.3</td> <td>32.9</td> <td>55.8</td> <td>60.6</td> </tr> <tr> <td>R1-Searcher++</td> <td>61.0</td> <td>73.5</td> <td>59.0</td> <td>64.2</td> <td>63.2</td> <td>32.3</td> <td>58.7</td> <td>62.1</td> </tr> <tr> <td>Search-R1</td> <td>61.2</td> <td>73.6</td> <td>56.5</td> <td>54.0</td> <td>63.6</td> <td>24.8</td> <td>48.4</td> <td>60.3</td> </tr> <tr> <td>Search-R1-step*</td> <td>62.4</td> <td>74.4</td> <td>57.3</td> <td>54.8</td> <td>64.2</td> <td>25.3</td> <td>49.6</td> <td>61.2</td> </tr> <tr> <td>β-GRPO</td> <td>65.0</td> <td>75.0</td> <td>60.0</td> <td>53.0</td> <td>66.0</td> <td>24.0</td> <td>52.0</td> <td>62.5</td> </tr> <tr> <td>β-GRPO-step*</td> <td>62.4</td> <td>73.9</td> <td>61.3</td> <td>52.1</td> <td>66.0</td> <td>22.8</td> <td>54.4</td> <td>62.1</td> </tr> <tr> <td>HiPRAG-3B</td> <td>68.7</td> <td>75.5</td> <td>66.3</td> <td>57.4</td> <td>67.4</td> <td>24.1</td> <td>41.6</td> <td>65.4</td> </tr> <tr> <td>HiPRAG-7B</td> <td>71.2</td> <td>76.3</td> <td>63.2</td> <td>62.4</td> <td>71.7</td> <td>34.1</td> <td>52.8</td> <td>67.2</td> </tr> </tbody> </table></div> **Key observations from Table 1:** * **Overall Performance:** `HiPRAG-7B` achieves the highest average `CEM` score of 67.2%, significantly outperforming the next-best baseline, `β-GRPO` (62.5%) and `R1-Searcher++` (62.1%). `HiPRAG-3B` also shows a strong average `CEM` of 65.4%, surpassing all baselines including 7B models like `R1-Searcher++`. This indicates that `HiPRAG`'s fine-grained, process-based reward mechanism effectively guides agents to develop more robust and accurate reasoning trajectories. * **Performance across Datasets:** * `HiPRAG` models show consistent gains across most datasets. For `General QA` datasets like NQ, TriviaQA, and PopQA, `HiPRAG` achieves top scores. * For `Multi-Hop QA` datasets, `HiPRAG-7B` shows impressive performance on HotpotQA (62.4%), 2Wiki (71.7%), and Musique (34.1%), often outperforming baselines by a large margin on these complex reasoning tasks. * **Efficiency Gains (from Abstract):** Beyond accuracy, `HiPRAG` delivers unprecedented gains in efficiency. The `over-search rate (OSR)` is dramatically reduced from over 27% in baselines to just 2.3%, while concurrently lowering the `under-search rate (USR)`. This demonstrates the direct efficacy of optimizing the reasoning process. The superior performance of `HiPRAG` models, especially the 3B variant surpassing larger baseline models, suggests that the proposed training methodology is a more effective path to performance gains than merely scaling model size with conventional rewards. ## 6.2. Data Presentation (Tables) The following are the results from Table 2 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <th>Base Model</th> <th>RL Algo.</th> <th>Method</th> <th>Avg. CEM</th> <th>Avg. OSR</th> <th>Avg. USR</th> </tr> </thead> <tbody> <tr> <td>Llama-3.2-3B-Instruct</td> <td>PPO</td> <td>baseline</td> <td>56.4</td> <td>7.3</td> <td>57.6</td> </tr> <tr> <td>Llama-3.2-3B-Instruct</td> <td>PPO</td> <td>HiPRAG</td> <td>64.8</td> <td>6.0</td> <td>49.7</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>GRPO</td> <td>baseline</td> <td>58.5</td> <td>8.4</td> <td>52.1</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>GRPO</td> <td>HiPRAG</td> <td>64.4</td> <td>4.1</td> <td>33.2</td> </tr> <tr> <td>Qwen2.5-3B</td> <td>PPO</td> <td>baseline</td> <td>60.3</td> <td>3.8</td> <td>44.0</td> </tr> <tr> <td>Qwen2.5-3B</td> <td>PPO</td> <td>HiPRAG</td> <td>65.4</td> <td>3.2</td> <td>41.9</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO</td> <td>baseline</td> <td>59.3</td> <td>6.1</td> <td>47.5</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO</td> <td>HiPRAG</td> <td>64.1</td> <td>4.9</td> <td>38.1</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>GRPO</td> <td>baseline</td> <td>61.2</td> <td>5.2</td> <td>43.3</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>GRPO</td> <td>HiPRAG</td> <td>67.2</td> <td>4.3</td> <td>32.6</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>PPO</td> <td>baseline</td> <td>53.3</td> <td>7.6</td> <td>29.0</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>PPO</td> <td>HiPRAG</td> <td>64.5</td> <td>6.2</td> <td>33.9</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO</td> <td>HiPRAG (over-search only)</td> <td>58.8</td> <td>4.9</td> <td>52.7</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO</td> <td>HiPRAG (under-search only)</td> <td>63.3</td> <td>6.6</td> <td>16.9</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO</td> <td>HiPRAG (λp = 0.2)</td> <td>59.6</td> <td>5.5</td> <td>44.5</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO</td> <td>HiPRAG (λp = 0.6)</td> <td>62.5</td> <td>5.2</td> <td>39.0</td> </tr> </tbody> </table></div> The following are the results from Table 4 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2">Base Model</td> <td rowspan="2">RL Algo. + Method</td> <td colspan="3">General QA</td> <td colspan="3">Multi-Hop QA</td> <td rowspan="2">Bamboogle</td> <td rowspan="2">Avg.</td> </tr> <tr> <td>NQ</td> <td>TriviaQA</td> <td>PopQA</td> <td>HotpotQA</td> <td>2Wiki</td> <td>Musique</td> </tr> </thead> <tbody> <tr> <td>Llama-3.2-3B-Instruct</td> <td>PPO + baseline</td> <td>65.2</td> <td>74.5</td> <td>55.1</td> <td>47.0</td> <td>52.3</td> <td>18.7</td> <td>36.0</td> <td>56.4</td> </tr> <tr> <td>Llama-3.2-3B-Instruct</td> <td>PPO + HiPRAG</td> <td>71.6</td> <td>77.2</td> <td>61.0</td> <td>57.7</td> <td>67.9</td> <td>25.7</td> <td>43.2</td> <td>64.8</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>GRPO + baseline</td> <td>59.6</td> <td>69.1</td> <td>57.3</td> <td>52.4</td> <td>61.4</td> <td>20.6</td> <td>24.8</td> <td>58.5</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>GRPO + HiPRAG</td> <td>68.5</td> <td>74.2</td> <td>60.6</td> <td>59.2</td> <td>69.1</td> <td>27.9</td> <td>38.4</td> <td>64.4</td> </tr> <tr> <td>Qwen2.5-3B</td> <td>PPO + baseline</td> <td>60.6</td> <td>71.7</td> <td>55.8</td> <td>54.3</td> <td>65.7</td> <td>24.1</td> <td>40.8</td> <td>60.3</td> </tr> <tr> <td>Qwen2.5-3B</td> <td>PPO + HiPRAG</td> <td>68.7</td> <td>75.5</td> <td>66.3</td> <td>57.4</td> <td>67.4</td> <td>24.1</td> <td>41.6</td> <td>65.4</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + baseline</td> <td>60.9</td> <td>70.1</td> <td>57.0</td> <td>52.0</td> <td>63.0</td> <td>24.3</td> <td>37.6</td> <td>59.3</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG</td> <td>65.6</td> <td>73.9</td> <td>62.1</td> <td>55.6</td> <td>69.6</td> <td>26.0</td> <td>32.8</td> <td>64.1</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>GRPO + baseline</td> <td>62.4</td> <td>74.4</td> <td>57.3</td> <td>54.8</td> <td>64.2</td> <td>25.3</td> <td>49.6</td> <td>61.2</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>GRPO + HiPRAG</td> <td>71.2</td> <td>76.3</td> <td>63.2</td> <td>62.4</td> <td>71.7</td> <td>34.1</td> <td>52.8</td> <td>67.2</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>PPO + baseline</td> <td>55.6</td> <td>67.5</td> <td>43.5</td> <td>49.4</td> <td>58.5</td> <td>26.6</td> <td>44.0</td> <td>53.3</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>PPO + HiPRAG</td> <td>66.2</td> <td>75.7</td> <td>58.4</td> <td>59.9</td> <td>66.2</td> <td>34.3</td> <td>52.0</td> <td>64.5</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG (over-search only)</td> <td>61.9</td> <td>66.9</td> <td>54.9</td> <td>52.2</td> <td>65.4</td> <td>25.5</td> <td>39.2</td> <td>58.8</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG (under-search only)</td> <td>63.7</td> <td>74.1</td> <td>60.6</td> <td>55.9</td> <td>67.9</td> <td>28.4</td> <td>40.8</td> <td>63.3</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG ($\lambda_p$ = 0.2)</td> <td>61.9</td> <td>71.2</td> <td>56.8</td> <td>53.7</td> <td>62.2</td> <td>25.4</td> <td>31.2</td> <td>59.6</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG ($\lambda_p$ = 0.6)</td> <td>66.6</td> <td>74.4</td> <td>60.5</td> <td>55.5</td> <td>64.4</td> <td>25.6</td> <td>38.4</td> <td>62.5</td> </tr> </tbody> </table></div> The following are the results from Table 5 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2">Base Model</td> <td rowspan="2">RL Algo. + Method</td> <td colspan="3">General QA</td> <td colspan="4">Multi-Hop QA</td> <td rowspan="2">Avg.</td> </tr> <tr> <td>NQ</td> <td>TriviaQA</td> <td>PopQA</td> <td>HotpotQA</td> <td>2Wiki</td> <td>Musique</td> <td>Bamboogle</td> </tr> </thead> <tbody> <tr> <td>Llama-3.2-3B-Instruct</td> <td>PPO + baseline</td> <td>12.5</td> <td>15.4</td> <td>5.0</td> <td>4.8</td> <td>3.7</td> <td>2.7</td> <td>8.7</td> <td>7.3</td> </tr> <tr> <td>Llama-3.2-3B-Instruct</td> <td>PPO + HiPRAG</td> <td>11.9</td> <td>13.3</td> <td>4.5</td> <td>4.6</td> <td>1.8</td> <td>3.1</td> <td>5.0</td> <td>6.0</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>GRPO + baseline</td> <td>8.4</td> <td>17.0</td> <td>5.6</td> <td>7.2</td> <td>4.3</td> <td>5.0</td> <td>10.3</td> <td>8.4</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>GRPO + HiPRAG</td> <td>4.4</td> <td>9.8</td> <td>2.2</td> <td>3.0</td> <td>2.9</td> <td>1.4</td> <td>3.9</td> <td>4.1</td> </tr> <tr> <td>Qwen2.5-3B</td> <td>PPO + baseline</td> <td>6.4</td> <td>9.0</td> <td>2.6</td> <td>2.9</td> <td>1.5</td> <td>1.7</td> <td>4.3</td> <td>3.8</td> </tr> <tr> <td>Qwen2.5-3B</td> <td>PPO + HiPRAG</td> <td>5.1</td> <td>6.9</td> <td>2.2</td> <td>2.3</td> <td>1.4</td> <td>1.2</td> <td>3.4</td> <td>3.2</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + baseline</td> <td>8.6</td> <td>13.5</td> <td>5.6</td> <td>4.2</td> <td>1.8</td> <td>3.9</td> <td>12.8</td> <td>6.1</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG</td> <td>6.0</td> <td>11.0</td> <td>3.9</td> <td>4.5</td> <td>2.5</td> <td>2.8</td> <td>11.5</td> <td>4.9</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>GRPO + baseline</td> <td>5.3</td> <td>7.4</td> <td>2.0</td> <td>3.5</td> <td>0.9</td> <td>3.6</td> <td>8.7</td> <td>5.2</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>GRPO + HiPRAG</td> <td>4.1</td> <td>5.4</td> <td>1.3</td> <td>1.8</td> <td>0.3</td> <td>1.5</td> <td>4.8</td> <td>4.3</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>PPO + baseline</td> <td>11.6</td> <td>19.8</td> <td>6.1</td> <td>7.4</td> <td>2.7</td> <td>8.5</td> <td>19.6</td> <td>7.6</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>PPO + HiPRAG</td> <td>10.4</td> <td>14.5</td> <td>4.6</td> <td>5.6</td> <td>2.1</td> <td>5.9</td> <td>13.4</td> <td>6.2</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG (over-search only)</td> <td>6.0</td> <td>11.0</td> <td>3.9</td> <td>4.5</td> <td>2.5</td> <td>2.8</td> <td>11.5</td> <td>4.9</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG (under-search only)</td> <td>8.2</td> <td>15.6</td> <td>5.9</td> <td>5.3</td> <td>2.7</td> <td>3.1</td> <td>6.4</td> <td>6.6</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG ($\lambda_p$ = 0.2)</td> <td>7.4</td> <td>11.4</td> <td>3.7</td> <td>4.4</td> <td>2.3</td> <td>3.3</td> <td>12.1</td> <td>5.5</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG ($\lambda_p$ = 0.6)</td> <td>9.3</td> <td>13.1</td> <td>4.1</td> <td>3.3</td> <td>1.3</td> <td>2.2</td> <td>6.3</td> <td>5.2</td> </tr> </tbody> </table></div> The following are the results from Table 6 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2">Base Model</td> <td rowspan="2">RL Algo. + Method</td> <td colspan="3">General QA</td> <td colspan="3">Multi-Hop QA</td> <td rowspan="2">Bamboogle</td> <td rowspan="2">Avg.</td> </tr> <tr> <td>NQ</td> <td>TriviaQA</td> <td>PopQA</td> <td>HotpotQA</td> <td>2Wiki</td> <td>Musique</td> </tr> </thead> <tbody> <tr> <td>Llama-3.2-3B-Instruct</td> <td>PPO + baseline</td> <td>67.1</td> <td>75.0</td> <td>66.7</td> <td>52.6</td> <td>59.3</td> <td>50.0</td> <td>20.0</td> <td>57.6</td> </tr> <tr> <td>Llama-3.2-3B-Instruct</td> <td>PPO + HiPRAG</td> <td>35.3</td> <td>48.4</td> <td>31.7</td> <td>50.8</td> <td>55.3</td> <td>64.3</td> <td>10.3</td> <td>49.7</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>GRPO + baseline</td> <td>61.9</td> <td>63.9</td> <td>59.6</td> <td>46.1</td> <td>49.1</td> <td>61.9</td> <td>22.3</td> <td>52.1</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>GRPO + HiPRAG</td> <td>52.9</td> <td>34.9</td> <td>35.2</td> <td>29.2</td> <td>25.0</td> <td>45.5</td> <td>21.2</td> <td>33.2</td> </tr> <tr> <td>Qwen2.5-3B</td> <td>PPO + baseline</td> <td>33.3</td> <td>66.7</td> <td>30.8</td> <td>38.5</td> <td>47.5</td> <td>66.7</td> <td>0.0</td> <td>44.0</td> </tr> <tr> <td>Qwen2.5-3B</td> <td>PPO + HiPRAG</td> <td>43.9</td> <td>36.4</td> <td>42.3</td> <td>41.9</td> <td>42.6</td> <td>56.8</td> <td>16.7</td> <td>41.9</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + baseline</td> <td>47.1</td> <td>33.2</td> <td>48.8</td> <td>39.0</td> <td>52.9</td> <td>70.0</td> <td>32.2</td> <td>47.5</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG</td> <td>11.1</td> <td>44.4</td> <td>61.9</td> <td>25.0</td> <td>32.0</td> <td>10.1</td> <td>8.7</td> <td>38.1</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>GRPO + baseline</td> <td>40.5</td> <td>34.3</td> <td>43.8</td> <td>40.9</td> <td>45.0</td> <td>56.2</td> <td>20.0</td> <td>43.3</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>GRPO + HiPRAG</td> <td>30.2</td> <td>34.9</td> <td>34.9</td> <td>40.5</td> <td>24.4</td> <td>37.3</td> <td>41.7</td> <td>32.6</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>PPO + baseline</td> <td>33.4</td> <td>13.9</td> <td>17.5</td> <td>40.3</td> <td>33.4</td> <td>50.0</td> <td>13.2</td> <td>29.0</td> </tr> <tr> <td>Qwen2.5-7B-Instruct</td> <td>PPO + HiPRAG</td> <td>57.1</td> <td>44.9</td> <td>25.5</td> <td>20.0</td> <td>34.6</td> <td>57.1</td> <td>1.6</td> <td>33.9</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG (over-search only)</td> <td>54.5</td> <td>55.2</td> <td>48.9</td> <td>44.7</td> <td>53.7</td> <td>78.3</td> <td>20.0</td> <td>52.7</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG (under-search only)</td> <td>14.0</td> <td>20.4</td> <td>13.6</td> <td>25.6</td> <td>13.2</td> <td>30.8</td> <td>16.9</td> <td>16.9</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG ($\lambda_p$ = 0.2)</td> <td>24.0</td> <td>39.5</td> <td>55.8</td> <td>41.3</td> <td>45.5</td> <td>80.1</td> <td>30.1</td> <td>44.5</td> </tr> <tr> <td>Qwen2.5-3B-Instruct</td> <td>PPO + HiPRAG ($\lambda_p$ = 0.6)</td> <td>27.2</td> <td>34.1</td> <td>33.6</td> <td>60.6</td> <td>51.2</td> <td>53.3</td> <td>1.6</td> <td>39.0</td> </tr> </tbody> </table></div> ## 6.3. Ablation Studies / Parameter Analysis ### 6.3.1. Influence of Model Size As expected, larger models generally demonstrate stronger reasoning capabilities and higher accuracy. `HiPRAG` experiments with 7B models consistently outperform their 3B counterparts in terms of `Avg. CEM` (e.g., `HiPRAG-7B` with GRPO achieves 67.2% vs. `HiPRAG-3B` with GRPO achieving 64.4% as seen in Table 2). However, `HiPRAG`'s process-based reward approach allows smaller models to achieve remarkable performance, narrowing the gap with larger models. For instance, the `HiPRAG` model trained based on `Qwen2.5-3B-Instruct` + `GRPO` achieves an `Avg. CEM` of 64.4%, which not only surpasses strong external 7B baselines like `R1-Searcher++` (62.2% Avg. CEM in Table 1) but also outperforms its own 7B counterpart trained with a baseline reward (61.2% Avg. CEM in Table 2). This highlights that the `HiPRAG` training methodology is a more effective pathway to performance gains than simply scaling model size with conventional rewards. Furthermore, a larger model size tends to result in more efficient search decisions. With `GRPO`, the 7B model (4.3% Avg. OSR, 32.6% Avg. USR) exhibits better efficiency than the 3B model (4.1% Avg. OSR, 33.2% Avg. USR), in addition to its higher accuracy (Table 2). ### 6.3.2. Influence of Model Family To assess the generalizability of `HiPRAG`, experiments were conducted with both `Qwen2.5-3B-Instruct` and `Llama-3.2-3B-Instruct` models. While both models achieve comparable peak accuracy after training with `HiPRAG` (Qwen-3B at 64.1% Avg. CEM, Llama-3B at 64.8% Avg. CEM), their underlying search behaviors and efficiency differ. The following figure (Figure 2b from the original paper) shows the curves of the ratio of searches among all reasoning steps for different model families. ![该图像是一个折线图,展示了在不同步骤下,Llama-3.2-3B-Instruct 和 Qwen2.5-3B-Instruct 模型的搜索步骤比例变化。可以看到,Llama-3.2-3B-Instruct 模型的搜索步骤比例在一定步数内保持在较高的水平,最终接近 0.9,而 Qwen2.5-3B-Instruct 的比例则相对较低,波动在 0.4 到 0.9 之间。](/files/papers/692eb0ceab04788a90065fde/images/3.jpg) *该图像是一个折线图,展示了在不同步骤下,Llama-3.2-3B-Instruct 和 Qwen2.5-3B-Instruct 模型的搜索步骤比例变化。可以看到,Llama-3.2-3B-Instruct 模型的搜索步骤比例在一定步数内保持在较高的水平,最终接近 0.9,而 Qwen2.5-3B-Instruct 的比例则相对较低,波动在 0.4 到 0.9 之间。* As seen in Figure 2b, the `Llama-3B` model initially shows a higher tendency to rely on its `parametric knowledge` with more non-search steps, resulting in a higher `under-search rate`. After training, the `Qwen-3B` model achieves a high `Avg. CEM` of 64.1% with lower suboptimal search rates (4.9% Avg. OSR, 38.1% Avg. USR) compared to the `Llama-3B` model's 6.0% Avg. OSR and 49.7% Avg. USR (Table 2). This suggests that although `HiPRAG` is effective across different model families, the base model's inherent tendencies can influence the final search efficiency. ### 6.3.3. Influence of RL Algorithm The impact of different `RL algorithms` on `HiPRAG` was explored using `PPO` and `GRPO` on `Qwen2.5-3B/7B-Instruct` models. `PPO` demonstrated better training stability, often completing the full training run without reward collapse, while `GRPO` consistently showed the potential for higher final performance and faster convergence. The following figure (Figure 2a from the original paper) shows the reward curves for different RL algorithms. ![该图像是一个示意图,展示了在不同训练方法下,Qwen2.5-3B模型的平均奖励随步骤变化的情况。图中包含了HiPRAG和GRPO两种训练方法的表现,显示了奖励值在约200步时出现显著差异。](/files/papers/692eb0ceab04788a90065fde/images/2.jpg) *该图像是一个示意图,展示了在不同训练方法下,Qwen2.5-3B模型的平均奖励随步骤变化的情况。图中包含了HiPRAG和GRPO两种训练方法的表现,显示了奖励值在约200步时出现显著差异。* From Table 2, `GRPO` yields a higher `Avg. CEM` for both 3B (64.4% vs. 64.1% with PPO) and 7B (67.2% vs. 64.5% with PPO) models. Moreover, `GRPO` results in more efficient search behavior (e.g., 2.3% OSR for 7B-GRPO vs. 6.2% for 7B-PPO, though this specific 2.3% OSR is from the abstract and not directly shown in the table for 7B GRPO HiPRAG, the lowest OSR in Table 2 is 1.3% for PopQA for 7B GRPO HiPRAG, which averages to 4.3% in Table 2). This aligns with findings in related literature where `GRPO`'s `critic-free approach` can be more sample-efficient for `LLM training`, often with a trade-off in training stability. ### 6.3.4. Influence of Instruction Tuning on Base Model To understand the impact of `instruction-tuning` on the base model before applying `HiPRAG`, a comparison was made between a base model (`Qwen2.5-3B`) and its `instruction-tuned counterpart` (`Qwen2.5-3B-Instruct`). Table 2 shows that the instruct-tuned model generally exhibited a higher initial reward, as its pre-training makes it more adept at following the structured output format required by `HiPRAG`. The `hierarchical reward`, which gates the `process bonus` until both the answer and format are correct, favors models that quickly learn this structure. However, the base model (`Qwen2.5-3B`) eventually caught up, converging to a similar reward level. Interestingly, the base model, once fully trained, may achieve a slightly higher `Avg. CEM` score (65.4% vs. 64.1% for instruct-tuned) and a lower `Avg. OSR` (3.2% vs. 4.9%). This suggests that the base model might learn the reasoning and search behaviors from the `RL objective` more purely, without potential biases introduced during the `instruction-tuning` phase. ### 6.3.5. Influence on Output Format (Ablation) To isolate the effect of the new structured output format and reward change, $Search-R1-step\*$ and \betaGRPOstep\*-GRPO-step\* variants were trained. These models used the same reward functions as their original counterparts (Search-R1 v0.3 and $$\beta-GRPO respectively) but were enforced to use HiPRAG's parsable output format. The results in Table 1 show that this structured format, by itself, maintains or slightly increases performance. This confirms that the new parsable output format provides a robust foundation and that the significant performance gains of the full HiPRAG method are primarily due to the process-based reward mechanism it enables, rather than merely an artifact of the format change.

6.3.6. Influence of Process Bonus Coefficient (λp\lambda_p) (Ablation)

The process bonus coefficient λp\lambda_p determines the weight of the step-correctness ratio in the hierarchical reward. Experiments with different λp\lambda_p values revealed an optimal balance:

  • λp=0.4\lambda_p = 0.4 (Optimal): Achieved the highest performance (64.1% Avg. CEM for Qwen2.5-3B-Instruct with PPO). It also resulted in a good balance of efficiency (4.9% Avg. OSR, 38.1% Avg. USR).

  • λp=0.2\lambda_p = 0.2 (Lower): Behaved similarly to an outcome-only reward, failing to sufficiently incentivize efficiency (59.6% Avg. CEM). This was reflected in higher inefficiency rates (5.5% Avg. OSR and 44.5% Avg. USR).

  • λp=0.6\lambda_p = 0.6 (Higher): Over-prioritized step purity at the expense of final answer correctness, leading to a slight performance degradation (62.5% Avg. CEM). This configuration also showed a relatively low Avg. OSR (5.2%) and Avg. USR (39.0%), but the trade-off in accuracy was not beneficial.

    This ablation study confirms that an optimally tuned process bonus is crucial for achieving the best trade-off between accuracy and efficiency.

6.3.7. Training with Over-search or Under-search Only (Ablation)

To understand the individual contributions of over-search and under-search penalties, models were trained with only one type of penalty:

  • HiPRAG (over-search only): Training to reduce only over-search proved insufficient, yielding a low Avg. CEM of 58.8%. While it successfully reduced the Avg. OSR to 4.9%, it caused the model to become too hesitant to search, resulting in a very high Avg. USR of 52.7%. This indicates that solely penalizing over-search can lead to under-searching.
  • HiPRAG (under-search only): Targeting only under-search was more effective (63.3% Avg. CEM), underscoring that preventing hallucination is often more critical than solely improving efficiency. This method dramatically lowered the Avg. USR to just 16.9% but made the agent overly reliant on its search tool, slightly increasing the Avg. OSR to 6.6%.
  • Combined Approach (full HiPRAG): The best performance was achieved only when penalizing both suboptimal behaviors simultaneously (64.1% Avg. CEM for the corresponding configuration), confirming that a holistic approach to search optimization is necessary.

6.4. Qualitative Case Study

To illustrate the practical benefits of HiPRAG, a specific case study (Appendix G) compares the reasoning trajectory of a baseline model (Qwen2.5-3B-Instruct) with a HiPRAG-trained agent for the question: "What is the place of birth of the performer of song Slow Down (Lacy J. Dalton Song)?".

The following figure (Figure 7 from the original paper) shows the baseline reasoning trajectory.

该图像是一个折线图,展示了在不同步骤下,Llama-3.2-3B-Instruct 和 Qwen2.5-3B-Instruct 模型的搜索步骤比例变化。可以看到,Llama-3.2-3B-Instruct 模型的搜索步骤比例在一定步数内保持在较高的水平,最终接近 0.9,而 Qwen2.5-3B-Instruct 的比例则相对较低,波动在 0.4 到 0.9 之间。

Baseline Model (Inefficient Reasoning): The baseline model overlooks the crucial parenthetical information, "(Lacy J. Dalton Song)," and initiates a broad search for the generic song "Slow Down." This leads to an unnecessary five-step process:

  1. It identifies three different artists who have a song by that title.

  2. It then proceeds to search for the birthplace of each one individually.

  3. Even when it finds a correct birthplace for one (e.g., Selena Gomez: "Grand Prairie, Texas"), it is an irrelevant search in the context of the specific question.

    This is a classic example of over-searching, where the agent performs redundant and irrelevant lookups, ultimately failing to provide a single, correct answer relevant to the specific question.

The following figure (Figure 8 from the original paper) shows the HiPRAG-trained reasoning trajectory.

该图像是一个示意图,展示了在不同训练方法下,Qwen2.5-3B模型的平均奖励随步骤变化的情况。图中包含了HiPRAG和GRPO两种训练方法的表现,显示了奖励值在约200步时出现显著差异。

HiPRAG-trained Model (Efficient Reasoning): In contrast, the HiPRAG-trained agent demonstrates a highly efficient and accurate reasoning path:

  1. Step 1 (Non-search): It correctly parses the entire question in its first, non-search step, identifying "Lacy J. Dalton" as the specified performer. It correctly infers that the core entity for the search is Lacy J. Dalton, not just "Slow Down".

  2. Step 2 (Search): It then executes a single, targeted search query for "Lacy J. Dalton place of birth."

  3. Final Answer: This directly leads to the correct answer ("Lacy J. Dalton was born in Salem, Missouri").

    This case clearly demonstrates how HiPRAG's process-oriented rewards cultivate a more nuanced and efficient reasoning strategy, significantly improving both accuracy and search economy by curbing over-searching and ensuring necessary searches are made.

6.5. Additional Analysis on Efficacy (Appendix F)

  • Format Correctness Percentage: Analysis of HiPRAG-trained models showed that 96.3% of all generated trajectories successfully adhered to the required structured output format. This high percentage validates the effectiveness of the prompting and reward strategy in enforcing a machine-parsable output, which is crucial for process reward application.
  • Efficacy of Over-search & Under-search Detection: A manual audit of 200 randomly selected reasoning trajectories revealed a 98.3% accuracy rate for over-search detection and a 95.6% accuracy rate for under-search detection. These high accuracy figures confirm that the on-the-fly LLM-based judges provide a reliable and effective signal for identifying suboptimal search behaviors during RL training.
  • Efficacy of CEM Metric: A manual inspection of Cover Exact Match (CEM) judgments on 100 randomly sampled question-answer pairs found that CEM's assessment of correctness aligned with human judgment in 98% of cases. This confirms CEM as a robust metric for the task, capable of appropriately handling valid answers embedded within longer, explanatory LLM responses.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduces HiPRAG (Hierarchical Process Rewards for Efficient agentic RAG), a novel Reinforcement Learning training methodology that significantly addresses the pervasive inefficiencies of over-searching and under-searching in agentic RAG systems. By decomposing agent trajectories into discrete, parsable steps and employing on-the-fly detection of suboptimal search actions, HiPRAG provides fine-grained, knowledge-aware process rewards. This hierarchical reward function first ensures format adherence and final answer correctness before adding a bonus for process optimality. Experiments demonstrate that HiPRAG achieves state-of-the-art Cover Exact Match (CEM) accuracies (65.4% for 3B and 67.2% for 7B models) while dramatically improving search efficiency, reducing the over-search rate from over 27% to 2.3% and concurrently lowering the under-search rate. The method shows strong generalizability across various model families, sizes, RL algorithms, and types. The core finding is that optimizing the reasoning process itself, rather than solely the final outcome, is crucial for developing powerful and efficient LLM search agents.

7.2. Limitations & Future Work

The authors acknowledge several ethical considerations that also hint at potential limitations and future work:

  • Bias Inheritance: The system may inherit biases from its underlying training data (e.g., Wikipedia) and foundation models. This is a common challenge for LLMs and suggests a need for ongoing research into bias mitigation in RAG systems and RL training.

  • Potential Misuse: While the method aims to foster more reliable AI by penalizing factual errors and reducing computational waste, any capable AI system presents potential for misuse. This implies continuous ethical scrutiny and responsible deployment are necessary.

  • Reliance on Proprietary LLM Judges: The on-the-fly over-search and under-search detection relies on proprietary models (gpt-4.1-mini, gpt-5-mini). While effective for research, this introduces a dependency on external services, potential costs, and limits full reproducibility for researchers without API access. Future work could explore developing open-source or smaller, fine-tuned detectors to overcome this.

    The paper's concluding sentence, "This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents," implicitly suggests a future direction of further exploring and refining such fine-grained RL control mechanisms for broader applications in LLM agent reasoning.

7.3. Personal Insights & Critique

HiPRAG presents a highly insightful and effective approach to a critical problem in agentic RAG. The shift from outcome-based to process-based rewards is conceptually sound and rigorously implemented.

My personal insights and critiques include:

  • Innovation of Structured Output: The emphasis on a machine-parsable output format is a clever and foundational innovation. It transforms the traditionally unstructured and opaque LLM reasoning process into something amenable to programmatic evaluation, directly enabling the fine-grained process rewards. This design choice, while seemingly simple, is crucial for the entire system's efficacy.
  • Direct Detection is Superior: The on-the-fly, LLM-based detection of over-search and under-search is a significant improvement over previous, more indirect methods. It provides a more accurate and immediate signal, which is vital for effective RL training. The impressive accuracy rates (98.3% for over-search, 95.6% for under-search) are strong validation.
  • Hierarchical Reward Function Design: The hierarchical reward is well-designed to prevent over-suppression of search capabilities, a common pitfall when overly penalizing intermediate steps. By first ensuring basic correctness and format, it allows the agent to build foundational skills before optimizing for efficiency, making the learning process more robust.
  • Strong Empirical Results: The dramatic reduction in over-search rate (from >27% to 2.3%) while maintaining or improving accuracy is a standout achievement, demonstrating the practical impact of HiPRAG. The generalizability across different model sizes, families, and RL algorithms further strengthens the findings.
  • Critique: External LLM Dependency: As noted in limitations, the reliance on gpt-4.1-mini and gpt-5-mini for detection is a practical bottleneck. While these are efficient for proprietary models, an open-source alternative or a strategy to train a smaller, dedicated on-the-fly detector would enhance reproducibility and democratize the approach. This could involve using synthetic data generated by these powerful LLMs to train a smaller, local verifier model.
  • Potential for Broader Application: The principles of structured reasoning trajectories and fine-grained process rewards could be highly transferable beyond RAG to other LLM agent tasks involving complex tool use, planning, or multi-step reasoning. For example, in code generation, a process reward could evaluate the efficiency of intermediate coding steps or API calls, not just the final runnable code.
  • Beyond QA: While evaluated on QA benchmarks, the methodology has implications for any domain where LLMs need to reason and interact with external systems efficiently and reliably. The ability to control and optimize the reasoning process itself opens doors for more robust and trustworthy AI agents in diverse applications.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.