HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
TL;DR Summary
HiPRAG introduces a novel hierarchical process rewards method to tackle common over-search and under-search issues in agentic retrieval-augmented generation, improving search efficiency and accuracy significantly across multiple QA benchmarks, demonstrating the importance of opti
Abstract
Agentic Retrieval-Augmented Generation (RAG) is a powerful technique for incorporating external information that Large Language Models (LLMs) lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a Reinforcement Learning (RL) framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce erarchical rocess Rewards for Efficient agentic (HiPRAG), a novel training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B), outperforming strong agentic RAG baselines. This is accomplished while dramatically improving search efficiency, reducing the over-search rate from over 27% in baselines to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents. We will release our code upon acceptance.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
1.2. Authors
The paper lists "Anonymous authors" as it is currently under double-blind review. Therefore, specific research backgrounds and affiliations are not disclosed.
1.3. Journal/Conference
The paper is published at OpenReview, indicated by the provided link (https://openreview.net/pdf?id=Gt4v9WBPzm). OpenReview is a platform for disseminating research papers, particularly those under review for conferences, allowing for public peer review and discussion. Its reputation and influence depend on the associated conference (which is not explicitly stated here, but implied to be a major machine learning or NLP venue given the topic and quality).
1.4. Publication Year
The publication date is provided as 2025-10-08T00:00:00.000Z (UTC). This indicates the paper is slated for publication in late 2025.
1.5. Abstract
Agentic Retrieval-Augmented Generation (RAG) empowers Large Language Models (LLMs) to use external information for problem-solving. However, these systems often suffer from inefficient search behaviors like over-search (retrieving redundant information) and under-search (failing to search when needed), leading to unnecessary computational overhead and unreliable outputs. Existing training methods, primarily outcome-based Reinforcement Learning (RL), lack the fine-grained control to address these issues. This paper introduces HiPRAG (Hierarchical Process Rewards for Efficient agentic RAG), a novel RL training methodology that incorporates a fine-grained, knowledge-grounded process reward. HiPRAG evaluates each search decision's necessity by decomposing the agent's reasoning trajectory into discrete, parsable steps. It applies a hierarchical reward function that bonuses optimal search and non-search steps, in addition to standard outcome and format rewards. Experiments with Qwen2.5 and Llama-3.2 models across seven QA benchmarks demonstrate that HiPRAG achieves higher average accuracies (65.4% for 3B, 67.2% for 7B) than strong agentic RAG baselines. Crucially, it dramatically improves search efficiency, reducing the over-search rate from over 27% to 2.3% and also lowering the under-search rate. The results highlight the efficacy of optimizing the reasoning process itself, rather than just the final outcome, and show strong generalizability across various RL algorithms, model families, sizes, and types.
1.6. Original Source Link
https://openreview.net/pdf?id=Gt4v9WBPzm This is a preprint link on OpenReview, indicating the paper is currently under double-blind review. The authors state, "We will release our code upon acceptance."
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper is the prevalence of suboptimal search behaviors in Agentic Retrieval-Augmented Generation (RAG) systems. While agentic RAG, where Large Language Models (LLMs) can autonomously decide when and what to retrieve, is a powerful technique for accessing external information, its practical application is hampered by two main issues:
-
Over-search: The agent retrieves information it already possesses or performs redundant searches, leading to unnecessary computational overhead, increased latency, and wasted resources.
-
Under-search: The agent fails to perform a search when external knowledge is actually required, resulting in factual errors, hallucinations, and unreliable outputs.
These inefficiencies undermine both the accuracy and efficiency of agentic RAG systems. Current training methods, often based on
Reinforcement Learning (RL)withoutcome-based rewards(e.g., rewarding only the correctness of the final answer), lack the granularity to effectively address these step-level inefficiencies. Existing efforts to introduce process-level rewards often rely on imperfect proxies like confidence scores or separately trained reward models, which can introduce bias or misjudge the true quality of a reasoning step. The fundamental challenge is the lack of explicit, step-specific feedback on individual retrieval decisions.
The paper's innovative idea is to provide fine-grained, knowledge-grounded process rewards to directly optimize the agent's search behavior at each step of its reasoning trajectory. Instead of coarse, outcome-only feedback, HiPRAG aims to give precise signals about when a search was necessary, when it was redundant, and when it was missed.
2.2. Main Contributions / Findings
The primary contributions of the HiPRAG paper are:
-
Novel RL Training Methodology with Hierarchical Process Rewards: The paper proposes
HiPRAG, a new training framework for agentic RAG systems that uses a hierarchical and knowledge-aware process reward. This mechanism provides fine-grained supervision on the agent's search behavior, moving beyond traditional outcome-based rewards. -
Efficient On-the-Fly Detection of Suboptimal Searches:
HiPRAGintroduces robust and direct methods for detectingover-searchandunder-searchbehaviors during the RL training phase. This is enabled by a redesigned, structured output format that allows rule-based parsing of reasoning steps and utilizes external LLMs for efficient semantic equivalence and factual accuracy checks on-the-fly. -
Demonstrated Superior Performance and Efficiency: Experiments on seven diverse QA benchmarks using Qwen2.5 and Llama-3.2 models show significant improvements:
- Accuracy: Achieved average accuracies of 65.4% (3B models) and 67.2% (7B models) in
Cover Exact Match (CEM), outperforming strong agentic RAG baselines. - Efficiency: Dramatically improved search efficiency, reducing the
over-search rate (OSR)from over 27% in baselines to just 2.3%, and concurrently lowering theunder-search rate (USR).
- Accuracy: Achieved average accuracies of 65.4% (3B models) and 67.2% (7B models) in
-
Strong Generalizability: The method demonstrates good generalizability across different
RL algorithms(PPO, GRPO),model families(Qwen2.5, Llama-3.2),model sizes(3B, 7B), andmodel types(base, instruct).These findings collectively demonstrate that
HiPRAGeffectively solves the problem of suboptimal search behaviors by optimizing the reasoning process itself, leading to more accurate, efficient, and reliable agentic RAG systems.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand HiPRAG, a grasp of several core concepts in natural language processing (NLP) and artificial intelligence (AI) is essential:
- Large Language Models (LLMs): These are advanced deep learning models, typically based on the Transformer architecture, trained on vast amounts of text data to understand, generate, and process human language. They can perform various tasks like question answering, summarization, and translation. Examples include GPT series, Llama, and Qwen.
- Retrieval-Augmented Generation (RAG): RAG is a technique that enhances LLMs by enabling them to retrieve relevant information from an external knowledge base (like a database or a collection of documents) before generating a response. This helps LLMs access up-to-date or domain-specific information that wasn't included in their training data, thereby reducing hallucinations and improving factual accuracy. A basic RAG system typically performs retrieval once based on the initial query.
- Agentic RAG / LLM Agents: This is an evolution of RAG where the LLM acts as an "agent" capable of multi-step reasoning and autonomous decision-making. An agentic RAG system can dynamically decide when to search, what to search for (by generating a query), how to use the retrieved information, and when to stop searching to formulate a final answer. This often involves an iterative process of
Think-Act-Observe. - Reinforcement Learning (RL): RL is a paradigm of machine learning where an "agent" learns to make decisions by interacting with an "environment." The agent receives
rewardsorpenaltiesfor its actions, aiming to maximize cumulative reward over time.- Policy (): The agent's strategy, mapping observed states to actions. In LLM agents, this is often the LLM itself, which generates text (actions) given a prompt (state).
- Reward Function (): Defines the goal of the RL problem by assigning numerical values to the agent's actions and outcomes. A positive reward encourages desirable behavior, while a negative reward (penalty) discourages undesirable behavior.
- Trajectory: A sequence of states, actions, and rewards from the beginning to the end of an episode (e.g., answering a question).
- Proximal Policy Optimization (PPO): A widely used
Reinforcement Learningalgorithm that balances exploration and exploitation while ensuring that policy updates are not too drastic, which helps maintain training stability. PPO achieves this by optimizing a "clipped" objective function, preventing large policy updates that could destabilize training. - Group Relative Policy Optimization (GRPO): Another
Reinforcement Learningalgorithm, often considered a critic-free approach, which can be more sample-efficient for LLM training compared to PPO, although sometimes at the cost of less training stability. It estimates policy gradients based on relative returns within groups of trajectories.
3.2. Previous Works
The paper builds upon a rich body of prior research in Agentic RAG and Reinforcement Learning for tool use. Key related works include:
- ReAct (Yao et al., 2023): This foundational framework demonstrated that LLMs can synergize
reasoning(generatingthoughtsorplans) andacting(using tools like search engines). It introduced theThought-Action-Observationloop, which is central to many agentic systems. - Chain-of-Retrieval (Wang et al., 2025b) and DeepRAG (Guan et al., 2025): These works further refined the structured
retrieval processitself, allowing for sequential or multi-step retrieval to handle complex queries, building upon thechain-of-thoughtreasoning paradigm. - RL for Tool Use:
- ToolRL (Qian et al., 2025a) and ToRL (Li et al., 2025c): These studies showed that
Reinforcement Learningfromrewardsbased on task success can significantly scale and improve an LLM's ability to integrate and use various tools, including search engines. - RAG-RL (Huang et al., 2025a): Uses RL and curriculum learning to enhance
RAGperformance, demonstrating the potential of RL in this domain.
- ToolRL (Qian et al., 2025a) and ToRL (Li et al., 2025c): These studies showed that
- RL-Based Agentic RAG Systems:
- Search-R1 (Jin et al., 2025b): A key baseline in the paper, it trains LLMs to reason and leverage search engines using
Reinforcement Learning. It introduced the concept ofoutcome + format reward, providing a reward if the final answer is correct AND the output follows a specified format. This system often generates reasoning within XML blocks, whichHiPRAGidentifies as a limitation due to parsing difficulties. - R1-Searcher (Song et al., 2025a) and R1-Searcher++ (Song et al., 2025b): These works further incentivize search capabilities and dynamic knowledge acquisition in LLMs using RL, focusing on improving overall performance.
- -GRPO (Wu et al., 2025): Another strong RL baseline that aims to mitigate suboptimal search behaviors by reducing uncertainty, and is compared against in
HiPRAGexperiments.
- Search-R1 (Jin et al., 2025b): A key baseline in the paper, it trains LLMs to reason and leverage search engines using
- Efficient Agentic RAG & Tool Use (Addressing Inefficiency):
- Uncertainty Detection (Mallen et al., 2023; Dhole, 2025): Early approaches to adaptive retrieval relied on heuristics or classifiers to detect model uncertainty and trigger retrieval.
- Self-Awareness/Information Needs (Su et al., 2024; Zubkova et al., 2025; Yao et al., 2025; Huanshuo et al., 2025): More advanced methods learn to assess an LLM's real-time information needs or
self-awarenessfrom its internal states for dynamic retrieval decisions. - Process-Level Reward Shaping:
- R1-Searcher++ (Song et al., 2025b) and Synergistic Reasoning Agents (Huang et al., 2025b): Use RL to incentivize necessary knowledge acquisition and dynamic search.
- ReARTeR (Sun et al., 2025): Introduces a framework with a trustworthy
process reward modelto score and refine each step in a RAG pipeline. This is conceptually similar toHiPRAGbut relies on a separate reward model, whichHiPRAGavoids for direct, on-the-fly evaluation. - SMART (Qian et al., 2025b), SMARTCAL (Shen et al., 2024), and OTC (Wang et al., 2025a): These train agents to be
self-awareand makeoptimal tool calls, often using RL to mitigate tool overuse. Some incorporate length or retrieval time-based penalties, whichHiPRAGargues can lead tounder-searchby oversimplifying the problem. - Verifiable Stepwise Rewards (Yue et al., 2025; Ye et al., 2025): These methods also use stepwise rewards to promote more efficient general reasoning paths, aligning with
HiPRAG's philosophy.
3.3. Technological Evolution
The field of retrieval-augmented generation has evolved through several stages:
-
Early RAG (e.g., DPR, RAG models by Lewis et al., 2020): Initial approaches focused on performing a single retrieval at the beginning of the generation process, based on the input query. The retrieved documents were then concatenated with the query and fed into a generative LLM. This was a significant improvement over pure parametric LLMs but lacked dynamic interaction.
-
Prompt-Based Agentic RAG (e.g., ReAct, IRCoT, Search-o1): The next evolution introduced the concept of agents, where LLMs could dynamically interact with external tools. Through sophisticated prompting strategies (e.g.,
chain-of-thoughtprompting), LLMs were instructed to interleave reasoning steps with tool calls (like search), creating multi-step reasoning trajectories. These systems relied heavily on the LLM's ability to follow instructions and generate appropriate tool calls based on prompt engineering. However, their behavior could be inconsistent and suboptimal without explicit training. -
RL-Tuned Agentic RAG (e.g., Search-R1, R1-Searcher, -GRPO): Recognizing the limitations of purely prompt-based methods, researchers began applying
Reinforcement Learningto explicitly train LLM agents. In this stage, the LLM's actions (e.g., generating a search query, deciding to stop searching) are rewarded or penalized based on the outcome of a task (e.g., correct answer). This allowed for more robust and optimized tool-use policies. However, most RL approaches still focused onoutcome-based rewards, meaning they only gave feedback on the final answer, not the quality of individual reasoning or search steps. This often led toover-searchorunder-searchas the agent didn't receive fine-grained feedback on its process.HiPRAGfits into this timeline as a significant advancement in the RL-tuned Agentic RAG stage. It addresses the critical gap of fine-grained process-level optimization. While previous RL methods focused on task success,HiPRAGexplicitly targets the efficiency and optimality of each step in the reasoning trajectory by introducing ahierarchical, knowledge-aware process reward.
3.4. Differentiation Analysis
Compared to the main methods in related work, HiPRAG introduces several core differences and innovations:
-
Fine-Grained Process Reward vs. Outcome-Based/Coarse Proxies:
- Prior Methods: Many existing
RL-based agentic RAGsystems (e.g.,Search-R1,R1-Searcher) primarily useoutcome-based rewards(correctness of the final answer) or coarse process proxies (e.g., length penalties, confidence scores, separately trained reward models like inReARTeR). Length penalties can inadvertently discourage necessary searches (under-search), while confidence scores or learned reward models can be imperfect or introduce bias. - HiPRAG's Innovation:
HiPRAGprovides explicit,step-specific feedbackon each retrieval decision. It directly evaluates whether a specific search wassuperfluous(over-search) ormissing(under-search) for that particular step. Thisknowledge-grounded process rewardoffers a much more accurate and direct training signal for efficiency.
- Prior Methods: Many existing
-
On-the-Fly Direct Detection vs. Complex Regeneration/Proxies:
- Prior Methods: Some attempts at
over-search detectioninvolved complexregeneration pipelines(e.g., removing context and re-prompting, as mentioned by Wu et al., 2025), which were computationally expensive and prone to producing unnatural outputs. - HiPRAG's Innovation:
HiPRAGproposes a direct and efficienton-the-fly detectionmechanism. Forover-search, it isolates the search query, regenerates its answer, and uses an external LLM to check semantic equivalence with the original step's conclusion. Forunder-search, it verifies the factual accuracy and internal logic of non-search steps with an external verifier LLM. This direct approach is faster, more robust, and provides a clearer signal.
- Prior Methods: Some attempts at
-
Structured, Parsable Output Format:
- Prior Methods: Frameworks like
Search-R1often use blocks that mix reasoning, planning, and conclusions, making it difficult to programmatically isolate discrete reasoning steps for fine-grained evaluation without expensive post-hoc interpretation by another LLM. - HiPRAG's Innovation:
HiPRAGenforces a strict,machine-parsable output formatwith explicit tags, clearly delineating search and non-search steps. This structured format is crucial for enabling theon-the-fly detectionand application ofprocess rewards.
- Prior Methods: Frameworks like
-
Hierarchical Reward Function:
-
Prior Methods: Reward functions might prioritize one aspect (e.g., outcome correctness or efficiency) over another, potentially leading to undesired trade-offs.
-
HiPRAG's Innovation:
HiPRAGuses ahierarchical rewardthat first incentivizes basicformat adherenceandfinal answer correctness. Only once these primary goals are met does it provide an additionalbonusforprocess optimality(optimal search and non-search steps). This dynamic focus preventsover-suppressionof search capabilities during early training while ensuring efficiency later.In essence,
HiPRAGdifferentiates itself by moving from a reactive, outcome-centric optimization to a proactive, process-centric optimization, enabled by structured outputs and direct, real-time evaluation of each reasoning step.
-
4. Methodology
4.1. Principles
The core idea behind HiPRAG is to directly address the suboptimal search behaviors (over-search and under-search) in agentic RAG systems by optimizing the agent's reasoning process itself, rather than solely relying on the final outcome. This is achieved by integrating a fine-grained, knowledge-grounded process reward into the Reinforcement Learning (RL) training loop. The theoretical basis and intuition are that for an LLM agent to be truly efficient and accurate, it needs specific, step-level feedback on whether its decision to search or not search was optimal. If a model is only rewarded for the final answer, it might achieve correctness through inefficient means (e.g., excessive searching) or fail due to insufficient searching, without ever learning why its intermediate steps were good or bad. HiPRAG provides this missing link by:
-
Making Reasoning Parsable: Structuring the agent's output so each reasoning step is clearly identifiable.
-
Detecting Suboptimalities Directly: Implementing efficient, on-the-fly mechanisms to determine if each search action was necessary or if a crucial search was missed.
-
Hierarchically Rewarding Process Quality: Incorporating these detections into a reward function that guides the agent to first achieve basic correctness and format, and then to optimize for search efficiency.
This approach aims to cultivate a more nuanced understanding in the agent of its own
knowledge boundariesandinformation needs, leading to more robust, accurate, and efficient decision-making when interacting with external tools.
4.2. Core Methodology In-depth
The HiPRAG methodology is composed of three interconnected components: a structured output format, efficient detection mechanisms for suboptimal searches, and a hierarchical reward function. The overall training workflow is summarized in Figure 1.
4.2.1. Decomposing Reasoning Trajectory into Parsable Steps
A significant challenge for implementing process rewards is the difficulty in programmatically parsing an agent's reasoning trajectory. Traditional agentic RAG frameworks, such as Search-R1, often generate reasoning within multiple XML blocks, interleaved with search queries and retrieved information. This format presents two key issues:
-
Ambiguous Step Boundaries: A single block often contains a mix of conclusions from previous actions and planning for current ones, making it hard to isolate a discrete, self-contained reasoning step.
-
Implicit Internal Reasoning: Steps where the agent relies on its
parametric knowledge(i.e., its internal knowledge without searching) are embedded as prose within blocks, making them indistinguishable from analytical text without complex natural language understanding, which would be slow and error-prone duringRL training.To overcome these issues,
HiPRAGenforces a structured,machine-parsable output formatduringRL training. The agent's prompt and rollout logic are modified to generate its entire reasoning trajectory within a single block, which then contains a sequence of discrete blocks. Each block represents a single logical reasoning action.
Each step can be one of two types:
-
Search Step: Distinguished by the presence of a block (containing the query) and a block (containing retrieved information).
-
Non-Search Step: Relies solely on the agent's internal knowledge and contains only and tags, without or .
Formally, a complete reasoning trajectory for a given question is defined as a sequence of steps with a final answer :
Each step (for ) can be:
- A search step (): represented as a tuple
( r _ { i } , q _ { i } , c _ { i } , o _ { i } ), where:- : The
reasoningblock for this step (planning and analysis). - : The
search querygenerated. - : The
retrieved contextfrom the search. - : The
conclusionor summary of knowledge gained.
- : The
- A non-search step (): represented as a tuple
( r _ { i } , o _ { i } ), where:-
: The
reasoningblock for this step. -
: The
conclusionor summary of knowledge gained.This structured format ensures that each reasoning step, along with its type (search or non-search) and its components (reasoning, query, context, conclusion), can be deterministically identified and extracted.
-
Adherence to this schema is enforced through two parallel approaches:
-
System Prompt: The agent's system prompt is updated with explicit instructions and few-shot examples that demonstrate the correct usage of all XML tags (e.g., , , , , , , ). The prompt for this parsable output format is shown in Figure 4 in the Appendix.
-
Positive Reward for Correct Format: As detailed in Section 4.2.3, the
RL frameworkapplies a positive reward for correct output formats, incentivizing the model to consistently produce parsable trajectories.The inference process for this format is outlined in Algorithm 1, and the mechanism for checking format correctness is detailed in Algorithm 2 and Algorithm 3 in the Appendix.
The following figure (Figure 3 from the original paper) illustrates the difference between the original Search-R1 format and HiPRAG's parsable format for the same multi-hop question.
该图像是图表,展示了两种不同推理轨迹格式的比较,针对相同的多步问题进行分析。左侧为原始搜索格式,右侧为 HiPRAG 可解析格式。每个逻辑步骤以一致的颜色高亮,显示各格式之间的对应关系。为了提高可读性,实际检索的文档被其总结文本所替代。
The structured output format is a critical enabler for the on-the-fly detection of suboptimal searches, as it allows for straightforward extraction of step-level information.
4.2.2. On-the-Fly Detection of Suboptimal Searches
With the reasoning trajectories segmented into discrete, parsable steps, HiPRAG can implement efficient and robust checks for over-search and under-search during the RL training phase.
4.2.2.1. Over-search Detection
Previous methods for over-search detection were often complex and computationally expensive. They involved removing the search context and regenerating an answer with a fixed instruction to rely on internal knowledge. This approach was brittle, as appended instructions could conflict with the agent's original reasoning and lead to unnatural outputs.
HiPRAG proposes a more direct and robust method:
-
For each
search step, thesearch queryis taken as a standalone question. -
An answer is then regenerated by prompting the agent to answer only without performing a search.
-
An external
LLM judge(specifically,gpt-4.1-mini) is used to assess the semantic equivalence of the original step's conclusion and the re-generated answer . -
If and are semantically equivalent, it implies the information retrieved by was already known to the model or could have been inferred internally, making the search redundant. In this case, the step is flagged as an
over-searchby theOver(\cdot)detector function.This method is faster because it isolates the core knowledge required by the query and provides a more reliable signal. The prompt used for the external LLM judge for
over-search detectionis shown in Figure 5 in the Appendix.
4.2.2.2. Under-search Detection
Under-search occurs when the agent fails to retrieve necessary external knowledge, leading to factual errors or hallucinations.
HiPRAG detects under-search for non-search steps:
-
For each
non-search step, an externalverifier model(specifically,gpt-5-mini) is prompted to assess thefactual accuracyof itsreasoningandconclusion. -
The verifier model checks two aspects:
- Factual Accuracy: Is the statement in and factually correct?
- Internal Logic: Does logically follow from within this same step?
-
If the content is found to be incorrect or illogical, the step is flagged as an
under-searchby theUnder(\cdot)detector function, indicating that the agent should have used the search tool but failed to do so.The prompt used for the external verifier model for
under-search detectionis shown in Figure 6 in the Appendix.
In actual implementation, both detection methods can work concurrently to improve speed. For batch over-search detection during the RL rollout phase, the re-generation step can be executed separately through batch generation before using the external LLM judge.
4.2.3. Hierarchical Process Reward Calculation
A naive reward function that simply penalizes search (e.g., based on length or confidence) can harm the agent's ability to retrieve necessary information, especially for knowledge-intensive tasks. HiPRAG aims to incentivize optimal search behavior to improve both performance and efficiency, while preserving the agent's fundamental reasoning and search capabilities. To achieve this, the rewards are dynamically focused: initially prioritizing format and final answer correctness, and then shifting focus towards reasoning efficiency and process optimality once basic capabilities are established. This is accomplished through a hierarchical reward function.
Let's define the components of the reward function:
-
is an indicator function that signals the final answer of the trajectory 's correctness. Here,
Cover Exact Match (CEM)(introduced in Section 5.2) is used. if the answer is correct,0otherwise. -
is an indicator function that signals whether the trajectory follows the required output format (as described in Section 4.2.1 and Algorithm 2 and 3). if the format is correct,
0otherwise. -
N(T)is the total number of steps in the trajectory . -
is the number of optimal (neither
over-searchnorunder-search) steps in trajectory . This is calculated as: $ N _ { \mathrm { c o r r } } ( T ) = \left| { s ^ { R } \in ( T ) : \neg { \mathsf { O v e r } } ( s ^ { R } ) } \right| + \left| { s ^ { N R } \in ( T ) : \neg { \mathsf { U n d e r } } ( s ^ { N R } ) } \right| $ where:-
: A search step in trajectory .
-
: The logical NOT of the
Over()detector, meaning the search step is not anover-search. -
: A non-search step in trajectory .
-
: The logical NOT of the
Under()detector, meaning the non-search step is not anunder-search.The reward function also incorporates two hyperparameters:
-
-
: A
format weightthat controls the importance of format adherence. -
: A
process bonus coefficientthat controls the weight of the process optimality reward.The single merged
hierarchical rewardfor a trajectory is defined as: $ R ( T ) = A ( T ) \bigl ( 1 - \lambda _ { f } \bigr ) + \lambda _ { f } F ( T ) + \lambda _ { p } A ( T ) F ( T ) \frac { N _ { \mathrm { c o r r } } ( T ) } { N ( T ) } . $ Let's break down this formula: -
The first two terms, , represent a weighted sum of the
final answer correctnessandformat adherence. If , this part is algebraically equivalent to the standardoutcome + format rewardused in prior work (e.g.,Search-R1). -
The third term, , provides a
gated process bonus. This bonus is only applied when both thefinal answeris correct () and thetrajectory formatis correct (). If both are correct, the reward becomes .-
The fraction represents the
proportion of optimal stepsin the trajectory, effectively measuring the efficiency and optimality of the agent's reasoning path.This
hierarchical structureensures a balanced training signal:
-
-
Initial Focus: The agent is first incentivized to produce well-formed reasoning trajectories and correct answers. Without and , the process bonus term is zero.
-
Subsequent Optimization: Once these basic requirements are met, the agent receives an additional bonus for the efficiency and validity of its reasoning path (i.e., making fewer
over-searchorunder-searcherrors).This approach avoids
over-suppressionof search capabilities by not punishing suboptimal steps too heavily if the final outcome and format are correct, while still directly encouraging the model to develop a more nuanced understanding of itsknowledge boundaries.
The following table (Table 3 from the original paper) provides a description of all symbols and notations used in the methodology.
| Symbol | Description |
|---|---|
| A complete reasoning trajectory for a given question. | |
| `N(T)` | The total number of steps in a trajectory . |
| The final answer within a trajectory . | |
| The -th step in a reasoning trajectory. | |
| A search step, represented as a tuple . | |
| A non-search (internal reasoning) step, represented as a tuple . | |
| The reasoning block of a step, containing planning and analysis. | |
| The search query generated in a search step. | |
| The retrieved context returned from a search query. | |
| The conclusion or summary of knowledge from a step. | |
| The re-generated answer used for over-search detection. | |
| `A(a)` | A function indicating if the final answer is correct (1) or not (0). |
| `F(T)` | A function indicating if the trajectory follows the required format (1) or not (0). |
| The number of optimal (correct) steps in trajectory . | |
| The detector function that identifies an over-search step. | |
| The detector function that identifies an under-search step. | |
| A hyperparameter representing the format weight. | |
| A hyperparameter representing the process bonus coefficient. | |
| `R(T)` | The final hierarchical reward calculated for a trajectory . |
The following figure (Figure 1 from the original paper) provides a general overview of the HiPRAG training workflow, illustrating the components discussed above.
该图像是HiPRAG训练工作流程的示意图。该流程展示了策略模型生成多步骤推理轨迹的过程,并对每个步骤进行即时评价,以发现潜在的搜索行为不当情况。最后,结合步骤优化性的过程奖励和最终答案的正确性及格式化奖励,计算最终的层级奖励。
5. Experimental Setup
This section details the framework used to evaluate HiPRAG, including datasets, evaluation metrics, baselines, and training procedures.
5.1. Datasets
The experiments used a comprehensive set of question-answering (QA) datasets, covering both single-hop and multi-hop queries.
-
Training Set: A combination of the official training sets from:
-
NQ (Natural Questions) (Kwiatkowski et al., 2019): A large-scale dataset for open-domain question answering. Questions are real Google queries, and answers are short text spans from Wikipedia. This dataset is crucial for teaching single-fact retrieval.
-
HotpotQA (Yang et al., 2018): A multi-hop QA dataset where questions require reasoning over multiple documents to find the answer, often involving compositional reasoning. This dataset is essential for training
agentic RAGmodels in complex, multi-step reasoning.This combined training set provides a diverse corpus for teaching agents both single-fact retrieval and multi-hop reasoning, critical for learning efficient reasoning.
-
-
Test Set: To assess both in-domain and out-of-domain generalization, models were evaluated on the development or test sets of seven diverse QA datasets:
-
NQ (Natural Questions): (Described above)
-
PopQA (Mallen et al., 2023): A knowledge-intensive QA dataset.
-
TriviaQA (Joshi et al., 2017): A large-scale dataset for reading comprehension and question answering, featuring questions from various trivia sources.
-
2WikiMultiHopQA (Ho et al., 2020): A multi-hop QA dataset designed to evaluate reasoning steps over multiple Wikipedia articles.
-
Bamboogle (Press et al., 2023): A dataset designed to measure and narrow the
compositionality gapin language models, often requiring complex reasoning. -
HotpotQA: (Described above)
-
Musique (Trivedi et al., 2022): A multi-hop QA dataset where questions are formed by composing single-hop questions, requiring reasoning across multiple steps.
These datasets are selected to provide a wide coverage of question types and complexity, allowing for a thorough evaluation of the agent's accuracy and reasoning capabilities in diverse scenarios.
-
5.2. Evaluation Metrics
The primary metrics used to evaluate HiPRAG focus on both correctness and efficiency.
-
Cover Exact Match (CEM) (Song et al., 2025a):
- Conceptual Definition: This metric determines the correctness of a generated answer by checking if the
ground-truth answer stringis present anywhere within the model's generated response. It is a more lenient metric than strict Exact Match, as modern LLMs often generate verbose or explanatory responses that might contain the correct answer amidst additional text. CEM ensures that models are not unfairly penalized for valid answers embedded in longer, more conversational outputs. - Mathematical Formula: The paper describes CEM conceptually rather than providing a formal mathematical formula, implying a string-matching operation. If is the ground-truth answer string and is the model's generated answer string, then: $ \mathrm{CEM}(P, G) = \begin{cases} 1 & \text{if } G \text{ is a substring of } P \ 0 & \text{otherwise} \end{cases} $
- Symbol Explanation:
- : The answer string generated by the model.
- : The ground-truth answer string.
1: Indicates the answer is correct according to CEM.0: Indicates the answer is incorrect according to CEM.
- Conceptual Definition: This metric determines the correctness of a generated answer by checking if the
-
Over-search Rate (OSR):
- Conceptual Definition: OSR measures the proportion of
search stepsthat were deemedover-searches(i.e., redundant or unnecessary) relative to all identifiablesearch stepswithin a set of test samples. A lower OSR indicates better search efficiency, as the agent avoids querying for already known or easily inferable information. - Mathematical Formula: Given a set of test samples , with each sample having a reasoning trajectory : $ { \mathrm { O S R } } = { \frac { \sum _ { T \in { \mathcal { D } } _ { \mathrm { t s s } } } \left| \left{ s ^ { R } \in T : { \mathrm { O v e r } } ( s ^ { R } ) \right} \right| } { \sum _ { T \in { \mathcal { D } } _ { \mathrm { t s s } } } \left| \left{ s ^ { R } \in T \right} \right| } } $
- Symbol Explanation:
- : The set of all test samples.
- : A single reasoning trajectory from the test set.
- : A search step within the trajectory .
- : A boolean function (or detector) that returns
Trueif the search step is anover-search, andFalseotherwise. - : Denotes the count of elements in the set.
- The numerator sums the count of all
over-searchsteps across all trajectories in the test set. - The denominator sums the total count of all
search stepsacross all trajectories in the test set.
- Conceptual Definition: OSR measures the proportion of
-
Under-search Rate (USR):
- Conceptual Definition: USR measures the proportion of
non-search stepswhere the agent failed to retrieve necessary information (leading to factual errors or logical inaccuracies) relative to all identifiablenon-search stepswithin a set of test samples. A lower USR indicates better reasoning robustness, as the agent correctly identifies when external knowledge is required and avoidshallucinations. - Mathematical Formula: Given a set of test samples , with each sample having a reasoning trajectory : $ { \mathrm { U S R } } = { \frac { \sum _ { T \in { \mathcal { D } } _ { \mathrm { t s s } } } \left| \left{ s ^ { N R } \in T : { \mathrm { U n d e r } } ( s ^ { N R } ) \right} \right| } { \sum _ { T \in { \mathcal { D } } _ { \mathrm { t s s } } } \left| \left{ s ^ { N R } \in T \right} \right| } } $
- Symbol Explanation:
- : The set of all test samples.
- : A single reasoning trajectory from the test set.
- : A non-search step within the trajectory .
- : A boolean function (or detector) that returns
Trueif the non-search step is anunder-search, andFalseotherwise. - : Denotes the count of elements in the set.
- The numerator sums the count of all
under-searchsteps across all trajectories in the test set. - The denominator sums the total count of all
non-search stepsacross all trajectories in the test set.
- Conceptual Definition: USR measures the proportion of
5.3. Baselines
HiPRAG was compared against a comprehensive set of baselines representing different paradigms in retrieval-augmented generation:
- Direct Inference:
- Description: The LLM generates an answer directly without any retrieval mechanism. This serves as a lower bound, showcasing the LLM's inherent
parametric knowledge.
- Description: The LLM generates an answer directly without any retrieval mechanism. This serves as a lower bound, showcasing the LLM's inherent
- Standard RAG (Lewis et al., 2020):
- Description: A conventional
RAGsetup where retrieval is performed once based on the initial query. The retrieved context is then provided to the LLM for a single-pass generation. This represents the basic form of RAG.
- Description: A conventional
- Prompt-Based Agentic RAG:
- Description: These methods rely on sophisticated
prompt engineering(e.g.,chain-of-thought) to achieve multi-step reasoning and dynamic search without explicitRL training. - IRCoT (Trivedi et al., 2023):
Interleaving Retrieval with Chain-of-Thoughtreasoning for knowledge-intensive multi-step questions. - Search-o1 (Li et al., 2025b):
Agentic Search-Enhanced Large Reasoning Models, another prompt-based agentic approach.
- Description: These methods rely on sophisticated
- RL-Based Agentic RAG:
- Description: State-of-the-art methods that use
Reinforcement Learningto train search agents, optimizing their policies for invoking tools and making search decisions. - Search-R1 (Jin et al., 2025b): Trains LLMs to reason and leverage search engines with
Reinforcement Learning. It uses anoutcome + format reward. - R1-Searcher (Song et al., 2025a): Incentivizes the search capability in LLMs via
Reinforcement Learning. - R1-Searcher++ (Song et al., 2025b): An enhancement of R1-Searcher, further incentivizing dynamic knowledge acquisition.
- -GRPO (Wu et al., 2025): A method that uses
Group Relative Policy Optimizationto mitigate suboptimalagentic searchesby reducing uncertainty. - Search-R1-step*: A variant of
Search-R1(v0.3) that uses itsoutcome + format rewardbut is adapted toHiPRAG's structured output format to isolate the impact of the format change. - -GRPO-step*: A variant of \beta`-GRPO` that uses its reward but is adapted to `HiPRAG`'s structured output format.
These baselines were chosen to represent a spectrum of `RAG` evolution, from simple inference to advanced `RL-trained agents`, allowing for a fair comparison of `HiPRAG`'s innovation in process-level optimization.
## 5.4. Training Details
* **Hardware:** All `RL-based models` were trained using four `NVIDIA A100 80GB GPUs`.
* **Training Schedule:** The training process ran for a total of `400 steps`, with checkpoints saved every `50 steps`.
* **Checkpoint Strategy:** For evaluation, if training completed stably, the final checkpoint was used. If the training reward `collapsed` (indicating instability), the last stable checkpoint before the collapse was used to ensure a fair evaluation of the model's best-learned state.
* **Models:**
* **Main Experiments:** `Qwen2.5-(3B/7B)-Instruct models` (Qwen et al., 2025).
* **Generalizability Analysis:** `Llama3.2-3B-Instruct` (Grattafiori et al., 2024) and `Qwen2.5-3B` (base model, not instruct-tuned).
* **Suboptimal Search Detectors:**
* `Over-search detection`: Performed by `gpt-4.1-mini` (OpenAI, 2025a).
* `Under-search detection`: Relied on `gpt-5-mini` (OpenAI, 2025b). The use of these smaller, proprietary models was chosen for their fast inference speed and sufficient performance.
* **Reinforcement Learning Algorithms:**
* **Core RL Algorithm:** `Proximal Policy Optimization (PPO)` (Schulman et al., 2017) was used for main experiments due to its demonstrated training stability in complex LLM finetuning.
* **Comparison Algorithm:** `Group Relative Policy Optimization (GRPO)` (Shao et al., 2024) was also used, with a group size of 5 and the same training parameters as PPO, to assess the impact of the RL algorithm choice.
* **Retrieval Environment:**
* Followed the `Search-R1` setup.
* `Knowledge Source`: The 2018 Wikipedia dump (Karpukhin et al., 2020).
* `Retriever`: `E5-base` (Wang et al., 2024).
* `Retrieval Output`: In each search step, the `top-3 relevant passages` were returned.
* **Inference Parameters:**
* **RL Rollout Stage:** `Temperature` and `top p` were set to `1` to encourage a high possibility of generating diverse reasoning trajectories.
* **Testing Phase:** `Temperature` and `top p` were set to the models' default values for evaluation.
* **Reward Function Hyperparameters:**
* Main experiments: `format weight` $\lambda _ { f } = 0.2$ and `process bonus coefficient` $\lambda _ { p } = 0.4$.
* Exploration of different $\lambda _ { p }$ values was also conducted (see Section 6.3).
# 6. Results & Analysis
This section presents a comprehensive analysis of `HiPRAG`'s performance, comparing it against state-of-the-art baselines and exploring the influence of various parameters and design choices.
## 6.1. Core Results Analysis
The main experimental results, presented in Table 1, demonstrate that `HiPRAG` significantly outperforms all baseline methods across a diverse set of seven question-answering benchmarks in terms of `Cover Exact Match (CEM)`.
The following are the results from Table 1 of the original paper:
<div class="table-wrapper"><table>
<thead>
<tr>
<td rowspan="2">Method</td>
<td colspan="3">General QA</td>
<td colspan="3">Multi-Hop QA</td>
<td rowspan="2">Bamboogle</td>
<td rowspan="2">Avg.</td>
</tr>
<tr>
<td>NQ</td>
<td>TriviaQA</td>
<td>PopQA</td>
<td>HotpotQA</td>
<td>2Wiki</td>
<td>Musique</td>
</tr>
</thead>
<tbody>
<tr>
<td>Direct Inference</td>
<td>27.0</td>
<td>26.8</td>
<td>40.1</td>
<td>58.7</td>
<td>16.0</td>
<td>7.9</td>
<td>15.9</td>
<td>31.8</td>
</tr>
<tr>
<td>Standard RAG</td>
<td>51.2</td>
<td>54.7</td>
<td>65.7</td>
<td>56.9</td>
<td>21.6</td>
<td>18.5</td>
<td>18.6</td>
<td>45.3</td>
</tr>
<tr>
<td>IRCoT</td>
<td>27.5</td>
<td>36.0</td>
<td>42.5</td>
<td>51.4</td>
<td>37.6</td>
<td>19.4</td>
<td>20.6</td>
<td>36.4</td>
</tr>
<tr>
<td>Search-o1</td>
<td>40.2</td>
<td>42.2</td>
<td>58.5</td>
<td>56.1</td>
<td>45.6</td>
<td>15.1</td>
<td>19.3</td>
<td>43.9</td>
</tr>
<tr>
<td>R1-Searcher</td>
<td>60.0</td>
<td>73.0</td>
<td>58.2</td>
<td>60.4</td>
<td>60.3</td>
<td>32.9</td>
<td>55.8</td>
<td>60.6</td>
</tr>
<tr>
<td>R1-Searcher++</td>
<td>61.0</td>
<td>73.5</td>
<td>59.0</td>
<td>64.2</td>
<td>63.2</td>
<td>32.3</td>
<td>58.7</td>
<td>62.1</td>
</tr>
<tr>
<td>Search-R1</td>
<td>61.2</td>
<td>73.6</td>
<td>56.5</td>
<td>54.0</td>
<td>63.6</td>
<td>24.8</td>
<td>48.4</td>
<td>60.3</td>
</tr>
<tr>
<td>Search-R1-step*</td>
<td>62.4</td>
<td>74.4</td>
<td>57.3</td>
<td>54.8</td>
<td>64.2</td>
<td>25.3</td>
<td>49.6</td>
<td>61.2</td>
</tr>
<tr>
<td>β-GRPO</td>
<td>65.0</td>
<td>75.0</td>
<td>60.0</td>
<td>53.0</td>
<td>66.0</td>
<td>24.0</td>
<td>52.0</td>
<td>62.5</td>
</tr>
<tr>
<td>β-GRPO-step*</td>
<td>62.4</td>
<td>73.9</td>
<td>61.3</td>
<td>52.1</td>
<td>66.0</td>
<td>22.8</td>
<td>54.4</td>
<td>62.1</td>
</tr>
<tr>
<td>HiPRAG-3B</td>
<td>68.7</td>
<td>75.5</td>
<td>66.3</td>
<td>57.4</td>
<td>67.4</td>
<td>24.1</td>
<td>41.6</td>
<td>65.4</td>
</tr>
<tr>
<td>HiPRAG-7B</td>
<td>71.2</td>
<td>76.3</td>
<td>63.2</td>
<td>62.4</td>
<td>71.7</td>
<td>34.1</td>
<td>52.8</td>
<td>67.2</td>
</tr>
</tbody>
</table></div>
**Key observations from Table 1:**
* **Overall Performance:** `HiPRAG-7B` achieves the highest average `CEM` score of 67.2%, significantly outperforming the next-best baseline, `β-GRPO` (62.5%) and `R1-Searcher++` (62.1%). `HiPRAG-3B` also shows a strong average `CEM` of 65.4%, surpassing all baselines including 7B models like `R1-Searcher++`. This indicates that `HiPRAG`'s fine-grained, process-based reward mechanism effectively guides agents to develop more robust and accurate reasoning trajectories.
* **Performance across Datasets:**
* `HiPRAG` models show consistent gains across most datasets. For `General QA` datasets like NQ, TriviaQA, and PopQA, `HiPRAG` achieves top scores.
* For `Multi-Hop QA` datasets, `HiPRAG-7B` shows impressive performance on HotpotQA (62.4%), 2Wiki (71.7%), and Musique (34.1%), often outperforming baselines by a large margin on these complex reasoning tasks.
* **Efficiency Gains (from Abstract):** Beyond accuracy, `HiPRAG` delivers unprecedented gains in efficiency. The `over-search rate (OSR)` is dramatically reduced from over 27% in baselines to just 2.3%, while concurrently lowering the `under-search rate (USR)`. This demonstrates the direct efficacy of optimizing the reasoning process.
The superior performance of `HiPRAG` models, especially the 3B variant surpassing larger baseline models, suggests that the proposed training methodology is a more effective path to performance gains than merely scaling model size with conventional rewards.
## 6.2. Data Presentation (Tables)
The following are the results from Table 2 of the original paper:
<div class="table-wrapper"><table>
<thead>
<tr>
<th>Base Model</th>
<th>RL Algo.</th>
<th>Method</th>
<th>Avg. CEM</th>
<th>Avg. OSR</th>
<th>Avg. USR</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama-3.2-3B-Instruct</td>
<td>PPO</td>
<td>baseline</td>
<td>56.4</td>
<td>7.3</td>
<td>57.6</td>
</tr>
<tr>
<td>Llama-3.2-3B-Instruct</td>
<td>PPO</td>
<td>HiPRAG</td>
<td>64.8</td>
<td>6.0</td>
<td>49.7</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>GRPO</td>
<td>baseline</td>
<td>58.5</td>
<td>8.4</td>
<td>52.1</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>GRPO</td>
<td>HiPRAG</td>
<td>64.4</td>
<td>4.1</td>
<td>33.2</td>
</tr>
<tr>
<td>Qwen2.5-3B</td>
<td>PPO</td>
<td>baseline</td>
<td>60.3</td>
<td>3.8</td>
<td>44.0</td>
</tr>
<tr>
<td>Qwen2.5-3B</td>
<td>PPO</td>
<td>HiPRAG</td>
<td>65.4</td>
<td>3.2</td>
<td>41.9</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO</td>
<td>baseline</td>
<td>59.3</td>
<td>6.1</td>
<td>47.5</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO</td>
<td>HiPRAG</td>
<td>64.1</td>
<td>4.9</td>
<td>38.1</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>GRPO</td>
<td>baseline</td>
<td>61.2</td>
<td>5.2</td>
<td>43.3</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>GRPO</td>
<td>HiPRAG</td>
<td>67.2</td>
<td>4.3</td>
<td>32.6</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>PPO</td>
<td>baseline</td>
<td>53.3</td>
<td>7.6</td>
<td>29.0</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>PPO</td>
<td>HiPRAG</td>
<td>64.5</td>
<td>6.2</td>
<td>33.9</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO</td>
<td>HiPRAG (over-search only)</td>
<td>58.8</td>
<td>4.9</td>
<td>52.7</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO</td>
<td>HiPRAG (under-search only)</td>
<td>63.3</td>
<td>6.6</td>
<td>16.9</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO</td>
<td>HiPRAG (λp = 0.2)</td>
<td>59.6</td>
<td>5.5</td>
<td>44.5</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO</td>
<td>HiPRAG (λp = 0.6)</td>
<td>62.5</td>
<td>5.2</td>
<td>39.0</td>
</tr>
</tbody>
</table></div>
The following are the results from Table 4 of the original paper:
<div class="table-wrapper"><table>
<thead>
<tr>
<td rowspan="2">Base Model</td>
<td rowspan="2">RL Algo. + Method</td>
<td colspan="3">General QA</td>
<td colspan="3">Multi-Hop QA</td>
<td rowspan="2">Bamboogle</td>
<td rowspan="2">Avg.</td>
</tr>
<tr>
<td>NQ</td>
<td>TriviaQA</td>
<td>PopQA</td>
<td>HotpotQA</td>
<td>2Wiki</td>
<td>Musique</td>
</tr>
</thead>
<tbody>
<tr>
<td>Llama-3.2-3B-Instruct</td>
<td>PPO + baseline</td>
<td>65.2</td>
<td>74.5</td>
<td>55.1</td>
<td>47.0</td>
<td>52.3</td>
<td>18.7</td>
<td>36.0</td>
<td>56.4</td>
</tr>
<tr>
<td>Llama-3.2-3B-Instruct</td>
<td>PPO + HiPRAG</td>
<td>71.6</td>
<td>77.2</td>
<td>61.0</td>
<td>57.7</td>
<td>67.9</td>
<td>25.7</td>
<td>43.2</td>
<td>64.8</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>GRPO + baseline</td>
<td>59.6</td>
<td>69.1</td>
<td>57.3</td>
<td>52.4</td>
<td>61.4</td>
<td>20.6</td>
<td>24.8</td>
<td>58.5</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>GRPO + HiPRAG</td>
<td>68.5</td>
<td>74.2</td>
<td>60.6</td>
<td>59.2</td>
<td>69.1</td>
<td>27.9</td>
<td>38.4</td>
<td>64.4</td>
</tr>
<tr>
<td>Qwen2.5-3B</td>
<td>PPO + baseline</td>
<td>60.6</td>
<td>71.7</td>
<td>55.8</td>
<td>54.3</td>
<td>65.7</td>
<td>24.1</td>
<td>40.8</td>
<td>60.3</td>
</tr>
<tr>
<td>Qwen2.5-3B</td>
<td>PPO + HiPRAG</td>
<td>68.7</td>
<td>75.5</td>
<td>66.3</td>
<td>57.4</td>
<td>67.4</td>
<td>24.1</td>
<td>41.6</td>
<td>65.4</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + baseline</td>
<td>60.9</td>
<td>70.1</td>
<td>57.0</td>
<td>52.0</td>
<td>63.0</td>
<td>24.3</td>
<td>37.6</td>
<td>59.3</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG</td>
<td>65.6</td>
<td>73.9</td>
<td>62.1</td>
<td>55.6</td>
<td>69.6</td>
<td>26.0</td>
<td>32.8</td>
<td>64.1</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>GRPO + baseline</td>
<td>62.4</td>
<td>74.4</td>
<td>57.3</td>
<td>54.8</td>
<td>64.2</td>
<td>25.3</td>
<td>49.6</td>
<td>61.2</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>GRPO + HiPRAG</td>
<td>71.2</td>
<td>76.3</td>
<td>63.2</td>
<td>62.4</td>
<td>71.7</td>
<td>34.1</td>
<td>52.8</td>
<td>67.2</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>PPO + baseline</td>
<td>55.6</td>
<td>67.5</td>
<td>43.5</td>
<td>49.4</td>
<td>58.5</td>
<td>26.6</td>
<td>44.0</td>
<td>53.3</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>PPO + HiPRAG</td>
<td>66.2</td>
<td>75.7</td>
<td>58.4</td>
<td>59.9</td>
<td>66.2</td>
<td>34.3</td>
<td>52.0</td>
<td>64.5</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG (over-search only)</td>
<td>61.9</td>
<td>66.9</td>
<td>54.9</td>
<td>52.2</td>
<td>65.4</td>
<td>25.5</td>
<td>39.2</td>
<td>58.8</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG (under-search only)</td>
<td>63.7</td>
<td>74.1</td>
<td>60.6</td>
<td>55.9</td>
<td>67.9</td>
<td>28.4</td>
<td>40.8</td>
<td>63.3</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG ($\lambda_p$ = 0.2)</td>
<td>61.9</td>
<td>71.2</td>
<td>56.8</td>
<td>53.7</td>
<td>62.2</td>
<td>25.4</td>
<td>31.2</td>
<td>59.6</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG ($\lambda_p$ = 0.6)</td>
<td>66.6</td>
<td>74.4</td>
<td>60.5</td>
<td>55.5</td>
<td>64.4</td>
<td>25.6</td>
<td>38.4</td>
<td>62.5</td>
</tr>
</tbody>
</table></div>
The following are the results from Table 5 of the original paper:
<div class="table-wrapper"><table>
<thead>
<tr>
<td rowspan="2">Base Model</td>
<td rowspan="2">RL Algo. + Method</td>
<td colspan="3">General QA</td>
<td colspan="4">Multi-Hop QA</td>
<td rowspan="2">Avg.</td>
</tr>
<tr>
<td>NQ</td>
<td>TriviaQA</td>
<td>PopQA</td>
<td>HotpotQA</td>
<td>2Wiki</td>
<td>Musique</td>
<td>Bamboogle</td>
</tr>
</thead>
<tbody>
<tr>
<td>Llama-3.2-3B-Instruct</td>
<td>PPO + baseline</td>
<td>12.5</td>
<td>15.4</td>
<td>5.0</td>
<td>4.8</td>
<td>3.7</td>
<td>2.7</td>
<td>8.7</td>
<td>7.3</td>
</tr>
<tr>
<td>Llama-3.2-3B-Instruct</td>
<td>PPO + HiPRAG</td>
<td>11.9</td>
<td>13.3</td>
<td>4.5</td>
<td>4.6</td>
<td>1.8</td>
<td>3.1</td>
<td>5.0</td>
<td>6.0</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>GRPO + baseline</td>
<td>8.4</td>
<td>17.0</td>
<td>5.6</td>
<td>7.2</td>
<td>4.3</td>
<td>5.0</td>
<td>10.3</td>
<td>8.4</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>GRPO + HiPRAG</td>
<td>4.4</td>
<td>9.8</td>
<td>2.2</td>
<td>3.0</td>
<td>2.9</td>
<td>1.4</td>
<td>3.9</td>
<td>4.1</td>
</tr>
<tr>
<td>Qwen2.5-3B</td>
<td>PPO + baseline</td>
<td>6.4</td>
<td>9.0</td>
<td>2.6</td>
<td>2.9</td>
<td>1.5</td>
<td>1.7</td>
<td>4.3</td>
<td>3.8</td>
</tr>
<tr>
<td>Qwen2.5-3B</td>
<td>PPO + HiPRAG</td>
<td>5.1</td>
<td>6.9</td>
<td>2.2</td>
<td>2.3</td>
<td>1.4</td>
<td>1.2</td>
<td>3.4</td>
<td>3.2</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + baseline</td>
<td>8.6</td>
<td>13.5</td>
<td>5.6</td>
<td>4.2</td>
<td>1.8</td>
<td>3.9</td>
<td>12.8</td>
<td>6.1</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG</td>
<td>6.0</td>
<td>11.0</td>
<td>3.9</td>
<td>4.5</td>
<td>2.5</td>
<td>2.8</td>
<td>11.5</td>
<td>4.9</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>GRPO + baseline</td>
<td>5.3</td>
<td>7.4</td>
<td>2.0</td>
<td>3.5</td>
<td>0.9</td>
<td>3.6</td>
<td>8.7</td>
<td>5.2</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>GRPO + HiPRAG</td>
<td>4.1</td>
<td>5.4</td>
<td>1.3</td>
<td>1.8</td>
<td>0.3</td>
<td>1.5</td>
<td>4.8</td>
<td>4.3</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>PPO + baseline</td>
<td>11.6</td>
<td>19.8</td>
<td>6.1</td>
<td>7.4</td>
<td>2.7</td>
<td>8.5</td>
<td>19.6</td>
<td>7.6</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>PPO + HiPRAG</td>
<td>10.4</td>
<td>14.5</td>
<td>4.6</td>
<td>5.6</td>
<td>2.1</td>
<td>5.9</td>
<td>13.4</td>
<td>6.2</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG (over-search only)</td>
<td>6.0</td>
<td>11.0</td>
<td>3.9</td>
<td>4.5</td>
<td>2.5</td>
<td>2.8</td>
<td>11.5</td>
<td>4.9</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG (under-search only)</td>
<td>8.2</td>
<td>15.6</td>
<td>5.9</td>
<td>5.3</td>
<td>2.7</td>
<td>3.1</td>
<td>6.4</td>
<td>6.6</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG ($\lambda_p$ = 0.2)</td>
<td>7.4</td>
<td>11.4</td>
<td>3.7</td>
<td>4.4</td>
<td>2.3</td>
<td>3.3</td>
<td>12.1</td>
<td>5.5</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG ($\lambda_p$ = 0.6)</td>
<td>9.3</td>
<td>13.1</td>
<td>4.1</td>
<td>3.3</td>
<td>1.3</td>
<td>2.2</td>
<td>6.3</td>
<td>5.2</td>
</tr>
</tbody>
</table></div>
The following are the results from Table 6 of the original paper:
<div class="table-wrapper"><table>
<thead>
<tr>
<td rowspan="2">Base Model</td>
<td rowspan="2">RL Algo. + Method</td>
<td colspan="3">General QA</td>
<td colspan="3">Multi-Hop QA</td>
<td rowspan="2">Bamboogle</td>
<td rowspan="2">Avg.</td>
</tr>
<tr>
<td>NQ</td>
<td>TriviaQA</td>
<td>PopQA</td>
<td>HotpotQA</td>
<td>2Wiki</td>
<td>Musique</td>
</tr>
</thead>
<tbody>
<tr>
<td>Llama-3.2-3B-Instruct</td>
<td>PPO + baseline</td>
<td>67.1</td>
<td>75.0</td>
<td>66.7</td>
<td>52.6</td>
<td>59.3</td>
<td>50.0</td>
<td>20.0</td>
<td>57.6</td>
</tr>
<tr>
<td>Llama-3.2-3B-Instruct</td>
<td>PPO + HiPRAG</td>
<td>35.3</td>
<td>48.4</td>
<td>31.7</td>
<td>50.8</td>
<td>55.3</td>
<td>64.3</td>
<td>10.3</td>
<td>49.7</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>GRPO + baseline</td>
<td>61.9</td>
<td>63.9</td>
<td>59.6</td>
<td>46.1</td>
<td>49.1</td>
<td>61.9</td>
<td>22.3</td>
<td>52.1</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>GRPO + HiPRAG</td>
<td>52.9</td>
<td>34.9</td>
<td>35.2</td>
<td>29.2</td>
<td>25.0</td>
<td>45.5</td>
<td>21.2</td>
<td>33.2</td>
</tr>
<tr>
<td>Qwen2.5-3B</td>
<td>PPO + baseline</td>
<td>33.3</td>
<td>66.7</td>
<td>30.8</td>
<td>38.5</td>
<td>47.5</td>
<td>66.7</td>
<td>0.0</td>
<td>44.0</td>
</tr>
<tr>
<td>Qwen2.5-3B</td>
<td>PPO + HiPRAG</td>
<td>43.9</td>
<td>36.4</td>
<td>42.3</td>
<td>41.9</td>
<td>42.6</td>
<td>56.8</td>
<td>16.7</td>
<td>41.9</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + baseline</td>
<td>47.1</td>
<td>33.2</td>
<td>48.8</td>
<td>39.0</td>
<td>52.9</td>
<td>70.0</td>
<td>32.2</td>
<td>47.5</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG</td>
<td>11.1</td>
<td>44.4</td>
<td>61.9</td>
<td>25.0</td>
<td>32.0</td>
<td>10.1</td>
<td>8.7</td>
<td>38.1</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>GRPO + baseline</td>
<td>40.5</td>
<td>34.3</td>
<td>43.8</td>
<td>40.9</td>
<td>45.0</td>
<td>56.2</td>
<td>20.0</td>
<td>43.3</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>GRPO + HiPRAG</td>
<td>30.2</td>
<td>34.9</td>
<td>34.9</td>
<td>40.5</td>
<td>24.4</td>
<td>37.3</td>
<td>41.7</td>
<td>32.6</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>PPO + baseline</td>
<td>33.4</td>
<td>13.9</td>
<td>17.5</td>
<td>40.3</td>
<td>33.4</td>
<td>50.0</td>
<td>13.2</td>
<td>29.0</td>
</tr>
<tr>
<td>Qwen2.5-7B-Instruct</td>
<td>PPO + HiPRAG</td>
<td>57.1</td>
<td>44.9</td>
<td>25.5</td>
<td>20.0</td>
<td>34.6</td>
<td>57.1</td>
<td>1.6</td>
<td>33.9</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG (over-search only)</td>
<td>54.5</td>
<td>55.2</td>
<td>48.9</td>
<td>44.7</td>
<td>53.7</td>
<td>78.3</td>
<td>20.0</td>
<td>52.7</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG (under-search only)</td>
<td>14.0</td>
<td>20.4</td>
<td>13.6</td>
<td>25.6</td>
<td>13.2</td>
<td>30.8</td>
<td>16.9</td>
<td>16.9</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG ($\lambda_p$ = 0.2)</td>
<td>24.0</td>
<td>39.5</td>
<td>55.8</td>
<td>41.3</td>
<td>45.5</td>
<td>80.1</td>
<td>30.1</td>
<td>44.5</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>PPO + HiPRAG ($\lambda_p$ = 0.6)</td>
<td>27.2</td>
<td>34.1</td>
<td>33.6</td>
<td>60.6</td>
<td>51.2</td>
<td>53.3</td>
<td>1.6</td>
<td>39.0</td>
</tr>
</tbody>
</table></div>
## 6.3. Ablation Studies / Parameter Analysis
### 6.3.1. Influence of Model Size
As expected, larger models generally demonstrate stronger reasoning capabilities and higher accuracy. `HiPRAG` experiments with 7B models consistently outperform their 3B counterparts in terms of `Avg. CEM` (e.g., `HiPRAG-7B` with GRPO achieves 67.2% vs. `HiPRAG-3B` with GRPO achieving 64.4% as seen in Table 2). However, `HiPRAG`'s process-based reward approach allows smaller models to achieve remarkable performance, narrowing the gap with larger models. For instance, the `HiPRAG` model trained based on `Qwen2.5-3B-Instruct` + `GRPO` achieves an `Avg. CEM` of 64.4%, which not only surpasses strong external 7B baselines like `R1-Searcher++` (62.2% Avg. CEM in Table 1) but also outperforms its own 7B counterpart trained with a baseline reward (61.2% Avg. CEM in Table 2). This highlights that the `HiPRAG` training methodology is a more effective pathway to performance gains than simply scaling model size with conventional rewards.
Furthermore, a larger model size tends to result in more efficient search decisions. With `GRPO`, the 7B model (4.3% Avg. OSR, 32.6% Avg. USR) exhibits better efficiency than the 3B model (4.1% Avg. OSR, 33.2% Avg. USR), in addition to its higher accuracy (Table 2).
### 6.3.2. Influence of Model Family
To assess the generalizability of `HiPRAG`, experiments were conducted with both `Qwen2.5-3B-Instruct` and `Llama-3.2-3B-Instruct` models. While both models achieve comparable peak accuracy after training with `HiPRAG` (Qwen-3B at 64.1% Avg. CEM, Llama-3B at 64.8% Avg. CEM), their underlying search behaviors and efficiency differ.
The following figure (Figure 2b from the original paper) shows the curves of the ratio of searches among all reasoning steps for different model families.

*该图像是一个折线图,展示了在不同步骤下,Llama-3.2-3B-Instruct 和 Qwen2.5-3B-Instruct 模型的搜索步骤比例变化。可以看到,Llama-3.2-3B-Instruct 模型的搜索步骤比例在一定步数内保持在较高的水平,最终接近 0.9,而 Qwen2.5-3B-Instruct 的比例则相对较低,波动在 0.4 到 0.9 之间。*
As seen in Figure 2b, the `Llama-3B` model initially shows a higher tendency to rely on its `parametric knowledge` with more non-search steps, resulting in a higher `under-search rate`. After training, the `Qwen-3B` model achieves a high `Avg. CEM` of 64.1% with lower suboptimal search rates (4.9% Avg. OSR, 38.1% Avg. USR) compared to the `Llama-3B` model's 6.0% Avg. OSR and 49.7% Avg. USR (Table 2). This suggests that although `HiPRAG` is effective across different model families, the base model's inherent tendencies can influence the final search efficiency.
### 6.3.3. Influence of RL Algorithm
The impact of different `RL algorithms` on `HiPRAG` was explored using `PPO` and `GRPO` on `Qwen2.5-3B/7B-Instruct` models. `PPO` demonstrated better training stability, often completing the full training run without reward collapse, while `GRPO` consistently showed the potential for higher final performance and faster convergence.
The following figure (Figure 2a from the original paper) shows the reward curves for different RL algorithms.

*该图像是一个示意图,展示了在不同训练方法下,Qwen2.5-3B模型的平均奖励随步骤变化的情况。图中包含了HiPRAG和GRPO两种训练方法的表现,显示了奖励值在约200步时出现显著差异。*
From Table 2, `GRPO` yields a higher `Avg. CEM` for both 3B (64.4% vs. 64.1% with PPO) and 7B (67.2% vs. 64.5% with PPO) models. Moreover, `GRPO` results in more efficient search behavior (e.g., 2.3% OSR for 7B-GRPO vs. 6.2% for 7B-PPO, though this specific 2.3% OSR is from the abstract and not directly shown in the table for 7B GRPO HiPRAG, the lowest OSR in Table 2 is 1.3% for PopQA for 7B GRPO HiPRAG, which averages to 4.3% in Table 2). This aligns with findings in related literature where `GRPO`'s `critic-free approach` can be more sample-efficient for `LLM training`, often with a trade-off in training stability.
### 6.3.4. Influence of Instruction Tuning on Base Model
To understand the impact of `instruction-tuning` on the base model before applying `HiPRAG`, a comparison was made between a base model (`Qwen2.5-3B`) and its `instruction-tuned counterpart` (`Qwen2.5-3B-Instruct`). Table 2 shows that the instruct-tuned model generally exhibited a higher initial reward, as its pre-training makes it more adept at following the structured output format required by `HiPRAG`. The `hierarchical reward`, which gates the `process bonus` until both the answer and format are correct, favors models that quickly learn this structure.
However, the base model (`Qwen2.5-3B`) eventually caught up, converging to a similar reward level. Interestingly, the base model, once fully trained, may achieve a slightly higher `Avg. CEM` score (65.4% vs. 64.1% for instruct-tuned) and a lower `Avg. OSR` (3.2% vs. 4.9%). This suggests that the base model might learn the reasoning and search behaviors from the `RL objective` more purely, without potential biases introduced during the `instruction-tuning` phase.
### 6.3.5. Influence on Output Format (Ablation)
To isolate the effect of the new structured output format and reward change, $Search-R1-step\*$ and \beta variants were trained. These models used the same reward functions as their original counterparts (
Search-R1 v0.3and $$\beta-GRPOrespectively) but were enforced to useHiPRAG's parsable output format. The results in Table 1 show that this structured format, by itself, maintains or slightly increases performance. This confirms that the new parsable output format provides a robust foundation and that the significant performance gains of the fullHiPRAGmethod are primarily due to theprocess-based reward mechanismit enables, rather than merely an artifact of the format change.
- Description: State-of-the-art methods that use
6.3.6. Influence of Process Bonus Coefficient () (Ablation)
The process bonus coefficient determines the weight of the step-correctness ratio in the hierarchical reward. Experiments with different values revealed an optimal balance:
-
(Optimal): Achieved the highest performance (64.1% Avg. CEM for
Qwen2.5-3B-InstructwithPPO). It also resulted in a good balance of efficiency (4.9% Avg. OSR, 38.1% Avg. USR). -
(Lower): Behaved similarly to an outcome-only reward, failing to sufficiently incentivize efficiency (59.6% Avg. CEM). This was reflected in higher inefficiency rates (5.5% Avg. OSR and 44.5% Avg. USR).
-
(Higher): Over-prioritized
step purityat the expense offinal answer correctness, leading to a slight performance degradation (62.5% Avg. CEM). This configuration also showed a relatively low Avg. OSR (5.2%) and Avg. USR (39.0%), but the trade-off in accuracy was not beneficial.This ablation study confirms that an optimally tuned
process bonusis crucial for achieving the best trade-off between accuracy and efficiency.
6.3.7. Training with Over-search or Under-search Only (Ablation)
To understand the individual contributions of over-search and under-search penalties, models were trained with only one type of penalty:
HiPRAG(over-search only): Training to reduce onlyover-searchproved insufficient, yielding a lowAvg. CEMof 58.8%. While it successfully reduced theAvg. OSRto 4.9%, it caused the model to become too hesitant to search, resulting in a very highAvg. USRof 52.7%. This indicates that solely penalizingover-searchcan lead tounder-searching.HiPRAG(under-search only): Targeting onlyunder-searchwas more effective (63.3%Avg. CEM), underscoring that preventinghallucinationis often more critical than solely improving efficiency. This method dramatically lowered theAvg. USRto just 16.9% but made the agent overly reliant on its search tool, slightly increasing theAvg. OSRto 6.6%.- Combined Approach (full
HiPRAG): The best performance was achieved only when penalizing bothsuboptimal behaviorssimultaneously (64.1%Avg. CEMfor the corresponding configuration), confirming that a holistic approach to search optimization is necessary.
6.4. Qualitative Case Study
To illustrate the practical benefits of HiPRAG, a specific case study (Appendix G) compares the reasoning trajectory of a baseline model (Qwen2.5-3B-Instruct) with a HiPRAG-trained agent for the question: "What is the place of birth of the performer of song Slow Down (Lacy J. Dalton Song)?".
The following figure (Figure 7 from the original paper) shows the baseline reasoning trajectory.

Baseline Model (Inefficient Reasoning): The baseline model overlooks the crucial parenthetical information, "(Lacy J. Dalton Song)," and initiates a broad search for the generic song "Slow Down." This leads to an unnecessary five-step process:
-
It identifies three different artists who have a song by that title.
-
It then proceeds to search for the birthplace of each one individually.
-
Even when it finds a correct birthplace for one (e.g., Selena Gomez: "Grand Prairie, Texas"), it is an irrelevant search in the context of the specific question.
This is a classic example of
over-searching, where the agent performs redundant and irrelevant lookups, ultimately failing to provide a single, correct answer relevant to the specific question.
The following figure (Figure 8 from the original paper) shows the HiPRAG-trained reasoning trajectory.

HiPRAG-trained Model (Efficient Reasoning):
In contrast, the HiPRAG-trained agent demonstrates a highly efficient and accurate reasoning path:
-
Step 1 (Non-search): It correctly parses the entire question in its first, non-search step, identifying "Lacy J. Dalton" as the specified performer. It correctly infers that the core entity for the search is Lacy J. Dalton, not just "Slow Down".
-
Step 2 (Search): It then executes a single, targeted search query for "Lacy J. Dalton place of birth."
-
Final Answer: This directly leads to the correct answer ("Lacy J. Dalton was born in Salem, Missouri").
This case clearly demonstrates how
HiPRAG's process-oriented rewards cultivate a more nuanced and efficient reasoning strategy, significantly improving both accuracy and search economy by curbingover-searchingand ensuringnecessary searchesare made.
6.5. Additional Analysis on Efficacy (Appendix F)
- Format Correctness Percentage: Analysis of
HiPRAG-trained models showed that 96.3% of all generated trajectories successfully adhered to the required structured output format. This high percentage validates the effectiveness of the prompting and reward strategy in enforcing a machine-parsable output, which is crucial forprocess rewardapplication. - Efficacy of Over-search & Under-search Detection: A manual audit of 200 randomly selected reasoning trajectories revealed a 98.3% accuracy rate for
over-search detectionand a 95.6% accuracy rate forunder-search detection. These high accuracy figures confirm that theon-the-fly LLM-based judgesprovide a reliable and effective signal for identifying suboptimal search behaviors duringRL training. - Efficacy of CEM Metric: A manual inspection of
Cover Exact Match (CEM)judgments on 100 randomly sampled question-answer pairs found thatCEM's assessment of correctness aligned with human judgment in 98% of cases. This confirmsCEMas a robust metric for the task, capable of appropriately handling valid answers embedded within longer, explanatory LLM responses.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work introduces HiPRAG (Hierarchical Process Rewards for Efficient agentic RAG), a novel Reinforcement Learning training methodology that significantly addresses the pervasive inefficiencies of over-searching and under-searching in agentic RAG systems. By decomposing agent trajectories into discrete, parsable steps and employing on-the-fly detection of suboptimal search actions, HiPRAG provides fine-grained, knowledge-aware process rewards. This hierarchical reward function first ensures format adherence and final answer correctness before adding a bonus for process optimality. Experiments demonstrate that HiPRAG achieves state-of-the-art Cover Exact Match (CEM) accuracies (65.4% for 3B and 67.2% for 7B models) while dramatically improving search efficiency, reducing the over-search rate from over 27% to 2.3% and concurrently lowering the under-search rate. The method shows strong generalizability across various model families, sizes, RL algorithms, and types. The core finding is that optimizing the reasoning process itself, rather than solely the final outcome, is crucial for developing powerful and efficient LLM search agents.
7.2. Limitations & Future Work
The authors acknowledge several ethical considerations that also hint at potential limitations and future work:
-
Bias Inheritance: The system may inherit biases from its underlying training data (e.g., Wikipedia) and
foundation models. This is a common challenge for LLMs and suggests a need for ongoing research into bias mitigation inRAGsystems andRL training. -
Potential Misuse: While the method aims to foster more reliable AI by penalizing factual errors and reducing computational waste, any capable AI system presents potential for misuse. This implies continuous ethical scrutiny and responsible deployment are necessary.
-
Reliance on Proprietary LLM Judges: The on-the-fly
over-searchandunder-search detectionrelies on proprietary models (gpt-4.1-mini,gpt-5-mini). While effective for research, this introduces a dependency on external services, potential costs, and limits full reproducibility for researchers without API access. Future work could explore developing open-source or smaller, fine-tuned detectors to overcome this.The paper's concluding sentence, "This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents," implicitly suggests a future direction of further exploring and refining such fine-grained
RL controlmechanisms for broader applications inLLM agentreasoning.
7.3. Personal Insights & Critique
HiPRAG presents a highly insightful and effective approach to a critical problem in agentic RAG. The shift from outcome-based to process-based rewards is conceptually sound and rigorously implemented.
My personal insights and critiques include:
- Innovation of Structured Output: The emphasis on a
machine-parsable output formatis a clever and foundational innovation. It transforms the traditionally unstructured and opaqueLLM reasoningprocess into something amenable to programmatic evaluation, directly enabling the fine-grainedprocess rewards. This design choice, while seemingly simple, is crucial for the entire system's efficacy. - Direct Detection is Superior: The
on-the-fly, LLM-based detectionofover-searchandunder-searchis a significant improvement over previous, more indirect methods. It provides a more accurate and immediate signal, which is vital for effectiveRL training. The impressive accuracy rates (98.3% forover-search, 95.6% forunder-search) are strong validation. - Hierarchical Reward Function Design: The
hierarchical rewardis well-designed to preventover-suppressionof search capabilities, a common pitfall when overly penalizing intermediate steps. By first ensuring basic correctness and format, it allows the agent to build foundational skills before optimizing for efficiency, making the learning process more robust. - Strong Empirical Results: The dramatic reduction in
over-search rate(from >27% to 2.3%) while maintaining or improving accuracy is a standout achievement, demonstrating the practical impact ofHiPRAG. The generalizability across different model sizes, families, andRL algorithmsfurther strengthens the findings. - Critique: External LLM Dependency: As noted in limitations, the reliance on
gpt-4.1-miniandgpt-5-minifor detection is a practical bottleneck. While these are efficient for proprietary models, an open-source alternative or a strategy to train a smaller, dedicatedon-the-fly detectorwould enhance reproducibility and democratize the approach. This could involve using synthetic data generated by these powerful LLMs to train a smaller, localverifier model. - Potential for Broader Application: The principles of
structured reasoning trajectoriesandfine-grained process rewardscould be highly transferable beyondRAGto otherLLM agenttasks involving complex tool use, planning, or multi-step reasoning. For example, in code generation, a process reward could evaluate the efficiency of intermediate coding steps or API calls, not just the final runnable code. - Beyond QA: While evaluated on
QA benchmarks, the methodology has implications for any domain where LLMs need to reason and interact with external systems efficiently and reliably. The ability to control and optimize the reasoning process itself opens doors for more robust and trustworthy AI agents in diverse applications.
Similar papers
Recommended via semantic vector search.