DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
TL;DR Summary
The DeepSeekMath-V2 model addresses the effectiveness of large language models in mathematical reasoning. By training a theorem prover verifier, it enables self-verification, producing more accurate proofs and achieving excellent results in competitions, demonstrating the potenti
Abstract
Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn’t address a key issue: correct answers don’t guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute. While much work remains, these results suggest that self-verifiable mathematical reasoning is a feasible research direction that may help develop more capable mathematical AI systems.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning." This title indicates a focus on advancing large language models (LLMs) to perform mathematical reasoning with an emphasis on the ability to verify their own outputs.
1.2. Authors
The authors are Zhihong Shao, Yuxiang Luo, Chengda Lu, Z.Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. They are affiliated with DeepSeek-AI. Zhihong Shao is also listed as the contact author.
1.3. Journal/Conference
The paper is published as a preprint, indicated by the CoRR, abs/2507.23726, 2025 and arXiv URLs for some references, and the overall submission style. The abstract and references mention other papers published in Nat. (Nature) and conference proceedings like EMNLP 2025 and ICLR 2023. The current paper itself does not explicitly state a journal or conference publication, suggesting it is a preprint (e.g., on arXiv) at the time of its release. This implies it is undergoing peer review or has not yet been formally published, but it is available to the research community.
1.4. Publication Year
The publication year is 2025.
1.5. Abstract
This paper addresses the limitations of current large language models (LLMs) in mathematical reasoning, particularly their reliance on final-answer-based reinforcement learning (RL) which doesn't guarantee correct reasoning and is inapplicable to tasks like theorem proving. The research objective is to develop self-verifiable mathematical reasoning capabilities in LLMs.
The core methodology involves training an accurate and faithful LLM-based verifier for theorem proving. This verifier then serves as a reward model to train a proof generator. The generator is incentivized to identify and resolve issues in its own proofs through a process called self-verification. To improve the verifier further, a mechanism is proposed to scale verification compute to automatically label new, hard-to-verify proofs, creating training data.
The main results demonstrate that their model, DeepSeekMath-V2, achieves strong theorem-proving capabilities. It scored gold-level scores on IMO 2025 and CMO 2024, and a near-perfect 118/120 on Putnam 2024, utilizing scaled test-time compute. The key conclusion is that self-verifiable mathematical reasoning is a feasible research direction that can lead to more capable mathematical AI systems.
1.6. Original Source Link
The original source link is /files/papers/69284d609b40dbd7a3a8aeda/paper.pdf. This appears to be a local or internal file path rather than a public URL. Based on the references, it is likely intended to be a link to a preprint repository like arXiv, where similar papers are often found (e.g., https://arxiv.org/abs/2507.23726). The publication status is preprint or unknown.
2. Executive Summary
2.1. Background & Motivation
The paper addresses a critical challenge in the field of artificial intelligence (AI) concerning the mathematical reasoning capabilities of Large Language Models (LLMs). While LLMs have shown significant progress in quantitative reasoning, often saturating (reaching near-perfect performance on) benchmarks like AIME and HMMT, the conventional approach of rewarding models solely based on correct final answers suffers from two fundamental limitations:
-
Unreliable Proxy for Reasoning Correctness: A model might arrive at the correct answer through flawed logic, fortunate errors, or superficial understanding, meaning a correct answer does not guarantee correct reasoning. This creates a
generation-verification gapwhere the model can generate an answer but cannot reliably verify its own reasoning process. -
Inapplicability to Theorem Proving: Many mathematical tasks, especially
theorem proving, do not involve numerical answers. Instead, they require rigorous, step-by-step derivation and proof of a statement. For such tasks, afinal answer rewardis fundamentally inapplicable. This also means LLMs trained this way often produce mathematically invalid or logically inconsistentnatural-language proofs.The core problem the paper aims to solve is the lack of
comprehensiveness and rigorin LLM-generated mathematical reasoning, particularly in theorem proving, and the models' inability to reliably verify the validity of their own proofs. This problem is important because mathematical reasoning is a crucial testbed for AI and has the potential to significantly impact scientific research if advanced further. The ability toself-verifyis especially critical forscaling test-time computeon open problems where no known solutions exist for comparison.
The paper's entry point is the observation that humans can identify issues in proofs even without reference solutions, and that a proof's quality can be inferred by the effort required to identify issues. This motivates the development of proof verification capabilities in LLMs, which can then be used to iteratively improve proof generation.
2.2. Main Contributions / Findings
The paper makes several primary contributions towards self-verifiable mathematical reasoning:
- Accurate and Faithful LLM-based Verifier: The authors investigate and develop a method to train an LLM-based
verifierthat can accurately identify issues in mathematical proofs and assign scores based on mathematical rigor and completeness. - Meta-Verification for Verifier Faithfulness: They introduce
meta-verification, a secondary evaluation process, to assess the quality of theverifier'sanalyses. This ensures that theverifieridentifies genuine issues rather than hallucinating non-existent ones, thereby improving thefaithfulnessof issue identification. - Self-Verification Incentivization for Generator: They train a
proof generatorusing theverifieras a reward model. Crucially, the generator is incentivized to performself-verification, meaning it is prompted to analyze its own generated proofs, identify issues, and iteratively resolve them before finalizing its output. This makes the model explicitly aware of its reward function. - Automated Labeling via Scaled Verification Compute: To maintain the
generation-verification gapand continuously improve theverifieras the generator becomes stronger, they propose and implement a method toscale verification compute. This automatically labels new, hard-to-verify proofs, creating training data for theverifierwithout requiring extensive human annotation. - High Performance on Mathematical Competitions: The resulting model,
DeepSeekMath-V2, demonstrates strong theorem-proving capabilities. It achieved gold-level scores on IMO 2025 and CMO 2024, and a near-perfect 118/120 on Putnam 2024, especially withscaled test-time compute. These findings suggest thatself-verifiable mathematical reasoningis a feasible and promising research direction for developing more capable mathematical AI systems.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand DeepSeekMath-V2, a reader should be familiar with the following foundational concepts:
- Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the
transformer architecture, trained on vast amounts of text data. They are capable of understanding, generating, and processing human language for a wide range of tasks, including translation, summarization, and reasoning. LLMs learn to predict the next word in a sequence, thereby developing a deep understanding of grammar, facts, and context. - Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to make decisions by interacting with anenvironment. The agent receivesrewardsorpenaltiesfor its actions, aiming to maximize cumulative reward over time. In the context of LLMs, RL is often used to fine-tune models to align their outputs with specific objectives or human preferences, such as generating correct mathematical answers. - Supervised Fine-tuning (SFT): A common technique in
natural language processing (NLP)where a pre-trainedLLMis further trained on a smaller, task-specific dataset with human-labeled examples. This process adapts the general knowledge of the pre-trained model to perform well on a particular downstream task, such as mathematical reasoning or code generation. - Mathematical Reasoning: The ability to understand, analyze, and solve mathematical problems. This includes
quantitative reasoning(solving problems that result in numerical answers) andtheorem proving(rigorously deriving logical conclusions from a set of axioms and definitions, often involving multi-step logical deductions without necessarily producing a numerical answer). - Theorem Proving: A specific form of mathematical reasoning that involves constructing a formal or informal derivation (a
proof) to establish the truth of a mathematical statement (theorem) based on established axioms, definitions, and previously proven theorems. This often requires complex, multi-step logical deductions. - Natural Language Reasoning: The ability of an AI system to process, understand, and generate human-like textual explanations and deductions for various tasks, including mathematical proofs. This contrasts with
formal reasoning, which uses symbolic logic and formal systems. - Formal Proof Assistants: Software tools (e.g.,
Lean,Isabelle) that help mathematicians and computer scientists construct and verify mathematical proofs. These systems require proofs to be written in a highly structured, formal language, and once compiled, the correctness of the proof is mathematically guaranteed by the system's logic.
3.2. Previous Works
The paper situates its work in the context of several important prior studies and developments:
- LLMs for Quantitative Reasoning (e.g., AIME, HMMT): The paper notes that conventional
RLapproaches, rewardingLLMsfor correctfinal answers, have enabled models (like those described by Guo et al., 2025, and OpenAI, 2024) tosaturate(achieve very high performance on) quantitative reasoning benchmarks such as the American Invitational Mathematics Examination (AIME) and the Harvard-MIT Mathematics Tournament (HMMT). These benchmarks primarily test problem-solving skills that yield numerical answers. - Gemini 2.5 Pro and Self-Verification: The paper acknowledges that some frontier
LLMs, such asGemini 2.5 Pro(DeepMind, 2025), already demonstrate a certain level ofself-verification capabilities, allowing them to refine their own solutions (Huang and Yang, 2025). This indicates a growing trend towards models that can evaluate and improve their outputs. - DeepMind's DeepThink (IMO-Gold): A significant development mentioned is
DeepMind's internal DeepThink variant(Luong and Lockhart, 2025), which achievedgold medal performanceat IMO 2025 using purenatural language reasoning. This serves as an "existence proof" thatLLM-based verificationof complex proofs is achievable, paving the way for the current work's focus onself-verifiable reasoning. Related research by Luong et al. (2025) also explores whether reasoning models can evaluate proofs, with or without reference solutions. - Formal Proof Search Systems (e.g., AlphaProof, DeepSeek-Prover-V2, Seed-Prover):
- AlphaProof (AlphaProof and teams, 2024; Trinh et al., 2024; Chervonyi et al., 2025): A system specialized for formal proof search, which achieved
silver-level performanceat IMO 2024. It relies on formal languages and often requires intensive computation.AlphaGeometry2(Chervonyi et al., 2025) is a variant specifically for geometry problems. - DeepSeek-Prover-V2 (Ren et al., 2025) and Seed-Prover (Chen et al., 2025): These systems represent advancements in formal mathematical reasoning, capable of producing substantially more valid formal proofs.
Seed-Provernotably solved 5 of 6 problems at IMO 2025. These systems often leverage informal reasoning (from LLMs) to guide formal proof generation, making the informal reasoning quality highly impactful.
- AlphaProof (AlphaProof and teams, 2024; Trinh et al., 2024; Chervonyi et al., 2025): A system specialized for formal proof search, which achieved
3.3. Technological Evolution
The field of AI in mathematical reasoning has evolved significantly:
-
Early Symbolic AI and Expert Systems: Initial attempts often relied on symbolic logic and rule-based systems, requiring explicit programming of mathematical rules and inference steps. These systems struggled with the vastness and ambiguity of real-world mathematical problems.
-
Rise of Deep Learning and Quantitative Reasoning: With the advent of
deep learning, particularlytransformer models,LLMsbegan to show impressive capabilities inquantitative reasoning. Early successes involved training models to produce correct numerical answers, often usingreinforcement learningwithfinal answer rewards. This led to models saturating benchmarks like AIME. -
Limitations of Final Answer Rewards: The limitations of
final answer rewardsbecame apparent when models struggled withtheorem provingor produced correct answers via incorrect reasoning. This highlighted the need for models to understand and evaluate the reasoning process itself. -
Emergence of Informal Proof Generation and Self-Correction: Recent advancements, exemplified by
Gemini 2.5 ProandDeepThink, demonstrated thatLLMscould generatenatural-language proofsand even perform some level ofself-correction. This proved the feasibility ofLLM-based verificationfor complex, informal proofs. -
Integration with Formal Proof Systems: Concurrently, there has been significant progress in integrating
LLMswithformal proof assistants.LLMsare used to generate informal proofs or proof sketches, which then guideformal proversto construct verifiable proofs in formal languages. Systems likeAlphaProofandSeed-Proverrepresent this hybrid approach.DeepSeekMath-V2fits within this timeline by pushing the boundaries ofnatural language theorem proving. It moves beyond simple final-answer correctness to focus on therigorandcomprehensivenessof the reasoning process itself, through novelself-verificationandmeta-verificationmechanisms. It aims to developLLMsthat can not only generate proofs but also critically evaluate and improve them, bridging thegeneration-verification gapin the informal reasoning domain.
3.4. Differentiation Analysis
Compared to the main methods in related work, DeepSeekMath-V2 presents several core differences and innovations:
-
Beyond Final Answer Accuracy: Unlike earlier
LLMsthat primarily focused onfinal answer accuracyfor quantitative reasoning,DeepSeekMath-V2explicitly targetstheorem provingwhere rigorous, step-by-step derivation is paramount, not just a numerical answer. Its reward mechanism moves beyond simple correctness tocomprehensiveness and rigorof the proof. -
Dedicated LLM-based Proof Verifier: A key innovation is the training of a specific
LLM-based verifier() designed to evaluatenatural-language proofsaccording to detailed rubrics, identifying issues and assigning scores. This is a more nuanced and powerful form of verification than simply checking a final numerical output. WhileDeepThinkdemonstratedLLM-based verificationis achievable,DeepSeekMath-V2open-sources its training methodology for this capability. -
Meta-Verification for Faithfulness: The paper introduces
meta-verificationto address a critical vulnerability: theverifiermight hallucinate non-existent issues while still predicting the correct score. By training ameta-verifier() to assess the quality and justification of theverifier'sanalyses,DeepSeekMath-V2enhances thefaithfulnessand trustworthiness of its issue identification, a novel approach to ensure verification quality. -
Incentivized Self-Verification in the Generator: While some models like
Gemini 2.5 Proshowself-verificationcapabilities,DeepSeekMath-V2explicitly trains itsproof generator() to perform rigorousself-analysisusing the same rubrics as the externalverifier. The reward function directly incentivizes the generator to identify and resolve issues in its own proofs, promoting a deliberate, iterative refinement process rather than blind trial-and-error. This closes thegeneration-verification gapwithin the model itself. -
Automated Labeling for Verifier Improvement: To handle the challenge of continuously improving the
verifieras thegeneratorbecomes stronger and produces more complex, hard-to-verify proofs,DeepSeekMath-V2proposes an innovativeautomated labeling process. This scalesverification compute(multiple verifier samples, meta-verification of these samples) to generate new training data for theverifierwithout human annotation, creating a sustainable improvement cycle. -
Focus on Natural Language vs. Formal Proofs: While
AlphaProof,DeepSeek-Prover-V2, andSeed-Proverfocus on generatingformal proofs(which offer strong guarantees but require specialized languages and high computational costs),DeepSeekMath-V2focuses onnatural language theorem proving. It demonstrates thatLLMscan achieve gold-level performance on competitions using informal reasoning, suggesting that advancing informal reasoning can significantly benefit formal reasoning systems as well.In essence,
DeepSeekMath-V2differentiates itself by building a tightly coupled, self-improving ecosystem ofLLM-based verifiersandgeneratorsfornatural language mathematical proofs, with a strong emphasis on faithful issue identification and iterative self-refinement.
4. Methodology
The methodology of DeepSeekMath-V2 is centered around an iterative reinforcement learning (RL) process that synergistically improves both proof verification and proof generation capabilities. It involves training a dedicated verifier to identify issues and score proofs, a meta-verifier to ensure the verifier's faithfulness, and a generator that uses the verifier as a reward model and is incentivized to self-verify and refine its own proofs.
4.1. Principles
The core idea is to move beyond final-answer-based rewards for mathematical reasoning, which are unreliable for assessing correctness of reasoning and inapplicable to theorem proving. The fundamental principle is that for complex reasoning tasks like theorem proving, it is crucial to verify the comprehensiveness and rigor of the step-by-step derivation.
The theoretical basis is that an LLM can be trained to emulate human mathematical experts in identifying flaws and evaluating proof quality without reference solutions. This verifier can then provide a reward signal for an LLM proof generator. Furthermore, by making the generator explicitly aware of the verifier's criteria, it can learn to self-verify and iteratively improve its own outputs, mimicking how humans review and refine their work. The synergy between verification and generation is exploited to create a continuous improvement loop, where stronger generators produce challenging proofs that, in turn, help improve the verifier.
4.2. Core Methodology In-depth (Layer by Layer)
The methodology consists of three main components: Proof Verification, Proof Generation, and Synergy Between Proof Verification and Generation.
4.2.1. Proof Verification
This component focuses on training an LLM to act as a mathematical verifier.
4.2.1.1. Training a Verifier to Identify Issues and Score Proofs
The goal here is to train a verifier model, denoted as , to evaluate proofs according to high-level rubrics (, detailed in Appendix A.2). The verifier takes a problem and a proof as input and produces a proof analysis that summarizes identified issues (if any) and assigns an overall score. The score levels are:
- 1: For complete and rigorous proofs with all logical steps clearly justified.
- 0.5: For proofs with sound overall logic but minor errors or omitted details.
- 0: For fundamentally flawed proofs containing fatal logical errors or critical gaps.
Data Collection Process:
- Problem Collection: Problems are crawled from
Art of Problem Solving (AoPS)contests, prioritizing math olympiads, team selection tests, and post-2010 problems explicitly requiring proofs. This resulted in 17,503 problems, denoted as . - Candidate Proof Generation: An initial variant of
DeepSeek-V3.2-Exp-Thinkingis used to generate candidate proofs. This model is prompted to iteratively refine its proofs to improve comprehensiveness and rigor, as it was not initially optimized for theorem proving and tended to produce concise but error-prone outputs. - Expert Annotation: Proofs are randomly sampled across diverse problem types (e.g., algebra, number theory) and
mathematical expertsscore each proof according to the defined evaluation rubrics. This process yields an initialRL dataset, where is the problem, is the proof, and is the expert-annotated proof score.
RL Objective for Verifier Training:
The verifier is built upon a version of DeepSeek-V3.2-Exp-SFT (Supervised Fine-Tuning), which was already fine-tuned on reasoning data related to mathematics and code. The model is trained with reinforcement learning using two reward components to produce proof analyses:
-
Format Reward (): This is an
indicator functionthat ensures the model generates its response in the specified format, which includes both a summary of identified issues and a proof score. It checks for the presence of key phrases like "Here is my evaluation of the solution:" and a score enclosed in following "Based on my evaluation, the final overall score should be:". whereV'_idenotes the verifier's final response for problem and proof . -
Score Reward (): This component rewards the
verifierbased on how close its predicted scores'_iis to the expert-annotated ground-truth score . Here,s'_iis the proof score extracted from the verifier's responseV'_i, and is the expert-annotated score. This reward function gives a reward of 1 if the predicted score exactly matches the annotated score, 0.5 if there is a difference of 0.5 (e.g., predicting 0.5 for a 0 or 1 proof), and 0 if the predicted score is completely off (e.g., predicting 0 for a 1 proof or vice versa).
The overall RL objective for training the verifier is to maximize the expected product of these two reward components:
$
\operatorname*{max}{\pi\varphi} \mathbb{E}_{(X_i, Y_i, s_i) \sim \mathcal{D}_v, (V'i, s'i) \sim \pi\varphi(\cdot | X_i, Y_i)} \left[ R{\mathrm{format}}(V'i) \cdot R{\mathrm{score}}(s'_i, s_i) \right]
$
This objective trains the verifier () to generate responses (V'_i) that adhere to the specified format and predict scores (s'_i) that are accurate with respect to the expert annotations ().
4.2.1.2. Introducing Meta-Verification to Review Proof Analyses
The initial verifier training ensures accurate score prediction but doesn't directly supervise the identified issues. This creates a risk of hallucination: the verifier could predict the correct score for a flawed proof () while still inventing non-existent issues, undermining its trustworthiness.
To address this, meta-verification is introduced. This is a secondary evaluation process that assesses:
- Whether the issues identified by the
verifiertruly exist in the proof. - Whether these identified issues logically justify the
verifier'spredicted proof score, according to the evaluation rubrics . The completemeta-verification rubrics() are detailed in Appendix A.3.
Meta-Verifier Training Process:
- Initial Verifier: An initial
verifieris obtained from the process described in Section 4.2.1.1. - Expert Meta-Annotation: Mathematical experts score the quality of the
verifier'sresponses (proof analyses ) according to themeta-verification rubrics. This creates a dataset , where is theverifier'sanalysis of proof , and is the expert-annotated quality score for that analysis. - Meta-Verifier Training: A dedicated
meta-verifieris trained usingRLto analyze theverifier'sproof analysis . Themeta-verifieroutputs a summary of issues found in the analysis itself, followed by a quality score measuring the accuracy and justification of theverifier'sanalysis. TheRL objectiveformeta-verifiertraining follows the same structure as theverifiertraining, usingformatandscore rewardsto align with expertmeta-annotations.
Enhanced Verifier Training with Meta-Verification Feedback:
Once the meta-verifier is trained, its feedback is integrated into the verifier's reward function:
Here:
-
is the
format rewardfor the verifier's response. -
is the
score rewardfor the verifier's predicted score accuracy. -
is the quality score from the
meta-verifier. This term explicitly rewards theverifierfor producing faithful and justified issue identifications, as evaluated by themeta-verifier.The enhanced
verifieris then trained on both theverification datasetand themeta-verification dataset, using the same reward mechanism on as used for training themeta-verifier. The resulting model can perform bothproof verificationandmeta-verificationtasks. The paper reports that on a validation split of , the average quality score of the verifier's proof analyses (as evaluated by themeta-verifier) improved from 0.85 to 0.96, while maintaining the same accuracy in proof score prediction.
4.2.2. Proof Generation
This component describes how the proof generator is trained and enhanced using the verifier.
4.2.2.1. Training a Generator for Theorem Proving
With the verifier now capable of accurately assessing proof quality, it is used as a generative reward model to train a proof generator . The RL objective for the generator is to maximize the expected proof score produced by the verifier.
$
\operatorname*{max}{\pi\theta} \mathbb{E}_{X_i \sim \mathcal{D}p, Y_i \sim \pi\theta(\cdot | X_i)} \left[ R_Y \right]
$
Here:
- is the
proof generatormodel. - means problem is sampled from the problem dataset.
- means proof is generated by for problem .
- is the proof score produced by the
verifier. This means thegeneratoris rewarded for producing proofs that theverifierdeems high quality.
4.2.2.2. Enhancing Reasoning via Self-Verification
For challenging problems, a single-shot proof generation is often insufficient. While iterative refinement based on external verifier feedback is possible, the authors observed that when a generator is prompted to both generate and analyze its own proof in one shot, it often hallucinates correctness even when flaws are easily identifiable by an external verifier. This means the generator lacks genuine self-verification capabilities.
To address this, the proof generator is endowed with genuine verification capabilities. During training, the generator is prompted to produce not only a proof but also a self-analysis . This self-analysis must follow the same format and rubrics () as the verifier's output (see Appendix A.1 for the prompt template). The proof score predicted within this self-analysis is denoted as .
To ensure faithful self-evaluation, the external verifier is used to assess both components:
-
The generated proof receives a score , where is the score assigned by the external
verifier. -
The
self-analysisreceives ameta-verification score, which assesses the quality and faithfulness of thegenerator'sown evaluation.The reward function for training the
generatorwithself-verificationcombines these assessments: $ \begin{array} { c } { { R = R _ { \mathrm { f o r m a t } } ( Y , Z ) \cdot ( \alpha \cdot R _ { Y } + \beta \cdot R _ { Z } ) } } \ { { { } } } \ { { R _ { Z } = R _ { \mathrm { s c o r e } } ( s ^ { \prime } , s ) \cdot R _ { \mathrm { m e t a } } ( Z ) } } \end{array} $ Here: -
is an
indicator functionthat verifies both the proof and the self-analysis adhere to their specified formats (as outlined in Appendix A.1). -
is the score of the proof as judged by the external
verifier. -
is a reward component for the quality of the
self-analysis. -
rewards the
generatorfor accurately self-assessing its proof, by comparing its self-assigned score with the externalverifier'sscore . -
is the
meta-verification scoreof theself-analysis, ensuring the faithfulness of thegenerator'sissue identification within . -
and are weighting coefficients, set to and respectively, indicating a stronger emphasis on the actual proof quality but still significant weight on the self-evaluation.
This reward structure creates specific incentives for the
generator: -
Faithful Acknowledgment of Errors: The model is rewarded more for truthfully acknowledging errors in its proof (even if it can't fix them) than for falsely claiming correctness.
-
Highest Rewards for Rigorous Proofs and Accurate Self-Assessment: The maximum reward is achieved when the
generatorproduces a completely correct proof () and accurately recognizes its rigor (). -
Iterative Issue Resolution: A good strategy for the
proof generatorto obtain high rewards is to iteratively identify and resolve as many issues as possible in its own proofs before finalizing them, guided by itsself-verificationcapabilities.
4.2.3. Synergy Between Proof Verification and Generation
The proof verifier and generator are designed to create a synergistic, self-improving cycle. As the generator improves, it produces new proofs that challenge the verifier's current capabilities, particularly hard-to-verify proofs where the verifier might struggle to identify subtle issues in a single attempt. These challenging cases become valuable training data for enhancing the verifier itself.
To address the need for labeled correctness data for these newly generated proofs, the authors developed an automated labeling process, which eventually replaced human annotation entirely in later training iterations.
Automated Labeling Process:
- Multiple Verification Analyses: For each newly generated proof, independent
verification analysesare generated by theverifier. This increases the probability of catching real issues in flawed proofs, as oneverifierinstance might miss something that another catches. - Meta-Verification of Issue-Reporting Analyses: For
analyses(from step 1) that report issues (scores of 0 or 0.5),meta-verification assessmentsare generated by themeta-verifierto validate if the identified problems genuinely exist. An analysis is deemed valid if the majority of thesemeta-assessmentsconfirm its findings. This step leverages the observation thatmeta-verificationis easier and moresample-efficientforLLMsto master than identifying issues from scratch. - Proof Scoring and Routing:
-
For each proof, analyses that assign the lowest score are examined. If at least such analyses are deemed valid by the
meta-verifier, the proof is labeled with that lowest score. -
If no legitimate issues are identified across all
verification attempts(i.e., all analyses assign a score of 1 and are deemed valid), the proof is labeled with a score of 1. -
Otherwise (e.g., conflicting analyses, or insufficient valid low-score analyses), the proof is either discarded or routed to human experts for labeling.
This fully automated pipeline significantly boosts annotation efficiency and allows for continuous improvement of the
verifierusing data generated by thegenerator, without relying on manual annotation in later stages. Quality checks confirmed that the automated labels aligned well with expert judgments.
-
5. Experimental Setup
5.1. Datasets
The paper uses a set of problems for training and evaluation.
- Training Data (Problem Set ): Problems were crawled from
Art of Problem Solving (AoPS)contests. This included math olympiads, team selection tests, and post-2010 problems specifically requiring proofs. The total count was 17,503 problems. These problems typically span various mathematical domains like algebra, number theory, geometry, combinatorics, and inequality, and are designed to test deep mathematical understanding and rigorous proof construction. - Initial RL Dataset (): This dataset was created by sampling proofs generated by a variant of
DeepSeek-V3.2-Exp-Thinkingon problems from . These proofs were then scored bymathematical experts(0, 0.5, or 1). - Meta-Verification Dataset (): This dataset was generated by having
mathematical expertsscore the quality ofverifierresponses (proof analyses) according tometa-verification rubrics.
Evaluation Benchmarks:
The final proof generator was evaluated on several benchmarks to assess its theorem-proving capabilities:
- In-House CNML-Level Problems: A set of 91 theorem-proving problems.
- Source: In-house, comparable in difficulty to problems from the
Chinese National High School Mathematics League (CNML). - Characteristics: Spanning algebra (13 problems), geometry (24 problems), number theory (19 problems), combinatorics (24 problems), and inequality (11 problems). These are challenging problems requiring rigorous proofs.
- Source: In-house, comparable in difficulty to problems from the
- Competition Problems:
- IMO 2025: 6 problems from the
International Mathematical Olympiad, a premier global competition for pre-university students. - CMO 2024: 6 problems from the
China Mathematical Olympiad, China's national championship. - Putnam 2024: 12 problems from the
William Lowell Putnam Competition, a preeminent mathematics competition for undergraduate students in North America. - ISL 2024 (IMO Shortlist): 31 problems from the
IMO Shortlist, a collection of problems proposed by participating countries for potential inclusion in IMO 2024. These represent a wide range of difficulty levels. - IMO-ProofBench: A benchmark developed by the DeepMind team.
-
Source: Luong et al. (2025).
-
Characteristics: Divided into a
basic set(30 problems, pre-IMO to IMO-Medium difficulty) and anadvanced set(30 challenging problems simulating complete IMO examinations, up to IMO-Hard level).These datasets were chosen because they are standard and highly respected benchmarks in competitive mathematics, suitable for evaluating the rigor and correctness of mathematical proofs. The mix of high-school and undergraduate level problems, as well as varied mathematical domains, provides a comprehensive assessment of the model's generalized theorem-proving abilities.
-
- IMO 2025: 6 problems from the
5.2. Evaluation Metrics
The paper uses several metrics to evaluate the performance of the models, primarily focusing on proof correctness and quality.
-
Proof Score ():
- Conceptual Definition: This metric quantifies the overall correctness, completeness, and rigor of a mathematical proof. It is the primary output of the
verifierand the target forexpert annotationandself-assessment. - Mathematical Formula: Not a formula, but a discrete scale:
- Symbol Explanation:
- : Completely correct, all steps executed properly and clearly demonstrated.
- : Generally correct, but with some details omitted or minor errors.
- : Fundamentally flawed, containing fatal logical errors or critical gaps, or not addressing the required problem.
- Conceptual Definition: This metric quantifies the overall correctness, completeness, and rigor of a mathematical proof. It is the primary output of the
-
Pass@1:
- Conceptual Definition: This metric measures the average quality of the final proof generated by a model in a multi-attempt scenario (e.g., sequential refinement). It assesses the typical performance of a single generation attempt after refinement.
- Mathematical Formula: $ \text{Pass@1} = \frac{1}{N_{problems} \cdot N_{threads}} \sum_{i=1}^{N_{problems}} \sum_{j=1}^{N_{threads}} s_{i,j}^{\text{final}} $
- Symbol Explanation:
- : The total number of problems evaluated.
- : The number of independent refinement threads launched per problem.
- : The score (0, 0.5, or 1) of the final proof from the -th refinement thread for the -th problem, as evaluated by the final
verifier. - The metric is an average score across all final proofs from all threads.
-
Best@32 (or Best@N):
- Conceptual Definition: This metric measures the quality of the single best proof found for a given problem across multiple independent generation/refinement attempts (
threads). The selection of the "best" proof is based on the model's ownself-assigned score. This evaluates the model's ability to not only generate good proofs but also to accurately identify its best work. - Mathematical Formula: $ \text{Best@N} = \frac{1}{N_{problems}} \sum_{i=1}^{N_{problems}} \max_{j=1}^{N_{threads}} { s_{i,j}^{\text{verifier}} \mid s_{i,j}^{\text{self}} = \max_{k=1}^{N_{threads}} s_{i,k}^{\text{self}} } $
- Symbol Explanation:
- : The total number of problems evaluated.
- : The number of independent refinement threads (here, 32).
- : The self-assigned score of the final proof from the -th thread for the -th problem.
- : The score of the final proof from the -th thread for the -th problem, as evaluated by the final
verifier. - For each problem , the model first identifies the thread that produced the highest
self-assigned score(). Then, theverifier score() of that self-selected best proof is taken. TheBest@Nis the average of these bestverifier scoresacross all problems.
- Conceptual Definition: This metric measures the quality of the single best proof found for a given problem across multiple independent generation/refinement attempts (
5.3. Baselines
The paper compares DeepSeekMath-V2 against several state-of-the-art LLMs:
-
GPT-5-Thinking-High (OpenAI, 2025): This refers to a high-performance variant of
OpenAI'sGPT-5model, likely optimized for reasoning tasks, as implied by "Thinking-High." It represents a cutting-edge proprietary model from a leading AI research organization. -
Gemini 2.5-Pro (DeepMind, 2025): This is
DeepMind'sadvancedGeminimodel, which is known for its strong multimodal capabilities and also for demonstrating certainself-verificationabilities, as mentioned in the related work. -
DeepMind's DeepThink (IMO Gold) (Luong and Lockhart, 2025): This is a specific variant of
DeepMind'sDeepThinksystem, which achievedgold medal performanceat IMO 2025 usingnatural language reasoning. It serves as a strong baseline, representing the current pinnacle ofLLM-based informal mathematical reasoning.These baselines are representative because they are leading
LLMsfrom major AI labs, and some (likeGemini 2.5-ProandDeepThink) have already demonstrated advanced mathematical reasoning and evenself-verificationcapabilities. Comparing against them allows the authors to positionDeepSeekMath-V2's performance relative to the current frontier.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate DeepSeekMath-V2's strong theorem-proving capabilities, especially when leveraging self-verification and scaled test-time compute.
6.1.1. One-Shot Generation
For one-shot generation, where the model generates a proof without iterative refinement, DeepSeekMath-V2 consistently outperforms competitor models. On the in-house CNML-level problems, DeepSeekMath-V2 shows superior performance across all categories: algebra, geometry, number theory, combinatorics, and inequality.
The following figure (Figure 1 from the original paper) illustrates the average proof scores:

Analysis of Figure 1:
The bar chart clearly shows DeepSeekMath-V2 achieving higher average proof scores (as evaluated by its own final verifier) than GPT-5-Thinking-High and Gemini 2.5-Pro in every mathematical category. For instance, in Algebra, DeepSeekMath-V2 is significantly higher, and the difference is maintained across Geometry, Number Theory, Combinatorics, and Inequality. This indicates that DeepSeekMath-V2 has developed a more robust and generalized ability to generate correct and rigorous proofs across diverse mathematical domains in a single attempt. The consistency of this outperformance suggests that the training methodology has effectively instilled strong foundational theorem-proving skills.
6.1.2. Sequential Refinement with Self-Verification
For more challenging problems, DeepSeekMath-V2's sequential refinement process, guided by self-verification, significantly improves proof quality. The model is prompted to generate a proof with self-analysis, and then iteratively refines it by addressing identified issues.
The following figure (Figure 2 from the original paper) demonstrates proof quality improvements on IMO Shortlist 2024 problems:

Analysis of Figure 2:
The line graph shows two metrics, Pass@1 and Best@32, as the maximum sequential iterations increase from 1 (no refinement) to 8.
-
Pass@1 (Average Score): The
Pass@1score, representing the average quality of the final proof from each refinement thread, shows a substantial improvement as the number of sequential attempts increases. It starts relatively low at iteration 1 (no refinement) and steadily rises, indicating thatself-verificationeffectively guides the iterative improvement process. This confirms that repeated self-correction leads to higher average proof quality. -
Best@32 (Self-Selected Best Proof): The
Best@32score, which represents theverifier scoreof the best proof per problem (selected by the model's ownself-assigned scoresacross 32 threads), is consistently and significantly higher thanPass@1at all iteration counts. This highlightsDeepSeekMath-V2's strong ability to accurately assess its own proof quality and reliably differentiate between high-quality and flawed proofs. The gap betweenBest@32andPass@1also grows, suggesting that while the average quality improves, the model's capacity to find and select exceptionally good proofs also benefits from more iterations.These results confirm that the
self-verificationmechanism is not just a theoretical concept but an effective strategy for systematically improving mathematical reasoning, particularly for complex problems where a single-shot solution is insufficient.
6.1.3. High-Compute Search
To tackle the most challenging problems from prestigious competitions, DeepSeekMath-V2 employs a high-compute search strategy that scales both verification and generation compute. This involves maintaining a pool of candidate proofs, iteratively refining them based on verifier feedback, and exploring diverse proof strategies in parallel. The process continues until a proof passes numerous verification attempts (64 in this case), indicating high confidence in its correctness.
The following are the results from Table 1 of the original paper:
| Contest | Problems | Points |
|---|---|---|
| IMO 2025 | P1, P2, P3, P4, P5 | 83.3% |
| CMO 2024 | P1, P2, P4, P5, P6 | 73.8% |
| Putnam 2024 | A1~B4, B5, B6 | 98.3% |
Analysis of Table 1:
This table presents expert evaluation results for DeepSeekMath-V2 on three major mathematical competitions.
-
IMO 2025 (International Mathematical Olympiad): The model solved 5 out of 6 problems (P1, P2, P3, P4, P5), achieving a score of 83.3%. The original paper notes that "Problems in gray are fully solved, while underlined problems received partial credit." Assuming all listed problems (P1-P5) are fully solved, this is a gold-medal performance.
-
CMO 2024 (China Mathematical Olympiad): The model solved 4 problems (P1, P2, P4, P5) completely and received partial credit on P6, accumulating 73.8% of the total points. This also translates to a gold-medal performance in a pinnacle high-school competition.
-
Putnam 2024 (William Lowell Putnam Competition): The model achieved an outstanding score of 118/120 (98.3%), solving 11 out of 12 problems completely (A1-B4, B5) and the remaining problem (B6) with minor errors. This score significantly surpasses the highest human score of 90, demonstrating near-perfect performance on a preeminent undergraduate competition.
These results are a strong testament to the effectiveness of
DeepSeekMath-V2's approach, particularly thehigh-compute searchguided by itsverifier. The ability to solve such a high percentage of problems and achieve gold-level scores, especially on Putnam, suggests that the model can handle complex, multi-step reasoning and rigorous derivation at a level comparable to or exceeding top human competitors.
The following figure (Figure 3 from the original paper) shows results on IMO-ProofBench:

Analysis of Figure 3:
This bar chart compares DeepSeekMath-V2 with various baselines on the Basic and Advanced subsets of IMO-ProofBench.
-
Basic Set:
DeepSeekMath-V2substantially outperformsDeepMind's DeepThink (IMO Gold)and all other baselines on thebasic set. This indicates its strong command over foundational to medium-difficultyIMO-level problems. -
Advanced Set: On the
advanced set, which consists of the hardestIMO-level problems,DeepSeekMath-V2remains competitive withDeepThink, outperformingGPT-4-Math,Gemini 2.5-Pro, and others by a significant margin. The fact that it's competitive withDeepThinkon the advanced set, while outperforming it on the basic set, highlights its overall strength.The paper notes that even with this advanced approach, "the hardest IMO-level problems remain challenging for our model." However, for problems not fully solved, the
generatortypically identifies the genuine issues in its proofs, and fully solved problems pass all 64verification attempts. This confirms the reliability of theLLM-based verifiersin assessing complex proofs and highlights the power ofscaled test-time computeunderverifier guidancefor solving problems that would require hours of human effort.
6.2. Ablation Studies / Parameter Analysis
The paper does not explicitly present a dedicated "Ablation Studies" section with separate experiments comparing different components of the model. However, the methodology section implicitly discusses the impact of certain components:
-
Meta-Verification: The paper reports that "On a validation split of the average quality score of the verifier's proof analyses - as evaluated by the meta-verifier - improved from 0.85 to 0.96, while maintaining the same accuracy in proof score prediction." This demonstrates the effectiveness of
meta-verificationin improving thefaithfulnessof theverifier'sissue identification, preventinghallucinationof non-existent issues. -
Sequential Refinement with Self-Verification: Figure 2 visually demonstrates the impact of
sequential refinementon proof quality. ThePass@1metric, showing average proof quality, improves substantially as the number of sequential attempts increases. This implicitly acts as an ablation study on the number of refinement steps, showing that more self-verification iterations directly lead to better outcomes. -
Automated Labeling Process: The paper mentions that "In our last two training iterations, this fully automated pipeline replaced human annotation entirely. Quality checks confirmed that the automated labels aligned well with expert judgments." This confirms the efficacy and reliability of the automated labeling process, which is a critical component for scaling
verifierimprovement without human intervention. -
Reward Function Parameters (): The paper specifies the weights for the combined reward function in
self-verificationfor thegeneratoras and . While no explicit ablation is shown for these values, their assignment indicates the authors' empirical decision to prioritize the actual proof quality () while still giving significant weight to theself-analysisquality ().While not presented as formal ablation tables, these observations confirm the positive impact of key technical contributions on the model's overall performance and capabilities.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully presents DeepSeekMath-V2, a novel Large Language Model (LLM) capable of both generating and verifying mathematical proofs in natural language. The core innovation lies in moving beyond the limitations of final-answer-based rewards by introducing self-verifiable mathematical reasoning. Key contributions include:
-
Training an accurate and faithful
LLM-based verifierfor mathematical proofs. -
Employing
meta-verificationto significantly reducehallucinated issuesand ensure the quality and faithfulness of theverifier'sanalyses. -
Incentivizing the
proof generatorto maximize proof quality through rigorousself-verification, where it identifies and resolves issues in its own reasoning before finalizing outputs. -
Developing an
automated labeling processthat leveragesscaled verification computeto improve theverifierby automatically labeling increasinglyhard-to-verify proofs, thereby creating a sustainable, self-improving cycle without constant human annotation.Empirically,
DeepSeekMath-V2demonstrates exceptional performance on challenging competitive mathematics problems. Withscaled test-time compute, it achieved gold-medal scores in high-school competitions like IMO 2025 and CMO 2024, and a near-perfect score of 118/120 on the undergraduate Putnam 2024 competition, surpassing top human performance. This work establishes thatLLMscan develop meaningfulself-evaluationabilities for complex reasoning tasks, marking a significant step towards more reliable and autonomous AI systems in mathematics.
7.2. Limitations & Future Work
The authors acknowledge several limitations and areas for future work:
- Remaining Challenges in Hardest Problems: Despite its strong performance, the paper explicitly states that "the hardest IMO-level problems remain challenging for our model." This indicates that there is still a gap to close for the most complex mathematical proofs, requiring further advancements in reasoning capabilities.
- Efficiency of High-Compute Search: While
scaled test-time computeenabled the model to solve highly challenging problems, the computational cost of this extensive search (e.g., 64 proof samples with 64 verification analyses each, up to 16 refinement iterations) might be substantial. Future work could focus on making this search process more efficient. - Integration with Formal Reasoning: The paper highlights the potential synergy between
natural language theorem provingandformal reasoning. It mentions that existingformal provers(likeDeepSeek-Prover-V2andSeed-Prover) achieved impressive results without specifically optimizing their informal reasoning components for theorem proving. This suggests a future direction of leveraging advancednatural language theorem provingto further benefit and guideformal reasoning systems, aiming for truly reliable mathematical AI that combines informal insights with formal guarantees. - Generalizability of Self-Verification: While
DeepSeekMath-V2demonstratesself-verifiabilityin mathematics, the generalizability of this approach to other complex reasoning domains (e.g., scientific discovery, engineering design) would be an important avenue for future research.
7.3. Personal Insights & Critique
This paper presents a highly impactful advancement in LLM capabilities for mathematical reasoning. The shift from final-answer rewards to self-verifiable reasoning is a critical conceptual leap, addressing a fundamental limitation of previous RL approaches. The ability of an LLM to critically evaluate its own logical steps and iterative refinements, rather than just matching a final output, mimics human expert reasoning and problem-solving much more closely.
Inspirations:
- Synergistic Learning Loop: The iterative
generation-verificationcycle is particularly inspiring. It demonstrates how twoLLMcomponents (generator and verifier) can be trained to mutually improve, creating a self-sustaining learning ecosystem. This paradigm could be highly transferable to other domains requiring complex, verifiable outputs, such as scientific hypothesis generation, code debugging, or legal argument construction. - Automated Labeling: The
automated labeling processis a practical and crucial innovation. Data annotation is often a bottleneck inRLtraining for complex tasks. Automating the creation ofhard-to-verifytraining examples for theverifieris a clever solution toscale data generationand overcome human expert limitations, suggesting a path towards truly autonomous AI development in specialized fields. - Meta-Verification for Trustworthiness: The introduction of
meta-verificationto addresshallucinationin theverifieris a powerful idea. It highlights the importance of not just accuracy, butfaithfulnessandexplainabilityin AI systems, especially when they are meant to provide critical feedback. This concept could be applied to ensure the reliability of AI-generated explanations or rationales in other sensitive applications.
Potential Issues or Areas for Improvement:
- Computational Cost: While effective, the
high-compute searchstrategy is computationally intensive. The paper doesn't detail the exact compute resources (GPUs, time) required for training and inference, which would be valuable for understanding the practicality of deploying such systems widely. Future work could explore more sample-efficientRLalgorithms or more intelligent search strategies to reduce this cost. - Interpretability of Verifier's Issues: While the
verifieridentifies issues, understanding why certain issues are identified or why certain proofs are flawed might still require human expertise. Enhancing theexplainabilityof theverifier'sanalysis, beyond just summarizing issues, could further empower human users and help diagnose model failures more effectively. - Generalization to Novel Mathematical Concepts: The current training relies heavily on competitive math problems, which, while diverse, might not cover entirely novel mathematical concepts or structures that a human researcher might encounter. Scaling
DeepSeekMath-V2to trulyresearch-level mathematics(which involves discovering new theorems and frameworks, not just solving known problems) remains an open challenge. The current benchmarks test problem-solving within established mathematical fields. - Formalization Gap: While the paper argues that advancing
natural language theorem provingwill benefitformal reasoning, the direct path to converting anatural language proof(even a self-verified one) into a mechanically verifiableformal proofis still a complex research area.DeepSeekMath-V2provides excellentinformal insights, but bridging
Similar papers
Recommended via semantic vector search.