Paper status: completed

DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

Reinforcement Learning for Math Reasoning (14)Self-Verifiable Mathematical Reasoning (1)LLM-based Theorem Proving (1)Proof Generator and Verifier (1)Quantitative Reasoning Capability Enhancement (1)

Original Link

Price: 0.100000

43 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The DeepSeekMath-V2 model addresses the effectiveness of large language models in mathematical reasoning. By training a theorem prover verifier, it enables self-verification, producing more accurate proofs and achieving excellent results in competitions, demonstrating the potenti

Abstract

Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn’t address a key issue: correct answers don’t guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute. While much work remains, these results suggest that self-verifiable mathematical reasoning is a feasible research direction that may help develop more capable mathematical AI systems.

Mind Map

In-depth Reading

English Analysis~19 min read · 25,721 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning." This title indicates a focus on advancing large language models (LLMs) to perform mathematical reasoning with an emphasis on the ability to verify their own outputs.

1.2. Authors

The authors are Zhihong Shao, Yuxiang Luo, Chengda Lu, Z.Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. They are affiliated with DeepSeek-AI. Zhihong Shao is also listed as the contact author.

1.3. Journal/Conference

The paper is published as a preprint, indicated by the CoRR, abs/2507.23726, 2025 and arXiv URLs for some references, and the overall submission style. The abstract and references mention other papers published in Nat. (Nature) and conference proceedings like EMNLP 2025 and ICLR 2023. The current paper itself does not explicitly state a journal or conference publication, suggesting it is a preprint (e.g., on arXiv) at the time of its release. This implies it is undergoing peer review or has not yet been formally published, but it is available to the research community.

1.4. Publication Year

The publication year is 2025.

1.5. Abstract

This paper addresses the limitations of current large language models (LLMs) in mathematical reasoning, particularly their reliance on final-answer-based reinforcement learning (RL) which doesn't guarantee correct reasoning and is inapplicable to tasks like theorem proving. The research objective is to develop self-verifiable mathematical reasoning capabilities in LLMs.

The core methodology involves training an accurate and faithful LLM-based verifier for theorem proving. This verifier then serves as a reward model to train a proof generator. The generator is incentivized to identify and resolve issues in its own proofs through a process called self-verification. To improve the verifier further, a mechanism is proposed to scale verification compute to automatically label new, hard-to-verify proofs, creating training data.

The main results demonstrate that their model, DeepSeekMath-V2, achieves strong theorem-proving capabilities. It scored gold-level scores on IMO 2025 and CMO 2024, and a near-perfect 118/120 on Putnam 2024, utilizing scaled test-time compute. The key conclusion is that self-verifiable mathematical reasoning is a feasible research direction that can lead to more capable mathematical AI systems.

1.6. Original Source Link

The original source link is /files/papers/69284d609b40dbd7a3a8aeda/paper.pdf. This appears to be a local or internal file path rather than a public URL. Based on the references, it is likely intended to be a link to a preprint repository like arXiv, where similar papers are often found (e.g., https://arxiv.org/abs/2507.23726). The publication status is preprint or unknown.

2. Executive Summary

2.1. Background & Motivation

The paper addresses a critical challenge in the field of artificial intelligence (AI) concerning the mathematical reasoning capabilities of Large Language Models (LLMs). While LLMs have shown significant progress in quantitative reasoning, often saturating (reaching near-perfect performance on) benchmarks like AIME and HMMT, the conventional approach of rewarding models solely based on correct final answers suffers from two fundamental limitations:

Unreliable Proxy for Reasoning Correctness: A model might arrive at the correct answer through flawed logic, fortunate errors, or superficial understanding, meaning a correct answer does not guarantee correct reasoning. This creates a generation-verification gap where the model can generate an answer but cannot reliably verify its own reasoning process.
Inapplicability to Theorem Proving: Many mathematical tasks, especially theorem proving, do not involve numerical answers. Instead, they require rigorous, step-by-step derivation and proof of a statement. For such tasks, a final answer reward is fundamentally inapplicable. This also means LLMs trained this way often produce mathematically invalid or logically inconsistent natural-language proofs.

The core problem the paper aims to solve is the lack of comprehensiveness and rigor in LLM-generated mathematical reasoning, particularly in theorem proving, and the models' inability to reliably verify the validity of their own proofs. This problem is important because mathematical reasoning is a crucial testbed for AI and has the potential to significantly impact scientific research if advanced further. The ability to self-verify is especially critical for scaling test-time compute on open problems where no known solutions exist for comparison.

The paper's entry point is the observation that humans can identify issues in proofs even without reference solutions, and that a proof's quality can be inferred by the effort required to identify issues. This motivates the development of proof verification capabilities in LLMs, which can then be used to iteratively improve proof generation.

2.2. Main Contributions / Findings

The paper makes several primary contributions towards self-verifiable mathematical reasoning:

Accurate and Faithful LLM-based Verifier: The authors investigate and develop a method to train an LLM-based verifier that can accurately identify issues in mathematical proofs and assign scores based on mathematical rigor and completeness.
Meta-Verification for Verifier Faithfulness: They introduce meta-verification, a secondary evaluation process, to assess the quality of the verifier's analyses. This ensures that the verifier identifies genuine issues rather than hallucinating non-existent ones, thereby improving the faithfulness of issue identification.
Self-Verification Incentivization for Generator: They train a proof generator using the verifier as a reward model. Crucially, the generator is incentivized to perform self-verification, meaning it is prompted to analyze its own generated proofs, identify issues, and iteratively resolve them before finalizing its output. This makes the model explicitly aware of its reward function.
Automated Labeling via Scaled Verification Compute: To maintain the generation-verification gap and continuously improve the verifier as the generator becomes stronger, they propose and implement a method to scale verification compute. This automatically labels new, hard-to-verify proofs, creating training data for the verifier without requiring extensive human annotation.
High Performance on Mathematical Competitions: The resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities. It achieved gold-level scores on IMO 2025 and CMO 2024, and a near-perfect 118/120 on Putnam 2024, especially with scaled test-time compute. These findings suggest that self-verifiable mathematical reasoning is a feasible and promising research direction for developing more capable mathematical AI systems.

3.1. Foundational Concepts

To understand DeepSeekMath-V2, a reader should be familiar with the following foundational concepts:

Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the transformer architecture, trained on vast amounts of text data. They are capable of understanding, generating, and processing human language for a wide range of tasks, including translation, summarization, and reasoning. LLMs learn to predict the next word in a sequence, thereby developing a deep understanding of grammar, facts, and context.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties for its actions, aiming to maximize cumulative reward over time. In the context of LLMs, RL is often used to fine-tune models to align their outputs with specific objectives or human preferences, such as generating correct mathematical answers.
Supervised Fine-tuning (SFT): A common technique in natural language processing (NLP) where a pre-trained LLM is further trained on a smaller, task-specific dataset with human-labeled examples. This process adapts the general knowledge of the pre-trained model to perform well on a particular downstream task, such as mathematical reasoning or code generation.
Mathematical Reasoning: The ability to understand, analyze, and solve mathematical problems. This includes quantitative reasoning (solving problems that result in numerical answers) and theorem proving (rigorously deriving logical conclusions from a set of axioms and definitions, often involving multi-step logical deductions without necessarily producing a numerical answer).
Theorem Proving: A specific form of mathematical reasoning that involves constructing a formal or informal derivation (a proof) to establish the truth of a mathematical statement (theorem) based on established axioms, definitions, and previously proven theorems. This often requires complex, multi-step logical deductions.
Natural Language Reasoning: The ability of an AI system to process, understand, and generate human-like textual explanations and deductions for various tasks, including mathematical proofs. This contrasts with formal reasoning, which uses symbolic logic and formal systems.
Formal Proof Assistants: Software tools (e.g., Lean, Isabelle) that help mathematicians and computer scientists construct and verify mathematical proofs. These systems require proofs to be written in a highly structured, formal language, and once compiled, the correctness of the proof is mathematically guaranteed by the system's logic.

3.2. Previous Works

The paper situates its work in the context of several important prior studies and developments:

LLMs for Quantitative Reasoning (e.g., AIME, HMMT): The paper notes that conventional RL approaches, rewarding LLMs for correct final answers, have enabled models (like those described by Guo et al., 2025, and OpenAI, 2024) to saturate (achieve very high performance on) quantitative reasoning benchmarks such as the American Invitational Mathematics Examination (AIME) and the Harvard-MIT Mathematics Tournament (HMMT). These benchmarks primarily test problem-solving skills that yield numerical answers.
Gemini 2.5 Pro and Self-Verification: The paper acknowledges that some frontier LLMs, such as Gemini 2.5 Pro (DeepMind, 2025), already demonstrate a certain level of self-verification capabilities, allowing them to refine their own solutions (Huang and Yang, 2025). This indicates a growing trend towards models that can evaluate and improve their outputs.
DeepMind's DeepThink (IMO-Gold): A significant development mentioned is DeepMind's internal DeepThink variant (Luong and Lockhart, 2025), which achieved gold medal performance at IMO 2025 using pure natural language reasoning. This serves as an "existence proof" that LLM-based verification of complex proofs is achievable, paving the way for the current work's focus on self-verifiable reasoning. Related research by Luong et al. (2025) also explores whether reasoning models can evaluate proofs, with or without reference solutions.
Formal Proof Search Systems (e.g., AlphaProof, DeepSeek-Prover-V2, Seed-Prover):
- AlphaProof (AlphaProof and teams, 2024; Trinh et al., 2024; Chervonyi et al., 2025): A system specialized for formal proof search, which achieved silver-level performance at IMO 2024. It relies on formal languages and often requires intensive computation. AlphaGeometry2 (Chervonyi et al., 2025) is a variant specifically for geometry problems.
- DeepSeek-Prover-V2 (Ren et al., 2025) and Seed-Prover (Chen et al., 2025): These systems represent advancements in formal mathematical reasoning, capable of producing substantially more valid formal proofs. Seed-Prover notably solved 5 of 6 problems at IMO 2025. These systems often leverage informal reasoning (from LLMs) to guide formal proof generation, making the informal reasoning quality highly impactful.

3.3. Technological Evolution

The field of AI in mathematical reasoning has evolved significantly:

Early Symbolic AI and Expert Systems: Initial attempts often relied on symbolic logic and rule-based systems, requiring explicit programming of mathematical rules and inference steps. These systems struggled with the vastness and ambiguity of real-world mathematical problems.
Rise of Deep Learning and Quantitative Reasoning: With the advent of deep learning, particularly transformer models, LLMs began to show impressive capabilities in quantitative reasoning. Early successes involved training models to produce correct numerical answers, often using reinforcement learning with final answer rewards. This led to models saturating benchmarks like AIME.
Limitations of Final Answer Rewards: The limitations of final answer rewards became apparent when models struggled with theorem proving or produced correct answers via incorrect reasoning. This highlighted the need for models to understand and evaluate the reasoning process itself.
Emergence of Informal Proof Generation and Self-Correction: Recent advancements, exemplified by Gemini 2.5 Pro and DeepThink, demonstrated that LLMs could generate natural-language proofs and even perform some level of self-correction. This proved the feasibility of LLM-based verification for complex, informal proofs.
Integration with Formal Proof Systems: Concurrently, there has been significant progress in integrating LLMs with formal proof assistants. LLMs are used to generate informal proofs or proof sketches, which then guide formal provers to construct verifiable proofs in formal languages. Systems like AlphaProof and Seed-Prover represent this hybrid approach.

DeepSeekMath-V2 fits within this timeline by pushing the boundaries of natural language theorem proving. It moves beyond simple final-answer correctness to focus on the rigor and comprehensiveness of the reasoning process itself, through novel self-verification and meta-verification mechanisms. It aims to develop LLMs that can not only generate proofs but also critically evaluate and improve them, bridging the generation-verification gap in the informal reasoning domain.

3.4. Differentiation Analysis

Compared to the main methods in related work, DeepSeekMath-V2 presents several core differences and innovations:

Beyond Final Answer Accuracy: Unlike earlier LLMs that primarily focused on final answer accuracy for quantitative reasoning, DeepSeekMath-V2 explicitly targets theorem proving where rigorous, step-by-step derivation is paramount, not just a numerical answer. Its reward mechanism moves beyond simple correctness to comprehensiveness and rigor of the proof.
Dedicated LLM-based Proof Verifier: A key innovation is the training of a specific LLM-based verifier ( $\pi_\varphi$ ) designed to evaluate natural-language proofs according to detailed rubrics, identifying issues and assigning scores. This is a more nuanced and powerful form of verification than simply checking a final numerical output. While DeepThink demonstrated LLM-based verification is achievable, DeepSeekMath-V2 open-sources its training methodology for this capability.
Meta-Verification for Faithfulness: The paper introduces meta-verification to address a critical vulnerability: the verifier might hallucinate non-existent issues while still predicting the correct score. By training a meta-verifier ( $\pi_\eta$ ) to assess the quality and justification of the verifier's analyses, DeepSeekMath-V2 enhances the faithfulness and trustworthiness of its issue identification, a novel approach to ensure verification quality.
Incentivized Self-Verification in the Generator: While some models like Gemini 2.5 Pro show self-verification capabilities, DeepSeekMath-V2 explicitly trains its proof generator ( $\pi_\theta$ ) to perform rigorous self-analysis using the same rubrics as the external verifier. The reward function directly incentivizes the generator to identify and resolve issues in its own proofs, promoting a deliberate, iterative refinement process rather than blind trial-and-error. This closes the generation-verification gap within the model itself.
Automated Labeling for Verifier Improvement: To handle the challenge of continuously improving the verifier as the generator becomes stronger and produces more complex, hard-to-verify proofs, DeepSeekMath-V2 proposes an innovative automated labeling process. This scales verification compute (multiple verifier samples, meta-verification of these samples) to generate new training data for the verifier without human annotation, creating a sustainable improvement cycle.
Focus on Natural Language vs. Formal Proofs: While AlphaProof, DeepSeek-Prover-V2, and Seed-Prover focus on generating formal proofs (which offer strong guarantees but require specialized languages and high computational costs), DeepSeekMath-V2 focuses on natural language theorem proving. It demonstrates that LLMs can achieve gold-level performance on competitions using informal reasoning, suggesting that advancing informal reasoning can significantly benefit formal reasoning systems as well.

In essence, DeepSeekMath-V2 differentiates itself by building a tightly coupled, self-improving ecosystem of LLM-based verifiers and generators for natural language mathematical proofs, with a strong emphasis on faithful issue identification and iterative self-refinement.

4. Methodology

The methodology of DeepSeekMath-V2 is centered around an iterative reinforcement learning (RL) process that synergistically improves both proof verification and proof generation capabilities. It involves training a dedicated verifier to identify issues and score proofs, a meta-verifier to ensure the verifier's faithfulness, and a generator that uses the verifier as a reward model and is incentivized to self-verify and refine its own proofs.

4.1. Principles

The core idea is to move beyond final-answer-based rewards for mathematical reasoning, which are unreliable for assessing correctness of reasoning and inapplicable to theorem proving. The fundamental principle is that for complex reasoning tasks like theorem proving, it is crucial to verify the comprehensiveness and rigor of the step-by-step derivation.

The theoretical basis is that an LLM can be trained to emulate human mathematical experts in identifying flaws and evaluating proof quality without reference solutions. This verifier can then provide a reward signal for an LLM proof generator. Furthermore, by making the generator explicitly aware of the verifier's criteria, it can learn to self-verify and iteratively improve its own outputs, mimicking how humans review and refine their work. The synergy between verification and generation is exploited to create a continuous improvement loop, where stronger generators produce challenging proofs that, in turn, help improve the verifier.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology consists of three main components: Proof Verification, Proof Generation, and Synergy Between Proof Verification and Generation.

4.2.1. Proof Verification

This component focuses on training an LLM to act as a mathematical verifier.

4.2.1.1. Training a Verifier to Identify Issues and Score Proofs

The goal here is to train a verifier model, denoted as $\pi_\varphi$ , to evaluate proofs according to high-level rubrics ( $\mathcal{T}_\nu$ , detailed in Appendix A.2). The verifier takes a problem $X$ and a proof $Y$ as input and produces a proof analysis that summarizes identified issues (if any) and assigns an overall score. The score levels are:

1: For complete and rigorous proofs with all logical steps clearly justified.
0.5: For proofs with sound overall logic but minor errors or omitted details.
0: For fundamentally flawed proofs containing fatal logical errors or critical gaps.

Data Collection Process:

Problem Collection: Problems are crawled from Art of Problem Solving (AoPS) contests, prioritizing math olympiads, team selection tests, and post-2010 problems explicitly requiring proofs. This resulted in 17,503 problems, denoted as $\mathcal{D}_p$ .
Candidate Proof Generation: An initial variant of DeepSeek-V3.2-Exp-Thinking is used to generate candidate proofs. This model is prompted to iteratively refine its proofs to improve comprehensiveness and rigor, as it was not initially optimized for theorem proving and tended to produce concise but error-prone outputs.
Expert Annotation: Proofs are randomly sampled across diverse problem types (e.g., algebra, number theory) and mathematical experts score each proof according to the defined evaluation rubrics. This process yields an initial RL dataset $\mathcal{D}_\nu = \{ (X_i, Y_i, s_i) \}$ , where $X_i$ is the problem, $Y_i$ is the proof, and $s_i \in \{0, 0.5, 1\}$ is the expert-annotated proof score.

RL Objective for Verifier Training: The verifier is built upon a version of DeepSeek-V3.2-Exp-SFT (Supervised Fine-Tuning), which was already fine-tuned on reasoning data related to mathematics and code. The model is trained with reinforcement learning using two reward components to produce proof analyses:

Format Reward ( $R_{\mathrm{format}}$ ): This is an indicator function that ensures the model generates its response in the specified format, which includes both a summary of identified issues and a proof score. It checks for the presence of key phrases like "Here is my evaluation of the solution:" and a score enclosed in $\boxed{}$ following "Based on my evaluation, the final overall score should be:". $R_{\mathrm{format}}(V'_i) = \begin{cases} 1 & \text{if } V'_i \text{ contains required phrases and score format} \\ 0 & \text{otherwise} \end{cases}$ where V'_i denotes the verifier's final response for problem $X_i$ and proof $Y_i$ .
Score Reward ( $R_{\mathrm{score}}$ ): This component rewards the verifier based on how close its predicted score s'_i is to the expert-annotated ground-truth score $s_i$ . $R_{\mathrm{score}}(s'_i, s_i) = 1 - |s'_i - s_i|$ Here, s'_i is the proof score extracted from the verifier's response V'_i, and $s_i$ is the expert-annotated score. This reward function gives a reward of 1 if the predicted score exactly matches the annotated score, 0.5 if there is a difference of 0.5 (e.g., predicting 0.5 for a 0 or 1 proof), and 0 if the predicted score is completely off (e.g., predicting 0 for a 1 proof or vice versa).

The overall RL objective for training the verifier is to maximize the expected product of these two reward components: $ \operatorname*{max}{\pi\varphi} \mathbb{E}_{(X_i, Y_i, s_i) \sim \mathcal{D}_v, (V'i, s'i) \sim \pi\varphi(\cdot | X_i, Y_i)} \left[ R{\mathrm{format}}(V'i) \cdot R{\mathrm{score}}(s'_i, s_i) \right] $ This objective trains the verifier ( $\pi_\varphi$ ) to generate responses (V'_i) that adhere to the specified format and predict scores (s'_i) that are accurate with respect to the expert annotations ( $s_i$ ).

4.2.1.2. Introducing Meta-Verification to Review Proof Analyses

The initial verifier training ensures accurate score prediction but doesn't directly supervise the identified issues. This creates a risk of hallucination: the verifier could predict the correct score for a flawed proof ( $s_i < 1$ ) while still inventing non-existent issues, undermining its trustworthiness.

To address this, meta-verification is introduced. This is a secondary evaluation process that assesses:

Whether the issues identified by the verifier truly exist in the proof.
Whether these identified issues logically justify the verifier's predicted proof score, according to the evaluation rubrics $\mathcal{T}_\nu$ . The complete meta-verification rubrics ( $\mathcal{I}_{m\nu}$ ) are detailed in Appendix A.3.

Meta-Verifier Training Process:

Initial Verifier: An initial verifier $\pi_\varphi$ is obtained from the process described in Section 4.2.1.1.
Expert Meta-Annotation: Mathematical experts score the quality of the verifier's responses (proof analyses $V_i$ ) according to the meta-verification rubrics $\mathcal{I}_{m\nu}$ . This creates a dataset $\mathcal{D}_{m\nu} = \{ (X_i, Y_i, V_i, ms_i) \}$ , where $V_i$ is the verifier's analysis of proof $Y_i$ , and $ms_i \in \{0, 0.5, 1\}$ is the expert-annotated quality score for that analysis.
Meta-Verifier Training: A dedicated meta-verifier $\pi_\eta(\cdot | X, Y, V, \mathcal{I}_{m\nu})$ is trained using RL to analyze the verifier's proof analysis $V$ . The meta-verifier outputs a summary of issues found in the analysis itself, followed by a quality score measuring the accuracy and justification of the verifier's analysis. The RL objective for meta-verifier training follows the same structure as the verifier training, using format and score rewards to align with expert meta-annotations.

Enhanced Verifier Training with Meta-Verification Feedback: Once the meta-verifier $\pi_\eta$ is trained, its feedback is integrated into the verifier's reward function: $R_V = R_{\mathrm{format}} \cdot R_{\mathrm{score}} \cdot R_{\mathrm{meta}}$ Here:

$R_{\mathrm{format}}$ is the format reward for the verifier's response.
$R_{\mathrm{score}}$ is the score reward for the verifier's predicted score accuracy.
$R_{\mathrm{meta}}$ is the quality score from the meta-verifier $\pi_\eta$ . This term explicitly rewards the verifier for producing faithful and justified issue identifications, as evaluated by the meta-verifier.

The enhanced verifier is then trained on both the verification dataset $\mathcal{D}_\nu$ and the meta-verification dataset $\mathcal{D}_{m\nu}$ , using the same reward mechanism on $\mathcal{D}_{m\nu}$ as used for training the meta-verifier. The resulting model can perform both proof verification and meta-verification tasks. The paper reports that on a validation split of $\mathcal{D}_\nu$ , the average quality score of the verifier's proof analyses (as evaluated by the meta-verifier) improved from 0.85 to 0.96, while maintaining the same accuracy in proof score prediction.

4.2.2. Proof Generation

This component describes how the proof generator is trained and enhanced using the verifier.

4.2.2.1. Training a Generator for Theorem Proving

With the verifier $\pi_\varphi$ now capable of accurately assessing proof quality, it is used as a generative reward model to train a proof generator $\pi_\theta(\cdot | X)$ . The RL objective for the generator is to maximize the expected proof score produced by the verifier. $ \operatorname*{max}{\pi\theta} \mathbb{E}_{X_i \sim \mathcal{D}p, Y_i \sim \pi\theta(\cdot | X_i)} \left[ R_Y \right] $ Here:

$\pi_\theta$ is the proof generator model.
$X_i \sim \mathcal{D}_p$ means problem $X_i$ is sampled from the problem dataset.
$Y_i \sim \pi_\theta(\cdot | X_i)$ means proof $Y_i$ is generated by $\pi_\theta$ for problem $X_i$ .
$R_Y$ is the proof score produced by the verifier $\pi_\varphi(\cdot | X_i, Y_i, \mathcal{I}_\nu)$ . This means the generator is rewarded for producing proofs that the verifier deems high quality.

4.2.2.2. Enhancing Reasoning via Self-Verification

For challenging problems, a single-shot proof generation is often insufficient. While iterative refinement based on external verifier feedback is possible, the authors observed that when a generator is prompted to both generate and analyze its own proof in one shot, it often hallucinates correctness even when flaws are easily identifiable by an external verifier. This means the generator lacks genuine self-verification capabilities.

To address this, the proof generator is endowed with genuine verification capabilities. During training, the generator $\pi_\theta$ is prompted to produce not only a proof $Y$ but also a self-analysis $Z$ . This self-analysis $Z$ must follow the same format and rubrics ( $\mathcal{T}_\nu$ ) as the verifier's output (see Appendix A.1 for the prompt template). The proof score predicted within this self-analysis is denoted as $s'$ .

To ensure faithful self-evaluation, the external verifier $\pi_\varphi$ is used to assess both components:

The generated proof $Y$ receives a score $R_Y = s$ , where $s$ is the score assigned by the external verifier.
The self-analysis $Z$ receives a meta-verification score $R_{\mathrm{meta}}(Z) = ms$ , which assesses the quality and faithfulness of the generator's own evaluation.

The reward function for training the generator with self-verification combines these assessments: $ \begin{array} { c } { { R = R _ { \mathrm { f o r m a t } } ( Y , Z ) \cdot ( \alpha \cdot R _ { Y } + \beta \cdot R _ { Z } ) } } \ { { { } } } \ { { R _ { Z } = R _ { \mathrm { s c o r e } } ( s ^ { \prime } , s ) \cdot R _ { \mathrm { m e t a } } ( Z ) } } \end{array} $ Here:
$R_{\mathrm{format}}(Y, Z)$ is an indicator function that verifies both the proof $Y$ and the self-analysis $Z$ adhere to their specified formats (as outlined in Appendix A.1).
$R_Y$ is the score of the proof $Y$ as judged by the external verifier $\pi_\varphi$ .
$R_Z$ is a reward component for the quality of the self-analysis $Z$ .
$R_{\mathrm{score}}(s', s)$ rewards the generator for accurately self-assessing its proof, by comparing its self-assigned score $s'$ with the external verifier's score $s$ .
$R_{\mathrm{meta}}(Z)$ is the meta-verification score of the self-analysis $Z$ , ensuring the faithfulness of the generator's issue identification within $Z$ .
$\alpha$ and $\beta$ are weighting coefficients, set to $\alpha = 0.76$ and $\beta = 0.24$ respectively, indicating a stronger emphasis on the actual proof quality but still significant weight on the self-evaluation.

This reward structure creates specific incentives for the generator:
Faithful Acknowledgment of Errors: The model is rewarded more for truthfully acknowledging errors in its proof (even if it can't fix them) than for falsely claiming correctness.
Highest Rewards for Rigorous Proofs and Accurate Self-Assessment: The maximum reward is achieved when the generator produces a completely correct proof ( $R_Y=1$ ) and accurately recognizes its rigor ( $s'=1, R_Z=1$ ).
Iterative Issue Resolution: A good strategy for the proof generator to obtain high rewards is to iteratively identify and resolve as many issues as possible in its own proofs before finalizing them, guided by its self-verification capabilities.

4.2.3. Synergy Between Proof Verification and Generation

The proof verifier and generator are designed to create a synergistic, self-improving cycle. As the generator improves, it produces new proofs that challenge the verifier's current capabilities, particularly hard-to-verify proofs where the verifier might struggle to identify subtle issues in a single attempt. These challenging cases become valuable training data for enhancing the verifier itself.

To address the need for labeled correctness data for these newly generated proofs, the authors developed an automated labeling process, which eventually replaced human annotation entirely in later training iterations.

Automated Labeling Process:

Multiple Verification Analyses: For each newly generated proof, $n$ independent verification analyses are generated by the verifier. This increases the probability of catching real issues in flawed proofs, as one verifier instance might miss something that another catches.
Meta-Verification of Issue-Reporting Analyses: For analyses (from step 1) that report issues (scores of 0 or 0.5), $m$ meta-verification assessments are generated by the meta-verifier to validate if the identified problems genuinely exist. An analysis is deemed valid if the majority of these meta-assessments confirm its findings. This step leverages the observation that meta-verification is easier and more sample-efficient for LLMs to master than identifying issues from scratch.
Proof Scoring and Routing:
- For each proof, analyses that assign the lowest score are examined. If at least $k$ such analyses are deemed valid by the meta-verifier, the proof is labeled with that lowest score.
- If no legitimate issues are identified across all $n$ verification attempts (i.e., all analyses assign a score of 1 and are deemed valid), the proof is labeled with a score of 1.
- Otherwise (e.g., conflicting analyses, or insufficient valid low-score analyses), the proof is either discarded or routed to human experts for labeling.
  
  This fully automated pipeline significantly boosts annotation efficiency and allows for continuous improvement of the verifier using data generated by the generator, without relying on manual annotation in later stages. Quality checks confirmed that the automated labels aligned well with expert judgments.

5. Experimental Setup

5.1. Datasets

The paper uses a set of problems for training and evaluation.

Training Data (Problem Set $\mathcal{D}_p$ ): Problems were crawled from Art of Problem Solving (AoPS) contests. This included math olympiads, team selection tests, and post-2010 problems specifically requiring proofs. The total count was 17,503 problems. These problems typically span various mathematical domains like algebra, number theory, geometry, combinatorics, and inequality, and are designed to test deep mathematical understanding and rigorous proof construction.
Initial RL Dataset ( $\mathcal{D}_\nu$ ): This dataset was created by sampling proofs generated by a variant of DeepSeek-V3.2-Exp-Thinking on problems from $\mathcal{D}_p$ . These proofs were then scored by mathematical experts (0, 0.5, or 1).
Meta-Verification Dataset ( $\mathcal{D}_{m\nu}$ ): This dataset was generated by having mathematical experts score the quality of verifier responses (proof analyses) according to meta-verification rubrics.

Evaluation Benchmarks: The final proof generator was evaluated on several benchmarks to assess its theorem-proving capabilities:

In-House CNML-Level Problems: A set of 91 theorem-proving problems.
- Source: In-house, comparable in difficulty to problems from the Chinese National High School Mathematics League (CNML).
- Characteristics: Spanning algebra (13 problems), geometry (24 problems), number theory (19 problems), combinatorics (24 problems), and inequality (11 problems). These are challenging problems requiring rigorous proofs.
Competition Problems:
- IMO 2025: 6 problems from the International Mathematical Olympiad, a premier global competition for pre-university students.
- CMO 2024: 6 problems from the China Mathematical Olympiad, China's national championship.
- Putnam 2024: 12 problems from the William Lowell Putnam Competition, a preeminent mathematics competition for undergraduate students in North America.
- ISL 2024 (IMO Shortlist): 31 problems from the IMO Shortlist, a collection of problems proposed by participating countries for potential inclusion in IMO 2024. These represent a wide range of difficulty levels.
- IMO-ProofBench: A benchmark developed by the DeepMind team.
  - Source: Luong et al. (2025).
  - Characteristics: Divided into a basic set (30 problems, pre-IMO to IMO-Medium difficulty) and an advanced set (30 challenging problems simulating complete IMO examinations, up to IMO-Hard level).
    
    These datasets were chosen because they are standard and highly respected benchmarks in competitive mathematics, suitable for evaluating the rigor and correctness of mathematical proofs. The mix of high-school and undergraduate level problems, as well as varied mathematical domains, provides a comprehensive assessment of the model's generalized theorem-proving abilities.

5.2. Evaluation Metrics

The paper uses several metrics to evaluate the performance of the models, primarily focusing on proof correctness and quality.

Proof Score ( $s$ ):
- Conceptual Definition: This metric quantifies the overall correctness, completeness, and rigor of a mathematical proof. It is the primary output of the verifier and the target for expert annotation and self-assessment.
- Mathematical Formula: Not a formula, but a discrete scale: $s \in \{0, 0.5, 1\}$
- Symbol Explanation:
  - $s=1$ : Completely correct, all steps executed properly and clearly demonstrated.
  - $s=0.5$ : Generally correct, but with some details omitted or minor errors.
  - $s=0$ : Fundamentally flawed, containing fatal logical errors or critical gaps, or not addressing the required problem.
Pass@1:
- Conceptual Definition: This metric measures the average quality of the final proof generated by a model in a multi-attempt scenario (e.g., sequential refinement). It assesses the typical performance of a single generation attempt after refinement.
- Mathematical Formula: $ \text{Pass@1} = \frac{1}{N_{problems} \cdot N_{threads}} \sum_{i=1}^{N_{problems}} \sum_{j=1}^{N_{threads}} s_{i,j}^{\text{final}} $
- Symbol Explanation:
  - $N_{problems}$ : The total number of problems evaluated.
  - $N_{threads}$ : The number of independent refinement threads launched per problem.
  - $s_{i,j}^{\text{final}}$ : The score (0, 0.5, or 1) of the final proof from the $j$ -th refinement thread for the $i$ -th problem, as evaluated by the final verifier.
  - The metric is an average score across all final proofs from all threads.
Best@32 (or Best@N):
- Conceptual Definition: This metric measures the quality of the single best proof found for a given problem across multiple independent generation/refinement attempts (threads). The selection of the "best" proof is based on the model's own self-assigned score. This evaluates the model's ability to not only generate good proofs but also to accurately identify its best work.
- Mathematical Formula: $ \text{Best@N} = \frac{1}{N_{problems}} \sum_{i=1}^{N_{problems}} \max_{j=1}^{N_{threads}} { s_{i,j}^{\text{verifier}} \mid s_{i,j}^{\text{self}} = \max_{k=1}^{N_{threads}} s_{i,k}^{\text{self}} } $
- Symbol Explanation:
  - $N_{problems}$ : The total number of problems evaluated.
  - $N_{threads}$ : The number of independent refinement threads (here, 32).
  - $s_{i,j}^{\text{self}}$ : The self-assigned score of the final proof from the $j$ -th thread for the $i$ -th problem.
  - $s_{i,j}^{\text{verifier}}$ : The score of the final proof from the $j$ -th thread for the $i$ -th problem, as evaluated by the final verifier.
  - For each problem $i$ , the model first identifies the thread $j$ that produced the highest self-assigned score ( $s_{i,j}^{\text{self}}$ ). Then, the verifier score ( $s_{i,j}^{\text{verifier}}$ ) of that self-selected best proof is taken. The Best@N is the average of these best verifier scores across all problems.

5.3. Baselines

The paper compares DeepSeekMath-V2 against several state-of-the-art LLMs:

GPT-5-Thinking-High (OpenAI, 2025): This refers to a high-performance variant of OpenAI's GPT-5 model, likely optimized for reasoning tasks, as implied by "Thinking-High." It represents a cutting-edge proprietary model from a leading AI research organization.
Gemini 2.5-Pro (DeepMind, 2025): This is DeepMind's advanced Gemini model, which is known for its strong multimodal capabilities and also for demonstrating certain self-verification abilities, as mentioned in the related work.
DeepMind's DeepThink (IMO Gold) (Luong and Lockhart, 2025): This is a specific variant of DeepMind's DeepThink system, which achieved gold medal performance at IMO 2025 using natural language reasoning. It serves as a strong baseline, representing the current pinnacle of LLM-based informal mathematical reasoning.

These baselines are representative because they are leading LLMs from major AI labs, and some (like Gemini 2.5-Pro and DeepThink) have already demonstrated advanced mathematical reasoning and even self-verification capabilities. Comparing against them allows the authors to position DeepSeekMath-V2's performance relative to the current frontier.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate DeepSeekMath-V2's strong theorem-proving capabilities, especially when leveraging self-verification and scaled test-time compute.

6.1.1. One-Shot Generation

For one-shot generation, where the model generates a proof without iterative refinement, DeepSeekMath-V2 consistently outperforms competitor models. On the in-house CNML-level problems, DeepSeekMath-V2 shows superior performance across all categories: algebra, geometry, number theory, combinatorics, and inequality. The following figure (Figure 1 from the original paper) illustrates the average proof scores:

Figure 1 | Average proof scores on CNML-level problems by category and model, as evaluated by our verifier.

Analysis of Figure 1: The bar chart clearly shows DeepSeekMath-V2 achieving higher average proof scores (as evaluated by its own final verifier) than GPT-5-Thinking-High and Gemini 2.5-Pro in every mathematical category. For instance, in Algebra, DeepSeekMath-V2 is significantly higher, and the difference is maintained across Geometry, Number Theory, Combinatorics, and Inequality. This indicates that DeepSeekMath-V2 has developed a more robust and generalized ability to generate correct and rigorous proofs across diverse mathematical domains in a single attempt. The consistency of this outperformance suggests that the training methodology has effectively instilled strong foundational theorem-proving skills.

For more challenging problems, DeepSeekMath-V2's sequential refinement process, guided by self-verification, significantly improves proof quality. The model is prompted to generate a proof with self-analysis, and then iteratively refines it by addressing identified issues.

The following figure (Figure 2 from the original paper) demonstrates proof quality improvements on IMO Shortlist 2024 problems:

Figure 2 | Proof quality improvements as the maximum sequential iterations varies from 1 (no refinement) to 8 (initial generation plus up to 7 refinements based on self verification).

Analysis of Figure 2: The line graph shows two metrics, Pass@1 and Best@32, as the maximum sequential iterations increase from 1 (no refinement) to 8.

Pass@1 (Average Score): The Pass@1 score, representing the average quality of the final proof from each refinement thread, shows a substantial improvement as the number of sequential attempts increases. It starts relatively low at iteration 1 (no refinement) and steadily rises, indicating that self-verification effectively guides the iterative improvement process. This confirms that repeated self-correction leads to higher average proof quality.
Best@32 (Self-Selected Best Proof): The Best@32 score, which represents the verifier score of the best proof per problem (selected by the model's own self-assigned scores across 32 threads), is consistently and significantly higher than Pass@1 at all iteration counts. This highlights DeepSeekMath-V2's strong ability to accurately assess its own proof quality and reliably differentiate between high-quality and flawed proofs. The gap between Best@32 and Pass@1 also grows, suggesting that while the average quality improves, the model's capacity to find and select exceptionally good proofs also benefits from more iterations.

These results confirm that the self-verification mechanism is not just a theoretical concept but an effective strategy for systematically improving mathematical reasoning, particularly for complex problems where a single-shot solution is insufficient.

6.1.3. High-Compute Search

To tackle the most challenging problems from prestigious competitions, DeepSeekMath-V2 employs a high-compute search strategy that scales both verification and generation compute. This involves maintaining a pool of candidate proofs, iteratively refining them based on verifier feedback, and exploring diverse proof strategies in parallel. The process continues until a proof passes numerous verification attempts (64 in this case), indicating high confidence in its correctness.

The following are the results from Table 1 of the original paper:

Contest	Problems	Points
IMO 2025	P1, P2, P3, P4, P5	83.3%
CMO 2024	P1, P2, P4, P5, P6	73.8%
Putnam 2024	A1~B4, B5, B6	98.3%

Analysis of Table 1: This table presents expert evaluation results for DeepSeekMath-V2 on three major mathematical competitions.

IMO 2025 (International Mathematical Olympiad): The model solved 5 out of 6 problems (P1, P2, P3, P4, P5), achieving a score of 83.3%. The original paper notes that "Problems in gray are fully solved, while underlined problems received partial credit." Assuming all listed problems (P1-P5) are fully solved, this is a gold-medal performance.
CMO 2024 (China Mathematical Olympiad): The model solved 4 problems (P1, P2, P4, P5) completely and received partial credit on P6, accumulating 73.8% of the total points. This also translates to a gold-medal performance in a pinnacle high-school competition.
Putnam 2024 (William Lowell Putnam Competition): The model achieved an outstanding score of 118/120 (98.3%), solving 11 out of 12 problems completely (A1-B4, B5) and the remaining problem (B6) with minor errors. This score significantly surpasses the highest human score of 90, demonstrating near-perfect performance on a preeminent undergraduate competition.

These results are a strong testament to the effectiveness of DeepSeekMath-V2's approach, particularly the high-compute search guided by its verifier. The ability to solve such a high percentage of problems and achieve gold-level scores, especially on Putnam, suggests that the model can handle complex, multi-step reasoning and rigorous derivation at a level comparable to or exceeding top human competitors.

The following figure (Figure 3 from the original paper) shows results on IMO-ProofBench:

Figure 3 | Expert evaluation results on the Basic and Advanced subsets of IMO-ProofBench. All results are sourced from Luong et al. (2025), with the exception of DeepSeekMath-V2, which was evaluated by our experts following the grading guidelines.

Analysis of Figure 3: This bar chart compares DeepSeekMath-V2 with various baselines on the Basic and Advanced subsets of IMO-ProofBench.

Basic Set: DeepSeekMath-V2 substantially outperforms DeepMind's DeepThink (IMO Gold) and all other baselines on the basic set. This indicates its strong command over foundational to medium-difficulty IMO-level problems.
Advanced Set: On the advanced set, which consists of the hardest IMO-level problems, DeepSeekMath-V2 remains competitive with DeepThink, outperforming GPT-4-Math, Gemini 2.5-Pro, and others by a significant margin. The fact that it's competitive with DeepThink on the advanced set, while outperforming it on the basic set, highlights its overall strength.

The paper notes that even with this advanced approach, "the hardest IMO-level problems remain challenging for our model." However, for problems not fully solved, the generator typically identifies the genuine issues in its proofs, and fully solved problems pass all 64 verification attempts. This confirms the reliability of the LLM-based verifiers in assessing complex proofs and highlights the power of scaled test-time compute under verifier guidance for solving problems that would require hours of human effort.

6.2. Ablation Studies / Parameter Analysis

The paper does not explicitly present a dedicated "Ablation Studies" section with separate experiments comparing different components of the model. However, the methodology section implicitly discusses the impact of certain components:

Meta-Verification: The paper reports that "On a validation split of $\mathcal{D}_\nu,$ the average quality score of the verifier's proof analyses - as evaluated by the meta-verifier - improved from 0.85 to 0.96, while maintaining the same accuracy in proof score prediction." This demonstrates the effectiveness of meta-verification in improving the faithfulness of the verifier's issue identification, preventing hallucination of non-existent issues.
Sequential Refinement with Self-Verification: Figure 2 visually demonstrates the impact of sequential refinement on proof quality. The Pass@1 metric, showing average proof quality, improves substantially as the number of sequential attempts increases. This implicitly acts as an ablation study on the number of refinement steps, showing that more self-verification iterations directly lead to better outcomes.
Automated Labeling Process: The paper mentions that "In our last two training iterations, this fully automated pipeline replaced human annotation entirely. Quality checks confirmed that the automated labels aligned well with expert judgments." This confirms the efficacy and reliability of the automated labeling process, which is a critical component for scaling verifier improvement without human intervention.
Reward Function Parameters ( $\alpha, \beta$ ): The paper specifies the weights for the combined reward function in self-verification for the generator as $\alpha = 0.76$ and $\beta = 0.24$ . While no explicit ablation is shown for these values, their assignment indicates the authors' empirical decision to prioritize the actual proof quality ( $R_Y$ ) while still giving significant weight to the self-analysis quality ( $R_Z$ ).

While not presented as formal ablation tables, these observations confirm the positive impact of key technical contributions on the model's overall performance and capabilities.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully presents DeepSeekMath-V2, a novel Large Language Model (LLM) capable of both generating and verifying mathematical proofs in natural language. The core innovation lies in moving beyond the limitations of final-answer-based rewards by introducing self-verifiable mathematical reasoning. Key contributions include:

Training an accurate and faithful LLM-based verifier for mathematical proofs.
Employing meta-verification to significantly reduce hallucinated issues and ensure the quality and faithfulness of the verifier's analyses.
Incentivizing the proof generator to maximize proof quality through rigorous self-verification, where it identifies and resolves issues in its own reasoning before finalizing outputs.
Developing an automated labeling process that leverages scaled verification compute to improve the verifier by automatically labeling increasingly hard-to-verify proofs, thereby creating a sustainable, self-improving cycle without constant human annotation.

Empirically, DeepSeekMath-V2 demonstrates exceptional performance on challenging competitive mathematics problems. With scaled test-time compute, it achieved gold-medal scores in high-school competitions like IMO 2025 and CMO 2024, and a near-perfect score of 118/120 on the undergraduate Putnam 2024 competition, surpassing top human performance. This work establishes that LLMs can develop meaningful self-evaluation abilities for complex reasoning tasks, marking a significant step towards more reliable and autonomous AI systems in mathematics.

7.2. Limitations & Future Work

The authors acknowledge several limitations and areas for future work:

Remaining Challenges in Hardest Problems: Despite its strong performance, the paper explicitly states that "the hardest IMO-level problems remain challenging for our model." This indicates that there is still a gap to close for the most complex mathematical proofs, requiring further advancements in reasoning capabilities.
Efficiency of High-Compute Search: While scaled test-time compute enabled the model to solve highly challenging problems, the computational cost of this extensive search (e.g., 64 proof samples with 64 verification analyses each, up to 16 refinement iterations) might be substantial. Future work could focus on making this search process more efficient.
Integration with Formal Reasoning: The paper highlights the potential synergy between natural language theorem proving and formal reasoning. It mentions that existing formal provers (like DeepSeek-Prover-V2 and Seed-Prover) achieved impressive results without specifically optimizing their informal reasoning components for theorem proving. This suggests a future direction of leveraging advanced natural language theorem proving to further benefit and guide formal reasoning systems, aiming for truly reliable mathematical AI that combines informal insights with formal guarantees.
Generalizability of Self-Verification: While DeepSeekMath-V2 demonstrates self-verifiability in mathematics, the generalizability of this approach to other complex reasoning domains (e.g., scientific discovery, engineering design) would be an important avenue for future research.

7.3. Personal Insights & Critique

This paper presents a highly impactful advancement in LLM capabilities for mathematical reasoning. The shift from final-answer rewards to self-verifiable reasoning is a critical conceptual leap, addressing a fundamental limitation of previous RL approaches. The ability of an LLM to critically evaluate its own logical steps and iterative refinements, rather than just matching a final output, mimics human expert reasoning and problem-solving much more closely.

Inspirations:

Synergistic Learning Loop: The iterative generation-verification cycle is particularly inspiring. It demonstrates how two LLM components (generator and verifier) can be trained to mutually improve, creating a self-sustaining learning ecosystem. This paradigm could be highly transferable to other domains requiring complex, verifiable outputs, such as scientific hypothesis generation, code debugging, or legal argument construction.
Automated Labeling: The automated labeling process is a practical and crucial innovation. Data annotation is often a bottleneck in RL training for complex tasks. Automating the creation of hard-to-verify training examples for the verifier is a clever solution to scale data generation and overcome human expert limitations, suggesting a path towards truly autonomous AI development in specialized fields.
Meta-Verification for Trustworthiness: The introduction of meta-verification to address hallucination in the verifier is a powerful idea. It highlights the importance of not just accuracy, but faithfulness and explainability in AI systems, especially when they are meant to provide critical feedback. This concept could be applied to ensure the reliability of AI-generated explanations or rationales in other sensitive applications.

Potential Issues or Areas for Improvement:

Computational Cost: While effective, the high-compute search strategy is computationally intensive. The paper doesn't detail the exact compute resources (GPUs, time) required for training and inference, which would be valuable for understanding the practicality of deploying such systems widely. Future work could explore more sample-efficient RL algorithms or more intelligent search strategies to reduce this cost.
Interpretability of Verifier's Issues: While the verifier identifies issues, understanding why certain issues are identified or why certain proofs are flawed might still require human expertise. Enhancing the explainability of the verifier's analysis, beyond just summarizing issues, could further empower human users and help diagnose model failures more effectively.
Generalization to Novel Mathematical Concepts: The current training relies heavily on competitive math problems, which, while diverse, might not cover entirely novel mathematical concepts or structures that a human researcher might encounter. Scaling DeepSeekMath-V2 to truly research-level mathematics (which involves discovering new theorems and frameworks, not just solving known problems) remains an open challenge. The current benchmarks test problem-solving within established mathematical fields.
Formalization Gap: While the paper argues that advancing natural language theorem proving will benefit formal reasoning, the direct path to converting a natural language proof (even a self-verified one) into a mechanically verifiable formal proof is still a complex research area. DeepSeekMath-V2 provides excellent informal insights, but bridging

Similar papers

Recommended via semantic vector search.

No similar papers found yet.