Paper status: completed

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

Published:06/02/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
11 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study examines reinforcement learning with verifiable rewards (RLVR), breaking down the learning signal into Positive Sample Reinforcement (PSR) and Negative Sample Reinforcement (NSR). It finds that training solely with negative samples enhances model performance and divers

Abstract

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs). Unlike supervised learning, it updates the model using both correct and incorrect samples via policy gradients. To better understand its mechanism, we decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR), respectively. We train Qwen2.5-Math-7B, Qwen3-4B and Llama-3.1-8B-Instruct on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples -- without reinforcing correct responses -- can be highly effective: it consistently improves performance over the base model across the entire Pass@kk spectrum kk up to 256), often matching or surpassing PPO and GRPO. In contrast, reinforcing only correct responses improves Pass@1 but degrades performance at higher kk, due to reduced diversity. These inference-scaling trends highlight that solely penalizing incorrect responses may contribute more to performance than previously recognized. Through gradient analysis, we show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs. It refines the model's existing knowledge rather than introducing entirely new behaviors. Building on this insight, we propose a simple variant of the RL objective that upweights NSR, and show that it consistently improves overall Pass@kk performance on MATH, AIME 2025, and AMC23. Our code is available at https://github.com/TianHongZXY/RLVR-Decomposed.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

1.2. Authors

Xinyu Zhu (University of Virginia), Mengzhou Xia (Princeton University), Zhepei Wei (University of Virginia), Wei-Lin Chen (University of Virginia), Danqi Chen (Princeton University), and Yu Meng (University of Virginia).

1.3. Journal/Conference

This paper was published as a preprint on arXiv (version 2) on June 2, 2025. It comes from prominent research groups at the University of Virginia and Princeton Language and Intelligence (PLI), which are highly influential in the field of natural language processing (NLP).

1.4. Publication Year

2025

1.5. Abstract

The paper investigates reinforcement learning with verifiable rewards (RLVR), a popular method for training large language models (LLMs) on reasoning tasks. RLVR updates models using both correct and incorrect samples. The authors decompose this signal into Positive Sample Reinforcement (PSR)—reinforcing correct answers—and Negative Sample Reinforcement (NSR)—penalizing incorrect ones. Surprisingly, they find that training solely with negative samples (NSR) consistently improves performance across various metrics and often matches or beats established algorithms like PPO and GRPO. Conversely, PSR improves immediate accuracy (Pass@1) but kills output diversity, hurting performance when the model is allowed multiple attempts (Pass@k). The authors provide a token-level gradient analysis to explain why NSR preserves the model's prior knowledge and propose Weighted-REINFORCE, a method that upweights negative signals to achieve better reasoning performance.

2. Executive Summary

2.1. Background & Motivation

The current state-of-the-art in LLM reasoning (like OpenAI's o1o1 or DeepSeek-R1) relies heavily on reinforcement learning with verifiable rewards (RLVR). In domains like math or coding, we can verify if an answer is correct using a script. Unlike human feedback, which is subjective and expensive, verifiable rewards provide a clear binary signal: +1 for correct and -1 for incorrect.

While RLVR is powerful, researchers didn't fully understand why it works so well. Specifically, it wasn't clear how the model balances learning from its successes versus learning from its failures. The core challenge is that training often leads to "diversity collapse," where the model becomes overconfident in one specific way of solving a problem and loses the ability to explore alternative paths.

2.2. Main Contributions / Findings

  1. Decomposition of RLVR: The authors separated the standard RL objective into two parts: PSR (positive) and NSR (negative).

  2. Surprising Power of Negative Reinforcement: They discovered that training a model only on its mistakes (NSR) is remarkably effective. It improves the model's accuracy without destroying its ability to think in diverse ways.

  3. The "Diversity vs. Precision" Trade-off: They showed that reinforcing correct answers (PSR) makes the model "sharper" (better at a single guess) but "narrower" (worse at finding any correct answer in 256 attempts).

  4. Gradient Analysis: They provided a mathematical explanation showing that NSR "pushes away" from wrong answers and redistributes that probability to other plausible candidates based on the model's existing knowledge.

  5. Weighted-REINFORCE: They proposed a simple tweak—down-weighting the positive rewards—that outperforms complex state-of-the-art algorithms like PPO and GRPO.


3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a novice needs to be familiar with several core concepts:

  • Reinforcement Learning (RL): A machine learning paradigm where an agent (the LLM) learns to make decisions by performing actions (generating text) and receiving rewards.
  • Verifiable Rewards: In math, we don't need a human to grade the model; we can just check if the final number matches the answer key. This is a "verifiable reward."
  • Chain-of-Thought (CoT): A technique where the model explains its reasoning step-by-step before giving a final answer. RLVR is often used to encourage longer, more rigorous CoT.
  • Policy (π\pi): In this context, the "policy" is the LLM itself—it is the strategy the model uses to pick the next token.
  • Logits and Softmax: LLMs output "logits" (raw scores) for every possible word (token). The softmax function turns these scores into probabilities that sum to 100%.

3.2. Previous Works

The authors build on several established RL algorithms:

  • REINFORCE: The classic RL algorithm. It increases the probability of actions that lead to high rewards.
  • PPO (Proximal Policy Optimization): A more stable version of RL that ensures the model doesn't change too much in a single update.
  • GRPO (Group Relative Policy Optimization): A newer algorithm (popularized by DeepSeek) that calculates rewards by comparing a group of outputs for the same prompt, removing the need for a separate "critic" model.
  • Entropy: In information theory, entropy measures the randomness or diversity of a distribution. High entropy means the model is considering many options; low entropy means it is focused on just one.

3.3. Technological Evolution

Early LLMs were trained mostly via Supervised Fine-Tuning (SFT), where they were given correct examples to imitate. Then came RLHF (Reinforcement Learning from Human Feedback), which aligned models with human preferences. Recently, the field shifted toward RLVR, where models "self-improve" by solving problems and checking their own answers. This paper represents a move toward understanding the internal dynamics of this self-improvement.

3.4. Differentiation Analysis

Unlike previous work that treats the RL reward as a single signal, this paper decomposes the signal. The core innovation is identifying that the negative component of the reward is actually the primary driver of performance in high-compute (high-diversity) scenarios.


4. Methodology

4.1. Principles

The core idea is that the standard RLVR objective is actually the sum of two different behaviors: "do more of the good stuff" (PSR) and "do less of the bad stuff" (NSR). By splitting them, the authors can study their individual effects on the model's "brain" (the weights/parameters).

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. The RLVR Objective and Decomposition

The standard RLVR objective aims to minimize the negative expected reward. Given a model with parameters θ\theta, a prompt xx, and a response yy, the loss function LRLVR(θ)\mathcal{L}_{RLVR}(\theta) is defined as:

$ \mathcal{L}{RLVR}(\theta) = - \mathbb{E}{x \sim \mathcal{D}, y \sim \pi_{\theta}(\cdot|x)} [r(x, y)], \quad r(x, y) \in {-1, +1} $

Where:

  • D\mathcal{D} is the dataset of prompts.

  • πθ(yx)\pi_{\theta}(y|x) is the probability the model generates yy given xx.

  • r(x, y) is the reward (+1 if correct, -1 if wrong).

    The authors decompose this into two sub-objectives:

  1. Positive Sample Reinforcement (PSR): $ \mathcal{L}{PSR}(\theta) = - \mathbb{E}{x \sim \mathcal{D}} \left[ \sum_{y: r(x, y) = 1} \pi_{\theta}(y|x) \right] $ This part only looks at correct answers and tries to make them more likely.
  2. Negative Sample Reinforcement (NSR): $ \mathcal{L}{NSR}(\theta) = - \mathbb{E}{x \sim \mathcal{D}} \left[ \sum_{y: r(x, y) = -1} -\pi_{\theta}(y|x) \right] $ This part only looks at wrong answers and tries to make them less likely.

The total loss is simply: LRLVR(θ)=LPSR(θ)+LNSR(θ)\mathcal{L}_{RLVR}(\theta) = \mathcal{L}_{PSR}(\theta) + \mathcal{L}_{NSR}(\theta).

The following figure (Figure 1 from the paper) illustrates this decomposition:

Figure 1: Decomposing learning signals in RLVR into positive and negative reward components. Positive Sample Reinforcement (PSR) increases the likelihood of correct responses and improves Pass `@ 1` , but reduces output diversity and hurts \(\\mathrm { P a s s } @ k\) for large \(k\) . Negative Sample Reinforcement (NSR) discourages wrong responses and redistributes probability mass according to the model's prior knowledge, improving the full \(\\mathrm { P a s s } @ k\) spectrum. 该图像是示意图,展示了在RLVR中学习信号的正负奖励组成。图中左侧为语言模型(LLM)生成的多个响应,右侧分别标示了正样本强化(PSR)和负样本强化(NSR)的效果。PSR旨在最大化正确响应的概率,从而提高Pass@1,但可能降低高kk值的输出多样性;而NSR通过抑制错误响应并重新分配概率,提高了整体Pass@kk性能。公式以maxπθ(yx)max \pi_\theta (y|x)maxπθ(yx)max -\pi_\theta (y|x)形式表述。

4.2.2. Token-Level Gradient Analysis

To understand how these updates change the model's behavior, the authors look at the gradient—the direction the model's internal scores ("logits") should move. Let zvz_v be the logit for token vv.

For PSR, the update direction for a sampled token yty_t is: $ -\frac{\partial \mathcal{L}{PSR}}{\partial z_v} \propto \begin{cases} \pi_v \cdot (1 - \pi_v) & \text{if } v = y_t \text{ (sampled token)} \ -\pi{y_t} \cdot \pi_v & \text{if } v \neq y_t \text{ (unsampled tokens)} \end{cases} $ This tells the model: "Make the correct token much more likely and push everything else down." This leads to sharpening, which reduces diversity.

For NSR, the update direction is: $ -\frac{\partial \mathcal{L}{NSR}}{\partial z_v} \propto \begin{cases} -\pi_v \cdot (1 - \pi_v) & \text{if } v = y_t \text{ (sampled token)} \ \pi{y_t} \cdot \pi_v & \text{if } v \neq y_t \text{ (unsampled tokens)} \end{cases} $ This tells the model: "Make the wrong token less likely, and redistribute that probability to all other tokens based on their current likelihood." This preserves the model's prior knowledge because it doesn't pick a new favorite; it just removes a bad option.

This mechanism is visualized in Figure 6:

Figure 6: Gradient dynamics of PSR and NSR under a math word problem example. Bars indicate token logits before the update, and arrows indicate gradient direction and magnitude. Left (PSR): The model generates the correct response ("7") and receives a \(+ 1\) reward. Gradients increase the logit of the sampled token while suppressing all others, including potentially correct alternatives like "2", resulting in a sharpened, overconfident distribution. Right (NSR): The model generates an incorrect response \(( \\ " 6 \\ " )\) and receives a \(- 1\) reward. Gradients demote the sampled incorrect token and proportionally reallocate logits to other tokens (e.g., "7, "") based on their current probabilities thereby promoting exploration on alternative correct paths and preserving diversity. 该图像是一个示意图,展示了正样本强化(PSR)下的梯度动态。条形图表示更新前的标记概率,其中“7”的概率被提升,抑制了其他选项如“2”。公式 riangle v = rac{igtriangleup L_{PSR}}{igtriangleup Logit_v} 表明提升正确选项概率的过程。图中还提到了对其他正确替代方案的抑制。

4.2.3. Weighted-REINFORCE

Based on their findings, the authors propose a new objective that balances these two signals by scaling down the positive signal with a coefficient λ\lambda:

$ \mathcal{L}{W-REINFORCE}(\theta) = \lambda \cdot \mathcal{L}{PSR}(\theta) + \mathcal{L}_{NSR}(\theta) $

They found that a small λ\lambda (like 0.1) works best, as it prevents the model from becoming overconfident and collapsing its output diversity.


5. Experimental Setup

5.1. Datasets

The authors used three main mathematical reasoning benchmarks:

  1. MATH: 7,500 problems ranging from algebra to calculus.
  2. AIME 2025: Problems from the American Invitational Mathematics Examination, known for being extremely difficult.
  3. AMC23: American Mathematics Competitions (2023 version).

Example of data:

  • Prompt: "Alice has 3 apples. She buys 2 more and gives 1 to her friend. How many does she have now?"
  • Response (CoT): "Alice starts with 3. She gets 2 more, so 3+2=53+2=5. She gives 1 away, so 51=45-1=4. The answer is \boxed{4}."

5.2. Evaluation Metrics

The primary metric is Pass@k.

  1. Conceptual Definition: Pass@k measures the probability that a model will produce at least one correct answer if it is allowed to generate kk independent attempts for the same problem.
  2. Mathematical Formula: $ \mathrm{Pass@k} = \mathbb{E}_{x \sim \mathcal{D}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right] $
  3. Symbol Explanation:
    • nn: The total number of samples generated per problem (usually 256 in this paper).
    • cc: The number of correct samples among those nn samples.
    • kk: The number of attempts we are evaluating (e.g., if k=1k=1, it's the accuracy of a single guess).
    • (nk)\binom{n}{k}: The binomial coefficient ("n choose k").

5.3. Baselines

The authors compared their methods against:

  • Base Models: Qwen2.5-Math-7B, Qwen3-4B, and Llama-3.1-8B-Instruct.

  • PPO: Standard Proximal Policy Optimization.

  • GRPO: Group Relative Policy Optimization.

  • Vanilla REINFORCE: Standard RL without weighting.


6. Results & Analysis

6.1. Core Results Analysis

The most striking result is found in Figure 2 and Figure 3. While PSR (positive only) improves Pass@1 (greedy accuracy), its performance drops significantly as kk increases, eventually falling below the base model. This means PSR makes the model "stuck" in a few patterns.

In contrast, NSR (negative only) improves performance across the entire spectrum. It makes the model smarter at a single guess and better at finding answers when given many tries.

The following are the results from Table 1 of the original paper, showing Pass@k on various benchmarks:

Method Pass@k
1 2 4 8 16 32 64 128 256
MATH
Base Model63.276.083.788.491.693.795.296.296.9
PPO76.682.686.789.691.793.494.795.696.3
GRPO76.381.785.688.490.692.393.694.795.5
NSR75.782.486.990.192.494.195.396.296.9
W-REINFORCE76.682.887.190.292.494.195.396.196.7
AIME 2025
Base Model6.19.713.817.922.226.530.836.646.7
NSR10.014.619.224.129.334.640.246.053.3
W-REINFORCE10.615.320.024.729.734.640.547.856.7

6.2. Ablation Studies / Parameter Analysis

The authors tested the sensitivity of the λ\lambda parameter in Weighted-REINFORCE.

The following are the results from Table 4 of the original paper:

λ\lambda Pass@k (MATH)
1 2 4 8 16 32 64 128 256
0 (NSR)75.782.486.990.192.494.195.396.296.9
0.0575.682.286.789.992.294.095.396.397.1
0.176.682.887.190.292.494.195.396.196.7
1 (REINFORCE)74.879.182.484.986.988.589.991.192.0

Key takeaway: When λ=1\lambda=1 (standard RL), the model's ability to solve problems at k=256k=256 is actually worse than the base model. Using a small λ\lambda (0.1) preserves the base model's diversity while gaining the accuracy of RL.


7. Conclusion & Reflections

7.1. Conclusion Summary

The paper reveals that negative reinforcement is a hidden hero in training reasoning models. By simply penalizing mistakes and letting the model's natural intelligence fill the gaps, we can achieve high accuracy without the "diversity collapse" that plagues standard RL. Their proposed Weighted-REINFORCE is a simple, robust alternative to more complex algorithms.

7.2. Limitations & Future Work

  • Stability: NSR can become unstable if trained for too many steps without any positive reinforcement.
  • Sparse Rewards: This method relies on binary (+1/-1) rewards. It’s unclear if it works for "dense" rewards (where you give partial credit for steps).
  • Latent Capabilities: The paper suggests RL doesn't necessarily teach new reasoning but helps the model access what it already knows more effectively.

7.3. Personal Insights & Critique

This paper is a refreshing "back to basics" look at RL. While the industry is rushing toward increasingly complex algorithms like PPO with many moving parts (critics, reference models, KL-penalties), this paper shows that a properly weighted version of the 30-year-old REINFORCE algorithm can actually perform better.

The insight that "telling a model what NOT to do" is more effective than "telling it exactly what to do" mirrors human learning in some ways. When a student is already smart (a strong "base model"), they often just need to be corrected on their errors rather than being forced to repeat a single "correct" solution. This allows the student to maintain their own creative problem-solving styles.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.