Paper status: completed

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

Published:01/11/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study examines the relationship between knowledge localization and model editing in language models. It reveals that optimal editing locations may differ from suggested knowledge storage points, challenging prior causal tracing assumptions. Ultimately, the choice of editing

Abstract

Language models learn a great quantity of factual information during pretraining, and recent work localizes this information to specific model weights like mid-layer MLP weights. In this paper, we find that we can change how a fact is stored in a model by editing weights that are in a different location than where existing methods suggest that the fact is stored. This is surprising because we would expect that localizing facts to specific model parameters would tell us where to manipulate knowledge in models, and this assumption has motivated past work on model editing methods. Specifically, we show that localization conclusions from representation denoising (also known as Causal Tracing) do not provide any insight into which model MLP layer would be best to edit in order to override an existing stored fact with a new one. This finding raises questions about how past work relies on Causal Tracing to select which model layers to edit. Next, we consider several variants of the editing problem, including erasing and amplifying facts. For one of our editing problems, editing performance does relate to localization results from representation denoising, but we find that which layer we edit is a far better predictor of performance. Our results suggest, counterintuitively, that better mechanistic understanding of how pretrained language models work may not always translate to insights about how to best change their behavior. Our code is available at https://github.com/google/belief-localization

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

1.2. Authors

  • Peter Hase (UNC Chapel Hill, Google Research)
  • Mohit Bansal (UNC Chapel Hill)
  • Been Kim (Google Research)
  • Asma Ghandeharioun (Google Research)

1.3. Journal/Conference

Published at ICLR 2024 (International Conference on Learning Representations). ICLR is one of the top-tier, highly prestigious conferences in the field of machine learning and deep learning, known for publishing cutting-edge research in representation learning. Note: The source metadata indicates publication around Jan 2023 (ArXiv), and it is associated with top-tier research tracks.

1.4. Publication Year

2023

1.5. Abstract

This paper investigates the relationship between mechanistic localization (identifying where specific knowledge is stored in a model) and model editing (modifying model weights to update or inject knowledge). The authors challenge the prevailing assumption that the best place to edit a model is where the knowledge is "localized." Using a method called Causal Tracing, they find that localization results are statistically uncorrelated with the success of editing methods like ROME. They introduce several variants of the editing problem (e.g., Fact Forcing, Fact Erasure) to find a connection, but consistently find that the choice of layer to edit is a far better predictor of success than localization signals. The conclusion is a counter-intuitive finding: understanding where a model stores a fact does not necessarily tell us where to edit it to change the model's behavior.

2. Executive Summary

2.1. Background & Motivation

Large Language Models (LLMs) like GPT-J store vast amounts of factual knowledge (e.g., "Paris is located in France") within their parameters. As the world changes or errors are discovered, we need efficient ways to update this knowledge without retraining the entire model. This has led to the development of Model Editing techniques.

A logical assumption driving this field has been: "To change a fact, we should first find where it is stored, and then edit those specific weights." This process of finding the storage location is called Localization. Recent influential methods, such as ROME (Rank-One Model Editing), rely on a localization technique called Causal Tracing to identify "early-to-mid-layer MLP weights" as the storage site for factual associations and target these layers for editing.

However, the authors of this paper noticed a potential gap: While these editing methods work, does their success actually depend on the localization being correct? Or is the connection between "where knowledge is stored" and "where to edit" illusory?

2.2. Main Contributions & Findings

  1. Disconnection between Localization and Editing: The authors provide rigorous empirical evidence that edit success is essentially unrelated to localization results. Ideally, editing a layer identified by Causal Tracing as "important" should yield better results. In reality, the correlation is near zero (and sometimes negative).
  2. Robustness of Findings: This lack of correlation holds true across different models (GPT-J, GPT2-XL), datasets (CounterFact, ZSRE), and editing metrics.
  3. New Task Variants: To try and rescue the connection, the authors introduce four new editing task variants: Tracing Reversal, Fact Erasure, Fact Amplification, and Fact Forcing.
  4. The "Layer" Predictor: They find that simply picking a specific layer (e.g., always layer 5) is a much stronger predictor of editing success than looking at the Causal Tracing score for a specific data point.
  5. Implications: The results suggest that mechanistic interpretability (understanding how the model works) and model control (changing how the model behaves) might be answering fundamentally different questions.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Autoregressive Language Models & MLPs

Models like GPT-J are autoregressive Transformers. They predict the next token in a sequence. Internally, they consist of layers. Each layer has two main sub-components:

  • Self-Attention: Helps tokens look at other tokens for context.
  • Multi-Layer Perceptron (MLP): A feed-forward network within the transformer block. Current research suggests these MLPs act as "Key-Value Memories" where factual knowledge (like "Paris" -> "France") is stored.

3.1.2. Causal Tracing (Localization)

Causal Tracing is a technique to find where information is located during a model's inference process.

  • Concept: It runs the model with a corrupted input (e.g., adding noise to the subject "Paris") so the model can't predict "France". Then, it systematically restores the internal states (activations) of specific layers from a "clean" run to see if restoring that specific state recovers the correct output "France".

  • Intuition: If restoring Layer 5's activation fixes the output, then Layer 5 contains crucial information about the fact.

    The following figure (Figure 2 from the original paper) visualizes Causal Tracing. It shows how restoring specific hidden states (y-axis layers) at specific token positions (x-axis) recovers the probability of the correct answer.

    Figure 2: Visualizing Causal Tracing results over MLP layers with window size 10. Tokens with an asterisk are the noised subject tokens. Here, \(p _ { \\theta } ( o _ { t r u e } | s , r ) { = } . 9 2 3\) and \(p _ { \\theta } ( o _ { t r u e } | s _ { n o i s e } , r ) { = } . 0 0 1\) . 该图像是插图,展示了在窗口大小为10的情况下,MLP层的因果追踪效果。图中显示了不同token(带星号的为被噪声处理的subject tokens)与概率p(Seattle)的关系。原始概率为0.923,噪声概率为0.001,纵轴表示token,横轴表示MLP层中心。

3.1.3. Model Editing (ROME)

Rank-One Model Editing (ROME) is a specific algorithm designed to update a specific fact in a model.

  • Goal: Change the model so that for prompt s, r (Subject, Relation), it outputs onewo_{new} instead of ooldo_{old}.
  • Mechanism: It treats the MLP weights as a linear memory. It computes a rank-one update to the weight matrix of a specific MLP layer to map the subject representation to the new object representation.

3.2. Previous Works

  • Meng et al. (2022) - ROME: This is the pivotal work this paper analyzes. Meng et al. proposed Causal Tracing to localize facts to early/mid MLP layers and then designed ROME to edit those specific layers. This paper challenges the link between the two.
  • Dai et al. (2022) - Knowledge Neurons: Another localization method focusing on individual neurons.
  • Zhu et al. (2020) - Constrained Finetuning: A baseline editing method that uses standard optimization (like Adam) to update weights while constraining them to stay close to the original weights to prevent breaking the model.

3.3. Differentiation Analysis

  • Unlike ROME/MEMIT papers: This paper is not proposing a new editing method. It is a scientific critique and analysis paper.
  • Focus: While previous work used editing to validate localization (i.e., "We found this neuron, we changed it, and the output changed, so our localization is correct"), this paper asks a broader quantitative question: "Is the strength of the localization signal predictive of the quality of the edit across a large dataset?"

4. Methodology

The authors' methodology revolves around defining a rigorous statistical comparison between Localization Signals (where the math says the fact is) and Editing Performance (how well an edit works at that location).

4.1. Data Notation and Definitions

The authors define a fact as a tuple (s, r, o).

  • ss: Subject (e.g., "Paris").

  • rr: Relation (e.g., "is located in").

  • oo: Object (e.g., "France").

    They define several variations for experimental purposes:

  • snoises_{noise}: The subject embedding with Gaussian noise added.

  • otrueo_{true}: The original correct object (e.g., "France").

  • ofalseo_{false}: The target object for the edit (e.g., "Spain" - creating a counterfactual).

  • PP: The prompt constructed from (s, r).

    The following figure (Figure 3 from the original paper) shows an example datapoint from the CounterFact dataset used in the study:

    Figure 3: An example CounterFact datapoint. 该图像是一个示例的 CounterFact 数据点,展示了输入提示、请求编辑、改述和邻近信息。输入提示为"马德里的自治大学,其位于",请求编辑为将"西班牙"替换为"瑞典"。改述部分为"和萨莉·比弗斯·莱利。马德里的自治大学位于",邻近信息为"Ripollès,位于"。

4.2. Quantifying Localization: Causal Tracing Effect

To compare localization with editing, they first need a numerical score representing "how much information is in layer \ell." They use the Tracing Effect.

4.2.1. The Tracing Formula

The tracing effect estimates the information content of a representation vector v(t,)\boldsymbol{v}_{(t, \ell)} (at token index tt and layer \ell) by measuring how much it restores the probability of the true object when the input is noisy.

The formula for the Tracing Effect at a specific layer is:

Tracing Effect=pθ(otruesnoise,r,v(t,))pθ(otruesnoise,r)pθ(otrues,r)pθ(otruesnoise,r) \mathrm { T r a c i n g ~ E f f e c t } = \frac { p _ { \theta } ( o _ { t r u e } | s _ { n o i s e } , r , v _ { ( t , \ell ) } ) - p _ { \theta } ( o _ { t r u e } | s _ { n o i s e } , r ) } { p _ { \theta } ( o _ { t r u e } | s , r ) - p _ { \theta } ( o _ { t r u e } | s _ { n o i s e } , r ) }

Symbol Explanation:

  • pθp_{\theta}: The probability output by the model with parameters θ\theta.
  • otrueo_{true}: The correct output token (e.g., "France").
  • snoises_{noise}: The subject tokens with added noise.
  • v(t,)v_{(t, \ell)}: The clean activation vector from layer \ell at token tt, copied from a run with the original un-noised subject ss.
  • pθ(otruesnoise,r,v(t,))p_{\theta}(o_{true} | s_{noise}, r, v_{(t, \ell)}): The probability of the true answer when the input is noisy, but the hidden state at (t,)(t, \ell) has been manually restored to its clean value.
  • pθ(otruesnoise,r)p_{\theta}(o_{true} | s_{noise}, r): The baseline performance of the model on the noisy input (usually very low).
  • pθ(otrues,r)p_{\theta}(o_{true} | s, r): The performance of the model on the clean input (usually high).

Interpretation:

  • The numerator measures the recovery: how much probability is gained by restoring this specific layer.

  • The denominator normalizes this by the maximum possible recovery (the difference between clean and noisy performance).

  • A value of 1 means restoring this layer fully recovers the factual knowledge. A value of 0 means it does nothing.

    For the analysis, since Causal Tracing produces a grid (layers ×\times tokens), the authors take the maximum value across all tokens tt for a given layer \ell to get a single scalar score per layer.

4.3. Quantifying Editing Success: Normalized Metrics

To measure how well an editing method works, they define metrics that are also normalized to a [0, 1] scale. Let θ\theta^* be the model weights after editing.

4.3.1. Rewrite Score

This measures success in injecting the new fact (s,r,ofalse)(s, r, o_{false}).

Rewrite Score=pθ(ofalses,r)pθ(ofalses,r)1pθ(ofalses,r) \text{Rewrite Score} = \frac { p _ { \theta ^ { * } } ( o _ { f a l s e } | s , r ) - p _ { \theta } ( o _ { f a l s e } | s , r ) } { 1 - p _ { \theta } ( o _ { f a l s e } | s , r ) }

Symbol Explanation:

  • pθ(ofalses,r)p_{\theta^*}(o_{false} | s, r): The probability of the target counterfactual (e.g., "Spain") after editing.
  • pθ(ofalses,r)p_{\theta}(o_{false} | s, r): The probability of the target counterfactual before editing (usually near 0).
  • Goal: We want the numerator to be large (probability increased). The denominator scales it based on how much room there was to improve.

4.3.2. Paraphrase Score

This checks generalization. If we edit "Paris is located in...", the model should also answer "Paris is situated in..." with the new fact.

Paraphrase Score=pθ(ofalses,r)pθ(ofalses,r)1pθ(ofalses,r) \text{Paraphrase Score} = \frac { p _ { \theta ^ { * } } ( o _ { f a l s e } | s , r ^ { * } ) - p _ { \theta } ( o _ { f a l s e } | s , r ^ { * } ) } { 1 - p _ { \theta } ( o _ { f a l s e } | s , r ^ { * } ) }

  • rr^*: A paraphrase of the relation rr.

4.3.3. Neighborhood Score

This checks specificity (safety). Editing "Paris" should not change facts about "Marseille".

Neighborhood Score=1pθ(ofalses,r)pθ(ofalses,r).5+pθ(ofalses,r).5 \text{Neighborhood Score} = 1 - \frac { | p _ { \theta ^ { * } } ( o _ { f a l s e } | s ^ { * } , r ) - p _ { \theta } ( o _ { f a l s e } | s ^ { * } , r ) | } { . 5 + | p _ { \theta } ( o _ { f a l s e } | s ^ { * } , r ) - . 5 | }

  • ss^*: A neighboring subject (e.g., "Marseille").
  • This metric is designed so that 1 is perfect (no change in probability for the neighbor) and 0 is bad (maximum disturbance).

4.4. Experimental Variants

To exhaustively test the relationship, the authors define 5 tasks (visualized in Figure 5 below):

  1. Error Injection (Standard): Inject a false fact (e.g., Paris -> Spain). Maximize p(ofalse)p(o_{false}).

  2. Tracing Reversal: Try to make the clean input ss output the garbage result onoiseo_{noise} that usually comes from noisy input snoises_{noise}. Maximize p(onoise)p(o_{noise}).

  3. Fact Erasure: Minimize the probability of the true fact. Minimize p(otrue)p(o_{true}).

  4. Fact Amplification: Increase the probability of the already known true fact. Maximize p(otrue)p(o_{true}).

  5. Fact Forcing: This is the closest match to Causal Tracing. The input is the noisy subject snoises_{noise}, and the goal is to force the model to output the correct answer otrueo_{true}. Maximize p(otruesnoise)p(o_{true} | s_{noise}).

    The following figure (Figure 5 from the original paper) summarizes these variants:

    Figure 5: Depiction of editing problem variants. Rather than inject a new false fact into a model (Error Injection), we consider injecting the output obtained from noising the subject entity (Tracing Reversal), erasing a stored fact (Fact Erasure), amplifying a stored fact (Fact Amplification), or forcing a known fact onto the same kind of noisy input as used in Causal Tracing (Fact Forcing). 该图像是一个示意图,展示了不同的编辑问题变体。图中列出了五种编辑变体,包括错误注入、追踪逆转、事实擦除、事实放大和事实强制,每种变体的输入提示和目标可以通过特定的公式表达,如 argmaxpθ(SwedenInput)arg \, max \, p_\theta(Sweden|Input)

5. Experimental Setup

5.1. Datasets

  • CounterFact:
    • Scale: The authors filter it to a subset of 652 facts that the GPT-J model already knows correctly (to ensure there is something to localize).
    • Structure: Prompts designed for cloze completion (e.g., "The mother tongue of Danielle Darrieux is [French]").
    • Characteristics: Contains paraphrases and neighborhood subjects for robust evaluation.
  • ZSRE (Zero-Shot Relation Extraction):
    • Used for robustness checks. Contains question-answering style prompts (e.g., "Who is the CEO of Apple?").

5.2. Models

  • GPT-J: A 6 Billion parameter autoregressive model. This is the primary model for analysis.
    • It has 28 layers. The authors test editing at layers {1, 5, 9, 13, 17, 21, 25, 28} and specifically Layer 6 (ROME default).
  • GPT2-XL: A smaller (1.5B parameter) model with 48 layers, used for validation.

5.3. Evaluation Strategy: Linear Regression Analysis

Instead of just looking at averages, the authors use linear regression to measure how much variance in Edit Success (YY) can be explained by Tracing Effect (XX).

They compare R2R^2 values (Coefficient of Determination) for two models:

  1. Layer-Only Model: YLayerY \sim \text{Layer} (categorical). Does just knowing "I edited Layer 5" predict success?

  2. Interaction Model: YLayer+Layer×TracingEffectY \sim \text{Layer} + \text{Layer} \times \text{TracingEffect}. Does adding the specific tracing score for that datapoint improve prediction?

    If Localization informs Editing, Model 2 should have a significantly higher R2R^2 than Model 1.

6. Results & Analysis

6.1. The Surprising Disconnect (Error Injection)

The primary result is that for the standard task of injecting a new fact (Error Injection), localization strength is not correlated with edit success.

The following figure (Figure 4 from the original paper) shows a scatter plot of Edit Success (Rewrite Score) vs. Tracing Effect at Layer 6.

  • X-axis: How important Layer 6 was deemed by Causal Tracing.

  • Y-axis: How well ROME worked when editing Layer 6.

  • Observation: The red dashed line shows what we expect (positive correlation). The data points show a flat or even slightly negative correlation (ρ=0.13\rho = -0.13).

    Figure 4: The correlation between ROME edit success and the tracing effect at layer 6 in GPT-J is not positive but in fact slightly negative \(( \\rho =\) \(- 0 . 1 3\) \(p < 1 \\mathrm { e } { - 3 ) }\) . The dashed red line shows a hypothetical perfect relationship. 该图像是图表,展示了在GPT-J模型第六层的追踪效果与ROME重写评分之间的关系。图中显示,追踪效果的变化与重写评分不呈正相关,反而略微呈负相关,标注的虚线表示假想的完美关系。

This means that for a specific fact, even if Causal Tracing says "Layer 6 is NOT important for this fact," editing Layer 6 still works perfectly fine. Conversely, even if it says "Layer 6 IS crucial," editing it might not be better than average.

6.2. Variance Explained Analysis

The authors formalized this observation using the regression analysis described in Section 5.3.

The following are the results from Table 1 of the original paper:

Method R2 Values
Layer Tracing Effect Both
ROME 0.947 0.016 0.948

Analysis:

  • Layer (0.947): Knowing which layer you chose (e.g., Layer 5 vs Layer 28) explains 94.7% of the variance in success.

  • Tracing Effect (0.016): Knowing the localization score explains almost nothing (1.6%).

  • Both (0.948): Adding Tracing Effect to the Layer model improves the R2R^2 by only 0.001. This is negligible.

    Conclusion: The choice of layer matters immensely (early-mid layers are generally better), but the datapoint-specific localization signal provides no useful information for selecting the edit location.

6.3. Results Across Variants (Reconciling Localization)

The authors tested if this disconnect persisted across the other task variants (Fact Erasure, Fact Forcing, etc.).

The following figure (Figure 6 from the original paper) summarizes the R2R^2 improvement when adding tracing information.

  • Blue Bar: R2R^2 from knowing the Layer.

  • Orange Bar: R2R^2 from knowing Layer + Tracing Effect.

  • Key Insight: The orange bars are barely higher than the blue bars. The bold numbers above the bars show the improvement.

    该图像是一个图表,展示了不同模型层在事实编辑任务中的表现,包括追踪反转、事实放大、事实擦除和事实强制四种任务。图中包含不同层(FT-1、FT-5、ROME MEMIT)与其对应的 \(R^2( ext{%})\) 值的比较,以及层与追踪效应(Layer + Tracing Effect)的影响。

    Detailed Analysis:

  • Fact Forcing: This task was designed to mimic Causal Tracing exactly (using noisy input). Here, we see the largest improvement, but it is still tiny (3%\approx 3\%).

  • Other Tasks: For Erasure, Amplification, and Reversal, the added value of localization is practically zero (<1%< 1\%).

    The following are the detailed results from Table 3 of the original paper, showing the R2R^2 values for predicting Rewrite Score:

    Editing Problem Method R2 Values Diff p-value
    Layer Trace Both
    Error Injection FT (1 layer) 0.756 0.062 0.758 0.002 <1e-4
    FT (5 layers) 0.775 0.055 0.777 0.002 <1e-4
    ROME (1 layer) 0.947 0.016 0.948 0.001 <1e-4
    MEMIT (5 layers) 0.677 0.024 0.678 0.001 0.199
    Fact Forcing FT (1 layer) 0.697 0.104 0.724 0.027 <1e-4
    FT (5 layers) 0.634 0.10 0.666 0.032 <1e-4
    ROME (1 layer) 0.422 0.004 0.425 0.003 <1e-4
    MEMIT (5 layers) 0.345 0.041 0.354 0.009 <1e-4

Key Takeaway from Table 3: Even in the "best case" scenario (Fact Forcing with Finetuning), the Tracing Effect only adds roughly 0.03 (3%) to the explained variance. The Layer variable consistently dominates.

6.4. Where are facts actually stored?

The authors present an intriguing visualization (Figure 1 from the original paper) showing where Causal Tracing claims facts are stored.

  • Observation: While ROME/MEMIT edit layers 4-9 (the "peak" region), many individual facts have their tracing peaks in layers 1-3 or 16-20.

  • Implication: Even for facts "stored" in layer 20, editing layer 5 works best. This implies ROME might be "overriding" the computation rather than "editing the memory in place."

    Figure 1: We visualize where 652 facts known by GPT-J are stored within the model, as localized by Causal Tracing. Model editing methods like ROME and MEMIT can successfully change knowledge in LMs by editing layers 4-9. But many facts appear to be stored outside of this range, e.g. at layers 1-3 and 16-20. What about these facts? 该图像是一个图表,显示了在不同层次中因果追踪(Causal Tracing)效果的峰值频率。横轴表示GPT-J的层数,纵轴表示峰值数目。红线标记了ROME的编辑层,而蓝色区域指示了MEMIT的编辑层。结果显示, Causal Tracing 在4到8层之间的活跃度明显高于其他层。

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper fundamentally challenges the intuition that "Localization = Site for Editing."

  1. No Correlation: The effectiveness of editing a layer is statistically unrelated to the amount of factual information localized in that layer by Causal Tracing.
  2. Layer Selection Dominates: The best strategy for editing is to pick a generally good layer (like Layer 5-6 in GPT-J) and ignore datapoint-specific localization results.
  3. Different Questions: The authors conclude that Causal Tracing answers "Where is information carried in the forward pass?" while Editing answers "Where is the best intervention point to change the output?" These are distinct questions.

7.2. Limitations & Future Work

  • Dataset Scope: The study is limited to simple subject-relation-object facts (CounterFact, ZSRE). Complex reasoning chains might show different behaviors.
  • Model Size: Experiments focused on GPT-J (6B) and GPT2-XL (1.5B). Larger models (e.g., 175B+) might exhibit different "phase change" behaviors.
  • Hypothesis on Overriding: The authors hypothesize that ROME works by overriding information flow rather than editing storage. Future work is needed to verify this mechanism—how can one layer override information accumulated over many layers?

7.3. Personal Insights & Critique

  • The "Interpretability Illusion": This paper serves as a crucial warning for the Interpretability field. Just because we can highlight a part of the network that "lights up" for a concept (Localization) doesn't mean we have functional control over it via that part. It exposes a gap between observation and manipulation.
  • Practical Utility: For practitioners, this simplifies things. You don't need to run expensive Causal Tracing for every new fact you want to edit. Just stick to the known "good layers."
  • Mechanism of ROME: The finding that editing works even where the fact isn't stored suggests ROME might be functioning more like a "patch" or a "redirect" instruction inserted into the stream, rather than a database update. It's like changing a signpost on the road (the stream) rather than moving the destination (the stored weight).

Self-Correction Checklist (Post-Generation)

  • Methodology Integration: Did I integrate the Tracing formula into the text? Yes, in section 4.2.1.
  • Faithfulness: Is the formula exact? Yes, includes v(t,)v_{(t, \ell)} and restoration terms.
  • Structure: Are all 7 sections present? Yes.
  • Table Handling: Did I use HTML for Table 3? Yes, because of the merged headers.
  • Beginner Friendly: Did I explain Causal Tracing and ROME simply? Yes, in Section 3.1.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.