Paper status: completed

Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths

Published:06/21/2024

Long-Context Modeling (17)Sliding-Window Attention Optimization (1)Improving Inference Efficiency of Large Language Models (1)Mixture of Attention Spans (1)Adaptive Window Length Configuration (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces the Mixture of Attention Spans (MoA), which optimizes inference efficiency for large language models (LLMs) by tailoring sliding-window lengths for different attention heads and layers, significantly improving effective context length and retrieval accuracy

Abstract

Sliding-window attention offers a hardware-efficient solution to the memory and throughput challenges of Large Language Models (LLMs) in long-context scenarios. Existing methods typically employ a single window length across all attention heads and input sizes. However, this uniform approach fails to capture the heterogeneous attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose Mixture of Attention Spans (MoA), which automatically tailors distinct sliding-window length configurations to different heads and layers. MoA constructs and navigates a search space of various window lengths and their scaling rules relative to input sizes. It profiles the model, evaluates potential configurations, and pinpoints the optimal length configurations for each head. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer inputs, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9x with the same average sliding-window length, boosting retrieval accuracy by 1.5-7.1x over the uniform-window baseline across Vicuna-{7B, 13B} and Llama3-{8B, 70B} models. Moreover, MoA narrows the performance gap with full attention, reducing the maximum relative performance drop from 9%-36% to within 5% across three long-context understanding benchmarks. MoA achieves a 1.2-1.4x GPU memory reduction, boosting decode throughput by 6.6-8.2x and 1.7-1.9x over FlashAttention2 and vLLM, with minimal performance impact. Our code is available at: https://github.com/thu-nics/MoA

Mind Map

In-depth Reading

English Analysis~33 min read · 44,912 chars

1. Bibliographic Information

1.1. Title

Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths

1.2. Authors

The paper is authored by a collaborative team from multiple institutions:

Tianyu Fu, Haofeng Huang, Xuefei Ning, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Huazhong Yang, Yu Wang (Tsinghua University and Infinigence-AI)
Genghan Zhang (Stanford University)
Guohao Dai (Infinigence-AI and Shanghai Jiao Tong University)

The affiliations suggest a strong background in computer science, artificial intelligence, and potentially hardware architecture, given the focus on efficient inference.

1.3. Journal/Conference

The paper is published on arXiv, a preprint server. While arXiv is not a peer-reviewed journal or conference in itself, it is a highly influential platform for rapid dissemination of research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Papers published on arXiv are often later submitted to and accepted by top-tier conferences or journals. Given the publication date (2024-06-21), it is likely a recent work that may undergo peer review in the future.

1.4. Publication Year

2024

1.5. Abstract

The paper addresses the efficiency challenges of Large Language Models (LLMs) in processing long contexts, particularly concerning sliding-window attention methods. Existing sliding-window attention approaches typically apply a uniform window length across all attention heads and input sizes, which fails to account for the diverse attention patterns within LLMs and their varied accuracy-latency trade-offs. To overcome this, the authors propose Mixture of Attention Spans (MoA), a novel training-free method that automatically configures distinct sliding-window lengths for individual attention heads and layers. MoA constructs a search space of various window lengths and their scaling rules (how window length changes with input size), profiles the model to understand attention influences, evaluates configurations, and pinpoints optimal settings for each head. This adaptive approach reveals that some heads expand their focus for longer inputs, while others maintain a fixed local context. Experiments demonstrate that MoA significantly increases the effective context length by 3.9x with the same average sliding-window length, boosting retrieval accuracy by 1.5-7.1x over uniform-window baselines across Vicuna and Llama3 models. It also narrows the performance gap with full attention (reducing the maximum relative performance drop from 9-36% to within 5%) on long-context understanding benchmarks. Furthermore, MoA achieves 1.2-1.4x GPU memory reduction, leading to 6.6-8.2x and 1.7-1.9x decode throughput improvements over FlashAttention2 and vLLM, respectively, with minimal performance impact.

1.6. Original Source Link

https://arxiv.org/abs/2406.14909 Publication Status: Preprint on arXiv. PDF Link: https://arxiv.org/pdf/2406.14909v3.pdf

2. Executive Summary

2.1. Background & Motivation

Core Problem

The core problem the paper addresses is the computational and memory inefficiency of Large Language Models (LLMs) when processing very long input sequences, particularly in the context of their attention mechanism. While attention is crucial for context understanding, its computational cost and memory footprint (especially for the Key-Value Cache, or KV-Cache) grow quadratically with sequence length, making long-context inference prohibitive.

Importance of the Problem

The ability to handle long input contexts is vital for enhancing LLM capabilities in many real-world applications such as fact retrieval, summarization, few-shot learning, and question answering. Current sliding-window attention methods, which aim to bound this quadratic growth by limiting attention to local spans, often fail to achieve a long effective context length. This is because they typically employ a uniform, fixed-span sliding-window mask across all attention heads and input lengths.

Challenges and Gaps in Prior Research

Heterogeneous Attention Patterns: LLMs exhibit diverse attention patterns. Some attention heads naturally focus on local contexts, while others require a broader, even global, view of the input. A uniform window length constrains global-context heads unnecessarily and over-allocates resources to local-context heads.
Lack of Elasticity: As input length increases, different attention heads require varying degrees of attention span expansion to maintain performance. Uniform approaches lack the heterogeneous elastic rules needed to scale attention spans differently for various heads.
Suboptimal Performance: Existing uniform sliding-window methods, such as StreamingLLM, often struggle to extend effective context length significantly beyond their predefined span, leading to reduced retrieval accuracy and overall performance degradation in long-context tasks.
Calibration Dataset Misalignment: Traditional model compression methods often use general language modeling datasets for configuration, which may not accurately reflect the impact on long-context tasks or align with the LLM's own response generation patterns.

Paper's Entry Point / Innovative Idea

The paper's innovative idea is Mixture of Attention Spans (MoA), a training-free heterogeneous sliding-window attention method. Instead of a uniform window length, MoA proposes to automatically tailor distinct sliding-window length configurations to different attention heads and layers, and to allow these lengths to elastically adapt to varying input sizes. This aims to capture the intrinsic heterogeneous attention patterns of LLMs more effectively, optimizing the accuracy-latency trade-off.

2.2. Main Contributions / Findings

The primary contributions and key findings of the Mixture of Attention Spans (MoA) paper are:

Heterogeneous Elastic Rules for Attention Spans:
- Contribution: MoA introduces and formulates a search space for heterogeneous elastic rules that define how each attention head's sliding-window length scales with the input length. This moves beyond the uniform-window paradigm.
- Finding: These rules significantly improve content retrieval accuracy, boosting it from 25% to 98% compared to uniform counterparts at similar densities.
Calibration Dataset Construction for Accurate Profiling:
- Contribution: The paper emphasizes and demonstrates the importance of a specialized calibration dataset design. MoA uses datasets with long-range dependencies (like MultiNews summaries) and aligns supervision with the original LLM's responses rather than human-written summaries.
- Finding: This data engineering approach is crucial for accurately profiling the attention influences and impact of compression, leading to better overall performance. Using model-aligned summaries improves retrieval accuracy from 87% to 95% and reduces perplexity significantly.
Automatic Configuration Search Pipeline:
- Contribution: MoA provides an automatic pipeline to efficiently search for the optimal heterogeneous elastic rules for all attention heads. This pipeline involves gradient-based profiling to quantify attention influence and multi-objective optimization to minimize prediction loss under density constraints.
- Finding: The pipeline efficiently finds optimal configurations within hours (e.g., 2 hours for Vicuna-13B, 8.5 hours for Llama3-70B), making the method practical.
Significant Performance Improvements:
- Finding: MoA increases the effective context length by 3.9x for the same average sliding-window length.
- Finding: It boosts retrieval accuracy by 1.5-7.1x over uniform-window baselines across various models (Vicuna-{7B, 13B} and Llama3-{8B, 70B}).
- Finding: MoA reduces the maximum relative performance drop on long-context understanding benchmarks (LV-Eval, LongBench) from 9-36% (baselines) to within 5% of full attention performance. At 25% density, MoA achieves over 90% retrieval accuracy, outperforming baselines requiring 75-100% density.
Substantial Efficiency Gains:
- Finding: MoA achieves 1.2-1.4x GPU memory reduction.
- Finding: It boosts decode throughput by 6.6-8.2x over FlashAttention2 and 1.7-1.9x over vLLM at 50% density, with minimal performance impact. These gains are attributed to static KV-Cache size, reduced attention computations, increased batch sizes, and a specialized CUDA kernel implementation.

3.1. Foundational Concepts

To understand the Mixture of Attention Spans (MoA) paper, it's essential to grasp several foundational concepts related to Large Language Models (LLMs) and their core attention mechanism.

3.1.1. Large Language Models (LLMs)

Large Language Models (LLMs) are a type of artificial intelligence model designed to understand, generate, and process human language. They are typically based on the Transformer architecture and trained on vast amounts of text data, allowing them to learn complex patterns and relationships within language. LLMs have demonstrated remarkable capabilities in various tasks, including text generation, summarization, translation, and question answering.

3.1.2. Attention Mechanism

The attention mechanism is a core component of Transformer models (Vaswani et al., 2017). It allows the model to weigh the importance of different parts of the input sequence when processing each token. Instead of processing input sequentially or relying on fixed-size context windows, attention enables the model to "attend" to relevant information across the entire input, regardless of its position. This is crucial for capturing long-range dependencies in text.

3.1.3. Multi-Head Self-Attention (MHA)

Multi-Head Self-Attention (MHA) is an extension of the basic attention mechanism. In MHA, the attention operation is performed multiple times in parallel, each with different learned linear projections (attention heads). The idea is that each "head" can learn to focus on different types of relationships or aspects of the input sequence, thereby enriching the model's ability to capture diverse information. The outputs from all heads are then concatenated and linearly transformed to produce the final output.

The computation for a single attention head involves Query (Q), Key (K), and Value (V) matrices. Given an input sequence, it is transformed into Query (Q), Key (K), and Value (V) matrices through linear projections. For a sequence of length $N$ and embedding dimension $d_{model}$ , these matrices are of shape $N \times d_k$ , $N \times d_k$ , and $N \times d_v$ respectively (where $d_k$ and $d_v$ are typically $d_{model}/\text{num\_heads}$ ).

The attention is computed as: $ \mathbf{S} = \mathbf{Q}\mathbf{K}^T $ This calculates the attention scores (or logits) between each query token and all key tokens. $ \mathbf{A} = \mathrm{softmax}(\mathbf{S} + \mathbf{M}) $ The softmax function normalizes the scores, turning them into probabilities, and a causal mask ( $\mathbf{M}$ ) is applied. The causal mask ensures that when predicting the next token, a token can only attend to previous tokens and itself, not future tokens. This is essential for autoregressive models (like LLMs) that generate text token by token. $ \mathbf{O} = \mathbf{A}\mathbf{V} $ Finally, the attention probabilities ( $\mathbf{A}$ ) are multiplied by the Value matrix ( $\mathbf{V}$ ) to produce the output ( $\mathbf{O}$ ), which is a weighted sum of the value vectors.

3.1.4. Key-Value Cache (KV-Cache)

During autoregressive inference (i.e., generating text token by token), the Key and Value matrices for previously processed tokens remain constant. Instead of recomputing them at each generation step, they can be stored and reused. This stored information is called the KV-Cache. The KV-Cache significantly speeds up the decoding process but can consume a large amount of GPU memory, especially for long contexts and large batch sizes. The memory consumption of KV-Cache scales linearly with the sequence length and batch size, and quadratically with model size (number of heads and hidden dimensions).

3.1.5. Autoregressive Inference (Prefill and Decode Stages)

Autoregressive inference in LLMs typically involves two stages:

Prefill Stage: The model processes the entire input prompt (e.g., a query or an incomplete sentence) in parallel to generate the initial output token. In this stage, attention usually computes interactions between all prompt tokens.
Decode Stage: After the first token is generated, the model iteratively generates subsequent tokens one by one. For each new token, it uses the newly generated token and the KV-Cache from all preceding tokens (including the prompt and previously generated tokens) to predict the next token. This iterative process is where the KV-Cache is most critical but also where its memory demands become significant.

3.1.6. Sliding-Window Attention

Sliding-window attention is a technique used to mitigate the quadratic memory and computational cost of full attention for long sequences. Instead of allowing each token to attend to all previous tokens, it restricts attention to a fixed-size window of recent tokens. This means a token can only "see" and interact with tokens within its immediate historical window, plus often a few initial "global" tokens (known as attention sinks) that are always visible to all tokens. This approach makes the memory and computation costs scale linearly with sequence length (within the window), making it more hardware-efficient.

3.2. Previous Works

The paper discusses several categories of efficient attention methods and LLM acceleration frameworks, highlighting their strengths and limitations compared to MoA.

3.2.1. Dynamic Sparse Attention

This branch of work attempts to dynamically skip attention computations or prune the KV-Cache based on the input sequence content.

Examples: Pagliardini et al., 2023, Qu et al., 2022, Roy et al., 2021, Wang et al., 2021, Lu et al., 2021, Kitaev et al., 2020 for prefill; Anagnostidis et al., 2023, Zhang et al., 2023 (H2O), Ge et al., 2023, Sheng et al., 2023, Liu et al., 2023, Li et al., 2024c (SnapKV), Cai et al., 2024 (PyramidKV), Tang et al., 2024 for decode.
Characteristics: These methods adapt sparsity patterns to the input.
Limitations: Often require specific hardware for wall-time speedups (dynamic prefill), or extensive retraining, additional pruning score computation, or memory swapping for KV-Cache retrieval (dynamic decode).
H2O (Heavy-hitter Oracle): A dynamic method that identifies and preserves "heavy hitter" tokens (most important) in the KV-Cache and recent tokens. It uses dense prefill to calculate token importance.
InfLLM: A dynamic KV-Cache compression method that unveils intrinsic capacity for long sequences with training-free memory. It uses sparse prefill.
SnapKV & PyramidKV: More recent dynamic KV-Cache compression methods.

3.2.2. Static Sparse Attention

These methods apply predefined attention masks consistently across all input sequences, leading to fixed computation flows that are generally more efficient and GPU-friendly.

Examples: Zaheer et al., 2020 (BigBird), Beltagy et al., 2020 (Longformer), Child et al., 2019, Zhou et al., 2024.
For Generative LLMs: The predominant method is the uniform sliding-window mask with global attention on a few initial tokens (attention sinks).
- StreamingLLM (Xiao et al., 2024c): A prominent example of a uniform sliding-window attention method that uses attention sinks and fixed window sizes. This is a primary baseline for MoA.
Limitations: Uniform sliding-window masks are model- and data-agnostic. This can severely compromise LLMs' effective context length and lead to suboptimal performance in long-sequence scenarios, as they fail to account for heterogeneous attention patterns across heads and layers.

3.2.3. LLM Acceleration Frameworks and Kernel Optimizations

These works focus on system-level and low-level optimizations.

Frameworks: DeepSpeed-Inference (Aminabadi et al., 2022), Accelerate (Gugger et al., 2022), FlexGen (Sheng et al., 2023), vLLM (Kwon et al., 2023).
- vLLM: An efficient inference framework for LLMs that uses PagedAttention for memory management.
Kernel Optimizations: FlashAttention (Dao et al., 2022), $FlashAttention2 (Dao, 2024)$ $Fl a s h A tt e n t i o n 2 (D a o, 2024)$ , FlashAttention-3 (Shah et al., 2024). These provide highly optimized CUDA kernels for attention computation.
- FlashAttention2: A faster and more memory-efficient exact attention algorithm.
Relationship to MoA: These optimizations are often orthogonal to MoA's approach and can be integrated to further enhance efficiency. MoA leverages a custom CUDA kernel, similar in spirit to FlashAttention, but designed for heterogeneous sliding-window attention.

3.3. Technological Evolution

The field of attention mechanisms in LLMs has evolved from computationally expensive full attention to various efficient approximations, driven by the need to handle increasingly long contexts.

Full Attention (Vaswani et al., 2017): The original Transformer model introduced self-attention, which allowed each token to attend to all other tokens. While powerful, its quadratic complexity in terms of both computation and memory (for attention scores and KV-Cache) made it impractical for very long sequences.
Early Sparse Attention (Child et al., 2019; Kitaev et al., 2020): Researchers started exploring ways to sparsify the attention matrix to reduce costs. Methods like Sparse Transformers and Reformer introduced fixed patterns (e.g., local windows, strided attention, LSH attention) to limit attention to a subset of tokens.
Context-Extending Sparse Attention (Beltagy et al., 2020; Zaheer et al., 2020): Longformer and BigBird further developed sparse attention by combining local sliding windows with a few global attention tokens (often at the beginning or specific positions) to ensure some long-range information flow while maintaining linear complexity. This laid the groundwork for methods like StreamingLLM.
Dynamic Sparse Attention (Roy et al., 2021; Anagnostidis et al., 2023; Zhang et al., 2023): This trend focused on making attention patterns adaptive to the input, identifying and attending to the most important tokens dynamically. While potentially more accurate, these often introduce complex control flow or require additional overheads for importance scoring.
Hardware-Efficient Kernel Optimizations (Dao et al., 2022; Kwon et al., 2023): Parallel to algorithmic innovations, significant work has gone into optimizing the underlying GPU kernels (FlashAttention, vLLM's PagedAttention) to make existing attention computations faster and more memory-efficient, especially during autoregressive inference.
MoA's Position: MoA builds upon the foundation of static sliding-window attention (like StreamingLLM) but introduces heterogeneity and elasticity to address its limitations. It recognizes that a "one-size-fits-all" window length is suboptimal due to diverse attention patterns across attention heads and layers. By profiling these patterns and automatically tailoring window lengths, MoA aims to combine the efficiency of static methods with a more nuanced, performance-aware adaptation typically sought by dynamic methods, but without their runtime overheads. It can also integrate with kernel optimizations.

3.4. Differentiation Analysis

Compared to the main methods in related work, MoA introduces several core differences and innovations:

From Uniform to Heterogeneous and Elastic Attention Spans:
- Prior Static Methods (e.g., StreamingLLM, Longformer, BigBird): These methods typically employ a single, uniform sliding-window length across all attention heads and layers. While efficient, this rigid approach fails to capture the distinct needs of different heads.
- MoA's Innovation: MoA is the first to automatically tailor distinct sliding-window length configurations for each individual attention head and layer. Furthermore, it introduces elastic rules that allow these attention spans to scale differently with varying input lengths. This is a fundamental shift from uniform to heterogeneous and adaptive attention span management.
Training-Free Optimization based on Attention Influence:
- Prior Dynamic Methods (e.g., H2O, InfLLM, SnapKV, PyramidKV): These methods dynamically prune the KV-Cache or skip computations during inference based on token importance, often requiring additional runtime computation for scoring or extensive retraining.
- MoA's Innovation: MoA's configuration search is entirely training-free. It uses a gradient-based profiling method to quantify the influence of each attention value on the model's prediction loss. This allows for a principled, data-driven optimization of attention spans without altering model weights or incurring dynamic runtime overheads. Once configured, MoA runs with static, hardware-friendly attention masks.
Principled Calibration Dataset Design:
- Prior Compression Methods: Often rely on general language modeling datasets and human responses for calibration, which may not align well with long-context tasks or the model's inherent attention patterns.
- MoA's Innovation: MoA explicitly highlights and addresses the importance of data engineering. It uses calibration datasets with long-range dependencies and, crucially, leverages the original LLM's responses as supervision targets. This ensures that the attention influence profiling accurately reflects what's important for the original model's performance on long-context tasks.
Enhanced Performance and Efficiency Pareto Front:
- Prior Efficient Methods: Often face a trade-off where significant efficiency gains come with noticeable performance degradation (e.g., reduced effective context length, lower retrieval accuracy).
- MoA's Innovation: MoA significantly improves the accuracy-throughput Pareto front. It achieves substantial throughput and memory efficiency gains comparable to or exceeding highly optimized baselines (FlashAttention2, vLLM) while maintaining performance very close to full attention (e.g., <5% performance drop on understanding benchmarks, much higher retrieval accuracy at lower densities). This demonstrates a superior accuracy-latency trade-off.
  
  In essence, MoA marries the hardware efficiency of static sliding-window attention with a sophisticated, data-driven understanding of heterogeneous attention patterns to achieve both high performance and high efficiency in long-context LLM inference.

4. Methodology

4.1. Principles

The core idea behind Mixture of Attention Spans (MoA) stems from the observation that not all attention heads within an LLM behave uniformly. Instead, they exhibit heterogeneous attention patterns – some heads focus on very local contexts, while others require a broader, even global, view of the input sequence. Furthermore, these patterns can be elastic, meaning their optimal attention span might change as the overall input length varies.

The theoretical basis and intuition behind MoA are as follows:

Heterogeneity is Key: Applying a single, uniform sliding-window length across all attention heads is suboptimal. It either excessively restricts global-context heads (leading to performance degradation) or over-allocates resources to local-context heads (leading to inefficiency). By tailoring attention spans to individual heads, MoA aims to preserve crucial long-range dependencies where needed while aggressively pruning redundant attention in other heads.
Elasticity for Generalization: A fixed attention span for a given head might work for one input length but fail for another. Recognizing that different heads "stretch" their focus at different rates as input length increases (elastic rules) allows MoA to generalize better to unseen or longer contexts without re-profiling.
Data-Driven Optimization: Instead of arbitrary heuristic choices for attention spans, MoA uses a principled, gradient-based profiling approach. It quantifies the influence of each attention value on the model's prediction loss. This provides an objective measure of which attention connections are critical and should be preserved (assigned a longer window) and which are less important and can be masked (assigned a shorter window).
Hardware-Friendly Static Masks: While the configuration search is adaptive, the resulting attention masks for inference are static. This means MoA avoids the runtime overhead and complex control flow associated with dynamic sparse attention methods, allowing for efficient CUDA kernel implementation and predictable performance.
Optimized Trade-off: The goal is to find an optimal MoA configuration (a set of heterogeneous elastic rules) that minimizes the prediction loss (maintaining accuracy) while adhering to density constraints (improving efficiency) across various input lengths.

By implementing these principles, MoA aims to achieve a superior accuracy-latency trade-off for LLM inference in long-context scenarios.

4.2. Core Methodology In-depth (Layer by Layer)

The Mixture of Attention Spans (MoA) methodology is structured around identifying and applying heterogeneous elastic rules for attention spans across different attention heads and layers. This involves defining a search space, profiling attention influence, and then automatically optimizing the configuration.

4.2.1. Mixture of Attention Patterns and Elastic Rules

The foundation of MoA is the empirical observation of heterogeneous attention patterns and elastic rules within pre-trained LLMs.

Heterogeneous Attention Patterns: The paper highlights that different attention heads within an LLM do not attend in the same way. For instance, as shown in Figure 2, one attention head (e.g., layer 2, head 5) might primarily focus on local contexts (a narrow-span sliding window), while another attention head (e.g., layer 0, head 21) might cover nearly the entire input sequence (global attention). These attention spans tend to be consistent across different tasks and datasets (as shown in Appendix E.1, Figure 14).
- Implication: A uniform sliding-window attention mask applied to all heads will either unduly restrict global-context heads or waste computation and memory on local-context heads.
Heterogeneous Elastic Rules: Beyond static heterogeneity, attention heads also exhibit elastic behavior, meaning their optimal attention span can change differently as the input length varies. For example, some heads that show global attention for shorter inputs might maintain a medium-span local focus for longer inputs, while others might continue to expand their focus globally (Figure 2). Table 1 further demonstrates that applying the same sliding-window mask across layers can lead to a significant retrieval accuracy variance depending on the input length.
- Implication: A fixed scaling rule for attention span across all heads is insufficient. MoA needs to identify heterogeneous elastic rules that govern how each head's attention span adapts to changing input lengths.

4.2.2. Heterogeneous Elastic Rule Search Space

MoA constructs a search space for these heterogeneous elastic rules.

Base Attention Mask: MoA adopts a hardware-friendly sliding-window mask (similar to Beltagy et al., 2020) as its base. This mask typically includes:
- A sliding-window span: a contiguous diagonal region that indicates recent tokens each head can attend to.
- A fixed prefix of unmasked tokens (attention sink): initial tokens (e.g., 64 tokens for MoA) that are always visible to all attention heads, regardless of the sliding-window length.
- The total attention span ( $S$ ) for a head is the sliding-window span plus the number of initially unmasked tokens.
Defining Elastic Rules: MoA defines the attention span $S_h$ for a specific attention head $h$ at input length $N$ using a simple linear function: $ S_h = \alpha_h + \beta_h \cdot N $ Here,
- $\alpha_h$ : A hyperparameter that controls the base span of attention head $h$ .
- $\beta_h$ : A hyperparameter that controls the expansion rate of the attention span for attention head $h$ as the input length $N$ increases.
- Each attention head $h$ (across all layers) gets its own unique pair of $(\alpha_h, \beta_h)$ hyperparameters, chosen from multiple discrete options. By default, MoA uses six options for $\alpha$ and nine options for $\beta$ .
Search Space Size: For an LLM with many attention heads and layers (e.g., a 7B model with 32 heads and 32 layers, totaling $32 \times 32 = 1024$ attention heads), the combination of $(\alpha, \beta)$ pairs can lead to an enormous search space (e.g., $54^{1024}$ ). This necessitates an automatic pipeline to efficiently find the optimal $\alpha$ and $\beta$ for each attention head.

The following figure (Figure 3 from the original paper) provides an overview of MoA, including its search space and automatic rule search pipeline:

该图像是图表，展示了MoA的整体流程。(a) 搜索空间包括多样的弹性规则与滑动窗口掩码的关系；(b) 自动规则搜索流水线从校准数据集开始，涵盖了长依赖上下文与原始密集LLM生成的监督文本，通过剖析每个注意力值对模型预测的影响，最后优化每个注意力头的弹性规则以最小化预测损失。

4.2.3. Automatic MoA Configuration Search Pipeline

MoA employs an automatic configuration search pipeline (Figure 3(b)) to find the optimal heterogeneous elastic rules. This pipeline consists of three main steps: Attention Influence Profiling, Automatic Optimization, and relies on a carefully designed Calibration Dataset.

4.2.3.1. Calibration Dataset Construction (Section 5)

Before profiling, MoA emphasizes the design of a specialized calibration dataset and supervision objectives.

Problem with Current Approach: General language modeling datasets (e.g., RedPajama) and human-written summaries often lack long-context dependencies and introduce misalignment between the model's internal representation and human responses. This can lead to inaccurate attention influence profiling.
MoA's Approach:
- Long-Range Dependencies: MoA uses the MultiNews dataset (Fabbri et al., 2019), which includes summaries that require deep understanding of long-range content.
- Model Alignment: Crucially, the summaries used for supervision are generated by the original dense model itself, rather than human-written summaries. This ensures higher alignment between the model's internal attention patterns and the supervision signal, leading to more accurate attention influence quantification.
- Loss Calculation: The cross-entropy loss is calculated using these model-generated summaries as the reference, facilitating accurate profiling for the MoA configuration search.

4.2.3.2. Attention Influence Profiling

This step quantifies the impact of individual attention values on the LLM's final prediction loss. It is crucial for understanding which attention connections are critical and should be preserved.

Influence Derivation: The influence of each attention value ( $A_{h,i,j}$ ) is derived from the attention matrix $\mathbf{A}$ and its gradient $\partial L / \partial \mathbf{A}$ , computed over the calibration dataset.
Approximation of Loss Change: When applying sliding-window attention masks, MoA approximates the change in the model's prediction loss ( $\Delta L$ ) using a first-order Taylor expansion. This approach is supported by other LLM compression methods. The change in loss $\Delta L$ is approximated as: $ \Delta L = \sum_h \sum_i \sum_j \frac{\partial L}{\partial A_{h,i,j}} \cdot \Delta A_{h,i,j} $ Here, $h$ indexes attention heads across all layers, and i, j are the row and column indices within each attention matrix $\mathbf{A}_h$ .
Attention Influence Matrix ( $E_{i,j}$ ): The attention influence matrix $E_{i,j}$ (for a single head, omitting $h$ for simplicity) is defined as the estimated change in loss if the attention value $A_{i,j}$ is masked (set to zero). This accounts for two effects:
1. Direct Effect: Setting $A_{i,j}$ to zero directly reduces its value: $\Delta A_{i,j|j} = - A_{i,j}$ .
2. Indirect Effect (Softmax Normalization): The softmax function normalizes each row of the attention matrix to sum to one. Thus, setting $A_{i,j}$ to zero causes a proportional increase in other attention values ( $A_{i,n}$ for $n \neq j$ ) within the same row. These two effects are integrated into the following formulation. The derivation is provided in Appendix E.2 of the original paper. $ E_{i,j} = \sum_n \frac{\partial L}{\partial A_{i,n}} \cdot \Delta A_{i,n|j} = \frac{\partial L}{\partial A_{i,j}} \cdot (-A_{i,j}) + \sum_{n \neq j} \frac{\partial L}{\partial A_{i,n}} \cdot A_{i,n} \cdot \frac{A_{i,j}}{1 - A_{i,j}} $ Where:
- $E_{i,j}$ : The attention influence of the attention value at row $i$ , column $j$ . It quantifies how much the prediction loss would change if this specific attention value were masked.
- $\frac{\partial L}{\partial A_{i,n}}$ : The gradient of the total prediction loss $L$ with respect to the attention value $A_{i,n}$ . This indicates how sensitive the loss is to changes in $A_{i,n}$ .
- $\Delta A_{i,n|j}$ : The change in attention value $A_{i,n}$ if $A_{i,j}$ is masked.
- $A_{i,j}$ : The attention value at row $i$ , column $j$ .
- The first term $\frac{\partial L}{\partial A_{i,j}} \cdot (-A_{i,j})$ captures the direct loss change from zeroing out $A_{i,j}$ .
- The second term $\sum_{n \neq j} \frac{\partial L}{\partial A_{i,n}} \cdot A_{i,n} \cdot \frac{A_{i,j}}{1 - A_{i,j}}$ captures the indirect loss change due to the re-distribution of attention weights for other tokens in the same row after $A_{i,j}$ is masked.
Practical Calculation: In practice, backpropagation on the calibration dataset is used to calculate the average attention influence $\bar{\mathbf{E}}_h$ for each head, across multiple data items and different input lengths. The gradient $\partial L / \partial \mathbf{A}_h$ is computed using standard deep learning frameworks like PyTorch.
Accuracy Loss for a Rule: After obtaining $\bar{\mathbf{E}}_h$ , MoA calculates the accuracy loss ( $\Delta L$ ) for applying a candidate elastic rule $r_h$ at a specific input length. This loss is the sum of the masked average attention influences according to the rule: $ \Delta L = \sum_h \Delta L_{h,r_h} = \sum_h \sum_i \sum_j M_{r_h,i,j} \cdot \bar{E}_{h,i,j} $ Where:
- $\Delta L$ : The total estimated accuracy loss across all heads.
- $\Delta L_{h,r_h}$ : The accuracy loss for head $h$ using rule $r_h$ .
- $M_{r_h,i,j}$ : A binary mask for head $h$ corresponding to rule $r_h$ . Masked positions are 1, unmasked are 0.
- $\bar{E}_{h,i,j}$ : The average attention influence for head $h$ at position (i,j).
  
  This profiling step generates a map of how different elastic rules impact accuracy across various attention heads and input lengths.

4.2.3.3. Automatic Optimization

This stage automatically selects the optimal elastic rule for each attention head.

Multi-objective Optimization: MoA formulates the problem as a multi-objective optimization to find a set of Pareto optimal configurations across all profiled lengths. The objective is to minimize the total accuracy loss while adhering to user-defined density constraints. The objective is formulated as: $ \underset{r_h \in \mathbb{R}}{\arg \min} \Delta L^{(N_i)}, N_i \in \mathbb{N}{\mathrm{profile}} \quad \mathrm{s.t.} \frac{1}{H} \sum{h=1}^H d_{r_h}^{(N_i)} \leq d_{\mathrm{constr}}^{(N_i)}, \forall N_i \in \mathbb{N}_{\mathrm{constr}} $ Where:
- $r_h \in \mathbb{R}$ : The elastic rule for head $h$ , chosen from the set of candidate rules $\mathbb{R}$ .
- $\Delta L^{(N_i)}$ : The accuracy loss (from Equation 4) at input length $N_i$ .
- $N_i \in \mathbb{N}_{\mathrm{profile}}$ : Set of input lengths for which profiling was performed.
- $d_{r_h}^{(N_i)}$ : The density of rule $r_h$ for head $h$ at input length $N_i$ . Density is the ratio of active attention connections to total possible connections.
- $d_{\mathrm{constr}}^{(N_i)}$ : User-defined density constraint at input length $N_i$ .
- $H$ : Total number of attention heads.
- $\mathbb{N}_{\mathrm{constr}}$ : Set of lengths subject to density constraints.
Mixed-Integer Programming (MIP) Formulation: This problem corresponds to a multi-objective mixed-integer programming problem, which can be solved using standard linear solvers like Gurobi. The detailed formulation (from Appendix E.3.1) introduces a binary variable $X_{h,r_h} \in \{0, 1\}$ to indicate whether rule $r_h$ is selected for head $h$ . $ \underset{\mathbf{\phi}}{\mathrm{argmin}} \Delta L = \sum_h \Delta L_{h,r_h} \quad \mathrm{s.t.} \frac{1}{H} \sum_h d_{r_h} \leq d_{\mathrm{constr}} $ This is transformed into: $ \underset{X_{h,r_h}}{\mathrm{argmin}} \frac{1}{H} \sum_{h=0}^{H-1} \sum_{r_h=0}^{R_h-1} \Delta L_{h,r_h} X_{h,r_h} \quad \mathrm{s.t.} $ $ \sum_{r_h=0}^{R_h-1} X_{h,r_h} = 1, \quad h \in {0, \cdots, H-1} $ $ \frac{1}{H} \sum_{h=0}^{H-1} \sum_{r_h=0}^{R_h-1} d_{r_h} X_{h,r_h} \leq d_{\mathrm{constr}} $ $ X_{h,r_h} \in {0, 1}, \quad \forall h \in {0, \cdots, H-1}, \forall r_h \in \mathbb{R} $ Where:
- $X_{h,r_h}$ : A binary variable that is 1 if rule $r_h$ is selected for head $h$ , and 0 otherwise.
- $\Delta L_{h,r_h}$ : The accuracy loss of head $h$ with rule $r_h$ .
- $d_{r_h}$ : The density of rule $r_h$ .
- $R_h$ : The number of candidate rules for head $h$ .
- The first constraint ensures that exactly one rule is selected per head.
- The second constraint enforces the average density budget.
- Additional constraints are added to limit the number of distinct rules per model layer for inference-time efficiency.
Generalization to Multiple Lengths (Appendix E.3.2, Figure 15): To handle multi-objective MIP across various lengths, MoA uses an epsilon-constraint method (Yv et al., 1971). This involves:
1. Performing single-objective optimization for each profiled length as the primary objective, recording the outcomes for all other objectives. This establishes the range of possible losses $R^{(N_j)}$ for each objective $j$ .
2. Iterating through each objective again, treating other objectives as constraints by partitioning their loss ranges into uniform intervals. This generates a comprehensive set of potential configurations.
3. Finally, configurations that do not conform to the Pareto front requirements are removed, yielding the final set of Pareto optimal solutions.
Final Configuration Selection: From the set of Pareto-optimal MoA configurations, the one that yields the minimum loss at an unseen validation length (e.g., 12k in the experiments) is chosen as the optimal solution. This step helps ensure good generalization to lengths beyond those used for profiling.

This pipeline effectively automates the discovery of complex, heterogeneous elastic rules that optimize both performance and efficiency for LLM inference in long-context scenarios.

5. Experimental Setup

5.1. Datasets

The authors use a comprehensive set of datasets to evaluate MoA's performance across various long-context capabilities, including retrieval, understanding, and coherence. A specialized calibration dataset is also used for MoA's internal profiling.

5.1.1. Long-Context Retrieval

LongEval (Li et al., 2023a): This benchmark tests key-value retrieval accuracy. The model is given a lengthy context containing many key-value pairs (e.g., line <index>: REGISTER_CONTENT is <value>) and is then asked to retrieve the value for a specific key.
- Scale: 100 data items per length level.
- Purpose: To quantify how well the model can recall specific facts embedded deep within long input sequences, and thus measure effective context length.
Needle-In-A-Haystack (NIAH) (Kamradt, 2024): This is another common retrieval benchmark.
- Characteristics: It consists of a single "needle" sentence (the target information) hidden within a large, irrelevant "haystack" context. The model is then asked a question that can only be answered by retrieving the "needle."
- Purpose: To test the model's ability to find and utilize a single critical piece of information in extremely long and noisy contexts.

5.1.2. Long-Context Understanding

LV-Eval (Yuan et al., 2024): A benchmark for long-context understanding.
- Characteristics: Includes 11 sub-datasets. For Vicuna models (16k context limit), the 16k split is used, truncating inputs to 15,500 tokens.
- Purpose: To assess a model's ability to comprehend and answer questions based on complex, multi-faceted long documents.
LongBench (Bai et al., 2023): Another benchmark for long-context understanding.
- Characteristics: Includes 13 sub-datasets. The LongBench-E split is used for balanced data items across length levels. It's segmented into ranges: 0-4k, 4-8k, and 8k+ tokens, with truncation thresholds of 3,500, 7,500, and 15,500 tokens, respectively.
- Purpose: Provides a comprehensive evaluation of LLMs on diverse long-context tasks, including summarization, question answering, and code completion.

5.1.3. Perplexity Evaluation

A custom test set is constructed for perplexity evaluation:

Datasets: Qasper (Dasigi et al., 2021) for question answering, MultiNews (Fabbri et al., 2019) for summarization, TREC (Li & Roth, 2002; Hovy et al., 2001) for few-shot learning, and LCC (Mohler et al., 2016) for code completion.
Scale: 50 data items per length level from the test split of each of the four datasets (total $50 \times 4$ items).
Characteristics: Data items are organized as question-answer pairs, with human-written questions and answers.
Purpose: To measure the model's coherence and fluency in generating responses, specifically calculating perplexity on the answer part of the data.

5.1.4. Validation Dataset

Used to select the optimal MoA configuration from the Pareto front solutions.

Construction: Similar to the perplexity test dataset, it samples $50 \times 4$ data items from the validation splits (or test splits for datasets without validation) of Qasper, MultiNews, TREC, and LCC.
Additional Data: 50 data items from LongEval are added to validate retrieval ability.
Purpose: To ensure the selected configuration generalizes well to unseen lengths and tasks.

5.1.5. Calibration Dataset (for MoA's Internal Profiling)

This is crucial for MoA's attention influence profiling (Section 4.2.3.1).

Dataset: MultiNews dataset (Fabbri et al., 2019).
Characteristics: Contains long news documents and their summaries, highlighting long-range dependencies.
Supervision: Summaries are generated by the original dense model, not human-written. This ensures model alignment.
Scale: $50 \times 3$ data items at 2k, 4k, and 8k lengths. Data items are padded to their corresponding length.
Purpose: To accurately profile the attention influence of individual attention values on the model's prediction loss for long-context tasks.

5.1.6. Data Sample Examples

The following are concrete examples of data samples used in the experiments:

Format 1. LongEval

This format is used for the LongEval (Li et al., 2023a) retrieval benchmark. The instruction for retrieval is given after a lengthy context.

Below is a record of lines I want you to remember. Each line begins with 'line <line index>' and contains a ' <REGISTER_CONTENT>' at the end of the line as a numerical value. For each line index, memorize its corresponding <REGISTER_CONTENT>. At the end of the record, I will ask you to retrieve the corresponding <REGISTER_CONTENT> of a certain line index. Now the record start:

line delightful-incandescence: REGISTER_CONTENT is <19147>
line cloistered-presence: REGISTER_CONTENT is <8862>
... (massive lines of key-value pairs) ...

Now the record is over. Tell me what is the <REGISTER_CONTENT> in line cloistered-presence? I need the number.

Format 2. Needle-In-A-Haystack (NIAH)

This format is used for the Needle-In-A-Haystack (NIAH) (Kamradt, 2024) retrieval benchmark. A single "needle" sentence is embedded within irrelevant context.

People who are powerful but uncharismatic will tend to be disliked. Their power makes them a target for criticism that they don't have the charisma to disarm. That was Hillary Clinton's problem.

The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.

It also tends to be a problem for any CEO who is more of a builder than a schmoozer.

What is the best thing to do in San Francisco?

Format 3. MultiNews Calibration Dataset

This format is used for MoA's calibration dataset. The context includes multiple news passages, and the supervision is a summarization generated by the original dense model.

$You are given several news passages. Write a one-page summary of all news. <News1> <News2> ... (multiple news articles) ... Now, write a one-page summary of all the news. Summarization: (This is the model-generated summary used as supervision)$

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, a complete explanation is provided below.

5.2.1. Retrieval Accuracy

Conceptual Definition: Retrieval accuracy measures the ability of a model to correctly extract or identify specific pieces of information (e.g., a key-value pair, a "needle" sentence) from a given long document or context. It directly assesses how well the model can pinpoint and utilize relevant facts amidst potentially irrelevant or lengthy surrounding text. For the LongEval benchmark, it specifically checks if the model returns the correct numerical value associated with a queried key.
Mathematical Formula: For a set of $N$ retrieval tasks: $ \text{Retrieval Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}(\text{model_output}_i = \text{ground_truth}_i) $
Symbol Explanation:
- $N$ : The total number of retrieval tasks or data items.
- $\text{model\_output}_i$ : The value retrieved by the model for the $i$ -th task.
- $\text{ground\_truth}_i$ : The correct value to be retrieved for the $i$ -th task.
- $\mathbb{I}(\cdot)$ : The indicator function, which equals 1 if the condition inside the parenthesis is true, and 0 otherwise.

5.2.2. LV-Eval Score

Conceptual Definition: LV-Eval is a benchmark that provides an aggregate score for an LLM's long-context understanding capabilities across various sub-datasets. It quantifies how well a model can perform tasks like question answering, summarization, and conditional generation over extended inputs. The score is typically an average or weighted average of performance across its constituent tasks.
Mathematical Formula: The paper refers to LV-Eval scores as a single aggregate value. A specific formula for this aggregate score is not provided in the paper's main text or appendices, as it's typically computed by the benchmark's official script based on its internal evaluation methodology for 11 sub-datasets. Conceptually, it represents a composite metric reflecting overall long-context understanding.
Symbol Explanation: Not applicable for a general formula, as it's an aggregate benchmark score. The score quantifies the overall performance across different long-context understanding tasks.

5.2.3. LongBench Score

Conceptual Definition: Similar to LV-Eval, LongBench is a comprehensive benchmark for evaluating LLMs on various long-context understanding tasks. It provides a score that reflects the model's performance across its 13 diverse sub-datasets, which cover tasks such as summarization, code completion, few-shot learning, and question answering. The score is typically an average over the different sub-tasks, often computed by the official benchmark.
Mathematical Formula: The paper refers to LongBench scores as an aggregate value. A specific formula for this aggregate score is not provided in the paper, as it's usually computed by the benchmark's official script. Conceptually, it represents a composite metric reflecting overall long-context understanding across a wider range of tasks than LV-Eval.
Symbol Explanation: Not applicable for a general formula, as it's an aggregate benchmark score. The score quantifies the overall performance across different long-context understanding tasks.

5.2.4. Perplexity (PPL)

Conceptual Definition: Perplexity is a common metric used to evaluate how well a language model predicts a sample of text. It essentially measures the uncertainty of the model when predicting the next token in a sequence. A lower perplexity score indicates that the model is more confident and accurate in its predictions, implying better coherence and fluency in language generation. It's often expressed as the exponentiated average negative log-likelihood of the tokens.
Mathematical Formula: $ \text{PPL}(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, \dots, w_{i-1})\right) $
Symbol Explanation:
- $W$ : A sequence of $N$ tokens, $W = (w_1, w_2, \dots, w_N)$ .
- $N$ : The total number of tokens in the sequence.
- $P(w_i | w_1, \dots, w_{i-1})$ : The probability assigned by the language model to the $i$ -th token $w_i$ , given all preceding tokens $w_1, \dots, w_{i-1}$ .
- $\log$ : The natural logarithm.
- $\exp(\cdot)$ : The exponential function, $e^{(\cdot)}$ .

5.2.5. Decode Throughput (tokens/s)

Conceptual Definition: Decode throughput measures the rate at which an LLM can generate output tokens during the decode stage of autoregressive inference. It is typically expressed in "tokens per second" (tokens/s). Higher throughput indicates greater efficiency and faster text generation, which is critical for real-time applications and serving many users simultaneously.
Mathematical Formula: $ \text{Decode Throughput} = \frac{\text{Total number of tokens generated}}{\text{Total time taken for generation}} $
Symbol Explanation:
- Total number of tokens generated: The sum of all tokens produced by the model across all requests in a given period.
- Total time taken for generation: The wall-clock time required to generate these tokens.

5.2.6. GPU Memory Reduction

Conceptual Definition: GPU Memory Reduction quantifies the decrease in Graphics Processing Unit (GPU) memory consumption achieved by an optimization method compared to a baseline. For LLMs, this often refers to the memory saved by efficiently managing the KV-Cache or reducing the size of attention matrices. Reducing GPU memory allows for larger batch sizes or longer context lengths on the same hardware.
Mathematical Formula: $ \text{GPU Memory Reduction} = \frac{\text{Baseline GPU Memory Usage}}{\text{Optimized GPU Memory Usage}} $ Alternatively, as a percentage reduction: $ \text{Percentage Reduction} = \left(1 - \frac{\text{Optimized GPU Memory Usage}}{\text{Baseline GPU Memory Usage}}\right) \times 100% $
Symbol Explanation:
- Baseline GPU Memory Usage: The GPU memory consumed by the original or unoptimized method.
- Optimized GPU Memory Usage: The GPU memory consumed by the proposed (MoA) method.

5.2.7. Length-controlled Win Rate

Conceptual Definition: Length-controlled Win Rate is a metric used in benchmarks like AlpacaEval 2.0 to evaluate the quality of a model's instruction-following responses. It compares the output of a target model against a reference model (e.g., gpt4_turbo) and determines which model's response is "better" or "wins" according to an automated evaluator. "Length-controlled" implies that the evaluation accounts for potential biases related to response length. A higher win rate indicates superior instruction-following capabilities.
Mathematical Formula: The exact calculation for AlpacaEval 2.0's length-controlled win rate involves a sophisticated automated evaluator (like gpt-4-1106-preview) comparing model responses. It's not a simple mathematical formula that can be derived directly but rather an outcome of a pairwise comparison protocol. Conceptually, it's the percentage of times the target model's response is preferred over the reference model's response by the evaluator.
Symbol Explanation: Not applicable for a simple mathematical formula. It represents the percentage of instances where the target model's output is judged superior by an automated evaluator.

5.3. Baselines

The paper compares MoA against a comprehensive set of state-of-the-art efficient attention methods and LLM frameworks, encompassing both static and dynamic approaches.

StreamingLLM (Xiao et al., 2024c): A prominent static efficient attention method that employs a uniform sliding-window mask with a fixed number of attention sink tokens (initial tokens always visible). It uses efficient prefill.
- Configuration: Initial four tokens unmasked (attention sink), except for 70B model and super long settings where 64 tokens are used.
InfLLM (Xiao et al., 2024a): A dynamic KV-Cache compression method that uses training-free memory for long sequences. It uses sparse prefill.
- Configuration: Adheres to original setup with same local window size, selected memory size, and 128 initial tokens as attention sinks.
H2O (Heavy-hitter Oracle) (Zhang et al., 2023): A dynamic KV-Cache pruning method that identifies and keeps "heavy hitter" tokens (most important) and recent tokens.
- Configuration: Uses the same number of heavy hitter tokens and recent tokens as specified in its original paper.
- Note: H2O uses original dense prefill because it needs to calculate the column sum of the attention matrix to determine token importance for KV-Cache eviction, making it incompatible with sparse prefill kernel optimizations like FlashAttention during prefill. This can lead to Out-Of-Memory (OOM) issues at large prefill lengths and batch sizes.
FlashAttention2 (Dao, 2024): A highly optimized CUDA kernel for exact attention computation, significantly improving speed and memory efficiency for dense attention. It serves as a strong baseline for dense attention performance.
vLLM (Kwon et al., 2023): A high-throughput LLM serving framework that uses PagedAttention for efficient KV-Cache memory management. It's known for its strong performance in decode throughput.
BigBird (Zaheer et al., 2020): An early static sparse attention model that combines local sliding windows with global tokens. Included in the accuracy-throughput trade-off analysis.
SnapKV (Li et al., 2024c): A dynamic KV-Cache compression method. Included for comparison in longer-context generalization results.
PyramidKV (Cai et al., 2024): Another recent dynamic KV-Cache compression method. Included for comparison in longer-context generalization results.

5.4. MoA Settings

The specific configurations used for MoA in the experiments are:

Attention Pattern: Block sliding-window attention with a block size of 64 tokens. The first block of tokens (64 tokens) is always unmasked and serves as the attention sink.
Profiling Dataset: MultiNews (Fabbri et al., 2019) with model-generated summaries as supervision (as detailed in Section 5 of the paper).
Profiling Lengths: $50 \times 3$ data items are used at 2k, 4k, and 8k input lengths. Data items are padded to their respective lengths to ensure uniform attention-influence tensor shapes.
Profiling Granularity: Block granularity is used; the average attention influence is calculated within each block to represent its overall influence.
Hyperparameter Search Space:
- $\alpha$ (base span): 6 discrete values.
- $\beta$ (expansion rate): 9 discrete values.
- This creates a search space of $6 \times 9 = 54$ pairs for each attention head.
- $\alpha$ values are uniformly sampled from $[-2048, 8192]$ , and $\beta$ values from [0, 1]. The resulting attention span lengths are clipped to be between 0 and the current input length.
Optimization: Multi-objective optimization is performed at the same set of lengths (2k, 4k, 8k).
Rule Restriction: To ensure inference-time efficiency, the number of distinct elastic rules is restricted to at most two per model layer.
Validation Length: The optimal MoA configuration is selected from the Pareto front solutions based on the lowest perplexity on a validation dataset of length 12k.
Fine-tuning: The models are not fine-tuned after applying MoA.
Consistency: Each model uses the same MoA configuration across all benchmarks and lengths.

6. Results & Analysis

6.1. Core Results Analysis

MoA demonstrates superior performance and efficiency across various LLM models and long-context benchmarks, significantly outperforming state-of-the-art baselines.

6.1.1. Accuracy-Throughput Trade-off

The following figure (Figure 4 from the original paper) shows the accuracy-throughput trade-offs of seven attention methods at different densities:

Figure 4: Accuracy-throughput trade-offs of seven attention methods at different densities, tested on Vicuna-7B with 8k input length using one A100-80GB GPU on the LongEval dataset. 该图像是图表，展示了在不同密度下七种注意力方法在Vicuna-7B模型中进行的解码吞吐量与检索准确率之间的权衡。方法包括InfLLM、BigBird、H2O、SnapKV、PyramidKV、StreamingLLM和MoA，X轴表示解码吞吐量（token/s），Y轴表示检索准确率。

As shown in Figure 4, MoA effectively advances the Pareto front in terms of both retrieval accuracy and decode throughput compared to six baselines (InfLLM, BigBird, H2O, SnapKV, PyramidKV, StreamingLLM).

Throughput Advantage: At the same densities, MoA achieves notable throughput improvements of 1.6-18.1x over H2O, InfLLM, BigBird, SnapKV, and PyramidKV. This is attributed to its efficient static attention design and customized GPU kernel.
Accuracy Maintenance: MoA consistently maintains high retrieval accuracy across varying densities, indicating it doesn't sacrifice performance for efficiency as much as other methods.

6.1.2. Performance

MoA consistently outperforms existing efficient attention methods and achieves performance comparable to the original full-attention model at 50% density across various tasks and model sizes.

6.1.2.1. Long-Context Retrieval

The following are the results from Table 4 of the original paper:

Model	Attention	Retrieve Acc. ↑ 4k	8k	16k	LV-Eval ↑ 16k	LongBench ↑ 0-16k	PPL ↓ 8-12k
Vicuna-7B	Original	1.00	0.98	0.62	5.93	34.76	3.79
	H2O	0.86	0.68	0.35	5.42	33.59	3.94
	InfLLM	0.67	0.57	0.26	5.13	32.97	4.07
	StreamingLLM	0.43	0.16	0.08	4.72	31.84	4.48
	MoA	1.00	0.97	0.57	5.61	33.96	3.75
Vicuna-13B	Original	0.99	0.98	0.44	5.83	39.23	3.62
	H2O	0.88	0.76	0.28	5.66	38.13	3.80
	InfLLM	0.70	0.53	0.27	6.80	37.13	4.07
	StreamingLLM	0.65	0.49	0.33	5.43	32.13	4.10
	MoA	0.99	0.93	0.49	7.16	38.77	3.62
Llama3-8B	Original	0.99	0.99	0.97	17.49	43.69	4.52
	H2O	0.94	0.88	0.37	16.03	42.99	4.63
	InfLLM	0.65	0.59	0.52	14.44	42.43	4.68
	StreamingLLM	0.68	0.55	0.52	11.16	38.22	4.79
	MoA	0.99	1.00	1.00	17.46	42.97	4.49
Llama3-70B	Original	1.00	0.99	0.93	24.51	49.10	3.67
	H2O	0.93	0.91	OOM	OOM	OOM	OOM
	StreamingLLM	0.20	0.15	0.04	17.45	42.53	4.26
	MoA	1.00	1.00	0.94	23.65	47.79	3.75

Superior Accuracy: As shown in Table 4, MoA consistently achieves very high retrieval accuracy across all tested models (Vicuna-{7B, 13B}, Llama3-{8B, 70B}) and input lengths (4k, 8k, 16k) at 50% density.
- Its maximum relative accuracy drop (compared to the Original full-attention model) is only 8%, significantly lower than StreamingLLM (87%), InfLLM (58%), and H2O (44%). On average, MoA's relative accuracy drop is below 1%.
Extended Effective Context Length: The following figure (Figure 5(a) from the original paper) shows retrieval accuracy and effective context length:

该图像是图表，展示了Vicuna-7B模型在不同注意力跨度和输入长度下的检索准确率。X轴表示不同的注意力跨度，Y轴表示不同的输入长度，图中的热图显示了不同配置下的检索准确性。

Figure 5(a) demonstrates that MoA retains over 90% retrieval accuracy up to 60k lengths, effectively matching the dense model's effective context length. This is achieved with only 8k profiling and 12k validation. In stark contrast, H2O, InfLLM, and StreamingLLM have much shorter effective context lengths (8k, <4k, and <4k respectively). MoA extends its effective context length to approximately 3.9x the average KV-Cache length.

Density Efficiency: The following figure (Figure 6 from the original paper) shows retrieval accuracy across attention spans and input lengths:

$Figure 7: Retrieval accuracy tests on LongEval with Vicuna-7B. (a) Varies input lengths and densities to show effective context lengths across attention spans, (b) Set input length at $^ \\mathrm { 8 k }$ and show retrieval accuracy across different densities.$ 该图像是图表，展示了在LongEval上使用Vicuna-7B进行的检索准确性测试。图(a)显示了不同注意力跨度下的有效长度，图(b)展示了在特定密度下的检索准确性变化。

Figure 6 (and Figure 7(b) in the appendix) further illustrates that MoA reaches over 0.9 retrieval accuracy at an 8k input length with just 25% density, whereas StreamingLLM and H2O require 100% and 75% density, respectively, for similar performance. This highlights MoA's efficiency in retaining critical information.

Needle-In-A-Haystack Performance: The following figure (Figure 7 from the original paper) shows retrieval accuracy tests on LongEval:

$Figure 8: The Needle-In-A-Haystack (NIAH) retrieval accuracy using different attention methods across $^ \\mathrm { 8 k }$ to 256k input lengths on Llama-3-8B model. All efficient attention methods employ a $5 0 \\%$ density.$ 该图像是一个图表，展示了不同注意力方法在 Llama-3-8B 模型中，针对输入长度从 8k 到 256k 的 Needle-In-A-Haystack (NIAH) 检索准确率。图中显示了 MoA、H2O、StreamingLLM 和 InfLLM 方法在不同 Token 限制下的深度表现，颜色深浅表示不同的准确性评分。

Figure 8 (Appendix C.1.2) shows that MoA achieves perfect retrieval accuracy across input lengths from 8k to 256k on the NIAH dataset, while baselines like StreamingLLM have limited effective context length and InfLLM shows reduced accuracy. H2O and InfLLM often fail at extreme lengths due to OOM/OOT errors.

6.1.2.2. Long-Context Understanding

The following are the results from Table 7 of the original paper:

Model	Attention	LongBench ↑
Model	Attention	0-4k	4-8k	8-16k
Vicuna-7B	Original	37.91	33.82	32.54
	H2O	36.23	32.74	31.81
	InfLLM	35.23	33.54	30.15
	StreamingLLM	30.53	33.28	31.70
	MoA	37.04	32.90	31.94
Vicuna-13B	Original	42.25	39.52	35.93
	H2O	41.63	38.02	34.75
	InfLLM	39.36	37.66	34.36
	StreamingLLM	30.65	33.07	32.68
	MoA	41.73	38.88	35.69
Llama3-8B	Original	44.27	43.53	43.26
	H2O	43.46	43.01	42.50
	InfLLM	42.78	42.69	41.81
	StreamingLLM	37.20	38.02	39.43
	MoA	43.07	42.75	43.09
Llama3-70B	Original	50.70	48.05	48.55
	H20	50.16	47.77	OOM
	StreamingLLM	45.14	42.40	40.04
	MoA	49.74	46.80	46.84

The following are the results from Table 8 of the original paper:

		Single-QA		Multi-QA		Retrieval
Model	Attention	w/o. Conf (2)	w. Conf (2)	w/o. Conf (3)	W. Conf (2)	w. Conf (2)
Vicuna-7B	Original	10.49	6.29	6.83	5.60	0.00
	H20	9.16	6.20	6.44	4.80	0.00
	InfLLM	7.11	6.70	6.07	4.80	0.00
	StreamingLLM	7.54	5.90	5.98	3.56	0.00
	MoA	9.98	6.27	6.16	5.31	0.09
Vicuna-13B	Original	10.64	7.28	5.32	5.07	1.08
	H20	9.53	6.54	5.25	5.36	1.83
	InfLLM	10.21	9.35	6.03	3.19	2.08
	StreamingLLM	9.05	5.86	5.37	3.19	3.70
	MoA	11.04	6.93	5.79	5.84	6.88
Llama3-8B	Original	34.05	19.51	11.41	17.70	7.84
	H20	28.52	17.05	11.11	15.98	9.95
	InfLLM	24.94	17.75	10.61	14.80	6.04
	StreamingLLM	20.21	9.57	8.14	9.36	10.03
	MoA	32.98	20.53	10.65	17.57	8.98
Llama3-70B	Original	44.44	25.02	16.71	22.86	17.43
	StreamingLLM	26.63	14.22	14.04	14.70	19.38
	MoA	42.44	23.58	15.75	21.27	19.19

Reduced Performance Gap: Table 4, Table 7, and Table 8 show that MoA narrows the maximum relative performance drop on LV-Eval and LongBench benchmarks to 5% and 3% respectively, which is significantly better than StreamingLLM's 36% and 18%.
- H2O and InfLLM incur larger efficiency costs while achieving similar or worse performance.
Consistent Performance: The following figure (Figure 9 from the original paper) shows LV-Eval score of StreamingLLM and MoA at various densities:

$Figure 10: (a) LV-Eval and (b) LongBench scores for different attention methods at $5 0 \\%$ density, tested on Vicuna-7B, 13B and Llama3-70B models. Scores normalized against the original dense model.$ 该图像是图表，展示了 MoA、H2O 和 StreamingLLM 在不同模型（7B、13B、70B）上的性能对比。图中展示了各种注意力方法在总结、代码完成和多文档问答任务中的得分，指标均经过归一化处理。

Figure 9 and Figure 10 (Appendix C.1.3) highlight MoA's consistent performance across subtasks and various densities, contrasting with the inconsistent performance of baselines.

Perplexity: MoA maintains perplexity within a 1% relative increase, whereas other methods see increases of 4-13%. This indicates MoA preserves the model's language generation coherence better.

6.1.2.3. Longer-Context Generalization

The following are the results from Table 9 of the original paper:

Attention	32k	64k	Retrieve Acc. ↑ 128k	256k
SnapKV	1.00	0.88	0.71	0.33
PyramidKV	1.00	0.85	0.62	0.37
MoA	1.00	0.92	0.83	0.46

Strong Generalizability: Even when compressed within 12k context lengths during configuration search, MoA generalizes exceptionally well to much longer contexts (32k to 256k), as shown in Figure 5(b) (from original paper) and Table 9.
- It outperforms InfLLM and StreamingLLM by 1.9-3.3x in retrieval accuracy and 1.2-1.4x in LV-Eval scores at these extended lengths, performing comparably to the original dense model.
- MoA also consistently outperforms more recent baselines like SnapKV and PyramidKV in retrieval accuracy at longer contexts (Table 9).

6.1.2.4. Instruction-following Generation

The following are the results from Table 10 of the original paper:

Attention	Length-controlled Win Rate ↑	Standard Error
Original	8.84	0.53
H2O	9.66	0.55
InfLLM	5.76	0.42
StreamingLLM	7.96	0.49
MoA	9.83	0.57

Highest Win Rate: MoA achieves the highest length-controlled win rate on the AlpacaEval 2.0 benchmark (Table 10), outperforming both efficient attention baselines and even the original model. This suggests MoA's elastic design not only improves efficiency but also enhances general instruction-following capabilities.

6.1.3. Ablation Studies / Parameter Analysis

6.1.3.1. Impact of Heterogeneity and Elastic Rules

The following are the results from Table 3 of the original paper:

Mask Design	Retrieval Acc. 8k	16k	PPL 8k	12k
Uniform	0.25	0.15	4.89	5.19
+Hetero. Layers	0.31	0.26	4.55	4.85
+Hetero. Heads	0.95	0.41	3.96	4.30
+Elastic Rules	0.98	0.43	3.96	4.29

Table 3 presents an ablation study on the search space design, demonstrating the progressive benefits of heterogeneity and elasticity at a consistent 25% density.

Uniform Baseline: A basic uniform sliding-window method performs poorly (Retrieval Acc. 0.25 at 8k, 0.15 at 16k; high PPL).
Heterogeneity in Layers: Introducing distinct attention spans per layer (+Hetero. Layers) improves retrieval accuracy (0.31 at 8k) and reduces PPL.
Heterogeneity in Heads: Adding distinct attention spans for each individual head (+Hetero. Heads) within layers brings a significant boost in retrieval accuracy (0.95 at 8k), showing the strong impact of fine-grained heterogeneity.
Elastic Rules: Finally, incorporating elastic rules that allow attention spans to scale with input length (+Elastic Rules) further enhances retrieval accuracy (0.98 at 8k, 0.43 at 16k) and maintains low PPL, validating the importance of adaptive attention spans.

6.1.3.2. Calibration Dataset Design Choices

The following are the results from Table 2 of the original paper:

Dataset	Supervision	Reference	Long Dep.	Align Model	Retrieval Acc. ↑	PPL ↓
RedPajama	Context	-	X	X	0.25	4.95
MultiNews	Context & Summary	Human	x/√	X	0.27	4.62
MultiNews	Summary	Human	✓	X	0.87	3.97
MultiNews	Summary	Model	✓	✓	0.95	3.96

Table 2 validates MoA's calibration dataset design principles at 25% average density and 8k input length.

Long-Range Dependencies: Moving from RedPajama (general language modeling, no long dependency) to MultiNews (Summary, Human reference, with long dependency) significantly improves retrieval accuracy from 0.27 to 0.87 and reduces perplexity from 4.62 to 3.97. This highlights the importance of training on data that naturally features long-range dependencies for profiling.
Model Alignment: Using summaries generated by the original dense model as supervision (MultiNews, Summary, Model reference, with model alignment) further boosts retrieval accuracy from 0.87 to 0.95, while maintaining low perplexity. This confirms that aligning the supervision with the model's own response patterns is crucial for accurate attention influence profiling.

6.1.4. MoA Configuration Analysis

6.1.4.1. Overview of Mask Density Across Layers

The following figure (Figure 10 from the original paper) shows LV-Eval and LongBench scores for different attention methods:

Figure 11: The MoA mask density across layers for different LLMs. 该图像是图表，展示了不同层级上 Vicuna-7B 和 Vicuna-13B 模型的 MoA 掩码密度。图中以不同颜色表示窗口长度（2k、4k、8k、16k），并显示了各层的密度变化趋势。

Figure 11 (from original paper) visualizes the MoA mask density across layers for different LLMs.

Density Distribution: Initial and middle layers generally exhibit high mask density (larger attention spans), aligning with findings on LLMs' intrinsic dimensions and layer sensitivities. These layers might be crucial for initial feature extraction and context aggregation.
Later Layers: Final layers show a pattern where most attention heads require low density (shorter attention spans), while a few outliers still demand high density. This suggests specialization, where some late-stage heads might still need broad context for final predictions, while others focus on refining local information.

6.1.4.2. Diversity within Layers

The following figure (Figure 11 from the original paper) shows MoA mask density across layers:

Figure 12: The MoA mask's average density and the density range for each layer for different LMs. 该图像是散点图，展示了Vicuna-7B和Vicuna-13B模型中平均密度与密度范围之间的关系。每个点代表不同的配置，图中还包括线性回归趋势线及阴影区域，反映了两者的密度变化趋势。

Figure 12 (from original paper) plots the average mask density and the density range for each layer.

Confirmation of Heterogeneity: Layers with lower average density (smaller lines) tend to display a wider range of density (wider shaded areas), confirming the necessity of heterogeneous attention rules within the same layer. This reinforces MoA's approach of tailoring attention spans at the individual head level.

6.1.4.3. Connection to Semantic Interpretation

MoA preserves the learned semantics of the dense model by statistically analyzing and optimizing attention patterns.

Position vs. Content-Based Attention: Attention heads can be primarily position-based (e.g., attending to first token, previous token) or content-based (attending to semantically similar tokens regardless of position). Figure 2 illustrates these types.
Quantitative Analysis (SoE): The paper quantifies position-based attention using Standard deviation of Expectation (SoE). SoE measures how uneven the mean attention matrix is, reflecting the influence of position. The derivation uses a linear approximation of attention value $A_{h,i,j}$ as a combination of positional influence $P_h(i,j)$ and token influence $T_h(t_i, t_j)$ : $ A_{h,i,j} = f_h(P_h(i,j), T_h(t_i, t_j)) $ Linear approximation: $ A_{h,i,j} = \alpha_h P_h(i,j) + (1 - \alpha_h) T_h(t_i, t_j) $ Where $\alpha_h$ is the weight factor for position influence. The expectation of attention value over different sentences is: $ \mathbb{E}t[A{h,i,j}] = \alpha_h P_h(i,j) + (1 - \alpha_h) \mu_T, \quad \forall i > j $ Where $\mu_T$ is the average token influence. The Standard deviation of Expectation (SoE) over different positions of the attention matrix is: $ \sigma_p(\mathbb{E}t[A{h,i,j}]) = \alpha_h \delta_p $ Where $\delta_p$ is the standard deviation of $P_h(i,j)$ over different positions, which is constant for all heads. This implies that the position impact $\alpha_h$ is proportional to the SoE of different heads.
Correlation: The following figure (Figure 12 from the original paper) shows MoA mask's average density and the density range for each layer:

该图像是图表，展示了MoA的掩码稀疏性与头部对位置依赖性的正相关关系(SoE)。图表中，水平坐标轴表示SoE(E-2)，竖直坐标轴表示稀疏性百分比，伴随有两侧的边际分布图，显示不同SoE值的稀疏性变化情况。

Figure 13 (from original paper) shows a positive correlation between MoA's mask sparsity and a head's dependency on position (SoE). This suggests that MoA successfully captures the semantic information of the dense language model, assigning shorter windows (higher sparsity) to heads that are more position-based and longer windows (lower sparsity) to those that rely more on content.

6.1.4.4. Cross-Dataset Generalizability

The following figure (Figure 13 from the original paper) shows positive correlation between MoA's mask sparsity and head's dependency on position (SoE):

Figure 14: Examples of attention matrices from different attention heads (columns) and tasks (rows) of the Vicuna-7B model. The attention matrices were averaged over 256 data items per dataset. The same head shows a similar attention span across different tasks, explaining the robust cross-dataset generalizability of our method. 该图像是一个示意图，展示了Vicuna-7B模型中不同任务（少量学习、聊天和编码）下第17层第29个注意力头的注意矩阵。每个注意力矩阵通过256个数据项的平均值展示，反映了不同任务中的注意力模式。相同的注意力头在不同任务中表现出相似的注意力跨度，说明我们方法的跨数据集泛化能力。

Figure 14 (Appendix E.1) visualizes the attention matrices of the same attention heads across different tasks (few-shot learning, chatting, coding). The observation that the same head shows a similar attention span across diverse tasks explains the robust cross-dataset generalizability of MoA. The heterogeneous elastic rules learned are intrinsic to the model's architecture and pre-training, rather than task-specific.

6.2. Efficiency

MoA demonstrates high runtime efficiency and manageable one-time search overhead.

6.2.1. Memory and Throughput Breakdown

The following are the results from Table 5 of the original paper:

Model Framework		Attention	4k Batch	Throughput	Batch	Throughput	Batch	Throughput
	vLLM	PagedAttention	30	628.8	15	323.0	8	145.5
	FlexGen	H20	20	754.9	6	296.3	1	51.7
	HuggingFace	InfLLM	15	62.0	10	37.5	6	19.2
	HuggingFace	StreamingLLM	50	945.1	25	467.3	12	232.0
		FlashAttention2	30	134.6	15	66.9	8	32.9
		+Static KV-Cache	30	496.1	15	219.5	8	91.6
		HuggingFace +Reduced Attention	30	722.5	15	369.9	8	178.3
		+Increased Batch	50	897.7	25	436.7	12	206.4
	vLLM	+Kernel (=MoA)	50	1099.0	25	535.7	12	257.3
	13B	FlexGen	PagedAttention	16	314.8	8	160.5	4	71.1
		H20	12	330.2	4	138.2	1	37.4
HuggingFace		InfLLM	8	30.3	5	17.63	3	11.3
HuggingFace		StreamingLLM	28	478.4	14	241.2	7	116.5
		FlashAttention2	16	81.3	8	40.8	4	19.8
		+Static KV-Cache	16	264.6	8	111.3	4	62.2
HuggingFace		+Reduced Attention	16	329.6	8	156.4	4	87.3
	+Increased Batch	28	471.5	14	222.6	7	108.3
		+Kernel (=MoA)	28	550.9	14	267.6	7	132.3

Decode Throughput: Table 5 shows that MoA achieves significant decode throughput improvements. For Vicuna-7B, it's 6.6-8.2x over FlashAttention2 and 1.7-1.9x over vLLM (which uses PagedAttention). Similar gains are observed for Vicuna-13B.
GPU Memory Reduction: The following are the results from Table 11 of the original paper:

Size Framework Memory (GB)

4k 8k 16k

7B FlashAttn2 28.5 44.4 76.3

H20 36.9 OOM OOM

MoA 22.7 32.9 53.5

13B FlashAttn2 36.8 49.2 74.0

H20 40.4 77.9 OOM

MoA 32.0 39.6 55.0

Size	Framework	Memory (GB)
7B	FlashAttn2	28.5	44.4	76.3
	H20	36.9	OOM	OOM
	MoA	22.7	32.9	53.5
13B	FlashAttn2	36.8	49.2	74.0
	H20	40.4	77.9	OOM
	MoA	32.0	39.6	55.0

Table 11 indicates that MoA reduces GPU memory usage by 1.2-1.4x compared to FlashAttention2. H2O frequently encounters Out-Of-Memory (OOM) issues at longer input lengths.

Ablation of Efficiency Factors (Table 5 breakdown): The paper ablates MoA's efficiency improvements, building upon a HuggingFace baseline with FlashAttention2.
1. Static KV-Cache ( $\approx 3.0 \times$ improvement): By maintaining KV-Cache tokens only within each head's defined span, MoA prevents KV-Cache growth, eliminating reallocation overhead and significantly reducing memory.
2. Reduced Attention Computation ( $\approx 1.5 \times$ improvement): The sparse attention (reduced density) decreases the amount of computation and memory access required for the attention operation itself.
3. Increased Batch Size ( $\approx 1.4 \times$ improvement): The memory reduction from a smaller KV-Cache allows for processing a larger batch size, which improves GPU utilization and overall throughput.
4. GPU Kernel Design ( $\approx 1.2 \times$ improvement): MoA utilizes a customized CUDA GPU kernel specifically designed for heterogeneous attention patterns, providing further low-level optimization.

6.2.2. Efficiency Results for Longer Input

The following are the results from Table 12 of the original paper:

Model Size	Framework	Attention	Min. #GPU	Total Throughput	Total Memory (GB)	Throughput per GPU
7B	vLLM	PagedAttention	2	30.2	142.0	15.1
	FlexGen	200	>8	-	OOM	-
	HuggingFace	InfLLM	1	6.1	47.7	6.1
	HuggingFace	StreamingLLM	1	19.8	43.9	19.8
	HuggingFace	FlashAttention2	2	4.3	85.6	2.2
	HuggingFace	MoA	1	20.3	44.0	20.3
13B	vLLM	PagedAttention	2	21.5	142.0	10.8
	FlexGen	200	>8	-	OOM	-
	HuggingFace	InfLLM	1	4.3	78.6	4.3
	HuggingFace	StreamingLLM	1	14.0	64.6	14.0
	HuggingFace	FlashAttention2	2	3.0	130.6	1.5
	HuggingFace	MoA	1	14.7	63.4	14.7

At a 128k input length, MoA processes inputs using only one A100 GPU, whereas FlashAttention2 and vLLM require at least two GPUs for a single request.

Speedup: MoA achieves 4.7-4.9x decode speedup over FlashAttention2 (while using half the GPUs) and 1.3-1.4x higher throughput per GPU compared to vLLM (Table 12).
Memory Savings: It also demonstrates a 1.9-2.1x reduction in GPU memory usage.

6.2.3. Energy Efficiency Result

The following are the results from Table 13 of the original paper:

		Energy Per Token (J)			Power (W)
Framework	Attention	4k	8k	16k	4k	8k	16k
Huggingface	FlashAttention2	2.98	5.93	12.1	350	354	359
Huggingface	MoA	0.34	0.62	1.21	330	322	315

Table 13 shows that MoA achieves an 8.7-10x reduction in energy per output token compared to FlashAttention2, driven by slightly lower GPU power consumption. This makes MoA not only faster but also more environmentally friendly.

6.2.4. Automatic Search Pipeline Overhead

The following are the results from Table 14 of the original paper:

Stage	7B LLM	13B LLM	70B LLM
Calibration Data Gen.	10min	15min	2 × 60min
Profile	20min	2 × 25min	8 × 210min
Optimize (CPU)	30min	25min	100min
Validate	35min	40min	2 × 140min
Total Latency	1h 35min	1h 45min	8h 30min
Total GPU Time	1h 5min	1h 45min	34h 40min

The automatic configuration search is a one-time overhead.

For Vicuna-7B and 13B models, the entire search pipeline completes within two hours.
For the larger Llama3-70B model, it takes 8.5 hours of wall time and 34.7 hours of GPU time.
Profiling is the most resource-demanding stage, requiring two A100 GPUs for a 13B model at 8k profile length. Other stages typically need a single GPU.
The optimization phase (on CPU) is relatively quick, taking about 25 minutes for Vicuna-13B. Validation adds about 40 minutes. This overhead is manageable given the significant and lasting efficiency benefits.

The following are the results from Table 15 of the original paper:

Stage	Complexity w.r.t parameter size	Complexity w.r.t dataset size
Calibration Dataset Gen.	Linear	Linear
Profile	Linear	Linear
Optimize	Polynomial ~ Exponential for #Head	Irrelevant
Validate	Linear	Linear
Empirical Latency	Almost Linear	Linear

Table 15 summarizes the computational complexity of each stage with respect to model parameter size and calibration dataset size. Most stages are linear, except for optimization which is polynomial to exponential with respect to the number of heads in the theoretical worst case, but empirically efficient.

6.3. Additional Performance Analysis (Appendix C.1.1)

The following are the results from Table 6 of the original paper:

Model	Attention	Retrieve Acc. ↑ 4k	8k	16k	LV-Eval ↑ 16k	LongBench ↑ 0-4k	4-8k	8-16k	PPL ↓ 8-12k
Vicuna-7B	StreamingLLM	0.91	0.35	0.09	4.30	36.39	32.44	31.04	3.92
	MoA	1.00	0.97	0.58	5.67	38.07	33.80	31.75	3.78
Vicuna-13B	StreamingLLM	0.73	0.81	0.37	5.65	36.77	34.65	33.43	3.70
	MoA	0.99	0.97	0.42	5.57	41.85	39.76	36.06	3.62
Llama3-8B	StreamingLLM	1.00	0.83	0.76	14.89	42.45	40.62	43.16	4.51
	MoA	0.99	1.00	0.93	15.61	43.51	42.51	43.58	4.53

Table 6 (Appendix C.1.1) shows MoA's overall performance at a higher density of 75%. Even at this higher density, MoA demonstrates improved performance over the uniform attention baseline StreamingLLM across all metrics. This reinforces the effectiveness of heterogeneous elastic rules in MoA.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces Mixture of Attention Spans (MoA), a novel, training-free method designed to optimize the efficiency of Large Language Model (LLM) inference in long-context scenarios. MoA's core innovation lies in automatically tailoring heterogeneous elastic rules for sliding-window lengths to individual attention heads and layers. By precisely profiling the attention influence on model predictions using a specially designed calibration dataset (featuring long-range dependencies and model-aligned supervision), MoA intelligently allocates attention spans to preserve crucial information while aggressively pruning redundant connections.

The key findings and contributions are:

Enhanced Context Length and Accuracy: With the same average sliding-window length, MoA extends the effective context length of LLMs by 3.9x, leading to a 1.5-7.1x boost in retrieval accuracy over uniform-window baselines. It significantly narrows the performance gap with full attention, reducing the maximum relative performance drop on long-context understanding benchmarks from 9-36% to within 5%.
Superior Efficiency: MoA achieves a 1.2-1.4x GPU memory reduction and substantially improves decode throughput by 6.6-8.2x over FlashAttention2 and 1.7-1.9x over vLLM, with minimal impact on performance. These gains stem from static KV-Cache management, reduced attention computations, increased batch sizes, and a specialized CUDA kernel.
Automatic and Generalizable: The proposed automatic configuration search pipeline efficiently finds optimal heterogeneous elastic rules within a few hours. The discovered configurations demonstrate strong generalizability to unseen and longer contexts (up to 256k), suggesting that the learned attention patterns are intrinsic to the model.

In essence, MoA offers a practical and powerful solution for deploying LLMs efficiently in diverse long-context applications by intelligently adapting the attention mechanism to its heterogeneous nature.

7.2. Limitations & Future Work

The authors acknowledge the following limitations and suggest future research directions:

Extremely Low-Density Budgets: MoA's current performance degrades under extremely low-density budgets. This implies there's a limit to how much attention can be pruned before critical information is lost, even with heterogeneous rules.
Dynamic MoA Method: To address the limitation at very low density budgets, the authors suggest designing a dynamic MoA method. Such a method could potentially adapt attention spans more flexibly at runtime based on the specific content, offering finer-grained control and potentially better performance in extreme compression scenarios.
Non-linear Elastic Rules: The current elastic rules use a simple linear function ( $S_h = \alpha_h + \beta_h \cdot N$ ). Exploring non-linear elastic rules with bounded attention spans could offer more nuanced and potentially more optimal ways for attention spans to scale with input length.
Adaptation to Other Compression Methods: The profiling method developed in MoA to evaluate attention influence could be adapted to evaluate the influence of weights and other activations. This could facilitate the development of other LLM compression methods, such as quantization or weight pruning, by providing a principled way to identify critical components.

7.3. Personal Insights & Critique

The Mixture of Attention Spans (MoA) paper presents a compelling and practical approach to a significant challenge in LLM deployment.

Inspirations:

The Power of Heterogeneity: The paper brilliantly leverages the insight that attention heads are not monolithic. This heterogeneous perspective is a powerful paradigm that could be extended to other components of neural networks beyond attention mechanisms, where different parts might serve specialized functions. Recognizing and optimizing for this diversity is more effective than uniform compression.
Data Engineering for Profiling: The emphasis on calibration dataset design, particularly using model-aligned supervision, is a crucial and often overlooked aspect of compression research. It highlights that the "data" for analyzing compression impact needs to be carefully constructed to reflect the real-world usage and inherent patterns of the model, rather than generic datasets or human annotations that might not fully capture the model's internal logic. This concept could be transformative for other model compression or interpretability tasks.
Interpretability of Rules: The correlation between mask sparsity and positional dependency (SoE) for attention heads provides valuable interpretability into how LLMs function and how MoA preserves their semantic structure. This kind of analysis helps build trust in compressed models and guides further research into attention patterns.
Bridging Static Efficiency and Dynamic Adaptability: MoA cleverly combines the best of both worlds: the predictable, hardware-efficient nature of static sparse attention for inference with the performance benefits typically sought by more complex dynamic methods. The one-time search overhead is a small price for significant and lasting runtime gains.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Linear Elastic Rules Simplification: While effective, the linear scaling rule ( $S_h = \alpha_h + \beta_h \cdot N$ ) might be a simplification. Real attention span requirements might follow more complex, non-linear, or piecewise functions, especially at very long contexts or for models with diverse architectures. The current method might miss some optimal configurations if the true elasticity is non-linear.
Search Space for $\alpha$ and $\beta$ : The discrete options for $\alpha$ and $\beta$ are chosen empirically. The sensitivity of the final configuration to the range and granularity of these discrete options could be further explored. A more adaptive or continuous search might yield better results but increase search complexity.
Generalization to New Models/Architectures: While shown to work for Vicuna and Llama3, the robustness of the discovered heterogeneous elastic rules and the profiling methodology for vastly different LLM architectures (e.g., Mixture-of-Experts models) or pre-training objectives would be interesting to verify.
Overhead Scaling for Larger Models/Search Spaces: While the current search pipeline is efficient, for models with even more layers/heads or a much larger hyperparameter search space (e.g., if non-linear rules were introduced), the polynomial-to-exponential complexity of the optimization stage could become a bottleneck. Further algorithmic innovations in the search itself might be necessary.
Single-GPU vs. Multi-GPU Inference: The efficiency gains are primarily shown on a single A100 GPU or a small number of GPUs. While memory reduction helps multi-GPU setups, the full impact on large-scale distributed inference (e.g., with complex tensor parallelism or pipeline parallelism) might introduce new bottlenecks that need to be addressed.
"Attention Sink" Heuristic: The concept of attention sinks (a fixed number of initial unmasked tokens) is a heuristic. While effective, the optimal number of attention sinks might also be a parameter that could be optimized or made elastic by the MoA framework itself, rather than being a fixed constant.

Overall, MoA is a significant step forward in efficient LLM inference, demonstrating that a nuanced, data-driven approach to sparse attention can yield substantial benefits without compromising performance. Its principled methodology and impressive results make it a valuable contribution to the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~33 min read · 44,912 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

Core Problem

Importance of the Problem

Challenges and Gaps in Prior Research

Paper's Entry Point / Innovative Idea

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs)

3.1.2. Attention Mechanism

3.1.3. Multi-Head Self-Attention (MHA)

3.1.4. Key-Value Cache (KV-Cache)

3.1.5. Autoregressive Inference (Prefill and Decode Stages)

3.1.6. Sliding-Window Attention

3.2. Previous Works

3.2.1. Dynamic Sparse Attention

3.2.2. Static Sparse Attention

3.2.3. LLM Acceleration Frameworks and Kernel Optimizations

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Mixture of Attention Patterns and Elastic Rules

4.2.2. Heterogeneous Elastic Rule Search Space

4.2.3. Automatic MoA Configuration Search Pipeline

4.2.3.1. Calibration Dataset Construction (Section 5)

4.2.3.2. Attention Influence Profiling

4.2.3.3. Automatic Optimization

5. Experimental Setup

5.1. Datasets

5.1.1. Long-Context Retrieval

5.1.2. Long-Context Understanding

5.1.3. Perplexity Evaluation

5.1.4. Validation Dataset

5.1.5. Calibration Dataset (for MoA's Internal Profiling)

5.1.6. Data Sample Examples

Format 1. LongEval

Format 2. Needle-In-A-Haystack (NIAH)

Format 3. MultiNews Calibration Dataset

5.2. Evaluation Metrics

5.2.1. Retrieval Accuracy

5.2.2. LV-Eval Score

5.2.3. LongBench Score

5.2.4. Perplexity (PPL)

5.2.5. Decode Throughput (tokens/s)

5.2.6. GPU Memory Reduction

5.2.7. Length-controlled Win Rate

5.3. Baselines

5.4. MoA Settings

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Accuracy-Throughput Trade-off

6.1.2. Performance

6.1.2.1. Long-Context Retrieval

6.1.2.2. Long-Context Understanding

6.1.2.3. Longer-Context Generalization

6.1.2.4. Instruction-following Generation

6.1.3. Ablation Studies / Parameter Analysis

6.1.3.1. Impact of Heterogeneity and Elastic Rules

6.1.3.2. Calibration Dataset Design Choices

6.1.4. MoA Configuration Analysis

6.1.4.1. Overview of Mask Density Across Layers

6.1.4.2. Diversity within Layers

6.1.4.3. Connection to Semantic Interpretation

6.1.4.4. Cross-Dataset Generalizability