Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths
TL;DR Summary
This paper introduces the Mixture of Attention Spans (MoA), which optimizes inference efficiency for large language models (LLMs) by tailoring sliding-window lengths for different attention heads and layers, significantly improving effective context length and retrieval accuracy
Abstract
Sliding-window attention offers a hardware-efficient solution to the memory and throughput challenges of Large Language Models (LLMs) in long-context scenarios. Existing methods typically employ a single window length across all attention heads and input sizes. However, this uniform approach fails to capture the heterogeneous attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose Mixture of Attention Spans (MoA), which automatically tailors distinct sliding-window length configurations to different heads and layers. MoA constructs and navigates a search space of various window lengths and their scaling rules relative to input sizes. It profiles the model, evaluates potential configurations, and pinpoints the optimal length configurations for each head. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer inputs, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9x with the same average sliding-window length, boosting retrieval accuracy by 1.5-7.1x over the uniform-window baseline across Vicuna-{7B, 13B} and Llama3-{8B, 70B} models. Moreover, MoA narrows the performance gap with full attention, reducing the maximum relative performance drop from 9%-36% to within 5% across three long-context understanding benchmarks. MoA achieves a 1.2-1.4x GPU memory reduction, boosting decode throughput by 6.6-8.2x and 1.7-1.9x over FlashAttention2 and vLLM, with minimal performance impact. Our code is available at: https://github.com/thu-nics/MoA
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths
1.2. Authors
The paper is authored by a collaborative team from multiple institutions:
-
Tianyu Fu, Haofeng Huang, Xuefei Ning, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Huazhong Yang, Yu Wang (Tsinghua University and Infinigence-AI)
-
Genghan Zhang (Stanford University)
-
Guohao Dai (Infinigence-AI and Shanghai Jiao Tong University)
The affiliations suggest a strong background in computer science, artificial intelligence, and potentially hardware architecture, given the focus on efficient inference.
1.3. Journal/Conference
The paper is published on arXiv, a preprint server. While arXiv is not a peer-reviewed journal or conference in itself, it is a highly influential platform for rapid dissemination of research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Papers published on arXiv are often later submitted to and accepted by top-tier conferences or journals. Given the publication date (2024-06-21), it is likely a recent work that may undergo peer review in the future.
1.4. Publication Year
2024
1.5. Abstract
The paper addresses the efficiency challenges of Large Language Models (LLMs) in processing long contexts, particularly concerning sliding-window attention methods. Existing sliding-window attention approaches typically apply a uniform window length across all attention heads and input sizes, which fails to account for the diverse attention patterns within LLMs and their varied accuracy-latency trade-offs. To overcome this, the authors propose Mixture of Attention Spans (MoA), a novel training-free method that automatically configures distinct sliding-window lengths for individual attention heads and layers. MoA constructs a search space of various window lengths and their scaling rules (how window length changes with input size), profiles the model to understand attention influences, evaluates configurations, and pinpoints optimal settings for each head. This adaptive approach reveals that some heads expand their focus for longer inputs, while others maintain a fixed local context. Experiments demonstrate that MoA significantly increases the effective context length by 3.9x with the same average sliding-window length, boosting retrieval accuracy by 1.5-7.1x over uniform-window baselines across Vicuna and Llama3 models. It also narrows the performance gap with full attention (reducing the maximum relative performance drop from 9-36% to within 5%) on long-context understanding benchmarks. Furthermore, MoA achieves 1.2-1.4x GPU memory reduction, leading to 6.6-8.2x and 1.7-1.9x decode throughput improvements over FlashAttention2 and vLLM, respectively, with minimal performance impact.
1.6. Original Source Link
https://arxiv.org/abs/2406.14909
Publication Status: Preprint on arXiv.
PDF Link: https://arxiv.org/pdf/2406.14909v3.pdf
2. Executive Summary
2.1. Background & Motivation
Core Problem
The core problem the paper addresses is the computational and memory inefficiency of Large Language Models (LLMs) when processing very long input sequences, particularly in the context of their attention mechanism. While attention is crucial for context understanding, its computational cost and memory footprint (especially for the Key-Value Cache, or KV-Cache) grow quadratically with sequence length, making long-context inference prohibitive.
Importance of the Problem
The ability to handle long input contexts is vital for enhancing LLM capabilities in many real-world applications such as fact retrieval, summarization, few-shot learning, and question answering. Current sliding-window attention methods, which aim to bound this quadratic growth by limiting attention to local spans, often fail to achieve a long effective context length. This is because they typically employ a uniform, fixed-span sliding-window mask across all attention heads and input lengths.
Challenges and Gaps in Prior Research
- Heterogeneous Attention Patterns: LLMs exhibit diverse
attention patterns. Someattention headsnaturally focus on local contexts, while others require a broader, even global, view of the input. A uniform window length constrains global-context heads unnecessarily and over-allocates resources to local-context heads. - Lack of Elasticity: As input length increases, different
attention headsrequire varying degrees ofattention spanexpansion to maintain performance. Uniform approaches lack theheterogeneous elastic rulesneeded to scaleattention spansdifferently for various heads. - Suboptimal Performance: Existing uniform
sliding-window methods, such asStreamingLLM, often struggle to extendeffective context lengthsignificantly beyond their predefined span, leading to reducedretrieval accuracyand overall performance degradation in long-context tasks. - Calibration Dataset Misalignment: Traditional model compression methods often use general language modeling datasets for configuration, which may not accurately reflect the impact on
long-context tasksor align with the LLM's own response generation patterns.
Paper's Entry Point / Innovative Idea
The paper's innovative idea is Mixture of Attention Spans (MoA), a training-free heterogeneous sliding-window attention method. Instead of a uniform window length, MoA proposes to automatically tailor distinct sliding-window length configurations to different attention heads and layers, and to allow these lengths to elastically adapt to varying input sizes. This aims to capture the intrinsic heterogeneous attention patterns of LLMs more effectively, optimizing the accuracy-latency trade-off.
2.2. Main Contributions / Findings
The primary contributions and key findings of the Mixture of Attention Spans (MoA) paper are:
-
Heterogeneous Elastic Rules for Attention Spans:
- Contribution: MoA introduces and formulates a search space for
heterogeneous elastic rulesthat define how eachattention head'ssliding-window lengthscales with the input length. This moves beyond theuniform-windowparadigm. - Finding: These rules significantly improve
content retrieval accuracy, boosting it from 25% to 98% compared to uniform counterparts at similar densities.
- Contribution: MoA introduces and formulates a search space for
-
Calibration Dataset Construction for Accurate Profiling:
- Contribution: The paper emphasizes and demonstrates the importance of a specialized
calibration datasetdesign. MoA uses datasets withlong-range dependencies(like MultiNews summaries) and aligns supervision with the original LLM's responses rather than human-written summaries. - Finding: This
data engineeringapproach is crucial for accurately profiling theattention influencesandimpact of compression, leading to better overall performance. Using model-aligned summaries improvesretrieval accuracyfrom 87% to 95% and reducesperplexitysignificantly.
- Contribution: The paper emphasizes and demonstrates the importance of a specialized
-
Automatic Configuration Search Pipeline:
- Contribution: MoA provides an
automatic pipelineto efficiently search for the optimalheterogeneous elastic rulesfor allattention heads. This pipeline involvesgradient-based profilingto quantifyattention influenceandmulti-objective optimizationto minimize prediction loss underdensity constraints. - Finding: The pipeline efficiently finds optimal configurations within hours (e.g., 2 hours for Vicuna-13B, 8.5 hours for Llama3-70B), making the method practical.
- Contribution: MoA provides an
-
Significant Performance Improvements:
- Finding: MoA increases the
effective context lengthby 3.9x for the same averagesliding-window length. - Finding: It boosts
retrieval accuracyby 1.5-7.1x over uniform-window baselines across various models (Vicuna-{7B, 13B} and Llama3-{8B, 70B}). - Finding: MoA reduces the maximum relative performance drop on long-context understanding benchmarks (LV-Eval, LongBench) from 9-36% (baselines) to within 5% of
full attentionperformance. At 25% density, MoA achieves over 90%retrieval accuracy, outperforming baselines requiring 75-100% density.
- Finding: MoA increases the
-
Substantial Efficiency Gains:
- Finding: MoA achieves 1.2-1.4x GPU memory reduction.
- Finding: It boosts
decode throughputby 6.6-8.2x overFlashAttention2and 1.7-1.9x overvLLMat 50% density, with minimal performance impact. These gains are attributed tostatic KV-Cachesize,reduced attention computations,increased batch sizes, and aspecialized CUDA kernel implementation.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the Mixture of Attention Spans (MoA) paper, it's essential to grasp several foundational concepts related to Large Language Models (LLMs) and their core attention mechanism.
3.1.1. Large Language Models (LLMs)
Large Language Models (LLMs) are a type of artificial intelligence model designed to understand, generate, and process human language. They are typically based on the Transformer architecture and trained on vast amounts of text data, allowing them to learn complex patterns and relationships within language. LLMs have demonstrated remarkable capabilities in various tasks, including text generation, summarization, translation, and question answering.
3.1.2. Attention Mechanism
The attention mechanism is a core component of Transformer models (Vaswani et al., 2017). It allows the model to weigh the importance of different parts of the input sequence when processing each token. Instead of processing input sequentially or relying on fixed-size context windows, attention enables the model to "attend" to relevant information across the entire input, regardless of its position. This is crucial for capturing long-range dependencies in text.
3.1.3. Multi-Head Self-Attention (MHA)
Multi-Head Self-Attention (MHA) is an extension of the basic attention mechanism. In MHA, the attention operation is performed multiple times in parallel, each with different learned linear projections (attention heads). The idea is that each "head" can learn to focus on different types of relationships or aspects of the input sequence, thereby enriching the model's ability to capture diverse information. The outputs from all heads are then concatenated and linearly transformed to produce the final output.
The computation for a single attention head involves Query (Q), Key (K), and Value (V) matrices.
Given an input sequence, it is transformed into Query (Q), Key (K), and Value (V) matrices through linear projections. For a sequence of length and embedding dimension , these matrices are of shape , , and respectively (where and are typically ).
The attention is computed as:
$
\mathbf{S} = \mathbf{Q}\mathbf{K}^T
$
This calculates the attention scores (or logits) between each query token and all key tokens.
$
\mathbf{A} = \mathrm{softmax}(\mathbf{S} + \mathbf{M})
$
The softmax function normalizes the scores, turning them into probabilities, and a causal mask () is applied. The causal mask ensures that when predicting the next token, a token can only attend to previous tokens and itself, not future tokens. This is essential for autoregressive models (like LLMs) that generate text token by token.
$
\mathbf{O} = \mathbf{A}\mathbf{V}
$
Finally, the attention probabilities () are multiplied by the Value matrix () to produce the output (), which is a weighted sum of the value vectors.
3.1.4. Key-Value Cache (KV-Cache)
During autoregressive inference (i.e., generating text token by token), the Key and Value matrices for previously processed tokens remain constant. Instead of recomputing them at each generation step, they can be stored and reused. This stored information is called the KV-Cache. The KV-Cache significantly speeds up the decoding process but can consume a large amount of GPU memory, especially for long contexts and large batch sizes. The memory consumption of KV-Cache scales linearly with the sequence length and batch size, and quadratically with model size (number of heads and hidden dimensions).
3.1.5. Autoregressive Inference (Prefill and Decode Stages)
Autoregressive inference in LLMs typically involves two stages:
- Prefill Stage: The model processes the entire input prompt (e.g., a query or an incomplete sentence) in parallel to generate the initial output token. In this stage,
attentionusually computes interactions between all prompt tokens. - Decode Stage: After the first token is generated, the model iteratively generates subsequent tokens one by one. For each new token, it uses the newly generated token and the
KV-Cachefrom all preceding tokens (including the prompt and previously generated tokens) to predict the next token. This iterative process is where theKV-Cacheis most critical but also where its memory demands become significant.
3.1.6. Sliding-Window Attention
Sliding-window attention is a technique used to mitigate the quadratic memory and computational cost of full attention for long sequences. Instead of allowing each token to attend to all previous tokens, it restricts attention to a fixed-size window of recent tokens. This means a token can only "see" and interact with tokens within its immediate historical window, plus often a few initial "global" tokens (known as attention sinks) that are always visible to all tokens. This approach makes the memory and computation costs scale linearly with sequence length (within the window), making it more hardware-efficient.
3.2. Previous Works
The paper discusses several categories of efficient attention methods and LLM acceleration frameworks, highlighting their strengths and limitations compared to MoA.
3.2.1. Dynamic Sparse Attention
This branch of work attempts to dynamically skip attention computations or prune the KV-Cache based on the input sequence content.
- Examples:
Pagliardini et al., 2023,Qu et al., 2022,Roy et al., 2021,Wang et al., 2021,Lu et al., 2021,Kitaev et al., 2020for prefill;Anagnostidis et al., 2023,Zhang et al., 2023 (H2O),Ge et al., 2023,Sheng et al., 2023,Liu et al., 2023,Li et al., 2024c (SnapKV),Cai et al., 2024 (PyramidKV),Tang et al., 2024for decode. - Characteristics: These methods adapt sparsity patterns to the input.
- Limitations: Often require specific hardware for wall-time speedups (dynamic prefill), or extensive retraining, additional pruning score computation, or memory swapping for
KV-Cacheretrieval (dynamic decode). - H2O (Heavy-hitter Oracle): A dynamic method that identifies and preserves "heavy hitter" tokens (most important) in the
KV-Cacheand recent tokens. It uses dense prefill to calculate token importance. - InfLLM: A dynamic
KV-Cachecompression method that unveils intrinsic capacity for long sequences with training-free memory. It uses sparse prefill. - SnapKV & PyramidKV: More recent dynamic
KV-Cachecompression methods.
3.2.2. Static Sparse Attention
These methods apply predefined attention masks consistently across all input sequences, leading to fixed computation flows that are generally more efficient and GPU-friendly.
- Examples:
Zaheer et al., 2020 (BigBird),Beltagy et al., 2020 (Longformer),Child et al., 2019,Zhou et al., 2024. - For Generative LLMs: The predominant method is the
uniform sliding-window maskwithglobal attentionon a few initial tokens (attention sinks).- StreamingLLM (Xiao et al., 2024c): A prominent example of a uniform
sliding-window attentionmethod that usesattention sinksand fixed window sizes. This is a primary baseline for MoA.
- StreamingLLM (Xiao et al., 2024c): A prominent example of a uniform
- Limitations: Uniform
sliding-window masksare model- and data-agnostic. This can severely compromise LLMs'effective context lengthand lead to suboptimal performance inlong-sequence scenarios, as they fail to account for heterogeneousattention patternsacross heads and layers.
3.2.3. LLM Acceleration Frameworks and Kernel Optimizations
These works focus on system-level and low-level optimizations.
- Frameworks:
DeepSpeed-Inference (Aminabadi et al., 2022),Accelerate (Gugger et al., 2022),FlexGen (Sheng et al., 2023),vLLM (Kwon et al., 2023).- vLLM: An efficient inference framework for LLMs that uses
PagedAttentionfor memory management.
- vLLM: An efficient inference framework for LLMs that uses
- Kernel Optimizations:
FlashAttention (Dao et al., 2022), ,FlashAttention-3 (Shah et al., 2024). These provide highly optimized CUDA kernels forattentioncomputation.- FlashAttention2: A faster and more memory-efficient exact
attentionalgorithm.
- FlashAttention2: A faster and more memory-efficient exact
- Relationship to MoA: These optimizations are often orthogonal to MoA's approach and can be integrated to further enhance efficiency. MoA leverages a custom CUDA kernel, similar in spirit to FlashAttention, but designed for heterogeneous
sliding-window attention.
3.3. Technological Evolution
The field of attention mechanisms in LLMs has evolved from computationally expensive full attention to various efficient approximations, driven by the need to handle increasingly long contexts.
- Full Attention (Vaswani et al., 2017): The original
Transformermodel introducedself-attention, which allowed each token to attend to all other tokens. While powerful, its quadratic complexity in terms of both computation and memory (forattention scoresandKV-Cache) made it impractical for very long sequences. - Early Sparse Attention (Child et al., 2019; Kitaev et al., 2020): Researchers started exploring ways to sparsify the
attention matrixto reduce costs. Methods likeSparse TransformersandReformerintroduced fixed patterns (e.g., local windows, strided attention, LSH attention) to limitattentionto a subset of tokens. - Context-Extending Sparse Attention (Beltagy et al., 2020; Zaheer et al., 2020):
LongformerandBigBirdfurther developed sparse attention by combininglocal sliding windowswith a fewglobal attentiontokens (often at the beginning or specific positions) to ensure some long-range information flow while maintaining linear complexity. This laid the groundwork for methods likeStreamingLLM. - Dynamic Sparse Attention (Roy et al., 2021; Anagnostidis et al., 2023; Zhang et al., 2023): This trend focused on making
attention patternsadaptive to the input, identifying and attending to the most important tokens dynamically. While potentially more accurate, these often introduce complex control flow or require additional overheads for importance scoring. - Hardware-Efficient Kernel Optimizations (Dao et al., 2022; Kwon et al., 2023): Parallel to algorithmic innovations, significant work has gone into optimizing the underlying GPU kernels (
FlashAttention,vLLM's PagedAttention) to make existingattentioncomputations faster and more memory-efficient, especially duringautoregressive inference. - MoA's Position: MoA builds upon the foundation of
static sliding-window attention(likeStreamingLLM) but introduces heterogeneity and elasticity to address its limitations. It recognizes that a "one-size-fits-all" window length is suboptimal due to diverseattention patternsacrossattention headsand layers. By profiling these patterns and automatically tailoring window lengths, MoA aims to combine the efficiency of static methods with a more nuanced, performance-aware adaptation typically sought by dynamic methods, but without their runtime overheads. It can also integrate with kernel optimizations.
3.4. Differentiation Analysis
Compared to the main methods in related work, MoA introduces several core differences and innovations:
-
From Uniform to Heterogeneous and Elastic Attention Spans:
- Prior Static Methods (e.g., StreamingLLM, Longformer, BigBird): These methods typically employ a single, uniform
sliding-window lengthacross allattention headsand layers. While efficient, this rigid approach fails to capture the distinct needs of different heads. - MoA's Innovation: MoA is the first to automatically tailor distinct
sliding-window length configurationsfor each individual attention head and layer. Furthermore, it introduceselastic rulesthat allow theseattention spansto scale differently with varying input lengths. This is a fundamental shift from uniform to heterogeneous and adaptiveattention spanmanagement.
- Prior Static Methods (e.g., StreamingLLM, Longformer, BigBird): These methods typically employ a single, uniform
-
Training-Free Optimization based on Attention Influence:
- Prior Dynamic Methods (e.g., H2O, InfLLM, SnapKV, PyramidKV): These methods dynamically prune the
KV-Cacheor skip computations during inference based on token importance, often requiring additional runtime computation for scoring or extensive retraining. - MoA's Innovation: MoA's configuration search is entirely
training-free. It uses agradient-based profilingmethod to quantify theinfluenceof eachattention valueon the model's prediction loss. This allows for a principled, data-driven optimization ofattention spanswithout altering model weights or incurring dynamic runtime overheads. Once configured, MoA runs with static, hardware-friendlyattention masks.
- Prior Dynamic Methods (e.g., H2O, InfLLM, SnapKV, PyramidKV): These methods dynamically prune the
-
Principled Calibration Dataset Design:
- Prior Compression Methods: Often rely on general language modeling datasets and human responses for calibration, which may not align well with
long-context tasksor the model's inherentattention patterns. - MoA's Innovation: MoA explicitly highlights and addresses the importance of
data engineering. It usescalibration datasetswithlong-range dependenciesand, crucially, leverages the original LLM's responses as supervision targets. This ensures that theattention influence profilingaccurately reflects what's important for the original model's performance onlong-context tasks.
- Prior Compression Methods: Often rely on general language modeling datasets and human responses for calibration, which may not align well with
-
Enhanced Performance and Efficiency Pareto Front:
-
Prior Efficient Methods: Often face a trade-off where significant efficiency gains come with noticeable performance degradation (e.g., reduced
effective context length, lowerretrieval accuracy). -
MoA's Innovation: MoA significantly improves the
accuracy-throughput Pareto front. It achieves substantialthroughputandmemory efficiencygains comparable to or exceeding highly optimized baselines (FlashAttention2,vLLM) while maintaining performance very close tofull attention(e.g., <5% performance drop on understanding benchmarks, much higherretrieval accuracyat lower densities). This demonstrates a superioraccuracy-latency trade-off.In essence, MoA marries the hardware efficiency of
static sliding-window attentionwith a sophisticated, data-driven understanding ofheterogeneous attention patternsto achieve both high performance and high efficiency inlong-context LLM inference.
-
4. Methodology
4.1. Principles
The core idea behind Mixture of Attention Spans (MoA) stems from the observation that not all attention heads within an LLM behave uniformly. Instead, they exhibit heterogeneous attention patterns – some heads focus on very local contexts, while others require a broader, even global, view of the input sequence. Furthermore, these patterns can be elastic, meaning their optimal attention span might change as the overall input length varies.
The theoretical basis and intuition behind MoA are as follows:
-
Heterogeneity is Key: Applying a single, uniform
sliding-window lengthacross allattention headsis suboptimal. It either excessively restrictsglobal-context heads(leading to performance degradation) or over-allocates resources tolocal-context heads(leading to inefficiency). By tailoringattention spansto individual heads, MoA aims to preserve cruciallong-range dependencieswhere needed while aggressively pruning redundantattentionin other heads. -
Elasticity for Generalization: A fixed
attention spanfor a given head might work for one input length but fail for another. Recognizing that different heads "stretch" their focus at different rates as input length increases (elastic rules) allows MoA to generalize better to unseen or longer contexts without re-profiling. -
Data-Driven Optimization: Instead of arbitrary heuristic choices for
attention spans, MoA uses a principled,gradient-based profilingapproach. It quantifies theinfluenceof eachattention valueon the model's prediction loss. This provides an objective measure of whichattention connectionsare critical and should be preserved (assigned a longer window) and which are less important and can be masked (assigned a shorter window). -
Hardware-Friendly Static Masks: While the configuration search is adaptive, the resulting
attention masksfor inference arestatic. This means MoA avoids the runtime overhead and complex control flow associated withdynamic sparse attentionmethods, allowing for efficientCUDA kernelimplementation and predictable performance. -
Optimized Trade-off: The goal is to find an optimal
MoA configuration(a set ofheterogeneous elastic rules) that minimizes theprediction loss(maintaining accuracy) while adhering todensity constraints(improving efficiency) across various input lengths.By implementing these principles, MoA aims to achieve a superior
accuracy-latency trade-offforLLM inferenceinlong-context scenarios.
4.2. Core Methodology In-depth (Layer by Layer)
The Mixture of Attention Spans (MoA) methodology is structured around identifying and applying heterogeneous elastic rules for attention spans across different attention heads and layers. This involves defining a search space, profiling attention influence, and then automatically optimizing the configuration.
4.2.1. Mixture of Attention Patterns and Elastic Rules
The foundation of MoA is the empirical observation of heterogeneous attention patterns and elastic rules within pre-trained LLMs.
-
Heterogeneous Attention Patterns: The paper highlights that different
attention headswithin anLLMdo not attend in the same way. For instance, as shown in Figure 2, oneattention head(e.g., layer 2, head 5) might primarily focus onlocal contexts(anarrow-span sliding window), while anotherattention head(e.g., layer 0, head 21) might cover nearly the entire input sequence (global attention). Theseattention spanstend to be consistent across different tasks and datasets (as shown in Appendix E.1, Figure 14).- Implication: A uniform
sliding-window attention maskapplied to all heads will either unduly restrictglobal-context headsor waste computation and memory onlocal-context heads.
- Implication: A uniform
-
Heterogeneous Elastic Rules: Beyond static heterogeneity,
attention headsalso exhibitelasticbehavior, meaning their optimalattention spancan change differently as the input length varies. For example, some heads that showglobal attentionfor shorter inputs might maintain a medium-spanlocal focusfor longer inputs, while others might continue to expand their focus globally (Figure 2). Table 1 further demonstrates that applying the samesliding-window maskacross layers can lead to a significantretrieval accuracyvariance depending on the input length.- Implication: A fixed
scaling ruleforattention spanacross all heads is insufficient. MoA needs to identifyheterogeneous elastic rulesthat govern how each head'sattention spanadapts to changing input lengths.
- Implication: A fixed
4.2.2. Heterogeneous Elastic Rule Search Space
MoA constructs a search space for these heterogeneous elastic rules.
-
Base Attention Mask: MoA adopts a hardware-friendly
sliding-window mask(similar to Beltagy et al., 2020) as its base. This mask typically includes:- A
sliding-window span: a contiguous diagonal region that indicates recent tokens each head can attend to. - A
fixed prefix of unmasked tokens(attention sink): initial tokens (e.g., 64 tokens for MoA) that are always visible to allattention heads, regardless of thesliding-window length. - The total
attention span() for a head is thesliding-window spanplus the number ofinitially unmasked tokens.
- A
-
Defining Elastic Rules: MoA defines the
attention spanfor a specificattention headat input length using a simple linear function: $ S_h = \alpha_h + \beta_h \cdot N $ Here,- : A hyperparameter that controls the base span of
attention head. - : A hyperparameter that controls the expansion rate of the
attention spanforattention headas the input length increases. - Each
attention head(across all layers) gets its own unique pair of hyperparameters, chosen from multiple discrete options. By default, MoA uses six options for and nine options for .
- : A hyperparameter that controls the base span of
-
Search Space Size: For an
LLMwith manyattention headsand layers (e.g., a 7B model with 32 heads and 32 layers, totaling attention heads), the combination of pairs can lead to an enormous search space (e.g., ). This necessitates anautomatic pipelineto efficiently find the optimal and for eachattention head.The following figure (Figure 3 from the original paper) provides an overview of MoA, including its search space and automatic rule search pipeline:
该图像是图表,展示了MoA的整体流程。(a) 搜索空间包括多样的弹性规则与滑动窗口掩码的关系;(b) 自动规则搜索流水线从校准数据集开始,涵盖了长依赖上下文与原始密集LLM生成的监督文本,通过剖析每个注意力值对模型预测的影响,最后优化每个注意力头的弹性规则以最小化预测损失。
4.2.3. Automatic MoA Configuration Search Pipeline
MoA employs an automatic configuration search pipeline (Figure 3(b)) to find the optimal heterogeneous elastic rules. This pipeline consists of three main steps: Attention Influence Profiling, Automatic Optimization, and relies on a carefully designed Calibration Dataset.
4.2.3.1. Calibration Dataset Construction (Section 5)
Before profiling, MoA emphasizes the design of a specialized calibration dataset and supervision objectives.
- Problem with Current Approach: General language modeling datasets (e.g.,
RedPajama) and human-written summaries often lacklong-context dependenciesand introducemisalignmentbetween the model's internal representation and human responses. This can lead to inaccurateattention influence profiling. - MoA's Approach:
- Long-Range Dependencies: MoA uses the
MultiNews dataset(Fabbri et al., 2019), which includes summaries that require deep understanding oflong-range content. - Model Alignment: Crucially, the summaries used for supervision are generated by the original dense model itself, rather than human-written summaries. This ensures higher alignment between the model's internal
attention patternsand thesupervision signal, leading to more accurateattention influence quantification. - Loss Calculation: The cross-entropy loss is calculated using these model-generated summaries as the reference, facilitating accurate
profilingfor the MoA configuration search.
- Long-Range Dependencies: MoA uses the
4.2.3.2. Attention Influence Profiling
This step quantifies the impact of individual attention values on the LLM's final prediction loss. It is crucial for understanding which attention connections are critical and should be preserved.
-
Influence Derivation: The
influenceof eachattention value() is derived from theattention matrixand itsgradient, computed over thecalibration dataset. -
Approximation of Loss Change: When applying
sliding-window attention masks, MoA approximates the change in the model'sprediction loss() using afirst-order Taylor expansion. This approach is supported by other LLM compression methods. The change in loss is approximated as: $ \Delta L = \sum_h \sum_i \sum_j \frac{\partial L}{\partial A_{h,i,j}} \cdot \Delta A_{h,i,j} $ Here, indexesattention headsacross all layers, andi, jare the row and column indices within eachattention matrix. -
Attention Influence Matrix (): The
attention influence matrix(for a single head, omitting for simplicity) is defined as the estimated change in loss if theattention valueis masked (set to zero). This accounts for two effects:- Direct Effect: Setting to zero directly reduces its value: .
- Indirect Effect (Softmax Normalization): The
softmax functionnormalizes each row of theattention matrixto sum to one. Thus, setting to zero causes a proportional increase in otherattention values( for ) within the same row. These two effects are integrated into the following formulation. The derivation is provided in Appendix E.2 of the original paper. $ E_{i,j} = \sum_n \frac{\partial L}{\partial A_{i,n}} \cdot \Delta A_{i,n|j} = \frac{\partial L}{\partial A_{i,j}} \cdot (-A_{i,j}) + \sum_{n \neq j} \frac{\partial L}{\partial A_{i,n}} \cdot A_{i,n} \cdot \frac{A_{i,j}}{1 - A_{i,j}} $ Where:
- : The
attention influenceof theattention valueat row , column . It quantifies how much theprediction losswould change if this specificattention valuewere masked. - : The
gradientof the totalprediction losswith respect to theattention value. This indicates how sensitive the loss is to changes in . - : The change in
attention valueif is masked. - : The
attention valueat row , column . - The first term captures the direct loss change from zeroing out .
- The second term captures the indirect loss change due to the re-distribution of
attention weightsfor other tokens in the same row after is masked.
-
Practical Calculation: In practice, backpropagation on the
calibration datasetis used to calculate the averageattention influencefor each head, across multiple data items and different input lengths. Thegradientis computed using standard deep learning frameworks like PyTorch. -
Accuracy Loss for a Rule: After obtaining , MoA calculates the accuracy loss () for applying a candidate
elastic ruleat a specific input length. This loss is the sum of the masked averageattention influencesaccording to the rule: $ \Delta L = \sum_h \Delta L_{h,r_h} = \sum_h \sum_i \sum_j M_{r_h,i,j} \cdot \bar{E}_{h,i,j} $ Where:-
: The total estimated accuracy loss across all heads.
-
: The accuracy loss for head using rule .
-
: A binary
maskfor head corresponding to rule . Masked positions are 1, unmasked are 0. -
: The average
attention influencefor head at position(i,j).This profiling step generates a map of how different
elastic rulesimpact accuracy across variousattention headsand input lengths.
-
4.2.3.3. Automatic Optimization
This stage automatically selects the optimal elastic rule for each attention head.
-
Multi-objective Optimization: MoA formulates the problem as a
multi-objective optimizationto find a set ofPareto optimal configurationsacross all profiled lengths. The objective is to minimize the totalaccuracy losswhile adhering touser-defined density constraints. The objective is formulated as: $ \underset{r_h \in \mathbb{R}}{\arg \min} \Delta L^{(N_i)}, N_i \in \mathbb{N}{\mathrm{profile}} \quad \mathrm{s.t.} \frac{1}{H} \sum{h=1}^H d_{r_h}^{(N_i)} \leq d_{\mathrm{constr}}^{(N_i)}, \forall N_i \in \mathbb{N}_{\mathrm{constr}} $ Where:- : The
elastic rulefor head , chosen from the set of candidate rules . - : The
accuracy loss(from Equation 4) at input length . - : Set of input lengths for which profiling was performed.
- : The
densityof rule for head at input length .Densityis the ratio of activeattention connectionsto total possible connections. - : User-defined
density constraintat input length . - : Total number of
attention heads. - : Set of lengths subject to
density constraints.
- : The
-
Mixed-Integer Programming (MIP) Formulation: This problem corresponds to a
multi-objective mixed-integer programmingproblem, which can be solved using standard linear solvers likeGurobi. The detailed formulation (from Appendix E.3.1) introduces a binary variable to indicate whether rule is selected for head . $ \underset{\mathbf{\phi}}{\mathrm{argmin}} \Delta L = \sum_h \Delta L_{h,r_h} \quad \mathrm{s.t.} \frac{1}{H} \sum_h d_{r_h} \leq d_{\mathrm{constr}} $ This is transformed into: $ \underset{X_{h,r_h}}{\mathrm{argmin}} \frac{1}{H} \sum_{h=0}^{H-1} \sum_{r_h=0}^{R_h-1} \Delta L_{h,r_h} X_{h,r_h} \quad \mathrm{s.t.} $ $ \sum_{r_h=0}^{R_h-1} X_{h,r_h} = 1, \quad h \in {0, \cdots, H-1} $ $ \frac{1}{H} \sum_{h=0}^{H-1} \sum_{r_h=0}^{R_h-1} d_{r_h} X_{h,r_h} \leq d_{\mathrm{constr}} $ $ X_{h,r_h} \in {0, 1}, \quad \forall h \in {0, \cdots, H-1}, \forall r_h \in \mathbb{R} $ Where:- : A binary variable that is 1 if rule is selected for head , and 0 otherwise.
- : The
accuracy lossof head with rule . - : The
densityof rule . - : The number of candidate rules for head .
- The first constraint ensures that exactly one rule is selected per head.
- The second constraint enforces the average
density budget. - Additional constraints are added to limit the number of distinct rules per model layer for inference-time efficiency.
-
Generalization to Multiple Lengths (Appendix E.3.2, Figure 15): To handle
multi-objective MIPacross various lengths, MoA uses anepsilon-constraint method(Yv et al., 1971). This involves:- Performing single-objective optimization for each profiled length as the primary objective, recording the outcomes for all other objectives. This establishes the range of possible losses for each objective .
- Iterating through each objective again, treating other objectives as constraints by partitioning their loss ranges into uniform intervals. This generates a comprehensive set of potential configurations.
- Finally, configurations that do not conform to the
Pareto front requirementsare removed, yielding the final set ofPareto optimal solutions.
-
Final Configuration Selection: From the set of
Pareto-optimal MoA configurations, the one that yields theminimum lossat an unseen validation length (e.g., 12k in the experiments) is chosen as the optimal solution. This step helps ensure good generalization to lengths beyond those used for profiling.This pipeline effectively automates the discovery of complex,
heterogeneous elastic rulesthat optimize bothperformanceandefficiencyforLLM inferenceinlong-context scenarios.
5. Experimental Setup
5.1. Datasets
The authors use a comprehensive set of datasets to evaluate MoA's performance across various long-context capabilities, including retrieval, understanding, and coherence. A specialized calibration dataset is also used for MoA's internal profiling.
5.1.1. Long-Context Retrieval
-
LongEval (Li et al., 2023a): This benchmark tests
key-value retrieval accuracy. The model is given a lengthy context containing many key-value pairs (e.g.,line <index>: REGISTER_CONTENT is <value>) and is then asked to retrieve the value for a specific key.- Scale: 100 data items per length level.
- Purpose: To quantify how well the model can recall specific facts embedded deep within long input sequences, and thus measure
effective context length.
-
Needle-In-A-Haystack (NIAH) (Kamradt, 2024): This is another common retrieval benchmark.
- Characteristics: It consists of a single "needle" sentence (the target information) hidden within a large, irrelevant "haystack" context. The model is then asked a question that can only be answered by retrieving the "needle."
- Purpose: To test the model's ability to find and utilize a single critical piece of information in extremely long and noisy contexts.
5.1.2. Long-Context Understanding
-
LV-Eval (Yuan et al., 2024): A benchmark for
long-context understanding.- Characteristics: Includes 11 sub-datasets. For Vicuna models (16k context limit), the 16k split is used, truncating inputs to 15,500 tokens.
- Purpose: To assess a model's ability to comprehend and answer questions based on complex, multi-faceted long documents.
-
LongBench (Bai et al., 2023): Another benchmark for
long-context understanding.- Characteristics: Includes 13 sub-datasets. The
LongBench-Esplit is used for balanced data items across length levels. It's segmented into ranges: 0-4k, 4-8k, and 8k+ tokens, with truncation thresholds of 3,500, 7,500, and 15,500 tokens, respectively. - Purpose: Provides a comprehensive evaluation of LLMs on diverse
long-context tasks, including summarization, question answering, and code completion.
- Characteristics: Includes 13 sub-datasets. The
5.1.3. Perplexity Evaluation
A custom test set is constructed for perplexity evaluation:
- Datasets: Qasper (Dasigi et al., 2021) for question answering, MultiNews (Fabbri et al., 2019) for summarization, TREC (Li & Roth, 2002; Hovy et al., 2001) for few-shot learning, and LCC (Mohler et al., 2016) for code completion.
- Scale: 50 data items per length level from the test split of each of the four datasets (total items).
- Characteristics: Data items are organized as question-answer pairs, with human-written questions and answers.
- Purpose: To measure the model's
coherenceandfluencyin generating responses, specifically calculatingperplexityon the answer part of the data.
5.1.4. Validation Dataset
Used to select the optimal MoA configuration from the Pareto front solutions.
- Construction: Similar to the
perplexity test dataset, it samples data items from the validation splits (or test splits for datasets without validation) of Qasper, MultiNews, TREC, and LCC. - Additional Data: 50 data items from
LongEvalare added to validate retrieval ability. - Purpose: To ensure the selected configuration generalizes well to unseen lengths and tasks.
5.1.5. Calibration Dataset (for MoA's Internal Profiling)
This is crucial for MoA's attention influence profiling (Section 4.2.3.1).
- Dataset:
MultiNews dataset(Fabbri et al., 2019). - Characteristics: Contains long news documents and their summaries, highlighting
long-range dependencies. - Supervision: Summaries are generated by the original dense model, not human-written. This ensures
model alignment. - Scale: data items at 2k, 4k, and 8k lengths. Data items are padded to their corresponding length.
- Purpose: To accurately profile the
attention influenceof individualattention valueson themodel's prediction lossforlong-context tasks.
5.1.6. Data Sample Examples
The following are concrete examples of data samples used in the experiments:
Format 1. LongEval
This format is used for the LongEval (Li et al., 2023a) retrieval benchmark. The instruction for retrieval is given after a lengthy context.
Below is a record of lines I want you to remember. Each line begins with 'line <line index>' and contains a ' <REGISTER_CONTENT>' at the end of the line as a numerical value. For each line index, memorize its corresponding <REGISTER_CONTENT>. At the end of the record, I will ask you to retrieve the corresponding <REGISTER_CONTENT> of a certain line index. Now the record start:
line delightful-incandescence: REGISTER_CONTENT is <19147>
line cloistered-presence: REGISTER_CONTENT is <8862>
... (massive lines of key-value pairs) ...
Now the record is over. Tell me what is the <REGISTER_CONTENT> in line cloistered-presence? I need the number.
Format 2. Needle-In-A-Haystack (NIAH)
This format is used for the Needle-In-A-Haystack (NIAH) (Kamradt, 2024) retrieval benchmark. A single "needle" sentence is embedded within irrelevant context.
People who are powerful but uncharismatic will tend to be disliked. Their power makes them a target for criticism that they don't have the charisma to disarm. That was Hillary Clinton's problem.
The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.
It also tends to be a problem for any CEO who is more of a builder than a schmoozer.
What is the best thing to do in San Francisco?
Format 3. MultiNews Calibration Dataset
This format is used for MoA's calibration dataset. The context includes multiple news passages, and the supervision is a summarization generated by the original dense model.
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, a complete explanation is provided below.
5.2.1. Retrieval Accuracy
- Conceptual Definition:
Retrieval accuracymeasures the ability of a model to correctly extract or identify specific pieces of information (e.g., a key-value pair, a "needle" sentence) from a given long document or context. It directly assesses how well the model can pinpoint and utilize relevant facts amidst potentially irrelevant or lengthy surrounding text. For theLongEvalbenchmark, it specifically checks if the model returns the correct numerical value associated with a queried key. - Mathematical Formula: For a set of retrieval tasks: $ \text{Retrieval Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}(\text{model_output}_i = \text{ground_truth}_i) $
- Symbol Explanation:
- : The total number of retrieval tasks or data items.
- : The value retrieved by the model for the -th task.
- : The correct value to be retrieved for the -th task.
- : The indicator function, which equals 1 if the condition inside the parenthesis is true, and 0 otherwise.
5.2.2. LV-Eval Score
- Conceptual Definition:
LV-Evalis a benchmark that provides an aggregate score for an LLM'slong-context understandingcapabilities across various sub-datasets. It quantifies how well a model can perform tasks like question answering, summarization, and conditional generation over extended inputs. The score is typically an average or weighted average of performance across its constituent tasks. - Mathematical Formula: The paper refers to
LV-Evalscores as a single aggregate value. A specific formula for this aggregate score is not provided in the paper's main text or appendices, as it's typically computed by the benchmark's official script based on its internal evaluation methodology for 11 sub-datasets. Conceptually, it represents a composite metric reflecting overall long-context understanding. - Symbol Explanation: Not applicable for a general formula, as it's an aggregate benchmark score. The score quantifies the overall performance across different long-context understanding tasks.
5.2.3. LongBench Score
- Conceptual Definition: Similar to
LV-Eval,LongBenchis a comprehensive benchmark for evaluating LLMs on variouslong-context understandingtasks. It provides a score that reflects the model's performance across its 13 diverse sub-datasets, which cover tasks such as summarization, code completion, few-shot learning, and question answering. The score is typically an average over the different sub-tasks, often computed by the official benchmark. - Mathematical Formula: The paper refers to
LongBenchscores as an aggregate value. A specific formula for this aggregate score is not provided in the paper, as it's usually computed by the benchmark's official script. Conceptually, it represents a composite metric reflecting overall long-context understanding across a wider range of tasks thanLV-Eval. - Symbol Explanation: Not applicable for a general formula, as it's an aggregate benchmark score. The score quantifies the overall performance across different long-context understanding tasks.
5.2.4. Perplexity (PPL)
- Conceptual Definition:
Perplexityis a common metric used to evaluate how well alanguage modelpredicts a sample of text. It essentially measures the uncertainty of the model when predicting the next token in a sequence. A lower perplexity score indicates that the model is more confident and accurate in its predictions, implying bettercoherenceandfluencyin language generation. It's often expressed as the exponentiated average negative log-likelihood of the tokens. - Mathematical Formula: $ \text{PPL}(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, \dots, w_{i-1})\right) $
- Symbol Explanation:
- : A sequence of tokens, .
- : The total number of tokens in the sequence.
- : The probability assigned by the language model to the -th token , given all preceding tokens .
- : The natural logarithm.
- : The exponential function, .
5.2.5. Decode Throughput (tokens/s)
- Conceptual Definition:
Decode throughputmeasures the rate at which an LLM can generate output tokens during thedecode stageofautoregressive inference. It is typically expressed in "tokens per second" (tokens/s). Higher throughput indicates greater efficiency and faster text generation, which is critical for real-time applications and serving many users simultaneously. - Mathematical Formula: $ \text{Decode Throughput} = \frac{\text{Total number of tokens generated}}{\text{Total time taken for generation}} $
- Symbol Explanation:
Total number of tokens generated: The sum of all tokens produced by the model across all requests in a given period.Total time taken for generation: The wall-clock time required to generate these tokens.
5.2.6. GPU Memory Reduction
- Conceptual Definition:
GPU Memory Reductionquantifies the decrease in Graphics Processing Unit (GPU) memory consumption achieved by an optimization method compared to a baseline. For LLMs, this often refers to the memory saved by efficiently managing theKV-Cacheor reducing the size ofattention matrices. Reducing GPU memory allows for largerbatch sizesor longercontext lengthson the same hardware. - Mathematical Formula: $ \text{GPU Memory Reduction} = \frac{\text{Baseline GPU Memory Usage}}{\text{Optimized GPU Memory Usage}} $ Alternatively, as a percentage reduction: $ \text{Percentage Reduction} = \left(1 - \frac{\text{Optimized GPU Memory Usage}}{\text{Baseline GPU Memory Usage}}\right) \times 100% $
- Symbol Explanation:
Baseline GPU Memory Usage: The GPU memory consumed by the original or unoptimized method.Optimized GPU Memory Usage: The GPU memory consumed by the proposed (MoA) method.
5.2.7. Length-controlled Win Rate
- Conceptual Definition:
Length-controlled Win Rateis a metric used in benchmarks likeAlpacaEval 2.0to evaluate the quality of a model's instruction-following responses. It compares the output of a target model against a reference model (e.g.,gpt4_turbo) and determines which model's response is "better" or "wins" according to an automated evaluator. "Length-controlled" implies that the evaluation accounts for potential biases related to response length. A higher win rate indicates superior instruction-following capabilities. - Mathematical Formula: The exact calculation for
AlpacaEval 2.0'slength-controlled win rateinvolves a sophisticated automated evaluator (likegpt-4-1106-preview) comparing model responses. It's not a simple mathematical formula that can be derived directly but rather an outcome of a pairwise comparison protocol. Conceptually, it's the percentage of times the target model's response is preferred over the reference model's response by the evaluator. - Symbol Explanation: Not applicable for a simple mathematical formula. It represents the percentage of instances where the target model's output is judged superior by an automated evaluator.
5.3. Baselines
The paper compares MoA against a comprehensive set of state-of-the-art efficient attention methods and LLM frameworks, encompassing both static and dynamic approaches.
- StreamingLLM (Xiao et al., 2024c): A prominent
static efficient attentionmethod that employs a uniformsliding-window maskwith a fixed number ofattention sinktokens (initial tokens always visible). It usesefficient prefill.- Configuration: Initial four tokens unmasked (attention sink), except for 70B model and super long settings where 64 tokens are used.
- InfLLM (Xiao et al., 2024a): A
dynamic KV-Cache compressionmethod that uses training-free memory for long sequences. It usessparse prefill.- Configuration: Adheres to original setup with same local window size, selected memory size, and 128 initial tokens as
attention sinks.
- Configuration: Adheres to original setup with same local window size, selected memory size, and 128 initial tokens as
- H2O (Heavy-hitter Oracle) (Zhang et al., 2023): A
dynamic KV-Cache pruningmethod that identifies and keeps "heavy hitter" tokens (most important) and recent tokens.- Configuration: Uses the same number of heavy hitter tokens and recent tokens as specified in its original paper.
- Note: H2O uses
original dense prefillbecause it needs to calculate the column sum of theattention matrixto determine token importance forKV-Cache eviction, making it incompatible withsparse prefillkernel optimizations likeFlashAttentionduring prefill. This can lead toOut-Of-Memory (OOM)issues at largeprefill lengthsandbatch sizes.
- FlashAttention2 (Dao, 2024): A highly optimized
CUDA kernelforexact attentioncomputation, significantly improving speed and memory efficiency fordense attention. It serves as a strong baseline for dense attention performance. - vLLM (Kwon et al., 2023): A high-throughput
LLM serving frameworkthat usesPagedAttentionfor efficientKV-Cachememory management. It's known for its strong performance indecode throughput. - BigBird (Zaheer et al., 2020): An early
static sparse attentionmodel that combineslocal sliding windowswithglobal tokens. Included in theaccuracy-throughput trade-offanalysis. - SnapKV (Li et al., 2024c): A
dynamic KV-Cachecompression method. Included for comparison inlonger-context generalizationresults. - PyramidKV (Cai et al., 2024): Another recent
dynamic KV-Cachecompression method. Included for comparison inlonger-context generalizationresults.
5.4. MoA Settings
The specific configurations used for MoA in the experiments are:
- Attention Pattern:
Block sliding-window attentionwith ablock sizeof 64 tokens. The first block of tokens (64 tokens) is always unmasked and serves as theattention sink. - Profiling Dataset:
MultiNews(Fabbri et al., 2019) withmodel-generated summariesas supervision (as detailed in Section 5 of the paper). - Profiling Lengths: data items are used at 2k, 4k, and 8k input lengths. Data items are padded to their respective lengths to ensure uniform
attention-influence tensorshapes. - Profiling Granularity:
Block granularityis used; the averageattention influenceis calculated within each block to represent its overall influence. - Hyperparameter Search Space:
- (base span): 6 discrete values.
- (expansion rate): 9 discrete values.
- This creates a search space of pairs for each
attention head. - values are uniformly sampled from , and values from
[0, 1]. The resultingattention span lengthsare clipped to be between 0 and the current input length.
- Optimization:
Multi-objective optimizationis performed at the same set of lengths (2k, 4k, 8k). - Rule Restriction: To ensure
inference-time efficiency, the number of distinctelastic rulesis restricted to at most two per model layer. - Validation Length: The optimal
MoA configurationis selected from thePareto front solutionsbased on the lowestperplexityon avalidation datasetof length 12k. - Fine-tuning: The models are not fine-tuned after applying MoA.
- Consistency: Each model uses the same
MoA configurationacross all benchmarks and lengths.
6. Results & Analysis
6.1. Core Results Analysis
MoA demonstrates superior performance and efficiency across various LLM models and long-context benchmarks, significantly outperforming state-of-the-art baselines.
6.1.1. Accuracy-Throughput Trade-off
The following figure (Figure 4 from the original paper) shows the accuracy-throughput trade-offs of seven attention methods at different densities:
该图像是图表,展示了在不同密度下七种注意力方法在Vicuna-7B模型中进行的解码吞吐量与检索准确率之间的权衡。方法包括InfLLM、BigBird、H2O、SnapKV、PyramidKV、StreamingLLM和MoA,X轴表示解码吞吐量(token/s),Y轴表示检索准确率。
As shown in Figure 4, MoA effectively advances the Pareto front in terms of both retrieval accuracy and decode throughput compared to six baselines (InfLLM, BigBird, H2O, SnapKV, PyramidKV, StreamingLLM).
- Throughput Advantage: At the same
densities, MoA achieves notablethroughput improvementsof 1.6-18.1x over H2O, InfLLM, BigBird, SnapKV, and PyramidKV. This is attributed to its efficientstatic attention designandcustomized GPU kernel. - Accuracy Maintenance: MoA consistently maintains high
retrieval accuracyacross varyingdensities, indicating it doesn't sacrifice performance for efficiency as much as other methods.
6.1.2. Performance
MoA consistently outperforms existing efficient attention methods and achieves performance comparable to the original full-attention model at 50% density across various tasks and model sizes.
6.1.2.1. Long-Context Retrieval
The following are the results from Table 4 of the original paper:
| Model | Attention | Retrieve Acc. ↑ 4k | 8k | 16k | LV-Eval ↑ 16k | LongBench ↑ 0-16k | PPL ↓ 8-12k |
| Vicuna-7B | Original | 1.00 | 0.98 | 0.62 | 5.93 | 34.76 | 3.79 |
| H2O | 0.86 | 0.68 | 0.35 | 5.42 | 33.59 | 3.94 | |
| InfLLM | 0.67 | 0.57 | 0.26 | 5.13 | 32.97 | 4.07 | |
| StreamingLLM | 0.43 | 0.16 | 0.08 | 4.72 | 31.84 | 4.48 | |
| MoA | 1.00 | 0.97 | 0.57 | 5.61 | 33.96 | 3.75 | |
| Vicuna-13B | Original | 0.99 | 0.98 | 0.44 | 5.83 | 39.23 | 3.62 |
| H2O | 0.88 | 0.76 | 0.28 | 5.66 | 38.13 | 3.80 | |
| InfLLM | 0.70 | 0.53 | 0.27 | 6.80 | 37.13 | 4.07 | |
| StreamingLLM | 0.65 | 0.49 | 0.33 | 5.43 | 32.13 | 4.10 | |
| MoA | 0.99 | 0.93 | 0.49 | 7.16 | 38.77 | 3.62 | |
| Llama3-8B | Original | 0.99 | 0.99 | 0.97 | 17.49 | 43.69 | 4.52 |
| H2O | 0.94 | 0.88 | 0.37 | 16.03 | 42.99 | 4.63 | |
| InfLLM | 0.65 | 0.59 | 0.52 | 14.44 | 42.43 | 4.68 | |
| StreamingLLM | 0.68 | 0.55 | 0.52 | 11.16 | 38.22 | 4.79 | |
| MoA | 0.99 | 1.00 | 1.00 | 17.46 | 42.97 | 4.49 | |
| Llama3-70B | Original | 1.00 | 0.99 | 0.93 | 24.51 | 49.10 | 3.67 |
| H2O | 0.93 | 0.91 | OOM | OOM | OOM | OOM | |
| StreamingLLM | 0.20 | 0.15 | 0.04 | 17.45 | 42.53 | 4.26 | |
| MoA | 1.00 | 1.00 | 0.94 | 23.65 | 47.79 | 3.75 |
-
Superior Accuracy: As shown in Table 4, MoA consistently achieves very high
retrieval accuracyacross all tested models (Vicuna-{7B, 13B}, Llama3-{8B, 70B}) and input lengths (4k, 8k, 16k) at 50%density.- Its maximum relative accuracy drop (compared to the
Originalfull-attention model) is only 8%, significantly lower thanStreamingLLM(87%),InfLLM(58%), andH2O(44%). On average, MoA's relative accuracy drop is below 1%.
- Its maximum relative accuracy drop (compared to the
-
Extended Effective Context Length: The following figure (Figure 5(a) from the original paper) shows retrieval accuracy and effective context length:
该图像是图表,展示了Vicuna-7B模型在不同注意力跨度和输入长度下的检索准确率。X轴表示不同的注意力跨度,Y轴表示不同的输入长度,图中的热图显示了不同配置下的检索准确性。
Figure 5(a) demonstrates that MoA retains over 90% retrieval accuracy up to 60k lengths, effectively matching the dense model's effective context length. This is achieved with only 8k profiling and 12k validation. In stark contrast, H2O, InfLLM, and StreamingLLM have much shorter effective context lengths (8k, <4k, and <4k respectively). MoA extends its effective context length to approximately 3.9x the average KV-Cache length.
-
Density Efficiency: The following figure (Figure 6 from the original paper) shows retrieval accuracy across attention spans and input lengths:
该图像是图表,展示了在LongEval上使用Vicuna-7B进行的检索准确性测试。图(a)显示了不同注意力跨度下的有效长度,图(b)展示了在特定密度下的检索准确性变化。
Figure 6 (and Figure 7(b) in the appendix) further illustrates that MoA reaches over 0.9 retrieval accuracy at an 8k input length with just 25% density, whereas StreamingLLM and H2O require 100% and 75% density, respectively, for similar performance. This highlights MoA's efficiency in retaining critical information.
-
Needle-In-A-Haystack Performance: The following figure (Figure 7 from the original paper) shows retrieval accuracy tests on LongEval:
该图像是一个图表,展示了不同注意力方法在 Llama-3-8B 模型中,针对输入长度从 8k 到 256k 的 Needle-In-A-Haystack (NIAH) 检索准确率。图中显示了 MoA、H2O、StreamingLLM 和 InfLLM 方法在不同 Token 限制下的深度表现,颜色深浅表示不同的准确性评分。
Figure 8 (Appendix C.1.2) shows that MoA achieves perfect retrieval accuracy across input lengths from 8k to 256k on the NIAH dataset, while baselines like StreamingLLM have limited effective context length and InfLLM shows reduced accuracy. H2O and InfLLM often fail at extreme lengths due to OOM/OOT errors.
6.1.2.2. Long-Context Understanding
The following are the results from Table 7 of the original paper:
| Model | Attention | LongBench ↑ | ||
| 0-4k | 4-8k | 8-16k | ||
| Vicuna-7B | Original | 37.91 | 33.82 | 32.54 |
| H2O | 36.23 | 32.74 | 31.81 | |
| InfLLM | 35.23 | 33.54 | 30.15 | |
| StreamingLLM | 30.53 | 33.28 | 31.70 | |
| MoA | 37.04 | 32.90 | 31.94 | |
| Vicuna-13B | Original | 42.25 | 39.52 | 35.93 |
| H2O | 41.63 | 38.02 | 34.75 | |
| InfLLM | 39.36 | 37.66 | 34.36 | |
| StreamingLLM | 30.65 | 33.07 | 32.68 | |
| MoA | 41.73 | 38.88 | 35.69 | |
| Llama3-8B | Original | 44.27 | 43.53 | 43.26 |
| H2O | 43.46 | 43.01 | 42.50 | |
| InfLLM | 42.78 | 42.69 | 41.81 | |
| StreamingLLM | 37.20 | 38.02 | 39.43 | |
| MoA | 43.07 | 42.75 | 43.09 | |
| Llama3-70B | Original | 50.70 | 48.05 | 48.55 |
| H20 | 50.16 | 47.77 | OOM | |
| StreamingLLM | 45.14 | 42.40 | 40.04 | |
| MoA | 49.74 | 46.80 | 46.84 | |
The following are the results from Table 8 of the original paper:
| Single-QA | Multi-QA | Retrieval | ||||
| Model | Attention | w/o. Conf (2) | w. Conf (2) | w/o. Conf (3) | W. Conf (2) | w. Conf (2) |
| Vicuna-7B | Original | 10.49 | 6.29 | 6.83 | 5.60 | 0.00 |
| H20 | 9.16 | 6.20 | 6.44 | 4.80 | 0.00 | |
| InfLLM | 7.11 | 6.70 | 6.07 | 4.80 | 0.00 | |
| StreamingLLM | 7.54 | 5.90 | 5.98 | 3.56 | 0.00 | |
| MoA | 9.98 | 6.27 | 6.16 | 5.31 | 0.09 | |
| Vicuna-13B | Original | 10.64 | 7.28 | 5.32 | 5.07 | 1.08 |
| H20 | 9.53 | 6.54 | 5.25 | 5.36 | 1.83 | |
| InfLLM | 10.21 | 9.35 | 6.03 | 3.19 | 2.08 | |
| StreamingLLM | 9.05 | 5.86 | 5.37 | 3.19 | 3.70 | |
| MoA | 11.04 | 6.93 | 5.79 | 5.84 | 6.88 | |
| Llama3-8B | Original | 34.05 | 19.51 | 11.41 | 17.70 | 7.84 |
| H20 | 28.52 | 17.05 | 11.11 | 15.98 | 9.95 | |
| InfLLM | 24.94 | 17.75 | 10.61 | 14.80 | 6.04 | |
| StreamingLLM | 20.21 | 9.57 | 8.14 | 9.36 | 10.03 | |
| MoA | 32.98 | 20.53 | 10.65 | 17.57 | 8.98 | |
| Llama3-70B | Original | 44.44 | 25.02 | 16.71 | 22.86 | 17.43 |
| StreamingLLM | 26.63 | 14.22 | 14.04 | 14.70 | 19.38 | |
| MoA | 42.44 | 23.58 | 15.75 | 21.27 | 19.19 | |
-
Reduced Performance Gap: Table 4, Table 7, and Table 8 show that MoA narrows the maximum relative performance drop on
LV-EvalandLongBenchbenchmarks to 5% and 3% respectively, which is significantly better thanStreamingLLM's36% and 18%.H2OandInfLLMincur larger efficiency costs while achieving similar or worse performance.
-
Consistent Performance: The following figure (Figure 9 from the original paper) shows LV-Eval score of StreamingLLM and MoA at various densities:
该图像是图表,展示了 MoA、H2O 和 StreamingLLM 在不同模型(7B、13B、70B)上的性能对比。图中展示了各种注意力方法在总结、代码完成和多文档问答任务中的得分,指标均经过归一化处理。
Figure 9 and Figure 10 (Appendix C.1.3) highlight MoA's consistent performance across subtasks and various densities, contrasting with the inconsistent performance of baselines.
- Perplexity: MoA maintains
perplexitywithin a 1% relative increase, whereas other methods see increases of 4-13%. This indicates MoA preserves the model's language generationcoherencebetter.
6.1.2.3. Longer-Context Generalization
The following are the results from Table 9 of the original paper:
| Attention | 32k | 64k | Retrieve Acc. ↑ 128k | 256k |
| SnapKV | 1.00 | 0.88 | 0.71 | 0.33 |
| PyramidKV | 1.00 | 0.85 | 0.62 | 0.37 |
| MoA | 1.00 | 0.92 | 0.83 | 0.46 |
- Strong Generalizability: Even when compressed within 12k context lengths during configuration search, MoA generalizes exceptionally well to much longer contexts (32k to 256k), as shown in Figure 5(b) (from original paper) and Table 9.
- It outperforms
InfLLMandStreamingLLMby 1.9-3.3x inretrieval accuracyand 1.2-1.4x inLV-Eval scoresat these extended lengths, performing comparably to the originaldense model. - MoA also consistently outperforms more recent baselines like
SnapKVandPyramidKVinretrieval accuracyat longer contexts (Table 9).
- It outperforms
6.1.2.4. Instruction-following Generation
The following are the results from Table 10 of the original paper:
| Attention | Length-controlled Win Rate ↑ | Standard Error |
| Original | 8.84 | 0.53 |
| H2O | 9.66 | 0.55 |
| InfLLM | 5.76 | 0.42 |
| StreamingLLM | 7.96 | 0.49 |
| MoA | 9.83 | 0.57 |
- Highest Win Rate: MoA achieves the highest
length-controlled win rateon theAlpacaEval 2.0benchmark (Table 10), outperforming both efficient attention baselines and even the original model. This suggests MoA's elastic design not only improves efficiency but also enhances general instruction-following capabilities.
6.1.3. Ablation Studies / Parameter Analysis
6.1.3.1. Impact of Heterogeneity and Elastic Rules
The following are the results from Table 3 of the original paper:
| Mask Design | Retrieval Acc. 8k | 16k | PPL 8k | 12k |
| Uniform | 0.25 | 0.15 | 4.89 | 5.19 |
| +Hetero. Layers | 0.31 | 0.26 | 4.55 | 4.85 |
| +Hetero. Heads | 0.95 | 0.41 | 3.96 | 4.30 |
| +Elastic Rules | 0.98 | 0.43 | 3.96 | 4.29 |
Table 3 presents an ablation study on the search space design, demonstrating the progressive benefits of heterogeneity and elasticity at a consistent 25% density.
- Uniform Baseline: A basic uniform
sliding-windowmethod performs poorly (Retrieval Acc.0.25 at 8k, 0.15 at 16k; highPPL). - Heterogeneity in Layers: Introducing distinct
attention spansper layer (+Hetero. Layers) improvesretrieval accuracy(0.31 at 8k) and reducesPPL. - Heterogeneity in Heads: Adding distinct
attention spansfor each individual head (+Hetero. Heads) within layers brings a significant boost inretrieval accuracy(0.95 at 8k), showing the strong impact of fine-grainedheterogeneity. - Elastic Rules: Finally, incorporating
elastic rulesthat allowattention spansto scale with input length (+Elastic Rules) further enhancesretrieval accuracy(0.98 at 8k, 0.43 at 16k) and maintains lowPPL, validating the importance of adaptiveattention spans.
6.1.3.2. Calibration Dataset Design Choices
The following are the results from Table 2 of the original paper:
| Dataset | Supervision | Reference | Long Dep. | Align Model | Retrieval Acc. ↑ | PPL ↓ |
| RedPajama | Context | - | X | X | 0.25 | 4.95 |
| MultiNews | Context & Summary | Human | x/√ | X | 0.27 | 4.62 |
| MultiNews | Summary | Human | ✓ | X | 0.87 | 3.97 |
| MultiNews | Summary | Model | ✓ | ✓ | 0.95 | 3.96 |
Table 2 validates MoA's calibration dataset design principles at 25% average density and 8k input length.
- Long-Range Dependencies: Moving from
RedPajama(general language modeling, no long dependency) toMultiNews(Summary,Humanreference, with long dependency) significantly improvesretrieval accuracyfrom 0.27 to 0.87 and reducesperplexityfrom 4.62 to 3.97. This highlights the importance of training on data that naturally featureslong-range dependenciesforprofiling. - Model Alignment: Using summaries generated by the original dense model as supervision (
MultiNews,Summary,Modelreference, with model alignment) further boostsretrieval accuracyfrom 0.87 to 0.95, while maintaining lowperplexity. This confirms that aligning the supervision with the model's own response patterns is crucial for accurateattention influence profiling.
6.1.4. MoA Configuration Analysis
6.1.4.1. Overview of Mask Density Across Layers
The following figure (Figure 10 from the original paper) shows LV-Eval and LongBench scores for different attention methods:
该图像是图表,展示了不同层级上 Vicuna-7B 和 Vicuna-13B 模型的 MoA 掩码密度。图中以不同颜色表示窗口长度(2k、4k、8k、16k),并显示了各层的密度变化趋势。
Figure 11 (from original paper) visualizes the MoA mask density across layers for different LLMs.
- Density Distribution: Initial and middle layers generally exhibit high
mask density(largerattention spans), aligning with findings on LLMs' intrinsic dimensions and layer sensitivities. These layers might be crucial for initial feature extraction and context aggregation. - Later Layers: Final layers show a pattern where most
attention headsrequire lowdensity(shorterattention spans), while a few outliers still demand highdensity. This suggests specialization, where some late-stage heads might still need broad context for final predictions, while others focus on refining local information.
6.1.4.2. Diversity within Layers
The following figure (Figure 11 from the original paper) shows MoA mask density across layers:
该图像是散点图,展示了Vicuna-7B和Vicuna-13B模型中平均密度与密度范围之间的关系。每个点代表不同的配置,图中还包括线性回归趋势线及阴影区域,反映了两者的密度变化趋势。
Figure 12 (from original paper) plots the average mask density and the density range for each layer.
- Confirmation of Heterogeneity: Layers with lower average
density(smaller lines) tend to display a widerrange of density(wider shaded areas), confirming the necessity ofheterogeneous attention ruleswithin the same layer. This reinforces MoA's approach of tailoringattention spansat the individual head level.
6.1.4.3. Connection to Semantic Interpretation
MoA preserves the learned semantics of the dense model by statistically analyzing and optimizing attention patterns.
-
Position vs. Content-Based Attention:
Attention headscan be primarily position-based (e.g., attending to first token, previous token) or content-based (attending to semantically similar tokens regardless of position). Figure 2 illustrates these types. -
Quantitative Analysis (SoE): The paper quantifies
position-based attentionusingStandard deviation of Expectation (SoE). SoE measures how uneven the meanattention matrixis, reflecting the influence of position. The derivation uses a linear approximation ofattention valueas a combination ofpositional influenceandtoken influence: $ A_{h,i,j} = f_h(P_h(i,j), T_h(t_i, t_j)) $ Linear approximation: $ A_{h,i,j} = \alpha_h P_h(i,j) + (1 - \alpha_h) T_h(t_i, t_j) $ Where is the weight factor for position influence. The expectation of attention value over different sentences is: $ \mathbb{E}t[A{h,i,j}] = \alpha_h P_h(i,j) + (1 - \alpha_h) \mu_T, \quad \forall i > j $ Where is the average token influence. The Standard deviation of Expectation (SoE) over different positions of the attention matrix is: $ \sigma_p(\mathbb{E}t[A{h,i,j}]) = \alpha_h \delta_p $ Where is the standard deviation of over different positions, which is constant for all heads. This implies that theposition impactis proportional to theSoEof different heads. -
Correlation: The following figure (Figure 12 from the original paper) shows MoA mask's average density and the density range for each layer:
该图像是图表,展示了MoA的掩码稀疏性与头部对位置依赖性的正相关关系(SoE)。图表中,水平坐标轴表示SoE(E-2),竖直坐标轴表示稀疏性百分比,伴随有两侧的边际分布图,显示不同SoE值的稀疏性变化情况。
Figure 13 (from original paper) shows a positive correlation between MoA's mask sparsity and a head's dependency on position (SoE). This suggests that MoA successfully captures the semantic information of the dense language model, assigning shorter windows (higher sparsity) to heads that are more position-based and longer windows (lower sparsity) to those that rely more on content.
6.1.4.4. Cross-Dataset Generalizability
The following figure (Figure 13 from the original paper) shows positive correlation between MoA's mask sparsity and head's dependency on position (SoE):
该图像是一个示意图,展示了Vicuna-7B模型中不同任务(少量学习、聊天和编码)下第17层第29个注意力头的注意矩阵。每个注意力矩阵通过256个数据项的平均值展示,反映了不同任务中的注意力模式。相同的注意力头在不同任务中表现出相似的注意力跨度,说明我们方法的跨数据集泛化能力。
Figure 14 (Appendix E.1) visualizes the attention matrices of the same attention heads across different tasks (few-shot learning, chatting, coding). The observation that the same head shows a similar attention span across diverse tasks explains the robust cross-dataset generalizability of MoA. The heterogeneous elastic rules learned are intrinsic to the model's architecture and pre-training, rather than task-specific.
6.2. Efficiency
MoA demonstrates high runtime efficiency and manageable one-time search overhead.
6.2.1. Memory and Throughput Breakdown
The following are the results from Table 5 of the original paper:
| Model Framework | Attention | 4k Batch | Throughput | Batch | Throughput | Batch | Throughput | |
| vLLM | PagedAttention | 30 | 628.8 | 15 | 323.0 | 8 | 145.5 | |
| FlexGen | H20 | 20 | 754.9 | 6 | 296.3 | 1 | 51.7 | |
| HuggingFace | InfLLM | 15 | 62.0 | 10 | 37.5 | 6 | 19.2 | |
| HuggingFace | StreamingLLM | 50 | 945.1 | 25 | 467.3 | 12 | 232.0 | |
| FlashAttention2 | 30 | 134.6 | 15 | 66.9 | 8 | 32.9 | ||
| +Static KV-Cache | 30 | 496.1 | 15 | 219.5 | 8 | 91.6 | ||
| HuggingFace +Reduced Attention | 30 | 722.5 | 15 | 369.9 | 8 | 178.3 | ||
| +Increased Batch | 50 | 897.7 | 25 | 436.7 | 12 | 206.4 | ||
| vLLM | +Kernel (=MoA) | 50 | 1099.0 | 25 | 535.7 | 12 | 257.3 | |
| 13B | FlexGen | PagedAttention | 16 | 314.8 | 8 | 160.5 | 4 | 71.1 |
| H20 | 12 | 330.2 | 4 | 138.2 | 1 | 37.4 | ||
| HuggingFace | InfLLM | 8 | 30.3 | 5 | 17.63 | 3 | 11.3 | |
| HuggingFace | StreamingLLM | 28 | 478.4 | 14 | 241.2 | 7 | 116.5 | |
| FlashAttention2 | 16 | 81.3 | 8 | 40.8 | 4 | 19.8 | ||
| +Static KV-Cache | 16 | 264.6 | 8 | 111.3 | 4 | 62.2 | ||
| HuggingFace | +Reduced Attention | 16 | 329.6 | 8 | 156.4 | 4 | 87.3 | |
| +Increased Batch | 28 | 471.5 | 14 | 222.6 | 7 | 108.3 | ||
| +Kernel (=MoA) | 28 | 550.9 | 14 | 267.6 | 7 | 132.3 | ||
-
Decode Throughput: Table 5 shows that MoA achieves significant
decode throughputimprovements. For Vicuna-7B, it's 6.6-8.2x overFlashAttention2and 1.7-1.9x overvLLM(which usesPagedAttention). Similar gains are observed for Vicuna-13B. -
GPU Memory Reduction: The following are the results from Table 11 of the original paper:
Size Framework Memory (GB) 4k 8k 16k 7B FlashAttn2 28.5 44.4 76.3 H20 36.9 OOM OOM MoA 22.7 32.9 53.5 13B FlashAttn2 36.8 49.2 74.0 H20 40.4 77.9 OOM MoA 32.0 39.6 55.0
Table 11 indicates that MoA reduces GPU memory usage by 1.2-1.4x compared to FlashAttention2. H2O frequently encounters Out-Of-Memory (OOM) issues at longer input lengths.
- Ablation of Efficiency Factors (Table 5 breakdown): The paper ablates MoA's efficiency improvements, building upon a HuggingFace baseline with
FlashAttention2.- Static KV-Cache ( improvement): By maintaining
KV-Cachetokens only within each head's defined span, MoA preventsKV-Cachegrowth, eliminating reallocation overhead and significantly reducing memory. - Reduced Attention Computation ( improvement): The
sparse attention(reduced density) decreases the amount of computation and memory access required for theattentionoperation itself. - Increased Batch Size ( improvement): The memory reduction from a smaller
KV-Cacheallows for processing a largerbatch size, which improves GPU utilization and overallthroughput. - GPU Kernel Design ( improvement): MoA utilizes a
customized CUDA GPU kernelspecifically designed forheterogeneous attention patterns, providing further low-level optimization.
- Static KV-Cache ( improvement): By maintaining
6.2.2. Efficiency Results for Longer Input
The following are the results from Table 12 of the original paper:
| Model Size | Framework | Attention | Min. #GPU | Total Throughput | Total Memory (GB) | Throughput per GPU |
| 7B | vLLM | PagedAttention | 2 | 30.2 | 142.0 | 15.1 |
| FlexGen | 200 | >8 | - | OOM | - | |
| HuggingFace | InfLLM | 1 | 6.1 | 47.7 | 6.1 | |
| HuggingFace | StreamingLLM | 1 | 19.8 | 43.9 | 19.8 | |
| HuggingFace | FlashAttention2 | 2 | 4.3 | 85.6 | 2.2 | |
| HuggingFace | MoA | 1 | 20.3 | 44.0 | 20.3 | |
| 13B | vLLM | PagedAttention | 2 | 21.5 | 142.0 | 10.8 |
| FlexGen | 200 | >8 | - | OOM | - | |
| HuggingFace | InfLLM | 1 | 4.3 | 78.6 | 4.3 | |
| HuggingFace | StreamingLLM | 1 | 14.0 | 64.6 | 14.0 | |
| HuggingFace | FlashAttention2 | 2 | 3.0 | 130.6 | 1.5 | |
| HuggingFace | MoA | 1 | 14.7 | 63.4 | 14.7 |
At a 128k input length, MoA processes inputs using only one A100 GPU, whereas FlashAttention2 and vLLM require at least two GPUs for a single request.
- Speedup: MoA achieves 4.7-4.9x
decode speedupoverFlashAttention2(while using half the GPUs) and 1.3-1.4x higherthroughput per GPUcompared tovLLM(Table 12). - Memory Savings: It also demonstrates a 1.9-2.1x reduction in GPU memory usage.
6.2.3. Energy Efficiency Result
The following are the results from Table 13 of the original paper:
| Energy Per Token (J) | Power (W) | ||||||
| Framework | Attention | 4k | 8k | 16k | 4k | 8k | 16k |
| Huggingface | FlashAttention2 | 2.98 | 5.93 | 12.1 | 350 | 354 | 359 |
| Huggingface | MoA | 0.34 | 0.62 | 1.21 | 330 | 322 | 315 |
Table 13 shows that MoA achieves an 8.7-10x reduction in energy per output token compared to FlashAttention2, driven by slightly lower GPU power consumption. This makes MoA not only faster but also more environmentally friendly.
6.2.4. Automatic Search Pipeline Overhead
The following are the results from Table 14 of the original paper:
| Stage | 7B LLM | 13B LLM | 70B LLM |
| Calibration Data Gen. | 10min | 15min | 2 × 60min |
| Profile | 20min | 2 × 25min | 8 × 210min |
| Optimize (CPU) | 30min | 25min | 100min |
| Validate | 35min | 40min | 2 × 140min |
| Total Latency | 1h 35min | 1h 45min | 8h 30min |
| Total GPU Time | 1h 5min | 1h 45min | 34h 40min |
The automatic configuration search is a one-time overhead.
- For Vicuna-7B and 13B models, the entire search pipeline completes within two hours.
- For the larger Llama3-70B model, it takes 8.5 hours of wall time and 34.7 hours of GPU time.
- Profiling is the most resource-demanding stage, requiring two A100 GPUs for a 13B model at 8k profile length. Other stages typically need a single GPU.
- The
optimizationphase (on CPU) is relatively quick, taking about 25 minutes for Vicuna-13B. Validation adds about 40 minutes. This overhead is manageable given the significant and lasting efficiency benefits.
The following are the results from Table 15 of the original paper:
| Stage | Complexity w.r.t parameter size | Complexity w.r.t dataset size |
| Calibration Dataset Gen. | Linear | Linear |
| Profile | Linear | Linear |
| Optimize | Polynomial ~ Exponential for #Head | Irrelevant |
| Validate | Linear | Linear |
| Empirical Latency | Almost Linear | Linear |
Table 15 summarizes the computational complexity of each stage with respect to model parameter size and calibration dataset size. Most stages are linear, except for optimization which is polynomial to exponential with respect to the number of heads in the theoretical worst case, but empirically efficient.
6.3. Additional Performance Analysis (Appendix C.1.1)
The following are the results from Table 6 of the original paper:
| Model | Attention | Retrieve Acc. ↑ 4k | 8k | 16k | LV-Eval ↑ 16k | LongBench ↑ 0-4k | 4-8k | 8-16k | PPL ↓ 8-12k |
| Vicuna-7B | StreamingLLM | 0.91 | 0.35 | 0.09 | 4.30 | 36.39 | 32.44 | 31.04 | 3.92 |
| MoA | 1.00 | 0.97 | 0.58 | 5.67 | 38.07 | 33.80 | 31.75 | 3.78 | |
| Vicuna-13B | StreamingLLM | 0.73 | 0.81 | 0.37 | 5.65 | 36.77 | 34.65 | 33.43 | 3.70 |
| MoA | 0.99 | 0.97 | 0.42 | 5.57 | 41.85 | 39.76 | 36.06 | 3.62 | |
| Llama3-8B | StreamingLLM | 1.00 | 0.83 | 0.76 | 14.89 | 42.45 | 40.62 | 43.16 | 4.51 |
| MoA | 0.99 | 1.00 | 0.93 | 15.61 | 43.51 | 42.51 | 43.58 | 4.53 |
Table 6 (Appendix C.1.1) shows MoA's overall performance at a higher density of 75%. Even at this higher density, MoA demonstrates improved performance over the uniform attention baseline StreamingLLM across all metrics. This reinforces the effectiveness of heterogeneous elastic rules in MoA.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces Mixture of Attention Spans (MoA), a novel, training-free method designed to optimize the efficiency of Large Language Model (LLM) inference in long-context scenarios. MoA's core innovation lies in automatically tailoring heterogeneous elastic rules for sliding-window lengths to individual attention heads and layers. By precisely profiling the attention influence on model predictions using a specially designed calibration dataset (featuring long-range dependencies and model-aligned supervision), MoA intelligently allocates attention spans to preserve crucial information while aggressively pruning redundant connections.
The key findings and contributions are:
-
Enhanced Context Length and Accuracy: With the same average
sliding-window length, MoA extends theeffective context lengthof LLMs by 3.9x, leading to a 1.5-7.1x boost inretrieval accuracyover uniform-window baselines. It significantly narrows the performance gap withfull attention, reducing the maximum relative performance drop onlong-context understanding benchmarksfrom 9-36% to within 5%. -
Superior Efficiency: MoA achieves a 1.2-1.4x GPU memory reduction and substantially improves
decode throughputby 6.6-8.2x overFlashAttention2and 1.7-1.9x overvLLM, with minimal impact on performance. These gains stem fromstatic KV-Cachemanagement,reduced attention computations,increased batch sizes, and aspecialized CUDA kernel. -
Automatic and Generalizable: The proposed automatic configuration search pipeline efficiently finds optimal
heterogeneous elastic ruleswithin a few hours. The discovered configurations demonstrate stronggeneralizabilityto unseen and longer contexts (up to 256k), suggesting that the learnedattention patternsare intrinsic to the model.In essence, MoA offers a practical and powerful solution for deploying LLMs efficiently in diverse
long-context applicationsby intelligently adapting theattention mechanismto its heterogeneous nature.
7.2. Limitations & Future Work
The authors acknowledge the following limitations and suggest future research directions:
- Extremely Low-Density Budgets: MoA's current performance degrades under extremely low-density budgets. This implies there's a limit to how much
attentioncan be pruned before critical information is lost, even with heterogeneous rules. - Dynamic MoA Method: To address the limitation at very low
density budgets, the authors suggest designing adynamic MoA method. Such a method could potentially adaptattention spansmore flexibly at runtime based on the specific content, offering finer-grained control and potentially better performance in extreme compression scenarios. - Non-linear Elastic Rules: The current
elastic rulesuse a simple linear function (). Exploringnon-linear elastic ruleswith boundedattention spanscould offer more nuanced and potentially more optimal ways forattention spansto scale with input length. - Adaptation to Other Compression Methods: The
profiling methoddeveloped in MoA to evaluateattention influencecould be adapted to evaluate the influence of weights and other activations. This could facilitate the development of otherLLM compression methods, such asquantizationorweight pruning, by providing a principled way to identify critical components.
7.3. Personal Insights & Critique
The Mixture of Attention Spans (MoA) paper presents a compelling and practical approach to a significant challenge in LLM deployment.
Inspirations:
- The Power of Heterogeneity: The paper brilliantly leverages the insight that
attention headsare not monolithic. Thisheterogeneousperspective is a powerful paradigm that could be extended to other components of neural networks beyondattention mechanisms, where different parts might serve specialized functions. Recognizing and optimizing for this diversity is more effective than uniform compression. - Data Engineering for Profiling: The emphasis on
calibration dataset design, particularly usingmodel-aligned supervision, is a crucial and often overlooked aspect of compression research. It highlights that the "data" for analyzing compression impact needs to be carefully constructed to reflect the real-world usage and inherent patterns of the model, rather than generic datasets or human annotations that might not fully capture the model's internal logic. This concept could be transformative for othermodel compressionorinterpretabilitytasks. - Interpretability of Rules: The correlation between
mask sparsityandpositional dependency(SoE) forattention headsprovides valuableinterpretabilityinto how LLMs function and how MoA preserves theirsemantic structure. This kind of analysis helps build trust in compressed models and guides further research intoattention patterns. - Bridging Static Efficiency and Dynamic Adaptability: MoA cleverly combines the best of both worlds: the predictable, hardware-efficient nature of
static sparse attentionfor inference with the performance benefits typically sought by more complexdynamic methods. The one-time search overhead is a small price for significant and lasting runtime gains.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Linear Elastic Rules Simplification: While effective, the linear scaling rule () might be a simplification. Real
attention spanrequirements might follow more complex, non-linear, or piecewise functions, especially at very long contexts or for models with diverse architectures. The current method might miss some optimal configurations if the trueelasticityis non-linear. -
Search Space for and : The discrete options for and are chosen empirically. The sensitivity of the final configuration to the range and granularity of these discrete options could be further explored. A more adaptive or continuous search might yield better results but increase search complexity.
-
Generalization to New Models/Architectures: While shown to work for Vicuna and Llama3, the robustness of the discovered
heterogeneous elastic rulesand theprofiling methodologyfor vastly differentLLM architectures(e.g., Mixture-of-Experts models) or pre-training objectives would be interesting to verify. -
Overhead Scaling for Larger Models/Search Spaces: While the current search pipeline is efficient, for models with even more layers/heads or a much larger
hyperparameter search space(e.g., if non-linear rules were introduced), the polynomial-to-exponential complexity of theoptimization stagecould become a bottleneck. Further algorithmic innovations in the search itself might be necessary. -
Single-GPU vs. Multi-GPU Inference: The efficiency gains are primarily shown on a single A100 GPU or a small number of GPUs. While memory reduction helps multi-GPU setups, the full impact on large-scale distributed inference (e.g., with complex
tensor parallelismorpipeline parallelism) might introduce new bottlenecks that need to be addressed. -
"Attention Sink" Heuristic: The concept of
attention sinks(a fixed number of initial unmasked tokens) is a heuristic. While effective, the optimal number ofattention sinksmight also be a parameter that could be optimized or madeelasticby the MoA framework itself, rather than being a fixed constant.Overall, MoA is a significant step forward in efficient LLM inference, demonstrating that a nuanced, data-driven approach to
sparse attentioncan yield substantial benefits without compromising performance. Its principled methodology and impressive results make it a valuable contribution to the field.
Similar papers
Recommended via semantic vector search.