Paper status: completed

MoBA: Mixture of Block Attention for Long-Context LLMs

Published:02/18/2025

Sparse Attention Mechanism (7)Long-Context Modeling for Large Language Models (1)Mixture of Block Attention Mechanism (1)Efficiency Improvement in Attention Computation (1)Large Language Models (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study presents MoBA, a novel mixture of block attention mechanism for large language models, addressing computational complexity in long-context tasks. It enables efficient transitions between full and sparse attention while maintaining performance, significantly improving ef

Abstract

Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.

Mind Map

In-depth Reading

English Analysis~25 min read · 31,150 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is "MoBA: Mixture of Block Attention for Long-Context LLMs," which proposes a novel attention mechanism designed to efficiently scale large language models (LLMs) to handle extremely long input sequences.

1.2. Authors

The paper lists numerous authors, primarily affiliated with Moonshot AI, with additional contributions from Tsinghua University and Zhejiang Lab/Zhejiang University.

Moonshot AI: Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou.
Tsinghua University: Mingxing Zhang.
Zhejiang Lab/Zhejiang University: Jiezhong Qiu.

1.3. Journal/Conference

This paper is presented as a "Technical Report," which typically implies it is a pre-print not yet formally peer-reviewed and published in a journal or conference. It is available on arXiv, a popular repository for preprints in physics, mathematics, computer science, and other fields. arXiv serves as a crucial platform for disseminating research rapidly, allowing early sharing of findings before or during the formal peer-review process.

1.4. Publication Year

The publication date is 2025-02-18.

1.5. Abstract

The abstract introduces the critical challenge of scaling context length in large language models (LLMs) due to the quadratic computational complexity of traditional attention mechanisms. It notes that existing solutions either introduce strong biases (e.g., sink or window attention) or use linear approximations whose performance on complex reasoning tasks is not well-understood. The authors propose Mixture of Block Attention (MoBA), an innovative approach that applies Mixture of Experts (MoE) principles to the attention mechanism. MoBA adheres to a "less structure" principle, allowing the model to autonomously determine where to attend. This novel architecture demonstrates superior performance on long-context tasks, offers the key advantage of seamlessly transitioning between full and sparse attention, and enhances efficiency without compromising performance. MoBA has been deployed to support Kimi's long-context requests, demonstrating significant advancements in efficient attention computation for LLMs.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2502.13189
PDF Link: https://arxiv.org/pdf/2502.13189v1.pdf This paper is published as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The advancement of large language models (LLMs) towards Artificial General Intelligence (AGI) critically depends on their ability to process and understand longer context lengths. However, a fundamental bottleneck is the attention mechanism within Transformers, which exhibits a quadratic computational complexity with respect to the input sequence length. This quadratic growth ( $O(N^2)$ , where $N$ is the sequence length) makes processing very long contexts computationally prohibitive and memory-intensive.

Prior research has attempted to address this problem through two main avenues:

Imposing biased structures: Approaches like sink attention or window attention introduce predefined patterns that restrict the tokens a query can attend to. While efficient, these methods are often task-specific and may limit the model's ability to capture complex, non-local dependencies.
Radically modifying attention into linear approximations: Models such as Mamba, RWKV, or RetNet replace the standard attention mechanism with linear approximations to reduce complexity. However, their effectiveness in demanding complex reasoning tasks remains largely unexplored and unproven.

The paper identifies a crucial gap: a need for a solution that provides efficiency for long contexts while adhering to a "less structure" principle, allowing the model to determine relevant information autonomously without strong predefined biases. Such a solution should ideally be able to seamlessly switch between efficient sparse attention and comprehensive full attention without performance compromises.

2.2. Main Contributions / Findings

The paper proposes Mixture of Block Attention (MoBA) as an innovative solution to the long-context problem in LLMs, offering several key contributions:

Novel Architecture (MoBA): MoBA introduces a new attention mechanism that adapts the principles of Mixture of Experts (MoE) to the attention layer of Transformer models. Instead of routing entire network layers, MoBA partitions the input context into blocks and uses a gating network to dynamically select the most relevant blocks for each query token. This allows for sparse attention while maintaining the model's ability to attend where needed.
Adherence to "Less Structure" Principle: Unlike sliding window or sink attention, MoBA avoids imposing strong, predefined biases on the attention pattern. The gating mechanism allows the model to learn and autonomously determine which KV (key-value) blocks are most relevant to a given query, enhancing its flexibility and generality across tasks.
Seamless Transition between Full and Sparse Attention: MoBA is designed to easily switch between sparse attention and full attention, offering a crucial advantage for balancing efficiency and performance. This hybrid capability ensures the model can leverage efficiency when possible without risking performance degradation when full attention is truly necessary.
Superior Performance on Long-Context Tasks with Efficiency Gains: Experimental results demonstrate that MoBA achieves performance comparable to full attention models in terms of language model (LM) loss and various downstream tasks, even with significant sparsity. Critically, it offers substantial computational efficiency improvements, with speedups of up to 6.5x during prefilling 1 million tokens and effective scaling up to 10 million tokens.
Real-World Deployment: MoBA has already been deployed to support Kimi's long-context requests, validating its practical utility and robustness in a real-world production environment.
Open-Sourced Code: The authors provide their code, encouraging further research and development within the community.

These contributions collectively represent a significant advancement in developing efficient and scalable attention mechanisms for LLMs, paving the way for models capable of handling much longer contexts required for AGI.

3.1. Foundational Concepts

To fully understand MoBA, a foundational understanding of Large Language Models (LLMs), the Transformer architecture, its core attention mechanism, and the concept of Mixture of Experts (MoE) is essential.

Large Language Models (LLMs): LLMs are a class of deep learning models, typically based on the Transformer architecture, that are trained on vast amounts of text data to perform a wide range of natural language processing (NLP) tasks. They excel at understanding, generating, and manipulating human language, demonstrating capabilities in tasks like text generation, translation, summarization, and complex reasoning. Their ability to process and generate long sequences of text is crucial for their performance, and extending their context length (the amount of preceding information they can consider) is a major research goal.
Transformer Architecture: Introduced by Vaswani et al. (2017), the Transformer is a neural network architecture that has become the backbone of most modern LLMs. It eschews traditional recurrent or convolutional layers in favor of attention mechanisms. A Transformer typically consists of an encoder and a decoder stack (or decoder-only for generative LLMs). Each stack is composed of multiple identical layers, with each layer containing a multi-head self-attention sub-layer and a position-wise feed-forward network (FFN). A key innovation of Transformers is their ability to process input sequences in parallel, unlike Recurrent Neural Networks (RNNs), making them highly efficient for training on modern hardware.
Attention Mechanism: The attention mechanism is the core innovation of the Transformer. It allows the model to dynamically weigh the importance of different parts of the input sequence when processing each token. Instead of processing tokens sequentially, attention enables the model to look at all other tokens in the sequence simultaneously and decide which ones are most relevant to the current token. This mechanism is crucial for capturing long-range dependencies in text. The self-attention variant allows a token in a sequence to attend to all other tokens in the same sequence. The standard self-attention calculation involves three learnable weight matrices: Query (Q), Key (K), and Value (V). For each token, a query vector is compared against all key vectors to compute attention scores, which are then used to weight the value vectors. The output is a weighted sum of the value vectors, representing the context-aware representation of the token. The computational cost of standard self-attention is quadratic ( $O(N^2 \cdot d)$ ) with respect to the sequence length $N$ (and linear with respect to the hidden dimension $d$ ), primarily due to the $Q K^T$ matrix multiplication. This is the "prohibitive overhead" the paper aims to address.
Mixture of Experts (MoE): Mixture of Experts (MoE) is a technique used to scale models by conditionally executing only a subset of the network's parameters for each input. Instead of having a single large network, an MoE layer consists of a gating network (or router) and several expert networks (e.g., Feed-Forward Networks). For a given input, the gating network learns to select and route the input to one or more expert networks. Only the selected experts are computed, reducing the computational cost per token while increasing the total number of parameters. MoE models have primarily been applied to the Feed-Forward Network (FFN) layers in Transformers to improve scalability and training efficiency. MoBA innovatively applies this principle to the attention mechanism itself.
Flash Attention: Flash Attention (Dao et al., 2022) is a highly optimized implementation of the attention mechanism that significantly reduces memory usage and speeds up computation, particularly for long sequences. It achieves this by reorganizing the computation of attention to minimize reads and writes to high-bandwidth memory (HBM), leveraging SRAM (on-chip memory). While Flash Attention makes standard attention more efficient, it does not change its quadratic computational complexity in terms of FLOPs, but rather reduces the wall-clock time and memory footprint for a given $N$ . MoBA's implementation incorporates Flash Attention to further optimize its block-wise computations.
Causal Masking: In generative LLMs and autoregressive tasks, causal masking is applied to the attention mechanism to prevent a token from attending to future tokens in the sequence. This ensures that the model only uses information available up to the current token to make predictions, preserving the autoregressive property. It's typically implemented by setting the attention scores for future positions to negative infinity before the softmax operation, effectively making them zero after softmax. MoBA explicitly incorporates causal masking into its block selection to maintain this property.

3.2. Previous Works

The paper contextualizes MoBA by discussing various approaches to address the quadratic complexity of Transformer attention for long contexts.

Standard Attention (from Vaswani et al., 2017): The core mechanism that MoBA seeks to make more efficient. For a single query vector $\pmb q \in \mathbb{R}^{1 \times d}$ attending to $N$ key vectors $\pmb K \in \mathbb{R}^{N \times d}$ and value vectors $\pmb V \in \mathbb{R}^{N \times d}$ , the output is calculated as: $ \mathrm { At t n } ( { \pmb q } , { \pmb K } , { \pmb V } ) = \mathrm { S o f t m a x } \big ( { \pmb q } { \pmb K } ^ { \top } \big ) { \pmb V } $ Here, $\pmb q$ is the query vector, $\pmb K$ is the matrix of key vectors, $\pmb V$ is the matrix of value vectors, and $d$ is the dimension of a single attention head. The softmax function normalizes the attention scores. MoBA modifies this by only considering a subset of $\pmb K$ and $\pmb V$ .
Static Sparse Attention: These methods introduce predefined, fixed patterns of attention to reduce computation.
- Longformer (Beltagy et al., 2020), BigBird (Zaheer et al., 2020), LongNet (J. Ding et al., 2023): These models use a combination of local windowed attention (attending to nearby tokens) and global attention (attending to a few special tokens or a fixed pattern) to achieve linear or near-linear complexity.
- Sliding Window Attention (SWA): A common form where each query token only attends to a fixed-size window of preceding tokens. The paper states that SWA (e.g., Child et al., 2019) can be viewed as a special case of MoBA where the gating network consistently selects the most recent $k$ blocks.
- Attention Sink (Xiao et al., 2023): Each query token attends to a few initial "sink" tokens and a fixed window of recent tokens. The paper views this as a variant of MoBA where the gating network is designed to always select both the initial sink blocks and recent blocks.
Dynamic Sparse Attention: These methods adaptively select which tokens to attend to during computation.
- Reformer (Kitaev et al., 2020) and Routing Transformer (Roy et al., 2021): Use locality-sensitive hashing (LSH) or k-means clustering to group similar queries and keys, allowing attention within clusters.
- Memorizing Transformers (Y. Wu et al., 2022) and Unlimiformer (Bertsch et al., 2024): Dynamically attend to tokens selected by k-nearest neighbor (kNN) algorithms.
- CoLT5 (Ainslie, Lei, et al., 2023): Designs a routing multi-head attention mechanism.
- Permute Attention (Tay, Dehghani et al., 2020): Permutes blocks from the input sequence for dynamic block sparse attention.
- Quest (Tang et al., 2024): A query-aware sparsity method for efficient long-context LLM inference, which the paper notes can be seen as MoBA with a block size of 1 and a specialized block representation function combining min and max pooling.
- Longheads (Lu et al., 2024): Can be viewed as MoBA with a top-1 gating network, meaning each query selects only the single most relevant KV block.
Linear Attention Models (Beyond Traditional Attention): These approaches entirely replace the canonical softmax-based attention with linear approximations to achieve linear computational complexity ( $O(N \cdot d)$ ).
- Mamba (Gu et al., 2023; Dao and Gu, 2024): A state-space model (SSM) that offers linear-time sequence modeling with selective state space mechanisms.
- RWKV (Peng, Alcaide, et al., 2023; Peng, Goldstein, et al., 2024): Combines RNN and Transformer ideas, offering linear scaling with a recurrent mechanism.
- RetNet (Sun et al., 2023): A retentive network designed as a successor to Transformers, offering $O(N)$ complexity during inference.
- The paper notes that while these models reduce computational overhead, evidence of their effectiveness in complex reasoning tasks remains limited.

3.3. Technological Evolution

The field of attention mechanisms for LLMs has evolved from computationally intensive full attention to a spectrum of more efficient approaches.

Full Attention (2017 onwards): The initial Transformer model introduced self-attention, which, despite its quadratic complexity, revolutionized NLP due to its ability to capture long-range dependencies. Its parallelizability for training on GPUs made it dominant.
Static Sparse Attention (Late 2010s/Early 2020s): As context lengths grew, the limitations of full attention became apparent. Researchers began introducing fixed sparsity patterns (e.g., windowed, dilated, global tokens) to reduce computation, leading to models like Longformer, BigBird, ETC, and LongNet. These methods improve efficiency but often impose strong inductive biases.
Dynamic Sparse Attention (Early 2020s onwards): The next step was to make sparsity adaptive. Instead of fixed patterns, models started learning where to attend, leveraging mechanisms like LSH, kNN, or routing networks. Quest and Longheads are examples of this trend, moving towards more flexible, data-driven sparsity. MoBA fits into this category, pushing the adaptive sparsity further by applying MoE principles.
Beyond Traditional Attention (Mid-2020s): A more radical shift involves replacing the softmax-based attention altogether with architectures inspired by CNNs, RNNs, or State Space Models (SSMs), leading to linear attention models like Mamba, RWKV, and RetNet. While highly efficient, their performance on complex tasks is still under investigation.

MoBA positions itself as an evolution within the dynamic sparse attention paradigm. It seeks to improve efficiency while largely preserving the Transformer framework's strengths, particularly its expressive power and less structure philosophy, which contrasts with the radical architectural changes of linear attention models.

3.4. Differentiation Analysis

MoBA differentiates itself from previous works primarily through its novel application of Mixture of Experts (MoE) principles to the attention mechanism itself, along with its emphasis on "less structure" and hybrid capabilities.

Vs. Full Attention:
- Core Difference: MoBA significantly reduces computational complexity from quadratic to sub-quadratic by selectively attending to only a subset of KV blocks, whereas full attention computes interactions between every query and every key.
- Innovation: Achieves efficiency without sacrificing performance, demonstrated by comparable LM loss and benchmark scores.
Vs. Static Sparse Attention (e.g., Sliding Window, Attention Sink):
- Core Difference: Static sparse attention methods impose predefined, fixed patterns (e.g., fixed window size, always attending to initial sink tokens). MoBA, leveraging a gating network, dynamically and adaptively selects relevant blocks for each query.
- Innovation: MoBA has stronger expressive power. The paper explicitly demonstrates that Sliding Window Attention and Attention Sink can be viewed as special cases of MoBA with specific, hard-coded gating networks. MoBA's learned gating mechanism allows it to adapt to diverse contextual needs, adhering to the "less structure" principle.
Vs. Other Dynamic Sparse Attention Methods (e.g., Quest, Longheads):
- Core Difference: While Quest and Longheads also use dynamic sparsity, MoBA formalizes block selection with a Mixture of Experts framework. Longheads is effectively a top-1 gating MoBA, and Quest uses a specialized block representation.
- Innovation: MoBA's top-k gating allows for a tunable balance of sparsity and context, and its block pooling for affinity scores is a direct application of MoE principles, offering a generalized and robust framework for adaptive block selection.
Vs. Linear Attention Models (e.g., Mamba, RWKV, RetNet):
- Core Difference: Linear attention models radically alter the fundamental attention mechanism to achieve linear complexity, often requiring training from scratch and potentially exhibiting limitations in complex reasoning tasks. MoBA retains the core Transformer architecture and softmax-based attention formulation (albeit in a sparse form).
- Innovation: MoBA avoids modifying the attention mechanism in ways that might compromise its expressive power for complex reasoning. It aims to achieve efficiency within the Transformer framework, allowing for easier integration with pre-trained Transformer-based models and potentially better performance on tasks requiring the full expressive power of softmax attention. The paper specifically highlights that the performance of linear approximations in complex reasoning tasks remains "inadequately explored," implying MoBA aims to provide a robust alternative.
Hybrid Capability: A unique differentiation is MoBA's ability to seamlessly transition between full attention and sparse attention, which is explicitly designed to balance efficiency and performance and is not a common feature in other sparse attention methods.

4. Methodology

4.1. Principles

The core principle behind Mixture of Block Attention (MoBA) is to address the quadratic computational complexity of Transformer attention by applying the Mixture of Experts (MoE) paradigm to the attention mechanism itself. Instead of having every query token attend to every key-value pair in the entire context, MoBA partitions the full context into discrete blocks. For each query token, a gating network dynamically selects a small, fixed number of the most relevant blocks to attend to. This allows for a sparse attention pattern, significantly reducing computational cost while maintaining the model's ability to focus on pertinent information.

The approach adheres to a "less structure" principle, meaning that unlike predefined sparse attention patterns (e.g., fixed windows), MoBA empowers the model to autonomously determine where to attend. This dynamic selection mechanism aims to preserve the expressive power of full attention while gaining efficiency, and crucially, it allows for a seamless transition between sparse and full attention modes.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Preliminaries: Standard Attention in Transformer

Before detailing MoBA, the paper first outlines the standard attention mechanism as used in Transformers. This provides a baseline against which MoBA's modifications can be understood.

For a single query vector $\pmb q \in \mathbb{R}^{1 \times d}$ to attend to $N$ key vectors $\pmb K \in \mathbb{R}^{N \times d}$ and value vectors $\pmb V \in \mathbb{R}^{N \times d}$ , the attention output is computed as: $ \mathrm { At t n } ( { \pmb q } , { \pmb K } , { \pmb V } ) = \mathrm { S o f t m a x } \big ( { \pmb q } { \pmb K } ^ { \top } \big ) { \pmb V } $ In this formula:

$\pmb q$ : Represents the query vector for the current token. It is a row vector of dimension $d$ .
$\pmb K$ : Is the matrix of key vectors, where each row corresponds to a key vector for a token in the context. It has dimensions $N \times d$ , where $N$ is the total sequence length (number of tokens in the context) and $d$ is the head dimension.
$\pmb V$ : Is the matrix of value vectors, similar to $\pmb K$ , with dimensions $N \times d$ .
$\pmb q \pmb K ^ { \top }$ : Computes the dot product (or similarity score) between the query vector and all key vectors. This results in a row vector of length $N$ , where each element indicates how relevant each key is to the query.
$\mathrm { S o f t m a x } ( \cdot )$ : Applies the softmax function to the dot product scores, normalizing them into a probability distribution. This ensures that the attention weights sum to 1, indicating the relative importance of each key.
The entire expression $\mathrm { S o f t m a x } \big ( { \pmb q } { \pmb K } ^ { \top } \big ) { \pmb V }$ : Multiplies the normalized attention weights by the value vectors. This produces a weighted sum of the value vectors, where values from more relevant tokens (higher attention scores) contribute more to the output.

The paper notes that this explanation focuses on a single-head scenario for clarity, and extension to multi-head attention involves concatenating the outputs from multiple attention heads. The key challenge with this standard attention is the quadratic complexity ( $O(N^2)$ ) due to the $\pmb q \pmb K ^ { \top }$ operation, which involves computing scores for all $N$ keys for each query.

4.2.2. MoBA Architecture

MoBA modifies the standard attention by enabling each query token to attend to only a subset of keys and values from the entire context. This is achieved through block partitioning and a dynamic selection strategy.

The modified MoBA attention formula is: $ \mathrm { M o B A } ( q , K , V ) = \mathrm { S o f t m a x } \Big ( q K [ I ] ^ { \top } \Big ) V [ I ] $ Here:

$q$ , $K$ , $V$ : Are the query vector, key matrix, and value matrix as in standard attention.
$I \subseteq [N]$ : Represents the set of selected key and value indices. This is the crucial difference: instead of using all $N$ keys and values, MoBA uses only those specified by the index set $I$ .

The key innovation of MoBA lies in how this set $I$ is determined. The full context of length $N$ is first divided into $n$ blocks, where each block contains $B$ subsequent tokens.
$B = \frac{N}{n}$ : Denotes the block size.
$I _ { i } = [ ( i - 1 ) \times B + 1 , i \times B ]$ : Represents the range of indices for the $i$ -th block.

The MoBA model then employs a gating mechanism to select the most relevant blocks for each query token. The total set of selected indices $I$ is the union of indices from all blocks $I_i$ where the gate value $g_i$ is positive (i.e., selected): $ I = \bigcup _ { g _ { i } > 0 } I _ { i } $

4.2.2.1. MoBA Gating Mechanism

The MoBA gate determines the relevance of each block to a given query. It involves two steps:

Affinity Score Computation: For each query $\pmb q$ , an affinity score $s_i$ is computed for every block $i$ . This score measures the relevance between the query and the $i$ -th block. The score $s_i$ is computed by the inner product between the query vector $\pmb q$ and the mean pooling of the key vectors within the $i$ -th block: $ s _ { i } = \langle \pmb { q } , \mathrm { m e a n.pool } ( \pmb { K } [ I _ { i } ] ) \rangle $ Here:
- $\langle \cdot , \cdot \rangle$ : Denotes the inner product (dot product) of two vectors.
- $\mathrm { m e a n.pool } ( \pmb { K } [ I _ { i } ] )$ : Computes the mean of all key vectors belonging to the $i$ -th block along the sequence dimension. This creates a single representative vector for the entire block.
- The inner product then measures the similarity between the query and this block representative.
Top-k Gating: After computing affinity scores for all blocks, a top-k gating mechanism is applied. This selects the $k$ blocks with the highest affinity scores for the current query. The gate value $g_i$ for the $i$ -th block is set to 1 if its score is among the top $k$ , and 0 otherwise: $ g _ { i } = \left{ { \begin{array} { l l } { 1 } & { s _ { i } \in { \mathrm { T o p k } } \left( { s _ { j } | j \in [ n ] } , k \right) } \ { 0 } & { { \mathrm { o t h e r w i s e } } } \end{array} } \right. $ Here:
- $\mathrm { T o p k } ( \{ s _ { j } | j \in [ n ] \} , k )$ : Represents the set containing the $k$ highest scores among all affinity scores $\{s_j\}$ for blocks $j \in [n]$ .
  
  The following figure (Figure 1 from the original paper) provides a visual illustration of the MoBA mechanism:
  
  $Figure 1: Illustration of mixture of block attention (MoBA). (a) A running example of MoBA; $\\mathbf { ( b ) }$ Integration of MoBA into Flash Attention.$ 该图像是示意图，展示了混合块注意力（MoBA）的结构和原理。左侧部分（a）展示了查询（queries）如何通过路由器（Router）连接到对应的键（keys）和值（values）上，并生成注意力评分（Attn score）。右侧部分（b）则介绍了MoBA门控机制的具体步骤，包括分块、均值池化、矩阵乘法和TopK选择，最终输出注意力结果。

Figure 1: Illustration of mixture of block attention (MoBA). (a) A running example of MoBA; $\mathbf { ( b ) }$ Integration of MoBA into Flash Attention.

Figure 1(a) shows a running example:

We have two query tokens (q1, q2) and four KV blocks (B1, B2, B3, B4).
A router (the gating network) dynamically selects two blocks ( $top-k=2$ ) for each query.
q1 is routed to B1 and B2.
q2 is routed to B3 and B4.
This illustrates how different queries can focus on different subsets of the context blocks.

Figure 1(b) illustrates the integration into Flash Attention:
The KV cache is first chunked into blocks.
Mean pooling is applied to Keys within each block to create block representations.
Query (Q) is multiplied by these block representations ( $Q \cdot \bar{K}_{block}^T$ ) to get affinity scores.
A TopK selection, combined with causal masking, identifies the relevant blocks.
Finally, Flash Attention is performed on the selected $Q$ , $K$ , $V$ blocks.

4.2.2.2. Causality in MoBA

To preserve the autoregressive property of LLMs, MoBA incorporates causal relationship preservation through specific designs:

No Attention to Future Blocks: A query token cannot be routed to any future blocks. This is enforced by setting the affinity score $s_i$ to $-\infty$ (and thus $g_i = 0$ ) for any block $i$ whose start position is after the query token's position. More formally, if $\mathrm { p o s } ( \pmb q )$ is the position index of query $\pmb q$ , then $s _ { i } = - \infty$ and $g _ { i } = 0$ for any blocks $i$ such that $\mathrm { p o s } ( \pmb q ) < i \times B$ . This ensures no information leakage from future parts of the sequence.
Current Block Attention and Causal Masking: The current block (the block that contains the query token) is always attended to. This avoids information leakage from subsequent tokens within the current block through causal masking applied locally, and encourages attention to the local context. More formally, $g_i = 1$ is explicitly set for the block $i$ where $\mathrm { p o s } ( \pmb q )$ falls within the interval $I_i$ .

4.2.2.3. Fine-Grained Block Segmentation

Inspired by MoE literature (which shows benefits of fine-grained expert segmentation), MoBA investigates the impact of partitioning the context into blocks with a finer grain. While MoE typically segments along the FFN intermediate hidden dimension, MoBA operates segmentation along the context-length dimension. The experiments aim to determine if smaller block sizes (more blocks, each covering fewer tokens) can enhance performance.

4.2.2.4. Hybrid of MoBA and Full Attention

MoBA is designed as a direct substitute for full attention, maintaining the same number of parameters. This design enables a hybrid strategy where the model can seamlessly switch between MoBA (sparse attention) and full attention. This choice can be dynamically made during training or inference, providing flexibility to balance efficiency and performance without significant overhead. The paper explores this for both training and layer-wise supervised fine-tuning (SFT).

4.2.2.5. Comparison to Sliding Window Attention and Attention Sink

The paper highlights that MoBA has stronger expressive power than widely used sparse attention architectures like Sliding Window Attention (SWA) and Attention Sink.

Sliding Window Attention (SWA): Each query token attends only to a fixed window of preceding tokens. MoBA can emulate SWA by having a gating network that consistently selects the most recent $k$ blocks.
Attention Sink: Each query token attends to a set of initial tokens (sink tokens) and a fixed window of recent tokens. MoBA can emulate Attention Sink by having a gating network that always selects both the initial sink blocks and the recent blocks.

This demonstrates that MoBA is a more general framework, capable of subsuming these specialized sparse attention methods by incorporating specific (learned or hardcoded) gating networks. MoBA's adaptive and dynamic block selection allows for more nuanced and efficient information processing, moving beyond the fixed and potentially suboptimal patterns of SWA and Attention Sink.

4.2.3. Implementation

The authors provide a high-performance implementation of MoBA by incorporating optimization techniques from Flash Attention (Dao et al., 2022) and Mixture of Experts (MoE) (Rajeswaran et al., 2022). The algorithmic workflow is formalized in Algorithm 1 and visualized in Figure 1(b).

The following figure (Figure 2 from the original paper) shows the efficiency evaluation:

该图像是图表，展示了 MoBA 和 Flash Attention 在不同序列长度下的计算时间对比。图 (a) 显示了计算时间随序列长度增加的趋势，图 (b) 则提供了更长序列的计算时间。MoBA 在处理长上下文任务时表现出更优的效率，尤其是在较长序列时。

Figure 2: Effciency of MoBA vs.full attention (implemented with Flash Attention). (a)1M Model speedup evaluation: Computation time scaling of MoBA versus Flash Attention on 1M model with increasing sequence lengths (81M). (b) Fixed Sparsiy Rati scaling:Computation time caling comparison betweenMoBA and Flash Attentn across increasing sequence lengths (8K-10M), maintaining a constant sparsity ratio of $9 5 . 3 1 \%$ (fixed 64 MoBA blocks with variance block size and fixed top- $\mathbf { \cdot k } { = } 3$ ).

The MoBA implementation consists of five major steps:

Determine assignment: Identify which query tokens will attend to which KV blocks based on the gating network and causal mask.
Arrange ordering: Organize the query tokens based on their assigned KV blocks for efficient processing.
Compute attention outputs: Calculate attention outputs for the current block and the selected historical KV blocks. This step is optimized using Flash Attention for varying lengths.
Compute self-attention for the current block: This is typically a standard causal self-attention.
Combine results: Integrate the outputs from current block attention and historical block attention using online softmax (Milakov and Gimelshein, 2018).

The following is the Algorithm 1 from the original paper, detailing the MoBA implementation:

Algorithm 1 MoBA (Mixture of Block Attention) Implementation

Require: Query, key and value matrices $\mathbf { Q } , \mathbf { K } , \mathbf { V } \in \mathbb { R } ^ { N \times h \times d }$ ; MoBA hyperparameters (block size $B$ and top $k$ ); $h$
and $d$ denote the number of attention heads and head dimension. Also denote $n = N / B$ to be the number of
blocks.
1: // Split KV into blocks
2: $\{ \tilde { \bf K } _ { i } , \tilde { \bf V } _ { i } \} = \mathrm { s p l i t.blocks } ( { \bf K } , { \bf V } , B ) ,$ where $\tilde { \mathbf { K } } _ { i } , \tilde { \mathbf { V } } _ { i } \in \mathbb { R } ^ { B \times h \times d } , i \in [ n ]$
3: // Compute gating scores for dynamic block selection
4: $\bar { \mathbf { K } } = \bar { \mathrm { m e a n.pool } } ( \mathbf { K } , B ) \in \mathbb { R } ^ { n \times h \times d }$
5: $\mathbf { S } = \mathbf { Q } \bar { \mathbf { K } } ^ { \top } \in \mathbb { R } ^ { N \times h \times n }$
6: // Select blocks with causal constraint (no attention to future blocks)
7: $\mathbf { M } =$ create_causal_mask ( N , n )
8: $\mathbf { G } = \mathrm { topk } ( \mathbf { S } + \mathbf { M } , k )$
9: // Organize attention patterns for computation efficiency
10: $\mathbf { Q } ^ { s } , \bar { \tilde { \mathbf { K } } } { } ^ { s } , \tilde { \mathbf { V } } { } ^ { s } = \mathrm { get.self.attn.block } ( \mathbf { Q } , \tilde { \mathbf { K } } , \tilde { \mathbf { V } } )$
11: $\mathbf { Q } ^ { m } , \tilde { \mathbf { K } } ^ { m } , \tilde { \mathbf { V } } ^ { m } = \mathrm { get.moba.attn.block } ( \mathbf { Q } , \tilde { \mathbf { K } } , \tilde { \mathbf { V } } , \mathbf { G } )$
12: // Compute attentions seperately
13: $\mathbf { O } ^ { s } = \mathrm { flash\_attention\_varlen } ( \mathbf { Q } ^ { s } , \tilde { \mathbf { K } } ^ { s } , \tilde { \mathbf { V } } ^ { s } , \text{causal=True})$
14: $\mathbf { O } ^ { m } = \mathrm { flash\_attention\_varlen } ( \mathbf { Q } ^ { m } , \tilde { \mathbf { K } } ^ { m } , \tilde { \mathbf { V } } ^ { m } , \text{causal=False})$
15: // Combine results with online softmax
16: $\mathbf { O = } \mathrm { combine\_with\_online\_softmax } ( \mathbf { O } ^ { s } , \mathbf { O } ^ { m } )$
17: return $\mathbf { O }$

Explanation of Algorithm 1:

Require:
- $\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{N \times h \times d}$ : Input Query, Key, and Value matrices. $N$ is the sequence length, $h$ is the number of attention heads, and $d$ is the head dimension.
- MoBA hyperparameters: block size $B$ and top-k value $k$ .
- $n = N/B$ : The total number of blocks.
Line 1-2: Split KV into blocks
- split.blocks(K, V, B): This function partitions the full Key and Value matrices ( $\mathbf{K}$ , $\mathbf{V}$ ) into $n$ smaller blocks.
- $\{ \tilde { \bf K } _ { i } , \tilde { \bf V } _ { i } \}$ : Represents the $i$ -th Key and Value block. Each block $\tilde { \mathbf { K } } _ { i } , \tilde { \mathbf { V } } _ { i }$ has dimensions $B \times h \times d$ .
Line 3-5: Compute gating scores for dynamic block selection
- $\bar { \mathbf { K } } = \mathrm { mean.pool } ( \mathbf { K } , B ) \in \mathbb{R}^{n \times h \times d}$ : Computes the mean of Key vectors within each block. This results in a matrix $\bar { \mathbf { K } }$ where each row is the mean-pooled representation of a block's keys.
- $\mathbf { S } = \mathbf { Q } \bar { \mathbf { K } } ^ { \top } \in \mathbb{R}^{N \times h \times n}$ : Calculates the affinity scores between each query token and each mean-pooled block key. This is the $\pmb q \langle \mathrm { mean.pool } ( \pmb { K } [ I _ { i } ] ) \rangle$ operation from the gating mechanism, extended for all queries and heads.
Line 6-8: Select blocks with causal constraint
- $\mathbf { M } =$ create_causal_mask( N, n ): Generates a causal mask matrix. This mask ensures that query tokens can only attend to historical blocks (blocks that precede or contain the query's position), preventing attention to future blocks. Future block scores are set to $-\infty$ here.
- $\mathbf { G } = \mathrm { topk } ( \mathbf { S } + \mathbf { M } , k )$ : Applies the top-k selection to the affinity scores $\mathbf{S}$ , after adding the causal mask $\mathbf{M}$ . This gating matrix $\mathbf{G}$ (binary, 1 for selected, 0 for not selected) indicates which $k$ blocks each query will attend to, respecting causality.
Line 9-11: Organize attention patterns for computation efficiency
- $\mathbf { Q } ^ { s } , \tilde { \mathbf { K } } ^ { s } , \tilde { \mathbf { V } } ^ { s } = \mathrm { get.self.attn.block } ( \mathbf { Q } , \tilde { \mathbf { K } } , \tilde { \mathbf { V } } )$ : Prepares the Query, Key, and Value tensors specifically for self-attention within the current block. This typically involves extracting the $Q$ , $K$ , $V$ for each block and aligning them for local causal attention.
- $\mathbf { Q } ^ { m } , \tilde { \mathbf { K } } ^ { m } , \tilde { \mathbf { V } } ^ { m } = \mathrm { get.moba.attn.block } ( \mathbf { Q } , \tilde { \mathbf { K } } , \tilde { \mathbf { V } } , \mathbf { G } )$ : Prepares the Query, Key, and Value tensors for MoBA (attention to historical blocks). This function uses the gating matrix $\mathbf{G}$ to select and arrange the KV blocks that each query will attend to.
Line 12-14: Compute attentions separately
- $\mathbf { O } ^ { s } = \mathrm { flash\_attention\_varlen } ( \mathbf { Q } ^ { s } , \tilde { \mathbf { K } } ^ { s } , \tilde { \mathbf { V } } ^ { s } , \text{causal=True})$ : Computes self-attention within the current block using an optimized Flash Attention implementation that supports varying sequence lengths. $causal=True$ ensures that tokens within the current block only attend to preceding tokens in that block.
- $\mathbf { O } ^ { m } = \mathrm { flash\_attention\_varlen } ( \mathbf { Q } ^ { m } , \tilde { \mathbf { K } } ^ { m } , \tilde { \mathbf { V } } ^ { m } , \text{causal=False})$ : Computes attention from query tokens to the selected historical KV blocks using Flash Attention. $causal=False$ here implies that queries can attend to all tokens within the selected historical blocks, as long as those blocks themselves are in the past.
Line 15-16: Combine results with online softmax
- $\mathbf { O = } \mathrm { combine\_with\_online\_softmax } ( \mathbf { O } ^ { s } , \mathbf { O } ^ { m } )$ : Combines the outputs from the current block attention ( $\mathbf{O}^s$ ) and historical block attention ( $\mathbf{O}^m$ ). This combination is typically done using an online softmax approach (Milakov and Gimelshein, 2018) to correctly normalize the attention scores across all attended tokens (both local and historical) while managing memory efficiently.
Line 17: Return Output
- Returns the final attention output $\mathbf{O}$ .
  
  This algorithm efficiently computes sparse attention by leveraging block partitioning, dynamic top-k gating, causal masking, and Flash Attention for optimized computation on the selected blocks.

5. Experimental Setup

5.1. Datasets

The experiments in the paper utilize a large-scale training corpus for scaling law experiments and various benchmark datasets for large language modeling evaluation.

Training Corpus for Scaling Law Experiments:
- The models are trained with a substantial number of tokens, ranging from 10.8 Billion for the 568M parameter model up to 36.9 Billion for the 2.1B parameter model (Table 1).
- The specific source or characteristics of this training corpus are not detailed in the paper, but it is implied to be a general-purpose text dataset suitable for LLM pre-training.
- Models are initially trained with a sequence length of 8K for initial LM loss evaluation, and then extended to 32K for long-context scalability assessments.
- For continual pre-training, the Llama3.1 8B Base Model is used as a starting point. This model is continually pre-trained on contexts up to 1M tokens.
Evaluation Benchmarks for Large Language Modeling: The paper evaluates MoBA against full attention models across a diverse set of real-world downstream tasks, covering various capabilities of LLMs. These include:
- AGIEval: A comprehensive benchmark for AGI capabilities.
- BBH (Big-Bench Hard): A challenging subset of Big-Bench, designed to test LLMs on tasks requiring multi-step reasoning.
- CEval: A Chinese-language benchmark for LLMs, covering various subjects and task types.
- GSM8K: A dataset of grade school math word problems requiring multi-step reasoning.
- HellaSWAG: A commonsense reasoning benchmark, requiring LLMs to select the most plausible ending to a given premise.
- Loogle: (Likely a variant of Google-like search tasks or long-context retrieval).
- Competition Math: A benchmark for advanced mathematical problem-solving.
- MBPP (Mostly Basic Python Programs): A dataset for code generation tasks, requiring models to write Python programs from natural language prompts.
- MBPP Sanitized: A cleaned version of MBPP.
- MMLU (Massive Multitask Language Understanding): A benchmark covering 57 subjects across STEM, humanities, social sciences, and more, testing LLMs' world knowledge and problem-solving abilities.
- MMLU Pro: A more challenging version of MMLU.
- OpenAI HumanEval: A dataset for evaluating code generation models on Python programming problems.
- SimpleQA: A dataset for simple question answering.
- TriviaQA: A challenging question answering dataset based on trivia.
- LongBench: A benchmark specifically designed for evaluating LLMs on long-context understanding and reasoning tasks, with context lengths up to 32K.
- RULER: (Reference Unconstrained Long-range Evaluation for Retrieval, likely a benchmark for long-context retrieval or summarization tasks), evaluated at 128K context length.
- Needle in a Haystack: A specialized benchmark that evaluates a model's ability to retrieve a specific piece of information (needle) hidden within a very long document (haystack). This is a direct test of long-context retrieval and attention span. The evaluation is conducted up to 1M context length.
  
  These datasets are chosen to provide a comprehensive evaluation across general language modeling performance, reasoning, code generation, knowledge retrieval, and crucially, long-context understanding capabilities, directly addressing the paper's primary objective.

5.2. Evaluation Metrics

The paper uses several standard metrics to evaluate MoBA's performance, particularly focusing on language model loss and task-specific scores for downstream benchmarks.

Language Model (LM) Loss:
- Conceptual Definition: Language Model Loss, typically cross-entropy loss, quantifies how well a language model predicts the next token in a sequence given the preceding tokens. A lower LM loss indicates better predictive accuracy and stronger language modeling capabilities. It is a fundamental metric for evaluating the quality of generative language models during pre-training and fine-tuning.
- Mathematical Formula: For a given sequence of $N$ tokens $(t_1, t_2, \ldots, t_N)$ and a vocabulary size $V$ , the cross-entropy loss is typically calculated as: $ L = - \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{V} y_{i,j} \log(\hat{y}_{i,j}) $
- Symbol Explanation:
  - $L$ : The cross-entropy loss.
  - $N$ : The sequence length.
  - $V$ : The vocabulary size.
  - $y_{i,j}$ : A binary indicator (0 or 1) that is 1 if the $j$ -th token in the vocabulary is the true next token at position $i$ , and 0 otherwise.
  - $\hat{y}_{i,j}$ : The predicted probability that the $j$ -th token in the vocabulary is the next token at position $i$ , according to the model.
Trailing LM Loss:
- Conceptual Definition: Trailing LM loss is a variant of LM loss that specifically measures the prediction accuracy on the last few tokens of a long sequence (e.g., the last 1K or 2K tokens). This metric is particularly insightful for long-context models because LLMs often perform better on initial tokens and struggle with maintaining context and coherence towards the end of very long inputs. A good trailing LM loss indicates the model's ability to effectively utilize and retain information over long distances.
- Mathematical Formula: The formula is the same as LM loss, but the summation $\sum_{i=1}^{N}$ is restricted to a specific range of indices, typically $(N - K + 1)$ to $N$ , where $K$ is the number of trailing tokens considered.
- Symbol Explanation: Same as LM loss, with the addition of $K$ representing the number of trailing tokens.
Position-wise LM Loss:
- Conceptual Definition: Position-wise LM loss evaluates the LM loss at each specific token position within the sequence. This metric helps to visualize how the model's predictive performance changes as a function of its position in the context. It can reveal patterns such as degradation of performance with increasing distance from the beginning of the sequence, or the impact of sparse attention on different parts of the context.
- Mathematical Formula: Not explicitly provided in the paper as a single formula, but conceptually, it's the LM loss calculated for each position $i$ individually: $ L_i = - \sum_{j=1}^{V} y_{i,j} \log(\hat{y}_{i,j}) $
- Symbol Explanation: $L_i$ is the loss at position $i$ . Other symbols are as defined for LM loss.
Benchmark-Specific Metrics (e.g., Accuracy, F1 Score, Pass@1):
- For benchmarks like AGIEval, BBH, GSM8K, MMLU, HellaSWAG, OpenAI HumanEval, etc., the paper reports scores that are standard for those respective benchmarks. These typically include:
  - Accuracy: For classification or multiple-choice tasks (e.g., MMLU, HellaSWAG).
  - F1 Score: For tasks involving information extraction or generation where both precision and recall are important.
  - Pass@1: For code generation tasks (e.g., OpenAI HumanEval, MBPP), indicating the percentage of problems for which the first generated solution passes all test cases.
- While the paper doesn't explicitly define the formulas for these standard metrics, they are well-established in the NLP and ML communities. The specific metrics used for each benchmark are assumed to be their standard evaluation protocols.
Needle in a Haystack Evaluation Score:
- Conceptual Definition: This is a specialized test for long-context retrieval. A specific, unique "needle" (e.g., a sentence or fact) is embedded at various positions within a very long "haystack" (a long document). The model is then prompted to find and extract this needle. The score (often a percentage) indicates how accurately the model can locate and reproduce the embedded information. A high score signifies strong long-context retrieval and attention capabilities.
- Mathematical Formula: Typically, this is a binary success/failure for each trial, averaged over multiple trials and positions. $ \text{Score} = \frac{\text{Number of correctly retrieved needles}}{\text{Total number of needles}} \times 100% $
- Symbol Explanation: Number of correctly retrieved needles is the count of instances where the model successfully extracts the target information. Total number of needles is the total number of test cases.

5.3. Baselines

The primary baseline model against which MoBA is compared is full attention implemented with Flash Attention.

Full Attention (with Flash Attention):
- This serves as the main strong baseline for evaluating both performance and efficiency. Full attention is the standard, most expressive attention mechanism in Transformers, where every query token attends to every key token in the context.
- The implementation uses Flash Attention (Dao et al., 2022), which is an optimized version of full attention that significantly improves speed and memory efficiency on modern GPUs, without altering the fundamental quadratic computational complexity. By comparing against Flash Attention, the authors ensure that MoBA's efficiency gains are due to its sparse nature, not merely better implementation details.
- The full attention models are trained under equivalent conditions to MoBA models (e.g., Llama-8B-1M-Full vs. Llama-8B-1M-MoBA) to allow for a direct comparison of their respective performance and efficiency profiles.
- For continual pre-training experiments, the Llama3.1 8B Base Model is used as a common starting point for both MoBA and full attention variants, ensuring a fair comparison from a consistent foundation.
  
  The comparison aims to show that MoBA can match or exceed the performance of full attention while achieving substantial efficiency gains, particularly for very long contexts. The paper also implicitly contrasts MoBA with the limitations of other sparse attention methods (like sliding window and attention sink) and linear attention models (like Mamba or RetNet) by highlighting MoBA's "less structure" principle and its ability to perform well on complex reasoning tasks.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that MoBA achieves performance comparable to full attention while significantly improving computational efficiency and scalability for long contexts.

6.1.1. Scaling Law Experiments and Ablation Studies

Scalability w.r.t. LM Loss:

LM Loss (seqlen = 8K): When trained with an 8K sequence length (standard for many LLMs), MoBA models, despite achieving 81.25% sparsity, show LM loss curves that are almost indistinguishable from full attention models. The LM loss values for both mechanisms remain consistent within a range of 1e-3. This suggests that MoBA can scale similarly to full attention in terms of language modeling performance, even with substantial sparsity. The following figure (Figure 3 from the original paper) illustrates the scaling law comparison:

$Figure 3: Scaling law comparison between MoBA and full attention. (a) LM loss on validation set (seqlen $\\scriptstyle = 8 \\mathrm { K }$ ; (b) trailing LM loss on validation set (seqlen ${ \\tt = } 3 2 \\mathrm { K }$ , last 1K tokens); (c) fitted scaling law curve.$ 该图像是图表，展示了MoBA与全注意力机制在不同条件下的缩放规律比较。图(a)表示了在序列长度为8K时的语言模型损失，图(b)显示了在序列长度为32K、最后2K令牌的情况下的后续语言模型损失。图(c)提供了相关的拟合缩放规律曲线的数据。整体上，展示了MoBA的投影与全注意力的投影在计算效率和性能上的差异。

Figure 3: Scaling law comparison between MoBA and full attention. (a) LM loss on validation set (seqlen $\scriptstyle = 8 \mathrm { K }$ ; (b) trailing LM loss on validation set (seqlen ${ \tt = } 3 2 \mathrm { K }$ , last 1K tokens); (c) fitted scaling law curve. Figure 3(a) visually confirms the comparable LM loss for both MoBA and full attention at 8K context length.

Long Context Scalability (Trailing LM Loss, seqlen = 32K): To assess long-context scalability, models were evaluated at a 32K maximum sequence length, using trailing LM loss (loss on the last 1K tokens) to avoid biases from short sequences. This setting results in an even higher sparsity of 95.31% for MoBA. Figure 3(b) shows that while MoBA initially has slightly higher trailing LM loss than full attention at lower compute levels, the gap progressively narrows as compute increases. This indicates MoBA's strong potential for long-context scalability.

Ablation Study: Fine-Grained Block Segmentation:

The paper investigates the impact of block granularity on MoBA's performance. Using a 1.5B parameter model with a 32K context length, the block size and top-k hyperparameters were adjusted to maintain a consistent sparsity of 75%.
The results, shown in Figure 4, demonstrate that MoBA's performance is significantly affected by block granularity. Finer-grained block segmentation (smaller block sizes, more blocks) generally leads to better performance (lower LM loss). This confirms that fine-grained expert segmentation is a beneficial technique for MoBA, similar to its positive impact in the broader MoE literature. The following figure (Figure 4 from the original paper) shows the effect of block segmentation:

该图像是一个示意图，展示了不同 MoBA 块分段设置下的 LM 损失变化。可以看到，随着分段设置的调整，MoBA 的 LM 损失（蓝点）在相同设置下显著低于全注意力基线（红点），显示出其在长上下文任务中的优势。

Figure 4: FinGraied Bloc Segmentati.The LMloss n validatio et s. MoA withdifferent bloc anur

6.1.2. Hybrid of MoBA and Full Attention

The flexibility of MoBA allows for hybrid training and layer-wise hybrid strategies.

MoBA/Full Hybrid Training:

Three 1.5B parameter models were trained on 30B tokens with a 32K context length: MoBA/full hybrid, full attention, and MoBA.
- MoBA/full hybrid: Trained with MoBA for 90% of tokens, then switched to full attention for the remaining 10%.
- Full attention: Trained entirely with full attention.
- MoBA: Trained exclusively with MoBA.
Figure 5(a) presents the position-wise LM loss. The MoBA/full hybrid model achieves the lowest position-wise LM loss across most positions. This suggests that a blend of MoBA for efficiency and full attention for a final boost can yield superior long-context performance. Interestingly, MoBA initially has higher loss but catches up to full attention as context length increases. The following figure (Figure 5 from the original paper) illustrates the hybrid strategies:

该图像是三个图表，展示了MoBA、MoBA/Full Hybrid和Full Attention在不同位置和层数下的LM损失表现。图(a)显示了位置与LM损失的关系，图(b)和图(c)分别展示了与层数相关的不同方法的性能对比。

Figure 5:Hybridof MoBA andfull atention. (a) position-wise LM loss for MoBA, full attention, and MoA/full T .rueayriT aiLM loss (seqler $\scriptstyle 1 = 3 2 \mathrm { K }$ , last 2K) w.r.t the number of full attention layers in layer-wise hybrid.

Layer-wise Hybrid:

MoBA can sometimes lead to suboptimal performance during Supervised Fine-Tuning (SFT), especially when dealing with sparse gradients from task-specific tokens.
To address this, a layer-wise hybrid strategy was proposed: using full attention for the last few Transformer layers (which typically handle task-specific token predictions) and MoBA for the preceding layers.
Figures 5(b) and 5(c) (the latter from Appendix) show that this strategy significantly reduces SFT loss, confirming that combining full attention in critical final layers with MoBA in earlier layers can improve performance while maintaining efficiency.

6.1.3. Large Language Modeling Evaluation

Llama3.1 8B Base Model was used as a starting point for continual pre-training up to 1M context length. MoBA was activated for the last 100K tokens of this process. The block size was 4096 and top-k was 12, leading to 95.31% sparsity at 1M context.

Benchmark Performance: Table 2 compares Llama-8B-1M-MoBA with Llama-8B-1M-Full across 15 benchmarks. The following are the results from Table 2 of the original paper:

Benchmark	Llama-8B-1M-MoBA	Llama-8B-1M-Full
AGIEval [0-shot]	0.5144	0.5146
BBH [3-shot]	0.6573	0.6589
CEval [5-shot]	0.6273	0.6165
GSM8K [5-shot]	0.7278	0.7142
HellaSWAG [0-shot]	0.8262	0.8279
Loogle [0-shot]	0.4209	0.4016
Competition Math [0-shot]	0.4254	0.4324
MBPP [3-shot]	0.5380	0.5320
MBPP Sanitized [0-shot]	0.6926	0.6615
MMLU [0-shot]	0.4903	0.4904
MMLU Pro [5-shot][CoT]	0.4295	0.4328
OpenAI HumanEval [0-shot][pass@1]	0.6951	0.7012
SimpleQA [0-shot]	0.0465	0.0492
TriviaQA [0-shot]	0.5673	0.5667
LongBench @32K [0-shot]	0.4828	0.4821
RULER @128K [0-shot]	0.7818	0.7849

The results show that Llama-8B-1M-MoBA achieves performance highly comparable to Llama-8B-1M-Full across most benchmarks. For instance, on AGIEval, MMLU, HellaSWAG, and RULER, the scores are very close. Notably, MoBA even slightly outperforms full attention on CEval, GSM8K, Loogle, MBPP, and MBPP Sanitized. This is particularly impressive for RULER at 128K context length, where MoBA operates with 62.5% sparsity but nearly matches full attention's score.

Needle in a Haystack Evaluation: Figure 7 demonstrates Llama-8B-1M-MoBA's performance on the Needle in a Haystack benchmark up to 1M context length. The following figure (Figure 7 from the original paper) shows the Needle in a Haystack performance:

该图像是图表，展示了在不同上下文长度下，LLama-8B-1M-MoBA在"Needle in the Haystack"基准测试中的表现，分数范围从0到100。图中的绿色区域表示模型在长上下文任务中表现良好。

Figure 7: Performance of LLama-8B-1M-MoBA on the Needle in the Haystack benchmark (upto 1M context length) The model exhibits satisfactory performance, indicated by the green regions, even at 1 million tokens, affirming its capability for long-context retrieval.

6.1.4. Efficiency and Scalability

Speedup Evaluation (1M Model): Figure 2(a) compares the computation time scaling of MoBA vs. Flash Attention on a 1M model with increasing sequence lengths. MoBA significantly improves efficiency, achieving a speedup ratio of up to 6.5x when prefilling 1 million tokens. This confirms its practical advantage in terms of wall-clock time.
Fixed Sparsity Ratio Scaling (8K-10M): Figure 2(b) extends the comparison to sequence lengths up to 10 million tokens, maintaining a constant sparsity ratio of 95.31%. MoBA demonstrates superior efficiency compared to Flash Attention as sequence lengths grow. While both methods perform comparably at shorter sequences (e.g., 32K-128K), MoBA's computational advantage becomes increasingly evident and substantial for extremely long sequences, highlighting its strength in processing contexts of millions of tokens.
Underlying Factors: The efficiency of MoBA stems from three main factors: (1) dynamic block selection, (2) sparse attention across blocks, and (3) an optimized implementation combining Mixture of Experts and Flash Attention techniques. These factors reduce the computational complexity to a more economical sub-quadratic scale.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Model Param	Head	Layer	Hidden	TrainingToken	Block size	TopK
568M	14	14	1792	10.8B	512	3
822M	16	16	2048	15.3B	512	3
1.1B	18	18	2304	20.6B	512	3
1.5B	20	20	2560	27.4B	512	3
2.1B	22	22	2816	36.9B	512	3

Table 1: Configuration of Scaling Law Experiments

This table details the configurations used for the scaling law experiments, showing models of varying sizes (from 568M to 2.1B parameters) with their respective head count, layer count, hidden dimension, total training tokens, and MoBA-specific hyperparameters (block size and top-k). For these experiments, a constant block size of 512 and top-k of 3 were used across all model sizes.

The following are the results from Table 2 of the original paper:

Benchmark	Llama-8B-1M-MoBA	Llama-8B-1M-Full
AGIEval [0-shot]	0.5144	0.5146
BBH [3-shot]	0.6573	0.6589
CEval [5-shot]	0.6273	0.6165
GSM8K [5-shot]	0.7278	0.7142
HellaSWAG [0-shot]	0.8262	0.8279
Loogle [0-shot]	0.4209	0.4016
Competition Math [0-shot]	0.4254	0.4324
MBPP [3-shot]	0.5380	0.5320
MBPP Sanitized [0-shot]	0.6926	0.6615
MMLU [0-shot]	0.4903	0.4904
MMLU Pro [5-shot][CoT]	0.4295	0.4328
OpenAI HumanEval [0-shot][pass@1]	0.6951	0.7012
SimpleQA [0-shot]	0.0465	0.0492
TriviaQA [0-shot]	0.5673	0.5667
LongBench @32K [0-shot]	0.4828	0.4821
RULER @128K [0-shot]	0.7818	0.7849

Table 2: Performance comparison between MoBA and full Attention across different evaluation benchmarks.

This table presents a comprehensive performance comparison between Llama-8B-1M-MoBA and Llama-8B-1M-Full across various evaluation benchmarks. It clearly shows that MoBA maintains highly comparable performance to full attention across a wide range of tasks, including reasoning, QA, and code generation, even at extended context lengths. In some cases, MoBA even slightly surpasses the full attention baseline.

The following are the results from Table 3 of the original paper:

LM Loss Position Range	MoBA	Full
0K - 2K	3.075 × C^-0.078	3.068 × C^-0.078
2K - 4K	2.415× C^-0.084	2.411 × C^-0.083
4K - 6K	2.085 × C^-0.081	2.077 × C^-0.081
6K - 8K	1.899 × C^-0.092	1.894 × C^-0.092
8K - 10K	1.789 × C^-0.091	1.774 × C^-0.089
10K - 12K	1.721 × C^-0.092	1.697 × C^-0.087
12K - 14K	1.670 × C^-0.089	1.645 × C^-0.088
14K - 16K	1.630 × C^-0.089	1.600 × C^-0.087
16K - 18K	1.607 × C^-0.090	1.567 × C^-0.087
18K - 20K	1.586 × C^-0.091	1.542 × C^-0.087
20K - 22K	1.571 × C^-0.093	1.519 × C^-0.086
22K - 24K	1.566 × C^-0.089	1.513 × C^-0.085
24K - 26K	1.565 × C^-0.091	1.502 × C^-0.085
26K - 28K	1.562 × C^-0.095	1.493 × C^-0.088
28K - 30K	1.547 × C^-0.097	1.471 × C^-0.091
30K - 32K	1.546 × C^-0.108	1.464 × C^-0.097

Table 3: Loss scaling with different positions

This table presents LM loss scaling for MoBA and full attention across different position ranges within a 32K context length. The loss is fitted to a power-law curve of the form $C \times \text{Compute}^{-A}$ , where $C$ is a constant and $A$ is the scaling exponent. It shows that MoBA's loss scaling behavior is very similar to full attention across all positions, although full attention generally achieves slightly lower loss, particularly at very long positions. The difference in scaling exponents (the exponent of $C$ ) is small, further indicating comparable scaling properties.

6.3. Ablation Studies / Parameter Analysis

The paper includes several analyses to understand the impact of MoBA's design choices and hyperparameters.

Fine-Grained Block Segmentation: As discussed in Section 6.1.1, Figure 4 shows that LM loss significantly decreases as block granularity becomes finer (smaller block size), even when sparsity is kept constant. This suggests that allowing MoBA to select from a larger number of smaller KV blocks provides a more effective and nuanced way for the model to attend, leading to better language modeling performance. The trade-off is that more blocks mean a larger routing overhead.
Hybrid Training Strategy: Figure 5(a) analyzes the position-wise LM loss for MoBA, full attention, and a MoBA/full hybrid model. The hybrid approach, which uses MoBA for most training and switches to full attention for a smaller portion, yields the best overall performance, especially in later positions of the sequence. This indicates that the hybrid approach effectively balances the efficiency of MoBA with the potentially higher expressive power of full attention during critical training phases.
Layer-wise Hybrid for SFT: Figures 5(b) and 5(c) (Appendix) investigate a layer-wise hybrid strategy to optimize supervised fine-tuning (SFT) performance. The results show that using full attention in the top layers of the Transformer (which are often more task-specific and sensitive to sparse gradients) while keeping MoBA in the lower layers (base layers) significantly reduces SFT loss. This practical finding suggests a method to overcome potential performance dips of sparse attention during fine-tuning on specific tasks.
Sparsity Ratio and Context Length: Across multiple experiments, the sparsity ratio is a key parameter that is either kept constant or dynamically adjusted (by varying block size and top-k) to explore MoBA's behavior. The results consistently demonstrate that MoBA can maintain high performance even at very high sparsity levels (e.g., 81.25% at 8K context, 95.31% at 32K and 1M context), validating the effectiveness of its dynamic block selection in focusing on critical information while discarding irrelevant parts of the context. The ability to scale block size with context length (while keeping top-k fixed) is crucial for maintaining a constant sparsity ratio and achieving efficiency at extreme scales (up to 10M tokens).

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Mixture of Block Attention (MoBA), a novel attention mechanism designed to enable large language models (LLMs) to effectively handle long-context tasks with enhanced computational efficiency. By applying Mixture of Experts (MoE) principles to the attention mechanism, MoBA partitions the input context into blocks and uses a gating network to dynamically select the most relevant blocks for each query token. This approach adheres to a "less structure" philosophy, allowing the model to autonomously determine attention patterns rather than relying on predefined biases.

The key findings and contributions include:

Comparable Performance to Full Attention: Extensive experiments demonstrate that MoBA achieves language model loss and downstream task performance highly comparable to full attention models, even at significant levels of sparsity (e.g., 95.31%).
Significant Efficiency Gains: MoBA provides substantial improvements in computational efficiency, achieving speedups of up to 6.5x during prefilling 1 million tokens and effectively scaling to 10 million tokens.
Long-Context Scalability: The architecture proves robust for long-context scenarios, as evidenced by competitive trailing LM loss at 32K tokens and satisfactory performance on the Needle in a Haystack benchmark up to 1 million tokens.
Flexibility and Hybrid Strategies: MoBA offers the ability to seamlessly transition between full and sparse attention, enabling effective hybrid training and layer-wise hybrid strategies to optimize performance, particularly during supervised fine-tuning (SFT).
Real-World Deployment: The successful deployment of MoBA to support Kimi's long-context requests underscores its practical utility and robustness.

In summary, MoBA represents a significant advancement in efficient attention computation, providing a scalable and high-performing solution for extending the context capabilities of LLMs without compromising their expressive power.

7.2. Limitations & Future Work

While the paper primarily highlights the advancements and successful deployment of MoBA, it implicitly points to areas for further research and investigation by contrasting its approach with the limitations of existing methods. The paper does not have a dedicated "Limitations" or "Future Work" section, but the following can be inferred:

Complex Reasoning Tasks with Sparse Attention: The introduction notes that the performance of linear approximations of attention in complex reasoning tasks "remains inadequately explored." While MoBA aims to overcome this by retaining softmax-based attention's expressive power, a more rigorous and specific evaluation of MoBA's performance on highly complex, multi-hop reasoning tasks at extreme context lengths could be a direction for future work.
Optimizing Gating Network Overhead: While MoBA reduces attention computation, the gating network itself introduces some overhead (e.g., mean pooling and affinity score computation). For extremely fine-grained block segmentation or very large numbers of blocks, the overhead of the gating network might become a factor. Future work could explore more efficient gating mechanisms or dynamic router architectures.
Adaptive Sparsity Scheduling: The paper shows the benefits of hybrid training and layer-wise hybrid strategies. Further research could investigate more advanced, dynamic scheduling mechanisms for switching between sparse and full attention modes, or for adjusting top-k values, based on task, layer, or input characteristics.
Theoretical Guarantees for "Less Structure": While MoBA adheres to a "less structure" principle compared to fixed sparse patterns, the block partitioning and mean-pooling for gate computation still introduce some structural assumptions. Future work could explore the theoretical implications of these structures and investigate even more flexible or learned ways of defining attention "experts."
Generalization Beyond Text: The current deployment is for LLMs. Exploring MoBA's applicability and performance in other domains that utilize Transformer architectures, such as multimodal models (e.g., vision Transformers) or time series analysis, could be an interesting future direction.

7.3. Personal Insights & Critique

MoBA presents an elegant and effective solution to the long-standing challenge of quadratic complexity in Transformer attention. The core insight of applying Mixture of Experts (MoE) principles, previously successful in FFN layers, to the attention mechanism is genuinely innovative. This approach allows LLMs to dynamically and intelligently select relevant parts of an extended context, mimicking a more human-like selective attention process.

One of the most compelling aspects is the "less structure" principle. By letting the model learn where to attend via a gating network, MoBA avoids the rigid biases of sliding window or sink attention, which can be limiting for tasks requiring non-local dependencies. The empirical results strongly support this, showing MoBA performing on par with full attention across diverse benchmarks while delivering substantial speedups. The 6.5x speedup for prefilling 1M tokens and scalability to 10M tokens are impressive practical achievements.

The ability to seamlessly transition between full and sparse attention is also a powerful design choice. This hybrid capability provides a practical pathway for practitioners to fine-tune the balance between efficiency and performance, potentially using MoBA for general pre-training and switching to full attention for critical, task-specific layers during SFT. This flexibility makes MoBA highly adaptable to different use cases and computational constraints. The deployment in Kimi validates its real-world applicability, moving it beyond a theoretical concept.

Potential areas for deeper investigation or critique include:

Gating Network Complexity vs. Attention Gain: While the gating network (mean pooling and top-k selection) is relatively lightweight, its computational overhead, especially for a very large number of blocks (fine-grained segmentation), could warrant further analysis. How does the complexity of the gating network scale, and at what point does it start to diminish the overall efficiency gains?
Finer-grained Top-K Analysis: The paper uses a fixed top-k (e.g., 3 or 12). A more detailed ablation of how dynamically adjusting $k$ (e.g., per-layer, per-head, or per-query) based on context or task difficulty could impact performance and efficiency would be valuable. The assumption of a fixed $k$ for all queries might still introduce some implicit bias.
Representational Capacity of Block Pooling: Using mean pooling for block representation is a simple and effective choice. Exploring more sophisticated block representation methods (e.g., learnable aggregations, CNNs, or smaller attention mechanisms within blocks) could potentially improve the gate's decision-making, though at the cost of increased gating network complexity.
Robustness to Adversarial Contexts: How does MoBA perform when the "needle" is disguised or placed in particularly distractor-rich blocks that might not be selected by the gating network? While Needle in a Haystack results are good, more adversarial settings might reveal limitations in the gating mechanism's ability to consistently identify critical information.

Overall, MoBA is a significant contribution that offers a pragmatic yet powerful approach to scaling LLMs to unprecedented context lengths, maintaining the expressive power of Transformers while drastically improving efficiency. Its principled design and demonstrated effectiveness make it a highly promising direction for future LLM development.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.