MoBA: Mixture of Block Attention for Long-Context LLMs
TL;DR Summary
The study presents MoBA, a novel mixture of block attention mechanism for large language models, addressing computational complexity in long-context tasks. It enables efficient transitions between full and sparse attention while maintaining performance, significantly improving ef
Abstract
Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "MoBA: Mixture of Block Attention for Long-Context LLMs," which proposes a novel attention mechanism designed to efficiently scale large language models (LLMs) to handle extremely long input sequences.
1.2. Authors
The paper lists numerous authors, primarily affiliated with Moonshot AI, with additional contributions from Tsinghua University and Zhejiang Lab/Zhejiang University.
- Moonshot AI: Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou.
- Tsinghua University: Mingxing Zhang.
- Zhejiang Lab/Zhejiang University: Jiezhong Qiu.
1.3. Journal/Conference
This paper is presented as a "Technical Report," which typically implies it is a pre-print not yet formally peer-reviewed and published in a journal or conference. It is available on arXiv, a popular repository for preprints in physics, mathematics, computer science, and other fields. arXiv serves as a crucial platform for disseminating research rapidly, allowing early sharing of findings before or during the formal peer-review process.
1.4. Publication Year
The publication date is 2025-02-18.
1.5. Abstract
The abstract introduces the critical challenge of scaling context length in large language models (LLMs) due to the quadratic computational complexity of traditional attention mechanisms. It notes that existing solutions either introduce strong biases (e.g., sink or window attention) or use linear approximations whose performance on complex reasoning tasks is not well-understood. The authors propose Mixture of Block Attention (MoBA), an innovative approach that applies Mixture of Experts (MoE) principles to the attention mechanism. MoBA adheres to a "less structure" principle, allowing the model to autonomously determine where to attend. This novel architecture demonstrates superior performance on long-context tasks, offers the key advantage of seamlessly transitioning between full and sparse attention, and enhances efficiency without compromising performance. MoBA has been deployed to support Kimi's long-context requests, demonstrating significant advancements in efficient attention computation for LLMs.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2502.13189
- PDF Link: https://arxiv.org/pdf/2502.13189v1.pdf This paper is published as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The advancement of large language models (LLMs) towards Artificial General Intelligence (AGI) critically depends on their ability to process and understand longer context lengths. However, a fundamental bottleneck is the attention mechanism within Transformers, which exhibits a quadratic computational complexity with respect to the input sequence length. This quadratic growth (, where is the sequence length) makes processing very long contexts computationally prohibitive and memory-intensive.
Prior research has attempted to address this problem through two main avenues:
-
Imposing biased structures: Approaches like
sink attentionorwindow attentionintroduce predefined patterns that restrict the tokens a query can attend to. While efficient, these methods are oftentask-specificand may limit the model's ability to capture complex, non-local dependencies. -
Radically modifying attention into linear approximations: Models such as
Mamba,RWKV, orRetNetreplace the standardattention mechanismwithlinear approximationsto reduce complexity. However, their effectiveness in demandingcomplex reasoning tasksremains largely unexplored and unproven.The paper identifies a crucial gap: a need for a solution that provides efficiency for long contexts while adhering to a "less structure" principle, allowing the model to determine relevant information autonomously without strong predefined biases. Such a solution should ideally be able to seamlessly switch between efficient
sparse attentionand comprehensivefull attentionwithout performance compromises.
2.2. Main Contributions / Findings
The paper proposes Mixture of Block Attention (MoBA) as an innovative solution to the long-context problem in LLMs, offering several key contributions:
-
Novel Architecture (
MoBA): MoBA introduces a new attention mechanism that adapts the principles ofMixture of Experts (MoE)to the attention layer ofTransformermodels. Instead of routing entire network layers, MoBA partitions the input context intoblocksand uses agating networkto dynamically select the most relevant blocks for each query token. This allows forsparse attentionwhile maintaining the model's ability to attend where needed. -
Adherence to "Less Structure" Principle: Unlike
sliding windoworsink attention, MoBA avoids imposing strong, predefined biases on the attention pattern. Thegating mechanismallows the model to learn and autonomously determine whichKV (key-value)blocks are most relevant to a given query, enhancing its flexibility and generality across tasks. -
Seamless Transition between Full and Sparse Attention: MoBA is designed to easily switch between
sparse attentionandfull attention, offering a crucial advantage for balancing efficiency and performance. This hybrid capability ensures the model can leverage efficiency when possible without risking performance degradation whenfull attentionis truly necessary. -
Superior Performance on Long-Context Tasks with Efficiency Gains: Experimental results demonstrate that MoBA achieves performance comparable to
full attentionmodels in terms oflanguage model (LM) lossand various downstream tasks, even with significantsparsity. Critically, it offers substantialcomputational efficiencyimprovements, with speedups of up to6.5xduringprefilling1 million tokens and effective scaling up to 10 million tokens. -
Real-World Deployment: MoBA has already been deployed to support Kimi's long-context requests, validating its practical utility and robustness in a real-world production environment.
-
Open-Sourced Code: The authors provide their code, encouraging further research and development within the community.
These contributions collectively represent a significant advancement in developing efficient and scalable attention mechanisms for LLMs, paving the way for models capable of handling much longer contexts required for AGI.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand MoBA, a foundational understanding of Large Language Models (LLMs), the Transformer architecture, its core attention mechanism, and the concept of Mixture of Experts (MoE) is essential.
-
Large Language Models (LLMs):
LLMsare a class of deep learning models, typically based on theTransformerarchitecture, that are trained on vast amounts of text data to perform a wide range of natural language processing (NLP) tasks. They excel at understanding, generating, and manipulating human language, demonstrating capabilities in tasks like text generation, translation, summarization, and complex reasoning. Their ability to process and generate long sequences of text is crucial for their performance, and extending theircontext length(the amount of preceding information they can consider) is a major research goal. -
Transformer Architecture: Introduced by Vaswani et al. (2017), the
Transformeris a neural network architecture that has become the backbone of most modernLLMs. It eschews traditional recurrent or convolutional layers in favor ofattention mechanisms. ATransformertypically consists of an encoder and a decoder stack (or decoder-only for generativeLLMs). Each stack is composed of multiple identical layers, with each layer containing amulti-head self-attentionsub-layer and aposition-wise feed-forward network (FFN). A key innovation ofTransformersis their ability to process input sequences in parallel, unlikeRecurrent Neural Networks (RNNs), making them highly efficient for training on modern hardware. -
Attention Mechanism: The
attention mechanismis the core innovation of theTransformer. It allows the model to dynamically weigh the importance of different parts of the input sequence when processing each token. Instead of processing tokens sequentially, attention enables the model to look at all other tokens in the sequence simultaneously and decide which ones are most relevant to the current token. This mechanism is crucial for capturing long-range dependencies in text. Theself-attentionvariant allows a token in a sequence to attend to all other tokens in the same sequence. The standardself-attentioncalculation involves three learnable weight matrices:Query (Q),Key (K), andValue (V). For each token, a query vector is compared against all key vectors to computeattention scores, which are then used to weight the value vectors. The output is a weighted sum of the value vectors, representing the context-aware representation of the token. The computational cost of standardself-attentionisquadratic() with respect to the sequence length (and linear with respect to the hidden dimension ), primarily due to the matrix multiplication. This is the "prohibitive overhead" the paper aims to address. -
Mixture of Experts (MoE):
Mixture of Experts (MoE)is a technique used to scale models by conditionally executing only a subset of the network's parameters for each input. Instead of having a single large network, anMoElayer consists of agating network(or router) and severalexpert networks(e.g.,Feed-Forward Networks). For a given input, thegating networklearns to select and route the input to one or moreexpert networks. Only the selected experts are computed, reducing the computational cost per token while increasing the total number of parameters.MoEmodels have primarily been applied to theFeed-Forward Network (FFN)layers inTransformersto improvescalabilityandtraining efficiency. MoBA innovatively applies this principle to theattention mechanismitself. -
Flash Attention:
Flash Attention(Dao et al., 2022) is a highly optimized implementation of theattention mechanismthat significantly reduces memory usage and speeds up computation, particularly for long sequences. It achieves this by reorganizing the computation ofattentionto minimize reads and writes tohigh-bandwidth memory (HBM), leveragingSRAM(on-chip memory). WhileFlash Attentionmakes standard attention more efficient, it does not change itsquadratic computational complexityin terms of FLOPs, but rather reduces the wall-clock time and memory footprint for a given . MoBA's implementation incorporatesFlash Attentionto further optimize its block-wise computations. -
Causal Masking: In generative
LLMsand autoregressive tasks,causal maskingis applied to theattention mechanismto prevent a token from attending to future tokens in the sequence. This ensures that the model only uses information available up to the current token to make predictions, preserving the autoregressive property. It's typically implemented by setting theattention scoresfor future positions to negative infinity before thesoftmaxoperation, effectively making them zero aftersoftmax. MoBA explicitly incorporatescausal maskinginto its block selection to maintain this property.
3.2. Previous Works
The paper contextualizes MoBA by discussing various approaches to address the quadratic complexity of Transformer attention for long contexts.
-
Standard Attention (from Vaswani et al., 2017): The core mechanism that
MoBAseeks to make more efficient. For a single query vector attending to key vectors and value vectors , the output is calculated as: $ \mathrm { At t n } ( { \pmb q } , { \pmb K } , { \pmb V } ) = \mathrm { S o f t m a x } \big ( { \pmb q } { \pmb K } ^ { \top } \big ) { \pmb V } $ Here, is the query vector, is the matrix of key vectors, is the matrix of value vectors, and is the dimension of a single attention head. Thesoftmaxfunction normalizes the attention scores.MoBAmodifies this by only considering a subset of and . -
Static Sparse Attention: These methods introduce predefined, fixed patterns of attention to reduce computation.
Longformer(Beltagy et al., 2020),BigBird(Zaheer et al., 2020),LongNet(J. Ding et al., 2023): These models use a combination oflocal windowed attention(attending to nearby tokens) andglobal attention(attending to a few special tokens or a fixed pattern) to achieve linear or near-linear complexity.Sliding Window Attention (SWA): A common form where each query token only attends to a fixed-size window of preceding tokens. The paper states thatSWA(e.g., Child et al., 2019) can be viewed as a special case ofMoBAwhere thegating networkconsistently selects the most recent blocks.Attention Sink(Xiao et al., 2023): Each query token attends to a few initial "sink" tokens and a fixed window of recent tokens. The paper views this as a variant ofMoBAwhere thegating networkis designed to always select both the initialsinkblocks and recent blocks.
-
Dynamic Sparse Attention: These methods adaptively select which tokens to attend to during computation.
Reformer(Kitaev et al., 2020) andRouting Transformer(Roy et al., 2021): Use locality-sensitive hashing (LSH) or k-means clustering to group similar queries and keys, allowing attention within clusters.Memorizing Transformers(Y. Wu et al., 2022) andUnlimiformer(Bertsch et al., 2024): Dynamically attend to tokens selected byk-nearest neighbor (kNN)algorithms.CoLT5(Ainslie, Lei, et al., 2023): Designs a routingmulti-head attentionmechanism.Permute Attention(Tay, Dehghani et al., 2020): Permutes blocks from the input sequence for dynamic block sparse attention.Quest(Tang et al., 2024): A query-aware sparsity method for efficientlong-context LLM inference, which the paper notes can be seen asMoBAwith a block size of 1 and a specialized block representation function combiningminandmax pooling.Longheads(Lu et al., 2024): Can be viewed asMoBAwith atop-1 gating network, meaning each query selects only the single most relevantKV block.
-
Linear Attention Models (Beyond Traditional Attention): These approaches entirely replace the
canonical softmax-based attentionwithlinear approximationsto achievelinear computational complexity().Mamba(Gu et al., 2023; Dao and Gu, 2024): Astate-space model (SSM)that offers linear-time sequence modeling withselective state spacemechanisms.RWKV(Peng, Alcaide, et al., 2023; Peng, Goldstein, et al., 2024): CombinesRNNandTransformerideas, offeringlinear scalingwith arecurrent mechanism.RetNet(Sun et al., 2023): Aretentive networkdesigned as a successor toTransformers, offering complexity during inference.- The paper notes that while these models reduce computational overhead, evidence of their effectiveness in
complex reasoning tasksremains limited.
3.3. Technological Evolution
The field of attention mechanisms for LLMs has evolved from computationally intensive full attention to a spectrum of more efficient approaches.
-
Full Attention (2017 onwards): The initial
Transformermodel introducedself-attention, which, despite its quadratic complexity, revolutionized NLP due to its ability to capture long-range dependencies. Its parallelizability for training on GPUs made it dominant. -
Static Sparse Attention (Late 2010s/Early 2020s): As context lengths grew, the limitations of
full attentionbecame apparent. Researchers began introducing fixed sparsity patterns (e.g.,windowed,dilated,global tokens) to reduce computation, leading to models likeLongformer,BigBird,ETC, andLongNet. These methods improve efficiency but often impose strong inductive biases. -
Dynamic Sparse Attention (Early 2020s onwards): The next step was to make sparsity adaptive. Instead of fixed patterns, models started learning where to attend, leveraging mechanisms like
LSH,kNN, or routing networks.QuestandLongheadsare examples of this trend, moving towards more flexible, data-driven sparsity.MoBAfits into this category, pushing the adaptive sparsity further by applyingMoEprinciples. -
Beyond Traditional Attention (Mid-2020s): A more radical shift involves replacing the
softmax-based attentionaltogether with architectures inspired byCNNs,RNNs, orState Space Models (SSMs), leading tolinear attentionmodels likeMamba,RWKV, andRetNet. While highly efficient, their performance on complex tasks is still under investigation.MoBApositions itself as an evolution within thedynamic sparse attentionparadigm. It seeks to improve efficiency while largely preserving theTransformerframework's strengths, particularly itsexpressive powerandless structurephilosophy, which contrasts with the radical architectural changes of linear attention models.
3.4. Differentiation Analysis
MoBA differentiates itself from previous works primarily through its novel application of Mixture of Experts (MoE) principles to the attention mechanism itself, along with its emphasis on "less structure" and hybrid capabilities.
-
Vs. Full Attention:
- Core Difference:
MoBAsignificantly reducescomputational complexityfromquadratictosub-quadraticby selectively attending to only a subset ofKV blocks, whereasfull attentioncomputes interactions between every query and every key. - Innovation: Achieves efficiency without sacrificing performance, demonstrated by comparable
LM lossand benchmark scores.
- Core Difference:
-
Vs. Static Sparse Attention (e.g., Sliding Window, Attention Sink):
- Core Difference:
Static sparse attentionmethods impose predefined, fixed patterns (e.g., fixed window size, always attending to initialsink tokens).MoBA, leveraging agating network, dynamically and adaptively selects relevant blocks for each query. - Innovation:
MoBAhasstronger expressive power. The paper explicitly demonstrates thatSliding Window AttentionandAttention Sinkcan be viewed as special cases ofMoBAwith specific, hard-codedgating networks. MoBA's learnedgating mechanismallows it to adapt to diverse contextual needs, adhering to the "less structure" principle.
- Core Difference:
-
Vs. Other Dynamic Sparse Attention Methods (e.g., Quest, Longheads):
- Core Difference: While
QuestandLongheadsalso use dynamic sparsity,MoBAformalizes block selection with aMixture of Expertsframework.Longheadsis effectively atop-1gatingMoBA, andQuestuses a specialized block representation. - Innovation: MoBA's
top-kgating allows for a tunable balance of sparsity and context, and its block pooling foraffinity scoresis a direct application ofMoEprinciples, offering a generalized and robust framework for adaptive block selection.
- Core Difference: While
-
Vs. Linear Attention Models (e.g., Mamba, RWKV, RetNet):
- Core Difference:
Linear attention modelsradically alter the fundamentalattention mechanismto achievelinear complexity, often requiring training from scratch and potentially exhibiting limitations incomplex reasoning tasks.MoBAretains the coreTransformerarchitecture andsoftmax-based attentionformulation (albeit in a sparse form). - Innovation:
MoBAavoids modifying theattention mechanismin ways that might compromise itsexpressive powerforcomplex reasoning. It aims to achieve efficiency within theTransformerframework, allowing for easier integration with pre-trainedTransformer-basedmodels and potentially better performance on tasks requiring the fullexpressive powerofsoftmax attention. The paper specifically highlights that the performance oflinear approximationsincomplex reasoning tasksremains "inadequately explored," implyingMoBAaims to provide a robust alternative.
- Core Difference:
-
Hybrid Capability: A unique differentiation is
MoBA's ability to seamlessly transition betweenfull attentionandsparse attention, which is explicitly designed to balance efficiency and performance and is not a common feature in other sparse attention methods.
4. Methodology
4.1. Principles
The core principle behind Mixture of Block Attention (MoBA) is to address the quadratic computational complexity of Transformer attention by applying the Mixture of Experts (MoE) paradigm to the attention mechanism itself. Instead of having every query token attend to every key-value pair in the entire context, MoBA partitions the full context into discrete blocks. For each query token, a gating network dynamically selects a small, fixed number of the most relevant blocks to attend to. This allows for a sparse attention pattern, significantly reducing computational cost while maintaining the model's ability to focus on pertinent information.
The approach adheres to a "less structure" principle, meaning that unlike predefined sparse attention patterns (e.g., fixed windows), MoBA empowers the model to autonomously determine where to attend. This dynamic selection mechanism aims to preserve the expressive power of full attention while gaining efficiency, and crucially, it allows for a seamless transition between sparse and full attention modes.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Preliminaries: Standard Attention in Transformer
Before detailing MoBA, the paper first outlines the standard attention mechanism as used in Transformers. This provides a baseline against which MoBA's modifications can be understood.
For a single query vector to attend to key vectors and value vectors , the attention output is computed as:
$
\mathrm { At t n } ( { \pmb q } , { \pmb K } , { \pmb V } ) = \mathrm { S o f t m a x } \big ( { \pmb q } { \pmb K } ^ { \top } \big ) { \pmb V }
$
In this formula:
-
: Represents the
query vectorfor the current token. It is a row vector of dimension . -
: Is the matrix of
key vectors, where each row corresponds to a key vector for a token in the context. It has dimensions , where is the totalsequence length(number of tokens in the context) and is thehead dimension. -
: Is the matrix of
value vectors, similar to , with dimensions . -
: Computes the
dot product(or similarity score) between the query vector and all key vectors. This results in a row vector of length , where each element indicates how relevant each key is to the query. -
: Applies the
softmaxfunction to the dot product scores, normalizing them into a probability distribution. This ensures that the attention weights sum to 1, indicating the relative importance of each key. -
The entire expression : Multiplies the
normalized attention weightsby thevalue vectors. This produces a weighted sum of the value vectors, where values from more relevant tokens (higher attention scores) contribute more to the output.The paper notes that this explanation focuses on a
single-head scenariofor clarity, and extension tomulti-head attentioninvolves concatenating the outputs from multipleattention heads. The key challenge with this standard attention is thequadratic complexity() due to the operation, which involves computing scores for all keys for each query.
4.2.2. MoBA Architecture
MoBA modifies the standard attention by enabling each query token to attend to only a subset of keys and values from the entire context. This is achieved through block partitioning and a dynamic selection strategy.
The modified MoBA attention formula is:
$
\mathrm { M o B A } ( q , K , V ) = \mathrm { S o f t m a x } \Big ( q K [ I ] ^ { \top } \Big ) V [ I ]
$
Here:
-
, , : Are the
query vector,key matrix, andvalue matrixas in standard attention. -
: Represents the
set of selected key and value indices. This is the crucial difference: instead of using all keys and values,MoBAuses only those specified by the index set .The
key innovationofMoBAlies in how this set is determined. Thefull contextof length is first divided intoblocks, where each block contains subsequent tokens. -
: Denotes the
block size. -
: Represents the range of indices for the -th block.
The
MoBAmodel then employs agating mechanismto select the most relevant blocks for eachquery token. The total set of selected indices is theunionof indices from all blocks where the gate value is positive (i.e., selected): $ I = \bigcup _ { g _ { i } > 0 } I _ { i } $
4.2.2.1. MoBA Gating Mechanism
The MoBA gate determines the relevance of each block to a given query. It involves two steps:
-
Affinity Score Computation: For each query , an
affinity scoreis computed for every block . This score measures the relevance between the query and the -th block. The score is computed by theinner productbetween the query vector and themean poolingof thekey vectorswithin the -th block: $ s _ { i } = \langle \pmb { q } , \mathrm { m e a n.pool } ( \pmb { K } [ I _ { i } ] ) \rangle $ Here:- : Denotes the
inner product(dot product) of two vectors. - : Computes the
meanof allkey vectorsbelonging to the -th block along the sequence dimension. This creates a single representative vector for the entire block. - The inner product then measures the similarity between the query and this block representative.
- : Denotes the
-
Top-k Gating: After computing affinity scores for all blocks, a
top-k gating mechanismis applied. This selects the blocks with the highest affinity scores for the current query. The gate value for the -th block is set to 1 if its score is among the top , and 0 otherwise: $ g _ { i } = \left{ { \begin{array} { l l } { 1 } & { s _ { i } \in { \mathrm { T o p k } } \left( { s _ { j } | j \in [ n ] } , k \right) } \ { 0 } & { { \mathrm { o t h e r w i s e } } } \end{array} } \right. $ Here:-
: Represents the set containing the highest scores among all affinity scores for blocks .
The following figure (Figure 1 from the original paper) provides a visual illustration of the
MoBAmechanism:
该图像是示意图,展示了混合块注意力(MoBA)的结构和原理。左侧部分(a)展示了查询(queries)如何通过路由器(Router)连接到对应的键(keys)和值(values)上,并生成注意力评分(Attn score)。右侧部分(b)则介绍了MoBA门控机制的具体步骤,包括分块、均值池化、矩阵乘法和TopK选择,最终输出注意力结果。
-
Figure 1: Illustration of mixture of block attention (MoBA). (a) A running example of MoBA; Integration of MoBA into Flash Attention.
Figure 1(a) shows a running example:
-
We have two query tokens (q1, q2) and four KV blocks (B1, B2, B3, B4).
-
A
router(thegating network) dynamically selects two blocks () for each query. -
q1 is routed to B1 and B2.
-
q2 is routed to B3 and B4.
-
This illustrates how different queries can focus on different subsets of the context blocks.
Figure 1(b) illustrates the integration into
Flash Attention: -
The
KV cacheis firstchunkedinto blocks. -
Mean poolingis applied toKeyswithin each block to create block representations. -
Query (Q)is multiplied by these block representations () to getaffinity scores. -
A
TopKselection, combined withcausal masking, identifies the relevant blocks. -
Finally,
Flash Attentionis performed on the selected , , blocks.
4.2.2.2. Causality in MoBA
To preserve the autoregressive property of LLMs, MoBA incorporates causal relationship preservation through specific designs:
- No Attention to Future Blocks: A
query tokencannot be routed to anyfuture blocks. This is enforced by setting theaffinity scoreto (and thus ) for any block whose start position is after thequery token'sposition. More formally, if is the position index of query , then and for any blocks such that . This ensures no information leakage from future parts of the sequence. - Current Block Attention and Causal Masking: The
current block(the block that contains the query token) is always attended to. This avoids information leakage from subsequent tokens within the current block throughcausal maskingapplied locally, and encourages attention to the local context. More formally, is explicitly set for the block where falls within the interval .
4.2.2.3. Fine-Grained Block Segmentation
Inspired by MoE literature (which shows benefits of fine-grained expert segmentation), MoBA investigates the impact of partitioning the context into blocks with a finer grain. While MoE typically segments along the FFN intermediate hidden dimension, MoBA operates segmentation along the context-length dimension. The experiments aim to determine if smaller block sizes (more blocks, each covering fewer tokens) can enhance performance.
4.2.2.4. Hybrid of MoBA and Full Attention
MoBA is designed as a direct substitute for full attention, maintaining the same number of parameters. This design enables a hybrid strategy where the model can seamlessly switch between MoBA (sparse attention) and full attention. This choice can be dynamically made during training or inference, providing flexibility to balance efficiency and performance without significant overhead. The paper explores this for both training and layer-wise supervised fine-tuning (SFT).
4.2.2.5. Comparison to Sliding Window Attention and Attention Sink
The paper highlights that MoBA has stronger expressive power than widely used sparse attention architectures like Sliding Window Attention (SWA) and Attention Sink.
-
Sliding Window Attention (SWA): Each query token attends only to a fixed window of preceding tokens.
MoBAcan emulateSWAby having agating networkthat consistently selects the most recent blocks. -
Attention Sink: Each query token attends to a set of initial tokens (
sink tokens) and a fixed window of recent tokens.MoBAcan emulateAttention Sinkby having agating networkthat always selects both the initialsink blocksand the recent blocks.This demonstrates that
MoBAis a more general framework, capable of subsuming these specializedsparse attentionmethods by incorporating specific (learned or hardcoded)gating networks. MoBA's adaptive and dynamic block selection allows for more nuanced and efficient information processing, moving beyond the fixed and potentially suboptimal patterns ofSWAandAttention Sink.
4.2.3. Implementation
The authors provide a high-performance implementation of MoBA by incorporating optimization techniques from Flash Attention (Dao et al., 2022) and Mixture of Experts (MoE) (Rajeswaran et al., 2022). The algorithmic workflow is formalized in Algorithm 1 and visualized in Figure 1(b).
The following figure (Figure 2 from the original paper) shows the efficiency evaluation:
该图像是图表,展示了 MoBA 和 Flash Attention 在不同序列长度下的计算时间对比。图 (a) 显示了计算时间随序列长度增加的趋势,图 (b) 则提供了更长序列的计算时间。MoBA 在处理长上下文任务时表现出更优的效率,尤其是在较长序列时。
Figure 2: Effciency of MoBA vs.full attention (implemented with Flash Attention). (a)1M Model speedup evaluation: Computation time scaling of MoBA versus Flash Attention on 1M model with increasing sequence lengths (81M). (b) Fixed Sparsiy Rati scaling:Computation time caling comparison betweenMoBA and Flash Attentn across increasing sequence lengths (8K-10M), maintaining a constant sparsity ratio of (fixed 64 MoBA blocks with variance block size and fixed top- ).
The MoBA implementation consists of five major steps:
-
Determine assignment: Identify which
query tokenswill attend to whichKV blocksbased on thegating networkandcausal mask. -
Arrange ordering: Organize the
query tokensbased on their assignedKV blocksfor efficient processing. -
Compute attention outputs: Calculate
attention outputsfor thecurrent blockand the selectedhistorical KV blocks. This step is optimized usingFlash Attentionfor varying lengths. -
Compute self-attention for the current block: This is typically a standard
causal self-attention. -
Combine results: Integrate the outputs from
current block attentionandhistorical block attentionusingonline softmax(Milakov and Gimelshein, 2018).The following is the
Algorithm 1from the original paper, detailing theMoBAimplementation:
Algorithm 1 MoBA (Mixture of Block Attention) Implementation
Require: Query, key and value matrices ; MoBA hyperparameters (block size and top );
and denote the number of attention heads and head dimension. Also denote to be the number of
blocks.
1: // Split KV into blocks
2: where
3: // Compute gating scores for dynamic block selection
4:
5:
6: // Select blocks with causal constraint (no attention to future blocks)
7: create_causal_mask ( N , n )
8:
9: // Organize attention patterns for computation efficiency
10:
11:
12: // Compute attentions seperately
13:
14:
15: // Combine results with online softmax
16:
17: return
Explanation of Algorithm 1:
-
Require:
- : Input
Query,Key, andValuematrices. is the sequence length, is the number ofattention heads, and is thehead dimension. - MoBA hyperparameters:
block sizeandtop-kvalue . - : The total number of blocks.
- : Input
-
Line 1-2: Split KV into blocks
split.blocks(K, V, B): This function partitions the fullKeyandValuematrices (, ) into smaller blocks.- : Represents the -th
KeyandValueblock. Each block has dimensions .
-
Line 3-5: Compute gating scores for dynamic block selection
- : Computes the mean of
Keyvectors within each block. This results in a matrix where each row is the mean-pooled representation of a block's keys. - : Calculates the
affinity scoresbetween eachquery tokenand eachmean-pooled block key. This is the operation from thegating mechanism, extended for all queries and heads.
- : Computes the mean of
-
Line 6-8: Select blocks with causal constraint
-
create_causal_mask(N, n): Generates acausal maskmatrix. This mask ensures that query tokens can only attend tohistorical blocks(blocks that precede or contain the query's position), preventing attention tofuture blocks. Future block scores are set to here. - : Applies the
top-kselection to theaffinity scores, after adding thecausal mask. Thisgating matrix(binary, 1 for selected, 0 for not selected) indicates which blocks each query will attend to, respecting causality.
-
-
Line 9-11: Organize attention patterns for computation efficiency
- : Prepares the
Query,Key, andValuetensors specifically forself-attentionwithin thecurrent block. This typically involves extracting the , , for each block and aligning them for local causal attention. - : Prepares the
Query,Key, andValuetensors forMoBA(attention tohistorical blocks). This function uses thegating matrixto select and arrange theKV blocksthat each query will attend to.
- : Prepares the
-
Line 12-14: Compute attentions separately
- : Computes
self-attentionwithin thecurrent blockusing an optimizedFlash Attentionimplementation that supports varying sequence lengths. ensures that tokens within the current block only attend to preceding tokens in that block. - : Computes
attentionfrom query tokens to theselected historical KV blocksusingFlash Attention. here implies that queries can attend to all tokens within the selected historical blocks, as long as those blocks themselves are in the past.
- : Computes
-
Line 15-16: Combine results with online softmax
- : Combines the outputs from the
current block attention() andhistorical block attention(). This combination is typically done using anonline softmaxapproach (Milakov and Gimelshein, 2018) to correctly normalize the attention scores across all attended tokens (both local and historical) while managing memory efficiently.
- : Combines the outputs from the
-
Line 17: Return Output
-
Returns the final attention output .
This algorithm efficiently computes
sparse attentionby leveraging block partitioning, dynamictop-kgating,causal masking, andFlash Attentionfor optimized computation on the selected blocks.
-
5. Experimental Setup
5.1. Datasets
The experiments in the paper utilize a large-scale training corpus for scaling law experiments and various benchmark datasets for large language modeling evaluation.
-
Training Corpus for Scaling Law Experiments:
- The models are trained with a substantial number of tokens, ranging from
10.8 Billionfor the568Mparameter model up to36.9 Billionfor the2.1Bparameter model (Table 1). - The specific source or characteristics of this training corpus are not detailed in the paper, but it is implied to be a general-purpose text dataset suitable for
LLM pre-training. - Models are initially trained with a
sequence lengthof8Kfor initialLM lossevaluation, and then extended to32Kforlong-context scalabilityassessments. - For
continual pre-training, theLlama3.1 8B Base Modelis used as a starting point. This model is continually pre-trained on contexts up to1M tokens.
- The models are trained with a substantial number of tokens, ranging from
-
Evaluation Benchmarks for Large Language Modeling: The paper evaluates
MoBAagainstfull attentionmodels across a diverse set of real-world downstream tasks, covering various capabilities of LLMs. These include:-
AGIEval: A comprehensive benchmark for
AGIcapabilities. -
BBH (Big-Bench Hard): A challenging subset of
Big-Bench, designed to testLLMson tasks requiring multi-step reasoning. -
CEval: A Chinese-language benchmark for
LLMs, covering various subjects and task types. -
GSM8K: A dataset of
grade school math word problemsrequiring multi-step reasoning. -
HellaSWAG: A
commonsense reasoningbenchmark, requiringLLMsto select the most plausible ending to a given premise. -
Loogle: (Likely a variant of Google-like search tasks or long-context retrieval).
-
Competition Math: A benchmark for advanced mathematical problem-solving.
-
MBPP (Mostly Basic Python Programs): A dataset for
code generationtasks, requiring models to write Python programs from natural language prompts. -
MBPP Sanitized: A cleaned version of MBPP.
-
MMLU (Massive Multitask Language Understanding): A benchmark covering 57 subjects across
STEM,humanities,social sciences, and more, testingLLMs'world knowledgeandproblem-solving abilities. -
MMLU Pro: A more challenging version of MMLU.
-
OpenAI HumanEval: A dataset for evaluating
code generationmodels on Python programming problems. -
SimpleQA: A dataset for
simple question answering. -
TriviaQA: A challenging
question answeringdataset based on trivia. -
LongBench: A benchmark specifically designed for evaluating
LLMsonlong-context understandingandreasoningtasks, with context lengths up to32K. -
RULER: (Reference Unconstrained Long-range Evaluation for Retrieval, likely a benchmark for
long-context retrievalorsummarizationtasks), evaluated at128K context length. -
Needle in a Haystack: A specialized benchmark that evaluates a model's ability to retrieve a specific piece of information (
needle) hidden within a very long document (haystack). This is a direct test oflong-context retrievalandattention span. The evaluation is conducted up to1M context length.These datasets are chosen to provide a comprehensive evaluation across general
language modelingperformance,reasoning,code generation,knowledge retrieval, and crucially,long-context understandingcapabilities, directly addressing the paper's primary objective.
-
5.2. Evaluation Metrics
The paper uses several standard metrics to evaluate MoBA's performance, particularly focusing on language model loss and task-specific scores for downstream benchmarks.
-
Language Model (LM) Loss:
- Conceptual Definition:
Language Model Loss, typicallycross-entropy loss, quantifies how well a language model predicts the next token in a sequence given the preceding tokens. A lowerLM lossindicates better predictive accuracy and stronger language modeling capabilities. It is a fundamental metric for evaluating the quality of generative language models during pre-training and fine-tuning. - Mathematical Formula: For a given sequence of tokens and a vocabulary size , the
cross-entropy lossis typically calculated as: $ L = - \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{V} y_{i,j} \log(\hat{y}_{i,j}) $ - Symbol Explanation:
- : The
cross-entropy loss. - : The
sequence length. - : The
vocabulary size. - : A binary indicator (0 or 1) that is 1 if the -th token in the vocabulary is the true next token at position , and 0 otherwise.
- : The predicted probability that the -th token in the vocabulary is the next token at position , according to the model.
- : The
- Conceptual Definition:
-
Trailing LM Loss:
- Conceptual Definition:
Trailing LM lossis a variant ofLM lossthat specifically measures the prediction accuracy on the last few tokens of a long sequence (e.g., the last 1K or 2K tokens). This metric is particularly insightful forlong-context modelsbecauseLLMsoften perform better on initial tokens and struggle with maintaining context and coherence towards the end of very long inputs. A goodtrailing LM lossindicates the model's ability to effectively utilize and retain information over long distances. - Mathematical Formula: The formula is the same as
LM loss, but the summation is restricted to a specific range of indices, typically to , where is the number of trailing tokens considered. - Symbol Explanation: Same as
LM loss, with the addition of representing the number of trailing tokens.
- Conceptual Definition:
-
Position-wise LM Loss:
- Conceptual Definition:
Position-wise LM lossevaluates theLM lossat each specific token position within the sequence. This metric helps to visualize how the model's predictive performance changes as a function of its position in the context. It can reveal patterns such as degradation of performance with increasing distance from the beginning of the sequence, or the impact ofsparse attentionon different parts of the context. - Mathematical Formula: Not explicitly provided in the paper as a single formula, but conceptually, it's the
LM losscalculated for each position individually: $ L_i = - \sum_{j=1}^{V} y_{i,j} \log(\hat{y}_{i,j}) $ - Symbol Explanation: is the loss at position . Other symbols are as defined for
LM loss.
- Conceptual Definition:
-
Benchmark-Specific Metrics (e.g., Accuracy, F1 Score, Pass@1):
- For benchmarks like
AGIEval,BBH,GSM8K,MMLU,HellaSWAG,OpenAI HumanEval, etc., the paper reports scores that are standard for those respective benchmarks. These typically include:- Accuracy: For classification or multiple-choice tasks (e.g.,
MMLU,HellaSWAG). - F1 Score: For tasks involving information extraction or generation where both
precisionandrecallare important. - Pass@1: For
code generationtasks (e.g.,OpenAI HumanEval,MBPP), indicating the percentage of problems for which the first generated solution passes all test cases.
- Accuracy: For classification or multiple-choice tasks (e.g.,
- While the paper doesn't explicitly define the formulas for these standard metrics, they are well-established in the NLP and ML communities. The specific metrics used for each benchmark are assumed to be their standard evaluation protocols.
- For benchmarks like
-
Needle in a Haystack Evaluation Score:
- Conceptual Definition: This is a specialized test for
long-context retrieval. A specific, unique "needle" (e.g., a sentence or fact) is embedded at various positions within a very long "haystack" (a long document). The model is then prompted to find and extract this needle. The score (often a percentage) indicates how accurately the model can locate and reproduce the embedded information. A high score signifies stronglong-context retrievalandattention capabilities. - Mathematical Formula: Typically, this is a binary success/failure for each trial, averaged over multiple trials and positions. $ \text{Score} = \frac{\text{Number of correctly retrieved needles}}{\text{Total number of needles}} \times 100% $
- Symbol Explanation:
Number of correctly retrieved needlesis the count of instances where the model successfully extracts the target information.Total number of needlesis the total number of test cases.
- Conceptual Definition: This is a specialized test for
5.3. Baselines
The primary baseline model against which MoBA is compared is full attention implemented with Flash Attention.
- Full Attention (with Flash Attention):
-
This serves as the main
strong baselinefor evaluating bothperformanceandefficiency.Full attentionis the standard, most expressiveattention mechanisminTransformers, where everyquery tokenattends to everykey tokenin the context. -
The implementation uses
Flash Attention(Dao et al., 2022), which is an optimized version offull attentionthat significantly improvesspeedandmemory efficiencyon modern GPUs, without altering the fundamental quadratic computational complexity. By comparing againstFlash Attention, the authors ensure thatMoBA's efficiency gains are due to its sparse nature, not merely better implementation details. -
The
full attentionmodels are trained under equivalent conditions toMoBAmodels (e.g.,Llama-8B-1M-Fullvs.Llama-8B-1M-MoBA) to allow for a direct comparison of their respective performance and efficiency profiles. -
For
continual pre-trainingexperiments, theLlama3.1 8B Base Modelis used as a common starting point for bothMoBAandfull attentionvariants, ensuring a fair comparison from a consistent foundation.The comparison aims to show that
MoBAcan match or exceed the performance offull attentionwhile achieving substantialefficiencygains, particularly for verylong contexts. The paper also implicitly contrastsMoBAwith the limitations of othersparse attentionmethods (likesliding windowandattention sink) andlinear attentionmodels (likeMambaorRetNet) by highlightingMoBA's "less structure" principle and its ability to perform well oncomplex reasoning tasks.
-
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that MoBA achieves performance comparable to full attention while significantly improving computational efficiency and scalability for long contexts.
6.1.1. Scaling Law Experiments and Ablation Studies
Scalability w.r.t. LM Loss:
-
LM Loss (seqlen = 8K): When trained with an
8K sequence length(standard for manyLLMs),MoBAmodels, despite achieving81.25% sparsity, showLM losscurves that are almost indistinguishable fromfull attentionmodels. TheLM lossvalues for both mechanisms remain consistent within a range of1e-3. This suggests thatMoBAcan scale similarly tofull attentionin terms oflanguage modelingperformance, even with substantial sparsity. The following figure (Figure 3 from the original paper) illustrates the scaling law comparison:
该图像是图表,展示了MoBA与全注意力机制在不同条件下的缩放规律比较。图(a)表示了在序列长度为8K时的语言模型损失,图(b)显示了在序列长度为32K、最后2K令牌的情况下的后续语言模型损失。图(c)提供了相关的拟合缩放规律曲线的数据。整体上,展示了MoBA的投影与全注意力的投影在计算效率和性能上的差异。
Figure 3: Scaling law comparison between MoBA and full attention. (a) LM loss on validation set (seqlen ; (b) trailing LM loss on validation set (seqlen , last 1K tokens); (c) fitted scaling law curve.
Figure 3(a) visually confirms the comparable LM loss for both MoBA and full attention at 8K context length.
- Long Context Scalability (Trailing LM Loss, seqlen = 32K): To assess
long-context scalability, models were evaluated at a32K maximum sequence length, usingtrailing LM loss(loss on the last 1K tokens) to avoid biases from short sequences. This setting results in an even higher sparsity of95.31%forMoBA. Figure 3(b) shows that whileMoBAinitially has slightly highertrailing LM lossthanfull attentionat lower compute levels, the gap progressively narrows as compute increases. This indicatesMoBA's strong potential forlong-context scalability.
Ablation Study: Fine-Grained Block Segmentation:
-
The paper investigates the impact of
block granularityonMoBA's performance. Using a1.5B parameter modelwith a32K context length, theblock sizeandtop-khyperparameters were adjusted to maintain a consistentsparsityof75%. -
The results, shown in Figure 4, demonstrate that
MoBA's performance is significantly affected byblock granularity. Finer-grained block segmentation (smaller block sizes, more blocks) generally leads to better performance (lowerLM loss). This confirms thatfine-grained expert segmentationis a beneficial technique forMoBA, similar to its positive impact in the broaderMoEliterature. The following figure (Figure 4 from the original paper) shows the effect of block segmentation:
该图像是一个示意图,展示了不同 MoBA 块分段设置下的 LM 损失变化。可以看到,随着分段设置的调整,MoBA 的 LM 损失(蓝点)在相同设置下显著低于全注意力基线(红点),显示出其在长上下文任务中的优势。
Figure 4: FinGraied Bloc Segmentati.The LMloss n validatio et s. MoA withdifferent bloc anur
6.1.2. Hybrid of MoBA and Full Attention
The flexibility of MoBA allows for hybrid training and layer-wise hybrid strategies.
MoBA/Full Hybrid Training:
-
Three
1.5B parameter modelswere trained on30B tokenswith a32K context length:MoBA/full hybrid,full attention, andMoBA.MoBA/full hybrid: Trained withMoBAfor 90% of tokens, then switched tofull attentionfor the remaining 10%.Full attention: Trained entirely withfull attention.MoBA: Trained exclusively withMoBA.
-
Figure 5(a) presents the
position-wise LM loss. TheMoBA/full hybridmodel achieves the lowestposition-wise LM lossacross most positions. This suggests that a blend ofMoBAfor efficiency andfull attentionfor a final boost can yield superiorlong-context performance. Interestingly,MoBAinitially has higher loss but catches up tofull attentionas context length increases. The following figure (Figure 5 from the original paper) illustrates the hybrid strategies:
该图像是三个图表,展示了MoBA、MoBA/Full Hybrid和Full Attention在不同位置和层数下的LM损失表现。图(a)显示了位置与LM损失的关系,图(b)和图(c)分别展示了与层数相关的不同方法的性能对比。
Figure 5:Hybridof MoBA andfull atention. (a) position-wise LM loss for MoBA, full attention, and MoA/full T .rueayriT aiLM loss (seqler , last 2K) w.r.t the number of full attention layers in layer-wise hybrid.
Layer-wise Hybrid:
MoBAcan sometimes lead to suboptimal performance duringSupervised Fine-Tuning (SFT), especially when dealing withsparse gradientsfrom task-specific tokens.- To address this, a
layer-wise hybrid strategywas proposed: usingfull attentionfor the last fewTransformerlayers (which typically handle task-specific token predictions) andMoBAfor the preceding layers. - Figures 5(b) and 5(c) (the latter from Appendix) show that this strategy significantly reduces
SFT loss, confirming that combiningfull attentionin critical final layers withMoBAin earlier layers can improve performance while maintaining efficiency.
6.1.3. Large Language Modeling Evaluation
-
Llama3.1 8B Base Modelwas used as a starting point forcontinual pre-trainingup to1M context length.MoBAwas activated for the last 100K tokens of this process. Theblock sizewas 4096 andtop-kwas 12, leading to95.31% sparsityat 1M context. -
Benchmark Performance: Table 2 compares
Llama-8B-1M-MoBAwithLlama-8B-1M-Fullacross 15 benchmarks. The following are the results from Table 2 of the original paper:Benchmark Llama-8B-1M-MoBA Llama-8B-1M-Full AGIEval [0-shot] 0.5144 0.5146 BBH [3-shot] 0.6573 0.6589 CEval [5-shot] 0.6273 0.6165 GSM8K [5-shot] 0.7278 0.7142 HellaSWAG [0-shot] 0.8262 0.8279 Loogle [0-shot] 0.4209 0.4016 Competition Math [0-shot] 0.4254 0.4324 MBPP [3-shot] 0.5380 0.5320 MBPP Sanitized [0-shot] 0.6926 0.6615 MMLU [0-shot] 0.4903 0.4904 MMLU Pro [5-shot][CoT] 0.4295 0.4328 OpenAI HumanEval [0-shot][pass@1] 0.6951 0.7012 SimpleQA [0-shot] 0.0465 0.0492 TriviaQA [0-shot] 0.5673 0.5667 LongBench @32K [0-shot] 0.4828 0.4821 RULER @128K [0-shot] 0.7818 0.7849
The results show that Llama-8B-1M-MoBA achieves performance highly comparable to Llama-8B-1M-Full across most benchmarks. For instance, on AGIEval, MMLU, HellaSWAG, and RULER, the scores are very close. Notably, MoBA even slightly outperforms full attention on CEval, GSM8K, Loogle, MBPP, and MBPP Sanitized. This is particularly impressive for RULER at 128K context length, where MoBA operates with 62.5% sparsity but nearly matches full attention's score.
-
Needle in a Haystack Evaluation: Figure 7 demonstrates
Llama-8B-1M-MoBA's performance on theNeedle in a Haystackbenchmark up to1M context length. The following figure (Figure 7 from the original paper) shows the Needle in a Haystack performance:
该图像是图表,展示了在不同上下文长度下,LLama-8B-1M-MoBA在"Needle in the Haystack"基准测试中的表现,分数范围从0到100。图中的绿色区域表示模型在长上下文任务中表现良好。
Figure 7: Performance of LLama-8B-1M-MoBA on the Needle in the Haystack benchmark (upto 1M context length)
The model exhibits satisfactory performance, indicated by the green regions, even at 1 million tokens, affirming its capability for long-context retrieval.
6.1.4. Efficiency and Scalability
- Speedup Evaluation (1M Model): Figure 2(a) compares the
computation time scalingofMoBAvs.Flash Attentionon a1M modelwith increasing sequence lengths.MoBAsignificantly improves efficiency, achieving aspeedup ratioof up to6.5xwhenprefilling1 million tokens. This confirms its practical advantage in terms of wall-clock time. - Fixed Sparsity Ratio Scaling (8K-10M): Figure 2(b) extends the comparison to sequence lengths up to
10 million tokens, maintaining a constantsparsity ratioof95.31%.MoBAdemonstrates superior efficiency compared toFlash Attentionas sequence lengths grow. While both methods perform comparably at shorter sequences (e.g., 32K-128K),MoBA'scomputational advantagebecomes increasingly evident and substantial forextremely long sequences, highlighting its strength in processing contexts of millions of tokens. - Underlying Factors: The efficiency of
MoBAstems from three main factors: (1)dynamic block selection, (2)sparse attentionacross blocks, and (3) an optimized implementation combiningMixture of ExpertsandFlash Attentiontechniques. These factors reduce thecomputational complexityto a moreeconomical sub-quadratic scale.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Model Param | Head | Layer | Hidden | TrainingToken | Block size | TopK |
| 568M | 14 | 14 | 1792 | 10.8B | 512 | 3 |
| 822M | 16 | 16 | 2048 | 15.3B | 512 | 3 |
| 1.1B | 18 | 18 | 2304 | 20.6B | 512 | 3 |
| 1.5B | 20 | 20 | 2560 | 27.4B | 512 | 3 |
| 2.1B | 22 | 22 | 2816 | 36.9B | 512 | 3 |
Table 1: Configuration of Scaling Law Experiments
This table details the configurations used for the scaling law experiments, showing models of varying sizes (from 568M to 2.1B parameters) with their respective head count, layer count, hidden dimension, total training tokens, and MoBA-specific hyperparameters (block size and top-k). For these experiments, a constant block size of 512 and top-k of 3 were used across all model sizes.
The following are the results from Table 2 of the original paper:
| Benchmark | Llama-8B-1M-MoBA | Llama-8B-1M-Full |
| AGIEval [0-shot] | 0.5144 | 0.5146 |
| BBH [3-shot] | 0.6573 | 0.6589 |
| CEval [5-shot] | 0.6273 | 0.6165 |
| GSM8K [5-shot] | 0.7278 | 0.7142 |
| HellaSWAG [0-shot] | 0.8262 | 0.8279 |
| Loogle [0-shot] | 0.4209 | 0.4016 |
| Competition Math [0-shot] | 0.4254 | 0.4324 |
| MBPP [3-shot] | 0.5380 | 0.5320 |
| MBPP Sanitized [0-shot] | 0.6926 | 0.6615 |
| MMLU [0-shot] | 0.4903 | 0.4904 |
| MMLU Pro [5-shot][CoT] | 0.4295 | 0.4328 |
| OpenAI HumanEval [0-shot][pass@1] | 0.6951 | 0.7012 |
| SimpleQA [0-shot] | 0.0465 | 0.0492 |
| TriviaQA [0-shot] | 0.5673 | 0.5667 |
| LongBench @32K [0-shot] | 0.4828 | 0.4821 |
| RULER @128K [0-shot] | 0.7818 | 0.7849 |
Table 2: Performance comparison between MoBA and full Attention across different evaluation benchmarks.
This table presents a comprehensive performance comparison between Llama-8B-1M-MoBA and Llama-8B-1M-Full across various evaluation benchmarks. It clearly shows that MoBA maintains highly comparable performance to full attention across a wide range of tasks, including reasoning, QA, and code generation, even at extended context lengths. In some cases, MoBA even slightly surpasses the full attention baseline.
The following are the results from Table 3 of the original paper:
| LM Loss Position Range | MoBA | Full |
| 0K - 2K | 3.075 × C-0.078 | 3.068 × C-0.078 |
| 2K - 4K | 2.415× C-0.084 | 2.411 × C-0.083 |
| 4K - 6K | 2.085 × C-0.081 | 2.077 × C-0.081 |
| 6K - 8K | 1.899 × C-0.092 | 1.894 × C-0.092 |
| 8K - 10K | 1.789 × C-0.091 | 1.774 × C-0.089 |
| 10K - 12K | 1.721 × C-0.092 | 1.697 × C-0.087 |
| 12K - 14K | 1.670 × C-0.089 | 1.645 × C-0.088 |
| 14K - 16K | 1.630 × C-0.089 | 1.600 × C-0.087 |
| 16K - 18K | 1.607 × C-0.090 | 1.567 × C-0.087 |
| 18K - 20K | 1.586 × C-0.091 | 1.542 × C-0.087 |
| 20K - 22K | 1.571 × C-0.093 | 1.519 × C-0.086 |
| 22K - 24K | 1.566 × C-0.089 | 1.513 × C-0.085 |
| 24K - 26K | 1.565 × C-0.091 | 1.502 × C-0.085 |
| 26K - 28K | 1.562 × C-0.095 | 1.493 × C-0.088 |
| 28K - 30K | 1.547 × C-0.097 | 1.471 × C-0.091 |
| 30K - 32K | 1.546 × C-0.108 | 1.464 × C-0.097 |
Table 3: Loss scaling with different positions
This table presents LM loss scaling for MoBA and full attention across different position ranges within a 32K context length. The loss is fitted to a power-law curve of the form , where is a constant and is the scaling exponent. It shows that MoBA's loss scaling behavior is very similar to full attention across all positions, although full attention generally achieves slightly lower loss, particularly at very long positions. The difference in scaling exponents (the exponent of ) is small, further indicating comparable scaling properties.
6.3. Ablation Studies / Parameter Analysis
The paper includes several analyses to understand the impact of MoBA's design choices and hyperparameters.
-
Fine-Grained Block Segmentation: As discussed in Section 6.1.1, Figure 4 shows that
LM losssignificantly decreases asblock granularitybecomes finer (smallerblock size), even whensparsityis kept constant. This suggests that allowingMoBAto select from a larger number of smallerKV blocksprovides a more effective and nuanced way for the model to attend, leading to betterlanguage modelingperformance. The trade-off is that more blocks mean a larger routing overhead. -
Hybrid Training Strategy: Figure 5(a) analyzes the
position-wise LM lossforMoBA,full attention, and aMoBA/full hybridmodel. The hybrid approach, which usesMoBAfor most training and switches tofull attentionfor a smaller portion, yields the best overall performance, especially in later positions of the sequence. This indicates that the hybrid approach effectively balances theefficiencyofMoBAwith the potentially higherexpressive poweroffull attentionduring critical training phases. -
Layer-wise Hybrid for SFT: Figures 5(b) and 5(c) (Appendix) investigate a
layer-wise hybrid strategyto optimizesupervised fine-tuning (SFT)performance. The results show that usingfull attentionin thetop layersof theTransformer(which are often more task-specific and sensitive tosparse gradients) while keepingMoBAin thelower layers(base layers) significantly reducesSFT loss. This practical finding suggests a method to overcome potential performance dips ofsparse attentionduringfine-tuningon specific tasks. -
Sparsity Ratio and Context Length: Across multiple experiments, the
sparsity ratiois a key parameter that is either kept constant or dynamically adjusted (by varyingblock sizeandtop-k) to exploreMoBA's behavior. The results consistently demonstrate thatMoBAcan maintain high performance even at very highsparsity levels(e.g.,81.25%at 8K context,95.31%at 32K and 1M context), validating the effectiveness of itsdynamic block selectionin focusing on critical information while discarding irrelevant parts of the context. The ability to scaleblock sizewithcontext length(while keepingtop-kfixed) is crucial for maintaining a constantsparsity ratioand achieving efficiency at extreme scales (up to 10M tokens).
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces Mixture of Block Attention (MoBA), a novel attention mechanism designed to enable large language models (LLMs) to effectively handle long-context tasks with enhanced computational efficiency. By applying Mixture of Experts (MoE) principles to the attention mechanism, MoBA partitions the input context into blocks and uses a gating network to dynamically select the most relevant blocks for each query token. This approach adheres to a "less structure" philosophy, allowing the model to autonomously determine attention patterns rather than relying on predefined biases.
The key findings and contributions include:
-
Comparable Performance to Full Attention: Extensive experiments demonstrate that
MoBAachieveslanguage model lossand downstream task performance highly comparable tofull attentionmodels, even at significant levels of sparsity (e.g.,95.31%). -
Significant Efficiency Gains:
MoBAprovides substantial improvements incomputational efficiency, achieving speedups of up to6.5xduringprefilling1 million tokens and effectively scaling to10 million tokens. -
Long-Context Scalability: The architecture proves robust for
long-context scenarios, as evidenced by competitivetrailing LM lossat32K tokensand satisfactory performance on theNeedle in a Haystackbenchmark up to1 million tokens. -
Flexibility and Hybrid Strategies:
MoBAoffers the ability to seamlessly transition betweenfullandsparse attention, enabling effectivehybrid trainingandlayer-wise hybrid strategiesto optimize performance, particularly duringsupervised fine-tuning (SFT). -
Real-World Deployment: The successful deployment of
MoBAto support Kimi's long-context requests underscores its practical utility and robustness.In summary,
MoBArepresents a significant advancement in efficient attention computation, providing a scalable and high-performing solution for extending the context capabilities ofLLMswithout compromising theirexpressive power.
7.2. Limitations & Future Work
While the paper primarily highlights the advancements and successful deployment of MoBA, it implicitly points to areas for further research and investigation by contrasting its approach with the limitations of existing methods. The paper does not have a dedicated "Limitations" or "Future Work" section, but the following can be inferred:
- Complex Reasoning Tasks with Sparse Attention: The introduction notes that the performance of
linear approximationsof attention incomplex reasoning tasks"remains inadequately explored." WhileMoBAaims to overcome this by retainingsoftmax-based attention's expressive power, a more rigorous and specific evaluation ofMoBA's performance on highlycomplex, multi-hop reasoning tasksat extreme context lengths could be a direction for future work. - Optimizing Gating Network Overhead: While
MoBAreducesattention computation, thegating networkitself introduces some overhead (e.g.,mean poolingandaffinity scorecomputation). For extremely fine-grained block segmentation or very large numbers of blocks, the overhead of thegating networkmight become a factor. Future work could explore more efficientgating mechanismsor dynamic router architectures. - Adaptive Sparsity Scheduling: The paper shows the benefits of hybrid training and layer-wise hybrid strategies. Further research could investigate more advanced, dynamic scheduling mechanisms for switching between
sparseandfull attentionmodes, or for adjustingtop-kvalues, based on task, layer, or input characteristics. - Theoretical Guarantees for "Less Structure": While
MoBAadheres to a "less structure" principle compared to fixed sparse patterns, the block partitioning and mean-pooling for gate computation still introduce some structural assumptions. Future work could explore the theoretical implications of these structures and investigate even more flexible or learned ways of defining attention "experts." - Generalization Beyond Text: The current deployment is for
LLMs. ExploringMoBA's applicability and performance in other domains that utilizeTransformerarchitectures, such asmultimodal models(e.g., vision Transformers) ortime series analysis, could be an interesting future direction.
7.3. Personal Insights & Critique
MoBA presents an elegant and effective solution to the long-standing challenge of quadratic complexity in Transformer attention. The core insight of applying Mixture of Experts (MoE) principles, previously successful in FFN layers, to the attention mechanism is genuinely innovative. This approach allows LLMs to dynamically and intelligently select relevant parts of an extended context, mimicking a more human-like selective attention process.
One of the most compelling aspects is the "less structure" principle. By letting the model learn where to attend via a gating network, MoBA avoids the rigid biases of sliding window or sink attention, which can be limiting for tasks requiring non-local dependencies. The empirical results strongly support this, showing MoBA performing on par with full attention across diverse benchmarks while delivering substantial speedups. The 6.5x speedup for prefilling 1M tokens and scalability to 10M tokens are impressive practical achievements.
The ability to seamlessly transition between full and sparse attention is also a powerful design choice. This hybrid capability provides a practical pathway for practitioners to fine-tune the balance between efficiency and performance, potentially using MoBA for general pre-training and switching to full attention for critical, task-specific layers during SFT. This flexibility makes MoBA highly adaptable to different use cases and computational constraints. The deployment in Kimi validates its real-world applicability, moving it beyond a theoretical concept.
Potential areas for deeper investigation or critique include:
-
Gating Network Complexity vs. Attention Gain: While the
gating network(mean pooling andtop-kselection) is relatively lightweight, its computational overhead, especially for a very large number of blocks (fine-grained segmentation), could warrant further analysis. How does the complexity of thegating networkscale, and at what point does it start to diminish the overall efficiency gains? -
Finer-grained
Top-KAnalysis: The paper uses a fixedtop-k(e.g., 3 or 12). A more detailed ablation of how dynamically adjusting (e.g., per-layer, per-head, or per-query) based on context or task difficulty could impact performance and efficiency would be valuable. The assumption of a fixed for all queries might still introduce some implicit bias. -
Representational Capacity of Block Pooling: Using
mean poolingfor block representation is a simple and effective choice. Exploring more sophisticatedblock representationmethods (e.g., learnable aggregations,CNNs, or smallerattention mechanismswithin blocks) could potentially improve the gate's decision-making, though at the cost of increasedgating networkcomplexity. -
Robustness to Adversarial Contexts: How does
MoBAperform when the "needle" is disguised or placed in particularly distractor-rich blocks that might not be selected by thegating network? WhileNeedle in a Haystackresults are good, more adversarial settings might reveal limitations in thegating mechanism's ability to consistently identify critical information.Overall,
MoBAis a significant contribution that offers a pragmatic yet powerful approach to scalingLLMsto unprecedented context lengths, maintaining theexpressive powerofTransformerswhile drastically improving efficiency. Its principled design and demonstrated effectiveness make it a highly promising direction for futureLLMdevelopment.
Similar papers
Recommended via semantic vector search.