Tags: LLM Inference Acceleration - Paper Library

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Published:7/3/2024

Long-Context ModelingSparse Attention MechanismLLM Inference AccelerationGPU Optimization for ComputationDynamic Sparse Computation

The paper presents MInference, a dynamic sparse calculation method that accelerates the prefilling stage of longcontext LLMs. By identifying three unique patterns in attention matrices and dynamically building sparse indices, MInference significantly reduces latency while maint

02

PLAIN: Leveraging High Internal Bandwidth in PIM for Accelerating Large Language Model Inference via Mixed-Precision Quantization

Published:10/26/2025

DRAM-PIM Deep Learning AccelerationLLM Inference AccelerationMixed-Precision Quantization AlgorithmHigh Bandwidth Memory OptimizationPIM-Based Computation Scheduling

PLAIN is a novel software/hardware codesign framework for accelerating large language model inference through mixedprecision quantization. It optimizes parameter quantization and leverages PIM characteristics, achieving up to 5.03x and 1.69x performance improvements with neglig

03

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Published:6/17/2024

Adaptive Structured Sparse AttentionLLM Inference AccelerationLong-Context ModelingNear-Lossless Sparse Attention

SampleAttention is introduced as an adaptive, nearlossless sparse attention method for longcontext LLMs, significantly reducing TimetoFirstToken latency while maintaining accuracy, achieving up to 2.42x TTFT reduction compared to FlashAttention.

01

Free Reads