Free Reads
Sign in to view your remaining parses.
Tag Filter
LLM Inference Acceleration
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Published:7/3/2024
Long-Context ModelingSparse Attention MechanismLLM Inference AccelerationGPU Optimization for ComputationDynamic Sparse Computation
The paper presents MInference, a dynamic sparse calculation method that accelerates the prefilling stage of longcontext LLMs. By identifying three unique patterns in attention matrices and dynamically building sparse indices, MInference significantly reduces latency while maint
02
PLAIN: Leveraging High Internal Bandwidth in PIM for Accelerating Large Language Model Inference via Mixed-Precision Quantization
Published:10/26/2025
DRAM-PIM Deep Learning AccelerationLLM Inference AccelerationMixed-Precision Quantization AlgorithmHigh Bandwidth Memory OptimizationPIM-Based Computation Scheduling
PLAIN is a novel software/hardware codesign framework for accelerating large language model inference through mixedprecision quantization. It optimizes parameter quantization and leverages PIM characteristics, achieving up to 5.03x and 1.69x performance improvements with neglig
03
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Published:6/17/2024
Adaptive Structured Sparse AttentionLLM Inference AccelerationLong-Context ModelingNear-Lossless Sparse Attention
SampleAttention is introduced as an adaptive, nearlossless sparse attention method for longcontext LLMs, significantly reducing TimetoFirstToken latency while maintaining accuracy, achieving up to 2.42x TTFT reduction compared to FlashAttention.
01