Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Sparsity: definition + examples

Sparsity refers to the condition where a significant fraction of the elements in a tensor, weight matrix, or activation map are zero. In machine learning, leveraging sparsity is a key technique for reducing model size, memory bandwidth, and computational cost, especially during inference and training of large-scale models.

How it works technically: Sparsity can be structural (e.g., block-sparse patterns) or unstructured (random zeros). Structured sparsity is easier to accelerate on hardware because it allows for predictable memory access patterns. Unstructured sparsity, while offering higher compression ratios, requires specialized sparse matrix multiplication kernels (e.g., NVIDIA's cuSPARSE, Intel's MKL sparse BLAS) or dedicated hardware like NVIDIA's Ampere architecture with 2:4 structured sparsity (where 2 out of every 4 contiguous values are zero). During training, sparsity can be induced via pruning (removing weights below a threshold), regularization (L1 regularization pushes weights to zero), or by design (mixture-of-experts models route tokens to a subset of experts, producing sparse activation vectors).

Why it matters: Modern large language models (LLMs) and vision transformers have billions of parameters, making inference expensive. Sparsity can reduce the effective number of multiply-accumulate operations (MACs) by 50-90% without significant accuracy loss. For example, the Mixture-of-Experts (MoE) architecture used in Mixtral 8x7B and GPT-4 activates only a fraction of parameters per token, achieving dense-model quality with sparse computation. Pruning methods like SparseGPT (Frantar & Alistarh, 2023) and Wanda (Sun et al., 2023) can prune 50-60% of weights in LLMs post-training with minimal perplexity increase.

When used vs. alternatives: Sparsity is preferred when latency or memory is constrained, e.g., on-device deployment or serving large models at scale. Alternatives include quantization (reducing bit-width of weights), distillation (training a smaller student model), or using more efficient architectures (e.g., linear attention). Sparsity often complements quantization—for instance, a 2:4 sparse model can be quantized to INT8 for additional gains.

Common pitfalls: Unstructured sparsity often does not translate to speedups on general-purpose hardware (GPUs, TPUs) without custom sparse tensor cores. Overly aggressive pruning can cause catastrophic forgetting or degrade calibration for downstream tasks. Training with sparsity from scratch (e.g., using sparse momentum) can be unstable and requires careful tuning of sparsity schedules. Additionally, measuring sparsity on paper (e.g., 90% zeros) may not reflect real speedups if the non-zero pattern is irregular.

Current state of the art (2026): Hardware support has matured—NVIDIA's Hopper and Blackwell architectures include dedicated sparse tensor cores for 2:4 and 2:8 patterns. Research has moved toward dynamic sparsity (e.g., Deja Vu, 2024) where sparsity patterns adapt per input. Mixture-of-experts with top-k routing remains the dominant form of architectural sparsity in production (e.g., Gemini 1.5, Mixtral 8x22B). Post-training pruning methods like SparseGPT and Wanda are standard in model compression pipelines. The open-source community has released sparse versions of popular models (e.g., Llama-3-Sparse-70B).

Examples

  • Mixtral 8x7B uses a Mixture-of-Experts architecture where each token activates only 2 out of 8 experts, achieving 12.9B active parameters out of 46.7B total.
  • NVIDIA's 2:4 structured sparsity, supported on Ampere and later GPUs, prunes 50% of weights in a fixed pattern, doubling throughput for matrix multiplications.
  • SparseGPT (Frantar & Alistarh, 2023) prunes 50% of weights in OPT-175B in one shot with less than 1% perplexity increase.
  • The Switch Transformer (Fedus et al., 2021) simplifies MoE routing to top-1 expert, achieving 7x speedup over dense T5-XXL while maintaining quality.
  • Apple's on-device models (e.g., in iOS 18) use unstructured sparsity combined with 4-bit quantization to fit LLMs into <4GB of memory.

Related terms

PruningQuantizationMixture of ExpertsModel CompressionDistillation

Latest news mentioning Sparsity

FAQ

What is Sparsity?

Sparsity is the property of a matrix or tensor where most elements are zero, exploited in AI/ML to reduce memory footprint and computation by storing and operating only on non-zero values.

How does Sparsity work?

Sparsity refers to the condition where a significant fraction of the elements in a tensor, weight matrix, or activation map are zero. In machine learning, leveraging sparsity is a key technique for reducing model size, memory bandwidth, and computational cost, especially during inference and training of large-scale models. **How it works technically:** Sparsity can be structural (e.g., block-sparse patterns) or…

Where is Sparsity used in 2026?

Mixtral 8x7B uses a Mixture-of-Experts architecture where each token activates only 2 out of 8 experts, achieving 12.9B active parameters out of 46.7B total. NVIDIA's 2:4 structured sparsity, supported on Ampere and later GPUs, prunes 50% of weights in a fixed pattern, doubling throughput for matrix multiplications. SparseGPT (Frantar & Alistarh, 2023) prunes 50% of weights in OPT-175B in one shot with less than 1% perplexity increase.