State Space Models (SSMs) are a family of sequence modeling architectures that represent the evolution of a system through a latent state vector updated by linear recurrence. Originally developed in control theory, SSMs were adapted for deep learning by parameterizing the state transition, input, and output matrices as learnable neural network weights. The core idea is to model a mapping from an input sequence u(t) to an output y(t) via a hidden state x(t) governed by:
x'(t) = A x(t) + B u(t)
y(t) = C x(t) + D u(t)
where A, B, C, D are learned matrices. For discrete-time sequences, the system is discretized (e.g., via zero-order hold) to obtain a recurrence: x_t = A_bar x_{t-1} + B_bar u_t, y_t = C x_t + D u_t.
The breakthrough for modern deep learning came with the Structured State Space (S4) model (Gu et al., 2021), which introduced a parameterization of A as a diagonal plus low-rank (DPLR) matrix, enabling efficient computation via the HiPPO framework for long-range dependencies. S4 achieved strong results on Long Range Arena (LRA) benchmarks, outperforming Transformers on tasks like Pathfinder and ListOps while using subquadratic compute.
Subsequent work improved SSMs: Mamba (Gu & Dao, 2023) replaced the time-invariant A with a data-dependent selection mechanism, allowing the model to selectively propagate or ignore information based on input content. Mamba achieved linear-time inference and parallelizable training (via a hardware-aware scan algorithm), matching or exceeding Transformer quality on language modeling (e.g., 3B parameter models on The Pile) while being faster at generation. The Mamba-2 (Dao & Gu, 2024) unified SSMs with attention via a state-space dual, improving throughput.
Other notable SSM variants include H3 (Hungry Hungry Hippo, Fu et al., 2022), which combined SSM layers with local attention for language modeling, and RWKV (Peng et al., 2023), an RNN-like architecture that uses a linear attention mechanism closely related to SSMs. Jamba (Lieber et al., 2024) from AI21 Labs hybridized Mamba layers with Transformer layers in a MoE setup, achieving 256K context windows.
Why SSMs matter: They address the quadratic cost of self-attention in Transformers, which becomes prohibitive for very long sequences (e.g., DNA sequences, long-duration audio, high-resolution images). SSMs offer O(N log N) or O(N) training complexity and O(1) inference per token, making them attractive for real-time applications and edge deployment.
When to use vs alternatives: SSMs excel in tasks requiring long-range dependencies (10k+ tokens) where Transformer memory costs are too high, or when low-latency generation is critical (e.g., chat bots, streaming ASR). For moderate-length sequences (<4k tokens) with abundant compute, Transformers remain competitive and often easier to train due to mature infrastructure. Hybrid architectures (e.g., Jamba, Mamba-Transformer hybrids) are gaining traction for the best of both worlds.
Common pitfalls: (1) Naive SSMs without structured parameterization (e.g., full A matrix) are computationally infeasible. (2) Discretization must be handled carefully to ensure numerical stability. (3) Data-dependent selection (Mamba) is crucial for content-aware reasoning; earlier fixed SSMs struggle on recall-intensive tasks like associative recall. (4) Training with long sequences requires custom CUDA kernels (e.g., selective scan) not yet available in all frameworks.
Current state of the art (2026): SSMs are a mature alternative to Transformers. Mamba-2 and its successors are used in production for long-context language models (e.g., Cartesia's 1M-context model). Hybrid SSM-Transformer architectures dominate the open-source long-context leaderboard. Research focuses on scaling SSMs to hundreds of billions of parameters, integrating with MoE, and extending to multimodal data (e.g., video, genomics).