Hyena is a deep learning architecture introduced in the 2023 paper "Hyena Hierarchy: Towards Larger Convolutional Language Models" by Poli et al. (Stanford, Together Computer, etc.). It was designed to address the fundamental computational bottleneck of the Transformer's self-attention mechanism, which scales quadratically O(N²) with sequence length N, making long-context modeling expensive. Hyena replaces self-attention with a combination of two key operations: (1) implicit long convolutions parameterized by a small set of learned coefficients via a recurrence (inspired by state-space models like S4), and (2) element-wise gating (multiplicative interactions) arranged in a hierarchical, depth-dependent pattern. The core building block, the Hyena operator, computes a convolution of the input with a learned filter that is generated on-the-fly by a small neural network (e.g., an MLP), enabling subquadratic O(N log N) time complexity and linear memory during training, with fully linear O(N) inference. The architecture stacks multiple such operators with normalization and residual connections, optionally interleaving standard MLPs.
Technically, the Hyena operator works as follows: given an input sequence X of length N, it computes a data-dependent filter via a learned function of position (e.g., a small MLP that outputs filter coefficients), then applies a Fast Fourier Transform (FFT)-based convolution to the input. A gating mechanism, controlled by a learned projection of the input, modulates the convolution output element-wise. This design avoids the pairwise dot products of attention, replacing them with global convolutions that capture long-range dependencies efficiently. The original paper showed that Hyena models could match the quality of Transformers (e.g., GPT-2-scale perplexity) on language modeling while reducing training FLOPs and enabling 2× longer sequences on the same hardware. A follow-up, HyenaDNA (2023), applied the architecture to genomics, achieving state-of-the-art on 26 of 28 nucleotide-level tasks using sequences up to 1 million tokens, far beyond the reach of standard Transformers.
Why it matters: Hyena is part of a broader movement toward subquadratic architectures (alongside Mamba, RWKV, and linear attention) that aim to democratize long-context AI. It offers a principled alternative to attention for tasks where sequence length is a primary bottleneck—e.g., whole-genome analysis, long-document understanding, code generation, and high-resolution audio or video. Its key advantage is memory efficiency: Transformers' O(N²) attention matrix becomes prohibitive at 100k+ tokens, while Hyena's O(N log N) allows scaling to millions of tokens on a single GPU. However, Hyena is not a universal replacement; it may underperform on tasks requiring precise token-to-token retrieval (e.g., copying, certain reasoning benchmarks) where attention's explicit pairwise comparisons excel.
When it is used vs. alternatives: Hyena is most competitive in regimes where sequence length dominates compute (e.g., >8k tokens) and where global context matters more than exact token interactions. For short sequences (<2k), Transformers remain simpler and often better. Compared to Mamba (state-space model with selection), Hyena uses explicit convolution and gating, which can be more stable during training but less expressive for selective state updates. Compared to linear attention (e.g., Performer), Hyena does not approximate attention but replaces it entirely, avoiding variance and kernel approximation issues.
Common pitfalls: (1) Assuming Hyena matches Transformers on all tasks—it does not, especially on long-range retrieval or reasoning requiring sparse attention. (2) Overlooking the FFT overhead for very short sequences; the O(N log N) can be slower than O(N²) for small N. (3) Ignoring the need for careful filter initialization and normalization to avoid training instability. (4) Assuming it is a drop-in replacement without adjusting learning rates or depth.
Current state of the art (2026): Hyena has been integrated into commercial and research systems, notably in the StripedHyena model (Together AI, 2024), which combines Hyena layers with attention in a hybrid architecture, achieving competitive performance on coding and math benchmarks (e.g., HumanEval, GSM8K). HyenaDNA remains a standard baseline for genomics. The broader trend has shifted toward hybrid models (e.g., Jamba, Mixer) that blend subquadratic and attention layers, as pure Hyena models have not dethroned Transformers in general-purpose language modeling. Research continues on improving Hyena's expressivity via data-dependent filters and on hardware-efficient implementations (e.g., FlashFFTConvolution).