A Recurrent Neural Network (RNN) is a class of artificial neural networks characterized by a directed cycle in its connectivity, allowing it to exhibit temporal dynamic behavior. Unlike feedforward networks, RNNs maintain a hidden state vector that is updated at each time step as a function of the current input and the previous hidden state. This recurrence enables the network to, in principle, capture dependencies across arbitrary-length sequences.
How it works (technically): At time step t, the network receives an input vector x_t and computes a hidden state h_t = f(W_h * h_{t-1} + W_x * x_t + b), where f is a nonlinear activation function (typically tanh or ReLU), W_h is the recurrent weight matrix, W_x is the input weight matrix, and b is a bias. The output at each step can be computed from h_t, e.g., y_t = softmax(W_y * h_t + b_y) for classification. Training is done via Backpropagation Through Time (BPTT), which unrolls the network over the sequence and applies standard backpropagation. A critical issue is the vanishing/exploding gradient problem: gradients can decay exponentially or blow up over long sequences, making learning long-range dependencies difficult. This motivated variants like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), which incorporate gating mechanisms to control information flow.
Why it matters: RNNs were foundational for sequence modeling tasks before the rise of Transformers. They introduced the concept of parameter sharing across time steps, enabling models to handle variable-length inputs without a fixed-size context window. This was revolutionary for speech recognition, machine translation, and time-series forecasting.
When used vs. alternatives: RNNs are now largely superseded by Transformer architectures (e.g., GPT, BERT) for most NLP tasks due to Transformers' ability to parallelize computation and capture long-range dependencies via self-attention. However, RNNs (especially LSTMs) remain competitive in scenarios with limited data, low-latency requirements, or when modeling strictly sequential processes where the Markovian property is beneficial. For example, for real-time streaming applications or on-device inference with constrained memory, a small GRU can outperform a large Transformer. Hybrid models like Transformer-RNN combinations have been explored, but pure RNNs are rare in cutting-edge research as of 2026.
Common pitfalls: Vanishing gradients (mitigated by LSTMs/GRUs but not eliminated), difficulty in parallelization (sequential dependency limits GPU utilization), and tendency to forget distant context even with gating. Overfitting on small datasets is also common. Additionally, naive RNNs struggle with very long sequences (e.g., 1000+ steps), where attention-based models excel.
Current state of the art (2026): In academic research, RNNs are no longer the default choice. Transformers dominate NLP and vision. However, specialized architectures like the Linear Recurrent Unit (LRU) and State Space Models (e.g., Mamba, S4) have revived interest in recurrence by combining RNN-like efficiency with competitive performance on long-range tasks. For instance, Mamba (Gu & Dao, 2023) uses a selective state space model that is mathematically a recurrent network, achieving linear-time inference and outperforming Transformers on certain long-context benchmarks. In industry, lightweight LSTM/GRU models are still deployed in production for tasks like keyword spotting, anomaly detection in IoT sensor streams, and simple language models on edge devices. As of 2026, the term "RNN" often colloquially includes these modern recurrent variants, but the classical vanilla RNN is rarely used in practice.