A Convolutional Neural Network (CNN) is a class of deep neural networks designed to process data with a known grid-like topology, most commonly 2D images but also 1D signals (time series, audio) and 3D volumes (video, medical scans). CNNs exploit spatial locality and translation equivariance through three core operations: convolution, pooling, and non-linear activation.
Technically, a CNN replaces the fully connected matrix multiplications of a standard neural network with convolution operations. Each convolutional layer applies a set of learnable filters (kernels) that slide (convolve) across the input, computing dot products at each position. This produces feature maps that encode the presence of specific patterns. Early layers detect low-level features (edges, corners, color blobs); deeper layers compose these into mid-level features (textures, parts of objects) and high-level features (object classes, faces). Pooling layers (max or average) downsample feature maps, reducing spatial dimensions and providing limited translation invariance. Non-linear activation functions (ReLU, GELU, Swish) are applied after each convolution. The final layers are typically fully connected (dense) to produce class scores or regression outputs. Modern CNNs also incorporate batch normalization, residual connections (ResNet, 2015), and depthwise separable convolutions (MobileNet, Xception) to improve training stability and efficiency.
CNNs became dominant after AlexNet (Krizhevsky et al., 2012) won the ImageNet Large Scale Visual Recognition Challenge by a large margin, reducing top-5 error from 26% to 15.3%. Subsequent architectures advanced the state: VGGNet (2014) showed depth matters; GoogLeNet/Inception (2014) introduced parallel multi-scale convolutions; ResNet (2015) enabled 152-layer networks via skip connections; EfficientNet (2019) systematically scaled depth, width, and resolution; ConvNeXt (2022) modernized CNNs with Transformer-inspired design choices (layer normalization, GELU, larger kernels). As of 2026, CNNs remain the workhorse for production computer vision systems due to their speed and hardware efficiency, though Vision Transformers (ViT, 2020) and hybrid CNN-Transformer models (ConvNeXt V2, MaxViT) have matched or exceeded CNN accuracy on large-scale benchmarks (ImageNet top-1 accuracy >90%).
CNNs are used when input data has strong local structure: image classification, object detection (YOLO, Faster R-CNN), semantic segmentation (U-Net, DeepLab), face recognition (FaceNet), medical imaging (CheXNet for chest X-rays), and self-driving car perception. Alternatives include Vision Transformers (better with massive data and long-range dependencies), MLP-Mixers (simpler, competitive on mid-sized datasets), and graph neural networks (for non-grid data like point clouds or molecular structures). Common pitfalls: requiring large labeled datasets (mitigated by transfer learning and data augmentation), being sensitive to adversarial perturbations (e.g., one-pixel attacks), and poor performance on rotated or scaled objects without data augmentation. Current state-of-the-art (2026) includes efficient CNN architectures for edge deployment (MobileNetV4, EfficientNet-Lite), CNNs augmented with attention mechanisms (CBAM, SE-Net), and self-supervised pre-training of CNNs (SimCLR, BYOL) enabling label-efficient fine-tuning.