ROCm — Definition, Examples & Latest News | gentic.news

ROCm (Radeon Open Compute) is AMD's open-source software stack for GPU-accelerated computing, primarily targeting machine learning, deep learning, and high-performance computing (HPC). Launched in 2016, ROCm aims to provide a CUDA-competitive ecosystem for AMD GPUs, centered around the HIP (Heterogeneous-Compute Interface for Portability) programming model. HIP is a C++ runtime API and kernel language that allows developers to write code portable between AMD and NVIDIA GPUs, with tools like hipify-perl to automatically convert CUDA code to HIP.

Technically, ROCm includes several key components: the ROCk kernel driver (Linux only), the ROCr runtime (user-space API), the ROCclr (common language runtime for OpenCL/HIP), and a suite of math libraries—rocBLAS (BLAS), rocFFT (FFT), rocRAND (random number generation), rocSPARSE (sparse linear algebra), MIOpen (deep learning primitives like convolutions and pooling), and RCCL (collective communications, analogous to NVIDIA's NCCL). The HIP compiler is built on LLVM, supporting both AMD's GCN/CDNA instruction set architectures (e.g., MI250X, MI300X).

Why ROCm matters: It is the primary means to run large-scale AI workloads on AMD hardware, which has gained traction in HPC (e.g., Frontier, the first exascale supercomputer, uses AMD MI250X GPUs) and cloud instances (AWS EC2 DL2a instances with AMD MI100). ROCm enables training and inference for popular frameworks like PyTorch (official support since v1.8), TensorFlow, JAX (partial), and ONNX Runtime. As of 2026, ROCm 6.x provides full support for the CDNA 3 architecture (MI300X) with 192 GB HBM3 memory and 5.2 TB/s bandwidth, competitive with NVIDIA H100 for large model training.

When used vs alternatives: ROCm is chosen when deploying on AMD GPUs for cost efficiency, open-source preference, or procurement mandates (e.g., EU projects promoting hardware diversity). It is not used on NVIDIA GPUs—those run CUDA natively. A major pitfall: ROCm historically had limited support for consumer Radeon GPUs (e.g., RX 7900 XTX is not fully supported for FP64/Matrix cores) and requires Linux (no official Windows support for training). Another common pitfall: operator coverage gaps—some PyTorch ops (e.g., certain index operations or custom CUDA kernels) may not have HIP equivalents, requiring manual porting or fallback to CPU. Performance can also lag CUDA on the same nominal FLOPs due to less mature kernel tuning and compiler optimizations.

Current state of the art (2026): ROCm 6.3+ offers near-parity with CUDA for standard workloads (LLM training, vision transformers). AMD's open-source model for ROCm includes contributing to Triton (a language for GPU kernel compilation) and PyTorch's native Dynamo backend. The MI400 series (2026) is expected with CDNA 4, further narrowing the gap. Key papers: AMD's work on FlashAttention-2 for MI250X (achieving ~220 TFLOPS on FP16) and optimized all-reduce with RCCL for 8xMI300X nodes. Training Llama 3 70B on 256 MI300X GPUs achieves ~55% MFU (model flops utilization) vs ~60% on H100, a gap that is closing with each ROCm release.

Examples

Frontier supercomputer (ORNL) uses 37,632 AMD MI250X GPUs, running ROCm 5.x for HPC and AI workloads.

Llama 3 70B fine-tuned on 8xMI300X GPUs using PyTorch with ROCm 6.2, achieving ~150 TFLOPS per GPU.

Stable Diffusion XL inference on a single AMD MI250X via ONNX Runtime with ROCm execution provider.

AMD's open-source FlashAttention-2 port for MI250X, integrated into ROCm 6.0, enabling 2x speedup over naive attention.

AWS EC2 DL2a instances (4xMI100 GPUs) used for training BERT-large with TensorFlow-ROCm 2.10.

FAQ

What is ROCm?

ROCm (Radeon Open Compute) is AMD's open-source software platform for GPU-accelerated machine learning and HPC, providing a CUDA-like runtime, compiler (HIP), and libraries targeting AMD Instinct GPUs.

How does ROCm work?

Where is ROCm used in 2026?

Frontier supercomputer (ORNL) uses 37,632 AMD MI250X GPUs, running ROCm 5.x for HPC and AI workloads. Llama 3 70B fine-tuned on 8xMI300X GPUs using PyTorch with ROCm 6.2, achieving ~150 TFLOPS per GPU. Stable Diffusion XL inference on a single AMD MI250X via ONNX Runtime with ROCm execution provider.

ROCm: definition + examples

Examples

Related terms

Latest news mentioning ROCm

FAQ