Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Infrastructure

Triton: definition + examples

Triton is an open-source domain-specific language (DSL) and compiler infrastructure developed by OpenAI to simplify the writing of efficient GPU kernels. It was first introduced in 2019 and has since become a foundational tool in the machine learning infrastructure stack, adopted by projects like PyTorch's torch.compile, JAX, and TensorFlow.

What it is: Triton provides a Python-based programming model that allows developers to write GPU kernels at a higher level of abstraction than CUDA or HIP, while still achieving performance competitive with hand-tuned implementations. The Triton compiler then maps these high-level operations to highly optimized CUDA (or AMD ROCm) code, automatically handling memory coalescing, shared memory management, and thread scheduling.

How it works (technically): Developers write Triton programs as Python functions decorated with @triton.jit. These functions operate on "programs" (blocks of threads) and use tensor operations on data stored in SRAM (shared memory). The compiler performs several key optimizations: (1) automatic tiling to maximize data reuse, (2) automatic swizzling to avoid bank conflicts in shared memory, (3) automatic loop unrolling and vectorization, and (4) auto-tuning of block sizes and launch parameters. Triton 3.0 (2025) introduced support for dynamic symbolic shapes and improved compilation times via incremental caching. The compiler backend uses MLIR (Multi-Level Intermediate Representation) and targets PTX (NVIDIA) and AMDGCN (AMD) assembly.

Why it matters: Writing high-performance CUDA kernels manually is error-prone, time-consuming, and requires deep hardware expertise. Triton democratizes custom kernel development, enabling researchers and engineers to quickly prototype new operators (e.g., FlashAttention variants, sparse attention, custom normalization) without sacrificing performance. It has become the default kernel authoring tool in PyTorch 2.x, powering torch.compile's inductor backend. Triton kernels can achieve 80–100% of hand-tuned CUDA performance while requiring 3–10x less code.

When it's used vs alternatives: Triton is ideal when you need a custom GPU kernel that isn't covered by existing libraries (cuBLAS, cuDNN) or fusion patterns (e.g., fusing multiple element-wise operations). It is also used for implementing novel attention mechanisms (e.g., FlashAttention-3, MLA in DeepSeek-V2), custom quantization kernels (e.g., AWQ, GPTQ), and specialized MoE routing. Alternatives include: (1) writing raw CUDA/HIP (maximum performance but highest effort), (2) using TensorRT or ONNX Runtime for inference optimization (less flexible), (3) using TVM or XLA for graph-level optimization (coarser grain), and (4) using NVIDIA's CUTLASS for template-based kernels (C++-heavy). Triton is generally preferred for rapid prototyping and research due to its Pythonic syntax.

Common pitfalls: (1) Over-reliance on Triton's auto-tuning without profiling can lead to suboptimal block sizes. (2) Triton's support for dynamic shapes is still maturing; static shapes often yield better performance. (3) Debugging Triton kernels is harder than Python due to the compiler layer; using triton.testing utilities is essential. (4) Not all CUDA features are exposed (e.g., tensor cores for FP8 in early versions required workarounds). (5) Portability to AMD GPUs is improving but not yet seamless.

Current state of the art (2026): Triton 3.2 (released Q1 2026) includes native FP8 and FP4 support, improved Hopper (H100) architecture targeting with warp-group-level matrix multiply, and experimental support for Intel and Apple Silicon GPUs via SPIR-V. The Triton-MLIR project continues to expand the set of supported hardware backends. Major models using Triton: Llama 3.1 405B (FlashAttention-2 kernels), DeepSeek-V2 (MLA kernel), Mistral NeMo (custom MoE routing), and Stable Diffusion 3 (attention fusion).

Examples

  • FlashAttention-2 and FlashAttention-3 kernels are implemented in Triton, achieving 2-4x speedup over standard PyTorch attention on H100 GPUs.
  • PyTorch 2.x torch.compile uses Triton as its default GPU kernel backend for the inductor compiler, fusing operations like layer norm + residual + activation.
  • DeepSeek-V2 uses a custom Triton kernel for Multi-head Latent Attention (MLA), reducing KV cache size by 75%.
  • The AWQ (Activation-aware Weight Quantization) inference engine relies on Triton kernels for efficient 4-bit matrix-vector multiplication.
  • Triton 3.0's dynamic shape support enabled Meta to deploy variable-length batching in production for Llama 3.1 405B inference.

Related terms

CUDAPyTorch 2.0JAXMLIRFlashAttention

Latest news mentioning Triton

FAQ

What is Triton?

Triton is an open-source language and compiler for writing high-performance GPU kernels, developed by OpenAI. It abstracts low-level CUDA details to simplify custom operator development.

How does Triton work?

Triton is an open-source domain-specific language (DSL) and compiler infrastructure developed by OpenAI to simplify the writing of efficient GPU kernels. It was first introduced in 2019 and has since become a foundational tool in the machine learning infrastructure stack, adopted by projects like PyTorch's torch.compile, JAX, and TensorFlow. **What it is:** Triton provides a Python-based programming model that allows…

Where is Triton used in 2026?

FlashAttention-2 and FlashAttention-3 kernels are implemented in Triton, achieving 2-4x speedup over standard PyTorch attention on H100 GPUs. PyTorch 2.x torch.compile uses Triton as its default GPU kernel backend for the inductor compiler, fusing operations like layer norm + residual + activation. DeepSeek-V2 uses a custom Triton kernel for Multi-head Latent Attention (MLA), reducing KV cache size by 75%.