NVIDIA's SM12x architecture in blackwell" class="entity-chip">RTX Pro 6000 Blackwell GPUs drops tcgen05.mma and TMEM. Instead, it reverts to mma.sync instructions with register-based accumulation, breaking compatibility with B200 kernels.
Key facts
- SM12x uses mma.sync, not tcgen05.mma
- No TMEM; accumulators in register memory
- SM10x kernels incompatible with SM12x
- SM8x kernels run well on SM12x
- NVFP4 blockscaled GEMM via CuTe DSL
Colfax International published a detailed tutorial on implementing NVFP4 blockscaled GEMM for NVIDIA's SM12x GPUs, including the RTX Pro 6000 Blackwell Server Edition (SM120) and DGX Spark (SM121). The post reveals a sharp architectural divergence within the Blackwell family: SM12x does not use the tcgen05.mma instruction or Tensor Memory (TMEM) found in SM10x (B200/B300). Instead, like SM8x (Ampere/Ada), it relies on warp-level mma.sync instructions and accumulates directly in register memory (RMEM).
Key Takeaways
- NVIDIA's SM12x architecture drops tcgen05.mma for mma.sync, breaking B200 kernel compatibility.
- SM8x kernels port easily; developers must maintain separate codebases.
Why the Programming Model Split Matters
The choice of mma.sync over tcgen05.mma has deep implications. On SM10x, tcgen05.mma launches asynchronously from a single thread, sourcing operands from shared memory (SMEM) and accumulators from TMEM, locked to one CTA per SM. This enables a warp-specialized paradigm where one warp issues MMA while others load data. [According to the source], SM10x GEMM kernels are completely incompatible with SM12x and will not run on those devices. Conversely, SM12x kernels may run on SM10x but will perform poorly.
SM12x's mma.sync is synchronous and warp-collective, with fixed register tile partitions. Because register fragments are much smaller than TMEM tiles, multiple warps must handle MMA to saturate throughput. The upside: SM8x kernels port directly to SM12x and often perform reasonably, preserving years of optimization work around scheduling and pipelining.
NVFP4 Blockscaled GEMM Implementation
The tutorial covers sub-byte blockscaled GEMM on SM12x, detailing the necessary PTX background and scale-factor layouts before walking through a CuTe DSL kernel. Colfax promises follow-up benchmarks in a subsequent optimization post; the current article focuses on correctness and the programming model shift rather than raw throughput numbers.

SM12x does bring advances over SM8x: sub-byte support (including NVFP4), larger shared memory, and improved tensor core throughput. But the architectural split means developers targeting both Blackwell variants must maintain separate kernel codebases — a cost NVIDIA has not publicly quantified.
What to watch
Watch for Colfax's upcoming optimization post with NVFP4 GEMM benchmark numbers on SM12x. Also track whether NVIDIA documents a unified kernel strategy across Blackwell variants in CUDA 13 or introduces compiler pragmas to abstract the mma.sync vs. tcgen05.mma difference.

Source: research.colfax-intl.com








