Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

NVFP4 GEMM on RTX Pro Blackwell: SM12x Breaks from B200 Programming Model
AI ResearchBreakthroughScore: 86

NVFP4 GEMM on RTX Pro Blackwell: SM12x Breaks from B200 Programming Model

NVIDIA's SM12x architecture drops tcgen05.mma for mma.sync, breaking B200 kernel compatibility. SM8x kernels port easily; developers must maintain separate codebases.

·Jun 21, 2026·3 min read··9 views·AI-Generated·Report error
Share:
Source: research.colfax-intl.comvia hn_ai_infra, @_akhaliqCorroborated
How does the SM12x programming model in RTX Pro Blackwell GPUs differ from B200?

NVIDIA's RTX Pro Blackwell GPUs (SM12x) use mma.sync instructions and register-based accumulation, diverging from B200's tcgen05.mma with TMEM. SM10x kernels won't run; SM8x kernels port easily.

TL;DR

SM12x uses mma.sync not tcgen05.mma · No TMEM; accumulators live in registers · SM8x kernels run well; SM10x are incompatible

NVIDIA's SM12x architecture in blackwell" class="entity-chip">RTX Pro 6000 Blackwell GPUs drops tcgen05.mma and TMEM. Instead, it reverts to mma.sync instructions with register-based accumulation, breaking compatibility with B200 kernels.

Key facts

  • SM12x uses mma.sync, not tcgen05.mma
  • No TMEM; accumulators in register memory
  • SM10x kernels incompatible with SM12x
  • SM8x kernels run well on SM12x
  • NVFP4 blockscaled GEMM via CuTe DSL

Colfax International published a detailed tutorial on implementing NVFP4 blockscaled GEMM for NVIDIA's SM12x GPUs, including the RTX Pro 6000 Blackwell Server Edition (SM120) and DGX Spark (SM121). The post reveals a sharp architectural divergence within the Blackwell family: SM12x does not use the tcgen05.mma instruction or Tensor Memory (TMEM) found in SM10x (B200/B300). Instead, like SM8x (Ampere/Ada), it relies on warp-level mma.sync instructions and accumulates directly in register memory (RMEM).

Key Takeaways

  • NVIDIA's SM12x architecture drops tcgen05.mma for mma.sync, breaking B200 kernel compatibility.
  • SM8x kernels port easily; developers must maintain separate codebases.

Why the Programming Model Split Matters

The choice of mma.sync over tcgen05.mma has deep implications. On SM10x, tcgen05.mma launches asynchronously from a single thread, sourcing operands from shared memory (SMEM) and accumulators from TMEM, locked to one CTA per SM. This enables a warp-specialized paradigm where one warp issues MMA while others load data. [According to the source], SM10x GEMM kernels are completely incompatible with SM12x and will not run on those devices. Conversely, SM12x kernels may run on SM10x but will perform poorly.

SM12x's mma.sync is synchronous and warp-collective, with fixed register tile partitions. Because register fragments are much smaller than TMEM tiles, multiple warps must handle MMA to saturate throughput. The upside: SM8x kernels port directly to SM12x and often perform reasonably, preserving years of optimization work around scheduling and pipelining.

NVFP4 Blockscaled GEMM Implementation

The tutorial covers sub-byte blockscaled GEMM on SM12x, detailing the necessary PTX background and scale-factor layouts before walking through a CuTe DSL kernel. Colfax promises follow-up benchmarks in a subsequent optimization post; the current article focuses on correctness and the programming model shift rather than raw throughput numbers.

Figure 3. MMA .m16n8k64 fragment layout for matrix A with e2m1 type. Taken from the PTX documentation.

SM12x does bring advances over SM8x: sub-byte support (including NVFP4), larger shared memory, and improved tensor core throughput. But the architectural split means developers targeting both Blackwell variants must maintain separate kernel codebases — a cost NVIDIA has not publicly quantified.

What to watch

Watch for Colfax's upcoming optimization post with NVFP4 GEMM benchmark numbers on SM12x. Also track whether NVIDIA documents a unified kernel strategy across Blackwell variants in CUDA 13 or introduces compiler pragmas to abstract the mma.sync vs. tcgen05.mma difference.

Figure 2. Blockscaled GEMM. Adapted from the PTX documentation.


Source: research.colfax-intl.com


Sources cited in this article

  1. SM. This
Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The SM12x programming model split is structurally reminiscent of NVIDIA's earlier fragmentation between Volta (SM7.0) and Turing (SM7.5), where tensor core instructions changed. The key difference: SM12x's divergence occurs within the same Blackwell family name, creating a confusing branding situation. Developers buying RTX Pro Blackwell for inference workloads should expect to recompile kernels but may benefit from SM8x compatibility without code changes. NVFP4 blockscaled GEMM is the first sub-byte format with hardware support on NVIDIA consumer/pro workstation GPUs, not just datacenter parts. This could democratize 4-bit inference for local deployment, but the lack of published benchmark numbers makes throughput comparisons impossible. Colfax's follow-up will be critical. The tutorial's focus on CuTe DSL suggests NVIDIA is pushing developers toward template-based kernel authoring rather than raw CUDA. This aligns with the broader trend of hardware abstraction layers (CuTe, Triton) to manage architectural diversity — but SM12x proves abstraction hasn't fully arrived.
This story is part of
The AI Infrastructure War Shifts from Chips to Developer Tools
Nvidia's enterprise pivot and AWS's OpenAI bet collide with Cursor's quiet ascent
Compare side-by-side
SM12x vs SM10x
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all