Can I run B200 GEMM kernels on RTX Pro Blackwell?

No. SM10x kernels using tcgen05.mma and TMEM are completely incompatible with SM12x and will not run.

Will my Ampere/Ada GEMM kernels work on SM12x?

Yes. SM8x kernels port directly to SM12x and often achieve reasonable performance.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

AI ResearchBreakthroughScore: 86

NVFP4 GEMM on RTX Pro Blackwell: SM12x Breaks from B200 Programming Model

NVIDIA's SM12x architecture drops tcgen05.mma for mma.sync, breaking B200 kernel compatibility. SM8x kernels port easily; developers must maintain separate codebases.

AAAla SMITH & AI Research Desk·Jun 21, 2026·3 min read··9 views·AI-Generated·Report error

Source: research.colfax-intl.comvia hn_ai_infra, @_akhaliqCorroborated

How does the SM12x programming model in RTX Pro Blackwell GPUs differ from B200?

NVIDIA's RTX Pro Blackwell GPUs (SM12x) use mma.sync instructions and register-based accumulation, diverging from B200's tcgen05.mma with TMEM. SM10x kernels won't run; SM8x kernels port easily.

TL;DR

SM12x uses mma.sync not tcgen05.mma · No TMEM; accumulators live in registers · SM8x kernels run well; SM10x are incompatible

NVIDIA's SM12x architecture in blackwell" class="entity-chip">RTX Pro 6000 Blackwell GPUs drops tcgen05.mma and TMEM. Instead, it reverts to mma.sync instructions with register-based accumulation, breaking compatibility with B200 kernels.

Key facts

SM12x uses mma.sync, not tcgen05.mma
No TMEM; accumulators in register memory
SM10x kernels incompatible with SM12x
SM8x kernels run well on SM12x
NVFP4 blockscaled GEMM via CuTe DSL

Colfax International published a detailed tutorial on implementing NVFP4 blockscaled GEMM for NVIDIA's SM12x GPUs, including the RTX Pro 6000 Blackwell Server Edition (SM120) and DGX Spark (SM121). The post reveals a sharp architectural divergence within the Blackwell family: SM12x does not use the tcgen05.mma instruction or Tensor Memory (TMEM) found in SM10x (B200/B300). Instead, like SM8x (Ampere/Ada), it relies on warp-level mma.sync instructions and accumulates directly in register memory (RMEM).

Key Takeaways

NVIDIA's SM12x architecture drops tcgen05.mma for mma.sync, breaking B200 kernel compatibility.
SM8x kernels port easily; developers must maintain separate codebases.

Why the Programming Model Split Matters

The choice of mma.sync over tcgen05.mma has deep implications. On SM10x, tcgen05.mma launches asynchronously from a single thread, sourcing operands from shared memory (SMEM) and accumulators from TMEM, locked to one CTA per SM. This enables a warp-specialized paradigm where one warp issues MMA while others load data. [According to the source], SM10x GEMM kernels are completely incompatible with SM12x and will not run on those devices. Conversely, SM12x kernels may run on SM10x but will perform poorly.

SM12x's mma.sync is synchronous and warp-collective, with fixed register tile partitions. Because register fragments are much smaller than TMEM tiles, multiple warps must handle MMA to saturate throughput. The upside: SM8x kernels port directly to SM12x and often perform reasonably, preserving years of optimization work around scheduling and pipelining.

NVFP4 Blockscaled GEMM Implementation

The tutorial covers sub-byte blockscaled GEMM on SM12x, detailing the necessary PTX background and scale-factor layouts before walking through a CuTe DSL kernel. Colfax promises follow-up benchmarks in a subsequent optimization post; the current article focuses on correctness and the programming model shift rather than raw throughput numbers.

Figure 3. MMA .m16n8k64 fragment layout for matrix A with e2m1 type. Taken from the PTX documentation.

SM12x does bring advances over SM8x: sub-byte support (including NVFP4), larger shared memory, and improved tensor core throughput. But the architectural split means developers targeting both Blackwell variants must maintain separate kernel codebases — a cost NVIDIA has not publicly quantified.

What to watch

Watch for Colfax's upcoming optimization post with NVFP4 GEMM benchmark numbers on SM12x. Also track whether NVIDIA documents a unified kernel strategy across Blackwell variants in CUDA 13 or introduces compiler pragmas to abstract the mma.sync vs. tcgen05.mma difference.

Figure 2. Blockscaled GEMM. Adapted from the PTX documentation.

Source: research.colfax-intl.com

Sources cited in this article

SM. This

Source: gentic.news · Jun 21, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The SM12x programming model split is structurally reminiscent of NVIDIA's earlier fragmentation between Volta (SM7.0) and Turing (SM7.5), where tensor core instructions changed. The key difference: SM12x's divergence occurs within the same Blackwell family name, creating a confusing branding situation. Developers buying RTX Pro Blackwell for inference workloads should expect to recompile kernels but may benefit from SM8x compatibility without code changes. NVFP4 blockscaled GEMM is the first sub-byte format with hardware support on NVIDIA consumer/pro workstation GPUs, not just datacenter parts. This could democratize 4-bit inference for local deployment, but the lack of published benchmark numbers makes throughput comparisons impossible. Colfax's follow-up will be critical. The tutorial's focus on CuTe DSL suggests NVIDIA is pushing developers toward template-based kernel authoring rather than raw CUDA. This aligns with the broader trend of hardware abstraction layers (CuTe, Triton) to manage architectural diversity — but SM12x proves abstraction hasn't fully arrived.

#hardware #nvidia #gpu-programming

This story is part of

The AI Infrastructure War Shifts from Chips to Developer Tools

Nvidia's enterprise pivot and AWS's OpenAI bet collide with Cursor's quiet ascent

Compare side-by-side

SM12x vs SM10x

→

Mentioned in this article

Nvidia SM12x RTX Pro 6000 Blackwell SM10x B200 NVFP4 SM8x CuTe DGX Spark Colfax International

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches4 shared topics

NVIDIA NVFP4 on Blackwell Cuts JAX Training by 1.8x in MaxText

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in AI Research

View all

Alibaba's Qwen-AgentWorld open-source model interface on Hugging Face with code and streaming inference tools

AI Research

Alibaba Open-Sources Qwen-AgentWorld for Generalist Agent Training

Alibaba open-sourced Qwen-AgentWorld and Wan-Streamer v0.1 on Hugging Face, targeting generalist agent training and real-time streaming. The releases include 8 additional papers on agent benchmarks and architectures.

x.com/5h ago/3 min read

open-sourceagentic aiworld models

A terminal window displays command-line output with benchmark results, showing a 33.4% score, while a bar chart…

AI Research

CLI-Universe: Qwen3-32B fine-tuned on 6K trajectories beats models 10x larger on Terminal-Bench 2.0

CLI-Universe synthesizes terminal-agent tasks; Qwen3-32B fine-tuned on 6K trajectories hits 33.4% on Terminal-Bench 2.0, beating models 10x larger.

x.com/1d ago/3 min read

agentic aifine-tuningbenchmarks