AMD's ROCm software stack improved performance by over 75x in 14 days since DeepSeek v4's launch, according to @SemiAnalysis_. The gains come from fused mHC and RoPE operations that reduce CPU overhead and improve HBM memory utilization.
Key facts
- ROCm performance improved 75x in 14 days post-DeepSeek v4
- Fused mHC and RoPE operations cut CPU overhead
- Kernels rewritten in TileLang and Triton for speed
- 5x more needed to match B200 single-node performance
- 1.5x more needed for PD disaggregated B200 performance
The 75x improvement is not a single benchmark but an aggregate across key inference kernels. The performance comes from fusing mHC operations and fusing RoPE hadamard transformations to reduce CPU overhead and improve HBM memory utilization [per @SemiAnalysis_]. Other kernels like the attention indexer and kvcache compressor have been rewritten using TileLang and Triton for fast development velocity.
The structural take: AMD did not achieve this through hardware changes — the MI300X silicon is unchanged. This is purely a software stack optimization, which means the gains are portable to existing AMD Instinct deployments. The speed of improvement (75x in 14 days) suggests the software stack was severely under-optimized relative to NVIDIA's CUDA ecosystem prior to DeepSeek v4's launch, which exposed the gap.
What's needed to catch NVIDIA: Another 5x performance improvement is needed to catch up to single node aggregated B200 performance. Then another 1.5x is needed to catch up to PD disaggregated B200 performance. @SemiAnalysis_ claims both are "within the realm of possibility for AMD within the next couple of weeks." That timeline is aggressive — NVIDIA's B200 is shipping now, and Blackwell's architecture includes hardware-level innovations (e.g., FP4 tensor cores) that are harder to close with software alone.
Who did the work: The rapid improvement is credited to HaiShaw, Thomas, @roaner, and @AnushElangovan [per @SemiAnalysis_]. AMD has not confirmed the performance figures or timeline publicly.
Risk factors: The source is a single analyst tweet, not an AMD announcement. The 75x figure lacks a baseline definition — 75x improvement over what starting performance? If the initial ROCm stack was extremely slow (e.g., running unoptimized PyTorch without any kernel fusion), 75x is less impressive than if it was already competitive. AMD did not disclose the figure or methodology.
Key Takeaways
- AMD ROCm stack improved 75x in 14 days post-DeepSeek v4 via fused operations.
- Still needs 5x more to match B200 performance.
What to watch

Watch for AMD's official ROCm 6.3 release notes and whether the company publishes independent benchmarks matching the 75x claim. Also track whether NVIDIA responds with Blackwell software optimizations that widen the gap again. The next 30 days will determine if AMD can close the 5x gap to B200.









