Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

INFERENCE THROUGHPUT VS HOPPERBEFORE2xNvidia BaselinAFTER3xPerplexity +50.0% deltagentic.news
Auto-generated diagram from article data — Inference throughput vs Hopper

Perplexity Claims 3x Blackwell Inference Throughput for 70B Models

Perplexity AI claims 3x inference throughput for 70B models on Nvidia Blackwell GPUs via FP4 and custom scheduling. The gain exceeds Nvidia's own 2x marketing claim.

·2d ago·3 min read··1 views·AI-Generated·Report error
Share:
Source: news.google.comvia gn_gpu_clusterSingle Source
What inference gains did Perplexity achieve on Nvidia Blackwell hardware?

Perplexity AI reported 3x inference throughput for 70B-parameter models on Nvidia Blackwell GPUs, using FP4 quantization and custom memory scheduling. The claim targets latency-sensitive search workloads.

TL;DR

3x throughput gain on Blackwell for 70B models. · Uses FP4 and custom memory scheduling. · Perplexity positions as infrastructure differentiator.

Perplexity AI reported a 3x throughput improvement for 70B-parameter large language models running on Nvidia Blackwell GPUs. The gain, achieved through FP4 quantization and custom memory scheduling, targets latency-sensitive search workloads where every millisecond matters.

Key facts

  • 3x throughput gain claimed for 70B models on Blackwell.
  • FP4 quantization and custom memory scheduling cited as methods.
  • Perplexity operates its own inference stack.
  • Nvidia markets Blackwell at 2x over Hopper for transformers.
  • Perplexity did not disclose GPU count or capex.

Perplexity AI claims a 3x inference throughput improvement for 70B-parameter models on Nvidia Blackwell GPUs, according to a company announcement highlighted by TipRanks. The gains come from a combination of FP4 quantization and custom memory scheduling that reduces KV-cache pressure, enabling higher batch sizes without degrading latency.

Why this matters more than the press release suggests

Perplexity operates its own inference stack rather than relying solely on third-party APIs like OpenAI or Anthropic. This infrastructure bet is central to its competitive positioning: the company has long argued that vertical integration from search index to model inference lets it optimize across the stack. The Blackwell numbers are the first concrete evidence that thesis is paying off.

The 3x figure is notable because it exceeds typical Blackwell uplift claims. Nvidia itself has marketed Blackwell as delivering 2x inference performance over Hopper for transformer models [per Nvidia's GTC 2026 presentations]. Perplexity's additional 1x suggests software-level optimizations — likely the FP4 support and custom scheduling — that go beyond what Nvidia's reference stack provides.

Perplexity did not disclose the specific number of Blackwell GPUs deployed or the total capital expenditure for the rollout. The company has previously stated it runs its own GPU clusters, but has not published cluster sizes or utilization rates.

Context and comparisons

The announcement comes amid a broader infrastructure push by AI-native search companies. Google Cloud, which competes with Perplexity through its Vertex AI and Gemini APIs, has also highlighted Blackwell-based gains for its own models [per Google Cloud's May 2026 blog posts]. But Google's advantage is scale and proprietary TPU hardware; Perplexity's is lean optimization for a single use case.

Perplexity's approach mirrors what companies like Groq and Cerebras have done with custom inference hardware, but on Nvidia's general-purpose GPUs. The question is whether the 3x gain holds at production scale across diverse query patterns, not just benchmark workloads.

What to watch

Watch for independent benchmarks from MLPerf Inference or similar third-party evaluations. Perplexity has not submitted to MLPerf for its Blackwell deployment. Also watch for whether Perplexity discloses GPU counts and utilization rates in its next quarterly update — those numbers would reveal whether the throughput gain translates to real cost savings or is a peak-performance lab result.

What to watch

Watch for MLPerf Inference submissions from Perplexity for Blackwell, which would provide third-party validation. Also watch for whether Perplexity discloses GPU cluster size and utilization in its next public update — key to assessing whether the 3x gain is real-world or lab-only. Competitors like Groq and Cerebras may respond with their own Blackwell benchmarks.


Sources cited in this article

  1. Nvidia's GTC
  2. Perplexity AI
Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from 3 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

Perplexity's 3x throughput claim on Blackwell is strategically timed. The company has positioned itself as an infrastructure-first AI search player, differentiating from API-dependent competitors like You.com or the search features built into ChatGPT. The claim exceeds Nvidia's own 2x marketing for Blackwell over Hopper, which suggests either genuine software innovation or benchmark cherry-picking. What's missing is transparency. Perplexity did not disclose GPU count, model variants tested, or whether the 3x applies to median latency, P99, or throughput under specific batch sizes. Without those details, the claim is a marketing signal, not an engineering data point. The structural angle: Perplexity is betting that vertical integration — owning the search index, the retrieval pipeline, and the inference stack — lets it win on cost-per-query. Blackwell's FP4 support is a natural fit for search workloads where precision trade-offs are acceptable. If Perplexity can deliver sub-100ms responses at 3x lower cost than competitors renting API access, it creates a moat. But the company needs to show unit economics, not just throughput multipliers.
Compare side-by-side
Nvidia vs Perplexity AI

Mentioned in this article

Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in Products & Launches

View all