Perplexity serves Qwen3 235B models on Nvidia GB200 racks, showing major inference gains

Perplexity AI is now running massive language models on Nvidia’s newest hardware, and the performance jump is hard to ignore. The company has published technical research detailing its deployment of post-trained Qwen3 235B mixture-of-experts (MoE) models on Nvidia’s Blackwell-generation GB200 NVL72 racks, showing substantial improvements in both speed and cost over the previous Hopper-generation systems.

What Perplexity actually built

The setup involves GB200 NVL72 racks, each packing 72 GPUs with 180 GB of high-bandwidth memory apiece. Those GPUs are wired together via 72-way NVLink, delivering 1,800 GB/s of bandwidth between them.

Here’s where the numbers get interesting. Latency for NVLink all-reduce operations dropped from 586.1 microseconds on the H200 (Hopper) to 313.3 microseconds on the GB200. That’s a 46% reduction. MoE prefill combine time fell from 730.1 microseconds to 438.5 microseconds, roughly a 40% improvement.

Perplexity also reports achieving up to 30x real-time inference capability compared to H100 baselines for certain configurations.

The engineering under the hood

Perplexity’s research highlights several software-level optimizations that squeeze more performance out of the Blackwell architecture. These include Blackwell-native quantization, which reduces the precision of model weights to speed up computation without meaningfully degrading output quality. There’s also prefill/decode disaggregation, a technique that separates the initial processing of a prompt from the token-by-token generation phase. Custom kernels round out the optimization stack, with Perplexity writing specialized code tuned for the specific demands of serving a 235-billion-parameter MoE model on this particular hardware topology.

The combination of hardware and software improvements means the GB200 NVL72 setup significantly lowers inference costs while improving output quality compared to Hopper-based systems.

Why this matters for the broader AI hardware race

This deployment strengthens Nvidia’s position against alternatives like AMD’s MI300X and AWS’s custom Trainium chips. The 72-GPU NVLink topology delivering 1,800 GB/s bandwidth is particularly significant, as competing solutions often rely on slower interconnects between chips, which creates bottlenecks when serving models that need to coordinate across many GPUs simultaneously.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

Source link

Perplexity serves Qwen3 235B models on Nvidia GB200 racks, showing major inference gains

What Perplexity actually built

The engineering under the hood

Why this matters for the broader AI hardware race

LEAVE A REPLY Cancel reply

Latest news

Strive SATA stock launches daily dividend first

7 AI Trading Tools Worth Trying

CZ’s release was supposed to launch Uptober — instead, we lost $200B

Trump’s Tech Posse in China, Who’s Winning in Musk v. Altman, and Hantavirus Conspiracy Theories

Advertisement

Bitcoin Futures Hit $61.9B as Traders Pile Into Both Sides of the Market – Bitcoin News

Who is Len Sassaman, Polymarket’s top bet for Satoshi?

Must read

Strive SATA stock launches daily dividend first

7 AI Trading Tools Worth Trying

You might also likeRELATED
Recommended to you

Editor Picks

Lithosphere Deploys Full-Stack Development Environment for AI-Native Applications

Lithosphere Integrates AI Mock Providers for Continuous Integration Workflows

Lithosphere to Launch Devnet Environment for Scalable AI Application Testing

Must Read

Strive SATA stock launches daily dividend first

7 AI Trading Tools Worth Trying

CZ’s release was supposed to launch Uptober — instead, we lost $200B

Hot Topics

Perplexity serves Qwen3 235B models on Nvidia GB200 racks, showing major inference gains

What Perplexity actually built

The engineering under the hood

Why this matters for the broader AI hardware race

LEAVE A REPLY Cancel reply

Latest news

Advertisement

Must read

You might also likeRELATEDRecommended to you

Editor Picks

Must Read

Hot Topics

You might also likeRELATED
Recommended to you