CryptoPerplexity serves Qwen3 235B models on Nvidia GB200 racks,...

Perplexity serves Qwen3 235B models on Nvidia GB200 racks, showing major inference gains

-


Perplexity AI is now running massive language models on Nvidia’s newest hardware, and the performance jump is hard to ignore. The company has published technical research detailing its deployment of post-trained Qwen3 235B mixture-of-experts (MoE) models on Nvidia’s Blackwell-generation GB200 NVL72 racks, showing substantial improvements in both speed and cost over the previous Hopper-generation systems.

What Perplexity actually built

The setup involves GB200 NVL72 racks, each packing 72 GPUs with 180 GB of high-bandwidth memory apiece. Those GPUs are wired together via 72-way NVLink, delivering 1,800 GB/s of bandwidth between them.

Here’s where the numbers get interesting. Latency for NVLink all-reduce operations dropped from 586.1 microseconds on the H200 (Hopper) to 313.3 microseconds on the GB200. That’s a 46% reduction. MoE prefill combine time fell from 730.1 microseconds to 438.5 microseconds, roughly a 40% improvement.

Perplexity also reports achieving up to 30x real-time inference capability compared to H100 baselines for certain configurations.

The engineering under the hood

Perplexity’s research highlights several software-level optimizations that squeeze more performance out of the Blackwell architecture. These include Blackwell-native quantization, which reduces the precision of model weights to speed up computation without meaningfully degrading output quality. There’s also prefill/decode disaggregation, a technique that separates the initial processing of a prompt from the token-by-token generation phase. Custom kernels round out the optimization stack, with Perplexity writing specialized code tuned for the specific demands of serving a 235-billion-parameter MoE model on this particular hardware topology.

The combination of hardware and software improvements means the GB200 NVL72 setup significantly lowers inference costs while improving output quality compared to Hopper-based systems.

Why this matters for the broader AI hardware race

This deployment strengthens Nvidia’s position against alternatives like AMD’s MI300X and AWS’s custom Trainium chips. The 72-GPU NVLink topology delivering 1,800 GB/s bandwidth is particularly significant, as competing solutions often rely on slower interconnects between chips, which creates bottlenecks when serving models that need to coordinate across many GPUs simultaneously.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.



Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest news

Strive SATA stock launches daily dividend first

Strive’s SATA stock will become the...

7 AI Trading Tools Worth Trying

AI trading is easiest to understand when you stop thinking about it as one tool.A good AI trading...

CZ’s release was supposed to launch Uptober — instead, we lost $200B

In an ironic twist of luck that few predicted, crypto prices apparently performed better while CZ was in...

Trump’s Tech Posse in China, Who’s Winning in Musk v. Altman, and Hantavirus Conspiracy Theories

Brian Barrett: Our best and brightest.Leah Feiger: Yeah. And so I'm just like, no, no, no, this is...

Advertisement

Bitcoin Futures Hit $61.9B as Traders Pile Into Both Sides of the Market – Bitcoin News

Key TakeawaysBitcoin futures open interest (OI) hit $61.9B across all exchanges on May 14, with Binance holding 19.05%...

Who is Len Sassaman, Polymarket’s top bet for Satoshi?

Polymarket users are betting that deceased cypherpunk Len Sassaman will be revealed as Satoshi Nakamoto in HBO’s documentary. Source...

Must read

7 AI Trading Tools Worth Trying

AI trading is easiest to understand when you...

You might also likeRELATED
Recommended to you