This week’s MLPerf training results from MLCommons are a mix of old and new. As is usual, Nvidia is the dominant merchant supplier, and this round includes the first Nvidia Blackwell and Google Trillium results.
In past MLPerf rounds, Intel supplied Habana Gaudi 2 scores, albeit for only select tests (GPT-3 and Llama 2). Although the Intel Gaudi 3 is the new hotness, optimizing setups and running benchmarks consumes resources better allocated elsewhere given how few customers buy Gaudi for training. Likewise, while AMD has emerged as Nvidia’s best merchant-market competitor, inference customers dominate its roster. No-showing this MLPerf round indicates training isn’t the company’s focus.
The only real Nvidia alternative this round was Google, which posted TPU v5p scores and previewed Trillium. The hyperscaler previously supplied only GPT-3 performance for the TPU v5p. This round, for TPU-v5p, Google showed a 5.7% speedup on that test for 6,144 accelerators and reported Stable Diffusion training times. For Trillium, the company reported GPT-3 results, which are only 8% faster than the v5p in a 2,048-accelerator configuration. It’s a disappointing speedup considering Google claimed Trillium would deliver a 4.7× speedup over the TPU v5e (which is roughly half as powerful as the v5p). Systems based on both NPUs show good scaling, reducing training time roughly in line with the number of AI chips employed.
Nvidia Previews Blackwell
Nvidia previewed Blackwell, showing an eight-GPU system with 180 GB of per-device memory delivering a 1.5× speedup over one with eight H200s and 141 GB per GPU (1.7× the H100 with 80 GB). We had expected Nvidia Blackwell to deliver a 3× gain. That said, on the Llama 2 fine-tuning test, Blackwell was more than twice as fast as the H100, almost in line with its raw performance advantage. However, Nvidia continually improves and submits new results each MLPerf round. We expect Blackwell to score better in future rounds.
Regarding scaling, Nvidia posted GPT-3 results for an 11,616-GPU system based on the H100. It’s 14× faster than a 512-GPU one, showing the diminishing returns from scaling up. Similarly, a 1,024-GPU system wasn’t even four times faster than a 64-GPU one on the Llama 2 test.
Side Notes
As an aside, one company did submit scores for an AMD GPU, but it was an RX 7900 XTX desktop GPU. Certainly, AI developers employ consumer graphics cards, but the real battle is for data-center dominance. Compared with the submitters’ Bert results using the comparable Nvidia RTX 4090 desktop device, the AMD hardware was 10% slower—not bad.
Also, comparing Google’s and Nvidia’s accelerators borders on meaningless; each company has different objectives, and data for similar system configurations is sparse. With those caveats, the H200 is twice as fast as the TPU v5p.
Bottom Line
Developers training large language models will either employ Nvidia-based systems or those based on hyperscalers’ own chips, such as a Google TPU or Amazon Trainium. Nvidia is the only company to consistently show it can train a variety of models, steadily improve performance, and scale up. Heretofore, Google has mainly reserved its TPU for internal workloads but is likely to draw more external customers to Trillium. Its success won’t depend so much on single-chip performance but on software support, cost, and ability to train models. Amazon, like Nvidia’s merchant competitors, undermines its credibility by opting out of MLPerf showdowns.