RGAT diagram from MLCommons

Nvidia Blackwell Shines, AMD MI325X Debuts in Latest MLPerf


MLCommons has released new MLPerf data-center inference results. Also known as The Nvidia Show, the semiannual benchmark report includes new tests in this edition along with results from other companies.

Nvidia Dominance

Many submitters employ Nvidia GPUs, but we’ll focus on the data Nvidia directly submitted. The company reported scores for the Hopper H100, the H200, the Grace Hopper GH200, the Blackwell B200, the Grace Blackwell NVL72 system, and the Grace Blackwell GB200. Nvidia generally runs more tests than other companies but only ran a subset this round.

  • The H100 update covers only the Llama 2-70B test. Nvidia raised its score in server mode by 44% compared with the previous MLPerf 4.1 round; offline-mode gains were smaller. The company likely bothered with upping its score because it anticipated AMD submitting MI325X results. See below. The H100 outpaced it in server mode but fell short in the offline configuration, an excellent showing for an older chip with less memory and maximum computing throughput.
  • Integrating more HBM connected by a faster interface than the H100, the H200 delivered an average 21% speedup on four tests compared with the H100’s MLPerf 4.1 scores but only 6% higher than the new H100 Llama 2-70B results, above. The double-digit gain speaks to the bottleneck memory poses to AI computation.
  • On the same four tests, the eight-GPU H200 board (with an Intel Xeon host) scored 7.8× higher than the GH200, which pairs a single H200 with the Arm-based Grace host. At this scale, speedups are typically linear with GPU count; an 8.0× gain would be reasonable. Grace Hopper’s fast GPU-CPU interface and other optimizations, however, boost its performance and offset gains from the eight chips’ bigger combined memory pool.
  • Exhibiting such scaling, eight B200s were 8.15× faster than a single B200 on Llama 2-70B, the only test where we have comparable data. Superlinear speedup could indicate that a single GPU couldn’t hold all the required data locally.
  • The B200 proved 2.4× faster than the H200 on a six-test blend, consistent with the speedup we predicted in our initial Nvidia Blackwell coverage.
  • Continuing the scaling theme and evidencing Nvidia’s system-design skills, the 72-B200 NVL72 rack delivered 10.5× the throughput of a single 8-B200 DGX system on Llama 3.1-405B in server mode and 9× scaling in offline mode. A big model like this is well suited to the larger system, which keeps its GPUs in a single domain. Nvidia also supplied unofficial results for Llama 2-70B, where it achieved almost perfect 8.8× scaling.

AMD Shows Off the MI325X

  • The MI325X debuted in this MLPerf edition. It increases HBM capacity and interface speed compared with its predecessor. AMD only supplied results for Llama 2-70B, roughly tying the H100 and falling short of the H200. We believe Meta uses the AMD MI325X (or at least the AMD MI300X) for serving the model to its customers. The chip company did not, however, supply other scores. Optimizing MLPerf consumes application-engineering resources. We infer AMD favors serving existing customers, such as Meta, over attracting new ones with competitive results.
  • MangoBoost evaluated a four-node by eight-MI300X system. Dividing its results by four and comparing with AMD’s eight-MI325X results shows the newer GPU to be 10% faster.
  • The MLPerf results indicate the host processor employed, and AMD Epyc is popular. Among the twelve companies with more than one submission, seven employed both Intel and AMD processors. Unsurprisingly, those favoring one x86 supplier or the other include Nvidia (which competes with AMD on the GPU front), Nvidia-aligned CoreWeave, and AMD.

Other MLPerf 5.0 Notables

  • Google updated its TPU v6e (Trillium) Stable Diffusion score, which it previously previewed. Per our comments then, it’s impossible to draw conclusions beyond disappointment in Google’s benchmarking thoroughness.
  • Intel provided an official Granite Rapids submission covering a variety of tests but not the newer ones. It’s the only server-processor company to integrate matrix units into its CPUs, so its chips will deliver the best performance for customers executing AI models on a host instead of an accelerator.
  • Broadcom and Dell posted scores for virtualized H100 GPUs. Performance matched that of a standard H100 setup. It’s a promising result, but firm conclusions require specific tests exercising virtualization.
  • Results on the CNN tests (ResNet, RetinaNet, and 3D U-Net) are getting sparse as CNNs are a solved problem in the data center and don’t motivate XPU scaling.
  • Fujitsu supplied power results for the Primergy CDI two-Xeon plus eight-H100 system. With no other power data this round, our sole conclusion is that at 5 kW, AI systems need a lot of power.
  • MLCommons added four benchmarks to the MLPerf suite for data-center inference: Llama 3.1 405B (to assess performance on a big model with a large context window), Llama 2 70B Interactive (to assess latency-constrained performance), the RGAT graph neural network (to introduce a new workload type), and the automotive-centric Point Painting 3D object detector (also to introduce a new workload type).

Takeaways

  • A comprehensive benchmark, MLPerf is valuable to customers evaluating their AI-acceleration options.
  • Nvidia is firmly in the lead, not just in performance but also in test coverage. Nobody else comes close. Incomplete submissions or total absence casts doubt on competitors.
  • Each round suppliers raise performance through software improvements, lending truth to our maxim that software is never finished.
  • Power efficiency is an underexplored differentiator.
  • MLCommons has done a good job adding tests but should consider pruning the CNNs from the data-center category, leaving them for other categories to cover. Similarly (but not alluded to above), some tests have 99% and 99.9% accuracy thresholds. The former are no longer useful.

Posted

in

by


error: Unable to select