survey marker

MLPerf Reveals Blackwell Gains, Untether’s Power Efficiency

Easy lies the head.

Recent MLPerf 4.1 data-center inference results include new tests, vendors, and AI accelerators (NPUs/GPUs). Organized by MLCommons members, MLPerf is a set of benchmarks covering different scenarios such as data-center, edge, and mobile inference and data-center training. The benchmarks include a closed class that specifies models, metrics, data sets, and accuracy.

MLPerf 4.1 Notables

  • MLCommons updated the benchmark to include a mixture-of-experts (MoE) test and changed the metric for large language model (LLM) tests GPT-J and Llama2-70B to tokens per second.
  • AMD finally debuted the MI300X but reported only Llama scores, including a preview of a system employing the upcoming Epyc Turin. The latter boosted throughput by less than 5%. In latency-constrained server mode, the MI300X system performed well below the Nvidia Grace Hopper H200 system. Moreover, the incomplete submission leads us to generally draw negative inferences about performance.
  • Google previewed TPU v6 Trillium performance alongside the now-available TPU v5. The new chip is about 3× faster than the old on the only test for which Google submitted results, Stable Diffusion. Considering LLMs and recommendation models are probably the TPUs’ dominant workloads, it’s disappointing not to see results for them.
    • Comparisons with other NPUs aren’t entirely fair because Google favors TCO (energy use) more than merchant-market suppliers, which seek to maximize performance. Nonetheless, the Nvidia H200 is about 1.9× faster than Trillium at Stable Diffusion.
  • Untether previewed SpeedAI240, its newest NPU. Among other changes, it adds LPDDR interfaces and supports a meager 64 GB of DRAM. The original Untether architecture didn’t support any external memory, limiting model size. Although the chip’s product brief includes estimates for ResNet-50 and Bert, the company only reported ResNet-50 results. As noted above for AMD, it leads us to a negative inference. Employing software from Krai, the SpeedAI240’s ResNet performance is good at about 88% that of the Nvidia H200.
    • Power efficiency, however, is excellent. Only Nvidia and Untether reported results on the MLPerf power tests, which consider power at the wall for the whole system. Employing “Slim” 75 W cards, the Untether submission provided 60–65% of the performance of the Nvidia H200 contender at about 20% of the power.
  • Nvidia submitted the most comprehensive MLPerf data. Moreover, other companies, including Google, provide scores for Nvidia-powered systems. We hope for a full set of Blackwell results when it qualifies for production. For now, Nvidia previewed its Llama performance, showing Blackwell to be 2.7× faster than the standard H200. (The 1,000 W CTS H200 version narrows the gap slightly.) These gains are consistent with our expectations.
    • Integrating more memory, the Nvidia H200 is an update to the H100. Its speedup ranges from nothing on GPT-J to 35% on Llama2-70B. Results for the GH200 Grace Hopper combination of the B200 and Nvidia’s Arm-compatible Grace processor show the GH200 to deliver similar inference performance as an x86 system with H200-based SXM cards.
  • Intel is unusual in integrating matrix-multiplication offloads into its server processors’ CPUs. The MLPerf 4.1 data shows the forthcoming 128-core Granite Rapids Xeon scoring 85% higher than a 64-core current-generation Emerald Rapids Xeon. Without knowing Granite’s clock rate and other details, it’s hard to say anything in addition to more cores means more throughput.

Additional MLPerf 4.1 Takeaways

  • MLCommons continues to be a valuable resource for comparing data-center NPUs.
  • Every supplier continues to improve its performance.
  • Incomplete submissions may indicate that suppliers either lack the resources to tune models or that their hardware isn’t broadly competitive.
  • No company has shown results that challenge Nvidia’s performance crown.
  • Despite its products’ performance, Nvidia’s GPU architecture has important shortcomings—a point highlighted by Untether’s vastly better power efficiency.

Posted

in

by


error: Unable to select