Hoping the Third Time is the Charm, Intel Samples Gaudi 3

Tomorrow we will do beautiful things.

May 2, 2024

Note: an abbreviated version of this post appeared on April 11.

Intel has begun sampling Gaudi 3, an AI engine (NPU) for hyperscaler and enterprise data centers. In rough terms, it doubles most Gaudi 2 resources but quadruples peak BF16 matrix throughput. Compared with the Nvidia H100, the Intel Gaudi 3 is 50% faster on select inference and training workloads.

Notables

Raw performance—Although Intel Gaudi 3 delivers much greater peak BF16 throughput than its predecessor, it falls 28% short of Nvidia Blackwell. With 1.4× the peak throughput, it tops the already-shipping AMD MI300X, however. Relative performance for other data types and actual models will differ, but FP16 is representative.
MLPerf—Intel and Nvidia are the only companies to submit results for all tests on the MLCommons MLPerf data-center benchmarks, proving their chip and software capabilities. Other companies either cherry-pick tests or don’t submit results. We look forward to seeing results for Gaudi 3 and Blackwell.
Architecture—Differing from Nvidia’s architecture and bearing a similarity to Google’s TPU, Gaudi employs dozens of VLIW cores with SIMD units and several large matrix engines.
End of the road—Intel’s next data-center NPU, Falcon Shores, will adopt a different architecture, employing the vector-engine-based computing engines from the company’s data-center GPUs in place of Gaudi’s VLIW cores. The company will have a single project instead of separate data-center NPU and GPU efforts.
Codecs—Gaudi integrates codecs, an advantage in systems analyzing images or video.
Interconnect—Gaudi employs Ethernet for chip-to-chip and board-to-board interconnect, whereas other designs have proprietary high-speed local interconnect and only use Ethernet (or InfiniBand) to link clusters. Ethernet chip-to-chip links have other proponents; for example, Tenstorrent’s Jim Keller has voiced his support. Roughly proportional to its relative computational performance, Gaudi 3’s networking bandwidth falls short of that of Nvidia Blackwell, which employs the proprietary NVLink. Intel has reported MLPerf results for a 384-accelerator Gaudi 2 cluster, leaving open the question of whether the architecture scales to the thousands of nodes typical of large training systems. Resolving this issue, Intel promises clusters can scale to 1,024 nodes with eight Gaudi 3s each.
PCIe—Like competing products, Gaudi accelerators also sport a PCIe interface to a host processor. The PCIe interface facilitates single-Gaudi add-in-card designs, which Intel plans to offer for the first time with Gaudi 3. This should be sufficient for enterprise customers and developers requiring less than 2 exaflops.
Enterprise—Further to its strategy of targeting corporate customers, Intel announced Dell, Hewlett Packard Enterprise, and Lenovo will join Supermicro in offering Gaudi 3 systems. Because many hyperscalers deploy Nvidia and proprietary AI accelerators, leaving little opportunity for Intel, it makes sense for the company to address a different segment.

Competition

Intel faces a red queen problem, trying to catch up to Nvidia while Team Green speeds ahead. Meanwhile, AMD and other competitors are also racing. Given Nvidia’s lead, the real contest is for second place—at least if the race is to win sockets in hyperscalers’ training clusters. Intel’s position doesn’t look good. But in the long run, enterprise customers may prove equally lucrative, and Intel is ahead in targeting them.

Customers

For data scientists operating at a high level and therefore agnostic to chip architecture, the Intel Gaudi 3 is a viable alternative to NPUs/GPUs from Nvidia and other suppliers. However, customers tuning software to the underlying hardware will avoid Gaudi 3 because Intel will develop no further chips based on its architecture. The company is already talking about Falcon Shores 2 before having delivered the first version, taking a page from the Osborne playbook to show that it has a long-term roadmap.

Bottom Line

The Gaudi architecture has proven itself technically capable but has seen relatively little adoption. The Intel Gaudi 3 is unlikely to improve the situation. Although it boosts performance, competing chips are doing the same, and Intel will move to a new architecture in the next generation.

If You Enjoyed This Post, Read one of These:

Posted

May 2, 2024

in

New Product Analysis

by

Joseph Byrne

Tags:

data center, Gaudi, Intel, NPU (AI accelerator)