Last week, organizations presented their newest processor technology at Hot Chips. Below is an observation from each Day 2 presentation other than the keynotes. Highlights:
- Meta MTIA has a clear purpose and publicly disclosed technical details—two things Microsoft Maia.
- Enfabrica makes the case for rethinking AI-cluster networking.
- Open-source, high-performance RISC-V CPUs dim financial prospects for startups targeting this space.
- We remain skeptical of Ampere Computing.
AMD Versal Gen 2
Years ago, when Xilinx introduced the first Versal products, it accelerated the trend toward being much more than uncommitted logic gates. Like its predecessor, the second Versal generation integrates a CPU subsystem, video codecs, controllers for Ethernet, PCIe, and DRAM, and other functions in addition to the usual FPGA blocks. By integrating so much, a Versal device should reduce system power, board area, and complexity compared with a design employing separate ASICs and embedded processors—a reversal of traditional FPGA characteristics.
While we can’t help but think the average FPGA customer mostly wants a pile of gates, driver assistance (ADAS) drives Versal volumes to a degree the average FPGA customer cannot. AMD’s presentation highlighted the company’s automated parking solution and Versal’s ability to multiplex AI models in space (i.e., by dividing chip resources and running concurrently) and time (i.e., by periodically switching among models). Although AMD presented nothing new, ADAS is a refreshingly practical application of computer vision and AI at a time when parlor tricks garner a lot of attention.
Stanford Onyx
In HPC and AI, matrices and vectors are often sparse—mostly filled with zeros. Techniques for handling zeros can reduce computation power and memory storage. Stanford presented the fibertree data structure for efficiently storing sparse tensors and its newest coarse-grain reconfigurable-architecture (CGRA) processor for operating on fibertree data. It’s important research, and Stanford’s newest efforts show improvements over its previous CGRA processor. However, the research’s applicability to commercial products is unclear.
Meta Next-Gen MTIA
Meta’s presentation stands out by disclosing technical details of an NPU developed with a clear purpose: recommendation models. The new Meta MTIA is similar to the first-generation chip previously described in a paper but not deployed. The new 90 W NPU has an array of processing elements (PE) integrating two RISC-V cores and fixed-function units, as Figure 1 shows. Changes from the previous version include more PE-local memory, a higher-bandwidth network-on-chip (NoC) port, and better support for PyTorch eager execution. Meta reports 80% PE utilization, which sounds high, but it’s unclear how this translates to math-unit utilization. An unusual feature is hardware for quantizing on the fly. In deployment, the second-gen MTIA sits in pairs on a PCIe card, and 12 cards attach to a dual-socket server. A single rack holds 72 NPUs.
Figure 1. Meta MTIA block diagram. (Source: Meta via HotChips 2024.)
Tesla Dojo Protocol
Surprising no one, Tesla thinks it has a better way of doing something. What’s new is that Tesla wants others to do things the same way. Specifically, the company has developed a Layer 4 protocol for the Tesla Dojo AI system. Riding atop Ethernet (Layer 2) and optionally coexisting with Internet Protocol (Layer 3), it’s faster and simpler than TCP, lending it to hardware implementations. Like TCP and unlike UDP, it will retry transmission to recover lost packets. Congestion management is simple and distributed. Tesla is contributing the protocol to the Ultra Ethernet Consortium, which is developing a multilayer network stack in addition to pushing Ethernet to faster speeds for data-center operators. These companies prefer open standards supported by multiple vendors, inclining them toward Ethernet-based networks and away from Nvidia’s InfiniBand and other technologies.
Enfabrica SuperNIC
Enfabrica discussed its 8 Tbps SuperNIC for AI clusters. It didn’t disclose anything new but made the case for rethinking networking these systems. An Ethernet controller in name, the SuperNIC is equally a switch. As Figure 1 shows, it has multiple Ethernet ports on one side. Internally, these connect to a switch that maps packets to parallel NIC pipelines. On the other side, it has multiple PCIe ports to attach to NPUs/GPUs. Another internal fabric links these ports and maps them to memory, which is a way station to the Ethernet functions. On the whole, the company makes the case for rethinking AI-cluster networking, showing how to get more I/O with fewer serdes, flattening the cluster topology while improving resiliency, and providing a roadmap to joining a half-million accelerators in two switching layers.
Figure 2. Enfabrica SuperNIC block diagram. (Source: Enfabrica via HotChips 2024.)
Intel Optical Compute Chiplet
A day after Broadcom updated the audience on its copackaged optics (CPO) approach, Intel updated us on its developments. These build on a 2022 Intel Labs disclosure. The company now reports success connecting Xeons with a single CPO link of eight fibers, each transmitting on eight wavelengths. Intel’s technology to implement eight lasers in the CPO component sets it apart. Broadcom, by contrast, relies on the typical external laser module. We sense Intel is further from mass production, however, and we don’t see Xeon-Xeon connectivity pressuring the company as much as data-center networking pushes Broadcom. Intel has unique technology, but the industry would benefit if it moved faster toward commercialization.
Cerebras Waferscale AI
Cerebras recapitulated previous disclosures and marketing literature with the addition that it’s now also targeting inference. The company made the case that big gen-AI models are slow owing to the memory wall. It claims its technology can output tokens to a single user many times faster than hyperscalers, improving user engagement and enabling AI agents. The waferscale Cerebras WSE-3 is faster because it has 880× the on-chip memory and 7,000× the memory bandwidth of an Nvidia H100. Moreover, multi-GPU/NPU setups scale poorly. It’s an argument similar to that SambaNova made in its presentation.
Cerebras used Llama3.1-8B and Llama3.1-70B as examples. The former fits on one WSE-3, whereas the bigger 70B model had to be divided among four. Both models are available online, but the company has yet to release a Llama-405B implementation, indicating that mapping big neural networks to the company’s hardware is nontrivial. While the presentation showed vastly superior single-user speed and throughput on the 70B model, it was for only 128 prompt and 20 output tokens. A broader test set would instill more confidence.
XiangShan Open-Source RISC-V
RISC-V startups are in trouble. The Chinese Academy of Sciences has developed two RISC-V CPUs. They compare the Nanhu with the Arm Cortex-A76 and the Kunminghu (shown in Figure 3) with the Arm Neoverse N2 (Cortex-A710). They taped out the 14 nm Nanhu V2 in 2022 and claim a 2.0 GHz clock rate. They’re freezing the 7 nm Kunminghu V1 RTL and targeting 3 GHz. We’re skeptical of the claimed clock rates, and the six-wide Kunminghu has an oddly long branch-mispredict penalty because the developers inserted registers along a critical path—a shortcoming a commercial supplier would fix before RTL freeze. Nonetheless, it reportedly achieves about 16 points per GHz on SpecInt2006, a respectable value and twice the throughput of the Nanhu.
RISC-V companies must adapt to the availability of a free, open-source RISC-V core promising high-end performance. They should target even higher performance or focus on commercializing and supporting open-source technologies, following a business model like that Red Hat pioneered for Linux.
Figure 3. Kunminghu RISC-V microarchitecture. (Source: Chinese Academy of Sciences via HotChips 2024.)
Ampere AmpereOne
Now that AmpereOne systems are available, Ampere is discussing details of the Arm-compatible server processor and its custom CPU. The processor employs chiplets, putting DRAM and I/O controllers and interfaces on separate dice from the CPUs. This enables the company to employ a different process technology for the disparate functions and to scale interfacing by adding interface dice to the package. The compute chiplet is an impressive grid of 192 CPUs. In such a design, the on-chip interconnect is critical, and Ampere highlighted how latency is low and relatively constant as load increases.
The CPU’s complexity is similar to the Arm Neoverse-N2 (Cortex-A710) and should deliver much less per-cycle throughput (IPC) than a Neoverse V3, AMD Zen, or Intel P-Core. It’s much smaller than these latter cores, enabling Ampere to pack so many on a single die. At 3.0 GHz (± a few 100 MHz), its clock rate is competitive. In its presentation, Ampere emphasized how it strove to reduce latency (in cycles) for recovering from branch mispredictions, retrieving content from L1 caches, L2 transfers, and other operations. The company claimed this focus on latency led it to employ a 16 KB L1 instruction cache, but didn’t show any data to justify how saving a few cycles could offset the higher miss penalty of such a small cache. We’re skeptical it’s a worthwhile tradeoff.
We’re also skeptical of the viability of a merchant-market Arm-compatible server processor and of wimpy cores for server processors. It’s also unclear why Ampere would invest in a custom core when it could license the N2 from Arm. They’re not identical but should perform similarly and have similar physical sizes (we’re guessing about 1.3 mm2 in TSMC N5).
Microsoft Maia 100
The Maia 100 is Microsoft’s first homegrown NPU. Whereas Meta has a clear purpose for its MTIA AI accelerator (executing recommendation models better than commercial offerings), Microsoft’s stated intent is simply to run OpenAI models. We suspect that the real goal is to gain leverage over Nvidia, but this aim could’ve been achieved by collaborating with one of the many NPU startups, such as Tenstorrent or SambaNova. Microsoft disclosed few details or clues to its design philosophy, resulting in a weak presentation.
Other
- AMD presented the Zen 5 CPU, touching on a couple of PC processors integrating it and briefly discussing those chips’ GPU and NPU updates. As with the company’s Day One MI300X presentation, it disclosed nothing new.
- Preferred Networks presented the MN-Core 2 accelerator targeting AI and HPC workloads. For AI, it targets training, instead of the easier-to-break-into inference market. For HPC, it supports double-precision, delivering up to 32.8 Tflops. The company compares the chip with the Nvidia A100, claiming greater performance at lower power. Although it has a low profile, Preferred has appeared on the Green 500 list, where its first-generation MN-Core system ranks 21. Despite the company’s interesting technology, the real takeaway is that conference organizers shouldn’t schedule someone with a Japanese accent at the end of the final day. As soon as the presenter began to speak, the remaining audience stampeded to the door.