man with lasso rounding up hot chips

Hot Chips 2024 Roundup: Day 1 Takeaways


Last week, various organizations presented their newest processors at Hot Chips. Typical of recent years, most focused on AI processing, but several talks covered general-purpose CPUs. Below is an observation from each Day 1 presentation. Highlights:

Qualcomm Oryon CPU

Coming to Qualcomm by way of startup Nuvia, key Oryon team members hail from Apple, where they established a new standard for CPU execution width. Qualcomm SVP and CPU architect Gerard Williams made the point that structures beyond decoders and execution units must also be enlarged to maximize back-end utilization. At the same time, bigger must not come at the expense of clock speed. Remarkably large structures include the 8K second-level TLB, logic to juggle more than 10 table walks at a time, and capability to have more than 200 load/store operations in flight. Charts and graphs of tradeoffs would’ve strengthened his talk, but it was nonetheless impossible to walk away without concluding the Qualcomm Oryon team knows how to make fast CPUs.

Intel Lunar Lake Processor

Presenter Arik Gihon hails from Israel, the location of the design center Intel has turned to when it’s painted into a corner, such as when Pentium 4 proved to be a dead end or Alder Lake needed a quick but meaningful upgrade. Gihon demonstrates that Intel has finally figured out a use for E-cores beyond goosing benchmarks. Figure 1 shows Lunar Lake can run Microsoft Teams mostly on E-cores, reducing dependence on the relatively power-hungry P-cores. Moreover, Lunar’s Skymont E-cores are much more power efficient than Meteor Lake’s low-power E-cores. (They’re presumably also better than Meteor’s regular E-cores.) Teams, however, is an unusual application because extrinsic factors (e.g., network speeds) constrain CPU-performance requirements. To fend off Arm-based PC processors claiming better throughput and battery life, Intel must take advantage of E-cores’ efficiency in a variety of applications.

Microsoft Teams Process Scheduling on Lunar Lake E-Core and P-Core

Figure 1. Lunar Lake E-core vs. P-core scheduling. (Source: Intel via HotChips 2024.)

IBM Telum II CPU and Spyre NPU

For those requiring eight-nines availability and massive transaction-processing throughput, IBM offers mainframes. As with the original Telum, the company’s new mainframe processor integrates an AI accelerator (NPU). Customers can employ it to evaluate transactions, marking them as is hotdog or is not hotdog fraudulent or valid, for example.

Classification models also provide a confidence score along with their output, and IBM now enables punting low-confidence evaluations to a separate Spyre NPU running a bigger neural network. Spyre scales up the on-chip NPU’s architecture, aiding compatibility. Due next year, Spyre will come on a 75 W PCIe card and promises more than 300 TOPS of raw number crunching and an excellent 55% utilization owing to 2 MB of on-chip SRAM for each of Spyre’s 32 cores. IBM says Spyre can perform generative AI. It’s unclear how mainframe customers would employ such a capability, but in 2024 every AI-chip vendor must mention transformers and gen AI.

In addition to reliability, massive I/O also characterizes mainframes. The Telum II processor also integrates a DPU (I/O accelerator) comprising four clusters of eight small cores, 36 MB cache, and other hardware, as Figure 2 shows. A 32-processor system can have 192 PCIe cards divided among 12 I/O drawers.

The bottom line is that IBM continues to update the deathless mainframe even as the category’s share of the computing market has descended to a tiny but nonzero asymptote. Customers, such as banks, that still require mainframes’ unique capabilities will find the new IBM technology delivers more of what they need.

IBM Tellum II

Figure 2. IBM Telum II microprocessor. (Source: IBM via HotChips 2024.)

Tenstorrent Blackhole and TT-Metalium

Tenstorrent discussed details of its forthcoming Blackhole NPU and associated software. The new chip scales out Wormhole’s architecture and adds 16 big, high-performance RISC-V CPUs to obviate an external host processor. Like Wormhole, Blackhole uses Ethernet for interchip connectivity, albeit upgraded to the 400 Gbps variety; and, normalish DRAM (GDDR) provides external memory instead of the expensive HBM favored by competitors.

The architecture’s unusual aspects include dividing processing among 140 Tensix+ cores, each of which integrates not only the usual matrix and vector engines and network-on-chip (NoC) circuitry but also five baby RISC-V cores. A simplified perspective is that one core manages NoC input and another NoC output, two cores similarly handle moving data into and out of the compute block, and the final core manages compute-logic execution. Tenstorrent also employs baby CPUs adjacent to DRAM and Ethernet controllers. A baby core’s code should be small enough to fit in its 4 KB instruction cache, providing flexibility and future proofing that a finite-state machine can’t match. Although the chip has 752 of them, the little CPUs occupy a low single-digit percentage of total die area.

More important than Tenstorrent’s architecture is the company’s developer-engagement philosophy. Whereas many NPU suppliers targeting AI inference assume customers will operate the chips as black boxes consuming ONNX-format models, Tenstorrent is betting that customers will program them. Releasing its software on GitHub, the company seeks to build a community and software ecosystem (see Figure 3). It’s an unusual approach that will take years to bear fruit, but developer engagement and software have proven essential to the success of other processors, including Nvidia GPUs.

Tenstorrent Software Stack

Figure 3. Tenstorrent software stack. (Source: Tenstorrent via HotChips 2024.)

SK Hynix Computing Memory

While NPU companies have been adding memory to their AI accelerators, Hynix has added NPU functions to a memory chip. These functions account for 20% of the chip’s area but can reduce system power and improve performance, particularly for batch-size-one operations typical of on-device inference tasks where memory-access latency is a bottleneck. Hynix’s challenge is to find customers willing to pay a premium for a nonstandard memory. The company is going against the historical practices of pricing memory per bit and standards-based governance.

Intel Xeon 6 SoC (Granite Rapids Xeon D)

Employing both Intel 3 and Intel 4 chiplets, Granite Rapids will be available as an SoC for communications and other applications. Versions will have either four or eight DRAM channels, integrate eight Ethernet ports supporting two 100 Gbps interfaces, and include offload engines for encryption, compression, and other operations. The Intel 3 die houses the CPUs and DRAM controllers, whereas the Intel 4 one has I/O controllers and accelerators.

The company showed various benchmarks for a Granite Rapids configuration with 42 P-cores. This large core count combined with the DRAM configuration underscores this processor is a bruiser among communications-targeted chips and suited only to designs that can accommodate its prodigious 77.5 × 50 mm package size and attendant power.

Nvidia Blackwell

Nvidia’s presentation recapitulated much of what the company has already disclosed about Blackwell. The company highlighted that FP4 can produce good results and showed off its newest networking technology. The newest NVLink5 switch employs 200 Gbps PAM4 serdes, the fastest copper interfaces available. Unfortunately, PCBs can’t carry traces this fast, forcing Nvidia’s new switch tray to employ flyover cables. As Figure 4 shows, the cables are going toward both the back and front panels. Although Nvidia’s NVL72 rack would use only back-panel ports, it exceeds many data centers’ maximum areal power. Affected customers, therefore, must deploy two smaller racks (e.g., employing 72 single-die Blackwells—which haven’t been announced—or 36 standard dual-die chips) to halve footprint power (h/t Bob Wheeler).

top view of NVLink 5 switch tray with cover removed

Figure 4. Nvidia NVLink 5 switch chassis. (Source: Nvidia.)

SambaNova SN40L

SambaNova described how its NPU architecture simplifies programming, enabling, for example, a single call for all decoders, reducing overhead and improving data locality. However, the SambaNova SN40L’s main advantage is its memory hierarchy. The NPU has 8 GB on board, plus 1 TB of HBM in package (see Figure 5), and can address 24 TB of external DRAM. These memories mitigate data-access bottlenecks and facilitate running large models in a single chassis instead of requiring multiple systems dependent on in-package HBM.

SambaNova SN40L photo

Figure 5. SambaNova SN40L chip.

Broadcom AI ASIC with Optics

Broadcom updated the audience on its progress implementing copackaged optics (CPO). Previous development vehicles have added CPO to the company’s Tomahawk Ethernet switch ICs, and the company is now trialing CPO with an NPU. Broadcom has evaluated CPO systems built by an ODM, demonstrating their mass-manufacturability. At the conference, the company also showed how it has reduced costs since the first CPO generation. Important CPO advantages over copper include reduced system power, simpler chassis assembly (cf the above NVLink switch), better bandwidth scaling, and easier scaling to 512 or more NPUs. Nonetheless, a key takeaway is that CPO-technology development is progressing, but more must be done before we’ll see large-scale deployments.

Furiosa RNGD

Founded in 2017 to develop an NPU for vision applications, Furiosa is now demonstrating large language models (LLMs) on its newest silicon, the RNGD. The NPU requires only 185 W (150 W TDP), and Furiosa positions it as a more power-efficient alternative to Nvidia’s L40S for server add-in boards. Architecturally, Furiosa has replaced fixed-size matrix-multiplication primitives with more general tensor contraction primitives. Combined with its hardware, the company expects tensor contraction to improve resource utilization, power efficiency, and flexibility in batch processing. Furiosa is unusual in staking out territory between the PCIe-card 75 W limit and kilowatt-class NPUs but nonetheless faces the same challenges luring customers from Nvidia as other NPU companies.

Other

Intel presented Gaudi 3, disclosing little (if anything) new. Although Gaudi chips have advantages, the company had to rationalize their development and that of data-center GPUs. Therefore, Gaudi 3 is the end of the road for the Habana-originated architecture, and there’s no simple path from its software environment to OneAPI employed elsewhere at Intel.

AMD presented the Instinct MI300X, its newest data-center GPU launched at the end of last year. Available in OCP-compliant systems from multiple OEMs, the MI300 is the strongest challenger to the Nvidia H100 but sells in far fewer quantities.

error: Unable to select