Power11 Is a Small Step for IBM, Lodestar for Others

The IBM Power11 processor embodies technologies that will appear in other companies’ designs. Big Blue’s approach to CPU-based AI acceleration for servers has been followed by Arm for smartphones and, we predict, soon by x86 suppliers for PC processors. The Power11’s outboard memory buffers solve a problem that other server processors will encounter as they scale up. The IBM chip’s simultaneous multithreading differs from that of mainstream processors but exemplifies the technology’s benefits. Therefore, other processors will adopt these three technologies as well.

Instead of sourcing third-party processors, IBM bases its computers on in-house designs. Executing bank transactions and other throughput- and reliability-sensitive workloads, the Z series mainframes are the company’s biggest systems. The Power systems don’t scale as large or promise the same reliability (merely eight nines of uptime) but are larger, higher performance, and more robust than most servers. Thus, they implement technologies before x86 and Arm designs.

Presented at Hot Chips in August, Power11 refines the Power10’s design, employing a similar semiconductor process and architecture, despite arriving five years later. Nonetheless, it speeds up IBM’s reference application basket by 14–50%, depending on the system configuration. For example, a 2U, two-socket Power11 server scales to 50% larger core counts, raising performance commensurately. A new 16-processor rack-scale system can have 256 cores maxing out at 4.2 GHz, compared with 240 at 4.0 GHz for its Power10 counterpart.

External Memory Buffers

In addition to the core-count and clock-rate upgrades, Power11 also updates the Open Memory Interface (OMI) that links the processor to external memory buffers, raising throughput from 51.2 GB/s to 76.8 GB/s. Whereas most processors attach directly to DRAM, Power11 connects indirectly using OMI, as Figure 1 shows.

IBM Power11's OMI arrangement and a memory module — Figure 1. IBM Power11 connects to memory modules (right) using OMI. (Source: IBM.)

The approach helps memory capacity and bandwidth to scale. Power11 has 16 OMI ports; each buffer supports two DDR5 memory modules. Modules can hold up to 512 GB of DRAM, providing 8 TB per Power11 chip. At 1.2 TB/s, aggregate DRAM bandwidth matches the OMI total. For comparison, an AMD Epyc processor can have 12 DRAM interfaces, delivering 614 GB/s peak bandwidth and 6 TB maximum capacity. Maximizing memory size, however, requires two standard DIMMs per channel, which necessitates reducing interface speed. Epyc, however, supports CXL for additional capacity.

As processors’ core counts increase, memory requirements rise commensurately. Direct DRAM attachment presents a topological problem: there is too little area around a processor for DRAM sticks. Moreover, adding DRAM interfaces ratchets up processors’ pin counts, and crowded boards challenge signal integrity. Therefore, there’s a case for other processors to follow IBM’s approach. For Intel, this would be a return to practice. A decade ago, the top-end Xeon E7 employed separate memory buffers.

Accelerating AI Matrix Math

To execute AI workloads in software, Power11 implements AI-acceleration instructions. Introduced by Power10, they speed up outer-product operations involving various integer and floating-point data. The challenges with supporting matrix instructions in a general-purpose processor are the die area and architectural state they can require. Mitigating these issues, the Power Architecture maps matrices onto the existing vector registers.

We expect other architectures to add matrix operations and along with area mitigations, either by reusing vector/SIMD resources as Power does or by sharing matrix units among CPUs. IBM takes the latter approach with its Z series CPUs (e.g., Telum II), and the Arm community is doing the same. Intel’s server CPUs have a dedicated unit to support x86’s matrix instructions, which the company has called AMX but is standardizing with AMD as “Ace.” We expect future AMD and Intel PC processors to adopt the same sharing method as Arm and the IBM Z series, a practical compromise that favors area savings over peak throughput. The RISC-V community, including SiFive, is taking an all-of-the-above (and then some) approach, reflecting the diverse workloads it targets.

AI functions have become relevant to servers, PCs, smartphones, and embedded systems. Different use cases are best handled in the cloud, with a CPU-attached accelerator, and on the CPU itself. The lattermost delivers the least throughput but has better latency and improves compatibility among systems of the same architecture.

Extreme Simultaneous Multithreading

Because Power11 provides prodigious memory capacity and bandwidth, we infer that the typical Power11 workload (e.g., SAP Hana) accesses memory more often than other server workloads. Taking hundreds of cycles to complete, each access potentially stalls a thread. Instead of leaving resources idle, simultaneous multithreading (SMT) can load function units during otherwise unused cycles.

As XPU.pub has discussed, SMT improves processor throughput—albeit typically at the expense of single-thread throughput—for many workloads. We expect Intel to return SMT to its server processors after a planned removal from the next Xeon; Nvidia’s upcoming Arm-compatible Vera processor will support the technology—a first for a high-performance Arm processor. None will match Power11’s eight-way SMT (SMT8) and double-stuffed CPU resources, however.

Looking Forward

The big change for the next-generation Power processor will be IBM’s use of chiplets, a technology that is already de rigueur for x86 processors and likely to be employed in hyperscalers’ proprietary Arm designs. A chiplet design could entail a larger package, adding room for direct DRAM interfaces, but we expect IBM to continue using OMI and off-chip memory buffers. Features like matrix extensions and SMT will return.

Bottom Line

Power11 is a smaller step forward than previous Power generations, but it points the way for other processors to scale. External memory buffers, matrix extensions, and SMT are high-end server features, but we predict they will become commonplace in other systems. Core counts continue to rise, pressuring memory subsystems to keep up. AI is finding widespread use, but not all models need to be offloaded. Simultaneous multithreading will find additional uses, foremost for performance elasticity and secondarily to squeeze more throughput from a CPU. Motorsports aficionados say that racing improves the breed: technology for rarefied performance trickles down to ordinary cars. IBM develops computing’s race technology, which other processor vendors later adopt.