Intel CEO Lip-Bu Tan lamented symmetric multithreading’s absence in the upcoming Diamond Rapids Xeon processor. Symmetric multithreading (SMT) has been a feature of Intel’s high-end CPUs (P-cores), and cloud providers have employed it in their virtual CPU (vCPU) offerings. To SMT or not to SMT is a longstanding question, with some CPU vendors adopting it while others shun it. The technology’s value, however, isn’t its potential performance uplift but instead its performance elasticity, the ability to let programs choose between single-thread execution speed and aggregate throughput.
What Processors Support SMT?
Intel popularized SMT, branding it HyperThreading in 2002, and AMD followed its lead. IBM has employed the technology in Power7 and subsequent processors, and Sun developed multithread Sparc processors. Raza Microelectronics (RMI) employed it for two generations of multicore embedded processors before being acquired by NetLogic (which was subsequently bought by Broadcom, where the group completed a third generation) before bouncing to Marvell and releasing the ThunderX2 processor. The forthcoming Nvidia Vera processor will also support SMT.
As for licensable CPU designs (IP), few companies offer SMT products. Arm has eschewed the technology but added the SMT-capable Cortex-A65AE to its portfolio in 2018. Among RISC-V offerings, the MIPS I8500 supports 4-way SMT. It’s a midrange three-way superscalar core that should deliver sizable SMT gains because it’s an in-order machine, leaving more execution slots open for threads to fill. Founded by engineers with SMT experience tracing back to RMI, startup Akeana supports SMT in its 1000 and 5000 series RISC-V cores, which range from single-issue, in-order CPUs to 10-way, out-of-order designs.
SMT in a Nutshell
One of a few multithreading approaches, SMT enables a CPU to have independent instruction streams (threads) in flight. Because a single thread rarely uses all CPU resources, they’re available for others. For example, a cache miss can stall the pipeline for hundreds of cycles. SMT can fill this time by executing another thread. Alternatively, a CPU may have two ALUs but only need one in a given cycle. Even a core with 10 parallel execution units rarely achieves an average throughput (IPC) greater than two, suggesting that a 5× throughput increase is possible.
SMT Speedup
The actual speedup SMT affords varies. Although a processor may have an abundance of some resources, others could be bottlenecked. For example, HPC threads often fully occupy a CPU’s floating-point unit. Executing two at once on a single core doesn’t raise performance at all. For small packets, a second thread of networking code may deliver a 50% gain as one proceeds while the other waits for a lookup-table access to complete. However, there may be no increase when processing large packets because copying them to memory is throughput, not latency, limited.
Benchmarks compare SMT performance assuming the threads are running the same code. Even if the code is identical, executing two independent copies of a program differs from a single workload employing parallel processing. In the latter case, the interdependent threads incur delays synchronizing and are likely to contend for resources, preventing SMT from being effective. Therefore, HPC users typically disable SMT.
Despite the varying benefits, a conservative rule of thumb is that SMT raises throughput 15% on average. If we factor out cases where the gains are small or nonexistent, a 30% boost is a baseline assumption. Cost (die size) and complexity of the CPU could grow about 10%, although this, too, depends on many factors.
SMT Tradeoffs
In practice, systems might not run the same code in each thread nor tightly synchronize them, affording more speedup opportunities. One thread may preprocess data and transfer it to an accelerator while another manages the system, for example. In another case, one thread rarely runs but must activate quickly to service an event. Because an SMT-capable core supports multiple simultaneous contexts, starting that thread doesn’t require swapping CPU state.
Employing SMT to raise throughput, however, slows thread execution compared with granting sole CPU access. If throughput when running two identical threads per core is 1.3× faster in aggregate than a single thread, then each is running at 0.65× speed. This isn’t a problem in many cases. But if per-thread performance isn’t critical, then a processor could alternatively employ many, lower-performance CPUs. Marginal speedup techniques disproportionately increase area; conversely, scaling back performance 35% can reduce area by 50% or more.
Indeed, Arm-based server processors have integrated more, slower CPUs than their x86 counterparts to achieve similar per-chip throughput. That is changing as Arm-compatible CPUs make performance gains and as AMD and Intel have developed alternatives to their big-core processors that pack additional smaller CPUs per chip. Moreover, cloud providers have started to map vCPUs to physical cores in both the Arm and x86 cases instead of always mapping two vCPUs to each x86 core by using SMT.
Bottom Line
In the x86 case, cloud providers may let customers enable SMT. The ability for the user to enable it, to choose between maximizing total per-core throughput or single-thread performance, is the technology’s unique advantage. This applies not only to server processors but also to CPUs integrated into automotive processors and networking chips (e.g., DPUs). Without competition, Diamond Rapids would suffer little from SMT’s absence. However, it will be a conspicuous shortcoming compared with AMD’s contemporaneous offering.
Links
- The HyperThreading Wikipedia page discusses SMT and associated issues.
- Akeana hosted a recent SMT webinar.
- Image credit: https://commons.wikimedia.org/wiki/File:Hyper-threaded_CPU.png