Cerebras WSE-3 Doubles FP16 Perf to Target the Biggest AI Models

Cerebras plans to have systems based on its recently announced WSE-3 AI accelerator available by midyear. Built in a 5 nm process, the Cerebras WSE-3 (also called the CS-3) doubles the FP16 hardware of its 7 nm predecessor but keeps most other parameters the same. It also supports 2,048-node clusters, up from 256. Notably, Cerebras did not disclose adding FP8 support, which Intel and Nvidia have employed to boost performance on certain workloads. Nor did the company say anything about FP64, required in many HPC applications. The company withheld power; we expect it to increase much less than raw FP16 performance.

Notables

Physical design—It’s impossible to overlook the Cerebras WSE-3’s unique physical design. Wafer-scale integration (WSI) was a hot topic around 1990, and although there was an EPROM company by that name, it didn’t go anywhere. Because AI processing and HPC are embarrassingly parallel functions, they’re a good fit for Cerebras’s homogeneous wafer-scale engines. Any physical design so unusual, however, presents business-model challenges, such as product cost and customer reluctance to adopt something so unusual.
Business model—Semiconductor suppliers commonly sell boards when a simple card with one major IC (the company’s chip) dominates the design. Some NPU/GPU suppliers have further forward integrated and offer systems and even complete clusters. Cerebras principally offers systems, but it also offers computing as a service through its affiliated partner G42. Such a service can benefit both suppliers and buyers, but it can also obfuscate a supplier’s health, a company can claim as customers those that paid only a small fee to try out the product.
Architecture—As described in the Cerebras WSE-2 Hot Chips presentation, the company’s architecture employs a mesh of little vector engines that it can chain together for matrix operations. By contrast, most competitors employ large matrix units, which increase raw FLOPS per watt and per mm². The Cerebras approach, however, improves flexibility and FPU utilization. Because data storage and movement bottlenecks training workloads, the mesh provides prodigious NSEW bandwidth and memories accompany each vector engine. The architecture provides more memory per FLOP than competitors. Presenting to software as a single device, the company’s architecture eases programming.
Big models—The huge aggregate memory in the Cerebras WSE-3 and predecessors holds data close to the processing cores, keeping them fed and enabling a single WSE-3 to operate on models with more than 100 billion parameters. Customers requiring more memory can connect WSE-3 systems to Cerebras’s MemoryX servers, which have massive DRAM capacity to keep additional data available on short notice. Exemplifying its memory capacity and ease-of-programming advantages, Cerebras can train a model bigger than GPT-2 in 565 lines of code.
Caveats—When Cerebras claims a particular FLOPS rate for the WSE-3, it’s FLOPS of “AI compute,” meaning FP16 FLOPS and including a 10× performance multiple for sparsity. Note also that the company hasn’t published MLPerf results, even for only the GPT-3 test.

Competition

AI—Like its predecessor, the Cerebras WSE-3 is best for training large models, whereas conventional accelerators can handle different sizes, performing comparatively better with smaller models. Moreover, the infrastructure and tools, especially in Nvidia’s case, are extensive, familiar, and mature.
HPC—Cerebras offers an alternative approach to using an AMD, Intel, or Nvidia accelerators alongside an Epyc or Xeon processor, and its AI-training advantages should apply to HPC. However, its architecture is built around FP16. Processing FP32 data incurs, we estimate, a 4× penalty. With WSE-2, FP64 wasn’t practical. The additional FP16 data paths in the Cerebras WSE-3 could change that. Native FP32 and FP64 with a lesser penalty (2× and 4×, respectively) would broaden the WSE-3’s appeal and enable Cerebras and its partners to take the Number One position on Top500 with a cluster smaller than the WSE-3’s 2,048-node max, but for now this is outside the company’s AI focus.

Customers

Cerebras is unusual in naming specific customers. To protect their brand and avoid tipping off competitors, companies usually aren’t keen on disclosing their suppliers. Cerebras deserves credit for getting so many to agree to publicize their relationship. However, because some buy computer time, not computer systems, it’s unclear how committed they are. Existing customers will find the WSE-3 promises a big increase in raw performance for compute-bound workloads. New customers will find it offers advantages when training the largest AI models.

Bottom Line

Competitors have taken different architectural approaches to challenging Nvidia, usually promising better performance per watt or per dollar, implicitly on a range of AI workloads—big, small, training, or inference. Cerebras is unusual in focusing on training the largest models, where its wafer-scale design and memory capacity give it a big advantage. The Cerebras WSE-3 doubles raw floating-point throughput compared with the WSE-2 and otherwise maintains per-chip (per-wafer) scaling while increasing cluster-level scaling.

The company is attacking only a slice of the market, but it’s a portion rising in prominence and capex owing to interest in generative AI. The company has secured at least a dozen customers, either for systems or for its computing service. None yet are large enough to propel it past the startup phase. The larger the leap in technology (wafer-scale integration in this case), the greater the faith required of that technology’s prospective customers—a chasm Cerebras seeks to bridge by offering simpler programming of big models and services through G42.

Cerebras WSE-3 Targets the Biggest AI Models with Twice WSE-2’s FP16 Performance

Notables

Competition

Customers

Bottom Line

If You Enjoyed This Post, Read one of These: