Speeding up its development cadence, Google has disclosed initial details of its sixth-generation TPU. The search giant’s cloud services arm plans to make the new data-center AI accelerator (NPU) publicly available by the end of the year. Replacing its predecessors’ alphanumeric designations, the Trillium moniker indicates Google will do more to promote the chip to external users. The company claims peak per-chip performance is 4.7× greater than the TPU v5e and energy efficiency is 67% better.
One Core Is Sufficient If It’s Big Enough
Since the fourth TPU generation, Google has developed two data-center NPUs per generation, a smaller one emphasizing energy efficiency and a larger one for big training clusters. The TPU v4i and v5e had a single TensorCore, whereas the larger v4 and v5p integrated two. Trillium is a smaller design; although Google didn’t disclose its TensorCore count, we infer it has only one.
Google Trillium Updates the TPU Architecture
In Google’s architecture, a TensorCore comprises a vector processing unit (VPU), associated memory (VMEM), and matrix units (MXUs). (By contrast, Nvidia’s similarly named Tensor Cores are only matrix units.) To achieve the 4.7× speedup, Trillium speeds up and enlarges the MXUs. Changing their size is a departure from previous generations, which kept them as a 128×128 array to preserve compatibility and constrain wire delay. Google previously increased performance by adding MXUs, not by enlarging them. Although the company hasn’t tried to maintain binary compatibility over the generations, it has nonetheless sought to carry forward hardware-dependent compiler optimizations.
Supplementing the TensorCore, a TPU integrates a sea of tiny SparseCores to process embeddings, a structure employed in training recommendation models—which Meta MTIA also targets. Trillium revises the SparseCore design, but Google has withheld details. Embeddings map poorly to TensorCores, and host processors have insufficient memory bandwidth and a thin PCI link to the accelerator.
Since v4, the TPU architecture has integrated large common memory (CMEM) pools. Akin to software-managed caches, these SRAM blocks are faster and more energy efficient than DRAM. Unlike some data-center NPUs with big memories, the TPU also has external DRAM (HBM). Trillium doubles its predecessor’s HBM capacity and bandwidth, but Google has withheld CMEM information.
Two Peas 256 TPUs in a Pod
Google assembles Trillium into 256-TPU pods as it did with the v5e, connecting adjacent NPUs with interchip interfaces (ICIs). Trillium doubles the ICI bandwidth. Logical pod slices enable multiple jobs to run concurrently; the smallest slice has a single TPU. Going in the other direction, to enable scaling beyond 256 chips, pods connect via Google’s data-center interconnect, a proprietary Ethernet-based stack implemented using the company’s intelligence processing units (IPUs), referred to as network processing units (the other NPUs) or data-processing units (DPUs) by others. For training its big models, Google uses its big TPUs, building thousand-node clusters linked by ICIs and a novel all-optical interconnect.
With Great Efficiency Comes Great Power
One criticism of American restaurants is that portions are oversized. However, price per forkful is low—value is high—as the heaps of cheap food compensate for expensive rent and labor. The same applies to merchant NPUs/GPUs: they’re expensive, but you get a lot. However, just as you size portions differently when you cook at home, Google sizes the TPU to optimize for total cost of ownership (TCO). For example, it integrates CMEM instead of adding more TensorCores to boost TOPS and designs smaller NPUs to reduce the cost of provisioning power and cooling. Nonetheless, Trillium’s limited information (4.7× the peak performance and 67% more efficient) indicates it runs hotter than the v5e.
Competition: TCO Is the KSP
Google embarked on TPU development because it needed an AI accelerator better than merchant-market offerings. Whereas better once meant faster, it now means cheaper to operate. Nvidia’s ever-increasing raw performance improves training times and inference throughput (see Nvidia Blackwell) but comes with a big price tag and rising power. The latter’s associated costs dominate the economics of a whole data center, Google’s chief concern. As Google races to catch up in large language models, its in-house technology should help it catch up more economically than relying exclusively on a merchant design.
Customers Can’t Buy Trillium but Can Rent It
Developers will experience Google Trillium through the TPU Cloud service. We expect the new chip to provide more throughput for the dollar. Circumstances where the service has cost and performance advantages over GPU-based instances will get even better. For those where it falls short, such as projects requiring a specific PyTorch function that Google hasn’t implemented, a faster TPU won’t help. Where flexibility and developer ecosystem dominate, Nvidia has the advantage.
Bottom Line
Designed originally for internal use, the TPU family is now available as a service. Google disclosed Trillium only a year after the TPU v5e, approximating the cadence of merchant-market chips. Meanwhile, a larger v6 may be used internally to train Gemini and other Google models. The TPU architecture’s large matrix units and SparseCores enable it to excel at embeddings, and the 256-node pods and slicing options should be good for small to medium workloads. Delivering higher peak throughput, added HBM capacity, and greater bandwidth per chip, Trillium extends the TPU v5e’s advantages. While we wait for Google to disclose more Trillium (or even v5e) technical details, software engineers and data scientists wait to see how the company is improving the developer experience compared with Nvidia.
Image credit: By СССР – Own work, CC BY-SA 2.5 ca, https://commons.wikimedia.org/w/index.php?curid=79426316
https://en.wikipedia.org/wiki/Trillium_grandiflorum#/media/File:Trillium_grandiflorum_at_Backus_Woods.jpg