Tenstorrent Wormhole n300

Tenstorrent Connects to Developers Through Wormhole


Tenstorrent has released a new AI accelerator (NPU). An upgrade to the earlier Tenstorrent Grayskull, Wormhole is available on PCIe cards and development workstations. It integrates upgraded Tensix cores and adds 16 Ethernet ports. First alluded to when Grayskull was announced in 2020, Wormhole is quickly followed by Blackhole on Tenstorrent’s roadmap.

Wormhole Draws Developers into Blackhole

Tenstorrent has no illusions about the competitiveness of Grayskull and Wormhole in 2024. Although some customers will deploy Wormhole, most will wait for Blackhole. Tenstorrent has released Grayskull and Wormhole primarily for developers to gain expertise with its architecture, begin creating models, and contribute to the NPUs’ open-source tools.

Many Cores, Small Matrix Units Characterize Tenstorrent’s Architecture

Tenstorrent’s architecture divides processing among many cores that each include CPUs, memory, and a compute engine. The latter includes SIMD and matrix units. Integrating 80 Tensix+ cores, Wormhole offers 292 FP8 TOPS. Employing many cores with relatively small matrix units contrasts with most NPUs, including Nvidia’s GPUs. They employ only a few large units, sacrificing utilization and flexibility to achieve greater peak throughput.

Compared with Grayskull, Wormhole doubles the number of multiply-accumulate (MAC) units, adds two bits to each multiplicand, and doubles accumulator size to 32 bits. The added precision should improve accuracy, especially when chaining sequential operations. Consistent with improving accuracy, Wormhole also adds FP32 support. The updated Tensix+ cores also reduce SIMD-operation stalls to improve throughput. Tenstorrent sees common activation functions speeding up by up to 3×. We expect the next-generation Blackhole will increase FP8 TOPS by integrating more cores and raising their clock rate instead of by adding units to each core.

Tenstorrent’s architecture also employs a different memory hierarchy than GPUs, which have large (e.g., 16K-entry) register files, a little local memory and cache, and off-chip HBM. Instead, it has a normal-size register file, a large local memory, and standard LPDDR (Grayskull) or GDDR (Wormhole) DRAM. This approach should facilitate programming and avoid HBM’s cost and limited availability.

One Token Ring to Rule Them All?

An unusual Wormhole feature is its 100 Gbps Ethernet ports (200 TB/s total), the interconnect technology also employed by Intel Gaudi. Nvidia, for example, endows its data-center GPUs with 18 NVLink interfaces, providing a total of 900 GB/s of bidirectional bandwidth. Proprietary interfaces can deliver more throughput at lower latency and support coherency but add cost. Tenstorrent CEO Jim Keller argues that it’s better to make big computers from small ones and Ethernet.

Competition and Customers

Late to market and reduced to only a development vehicle, Wormhole doesn’t threaten competing data-center NPUs. However, seeding the developer community indicates that Tenstorrent is seeking to prove its technology, nurture an ecosystem, and build its sales funnel in advance of launching a competitive product.

Bottom Line

Tenstorrent has consumed a lot of time and cash to build only two prototypes. Customers seeking an Nvidia alternative can use Wormhole to evaluate Tenstorrent’s architecture. If these appraisals are positive, the company should be able to quickly ramp Blackhole sales if systems prove much more powerful and economic than alternatives and software is sufficient.


Posted

in

by


error: Unable to select