Tensordyne Napier Slashes Costs and Boosts Token Rates

A nine-year-old startup has rebooted and taped out an AI processor that targets two challenges: cost and speed. Tensordyne promises its new Napier NPU enables rack-scale systems that deliver more throughput while requiring less power than an Nvidia NVL72. Tensordyne offers Napier in 72-NPU systems and four-system racks; it has booked preorders from neocloud companies and received interest from other potential customers.

Per-user token throughput has become an important metric for AI companies as users engaging with AI interactively or running agentic workloads that are sensitive to run times prove willing to pay a premium for faster execution. Meanwhile, data-center space and power limit AI deployments, and cost is emerging as a concern because AI companies’ capex has been outpacing revenue growth. Tensordyne’s principal innovation of computing in the logarithmic domain and its unusual approach to scale-up networking are the cornerstone technologies enabling Napier to simultaneously raise performance and lower power.

Tensordyne’s Path from Cars to the Data Center

Originally named Recogni, Tensordyne rebranded last year. Founded in 2017, the company focused on AI processing for computer vision, such as object recognition for assisted driving. In this era, convolutional neural networks (CNNs) dominated AI workloads. That year, however, Google researchers published the seminal paper on transformer models, which are the basis of large language models (LLMs), marking the dusk of the CNN era.

In 2022, Recogni hired a new CEO, Marc Bolitho, a veteran executive of automotive engineering companies. He took over from cofounder RK Anand. Remaining as Tensordyne’s product chief, Anand has a networking background, including as the leader of HPE Juniper’s data-center business. Later in 2022, ChatGPT launched. Transformer models hosted in data centers, not self-driving cars, would propel AI-chip startups. Possessing differentiating technology but applying it to CNNs positioned for the auto market, Recogni needed to adapt.

Napier the Rapier

Napier is the result. For a data-center NPU, it’s unusually small. It’s a monolithic 300 W design with four HBM3E stacks; the dual-die Nvidia Rubin GPU requires six times the power and has eight stacks. Tensordyne rates Napier at 2.1 PFLOPS for eight-bit floating-point operations, placing its peak raw throughput well below Rubin’s 17.5 PFLOPS and less than half that of Blackwell’s 5 PFLOPS.

Scaled up, a four-system rack integrates 288 Napiers compared with 72 GPUs in an Nvidia-based rack, reducing the peak-performance differential. Moreover, Tensordyne rates its rack at 13× the token throughput on DeepSeek-R1 as an Nvidia NVL72 GB300. On a GPT mixture-of-experts model with 2 trillion parameters, Tensordyne claims per-user token rates exceed those of a combined Nvidia GPU and Groq 3 (Nvidia LPX) cluster and a similarly heterogeneous Amazon Trainium 3 and Cerebras CS-3 cluster. Moreover, Tensordyne accomplishes this with a single 120 kW rack instead of a 1,500 kW (Nvidia) or 800 kW (Amazon) row. Such performance density indicates Napier has architectural advantages.

Napier’s Architecture Unlocks Performance

Like the Google TPU, Napier employs systolic arrays and vector processors. Provided they are fully loaded, systolic arrays require less power and better utilize their compute resources than many small cores, as a GPU has. Napier also has capacious on-chip memory, helping to keep the compute resources fed and reduce power compared with accessing HBM. The 256 MB SRAM compares with 230 MB for the first Groq chips, although the next-generation Groq NPU will have 512 MB. Groq’s design, however, has no other memory, a critical limitation. By contrast, Napier packs 144 GB of HBM. That’s half that of Rubin but is more memory per FLOP.

Napier affords this large SRAM because Tensordyne’s signature feature is computing in the base-2 logarithmic domain. Employing logarithms converts multiplication in the linear domain to addition. Adders are more compact than multipliers, freeing up die area for memory. They also use less energy, contributing to Napier’s power advantage. However, linear addition doesn’t map to a simpler operation with logarithmic values, necessitating exponentiation to return to the linear domain. Fortunately, base-2 exponentiation maps to a shift operation. Even better, further simplification is possible by only approximating exponentiation, and the resulting error is inconsequential.

Logarithmic-based computing requires adapting models, analogous to quantizing a 16-bit model to use four- or eight-bit data. Users quantize model weights to the logarithmic domain in advance, while Napier dynamically quantizes activations and scaling factors.

Scaling Up with a Cell-Based Fabric

Achieving high token throughput in a quarter-rack system, particularly with small (by NPU standards) chips, requires a low-latency, high-throughput interconnect. Tensordyne describes its scale-up interconnect as allowing 72 Napiers to operate as one, even when spreading a single workload among all chips. The company particularly focused on avoiding stalls during the LLM decode phase, which sequentially generates output tokens.

To accomplish these goals, Tensordyne employs a cell-based switch fabric. This fabric type has been a staple of networking and telecommunications, so much so that instead of crafting its own, Tensordyne sources its switch from HPE Juniper. Chopping data transfers into small cells doesn’t move data faster but reduces variation. It facilitates interleaving transactions, reducing worst-case latencies—which is essential for performance. When linking multiple Napiers’ systolic arrays to collaboratively process a single layer of an AI model, forward progress stalls while the system waits for the slowest transaction.

Bottom Line

In return for asking customers to adapt models for logarithmic-based processing, Tensordyne promises better performance per watt and per data-center square foot on LLMs than competing solutions. This translates to a 10× cost advantage, which should compensate for the need to port models. The biggest caveat is that Napier has only just taped out. Tensordyne has proven key aspects of its technology with earlier chips but must show Napier-based systems and racks can achieve their ambitious targets.