Sierpinski carpet

Fractile Computes in Memory for AI Inference


Startup Fractile has landed seed funding to build a better data-center AI accelerator (NPU), claiming its design will be 100× faster than an Nvidia H100 at 1/10th the cost. The company withholds details but indicates its technology avoids the bottleneck between processing and memory. This should improve processing-element utilization, boosting performance and reducing power. At the same time, Fractile is mindful that its approach must remain economical when scaled out to run 400-billion-parameter and larger neural networks and flexible enough to address models other than transformers.

Fractile is developing in-memory computation, building custom multiply-accumulate (MAC) circuits that also store state. Various other companies have designed AI accelerators for the data center and the edge that also bring computation and memory together. For example, the D1 chip employed in the Tesla Dojo places a memory block next to each of its 354 NPU cores. To scale out, each D1 can attach to four adjacent D1s. However, the D1’s on-chip memory proved insufficient, forcing Tesla to add HBM to Dojo in an unusual way. Another at-memory NPU, the Tenstorrent Grayskull, by contrast, supports external memory but is difficult to scale because it can’t directly interface with peers. Wormhole addresses this shortcoming with Ethernet.

Having witnessed these approaches, Fractile is likely to implement some kind of chip-to-chip interface. Power-efficiency gains from its in-memory architecture should enable it to cram more memory and computational resources in each chassis, further improving efficiency by amortizing host-processor and system-level overheads over more AI throughput.

Beyond merging MAC operations and memory, Fractile may gain efficiency by rethinking the MAC arrays NPUs employ. Replacing a large systolic array optimized for general matrix-matrix (GEMM) operations with arrays organized around general matrix-vector (GEMV) operations could enable finer-grained runtime allocation of computing resources. This would improve utilization and reduce latency, particularly compared with packing operations from different simultaneous queries into the array Tetris-style. Moreover, GEMV operations dominate the autoregressive token decoding at the heart of large language models (LLMs).

Bottom Line

Fractile has made big claims but hasn’t disclosed supporting details. It’s one of many companies addressing data-center inference with an architecture that should be better than the dominant GPU-based approach. A novel in-memory circuit design could improve Fractile’s performance and power efficiency over rivals. Hype around large language models (LLMs) remains frenetic, but executing them must become more economical. To this end, innovations such as those Fractile is working on are essential.


Posted

in

by


error: Unable to select