Ubitium UPA block diagram

Can the Ubitium UB410 be the First Universal Processor?


Startup Ubitium is about to tape out its first chip, which it calls a universal processor. The forthcoming UB410 is RISC-V compatible and aims to replace CPUs, GPUs, NPUs, DSPs, and FPGAs. Initially focusing on consumer applications, the company seeks to provide a unified environment for general-purpose, AI, signal-processing, and other code.

Ubitium’s founders have experience at various companies, and all three served stints at Pact, which had developed the Extreme Processor Platform (XPP). The XPP integrated 128 32-bit processors—a huge number for 2000, when it was introduced. Meanwhile, Stanford and other organizations researched coarse-grained reconfigurable arrays (CGRA). Later, two of Ubitium’s founders started Hyperion Core, which worked on a universal RISC-V chip before morphing into Ubitium.

A CPU with 256 Execution Units

Ubitium’s approach resembles these efforts but has important differences. Instead of tiling cores, the company combines a CPU front end with a processing-element (PE) array to provide a single-thread programming model and parallel processing. The UB410 can boot an off-the-shelf RISC-V Linux binary and provides multithreading and data-flow processing for high throughput. To reduce power compared with a high-performance CPU, it eliminates instruction scheduling and other complex logic to raise throughput.

The UB410 integrates 16 cores. A core comprises four universal processing arrays (UPAs). Each UPA comprises 256 PEs and a front end. We estimate it runs at about 900 MHz. A UPA can operate in out-of-order (OoO) or loop-acceleration mode. In OoO mode, it behaves like a CPU that implements simultaneous multithreading (SMT), dispatching to PEs up to 8 instructions per cycle (e.g., four instructions from each of two threads or one instruction from each of eight threads). Each instruction dispatched maps to a single PE.

Different Modes for Different Problems

Loop-acceleration mode organizes PEs for data-flow processing, each executing a single operation on data flowing through the array. Whereas OoO mode is for control code and similar software, loop acceleration is for AI and signal processing. In this mode, the array behaves as a CGRA or fine-grained (e.g., FPGA) machine but is programmed (like a CPU) instead of configured (like a CGRA or FPGA). Ubitium offers a compiler that maps code to the array and also lets source code provide parallelism hints (#pragma) and call hand-optimized routines.

To enable GPU-like single-instruction, multiple thread (SIMT) programming, Ubitium is developing a thread-acceleration mode that groups PEs to form units similar to Nvidia’s streaming multiprocessors. Developers can program the UPAs using a Cuda-like language. The UB410 supports a pair of 16-thread (16-PE) warps, but future designs may allow bigger configurations. Remaining PEs in a UPA can operate in OoO mode. In general, software can dynamically partition the UB410, operating each part in a different mode.

Magic Bus

On-chip interconnect is critical to array-based processors. In OoO mode, operations must forward results to subsequent dependent instructions running on different PEs, and loads/stores must transfer data from/to memory. Loop acceleration simplifies matters because data movement isn’t arbitrary in a data-flow machine, but data must still move in and out of external memory. Thread acceleration requires moving data to local storage that functions like a GPU register file. Ubitium withholds details of how it keeps the UPA system bus fast, yet small and low power. For external memory, the UB410 supports up to 64 GB of LPDDR5 DRAM.

Ubitium rates the UB410 at 5 W while delivering a theoretical maximum of 30 INT8 TOPS and 12 BF16 TFLOPS. Initial target applications include consumer audio. The UB410 enables a single chip—and a single toolchain—to handle all channels, performing signal processing, AI functions, and general code execution. However, the UB410 can address other applications, and future designs could scale down to reduce cost and power or up to even HPC performance levels.

Bottom Line

Ubitium is the most recent company to recognize that packing a chip with cores yields high theoretical performance. Forerunners, such as the InspireSemi Thunderbird, have struggled to deliver practical performance. When that’s been achievable, they’ve been difficult to use. Ultimately, they have faded away, unable to get customers to give up CPUs, DSPs, and FPGAs. Possessing a team proficient in CGRA-like designs, Ubitium has developed a RISC-V chip that provides compatibility and may overcome the technical shortcomings of earlier array processors. When the UB410 returns from the fab, Ubitium will finally be able to prove to customers the value of its architecture.


Posted

in

by


error: Selecting disabled if not logged in