a pillow made of concrete

Flow PPU Promises to Accelerate Threaded Code 100×

Adjust to the object, and you shall find a way.

Newly minted startup Flow Computing is developing technology comprising a parallel-processing unit (PPU) and a compiler to speed up CPU code one hundredfold. Based on research from the University of Eastern Finland (Joensuu) and continued at the state-owned VTT Technical Research Center, the technology strives to counter the slowing rate of CPU improvements without breaking compatibility.

Notables

  • Flow doesn’t speed up everything. Whereas others have offered elixirs to magically speed up computation, Flow specifically targets threaded CPU-based math code, making its claims credible—albeit of limited applicability. Moreover, SIMD instructions, such as Arm SVE and x86 AVX, and GPU/NPU accelerators address similar problems.
  • Flow focuses on speeding up parallel-processing primitives. The company targets the problems of CPU code getting bogged down synchronizing threads, accessing memory concurrently, and context switching. It supports propagating data through the PPU like a systolic array, enabling the PPU to handle dependent operations. Flow’s technology also promises to hide latency, which should improve scaling and make the PPU insensitive to data placement.
  • Your mileage may vary. A developer that runs his source code through the Flow compiler should expect a 2× speedup on routines employing Posix/Linux pthreads according to the company. Lightly modifying the software, perhaps with assistance from Flow’s AI-based (of course!) tool, can result in a 10× gain. Getting the full 100× requires a rewrite. However, the new code is much simpler because the PPU natively understands vector/matrix operations. Even so, these gains depend on the code and apply only to the inner loop. The company states a 64-core PPU achieves 38–107× (100× in marketing speak). Sequential code isn’t sped up (see Amdahl’s Law). The company hasn’t clarified how well a PPU handles diverging execution paths among threads, such as when if-then-else statements go different directions.
  • Everybody should use a PPU! Flow targets everything from smartwatches to servers by supporting PPU configurations from several to 256 cores. Realistically, it will be used in processors dedicated to performance-sensitive threaded workloads. A processor carved into dual-core virtual servers and rented to enterprise customers is unlikely to integrate a PPU. Although even spreadsheets and web browsers are now threaded, they aren’t performance sensitive and wouldn’t benefit from Flow’s promised speedup. However, neural networks, video transcoders, and networking software are among the many performance-sensitive threaded workloads Flow could target.
  • A PPU doesn’t require area or power! To the extent that a PPU obviates CPUs, a core-count reduction can more than offset the PPU’s power and area. The company estimates a 64-core PPU will occupy 22 mm2 in a 3 nm process and require 43 W.
  • Flow supports all instruction sets, depending on how one defines support and all. To apply Flow’s technology to a proprietary instruction set like Arm or x86, a company must obtain an architecture license from Flow and implement a PPU itself. The company will offer its own RISC-V design (IP). We expect Flow to disclose a target delivery date by the end of the year.
  • The PPU doesn’t replace the CPU. A PPU-endowed processor still requires a standard CPU for sequential code. The PPU is just a coprocessor for parallelized code.

Customers

To win designs, Flow must show how its technology speeds up specific cases compared with alternatives, including SIMD and GPU/NPU approaches, not just compared with scalar CPU execution. Otherwise, it’s up to potential licensees to convince themselves to employ it. This approach might succeed with companies capable of developing their own processor and targeting a specific use case.

For example, a streaming-media company must transcode videos to various formats to serve customers with diverse hardware and networking bandwidth. An ASIC codec could handle the coding but would lack flexibility. Traditional CPUs are flexible, but their throughput per watt and per dollar are poor. A Super CPU—Flow’s name for a CPU-PPU combination—could offer both flexibility and performance. However, only the largest and most sophisticated of these companies could roll its own processor.

If Flow could show how it speeds up a consumer IoT, smartphone, or PC application, it could win not only a high-volume design but also showcase its technology to other prospective licensees.

Bottom Line

Flow Computing has developed a novel technology for speeding up threaded CPU code but has yet to show actual silicon solving real-world problems. The company’s immediate challenge is to complete its initial RISC-V design so that it can demonstrate silicon running applications. Until then, the technology sounds intriguing, but aspirations of going into everything from smartwatches to servers are premature.

This article was updated 13 June 2024.


Posted

in

by


error: Unable to select