Ceva Boosts NeuPro-M NPU Throughput and Efficiency

Ceva has revised its NeuPro-M AI accelerator (NPU), introducing a configuration with more multiply-accumulate units (MACs) and updating the architecture to enhance real-world throughput and power efficiency. The NeuPro-M scales from a single-engine design integrating 4,096 eight-bit MACs to an eight-engine configuration with 64K eight-bit MACs. For even greater performance, licensees of the NeuPro-M design (IP) can employ multiple cores.

Target applications include computer vision for cameras, AR/VR glasses, ADAS, and drones. The NeuPro-M supports convolutional neural networks (CNNs) and transformer models. The former predominantly perform computer vision functions such as object detection and classification. The latter include large language models (LLMs), which power chatbots and can be used for interactive service manuals. It also includes vision transformers, which can address the same applications as CNNs, performing better in certain scenarios.

Early NPUs focused on CNNs but weren’t effective on transformer models. A flexible architecture, NeuPro-M targets both and can adapt to unforeseen AI developments. It can also execute code developed for Ceva DSPs, delivering greater capabilities than competing NPUs.

Ceva has released the final RTL for its fourth-generation NeuPro-M design (IP). Later this year, the company plans to deliver a fifth generation supporting new configurations, such as a smaller version with 2,048 MACs, and incorporating improvements, such as supporting BF16 and FP8 formats.

Ceva NeuPro-M Overview

Customers pay special attention to power, performance, and area (PPA) when evaluating IP. While an IP developer can easily scale the three factors proportionally because cutting a design down reduces all three, it’s trickier to improve power and area efficiency (e.g., raising performance while maintaining power and area).

With each NeuPro-M generation, Ceva implements additional techniques to extract ever more performance without a corresponding power or area increase. In an AI accelerator (NPU), the multiply-accumulate (MAC) array is among the most power- and area-hungry function units. Ceva applies various techniques to squeeze the most from the array.

Just as a gasoline-powered car gets infinitely bad mileage when idling, a stalled MAC array does no useful work but occupies silicon and uses energy owing to leakage current. An NPU design that depends on a separate CPU or DSP, for example, may stall the array when those cores execute operations the array can’t handle. The NeuPro-M doesn’t offload any computation, handling many of these operations in its streaming unit and integrating a programmable vector unit (VPU) for added flexibility. Moreover, the Ceva design distributes processing among the MAC array, the streaming logic, and the VPU, establishing a pipeline that mitigates stalling and raises function-unit utilization.

Sparsity and Quantization

Another technique to improve MAC utilization and efficiency is to ensure it is doing useful work. In AI, many matrices are sparse, holding nonzero values in a fraction of their elements. Skipping zero-times-anything operations ensures a MAC isn’t needlessly drawing power, freeing it to do useful work. The NeuPro-M handles structured sparsity, the flagging of zero-valued weights before loading a neural network into an NPU. (Developers can also prune neural networks, clamping small numbers to zero to increase sparsity.) The Ceva design is also unusual in handling unstructured sparsity, dynamically detecting zeroes generated during AI processing. By skipping operations with zero-value multiplicands, NeuPro-M effectively raises its throughput; some models run four times faster.

Quantization further increases MAC-array throughput. To enable standard AI models to run natively, the NeuPro-M can process 16-bit floating-point (FP16) data. For greater performance, it supports 8-bit integers (INT8) at twice the FP16 peak rate. Peak operations per second quadruple when employing INT4. The Ceva NPU also supports mixed precision, such as INT4 × INT8 operations. Quantization to INT4 can occur per data block, with each weight expressed as four bits scaled by a single FP16 value for the whole block. As with handling sparsity, the NeuPro-M accepts models quantized offline and is unusual in also dynamically quantizing during inference.

Ceva NeuPro-M Efficiency

Combining these design choices, the NeuPro-M efficiently executes CNN and transformer models, such as ResNet-50 and Llama 2, respectively. Table 1 summarizes the NPU’s efficiency on these models based on a 3 nm NPM16K implementation, showing images per second per watt for the CNN and tokens per second per watt for the LLM. The ResNet-50 test is for 244×244-pixel images, and the Llama 2 case is for the 7B model with a 2K sequence length. As the table shows, performance per watt is greater for instantiations optimized for slower clocks; a 37% speed reduction increases it by 36%.

	945 MHz	1.5 GHz
ResNet-50	24,282 IPS/watt	15,541 IPS/watt
Llama 2 7B	8,022 TPS/watt	5,135 TPS/watt

Table 1. NeuPro-M power efficiency. (Source: Ceva.)

Ceva NeuPro-M Architecture

Tensor Processing Unit

Each NeuPro-M core comprises multiple engines, as Figure 1 shows. The tensor processing unit (TPU) within each NPM engine houses the MAC array. It’s a flexible design, supporting 4-bit by 4-bit (4×4), 8×8, and 16×16 integer MACs and optionally 16-bit floating-point formats. Mixed-precision operations, such as INT4 × INT8, are also supported. Within the array, mechanisms enable models to reuse data and perform other operations.

The NPM4K model has a single engine containing 4,096 eight-bit MAC blocks. The NPM8K is also a single-engine design, but its TPU is twice as large. Other configurations have multiple engines employing this larger TPU. Ceva has updated the NeuPro-M design to support up to eight engines, yielding the NPM64K model. For greater scaling, a customer can instantiate multiple cores.

Figure 1. Ceva NeuPro-M block diagram. (Source: Ceva.)

Streaming Unit

The streaming unit complements the TPU, performing nonlinear functions, such as activation, pooling, and data manipulation. It operates on individual elements, unlike the TPU, which handles tensors. Supported activations include everything from softmax to developer-defined functions. Data-manipulation operations include scaling values and reshaping tensors.

The streaming unit also accelerates dynamic quantization. For example, it can map FP16 values to INT8 on the fly, enabling applications to take advantage of the NeuPro-M’s greater INT8 throughput. The unit additionally supports block formats, such as a data group in which INT4 elements share an FP16 scaling factor to conserve memory compared with INT8 or FP16. When feeding input data to the TPU, the streaming unit applies the scaling factor to expand the INT4 data to FP16 values. When accepting TPU output, it computes the factor and maps the individual values to INT4. Developers can configure group size and other quantization parameters.

Storing data in a low-precision format and converting it to high precision doesn’t take advantage of the TPU’s greater low-precision throughput, but it offers other benefits. The greater computational precision helps maintain model accuracy, while reducing precision for storage enables memory to hold more parameters. Likewise, it effectively increases memory bandwidth and energy efficiency by enabling more parameters to move per bus transaction. A model too large to fit in local memory can stall an NPU while shuttling data between it and main memory, a situation Ceva mitigates with its flexible quantization technology. Similarly, power restricts designs, particularly at the edge, and improving energy efficiency can boost real-world, power-constrained performance.

Vector Unit

Each NPM engine includes a vector processing unit (VPU). Based on Ceva’s SensPro, a DSP core targeting low-cost AI applications, the VPU is available to developers. By allowing them to write custom kernels and hook them into the NeuPro-M processing pipeline, Ceva has helped future-proof the NeuPro-M. SensPro is available in various models differing in parameters such as vector throughput, and customers can select whichever meets their requirements.

Sparsity Unit

A fourth unit accelerates sparsity-related operations, leveraging zero-valued data to conserve memory bandwidth and power. While Ceva’s software tools can prune networks, clamping a percentage of values to zero to accelerate computation by a set factor, the sparsity unit dynamically tackles sparsity during model execution. It handles sparseness as it occurs instead of enforcing a fixed structure of zero and nonzero values. Moreover, it finds sparsity in both network weights and data, achieving greater gains than NPUs that only handle weights that are zeroed during offline training and pruning.

Level One Memory Fabric

The first-level (L1) memory fabric holds model weights and activations within an NPM engine. More than an L1 data store, it helps the TPU stay occupied with computation instead of being stalled or busy moving data. Because AI tensor operations are multidimensional, can span multiple matrix ranks (e.g., columns), or require copying output to multiple ranks, the L1 memory integrates a DMA engine capable of reshaping data (e.g., transposing a matrix), performing scatter-gather transfers, broadcasting data, and handling other data manipulation, such as interleaving tensors within a single data structure to improve processing efficiency.

Level Two Subsystem

The second-level (L2) subsystem includes memory shared among NPM engines and performs DMA and reshaping operations like its first-level counterpart. It also expands model weights stored in main memory in a compressed format. Ceva provides an abstraction layer for the DMA engines at both levels so that developers don’t have to configure individual descriptors and manage buffers, data fragmentation, and individual transfers. Instead, they deal with queues, and the NeuPro-M manages dispatching DMA operations.

Ceva employs wide interfaces among the memories and the NPM function blocks. Transfers from the L2 to the L1 memory move 128 bytes per cycle, as do those between the L1 memory and the TPU, VPU, sparsity unit, and streaming unit. The fifth-generation NeuPro-M will support configurations transferring 512 bytes per cycle.

Schedulers

Both levels have schedulers to avoid interrupting a host CPU to manage model execution. The NPU, instead, controls the data flow within it, managing buffers and function units to improve utilization and throughput. The L2 (system) scheduler coordinates work among engines, while the first-level (engine) one coordinates the units within an NPM engine. Developers can build custom data flows among the L1 units, adapting the NeuPro-M to new workloads.

Controller

Although Ceva architected the NeuPro-M to avoid the performance-sapping overhead of relying on a separate controller, it still integrates one. However, it does not to participate in routine model execution. Instead, it interfaces with the host processor, pre- and postprocesses data, and manages power (e.g., slowing down or turning off NPM engines to conserve energy). Always on, it can, for example, listen for a wake word and activate the whole NPU for subsequent processing. Residing at the NPU’s top level, the controller can be a Ceva SensPro DSP or a NeuPro-Nano. The latter is a small, low-power NPU based on a DSP enhanced for AI processing.

Ceva-NeuPro Studio Software

Ceva’s NeuPro Studio software enables developers to deploy neural networks on any Ceva NPU, including the NeuPro-M. Studio provides access to a zoo of pretrained models optimized for the NeuPro-M and enables developers to import other networks from popular frameworks. By contrast, other NPU licensors support only ONNX-format models, requiring customers to independently convert their models to that format. Studio integrates Apache TVM to compile neural-network graphs, generating C source code to facilitate integrating AI models with applications.

Building on its history of supporting DSP development, Ceva provides additional capabilities. For example, Studio can integrate Ceva audio codecs or customer software with an AI model. Studio’s Arch Planner tool can profile code execution and NeuPro-M resource utilization, helping the developer partition work among NPM engines and, if performance dictates, multiple NeuPro-M cores.

Bottom Line

TThe Ceva NeuPro-M is a scalable NPU for edge inferencing, delivering 1 to 400 TOPS and capable of generative AI and other transformer-based models. It is self-contained, offloading all AI functions from a host processor to improve performance and efficiency. Moreover, the NeuPro-M can execute C/C++ code, facilitating application integration and algorithm customization. By comparison, some competing NPUs are black boxes, providing developers with no access to their inner workings.

Ceva employs several engines and automates the buffer management and data handling required by AI models, helping to boost the NPU’s utilization. Consequently, the NeuPro-M performs well on customer workloads, delivering up to three times as much throughput per watt as competing NPUs. Quantizing all data—not just static weights—and handling sparsity on the fly enhance the NeuPro-M’s PPA. The NeuPro-M also stands out for its programmability and flexibility, which should enable it to support future AI techniques.

Ceva sponsored this post. For more information about the company and its products, go to www.ceva-ip.com.