MemryX MX3 Accelerates Both Developers and Their CNNs

November 7, 2024

MemryX opened its developer hub this past week, following the general availability of the company’s MX3 edge-AI accelerator (NPU). The hub website provides documentation, tool downloads, examples, and other developer resources. Requiring less than 2 W, the MX3 is a small (9×9 mm) chip accelerating convolutional neural networks (CNNs) such as for vision processing.

In the Wild Versus in the Zoo

The hub exemplifies MemryX’s strategy to differentiate through ease of use. The best evidence of this strategy is the absence of a model zoo. A zoo is a collection of models optimized for a company’s hardware. Developers can use the models as examples or in their own projects.

To obviate a zoo, MemryX aims for models, such as those from the Hugging Face repos, to work without fine tuning, retraining, or manual optimization. Like its competitors, MemryX has tested many models, claiming to directly support 1,500, but it doesn’t tune them for its hardware. Testing validates their operation and informs the company which activation functions are in the wild and must be supported.

Models work out of the box because the MX3 employs a combination of integer/fixed-point and floating-point data. Developers typically quantize weights to eight bits (INT8) and use 16-bit (BF16) for activations. The latter is the default format for most neural networks; preserving the 16-bit format instead of forcing all computations to employ eight-bit values helps keep models from breaking (and thus requiring retraining or fine tuning) when a developer quantizes the weights.

MemryX’s Beating Heart

The MX3’s architecture is similar to several other NPUs. A near-memory dataflow design, it organizes processing engines in a grid. Each has two processing units and local memory. One unit performs matrix multiplications, and the other handles activation functions and other tasks. Network layers map to engines. Data flows from one engine to another, mirroring the flow between layers.

Near-memory systolic arrays can have scaling problems: models that don’t fit in the NPU don’t run well, if at all. However, designers can gluelessly chain multiple MX3s to tackle bigger models. Indeed, MemryX offers a four-chip 2280 M.2 card, the type that solid-state drives use. The MX3 integrates USB and PCI interfaces to connect to a host processor.

Although they’re well suited to CNNs, systolic arrays like this often struggle with transformer models. A CNN has copious data locality, but fundamental to a transformer are mechanisms combining nonadjacent data. Although MemryX withholds the MX3’s architecture, we expect the current version lacks the global interconnect throughput that executing transformers requires.

Get Going Quickly

The hub’s quick-start guides illustrate how easy it is to use the MX3. Standard Linux and Python commands install the requisite software. A single command-line statement compiles a model, making it ready to use. Acceptable compiler input includes ONNX-format and TensorFlow/Keras models. Within an application, a single Python or C statement sends data to a model, and another retrieves the results.

Numerous companies compete with MemryX, including processor companies such as NXP and Texas Instruments that integrate NPUs and accelerator suppliers such as Hailo. MemryX promises better AI performance than the former and greater ease of use than the latter.

Meaningful performance figures are elusive. MemryX claims the MX3 executes at 5 TOPS (peak) and utilization exceeds 50%. By contrast, the rival Hailo-8 has a 26 TOPS peak and delivers 1,038 frames per second (FPS) on MobileNet SSD (300 × 300 px images). Despite its lower TOPS rating, the MX3 achieves 1,400 FPS on the same model but with slightly smaller (224 × 224 px) images. Customers employing M.2 cards will find the four-NPU MemryX module delivers similar peak TOPS as the Hailo-8 module. We expect customers will compare these M.2 solutions instead of weighing a single MX3 against its discrete rival.

Our research finds that while a few developers are ready to employ transformers for computer vision, most are only curious about this newer neural-network type. Nonetheless, they’re essential to language models and have image-processing potential. Therefore, MemryX must add transformer capability to its roadmap. In the meantime, companies developing image-processing systems, such as for machine vision or video management, will find MemryX offers a small, low-power CNN accelerator that’s easy to use.

If You Enjoyed This Post, Read one of These:

Posted

November 7, 2024

in

New Product Analysis

by

Joseph Byrne

Tags:

edge AI, MemryX, NPU (AI accelerator)