Choosing the Right Arm Edge AI Solution for Your AI application

With the recent rollout of the Cortex-A320, Arm has unlocked a new realm of possibilities for developers focused on IoT edge AI workloads. This smallest embodiment of the Armv9-A architecture offers a fresh perspective on how to select the ideal processor tailored for your unique AI applications. The challenge lies in navigating your options: Cortex-A, Cortex-M, and Ethos-U NPU-based devices, alongside potential hybrid solutions. Let’s dive into the specifics beyond mere cost, exploring how each processor influences AI capabilities and the software development flows that can streamline your project.

AI Computation Efficiency in Embedded Devices

In recent years, the efficiency of AI computation in embedded devices has taken significant leaps forward. Arm’s advancements in both M- and A-profile architecture have resulted in substantial increases in machine learning (ML) inference per unit of energy consumed. For instance, Arm’s Cortex-M processors, like the Cortex-M52, Cortex-M55, and Cortex-M85, leverage the programmable Helium vector extension to enable ambitious AI use cases on microcontroller-class devices. Meanwhile, the newly released Cortex-A320 processors enhance AI performance by incorporating the scalable vector extension (SVE2), while the Ethos-U suite of neural processing units (NPU), particularly the Ethos-U85, excels in processing efficiency, especially for transformer networks.

Choosing the Right Processor

With varied architectures come differing advantages, making the choice of hardware quite complex. When selecting the best fit, it’s essential to weigh raw performance against design flexibility, taking into account your software development flow and CI/CD requirements.

Performance Metrics

Undoubtedly, achieving the necessary AI processing performance is crucial. Cortex-A processors serve as programmable units that can cater to a broad spectrum of uses. Equipped with the Neon/SVE2 vector engine, they expedite neural network and vectorized code processing. Similarly, Cortex-M processors, enhanced with the Helium vector engine, are tailored for energy-efficient applications. On the other hand, Ethos-U NPUs (up to Ethos-U85) are designed specifically for neural network operations, making them highly efficient with quantized 8-bit integer weights.

The latest Cortex-A generation, encompassing the Armv9 architecture, broadens the scope of supported data types, introducing new matrix-multiply instructions that significantly boost neural network processing performance. The Cortex-M55 was the trailblazer for integrating Helium vector technology, joined by the Cortex-M85. These processors handle multiple 8-bit integer multiply-accumulate (MAC) operations per clock cycle effectively.

To visualize performance, let’s peek into the theoretical MAC execution capability:

MAC/core/clock cycle | datatype | Int8 | Int16 | Int32 | BF16 | FP16 | FP32
—|—|—|—|—|—|—|—
Cortex-M55 & Cortex-M85 | | 8 | 4 | 2 | N/A | 4 | 2
Ethos-U85(128 MACs) | | 128 | 64 | N/A | N/A | N/A | N/A
Ethos-U85(2048 MACs) | | 2048 | 1024 | N/A | N/A | N/A | N/A
Cortex-A320 | | 32 | 8 | 4 | 8 | 8 | 4

Navigating Software Support

Aside from hardware capabilities, software support is another critical factor to consider. Arm offers a robust suite of open-source runtime support software across all its AI hardware solutions, including Cortex-A, Cortex-M, and Ethos-U. This support extends to various ML frameworks and runtimes, such as PyTorch, TensorFlow, and LiteRT, with optimizations to fully harness Arm AI features through acceleration libraries like CMSIS-NN and the Arm Compute Library. The Vela compiler further enhances model efficiency for deployment on Ethos-U, maximizing performance through refined executable binaries.

Utilizing Ethos-U NPU

For specific edge AI applications with clearly defined workloads, leveraging a dedicated NPU for neural network processing can significantly alleviate the computational burden on the host processor. The Ethos-U NPU shines in cases that require efficient handling of quantized 8-bit integer weights, particularly with transformer networks poised to benefit from the capabilities of Ethos-U85.

Configurations combining host processors with Ethos-U can vary. Ethos-U can be driven by Cortex-M processors, particularly those equipped with Helium, like the Cortex-M55. Some commercially available system-on-chip solutions already utilize this architecture. Recently, there’s been a surge of interest in running generative AI workloads on small language models (SLMs); this combination is ideal for such scenarios.

Additionally, some SoCs integrate Cortex-A processors with an ML island consisting of Cortex-M and Ethos-U. These setups are typically designed for richer operating systems like Linux, offering a flexible memory system. While Cortex-M CPUs support a 32-bit addressable memory space, Cortex-A processors, including the Cortex-A320, can access a broader 40-bit memory space, bolstered by a memory management unit (MMU) for virtual addressing.

As we witness advancements in large language models (LLMs), a more versatile memory system will be key for accommodating models with parameters exceeding 1 billion. While Cortex-M remains suitable for smaller language models, Cortex-A’s expansive capabilities may soon become vital as model sizes escalate.

Recently introduced, the ‘direct drive’ configuration allows the Cortex-A processor to connect directly with the Ethos-U NPU, streamlining the architecture by removing the need for a dedicated Cortex-M ‘driver’ processor. A Linux driver for the Ethos-U85 is now available for use on host Cortex-A systems.

Cortex-A320: Meeting Generative AI Demands

Edge AI system developers now have expanded options to optimize the final step of AI in IoT. Whether opting for Cortex-M, Cortex-A, or Ethos-U-accelerated systems, each caters to distinct needs. The Cortex-A320 processor’s capability to directly interface with Ethos-U85 brings added flexibility. As Arm’s most compact and efficient Cortex-A processor under the Armv9-A architecture, the Cortex-A320 is designed to enhance edge AI efficiency while adapting to the evolving landscape of generative AI within embedded systems.

Join us in exploring how Arm is shaping the future of IoT with transformative edge AI solutions.

The AI Buzz Hub team is excited to see where these breakthroughs take us. Want to stay in the loop on all things AI? Subscribe to our newsletter or share this article with your fellow enthusiasts.