AI Accelerators — Part IV: The Very Rich Landscape

A Rising Tide Lifts All Boats

It seems like recent years have been the golden age for many AI hardware companies; NVIDIA’s stock skyrocketed by about +500% in the past three years, dethroning Intel as the world’s most valuable chip company, and the startup scene seems to be just as hot; billions of dollars have been spent in funding AI hardware startups to challenge NVIDIA’s AI leadership over the past years.

AI Hardware Startups — Total Funding as of 4/2021 (Source: AnandTech, FWIW Nuvia is not in the AI Business)

101 Ways to Cook AI Accelerators

NVIDIA: Started with GPUs + CUDA, Aims for Full Control.

ImageNet Challenge: Winning Error and Percentage of Teams using GPUs (source: NVIDIA)
SIMT Execution Model: Code Divergence and Synchronization Example (source: NVIDIA)
NVIDIA’s Roadmap of GPUs, CPUs, and DPUs (source: NVIDIA)
Andrew Feldman, Cerebras CEO with the WSE-2 (source: IEEE spectrum)
Cerebras WSE-2 vs NVIDIA A100 spec comparisons. (Source: BusinessWire)
2nd Generation IPU High-Level ChipDiagram (source: GraphCore)
Example of a Bulk Synchronous Parallel Execution Timeline (source: GraphCore)

Reconfigurable Dataflow — Wave Computing, SambaNova, SimpleMachines

Wave Computing, SambaNova, and SimpleMachines are three startups that have presented accelerator chips whose foundations combine two concepts (overviewed in the previous chapter). (i) reconfigurability: from a processor taxonomy’s point of view, their accelerators are classified as Coarse-Grained Reconfigurable Arrays (CGRA) originally suggested in 1996. They describe a software-defined hardware approach in which the compiler determines the structures of computational datapaths and the behavior of on-chip memories. (ii) dataflow: their designs rely on graph-oriented hardware based on the dataflow graph that is laid out by the AI application. Conceptually, since AI applications can be defined by a computational graph, they can be naturally presented as dataflow programs.

Time-Based Mapping of DPU Kernels (Source: Wave Computing)
SambaNova’s RDU Block Diagram (source: SambaNova)
SambaNova’s Key Use Cases (source: HPCWire)
SimpleMachines’ Mozart Chip Block Diagram (source: SimpleMachines)
Hailo Hardware-Software Stack (source: Hailo)
Possible Mapping of an 8 Layer Model to a Pool of Compute+Control+Memory Resources (source: Hailo)
Hailo 8 Based PCBs (source: Hailo)

Systolic Arrays + VLIW: TPUv1, Groq, and Habana

TPUv1

Block Diagram of The First-Generation TPU Architecture (source: arXiv)
Groq TSP Execution Block Diagram (source: Groq)
Groq VLIW Instruction Set and Description (source: Groq)
Goya and Gaudi High-level Block Diagrams (sources: Habana 1,2)
Habana’s Goya MLPerf Inference Results (source: Habana)

RISC-Based AI Accelerators: Esperanto and TensTorrent

Esperanto

Esperanto was founded back in 2014 and remained in stealth mode for quite some time until announcing their first product at the end of 2020, the ET-SoC-1 chip. The ET-SoC1 is a RISC-V based heterogeneous many-core chip, with 1088 “Minion” low-power low-voltage cores for vectorized computations and 4 out-of-order “Maxion” high-power general-purpose cores, enabling the usage of ET-SoC1 as a host processor (meaning the processor running the operating system) and not only as a standalone accelerator connected to the host via PCIe). The ET-SoC1 is targeting datacenter inference workloads, currently demonstrated on large recommendation models.

Full-Chip Block Diagram of Esperanto’s ET-SoC1 with a Handful of Big “Maxion” Cores and Many Little “Minion” Cores (source: Esperanto/HotChips)
Physical Properties of x86 CPU vs. Esperanto ET Minion many-core Chip (source: Esperanto/HotChips)
TensTorrent Approach — Graph Parallelism and Tensor-Slicing (source: YouTube/TensTorrent)
TensTorrent Core (source: YouTube/TensTorrent)
Weights and Input/Output data Difference in a Matrix Multiply Operation (source: Mythic)
Mythic Analog Compute Engine — Flow Diagram and Analog Computation (source: Mythic)
Photonics vs. Electronics Calculation Properties (Source: HotChips/LightMatter)
Envise-Based Server Block Diagram (source: LightMatter)
NeuReality NR1-P Prototype (source: ZDNet)

Conclusions

Many companies are developing their own AI accelerators. I highlighted some of their implementations here, though not all of them, as my writing would not keep up with the influx of new announcements.

  • NVIDIA is betting on new lines of CPU and DPUs.
  • NeuReality is building system-centric AI servers.
  • TensTorrent decided to design their own RISC-V processors that will connect to their accelerators via their proprietary NoC.
  • Esperanto’s chip has a combination of many little cores for AI and a few big cores capable of running the operating system.