custom desidned AI Chips

There are several key players in the AI computing infrastructure landscape, but there are emerging new players with custom design AI compute chips, each with their own strengths and preferences in terms of hardware. Here’s a breakdown of some leaders and their chip focus:

Cloud-based AI platforms

  • Google Cloud: Google has its own Tensor Processing Unit (TPU) specifically designed for AI workloads, offering high performance and efficiency. They also use Nvidia GPUs and custom CPUs in their cloud offerings.
  • Amazon Web Services (AWS): AWS offers a variety of AI hardware options, including Nvidia GPUs, custom AWS Inferentia chips for inference tasks, and FPGAs. They also leverage Intel and AMD CPUs for general-purpose AI workloads.
  • Microsoft Azure: Azure utilizes Nvidia GPUs and its own Field Programmable Gate Arrays (FPGAs) for AI training and inference. Additionally, they have their own line of Azure Machine Learning chips focusing on edge computing applications.
  • IBM Cloud: IBM relies on Nvidia GPUs and its own Power architecture CPUs for AI computations. Additionally, they develop and utilize specialized neuromorphic chips like TrueNorth for specific applications.

Hardware and chip manufacturers

  • Nvidia: Nvidia is a dominant player in the AI hardware space, with their powerful and versatile GPUs being the go-to choice for many AI tasks. They also offer dedicated AI accelerators like the DGX A100 system.
  • Intel: Intel offers a range of CPUs and AI accelerators, including Xeon CPUs with built-in AI features and dedicated AI chips like Habana Labs Deep Learning Processor (HLDP).
  • AMD: Similar to Intel, AMD provides CPUs with AI-optimized features and dedicated AI accelerators like Alveo FPGAs.
  • Arm: Arm processors are increasingly popular for edge computing applications due to their lower power consumption. Many chipmakers create Arm-based SoCs with AI capabilities for devices like smartphones and IoT devices.

Noteworthy New Players To Watch

  • Tesla: Tesla’s custom AI chips developed for Autopilot and FSD systems are also being incorporated into various AI applications.
  • Xilinx: Xilinx provides FPGAs for AI acceleration, offering flexibility and customization for specific tasks.
  • Cerebras Systems have developed their own custom AI chip

The landscape of AI computing infrastructure is constantly evolving, with new players and technologies emerging. Ultimately, the choice of leaders and chip types depends on specific needs and applications. Some considerations include:

  • Performance: Different chips offer varying levels of performance for different AI tasks.
  • Cost: Costs can vary greatly depending on the type of hardware and cloud services chosen.
  • Development ecosystem: Some platforms and chips offer more mature and user-friendly development environments.
  • Sustainability: Considerations like energy efficiency and power consumption are becoming increasingly important.

By understanding the strengths and offerings of different leaders and chip types, you can make informed decisions about your AI computing infrastructure needs.

Custom Design AI Chip from Cerebras

The custom chip from Cerebras is

  • Full wafer size (not chopped up)
  • 5nm from TSMC
  • 850,000 cores
  • Every core single-clock-cycle access to fast memory at extremely high bandwidth – 20 PB/s. This is 1,000x more capacity and 9,800x greater bandwidth than the leading GPU.
  • 220Pb/s memory to memory fabric.
From Cerebras
Chip Size46,225 mm2826 mm256 X
Cores850,0006912 + 432123X
On-chip memory40 Gigabytes40 Megabytes1,000 X
Memory bandwidth20 Petabytes/sec1.6 Terabytes/sec12,733 X
Fabric bandwidth220 Petabits/sec4.8 Terabits/sec45,833 X

Tesla’s Dojo – Another Custom Design

The architectural uniqueness of Dojo is evident in its building block, the D1 chip, manufactured by TSMC using 7 nm semiconductor nodes, with a large die size of 645 mm² and 50 billion transistors and leveraging a RISC-V approach and custom instructions. Tesla claims a die with 354 Dojo cores can hit 362 BF16 TFLOPS at 2 GHz, indicating that each core can do 512 BF16 FLOPS per cycle.

The interface is 2×512 and connect directly to other Dojo Chips in a single cabinet.

Comparison of Dojo with IBM SPE –

Comparison with Other AI Chips

Design is for AI training compute efficiency. Enables 362Tflops or 6 to 10 times more than other AI.

Tesla Dojo D1 ChipFujitsu A64FXAMD RX 6900 XTIBM Cell
Area (die)645 mm2~400 mm2520 mm2~235 mm2
Process NodeTSMC 7 nmTSMC 7 nmTSMC 7 nmIBM 90 nm SOI, later shrink to 65nm
Core area, approx1.1 mm23.08 mm214.8 mm2
Core Count35448+440 WGPs8 SPE + 1 PPE
Core Clock Speed2 GHz2 to 2.2 GHz> 2.5 GHz Boost3.2 to 4 GHz
Management CoresSeparate host systems connected via interface processor (DIP), possibly over the hills and far awayIdentical uarch, one per 12 core cluster (CMG)CPU, connected via PCIe1 PPC, on-die
Power Draw, 1 die< 600W< 200W?< 300W60-80 W (65 nm)
Network On ChipMesh, 2x64B links in each directionRing Between CMGsGiant buses and crossbars of doomRing, 4x16B
MemoryHBM connected via DIP (800 GB/s per DIP, 5 DIPs connected to a tile)Directly connected HBM2, 1024 GB/sDirectly connected GDDR6Directly connected Rambus XDR, 25.6 GB/s
Vector FP32 Throughput22 TFLOPS6.758 TFLOPS (not including management cores)25.6 TFLOPS0.256 TFLOPS (SPEs only, 4 GHz)
FP16 or BF16 Throughput362 TFLOPS13.516 TFLOPS51.2 TFLOPSN/A
Tesla Exapod
From Electrec

New Blackwell Chip from Nvidia

At a company event in Californina, the CEO announced their new Blackwell chips, with 208 billion transistors. It succeeds the H100, Nvidias flagship chip. The Blackwell will be the basis of new computers and other products being deployed by the world’s largest data center operators, including Inc., Microsoft Corp., Alphabet Inc.’s Google and Oracle Corp. Blackwell-based products will be available later this year. Features of the chips include.

  • 208 billion transistor
  • Choice to pair those products with new networking chips — one that uses a proprietary InfiniBand standard and another that relies on the more common Ethernet protocol.
  • Two chips married to each other through a connection that ensures they act seamlessly as one
  • Custom-built 4NP TSMC process with two-reticle limit GPU dies connected by 10 TB/second chip-to-chip link into a single, unified GPU.
  • NVIDIA NVLink® delivers 1.8TB/s bidirectional throughput per GPU
  • High-speed communication among up to 576 GPUs for the most complex LLMs.
  • Available end of 2024

More Reading

  1. Nvidia Unveils Successor to Its All-Conquering AI Processor 2024
  2. NVIDIA Blackwell Platform Arrives to Power a New Era of Computing March 2024