NVIDIA Launches New GPU for AI Systems

Exploded view of the NVIDIA DGX A100. (1) 8 NVIDIA A100 GPUs; (2) 6 NVSwitches; (3) Mellanox Connect-6; (4) Dual AMD 64-core EPYC CPUs/1TB system memory; and (5) 15TB Gen4 NVME SSD. (Image courtesy of NVIDIA.)

NVIDIA’s new DGX A100 is much different from its DGX-1 and DGX-2 predecessors: its two-socket node is based on the latest AMD EPYC CPUs. The DGX-1 and DGX-2 used Intel CPUs.

What is behind the sudden departure?

The main reason for the departure from Intel is to facilitate the hugely increased performance of NVIDIA’s recently released A100 GPUs.

NVIDIA’s recently unveiled A100 Tensor Core GPU. (Image courtesy of NVIDIA.)

AMD’s second-generation EPYC 7742 processors have 64 cores each and they support PCIe 4.0, whereas Intel’s second-generation Xeon Scalable processors only support PCIe 3.0. This allows the DGX A100 to keep the NVIDIA A100 GPUs (up to eight) humming with data at 600 gigabytes per second through each of the six NVLink ports.

The DGX A100 is designed for AI applications and leverages NVIDIA’s new Tensor Flow 32 format, giving it 10 petaflops of peak INT8 performance.

NVIDIA’s SuperPOD for Selling Segments of Supercomputing to Customers

NVIDIA’s developing a SuperPOD configuration that links together many of its DataCenter GPUs. For example, the DGX A100 SuperPOD links 140 DGX A100 systems together using 170 HDR InfiniBand switches yielding 700 petaflops of computing power for AI and 280TB per second of network bandwidth. NVIDIA says it will be able to build this in three weeks.

Bottom Line

The DGX A100 system alone costs $199,000. For a full-capacity NVIDIA SuperPOD, you’d need a sum north of $28 million.

It’s good news for AI researchers with deep pockets, though.