The Great Debate of AI Architecture

An Intel Stratix 10 FPGA used for Microsoft’s Project Brainwave. (Image courtesy of Microsoft.)

Machines are getting smarter. How? They’re borrowing the signature activity of their human creators: learning.

Machine learning may not be 100 percent isomorphic to human learning, but the main principals are the same. By being exposed to enough data, and being incentivized to interpret that data correctly, machines and humans learn alike.

It turns out that, in order to learn anything useful, machines need to perform a lot of computation. Fortunately, that’s just the kind of thing computers are built for—the only tricky part is getting them to do it quickly and efficiently.

Naturally, there are different opinions on the best way to implement machine learning at the hardware level. Several major players have each opted for a different approach: NVIDIA’s going for GPUs, Microsoft’s all for FPGAs, and Google’s trying TPUs, to name a few. This article will explore these computational alternative approaches to machine learning.

Architecture and Instantiation

Figure 1. Deployment alternatives for deep neural networks (DNNs) and examples of their implementations. (Image courtesy of Microsoft.)

According to Doug Burger, a Distinguished Engineer at Microsoft Research NExT, there are really two questions to ask in the machine learning hardware discussion:

What is the right architecture for accelerating or running machine learning and deep neural networks?
How do you deploy or instantiate that architecture?

The first question—about the architecture—concerns the blueprint for how your system works, the plan for your inputs and outputs and everything in between. How you realize that system in terms of hardware is the second question—you could run it on a CPU, make a custom chip, or even use ants if you feel so inclined.

In Figure 1, we can see some of the options available to engineers building machine learning solutions. From left to right in the figure, we have different ways to instantiate an architecture: CPUs, GPUs, FPGAs and ASICs. We can see that these options present a trade-off between high flexibility and high efficiency. Underneath the approaches shown in the figure, there are examples of specific machine learning architectures: Microsoft Brainwave, Google TPU, and more. First off, let’s breakdown some of these acronyms:

CPU: central processing unit. Avery general-purpose processor. You have at least one of these in your computer right now.
GPU: graphics processing unit. A processor specially designed for the types of calculations needed for computer graphics.
DNN: deep neural network. Neural networks are a common approach to machine learning, and the deep essentially refers to the level of complexity (specifically, DNNs include a lot of hidden layers).
DPU: deep neural network (DNN) processing unit.
FPGA: field programmable gate array. This is a general-purpose device that can be reprogrammed at the logic gate level.
Hard DPU: “hard” refers to the fact that the DPU cannot be reprogrammed, unlike the “soft” FPGA.
ASIC: application-specific integrated circuit, designed to be very effective for one application only.
TPU: tensor processing unit. The name of Google’s architecture for machine learning.

GPUs, TPUs and FPGAs

Figure 2. NVIDIA’s Volta GPU architecture is specially designed for AI. (Image courtesy of NVIDIA.)

Besides CPUs (which can implement machine learning but are not very efficient at it), the first machine learning deployment option is to use GPUs. The biggest advocate of the GPU approach is—surprise, surprise—GPU giant NVIDIA.

As it turns out, neural networks and computer graphics require much the same types of calculations. In each case, the brunt of the work is simply matrix multiplication—lots and lots of matrix multiplication. Since GPUs are already optimized for these types of calculations, they’re a good match for machine learning applications. NVIDIA’s Volta GPU architecture (see Figure 2) is specially designed for machine learning, and it offers 100 TFLOPS of deep learning performance, according to the company.

Figure 3. Google’s first TPU on a printed circuit board. (Image courtesy of Google.)

Google’s hardware approach to machine learning involves its tensor processing unit (TPU) architecture, instantiated on an ASIC (see Figure 3). TPUs are the power behind many of Google’s most popular services, including Search, Street View, Translate and more.

At the core of the TPU is a style of architecture called a systolic array. This consists of a network of identical computing cells that take input from their neighbors in one direction and output it in another direction. In such a way, data propagates through a systolic array in a pulse-like fashion (hence the nominal nod to the human heart). According to Google engineers, this design provides excellent performance for matrix multiplication; it allows the TPU to achieve up to 128,000 operations per cycle, an order of magnitude higher than GPUs and several orders of magnitude higher than CPUs (see Figure 4).

Figure 4. Representation of systolic data flow in the TPU’s Matrix Multiply Unit. (Image courtesy of Google.)

Figure 5. A Stratix 10 FPGA board implementing Microsoft’s Project Brainwave. (Image courtesy of Microsoft.)

Because of their field programmability, FPGAs are used for many applications that require flexibility. However, what they gain in flexibility, they trade off in efficiency, which is why many applications are better served by an ASIC. Whether or not machine learning is one such application is still an open question, but Microsoft is putting its weight behind the soft approach—FPGAs (see Figure 5).

“We’ve taken a pretty radically different point of view than most, and I think it’s paying off for us,” said Microsoft’s Doug Burger.

Microsoft’s machine learning architecture, called Project Brainwave, is instantiated using FPGAs. Brainwave is s single-threaded architecture, which means it processes only one instruction at a time. However, as Burger points out, this doesn’t limit the amount of operations Brainwave can perform.

“The example that we gave at Hot Chips [a technical symposium] was a single instruction. We called this MegaSIMD. SIMD is a term for single instruction multiple data that the GPU people use. In MegaSIMD, you take a single instruction and that is actually turned into 1.3 million computations. And in the architecture in the demo we showed, that instruction runs for 10 cycles of the machine clock. And we do about 130,000 of those operations per cycle for 10 cycles.”

The advantage of the single-threaded approach, according to Burger, is that Brainwave is very easy to program and integrate with other tools. Not only that, but it offers easier upgradability to future hardware—no need to rewrite code for each generation of chips.

An Ongoing Debate

With so many approaches covering both the architecture and instantiation axes, it’s clear that there’s still a lack of consensus on the best hardware approach to machine learning.

“What’s happening now is this great debate, kind of mirroring the debates that happened about CPUs in the ’80s and ’90s,” says Burger. “About what is the right Architecture with a capital A for machine learning.”

But that’s not the only question mark in this domain—an even bigger question looms on the horizon. According to Burger, even our best efforts at machine learning are approaching a limit that we don’t yet know how to move past.

“We’re going to need some huge disruptive breakthroughs to either figure out a digital nonbiological scaling path, or we’re going to have to vector over to a more biologically based approach, which is a completely different paradigm from the architectures that we’re building now.”