Skip to content
Ian Griffiths By Ian Griffiths Technical Fellow I
After the AI Storm: Modern Compute
Listen to this post

AI is near the peak of its current hype cycle. The next AI winter is coming, and we'd best prepare. But it doesn't have to be all doom and gloom.

At endjin, we see the recent massive investments in AI as something like the space race. America and the USSR (as it was then) each invested phenomenal amounts of money and effort to enable humans to travel to and from the moon. As I write this, over half a century has passed since a human last stepped on the moon (just 12 people ever did), so it might be tempting to write off the whole exercise as a folly—an incredible effort for a technology we almost immediately abandoned!

But that misses something important: the moon landings produced many enduring benefits. True, the ability to visit the moon isn't one of them, but that doesn't mean the effort was wasted. It just means that the primary goal of the space race turned out not to be its most useful result. Space programmes have generated thousands of commercial spinoffs and numerous advances in basic science and engineering research.

What if the current massive investment in AI is like that? Whether or not the current crop of AI technology turns out to deliver on its promise, the investment will produce some enduring benefits. The key to enjoying the forthcoming winter in comfort will be to make effective use of these.

This is the first in a series of posts on Modern Compute in which we look at how the computational landscape has changed in the past decade or so (thanks not just to AI, but also to the industry's transition to cloud computing). How should we adjust our assumptions and habits if we are to take full advantage of current technology?

We must begin with the basics: why do we need computational capabilities at all, and what kinds are available?

Why Compute?

What is computation and why do we want it? I'm using the word "compute" to describe the ability to perform computation as rather than, say, "cores" or "processing" because computation occurs in many forms, and specific terminology can prevent us from seeing the full range of possibilities. Perhaps the most complete definition of computation is in Alan Turing's seminal paper "On Computable Numbers, with an Application to the Entscheidungsproblem". If you are prepared to work through 35 pages of dense academic text, then I would highly recommend Charles Petzold's The Annotated Turing, an excellent guide to the relevant paper. But otherwise, a brief definition will suffice:

Compute (mass noun): the capacity to process inputs arithmetically or symbolically, and to produce decisions and other outputs from that processing.

That's a pretty flexible definition, encompassing far more than just microprocessors. It would include, for example, any implementation of TLS (which provides the security characteristics of HTTPS that we rely on every day). TLS entails performing specific arithmetic procedures on incoming network messages, and deciding whether to accept or reject a particular connection based on the results of that arithmetic. But this definition of compute would also cover the application of a discrete Fourier transform to an image captured by a camera as part of the process of digitally encoding and compressing a video signal. The processing that network routers perform when deciding whether and where to forward network messages also falls under this description, as does automatically sending an email to a customer when a delivery has been dispatched. So would the aggregated processing of terabytes of data to discover useful signals, or the evaluation of some input text against a pretrained neural network.

I've chosen that particular list of examples because each of them commonly occurs in completely different kinds of hardware. Servers that accept incoming HTTPS connections often have specialized circuitry that handles the particular cryptographic requirements of TLS. Fourier transforms are typically processed either using the SIMD hardware built into most modern CPUs, or the general-purpose highly parallelized computational features offered by modern graphics cards. Network routers typically use a mix of hardware dedicated to the job of processing network messages, and a general-purpose CPU that provides flexibility when required. The automated email use case would typically be handled by writing code in a conventional programming language such as C# or TypeScript, and running it on a service providing access to a general-purpose CPU, such as Azure Functions, or AWS Lambda. We might also use commodity CPUs for very high volume data processing, but lots of them, and spread across many physical servers, orchestrated using a system such as Spark. The evaluation of a pretrained neural network might well use specialized hardware developed for that purpose such as Google's Tensorflow hardware, or their logical successors, sometimes known as NPUs or Neural Engines.

From the theoretical computer science perspective, every one of these is just an application of computation. From an engineering perspective, this is a wide range of techniques.

Any of these problems could be solved using general-purpose CPUs. (In fact, it's possible to find examples for every one of these scenarios where a general-purpose CPU has been used successfully.) One of the most important results of Turing's paper was that it is possible to create a universal computing machine that can perform any possible computation; modern CPUs embody this, and it is why they are so amazingly flexible. However, for many applications it has proven beneficial to use more specialized hardware.

The best hour you can spend to refine your own data strategy and leverage the latest capabilities on Azure to accelerate your road map.

So how do we know when specialized computational hardware might help us? We can begin by categorizing the kinds of computational work we need to do. Very broadly speaking, the need for computation falls into two categories, which I'm calling low-demand and high-demand.

Low-Demand Computation

Sometimes applications make fairly gentle computational demands. For example, we might need to determine whether a particular customer has opted into email notifications before we begin the process of sending an email. This is not a mathematically advanced scenario. Nonetheless, even very simple code needs somewhere to execute.

We have several options. Perhaps we can use a technology such as Microsoft's Power Automate, in which we essentially have no idea which computer is actually running our code. This is a Platform as a Service offering in which such details are managed for us. There's a price of course—if you were to work out how much you are paying for each individual instruction executed by whichever CPU actually does the work, this would look ruinously expensive compare to a typical virtual machine. But in practice the VM would probably be the more expensive way to do this, because the number of CPU instructions required to perform the relevant decision is likely to be tiny. If you paid for an entire VM, most of its capacity would go unused in this application because the computational demands are so low. (That said, the overall scale of your service can blur the lines. There might come a point where provisioning a general purpose CPU would be more cost effective. That will likely only be true for particularly high loads in this example, but there is a grey area in which we have low-demand work being done in high volumes.)

For basic decision making, orchestration, or simple automation of repeated processes, clicky-clicky-draggy-droppy systems such as Power Automate are often a reasonable solution. But for some problems, that style of programming can become unmanageable, especially since it often doesn't fit very well with revision control systems. So people often turn to systems such as AWS Lambda or Azure Functions that can host code written in conventional programming languages but which manage the host OS and hardware for you. These offer a range of pricing models, and the cheapest of these are well-suited to applications with very modest computational demands.

High-Demand Computation

Some kinds of work make much higher demands on the computational machinery. Sometimes you just have to crunch a lot of numbers—fluid dynamics and other physical modeling tends to fall into this category, as do certain machine learning training processes. Sometimes the demands on computation arise from the need to deal with many variations of a problem. That can happen with problem spaces where combinatorial explosion is endemic, such as timetable planning or other resource allocation problems. It can also arise in situations where it is useful to run multiple 'what if' scenarios each with small differences, such as a monte carlo simulation, as is common in the insurance industry.

With these kinds of problems, we will typically want to think carefully about how we will bring enough computational power to bear on the problem.

What Kind of Compute?

Several kinds of computational hardware are available. I'll explore all of these in more detail later in this series, but I'll give a quick outline here.

Why do we even have several options when a general-purpose CPU can do it all? It's because there are a few different characteristics we might care about, and different hardware solutions offer different tradeoffs. These include:

  • Cost of hardware
  • Performance (by whichever metrics matter to you, e.g., time taken to produce a result, or the rate at which data can be processed)
  • Power consumption
  • Initial engineering effort
  • Ongoing engineering costs (maintenance)
  • Speed of response to changing requirements

That's a lot of factors to consider, but to make matters worse, we can't always say in general how well a particular hardware solution will do for any of these dimensions: it can vary a lot from one workload to another. This diagram shows roughly where the various options I'm about to describe fall on the trade-off between runtime performance and initial development effort:

Three graphs showing the tradeoff of performance vs effort for three different workloads, with the various technological options appearing in completely different places in all three

We've got three graphs here, all ostensibly showing the same thing, but apparently contradicting one another! But that's just because each makes different assumptions about the workloads. (I'll get into more detail later, but I'll explain briefly why we might see such different outcomes. Workload 1 is a pretty good match for the 'tensor' processing in NPUs and some modern GPUs, so although an ASIC—a specialised custom-designed chip—might do a bit better, it's a small win. Conversely, for workload 2 GPUs and NPUs can offer some help, but apparently in this case the fit is not so good, so custom hardware, either on a dynamically configured FPGA, or a custom-fabricated ASIC, could do a lot better. The inverted arrangement in workload 3 is typical of cases in which the volumes of data being processed are relatively small, and the costs involved in moving data to where acceleration hardware could reach it completely wipe out any potential performance benefit.)

Note that in some cases, two or more of the compute options described in the following sections might be integrated onto a single device. For example, NPUs are currently typically contained in the same physical package as the CPU. In System On a Chip (SOC) systems, everything (CPU, chipset, specialised acceleration hardware, and maybe even RAM) is integrated into a single chip. It's still worth considering the types below separately, though, because regardless of how they are physically packaged, their usage models are still very distinctive.

General purpose CPU

The CPU (central processing unit) is the component that is able to execute conventional code. The most important feature of a CPU is that it can perform any computation that will fit in the available time and memory. General purpose CPUs are extremely widely available, and thanks to modern cloud infrastructure, we can adjust the number provisioned for an application very easily, and at short notice.

Most modern CPUs have some sort of SIMD (Single Instruction, Multiple Dispatch) capability, enabling a particular kind of parallel computation. This was once the preserve of supercomputers, and it can enable CPUs to perform certain computationally demanding tasks that would otherwise be beyond their reach.

The downside of a general purpose CPU, compared with everything else in the list that follows, is that it is likely to perform less well for some specific tasks than more specialized technologies.

ASICs

An Application-Specific Integrated Circuit (ASIC) is a custom-designed silicon chip hard-wired to perform some very particular task with maximum efficiency.

Custom-designed silicon has two advantages over a general purpose CPU. First, an ASIC can be fully optimized for the task it has to do. Second, whereas a general-purpose CPU needs to dedicate a non-trivial amount of its capacity to working out what it has been asked to do, and working out how best to do that, an ASIC does just one thing. It already knows what it's going to do, and how it's going to do it, so all of its capacity can be dedicated to getting on with the work. One upshot of these advantages is that an ASIC might be able to perform its job far more quickly than a CPU could do. (Or perhaps an ASIC can be manufactured more cheaply, because it can achieve its goals with a less advanced silicon fabrication process.) Another common advantage is that ASICs are often able to consume significantly less power for a particular workload.

High-performance networking hardware typically includes ASICs, because they are the only practical option for achieving the kinds of throughput required. Some bitcoin miners developed ASICs in an attempt to gain an edge. Another example of ASICs in computation is the 'tensor' processors Google introduced to lower the costs of using certain kinds of machine learning models to perform inference.

GPU

As the name (Graphics Processing Unit) suggests, a GPU's primary purpose is to accelerate various graphical operations. Many years ago, GPUs offered a fixed set of highly specialized capabilities that couldn't do anything other than render graphical images. A GPU was really just one particular kind of ASIC. However, the increasing sophistication of rendering techniques led to graphics card vendors making their GPUs programmable. This made it possible, in principle, to perform any computation on them, just as we can with a CPU.

Power BI Weekly is a collation of the week's top news and articles from the Power BI ecosystem, all presented to you in one, handy newsletter!

GPUs continue to be heavily oriented towards computer graphics and video, the most important upshot being that their approach to parallelism is very different from a typical general-purpose CPU. This can make them fairly inefficient at handling some of the work we routinely get CPUs to do, but they excel at certain kinds of highly repetitive work. This has led to GPUs being used for certain high-demand applications well outside of their original domain of computer graphics, such as fluid dynamics and machine learning.

In recent years, it has become common for GPUs also to incorporate so-called 'tensor' features, making certain operations involving large numbers of multiplications and additions very efficient. (In particular, they are good at matrix multiplication and convolution.) This is essentially the same functionality that Google introduced with their tensor processor ASICs. This is one of the developments driven by the wave of AI investment.

NPU or Neural Engine

NPUs (Neural Processing Units) are a fairly recent addition to the realm of broadly available computation services. At the time of writing this they are mostly found in CPUs designed for use in laptops, although AMD offers one desktop processor with an NPU. Doubtless Apple would want to highlight some differences to convince you that the Neural Engine in their ARM-based CPUs ('Apple Silicon') has some additional secret sauce, but it exists to serve the same basic purpose.

The heart of an NPU is much like the 'tensor' features in some GPUs: an array of arithmetic units that can perform multiplication and addition at an extremely high rate. They are wired together so that they can perform matrix multiplication and convolution in a way that is much less constrained by memory bandwidth than in a conventional CPU (or a non-'tensor'-capable GPU). NPUs also typically include acceleration for a few other operations often used in tasks such as neural network evaluation or image processing. This is another development directly driven by the current wave of investment in AI.

FPGAs

A Field-Programmable Gate Array (FPGA) is a chip that can be rewired at runtime. It is similar to an ASIC in that you can create a completely specialized hardware design, but whereas with an ASIC, that design is etched permanently into the chip during fabrication, with an FPGA you upload your design to the chip at runtime.

The fact that an FPGA is completely rewireable at runtime comes with a few disadvantages compared to an ASIC:

  • the maximum number of transistors available on a single device is lower
  • the maximum possible speed is lower
  • the power consumption is typically higher
  • you have less flexibility in the physical layout because all the transistors are fixed in place and you can only change some of the wiring, so some design optimizations may be unavailable
  • per-unit cost is much higher for high production volumes

The huge advantages are:

  • you can iterate hardware designs very quickly (potentially several times a day; it typically takes months for a single iteration of an ASIC)
  • you can completely change the functionality of the device without needing to physically replace any hardware
  • cost is much lower for small production volumes

In essence, FPGAs combine the flexibility we associate with software development with most of the performance benefits of dedicated hardware. They can't quite get to the performance extremes possible with a well-designed ASIC, but they reduce development iteration cycles from months to hours. FPGAs are often used to prototype hardware in advance of designing an ASIC.

Quantum Computing

Quantum computing promises to incorporate quantum effects in the programming model. For example, whereas all of the other mechanisms just discussed deal with discrete states (discrete here means that there are no in-between states, e.g., a bool is either true or false) quantum computing is a more complex model in which parts of the machinery may be in a quantum superposition of states. Instead of a variable whose type is a simple binary true/false, we might have a qubit.

In theory, this means that quantum computers can tackle some problems that are computationally unfeasible for conventional hardware. But for now, the practical results fall a long way short of this promise.

For example, Shor's algorithm uses quantum computation to find the prime factors of an integer. Many modern security mechanisms (including HTTPS, which is critical to security in the internet today) rely on the fact that it is essentially impossible for ordinary computers to perform this task except for certain special easy cases. So in theory, Shor's algorithm destroys the security of the internet. In practice, the most impressive use of Shor's algorithm in the scientific literature at the time of writing this was to factor the number 21 into its prime factors 3 and 7. Attempts with larger numbers have thus far failed. Any primary school child who has learned their multiplication tables will be able to do better. Conventional computers can easily reproduce these results, and indeed they can factor significantly larger numbers by brute force.

The security of HTTPS and similar systems relies on the fact that things get exponentially harder as the prime numbers involved get bigger. Beyond a certain size, although a classical computer can in theory still work out the result, the practical limitations around speed of execution mean that with numbers as large as are used in HTTPS, it would take longer than the remaining lifetime of the sun to produce the answer.

So far, quantum computers also turn out to run into practical limitations as the numbers get bigger. The reasons behind those practical limitations are different, but they kick in for much, much smaller numbers. Those working in quantum computing are of course hopeful that these limitations can be lifted, although some scientific papers have argued that unless you can absolutely eliminate certain issues (i.e., unless you can achieve what amounts to perfection in certain respects) these limitations will always kick in.

In practice, no real-world demonstration has shown quantum computers exceeding the capabilities of conventional computers, even in an area where quantum computers theoretically have a crushing advantage. So this is a technology whose time has not yet come.

I am mentioning quantum computing here because hardware of this kind is available today, so a discussion of computational capabilities is incomplete without it. For example Microsoft Azure offers various quantum computing services. However, I will not be talking about it in any detail in this series because quantum computing is at an early stage. Its applications are not yet well understood, and seem likely to be extremely limited with the currently available technology. (One of Azure's quantum offerings boasts a whole 11 qubits of capacity. While this might be a technical tour de force relative to the state of the art, it's many orders of magnitude from the billions or trillions of ordinary binary digits we are accustomed to having in conventional computers.) Perhaps exciting advances are just around the corner, but for now quantum computing remains in the realm of research.

Conclusion

Recent years have seen massive investment in AI. The ultimate value of this is as yet unclear, but various new technologies have become available as a direct result, such as 'tensor' processing facilities built into GPUs, new acceleration device types such as NPUs, not to mention various libraries and programming systems designed to exploit these and other computation facilities. In the next entry in this series, I will look at some of the basic truths of high performance computation that have driven much of the development in computational hardware.

Ian Griffiths

Technical Fellow I

Ian Griffiths

Ian has worked in various aspects of computing, including computer networking, embedded real-time systems, broadcast television systems, medical imaging, and all forms of cloud computing. Ian is a Technical Fellow at endjin, and 17 times Microsoft MVP in Developer Technologies. He is the author of O'Reilly's Programming C# 12.0, and has written Pluralsight courses on WPF fundamentals (WPF advanced topics WPF v4) and the TPL. He's a maintainer of Reactive Extensions for .NET, Reaqtor, and endjin's 50+ open source projects. Ian has given over 20 talks while at endjin. Technology brings him joy.