Modern Compute: Compute-Intensive Workloads
In this series on modern compute capabilities I've talked about the kinds of capabilities available, some of which have emerged as a result of recent industry investments in AI. I've also looked at the practical constraints that always apply. Now it's time to ask: when is this stuff useful? What sort of work do you have to be doing to benefit from using something more than vanilla code running on ordinary hardware?
As I described in the introduction, some workloads are more compute-intensive than others. Workloads that make light demands typically don't stand to gain much, so I'll be looking at some more demanding scenarios. These come in many forms, with a variety of characteristics and requirements, so I'll discuss which computational capabilities might usefully be brought to bear in a few different popular workloads. Before I get into the specifics, there is some common ground.
Techniques, computation, and IO
Demanding workloads might make our tools fail because there is too much data, or they might simply run too slowly. There are three ways we can tackle this:
- More cleverness
- More computers
- More power per computer
Option 1 might entail finding a different algorithm that performs better.
Spark is a popular example of 2. It enables us to distribute processing across a cluster of computers. For example, it offers a dataframe abstraction similar to the pandas
library in Python, but enables us to work with collections of data that are far too large to fit in a single machine. Or you can write your application code to take advantage of distributed processing directly, as we did when using 16,000 machines via Azure Batch. Sometimes, we can improve throughput simply by adding more computers to the cluster.
The obvious example of 3 would be to buy a more powerful computer. Back in the days when a new computer would run twice as fas as its predecessor without any modifications to the code, that was an option. But these days, taking advantage of Moore's law means working out how to exploit the ever increasing potential for parallelism. We may need to change how we tackle a problem so that it can make better use of the power we aready have.
In practice we often need to apply a bit of 1 to realise either 2 or 3 above. Simply having more machines won't necessarily solve problems faster: you need some system for distributing your computation. (That's why systems such as Spark were invented.) Better exploiting the available power often requires a change in approach.
With that in mind, let's look at some specific workloads.
Data analytics
Extracting useful insights from large volumes of data is a popular use of computing power. There are many practical applications of this.
For example, Epedemiologists ask questions about health and disease at population level, enabling them to see connections that might be completely invisible with individual patients. (Famously, it took population-level studies to establish conclusively the link between smoking and lung cancer.)
Online retailers track the way customers use web sites with the hope of identifying tactics for increasing the chances of making a sale. Large retailers with conventional shops are always tinkering with how they use their shelf space, and they perform sophisticated analysis with the goal of understanding how such changes influence customer behaviour.
Organizations that run complex systems such as utilities or telecoms networks, often have vast numbers of monitoring devices available. It is often possible to get a deeper understanding of the behaviour of such systems through collective analysis of a large number of monitoring inputs than you can get by looking at any device in isolation. For example, if a broadband provider knows that a router is reporting problems, that's useful information, but if all the routers in the same street are reporting similar errors, that's likely to require a different kind of intervention than a single failure.
The common characteristic here is that there tend to be very high numbers of records, and it requires bulk statistical analysis of this data to find useful information.
What constitutes 'large'?
As computers have become more powerful, the threshold for "large" volumes has changed. Even though we can no longer rely on a simple doubling of CPU speed every 18 months, Moore's law has continued to deliver larger and larger memory capacities.
There was a time when working with a gigabyte of data would require exotic hardware and a large budget. Today, an individual with a laptop and a Jupyter notebook can tackle problems of that size. Tools such as pandas and numpy can get you a long way: if your whole dataset can fit in memory, they work brilliantly. However, if you're operating, say, a broadband network where your monitoring systems can produce tens of gigabytes of telemetry an hour, and you want to analyze years of data, you're going to run into the limitations of these libraries.
So what should you do if your dataset outgrows your analytics tools?
It depends on exactly why you've hit limitations. Maybe a slight shift in techniques will work—perhaps chunking your data, or choosing more memory-efficient representations will give you the headroom you need. The pandas documentation shows a variety of techniques you can use to handle larger datasets, but it's notable that the final suggestion on that page is essentially: use something else.
If you do hit the buffers, it might be because you just can't fit your data in memory, in which case option 2 above (e.g., use Spark) is likely to be necessary. But in some cases, it won't be the quantity of data—it might just be the amount of processing you're trying to do. And if that's the case, some combination of options 1 and 3 might work. The next couple of sections outline ways to increase performance for certain kinds of scenarios.
Exploit your CPU's number crunching capabilities
If the problem is speed—you can get the results you want, but it just takes too long—then you might be able to make better use of the computational power your CPU has to offer. For example, modern CPUs all have some form of SIMD (single instruction, multiple data, aka vector processing) enabling each CPU instruction to perform arithmetic on multiple value streams simultaneously.
To show the different SIMD can make, I ran a very simple task: I took an array with 1 billion numbers (stored as 32-bit floating point values) and calculated their sum total. I did this in four different ways:
Language | Mechanism | Time (seconds) |
---|---|---|
Python | Ordinary for loop |
235 |
C# | Ordinary for loop |
2 |
Python | numpy.sum |
0.3 |
C# | TensorPrimitives.Sum |
0.3 |
This, incidentally, illustrates how large the difference between interpreted code and compiled code can be. A completely bog standard for
loop in C# goes about 100x faster than the equivalent in Python! If you looked at just the first two lines in that table, you might wonder how on earth Python became a popular choice for data processing. But the answer is on the third line: arguably Python's most important attribute is the wide range of libraries on offer. The numpy library has effectively made Python's lacklustre performance irrelevant here. Who cares that a for
loop is 100x slower than the C# equivalent when the numpy
version is actually faster?
Part of what makes numpy so fast here is that its sum
implementation exploits SIMD. We can do that in C# too of course. For final line in that table, I used a .NET library method that performs an SIMD-accelerated sum. Python and C# have essentially identical performance, because they're both using a technique that runs as fast as the CPU is able to do the work.
The practical lesson here is: know your libraries. It's very simple to write a loop that calculates a sum. If you're unfamiliar with the numpy
or TensorPrimitives
libraries I used, there will be a bit more of a learning curve. (The code required is actually pretty simple: it's just a single method call in either case, so you end up writing slightly less code than a conventional for
loop. But you need to know that these libraries are available, and to learn how to use them, so it might take you longer to discover what single line of numpy
-based code solves your problem than it would to write your own code.) But if you need more speed, the benefits of climbing that learning curve are clear: an order of magnitude increase in C#, and a three order of magnitude increase in Python.
As problems become more complex, it may be worth investing some time in working out whether more advanced library features can be brought to bear on your problem. Decomposing a problem into lots of small steps might be the easiest way to get things working, and often also results in code that is easy to maintain, but there are some situations in which more sophisticated library features might be able perform the work you need in a smaller number of steps. This can sometimes enable libraries such as numpy
to make much more effective use of your hardware.
In summary: libraries that can exploit SIMD capabilities might boost your analytics capacity to the level you need.
Exploit your CPU cores
All modern computers have multiple cores—they are able to execute multiple streams of instructions concurrently. Whereas the preceding section got a single stream of CPU instructions to do more work, we might look to do more work by executing multiple streams of instructions simultaneously. (This is not mutually exclusive with the preceding technique, by the way. You could have multiple cores all executing SIMD instructions.)
Naive code typically won't take advantage of this. And some languages (notably Python) weren't really designed to facilitate writing multithreaded code, in which case, this will be a variation of the "know your libraries" rule in the preceding section.
For example, numpy
is able to perform matrix multiplication with a multithreaded algorithm. Matrices are mathematical objects that can be represented as rectangular arrays of numbers, and if you're not familiar with them, the procedure for multiplying one matrix by another can seem a little idiosyncratic, but it turns out to be very useful for combining transformations of certain kinds. It's expensive, though: if you wish to multiply a 2000x2000 matrix by another 2000x2000 matrix, this requires 2000x2000x2000 multiplication operations (and nearly as many additions). That's 8 billion mathematical operations! (If you're wondering why it's not 16 billion, most modern CPUs can combine "multiply and add" into a single operation.) The table in the previous section showed the ability to add about 3 billion numbers per second, which suggests that this matrix multiplication would take over two seconds. My desktop computer is a few years old now, so newer computers will be able to go faster, but this is still going to be enough work that the time it takes a single CPU to perform the work will be long enough that we'll see an obvious delay. And some kinds of processing involve many, many matrix operations.
What if we got all of our CPU cores working on the problem?
It turns out that numpy
is able to use multiple cores when performing matrix multiplication. This shows how long it took my 8-core 2-way hyperthreaded (16 hardware threads) Core i9-9900K CPU to multiply two 2000x2000 matrices (again using 32-bit floating point values):
Language | Mechanism | Time (ms) |
---|---|---|
Python | numpy matmul (1 thread) |
126 |
Python | numpy matmul (2 threads) |
67 |
Python | numpy matmul (4 threads) |
39 |
Python | numpy matmul (8 threads) |
37 |
Python | numpy matmul (16 threads) |
32 |
I used the OPENBLAS_NUM_THREADS
environment variable to tell numpy
how many threads it was allowed to use. (The version of numpy
installed on my system is using the OpenBLAS library to perform matrix multiplication.) As you can see, the more hardware threads it is able to use, the faster it goes.
The difference between 1 and 4 threads is substantial: we see a 3.2x speedup. After this, the improvements get much smaller. Throwing all 16 hardware threads at the problem is still better than 4: we get a 3.9x speedup then (only about 20% better than 4 hardware threads). Interestingly, if I run the test again with 64-bit floating point values, multiple threads are slightly more effective, yielding a total 4.4x speedup. That suggests that this might be reaching a point where raw number crunching power is not the limiting factor, and we may be running into memory related problems, possibly to do with how the various cores' caches interact when they're all working on the same data.
If you were hoping for 16 hardware threads to make things 16 times faster, it's important to be aware of the distinction between hardware threads and cores: although this CPU is able to maintain 16 parallel hardware threads of execution, it has only 8 cores, meaning that pairs of hardware threads have to share some computational machinery. So if both the threads on a core are attempting to perform SIMD-accelerated multiply/add operations as fast as they possibly can, having two threads doesn't speed things up much because there's only one vector arithmetic unit per core. A single thread can very nearly saturate the core's capacity for that kind of work, so we can get close to maximum performance for this task when the number of threads matches the core count. Sometimes, trying to use all the hardware threads (16 threads instead of 8 for this CPU) actually slows you down, because the threads start contending for resources, but in this case, there was a (very) slight improvment when using all the hardware threads.
Even bearing that in mind, the total speedup here—maximum parallelism runs about 4x faster than a single thread for 32-bit values and 4.4x faster for 64-bit ones—is less than you might hope for. An 8 core CPU sounds like it should be able to work 8x faster. In practice, practical considerations such as the rate at which data can be loaded from main memory into the CPU come into play. Even so, running about 4 times faster is a worthwhile improvement.
The calculation rate here is pretty high: remember this requires 8 billion multiplications, and nearly as many additions, and numpy
(well, OpenBLAS really, being driven by numpy
) managed that in 32ms, which is approximately 250 billion operations per second, a notable step up on the 3 billion per second I measured with the simple SIMD summation.
Actually that's a lot more than can be explained by using multiple cores: even the 1-thread entry in the table above represents a processing rate roughly 21x faster than the summation. That's possible because with the SIMD summation, the limiting factor wasn't the CPU's processing speed, it was the speed at which data could be loaded from memory. OpenBLAS does clever tricks with how it orders the work to take advantage of the fact that each single number in these matrices gets involved in many calculations. It orders the work to take best possible advantage of the CPU cache, meaning that it performs far, far fewer accesses to main memory than calculations. (These matrices only contain a total of 4,000 numbers, so in a way it's obvious that although there's a lot more arithmetic to do here, the volume of data being fetched from memory is much smaller than the summation example where we have a billion-entry array. But OpenBLAS goes to some lengths to maximize the advantage this offers. As I'll show in just a moment, if you write a naive implementation by hand you'll find that for matrices like these that are too large to fit entirely inside the CPU's L1 cache, that the performance is far less good than OpenBLAS achieves even in single-threaded mode.)
OpenBLAS uses multiple optimization tricks, meaning it's fast even with just a single thread. But you can see from the table here that the exploiting multiple cores has made things 4x faster. Again, this might be enough to bring your analysis times down to acceptable levels.
Incidentally, this illustrates the importance of item 1 in my original list: more cleverness. Just throwing more CPUs at the problem is often not enough. To illustrate this, I wrote my own matrix multiplication code in C#. This operation is not built into .NET today. (If you're wondering why I'm not using TensorPrimitives.Multiply, it's because that's a completely different, and much simpler operation: elementwise multiplication. It would perform just 2 million operations for this input, and not the 8 billion required for conventional matrix multiplication.) This table shows the results for multiplying a pair of 2000x2000 matrices (the same operation as the preceding table showed for OpenBLAS via numpy on Python).
Language | Mechanism | Threads | Time (ms) |
---|---|---|---|
C# | Naive loop | 1 | 46,651 |
C# | Loop with flipped matrix | 1 | 10,008 |
C# | Loop with flipped matrix and SIMD | 1 | 2,085 |
C# | Loop with flipped matrix and SIMD | 2 | 1,094 |
C# | Loop with flipped matrix and SIMD | 4 | 553 |
C# | Loop with flipped matrix and SIMD | 8 | 265 |
C# | Loop with flipped matrix and SIMD | 16 | 237 |
(Just in case you were about to draw the conclusion that Python is faster than C#, you should check out the next table, later on in this section. This table has nothing to do with C# vs Python. My goal is to show that sometimes, clever techniques are the most important tool. If I had written Python exactly equivalent to this C# code, Python would have been orders of magnitude slower.)
The naive loop is dreadful—over 46 seconds, a mere quarter of a billion operations per second! (An equivalent naive loop in Python is even slower, taking several minutes to complete.) The second row applies the simple trick I described in the previous blog in this series of using row major order in one matrix and column major in the other. The third row shows what happens if we use the TensorPrimitives.Dot
method: although there's no whole-matrix SIMD support, this Dot
method is a SIMD operation we can use to calculate a single output value (which in this case performs 2000 multiplications, and 1999 adds). As you can see, these steps help a lot: being slightly smarter about memory layout boosts speed 4.7x, and then using SIMD gives us a further 4.8x boost on top of that. And then adding more hardware threads speeds things up further: applying all 16 hardware threads gives us a 8.8x speedup here.
The cumulative effect of all these changes is roughly a 200x speedup, which sounds impressive until you notice that our final (16-threaded) result here is slower than OpenBLAS managed with just a single thread at its disposal. Throwing more CPU cores at the problem certainly helped here, but because my C# implementation does not employ the more advanced cache-optimized techniques that OpenBLAS uses, it ends up being about 16x slower than numpy in the single-threaded case, and 7.4x slower when using all 16 hardware threads.
I think the main reason the difference gets smaller as this uses more cores is that distributing the work across the cores happens to spread some of the data across those CPU cores in a way that happens to make more efficient use of the available cache memory. Because each individual core accesses less of the memory, the importance of OpenBLAS's clever caching-oriented optimization is reduced. Even so, that cleverness still makes a big difference, because even in the comparison that's most favourable to C#, OpenBLAS still runs over 7x faster. So if you want to do this sort of thing in C#, it's worth looking for a library that wraps OpenBLAS or other similarly optimized BLAS libraries.
Of course, Python isn't the only language that can use OpenBLAS. C# wrappers for OpenBLAS are also available (although none of these wrappers appears to be very widely used). Here are some timings using the OpenBlasSharp
library.
Language | Mechanism | Threads | Time (ms) |
---|---|---|---|
C# | OpenBLAS | 1 | 124 |
C# | OpenBLAS | 16 | 34 |
These numbers are (give or take experimental error) the same as the Python numbers. That's exactly what you would expect, because it's the same OpenBLAS library doing all the hard work under the covers.
Although this shows that C# and Python have, as you'd expect, the same performance when they are both using OpenBLAS, Python does have an advantage here: with Python I fell into a pit of success. Simply by choosing to use numpy
, a very popular library and very much the default choice for this sort of thing, I got OpenBLAS without even asking for it. But in C#, if I hadn't known that OpenBLAS was out there, I might have been pleased with the 200x speedup my optimization attempts delivered, and not realised that OpenBLAS could offer a further 7.4x speedup. And even knowing that OpenBLAS was a better option, it took a while to work out how to use it. Math.NET has an OpenBLAS provider, but it's still in 'beta' status, and as far as I can tell, this doesn't work on 64-bit processes. To use OpenBlasSharp
I had to write some unsafe code. OpenBlasSharp
seems to be written by an individual developer, and is less than a year old, so I would not be confident deploying it into a production system. I had to journey into the wilderness to get C# to perform as well as the default choice in Python. So although there's no doubt that C# can match Python here (and in many scenarios, it can easily beat Python's performance) you can see why Python is popular for this sort of work.
Option 1—more cleverness—can often get you further than simply throwing more power at the problem. The OpenBLAS results are, unsurprisingly, a lot better than the C# code I wrote quickly while writing this post. It's really hard to beat a mature, well-developed library such as OpenBLAS for this kind of work.
Exploit your GPU or NPU
I've chosen matrix multiplication as an example here for a couple of reasons. First, it's a fundamental component of linear algebra, a powerful and flexible mathematical toolkit with applications in many computationally-demanding scenarios including engineering simulations, computer graphics, and AI. But second, it's a mechanism that computers have become much better at in recent years thanks to developments in AI. (In other words, this is exactly what I'm trying to highlight in this series: it's a generally useful new capability that emerged as a dividend of AI-focused investments.)
For several years now, it has become common for high-performance graphics processors to offer what they call tensor processing capabilities. In the computing world, a tensor is effectively just a collection of matrices. (Mathematicians and physicists might raise an eyebrow at this slightly reductive use of the term tensor.) The tensor cores available in many Nvidia GPUs (and AMD's equivalent AI accelerators) are just hardware that is really good at matrix multiplication. NPUs are also mostly about doing matrix multiplication. It turns out that you if are building hardware that only knows how to perform matrix multiplication (as opposed to the general-purpose arithmetic hardware in CPUs) it's possible to wire it up in a way that makes it extremely efficient with its memory access patterns—much more so than the tricks OpenBLAS performs to try to optimize memory bandwidth use. This typically raises the calculation rate up into the trillions of operations per second.
Apple's M1 CPU (released in 2020) has a 'neural engine' that can perform over 10 trillion operations per second (TOPS). Their M4 goes up to 38 TOPS. Current Snapdragon X processors offer over 45 TOPS. Intel's Core Ultra 200V processors offer 48 TOPS. (Note that these headline figures are only available with very low-precision calculations. NPUs tend to use INT8
, a slightly misleading name for what is not in fact an integer format—it's a fixed-point 8-bit format. Most NPUs also support 16-bit floating point, which is still pretty low-fi compared to conventional floating point, but even this typically halves the processing rate compared to INT8
. Even so, 24 trillion operations per second isn't too shabby.)
That's a lot of power. You can wield it only if you can find a way to express your problem in terms of matrix multiplication, and if you can live with relatively low precision. But there's so much power (and it consumes so much less electricity than the alternatives) that even if the problems you have don't look obviously like matrix multiplication problems, it might be worth finding a way to make them fit that mould. For example, it has become common to implement an image processing technique known as convolution by twisting it in such a way that you can run it on tensor processing hardware. It's not actually the most theoretically efficient way to execute that particular algorithm, but NPUs are so fast that even an inefficient solution on an NPU can often perform much better and make much lower power supply and cooling demands than any other available implementation.
The three techniques I've just described—exploiting SIMD, multiple cores, or matrix multiplication hardware—can improve our throughput, but once datasets become too large to fit in your computer's memory, it gets more challenging. For some kinds of processing, that doesn't need to be a problem: many statistical aggregates can be calculated by processing each element in turn, meaning that the memory requirements remain fairly small no matter what the volume of data is. (With libraries such as pandas
or numpy
it can take more work to do this in practice, because you have to write code to split the data into chunks, but it is possible.)
However, some kinds of processing are harder to do if you can't fit your whole data set in RAM. Joining data sources—finding items in multiple different sets of data that share some value—can be done even with datasets far larger than will fit in memory, but this requires more specialized techniques.
Distributed analytics
Sometimes, one computer won't be enough. To process multi-petabyte datasets in a reasonable length of time, neither multiple cores, SIMD, nor high-speed matrix arithmetic can get past the basic fact that there's a limit to the speed at which data can flow into and out of any single computer.
This is the kind of scenario that Spark was invented to deal with. And this is really a combination of options 2 (more computers) and 1 (more cleverness). Typically, analysis will not be looking at each data point in isolation—very often we're looking for relationships in the data. So one of the main things systems like Spark provide is clever algorithms for common problems such as joining related data that are able to work efficiently when data is distributed across many computers.
I've included this section for because a lot of the most demanding analytics applications need distributed computation. But when it comes to the overall theme of this series, the flow of value is mostly in reverse here: AI has much to gain from these techniques, but not much to offer them. If you want to build effective AI-driven systems, high-quality data is the key to getting LLMs to do anything useful. Distributed computing techniques are also vital to the more rarefied work of building AI models from scratch. However, there's not much to say about windfalls that AI research can offer. There has been some systems research into distributed scheduling of workloads that may yet prove to have broader applicability, but so far, this kind of work doesn't seem to have fed back into non-AI areas of distributed computing. So for now, the main thing to bear in mind is that you may be able to apply the techniques described above in this context: to maximise throughput in a Spark cluster, you'll want to make full use of each individual computer's capabilities. Spark is already able to exploit multiple cores because of how it works, so this may come down to using algorithms where SIMD can be brought to bear.
Note that for now, CPUs with integrated NPUs are mostly aimed at laptops, and phones because they can radically improve battery life for certain workloads, notably video calls, so they offer a clear value proposition. Although the first tensor processing hardware optimized for inference appeared in servers, it is currently relatively unusual for the kinds of computers typically used in Spark clusters to have this kind of hardware. (If it's present at all it's usually as part of a GPU. Intel's first non-mobile CPUs to include an NPU are the Arrow Lake desktop range, and at the time of writing this, they have no server-oriented NPU offerings.) So for now, NPU-based optimizations may not be applicable for this particular scenario. However, this could well change over the next few years.
Monte carlo style simulations
The preceding section described scenarios where we have large data sets and want to analyze them in some way. But what if we want to analyze possible outcomes, and not just past performance? Sometimes we want answers to what if questions. If we change the price of this product, how will that affect sales? If we change the excess on this insurance policy, how will that affect profitability?
One common way to tackle these kinds of questions is to run simulations. In essence, you generate some fake data that you hope is representative of how real world data would look if you went ahead and made the change, and then you analyze that data.
Naturally this is risky, because you can't know for certain whether your fake data is truly representative. One way to migitate this is to generate lots of fake datasets, each with small random changes. This might give you a more complete picture of how changes to some variable under your control affects outcomes.
This kind of random variation can also be useful in situations where your data is approximate or incomplete. Weather forecasts are a good example: they attempt to predict future weather based on the current state of the earth's atmosphere, but we can't know the exact position and velocity of every air molecule on the planet. Meteorologists rely on observations from a fairly small number of weather stations. Weather models have a troublesome characteristic when it comes to predictions: sometimes very small inaccuracies in measurements can radically change the predictions. (This is known as the 'butterfly effect' because the difference in air velocity caused by whether or not one butterfly flaps its wings can end up determining whether your model predicts light wind or a hurricane.)
Running multiple simulations in which you deliberately jiggle the measurements around a little bit can help tease out these problems. Sometimes the conditions will be such that the inaccurate and incomplete data doesn't matter much: if you run 1,000 variations and they all predict sunshine, then it's probably going to be sunny. But if half your simulations predict rain, and half predict the sun, then at least you know that you don't know what the weather is going to do. Rigidly defined areas of doubt and uncertainty are better than confident but wrong predictions.
This technique is known as monte carlo simulation. (It is so called because Monte Carlo is famous for its casino, amongst other things.)
Obviously, multiple cores help with this sort of simulation: if you want to run lots and lots of variations of the same basic work, the ability to execute multiple concurrent threads is useful.
Less obviously, this might be exactly the kind of problem that could benefit from matrix multiplication hardware of the kind offered by NPUs and most modern GPUs. If you can find a way to to express some of the work done by the simulation in terms of linear algebra, it might be possible to combine multiple runs of the simulation into a single stream of work running through an NPU.
Combinatorially exploding problems
Certain kinds of optimization problems grow exponentially in complexity. Whereas in data analytics, high computational demand is mostly associated with large volumes of data, there are some problems where even relatively small data sets can require a great deal of work. For example, resource allocation problems such as timetable planning tend to have this characteristic. As problems of this kind grow, they will always eventually reach a point where they simply can't be solved in a reasonable amount of time. However, there are two ways we can sometimes mitigate this.
First, we might be able to avoid reaching the point of impracticality. For example, boolean satisfiability is a problem requiring exponentially increasing work as problems size grow (in fact, it's the original NP-complete problem). Nonetheless, solvers such as Z3 are available that can produce results in a reasonable length of time for many practical problem sizes.
Second, we might be able to reduce the amount of work required if we don't demand the best possible solution. A heuristic that produces a good solution might be possible to execute in far less time than one that produces an optimal one. Or we might be able to reduce the effective size of the problem space by simplifying the data. In either case we might not get the perfect answer, but we might get one good enough to be useful in an amount of time that is practical.
So instead of throwing our hands up in despair when encountering a problem with exponentially expanding complexity, we might instead just ask: how far can we get in practice? Modern computational capabilities might enable us to get further than we otherwise could.
Complex models
Building computerised models of complex systems has become a critical tool in advanced engineering. A realistic model of a structure such as a bridge can enable designs to be tested extensively in a way that is not practical with physical models. Although a model will never represent the real thing with complete accuracy, the ability to run tests in a computer makes it possible to test a huge number of different scenarios. You can test a physical model to destruction just once, but you can destroy digital models over and over again, enabling a much better understanding of the strengths and weaknesses of a design.
One popular modelling technique is to represent some continuous substance (such as metal, or air) as a large number of small elements, and to calculate how these elements interact. (This is sometimes called Finite Element Analysis, or the Finite Element Method.) This involves performing large numbers of very similar calculations, and as such it's exactly the sort of thing that can benefit from highly parallel hardware.
This kind of work has been making extensive use of GPU acceleration for years, because it's such a natural fit. So if you're using a well established toolkit, it probably already works that way. That said, this does mean a lot of these kinds of tools used the kinds of GPU capabilities that predate tensor processing, so it's possible that there is scope to further exploit the available hardware as NPUs become more widespread.
Video and other content production
Video processing (such as applying effects, but also even basic work such as editing) is highly parallelisable, because it tends to involve applying common processing to large numbers of pixels in many frames of videos. In fact, this kind of workload has been taking advantage of GPU hardware for a very long time. And any kind of video production that involves 3D rendering is an obvious candidate because that's exactly the problem GPUs were trying to solve back before they got into the business of general computation. So any kind of video image processing or CGI is going to be a good fit for a GPU.
Obviously, video game rendering is also a very good fit, what with it being the exactly the problem a lot of GPU hardware was originally designed to solve. But this traditional workload can also take advantage of the newer, AI-oriented GPUs. Nvidia's CEO has talked about how AI techniques can dramatically improve image quality. For example, AI-based upscaling can enabling a rendering engine to produce a relatively low-resolution image, which the AI model transforms into high-resolution images. The results can look as good you'd get simply by throwing far more GPU horsepower at creating a more detailed image in a more conventional way, which can be particularly important in mobile scenarios thanks to better power efficiency.
AI-driven content production features are increasingly widespread, and have even made their way into some mainstream conventional tools, and AI-specific functionality may even displace some common production practices because they can produce results at least as good with less work. However, directly useful AI-based workloads are not the main focus of this series.
Obviously, in these graphics-heavy scenarios, the GPU is likely to be front and centre, but even in this very long established space, the never-ending demand for improved performance is likely to mean that as NPUs become more widespread, they will also take up part of the workload.
Conclusion
Modern computational capabilities may enable us to tackle problems that were once out of reach. Or they might mean that problems that would once have required exotic hardware can now be handled on an ordinary laptop computer. Analytics, what-if simulations, complex modeling, and graphical applications are all particularly well suited to taking advantage of these newer capabilities.