Increasing performance via low memory allocation in C#
Over the past few months I've worked on multiple projects involving reactive processing of large amounts of data. In a world where the volume of data is increasing at an almost inconceivable rate, being able to process this data efficiently and cheaply is vital.
A large part of the work involved in these projects was around the reduction of memory allocated as the data was being processed. There are three main considerations when considering memory allocation:
- How much memory is in productive use?
- How much memory is being held on to?
- How much work is the garbage collector having to do?
The considerations around garbage collection are particularly important when thinking about performance. This is because garbage collection takes up CPU time, reducting the time spent on the actual data processing. Not only this, but each time a garbage collection is triggered, much of the work is suspended so that the remaining references can be evaluated. This can drastically effect the amount of time taken to process the data.
In the first of our projects, this was exactly what was occuring. The processing was happening very slowly, and scaling out was not having much of an effect on the performance. This was because during a garbage collection, all of these cores are then just sitting idly. In going from our original solution to one which was low allocation, we increased the number of events we could process by a factor of 10.
Moving onto the second project, we were creating a system for processing large amounts of data using Azure Functions. The benefits of decreasing memory usage here are two-fold: firstly the reduction in allocation meant that the processing happened faster, but secondly Functions' consumption plan pricing model is based on memory used x time, so both the time taken and the memory usage itself have a direct impact on cost.
So, with all this in mind I thought I'd run through some of the techniques we used for identifying and decreasing memory usage.
List pre-allocation
When I first started trying to reduce the memory for the functions it was because I was running into the limit imposed for running functions locally (which is set at around 1.5GB, in line with the limit for functions on the consumption plan). This memory overload seemed to be happing at a specific point in the application, so at first, I assumed it was being caused by a specific piece of data.
As it turned out, I was actually running into a detail about how lists work in C#. Lists are built on arrays, so they have an internal array which is added to when you add objects to the list. Arrays in .NET need a pre-allocated size, so the list dynamically creates new arrays as it needs to when you add objects to it. When you create a list and don't specify a size, when you add the first element it defaults to a size of 4. When you reach that capacity, it creates a new internal array of size 8 and copies the elements over. Each time it reaches the capacity of the array, it creates a new array with double the capacity, copies the elements, and discards the first. This not only makes for quite a lot of extra work in copying them over, but also means that you end up with a lot of old arrays which need to be cleaned up by the garbage collector.
So, in my large (>1M object) lists, it was this memory doubling that was carrying me over the memory limits. When we reached capacity of an already fairly large list, it then allocated a list double that size, which was a sudden rapid increase in memory consumption. Not only this, but the old list probably ended up on the large object heap (as it was already using quite a lot of memory) so would not be cleaned up until a Gen2 garbage collection occurred.
It is worth mentioning that I was already operating fairly close to these memory limits. If we rely on the default behaviour, the list will temporarily consume 3x the amount of memory as it did before. This means that if you are currently using anywhere near 1/3 of the total memory available (which is not an ideal situation to start with) this can cause real problems. This behaviour also means that if we rely on the default behavour then on average lists will be taking up 33% more space than strictly necessary, and in the worst case be almost double the needed size.
These factors are particularly important to consider if most of the memory usage in the application is due to one list, as this means that this behaviour will have a huge impact on the total memory consumption of the application.
This goes to the second of our considerations - *how much memory is being held on to *- because we could end up with arrays which are almost double the size they need to be.
IEnumerable
However, even with the savings from pre-allocation, the memory usage of the application was relatively high. One of the main ways in which we reduced this memory consumption was in moving away from lists and using IEnumerables instead. This reduces memory usage because it means that all the processing and filtering of the data can happen without ever needing to collect all of the data at once. Each piece of data can flow through the processing without ever needing to store the full list of data. Of course this only works if you don't need to perform operations over the list as a whole. In our use case we were filtering the data as we went, and didn't need to enumerate the results until we'd reduced down to the relevant data. In this case all of the data does not need to be simultaneously stored in memory, this reduces the maximum memory used during the processing.
However, in the case where you need to perform any whole-list operation, it can be much more expensive to use an enumerable rather than a list. It also relies on the enumerable actually being able to produce items on demand. As with all of these examples, the specific situation must be considered before optimizing, and even once considered it is worth carrying out benchmarking to test the effects of any change.
Reference types
Another thing I tried when I was first looking at reducing the memory usage was to convert some of the classes to structs. On the surface this makes sense, as structs do not include the overhead of storing the object on the heap (16-bytes in a 64-bit process), plus the 8-bytes for storing each reference to the object. So, in converting the classes to structs you save 24 bytes per object (assuming one reference), and with large data volumes this can make a large difference to the memory allocated. However, what I didn't consider was the fact that I actually needed multiple copies of a fair amount of the objects.
Because I needed multiple copies of the same objects, when I converted them to structs, I actually ended up having to store multiple of each object rather than one single object and multiple references to it.
This meant that for my 56-byte object
On the heap: 56+16+8x2 = 88 bytes
Using structs: 56+56 = 112 bytes
So, using structs actually increased the memory usage. On top of this, I had more than two copies in some cases. It obviously depends on the size of your objects, but given enough copies eventually using structs will become less memory efficient.
This all goes to our first consideration - *how much memory is in productive use - *as literally duplicating objects will increase the total memory being used by the application. However, it is also worth considering the fact that holding onto references to each individual object within a large array will make the GC work a lot harder than having large collections of value types. This could also have a large impact on performance, and these cases should be benchmarked to verify which delivers better performance.
Span<T>
With C#7.2 we saw the addition of the Span<T>
class. This provides a way of accessing contiguous memory. The Span
itself is always allocated on the stack, but the memory it lets you access can be located anywhere. These can be used to access arrays, strings (if readonly), pointers or stackalloc memory. In our case we were using them to access arrays of bytes, as our data arrived as byte streams.
When you use spans, no additional memory is allocated on the heap. You can slice the spans in order to access certain parts of the total memory without causing extra allocations. Therefore if we split the incoming data into spans representing each piece of data, this data can then be sliced to retrieve the individual properties for that data chunk, and all that needs to be stored in order to access those properties is the indexes within the span at which they reside.
This doesn't technically allow us to do anything we couldn't before, but it can greatly simplify zero allocation memory access:
- It wraps everything (data, offset, length) in a single object, making it much easier to keep track of.
- A span can be used to point to various different memory types, so one piece of code can be agnostic to the actual memory being accessed.
- You can make spans read-only, meaning that memory access can be controlled in a way it can't be with arrays.
For example, say we were processing events which had a timestamp, a count, and a location, with these properties all encoded into a byte stream. If we knew that for each event, the timestamp was stored in the first 8 bytes (as a long), the count was stored in the next 2 bytes, and the location was stored in the remaining, then we count create a span out of that event data. To then access the properties for that event we would just slice the first 8 bytes of that span for the timestamp, the next 2 for the count and access the remainder to get the location. We can then pass this data around without the need to allocate more memory. If we had instead stored this as an array of bytes, then in order to slice that array and retrieve the properties, we would need to allocate new arrays for each section.
This is a really powerful tool for processing data from IoT devices, as large volumes of data can be processed relatively cheaply. For example, we have been working on an OSS project to create a high performance AIS parser - AIS.NET. This is a high performance parser for a known shipping data format which can be used to process large streams of shipping data from around the world.