C# 11.0 new features: UTF-8 string literals
C# 11.0 became available with .NET 7.0 in November 2022. It has made a few improvements for string literals. In this post, I'll show how the new UTF-8 string literals feature can help us optimize performance when working with data that tends to be in UTF-8 format, such as JSON, without sacrificing readability.
Text encoding
The C# string
type is a sequence of 16-bit code units, typically interpreted as a UTF-16 string. This was a design decision that made sense back when .NET was invented—at the time it ran only on Windows, which used the same encoding.
When the first 32-bit versions of Windows appeared in the early 1990s, they used an older 16-bit string encoding called UCS-2. At the time, Unicode defined well under 65536 code points, so each 16-bit UCS-2 value could represent absolutely any Unicode code point. This fixed-length representation is simpler to work with than variable-length encodings such as UTF-8, so UCS-2 made sense: although it meant strings using the Latin alphabet took twice as much space as most widely used encodings at the time, the payoff was that programs could deal with international character sets without the complications of a variable-length encoding.
(Another significant advantage that UCS-2 had at the time over UTF-8 is that UCS-2 existed. UTF-8 was invented several years after Windows adopted UCS-2.)
With hindsight, that design decision didn't work out so well. It was fine for about a decade, but in 2001, the Unicode 3.1 specification was released, which nearly doubled the number of characters, at which point it was no longer possible for a simple 16-bit encoding to represent all code points. Windows effectively retconned its Unicode character encoding to UTF-16, making the full repertoire of Unicode code points available, but at the price of abandoning the simple fixed-length approach. To represent code points outside of the original 16-bit Basic Multilingual Plane, UTF-16 uses pairs of 16-bit code units to represent a single code point. (E.g., the smiling face with open mouth emoji (😃) with Unicode codepoint 1F603 is represented in UTF-16 as D83D followed by DE03)
Microsoft would most likely not pick UTF-16 if starting from scratch today. It is inefficient for text that predominantly uses the Latin alphabet, because it typically requires twice as many bytes. Here's the text "Hello, world" encoded as ASCII, a popular 7-bit encoding:
{ 0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x2C, 0x20, 0x77, 0x6F, 0x72, 0x6C, 0x64 }
That same text has exactly the same encoding in a variety of widely-used 8-bit encodings including windows-1252 (which at one point was the de facto standard encoding for web pages, although web standards fanatics are typically extremely reluctant to admit this) ISO-8859-1 and, significantly, UTF-8.
Here's the same string in UTF-16 (and since this text doesn't use any characters outside of Unicode's Basic Multilingual Plane, this is also how it looks as UCS-2):
{ 0x48, 0x00, 0x65, 0x00 0x6C, 0x00 0x6C, 0x00 0x6F, 0x00 0x2C, 0x00 0x20, 0x00 0x77, 0x00 0x6F, 0x00 0x72, 0x00 0x6C, 0x00 0x64, 0x00 }
That's twice the size. And if you look closely you'll see that every second byte is zero. In fact, if you strip out all the zero bytes, you end up with the first string.
The main upside of UCS-2 was that it simplified text handling code—if you wanted to read, say, the 6th code point in a string, it was always at offset 12 because any code point was 2 bytes long. But once Unicode 3.1 rendered UCS-2 obsolete, that benefit was lost, because in UTF-16, any code points outside of the Basic Multilingual Plan require two code units (4 bytes). In any case, the existence of combining characters means that a single character might consist of multiple code units, so a fixed-length encoding was arguably of limited benefit in any case.
The only situation in which UTF-16 has any advantage over UTF-8 is in text where a high proportion of the code points are in the range 0800-FFFF: these fit in 2 bytes of UTF-16 but require 3 bytes in UTF-8. This includes most Chinese, Japanese, and Korean characters, so for content that is purely written in those languages, UTF-16 may be more space-efficient. But with technical formats such as JSON, HTML, or XML, the structural elements are typically from the ASCII range, so although, say, Chinese text embedded in a JSON document might be more efficiently represented as UTF-16 than UTF-8, UTF-8 encoding will be more efficient for all the {
, }
, "
, and :
symbols, and for any property names that use the Latin alphabet, so the JSON document as a whole might still be smaller as UTF-8.
To summarize, the UTF-16 representation used by .NET is rarely optimal. It has all the downsides of a variable-length encoding, and for the parts of the world that use the Latin alphabet (or code that needs to conform with document types that were designed by that part of the world) strings take twice as much space or network bandwidth as they need to.
That's probably not a design you'd pick if you were starting from scratch today.
UTF-8 has won
In 2023, when text needs to be sent over a network the most common encoding choice is UTF-8. It can represent any Unicode code point. With content types where a high proportion of the characters will be in the ASCII range (which is true for a lot of JSON) it is a space-efficient coding. For Latin-based languages and also a host of other scripts including Greek, Cyrillic, Hebrew, and Arabic, it is at least as space efficient as UTF-16. It is admittedly suboptimal for the primary languages of a significant chunk of the world, but it is able to deal with them. It has become the de facto standard for text in most situations.
This puts .NET at a disadvantage. Most of the string processing features of the .NET runtime libraries all expect to work with UTF-16 (typically wrapped in a string
) but if data arriving via HTTP or off disk is in UTF-8 format, we would need to convert it into UTF-16 to be able to use string
-based library features. This is not hard, but it comes at a cost: there is extra computational work, and it also often entails allocation of more memory, causing the GC to work harder.
This is why the System.Text.Json
library feature introduced in .NET Core 3.0 works directly with UTF-8. It is able to parse JSON documents in the encoding that they are most likely to be using. (If you have JSON in a string
form, you actually need to convert it to UTF-8 first to use these APIs, so it's best to avoid ever working with string
-typed JSON if possible.)
UTF-8 friction
A problem arises when using data that is natively UTF-8. When working with JSON, it might seem natural to write this sort of thing:
foreach (JsonProperty prop in elem.EnumerateObject())
{
if (prop.Name == "prop42")
{
...
This works, but it causes a subtle problem: to be able to compare prop.Name
with the string "prop42"
we need prop.Name
to return a .NET string
. The JsonProperty
type here is part of the System.Text.Json
API which, as just described, works directly with UTF-8 representations. So the code above is essentially asking for a string conversion: it's asking the library to decode the UTF-8 property name into a UTF-16 string.
This is why JsonProperty
offers a different way to achieve the same effect. We could replace that if
with this:
if (prop.NameEquals("prop42"))
This looks a bit odd—why would we use this instead of the much more idiomatic-looking comparison? A benchmark shows why. Here's the first example in a test that we'll execute via BenchmarkDotNet:
[Benchmark(Baseline = true)]
public int ComparePropertyNameAsString()
{
int total = 0;
using JsonDocument doc = JsonDocument.Parse(jsonUtf8);
foreach (JsonElement elem in doc.RootElement.EnumerateArray())
{
foreach (JsonProperty prop in elem.EnumerateObject())
{
if (prop.Name == "prop42")
{
total += 1;
}
}
}
return total;
}
(The benchmark setup creates a byte[]
array containing a JSON document structure as an array with 100,000 elements, each being a single JSON object with a single property. The first is {"prop0":0}
, the next is {"prop1":1}
and so on. The property names go up to prop99
, then repeat from prop0
. The benchmark shown above then scans the entire array, counting the number of objects with a property called prop42
. This is an artificial example, but representative of real work: checking to see whether a property has some particular name is a pretty common operation.)
I compared this against a near-identical benchmark, where the only difference is that we use the NameEquals
method shown above instead of the more idiomatic ==
operator. Here are the results:
Method | Mean | Error | StdDev | Ratio | Gen0 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|
ComparePropertyNameAsString | 14.13 ms | 0.122 ms | 0.108 ms | 1.00 | 453.1250 | 3920103 B | 1.000 |
ComparePropertyNameWithNameEqualsString | 13.12 ms | 0.097 ms | 0.091 ms | 0.93 | - | 103 B | 0.000 |
Change the comparison to use NameEquals
has delivered a roughly 7% performance improvement. And that's from changing just one line in a benchmark that iterates over the structure of the JSON—the name comparison is not the only work being done in that benchmark, so you might expect any improvement in the text comparison to be lost in the other parsing work being done here. And yet we can see a difference—even though this benchmark is also doing the relatively complex parsing work of dealing with arrays and objects, we can still measure an effect from this one change.
Perhaps more significantly, changing from ==
to NameEquals
has had a dramatic effect on memory usage: the example that used prop.Name == "prop42"
allocated almost 4MB of data. That's because the prop.Name
has to create a .NET string
object, and in this benchmark it creates 100,000 of them. Allocating a lot of unnecessary objects is harmful to overall performance in .NET—it tends to slow everything down, and significant performance gains can often be had from avoiding allocations. The NameEquals
version allocated only 103 bytes, despite processing all 100,000 elements in the JSON.
So there was a significant improvement to be had by not asking System.Text.Json
to give us the property name in string
form. However, we're still doing unnecessary work. The property name will be stored in UTF-8 format in the JSON, but we're passing NameEquals
a .NET string. It's having to perform a string comparison across two different string formats—we're effectively asking it to convert between UTF-8 and UTF-16 as it performs the comparison.
Speaking UTF-8
If we add this field to our code:
private static readonly byte[] Prop42Utf8 = Encoding.UTF8.GetBytes("prop42");
we can write a third version of the benchmark that uses a different overload of NameEquals
. Again, we change just one line of the code:
if (prop.NameEquals(Prop42Utf8))
We're passing the byte[]
array referred to by the static
field called Prop42Utf8. I've initialized that field by asking .NET's UTF-8 encoder to convert the string "prop42"
into UTF-8 for me.
Here's what BenchmarkDotNet reports:
Method | Mean | Error | StdDev | Ratio | Gen0 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|
ComparePropertyNameAsString | 14.13 ms | 0.122 ms | 0.108 ms | 1.00 | 453.1250 | 3920103 B | 1.000 |
ComparePropertyNameWithNameEqualsString | 13.12 ms | 0.097 ms | 0.091 ms | 0.93 | - | 103 B | 0.000 |
ComparePropertyNameWithNameEqualsUtf8Array | 12.22 ms | 0.194 ms | 0.181 ms | 0.86 | - | 103 B | 0.000 |
This has the same memory efficiency as the other NameEquals
example, but with significant further performance improvement. It is 14% faster than the original, or 7% faster than using NameEquals
with a string
. That's because we are no longer asking System.Text.Json
to compare a UTF-8 string (in the JSON) with a UTF-16 string (in a .NET string
). We're just asking it compare UTF-8 with UTF-8. That's much simpler, and correspondingly faster.
(Yes, I know that in the preceding paragraph it looks like I made the mistake of adding two relative performance measure together—7 + 7 is 14, but that's not how cumulative performance improvements work. To be strictly accurate, two consecutive 7% performance increases yield an aggregate improvement of (1-(0.93*0.93)), which is 13.51%. However, that would be spurious precision, given the experimental error here. So 14% is the right number, and you can see that from the Ratio
column. But if anyone is as pedantic as me—a big if, I know—that last paragraph is going to sound like it's wrong even though it isn't.)
There's one slightly unsatisfactory thing about this: the field initializer has to run to generate the UTF-8. This will impose a startup cost. It doesn't show up in this benchmark because the default BenchmarkDotNet behaviour is to run warmup tests to eliminate transient behaviour—it is deliberately ignoring cold start behaviour. (You can configure it to run differently, but in general, cold start performance is best assessed holistically at the process level.)
We could reduce this a little by providing the data directly in UTF-8 form:
private static readonly byte[] Prop42Utf8 = { 0x70, 0x72, 0x6F, 0x70, 0x34, 0x32 };
This has no effect on the benchmark because the benchmark is ignoring time taken by field initializers, but this might provide a slight improvement, particularly if you have a lot of these sorts of things—we're no longer getting the Encoding
to do the conversion at runtime.
Unfortunately, this hasn't completely resolved the startup performance issue with this technique. It will still cause a static initializer to be created—code will run at startup to initialize that array. (The only improvement was to get rid of the call to Encoding.UTF8.GetBytes
.) We can in fact avoid this with a small change, although it's not very clear what's going on:
// This, believe it or not, is dramatically more efficient than the preceding example.
public static ReadOnlySpan<byte> Prop42Utf8 => new byte[] { 0x70, 0x72, 0x6F, 0x70, 0x34, 0x32 };
(The JsonProperty.NameEquals
method these benchmarks use provide an overload accepting a ReadOnlySpan<byte>
, so this will be an acceptable replacement.) Instead of a static field we now have a read-only static property. This looks like a terrible idea: the code looks like it is going to allocate a new array every single time the property is fetched. However, it turns out that although this is the syntax for constructing a new byte[]
array, no array gets created. If you use the array initializer syntax in an expression whose target type is a ReadOnlySpan<T>
, the compiler knows that the array itself will never be directly accessible, so it doesn't actually need to create one—as long as it produces a ReadOnlySpan<byte>
with the specified contents, code won't be able to tell if there happens not to be a real .NET array behind it.
The way it avoids creating an array is very old-school. It's very much like what happens with string literals in languages like C: the compiler puts the bytes of the string into the DLL, and at runtime, we effectively get a direct pointer to that.
An important feature of spans is that they can be backed by a few different kinds of things. They can refer to ordinary .NET arrays of course, but you can also get a ReadOnlySpan<char>
that is effectively a view into a string
. If you initialize a Span<T>
or ReadOnlySpan<T>
with stackalloc
, it refers to memory on the stack frame. And it's also possible for these things to refer to memory that is entirely outside of the .NET runtime heap or stack. One of the main use cases for that is to enable programs such as web servers to work directly with operating system buffers—ASP.NET Core is able to wrap memory owned by IIS in a span for example, enabling .NET code to access that data without first copying it onto the .NET heap. In the example above, the compiler exploits this to enable the UTF-8 data to be embedded in the compiled DLL, and accessed directly from our C# code without ever needing to copy it onto the managed heap.
Unmanaged languages such as C++ do this routinely. In those languages, it's common to compile data directly into the executable, and for the code to use that data where it lies. This used to be unusual in C# because access to such data used to require unmanaged code. But spans make it more practical.
The IL for that property accessor looks something like this:
ldsflda valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=6' '<PrivateImplementationDetails>'::'24A2F13D4F90A8C4D435142C77BB3180A1A157B0F9AC991E13B4CBF2FA0F9DA5'
ldc.i4.6
newobj instance void valuetype [System.Runtime]System.ReadOnlySpan`1<uint8>::.ctor(void*,
int32)
ret
That strange-looking field referred to by the first instruction is nominally an instance of a value type representing a fixed-size 6-byte unmanaged array. But this is just how .NET components represent raw binary data that has been compiled into the DLL. The field itself looks like this:
.field static assembly initonly valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=6' '24A2F13D4F90A8C4D435142C77BB3180A1A157B0F9AC991E13B4CBF2FA0F9DA5' at I_000037C8
That last part, at I_000037C8
, indicates the address in memory at which the string's bytes will be found (relative to wherever in memory the DLL has been loaded). So the accessor's first instruction, ldsflda
loads a pointer to (i.e., the address of) this raw UTF-8 string data. Since ReadOnlySpan<T>
is a reference type, that newobj
doesn't actually construct anything on the heap. It just creates a value-typed wrapper around the pointer.
By the way, you might be wondering why we need this obscure trick of using a property accessor. Why not just write this?
// A lovely idea, but it won't compile
public static readonly ReadOnlySpan<byte> Prop42Utf8 = new byte[] { 0x70, 0x72, 0x6F, 0x70, 0x34, 0x32 };
This won't compile, because ReadOnlySpan<T>
cannot be used as a field type, unless it's an instance field of a ref struct
type. You can't have a static field that holds a span. (This has to do with the subtle lifetime semantics of spans, but I won't get into that—this blog is long enough already.)
So that gives us a pretty efficient way to get hold of the UTF-8 data—we've now eliminated the static constructor code, and have a representation that doesn't incur any heap overhead at all. But a couple of problems remain. The use of the array initializer syntax with numeric values makes this hard to read and modify—it's not at all obvious that this is the text "prop42"
. And the trick of writing a static property where the accessor appears to construct an array (but doesn't really) is somewhat non-obvious, particularly since the near-identical static field approach does construct an array.
If only there were a less obscure way to achieve all these benefits...
Speaking UTF-8 like a native in C# 11.0
Finally we get to the topic of this blog!
With C# 11.0, we can write this:
if (prop.NameEquals("prop42"u8))
Notice the u8
suffix on the string literal. This tells the compiler that we want a UTF-8 string.
The rest of the benchmark remains the same, and this produces identical results to the benchmark shown above that uses the UTF-8 array (and it's also identical to the one using a span). So we've got all of the performance, but with simpler code—we didn't need to define a field to hold the UTF-8 representation.
When we add that u8
suffix to a string literal, the resulting type is a ReadOnlySpan<byte>
. And the compiler emits the string data into the DLL in exactly the same way as it did in the example above with the static property that returns a ReadOnlySpan<byte>
. So we get all the performance benefits of that approach, but with none of the hassle.
And that, in a nutshell, is the benefit of UTF-8 string literals. When you're working with data that is natively UTF-8 and you want to get a ReadOnlySpan<byte>
representing a particular string constant, UTF-8 string literals provide a very easy but high-performance way to do that.