C# 11.0 new features: Span<char> pattern matching
This fifth post in my series on new features in C# 11.0 is the second of two posts on pattern matching.
String constant pattern recap
We've been able to use string constants as patterns for a long time in C#, e.g.:
if (name is "Lobby Lud")
{
Console.WriteLine("I claim my five pounds");
}
Back in C# 1.0, switch
statements provided special support for using string constants as case
values. When C# 7.0 enhanced switch
statements to allow the use of patterns, the special string handling was retconned into being treated as a pattern just like all the other forms of case
. So when you write this sort of thing:
switch (name)
{
case "Lobby Lud":
Console.WriteLine("I claim my five pounds");
break;
case "What?":
Console.WriteLine("Who?");
break;
}
C# would once have processed this as relying on the intrinsic support for the string constant form of case
. But nowadays, the case
keyword can be followed by any pattern, so the compiler sees these as constant patterns.
Span as input to string constant pattern
In the examples above, I've not shown the declaration of the name
variable, making it hard to tell what its type is. You might be wondering if, once again, I'm taking the opportunity to explain what I dislike about var
, but on this occasion I have a different motive.
Prior to C# 11.0, name
would need to be a string
for those examples to compile correctly. But now, it would also work if name
were a ReadOnlySpan<char>
or Span<char>
.
If you're not familiar with these types, I wrote years ago about how we used them to implement high performance AIS.NET parsing. But to quickly recap, a Span<T>
or ReadOnlySpan<T>
can represent any sequence of values in memory. They are conceptually like arrays, but they are more flexible in that they can refer to different kinds of memory—they don't necessarily have to refer to an array, or even to memory that lives on the GC heap. A string
is a sequence of char
values, but it's not an array. So we couldn't use it via char[]
, but we can refer to it as a ReadOnlySpan<char>
.
Why would we? An important feature of Span<T>
and ReadOnlySpan<T>
is that they provide a highly efficient way to represents subsections of the data. If you slice out a subsection of a string using someString[4..10]
, C# will generate code that creates a brand new string object containing a copy of just the characters you asked for. But if you do exactly the same thing with a ReadOnlySpan<char>
, no new copies of the data will be made. You end up with a new ReadOnlySpan<char>
(which, being a value type, doesn't require a new heap object) which points to the same underlying data, it just points to slightly less of it.
So imagine the string you wish to inspect is a substring of some larger string. Before C# 11.0, to use either of the examples above you'd have had to write something like:
string name = doc[nameStartIndex..nameEndIndex]; // Allocates a new string on the GC heap
But in C# 11.0, you can instead write this:
ReadOnlySpan<char> name = doc.AsSpan[nameStartIndex..nameEndIndex];
The resulting span doesn't have its own copy of the data. It just refers to the part of the doc
string that contains the data of interest. And thanks to the new C# 11.0 feature I'm discussing, you can use that ReadOnlySpan<char> name
with patterns as shown in the earlier examples.
What it compiles to
The compiler generates code that performs the comparison using the SequenceEqual
extension method defined by the MemoryExtensions
class. So the first example is equivalent to this:
if (name.SequenceEqual("Lobby Lud"))
{
Console.WriteLine("I claim my five pounds");
}
No UTF-8 support
If you've been reading this whole series, you may recall C# 11.0's new UTF-8 string literals feature, and you might be wondering whether this new support for span-based pattern matching also works with UTF-8 text. Could you write this, for example?
ReadOnlySpan<byte> textUtf8 = "Hello"u8;
if (textUtf8 is "Hello"u8)) // Won't compile
{
Console.WriteLine("Match");
}
Let's ignore the entirely unnecessary nature of the comparison—obviously by inspection we'd expect this test always to succeed. But in fact it won't compile. The compiler does not recognize UTF-8 string constants as a type of constant pattern. (As far as I know, there isn't some fundamental reason that it couldn't. But it would require additional language support, because the expression "Hello"u8
is of type ReadOnlySpan<byte>
, and that's not in the list of things that be used as a constant pattern. And if you're thinking that this sounds like an odd choice because ReadOnlySpan<char>
isn't exactly million miles from ReadOnlySpan<byte>
, remember that the new language feature I'm discussing changes only what is allowed as the input to a pattern; it does add any new ways of defining a pattern.)
In this specific example, we could get the behaviour we want by just calling the same extension the compiler uses with strings:
ReadOnlySpan<byte> textUtf8 = "Hello"u8;
if (textUtf8.SequenceEqual("Hello"u8))
{
Console.WriteLine("Match");
}
but that's not a general solution: we've replaced a pattern with a method invocation, meaning we can't do this in all places where a pattern is expected. So we can't use a UTF-8 string constant as a pattern in a switch
statement or expression, for example.
Summary
ReadOnlySpan<char>
can provide a memory efficient way to work with substrings. You can obtain a ReadOnlySpan<char>
that refers to some subsection of a string
(or of any sequence of char
values) without needing to allocate a new object. C# 11.0 enables us to use the resulting ReadOnlySpan<char>
as the input to a string constant pattern.