An Overview of the Corvus.Globbing Library
TLDR; In this post we look at Corvus.Globbing, a C# library for working with globs. We show how to get started with the library, some examples demonstrating how to use it, and performance benchmarks illustrating its low allocation properties and superior performance over other similar libraries.
Introduction
The Corvus GitHub Organisation contains a series of open-source repositories that provide tools and support for common requirements such as tenancy storage, authentication and more. Corvus encapsulate endjin's recommended practices for building .NET applications. These practices are founded in the knowledge gained through over a decade of real world experience building .NET applications for our customers.
Each blog post in this series will focus on a single Corvus repository, outlining the repository's intended use and how to get started.
Overview of Corvus.Globbing
In this blog post Corvus.Globbing is under the spotlight. Corvus.Globbing is a library for working with globs, it was built to be zero allocating with performance comparable to (or better than) DotNet.Glob and raw Regular Expressions, when running under dotnet 6.0.
About Globbing
Globbing is the process of using wildcard characters for defining patterns to match file names and directories against partial names. For example, in a bash shell, if you were to run ls *.txt
, it would return all files inside the current working directory with a .txt
file extension. The text *.txt
is the glob. The wildcard character, *
, essentially says "match against anything here" – so, find every file inside this directory whose name ends with the sequence of characters: '.txt'.
As another example, If you were to run ls hello*.txt
, it would match all files beginning with 'hello' and ending with '.txt', meaning this glob would match against a file named 'hello-world.txt', for example. Alternatively, ls *hello*.txt
, matches all files with names that contain the word 'hello' and end with '.txt'. In all of these cases the wildcard character, *
, says "match against anything here". The *
is the most common, but not the only, wildcard character; Corvus.Globbing supports others, as you'll see in the repo's README.
Internally, the bash shell has to parse the glob, then do some searching and comparing to find the file names that match the pattern provided. This is what Corvus.Globbing offers – the ability to match a glob pattern against file paths and names. Specifically, it allows you to say "does this glob match this file name/path" through the Match()
method, as you'll see shortly.
Use Cases
The particular use case we have optimized for is a case-sensitive glob that will be matched against a large number of paths (100+), but will not necessarily be cachable.
We want a very high-performance parse of that glob, and then a performant application of the glob to a large number of candidate paths.
Our motivation for this came when "link stripping" documents to be returned from an HTTP request, to remove links that the requesting identity is not permitted to see (perhaps for security, feature enablement, or just local context). To make this process high performance we needed to minimise allocations on the hot-path of the request handler, the library has therefore been designed to be low-allocating.
Corvus.Globbing is built for net6.0
.
Performance
We offer better raw matching performance (~10-30%) against a pre-tokenized glob pattern, for the StringComparison.Ordinal
(case sensitive) default, than Dotnet.Glob.
Our tokenization is also ~10x faster than Dotnet.Glob so it is significantly faster in the single use/throwaway case.
See detailed performance benchmarks at the end of this post.
How to get started
Corvus.Globbing is available on NuGet. To get started with Corvus.Globbing, add a reference to the Corvus.Globbing NuGet package in your project, by running the following command:
dotnet add package Corvus.Globbing
How to use
Within the Corvus.Globbing repository, there is a Polyglot Notebook (under the folder Corvus.Globbing.Documentation.Examples) that provides examples on how to use the library. You can pull down the repo and run the notebooks locally yourself.
Examples
Basic use
The simplest use case just requires you to pass the glob and candidate path to the Match()
method to determine if the glob and path match.
bool isMatch = Glob.Match("path/*atstand", "path/fooatstand"); // Returns: True
You can provide all the standard string-comparison types to the comparisonType
parameter to customise the comparison behavior. The default is StringComparison.Ordinal
(i.e. case sensitive), but you could do a case insensitive match:
bool isMatch = Glob.Match("path/*atstand", "PATH/fooatstand", StringComparison.OrdinalIgnoreCase); // returns: True
Efficient matching of a glob against multiple paths
If you want to match a glob against a number of paths, you want to avoid re-tokenizing the glob for each comparison.
For low memory allocation requirements and increased performance you can also stack allocate a tokenized glob array and reuse it:
string pattern = "path/*atstand";
// There can't be more tokens than there are characters in the glob pattern, so we allocate an array at least that long.
Span<GlobToken> tokenizedGlob = stackalloc GlobToken[pattern.Length];
int tokenCount = Corvus.Globbing.GlobTokenizer.Tokenize(pattern, ref tokenizedGlob);
// And then slice off the number of tokens we actually used
ReadOnlySpan<GlobToken> glob = tokenizedGlob[..tokenCount];
bool firstMatch = Glob.Match(pattern, glob, "path/fooatstand"); // Returns: True
bool secondMatch = Glob.Match(pattern, glob, "badpath/fooatstand"); // Returns: False
Efficient matching of a large glob against multiple paths
For very long potential globs you may not want to allocate them on the stack, instead you could fall back to the ArrayPool
allocation technique.
The following example is useful if you don't necessarily know how long the glob is (e.g. if it's a command line argument, or an input to a web API). The code decides to use stackalloc
if the provided pattern is less than a specified length, and the ArrayPool
allocation technique when it is not.
int MaxGlobTokenArrayLength = 1024;
// There can't be more tokens than there are characters in the glob.
GlobToken[]? globTokenArray = null;
Span<GlobToken> globTokens = pattern.Length > MaxGlobTokenArrayLength ? stackalloc GlobToken[0] : stackalloc GlobToken[pattern.Length];
if (pattern.Length > MaxGlobTokenArrayLength)
{
globTokenArray = ArrayPool<GlobToken>.Shared.Rent(pattern.Length);
globTokens = globTokenArray.AsSpan();
}
try
{
int tokenCount = GlobTokenizer.Tokenize(pattern, ref globTokens);
ReadOnlySpan<GlobToken> tokenizedGlob = globTokens[..tokenCount];
// Do your matching here...
bool firstMatch = Glob.Match(pattern, tokenizedGlob, "path/fooatstand");
Console.WriteLine($"Result of first match:\t{firstMatch}");
bool secondMatch = Glob.Match(pattern, tokenizedGlob, "badpath/fooatstand");
Console.WriteLine($"Result of second match:\t{secondMatch}");
}
finally
{
if (pattern.Length > MaxGlobTokenArrayLength)
{
ArrayPool<GlobToken>.Shared.Return(globTokenArray);
}
}
Benchmarks
We have used Benchmark Dotnet to compare the performance with raw RegEx and DotNet.Glob
. This represents a current example run.
The numbers in square brackets in the cells under the Pattern column in the tables below are conveying the number of characters in the glob pattern used in that test run.
Compile
Here we're comparing the time and memory performance for compiling code that tokenizes a glob pattern.
Method | Pattern | Mean (ns) | Error (ns) | StdDev (ns) | Ratio | RatioSD | Gen 0 | Gen 1 | Allocated |
---|---|---|---|---|---|---|---|---|---|
New_Compiled_Regex_Glob | p?th/(...)].txt [50] | 24,518.93 | 446.904 | 418.035 | 145.92 | 2.71 | 5.0659 | - | 21,272 B |
New_DotNet_Glob | p?th/(...)].txt [50] | 1,614.43 | 24.121 | 22.563 | 9.61 | 0.18 | 0.6733 | - | 2,816 B |
New_Corvus_Glob | p?th/(...)].txt [50] | 168.04 | 1.866 | 1.746 | 1.00 | 0.00 | - | - | - |
New_Compiled_Regex_Glob | p?th/(...)].txt [21] | 13,920.81 | 150.097 | 133.057 | 159.84 | 2.16 | 3.4332 | - | 14,392 B |
New_DotNet_Glob | p?th/(...)].txt [21] | 900.49 | 12.604 | 11.790 | 10.34 | 0.17 | 0.4549 | - | 1,904 B |
New_Corvus_Glob | p?th/(...)].txt [21] | 87.10 | 0.790 | 0.700 | 1.00 | 0.00 | - | - | - |
New_Compiled_Regex_Glob | p?th/(...)].txt [46] | 17,286.88 | 183.742 | 171.873 | 112.94 | 1.72 | 4.1504 | - | 17,432 B |
New_DotNet_Glob | p?th/(...)].txt [46] | 1,375.73 | 10.866 | 10.165 | 8.99 | 0.11 | 0.5665 | - | 2,376 B |
New_Corvus_Glob | p?th/(...)].txt [46] | 153.08 | 1.539 | 1.439 | 1.00 | 0.00 | - | - | - |
New_Compiled_Regex_Glob | p?th/a[e-g].txt | 12,350.67 | 206.552 | 193.209 | 188.56 | 6.84 | 3.1433 | 0.0153 | 13,168 B |
New_DotNet_Glob | p?th/a[e-g].txt | 726.14 | 8.640 | 7.659 | 11.08 | 0.27 | 0.3328 | - | 1,392 B |
New_Corvus_Glob | p?th/a[e-g].txt | 65.32 | 1.307 | 1.605 | 1.00 | 0.00 | - | - | - |
Compile and match false
Comparing the time and memory performance for compiling and running Corvus.Glob.Match
against the regex and DotNet.Glob
equivalents, for the case where the string and glob pattern do not match.
Method | # Of Matches | Pattern | Mean (ns) | Error (ns) | StdDev (ns) | Ratio | RatioSD | Gen 0 | Gen 1 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|
Compiled_Regex_IsMatch | 1 | p?th/(...)].txt [21] | 102,434.95 | 1,962.576 | 3,734.006 | 1,129.38 | 51.15 | 5.3711 | 2.6855 | 22,656 B |
DotNetGlob_IsMatch | 1 | p?th/(...)].txt [21] | 954.59 | 17.737 | 16.591 | 10.41 | 0.27 | 0.4549 | - | 1,904 B |
CorvusGlob_IsMatch | 1 | p?th/(...)].txt [21] | 91.73 | 1.822 | 1.521 | 1.00 | 0.00 | - | - | - |
Compiled_Regex_IsMatch | 1 | p?th/(...)].txt [46] | 111,423.68 | 447.546 | 396.738 | 703.52 | 13.52 | 6.9580 | 3.4180 | 29,591 B |
DotNetGlob_IsMatch | 1 | p?th/(...)].txt [46] | 1,404.32 | 21.916 | 20.500 | 8.86 | 0.24 | 0.5665 | - | 2,376 B |
CorvusGlob_IsMatch | 1 | p?th/(...)].txt [46] | 158.77 | 3.049 | 2.994 | 1.00 | 0.00 | - | - | - |
Compiled_Regex_IsMatch | 1 | p?th/a[e-g].txt | 95,250.25 | 1,534.617 | 1,360.398 | 1,310.39 | 24.75 | 4.6387 | 2.3193 | 19,799 B |
DotNetGlob_IsMatch | 1 | p?th/a[e-g].txt | 733.96 | 9.342 | 8.738 | 10.11 | 0.12 | 0.3328 | - | 1,392 B |
CorvusGlob_IsMatch | 1 | p?th/a[e-g].txt | 72.70 | 0.659 | 0.584 | 1.00 | 0.00 | - | - | - |
Compiled_Regex_IsMatch | 10000 | p?th/(...)].txt [21] | 294,937.84 | 3,379.605 | 2,995.932 | 3.30 | 0.04 | 5.3711 | 2.9297 | 22,648 B |
DotNetGlob_IsMatch | 10000 | p?th/(...)].txt [21] | 115,916.75 | 1,218.873 | 1,080.499 | 1.30 | 0.02 | 0.3662 | - | 1,904 B |
CorvusGlob_IsMatch | 10000 | p?th/(...)].txt [21] | 89,329.52 | 829.171 | 775.607 | 1.00 | 0.00 | - | - | - |
Compiled_Regex_IsMatch | 10000 | p?th/(...)].txt [46] | 306,961.22 | 4,678.470 | 4,376.244 | 3.34 | 0.07 | 6.8359 | 3.4180 | 29,582 B |
DotNetGlob_IsMatch | 10000 | p?th/(...)].txt [46] | 119,904.90 | 1,252.195 | 1,171.304 | 1.30 | 0.02 | 0.4883 | - | 2,376 B |
CorvusGlob_IsMatch | 10000 | p?th/(...)].txt [46] | 91,902.09 | 1,427.773 | 1,335.540 | 1.00 | 0.00 | - | - | - |
Compiled_Regex_IsMatch | 10000 | p?th/a[e-g].txt | 289,016.17 | 4,305.232 | 4,027.117 | 3.16 | 0.04 | 4.3945 | 1.9531 | 19,788 B |
DotNetGlob_IsMatch | 10000 | p?th/a[e-g].txt | 120,199.71 | 1,357.718 | 1,270.010 | 1.31 | 0.02 | 0.2441 | - | 1,392 B |
CorvusGlob_IsMatch | 10000 | p?th/a[e-g].txt | 91,516.00 | 768.001 | 641.316 | 1.00 | 0.00 | - | - | - |
Compile and match true
Comparing the time and memory performance for compiling and running Corvus.Glob.Match
against the regex and DotNet.Glob
equivalents, for the case where the string and glob pattern do match.
Method | # Of Matches | Pattern | Mean (ns) | Error (ns) | StdDev (ns) | Ratio | RatioSD | Gen 0 | Gen 1 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|
Compiled_Regex_IsMatch | 1 | p?th/(...)].txt [21] | 101,632.25 | 1,067.540 | 891.444 | 1,132.45 | 11.15 | 5.3711 | 2.6855 | 22,656 B |
DotNetGlob_IsMatch | 1 | p?th/(...)].txt [21] | 945.00 | 8.850 | 7.846 | 10.53 | 0.09 | 0.4539 | - | 1,904 B |
CorvusGlob_IsMatch | 1 | p?th/(...)].txt [21] | 89.75 | 0.389 | 0.325 | 1.00 | 0.00 | - | - | - |
Compiled_Regex_IsMatch | 1 | p?th/(...)].txt [46] | 110,443.78 | 847.213 | 792.483 | 710.63 | 7.04 | 6.9580 | 3.4180 | 29,591 B |
DotNetGlob_IsMatch | 1 | p?th/(...)].txt [46] | 1,371.05 | 12.836 | 12.007 | 8.84 | 0.08 | 0.5665 | - | 2,376 B |
CorvusGlob_IsMatch | 1 | p?th/(...)].txt [46] | 155.31 | 0.988 | 0.825 | 1.00 | 0.00 | - | - | - |
Compiled_Regex_IsMatch | 1 | p?th/a[e-g].txt | 95,855.67 | 525.585 | 465.917 | 1,310.64 | 11.98 | 4.6387 | 2.3193 | 19,799 B |
DotNetGlob_IsMatch | 1 | p?th/a[e-g].txt | 713.67 | 7.240 | 6.773 | 9.76 | 0.13 | 0.3328 | - | 1,392 B |
CorvusGlob_IsMatch | 1 | p?th/a[e-g].txt | 73.10 | 0.600 | 0.561 | 1.00 | 0.00 | - | - | - |
Compiled_Regex_IsMatch | 10000 | p?th/(...)].txt [21] | 292,953.87 | 3,174.962 | 2,969.862 | 3.30 | 0.04 | 5.3711 | 2.9297 | 22,648 B |
DotNetGlob_IsMatch | 10000 | p?th/(...)].txt [21] | 117,235.79 | 1,106.554 | 924.023 | 1.32 | 0.02 | 0.3662 | - | 1,904 B |
CorvusGlob_IsMatch | 10000 | p?th/(...)].txt [21] | 88,815.76 | 957.988 | 896.103 | 1.00 | 0.00 | - | - | - |
Compiled_Regex_IsMatch | 10000 | p?th/(...)].txt [46] | 305,527.82 | 4,855.610 | 3,790.941 | 3.34 | 0.06 | 6.8359 | 3.4180 | 29,582 B |
DotNetGlob_IsMatch | 10000 | p?th/(...)].txt [46] | 118,614.04 | 1,087.238 | 1,017.003 | 1.29 | 0.01 | 0.4883 | - | 2,376 B |
CorvusGlob_IsMatch | 10000 | p?th/(...)].txt [46] | 91,627.21 | 884.869 | 827.707 | 1.00 | 0.00 | - | - | - |
Compiled_Regex_IsMatch | 10000 | p?th/a[e-g].txt | 291,808.94 | 4,918.313 | 3,839.895 | 3.15 | 0.06 | 4.3945 | 1.9531 | 19,788 B |
DotNetGlob_IsMatch | 10000 | p?th/a[e-g].txt | 121,114.24 | 1,232.934 | 1,092.964 | 1.31 | 0.03 | 0.2441 | - | 1,392 B |
CorvusGlob_IsMatch | 10000 | p?th/a[e-g].txt | 92,326.72 | 1,832.623 | 1,714.237 | 1.00 | 0.00 | - | - | - |
Match false
Comparing the time performance for running Corvus.Glob.Match
against the regex and DotNet.Glob
equivalents, for the case where the string and glob pattern do not match.
Method | # Of Matches | Pattern | Mean (μs) | Error (μs) | StdDev (μs) | Ratio | RatioSD | Allocated |
---|---|---|---|---|---|---|---|---|
Compiled_Regex_IsMatch | 10000 | p?th/(...)].txt [50] | 446.86 | 4.783 | 3.994 | 4.74 | 0.06 | - |
DotNetGlob_IsMatch | 10000 | p?th/(...)].txt [50] | 101.13 | 1.021 | 0.955 | 1.07 | 0.01 | - |
CorvusGlob_IsMatch | 10000 | p?th/(...)].txt [50] | 94.26 | 1.052 | 0.932 | 1.00 | 0.00 | - |
Compiled_Regex_IsMatch | 10000 | p?th/(...)].txt [21] | 357.73 | 1.617 | 1.433 | 3.74 | 0.03 | - |
DotNetGlob_IsMatch | 10000 | p?th/(...)].txt [21] | 120.99 | 1.675 | 1.566 | 1.27 | 0.02 | - |
CorvusGlob_IsMatch | 10000 | p?th/(...)].txt [21] | 95.53 | 0.648 | 0.606 | 1.00 | 0.00 | - |
Compiled_Regex_IsMatch | 10000 | p?th/(...)].txt [46] | 386.72 | 4.710 | 4.175 | 3.92 | 0.04 | - |
DotNetGlob_IsMatch | 10000 | p?th/(...)].txt [46] | 121.05 | 1.141 | 0.952 | 1.23 | 0.01 | - |
CorvusGlob_IsMatch | 10000 | p?th/(...)].txt [46] | 98.54 | 0.614 | 0.574 | 1.00 | 0.00 | - |
Compiled_Regex_IsMatch | 10000 | p?th/a[e-g].txt | 369.37 | 1.664 | 1.475 | 3.88 | 0.02 | - |
DotNetGlob_IsMatch | 10000 | p?th/a[e-g].txt | 116.95 | 1.689 | 1.659 | 1.23 | 0.02 | - |
CorvusGlob_IsMatch | 10000 | p?th/a[e-g].txt | 95.02 | 0.457 | 0.357 | 1.00 | 0.00 | - |
Match true
Comparing the time performance for running Corvus.Glob.Match
against the regex and DotNet.Glob
equivalents, for the case where the string and glob pattern do match.
Method | # Of Matches | Pattern | Mean (μs) | Error (μs) | StdDev (μs) | Ratio | RatioSD | Allocated |
---|---|---|---|---|---|---|---|---|
Compiled_Regex_IsMatch | 10000 | p?th/(...)].txt [50] | 450.13 | 2.709 | 2.402 | 4.88 | 0.08 | - |
DotNetGlob_IsMatch | 10000 | p?th/(...)].txt [50] | 99.80 | 1.215 | 1.136 | 1.08 | 0.02 | - |
CorvusGlob_IsMatch | 10000 | p?th/(...)].txt [50] | 92.14 | 1.372 | 1.283 | 1.00 | 0.00 | - |
Compiled_Regex_IsMatch | 10000 | p?th/(...)].txt [21] | 345.40 | 1.602 | 1.499 | 3.78 | 0.03 | - |
DotNetGlob_IsMatch | 10000 | p?th/(...)].txt [21] | 115.90 | 0.647 | 0.606 | 1.27 | 0.01 | - |
CorvusGlob_IsMatch | 10000 | p?th/(...)].txt [21] | 91.29 | 0.616 | 0.546 | 1.00 | 0.00 | - |
Compiled_Regex_IsMatch | 10000 | p?th/(...)].txt [46] | 374.15 | 6.377 | 5.965 | 4.06 | 0.09 | - |
DotNetGlob_IsMatch | 10000 | p?th/(...)].txt [46] | 121.76 | 1.353 | 1.200 | 1.32 | 0.02 | - |
CorvusGlob_IsMatch | 10000 | p?th/(...)].txt [46] | 92.13 | 1.152 | 1.078 | 1.00 | 0.00 | - |
Compiled_Regex_IsMatch | 10000 | p?th/a[e-g].txt | 354.97 | 2.833 | 2.511 | 3.88 | 0.04 | - |
DotNetGlob_IsMatch | 10000 | p?th/a[e-g].txt | 115.02 | 1.082 | 1.012 | 1.26 | 0.01 | - |
CorvusGlob_IsMatch | 10000 | p?th/a[e-g].txt | 91.42 | 0.766 | 0.679 | 1.00 | 0.00 | - |
The ratio column in the benchmark results above show that, for all patterns and operations tested, Corvus
was faster than the DotNet.Glob
and regex equivalents.
There's something else worth noting from these results. In the "compile and match" examples (both true and false), Corvus.Globbing is only 3.3x faster than RegEx (and 1.3x faster than DotNetGlob) when matching against 1000 strings, but Corvus.Globbing is over 1000x faster than RegEx (and ~10x faster than DotNetGlob) if we compile and match against a single string. This implies that with RegEx (and, to a lesser extent, DotNetGlob) the costs of compilation/parsing are very high compared to the cost of doing a match against a single string. Therefore, for it to be worth paying the up-front cost of compiling/parsing, you need to be matching against many more strings when using the other libraries compared to Corvus.Globbing.
As you can see, Corvus.Globbing is a useful library, with strong performance characteristics, for working with globs in C#.
If you have any thoughts on better ways to approach particular problems, or related tools you'd like to see, then we'd like to hear them – please comment below, or in the Issues section of the project.