Skip to content
Liam Mooney By Liam Mooney Apprentice Engineer II
An Overview of the Corvus.Globbing Library

TLDR; In this post we look at Corvus.Globbing, a C# library for working with globs. We show how to get started with the library, some examples demonstrating how to use it, and performance benchmarks illustrating its low allocation properties and superior performance over other similar libraries.

Introduction

The Corvus GitHub Organisation contains a series of open-source repositories that provide tools and support for common requirements such as tenancy storage, authentication and more. Corvus encapsulate endjin's recommended practices for building .NET applications. These practices are founded in the knowledge gained through over a decade of real world experience building .NET applications for our customers.

Each blog post in this series will focus on a single Corvus repository, outlining the repository's intended use and how to get started.

Overview of Corvus.Globbing

In this blog post Corvus.Globbing is under the spotlight. Corvus.Globbing is a library for working with globs, it was built to be zero allocating with performance comparable to (or better than) DotNet.Glob and raw Regular Expressions, when running under dotnet 6.0.

About Globbing

Globbing is the process of using wildcard characters for defining patterns to match file names and directories against partial names. For example, in a bash shell, if you were to run ls *.txt, it would return all files inside the current working directory with a .txt file extension. The text *.txt is the glob. The wildcard character, *, essentially says "match against anything here" – so, find every file inside this directory whose name ends with the sequence of characters: '.txt'.

As another example, If you were to run ls hello*.txt, it would match all files beginning with 'hello' and ending with '.txt', meaning this glob would match against a file named 'hello-world.txt', for example. Alternatively, ls *hello*.txt, matches all files with names that contain the word 'hello' and end with '.txt'. In all of these cases the wildcard character, *, says "match against anything here". The * is the most common, but not the only, wildcard character; Corvus.Globbing supports others, as you'll see in the repo's README.

Internally, the bash shell has to parse the glob, then do some searching and comparing to find the file names that match the pattern provided. This is what Corvus.Globbing offers – the ability to match a glob pattern against file paths and names. Specifically, it allows you to say "does this glob match this file name/path" through the Match() method, as you'll see shortly.

Use Cases

The particular use case we have optimized for is a case-sensitive glob that will be matched against a large number of paths (100+), but will not necessarily be cachable.

We want a very high-performance parse of that glob, and then a performant application of the glob to a large number of candidate paths.

Our motivation for this came when "link stripping" documents to be returned from an HTTP request, to remove links that the requesting identity is not permitted to see (perhaps for security, feature enablement, or just local context). To make this process high performance we needed to minimise allocations on the hot-path of the request handler, the library has therefore been designed to be low-allocating.

Corvus.Globbing is built for net6.0.

Performance

We offer better raw matching performance (~10-30%) against a pre-tokenized glob pattern, for the StringComparison.Ordinal (case sensitive) default, than Dotnet.Glob.

Our tokenization is also ~10x faster than Dotnet.Glob so it is significantly faster in the single use/throwaway case.

See detailed performance benchmarks at the end of this post.

How to get started

Corvus.Globbing is available on NuGet. To get started with Corvus.Globbing, add a reference to the Corvus.Globbing NuGet package in your project, by running the following command:

dotnet add package Corvus.Globbing

How to use

Within the Corvus.Globbing repository, there is a Polyglot Notebook (under the folder Corvus.Globbing.Documentation.Examples) that provides examples on how to use the library. You can pull down the repo and run the notebooks locally yourself.

Examples

Basic use

The simplest use case just requires you to pass the glob and candidate path to the Match() method to determine if the glob and path match.

bool isMatch = Glob.Match("path/*atstand", "path/fooatstand"); // Returns: True

You can provide all the standard string-comparison types to the comparisonType parameter to customise the comparison behavior. The default is StringComparison.Ordinal (i.e. case sensitive), but you could do a case insensitive match:

bool isMatch = Glob.Match("path/*atstand", "PATH/fooatstand", StringComparison.OrdinalIgnoreCase); // returns: True

Efficient matching of a glob against multiple paths

If you want to match a glob against a number of paths, you want to avoid re-tokenizing the glob for each comparison.

For low memory allocation requirements and increased performance you can also stack allocate a tokenized glob array and reuse it:

string pattern = "path/*atstand";

// There can't be more tokens than there are characters in the glob pattern, so we allocate an array at least that long.
Span<GlobToken> tokenizedGlob = stackalloc GlobToken[pattern.Length];
int tokenCount = Corvus.Globbing.GlobTokenizer.Tokenize(pattern, ref tokenizedGlob);
// And then slice off the number of tokens we actually used
ReadOnlySpan<GlobToken> glob = tokenizedGlob[..tokenCount];

bool firstMatch = Glob.Match(pattern, glob, "path/fooatstand"); // Returns: True
bool secondMatch = Glob.Match(pattern, glob, "badpath/fooatstand"); // Returns: False

Efficient matching of a large glob against multiple paths

For very long potential globs you may not want to allocate them on the stack, instead you could fall back to the ArrayPool allocation technique.

The following example is useful if you don't necessarily know how long the glob is (e.g. if it's a command line argument, or an input to a web API). The code decides to use stackalloc if the provided pattern is less than a specified length, and the ArrayPool allocation technique when it is not.

int MaxGlobTokenArrayLength = 1024;

// There can't be more tokens than there are characters in the glob.
GlobToken[]? globTokenArray = null;
Span<GlobToken> globTokens = pattern.Length > MaxGlobTokenArrayLength ? stackalloc GlobToken[0] : stackalloc GlobToken[pattern.Length];

if (pattern.Length > MaxGlobTokenArrayLength)
{
    globTokenArray = ArrayPool<GlobToken>.Shared.Rent(pattern.Length);
    globTokens = globTokenArray.AsSpan();
}

try
{
    int tokenCount = GlobTokenizer.Tokenize(pattern, ref globTokens);
    ReadOnlySpan<GlobToken> tokenizedGlob = globTokens[..tokenCount];

    // Do your matching here...
    bool firstMatch = Glob.Match(pattern, tokenizedGlob, "path/fooatstand");
    Console.WriteLine($"Result of first match:\t{firstMatch}");

    bool secondMatch = Glob.Match(pattern, tokenizedGlob, "badpath/fooatstand");
    Console.WriteLine($"Result of second match:\t{secondMatch}");
}
finally
{
    if (pattern.Length > MaxGlobTokenArrayLength)
    {
        ArrayPool<GlobToken>.Shared.Return(globTokenArray);
    }
}

Benchmarks

We have used Benchmark Dotnet to compare the performance with raw RegEx and DotNet.Glob. This represents a current example run.

The numbers in square brackets in the cells under the Pattern column in the tables below are conveying the number of characters in the glob pattern used in that test run.

Compile

Here we're comparing the time and memory performance for compiling code that tokenizes a glob pattern.

Method Pattern Mean (ns) Error (ns) StdDev (ns) Ratio RatioSD Gen 0 Gen 1 Allocated
New_Compiled_Regex_Glob p?th/(...)].txt [50] 24,518.93 446.904 418.035 145.92 2.71 5.0659 - 21,272 B
New_DotNet_Glob p?th/(...)].txt [50] 1,614.43 24.121 22.563 9.61 0.18 0.6733 - 2,816 B
New_Corvus_Glob p?th/(...)].txt [50] 168.04 1.866 1.746 1.00 0.00 - - -
New_Compiled_Regex_Glob p?th/(...)].txt [21] 13,920.81 150.097 133.057 159.84 2.16 3.4332 - 14,392 B
New_DotNet_Glob p?th/(...)].txt [21] 900.49 12.604 11.790 10.34 0.17 0.4549 - 1,904 B
New_Corvus_Glob p?th/(...)].txt [21] 87.10 0.790 0.700 1.00 0.00 - - -
New_Compiled_Regex_Glob p?th/(...)].txt [46] 17,286.88 183.742 171.873 112.94 1.72 4.1504 - 17,432 B
New_DotNet_Glob p?th/(...)].txt [46] 1,375.73 10.866 10.165 8.99 0.11 0.5665 - 2,376 B
New_Corvus_Glob p?th/(...)].txt [46] 153.08 1.539 1.439 1.00 0.00 - - -
New_Compiled_Regex_Glob p?th/a[e-g].txt 12,350.67 206.552 193.209 188.56 6.84 3.1433 0.0153 13,168 B
New_DotNet_Glob p?th/a[e-g].txt 726.14 8.640 7.659 11.08 0.27 0.3328 - 1,392 B
New_Corvus_Glob p?th/a[e-g].txt 65.32 1.307 1.605 1.00 0.00 - - -

Compile and match false

Comparing the time and memory performance for compiling and running Corvus.Glob.Match against the regex and DotNet.Glob equivalents, for the case where the string and glob pattern do not match.

Method # Of Matches Pattern Mean (ns) Error (ns) StdDev (ns) Ratio RatioSD Gen 0 Gen 1 Allocated
Compiled_Regex_IsMatch 1 p?th/(...)].txt [21] 102,434.95 1,962.576 3,734.006 1,129.38 51.15 5.3711 2.6855 22,656 B
DotNetGlob_IsMatch 1 p?th/(...)].txt [21] 954.59 17.737 16.591 10.41 0.27 0.4549 - 1,904 B
CorvusGlob_IsMatch 1 p?th/(...)].txt [21] 91.73 1.822 1.521 1.00 0.00 - - -
Compiled_Regex_IsMatch 1 p?th/(...)].txt [46] 111,423.68 447.546 396.738 703.52 13.52 6.9580 3.4180 29,591 B
DotNetGlob_IsMatch 1 p?th/(...)].txt [46] 1,404.32 21.916 20.500 8.86 0.24 0.5665 - 2,376 B
CorvusGlob_IsMatch 1 p?th/(...)].txt [46] 158.77 3.049 2.994 1.00 0.00 - - -
Compiled_Regex_IsMatch 1 p?th/a[e-g].txt 95,250.25 1,534.617 1,360.398 1,310.39 24.75 4.6387 2.3193 19,799 B
DotNetGlob_IsMatch 1 p?th/a[e-g].txt 733.96 9.342 8.738 10.11 0.12 0.3328 - 1,392 B
CorvusGlob_IsMatch 1 p?th/a[e-g].txt 72.70 0.659 0.584 1.00 0.00 - - -
Compiled_Regex_IsMatch 10000 p?th/(...)].txt [21] 294,937.84 3,379.605 2,995.932 3.30 0.04 5.3711 2.9297 22,648 B
DotNetGlob_IsMatch 10000 p?th/(...)].txt [21] 115,916.75 1,218.873 1,080.499 1.30 0.02 0.3662 - 1,904 B
CorvusGlob_IsMatch 10000 p?th/(...)].txt [21] 89,329.52 829.171 775.607 1.00 0.00 - - -
Compiled_Regex_IsMatch 10000 p?th/(...)].txt [46] 306,961.22 4,678.470 4,376.244 3.34 0.07 6.8359 3.4180 29,582 B
DotNetGlob_IsMatch 10000 p?th/(...)].txt [46] 119,904.90 1,252.195 1,171.304 1.30 0.02 0.4883 - 2,376 B
CorvusGlob_IsMatch 10000 p?th/(...)].txt [46] 91,902.09 1,427.773 1,335.540 1.00 0.00 - - -
Compiled_Regex_IsMatch 10000 p?th/a[e-g].txt 289,016.17 4,305.232 4,027.117 3.16 0.04 4.3945 1.9531 19,788 B
DotNetGlob_IsMatch 10000 p?th/a[e-g].txt 120,199.71 1,357.718 1,270.010 1.31 0.02 0.2441 - 1,392 B
CorvusGlob_IsMatch 10000 p?th/a[e-g].txt 91,516.00 768.001 641.316 1.00 0.00 - - -

Compile and match true

Comparing the time and memory performance for compiling and running Corvus.Glob.Match against the regex and DotNet.Glob equivalents, for the case where the string and glob pattern do match.

Method # Of Matches Pattern Mean (ns) Error (ns) StdDev (ns) Ratio RatioSD Gen 0 Gen 1 Allocated
Compiled_Regex_IsMatch 1 p?th/(...)].txt [21] 101,632.25 1,067.540 891.444 1,132.45 11.15 5.3711 2.6855 22,656 B
DotNetGlob_IsMatch 1 p?th/(...)].txt [21] 945.00 8.850 7.846 10.53 0.09 0.4539 - 1,904 B
CorvusGlob_IsMatch 1 p?th/(...)].txt [21] 89.75 0.389 0.325 1.00 0.00 - - -
Compiled_Regex_IsMatch 1 p?th/(...)].txt [46] 110,443.78 847.213 792.483 710.63 7.04 6.9580 3.4180 29,591 B
DotNetGlob_IsMatch 1 p?th/(...)].txt [46] 1,371.05 12.836 12.007 8.84 0.08 0.5665 - 2,376 B
CorvusGlob_IsMatch 1 p?th/(...)].txt [46] 155.31 0.988 0.825 1.00 0.00 - - -
Compiled_Regex_IsMatch 1 p?th/a[e-g].txt 95,855.67 525.585 465.917 1,310.64 11.98 4.6387 2.3193 19,799 B
DotNetGlob_IsMatch 1 p?th/a[e-g].txt 713.67 7.240 6.773 9.76 0.13 0.3328 - 1,392 B
CorvusGlob_IsMatch 1 p?th/a[e-g].txt 73.10 0.600 0.561 1.00 0.00 - - -
Compiled_Regex_IsMatch 10000 p?th/(...)].txt [21] 292,953.87 3,174.962 2,969.862 3.30 0.04 5.3711 2.9297 22,648 B
DotNetGlob_IsMatch 10000 p?th/(...)].txt [21] 117,235.79 1,106.554 924.023 1.32 0.02 0.3662 - 1,904 B
CorvusGlob_IsMatch 10000 p?th/(...)].txt [21] 88,815.76 957.988 896.103 1.00 0.00 - - -
Compiled_Regex_IsMatch 10000 p?th/(...)].txt [46] 305,527.82 4,855.610 3,790.941 3.34 0.06 6.8359 3.4180 29,582 B
DotNetGlob_IsMatch 10000 p?th/(...)].txt [46] 118,614.04 1,087.238 1,017.003 1.29 0.01 0.4883 - 2,376 B
CorvusGlob_IsMatch 10000 p?th/(...)].txt [46] 91,627.21 884.869 827.707 1.00 0.00 - - -
Compiled_Regex_IsMatch 10000 p?th/a[e-g].txt 291,808.94 4,918.313 3,839.895 3.15 0.06 4.3945 1.9531 19,788 B
DotNetGlob_IsMatch 10000 p?th/a[e-g].txt 121,114.24 1,232.934 1,092.964 1.31 0.03 0.2441 - 1,392 B
CorvusGlob_IsMatch 10000 p?th/a[e-g].txt 92,326.72 1,832.623 1,714.237 1.00 0.00 - - -

Match false

Comparing the time performance for running Corvus.Glob.Match against the regex and DotNet.Glob equivalents, for the case where the string and glob pattern do not match.

Method # Of Matches Pattern Mean (μs) Error (μs) StdDev (μs) Ratio RatioSD Allocated
Compiled_Regex_IsMatch 10000 p?th/(...)].txt [50] 446.86 4.783 3.994 4.74 0.06 -
DotNetGlob_IsMatch 10000 p?th/(...)].txt [50] 101.13 1.021 0.955 1.07 0.01 -
CorvusGlob_IsMatch 10000 p?th/(...)].txt [50] 94.26 1.052 0.932 1.00 0.00 -
Compiled_Regex_IsMatch 10000 p?th/(...)].txt [21] 357.73 1.617 1.433 3.74 0.03 -
DotNetGlob_IsMatch 10000 p?th/(...)].txt [21] 120.99 1.675 1.566 1.27 0.02 -
CorvusGlob_IsMatch 10000 p?th/(...)].txt [21] 95.53 0.648 0.606 1.00 0.00 -
Compiled_Regex_IsMatch 10000 p?th/(...)].txt [46] 386.72 4.710 4.175 3.92 0.04 -
DotNetGlob_IsMatch 10000 p?th/(...)].txt [46] 121.05 1.141 0.952 1.23 0.01 -
CorvusGlob_IsMatch 10000 p?th/(...)].txt [46] 98.54 0.614 0.574 1.00 0.00 -
Compiled_Regex_IsMatch 10000 p?th/a[e-g].txt 369.37 1.664 1.475 3.88 0.02 -
DotNetGlob_IsMatch 10000 p?th/a[e-g].txt 116.95 1.689 1.659 1.23 0.02 -
CorvusGlob_IsMatch 10000 p?th/a[e-g].txt 95.02 0.457 0.357 1.00 0.00 -

Match true

Comparing the time performance for running Corvus.Glob.Match against the regex and DotNet.Glob equivalents, for the case where the string and glob pattern do match.

Method # Of Matches Pattern Mean (μs) Error (μs) StdDev (μs) Ratio RatioSD Allocated
Compiled_Regex_IsMatch 10000 p?th/(...)].txt [50] 450.13 2.709 2.402 4.88 0.08 -
DotNetGlob_IsMatch 10000 p?th/(...)].txt [50] 99.80 1.215 1.136 1.08 0.02 -
CorvusGlob_IsMatch 10000 p?th/(...)].txt [50] 92.14 1.372 1.283 1.00 0.00 -
Compiled_Regex_IsMatch 10000 p?th/(...)].txt [21] 345.40 1.602 1.499 3.78 0.03 -
DotNetGlob_IsMatch 10000 p?th/(...)].txt [21] 115.90 0.647 0.606 1.27 0.01 -
CorvusGlob_IsMatch 10000 p?th/(...)].txt [21] 91.29 0.616 0.546 1.00 0.00 -
Compiled_Regex_IsMatch 10000 p?th/(...)].txt [46] 374.15 6.377 5.965 4.06 0.09 -
DotNetGlob_IsMatch 10000 p?th/(...)].txt [46] 121.76 1.353 1.200 1.32 0.02 -
CorvusGlob_IsMatch 10000 p?th/(...)].txt [46] 92.13 1.152 1.078 1.00 0.00 -
Compiled_Regex_IsMatch 10000 p?th/a[e-g].txt 354.97 2.833 2.511 3.88 0.04 -
DotNetGlob_IsMatch 10000 p?th/a[e-g].txt 115.02 1.082 1.012 1.26 0.01 -
CorvusGlob_IsMatch 10000 p?th/a[e-g].txt 91.42 0.766 0.679 1.00 0.00 -

The ratio column in the benchmark results above show that, for all patterns and operations tested, Corvus was faster than the DotNet.Glob and regex equivalents.

There's something else worth noting from these results. In the "compile and match" examples (both true and false), Corvus.Globbing is only 3.3x faster than RegEx (and 1.3x faster than DotNetGlob) when matching against 1000 strings, but Corvus.Globbing is over 1000x faster than RegEx (and ~10x faster than DotNetGlob) if we compile and match against a single string. This implies that with RegEx (and, to a lesser extent, DotNetGlob) the costs of compilation/parsing are very high compared to the cost of doing a match against a single string. Therefore, for it to be worth paying the up-front cost of compiling/parsing, you need to be matching against many more strings when using the other libraries compared to Corvus.Globbing.

As you can see, Corvus.Globbing is a useful library, with strong performance characteristics, for working with globs in C#.

If you have any thoughts on better ways to approach particular problems, or related tools you'd like to see, then we'd like to hear them – please comment below, or in the Issues section of the project.

Liam Mooney

Apprentice Engineer II

Liam Mooney

Liam studied an MSci in Physics at University College London, which included modules on Statistical Data Analysis, High Performance Computing, Practical Physics and Computing. This led to his dissertation exploring the use of machine learning techniques for analysing LHC particle collision data.

Before joining endjin, Liam had a keen interest in data science and engineering, and did a number of related internships. However, since joining endjin he has developed a much broader set of interest, including DevOps and more general software engineering. He is currently exploring those interests and finding his feet in the tech space.