A Regular Expression Engine in C# - CodeProject
A Regular Expression Engine in C# - CodeProject
Introduction
Why even make a regular expression engine when .NET includes a rich regular expressions library?
First of all, there are two major classes of regular expression engine - backtracking and non-backtracking. Most modern regular
expression engines do backtracking. Backtracking engines have greater expressive power than non-backtracking engines, but at the
cost of performance.
Well, for starters they out-perform the backtracking engines on short strings over and over. Secondly, they are more suitable for
operating directly on a streaming source, whereas backtracking parsers generally require the text to be preloaded into memory.
More generally, the style of regular expression engine provided by .NET is geared toward matching patterns in long strings. It's not
as suitable for lexing tokens out of a stream. Lexing is of primary importance to parsers, and so the regular expressions used for
parsing terminals are often implemented using a non-backtracking engine that is geared for lexing.
Background
Lexing is the process of breaking an input stream of characters into "tokens" - strings of characters that have a "symbol" associated
with them. The symbol indicates what type of string it is. For example, the string "124.35" might be reported as the symbol
"NUMBER" whereas the string "foo" might be reported as the symbol "IDENTIFIER"
Parsers typically use lexers underneath, and then compose the incoming symbols into a syntax tree. Because lexers are called in core
parsing code, the lexing operations must be reasonably efficient. The .NET regular expression isn't really suitable here, and while it
can work, it actually increases code complexity while diminishing performance.
https://www.codeproject.com/Articles/1280690/A-Regular-Expression-Engine-in-Csharp 1/9
27/03/2019 A Regular Expression Engine in C# - CodeProject
Included in this project is a file called "FA.cs" that contains the core code for a regular expression engine using finite automata which
resolves to the humble yet ubiquitous state machine.
Finite state machines are composed of graphs of states. Each state can reference another state using either an "input transition" or
an "epsilon transition". An input transition consumes the specified input before moving to the destination. An epsilon transition
moves to a destination state without consuming or examining any input. A machine with one or more epsilon transitions is
considered "non-deterministic" (NFA) and may be in more than one state at any given time. Non-deterministic state machines are
more complicated and less efficient, but fortunately there is a deterministic state machine (DFA) for any non-deterministic state
machine, and an algorithm to turn an NFA into a DFA using a technique called powerset construction - math that is out of the scope
here to explore.
The closure of a state is the state itself and the unique set of all states reachable from that state, on either an epsilon or input
transition. The closure represents the state machine indicated by the root state. These graphs can be recursive.
The closure of q0 is { q0, q1, q2, q3 } because each of those states are connected to q0 either directly or indirectly.
Each time a black arrow is crossed, the input must match the character(s) above the arrow to proceed along that path. Once a
double circle (q3) is encountered, the machine "accepts" the input as valid/matching and returns the token under the state name, in
this case "foo" is under q3 so it is what will be returned once the input is matched.
https://www.codeproject.com/Articles/1280690/A-Regular-Expression-Engine-in-Csharp 2/9
27/03/2019 A Regular Expression Engine in C# - CodeProject
Now let's get useful. Lexers (aka tokenizers) distinguish between one of many different patterns of input. Let's build one that
represents an identifier, integer or a space
https://www.codeproject.com/Articles/1280690/A-Regular-Expression-Engine-in-Csharp 3/9
27/03/2019 A Regular Expression Engine in C# - CodeProject
You'll note that there are two ways to match an int (q2, q4), but they each return the same symbol "int" as a result.
Each of the machines presented was deterministic (each is a DFA). A DFA machine will only ever be in one state at a time. There are
also non-deterministic machines, or NFAs, which can be in several states at the same time!
https://www.codeproject.com/Articles/1280690/A-Regular-Expression-Engine-in-Csharp 4/9
27/03/2019 A Regular Expression Engine in C# - CodeProject
You'll note this machine has gray dashed lines in it. These are epsilon transitions or transitions on epsilon. Epsilon simply means "no
input" in this context. Every time a machine encounters a dashed arrow it is automatically in the state pointed to by that arrow as
well as itself. It can be said that it transitions on no input.
Starting out in q0 also puts you in q1, q4, q5, q7, q8 and q11 as well! This set of states, that is, the set of states reachable from this
state on epsilon transitions is called the epsilon closure
These NFA machines are easier to compose/build out of several other machines, but harder to run because they're more complex.
As mentioned, there's a DFA machine for every NFA machine, and an algorithm for converting an NFA to a DFA.
This code allows you to painlessly create these machines, to make pretty graphs like the above, (with the help of Graphviz), to
generate super fast code to run a given state machine, to run a state machine at runtime without generating the code first, to
translate regex expressions to machines and back again, and to otherwise examine and compose the machines.
//
// See the included project for a full sample
//
var lexer = FA.Lexer(
// our id from the article
FA.Parse(@"[A-Z_a-z][0-9A-Z_a-z]*", "id"),
// our int from the article
FA.Parse(@"0|([\-]?[1-9][0-9]*)", "int"),
// our space from the article
FA.Parse(@" ", "space")
);
// make sure there are no duplicate states left after the transformation.
// this can be an expensive op and isn't strictly necessary, so it's a
// separate step. It *should* be done before code generation.
// If code generation is done on an NFA, it is automatically transformed to
// a DFA and duplicates are removed during that process
dfaLexer.TrimDuplicates();
Console.WriteLine("Rendering graphs");
try
{
https://www.codeproject.com/Articles/1280690/A-Regular-Expression-Engine-in-Csharp 5/9
27/03/2019 A Regular Expression Engine in C# - CodeProject
// try to render the graphs (requires GraphViz - specifically dot.exe)
// to change the format, simply use a different extension like PNG or SVG
lexer.RenderToFile(@"..\..\..\lexer_nfa.jpg");
dfaLexer.RenderToFile(@"..\..\..\lexer_dfa.jpg");
}
catch
{
Console.WriteLine("GraphViz is not installed.");
}
// generate the code (can generate from either an NFA or DFA, DFA is slightly faster)
Console.WriteLine("Runtime Lexing:");
// it's easy enough to add code to generate lexers that do not require this class
// but we would lose error handling and position tracking, so it's not implemented.
// to use other interfaces, wrap them with ParseContext.Create()
var pc = ParseContext.Create(test);
try
{
// ParseContext.Current works like a TextReader.Peek() does, returning an int
// without advancing the cursor. Unline Peek() it is not problematic.
while (-1 != pc.Current) // loop until end of input.
{
// runtime lexing (can be NFA or DFA - DFA is intrinsically more efficient)
// if we pass it a StringBuilder, it will reuse it - perf is better,
// but we don't bother here.
var tok = dfaLexer.Lex(pc); // lex the next token
// symbol is in tok.Key, value is in tok.Value
Console.WriteLine("Token: {0}, Value: {1}", tok.Key, tok.Value);
}
}
catch(ExpectingException eex)
{
// We got a lexing error, so report it. The message might be long, but you can always
// build your own using ExpectingException members.
Console.WriteLine("Error: {0}", eex.Message);
}
Console.WriteLine();
Console.WriteLine("Compiled Lexing:");
https://www.codeproject.com/Articles/1280690/A-Regular-Expression-Engine-in-Csharp 6/9
27/03/2019 A Regular Expression Engine in C# - CodeProject
pc = ParseContext.Create(test); // restart the cursor
pc.EnsureStarted();
try
{
// reusing a stringbuilder allows the lex routine to run a little faster
// don't use it for anything else if you're passing it to lex.
var sb = new StringBuilder(); // don't have to use this, can pass null
As you can see, this is state q1, which corresponds to the last part of our "id" token in the graph above.
You can see the output transitions from this state encoded into the if.
You can see the arrow pointing back to itself using the goto.
You can see that if the if condition isn't satisfied it ends with a succesful result that returns a token.
All of these properties are reflected in the graph above if you look for them.
Points of Interest
Generating regular expressions - textual representation - from an FA is non-trivial. Parsing them from a textual representation is
trivial. This is almost exactly the opposite of how syntax trees usually are in terms of coding them.
The generated DFAs are an order of magnitude faster in release builds than debug builds. I still don't know why. Regardless, they're
still the fastest option in any build configuration.
It took way too many iterations of this design over the years to get one I was happy with that handled general use cases well,
including error handling. I finally got it fast, consistent, and easy in the right combinations.
There are improvements to be made, most notably including the ^ and . operators. Doing that efficiently almost requires it's own
article.
https://www.codeproject.com/Articles/1280690/A-Regular-Expression-Engine-in-Csharp 7/9
27/03/2019 A Regular Expression Engine in C# - CodeProject
History
3-26-2019 Initial release
License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)
Share
Combined Regular Expression Search DEELX - Regular Expression Engine for C++
Search Comments
random feedback
Super Lloyd 4hrs 6mins ago
Refresh 1
General News Suggestion Question Bug Answer Joke Praise Rant Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.
https://www.codeproject.com/Articles/1280690/A-Regular-Expression-Engine-in-Csharp 9/9