Chapter 24
Chapter 24
Chapter 24
Strings and Regular expressions
1. Introduction
In this chapter we focus on character strings. We build further on the content of the section on strings
in Chapter 10 and you must be fully familiar with that content before we can proceed. We discuss the
StringBuilder class as container for mutable strings and then touch on the basics of regular
expressions.
2. Strings
A character string is a data structure that contains characters. The most important thing to remember
about character strings is that a string is an array of characters. This means that we can index and
access individual characters in a string in the same way that we can access elements in any other array
such as int[]. You should be fully conversant with the content of the section on strings in Chapter
10. Make sure that you understand how to use the Compare method and that you are familiar with the
methods listed on page 180. Note also how the static Join method can be used to create a comma-
separated string of an array or list of string objects:
The example on csharp.pl3.co.za contains several examples of the usage of character strings. Study
and make sure that you understand everything. Make sure that you understand how to use
IntelliSense to determine the parameter and return types of the available built-in methods of the
string class.
String is a reference type and should be treated the same as other reference types with the
exception that we do not have to use new to instantiate it.
There is a fundamental difference between a null and empty string. A null is an absence of a
value and an empty string is a value that is empty.
Strings are immutable. That means that we cannot make a change to the initial value of a string.
A statement such as
string s = "a";
s += "b";
make a copy of s and then overwrites the original memory cell. You should use StringBuilder
if you want a mutable string. If you foresee many changes to a string, StringBuilder will be
much more efficient and faster.
Unlike other reference types, when we pass a string parameter, the reference is passed by value.
This means that we make a second copy of the pointer which in effect means that we have two
different memory cells that behave independently of each other.
See also
• https://stackoverflow.com/questions/1096449/c-sharp-string-reference-type
Strings in the string class are immutable. This means that the contents of a memory cell is fixed and
cannot be changed. A statement such as
string s = "abc";
s += "d";
effectively means that the variable name s points to a new memory cell with the contents "abcd".
The old s is left in memory until the garbage collector cleans it up. Appending or changing string
instances is slow and time consuming.
The StringBuilder class, on the other hand, is capable of amending the contents of a memory cell
in-place. This means that
effectively changes the contents of sb without creating a new memory cell. Note that although the
Append method returns an instance of StringBuilder, it is not necessary to do this:
sb = sb.Append("d");
A StringBuilder object maintains a buffer to accommodate expansions to the string. New data is
appended to the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the
original buffer is copied to the new buffer, and the new data is then appended to the new buffer.
For small or infrequent changes to a character string, the overheads that are involved with
StringBuilder does not make it worthwhile. However, for cases where frequent changes to the same
character string is expected, the usage of a StringBuilder object will imply a huge saving of CPU
time.
Study the examples on csharp.pl3.co.za to become familiar with the StringBuilder versions of
Length, Insert, Remove, Replace, etc. Also look at the method Timing to see the difference in time
between the append procedure of the two classes.
Also read this and pay specific attention to the Capacity property:
https://docs.microsoft.com/en-us/dotnet/api/system.text.stringbuilder
3. Regular expressions
Mostly, we can get away with the methods and properties that the string and StringBuilder classes
provide, but there are cases when we need more power. It is, for example, easy to find the domain
name in an email address with the IndexOf and Substring methods of the string class, but it is not
so easy to determine if a given email address is valid.
The Regex class in C# provides an interface for the Regular Expression (RE) language for character
strings. Regex is huge and we are just going to touch on some of the basic elements to make you
aware of the existence and power of regular expressions. The examples below are also available in
Listing 24.3.
There is nothing here that cannot be done with IndexOf in the string class, but just to get the ball
rolling, consider the following code fragment. We instantiate a Regex object and then use its Match
method to return a Match object. This object has properties to indicate whether a match was found
(Success), its Index and Value. Use IntelliSense to discover the other possible methods and
properties. The same can be achieved through static methods of the Regex class.
string str1 = "the quick brown fox jumped over the lazy dog";
It is not undoable, but it is quite tricky with existing string methods and properties. With Regex, it is
easy to find all occurrences of the word "the" in a given string.
This is easy with the existing Replace method in the string class:
string s = "the quick brown fox jumped over the brown dog";
s = s.Replace("brown", "black"):
All the examples below are contained in the Quantifiers method in Listing 24.3. I advise you
strongly to follow the discussion below while running the example. Comment out all lines of code
except for the one that you are inspecting at that moment and make sure that you understand why you
get the specific output.
Consider the following array of words that will serve as basis for the examples:
string[] words = new string[] {"abdomen", "bad", "baad", "baaad", "life", "lobby",
"boy", "bear", "bend", "bobby", "lend", "death"};
In Regex terminology, we refer to a substring in a larger string as a pattern. If we want to find all
words in the above list that contain the pattern "bd", we can do this:
If we want to find all words with the same pattern, but with any character between the "b" and the
"d", we can write "b.d" and do this
We refer to the period (".") as a wildcard that stands in the place of anything else. If the pattern is
"b..d", we will find "baad" and "bend". If the pattern is "b." (nothing following the period), we
will find "abdomen", "bad", "baad", "baaad", "lobby", "boy", "bear", "bend" and
"bobby".
Instead of writing "bb", we can write "b{2}". The number between curly braces is a quantifier. This
will find "lobby" and "bobby". What will we get if the pattern is "b.{2}d"? If we write "ba{1,2}d",
we find "bad" and "baad", but not "baaad". In general, {m,n} refers to the minimum and maximum
number of occurrences of the previous character.
We can use the + symbol to reflect on the character preceding it and find words containing one or
more of those characters. The "*" symbol will find matches that contain zero or more occurrences of
the preceding character. For example, "b+d" wil find "abdomen", while "b*d" will find all words
containing "d". The "?" symbol will find occurrences of zero or one of the preceding character.
If the pattern is "<.>", it means that we want to find "<" followed by any single character followed
by ">". This will then find "<b>" in the above string.
The search pattern "<.+>" means that we want to find "<" followed by any character one or more
times followed by ">". There are two possibilities, namely "<b>" and "<b>string</b>". If we
define a greedy search, the regular expression will find as many characters as possible, thus
"<b>string</b>". A lazy search will stop at the first fulfilment of the pattern, thus "<b>". A lazy
search are defined by the symbol "?" and thus the pattern must be "<.+?>".
If we write the code as below, there is a loop that will find all matches of a lazy search, resulting in
both "<b>" and "</b>".
string pattern = "<.+?>"; // '<' followed by any character one or more times lazy
MatchCollection matches = Regex.Matches(words, pattern);
for (int i = 0; i < matches.Count; i++)
Console.Write(matches[i].Value + ", ");
You would have noticed that we have special characters in the expressions, namely ".", "+", "?",
"(", ")", "{", "}", "[", "]". If we want to include those characters in the search string, we have to
precede them with "\". If we want to include "\" in the pattern, we have to write "\\". For example,
Copyright: PJ Blignaut, 2020
6
if we want to find a period in the expression, we have to write "\.". Since a "\" has a special meaning
in a C# string, we have to write @"\.". So, to find periods, we can do this:
In the examples above, we had to specify specific characters in the search pattern. If we want to find
a specific category (or class) of characters, we need a character class. Character classes are written
between [ ]. Don't confuse the meaning of the word "class" in this context with its normal meaning
in object oriented programming.
As an example, consider the pattern "[\w]" which will find all letters, digits and underscores.
Characters between "[" and "]" are treated as special characters. "\" precedes a character class and
should not be confused with the meaning of "\" outside of [ ] (cf Section 4.6 above).
Consider the following string on which the subsequent examples are based. For easy reference to
indexes within the string, two indexing strings are also provided:
" 1 2 3 4 5 6 7 8 9 "
"012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012"
"Note: The -8- years old quick brown_fox, with 4 toes, jumped over the (12 year) old lazy dog."
The table below lists some character classes along with typical patterns and the matches in the above
example string. You should run the CharacterClasses method in Listing 24.3 in conjunction with
the table entries. Comment out all the re variables except for the one that you want to inspect.
For the sake of readability, the @ and quotes are omitted in the table, but they should always be there,
for example [\w] must be written as @"[\w]".
3.8 Anchors
Regular expressions can be modified with special characters to mark word or sentence boundaries.
The table below shows the matches of some example patterns in the following string.
" 1 2 3 4 5 6 7 8"
"012345678901234567890123456789012345678901234567890123456789012345678901234567890"
"The 8 years old quick brown fox with 4 toes jumped over the 12 year old lazy dog."
The wildcards, quantifiers, character classes and anchors that we referred to above are by far not all
that the regular expression language has to offer. See the following websites for many more
possibilities.
http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet
http://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
We can use a regular expression to test the validity of email addresses, postal codes, telephone
numbers, dates and many more. The following regular expression can be used to test the validity of
an email address. See if you can analyse it into its parts:
It can then be used to prompt a user to enter an email address until it is valid:
string re = @"^((([\w]+\.[\w]+)+)|([\w]+))@(([\w]+\.)+)([A-Za-z]{1,3})$";
isValid = Regex.Match(sInput, re).Success;
} while (!isValid);
You can rest assured that we will not expect you to develop a complex regular expression as above.
You can, however, be expected to use a given RE in an application as above.
4. Summary
We discussed the following key concepts in this chapter: