Tech Republic



Getting familiar with Regular Expressions
Dan Frumin

A short while ago, I was working on a simple Web site project when I ran into an interesting problem. The application had to accept HTML input in a text box but support only a subset of HTML tags. That is to say, simple formatting tags would be supported, while complex tags, such as scripting and hyperlinks, would not.

At first glance, a simple string replacement might seem in order. Unfortunately, some of the complex tags take on numerous forms. For example, a hyperlink tag starts with <a but doesn't end until the closing >, so simple string replacement was not enough. Luckily for me, I was able to count on an old standby tool: Regular Expressions.

About Regular Expressions
Long considered one of the most powerful and most arcane of languages, the Regular Expression language (nicknamed RegEx) saw its humble beginnings in the world of UNIX. In the 1970s, access to computing resources shifted from punch cards to line terminals, and all input and output was handled one line at a time in the form of text.

Anumber of tools arose to help programmers deal with those text files. Popular among them was grep, which allowed users to find substrings in a file, sed, which allowed users to replace substrings in files, and ed, which allowed users to edit files one line at a time. A lot of the operations were too complex to be described by positional definition (e.g., change the 20th through 24th character), and so regular expressions were born.

The language of Regular Expressions is largely mathematical. The full syntax itself can be daunting, but it's incredibly powerful in its ability to describe very complex substrings and variations. The good news is that the basic syntax is actually easy to master and offers a significant tool to developers. Using Regular Expressions, you can match, capture, replace, and split substrings, all using the same syntax notation and a few lines of code.

One problem with Regular Expressions is that there are several implementations, some of which use small variants on the syntax. But practically all implementations support the same syntactic elements, even if the actual character representations are slightly different.

Putting Regular Expressions to work
Regular Expressions are powerful, but what can you do with them? Within the .NET Framework, you can use Regular Expressions in validation controls to quickly and easily validate text input. For example, you can validate that the user entered a valid zip code, vehicle license plate number, social security number, and so on. RegEx is also useful for matching and extracting substrings out of a bigger string or file contents. For instance, you could write a Regular Expression to extract every URL from an HTML file or every e-mail from an SMTP standard mail header.

Finally, you can use RegEx to transform one string into another using the replacement constructs. For example, you could take a comma-delimited (CSV) file and invert the order of the input fields to result in a new output file.

Basic Regular Expressions
Let's start by picking up some basic Regular Expressions syntax you can use in your toolset.

Literal Strings and Anchors
The basis of any string expression is the literal matching of any one character or set of characters. With the exception of special syntax elements, RegEx assumes that an expression of the form john will match the literal substring john. In addition, RegEx offers two position-based constructs (called anchors). The special character ^ signifies the beginning of a line, while $ signifies the end of a line. For example, ^foo will match an occurrence of foo at the beginning of a line. In the same fashion, ^this is a line$ will match this is a line only if it exists as a whole line of text, from beginning to end.

Character escapes
You'll often need to match some common special characters within an expression. RegEx supports most of the same character escapes as the rest of your code. These include

In addition, you can escape any special character within RegEx. You can always match special characters in an expression by escaping them with a backslash. For example, \^ will match a literal carat (^) and \$ will match a literal dollar sign ($).

Character classes
RegEx enables you to designate certain types or of characters as a class. The simplest character class is represented by the period (.), and it designates any character in the string. This is our generic reusable character and comes in very handy in writing expressions.

RegEx also offers a way to designate other special character classes. For example, you can use \w to designate any word character, generally considered the alpha characters (a-z and A-Z), the numeric characters (0-9), and an underscore (_). The \d sequence specifies any numeric character in the range of zero through nine (0-9). You can use \s to designate any white-space character, including spaces, tabs, and new lines.

In some cases, you may want to use the character classes in exclusionary form. In other words, you might need to specify any non-numeric character or any non-white-space character. RegEx meets this need by capitalizing the designator within the sequence. So, for example, \S refers to any non-white-space character.

Character sets
The last construct we'll introduce in this article is the character set. Let's assume that you need a validation expression for a phone number. The rules for phone numbers require that the first digit in an area code not be a zero (0) or a one (1), because those numbers are used to designate country codes. So clearly, you can't use \d to designate the first digit of the phone number.

To solve this, RegEx lets you use square brackets to designate a character set. For the example above, you might use [23456789] to designate any character in the set. The RegEx language also allows the use of a hyphen to designate a range, as in [2-9]. You should note that you can combine multiple set definitions as well. For instance, the set [A-Za-z0-9_] is equivalent to the character class \w.

Just it does with character classes, RegEx offers a way to invert the meaning of a special sequence. If the first character in the set is a carat (^), the set takes on the inverse specification and in turn refers to any character that is not in the set. As an example, you might create the set [^0-9] to refer to any non-numeric character—the equivalent of \D.

Here's a tip
Regular Expressions are incredibly powerful, but they take awhile to explain thoroughly. The above syntax elements lay a strong groundwork for accomplishing tasks with the RegEx language but may not be enough for general use.

Here's a tip that will help you craft more powerful expressions for use in your code: The character sequence .* represents a "match multiple characters" operation. For example, john.*doe will match a string that begins with john and continue until the first instance of doe is found.

Regular Expressions in validation controls
One of the many improvements in ASP.NET is its offering of both server- and client-side validation controls. Responding to the many developers who have to write code for validating input, the ASP.NET team created a series of controls to test for common cases, such as required text or comparison (matching a single string). However, this is not enough; text input can range from names to zip codes and phone numbers to order numbers. To address this, the .NET Framework offers the RegularExpressionValidator control.

This control has a property named ValidationExpression, which designates the RegEx string to test the value of the text box against. If the input string fails to match the expression, the validation control is tripped and an error message is displayed to the user. It's worth noting that this control implicitly tests the whole string, as if the expression were actually written within a set of anchors of the form of ^…$. We don't need to delve too deeply into this control, since you'll find it easy to use.

Here are a few examples of validation strings based on the syntax you learned above:

The Regular Expression Workbench
Eric Gunnerson of the MSDN team has written a useful tool for experimenting with Regular Expressions. It's called the Regular Expression Workbench and is available through the gotdotnet Web site. You can use the site's search feature to locate the tool within the User Samples section. I recommend it as a great way to learn Regular Expressions, as well as to develop and test complex expressions later on in your career.

RegEx Syntax reference so far
Throughout this series of articles, we'll offer a growing syntax reference to the RegEx language. Table A shows all the sequences we've covered so far.
Table A
 Literal strings and anchors
Any character that is not a special character (, ^, $, and \ are examples of special characters)
Itself.

Signifies the beginning of a line in the string.

Signifies the end of a line in the string.
Common character escapes
\ followed by any special character 
The character being escaped, for example \$.
\\
The slash character.
\r and \n
Carriage return and new line, respectively
\t
The tab character.
\x##
Matches any ASCII character in the hexadecimal form of exactly two digits.
\u#### 
Matches any Unicode character in the hexadecimal form of exactly four digits.
Character classes   
.
Matches any character, except the new line ('\n').
\w
Matches any word character. In standard ASCII that is any alpha character (a-z and A-Z), any numeric character (0-9) and an underscore (_).
\W
Matches any non-word character.
\d 
Matches any numeric digit (0-9).
\D
Matches any non-numeric digit.
\s
Matches any white-space character including tabs, carriage returns, and new lines.
\S
Matches any non-white-space character.
Character sets   
[abcd]
Matches any character designated within the set.
[^abcd]
Matches any character not in the set.
[0-9a-z]
You can use the hyphen (-) character to specify a range of characters within a set.
Special tip   
.*
Matches an unlimited number of characters.


Regular Expressions: Understanding sequence repetition and grouping

Quantifiers
In my last article, I showed you how certain sequences were regularly repeated. For example, to specify a ZIP code, I had to provide the sequence \d\d\d\d\d. You might expect that there is a way to provide quantitative guidelines to a RegEx expression. And you would be right.

RegEx allows you to specify that a particular sequence must show up exactly five times by appending {5} to its syntax. For example, the expression \d{5} specifies exactly five numeric digits. You can also specify a series of at least four and no more than seven characters by appending {4,7} to the sequence.

Similarly, the expression [A-Z]{3,6} specifies three to six instances of the character set consisting of uppercase letters. The expression can leave out one of the two designators, implying zero (0) in the former position and unlimited in the latter position. If you’re looking for a number up to six digits long, you would use {,6}. Similarly, a word that is at least four characters long can be expressed as \w{4,}.

In addition to the generic syntax above, RegEx offers shortcuts for designating quantifiers. The question mark character (?) is used to designate zero or one matches (equivalent to {0,1}). The asterisk character (*) is used to designate zero or more matches (equivalent to {0,}). Lastly, the plus character (+) is used to designate one or more matches (equivalent to {1,}). Using these sequences can make your expressions faster to write and easier to read.

Here are examples of how you might use the above constructs:
  • A simple ZIP code: \d{5}
  • A phone number with or without hyphens: [2-9]\d{2}-?\d{3}-?\d{4}
  • Any two words separated by a space: \w+ \w+
  • One or two words separated by a space: \w* ?\w+

Grouping
So far, you’ve seen how to quantify sequences of single characters within a string. You also know that a sequence of literals (for example, joe) designates the substring itself. But what if you want to quantify the literal substring of characters? The RegEx language offers the grouping construct for this purpose. To designate a group, you enclose it in parentheses.

For example, (abc) is the sequence abc within the string. By itself, it’s not different than the literal abc. However, when you apply some of the quantifiers, this construct becomes very powerful, especially when you consider that a group can contain complete RegEx sequences.

I previously wrote the expression for a simple ZIP code as \d{5}. However, ZIP codes also have an optional section that appends a hyphen and four more digits. The optional section is easily defined as -\d{4}. But how do you tell the RegEx engine that it’s optional? You might remember the question mark is used to match zero or one patterns.

A complex ZIP code is then expressed as \d{5}(-\d{4})?. To understand it, see that the group construct was applied to the optional section and then designated as matching zero or one times using the question mark quantifier. You can use the other quantifiers to control the matches within the expression. For example, (abc){3} designates the sequence abcabcabc.


Capturing

The grouping construct also carries a secondary meaning within the RegEx language. It creates a mechanism to capture a matching substring for future use such as extraction or replacement.

By default, any group you designate within an expression is a capturing group. Groups are numbered from left to right in order of opening parentheses, even if the groups are nested. The group with index 0 is a special group that contains the full match, as if the whole expression were wrapped in a set of parentheses.

Let’s break down a nested grouped expression for a phone number: (([2-9]\d{2})-)?(\d{3})-(\d{4}). Notice that this expression contains a number of groups, some nested and some not.

The first set of parentheses captures the first three digits of the area code as well as the hyphen that follows. We need to put this group here because the area code is optional, as designated by the question mark following its definition. We nested a second group to allow us to extract just the area code itself. The next two sets of parentheses are more obvious in their capture. At this point, you can refer to a number of sections within this substring using the group notation.

There’s no question that tracking groups by number is a rather tedious task. It’s further complicated by the fact that you were forced to designate a group (the first one) that you really didn’t care much about, just so you could specify that the area code is optional. RegEx addresses both of these issues quite elegantly using named groups. A group for a name is designated using the syntax (?<name> … ).

There is a special case designating a noncapturing group if the group begins with ?:. In the above example, you could have used (?:([2-9]\d{2})-)? to designate that the area code and hyphen group is a noncapturing group. This helps eliminate some of the groups that are there purely for expression reasons and not for reusability reasons.

If you apply both of these techniques to the ZIP code expression, you end up with (?<full>(?<base>\d{5})(?:-(?<ext>\d{4}))?). Now you're able to refer to the full ZIP code, the base part, and the extended part individually by name. The expression also uses a noncapturing group to ignore the hyphen in the extended part. One interesting side effect of naming groups is that all named groups are numbered after all nonnamed groups, throwing off the order of opening parentheses.

The .NET Framework Regex class
There is still much more to the Regular Expression language, but it’s time to shift focus to some code examples. To work with Regular Expressions, the .NET Framework offers the Regex class in the System.Text.RegularExpressions namespace. (In this article, I’ll cover only two simple methods of the class, saving some of the more complex uses of this class for the next article in this series.)

Regex.IsMatch
The .NET Framework String class offers the IndexOf method to determine whether one string contains another. However, its use is intrinsically limited to a literal string. What if you want to determine whether a string contains another string that is defined with a Regular Expression? The .NET Framework offers a static method of the Regex class for just this reason.

Let’s say that you want to determine whether a particular input string contains a ZIP code. You already know how to write the regular expression for a ZIP code. All you have to do is use the IsMatch method to apply the test:
bool hasMatch = Regex.IsMatch(inputString, @"\d{5}(-\d{4})?");

It’s worth noting that the IsMatch method will return a true value if the match exists anywhere within the substring. In general, you know enough about the input string to not worry about this. However, if you need to specify that the whole string should match the expression, you can use the ^ and $ modifiers:
bool hasMatch = Regex.IsMatch(inputString, @"^\d{5}(-\d{4})?$");

The astute reader will wonder how to determine whether multiple matches exist and where they show up within the string. Both of these features, and more, are available in the Regex class and will be covered in the next article in this series.

Regex.Replace
Similar to the IndexOf analog of the String class, the Regex class also offers a way to replace substrings defined as Regular Expressions. Let’s assume that you're writing a simple HTML interpreter. One of HTML’s features is that it collapses any white-space sequence in the input to a single space in the output. You can use the Regex class to achieve the same thing using the following code:
string result = Regex.Replace(inputString, @"\s+", " ");

This is a very simple example, but it illustrates how you can do some very neat things with the RegEx engine. The Replace method is actually far more powerful in that it allows you to refer to capture groups defined in the expression.

There are two simple ways to refer to capture groups in the expression. A dollar sign followed by any number refers to a capture group by number. The sequence $0 refers to group zero, which is the special group for the whole input string in this case. The sequence $2 then refers to the second group. In addition, you can specify a named capture group using the syntax ${name}.

Let’s look at two examples, both of which assume they are called using this syntax:
string result = Regex.Replace(inputString, pattern, replace);
// ensure that a phone number is hyphenated
pattern = @"([2-9]\d{2})-?(\d{3})-?(\d{4})";
replace = "$1-$2-$3";

// invert the order of the first and last names, ignoring the middle
pattern = @"(?<first>\w+) (?:\w+ )*(?<last>\w+)";
replace = "*** ${last}, ${first} ***";


Note that only the group reference characters are special in the replacement string. In the above example, the asterisks are interpreted as literal characters within the string. However, if you would like to place a literal dollar sign in the replacement string, you can specify it using $$.

Increase your knowledge of Regular Expression syntax

"Lazy" matching
By now, I hope you've had a chance to use regular expressions in your code. If you have, you might have run into a somewhat common and unfortunately complex problem.

Often, you will use the asterisk or plus quantifiers to designate that a particular sequence or pattern repeats. As it happens, the RegEx engine will match that pattern in what is termed a "greedy" behavior. That means it will match as much of the string as possible before testing the next sequence. In some cases, the greedy behavior is not what you want, and instead you need to use "lazy" matching.

This is probably best illustrated with an example. Assume that you are trying to extract anchor tags within an HTML file. Since the anchor tag starts with <a and ends with >, your first instinct might be to use <a.*>. The problem with this is that the .* sequence is applied in greedy fashion to match any character, including the greater-than sign, as many times as possible. If your HTML input string consists of "<a href=foo>bar</a>", the expression will match the whole string. Essentially, the .* sequence continues to match any character until the whole string is tested. This is clearly not what you want.

One alternative is <a[^>]*>, which translates to a string that begins with <a and greedily matches as many non-greater-than characters as possible, followed by a greater-than. This is a viable solution but limited in its applicability. A more generic solution is to use lazy matching, which asks RegEx to match as few characters as possible while still successfully applying the expression. A lazy match is defined by appending the question mark (?) to any quantifier, as in <a.*?>.

Regular Expression options
The RegEx engine supports a number of options that can be set either in code or within an expression. The most popular options are described below along with each option's programmatic name and its inline character for use within an expression.

IgnoreCase (i)
The IgnoreCase option specifies that searching and matching should be done in case-insensitive fashion.

ExplicitCapture (n)
The ExplicitCapture option specifies that groups should default to noncapturing mode, such that only named groups—e.g., (?<name> … )—are captured. This is useful if you have an expression that contains a lot of noncapturing groups and don't want to specify them using the (?: … ) syntax.

Multiline (m)
The Multiline option specifies that the string should be treated as a series of lines, and it designates two changes. First, the period character (.) will match any character within a single line, so it will not match either of the newline characters (\r or \n). Second, the carat and dollar anchors will match the beginning and end of a single line, not the whole string.

Singleline (s)
The Singleline option specifies that the input should be treated as one long string, taking away the special meaning of newline characters with regard to the period, carat, and dollar syntax elements as defined in the Multiline option.

To set options within the expression, you create a noncapturing group and add the option modifiers to the group definition. For example, (?s-in: … ) turns on the Singleline option and turns off the IgnoreCase and ExplicitCapture options. To set these options programmatically, you can use the RegexOptions enumeration, which is often accepted as a parameter to Regex methods.


More anchors
You've already learned about the ^ and $ anchors, but the Regular Expression language offers a few other options for anchoring matches to extend your ability to define expressions. The first is \b, which defines that a match must happen at the beginning or end of a word boundary. A word boundary is defined as the transition from a word character, such as \w, to a nonword character like \W, or vice versa. That means that white-space, punctuation, and symbols all define word boundaries.

As an example, \b[aA]\w* can be used to define any word that starts with the letter A (in either uppercase or lowercase). Another example is \w*ing\b, which defines any word that ends with the sequence ing. The sequence \B designates that a match should not happen at a word boundary. Thus, \w*\Bing\B\w* will match words that contain the sequence ing somewhere in the middle.

The RegEx engine also offers three other interesting anchors. The \A anchor defines the absolute beginning of the string, independent of the Multiline option. By extension, the \Z anchor defines the absolute end of the string, not including any terminating newline characters. These two anchors take on the same meaning that ^ and $ would have in a Singleline application of the match. And the \z anchor defines the end of the string, inclusive of the newline characters and independent of the Multiline option.

Here are a few examples to demonstrate the use of these anchors:
inputString = "AAA\nBBB\nCCC\n";
// (?m:\w+$)—matches three times, one for each line
// (?m:\A\w+)—matches only AAA
// (?m:\w+\Z)—matches only CCC
// (?m:\w+\z)—returns no matches, since there's a terminating newline

More about the Regex class: compiled expressions
The Regex class can actually be used in two ways. The simplest is to use a set of static methods that allow you to pass the expression as a literal string. This is the most common method of using this class in one-off situations.

However, this particular method is also less efficient because the expression must be interpreted and compiled into an internal representation with each call. An alternative is to use an instance method of the object with a compiled expression. This instance object can then be used again and again without incurring the cost of the expression compilation. For example:
Regex re = new Regex("<a.*?>");
foreach (string s in listOfStrings)
{
  if (re.IsMatch(s))
    // we have a match
}

Regex.Match and the Match class
You saw earlier that you can use the Regex object to determine whether there is a match within a string. But what if you want to extract the value of the match from the string? The Match method of the Regex class returns a Match object for the first match in the string:
Match m;
m = Regex.Match(inputString, @"(?<base>\d{5})(?:-(?<ext>\d{4}))?");

The first thing you can do is check whether the match was successful:
if (m.Success)
//

Now you have a Match object that you can use. Below are some simple members of this object:
// set the starting point and length of the match
int start = m.Index;
int length = m.Length;
 
// get the fully matched string of the whole match
string full = m.Value;
 
// use them all
Console.WriteLine("{0} at {1} is {2} chars long", full, start, length);


Match.Groups Collection
You might notice that the expression above uses two named capturing groups. The Match object allows you to access these using the Groups collection. A single item in the collection returns a Group object that also supports the Success, Index, Length, and Value members. To get at a single Group object, you can access the collection by name or by index. Here are a few examples:
// display all the groups, including 0 (the all match group)
for (int i = 0; i < m.Groups.Count; i++)
  Console.WriteLine("Group {0}={1}", i, m.Groups[i].Value);
 
// output the zip code
string zipBase = m.Groups["base"].Value;
string zipExt = m.Groups["ext"].Value;
Console.WriteLine("Base {0}, Extended {1}", zipBase, zipExt);

Using the Groups collection, you can extract a lot of information out of the match. As I mentioned in the previous article on sequence repetition, named groups are always numbered after nonnamed groups, so the code above can be useful in experimenting with group numbers.

Match.NextMatch and Regex.Matches
In many cases, the input string will contain multiple matches to a particular Regular Expression. You have several ways to access all the possible matches.

The first is to use the NextMatch method of the Match object. You can use this method as follows:
while (m.Success)
{
  // use the match
  m = m.NextMatch();
}

This is a simple way to iterate through the matches. Another method is to use the Matches method of the Regex class, which returns a collection of Match objects. You can then iterate the collection in typical fashion. For example:
MatchCollection mc = Regex.Matches(inputString, expression);
foreach (Match m in mc)
{
  // use the match
}


Regular Expressions syntax and advanced string replacement

Backreferences
In some cases, you're going to want to create an expression that references a previously defined group as part of a match. An expression can refer to previously defined groups either by number or by name. The expression \# refers to a previously defined group by number (for example, \1 is the first group). The expression \k<name> refers to a previously defined group by name. A simple example is to find any word which starts and ends with the same character. To do this, I would use the expression \b(\w)\w*?\1\b, which captures the first letter of a word in a capture group, then allows for any number of letters in a lazy match, followed by the contents of the first capture group (the first letter).

Alternation constructs
So far, you've seen a way that you can specify a match for one of a set of characters (for example, [AEIOU]). But what if you want to match one of a set of terms? The RegEx language offers the vertical bar (|) character to allow this type of matching within a group (either capturing or not). This character will match any one of a number of terms, opting for the left-most match first. For example, the expression (?i:Th(?:is|at)) will match either This or That in a case-insensitive search. You'll notice that I turned the nested group into a non-capturing group since I don't care about its results in particular. Alternatively, I might have specified the ExplicitCapture option to denote that groups should be non-capturing by default.

In some cases, you'll want to change the expression you're looking for based on a previously defined group. For example, let's assume you have a company with two distinct formats for order IDs. An order starting with the letter A is five digits long, while an order starting with the letter B is eight digits long. RegEx allows you to specify this using this syntax: (?(name)yes|no), where name defines the group name, and yes and no define the search expressions. In the case of this company, your expression would end up as:
((?<AThere>A)|B)(?(AThere)\d{5}|\d{8})

You may choose to leave out the no expression to specify that a search expression is only necessary if a group is defined. Essentially, the expression (?(name)yes) is syntactically equivalent to (?(name)yes|), which is to say that the no expression can match the empty string.

A more interesting application of this syntactic element is the extraction of URLs from an HTML input string. The HTML syntax allows attributes, such as the href attribute of the anchor tag to exist in undelimited form or delimited by either single or double quotes. If the href is delimited, the URL is represented by everything between the two delimiters. If it's undelimited, the URL is represented by everything up to the first white-space character or closing >. Let's analyze the following expression which can be used to extract the href attributes:
string pattern = @"href=(?<d>["'])?.*?(?(d)\k<d>|[\s>])";

The first thing you need to do is figure out if the attribute is delimited by single or double quotes. You use the character set ["'] and denote it in a group as optional using the question-mark quantifier. You know that if the attribute is delimited at the beginning, it must be delimited at the end, so you name the group in order to use it in a test alternation later. You also know that if the attribute is not delimited, it must end either in a white space (if there is another attribute following) or in a greater than (>) at the end of the tag. To achieve this, use the test and alternation construct to test for the named group d to be defined. If the group is defined, look for its value (the delimiter); otherwise, look for either a white space or a greater than.

While this may seem complex, you'll find that with a bit of practice you'll be able to generate these types of expressions more quickly.


Subexpressions
The last major syntactic element I'll introduce is the subexpression. Subexpressions allow you to perform tests on the space before and after a match, without capturing the results of the test into the match itself. The contents of the test can be any regular expression, allowing for great flexibility. Subexpression tests are considered to fall into two directions: lookahead and lookbehind. Lookahead tests continue only if they succeed to the right, while lookbehind tests continue only if they succeed to the left. RegEx offers both positive (the expression must exist) and negative (the expression must not exist) tests.

Specific examples should help explain:
  • Positive lookahead (?= … ). The match will continue only if the subexpression is found to the right. For example, \w+(?=ing\b) will match only those sequences of characters followed by ing and a word boundary (\b). In other words, all words ending with ing.
  • Positive lookbehind (?<= …). The match will continue only if the subexpression is found to the left. For example, (?<=\ba)\w+ will match those words that start with an a.
  • Negative lookahead (?! … ). The match will continue only if the subexpression is not found to the right. For example, \b\w+\b(?!,) will match those words that are not followed by a comma.
  • Negative lookbehind (?<! … ). The match will continue only if the subexpression is not found to the left. For example, (?<!\$)\d+\.\d+ will match those numbers with decimals that are not following a literal dollar sign.

You might find it interesting that the examples did not include words that do not start or do not end with a particular character sequence, such as words that do not end with ing. You might expect that expressions of the form \b(?<!a)\w+\b and \b\w+(?!ing)\b would work, but they do not. The reason for this inconsistency is that greedy matching overtakes the negative lookahead test, while lazy matching has no reason to continue any more than one character into the match because the test is a noncapturing sequence.

So, how do you test for words that start or end with a sequence? You must get very creative with subexpressions to achieve this goal. Two conditions are satisfied whenever a word does not end in a sequence (like ing). First, there is a sequence of characters that does not lead to ing. You could test for this, but as I discussed above, the greedy match will overtake the ing characters. Second, the sequence of characters ing does not exist to the left of a word boundary. You can use this second condition in a subexpression test of the form \b\w+(?<!ing)\b. This expression uses a lookbehind (to the left) subexpression to achieve what a lookahead cannot. You can use the analogous form to test for words that do not start with a sequence by testing that the sequence does not exist to the right of a word boundary, as in \b(?!a)\w+\b, which tests for words that do not start with the letter a.

Miscellaneous
I previously covered one miscellaneous construct in the language. At this point, it's worth adding another. To embed a comment inside a regular expression, you may use (?# …). The most important thing to remember about this construct is that it does not allow for nested parentheses, since the comment will last only until the first terminating parentheses.

More about the Regex class
In this section, I'll cover a few more members of the Regex class, which can be very useful in your code.

Regex.Split
You've already seen how the Regex class offers analogs to several of the String class members. It so happens that the Regex class offers an analogous Split member, which allows you to split a string based on a regular expression:
string[] manyStrings = Regex.Split(inputString, @"\s+");

The simple line of code above will split up a string at the points in which one or more white space characters exist. So an input string of the form "aa bb cc dd" will result in four strings ("aa", "bb", "cc", "dd"), even though the actual type and number of white space characters between the substrings is variable.

You may notice that the delimiter is getting extracted from the resulting array of strings. In some cases, you may want to have the delimiting expression show up in the results as well. In the example above, you may also want to do something with the white space between the strings. If you would like the delimiting match to be included in the results, all you have to do is encapsulate it in a capturing group, as in @"(\s+)".


Regex.Replace using a MatchEvaluator function
In a previous article, I discussed how the Replace member of the class can be used to manipulate a string using replacement references ($). In some cases, replacement references won't be enough and you'll want to call a function. For example, let's say that you have an input file consisting of HTML source code and you would like to append the date each URL was updated to the title of the link. Clearly, replacement references aren't enough since the update date is different for each URL. The Regex class offers a delegate class called MatchEvaluator for just this reason. This class must take on a standard prototype, as with all delegates. Here is sample code, which assumes that URLs are encapsulated in single quotes:
string pattern = "(?<open><a.*?href='(?<url>.*?)'.*?>)(?<title>.*?)(?<close></a.*>)";
// get the resulting replacement string
string result = Regex.Replace(inputString, pattern,
new MatchEvaluator(replaceFunction));
// do something with it…

public string replaceFunction(Match m)
{
string openingTag = m.Groups["open"].Value;
string title = m.Groups["title"].Value;
string closingTag = m.Groups["close"].Value;
string url = m.Groups["url"].Value;
string updated = dateUpdated(url).ToShortDateString();

string result = String.Format("{0}{1} ({2}){3}",
openingTag, title, updated, closingTag);

return result;
}


You'll notice that the replacement function references several named groups in the match pattern. This is a useful technique, especially when crafting complex results.

Your training is now complete
By now you've seen almost everything there is to see regarding the Regular Expression language and code tools provided by the .NET Framework. The remaining elements of each of those will be easy for you to comprehend given the background provided in these articles. As I mentioned in the first article in the series, RegEx represents one of the most valuable tools in my tool belt. It is my sincere hope that you've gained a strong understanding of RegEx and that it will be as valuable a tool for you and your projects as it is for me.

RegEx syntax
Throughout this series of articles, I've included a growing syntax reference to the RegEx language. The following is the summary of all the sequences covered to date.

Literal Strings and Anchors
Any character that is not a special character itself. Special characters include, but are not limited to ., ^, $, and \.
^ Signifies the beginning of the string or a line in the string, depending on the options.
$ Signifies the end of the string or a line in the string, depending on the options.
\A Signifies the absolute beginning of the string, independent of the multiline option.
\Z Signifies the absolute end of the string, up to but not including newline characters, independent of the multiline option.
\z Signifies the absolute end of the string, including the newline characters, independent of the multiline options.
\b Signifies a word boundary, defined as any place where a word character (\w) transitions to or from a non-word character (\W).
\B Signifies a position that is not a word boundary.
Character Escapes (More than these exist, but these are the most common.)
\ followed by any special character The character being escaped, for example \$
\\ The slash character
\r and \n Carriage return and new line, respectively
\t The tab character
\x## Matches any ASCII character in the hexadecimal form of exactly two digits.
\u#### Matches any Unicode character in the hexadecimal form of exactly four digits.
Character Classes
. Matches any character, except the newline (\n).
\w Matches any word character. In standard ASCII, that is any alpha character (a-z and AZ), any numeric character (0-9), and an underscore (_).
\W Matches any non-word character.
\d Matches any numeric digit (0-9).
\D Matches any non-numeric digit.
\s Matches any white-space character, including tabs, carriage returns, and newlines.
\S Matches any non-white-space character.
Character Sets
[abcd] Matches any character designated within the set.
[^abcd] Matches any character not in the set.
[0-9a-z] The hyphen (-) character specifies a range of characters within a set.
Quantifiers
{n} Match exactly n instances of the preceding sequence.
{n,m} Match at least n and no more than m instances of the preceding sequence. One of the two parameters may be left blank to designate a default value of zero for n and unlimited for m.
* Matches zero or more instances of the preceding sequence. Functionally equivalent to {0,}.
? Matches zero or one instances of the preceding sequence. Functionally equivalent to {0,1}.
+ Matches one or more instances of the preceding sequence. Functionally equivalent to {1,}.
QQQ? Lazy match; appending a question mark to any quantifier will specify that a match should go only as far as needed before hitting the next match.
Grouping and Capture
( … ) Place the contained sequence in a capture group.
(?: … ) Place the contained sequence in a non-capture group.
(?<name> … ) Place the contained sequence in a named capture group. You may also use single quotes to designate the name, as in (?'name' … )
Alternation Constructs
expr1|expr2 Matches any of the expressions separated by the vertical bar character. The leftmost expression wins.
(?(Name)yes|no) Matches the expression in the "yes" position if the named group has a successful capture, otherwise matches the expression in the "no" position. The "no" expression can be omitted.
Backreferences
\# Refers to the numbered group identified by the number following the slash.
\k<name> Refers to the named group identified in the sequence.
Subexpressions
(?= … ) Positive lookahead; continues to match only if the subexpression exists to the right. Does not capture the contents of the subexpression.
(?<= … ) Positive lookbehind; continues to match only if the subexpression exists to the left. Does not capture the contents of the subexpression.
(?! … ) Negative lookahead; continues to match only if the subexpression does not exist to the right. Does not capture the contents of the subexpression.
(?<! … ) Negative lookbehind; continues to match only if the subexpression does not exist to the left. Does not capture the contents of the subexpression.
Miscellaneous
(?ixms: …) Set the appropriate option for the contents of the group.
(?-ixms: …) Unset the appropriate option for the contents of the group.
(?# … ) Embed a comment in the expression starting with the # and ending in the first closing parentheses.


Copyright ©1995- 2003 CNET Networks, Inc. All Rights Reserved.
Visit us at www.TechRepublic.com

Dan Frumin Articles on TechRepublic

RegEx Book 
Mastering Regular Expressions by Jeffrey Friedl. (O'Reilly)

This is an excellent cheat sheet type reference.
http://www.evolt.org/article/rating/20/22700/

A Regular Expression library
http://www.regxlib.com/

An online RegEx tester
http://www.regxlib.com/RETester.aspx

A .Net regular expression builder and tester
http://www.codeproject.com/dotnet/expresso.asp?target=e xpresso

A tutorial with examples
http://www.3leaf.com/default/NetRegExpRepositor y.aspx#Tutorial