Tech
Republic
Getting
familiar with Regular Expressions
Dan Frumin
A short while ago, I
was working on a simple Web site project when I
ran into an interesting problem. The application had to accept HTML
input in a text box but support only a subset of HTML tags. That is to
say, simple formatting tags would be supported, while complex tags,
such as scripting and hyperlinks, would not.
At first glance, a simple string replacement might seem in order.
Unfortunately, some of the complex tags take on numerous forms. For
example, a hyperlink tag starts with <a but doesn't end until the
closing >, so simple string replacement was not enough. Luckily for
me, I was able to count on an old standby tool: Regular Expressions.
About Regular
Expressions
Long considered one of the most powerful and most arcane of
languages,
the Regular Expression language (nicknamed RegEx) saw its humble
beginnings in the world of UNIX. In the 1970s, access to computing
resources shifted from punch cards to line terminals, and all input and
output was handled one line at a time in the form of text.
Anumber of tools arose to help programmers deal with those text files.
Popular among them was grep, which
allowed users to find substrings in a file, sed, which allowed
users to replace substrings in files, and ed, which allowed
users to edit files one line at a time. A lot of the operations were
too complex to be described by positional definition (e.g., change the
20th through 24th character), and so regular expressions were born.
The language of Regular Expressions is largely mathematical. The full
syntax itself can be daunting, but it's incredibly powerful in its
ability to describe very complex substrings and variations. The good
news is that the basic syntax is actually easy to master and offers a
significant tool to developers. Using Regular Expressions, you can
match, capture, replace, and split substrings, all using the same
syntax notation and a few lines of code.
One problem with Regular Expressions is that there are several
implementations, some of which use small variants on the syntax. But
practically all implementations support the same syntactic elements,
even if the actual character representations are slightly different.
Putting Regular Expressions to work
Regular Expressions are powerful, but what can you do with them? Within
the .NET Framework, you can use Regular Expressions in validation
controls to quickly and easily validate text input. For example, you
can validate that the user entered a valid zip code, vehicle license
plate number, social security number, and so on. RegEx is also useful
for matching and extracting substrings out of a bigger string or file
contents. For instance, you could write a Regular Expression to extract
every URL from an HTML file or every e-mail from an SMTP standard mail
header.
Finally, you can use RegEx to transform one string into another using
the replacement constructs. For example, you could take a
comma-delimited (CSV) file and invert the order of the input fields to
result in a new output file.
Basic Regular
Expressions
Let's start by picking up some basic Regular Expressions syntax you can
use in your toolset.
Literal Strings and
Anchors
The basis of any string expression is the literal matching of any one
character or set of characters. With the exception of special syntax
elements, RegEx assumes that an expression of the form john
will match the literal substring john. In addition, RegEx
offers two position-based constructs (called anchors). The special
character ^ signifies the beginning of a line, while $ signifies the
end of a line. For example, ^foo will match an occurrence of foo
at the beginning of a line. In the same fashion, ^this is a line$
will match this is a line only if it exists as a whole line of
text, from beginning to end.
Character escapes
You'll often need to match some common special characters within an
expression. RegEx supports most of the same character escapes as the
rest of your code. These include
In addition, you can escape any special character within RegEx. You can
always match special characters in an expression by escaping them with
a backslash. For example, \^ will match a literal carat (^) and \$ will
match a literal dollar sign ($).
Character classes
RegEx enables you to designate certain types or of characters as a
class. The simplest character class is represented by the period (.),
and it designates any character in the string. This is our generic
reusable character and comes in very handy in writing expressions.
RegEx also offers a way to designate other special character classes.
For example, you can use \w to designate any word character, generally
considered the alpha characters (a-z and A-Z), the numeric characters
(0-9), and an underscore (_). The \d sequence specifies any numeric
character in the range of zero through nine (0-9). You can use \s to
designate any white-space character, including spaces, tabs, and new
lines.
In some cases, you may want to use the character classes in
exclusionary form. In other words, you might need to specify any
non-numeric character or any non-white-space character. RegEx meets
this need by capitalizing the designator within the sequence. So, for
example, \S refers to any non-white-space character.
Character sets
The last construct we'll introduce in this article is the character
set. Let's assume that you need a validation expression for a phone
number. The rules for phone numbers require that the first digit in an
area code not be a zero (0) or a one (1), because those numbers are
used to designate country codes. So clearly, you can't use \d to
designate the first digit of the phone number.
To solve this, RegEx lets you use square brackets to designate a
character set. For the example above, you might use [23456789] to
designate any character in the set. The RegEx language also allows the
use of a hyphen to designate a range, as in [2-9]. You should note that
you can combine multiple set definitions as well. For instance, the set
[A-Za-z0-9_] is equivalent to the character class \w.
Just it does with character classes, RegEx offers a way to invert the
meaning of a special sequence. If the first character in the set is a
carat (^), the set takes on the inverse specification and in turn
refers to any character that is not in the set. As an example, you
might create the set [^0-9] to refer to any non-numeric character—the
equivalent of \D.
Here's a tip
Regular Expressions are incredibly powerful, but they take awhile to
explain thoroughly. The above syntax elements lay a strong groundwork
for accomplishing tasks with the RegEx language but may not be enough
for general use.
Here's a tip that will help you craft more powerful expressions for use
in your code: The character sequence .* represents a "match multiple
characters" operation. For example, john.*doe will match a
string that begins with john and continue until the first
instance of doe is found.
Regular Expressions
in validation controls
One of the many improvements in ASP.NET is its offering of both server-
and client-side validation controls. Responding to the many developers
who have to write code for validating input, the ASP.NET team created a
series of controls to test for common cases, such as required text or
comparison (matching a single string). However, this is not enough;
text input can range from names to zip codes and phone numbers to order
numbers. To address this, the .NET Framework offers the
RegularExpressionValidator control.
This control has a property named ValidationExpression, which
designates the RegEx string to test the value of the text box against.
If the input string fails to match the expression, the validation
control is tripped and an error message is displayed to the user. It's
worth noting that this control implicitly tests the whole string, as if
the expression were actually written within a set of anchors of the
form of ^…$. We don't need to delve too deeply into this control, since
you'll find it easy to use.
Here are a few examples of validation strings based on the syntax you
learned above:
- Basic zip code consisting of five numbers: \d\d\d\d\d
- Phone number: [2-9]\d\d-\d\d\d-\d\d\d\d
- License plate consisting of three letters and three numbers:
[a-ZA-Z] [a-ZA-Z] [a-ZA-Z]\d\d\d
- Any string of five characters starting and ending with a hyphen:
-…-
The Regular
Expression Workbench
Eric Gunnerson of the MSDN team has written a useful tool for
experimenting with Regular Expressions. It's called the Regular
Expression Workbench and is available through the gotdotnet Web site. You can use
the site's search feature to locate the tool within the User Samples
section. I recommend it as a great way to learn Regular Expressions, as
well as to develop and test complex expressions later on in your career.
RegEx Syntax
reference so far
Throughout this series of articles, we'll offer a growing syntax
reference to the RegEx language. Table A shows all the
sequences we've covered so far.
Table A
Literal
strings and anchors
|
Any character that
is not a
special character (, ^, $, and \ are examples of special characters)
|
Itself. |
^
|
Signifies the beginning of a
line in the
string.
|
$
|
Signifies the end of a line in
the
string.
|
Common
character escapes |
\ followed by any special
character
|
The character being escaped,
for example
\$.
|
\\
|
The slash character.
|
\r and \n
|
Carriage return and new line,
respectively
|
\t
|
The tab character.
|
\x##
|
Matches any ASCII character in
the
hexadecimal form of exactly two digits.
|
\u####
|
Matches any Unicode character
in the
hexadecimal form of exactly four digits.
|
Character
classes
|
.
|
Matches any character, except
the new
line ('\n').
|
\w
|
Matches any word character. In
standard
ASCII that is any alpha character (a-z and A-Z), any numeric character
(0-9) and an underscore (_).
|
\W
|
Matches any non-word character.
|
\d
|
Matches any numeric digit (0-9).
|
\D
|
Matches any non-numeric digit.
|
\s
|
Matches any white-space
character
including tabs, carriage returns, and new lines.
|
\S
|
Matches any non-white-space
character.
|
Character
sets
|
[abcd]
|
Matches any character
designated within
the set.
|
[^abcd]
|
Matches any character not in
the set.
|
[0-9a-z]
|
You can use the hyphen (-)
character to
specify a range of characters within a set.
|
Special
tip
|
.*
|
Matches an unlimited number of
characters.
|
|
Regular Expressions:
Understanding
sequence repetition and grouping
Quantifiers
In
my last article,
I showed you how certain sequences were regularly repeated. For
example, to specify a ZIP code, I had to provide the sequence
\d\d\d\d\d. You might expect that there is a way to provide
quantitative guidelines to a RegEx expression. And you would be right.
RegEx
allows you to specify that a particular sequence must show up exactly
five times by appending {5} to its syntax. For example, the expression
\d{5} specifies exactly five numeric digits. You can also specify a
series of at least four and no more than seven characters by appending
{4,7} to the sequence.
Similarly, the expression [A-Z]{3,6}
specifies three to six instances of the character set consisting of
uppercase letters. The expression can leave out one of the two
designators, implying zero (0) in the former position and unlimited in
the latter position. If you’re looking for a number up to six digits
long, you would use {,6}. Similarly, a word that is at least four
characters long can be expressed as \w{4,}.
In addition to the
generic syntax above, RegEx offers shortcuts for designating
quantifiers. The question mark character (?) is used to designate zero
or one matches (equivalent to {0,1}). The asterisk character (*) is
used to designate zero or more matches (equivalent to {0,}). Lastly,
the plus character (+) is used to designate one or more matches
(equivalent to {1,}). Using these sequences can make your expressions
faster to write and easier to read.
Here are examples of how you might use the above constructs:
- A simple ZIP code: \d{5}
- A phone number with or without hyphens:
[2-9]\d{2}-?\d{3}-?\d{4}
- Any two words separated by a space: \w+
\w+
- One or two words separated by a space:
\w* ?\w+
Grouping
So
far, you’ve seen how to quantify sequences of single characters within
a string. You also know that a sequence of literals (for example, joe)
designates the substring itself. But what if you want to quantify the
literal substring of characters? The RegEx language offers the grouping
construct for this purpose. To designate a group, you enclose it in
parentheses.
For example, (abc) is the sequence abc within the string. By
itself, it’s not different than the literal abc.
However, when you apply some of the quantifiers, this construct becomes
very powerful, especially when you consider that a group can contain
complete RegEx sequences.
I previously wrote the expression for
a simple ZIP code as \d{5}. However, ZIP codes also have an optional
section that appends a hyphen and four more digits. The optional
section is easily defined as -\d{4}. But how do you tell the RegEx
engine that it’s optional? You might remember the question mark is used
to match zero or one patterns.
A complex ZIP code is then
expressed as \d{5}(-\d{4})?. To understand it, see that the group
construct was applied to the optional section and then designated as
matching zero or one times using the question mark quantifier. You can
use the other quantifiers to control the matches within the expression.
For example, (abc){3} designates the sequence abcabcabc.
Capturing
The
grouping construct also carries a secondary meaning within the RegEx
language. It creates a mechanism to capture a matching substring for
future use such as extraction or replacement.
By
default, any group you designate within an expression is a capturing
group. Groups are numbered from left to right in order of opening
parentheses, even if the groups are nested. The group with index 0 is a
special group that contains the full match, as if the whole expression
were wrapped in a set of parentheses.
Let’s break down a nested
grouped expression for a phone number: (([2-9]\d{2})-)?(\d{3})-(\d{4}).
Notice that this expression contains a number of groups, some nested
and some not.
The first set of parentheses captures the first
three digits of the area code as well as the hyphen that follows. We
need to put this group here because the area code is optional, as
designated by the question mark following its definition. We nested a
second group to allow us to extract just the area code itself. The next
two sets of parentheses are more obvious in their capture. At this
point, you can refer to a number of sections within this substring
using the group notation.
There’s no question that tracking
groups by number is a rather tedious task. It’s further complicated by
the fact that you were forced to designate a group (the first one) that
you really didn’t care much about, just so you could specify that the
area code is optional. RegEx addresses both of these issues quite
elegantly using named groups. A group for a name is designated using
the syntax (?<name> … ).
There is a special case
designating a noncapturing group if the group begins with ?:. In the
above example, you could have used (?:([2-9]\d{2})-)? to designate that
the area code and hyphen group is a noncapturing group. This helps
eliminate some of the groups that are there purely for expression
reasons and not for reusability reasons.
If you apply both of
these techniques to the ZIP code expression, you end up with
(?<full>(?<base>\d{5})(?:-(?<ext>\d{4}))?). Now
you're able to refer to the full ZIP code, the base part, and the
extended part individually by name. The expression also uses a
noncapturing group to ignore the hyphen in the extended part. One
interesting side effect of naming groups is that all named groups are
numbered after all nonnamed groups, throwing off the order of opening
parentheses.
The .NET Framework
Regex class
There
is still much more to the Regular Expression language, but it’s time to
shift focus to some code examples. To work with Regular Expressions,
the .NET Framework offers the Regex class in the
System.Text.RegularExpressions namespace. (In this article, I’ll cover
only two simple methods of the class, saving some of the more complex
uses of this class for the next article in this series.)
Regex.IsMatch
The
.NET Framework String class offers the IndexOf method to determine
whether one string contains another. However, its use is intrinsically
limited to a literal string. What if you want to determine whether a
string contains another string that is defined with a Regular
Expression? The .NET Framework offers a static method of the Regex
class for just this reason.
Let’s say that you want to determine
whether a particular input string contains a ZIP code. You already know
how to write the regular expression for a ZIP code. All you have to do
is use the IsMatch method to apply the test:
bool hasMatch = Regex.IsMatch(inputString,
@"\d{5}(-\d{4})?");
It’s
worth noting that the IsMatch method will return a true value if the
match exists anywhere within the substring. In general, you know enough
about the input string to not worry about this. However, if you need to
specify that the whole string should match the expression, you can use
the ^ and $ modifiers:
bool hasMatch = Regex.IsMatch(inputString,
@"^\d{5}(-\d{4})?$");
The
astute reader will wonder how to determine whether multiple matches
exist and where they show up within the string. Both of these features,
and more, are available in the Regex class and will be covered in the
next article in this series.
Regex.Replace
Similar
to the IndexOf analog of the String class, the Regex class also offers
a way to replace substrings defined as Regular Expressions. Let’s
assume that you're writing a simple HTML interpreter. One of HTML’s
features is that it collapses any white-space sequence in the input to
a single space in the output. You can use the Regex class to achieve
the same thing using the following code:
string result = Regex.Replace(inputString, @"\s+", "
");
This
is a very simple example, but it illustrates how you can do some very
neat things with the RegEx engine. The Replace method is actually far
more powerful in that it allows you to refer to capture groups defined
in the expression.
There are two simple ways to refer to capture
groups in the expression. A dollar sign followed by any number refers
to a capture group by number. The sequence $0 refers to group zero,
which is the special group for the whole input string in this case. The
sequence $2 then refers to the second group. In addition, you can
specify a named capture group using the syntax ${name}.
Let’s look at two examples, both of which assume they are called using
this syntax:
string result = Regex.Replace(inputString, pattern,
replace);
// ensure that a phone number is hyphenated
pattern = @"([2-9]\d{2})-?(\d{3})-?(\d{4})";
replace = "$1-$2-$3";
// invert the order of the first and last names, ignoring the middle
pattern = @"(?<first>\w+) (?:\w+ )*(?<last>\w+)";
replace = "*** ${last}, ${first} ***";
Note
that only the group reference characters are special in the replacement
string. In the above example, the asterisks are interpreted as literal
characters within the string. However, if you would like to place a
literal dollar sign in the replacement string, you can specify it using
$$.
Increase
your knowledge of Regular Expression syntax
"Lazy"
matching
By
now, I hope you've had a chance to use regular expressions in your
code. If you have, you might have run into a somewhat common and
unfortunately complex problem.
Often, you will use the asterisk
or plus quantifiers to designate that a particular sequence or pattern
repeats. As it happens, the RegEx engine will match that pattern in
what is termed a "greedy" behavior. That means it will match as much of
the string as possible before testing the next sequence. In some cases,
the greedy behavior is not what you want, and instead you need to use
"lazy" matching.
This is probably best illustrated with an
example. Assume that you are trying to extract anchor tags within an
HTML file. Since the anchor tag starts with <a and ends with >,
your first instinct might be to use <a.*>. The problem with this
is that the .* sequence is applied in greedy fashion to match any
character, including the greater-than sign, as many times as possible.
If your HTML input string consists of "<a
href=foo>bar</a>", the expression will match the whole string.
Essentially, the .* sequence continues to match any character until the
whole string is tested. This is clearly not what you want.
One
alternative is <a[^>]*>, which translates to a string that
begins with <a and greedily matches as many non-greater-than
characters as possible, followed by a greater-than. This is a viable
solution but limited in its applicability. A more generic solution is
to use lazy matching, which asks RegEx to match as few characters as
possible while still successfully applying the expression. A lazy match
is defined by appending the question mark (?) to any quantifier, as in
<a.*?>.
Regular
Expression options
The
RegEx engine supports a number of options that can be set either in
code or within an expression. The most popular options are described
below along with each option's programmatic name and its inline
character for use within an expression.
IgnoreCase (i)
The IgnoreCase option specifies that searching and matching should be
done in case-insensitive fashion.
ExplicitCapture (n)
The
ExplicitCapture option specifies that groups should default to
noncapturing mode, such that only named groups—e.g., (?<name> …
)—are captured. This is useful if you have an expression that contains
a lot of noncapturing groups and don't want to specify them using the
(?: … ) syntax.
Multiline (m)
The
Multiline option specifies that the string should be treated as a
series of lines, and it designates two changes. First, the period
character (.) will match any character within a single line, so it will
not match either of the newline characters (\r or \n). Second, the
carat and dollar anchors will match the beginning and end of a single
line, not the whole string.
Singleline (s)
The
Singleline option specifies that the input should be treated as one
long string, taking away the special meaning of newline characters with
regard to the period, carat, and dollar syntax elements as defined in
the Multiline option.
To set options within the expression, you
create a noncapturing group and add the option modifiers to the group
definition. For example, (?s-in: … ) turns on the Singleline option and
turns off the IgnoreCase and ExplicitCapture options. To set these
options programmatically, you can use the RegexOptions enumeration,
which is often accepted as a parameter to Regex methods.
More
anchors
You've
already learned about the ^ and $ anchors, but the Regular Expression
language offers a few other options for anchoring matches to extend
your ability to define expressions. The first is \b, which defines that
a match must happen at the beginning or end of a word boundary. A word
boundary is defined as the transition from a word character, such as
\w, to a nonword character like \W, or vice versa. That means that
white-space, punctuation, and symbols all define word boundaries.
As
an example, \b[aA]\w* can be used to define any word that starts with
the letter A (in either uppercase or lowercase). Another example is
\w*ing\b, which defines any word that ends with the sequence ing.
The sequence \B designates that a match should not happen at a word
boundary. Thus, \w*\Bing\B\w* will match words that contain the
sequence ing somewhere in the middle.
The RegEx engine
also offers three other interesting anchors. The \A anchor defines the
absolute beginning of the string, independent of the Multiline option.
By extension, the \Z anchor defines the absolute end of the string, not
including any terminating newline characters. These two anchors take on
the same meaning that ^ and $ would have in a Singleline application of
the match. And the \z anchor defines the end of the string, inclusive
of the newline characters and independent of the Multiline option.
Here are a few examples to demonstrate the use of these anchors:
inputString = "AAA\nBBB\nCCC\n";
// (?m:\w+$)—matches three times, one for each line
// (?m:\A\w+)—matches only AAA
// (?m:\w+\Z)—matches only CCC
// (?m:\w+\z)—returns no matches, since there's a
terminating newline
More about the Regex
class: compiled expressions
The
Regex class can actually be used in two ways. The simplest is to use a
set of static methods that allow you to pass the expression as a
literal string. This is the most common method of using this class in
one-off situations.
However, this particular method is also less
efficient because the expression must be interpreted and compiled into
an internal representation with each call. An alternative is to use an
instance method of the object with a compiled expression. This instance
object can then be used again and again without incurring the cost of
the expression compilation. For example:
Regex re = new Regex("<a.*?>");
foreach (string s in listOfStrings)
{
if (re.IsMatch(s))
// we have a match
}
Regex.Match and the
Match class
You
saw earlier that you can use the Regex object to determine whether
there is a match within a string. But what if you want to extract the
value of the match from the string? The Match method of the Regex class
returns a Match object for the first match in the string:
Match m;
m = Regex.Match(inputString,
@"(?<base>\d{5})(?:-(?<ext>\d{4}))?");
The first thing you can do is check whether the match was successful:
if (m.Success)
//
Now you have a Match object that you can use. Below are some simple
members of this object:
// set the starting point and length of the match
int start = m.Index;
int length = m.Length;
// get the fully matched string of the whole match
string full = m.Value;
// use them all
Console.WriteLine("{0} at {1} is {2} chars long",
full, start, length);
Match.Groups
Collection
You
might notice that the expression above uses two named capturing groups.
The Match object allows you to access these using the Groups
collection. A single item in the collection returns a Group object that
also supports the Success, Index, Length, and Value members. To get at
a single Group object, you can access the collection by name or by
index. Here are a few examples:
// display all the groups, including 0 (the all
match group)
for (int i = 0; i < m.Groups.Count; i++)
Console.WriteLine("Group {0}={1}", i,
m.Groups[i].Value);
// output the zip code
string zipBase = m.Groups["base"].Value;
string zipExt = m.Groups["ext"].Value;
Console.WriteLine("Base {0}, Extended {1}", zipBase,
zipExt);
Using the Groups collection, you can extract a lot of information out
of the match. As I mentioned in the previous article on sequence
repetition, named groups are always numbered after nonnamed groups,
so the code above can be useful in experimenting with group numbers.
Match.NextMatch and
Regex.Matches
In
many cases, the input string will contain multiple matches to a
particular Regular Expression. You have several ways to access all the
possible matches.
The first is to use the NextMatch method of the Match object. You can
use this method as follows:
while (m.Success)
{
// use the match
m = m.NextMatch();
}
This
is a simple way to iterate through the matches. Another method is to
use the Matches method of the Regex class, which returns a collection
of Match objects. You can then iterate the collection in typical
fashion. For example:
MatchCollection mc = Regex.Matches(inputString,
expression);
foreach (Match m in mc)
{
// use the match
}
Regular Expressions syntax
and advanced
string replacement
Backreferences
In
some cases, you're going to want to create an expression that
references a previously defined group as part of a match. An expression
can refer to previously defined groups either by number or by name. The
expression \# refers to a previously defined group by number (for
example, \1 is the first group). The expression \k<name> refers
to a previously defined group by name. A simple example is to find any
word which starts and ends with the same character. To do this, I would
use the expression \b(\w)\w*?\1\b, which captures the first letter of a
word in a capture group, then allows for any number of letters in a
lazy match, followed by the contents of the first capture group (the
first letter).
Alternation constructs
So
far, you've seen a way that you can specify a match for one of a set of
characters (for example, [AEIOU]). But what if you want to match one of
a set of terms? The RegEx language offers the vertical bar (|)
character to allow this type of matching within a group (either
capturing or not). This character will match any one of a number of
terms, opting for the left-most match first. For example, the
expression (?i:Th(?:is|at)) will match either This or That in a
case-insensitive search. You'll notice that I turned the nested group
into a non-capturing group since I don't care about its results in
particular. Alternatively, I might have specified the ExplicitCapture
option to denote that groups should be non-capturing by default.
In
some cases, you'll want to change the expression you're looking for
based on a previously defined group. For example, let's assume you have
a company with two distinct formats for order IDs. An order starting
with the letter A is five digits long, while an order starting with the
letter B is eight digits long. RegEx allows you to specify this using
this syntax: (?(name)yes|no), where name defines the group name, and
yes and no define the search expressions. In the case of this company,
your expression would end up as:
((?<AThere>A)|B)(?(AThere)\d{5}|\d{8})
You
may choose to leave out the no expression to specify that a search
expression is only necessary if a group is defined. Essentially, the
expression (?(name)yes) is syntactically equivalent to (?(name)yes|),
which is to say that the no expression can match the empty string.
A
more interesting application of this syntactic element is the
extraction of URLs from an HTML input string. The HTML syntax allows
attributes, such as the href attribute of the anchor tag to exist in
undelimited form or delimited by either single or double quotes. If the
href is delimited, the URL is represented by everything between the two
delimiters. If it's undelimited, the URL is represented by everything
up to the first white-space character or closing >. Let's analyze
the following expression which can be used to extract the href
attributes:
string pattern =
@"href=(?<d>["'])?.*?(?(d)\k<d>|[\s>])";
The
first thing you need to do is figure out if the attribute is delimited
by single or double quotes. You use the character set ["'] and denote
it in a group as optional using the question-mark quantifier. You know
that if the attribute is delimited at the beginning, it must be
delimited at the end, so you name the group in order to use it in a
test alternation later. You also know that if the attribute is not
delimited, it must end either in a white space (if there is another
attribute following) or in a greater than (>) at the end of the tag.
To achieve this, use the test and alternation construct to test for the
named group d to be defined. If the group is defined, look for
its value (the delimiter); otherwise, look for either a white space or
a greater than.
While this may seem complex, you'll find that
with a bit of practice you'll be able to generate these types of
expressions more quickly.
Subexpressions
The
last major syntactic element I'll introduce is the subexpression.
Subexpressions allow you to perform tests on the space before and after
a match, without capturing the results of the test into the match
itself. The contents of the test can be any regular expression,
allowing for great flexibility. Subexpression tests are considered to
fall into two directions: lookahead and lookbehind. Lookahead tests
continue only if they succeed to the right, while lookbehind tests
continue only if they succeed to the left. RegEx offers both positive
(the expression must exist) and negative (the expression must not
exist) tests.
Specific examples should help explain:
- Positive
lookahead (?= … ). The match will continue only if the subexpression is
found to the right. For example, \w+(?=ing\b) will match only those
sequences of characters followed by ing and a word boundary (\b). In
other words, all words ending with ing.
- Positive lookbehind
(?<= …). The match will continue only if the subexpression is found
to the left. For example, (?<=\ba)\w+ will match those words that
start with an a.
- Negative lookahead (?! … ). The match will
continue only if the subexpression is not found to the right. For
example, \b\w+\b(?!,) will match those words that are not followed by a
comma.
- Negative lookbehind (?<! … ). The match will continue
only if the subexpression is not found to the left. For example,
(?<!\$)\d+\.\d+ will match those numbers with decimals that are not
following a literal dollar sign.
You might find it
interesting that the examples did not include words that do not start
or do not end with a particular character sequence, such as words that
do not end with ing. You might expect that expressions of the form
\b(?<!a)\w+\b and \b\w+(?!ing)\b would work, but they do not. The
reason for this inconsistency is that greedy matching overtakes the
negative lookahead test, while lazy matching has no reason to continue
any more than one character into the match because the test is a
noncapturing sequence.
So, how do you test for words that start
or end with a sequence? You must get very creative with subexpressions
to achieve this goal. Two conditions are satisfied whenever a word does
not end in a sequence (like ing). First, there is a sequence of
characters that does not lead to ing. You could test for this, but as I
discussed above, the greedy match will overtake the ing characters.
Second, the sequence of characters ing does not exist to the left of a
word boundary. You can use this second condition in a subexpression
test of the form \b\w+(?<!ing)\b. This expression uses a lookbehind
(to the left) subexpression to achieve what a lookahead cannot. You can
use the analogous form to test for words that do not start with a
sequence by testing that the sequence does not exist to the right of a
word boundary, as in \b(?!a)\w+\b, which tests for words that do not
start with the letter a.
Miscellaneous
I
previously covered one miscellaneous construct in the language. At this
point, it's worth adding another. To embed a comment inside a regular
expression, you may use (?# …). The most important thing to remember
about this construct is that it does not allow for nested parentheses,
since the comment will last only until the first terminating
parentheses.
More about the
Regex class
In this section, I'll cover a few more members of the Regex class,
which can be very useful in your code.
Regex.Split
You've
already seen how the Regex class offers analogs to several of the
String class members. It so happens that the Regex class offers an
analogous Split member, which allows you to split a string based on a
regular expression:
string[] manyStrings = Regex.Split(inputString,
@"\s+");
The
simple line of code above will split up a string at the points in which
one or more white space characters exist. So an input string of the
form "aa bb cc dd" will result in four strings ("aa", "bb", "cc",
"dd"), even though the actual type and number of white space characters
between the substrings is variable.
You may notice that the
delimiter is getting extracted from the resulting array of strings. In
some cases, you may want to have the delimiting expression show up in
the results as well. In the example above, you may also want to do
something with the white space between the strings. If you would like
the delimiting match to be included in the results, all you have to do
is encapsulate it in a capturing group, as in @"(\s+)".
Regex.Replace
using a MatchEvaluator function
In
a previous article, I discussed how the Replace member of the class can
be used to manipulate a string using replacement references ($). In
some cases, replacement references won't be enough and you'll want to
call a function. For example, let's say that you have an input file
consisting of HTML source code and you would like to append the date
each URL was updated to the title of the link. Clearly, replacement
references aren't enough since the update date is different for each
URL. The Regex class offers a delegate class called MatchEvaluator for
just this reason. This class must take on a standard prototype, as with
all delegates. Here is sample code, which assumes that URLs are
encapsulated in single quotes:
string pattern =
"(?<open><a.*?href='(?<url>.*?)'.*?>)(?<title>.*?)(?<close></a.*>)";
// get the resulting replacement string
string result = Regex.Replace(inputString, pattern,
new MatchEvaluator(replaceFunction));
// do something with it…
public string replaceFunction(Match m)
{
string openingTag = m.Groups["open"].Value;
string title = m.Groups["title"].Value;
string closingTag = m.Groups["close"].Value;
string url = m.Groups["url"].Value;
string updated = dateUpdated(url).ToShortDateString();
string result = String.Format("{0}{1} ({2}){3}",
openingTag, title, updated, closingTag);
return result;
}
You'll
notice that the replacement function references several named groups in
the match pattern. This is a useful technique, especially when crafting
complex results.
Your training is
now complete
By
now you've seen almost everything there is to see regarding the Regular
Expression language and code tools provided by the .NET Framework. The
remaining elements of each of those will be easy for you to comprehend
given the background provided in these articles. As I mentioned in the
first article in the series, RegEx represents one of the most valuable
tools in my tool belt. It is my sincere hope that you've gained a
strong understanding of RegEx and that it will be as valuable a tool
for you and your projects as it is for me.
RegEx syntax
Throughout
this series of articles, I've included a growing syntax reference to
the RegEx language. The following is the summary of all the sequences
covered to date.
Literal
Strings and Anchors |
Any
character
that is not a special character itself. |
Special
characters include, but are not limited to ., ^, $, and \. |
^ |
Signifies the
beginning of the string or a line in the string, depending on the
options. |
$ |
Signifies the
end of the string or a line in the string, depending on the options.
|
\A |
Signifies the
absolute beginning of the string, independent of the multiline option.
|
\Z |
Signifies the
absolute end of the string, up to but not including newline characters,
independent of the multiline option. |
\z |
Signifies the
absolute end of the string, including the newline characters,
independent of the multiline options. |
\b |
Signifies a
word boundary, defined as any place where a word character (\w)
transitions to or from a non-word character (\W). |
\B |
Signifies a
position that is not a word boundary. |
Character
Escapes (More than these exist, but these
are the most common.) |
\
followed by
any special character |
The character
being escaped, for example \$ |
\\ |
The slash
character |
\r
and \n |
Carriage
return and new line, respectively |
\t |
The tab
character |
\x## |
Matches any
ASCII character in the hexadecimal form of exactly two digits. |
\u#### |
Matches any
Unicode character in the hexadecimal form of exactly four digits.
|
Character
Classes |
. |
Matches any
character, except the newline (\n). |
\w |
Matches any
word character. In standard ASCII, that is any alpha character (a-z and
AZ), any numeric character (0-9), and an underscore (_). |
\W |
Matches any
non-word character. |
\d |
Matches
any
numeric digit (0-9). |
\D |
Matches any
non-numeric digit. |
\s |
Matches any
white-space character, including tabs, carriage returns, and newlines.
|
\S |
Matches any
non-white-space character. |
Character
Sets |
[abcd] |
Matches any
character designated within the set. |
[^abcd] |
Matches any
character not in the set. |
[0-9a-z] |
The hyphen (-)
character specifies a range of characters within a set. |
Quantifiers
|
{n} |
Match exactly
n instances of the preceding sequence. |
{n,m} |
Match at least
n and no more than m instances of the preceding sequence. One of the
two parameters may be left blank to designate a default value of zero
for n and unlimited for m. |
* |
Matches zero
or more instances of the preceding sequence. Functionally equivalent to
{0,}. |
? |
Matches zero
or one instances of the preceding sequence. Functionally equivalent to
{0,1}. |
+ |
Matches one or
more instances of the preceding sequence. Functionally equivalent to
{1,}. |
QQQ? |
Lazy match;
appending a question mark to any quantifier will specify that a match
should go only as far as needed before hitting the next match. |
Grouping
and Capture |
(
… ) |
Place the
contained sequence in a capture group. |
(?:
… ) |
Place the
contained sequence in a non-capture group. |
(?<name>
… ) |
Place the
contained sequence in a named capture group. You may also use single
quotes to designate the name, as in (?'name' … ) |
Alternation
Constructs |
expr1|expr2 |
Matches any of
the expressions separated by the vertical bar character. The leftmost
expression wins. |
(?(Name)yes|no) |
Matches the
expression in the "yes" position if the named group has a successful
capture, otherwise matches the expression in the "no" position. The
"no" expression can be omitted. |
Backreferences
|
\# |
Refers to the
numbered group identified by the number following the slash. |
\k<name> |
Refers to the
named group identified in the sequence. |
Subexpressions
|
(?=
… ) |
Positive
lookahead; continues to match only if the subexpression exists to the
right. Does not capture the contents of the subexpression. |
(?<=
… ) |
Positive
lookbehind; continues to match only if the subexpression exists to the
left. Does not capture the contents of the subexpression. |
(?!
… ) |
Negative
lookahead; continues to match only if the subexpression does not exist
to the right. Does not capture the contents of the subexpression.
|
(?<!
… ) |
Negative
lookbehind; continues to match only if the subexpression does not exist
to the left. Does not capture the contents of the subexpression.
|
Miscellaneous
|
(?ixms:
…) |
Set the
appropriate option for the contents of the group. |
(?-ixms:
…) |
Unset the
appropriate option for the contents of the group. |
(?#
… ) |
Embed a
comment in the expression starting with the # and ending in the first
closing parentheses. |
Copyright
©1995-
2003 CNET Networks, Inc. All Rights Reserved.
Visit us at www.TechRepublic.com
Dan
Frumin Articles on TechRepublic
RegEx Book
Mastering Regular Expressions by
Jeffrey Friedl. (O'Reilly)
This is an
excellent cheat sheet type reference.
http://www.evolt.org/article/rating/20/22700/
A Regular
Expression library
http://www.regxlib.com/
An online RegEx
tester
http://www.regxlib.com/RETester.aspx
A .Net regular
expression builder and tester
http://www.codeproject.com/dotnet/expresso.asp?target=e
xpresso
A tutorial with
examples
http://www.3leaf.com/default/NetRegExpRepositor
y.aspx#Tutorial