Summary of Regular Expressions

Mastering Regular Expressions

Summary by Jared Robinson
September 2000
References Updated July 2004

INTRODUCTION

I started casually using regular expressions years ago. They are an excellent method of matching text. As time wore on, I came across many complex regular expressions in code, in books and on the internet. I wanted to understand how to better use them, even though they were quite cryptic looking. A coworker introduced me to a Mastering Regular Expressions, and after reading a few pages, I was hooked. I bought my own copy.

Reading and summarizing Mastering Regular Expressions (first edition) was a bit overwhelming. While the first few chapters were very enlightening and useful (I highly recommend reading them), I didn't feel a strong need for the additional, in-depth knowledge of regular expressions that the later chapters provide. In spite of this, I forged ahead. Most of my focus while reading was on Perl regular expressions. I start my summary with basic regular expression knowledge that should apply to nearly any regular expression engine. From there I discuss speed and gradually move into the more perl-centric world. Finally, I will end with some examples that are not perl-specific.

BASICS

To match any character, use the dot ".". Example: "the." Matches "they", "them", etc.

Quantifiers are useful for determining how many times to match something. They include "*", "+", "?", and "{min,max}".

To match something zero or more times, use "*". Example: "a*" would match "", "a ", "aa" and "aaa".

If you want one or more matches, use "+". Example: "a+" would match "a", "aa" and "aaa", but not "".

If you want one optional match (zero or one), use "?". Example: "j?am" matches "am " and "jam".

Some regular expression engines allow you to specify how many matches using "{min,max} ". Example: "a{5,10}".

If you want to match any one of a group of characters, use character classes, which are denoted by "[]". Example: "[abc]at" would match "aat", "bat" and "cat", but not "rat", "pat", etc.

Ranges are allowed in character classes such as "[a-zA-Z0-9]", which matches alpha-numeric characters.

Character classes can be negated. Example: "[^a-zA-Z0-9]" matches non-alpha-numeric characters.

If you want your match to be case-insensitive, there are various methods for turning this feature on. For grep, it is the "-i" option. For sed, perl, python, etc. it is the "/i" option on the end of the match operator: "/text/i".

Alternation is what you use when you want to match any of a given set of choices. Example: "(Jared|Mike|Stan) Robinson" would match any of the following: "Jared Robinson", "Mike Robinson, "Stan Robinson".

Use parenthesis to capture text. This is useful for substitutions. For example, "s/oldvar.([a-zA-Z0-9_]*)/newvar.\1/" would substitute oldvar.jared with newvar.jared. The "\1" is a back reference to the captured text. You can use "$1" instead of "\1" in Perl. Depending on the tool, you must use either "$" and "$" or "(" and ")" to capture text. Perl and python use unescaped parenthesis to capture text, and escaped parenthesis to match literal text. Sed, awk, emacs, grep and vi use escaped parenthesis to capture text.

Use the anchors "^" to match the beginning of a line, and "$" to match the end. Anchors don't actually match characters, but rather, positions in the matched text. Other common anchor meta-characters vary depending on the regular expression (regex) engine, but may include "\<", "\>" to match beginning and end of words (for vi, emacs, sed, grep), "\b" and "\B" to match word and non-word boundaries (perl, python).

What you match is often as important as what you don't match. Perhaps you want to match correct times of day. The following does not work: "[0-9]?[0-9]:[0-5][0-9] " because it would match not only lines containing "12:30", but also lines containing "99:30" and "112:3451". Why? A regex matches anywhere it can on a line of text unless otherwise constrained. We want to constrain the match to a word boundary, so we can use "\b". For egrep or perl, I would choose something like "\b(1[012]|[1-9]):[0-5][0-9]\b". See page 23 of "Mastering Regular Expressions" for details.

SPEED

Understanding how to increase the efficiency of a regex-match often requires intimate knowledge of the regex engine you are using. For most mortals, using Mastering Regular Expressions as a reference should be sufficient. It can be difficult to remember the details of each different flavor of regex engine. Once you apply speed optimizations to a regular expression, it is a good idea to benchmark multiple equivellant regular expressions to see which one is fastest.

Here are some general optimization guidelines:

Great speed improvements usually come when you use beginning-of-line and end-of-line anchors ("^" and "$") as the first or last thing in a regex. Example: "^Subject: "
Character classes are always faster than alternation. Example: use [abc] instead of (a|b|c).
Use + instead of *. It is frequently what you want anyway, and it is usually faster. Keep in mind that some utilites do not support +, like some versions of grep. Solaris' grep is brain damaged. You can use egrep instead.
Use non-capturing parenthesis when you don't need them, if applicable to your regex flavor. In Perl and Python, the syntax is "(?:text)".
Using a Posix NFA engine is almost always slower than traditional NFA, and many of the regex optimization rules don't apply to a Posix NFA. Fortunately, most regex flavors are traditional NFA.
Most optimizations that apply to an NFA do not apply to a DFA regex engine. DFA's are only useful for telling you whether or not a match occurred on a line of text. They do not tell you where the match was. DFA regex engines are usually faster than NFA regex engines.

LANGUAGE COMPARISON

Perl's most attractive feature is its integrated and powerful regular expressions. They are a part of the language, and don't require the use of any additional library. This is why Perl is commonly used a "glue" language, and also used for text manipulation.

Python now has regular expressions that are compatible with Perl, and just as powerful. However, the syntax is not an integral part of the language. It requires importing a library, and writing more code. While this is good for readability and maintenance, it also means that is isn't as good for quick, one-time text processing as Perl is.

Perl is great because it can be used in place of sed, awk, and other utilities on the command line. Perl is famous for its powerful one-line code snippets. Python does not have the same ability.

PERL REGULAR EXPRESSIONS

What I learned about Perl regular expressions that I didn't know before:

Quantifiers are greedy. They match as much text as possible, and then backtracking occurs in order for the entire regex to match. This may not be what you want. To get non-greedy (lazy) quantifiers, use "*?", "+?", "??", "{min,max}? ". Non-greedy quantifiers can speed up a regex match.
Double-quotes are not string delimiters. They are operators. In other words, text surrounded by double-quotes does not represent a string constant like it does in C. Double-quoted strings are interpolated in Perl. This means that if you have a string like "Hello $variable\n", it is the same thing as the concatenation "Hello" . $variable . "\n";
$+ refers to the last captured portion of a regex.
Regex side-effect variables such as $1 and $+ are dynamically scoped. This allows regexes in sub-scopes not to fiddle with the $1 from outer-scoped-levels.
Surround strings within regexes with Q and E to get the literal contents, instead of having them interpreted as if they were regex meta-characters. For example, m/$myvar/ where $myvar is "\wJared\w" would match word-character-"Jared "-word-character. If you wanted to match a literal backslash-"wJared"-backslash- "w", then you would write m/Q$myvarE/.
Variable Interpolation Confusion. Perl supports variable interpolation in strings. This is useful most of the time, but can cause problems and create confusion. See page 222, Phase B, in Mastering Regular Expressions. Python doesn't support variable interpolation, so you don't have this problem. This is both a blessing and a curse, since variable interpolation can make a programmer's life much easier and much more difficult.
You can use whitespace to comment your regex if you use the /x modifier. This feature is underused. It allows complicated regexes to be documented and thus be better maintained.
Non grouping parenthesis are used as follows: (?:your text)
Perl supports positive lookahead. Lookahead doesn't consume characters that it matches whereas normal matches do. Often it is used for efficiency, but can also make your life easier. Lookahead constructs consist of (?=your text) and (?!your text). Pages 228-230. Apparently, Perl 5.6 supports negative lookahead as well.
Case insensitive matching can be turned on anywhere in a regex as follows: (?i:your text).
The anchors \A and \Z match the beginning and end of string, respectively, regardless of newlines within the string.
The anchor \G is similar to \A except that it matches beginning at where the regex last matched when used with the /g modifier. (Page 236).
Octal escapes are supported. Use them as follows: "\033". Always put the zero in front for clarity so that it doesn't get confused with a back reference.
There is a performance penalty for using $`, $&, and $', which represent the portion of a string before the regex match, the regex match, and the portion remaining after the match, respectively. You can mimic the behavior of these variables using the pos() function.
The /e modifier executes the contents of replacement text before it is substituted into an expression: "s/(\b\d+\b)/&MyFunc($1)/e" would replace numbers with the return value of MyFunc(number).
There is a study() function that can speed up regex matches, but only under some special circumstances. It is best used when you have a large string that you want to match many times before the string is modified. If this is the case, then read pages 287-289.
There are many more details of Perl regexes that you can learn about such as the split() function, regex side-effects, multiple uses of the /e modifier, etc.

Now that I have covered more regular expression syntax possibilities than the average person will ever use, it seems appropriate to show some examples of basic and complex regular expressions.

EXAMPLES

I have created these examples in the attempt to make them useful to programmers using four tools: Perl, MS Dev, Visual SlickEdit and VI. Hopefully it will be easy to extend the ideas to python, sed, and emacs.

Replace an entire word, not part of a word

Perhaps you have the keyword "int" scattered throughout your C++ code, and you want to make sure that the data type is the same size across all platforms. So, you want to replace it with your own predefined data type, "IU32".

You might be tempted to search for "int" and replace it with "IU32". However, that will also replace words like "international" with "IU32ernational" and "interesting" with "IU32eresting". That's not what you want! A better approach would be to search for the word "int" all by itself. There are various methods of doing this:

Perl:: perl -pi.bak -e `s/\bint\b/IU32/g' *.cpp
MS Dev:: Search for "int" and select "Match whole word only", "Match Case" and "Regular Expression"
SlickEdit:: Do the same thing as with MS Dev.
Vi:: In command mode, type ":%s/\<int\>/IU32/g"

Keep in mind that you may have occurrences of "int" in your code that should not be replaced, so doing an automated non-interactive search-and-replace may not be the best way to go about it.

Replace a function call using capturing

In my experience with C and C++ code, programmers often use unsafe string handling function such as sprintf(buf, "%s", sourcestring). I like the more crash-proof and security-minded snprintf(buf, sizeof(buf), "%s", sourcestring). Unfortunately, snprintf() requires an extra argument, and so it is not a simple search-and-replace. You need to match the sprintf(), grab the first argument, and use it for the second argument as well.

Perl:: perl -pi.bak -e 's/\bsprintf\s*$\s*([^,]+),/snprintf\(\1, sizeof\(\1$, /g' *.cpp
MS Dev:: Search for "sprintf[ ]*([ ]*$[^,]*$," and replace with " snprintf(\1, sizeof(\1),"
SlickEdit:: Search for "sprintf[ ]*\([ ]*([^,]*)," and replace with " snprintf(\1, sizeof(\1),"
Vi:: In command mode, type ":%s/\<sprintf\>[ ]*([ ]*$[^,]*$,/snprintf(\1, sizeof(\1),/g"

You wouldn't want to use these search-and-replace expressions non-interactively on a file. The destination string, the first argument to sprintf(), might not be a character array, but a pointer instead. In this case, you would not want to use "sizeof(buf)" as the second argument to the snprintf() function. Therefore, a search and replace like this must be on a case-by-case basis. So, in SlickEdit and MS Dev, you wouldn't want to select "Replace All". In vim (vi improved), you would add a "c" to the end of the string, right after the final "g" to have vim prompt you to confirm each replace.

Let's break down the search-and-replace expressions so they are more understandable. I will underline a portion of the perl expression and explain it.

's/\bsprintf\s*$\s*([^,]+),/snprintf\(\1, sizeof\(\1$, /g'

The "\s*" matches zero or more whitespace characters. I used "[ ]*" for MS Dev, SlickEdit and vi. I used the "\s*" because programmers sometimes leave spaces between the function name and the opening parenthesis. They also sometimes leave space between the opening parenthesis and the first argument.

's/\bsprintf\s*$\s*([^,]+),/snprintf\(\1, sizeof\(\1$, /g'

Here I want to match a literal parenthesis. In perl and SlickEdit, if you want to match literal parenthesis, you must escape them. Unescaped parentheses are used to capture text. You may notice that MS Dev and vi use "$" and "$" to capture text, which is exactly opposite of perl and SlickEdit.

's/\bsprintf\s*$\s*([^,]+),/snprintf\(\1, sizeof\(\1$, /g'

This portion captures the first argument to the sprintf() function. It says "capture the character class one or more times that includes anything except for a comma, then match a comma".

's/\bsprintf\s*$\s*([^,]+),/snprintf\(\1, sizeof\(\1$, /g'

This is the replacement text. It contains the backreference "\1" twice, which substitutes the captured text from "([^,]+)" into the replacement text.

CONCLUSION

I hope that you found this summary to be useful. Regular expressions are a concept that will be around longer than most programming languages. It pays off to spend the time to learn to use them.

If you find anything in this document that needs correction, please let me know.

LINKS

Publisher's homepage (O'Reilly): Second Edition, First Edition.
Author's homepage (Jeffrey Friedl): Second Editon, First Edition.
Python regular expressions (compatible with perl): http://www.python.org/doc/lib/module-re.html. This is a great reference. Also see the python regex howto.