Mastering Regular Expressions

Summary by Jared Robinson
September 2000
References Updated July 2004

INTRODUCTION

I started casually using regular expressions years ago. They are an excellent method of matching text. As time wore on, I came across many complex regular expressions in code, in books and on the internet. I wanted to understand how to better use them, even though they were quite cryptic looking. A coworker introduced me to a Mastering Regular Expressions, and after reading a few pages, I was hooked. I bought my own copy.

Reading and summarizing Mastering Regular Expressions (first edition) was a bit overwhelming. While the first few chapters were very enlightening and useful (I highly recommend reading them), I didn't feel a strong need for the additional, in-depth knowledge of regular expressions that the later chapters provide. In spite of this, I forged ahead. Most of my focus while reading was on Perl regular expressions. I start my summary with basic regular expression knowledge that should apply to nearly any regular expression engine. From there I discuss speed and gradually move into the more perl-centric world. Finally, I will end with some examples that are not perl-specific.

BASICS

To match any character, use the dot ".". Example: "the." Matches "they", "them", etc.

Quantifiers are useful for determining how many times to match something. They include "*", "+", "?", and "{min,max}".

To match something zero or more times, use "*". Example: "a*" would match "", "a ", "aa" and "aaa".

If you want one or more matches, use "+". Example: "a+" would match "a", "aa" and "aaa", but not "".

If you want one optional match (zero or one), use "?". Example: "j?am" matches "am " and "jam".

Some regular expression engines allow you to specify how many matches using "{min,max} ". Example: "a{5,10}".

If you want to match any one of a group of characters, use character classes, which are denoted by "[]". Example: "[abc]at" would match "aat", "bat" and "cat", but not "rat", "pat", etc.

Ranges are allowed in character classes such as "[a-zA-Z0-9]", which matches alpha-numeric characters.

Character classes can be negated. Example: "[^a-zA-Z0-9]" matches non-alpha-numeric characters.

If you want your match to be case-insensitive, there are various methods for turning this feature on. For grep, it is the "-i" option. For sed, perl, python, etc. it is the "/i" option on the end of the match operator: "/text/i".

Alternation is what you use when you want to match any of a given set of choices. Example: "(Jared|Mike|Stan) Robinson" would match any of the following: "Jared Robinson", "Mike Robinson, "Stan Robinson".

Use parenthesis to capture text. This is useful for substitutions. For example, "s/oldvar.([a-zA-Z0-9_]*)/newvar.\1/" would substitute oldvar.jared with newvar.jared. The "\1" is a back reference to the captured text. You can use "$1" instead of "\1" in Perl. Depending on the tool, you must use either "\(" and "\)" or "(" and ")" to capture text. Perl and python use unescaped parenthesis to capture text, and escaped parenthesis to match literal text. Sed, awk, emacs, grep and vi use escaped parenthesis to capture text.

Use the anchors "^" to match the beginning of a line, and "$" to match the end. Anchors don't actually match characters, but rather, positions in the matched text. Other common anchor meta-characters vary depending on the regular expression (regex) engine, but may include "\<", "\>" to match beginning and end of words (for vi, emacs, sed, grep), "\b" and "\B" to match word and non-word boundaries (perl, python).

What you match is often as important as what you don't match. Perhaps you want to match correct times of day. The following does not work: "[0-9]?[0-9]:[0-5][0-9] " because it would match not only lines containing "12:30", but also lines containing "99:30" and "112:3451". Why? A regex matches anywhere it can on a line of text unless otherwise constrained. We want to constrain the match to a word boundary, so we can use "\b". For egrep or perl, I would choose something like "\b(1[012]|[1-9]):[0-5][0-9]\b". See page 23 of "Mastering Regular Expressions" for details.

SPEED

Understanding how to increase the efficiency of a regex-match often requires intimate knowledge of the regex engine you are using. For most mortals, using Mastering Regular Expressions as a reference should be sufficient. It can be difficult to remember the details of each different flavor of regex engine. Once you apply speed optimizations to a regular expression, it is a good idea to benchmark multiple equivellant regular expressions to see which one is fastest.

Here are some general optimization guidelines:

LANGUAGE COMPARISON

Perl's most attractive feature is its integrated and powerful regular expressions. They are a part of the language, and don't require the use of any additional library. This is why Perl is commonly used a "glue" language, and also used for text manipulation.

Python now has regular expressions that are compatible with Perl, and just as powerful. However, the syntax is not an integral part of the language. It requires importing a library, and writing more code. While this is good for readability and maintenance, it also means that is isn't as good for quick, one-time text processing as Perl is.

Perl is great because it can be used in place of sed, awk, and other utilities on the command line. Perl is famous for its powerful one-line code snippets. Python does not have the same ability.

PERL REGULAR EXPRESSIONS

What I learned about Perl regular expressions that I didn't know before:

Now that I have covered more regular expression syntax possibilities than the average person will ever use, it seems appropriate to show some examples of basic and complex regular expressions.

EXAMPLES

I have created these examples in the attempt to make them useful to programmers using four tools: Perl, MS Dev, Visual SlickEdit and VI. Hopefully it will be easy to extend the ideas to python, sed, and emacs.

Replace an entire word, not part of a word

Perhaps you have the keyword "int" scattered throughout your C++ code, and you want to make sure that the data type is the same size across all platforms. So, you want to replace it with your own predefined data type, "IU32".

You might be tempted to search for "int" and replace it with "IU32". However, that will also replace words like "international" with "IU32ernational" and "interesting" with "IU32eresting". That's not what you want! A better approach would be to search for the word "int" all by itself. There are various methods of doing this:

Perl:
perl -pi.bak -e `s/\bint\b/IU32/g' *.cpp
MS Dev:
Search for "int" and select "Match whole word only", "Match Case" and "Regular Expression"
SlickEdit:
Do the same thing as with MS Dev.
Vi:
In command mode, type ":%s/\<int\>/IU32/g"

Keep in mind that you may have occurrences of "int" in your code that should not be replaced, so doing an automated non-interactive search-and-replace may not be the best way to go about it.

Replace a function call using capturing

In my experience with C and C++ code, programmers often use unsafe string handling function such as sprintf(buf, "%s", sourcestring). I like the more crash-proof and security-minded snprintf(buf, sizeof(buf), "%s", sourcestring). Unfortunately, snprintf() requires an extra argument, and so it is not a simple search-and-replace. You need to match the sprintf(), grab the first argument, and use it for the second argument as well.

Perl:
perl -pi.bak -e 's/\bsprintf\s*\(\s*([^,]+),/snprintf\(\1, sizeof\(\1\), /g' *.cpp
MS Dev:
Search for "sprintf[ ]*([ ]*\([^,]*\)," and replace with " snprintf(\1, sizeof(\1),"
SlickEdit:
Search for "sprintf[ ]*\([ ]*([^,]*)," and replace with " snprintf(\1, sizeof(\1),"
Vi:
In command mode, type ":%s/\<sprintf\>[ ]*([ ]*\([^,]*\),/snprintf(\1, sizeof(\1),/g"

You wouldn't want to use these search-and-replace expressions non-interactively on a file. The destination string, the first argument to sprintf(), might not be a character array, but a pointer instead. In this case, you would not want to use "sizeof(buf)" as the second argument to the snprintf() function. Therefore, a search and replace like this must be on a case-by-case basis. So, in SlickEdit and MS Dev, you wouldn't want to select "Replace All". In vim (vi improved), you would add a "c" to the end of the string, right after the final "g" to have vim prompt you to confirm each replace.

Let's break down the search-and-replace expressions so they are more understandable. I will underline a portion of the perl expression and explain it.

's/\bsprintf\s*\(\s*([^,]+),/snprintf\(\1, sizeof\(\1\), /g'

The "\s*" matches zero or more whitespace characters. I used "[ ]*" for MS Dev, SlickEdit and vi. I used the "\s*" because programmers sometimes leave spaces between the function name and the opening parenthesis. They also sometimes leave space between the opening parenthesis and the first argument.

's/\bsprintf\s*\(\s*([^,]+),/snprintf\(\1, sizeof\(\1\), /g'

Here I want to match a literal parenthesis. In perl and SlickEdit, if you want to match literal parenthesis, you must escape them. Unescaped parentheses are used to capture text. You may notice that MS Dev and vi use "\(" and "\)" to capture text, which is exactly opposite of perl and SlickEdit.

's/\bsprintf\s*\(\s*([^,]+),/snprintf\(\1, sizeof\(\1\), /g'

This portion captures the first argument to the sprintf() function. It says "capture the character class one or more times that includes anything except for a comma, then match a comma".

's/\bsprintf\s*\(\s*([^,]+),/snprintf\(\1, sizeof\(\1\), /g'

This is the replacement text. It contains the backreference "\1" twice, which substitutes the captured text from "([^,]+)" into the replacement text.

CONCLUSION

I hope that you found this summary to be useful. Regular expressions are a concept that will be around longer than most programming languages. It pays off to spend the time to learn to use them.

If you find anything in this document that needs correction, please let me know.

LINKS