Back to Basics: Regular Expressions and Formal Languages

March 6, 2014

4 comments

Regular expressions are a very powerful tool to have on your toolbelt. They have an arcane syntax and often end up looking like a random stream of characters, but they can save you a lot of time parsing and interpreting text. Here are some problems that you can solve with regular expressions:

  • Find a list of phone numbers in a large text file
  • Check that a user-provided email address is valid
  • Verify that a password meets custom strength requirements
  • Locate all outgoing links in an HTML document
  • Modify all <img src=”…”> tags to refer to an HTTPS address

In this post I’d like to take a look at how to construct regular expressions, and at the same time consider which formal languages can be decided using regular expressions. Although regular expressions are often confused with regular languages, the regex flavor available in modern programming languages (C#, Perl, Python, JavaScript, etc.) is capable of recognizing much more than just regular languages. PCRE (Perl Compatible Regular Expressions) is a fairly common regex flavor that we will be using throughout this post.

Formal Languages

Now, what are those formal languages I’m talking about? A formal language is a set of finite strings. A string is simply a sequence of characters (over an agreed-upon alphabet). All of the following are formal languages:

  • The set of all strings
  • The empty set
  • The set of all strings that have an even number of characters
  • The set of all strings that begin with the letter ‘q’
  • The set of all strings that describe prime numbers in decimal notation
  • The set of all strings that are valid Perl scripts (aren’t these just all the strings? :-) )
  • The set of all strings that are C# programs that will not throw an exception when executed

We say that a regular expression R recognizes (or decides) a formal language L if the following two statements are equivalent for each string s: (1) s is in the language L; (2) s is accepted by R. In other words, R is equivalent to L — they both describe the same set of strings.

It turns out that there are formal languages that are extremely hard to recognize, with or without regular expressions. There are even formal languages that cannot be recognized by a computer. We will not expand on this point further, and instead focus on the expressive power or regular expressions as compared to regular languages.

I said that regular expressions are often mixed with regular languages. What are those? Well, a regular language is defined recursively as follows:

  • The empty language ø is a regular language
  • For each character c in the alphabet, {c} is a regular language

And, if A and B are regular languages, then:

  • The union A U B is a regular language
  • The concatenation AB is a regular language (all strings ab such that a is in A and b is in B)
  • The language A* is a regular language (all strings that consist of any number of occurrences of a word a that is in A)

Basic Operators

The most basic of all regular expressions is a regular expression that recognizes a single string. For example, if you’re simply looking for the string “phone” in a body of text, the following regular expression will do:

phone

Now, suppose you want to mix things up a little bit, and look for either <b> or <i> tags in HTML text. The following regular expression will do it:

<b>|<i>

At this point regular expressions are able to recognize any finite language (i.e., a language that has a finite number of words). It’s easy — if the words w1, w2, …, wn are in the language, then the corresponding regular expression is simply:

w1|w2|...|wn

However, finite languages are a very small subset of all languages. Many languages we’re interested in are not finite. For example, suppose we want to find any sequence of ‘a’s in a body of text. The language is clearly not finite. The following regular expression will do this:

a*

Here, the star operator (*) indicates any number of occurrences of the preceding string. This expands our expressive power to all the regular languages. Hopefully, it’s easy to see why, by following the recursive construction of any regular language:

  • If the regular language is empty, then the empty regular expression recognizes it.
  • If the regular language consists of a single character c, then the regular expression c recognizes it.
  • If the regular language is a union of two regular languages A U B, then we take the regular expression a that recognizes A and the regular expression b that recognizes B, and produce the regular expression a|b that recognizes A U B.
  • If the regular language is a concatenation of two regular languages AB, then we take the regular expression a that recognizes A and the regular expression b that recognizes B, and produce the regular expression ab which recognizes AB.
  • If the regular language is produced by applying the star operator to a regular language A (i.e. it’s the language A*), then we take the regular expression a that recognizes A and produce the regular expression a* that recognizes A*.

At this point, with just a couple of operators, our regular expressions can already match all regular languages. But it gets much more powerful from here. Before we go on though, let’s take a look at a few additional useful operators that aren’t strictly necessary to recognize regular languages, but make it easier to write regular expressions.

First, the plus operator (+) helps produce regular expressions that match any number of repetitions of a given string that is greater than 0. For example, if you’re looking for a sequence of one or more ‘a’s, the following regular expression will do:

a+

This doesn’t add any expressive power in terms of formal languages, because the regular expression a+ is exactly equivalent to the regular expression aa*. But it’s sure a bit shorter to write. Similarly, a? is the regular expression that matches 0 or 1 occurrences of ‘a’, and a{n,m} is the regular expression that matches between n and m occurrences of ‘a’.

Another super-useful construct is the character set (square brackets). It helps clearly express a range or a set of characters that you’re matching without specifying each one individually. For example, suppose we’re matching any number of digits. We could use the following regex:

(0|1|2|3|4|5|6|7|8|9)*

But the character set makes it easier to write:

[0-9]*

Character sets may also include multiple ranges. E.g., when looking for a word that consists of at least one alphanumeric character, we can use the following regex:

[A-Za-z0-9]+

And finally, character sets can be negated by adding a caret (^) at the beginning of the set. For example, to match anything that is not a digit, the following regex can be used:

[^0-9]

There are also some predefined ranges that you can use in most regex flavors. For example, \d is shorthand for a digit, and \w is shorthand for non-whitespace. Lastly, the dot (.) specifies any character. With that in mind, the following regex will match (some) valid email addresses, and also some invalid ones:

^[A-z0-9]+@[A-z0-9]+\.(\.com|\.org\|\.net)$

The escaping around the dots is required so that they are not interpreted as the dot character class, which stands for any character. The newly introduced caret (^) and dollar ($) signs at the beginning and end of the expression are start-of-string and end-of-string indicators. They mean that the regular expression should match the entire input string, and not just a substring. With that in mind, the preceding regex will match email addresses such as “mike@example.com” or “123@foo.org”, but will not match an address like “dave@example.co.il” or “miles.matheson@example.net”.

All of the above operators do not expand the expressive power of regular expressions beyond regular languages. It’s a tedious and technical exercise to translate each operator into the corresponding pipe and star operators, but it’s absolutely possible. Go ahead and try a couple.

Capture Groups

When dealing with regular expressions, it’s often useful to refer to captured substrings later in the program. Capture groups help do so by demarcating parts of the regex for later use.

For example, suppose we want to parse a URL query string with two named parameters, a and b, such as the string “http://example.org?a=1&b=1″. A rough attempt at matching the string would go something along the lines of:

^http://example.org?a=([A-z0-9]+)&b=([A-z0-9]+)$

To extract the parameter values, we place the interesting parts in parentheses, which makes them available to the application after matching the regex. In Perl, the matched subgroups would be placed in the $1 and $2 variables:

my $url = "http://example.org?a=1&b=2";
if ($url =~ /^http:\/\/example.org\?a=([A-z0-9]+)&b=([A-z0-9]+)$/g) {
	print "a=$1, b=$2\n";
}
# prints a=1, b=2

In other languages, the regex matching facility often exposes an array of captured subgroups that you can access (e.g. the Match.Captures property in .NET).

At this point, capture groups do not add any power of expressiveness — they are merely a utility feature for the rest of the program to use. Our regular expressions are still limited to recognizing regular languages. In the next section, we will move beyond regular languages. For that, we need an example of a non-regular language. One classic example is the language L = {waw : w is any string} — e.g., the strings “a”, “fooafoo”, “bab”, and so on. There is a cute proof showing that this language isn’t regular. If you trust me on this, you’ll also agree that the regular expressions flavor we have developed so far can’t recognize this language either.

Now, given, this example isn’t super-realistic. But what if you’re searching HTML text for links whose href attribute is the same as as the link text, e.g. <a href=”http://google.com”>http://google.com</a>? Or what if you want to match text inside a pair of matching tags, <b>foo</b> or <i>foo</i> or any other HTML tag for that matter?

Backreferences, Recursion, and Lookahead

Backreferences extend the regular expression syntax by letting you refer to a segment of the string that was captured by the regex in another location. The following regular expression will recognize the language L from the preceding section:

(.*)a\1

The \1 indicates that this part of the expression should match whatever was matched by the capture group #1. This is exactly how we defined our language, L, and goes to show that we now have more expressive power than regular languages.

Recursion in regular expressions is an even more powerful feature. You can use recursion to repeat the regular expression (or a subset thereof) in the middle of the matching string. Let’s start with a silly example — suppose we want to match all strings of ‘a’s, which we can already do with the regular expression a*. The recursive structure of the strings we’re matching is silly but obvious — a string that’s part of our language is either the empty string, or it’s of the form ax where x is another string in our language. The following recursive regex is equivalent to a*:

(a(?0)?)?

How does this work? Well, first of all, we clearly accept the empty string, because the whole thing we’re matching is optional. Now, if we don’t have an empty string, then it has to start with ‘a’. Following the ‘a’, we either have nothing, or we have a recursive application of the whole regular expression, which is indicated by ?0. If we wanted to match the string exactly, we would need begin- and end-line markers, which would require a sub-capture group for the recursive application:

^(a(?1)?)?)$

Here, ?1 is a recursive application of the first capture group, which is the whole regex except for the begin- and end-line markers. This is required because if we apply the recursive regex with the markers, we won’t be able to match them after the first application.

A cool example of using recursion helps recognize another non-regular language, L = {anbn : n is a natural number}. It’s easy to use recursion here once we realize that strings in this language have a recursive nature — if a string is in the language, then it’s of the form axb where x is another string that’s in the language. Here’s what the regular expression looks like:

^(a(?1)?b)$

Some explanation is in order. The whole regex (except for the begin- and end-line markers) is in a capture group, group 1. Then, we look for “a” followed by an optional recursive repetition of group 1, followed by “b”. This exactly follows our train of thought when we decomposed this language recursively, and is very natural to write.

Another classic example of using recursion is recognizing palindromes (strings that read the same when reversed, such as “racecar”). Every palindrome has a simple recursive structure with symmetry around the middle of the string. Therefore, the following recursive regex will match palindromes:

^((\w)(?1)\2|\w?)$

In this regex, the first capture group is the whole thing without the begin- and end-line markers. The second capture group is the first character matched by the regex. A palindrome requires that the first and last characters of the string are the same, which is what the (\w) and \2 parts ensure. In the recursive step, we recursively apply the regex from the first capture group. Finally, we need a termination step for our recursion. The termination step is when we have 0 or 1 characters left unclaimed in the middle of the string, represented by the \w? alternative.

The next smallest subset of formal languages that is strictly larger than regular languages is context-free languages (CFLs). The palindrome example we have seen in this section is an example of a context-free language. We won’t go into the details of what exactly a CFL is, but regular expressions with backreferences/recursion can recognize any CFL. In fact, they can recognize even languages that are not context-free!

Context-Sensitive Languages and Conclusion

For example, the super-simple language L = {ww : w is any string} is not context-free, even though it’s extremely easy to describe it with a regular expression that uses backreferences:

(.*)\1

Wow. It looks like the power of regular expressions is almost unlimited, when we have backreferences and recursion on our toolbelt. There’s no free lunch, however. It turns out that regular expression matching with all the bells and whistles on top (backreferences, recursion, etc.) is an NP-hard problem. There are numerous examples of regexes for various implementations that exhibit horrible (exponential) running times. Don’t let this deter you, though — most regexes you’ll write won’t hit these edge cases, and even if they do, for small inputs even exponential running time is not so bad.

For more details on how regular expressions with recursion can match all context-free languages, and for a proof that regexes with backreferences are NP-hard, see Nikita Popov’s article. For a super-quick refreshment on regular expression (that also covers look-ahead and look-behind), check out regular-expressions.info. Finally, the ultimate resource for mastering regular expressions is Jeffrey Friedl’s “Mastering Regular Expressions”.


I am posting short links and updates on Twitter as well as on this blog. You can follow me: @goldshtn

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

4 comments

  1. Marc ShermanMarch 7, 2014 ב 4:50 PM

    Hi Sasha, thanks for the interesting article. Quick question: The site “regular-expressions.info” says that .NET does not support recursion but it supports balancing groups which can be used instead of recursion.

    So if I modify your statement to be, “but regular expressions with backreferences/balancing groups can recognize any CFL.” then is it still true?

    I ask because I’m interested in parsing source files with regular expressions vs. something like lex and yacc.

    thanks,
    Marc

    Reply
    1. Sasha Goldshtein
      Sasha GoldshteinMarch 9, 2014 ב 10:09 AM

      I believe this is true, but I haven’t proven the equivalence.
      Regardless, note that most programming languages aren’t really CFLs.

      Reply
  2. SvickMarch 19, 2014 ב 11:03 PM

    All of the following are formal languages:

    snip

    * The set of all strings that are C# programs that will not throw an exception when executed

    I don’t think that’s a formal language, because C# programs don’t have to be deterministic.

    Does

    if (new Random().Next() == 0) throw new Exception();

    belong to that set?

    What about

    if (Console.ReadLine() == "") throw new Exception();

    ?

    (In both cases, imagine I did include the required Program and Main boilerplate.)

    Reply
    1. Sasha Goldshtein
      Sasha GoldshteinMarch 20, 2014 ב 5:39 PM

      The problem really is that “will not throw an exception when executed” is not well-defined, because as you noted it depends on the input. So let’s say “for any input, will not throw an exception”.

      Reply