Regular Expressions

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager.

First steps

A regular expression (in our case) will always look like this: /a*bcd/
The regex always starts and ends with a forward slash (/).

The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash.

Metacharacters are symbols like ^, *, ?, +, ., -. So in order to match a string or word containing the letter a, the regex /a/ will do the job. However, if you are looking for a+b (as in a formula), the expression would not be /a+b/, because the plus-symbol (+) has a different meaning in a regular expression, as we will see later. To match exactly a+b, the plus has to be escaped:

/a\+b/ will match the letter a followed by a plus symbol (+) and the letter b.

Exercise

Which of the following strings will the regular expression /a*b/ NOT match?

ab
aaab
b
a*b

Describing a word or part of a word (caret and dollar symbol)

A regular expressions that only consist of literal characters will match any word that contains this expression. E.g. /love/ will find love and glove, /is/ will not only find is but also his, this, disambiguation and so on. In order to describe a word as such, you will have to use the metacharacters ^ and $. ^ marks the beginning of a word (that is, the whitespace character in front of a string), $ the end (the whitespace character after a string). Do not worry about whitespace and the technical issues of strings — all you need to know is that if you use a regular expression like /^word$/, it will match exactly the word word, not words or sword. You can also use the caret to find words that begin with a certain string (e.g. un to find words like unbearable and unintelligible) or use the dollar symbol to find words with a certain ending (e.g. less if you are looking for careless, tasteless or nonetheless).

Note: When used within a bracket expression, the caret has a different function.

Exercise

Which of the following regular expression will find words that begin with un (but not words like gun or fun)?

/$un/
/^un/
/un/
/un$/

Alternation (pipe symbol)

Sometimes you will be searching for several possible strings at once. A good example might be that you are looking for instances of to be in the Penn Treebank. In order to find all the possible variations, your regular expression has to match be, is, am, are, was, were, been and being. This can be achieved by using the pipe symbol (|). As part of a regular expression the pipe symbol stands for or. Thus the regular expression /a|b/ matches all strings that contain a or b. A correct expression for instances of to be is /^(be|is|am|are|was|were|been|being)$/.

To have an alternation within another regular expression, the round brackets are used. So if you are looking for man (singular) and men (plural), you can use the expression /^m(a|e)n$/. Of course you could also use the expression /^man$|^men$/, but especially when you are formulating more complex queries, it will be easier to debug your expressions if you use brackets.

Bracket expressions

A bracket expression is realised by square brackets. The expression between the brackets matches a single character. For example, the regular expression [0123456789] matches any single digit. A bracket expression can either be an explicit list of all the possible characters or a range expression. Typical examples for range expressions are [0-9] which matches any single digit, just like [0123456789], and [a-z] to match any letter of the alphabet. We will not discuss the difference between lowercase and uppercase here, as we will not need to know that for our queries.

When used within a bracket expression, the ^-symbol does not mark the beginning of a string, but negates the bracket expression. /s[^uo]n/ matches sin, but not sun or son.

Exercise

Which of these regular expressions matches both sun and son?

/^sun$|^son$/
/^s(u|o)n$/
/^s[uo]n$/
All of the above

Repetition operators

So far, you have already encountered a few special characters like ^ and $. Another group of metacharacters are the repetition operators or quantifiers. They are used to match optional strings or strings that may be repeated once or several times. These special characters are ?, + and *.

  • ?: The preceding item will be matched zero or one time.
  • +: The preceding item will be matched at least one time.
  • *: The preceding item will be matched zero or more times.

Examples for these quantifiers:

  • /a?b/ matches b and ab.
  • /a+b/ matches ab, aab, aaab etc.
  • /a*b/ matches b, aab, aaab etc./

Remember, if you want to match these symbols explicitly, you have to escape them using the backward slash \.

Exercise

What will this regular expression match: /\??/?

One question mark or anything
Exactly one question mark
At least one question mark
All of the above
None of the above

If you are looking for an explicit number of repetitions, you can also use the curly brackets in the following ways:

  • {n}: The preceding item will be matched exactly n times, e.g. /a{3}/ matches aaa.
  • {n,}: The preceding item will be matched at least n times.
  • {n,m}: The preceding item will be matched at least n, but no more than m times.

Examples:

  • /a{2}b/ matches aab.
  • /a{2,}b/ matches aab, aaab, aaaab, etc.
  • /a{2,3}b/ matches aab and aaab.

Exercise

Which of these regular expressions will match a three-digit number?

/[0-9]{2,4}/
/[0-9]{3,}/
/[0-9]{3}/

More special characters

Two more special characters are \w and \W (notice the backslashes). \w matches all letters and digits (but not characters like ?), just like the expression /[0-9A-Za-z]/. \W matches all characters that are not alphanumeric (every character that is not a number or a letter).

One metacharacter you might actually find useful is the dot (.). It stands for any single character, so you can use it as a wildcard. And do not forget, if you are looking for a literal dot, you will have to escape it.

There are more special characters than you have seen yet, but you will probably not need any of these when you are formulating queries for the treebanks. Here are a few:
[:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].

Links

The Internet is packed with information on regular expressions. We suggest these two tutorials: