[Image of Linux]
The Personal Web Pages of Chris X. Edwards

Regular Expression Tutorial

--------------------------

Ambiguities in Character Class Defninitions

When defining character classes using brackets, it is possible to encounter problems because of the character class metacharacters. These problems basically comes in three categories.

Brackets

The first problem involves a character class that is to contain bracket characters themselves. For example, say you were trying to find all of the grouping symbols in a program. You might create a character class with [()[]{}]. The problem with this is that one of the characters that you want to include in the character class is the metacharacter that finishes the definition process. After the first "]", the rest of the expression is treated as separate characters.

There are a couple of ways to cure this problem. The first is to use a \ (backslash) to let the interpreter know that you mean a literal right bracket.

[()[\]{}]

The other solution is to reorder things a bit so that it is less ambiguous. If you have a character class of []], the interpreter is smart enough to realize that since you didn't specify anything by the first closing bracket, that maybe you meant for the bracket itself to be included. It then continues to look for another closing bracket which it treats as the metacharacter. So here's the grouping characters put in a correct character class.

[]()[{}]

Note that including another left bracket, [, in the class creates no special confusion. It is impossible and illogical to have nested character classes. Once you're in the class definition, left brackets are treated literally.

What if your regular expression just tries to find a single left bracket? If you specify a regular expression such as this:

re-NnoteBAD

You will not find "Note [3]" and "note [6]" as you might expect and hope. It will, however, find "Note 4]" if it can. Again there are two ways to handle this:

re-NnoteGOOD1

re-NnoteGOOD2

Note that in both cases, the last right bracket, "]" is a literal.

Carets

The caret (a.k.a. the circumflex or hat), "^", can also spell trouble in character class definitions. Because "^" means negate in character classes, one must be careful when searching for a literal "^" character. Because the negate syntax requires the caret to immediately follow the opening square bracket ( [^negchars] ), it is possible to specify a literal "^" by putting it somewhere else.

[hat+^]

This character class is not negated. It is any of the specified characters including the "^". If you don't like to rearrange things, you can also make the "^" a literal with a backslash.

[\^+hat]

Dashes

Literal dashes in a character class can be more troublesome since metacharacter dashes can appear almost anywhere in the definition. The key is almost. Metacharacter dashes specify a range from something to something. This means that there must be something in front and something behind. To get a literal dash, simply put it at the beginning or end of your character class. Also backslashes work.

[+-*/]

This should find only a math operator but produces an invalid range. Here are two ways that work.

[-+*/]

[+\-*/]

What if you want a range from the backslash to something? What if you wanted these characters "\", "]", "^", "_", and "`". In theory, because they are listed in their conventional order you could specify a range from the "\" character to the "`". But "[\-`]" would not work that way. If you really really wanted to do something stupid like this, this would work "[\\-`]". --------------------------
Previous Home Next
This page was created with only free, open-source, publicly licensed software.
This page was designed to be viewed with any browser on any system.
Chris X. Edwards ~ December 2003