Regular Expressions

Top  Previous  Next

Regular expressions are a form of pattern-matching used in text processing for a variety of purposes. Especially users of Unix and Unix-compatible systems should be familiar with utilities such as grep or sed or with Perl programming language.

 

How you can use regular expressions in FollowUpExpert?

 

One example is setting up your autoresponder to only reply to messages with subjects along the lines of "Order #1245 for ACME" with varying order number. To match this, you can define a regular expression that matches any subject that starts with "Order #" followed by any number followed by a single space and "for ACME".

 

Another example is text extraction from incoming emails. Here's an example:

 

Email:        joe@somedomain.com
Name:        Joe Public

 

You may often want to reply to the email address contained in the "Email:" field and not to the one the message is sent from. Basically, what you need to do is extract the address from the message text by matching "Email:" followed by any number of space characters followed by the email address you want to extract. This can also be done fairly easily using regular expressions.

 

Regular expressions consist of two types of characters: literals (for example: a, b, 1, 2) and operators (for example: |, *).

 

Literal characters match their equivalents in the text being processed (for example, a being part of a regular expression matches a in the text). All characters are treated as literals except for ., |, ?, +, (, ), {, }, [, ], ^, $ and \. The remaining characters are treated as operators and used for special purposes. To have an operator work as a literal (for instance, to match ? in the text being processed) prefix it with \ (for example, to match a question mark use the following regular expression: \?).

 

For example (in all following examples, regular expressions are marked with bold):

 

Acme GmbH will match "Acme GmbH" anywhere within the text being processed.

 
How are you\? will match "How are you?" anywhere.

 

The dot operator . acts as a wildcard character, i.e. it matches any single character.

 

For example:

 
A.me GmbH will match "Acme GmbH", "ACme GmbH", "A$me GmbH" and so on with any character in the position of the dot character.

 

The following operators: *, + and ? are used immediately after a character or expression to have it included a number of times repeated. The asterix character * is used to have the expression included any number of times including zero, plus operator + is used to have it included at least once and the question mark operator ? - to optionally include the character once (have it included once or not included at all).

 

For example:

 

Ac*me will match "Ame", "Acme", "Accme" and so on.
 

Ac+me will match "Acme", "Accme", "Accme" and so on but not "Ame".
 

Ac?me will match "Ame" or "Acme".

 

You can put one or more characters into parentheses: ( and ) to have the whole subexpression repeated.

 

For example:

 
Ac(me)* will match "Ac", "Acme", "Acmeme", "Acmememe" and so on.

 
Ac(me)+ will match "Acme", "Acmeme", "Acmememe" and so on but not "Ac".

 
Ac(me)? will match "Ac" or "Acme"

 

To explicitly specify the maximum and minimum number of repeats you can use the bounds operators: { and }. For instance, {2} means a character or expression included exactly twice, {3,5} - three to five times and {3,} at least three times with no upper limit.

 

For example:

 
Ac{2,3}me will match "Accme" or "Acccme".

 
Ac(me){2,} will match "Acmeme" or "Acmeme" and so on.

 

To form an alternative, i.e. have either one subexpression matched or the other use the | operator.

 

For example:

 
(Acme)|(Our Company) will match either "Acme" or "Our Company"

 
A(c|C)me will match "Acme" and "ACme".

 

To match a single character that is a member of a given set use the square brackets operators [ and ].

 

For example:

 

A[cC$]me will match "Acme", "ACme" and "A$me".

 
[^abc] will match any character other than "a","b" and "c".

 
[a-d] will match any character in range "a" to "d".

 
[^a-d] will match any character outside the range.

 

Sets can also contain character classes denoted using the syntax [:classname:] within a set, for instance [[:space:]] is a set containing all whitespace characters.

 

Available character classes:

 

alnum

Any alphanumeric character

alpha

Any alphabetical character in range a to z and A to Z plus optionally other alphabetical (national) characters depending on the locale settings

blank

Any blank character (either a space or a tab character)

cntrl

Any control character

digit

Any digit (0-9)

graph

Any graphical character

lower

Any lowercase character a-z plus optionally other (national) characters depending on the locale

print

Any printable character

punct

Any punctuation character

space

Any whitespace character (a space, a tab or a newline)

upper

Any uppercase character A-Z plus optionally other (national) characters depending on the locale

xdigit

Any hexadecimal digit

word

Any alphanumeric character or an underscore

 

You can use ^ and $ to match start and end of the line respectively, for example:
 
^Acme$ will match "Acme" being the only text on the line, for instance "Acme GmbH" will not be matched.

 
^Email will match "Email" that begins a line, i.e. without any characters preceding the text.

 

Note: The program matches the first possible part of the text or, if more than one match is found, the longest possible one. In case where there are multiple matches all starting at the same location and all of the same length the match with the longest first sub-expression is chosen. If that is the same for two or more matches, then the second sub-expression is taken into account and so on.

 

 

A nice tutorial about regular expressions:

 

http://www.ilovejackdaniels.com/cheat-sheets/regular-expressions-cheat-sheet/

 

(It is probably better than the one below plus it comes with a handy PDF cheatsheet. :)