Regular Expressions

Regular Expressions #

A Regular Expression (regexp for short) is a sequence of characters that specifies a search pattern. Such patterns are used in the Filter Transformations to search for similar string values.

Every character in Regular Expressions is either a metacharacter, having a special meaning, or a regular character that has a literal meaning. For example, a regexp in the form of .*gr[ae]y.* has metacharacters (., *, []) and literal characters (g, r, a, e, y). This regexp will search only strings that have the word gray or grey (e.g. small gray boxes, grey doors or stingrays).

Metacharacters #

Below you can see what all metacharacters mean with some examples. If you want to check an in-depth tutorial on Regular Expressions, you can visit the Python re library documentation.

If you want to test your regexp, visit this free online tool

Dot #

. (Dot) matches any character.

reqexp = p.t

matches: pat, pbt, pAt, p4t, p$t, etc.

Asterisk #

* (Asterisk) causes the preceding regexp or character to match 0 or more repetitions of that regexp or character.

reqexp = mo*re

matches: mre, more, moore, mooore, etc.

reqexp = .*gray

matches: gray, stingray, $7!Ngray, dark gray, etc.

Caret #

^ (Caret) matches the start of the string.

reqexp = ^gray.*

matches: gray or gray door

does not match: stingray or small gray box

Dollar #

$ (Dollar) matches the end of the string.

reqexp = gray$

matches: gray, stingray or dark gray

does not match: small gray box or gray door

Plus #

+ (Plus) causes the preceding regexp or character to match at least 1 repetition of that regexp or character.

reqexp = mo+re

matches: more, moore, mooore, etc.

does not match: mre

reqexp = .+gray

matches: stingray, $7!Ngray, dark gray, etc.

does not match: gray

Question mark #

? (Question mark) causes the preceding regexp or character to match 0 or 1 repetition of that regexp or character.

reqexp = mo?re

matches: mre, more

does not match: moore, mooore, etc.

reqexp = .?gray

matches: gray, Ngray, gray, _gray, 4gray, %gray etc.

does not match: stingray, small gray box, dark gray, etc.

The ? metacharacter also limits the number of repetitions or the preceding regexp to the least possible number.

reqexp = gr.*y

matches: grey, gray, groovy, gr^y, gray stingray, grey stingray, etc.

reqexp = gr.*?y

matches: grey, gray, groovy, gr^y, etc.

does not match: gray stingray, grey stingray

Braces #

{m} (Braces or Curly brackets) causes the preceding regexp or character to match exactly m repetitions.

reqexp = .{2}vy

matches: any 4 character long string ending with vy (e.g. navy, levy, envy, wavy, bevy, cavy, davy, jivy, tivy, 12vy, etc.)

does not match: strings ending with vy that are longer or shorter than 4 characters (e.g. groovy, ivy, heavy, gravy, anchovy, scurvy, etc.)

You can also specify a second number inside the Braces ({m,n}) causing the preceding regexp or character to match exactly from m to n repetitions.

reqexp = .{1,3}vy

matches: any 3 to 5 character long string ending with vy (e.g. navy, ivy, gravy, etc.)

does not match: strings ending with vy that are longer than 5 characters (e.g. groovy, anchovy, scurvy, etc.)

Square brackets #

[] (Square brackets) is used to indicate a set of characters. In a set:

  • characters can be listed individually,

    reqexp = [chf]at

    matches: cat, hat, fat

    does not match: pat, mat, bat, sat, rat, etc.

  • ranges of characters can be indicated by giving two characters and separating them by a ‘-’, e.g. [a-z],

    reqexp = [b-f]at

    matches: bat, cat, dat, eat, fat

    does not match: aat, mat, sat, rat, etc.

  • special characters lose their special meaning inside sets and are treated as literal characters,

    reqexp = [.*{2}]

    matches: ., *, {, } and 2

  • to match a literal ] inside a set, precede it with a backslash \, or place it at the beginning of the set,

    reqexp = [()[\]{}] or []()[{}]

    matches: [, ], {, }, ( and )

  • the Caret ^ will exclude the characters from a set.

    reqexp = [^chf]at

    matches: pat, mat, bat, sat, rat, etc.

    does not match: cat, hat, fat

Pipe #

| (Pipe) creates a regexp that matches either of the characters the | is between.

reqexp = gray|grey

matches: gray or grey

Brackets #

() (Brackets) match whatever regexp is inside the parentheses, and indicates the start and end of a group. The contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence. Inserting a question mark ? at the beginning of a group creates an extension notation. The first character after the ? determines what the meaning and further syntax of the construct is. Below you can see the most common usage.

Ignore character case (?i)

reqexp = (?i)Newton

matches: Newton, newton, NeWtOn

Lookahead assertion (?=…)

reqexp = Isaac (?=Newton)

matches: Isaac Newton

does not match: Isaac Newton, Isaac Hanson, Isaac Asimov, Isaac

Negative Lookahead assertion (?!…)

reqexp = Isaac (?!Newton)

matches: Isaac

does not match: Isaac Newton, Isaac Hanson, Isaac Asimov

More advanced usage of the () metacharacter can be found here.

Backslash #

\ (Backslash) either escapes special characters (permitting you to match characters like *, ?, etc.), or signals a special sequence.

reqexp = 2\*2=4

matches: 2*2=4

does not match: 22*2=44

reqexp = 2*2=4

matches: 22=4, 2=4, 2222222=4

does not match: 2*2=4, 22*2=44

Special Squences #

You can use the Backslash to create a Special sequence

Backslash Number #

\number (Backslash Number) matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches the the or 55 55, but not thethe (note the space after the group). This special sequence can only be used to match one of the first 99 groups.

Backslash capital A #

\A (Backslash capital A) matches the start of the string (similar to Caret).

reqexp = \Agray.*

matches: gray or gray door

does not match: stingray or small gray box

Backslash small b #

\b (Backslash small b) matches the empty string, but only at the beginning or end of a word.

reqexp = \bgray\b

matches: gray between words or brackets like small gray box

does not match: stingray, grayish, 3gray3

Backslash capital B #

\B (Backslash capital B) matches the empty string, but only when it is not at the beginning or end of a word.

reqexp = \Bray\B

matches: ray that is inside a word (like in portraying, hairsprays, arrays, etc.)

does not match: stingray, grayish, disarray, etc.

Backslash small d #

\d (Backslash small d) matches characters that are decimal digits (similar to [0-9]*).

reqexp = \d*.*

matches: any string that starts with a digit (e.g. 2 rays, 3 lemons, 99 problems, etc.)

Backslash capital D #

\D (Backslash capital D) matches characters that are not decimal digits (similar to [^0-9]*).

reqexp = \D*

matches: any string that does not contain digits.

does not match: 2 rays, 3 lemons, 99 problems, etc.

Backslash small s #

\s (Backslash small s) matches Unicode whitespace characters.

reqexp = \s*

matches: , , \n, etc.

Backslash capital S #

\s (Backslash capital S) matches only characters that are not Unicode whitespace characters.

reqexp = \S*

matches: gray, stingray, grayish

does not match: small gray door,

Backslash small w #

\w (Backslash small w) matches only alphanumeric characters.

reqexp = \w*

matches: gray, stingray, grayish

does not match: small gray door,

Backslash capital W #

\W (Backslash capital W) matches characters that are not alphanumeric characters.

reqexp = \W*

matches: gray, stingray, grayish

does not match: small gray door,

Backslash capital Z #

\Z (Backslash capital Z) matches the end of the string.

reqexp = gray\Z

matches: gray, stingray or dark gray

does not match: small gray box or gray door