Regular Expressions #
A Regular Expression (regexp for short) is a sequence of characters that specifies a search
pattern. Such patterns are used in the
Filter Transformations to search for
similar string values.
Every character in Regular Expressions is either a metacharacter, having a special meaning,
or a regular character that has a literal meaning. For example, a regexp in the form of
.*gr[ae]y.* has metacharacters (., *, []) and literal characters (g, r, a,
e, y). This regexp will search only strings that have the word gray or grey (e.g.
small gray boxes, grey doors or stingrays).
Metacharacters #
Below you can see what all metacharacters mean with some examples. If you want to check an in-depth
tutorial on Regular Expressions, you can visit the
Python re library documentation.
If you want to test your regexp, visit this free online tool
Dot #
. (Dot) matches any character.
reqexp =
p.tmatches:
pat,pbt,pAt,p4t,p$t, etc.
Asterisk #
* (Asterisk) causes the preceding regexp or character to match 0 or more repetitions of
that regexp or character.
reqexp =
mo*rematches:
mre,more,moore,mooore, etc.reqexp =
.*graymatches:
gray,stingray,$7!Ngray,dark gray, etc.
Caret #
^ (Caret) matches the start of the string.
reqexp =
^gray.*matches:
grayorgray doordoes not match:
stingrayorsmall gray box
Dollar #
$ (Dollar) matches the end of the string.
reqexp =
gray$matches:
gray,stingrayordark graydoes not match:
small gray boxorgray door
Plus #
+ (Plus) causes the preceding regexp or character to match at least 1 repetition of
that regexp or character.
reqexp =
mo+rematches:
more,moore,mooore, etc.does not match:
mrereqexp =
.+graymatches:
stingray,$7!Ngray,dark gray, etc.does not match:
gray
Question mark #
? (Question mark) causes the preceding regexp or character to match 0 or 1 repetition of
that regexp or character.
reqexp =
mo?rematches:
mre,moredoes not match:
moore,mooore, etc.reqexp =
.?graymatches:
gray,Ngray,gray,_gray,4gray,%grayetc.does not match:
stingray,small gray box,dark gray, etc.
The ? metacharacter also limits the number of repetitions or the preceding regexp to
the least possible number.
reqexp =
gr.*ymatches:
grey,gray,groovy,gr^y,gray stingray,grey stingray, etc.reqexp =
gr.*?ymatches:
grey,gray,groovy,gr^y, etc.does not match:
gray stingray,grey stingray
Braces #
{m} (Braces or Curly brackets) causes the preceding regexp or character to match exactly
m repetitions.
reqexp =
.{2}vymatches: any 4 character long string ending with
vy(e.g.navy,levy,envy,wavy,bevy,cavy,davy,jivy,tivy,12vy, etc.)does not match: strings ending with
vythat are longer or shorter than 4 characters (e.g.groovy,ivy,heavy,gravy,anchovy,scurvy, etc.)
You can also specify a second number inside the Braces ({m,n}) causing the preceding regexp
or character to match exactly from m to n repetitions.
reqexp =
.{1,3}vymatches: any 3 to 5 character long string ending with
vy(e.g.navy,ivy,gravy, etc.)does not match: strings ending with
vythat are longer than 5 characters (e.g.groovy,anchovy,scurvy, etc.)
Square brackets #
[] (Square brackets) is used to indicate a set of characters. In a set:
-
characters can be listed individually,
reqexp =
[chf]atmatches:
cat,hat,fatdoes not match:
pat,mat,bat,sat,rat, etc. -
ranges of characters can be indicated by giving two characters and separating them by a ‘-’, e.g.
[a-z],reqexp =
[b-f]atmatches:
bat,cat,dat,eat,fatdoes not match:
aat,mat,sat,rat, etc. -
special characters lose their special meaning inside sets and are treated as literal characters,
reqexp =
[.*{2}]matches:
.,*,{,}and2 -
to match a literal
]inside a set, precede it with a backslash\, or place it at the beginning of the set,reqexp =
[()[\]{}]or[]()[{}]matches:
[,],{,},(and) -
the Caret ^ will exclude the characters from a set.
reqexp =
[^chf]atmatches:
pat,mat,bat,sat,rat, etc.does not match:
cat,hat,fat
Pipe #
| (Pipe) creates a regexp that matches either of the characters the | is between.
reqexp =
gray|greymatches:
grayorgrey
Brackets #
() (Brackets) match whatever regexp is inside the parentheses, and indicates the
start and end of a group. The contents of a group can be retrieved after a match has
been performed, and can be matched later in the string with the
\number special sequence. Inserting a question mark ? at
the beginning of a group creates an extension notation. The first character after the
? determines what the meaning and further syntax of the construct is. Below you
can see the most common usage.
Ignore character case
(?i)reqexp =
(?i)Newtonmatches:
Newton,newton,NeWtOn
Lookahead assertion (?=…)
reqexp =
Isaac (?=Newton)matches:
Isaac Newtondoes not match:
Isaac Newton,Isaac Hanson,Isaac Asimov,Isaac
Negative Lookahead assertion (?!…)
reqexp =
Isaac (?!Newton)matches:
Isaacdoes not match:
Isaac Newton,Isaac Hanson,Isaac Asimov
More advanced usage of the () metacharacter
can be found here.
Backslash #
\ (Backslash) either escapes special characters (permitting you to match characters like *,
?, etc.), or signals a special sequence.
reqexp =
2\*2=4matches:
2*2=4does not match:
22*2=44reqexp =
2*2=4matches:
22=4,2=4,2222222=4does not match:
2*2=4,22*2=44
Special Squences #
You can use the Backslash to create a Special sequence
Backslash Number #
\number (Backslash Number) matches the contents of the group of the same number. Groups
are numbered starting from 1. For example, (.+) \1 matches the the or 55 55, but not
thethe (note the space after the group). This special sequence can only be used to match
one of the first 99 groups.
Backslash capital A #
\A (Backslash capital A) matches the start of the string (similar to Caret).
reqexp =
\Agray.*matches:
grayorgray doordoes not match:
stingrayorsmall gray box
Backslash small b #
\b (Backslash small b) matches the empty string, but only at the beginning or end of a word.
reqexp =
\bgray\bmatches:
graybetween words or brackets likesmall gray boxdoes not match:
stingray,grayish,3gray3
Backslash capital B #
\B (Backslash capital B) matches the empty string, but only when it is not at the beginning
or end of a word.
reqexp =
\Bray\Bmatches:
raythat is inside a word (like inportraying,hairsprays,arrays, etc.)does not match:
stingray,grayish,disarray, etc.
Backslash small d #
\d (Backslash small d) matches characters that are decimal digits (similar to [0-9]*).
reqexp =
\d*.*matches: any string that starts with a digit (e.g.
2 rays,3 lemons,99 problems, etc.)
Backslash capital D #
\D (Backslash capital D) matches characters that are not decimal digits (similar to [^0-9]*).
reqexp =
\D*matches: any string that does not contain digits.
does not match:
2 rays,3 lemons,99 problems, etc.
Backslash small s #
\s (Backslash small s) matches Unicode whitespace characters.
reqexp =
\s*matches:
,,\n, etc.
Backslash capital S #
\s (Backslash capital S) matches only characters that are not Unicode whitespace characters.
reqexp =
\S*matches:
gray,stingray,grayishdoes not match:
small gray door,
Backslash small w #
\w (Backslash small w) matches only alphanumeric characters.
reqexp =
\w*matches:
gray,stingray,grayishdoes not match:
small gray door,
Backslash capital W #
\W (Backslash capital W) matches characters that are not alphanumeric characters.
reqexp =
\W*matches:
gray,stingray,grayishdoes not match:
small gray door,
Backslash capital Z #
\Z (Backslash capital Z) matches the end of the string.
reqexp =
gray\Zmatches:
gray,stingrayordark graydoes not match:
small gray boxorgray door