Regular Expressions #
A Regular Expression (regexp
for short) is a sequence of characters that specifies a search
pattern. Such patterns are used in the
Filter Transformations to search for
similar string values.
Every character in Regular Expressions is either a metacharacter, having a special meaning,
or a regular character that has a literal meaning. For example, a regexp
in the form of
.*gr[ae]y.*
has metacharacters (.
, *
, []
) and literal characters (g
, r
, a
,
e
, y
). This regexp
will search only strings that have the word gray
or grey
(e.g.
small gray boxes
, grey doors
or stingrays
).
Metacharacters #
Below you can see what all metacharacters mean with some examples. If you want to check an in-depth
tutorial on Regular Expressions, you can visit the
Python re
library documentation.
If you want to test your regexp
, visit this free online tool
Dot #
.
(Dot) matches any character.
reqexp =
p.t
matches:
pat
,pbt
,pAt
,p4t
,p$t
, etc.
Asterisk #
*
(Asterisk) causes the preceding regexp
or character to match 0 or more repetitions of
that regexp
or character.
reqexp =
mo*re
matches:
mre
,more
,moore
,mooore
, etc.reqexp =
.*gray
matches:
gray
,stingray
,$7!Ngray
,dark gray
, etc.
Caret #
^
(Caret) matches the start of the string.
reqexp =
^gray.*
matches:
gray
orgray door
does not match:
stingray
orsmall gray box
Dollar #
$
(Dollar) matches the end of the string.
reqexp =
gray$
matches:
gray
,stingray
ordark gray
does not match:
small gray box
orgray door
Plus #
+
(Plus) causes the preceding regexp
or character to match at least 1 repetition of
that regexp
or character.
reqexp =
mo+re
matches:
more
,moore
,mooore
, etc.does not match:
mre
reqexp =
.+gray
matches:
stingray
,$7!Ngray
,dark gray
, etc.does not match:
gray
Question mark #
?
(Question mark) causes the preceding regexp
or character to match 0 or 1 repetition of
that regexp
or character.
reqexp =
mo?re
matches:
mre
,more
does not match:
moore
,mooore
, etc.reqexp =
.?gray
matches:
gray
,Ngray
,gray
,_gray
,4gray
,%gray
etc.does not match:
stingray
,small gray box
,dark gray
, etc.
The ?
metacharacter also limits the number of repetitions or the preceding regexp
to
the least possible number.
reqexp =
gr.*y
matches:
grey
,gray
,groovy
,gr^y
,gray stingray
,grey stingray
, etc.reqexp =
gr.*?y
matches:
grey
,gray
,groovy
,gr^y
, etc.does not match:
gray stingray
,grey stingray
Braces #
{m}
(Braces or Curly brackets) causes the preceding regexp
or character to match exactly
m
repetitions.
reqexp =
.{2}vy
matches: any 4 character long string ending with
vy
(e.g.navy
,levy
,envy
,wavy
,bevy
,cavy
,davy
,jivy
,tivy
,12vy
, etc.)does not match: strings ending with
vy
that are longer or shorter than 4 characters (e.g.groovy
,ivy
,heavy
,gravy
,anchovy
,scurvy
, etc.)
You can also specify a second number inside the Braces ({m,n}
) causing the preceding regexp
or character to match exactly from m
to n
repetitions.
reqexp =
.{1,3}vy
matches: any 3 to 5 character long string ending with
vy
(e.g.navy
,ivy
,gravy
, etc.)does not match: strings ending with
vy
that are longer than 5 characters (e.g.groovy
,anchovy
,scurvy
, etc.)
Square brackets #
[]
(Square brackets) is used to indicate a set of characters. In a set:
-
characters can be listed individually,
reqexp =
[chf]at
matches:
cat
,hat
,fat
does not match:
pat
,mat
,bat
,sat
,rat
, etc. -
ranges of characters can be indicated by giving two characters and separating them by a ‘-’, e.g.
[a-z]
,reqexp =
[b-f]at
matches:
bat
,cat
,dat
,eat
,fat
does not match:
aat
,mat
,sat
,rat
, etc. -
special characters lose their special meaning inside sets and are treated as literal characters,
reqexp =
[.*{2}]
matches:
.
,*
,{
,}
and2
-
to match a literal
]
inside a set, precede it with a backslash\
, or place it at the beginning of the set,reqexp =
[()[\]{}]
or[]()[{}]
matches:
[
,]
,{
,}
,(
and)
-
the Caret ^ will exclude the characters from a set.
reqexp =
[^chf]at
matches:
pat
,mat
,bat
,sat
,rat
, etc.does not match:
cat
,hat
,fat
Pipe #
|
(Pipe) creates a regexp
that matches either of the characters the |
is between.
reqexp =
gray|grey
matches:
gray
orgrey
Brackets #
()
(Brackets) match whatever regexp
is inside the parentheses, and indicates the
start and end of a group. The contents of a group can be retrieved after a match has
been performed, and can be matched later in the string with the
\number
special sequence. Inserting a question mark ?
at
the beginning of a group creates an extension notation. The first character after the
?
determines what the meaning and further syntax of the construct is. Below you
can see the most common usage.
Ignore character case
(?i)
reqexp =
(?i)Newton
matches:
Newton
,newton
,NeWtOn
Lookahead assertion (?=…)
reqexp =
Isaac (?=Newton)
matches:
Isaac Newton
does not match:
Isaac Newton
,Isaac Hanson
,Isaac Asimov
,Isaac
Negative Lookahead assertion (?!…)
reqexp =
Isaac (?!Newton)
matches:
Isaac
does not match:
Isaac Newton
,Isaac Hanson
,Isaac Asimov
More advanced usage of the ()
metacharacter
can be found here.
Backslash #
\
(Backslash) either escapes special characters (permitting you to match characters like *
,
?
, etc.), or signals a special sequence.
reqexp =
2\*2=4
matches:
2*2=4
does not match:
22*2=44
reqexp =
2*2=4
matches:
22=4
,2=4
,2222222=4
does not match:
2*2=4
,22*2=44
Special Squences #
You can use the Backslash to create a Special sequence
Backslash Number #
\number
(Backslash Number) matches the contents of the group of the same number. Groups
are numbered starting from 1. For example, (.+) \1
matches the the
or 55 55
, but not
thethe
(note the space after the group). This special sequence can only be used to match
one of the first 99 groups.
Backslash capital A #
\A
(Backslash capital A) matches the start of the string (similar to Caret).
reqexp =
\Agray.*
matches:
gray
orgray door
does not match:
stingray
orsmall gray box
Backslash small b #
\b
(Backslash small b) matches the empty string, but only at the beginning or end of a word.
reqexp =
\bgray\b
matches:
gray
between words or brackets likesmall gray box
does not match:
stingray
,grayish
,3gray3
Backslash capital B #
\B
(Backslash capital B) matches the empty string, but only when it is not at the beginning
or end of a word.
reqexp =
\Bray\B
matches:
ray
that is inside a word (like inportraying
,hairsprays
,arrays
, etc.)does not match:
stingray
,grayish
,disarray
, etc.
Backslash small d #
\d
(Backslash small d) matches characters that are decimal digits (similar to [0-9]*
).
reqexp =
\d*.*
matches: any string that starts with a digit (e.g.
2 rays
,3 lemons
,99 problems
, etc.)
Backslash capital D #
\D
(Backslash capital D) matches characters that are not decimal digits (similar to [^0-9]*
).
reqexp =
\D*
matches: any string that does not contain digits.
does not match:
2 rays
,3 lemons
,99 problems
, etc.
Backslash small s #
\s
(Backslash small s) matches Unicode whitespace characters.
reqexp =
\s*
matches:
,
,
\n
, etc.
Backslash capital S #
\s
(Backslash capital S) matches only characters that are not Unicode whitespace characters.
reqexp =
\S*
matches:
gray
,stingray
,grayish
does not match:
small gray door
,
Backslash small w #
\w
(Backslash small w) matches only alphanumeric characters.
reqexp =
\w*
matches:
gray
,stingray
,grayish
does not match:
small gray door
,
Backslash capital W #
\W
(Backslash capital W) matches characters that are not alphanumeric characters.
reqexp =
\W*
matches:
gray
,stingray
,grayish
does not match:
small gray door
,
Backslash capital Z #
\Z
(Backslash capital Z) matches the end of the string.
reqexp =
gray\Z
matches:
gray
,stingray
ordark gray
does not match:
small gray box
orgray door