Regex cheatsheet

Understand regular expressions with examples based on PHP.

Regular expressions, also known as regex or regexp, are useful tools to segment, search for and rework a string of characters.

Each regex engine has its own specificities: missing feature or different behaviour. The following regex are compatible with the PCRE (Perl Compatible Regular Expressions) extension of PHP. The slash "/" will be used as a delimiter.

The basics

^Beginning of the string./^foo/ : starts with foo
$End of the string./bar$/ : ends with bar
[ ]A "class": several options./[abc]/ : either "a", "b" or "c"
[^ ]Forbidden characters./[^abc]/ : neither "a", "b" or "c"
?0 or 1 time./as?/ : "a" or "as" because "s" is optional
*0 or X times./as*/ : "a", "as", "asssss".
+1 to X times./as+/ : "as", "asssss".
{ }Specific number of repetitions./as{2}}/ : "ass". /as{2, 4}}/ : "ass", "asss", "assss". /as{2,}}/ : "assssssss".
[0-9] or \dA digit./\d+/ : 0, 1, 2, 3
[A-Za-z0-9_] or \wDigit, lowercase, uppercase. "w" : word./\w+/ : Tuto4Dev
\s or [ \t\r\n\f]Space, tab, end of line, end of page. "s" : space./a\sb/ : a b
\D, \W et \SAn uppercase class is generally the opposite to the lowercase version. \D = [^\d] = everything which is not a digit./[\s\S]/ : all characters. Everything which is a space or not a space.
.Everything except end of lines. For matching really everything, use [\s\S] or use the option "single-line mode" of the preg_match function./.og/ : "dog", "fog"
( )A capturing group.preg_match('/a(.+)e/', 'abcde', $output) : $output[1] = 'bcd'
[:alnum:], [:digit:], [:alpha:], [:lower:], [:upper:]POSIX classes./([[:alpha:]]+)/ = /([a-zA-Z]+)/

The lazy quantifiers

By default, "*" and "+" are greedy, meaning they will try to capture the largest amount of characters. On the contrary, a lazy quantifier will capture the minimum amount of characters. By adding "?" to "*" or "+", the greedy quantifiers become lazy. An example trying to capture an XML markup:

<(.+)><b>Tuto4Dev</b>Greedy capture : b>Tuto4Dev</b. Not what we want.
<(.+?)><b>Tuto4Dev</b>Lazy capture : b
<([^>]+)><b>Tuto4Dev</b>The engine is faster with this alternative solution than with lazy quantifiers.

Searching for a word preceding or following another

foo(?=bar)Positive lookahead : find foo followed by bar.
foo(?!bar)Negative lookahead : find foo not followed by bar.
(?<=foo)barPositive lookbehind : find bar preceded by foo.
(?<!foo)barNegative lookbehind : find bar not preceded by foo.

Careful with lookbehind assertions, engines have troubles to understand difficult ones. Good to know: in order to capture the content of the assertion, you need to put parentheses inside. For instance: /(foo(?=bar))/ will only capture "foo". While /(foo(?=(bar)))/ will capture "foo" and "bar".

A more advanced example to find "script" markups without "src" attributes: /(<script(?!.*?src=(['"]).*?\2)[^>]*>)/.

This regex finds:

  • "<script"
  • Followed by {nothing of several characters} : (?!.*?)
  • Followed by src='{nothing of several characters}' : src=(['"]).*?\2
  • Followed by {nothing or everything which is not ">"} : [^>]*
  • Followed by ">"

Groups and recursion

We already know how to make a capturing group. Now we will learn how to name it, to make it non-capturing and to call it back in the regex. As you will see, there is several syntaxes doing the same thing.

Description Explanation / Regex
Non-capturing group. (?:.+)
Name a group, so it becomes callable in the regex or in the return of preg_match. (?'groupName'.*) or (?P<groupName>.*).
Calling back the exact captured value of a group: backreference. (?P=groupName), \k'groupName', \k{groupName} or if unnamed \1, \g1. Example: /<(?'markup'[bs])[^>]*>.*<\/\k'markup'>/ matches <b>tuto4dev</b> but not <b>tuto4dev</s>.
Create a subroutine: calling back the regex of a group.(?P>groupName), (?&groupName) , \g'groupName' or if unnamed \g'1', \g<1>, (?1). Example: /(?'test'[ab])(?&test)c/ matches aac, abc, bbc, bac because it is an alias to /(?'test'[ab])[ab]c/.
DEFINE a subroutine at the beginning of the regex./(?(DEFINE)(?'age'\d{1,3} years old))^Age: \g'age'$/ we defined the subroutine "age" to use it. It matches "Age: 25 years old".
Recursion: calling back the entire regex.(?R), (?0), \g<0>. Example : /a-(?R)?z/ matches a-z, a-a-zz, a-a-a-zzz...
Reset the global capture.\K. Example : /(ab\Kc)/, $output[0] = "c" and $output[1] = "abc".
Branch reset groups: capturing alternatives. Example: we want to parse JSON {"foo": "bar"} or {"foo": 42}, with the key in $output[1] and the value in $output[2]. /{"([^"]+)": (?|(\d+)|"([^"]+)")}/ we have 3 capturing groups but thanks to the syntax (?|(a)|(b)) the 2 last groups are both linked to $output[1]. If you are naming the groups, all alternatives, even with a different number of groups, need to have the same sequence of names.

Atomic groups and possessive quantifiers

Basic quantifiers "*", "+", "?", "{2, 3}" might be heavy. Let's take a basic example: /<.+>/ on "<img>". The engine directly finds "<". For it, ".+" matches "img>". When trying to find ">", the engine can't find it, it goes back to the previous character (backtrack). Because ".+" is now equal to "img", the engine can finally find ">".

Now let's try the same exercise with /<[^>]+>/ and "<img". The engine will try to match "[^>]+" to "img" before realizing it can't find ">". So it tries to backtrack: "[^>]+" is equal to "im" and "g" does not match ">". And so on and so forth... If the engine was more clever, it would have noticed that with "[^>]+" it was impossible to miss any ">": backtracking was useless.

Possessive quantifiers are used to disallow backtracking and lighten the process by adding a "+" to the basic quantifiers: ".++" for instance.

Atomatic groups are used too for performance gains. Possessive quantifiers are actually a short syntax to write atomic groups. ".++" is an alias of the atomic group "(?>.+)". In an atomic group, if the pattern matches, the engine will jump to the next part and never come back. Meaning it will not follow the alternatives either. For instance /(?>foobar|foobarbaz)\b/ applied to "foobarbaz" fails. Indeed, "foobarbaz" contains "foobar" but then "\b" is missing, the engine does not backtrack and fails.


DescriptionExplanation / Regex
Word boundaries, to find an exact word. \b is an alias of (^\w|\w$|\W\w|\w\W). Example: preg_replace ('/\bart\b/', 'REPLACEMENT' , 'This article is art') : "This article is REPLACEMENT".
Remove the need to escape special characters.Everything between \Q and \E will be interpreted as plain text, not as part of the regex. Example: \Q*\d+*\E matches literally *\d+* and not a number.
A condition.General syntax: (?(condition) then|else). Example testing the existence of a group: /alpha(num)?:(?(1)[[:alnum:]]|[[:alpha:]])+/. It matches "alpha:a" and "alphanum:a" but not "alpha:1". You can replace the "1" with the name of a group.
Deal with unicode.\X is the unicode equivalent of a dot. \x{1234} corresponds to unicode U+1234. Unicode has its own classes, such as: \p{Lowercase_Letter} or \p{Arabic}.
Make regex case insensitive/(?i)insensitive(?-i)sensitive/
Ignore spaces in quantifiers./(?x) \d +/ = /\d+/
Add comments.(?x) also ignores everything after the character # on every line. You can use that or (?#your comment) to add comments.