Regex cheatsheet

Understand regular expressions with examples based on PHP. 2022-11-18

Regular expressions, also known as regex or regexp, are useful tools to segment, search for and rework a string of characters.

Each regex engine has its own specificities: missing feature or different behaviour. The following regex are compatible with the PCRE (Perl Compatible Regular Expressions) extension of PHP. The slash "/" will be used as a delimiter.

The basics

Regex	Explanation	Example
^	Beginning of the string.	/^foo/ : starts with foo
$	End of the string.	/bar$/ : ends with bar
[ ]	A "class": several options.	/[abc]/ : either "a", "b" or "c"
[^ ]	Forbidden characters.	/[^abc]/ : neither "a", "b" or "c"
?	0 or 1 time.	/as?/ : "a" or "as" because "s" is optional
*	0 or X times.	/as*/ : "a", "as", "asssss".
+	1 to X times.	/as+/ : "as", "asssss".
{ }	Specific number of repetitions.	/as{2}}/ : "ass". /as{2, 4}}/ : "ass", "asss", "assss". /as{2,}}/ : "assssssss".
[0-9] or \d	A digit.	/\d+/ : 0, 1, 2, 3
[A-Za-z0-9_] or \w	Digit, lowercase, uppercase. "w" : word.	/\w+/ : Tuto4Dev
\s or [ \t\r\n\f]	Space, tab, end of line, end of page. "s" : space.	/a\sb/ : a b
\D, \W et \S	An uppercase class is generally the opposite to the lowercase version. \D = [^\d] = everything which is not a digit.	/[\s\S]/ : all characters. Everything which is a space or not a space.
.	Everything except end of lines. For matching really everything, use [\s\S] or use the option "single-line mode" of the preg_match function.	/.og/ : "dog", "fog"
( )	A capturing group.	`preg_match('/a(.+)e/', 'abcde', $output)` : $output[1] = 'bcd'
[:alnum:], [:digit:], [:alpha:], [:lower:], [:upper:]	POSIX classes.	/([[:alpha:]]+)/ = /([a-zA-Z]+)/

The lazy quantifiers

By default, "*" and "+" are greedy, meaning they will try to capture the largest amount of characters. On the contrary, a lazy quantifier will capture the minimum amount of characters. By adding "?" to "*" or "+", the greedy quantifiers become lazy. An example trying to capture an XML markup:

Regex	Subject	Explanation
<(.+)>	<b>Tuto4Dev</b>	Greedy capture : b>Tuto4Dev</b. Not what we want.
<(.+?)>	<b>Tuto4Dev</b>	Lazy capture : b
<([^>]+)>	<b>Tuto4Dev</b>	The engine is faster with this alternative solution than with lazy quantifiers.

Searching for a word preceding or following another

Regex	Explanation
foo(?=bar)	Positive lookahead : find foo followed by bar.
foo(?!bar)	Negative lookahead : find foo not followed by bar.
(?<=foo)bar	Positive lookbehind : find bar preceded by foo.
(?<!foo)bar	Negative lookbehind : find bar not preceded by foo.

Careful with lookbehind assertions, engines have troubles to understand difficult ones. Good to know: in order to capture the content of the assertion, you need to put parentheses inside. For instance: /(foo(?=bar))/ will only capture "foo". While /(foo(?=(bar)))/ will capture "foo" and "bar".

A more advanced example to find "script" markups without "src" attributes: /(<script(?!.*?src=(['"]).*?\2)[^>]*>)/.

This regex finds:

"<script"
Followed by {nothing of several characters} : (?!.*?)
Followed by src='{nothing of several characters}' : src=(['"]).*?\2
Followed by {nothing or everything which is not ">"} : [^>]*
Followed by ">"

Groups and recursion

We already know how to make a capturing group. Now we will learn how to name it, to make it non-capturing and to call it back in the regex. As you will see, there is several syntaxes doing the same thing.

Description	Explanation / Regex
Non-capturing group.	(?:.+)
Name a group, so it becomes callable in the regex or in the return of preg_match.	(?'groupName'.) or (?P<groupName>.).
Calling back the exact captured value of a group: backreference.	(?P=groupName), \k'groupName', \k{groupName} or if unnamed \1, \g1. Example: /<(?'markup'[bs])[^>]>.<\/\k'markup'>/ matches <b>tuto4dev</b> but not <b>tuto4dev</s>.
Create a subroutine: calling back the regex of a group.	(?P>groupName), (?&groupName) , \g'groupName' or if unnamed \g'1', \g<1>, (?1). Example: /(?'test'[ab])(?&test)c/ matches aac, abc, bbc, bac because it is an alias to /(?'test'[ab])[ab]c/.
DEFINE a subroutine at the beginning of the regex.	/(?(DEFINE)(?'age'\d{1,3} years old))^Age: \g'age'$/ we defined the subroutine "age" to use it. It matches "Age: 25 years old".
Recursion: calling back the entire regex.	(?R), (?0), \g<0>. Example : /a-(?R)?z/ matches a-z, a-a-zz, a-a-a-zzz...
Reset the global capture.	\K. Example : /(ab\Kc)/, $output[0] = "c" and $output[1] = "abc".
Branch reset groups: capturing alternatives. Example: we want to parse JSON {"foo": "bar"} or {"foo": 42}, with the key in $output[1] and the value in $output[2].	/{"([^"]+)": (?\|(\d+)\|"([^"]+)")}/ we have 3 capturing groups but thanks to the syntax (?\|(a)\|(b)) the 2 last groups are both linked to $output[1]. If you are naming the groups, all alternatives, even with a different number of groups, need to have the same sequence of names.

Atomic groups and possessive quantifiers

Basic quantifiers "*", "+", "?", "{2, 3}" might be heavy. Let's take a basic example: /<.+>/ on "<img>". The engine directly finds "<". For it, ".+" matches "img>". When trying to find ">", the engine can't find it, it goes back to the previous character (backtrack). Because ".+" is now equal to "img", the engine can finally find ">".

Now let's try the same exercise with /<[^>]+>/ and "<img". The engine will try to match "[^>]+" to "img" before realizing it can't find ">". So it tries to backtrack: "[^>]+" is equal to "im" and "g" does not match ">". And so on and so forth... If the engine was more clever, it would have noticed that with "[^>]+" it was impossible to miss any ">": backtracking was useless.

Possessive quantifiers are used to disallow backtracking and lighten the process by adding a "+" to the basic quantifiers: ".++" for instance.

Atomatic groups are used too for performance gains. Possessive quantifiers are actually a short syntax to write atomic groups. ".++" is an alias of the atomic group "(?>.+)". In an atomic group, if the pattern matches, the engine will jump to the next part and never come back. Meaning it will not follow the alternatives either. For instance /(?>foobar|foobarbaz)\b/ applied to "foobarbaz" fails. Indeed, "foobarbaz" contains "foobar" but then "\b" is missing, the engine does not backtrack and fails.

Miscellaneous

Description	Explanation / Regex
Word boundaries, to find an exact word.	\b is an alias of (^\w\|\w$\|\W\w\|\w\W). Example: `preg_replace ('/\bart\b/', 'REPLACEMENT' , 'This article is art')` : "This article is REPLACEMENT".
Remove the need to escape special characters.	Everything between \Q and \E will be interpreted as plain text, not as part of the regex. Example: \Q\d+\E matches literally \d+ and not a number.
A condition.	General syntax: (?(condition) then\|else). Example testing the existence of a group: /alpha(num)?:(?(1)[[:alnum:]]\|[[:alpha:]])+/. It matches "alpha:a" and "alphanum:a" but not "alpha:1". You can replace the "1" with the name of a group.
Deal with unicode.	\X is the unicode equivalent of a dot. \x{1234} corresponds to unicode U+1234. Unicode has its own classes, such as: \p{Lowercase_Letter} or \p{Arabic}.
Make regex case insensitive	/(?i)insensitive(?-i)sensitive/
Ignore spaces in quantifiers.	/(?x) \d +/ = /\d+/
Add comments.	(?x) also ignores everything after the character # on every line. You can use that or (?#your comment) to add comments.