Regex cheatsheet
Regular expressions, also known as regex or regexp, are useful tools to segment, search for and rework a string of
characters.
Each regex engine has its own specificities: missing feature or different behaviour. The following regex are
compatible with the PCRE (Perl Compatible Regular Expressions) extension of PHP. The slash "/" will be used as a
delimiter.
The basics
Regex | Explanation | Example |
---|---|---|
^ | Beginning of the string. | /^foo/ : starts with foo |
$ | End of the string. | /bar$/ : ends with bar |
[ ] | A "class": several options. | /[abc]/ : either "a", "b" or "c" |
[^ ] | Forbidden characters. | /[^abc]/ : neither "a", "b" or "c" |
? | 0 or 1 time. | /as?/ : "a" or "as" because "s" is optional |
* | 0 or X times. | /as*/ : "a", "as", "asssss". |
+ | 1 to X times. | /as+/ : "as", "asssss". |
{ } | Specific number of repetitions. | /as{2}}/ : "ass". /as{2, 4}}/ : "ass", "asss", "assss". /as{2,}}/ : "assssssss". |
[0-9] or \d | A digit. | /\d+/ : 0, 1, 2, 3 |
[A-Za-z0-9_] or \w | Digit, lowercase, uppercase. "w" : word. | /\w+/ : Tuto4Dev |
\s or [ \t\r\n\f] | Space, tab, end of line, end of page. "s" : space. | /a\sb/ : a b |
\D, \W et \S | An uppercase class is generally the opposite to the lowercase version. \D = [^\d] = everything which is not a digit. | /[\s\S]/ : all characters. Everything which is a space or not a space. |
. | Everything except end of lines. For matching really everything, use [\s\S] or use the option "single-line mode" of the preg_match function. | /.og/ : "dog", "fog" |
( ) | A capturing group. | preg_match('/a(.+)e/', 'abcde', $output) : $output[1] = 'bcd' |
[:alnum:], [:digit:], [:alpha:], [:lower:], [:upper:] | POSIX classes. | /([[:alpha:]]+)/ = /([a-zA-Z]+)/ |
The lazy quantifiers
By default, "*" and "+" are greedy, meaning they will try to capture the largest amount of characters. On the contrary, a lazy quantifier will capture the minimum amount of characters. By adding "?" to "*" or "+", the greedy quantifiers become lazy. An example trying to capture an XML markup:
Regex | Subject | Explanation |
---|---|---|
<(.+)> | <b>Tuto4Dev</b> | Greedy capture : b>Tuto4Dev</b. Not what we want. |
<(.+?)> | <b>Tuto4Dev</b> | Lazy capture : b |
<([^>]+)> | <b>Tuto4Dev</b> | The engine is faster with this alternative solution than with lazy quantifiers. |
Searching for a word preceding or following another
Regex | Explanation |
---|---|
foo(?=bar) | Positive lookahead : find foo followed by bar. |
foo(?!bar) | Negative lookahead : find foo not followed by bar. |
(?<=foo)bar | Positive lookbehind : find bar preceded by foo. |
(?<!foo)bar | Negative lookbehind : find bar not preceded by foo. |
Careful with lookbehind assertions, engines have troubles to understand difficult ones.
Good to know: in order to capture the content of the assertion, you need to put parentheses inside.
For instance: /(foo(?=bar))/ will only capture "foo". While /(foo(?=(bar)))/ will capture "foo" and "bar".
A more advanced example to find "script" markups without "src" attributes:
/(<script(?!.*?src=(['"]).*?\2)[^>]*>)/.
This regex finds:
- "<script"
- Followed by {nothing of several characters} : (?!.*?)
- Followed by src='{nothing of several characters}' : src=(['"]).*?\2
- Followed by {nothing or everything which is not ">"} : [^>]*
- Followed by ">"
Groups and recursion
We already know how to make a capturing group. Now we will learn how to name it, to make it non-capturing and to call it back in the regex. As you will see, there is several syntaxes doing the same thing.
Description | Explanation / Regex |
---|---|
Non-capturing group. | (?:.+) |
Name a group, so it becomes callable in the regex or in the return of preg_match. | (?'groupName'.*) or (?P<groupName>.*). |
Calling back the exact captured value of a group: backreference. | (?P=groupName), \k'groupName', \k{groupName} or if unnamed \1, \g1. Example: /<(?'markup'[bs])[^>]*>.*<\/\k'markup'>/ matches <b>tuto4dev</b> but not <b>tuto4dev</s>. |
Create a subroutine: calling back the regex of a group. | (?P>groupName), (?&groupName) , \g'groupName' or if unnamed \g'1', \g<1>, (?1). Example: /(?'test'[ab])(?&test)c/ matches aac, abc, bbc, bac because it is an alias to /(?'test'[ab])[ab]c/. |
DEFINE a subroutine at the beginning of the regex. | /(?(DEFINE)(?'age'\d{1,3} years old))^Age: \g'age'$/ we defined the subroutine "age" to use it. It matches "Age: 25 years old". |
Recursion: calling back the entire regex. | (?R), (?0), \g<0>. Example : /a-(?R)?z/ matches a-z, a-a-zz, a-a-a-zzz... |
Reset the global capture. | \K. Example : /(ab\Kc)/, $output[0] = "c" and $output[1] = "abc". |
Branch reset groups: capturing alternatives. Example: we want to parse JSON {"foo": "bar"} or {"foo": 42}, with the key in $output[1] and the value in $output[2]. | /{"([^"]+)": (?|(\d+)|"([^"]+)")}/ we have 3 capturing groups but thanks to the syntax (?|(a)|(b)) the 2 last groups are both linked to $output[1]. If you are naming the groups, all alternatives, even with a different number of groups, need to have the same sequence of names. |
Atomic groups and possessive quantifiers
Basic quantifiers "*", "+", "?", "{2, 3}" might be heavy. Let's take a basic example: /<.+>/ on "<img>".
The engine directly finds "<". For it, ".+" matches "img>".
When trying to find ">", the engine can't find it, it goes back to the previous character (backtrack).
Because ".+" is now equal to "img", the engine can finally find ">".
Now let's try the same exercise with /<[^>]+>/ and "<img".
The engine will try to match "[^>]+" to "img" before realizing it can't find ">".
So it tries to backtrack: "[^>]+" is equal to "im" and "g" does not match ">".
And so on and so forth...
If the engine was more clever, it would have noticed that with "[^>]+" it was impossible to miss
any ">": backtracking was useless.
Possessive quantifiers are used to disallow backtracking and lighten the process by adding a "+" to the basic quantifiers: ".++" for instance.
Atomatic groups are used too for performance gains.
Possessive quantifiers are actually a short syntax to write atomic groups.
".++" is an alias of the atomic group "(?>.+)". In an atomic group, if the pattern matches, the engine will jump
to the next part and never come back. Meaning it will not follow the alternatives either.
For instance /(?>foobar|foobarbaz)\b/ applied to "foobarbaz" fails.
Indeed, "foobarbaz" contains "foobar" but then "\b" is missing, the engine does not backtrack and fails.
Miscellaneous
Description | Explanation / Regex |
---|---|
Word boundaries, to find an exact word. | \b is an alias of (^\w|\w$|\W\w|\w\W). Example: preg_replace ('/\bart\b/', 'REPLACEMENT' , 'This article is art') : "This article is REPLACEMENT". |
Remove the need to escape special characters. | Everything between \Q and \E will be interpreted as plain text, not as part of the regex. Example: \Q*\d+*\E matches literally *\d+* and not a number. |
A condition. | General syntax: (?(condition) then|else). Example testing the existence of a group: /alpha(num)?:(?(1)[[:alnum:]]|[[:alpha:]])+/. It matches "alpha:a" and "alphanum:a" but not "alpha:1". You can replace the "1" with the name of a group. |
Deal with unicode. | \X is the unicode equivalent of a dot. \x{1234} corresponds to unicode U+1234. Unicode has its own classes, such as: \p{Lowercase_Letter} or \p{Arabic}. |
Make regex case insensitive | /(?i)insensitive(?-i)sensitive/ |
Ignore spaces in quantifiers. | /(?x) \d +/ = /\d+/ |
Add comments. | (?x) also ignores everything after the character # on every line. You can use that or (?#your comment) to add comments. |