Regex cheatsheet
  Regular expressions, also known as regex or regexp, are useful tools to segment, search for and rework a string of
  characters.
  
  Each regex engine has its own specificities: missing feature or different behaviour. The following regex are
  compatible with the PCRE (Perl Compatible Regular Expressions) extension of PHP. The slash "/" will be used as a
  delimiter.
The basics
| Regex | Explanation | Example | 
|---|---|---|
| ^ | Beginning of the string. | /^foo/ : starts with foo | 
| $ | End of the string. | /bar$/ : ends with bar | 
| [ ] | A "class": several options. | /[abc]/ : either "a", "b" or "c" | 
| [^ ] | Forbidden characters. | /[^abc]/ : neither "a", "b" or "c" | 
| ? | 0 or 1 time. | /as?/ : "a" or "as" because "s" is optional | 
| * | 0 or X times. | /as*/ : "a", "as", "asssss". | 
| + | 1 to X times. | /as+/ : "as", "asssss". | 
| { } | Specific number of repetitions. | /as{2}}/ : "ass". /as{2, 4}}/ : "ass", "asss", "assss". /as{2,}}/ : "assssssss". | 
| [0-9] or \d | A digit. | /\d+/ : 0, 1, 2, 3 | 
| [A-Za-z0-9_] or \w | Digit, lowercase, uppercase. "w" : word. | /\w+/ : Tuto4Dev | 
| \s or [ \t\r\n\f] | Space, tab, end of line, end of page. "s" : space. | /a\sb/ : a b | 
| \D, \W et \S | An uppercase class is generally the opposite to the lowercase version. \D = [^\d] = everything which is not a digit. | /[\s\S]/ : all characters. Everything which is a space or not a space. | 
| . | Everything except end of lines. For matching really everything, use [\s\S] or use the option "single-line mode" of the preg_match function. | /.og/ : "dog", "fog" | 
| ( ) | A capturing group. | preg_match('/a(.+)e/', 'abcde', $output): $output[1] = 'bcd' | 
| [:alnum:], [:digit:], [:alpha:], [:lower:], [:upper:] | POSIX classes. | /([[:alpha:]]+)/ = /([a-zA-Z]+)/ | 
The lazy quantifiers
By default, "*" and "+" are greedy, meaning they will try to capture the largest amount of characters. On the contrary, a lazy quantifier will capture the minimum amount of characters. By adding "?" to "*" or "+", the greedy quantifiers become lazy. An example trying to capture an XML markup:
| Regex | Subject | Explanation | 
|---|---|---|
| <(.+)> | <b>Tuto4Dev</b> | Greedy capture : b>Tuto4Dev</b. Not what we want. | 
| <(.+?)> | <b>Tuto4Dev</b> | Lazy capture : b | 
| <([^>]+)> | <b>Tuto4Dev</b> | The engine is faster with this alternative solution than with lazy quantifiers. | 
Searching for a word preceding or following another
| Regex | Explanation | 
|---|---|
| foo(?=bar) | Positive lookahead : find foo followed by bar. | 
| foo(?!bar) | Negative lookahead : find foo not followed by bar. | 
| (?<=foo)bar | Positive lookbehind : find bar preceded by foo. | 
| (?<!foo)bar | Negative lookbehind : find bar not preceded by foo. | 
  Careful with lookbehind assertions, engines have troubles to understand difficult ones.
  Good to know: in order to capture the content of the assertion, you need to put parentheses inside.
  For instance: /(foo(?=bar))/ will only capture "foo". While /(foo(?=(bar)))/ will capture "foo" and "bar".
  
  A more advanced example to find "script" markups without "src" attributes:
  /(<script(?!.*?src=(['"]).*?\2)[^>]*>)/.
  
  This regex finds:
- "<script"
- Followed by {nothing of several characters} : (?!.*?)
- Followed by src='{nothing of several characters}' : src=(['"]).*?\2
- Followed by {nothing or everything which is not ">"} : [^>]*
- Followed by ">"
Groups and recursion
We already know how to make a capturing group. Now we will learn how to name it, to make it non-capturing and to call it back in the regex. As you will see, there is several syntaxes doing the same thing.
| Description | Explanation / Regex | 
|---|---|
| Non-capturing group. | (?:.+) | 
| Name a group, so it becomes callable in the regex or in the return of preg_match. | (?'groupName'.*) or (?P<groupName>.*). | 
| Calling back the exact captured value of a group: backreference. | (?P=groupName), \k'groupName', \k{groupName} or if unnamed \1, \g1. Example: /<(?'markup'[bs])[^>]*>.*<\/\k'markup'>/ matches <b>tuto4dev</b> but not <b>tuto4dev</s>. | 
| Create a subroutine: calling back the regex of a group. | (?P>groupName), (?&groupName) , \g'groupName' or if unnamed \g'1', \g<1>, (?1). Example: /(?'test'[ab])(?&test)c/ matches aac, abc, bbc, bac because it is an alias to /(?'test'[ab])[ab]c/. | 
| DEFINE a subroutine at the beginning of the regex. | /(?(DEFINE)(?'age'\d{1,3} years old))^Age: \g'age'$/ we defined the subroutine "age" to use it. It matches "Age: 25 years old". | 
| Recursion: calling back the entire regex. | (?R), (?0), \g<0>. Example : /a-(?R)?z/ matches a-z, a-a-zz, a-a-a-zzz... | 
| Reset the global capture. | \K. Example : /(ab\Kc)/, $output[0] = "c" and $output[1] = "abc". | 
| Branch reset groups: capturing alternatives. Example: we want to parse JSON {"foo": "bar"} or {"foo": 42}, with the key in $output[1] and the value in $output[2]. | /{"([^"]+)": (?|(\d+)|"([^"]+)")}/ we have 3 capturing groups but thanks to the syntax (?|(a)|(b)) the 2 last groups are both linked to $output[1]. If you are naming the groups, all alternatives, even with a different number of groups, need to have the same sequence of names. | 
Atomic groups and possessive quantifiers
  Basic quantifiers "*", "+", "?", "{2, 3}" might be heavy. Let's take a basic example: /<.+>/ on "<img>".
  The engine directly finds "<". For it, ".+" matches "img>".
  When trying to find ">", the engine can't find it, it goes back to the previous character (backtrack).
  Because ".+" is now equal to "img", the engine can finally find ">".
  
  Now let's try the same exercise with /<[^>]+>/ and "<img".
  The engine will try to match "[^>]+" to "img" before realizing it can't find ">".
  So it tries to backtrack: "[^>]+" is equal to "im" and "g" does not match ">".
  And so on and so forth...
  If the engine was more clever, it would have noticed that with "[^>]+" it was impossible to miss
  any ">": backtracking was useless.
  
  Possessive quantifiers are used to disallow backtracking and lighten the process by adding a "+" to the basic quantifiers: ".++" for instance.
  
  Atomatic groups are used too for performance gains.
  Possessive quantifiers are actually a short syntax to write atomic groups.
  ".++" is an alias of the atomic group "(?>.+)". In an atomic group, if the pattern matches, the engine will jump
  to the next part and never come back. Meaning it will not follow the alternatives either.
  For instance /(?>foobar|foobarbaz)\b/ applied to "foobarbaz" fails.
  Indeed, "foobarbaz" contains "foobar" but then "\b" is missing, the engine does not backtrack and fails.
Miscellaneous
| Description | Explanation / Regex | 
|---|---|
| Word boundaries, to find an exact word. | \b is an alias of (^\w|\w$|\W\w|\w\W). Example: preg_replace ('/\bart\b/', 'REPLACEMENT' , 'This article is art'): "This article is REPLACEMENT". | 
| Remove the need to escape special characters. | Everything between \Q and \E will be interpreted as plain text, not as part of the regex. Example: \Q*\d+*\E matches literally *\d+* and not a number. | 
| A condition. | General syntax: (?(condition) then|else). Example testing the existence of a group: /alpha(num)?:(?(1)[[:alnum:]]|[[:alpha:]])+/. It matches "alpha:a" and "alphanum:a" but not "alpha:1". You can replace the "1" with the name of a group. | 
| Deal with unicode. | \X is the unicode equivalent of a dot. \x{1234} corresponds to unicode U+1234. Unicode has its own classes, such as: \p{Lowercase_Letter} or \p{Arabic}. | 
| Make regex case insensitive | /(?i)insensitive(?-i)sensitive/ | 
| Ignore spaces in quantifiers. | /(?x) \d +/ = /\d+/ | 
| Add comments. | (?x) also ignores everything after the character # on every line. You can use that or (?#your comment) to add comments. | 
