2017-07-24 09:12:30 +01:00
2017-07-22 13:31:31 +01:00
2017-07-24 09:12:30 +01:00

What is Regular Expression?

Regular expression is a group of characters or symbols which is used to find a specific pattern from a text.

A regular expression is a pattern that is matched against a subject string from left to right. The word "Regular expression" is a mouthful, you will usually find the term abbreviated as "regex" or "regexp". Regular expression is used for replacing a text withing a string, validating form, extract a substring from a string based upon a pattern match, and so much more.

Imagine you are writing an application and you want to set the rules when user chosing their username. We want the username can contains letter, number, underscore and hyphen. We also want to limit the number of characters in username so it does not look ugly. We use the following regular expression to validate a username:

Regular expression

Above regular expression can accepts the strings "john_doe", "jo-hn_doe" and "john12_as". It does not match "Jo" because that string contains uppercase letter and also it is too short.

Table of Contents

1. Basic Matchers

A regular expression is just a pattern of letters and digits that we used to search in a text. For example the regular expression cat means: the letter c, followed by the letter a, followed by the letter t.

"cat" => The cat sat on the mat

The regular expression 123 matches the string "123". The regular expression is matched against an input string by comparing each character in the regular expression to each character in the input string, one after another. Regular expressions are normally case-sensitive so the regular expression Cat would not match the string "cat".

"Cat" => The cat sat on the Cat

2. Meta Characters

Meta characters are the building blocks of the regular expressions. Meta characters do not stand for themselves but instead are interpreted in some special way. Some meta characters have a special meaning that are written inside the square brackets. The meta character are as follows:

Meta character Description
. Period matches any single character except a line break.
[ ] Character class. Matches any character contained between the square brackets.
[^ ] Negated character class. Matches any character that is not contained between the square brackets
* Matches 0 or more repetitions of the preceding symbol.
+ Matches 1 or more repetitions of the preceding symbol.
? Makes the preceding symbol optional.
{n,m} Braces. Matches at least "n" but not more than "m" repetitions of the preceding symbol.
(xyz) Character group. Matches the characters xyz in that exact order.
| Alternation. Matches either the characters before or the characters after the symbol.
\ Escapes the next character. This allows you to match reserved characters [ ] ( ) { } . * + ? ^ $ \ |
^ Matches the beginning of the input.
$ Matches the end of the input.

2.1 Full stop

Full stop . is the simplest example of meta character. The meta character . matches any single character. It will not match return or new line characters. For example the regular expression .ar means: any character, followed by the letter a, followed by the letter r.

".ar" => The car parked in the garage.

2.2 Character set

Character sets are also called character class. Square brackets are used to specify character sets. Use hyphen inside character set to specify the characters range. The order of the character range inside square brackets doesn't matter. For example the regular expression [Tt]he means: an uppercase T or lowercase t, followed by the letter h, followed by the letter e.

"[Tt]he" => The car parked in the garage.

Just like above example the regular expression ar[.] means: an lowercase character a, followed by letter r, followed by any character.

"ar[.]" => The car parked in the garage.

2.2.1 Negated character set

In general the caret symbol represents the start of the string, but when it is typed after the opening square bracket it negates the character set. For example the regular expression [^c]ar means: any character except c, followed by the character a, followed by the letter r.

"[^c]ar" => The car parked in the garage.

2.3 Repetitions

Following meta characters +, * or ? are used to specify how many times a subpattern can occurs. These meta characters act differently in different situations.

2.3.1 The Star

The symbol * matches zero or more repetitions of the preceding matcher. The regular expression a* means: zero or more repetitions of preceding lowercase character a. But if it apperas after a character set or class that it finds the repetitions of the whole character set. For example the regular expression [a-z]* means: any number of lowercase letters in a row.

"[a-z]*" => The car parked in the garage #21.

The * symbol can be used with the meta character . to match any string of characters .*. The * symbol can be used with the whitespace character \s to match a string of whitespace characters. For example the expression \s*cat\s* means: zero or more spaces, followed by lowercase character c, followed by lowercase character a, followed by lowercase character t, followed by zero or more spaces.

"\s*cat\s*" => The fat cat sat on the cat.

2.3.2 The Plus

The symbol + matches one or more repetitions of the preceding character. For example the regular expression c.+t means: lowercase letter c, followed by any number of character, followed by the lowercase character t.

"c.+t" => The fat cat sat on the mat.

2.3.3 The Question Mark

In regular expression the meta character ? makes the preceding character optional. This symbol matches zero or more repetitions of the preceding character. For example the regular expression [T]?he means: Optional the uppercase letter T, followed by the lowercase character h, followed by the lowercase character e.

"[T]he" => The car is parked in the garage.
"[T]?he" => The car is parked in the garage.

2.4 Braces

In regular expression braces that are also called quantifiers used to specify the number of times that a group of character or a character can be repeated. For example the regular expression [0-9]{2,3} means: Match at least 2 digits but not more than 3 ( characters in the range of 0 to 9).

"[0-9]{2}" => The number was 9.9997 but we rounded it off to 10.0.

We can leave out the second number. For example the regular expression [0-9]{2,} means: Match 2 or more digits. If we also remove the comma the regular expression [0-9]{2} means: Match exactly 2 digits.

"[0-9]{2,}" => The number was 9.9997 but we rounded it off to 10.0.
"[0-9]{2}" => The number was 9.9997 but we rounded it off to 10.0.

2.4 Character Group

Character group is a group of sub-pattern that is written inside Parentheses (...). As we discussed before that in regular expression if we put quantifier after character than it will repeats the preceding character. But if we put quantifier after a character group than it repeats the whole character group. For example the regular expression (ab)* matches zero or more repetitions of the character "ab". We can also use the alternation | meta character inside character group. For example the regular expression (c|g|p)ar means: lowercase character c, g or p, followed by character a, followed by character r.

"(c|g|p)ar" => The car is parked in the garage.

2.5 Alternation

In regular expression Vertical bar | is used to define alternation. Alternation is like a condition between multiple expressions. Now, you maybe thinking that character set and alternation works the same way. But the big difference between character set and alternation is that character set works on character level but alternation works on expression level. For example the regular expression [T|t]he|car means: uppercase character T or lowercase t, followed by lowercase character h, followed by lowercase character e or lowercase character c, followed by lowercase character a, followed by lowercase character r.

"[T|t]he|car" => The car is parked in the garage.
Description
Learn regex the easy way
Readme 2.8 MiB
Languages
SVG 100%