A Regular Expression (shortened as 'regex' or 'regexp') is a sequence of characters that defines a specific search patten in text. Most general-purpose programming launguages support regex capabilities either natively or via libraries, including for example JavaScript, Python, C, C++ and Java.
In this Gist, Let's break down the regex for matching a URL and take a look at each component.
- Matching a URL :
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
- Anchors
- Quantifiers
- Grouping Constructs
- Bracket Expressions
- Character Classes
- Character Escapes
- Flags
/^
(https?://)?([\da-z.-]+).([a-z.]{2,6})([/\w .-]*)*/?$
/
Anchors allow you to match a position before or after characters
^
: The caret anchor matches the beginning of the text$
: The dollar anchor matches the end of the text
Examples (JavaScript)
/^(https?
://)?
([\da-z.-]+
).([a-z.]{2,6}
)([/\w .-]*
)*
/?
$/
Quantifires match a number of instances of a character, group, or character class in a string.
Quantifier | Description |
---|---|
* | Match zero or more times - same as {0, } |
+ | Match one or more times - same as {1, } |
? | Match zero or one time - same as {0,1} |
{n} | Match exactly n times |
{n, } | Match at least n times |
{n,m} | Match from n to m times |
Examples (JavaScript)
"/^(
https?://)
?(
[\da-z.-]+)
.(
[a-z.]{2,6})(
[/\w .-]*)
*/?$/"
Groups use the (
)
symbols. They are useful for creating blocks of patterns, so you can apply repetitions or other modifiers to them as a whole.
Example (JavaScript)
"/^(https?://)?([
\da-z.-]
+).([
a-z.]
{2,6})([
/\w .-]
*)*/?$/"
The bracket expressions match one character out of a set of characters. The square brackets can contain character range such as [a-z], [0-9], or [a-zA-Z0-9] etc.
Examples (JavaScript)
"/^(https?://)?([\d
a-z.-]+).([a-z.]{2,6})([/\w
.-]*)*/?$/"
A Character class allows you to match any symbol from a certain character set. A character class is also called a character set.
Characters | Meaning |
---|---|
\d | Matches any digit (Arabic numeral), same as [0-9] |
\D | Matches any character that is not a digit (Arabic numeral), same as [^0-9] |
\w | Matches any alphanumeric character form the basic latin alphabet, including the underscore, same as [A-Za-z0-9_] |
\W | Matches any character that is not a word character form the basic Latin alphabet, such as [^a-za-z0-9_] |
\s | Matches a single white space character, including space, tab, form feed, line feed, and other Unicode spaces, same as [\f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff] |
\S | Matches a single character other thatn white space, same as [^ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff] |
\t | Matches a horizontal tag |
\r | Matches a carriage return |
\n | Matches a linefeed |
Examples (JavaScript)
"/^(https?:\
/\
/)?([\
da-z\
.-]+)\
.([a-z\
.]{2,6})([\
/\w \
.-]*)*\
/?$/"
There are special characters that have special meaning in a regular expression, such as []{}()\^$.|?*+
. To use a special character as a regular one, prepend it with a backslash: \
Examples (JavaScript)
A flag changes the default searching behavior of a regular expression. It makes a regex search in a different way.
Flag | Name | Modification |
---|---|---|
i | Ignore Casing | With this glag the search is case-insensitive: no difference between A and a |
g | Global | With this flag the search looks for all matches, without it - only the first match is returned |
s | Dot All | Enables 'dotall' mode, that allows a dot . to match newline character \n |
m | Multiline | Makes the boundary characters ^ and $ match the beginning and ending of every single line instead of the beginning and ending of the whole string |
u | Unicode | Enables full Unicode support |
Wonjong Park : https://github.com/wonjong2