Regular expressions (regex) are a powerful way to manipulate text using set patterns. In this regex tutorial, we will be discussing how to match URLs with regex.
The regex we will be looking at for matching URLs is:
(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$
Anchors are special characters that specify a position rather than a value itself. In this case, we have two of them:
^
: The start of a string/line anchor. This signifies that the following pattern should begin at the start of the string.$
: The end of string/line anchor. This signifies that the following pattern should be at the end of the string.
Example: if we have a regex like ^abc
, it will match any string that starts with "abc". if we have a regex like abc$
, it will match any string that ends with "abc".
In the terms of the URL-matching regex, the ^
and $
anchors are used to ensure that the entire string must match the URL pattern. In other words, a string will only be considered a match if it is a URL from start to finish without any extra characters.
Quantifiers specify how many instances of a character, group, or character class must be present in the input to be a match.
In this case, it uses three quantifiers:
-
?
: This matches 0 or 1 instance of the preceding character/group. It makes the preceding character/group optional. It's used after(https?:\/\/)
, making the "http://" or "https://" part of the URL optional. -
*
: This matches 0 or more instances of the preceding character/group. It's used after([\/\w \.-]*)
, allowing for path segments in the URL. -
{2,6}
: This matches between 2 and 6 instances of the preceding character/group. It's used after([a-z\.]{2,6})
, indicating that the top-level domain of the URL (like .com) should be 2-6 characters long.
Grouping constructs treat multiple characters as a single unit. They are typically used to apply a quantifier to multiple characters. They can also establish a backreference to a pattern of characters rather than just a single character.
In this case, it uses four grouping constructs:
-
(https?:\/\/)
: This matches "http://", "https://", or nothing (because of the?
quantifier that follows). It signifies that the protocol part of the URL is optional. -
([\da-z\.-]+)
: This matches alphanumeric characters, periods, or hyphens. It's used to match the domain name of the URL. -
([a-z\.]{2,6})
: This matches a sequence of 2 to 6 lowercase letters or periods. It's used to match the top-level domain of the URL. -
([\/\w \.-]*)
: This matches any number of slashes, word characters, spaces, periods, or hyphens. It's used to match the path, query string, and fragment of the URL.
Bracket expressions define a character class. Characters within a bracket expression are treated as a single unit where any one character can match.
In this case, we have three bracket expressions:
-
[a-z\.]
: This matches any lowercase letter or a period. It's used to match the top-level domain of the URL, like ".com". -
[\da-z\.-]
: This matches any digit (0-9), lowercase letter, period, or hyphen. It's used to match the domain name of the URL. -
[\/\w \.-]
: This matches a slash, word character (which includes letters, digits, and underscores), space, period, or hyphen. It's used to match the path, query string, and fragment of the URL.
Character classes are built-in shortcuts in regex for certain common groups of characters. They start with a backslash \
followed by a letter.
In this case, we use one character class:
\w
: This stands for "word" character and matches any alphanumeric character (letters or digits) or an underscore. It's equivalent to[a-zA-Z0-9_]
. In the URL-matching regex, it's used within a bracket expression in the part that matches the path, query string, and fragment of the URL.
The OR Operator is represented by the pipe symbol |
. It allows the expression to match either the pattern before or the pattern after the operator.
In this case, we use the OR operator within a group:
(https?:\/\/)
: This group uses the OR operator to match "http" or "https", followed by "://". (The?
quantifier after "s" makes it optional and signifies that the protocol part of the URL is optional.)
I am a full-stack web developer constantly looking forward to expanding my coding knowledge and skills. This tutorial was written by Neel Chakravartty.