#Introduction to Regular Expressions
A Regular Expression (Regex) is a powerful tool for matching and processing text. It defines a search pattern using a specific syntax within a string.
For example, verifying whether an input email address is valid character by character is tedious. Instead, a regular expression like:
^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$
can be used to validate it.
import re
# Validate email format
email_pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
if re.match(email_pattern, "[email protected]"):
print("Valid email")
#Metacharacters
| Metacharacter | Meaning | Example |
|---|---|---|
. | Matches any single character (except newline) | a.c → abc, a1c |
^ | Matches the beginning of the string | ^abc → matches abcxxxx |
$ | Matches the end of the string | abc$ → matches xxxxabc |
* | Matches 0 or more repetitions of the preceding character | a* → "", a, aa |
+ | Matches 1 or more repetitions | a+ → a, aa |
? | Matches 0 or 1 repetition | a? → "", a |
{n} | Matches exactly n repetitions | a{2} → aa |
{min,} | Matches at least min repetitions | a{2,} → aa, aaa, aaaa |
{min,max} | Matches between min and max repetitions | a{2,3} → aa, aaa |
[] | Matches any one character inside the brackets | [abc] → a, b, c |
[^] | Matches any one character not in the brackets | [^abc] → d, e, f |
[-] | Indicates a range | [a-z] → a, b, ..., z |
() | Groups expressions | (abc)+ → abc, abcabc |
| | OR operator | abc|xyz → abc or xyz |
\d | Matches any digit, same as [0-9] | \d → 1, 2, 3 |
\D | Matches any non-digit, same as [^0-9] | \D → a, @, _ |
\w | Matches alphanumeric or underscore, [a-zA-Z0-9_] | \w → a, 1, _ |
\W | Matches non-word characters, [^a-zA-Z0-9_] | \W → @, # |
\s | Matches any whitespace character | \s → space, \t, \n, etc. |
\S | Matches any non-whitespace character | \S → a, 1, @ |
\b | Matches word boundaries | \bcat\b → matches cat in a sentence |
\B | Matches non-word boundaries | \Bcat\B → matches cat in scatter |
\r | Carriage return | - |
\n | Newline | - |
\f | Form feed | - |
\t | Tab | - |
\v | Vertical tab | - |
\ | Escape character to treat special characters literally | \+ → + |
#Greedy vs Lazy Matching
By default, regex uses greedy matching, which means it tries to match the longest possible string. If a ? is added, it switches to lazy (non-greedy) matching, which matches the shortest possible string.
| Greedy Pattern | Description | Lazy Pattern | Description |
|---|---|---|---|
.* | Match 0 or more, longest possible | .*? | Match 0 or more, shortest |
.+ | Match 1 or more, longest possible | .+? | Match 1 or more, shortest |
.? | Match 0 or 1, longest | .?? | Match 0 or 1, shortest |
.{n,m} | Match n to m times, longest | .{n,m}? | Match n to m times, shortest |
.{n,} | Match at least n, longest | .{n,}? | Match at least n, shortest |