Regular Expressions for Python
Harnessing the Power of Regex
π Mastering Regular Expressions: A Comprehensive Guide
Introduction
Regular expressions (regex) are a π₯ powerful tool for pattern matching and text manipulation. They allow you to π search, π extract, and π replace specific patterns within strings, making them invaluable for tasks like:
- β Data validation
- π Parsing
- π Text mining
- π Log analysis
- π Search-and-replace operations
This guide will introduce you to the fundamental concepts, syntax, and real-world applications of regular expressions.
π Table of Contents
- Basic Syntax
- Special Characters
- Grouping and Capturing
- Lookaheads and Lookbehinds
- Common Use Cases
- Regex in Python
- Performance Considerations
- Practical Examples
- Debugging and Testing
- Conclusion
π °οΈ Character Classes
Match sets of characters:
[abc]
Matches a, b, or c.[a-z]
Matches any lowercase letter.[A-Z]
Matches any uppercase letter.[0-9]
Matches any digit.[a-zA-Z0-9]
Matches any alphanumeric character.[^a-z]
The^
inside brackets negates the set, matching any character except a lowercase letter.
π Anchors
Match a position within the string, not a character:
^
Matches the beginning of the string. β^Hello
matches"Hello world"
.$
Matches the end of the string. βworld$
matches"Hello world"
.Matches a word boundary (between a word and non-word character). β
cat
matches"cat"
but not"caterpillar"
.\B
Matches a non-word boundary. β\Bcat\B
matches"caterpillar"
but not"cat"
.
β‘ Special Character Metacharacters
These are shorthand for common character classes.
| .
| Matches any single character (except newline).
| a.c
| matches | "abc"
, "a1c"
|
| \d
| Matches any digit. Equivalent to [0-9]
.
| \d
matches 1
, 9
|
| \D
| Matches any non-digit. Equivalent to [^0-9]
. |
\D
matches a
, @
|
| \w
| Matches any word character (alphanumeric + underscore). |
\w
matches a
, _
, 1
|
| \W
| Matches any non-word character. | \W
matches #
, !
|
| \s
| Matches any whitespace (space, tab, newline). |
\s
matches " "
,
|
| \S
| Matches any non-whitespace. |
\S
matches a
, 1
|
| .
| Matches any single character (except newline
).
| a.c
| matches "abc"
, "a1c"
|
| \d
| Matches any digit. Equivalent to [0-9]
. | \d
matches 1
, 9
|
| \D
| Matches any non-digit. Equivalent to [^0-9]
. | \D
matches a
, @
|
| \w
| Matches any word character (alphanumeric + underscore). | \w
matches a
, _
, 1
|
| \W
| Matches any non-word character. | \W
matches #
, !
|
| \s
| Matches any whitespace (space, tab, newline). | \s
matches " "
,
|
| \S
| Matches any non-whitespace. | \S
matches a
, 1
|
| [a-z]
| Matches any lowercase letter.| a
, b
, z
|
| [A-Z]
| Matches any uppercase letter. | A
, B
, Z
|
| [0-9]
| Matches any digit. | 0
, 1
, 9
|
| [a-zA-Z0-9]
| Matches any alphanumeric character. | a
, B
, 1
|
| [^a-z]
| Matches any character except lowercase letters. | A
, 1
, @
|
π·οΈ Anchors
Match the beginning or end of a string:
| ^
| Matches the beginning of the string.
^hello
matches "hello world"
| $
| Matches the end of the string.
world$
matches "hello world"
|
π’ Quantifiers
Specify how many times a character or group should be repeated:
| Syntax | Description | Example |
|---------|--------------------------------------------------|-----------------------------|
| *
| Matches zero or more occurrences. | a*
matches ""
, "a"
, "aa"
|
| +
| Matches one or more occurrences. | a+
matches "a"
, "aa"
|
| ?
| Matches zero or one occurrence. | a?
matches ""
, "a"
|
| {n}
| Matches exactly n occurrences. | a{3}
matches "aaa"
|
| {n,}
| Matches n or more occurrences. | a{2,}
matches "aa"
, "aaa"
|
| {n,m}
| Matches between n and m occurrences. | a{2,4}
matches "aa"
, "aaa"
, "aaaa"
|
β‘ Special Characters
| Syntax | Description | Example |
|--------|--------------------------------------------------|-----------------------------|
| .
| Matches any character (except newline). | a.c
matches "abc"
, "a1c"
|
| \d
| Matches a digit. | \d
matches 1
, 2
|
| \w
| Matches a word character. | \w
matches a
, _
, 1
|
| \s
| Matches whitespace. | \s
matches " "
, \t
|
| \D
| Matches a non-digit character. | \D
matches a
, @
|
| \W
| Matches a non-word character. | \W
matches @
, #
|
| \S
| Matches a non-whitespace character. | \S
matches a
, 1
|
π€ Grouping and Capturing
Parentheses ()
are used to group parts of a regex and capture matched text for extraction or backreferencing.
| Syntax | Description | Example |
|--------------|--------------------------------------------------|-----------------------------|
| (pattern)
| Groups the pattern. | (abc)
|
| \1
, \2
| Refer to captured groups (backreferences). | (a).\1
matches "aba"
|
Example:
To extract the area code and phone number from a string like "(123) 456-7890"
:
$(\d{3})$ (\d{3}-\d{4})
Python Example:
import re
text = "(123) 456-7890"
pattern = r"$(\d{3})$ (\d{3}-\d{4})"
match = re.search(pattern, text)
if match:
area_code = match.group(1) # "123"
phone_number = match.group(2) # "456-7890"
print(f"π Area Code: {area_code}, Phone: {phone_number}")
π‘ Common Use Cases
| Type | Syntax | |------|--------| |βοΈ Email Validation | ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$ | | π Extracting URLs | https?://[^\s]+ | | π Finding Dates | \d{2}-\d{2}-\d{4} | | π Password Strength Check | ^(?=.[A-Z])(?=.[a-z])(?=.\d)(?=.[@$!%?&])[A-Za-z\d@$!%?&]{8,}$ |
- Validates email addresses (e.g., "user@example.com").
- Matches HTTP/HTTPS URLs in text.
- Matches dates in DD-MM-YYYY format.
- Ensures passwords have at least one uppercase letter, one lowercase letter, one digit, one special character, and are at least 8 characters long.
π Regex in Python
Pythonβs re module provides full support for regular expressions: import re
π Search for a pattern
text = "The quick brown fox jumps over the lazy dog."
match = re.search(r"brown \w+", text)
print(match.group()) # "brown fox"
π Find all occurrences
matches = re.findall(r"\b\w{3}\b", text)
print(matches) # ['The', 'fox', 'the', 'dog']
π Replace Text
new_text = re.sub(r"fox", "cat", text)
print(new_text) # "The quick brown cat jumps over the lazy dog."
β‘ Performance Considerations
- β οΈ Avoid greedy quantifiers (e.g.,
.*
) when possible. Use non-greedy quantifiers (e.g.,.*?
) for efficiency. - π Pre-compile regex patterns for repeated use:
pattern = re.compile(r"\d{3}-\d{4}")
- Use specific patterns instead of generic ones (e.g.,
\d
instead of.
).
π Practical Examples {#Practical Examples}
1. π·οΈ Extracting Hashtags
text = "Love #regex! It's #awesome for #text processing."
hashtags = re.findall(r"#\w+", text)
print(hashtags) # ['#regex', '#awesome', '#text']
2. π Parsing Log Files
log_entry = '127.0.0.1 - james [01/Jan/2025:12:34:56 +0000] "GET /index.html" 200 1234'
pattern = r'(\S+) - (\S+)$$
(.*?)
$$ "(\S+ \S+)" (\d+) (\d+)'
match = re.search(pattern, log_entry)
if match:
ip, user, date, request, status, size = match.groups()
print(f"π₯οΈ IP: {ip}, π€ User: {user}, π Request: {request}")
3. π Validating Phone Numbers
phone_pattern = r'^(\+\d{1,3}[- ]?)?\d{10}\$'
print(re.match(phone_pattern, "+1-1234567890")) # β
Valid
print(re.match(phone_pattern, "12345")) # β Invalid
π οΈ Debugging and Testing
- Use online tools like Regex101 to test and debug regex patterns.
- Break complex patterns into smaller, manageable parts.
π Regex Cheat Sheet
Credit: https://i.imgur.com/OQStwMn.png
π― Conclusion
Regular expressions are a versatile and powerful tool for text processing. By mastering the syntax and applying best practices, you can efficiently solve a wide range of string manipulation tasks. Start with simple patterns, gradually build complexity, and always test your regex against real-world data.
Next Steps:
- Practice with real-world datasets.
- Explore regex in other programming languages (e.g., JavaScript, Perl).
- Learn advanced techniques like recursive patterns and conditional matching.
- Learn advanced techniques like recursive patterns and conditional matching.