
Hi! Let’s use python regex to match groups in a body of text. Let’s go! 🔥✨⚡
The term ‘regex’ means regular expressions. A regular expression is a sequence of characters that defines a search pattern in body of text.
In Python, the re module allows us to match regular expression patterns and also do other regular expression operations. In this tutorial we will be dealing specifically with the group() function.
The group() function returns one or more subgroups of a matching regular expression pattern. Let’s write some code:
import re
s = "Email address is [email protected]"
match = re.search(r"([\w]+)@([\w]+)",s)
if match:
print(match.group())
print(match.group(1))
print(match.group(2))
Let us explain what is happening here:
- We import the module re, obviously.
- We define a string s. This will be the body of text against which we will wish to match the pattern defined in our regular expression.
- Variable match is a match object. The match object is returned by the function re.search(). The re.search() function scans through a string looking for the first location that matches the regular expression pattern. We supply two arguments: pattern and string
- The pattern is r”([\w]+)@([\w]+)” which we hope will match any standard email address:
- The r”” prefix is raw string notation, used to express patterns in Python code.
- (…) special character matches the regular expression inside the parentheses, indicates start and end of pattern group. Each group will be defined inside its own pair of parentheses.
- The special character [] indicates a set of characters we are looking to match, the set can contain whole character classes or individual characters
- The special sequence \w matches Unicode characters or alphanumeric characters or any character that can be part of a word in any language. This is the only character class we are looking to match in this case, so this is what we put in the set denoted by square brackets []
- The + or plus sign special character matches one or more repetitions of the preceding set of characters defined in the set
- The character @ is not enclosed in the parentheses () so it is not a part of the group, but it will be matched if the string to be matched contains a literal “@”
- We have two sets of parentheses, each with identical sets of characters to be matched, which means we will have 2 groups in the final match match object.
- The entire pattern is supposed to match most standard email addresses
- The string is s and this is what we will run our pattern pattern against. We will get matches if our pattern is found in s.
- The pattern is r”([\w]+)@([\w]+)” which we hope will match any standard email address:
- If a match is found, the match object will contain the entire matching substring as well as the groups of the matching substrings:
- match.group() returns one or more subgroups of the match. In our original pattern we defined two groups, each enclosed in parentheses. Using the group function without an argument will return all the groups
- match.group(1) will return the first matching subgroup as a string
- match.group(2) will return the second matching subgroup as a string
- In this case, we expect to see 3 statements printed.
When the above code executes, we will get the following output:
#output
#test@xyz
#test
#xyz
As you can see, this is close but not exactly the result that we expect. We get matching substrings but it is missing the .com part of the email address. Fortunately this is a simple fix. All we have to do is ensure that our regex pattern looks for the . or period in our pattern. So our final code will be:
import re
s = "Email address is [email protected]"
#we changed the below statement
match = re.search(r"([\w.-]+)@([\w.-]+)",s)
if match:
print(match.group())
print(match.group(1))
print(match.group(2))
#output
#[email protected]
#test
#xyz.com
Good! So now we see that our groups matched the entire email. Every email has two components the username and the domain. The group match.group(1) matched the username and match.group(2) matched the domain name, as expected. All we did was include the additional character, the period, in the set of characters to be matched for the pattern.
Sometimes email addresses contain hyphens (–) and may contain periods on both sides of the email address. We changed our code to be able to handle both possibilities.
So let’s rewrite our code to handle several possibilities:
import re
adds = ["Email address is [email protected]","Email address is [email protected]","Email address is [email protected]"]
for s in adds:
match = re.search(r"([\w.-]+)@([\w.-]+)",s)
if match:
print(match.group())
print(match.group(1))
print(match.group(2))
#output
#[email protected]
#test
#xyz.com
#[email protected]
#test.me
#xyz.com
#[email protected]
#test.me
#abc-xyz.com
See how we get all of the matching groups we expect? All we did was add a loop to search and match the regex pattern against every string in our array. Awesome!
Thanks for reading. Find the full Python regex match group documentation HERE. Find another fantastic tutorial HERE. Good luck and happy coding. 👌👌👌