Regular Expressions in Python

posted 13 min read

Mastering Regular Expressions in Python

I. Introduction

We all have been regularly searching for keywords in our favourite articles, and also using the "find and replace" feature in our text editors to correct mistakes. All this is done using Regular Expressions, which is the topic of discussion in this article. Regular Expressions or RegEx is a sequence of characters which specify a certain search pattern of strings. Regular Expressions are an integral part of all text processing tasks in text editors and also for text validation of user inputs. In this article, we shall discuss all about regular expressions and practice using the Python ‘re’ module

II. Understanding Regular Expressions

Regular Expressions or RegEx can be defined as a sequence of characters which form a pattern that can be searched in strings. The characters in a search pattern could be numbers, alphabets, non-alphanumeric characters. We may even want to search for characters at certain positions like the starting or end of a string or at word boundaries. There may be conditions about the number of times we want those characters in the pattern to be repeated. All such conditions help us define our regular expressions.

Basic syntax and components of regex patterns:

import re
print(re.search(r"^[cd]o{2}.ing\d","cooding2"))

RegEx comprise the following components:
1) Literals: The characters which match themselves in the string are known as Literals. Suppose we want to match “coding”, then our pattern could simply be “coding”. Here all the characters are literals, with no special meaning attached to any one of them.

2) Metacharacters: Some characters have special meanings in a regular expression. They are known as Metacharacters.
Examples:

  • . "period" is a special character which represent all characters except new line characters. Pattern .at matches “cat”, “bat”, “mat” etc.
  • * "asterisk" represents zero or more occurrences of the character preceding it. Pattern ra*t matches “rt”, “rat”, “raat”, “raaat”
  • + "plus" represents one or more occurrences of the preceding element. Pattern ra+t matches “rat”, “raat”, “raaat” but not “rt”
  • ? "question mark" represents zero or one occurrence of the preceding element. Pattern ra?t matches “rat” and “rt” but not “raat”
  • ^ "caret" represents the start of the string. So, the pattern “^a” will match the a in “air” but not in “fair”
  • $ "dollar sign" represents the end of the string. So, the pattern d$ will match the d in “sad” but not in “fade”.
  • \ "backslash" Helps us escape a metacharacter. Suppose we want to search for a price of item which is 200$ . But in regular expressions $ is metacharacter. The pattern “200$” will search for a pattern of 200 at the end of string. So, if we want the regular expression to treat $ as a literal, we will escape its metacharacter meaning using a “\” symbol. So, we will make our regular expression as “200\$”

Special sequence: The symbol \ in combination with some characters indicate a special sequence
\d matches digits 0-9,
\w matches word characters 0-9, a-z, A-Z, and underscore “_”
\s matches any whitespace character (space, tab, newline).
\b matches a word boundary (position between a word character and a non-word character).
\D matches any character that is not a digit.
\W matches any character that is not a word character.
\S matches any character that is not a whitespace character.

3) Character Classes: Suppose we want the character in our search pattern to be any one of a set of characters. Then we may use a character class. For example, we may want our word to be “bat” or “mat” then our regular expression will look like: [bm]at. Here [] will represent the character class, which includes the characters b and m and any one of these will match our pattern.

4) Quantifiers: Suppose we want to specify that our pattern must contain exactly "n" consecutive occurrences of a sub pattern or character. We may use quantifiers in such cases. The quantity is specified inside{} curly braces. We want to match 3 occurrences of “a”. Our Pattern would be a{3}

5) Anchors: These specify the positioning of the characters in a pattern. We have already discussed these anchors like

  • ^ start of string
  • $ end of string
  • \b word boundary

III. Getting Started with re Module

Python offers a module by the name of ‘re’, which provides regular expression matching operations. Both Unicode strings (str) as well as 8-bit strings (bytes) can be searched using this module.

Importing the re module:

Python ‘re’ module comes as part of the ‘The Python Standard Library’. It is included as a module under the “Text processing services” of the Python Standard Library. So, we do not need to install it separately using pip command like other packages. Rather we can just import the module and use it.

import re 

Overview of key functions and methods in the re module:

The entire documentation of the re module can be viewed at the official website. Here we will discuss the commonly used functions and methods of the re module.
re.search :It search a string for the presence of a pattern. Only the first match will be returned.
re.findall :It will find all occurrences of a pattern in a string and return them as a list.
re.finditer:It will return an iterator yielding a Match object for each match.
re.compile :It will compile a pattern into a Pattern object. Using a Pattern object is much more efficient if the same regular expression has to be used again and again.
re.match : It will match a regular expression pattern to the beginning of a string.
Let see a demonstration of these functions.

import re

my_text = "do you code daily, code daily to become an expert. I do code a lot"

print(re.search(r'code', my_text)) # only the first occurence is matched

print(re.findall(r'code', my_text)) # all 3 occurences are matched

m = re.finditer(r'code', my_text) # an iterator object is created 
print(m)
for individual_match in m: # loop over the iterator object to see all matches
    print(individual_match)
    
my_pattern = re.compile(r'do')
my_pattern1 =re.compile(r'you')
print(my_pattern.match(my_text)) # the pattern at the beginning of the string matched
print(my_pattern1.match(my_text))

Regular Expression Objects:
Pattern object : It is a compiled regular expression object returned by re.compile(). It offers methods like Pattern.match and Pattern.search.
Match object: It is returned by successful matches and searches.
It offers methods like Match.group.

Note: Always use the Python's raw string notation r"your_pattern" for Regular Expressions to avoid unexpected behaviour. Raw strings allow you to use backslashes \ without needing to escape them. Unlike regular strings, where backslashes have special meanings like \n representing a new line, within a raw string, \ retains its literal value.

IV. Basic Regex Patterns and Operations

Let us now practically explore some of the concepts we have learnt till now.

import re

my_text = """when I start writing, I understand that when
cat and bat are two different words / 1 and 9 
are different digits in maths!"""

# Matching literal characters
print(re.findall(r'cat',my_text))

# Matching character classes without range
print(re.findall(r'[cb]at',my_text))

# Matching character classes with range
print(re.findall(r'[b-m]at',my_text))

# character class inversions using Caret ^ inside square brackets
print(re.findall(r'[^\w\s]',my_text))

#Quantifiers for matching repetitions
print(re.findall(r'f{2}',my_text))

#Anchors for boundary matching
print(re.findall(r'\b[b-m]at\b',my_text)) # word boundary
print(re.findall(r'^when',my_text)) # string boundary

A.Matching literal characters:

Here we start by trying to find the simple literal keyword “cat” in a string called “my_text”.

B. Using character classes and ranges:

• Next, we try to match either cat or bat by using character class where we enclose c and b inside the square bracket [cb]. That means if either c or b is present it will be a match.
• Taking this a step further we specify a range of characters, [b-m] any character between this range including b and m will match.
• We could even say that any character except the characters present in the character class by using inversion with a ^ Caret sign inside a square bracket

C. Quantifiers for matching repetitions:

We could use quantifiers which specify the exact number of times a character must repeat inside the curly bracket {}. Here we search for “f” repeated twice

D. Anchors for boundary matching:

Here we search for the pattern with word boundary specified with \b . Note that now “mat” in “maths” and “hat” in “that” do not match as “mat” will not have a word boundary after it and “hat” will not have a word boundary before it.
Similarly, ^ outside of a character class represents a string beginning, so only one “when” at the beginning of the string will be matched.

V. Advanced Regex Patterns

A. Grouping and capturing in regular expressions:

When we find patterns using regular expressions, we can create groups of a set of characters by enclosing them in parenthesis (). These groups can then be accessed and processed separately. We may even specify our groups as non-capturing using the notation(?:pattern). Also, we can name our groups as well using the notation (?P<name>pattern) and then access them using those names. Let’s look at an example:

Suppose we have a key in our data which looks like “CODE-NAME-DEPT-SALARY-EXPERIENCE”. We can create a regular expression to match this pattern. But if we want to further process the salary and the experience to find the average salary and average experience level of people, then we will create groups for these. We can specify the Name and the age as non-capturing groups here. Let’s look at the example.

import re
salary_list = []
experience_list = []
# CODE-NAME-DEPT-SALARY-EXPERIENCE is the key
my_data = "CODE-Sam-SALES-2300-2 CODE-Tina-TESTING-2600-4 666 CODE-Divya-MARKETING-10000-12 hello"
# pattern1 has 2 non-capturing groups and 2 named groups
pattern1 = r'CODE-(?:\w+)-(?:\w+)-(?P<salary>\d+)-(?P<experience>\d+)'
pattern2 = r'CODE-(\w+)-(\w+)-(\d+)-(\d+)'
m1 = re.finditer(pattern1, my_data)
m2 = re.finditer(pattern2, my_data)
for m in m1:
    print(f"group 1 is for salary: {m.group(1)}")
    print(f"group 2 is for experience: {m.group(2)}")
    salary_list.append(int(m.group('salary')))
    experience_list.append(int(m.group('experience')))
print(f"Average salary is : {sum(salary_list)/len(salary_list)}")
print(f"Average experience level is : {sum(experience_list)/len(experience_list)}")

B. Alternation and conditional matching:

Regular expressions can also be used to match alternate patterns. Then depending on which pattern has been captured we can further conditionally match other patterns. The | operator is used for alternation and the (?(id/name)yes|no) syntax could be used for conditional matching. Let us suppose we want to capture a word say “one” but we have two different languages. If it is English we capture “one”, if it is French we capture “un”.

import re
my_text = "ENGLISH:one"
my_text1 = "FRENCH:un"
my_text2 = "FRENCH:one"
pattern = r'((ENGLISH)|(FRENCH)):(?(2)(one)|(un))'
m = re.search(pattern, my_text)
m1 = re.search(pattern, my_text1)
m2 = re.search(pattern, my_text2)
print(m)
print(m1)
print(m2)

C. Lookahead and lookbehind assertions:

We can check if our regular expression pattern we are searching for has or does not have an expression before it or after it. Let the pattern to be searched be called X, and the expression to be checked before or after it be called Y. The expression Y would not be included as part of the result. The syntax for this would be:
Positive Lookahead : X(?=Y)
Negative Lookahead: X(?!Y)
Positive Lookbehind : (?<=Y)X
Negative Lookbehind: (?<!Y)X
Let us look at an example:

import re
my_text = "onehello1 zerohello2 hello3one hello4zero"
pos_lookahead = r'(hello\d)(?=one)' # r'hello\d' followed by one
neg_lookahead = r'(hello\d)(?!one)' # r'hello\d' not followed by one
pos_lookbehind = r'(?<=one)(hello\d)' # r'hello\d' preceeded by one
neg_lookbehind = r'(?<!one)(hello\d)' # r'hello\d' not preceeded by one 
print(re.findall(pos_lookahead, my_text))
print(re.findall(neg_lookahead, my_text))
print(re.findall(pos_lookbehind, my_text))
print(re.findall(neg_lookbehind, my_text))

D.Backreferences:

Backreferences in a pattern allow us to specify that the contents found in an earlier capturing group must also be found at the current location in the string. Let us suppose we want to find the words which have the same starting and ending letter. We could then use backreference to capture such words.
Previously captured groups can be referred to by using backslash followed by the group number (e.g., \1, \2, \3, etc.)

import re
my_text = "In this harsh world , when the going gets tough the tough gets going"
pattern = r'\b(\w)\w*\1\b' # will match words which begin and end in the same letter
all_matches =(re.finditer(pattern, my_text))
for match in all_matches:
    print(match)

VI. Practical Examples and Use Cases

A. Validating input data

Suppose, we wish to validate the user input form. It may contain pin code. We can create a regular expression and use it to validate the particular user input. Suppose we wish to validate the Indian PIN code. Let’s keep the validation rules simple:
PIN code is exactly six-digit long code.
The first digit is 1-9, all other digits can be 0-9. (There are exceptions though, as some of the PIN codes might not exist.)
Regular expression pattern will be “[1-9]\d{5}”
Now we can validate the user input PIN code.

import re
entered_pincode = input("please input PINCODE\n")
pattern = r'^[1-9]\d{5}$'
if re.match(pattern, entered_pincode):
    print("valid PINCODE")
else: 
    print("Invalid PINCODE")

B. Extracting information from text (e.g., parsing log files)

Regular expressions can be used to extract information from text, like from log files.
Suppose my Log file contains information about users who have logged into a system. Let us try to extract the name of the person who logged into the system as well as the date.

import re

my_login_info = "INFO:UserA logged into system on date: 04-11-2024"
pattern = r'INFO:(\b.+\b) logged into system on date: (\b\d{2}-\d{2}-\d{4}\b)'
match = re.search(pattern, my_login_info)
print(match)
if match:
    print(f"The following user logged into the system: {match.group(1)}")
    print(f"Date of Login: {match.group(2)}")

The extracted information can then be used for further interpretation and analysis.

C: Search and replace operations in text processing tasks

Suppose we used an incorrect word in a paragraph. We could find and replace every occurrence of that word and substitute it with the correct word.
Let’s say we used weak instead of week and our word editor could not spot this mistake, since it cannot understand the context. Let’s correct this mistake with our regular expression.

import re

my_paragraph = "Decorators in weak 32.What do you suppose I should be studying in weak 33? Any tweak?"
misspelled_word = r'\bweak\b'
my_paragraph = re.sub(misspelled_word, "week", my_paragraph)
print(my_paragraph)


Note that here tweak was not changed as we had used word boundaries in our regular expression pattern

D. Filtering and cleaning textual data using regular expressions

We need to filter and clean our data on a regular basis, so that some further processing can be performed on it.
For example, suppose we want to find the most frequently used words in a book. The first step would be to clean the text of the book, removing all the special characters, replacing the new line characters “\n”, tabs “\t” etc with only a single space. This can be done using the regular expressions.
Let’s see an example for the same.

import re

my_book = "This happens    to be a very **small** book written in <English>, by a renowed author!!!"
my_book = re.sub(r"[^\w\s]", "", my_book)
print(my_book)
my_book = re.sub(r"\s+", " ", my_book)
print(my_book)
my_word_list = my_book.split(" ")
print(my_word_list)


So, here we have created a list of words in the book after cleaning the text of the book.

VII. Tips and Best Practices

Tip: Compile your regex patterns using re.compile if the same pattern is to be used multiple times
Use character classes whenever possible.
Use non-capturing groups if you do not require to further process them.
Do not forget to escape special characters if you want to match them as Literals
Use flags like re.IGNORECASE if you want your search to be case-insensitive

VIII. Conclusion

The knowledge of Regular Expressions is indispensable for text processing, data manipulation, and pattern matching tasks in real world applications. The Python re module offers many functionalities related to regular expressions usage. I hope this tutorial provides you with enough information to make you curious enough to explore the power of regular expressions in depth. I would encourage you to read the official documentation of the re module and explore more functions, methods and flags related to regular expressions. Only practice can help you master Regular expressions!

If you read this far, tweet to the author to show them you care. Tweet a Thanks

More Posts

Multithreading and Multiprocessing Guide in Python

Abdul Daim - Jun 7, 2024

Mastering Lambda Functions in Python: A comprehensive Guide

Abdul Daim - May 6, 2024

Decorators in Python [With Examples]

aditi-coding - Mar 26, 2024

Mastering Context Manager Simplifying Resource Management Python

Abdul Daim - Jun 15, 2024

You should know these f-string tricks in Python

mouyuan123 - May 28, 2024
chevron_left