Introduction to Regular Expressions in Python

python_tutorials

In this tutorial we are going to learn about using regular expressions in Python, including their syntax, and how to construct them using built-in Python modules. To do this we’ll cover the different operations in Python’s re module, and how to use it in your Python applications.

What are Regular Expressions?

Regular expressions are basically just a sequence of characters that can be used to define a search pattern for finding text. This “search engine” is embedded within the Python programming language (and many other languages as well) and made available through the re module.

To use regular expressions (or “regex” for short) you usually specify the rules for the set of possible strings that you want to match and then ask yourself questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”.

You can also use regexes to modify a string or to split it apart in various ways. These “higher order” operations all start by first matching text with the regex string, and then the string can be manipulated (like being split) once the match is found. All this is made possible by the re module available in Python, which we’ll look at further in some later sections.

Regular Expression Syntax

A regular expression specifies a pattern that aims to match the input string. In this section we’ll show some of the special characters and patterns you can use to match strings.

Matching Characters

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like ‘A’, ‘a’, or ‘0’, are the simplest regular expressions; they simply match themselves. There are also other special characters which can’t match themselves, i.e. ^, $, *, +, ?, {, }, [, ], , |, (, and ). This is because they are used for higher-order matching functionality, which is described further in this table:

Metacharacter Description
* Matches the preceding element zero or more times. For example, ab*c matches “ac”, “abc”, “abbbc”, etc. [xyz]* matches “”, “x”, “y”, “z”, “zx”, “zyx”, “xyzzy”, and so on. (ab)* matches “”, “ab”, “abab”, “ababab”, and so on.
+ Matches the preceding element one or more times. For example, ab+c matches “abc”, “abbc”, “abbbc”, and so on, but not “ac”.
? Matches the preceding element zero or one time. For example, ab?c matches only “ac” or “abc”.
| The choice (also known as alternation or set union) operator matches either the expression before or the expression after this operator. For example, abc|def can match either “abc” or “def”.
. Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches “abc”, etc., but [a.c] matches only “a”, “.”, or “c”.
^ Matches the starting position in the string, like the startsWith() function. In line-based tools, it matches the starting position of any line.
? Matches the ending position of the string or the position just before a string-ending newline, like the endsWith() function. In line-based tools, it matches the ending position of any line.

Credit to Wikipedia for some of the regex descriptions.

Regular Expressions Methods in Python

There are several methods available to use regular expressions. Here we are going to discuss some of the most commonly used methods and also give a few examples of how they are used. These methods include:

  1. re.match()
  2. re.search()
  3. re.findall()
  4. re.split()
  5. re.sub()
  6. re.compile()

re.match(pattern, string, flags=0)

This expression is used to match a character or set of characters at the beginning of a string. It’s also important to note that this expression will only match at the beginning of the string and not at the beginning of each line if the given string has multiple lines.

The expression below will return None because Python does not appear at the beginning of the string.

# match.py

import re
result = re.match(r'Python', 'It's  easy to learn Python. Python also has elegant syntax')

print(result)
$ python match.py
None

re.search(pattern, string)

This module will checks for a match anywhere in the given string and will return the results if found, and None if not found.

In the following code we are simply trying to find if the word “puppy” appears in the string “Daisy found a puppy”.

# search.py

import re

if re.search("puppy", "Daisy found a puppy."):
    print("Puppy found")
else:
    print("No puppy")

Here we first import the re module and use it to search the occurrence of the substring “puppy” in the string “Daisy found a puppy”. If it does exist in the string, a re.MatchObject is returned, which is considered “truthy” when evalutated in an if-statement.

$ python search.py 
Puppy found

re.compile(pattern, flags=0)

This method is used to compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, which we have discussed above. This can also save time since parsing/handling regex strings can be computationally expensive to run.

# compile.py

import re

pattern = re.compile('Python')
result = pattern.findall('Pythonistas are programmers that use Python, which is an easy-to-learn and powerful language.')

print(result)

find = pattern.findall('Python is easy to learn')

print(find)
$ python compile.py 
['Python', 'Python']
['Python']

Notice that only the matched string is returned, as opposed to the entire word in the case of “Pythonistas”. This is more useful when using a regex string that has special match characters in it.

re.sub(pattern, repl, string)

Like the name suggests, this expression is used to search and substitute for a new string if the pattern occurs.

# sub.py

import re
result = re.sub(r'python', 'ruby', 'python is a very easy language')

print(result)
$ python sub.py 
ruby is a very easy language

re.findall(pattern, string)

As you’ve seen prior to this section, this method finds and retrieves a list of all occurrences in the given string. It combines both the functions and properties of re.search() and re.match(). The following example will retrieve all the occurrences of “Python” from the string.

# findall.py

import re

result = re.findall(r'Python', 'Python is an easy to learn, powerful programming language. Python also has elegant syntax')
print(result)
$ python findall.py 
['Python', 'Python']

Again, using an exact match string like this (“Python”) is really only useful for finding if the regex string occurs in the given string, or how many times it occurs.

re.split(pattern, string, maxsplit=0, flags=0)

This expression will split a string at the location in which the specified pattern occurs in the string. It will also return the text of all groups in the pattern if an advanced feature like capturing parentheses are used in the pattern.

# split.py

import re

result =  re.split(r"y", "Daisy found a puppy")

if result:
    print(result)
else:
   print("No puppy")

As you can see above, the character pattern “y” occurs three times and the expression has split in all instances where it occurs.

$ python split.py 
['Dais', ' found a pupp', '']

Practical uses of Regular Expressions

Whether you know it or not, we use regular expressions almost daily in our applications. Since regular expressions are available in just about every programming language, it’s not easy to escape their usage. Let’s look at some of the ways regular expressions can be used in your applications.

Constructing URLs

Every web page has a URL. Now imagine you have a Django website with an address like “http://www.example.com/products/27/“, where 27 is the ID of a product. It would be very cumbersome to write separate views to match every single product.

However, with regular expressions, we can create a pattern that will match the URL and extract the ID for us:

An expression that will match and extract any numerical ID could be ^products/(d+)/$.

  • ^products/ tells Django to match a string that has “products/” at the beginning of the URL (where “beginning” of the string is specified by ^)
  • (d+) means that there will be a number (specified by d+) and we want it captured and extracted (specified by the parenthases)
  • / tells Django that another “/” character should follow
  • $ indicates the end of the URL, meaning that only strings ending with the / will match this pattern

Validating Email Addresses

Every authentication system requires users to sign up and log in before they can be allowed access to the system. We can use regular expression to check if an email address supplied is in a valid format.

# validate_email.py

import re

email = "[email protected]"

if not re.match(re.compile(r'^[email protected][^.].*.[a-z]{2,10}$', flags=re.IGNORECASE), email):
    print("Enter a valid email address")
else:
    print("Email address is valid")

As you can see, this is a pretty complicated regex string. Let’s break it down a bit using the example email address in the code above. It basically means the following:

  • ^[email protected]: Match every character from the beginning of the string up until the ‘@’ character
  • [^.].*: Match every character except “.”
  • .[a-z]{2,10}$: Match the domain TLD characters (with max length 10 characters) until the end of the string

So, as you’d expect, the code matches our example address:

$ python validate_email.py 
Email address is valid

Validating Phone Numbers

The following example is used to validate a list of prefixed Canadian numbers:

# validate_numbers.py

import re

numbers = ["+18009592809", "=18009592809"]

for number in numbers:
    if not re.match(re.compile(r"^(+1?[-. ]?(d+))$"), number):
        print("Number is not valid")
    else:
        print("Number is valid")
$ python validate_numbers.py 
Number is valid
Number is not valid

As you can see, because the second number uses a “=” character instead of “+”, it is deemed invalid.

Filtering Unwanted Content

Regular expressions can also be used to filter certain words out of post comments, which is particularly useful in blog posts and social media. The following example shows how you can filter out pre-selected words that users should not use in their comments.

# filter.py

import re

curse_words = ["foo", "bar", "baz"]
comment = "This string contains a foo word."
curse_count = 0

for word in curse_words:
    if re.search(word, comment):
        curse_count += 1

print("Comment has " + str(curse_count) + " curse word(s).")
$ python filter.py 
Comment has 1 curse word(s).

Conclusion

This tutorial has covered what is needed to be able to use regular expressions in any application. Feel free to consult the documentation for the re module, which has a ton of resources to help you accomplish your application’s goals.

Visit source site

Leave a Reply