Post image
Using Regular Expressions in Python
regular expressions

Each programming language has built-in functions for working with strings. In Python, strings have searching and replacing methods: index(), find(), split(), count(), replace(), and so on. But these methods are limited for the simplest cases. For example, the index() method searches for a simple specified portion of the string and the search is always case sensitive. To perform a case-insensitive search on the s string, you must call s.lower() or s.upper() to ensure that the string has the appropriate case for searching. The replace() and split() methods have the same restrictions.

If your task can be solved with these methods, it’s better to use them, because they are simple and fast, and also easy to read. But if you see that you are using a lot of string functions with if conditions to handle special cases, or a large number of successive split() and join() calls to cut your strings into pieces, then you need regular expressions.

Regular expressions are a powerful and (for the most part) standardized way to search, replace and parse text using integrated templates. Although the regular expressions’ syntax is quite complex and looks unlike normal code, the end result is often more readable than a set of sequential string functions. There even exists a way to put comments inside regular expressions, so you can include a little documentation in the regular expression.

In this article, we'll look at the basics of working with regular expressions in Python, as well as examples of how much more cumbersome the code would become without using them.

Why do we need regular expressions?

Let's look at a simple example - you receive a text message, which contains an important information for you, for example, a phone number. If there is a small number of such messages, then you’ll be able to retrieve the necessary information manually, without experiencing any special difficulties. But when the number of text messages is getting bigger, their size increases and the key information (the phone number) could be in any part of the message, in this situation it’d be a whole lot more reasonable to automate the process of finding the phone number.

Let's say that the text of the message from which you need to pull out the phone number is as follows:

text = 'My number is 415-730-0000. Call me as soon as possible.'

And you know that all phone numbers have the same structure - 3 digits, a hyphen, 3 digits, a hyphen, 4 digits.

You can use the following code for finding the structures of the corresponding type:

def is_phone_number(text):
    if len(text) != 12:
        return False
    for i in range(3):
        if not text[i].isdecimal():
            return False
    if text[3] != '-':
        return False
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False
    if text[7] != '-':
        return False
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False
        return True

for i, _ in enumerate(text):
    if is_phone_number(text[i:i+12]):
        print(text[i:i+12])

As a result, we get:

415-730-0000

However, a similar result can be obtained with fewer lines of code and, in addition, they’ll be more reader-friendly:

import re

phone_regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phone_number = phone_regex.search(text)
print(phone_number.group())

In fact, everything described in the function is_phone_number, fits into a single line - phone_regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d').

Let's look at how these few lines of code work.

First of all, the module for working with regular expressions (import re) is being imported, and further the calls to the methods described in it are being used (in this example - compile, search and group).

Then, using the compile method, the regular expression pattern (r'\d\d\d-\d\d\d-\d\d\d\d') is converted to an object that is used to look for correspondence using the match, search and other methods, which we’ll discuss further down.

Each \d character in the pattern indicates a digit (any number in the range from 0 to 9), and the '-' symbol means what it usually means. Thus, this pattern completely repeats the structure of the telephone number, we’ve mentioned above.

The letter 'r' before the beginning of the line makes it easy to work with the expression, since without it you’d have to use not one cutter, but two, and as a result, the pattern would look like this: '\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d', which, you have to admit, somewhat splits your focus.

After that, using the search method, applied to our analyzed text, we get the desired result, and return it as a string, using phone_number.group()

If you don’t use the group() method, then instead of 415-730-0000 , we get something like:

<_sre.SRE_Match object; span=(13, 25), match='415-730-0000'>

which is not quite what we wanted.

As you can see from the example above, using regular expressions can greatly simplify work with the text.

"But they can work not only with numbers, don’t they?" - you ask, and you’ll be absolutely right.

Try to solve the "House Password" and "Three Words" missions using only regular expressions and you’ll see how much more comfortable it is in comparison to writing many test conditions with if/else.

What else can I look for using regular expressions?

The following symbols and their combinations can also be used in regular expressions:

\w - any letter, number and underscore ('_');

\s - any "whitespace" character, namely - a space, tab, line feed;

\W - everything, except for the characters described in \w;

\S - everything except for the characters described in \s;

\D - any character other than the digit;

. - a dot can denote any character except the line feed character;

^ - matches the beginning of the line;

$ - corresponds to the end of the line;

* - matches 0 or more occurrences of a particular expression in a string. For example, ab* can mean 'a', 'ab', 'abb' and so on - 'a' with any number of 'b' after it;

+ - is almost equivalent *, but it corresponds to 1 or more occurrences in a string;

? - matches 0 or 1 occurrences in a string;

{x} - corresponds exactly to the x occurrences of the searched element in the string. For example, 6{3} will match '666', but not '66' or '6666';

{x, y} - matches from x to y occurrences in a string. Also, the second parameter might not be specified, instead you can write, for example, a{10,}, which means: any occurrence of 'a' in the number of characters from 10 and above (to infinity).

Also you can specify the groups of elements, at least to one of which the symbol at the given position must correspond. This is done by putting characters in square brackets - [].

For example, [abc] means any character from these three - "a", "b" or "c".

If the range of the covered values ​​is wide enough - there is no need to describe it elementwise - you can describe it as follows:

[a-zA-Z], which means any letter of the Latin alphabet.

[0-6] - any digit from 0 to 6 inclusive.

Note that while in the group, special characters cease to be such and begin to work as the regular ones. For example, if [+*?] are specified in the list of symbols, then any of the symbols - "+", "*", "?" will match this pattern.

If instead you need to find all the characters except the certain ones, you can use the ^ symbol at the beginning of the group. For example, [^abcxyz] means any character other than "a", "b", "c", "x", "y", "z".

What methods exist for working with regular expressions?

The most commonly used methods are as follows:

re.search(pattern, string) - this method finds the first occurrence of the pattern in the string and returns the corresponding object. If no matches are found, then None is returned.

re.match(pattern, string) - works similarly to the previous method, but it doesn’t look along the entire string, but only at the beginning. Thus, if the pattern is 'aaa' and the text is 'baaa', then the search method will find a corresponce, and match won’t.

re.fullmatch(pattern, string) - returns the corresponding object only if the entire string matches the specified pattern.

re.split(pattern, string) - splits a string using the specified pattern as a cutter. It returns the resulting elements as a list.

re.findall(pattern, string) - this method works the same as search, but it returns not only the first occurrence, but all the occurrences that it finds in the string. Then it returns a list of found substrings.

re.sub(pattern, repl, string, count = 0) - returns a string in which all substrings corresponding to the pattern are replaced by the specified substring. By default, all occurrences are replaced, but you can also specify in count the number of the first n occurrences that need to be replaced.

For what practical purposes the regular expressions are being used?

The most popular things that are easy to find or check by using regular expressions are:

- dates (when you know exactly the format, for example dd.mm.yyyy or yy-mm-dd);

- email addresses;

- zip codes;

- geographical coordinates;

- and much more.

Let's look at other cases where regular expressions were used as an alternative to the usual methods of working with strings.

Here are several solutions to the "First Word" mission.

Without using regular expressions, it looks like this:

def first_word(text: str) -> str:
    """
       returns the first word in a given text.
    """
    new_str = text.replace('.', ' ').replace(',', ' ').split()
    return new_str[0]

or even like this:

def first_word(text: str) -> str:
    import itertools as it
  
    unwanted_chars = '., '
  
    is_unwanted = lambda char: char in unwanted_chars
    is_wanted = lambda char: char not in unwanted_chars
  
    return ''.join(it.takewhile(is_wanted, it.dropwhile(is_unwanted, text)))

And when using regular expressions it looks like this:

import re

def first_word(text: str) -> str:
    return re.search("[A-Za-z']+", text).group()

As you can see, the solutions using standard string methods would have to be adjusted each time the new symbols, other than letters (for example, '!', '#', '@', '&' And etc), is being added to the text. The solution using regular expressions could be used in these situations without changes, since it strictly searches for objects consisting of letters of the Latin alphabet and ignores any other symbols.

However, it should be remembered that regular expressions aren’t an ideal tool for any situation. Every time you have a question: "Should I use regular expressions for this task or limit myself to standard methods?" - you need to consider everything. So as to avoid something that happened in the following example.

Here is a couple of solutions from the "Between Markers" mission, which perfectly illustrate the above idea.

When using regular expressions:

def between_markers(text: str, begin: str, end: str) -> str:
    """
       returns substring between two given markers
    """
    import re
    tb = text.find(begin)
    te = text.find(end)
    begin = begin.replace('[','\[')
    begin = begin.replace(']','\]')
    end = end.replace('[','\[')
    end = end.replace(']','\]')
    p = ''
    if(tb>=0 and te>=0):
        p = r'('+begin+')(?P[\d\w\W]*)('+end+')'
    elif(tb == -1 and te==-1):
        return text
    elif(tb==-1):
        p = r'(\s*)(?P[\d\w\W]*)('+end+')'
    elif(te==-1):
        p = r'('+begin+')(?P[\d\w\W]*)(\s*)'
    else:
        return ''
      
    match = re.search(p,text)
    if match:
        return match.group("res")
    return ''

and without:

def between_markers(txt, begin, end):
    a, b, c = txt.find(begin), txt.find(end), len(begin)
    return [txt[a+c:b], txt[a+c:], txt[:b], txt][2*(a<0)+(b<0)]

We also recommend you to read the official documentation in which you can learn more about the powerful capabilities (and limitations) of regular expressions.

Conclusion

As you can see, when working with strings and searching for information that falls under a certain pattern, regular expressions can be extremely useful. On the other hand, sometimes it’s very difficult to work with them when the structure of the sought text is very confusing or consists of many elements that need to be taken into account.

Please, tell us for what tasks did you use the regular expressions and whether you often have to deal with them?

Created: Sept. 13, 2018, 6:48 a.m.
10
21
User avatar
likewind