Introduction to Regular Expression with Python
Introduction to Regular Expression with Python
There are many way to define Regular Expression. But I don’t want to make something that can be fun to learn boring with some bookish definitions. In simple words, Regular Expression or RegEx is a pattern language for finding, replacing, and processing text data in programmatic ways. It is something that brings logic in working with plain text or even binary data. Most of the time regular expression is used for plain text data instead of binary data e.g. processing plain text files, html files, xml files etc. but it is no surprise that it is also used for non-text data processing.
Without just talking about theory, let's clarify the concepts with some examples too. Take this simple example: you have got a lot of music files in your hard drive. Some of them are mp3 files, some are .ogg files and even some of them are non-music files like .txt. Now you need to sort all of them out. Some of them are xxx.mp3, some of them are xxx.mP3, some of them are xxx.MP3 (the xxx is nothing but any arbitrary filename). The work is very simple but you need to sort out 1 million such files. What do you do now?
You can sort them all out with a simple python function like the following:
def sort_mp3(list_of_files):
new_list = []
for X in list_of_files:
x = X.lower()
if x.endswith("mp3"):
new_list.append(X)
return new_list
But life is not a bed of rose—in fact sometimes it is full of thorns. You don't know when life goes against you. You have a very naughty girl in the house and she made changes to some of your filenames to make your life miserable. I don't know why she did that but she did it to take some revenge on you. She renamed some files to xxx.mmmmppp3, some to xxx.mppPpPppppppp3, some to xxx.mP3333. Not to mention that years ago something similar happened to me and it made solve the problem with Python and RegEx.
What do you do now? You find the pattern of the crime to save yourself! Let's play detective! Do you see any pattern in there? Maybe, maybe not but I can see it. There are x number of m, followed by y number of p and z number of 3 - where x, y, 3 has an occurrence of any positive number of times. Again any of the m and p can be either capital or smaller letters. How do you define these rules in pure python code? You can define the rules but again, life is not a bed rose, rules can change anywhere, anyway, anyhow. Python was created to make programmers more productive with fewer lines of code but if it makes your life more miserable then there is no point of using it. So, should we leave Python? No, we use python and use another special purpose language that we call Regular Expression.
It is usually said as a programmer's joke that when you do not have regular expression you have one problem and when you have regular expression you have two problems. Whatever, jokes aside! Regular expression is hard to learn at the first glance but it is fun to use once you master it.
Without further ado, let's break down our problems in different parts and play "divide and rule." We are not going to learn the theory first and then implement it, but rather will learn the theory through coding and examples.
Let's express the rules in plain English:
1) A filename contains two parts: first is the filename and second is the file extension. The filename and the extension are separated by a dot. There may be any number of dots in a filename but the last dot is considered as the separator dot between the filename and the extension.
2) The filename part can contain any character.
3) The extension can also contain any character. But for our special case it has some limited number of characters and with a defined sequence.
4) Extension part starts with any non-zero number of m or M
5) Any number of m or M is followed by any non-zero number of p or P
6) Any number of p or P is followed by any non-zero number of 3
7) The extension must be at the end of the filename.
There is more than one way to do things in regular expression but as this is a short introductory article we will stick to only one way when possible.
Let's implement the rules with regular expression. Bear in mind that regular expression is one kind of language in itself so it has its own escaping rules as python does.
1) Any character in regular expression is denoted by a single dot . - without the DOTALL flag being used (DOTALL is a flag used for compiling regular expression that tells the RegEx compiler that . matches any character including newline character) the dot does not cover any newline. But we want more, we want 'any number of any character' and given that the 'any number' means non-zero positive number. How do we express that with regular expression? We express that with a plus sign +. So, what we get is .+
2) Already covered.
3) Already covered and I am going to elaborate more.
4, 5, 6) Non-zero number of m, p, 3 can be indicated by the great and powerful +. Now the result is m+p+3+, what still undone is we do not care about case sensitivity. We have a flag for that so you do not need to worry about them.
Now we combine the two parts of a filename with a literal dot. But wait! Dot is used for another special thing in RegEx - what do we do now? We need to escape that. We escape that with a backslash. So, as a result we get .+\.m+p+3+
But wait there is a problem once again, we need to specify that the file extension must be at the end of the filename. To indicate that we may safely say that 3 must be at the end of the filename. To indicate anything that must be at the end we express that with a dollar sign $ placed at the end of the expression. So, we get: .+\.m+p+3+$
This looks like some kind of alien language called Regular Expression. Now, how do we express it in Python code. We use a string. But again, whenever you feel that you are going to live the rest of your life in peace there comes a villain. We cannot just put two quotation around what we have constructed so far. We need to keep all those intact in memory.
If we say .+\.m+p+3+$ it means we are escaping . with \ inside python string and thus \ is losing it's mind and soul. So, we need to escape that, to escape a backslash we have use another backslash. And thus the result becomes .+\\.m+p+3+$. We could make life easier by using a raw python string. So, we can use it like this: r".+\.m+p+3+$"
But we are not done yet my friend. We just constructed a valid sentence in the Regular Expression language and made our place in the land. To use it we need some kind of python code. That python code can be found inside the builtin re module. The following steps need to be taken.
1) Compile the regular expression with the re.compile() function and get a compiled pattern as a result. You can use uncompiled string as pattern but that is not recommended for performance issues.
import re
my_pat = re.compile(r".+\.m+p+3+$", re.IGNORECASE)
# Now, let's rewrite the previous function.
def sort_mp3(list_of_files):
new_list = []
for X in list_of_files:
if my_pat.match(X):
new_list.append(x)
return new_list
So, we are done with our simple problem.
What about other functions of the Regular Expression library re? Everything cannot be covered in a single article. I will write more articles to explain regular expression further. Now I am drawing the finishing line by giving you a little homework; go study the Python re module.
Recent Stories
Top DiscoverSDK Experts
Compare Products
Select up to three two products to compare by clicking on the compare icon () of each product.
{{compareToolModel.Error}}
{{CommentsModel.TotalCount}} Comments
Your Comment