In this Python Regex Tutorial we want to learn about Python Regex for Data Cleaning, so when we want to clean data, regular expressions or regex plays an important role and it is one of the important tool for extracting, manipulating and validating data. Python regex module provides powerful tools for working with regular expressions in Python code.
In this Regex Tutorial we want to talk about the basics of Python regex for data cleaning and also provides you some examples of how it can be used to clean up messy data.
What is a Regular Expression?
regular expression is a sequence of characters, and it defines a search pattern. also it is used to match and manipulate strings, and it is one of the important tool for data cleaning. for example we have a dataset that contains phone numbers in different formats. we can use regular expression to extract only the numbers and we want to ignore any other characters such as parentheses or dashes.
Basic Regular Expression Syntax
First if you see regular expressions syntax, it will be hard to understand, but when you understand the basic, than it will be easy for your to grasp the meanings of regular expressions
These are some of the most common elements of regular expression syntax:
Literal characters | Any character that is not special character matches itself. for example regular expression abc would match the string “abc.” |
Character classes | character class is a group of characters that can match any one of several characters. for example regular expression [aeiou] would match any vowel. |
Quantifiers | quantifier specifies how many times a character or group of characters should appear in the match. for example regular expression a{3} would match the string “aaa.” |
Anchors | Anchors specify the position of the match within the string. for example regular expression ^abc would match any string that starts with “abc,” and the regular expression abc$ would match any string that ends with “abc.” |
Alternation | Alternation allows you to match one of several patterns. For example, the regular expression dog|cat would match either “dog” or “cat.” |
Python Regex for Data Cleaning
So now let’s learn that how we can use Python for data cleaning, and now let’s start our first example.
For example we have a dataset and it contains phone numbers in different formats, now we want to extract only the numbers and we want to ignore any other characters such as parentheses or dashes.
1 2 3 4 5 6 7 8 9 |
import re phone_numbers = ['(888) 888-8888', '888-888-8888', '888.888.8888'] pattern = re.compile(r'\d+') for number in phone_numbers: cleaned_number = ''.join(pattern.findall(number)) print(cleaned_number) |
In the above example we have created regular expression pattern that matches any sequence of one or more digits (\d+). after that we have used findall() method to extract all substrings in the phone number that match this pattern. and lastly we join the list of string results to create a single string containing only the digits.
This will be the result
Sometimes you have a dataset and it contains email addresses with variations in the domain name, now we want to standardize all domain names to codeloop.org.
1 2 3 4 5 6 7 8 9 |
import re email_addresses = ['user@codeloop.com', 'user@codeloop.xyz', 'user@codeloop.net'] pattern = re.compile(r'@[\w.-]+') for email in email_addresses: cleaned_email = pattern.sub('@codeloop.org', email) print(cleaned_email) |
this will be the result
Now let’s learn that how we can remove HTML tags, for example we have a dataset and it contains text with HTML tags, and we want to remove all the tags.
1 2 3 4 5 6 7 8 9 |
import re text = '<p>Welcome <b>to</b> codeloop <i>.org</i></p>' pattern = re.compile(r'<[^>]+>') cleaned_text = pattern.sub('', text) print(cleaned_text) |
In the above example we have created a regular expression pattern that matches any sequence of characters enclosed in angle brackets (< >). after that we have used sub() method to replace any matches with an empty string.
This will be the result
Subscribe and Get Free Video Courses & Articles in your Email