Python Regex Tutorial – Data Cleaning

In this Python Regex Tutorial we want to learn about Python Regex for Data Cleaning, so when we want to clean data, regular expressions or regex plays an important role and it is one of the important tool for extracting, manipulating and validating data. Python regex module provides powerful tools for working with regular expressions in Python code.

 

In this Regex Tutorial we want to talk about the basics of Python regex for data cleaning and also provides you some examples of how it can be used to clean up messy data.

 

 

What is a Regular Expression?

regular expression is a sequence of characters, and it defines a search pattern. also it is used to match and manipulate strings, and it is one of the important tool for data cleaning. for example we have a dataset that contains phone numbers in different formats. we can use regular expression to extract only the numbers and we want to ignore any other characters such as parentheses or dashes.

 

 

Basic Regular Expression Syntax

First if you see regular expressions syntax, it will be hard to understand, but when you understand the basic, than it will be easy for your to grasp the meanings of regular expressions

 

These are some of the most common elements of regular expression syntax:

Literal characters Any character that is not special character matches itself. for example regular expression abc would match the string “abc.”
Character classes character class is a group of characters that can match any one of several characters. for example regular expression [aeiou] would match any vowel.
Quantifiers quantifier specifies how many times a character or group of characters should appear in the match. for example regular expression a{3} would match the string “aaa.”
Anchors Anchors specify the position of the match within the string. for example regular expression ^abc would match any string that starts with “abc,” and the regular expression abc$ would match any string that ends with “abc.”
Alternation Alternation allows you to match one of several patterns. For example, the regular expression dog|cat would match either “dog” or “cat.”

 

 

 

Python Regex for Data Cleaning

So now let’s learn that how we can use Python for data cleaning, and now let’s start our first example.

 

 

For example we have a dataset and it contains phone numbers in different formats, now we want to extract only the numbers and we want to ignore any other characters such as parentheses or dashes.

In the above example we have created regular expression pattern that matches any sequence of one or more digits (\d+). after that we have used findall() method to extract all substrings in the phone number that match this pattern. and lastly we join the list of string results to create a single string containing only the digits.

 

 

 

This will be the result

Python Regex Tutorial - Data Cleaning
Python Regex Tutorial – Data Cleaning

 

 

 

Sometimes you have a dataset and it contains email addresses with variations in the domain name, now we want to standardize all domain names to codeloop.org.

 

 

 

this will be the result

Python Regex Tutorial
Python Regex Tutorial

 

 

 

Now let’s learn that how we can remove HTML tags, for example we have a dataset and it contains text with HTML tags, and we want to remove all the tags.

In the above example we have created a regular expression pattern that matches any sequence of characters enclosed in angle brackets (< >). after that we have used sub() method to replace any matches with an empty string.

 

 

This will be the result

Data Cleaning in Python Regex
Data Cleaning in Python Regex

 

 

Subscribe and Get Free Video Courses & Articles in your Email

 

Leave a Comment

Share via
Copy link
Powered by Social Snap
×