Python Machine Learning Preprocessing the Data

Python Machine Learning Preprocessing The Data – As in real world we deal with alot of raw data. Python Machine Learning algorithms expect Data to be formatted in a certain way before they start the training Process. in order to prepare the data for ingestion by Machine Learning algorithms, we have to preprocess it and convert the Data in to right format.

 

 

You can see my before article about Python Machine Learning

 

 

Preprocessing  Data Types

There are four different kinds of Preprocessing  The Data

  • Binarization
  • Mear Removal
  • Scalling
  • Normalization

 

 

So now lets see an example first we need to import these lines of code

 

 

After that we are going to define our data

 

 

 

1: Binarization:  So when we want to convert our numerical values to Boolean values we use Binarization process, and we have an inbuilt method in Sklearn that is called binarize(), this method takes our input value the a threshold.

 

 

 

Binarization Preprocessing Example

 

 

 

The result will be

Python Machine Learning Preprocessing the Data
Python Machine Learning Preprocessing the Data

 

 

2:  Mean Removal: Removing the mean is a common preprocessing technique used in machine learning. It’s usually useful to remove the mean from our feature vector, so that each feature is centered on zero. We do this in order to remove bias from the features in our feature vector.

 

 

So first lets determine the mean and standard deviation

 

 

 

This will be the result

Python Mean Removal
Python Mean Removal

 

 

The above lines of code displays the mean and standard deviation of the input data. Let’s
remove the mean:

 

 

This is the result

Python Machine Learning Scaled Data
Python Machine Learning Scaled Data

 

 

 

3: Scaling:  In our feature vector, the value of each feature can vary between many random values. So it becomes important to scale those features so that it is a level playing field for the machine learning algorithm to train on. We don’t want any feature to be artificially large or small just because of the nature of the measurements.

 

 

 

This is the result

Python Machine Learning
Python Machine Learning

 

 

 

4: Normalization:  We use the process of normalization to modify the values in the feature vector so that we can measure them on a common scale. In machine learning, we use many different forms of normalization.
 Some of the most common forms of normalization aim to modify the values
so that they sum up to 1. L1 normalization, which refers to Least Absolute Deviations,
works by making sure that the sum of absolute values is 1 in each row. L2 normalization,
which refers to least squares, works by making sure that the sum of squares is 1.
In general, L1 normalization technique is considered more robust than L2 normalization
technique. L1 normalization technique is robust because it is resistant to outliers in the data.
A lot of times, data tends to contain outliers and we cannot do anything about it. We want
to use techniques that can safely and effectively ignore them during the calculations. If we
are solving a problem where outliers are important, then maybe L2 normalization becomes
a better choice.

 

This is the example for the normalization

 

 

 

 

This is the result

Python Machine Learing Data Normalized
Python Machine Learing Data Normalized

 

 

 

FAQs:

 

What is preprocessing data in machine learning?

Preprocessing data in machine learning refers to the process of preparing and cleaning the raw data before feeding it into a machine learning algorithm. It involves transforming the data into a format that is suitable for analysis, removing noise, handling missing values, scaling features, encoding categorical variables and many more.

 

 

What are the 5 major steps of data preprocessing?

The five major steps of data preprocessing are:

    1. Data Cleaning: Handling missing values, removing duplicates, and dealing with outliers.
    2. Data Transformation: Scaling features, encoding categorical variables and transforming data types.
    3. Data Reduction: Reducing dimensionality through techniques like PCA (Principal Component Analysis) or feature selection.
    4. Data Integration: Combining data from multiple sources into a single dataset.
    5. Data Normalization: Normalizing or standardizing numerical features to a common scale.

 

 

How do I preprocess data in pandas Python?

You can preprocess data in pandas Python using different built-in functions and methods. Some common preprocessing tasks include:

  • Handling missing values: Using methods like fillna() or dropna() to handle missing data.
  • Encoding categorical variables: Using pd.get_dummies() for one-hot encoding or LabelEncoder from sklearn.preprocessing for label encoding.
  • Scaling numerical features: Using methods like Min-Max scaling or Z-score normalization.
  • Removing outliers: Filtering data based on statistical measures or using techniques like winsorization.

 

 

 

Which library is used for data preprocessing in Python?

The primary library used for data preprocessing in Python is pandas. Pandas provides powerful data manipulation and analysis tools, and this makes it ideal for handling and preprocessing structured data. Also libraries like NumPy, scikit-learn and TensorFlow also offer functionality for specific preprocessing tasks such as scaling, encoding, and handling missing values.

Subscribe and Get Free Video Courses & Articles in your Email

 

Leave a Comment

Share via
Copy link
Powered by Social Snap
×