Data Pre-Processing Steps
Data Pre-Processing Steps
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
•When you open the Data.csv, there's Country, Age, Salary, and Purchased for the headers.
•The dataset contains the information of some customers like the country, the age, the salary and
whether the customer purchased the products of the company.
•The first three columns, Country, Age, and the Salary are the independent variables.
•The dependent variable is the Purchased variable.
•The last column and in any machine learning model we are going to use some independent
variables to predict a dependent variable.
•So in this case, we are going to use the first 3 variables to predict whether the customer purchased
a product or not.
2. Importing the Dataset
Set Dependent Variables and
Independent variables (iloc examples)
• After import CSV split dataset into dependent
and independent variables
– iloc[row,column]
• Another idea is to take the mean of the columns, and replace the
missing data with the mean
# Taking care of missing Data
• from sklearn.impute import SimpleImputer
From sklearn, it contains important libaries to preprocess any dataset
The method transform that is going to replace the missing data by the mean of the column
In Case Of Transformers
• Transformers are for pre-processing the data
before modelling.
– fit () — This method goes through the training data,
calculates the parameters (like mean (μ) and standard
deviation (σ) in StandardScaler class ) and saves them
as internal objects.
– transform() — The parameters generated using the
fit() method are now used and applied to the training
data to update them.
– fit _Transform() — This method may be more
convenient and efficient for modelling and
transforming the training data simultaneously.
https://medium.com/nerd-for-tech/difference-fit-transform-and-fit-transform-method-in-scikit-learn-b0a4efcab804
Handling Categorical Values
Outlier Detection
HOW TO ENCODE CATEGORICAL DATA
• the Data.csv, we see that we have two categorical variables which are Country and Purchased