0% found this document useful (0 votes)
9 views32 pages

Data Pre-Processing Steps

Data pre-processing involves converting raw data into a suitable format for analysis, using Python libraries such as Numpy, Pandas, and Scikit-learn. The document outlines steps for handling missing values, encoding categorical data, splitting datasets, and applying feature scaling techniques. It emphasizes the importance of preparing data correctly to ensure effective machine learning model performance.

Uploaded by

piyush.is22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views32 pages

Data Pre-Processing Steps

Data pre-processing involves converting raw data into a suitable format for analysis, using Python libraries such as Numpy, Pandas, and Scikit-learn. The document outlines steps for handling missing values, encoding categorical data, splitting datasets, and applying feature scaling techniques. It emphasizes the importance of preparing data correctly to ensure effective machine learning model performance.

Uploaded by

piyush.is22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Data Pre-processing Steps

What is Data pre-processing


• Data Pre-processing is a process of converting
raw data into suitable form.
1. Python Libraries for Machine
learning
• Numpy & scipy : scientific calculation
• Pandas: data handling/import
• matplotlib: creating graph
• Scikit learn(sklearn) : statistics

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
•When you open the Data.csv, there's Country, Age, Salary, and Purchased for the headers.

•The dataset contains the information of some customers like the country, the age, the salary and
whether the customer purchased the products of the company.

•There are independent and dependent variables inside the dataset.

•The first three columns, Country, Age, and the Salary are the independent variables.
•The dependent variable is the Purchased variable.

•The last column and in any machine learning model we are going to use some independent
variables to predict a dependent variable.

•So in this case, we are going to use the first 3 variables to predict whether the customer purchased
a product or not.
2. Importing the Dataset
Set Dependent Variables and
Independent variables (iloc examples)
• After import CSV split dataset into dependent
and independent variables
– iloc[row,column]

X = dataset.iloc[:, : -1].values //Creating the independent variable vector

Y = dataset.iloc[:, -1].values //Creating the dependent variable vector


‘:’ stands for the rows which we want to include, and the next one stands for the
columns we want to include
only the ‘:’ (colon) is used, it means that all the rows/columns are to be included.
we need to include all the rows (:) and all the columns but the last one (:-1)
3. Handling Missing Values
• Drop the rows
or
• Fill the values with mean, median, mode.
Taking care of Missing Data

• Data.csv, there are two missing data


• There is one missing data inside Age for Spain, and another missing
data for Salary in Germany
• the first idea is to remove the lines of the observations where there
is some missing data.
• But imagine if this dataset contains crucial information, it would be
quite dangerous to remove an observation.

• Another idea is to take the mean of the columns, and replace the
missing data with the mean
# Taking care of missing Data
• from sklearn.impute import SimpleImputer
From sklearn, it contains important libaries to preprocess any dataset

Imputer will allow us to take care of any missing data

• imputer = SimpleImputer(missing_values = np.nan, strategy='mean')

For the first value, we are going to input missing_values = "NaN"


For the second value, the strategy would be to replace the missing value with the mean

• imputer = imputer.fit(X[:, 1:3])


1 represents the lower bound column that is included and 3 represents the upper bound column that is excluded

• X[:, 1:3] = imputer.transform(X[:, 1:3])

The method transform that is going to replace the missing data by the mean of the column
In Case Of Transformers
• Transformers are for pre-processing the data
before modelling.
– fit () — This method goes through the training data,
calculates the parameters (like mean (μ) and standard
deviation (σ) in StandardScaler class ) and saves them
as internal objects.
– transform() — The parameters generated using the
fit() method are now used and applied to the training
data to update them.
– fit _Transform() — This method may be more
convenient and efficient for modelling and
transforming the training data simultaneously.

https://medium.com/nerd-for-tech/difference-fit-transform-and-fit-transform-method-in-scikit-learn-b0a4efcab804
Handling Categorical Values
Outlier Detection
HOW TO ENCODE CATEGORICAL DATA

• the Data.csv, we see that we have two categorical variables which are Country and Purchased

• Their values contain categories

• So we need to encode the text we have over here into numbers

#Encoding categorical data:


#Encoding independent variables..

from sklearn.compose import ColumnTransformer


from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),[0])],
remainder = 'passthrough')
x = (ct.fit_transform(x))
print(x)

1st parameter: List of (name, transformer, columns)


2nd parameter: By specifying remainder='passthrough', all remaining
columns that were not specified in transformers
#Encoding dependent variables..

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
y= le.fit_transform(y)
#Splitting dataset..

from sklearn.model_selection import train_test_split


x_train , x_test,y_train , y_test = train_test_split(x, y , test_size=0.2,
random_state = 20)
print(x_train)
#print(x_test)
print(y_train)
print(y_test)
Feature Scaling
• Technique to standardize the independent
features present in the data in a fixed range.
• Scale the feature to -1 to 1
Why feature scaling is important
Types of Feature scaling
• Standardization also called as z-score
normalization.
• Normalization
– Min-max Scalar
– Robust Scalar
Example
Normalization

• Values scales between 0 to 1


• Types of Normalization
– Min-Max scaling
– Mean normalization
– Max absolute scaling
– Robust scaling
#Feature scaling: apply standard sclar on complete X train nd X test..

from sklearn.preprocessing import StandardScaler


sc = StandardScaler()
x_train[:,3:] = sc.fit_transform(x_train[:,3:])
x_test[:,3:] = sc.transform(x_test[:,3:])
print(x_train)
• https://www.youtube.com/watch?v=x08AN87
G0mg

You might also like