0% found this document useful (0 votes)
38 views27 pages

4 Data Preprocessing

Uploaded by

umadataengg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views27 pages

4 Data Preprocessing

Uploaded by

umadataengg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Machine Learning

1
Data Preprocessing

• Data preprocessing is a process of preparing the


raw data and
• making it suitable for a machine learning model.
• It is the first and crucial step while creating a
machine learning model.
• while doing any operation with data, it is
mandatory to clean data and put in a formatted
way.
Why do we need Data Preprocessing?

• A real-world data generally contains noises, missing


values, and maybe in an unusable format which cannot
be directly used for machine learning models.
• Data preprocessing is required tasks for cleaning the data
and
• making it suitable for a machine learning model
• which also increases the accuracy and efficiency of a
machine learning model.
Why do we need Data Preprocessing?
It involves below steps:
• Getting the dataset
• Importing libraries
• Importing datasets
• Finding Missing Data
• Encoding Categorical Data
• Splitting dataset into training and test set
• Feature scaling
1) Get the Dataset

• To create a machine learning model, the first thing we


required is a dataset as a machine learning model
completely works on data.
• The collected data for a particular problem in a proper
format is known as the dataset.
• CSV file.
• HTML or
• xlsx file
2) Importing Libraries

• In order to perform data preprocessing using Python,


we need to import some predefined Python libraries.
• These libraries are used to perform some specific jobs.
• Numpy:
• Matplotlib:
• Pandas:
• import numpy as np
3) Importing the Datasets

• Now we need to import the datasets which we


have collected for our machine learning
project.
• df= pd.read_csv('Dataset.csv')
Extracting dependent and independent variables:

• X = df.iloc[:, :-1].values
• y = df.iloc[:, -1].values
• print(X)
• print(y)
4) Handling Missing data

• The next step of data preprocessing is to handle


missing data in the datasets.
• If our dataset contains some missing data, then
it may create a huge problem for our machine
learning model.
• Hence it is necessary to handle missing values
present in the dataset.
Ways to handle missing data:

• There are mainly two ways to handle missing data, which are:
• By deleting the particular row: The first way is used to
commonly deal with null values.
• In this way, we just delete the specific row or column which
consists of null values.
• But this way is not so efficient and removing data may lead to
loss of information which will not give the accurate output.
Ways to handle missing data:

• By calculating the mean: In this way, we will


calculate the mean of that column or row which
contains any missing value and will put it on the
place of missing value.
• This strategy is useful for the features which
have numeric data such as age, salary, year, etc.
Here, we will use this approach.
4) Handling Missing data:

• To handle missing values, we will use Scikit-


learn library in our code,
• which contains various libraries for building
machine learning models.
• Here we will use Imputer class
of sklearn.preprocessing library.
• Below is the code for it:
4) Handling Missing data:

• from sklearn.impute import SimpleImputer


• imputer = SimpleImputer(missing_values=np.nan,
strategy='mean')
• imputer.fit(X[:, 1:3])X[:, 1:3] =
imputer.transform(X[:, 1:3])
• print(X)
5) Encoding Categorical data
5) Encoding Categorical data:
• Categorical data is data which has some categories such as,
in our dataset;
• there are two categorical variable, Country, and Purchased.
• Since machine learning model completely works on
mathematics and numbers,
• but if our dataset would have a categorical variable, then it
may create trouble while building the model.
• So it is necessary to encode these categorical variables into
numbers.
5) Encoding Categorical data

• For Country variable:


• Firstly, we will convert the country variables
into categorical data.
• So to do this, we will use LabelEncoder() class
from preprocessing library.
5) Encoding Categorical data

• # Encoding the Dependent Variable


• from sklearn.preprocessing import LabelEncoder
• le = LabelEncoder()
• y = le.fit_transform(y)
• print(y)
5) Encoding Categorical data
• Explanation:
• In above code, we have imported LabelEncoder class of sklearn
library.
• This class has successfully encoded the variables into digits.
• But in our case, there are three country variables, and as we can see
in the above output, these variables are encoded into 0, 1, and 2.
• By these values, the machine learning model may assume that there
is some correlation between these variables
• which will produce the wrong output.
• So to remove this issue, we will use dummy encoding.
5) Encoding Categorical data:

• Dummy Variables:
• Dummy variables are those variables which have
values 0 or 1.
• The 1 value gives the presence of that variable in
a particular column, and rest variables become 0.
• With dummy encoding, we will have a number of
columns equal to the number of categories.
5) Encoding Categorical data:

• In our dataset, we have 3 categories so it will


produce three columns having 0 and 1 values.
• For Dummy Encoding, we will
use OneHotEncoder class
of preprocessing library.
5) Encoding Categorical data:

• # Encoding the Independent Variable


• from sklearn.compose import ColumnTransformer
• from sklearn.preprocessing import OneHotEncoder
• ct = ColumnTransformer(transformers=[('encoder',
OneHotEncoder(), [0])], remainder='passthrough')
• X = np.array(ct.fit_transform(X))
• print(X)
5) Encoding Categorical data:

• For Purchased Variable:


• labelencoder_y= LabelEncoder()
• y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use
labelencoder object of LableEncoder class.
• Here we are not using OneHotEncoder class because the
purchased variable has only two categories yes or no, and
• which are automatically encoded into 0 and 1.
5) Encoding Categorical data
• # Encoding the Dependent Variable
• from sklearn.preprocessing import Label
• Encoderle = LabelEncoder()
• y = le.fit_transform(y)
• print(y)
6. Splitting the dataset into the Training set and Test set

• from sklearn.model_selection import train_test_split


• X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.2, random_state = 1)
• print(X_train)
• print(X_test)
• print(y_train)
• print(y_test)
7) Feature Scaling

• It is a technique to standardize the independent


variables of the dataset in a specific range.
• In feature scaling, we put our variables in the
same range and in the same scale
• so that no any variable dominate the other
variable.
7) Feature Scaling

• For feature scaling, we will import StandardScaler class


of sklearn.preprocessing library as:
• Now, we will create the object of StandardScaler class
for independent variables or features.
• And then we will fit and transform the training dataset.
• For test dataset, we will directly
apply transform() function instead of fit_transform()
• because it is already done in training set.
7) Feature Scaling

• from sklearn.preprocessing import StandardScaler


• sc = StandardScaler()
• X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
• X_test[:, 3:] = sc.transform(X_test[:, 3:])
• print(X_train)
• print(X_test)

You might also like