0% found this document useful (0 votes)

53 views27 pages

4 Data Preprocessing

Uploaded by

umadataengg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views27 pages

4 Data Preprocessing

Uploaded by

umadataengg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Machine Learning

1
Data Preprocessing

• Data preprocessing is a process of preparing the

raw data and
• making it suitable for a machine learning model.
• It is the first and crucial step while creating a
machine learning model.
• while doing any operation with data, it is
mandatory to clean data and put in a formatted
way.
Why do we need Data Preprocessing?

• A real-world data generally contains noises, missing

values, and maybe in an unusable format which cannot
be directly used for machine learning models.
• Data preprocessing is required tasks for cleaning the data
and
• making it suitable for a machine learning model
• which also increases the accuracy and efficiency of a
machine learning model.
Why do we need Data Preprocessing?
It involves below steps:
• Getting the dataset
• Importing libraries
• Importing datasets
• Finding Missing Data
• Encoding Categorical Data
• Splitting dataset into training and test set
• Feature scaling
1) Get the Dataset

• To create a machine learning model, the first thing we

required is a dataset as a machine learning model
completely works on data.
• The collected data for a particular problem in a proper
format is known as the dataset.
• CSV file.
• HTML or
• xlsx file
2) Importing Libraries

• In order to perform data preprocessing using Python,

we need to import some predefined Python libraries.
• These libraries are used to perform some specific jobs.
• Numpy:
• Matplotlib:
• Pandas:
• import numpy as np
3) Importing the Datasets

• Now we need to import the datasets which we

have collected for our machine learning
project.
• df= pd.read_csv('Dataset.csv')
Extracting dependent and independent variables:

• X = df.iloc[:, :-1].values
• y = df.iloc[:, -1].values
• print(X)
• print(y)
4) Handling Missing data

• The next step of data preprocessing is to handle

missing data in the datasets.
• If our dataset contains some missing data, then
it may create a huge problem for our machine
learning model.
• Hence it is necessary to handle missing values
present in the dataset.
Ways to handle missing data:

• There are mainly two ways to handle missing data, which are:
• By deleting the particular row: The first way is used to
commonly deal with null values.
• In this way, we just delete the specific row or column which
consists of null values.
• But this way is not so efficient and removing data may lead to
loss of information which will not give the accurate output.
Ways to handle missing data:

• By calculating the mean: In this way, we will

calculate the mean of that column or row which
contains any missing value and will put it on the
place of missing value.
• This strategy is useful for the features which
have numeric data such as age, salary, year, etc.
Here, we will use this approach.
4) Handling Missing data:

• To handle missing values, we will use Scikit-

learn library in our code,
• which contains various libraries for building
machine learning models.
• Here we will use Imputer class
of sklearn.preprocessing library.
• Below is the code for it:
4) Handling Missing data:

• from sklearn.impute import SimpleImputer

• imputer = SimpleImputer(missing_values=np.nan,
strategy='mean')
• imputer.fit(X[:, 1:3])X[:, 1:3] =
imputer.transform(X[:, 1:3])
• print(X)
5) Encoding Categorical data
5) Encoding Categorical data:
• Categorical data is data which has some categories such as,
in our dataset;
• there are two categorical variable, Country, and Purchased.
• Since machine learning model completely works on
mathematics and numbers,
• but if our dataset would have a categorical variable, then it
may create trouble while building the model.
• So it is necessary to encode these categorical variables into
numbers.
5) Encoding Categorical data

• For Country variable:

• Firstly, we will convert the country variables
into categorical data.
• So to do this, we will use LabelEncoder() class
from preprocessing library.
5) Encoding Categorical data

• # Encoding the Dependent Variable

• from sklearn.preprocessing import LabelEncoder
• le = LabelEncoder()
• y = le.fit_transform(y)
• print(y)
5) Encoding Categorical data
• Explanation:
• In above code, we have imported LabelEncoder class of sklearn
library.
• This class has successfully encoded the variables into digits.
• But in our case, there are three country variables, and as we can see
in the above output, these variables are encoded into 0, 1, and 2.
• By these values, the machine learning model may assume that there
is some correlation between these variables
• which will produce the wrong output.
• So to remove this issue, we will use dummy encoding.
5) Encoding Categorical data:

• Dummy Variables:
• Dummy variables are those variables which have
values 0 or 1.
• The 1 value gives the presence of that variable in
a particular column, and rest variables become 0.
• With dummy encoding, we will have a number of
columns equal to the number of categories.
5) Encoding Categorical data:

• In our dataset, we have 3 categories so it will

produce three columns having 0 and 1 values.
• For Dummy Encoding, we will
use OneHotEncoder class
of preprocessing library.
5) Encoding Categorical data:

• # Encoding the Independent Variable

• from sklearn.compose import ColumnTransformer
• from sklearn.preprocessing import OneHotEncoder
• ct = ColumnTransformer(transformers=[('encoder',
OneHotEncoder(), [0])], remainder='passthrough')
• X = np.array(ct.fit_transform(X))
• print(X)
5) Encoding Categorical data:

• For Purchased Variable:

• labelencoder_y= LabelEncoder()
• y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use
labelencoder object of LableEncoder class.
• Here we are not using OneHotEncoder class because the
purchased variable has only two categories yes or no, and
• which are automatically encoded into 0 and 1.
5) Encoding Categorical data
• # Encoding the Dependent Variable
• from sklearn.preprocessing import Label
• Encoderle = LabelEncoder()
• y = le.fit_transform(y)
• print(y)
6. Splitting the dataset into the Training set and Test set

• from sklearn.model_selection import train_test_split

• X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.2, random_state = 1)
• print(X_train)
• print(X_test)
• print(y_train)
• print(y_test)
7) Feature Scaling

• It is a technique to standardize the independent

variables of the dataset in a specific range.
• In feature scaling, we put our variables in the
same range and in the same scale
• so that no any variable dominate the other
variable.
7) Feature Scaling

• For feature scaling, we will import StandardScaler class

of sklearn.preprocessing library as:
• Now, we will create the object of StandardScaler class
for independent variables or features.
• And then we will fit and transform the training dataset.
• For test dataset, we will directly
apply transform() function instead of fit_transform()
• because it is already done in training set.
7) Feature Scaling

• from sklearn.preprocessing import StandardScaler

• sc = StandardScaler()
• X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
• X_test[:, 3:] = sc.transform(X_test[:, 3:])
• print(X_train)
• print(X_test)

Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
14 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
CSL0777 L09
No ratings yet
CSL0777 L09
29 pages
Data Mining with Python Lab Guide
No ratings yet
Data Mining with Python Lab Guide
39 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
ML Unit I Data Preprocessing &unit IV Cost Function and Unit V Pruning Topic
No ratings yet
ML Unit I Data Preprocessing &unit IV Cost Function and Unit V Pruning Topic
11 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
16 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Week 4
No ratings yet
Week 4
2 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
How To Prepare Your Dataset For Machine Learning in Python
No ratings yet
How To Prepare Your Dataset For Machine Learning in Python
14 pages
Chapter 3
No ratings yet
Chapter 3
4 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Handle Missing Data in Real-Time
No ratings yet
Handle Missing Data in Real-Time
5 pages
Machine Learning Data Preprocessing Guide
No ratings yet
Machine Learning Data Preprocessing Guide
24 pages
7 محاضرات
No ratings yet
7 محاضرات
36 pages
Day11 Machine Learning
No ratings yet
Day11 Machine Learning
37 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
42 pages
ML 1
No ratings yet
ML 1
13 pages
Data - Preprocessing - Jupyter Notebook
No ratings yet
Data - Preprocessing - Jupyter Notebook
5 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Ashfatmaterial
No ratings yet
Ashfatmaterial
4 pages
ML - Lab - Ex 2
No ratings yet
ML - Lab - Ex 2
4 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
L1 - Data Pre-Processing & Steps of Building A Model
No ratings yet
L1 - Data Pre-Processing & Steps of Building A Model
30 pages
Cse3001 Ai ML m2
No ratings yet
Cse3001 Ai ML m2
118 pages
Mtech Study Material
No ratings yet
Mtech Study Material
10 pages
Data Pre-Processing With Sklearn Using Standard and Minmax
No ratings yet
Data Pre-Processing With Sklearn Using Standard and Minmax
21 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
ML Unit 3
No ratings yet
ML Unit 3
17 pages
Scikit-Learn ML Cheat Sheet Guide
No ratings yet
Scikit-Learn ML Cheat Sheet Guide
3 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
Scikit-learn Machine Learning Tutorial
No ratings yet
Scikit-learn Machine Learning Tutorial
17 pages
Python for Data Science: ML Basics
No ratings yet
Python for Data Science: ML Basics
45 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
9 pages
Scikit-Learn: Python Data Analytics
No ratings yet
Scikit-Learn: Python Data Analytics
58 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Machine Learning
No ratings yet
Machine Learning
34 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
Data Preprocessing For Machine Learning in Python
No ratings yet
Data Preprocessing For Machine Learning in Python
27 pages
Topic 2
No ratings yet
Topic 2
47 pages
Exp2-Dm - KS
No ratings yet
Exp2-Dm - KS
9 pages
Azure Portal Setup Guide
No ratings yet
Azure Portal Setup Guide
18 pages
1 Evaluate Performance of Regression and Classifiaction
No ratings yet
1 Evaluate Performance of Regression and Classifiaction
8 pages
3 Storage Building Blocks
No ratings yet
3 Storage Building Blocks
12 pages
1 Azure
No ratings yet
1 Azure
13 pages
2 Pandas Series
No ratings yet
2 Pandas Series
1 page
Install and Start a Django Project
No ratings yet
Install and Start a Django Project
12 pages
2 App Creation
No ratings yet
2 App Creation
5 pages
0 Introduction
No ratings yet
0 Introduction
17 pages
4 Urls
No ratings yet
4 Urls
5 pages
Paper Structure6
No ratings yet
Paper Structure6
2 pages
Python ML for Healthcare Data
No ratings yet
Python ML for Healthcare Data
3 pages
Telco Edge Cloud Evolution to NaaS
No ratings yet
Telco Edge Cloud Evolution to NaaS
67 pages
MOAC Labs Online Instructor Guide
No ratings yet
MOAC Labs Online Instructor Guide
3 pages
Spatial Data Models
No ratings yet
Spatial Data Models
20 pages
RFC 557
No ratings yet
RFC 557
2 pages
Azure Cost Management Insights
No ratings yet
Azure Cost Management Insights
60 pages
Ict 378 National Gallery Washington Digital Forensic Case Study
No ratings yet
Ict 378 National Gallery Washington Digital Forensic Case Study
3 pages
AVR Opcode Analysis and Mnemonics
No ratings yet
AVR Opcode Analysis and Mnemonics
9 pages
Netflix's Global Expansion Strategy
No ratings yet
Netflix's Global Expansion Strategy
3 pages
Chatbot PPT 2.0
No ratings yet
Chatbot PPT 2.0
13 pages
تقييم الصف الأول الإعدادي لغات مجمع
No ratings yet
تقييم الصف الأول الإعدادي لغات مجمع
14 pages
Wndows Server Shortcut Commands
100% (1)
Wndows Server Shortcut Commands
3 pages
European Medicines Agency Cloud Strategy Accelerating Innovation Digitalisation Better Public Animal
No ratings yet
European Medicines Agency Cloud Strategy Accelerating Innovation Digitalisation Better Public Animal
30 pages
Cisco 400-151 Exam Dumps & Questions
No ratings yet
Cisco 400-151 Exam Dumps & Questions
8 pages
SLC File Format Guide for 3D Models
No ratings yet
SLC File Format Guide for 3D Models
6 pages
Logcat
No ratings yet
Logcat
818 pages
Pengertian dan Fungsi MAC Address
No ratings yet
Pengertian dan Fungsi MAC Address
7 pages
C Programming For 8051 PDF
No ratings yet
C Programming For 8051 PDF
16 pages
Database Administrator (DBA)
No ratings yet
Database Administrator (DBA)
8 pages
Path Traversal
No ratings yet
Path Traversal
4 pages
PC CheckList Main（电脑装机清单）
No ratings yet
PC CheckList Main（电脑装机清单）
2 pages
DTDP Syllabus
No ratings yet
DTDP Syllabus
97 pages
15960
No ratings yet
15960
36 pages
CS127 Lab Exercise: Pointers & Arrays
No ratings yet
CS127 Lab Exercise: Pointers & Arrays
12 pages
Chip Fabrication and Power Consumption Analysis
No ratings yet
Chip Fabrication and Power Consumption Analysis
5 pages
Vortex: Quickstart Guide
No ratings yet
Vortex: Quickstart Guide
40 pages
SN 74194
No ratings yet
SN 74194
13 pages
Wireless Tech Evolution
No ratings yet
Wireless Tech Evolution
15 pages
Refurbished Laptops Price List
No ratings yet
Refurbished Laptops Price List
15 pages
GL502MG Manage Tool User Guide - V1.00
No ratings yet
GL502MG Manage Tool User Guide - V1.00
15 pages
C10G - Hardware Installation GD - 3 - 12 - 2014
100% (1)
C10G - Hardware Installation GD - 3 - 12 - 2014
126 pages

4 Data Preprocessing

Uploaded by

4 Data Preprocessing

Uploaded by

Machine Learning

• Data preprocessing is a process of preparing the

• A real-world data generally contains noises, missing

• To create a machine learning model, the first thing we

• In order to perform data preprocessing using Python,

• Now we need to import the datasets which we

• The next step of data preprocessing is to handle

• By calculating the mean: In this way, we will

• To handle missing values, we will use Scikit-

• from sklearn.impute import SimpleImputer

• For Country variable:

• # Encoding the Dependent Variable

• In our dataset, we have 3 categories so it will

• # Encoding the Independent Variable

• For Purchased Variable:

• from sklearn.model_selection import train_test_split

• It is a technique to standardize the independent

• For feature scaling, we will import StandardScaler class

• from sklearn.preprocessing import StandardScaler

You might also like