0% found this document useful (0 votes)

9 views32 pages

Data Pre-Processing Steps

Data pre-processing involves converting raw data into a suitable format for analysis, using Python libraries such as Numpy, Pandas, and Scikit-learn. The document outlines steps for handling missing values, encoding categorical data, splitting datasets, and applying feature scaling techniques. It emphasizes the importance of preparing data correctly to ensure effective machine learning model performance.

Uploaded by

piyush.is22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views32 pages

Data Pre-Processing Steps

Uploaded by

piyush.is22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Data Pre-processing Steps

What is Data pre-processing

• Data Pre-processing is a process of converting
raw data into suitable form.
1. Python Libraries for Machine
learning
• Numpy & scipy : scientific calculation
• Pandas: data handling/import
• matplotlib: creating graph
• Scikit learn(sklearn) : statistics

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
•When you open the Data.csv, there's Country, Age, Salary, and Purchased for the headers.

•The dataset contains the information of some customers like the country, the age, the salary and
whether the customer purchased the products of the company.

•There are independent and dependent variables inside the dataset.

•The first three columns, Country, Age, and the Salary are the independent variables.
•The dependent variable is the Purchased variable.

•The last column and in any machine learning model we are going to use some independent
variables to predict a dependent variable.

•So in this case, we are going to use the first 3 variables to predict whether the customer purchased
a product or not.
2. Importing the Dataset
Set Dependent Variables and
Independent variables (iloc examples)
• After import CSV split dataset into dependent
and independent variables
– iloc[row,column]

X = dataset.iloc[:, : -1].values //Creating the independent variable vector

Y = dataset.iloc[:, -1].values //Creating the dependent variable vector

‘:’ stands for the rows which we want to include, and the next one stands for the
columns we want to include
only the ‘:’ (colon) is used, it means that all the rows/columns are to be included.
we need to include all the rows (:) and all the columns but the last one (:-1)
3. Handling Missing Values
• Drop the rows
or
• Fill the values with mean, median, mode.
Taking care of Missing Data

• Data.csv, there are two missing data

• There is one missing data inside Age for Spain, and another missing
data for Salary in Germany
• the first idea is to remove the lines of the observations where there
is some missing data.
• But imagine if this dataset contains crucial information, it would be
quite dangerous to remove an observation.

• Another idea is to take the mean of the columns, and replace the
missing data with the mean
# Taking care of missing Data
• from sklearn.impute import SimpleImputer
From sklearn, it contains important libaries to preprocess any dataset

Imputer will allow us to take care of any missing data

• imputer = SimpleImputer(missing_values = np.nan, strategy='mean')

For the first value, we are going to input missing_values = "NaN"

For the second value, the strategy would be to replace the missing value with the mean

• imputer = imputer.fit(X[:, 1:3])

1 represents the lower bound column that is included and 3 represents the upper bound column that is excluded

• X[:, 1:3] = imputer.transform(X[:, 1:3])

The method transform that is going to replace the missing data by the mean of the column
In Case Of Transformers
• Transformers are for pre-processing the data
before modelling.
– fit () — This method goes through the training data,
calculates the parameters (like mean (μ) and standard
deviation (σ) in StandardScaler class ) and saves them
as internal objects.
– transform() — The parameters generated using the
fit() method are now used and applied to the training
data to update them.
– fit _Transform() — This method may be more
convenient and efficient for modelling and
transforming the training data simultaneously.

https://medium.com/nerd-for-tech/difference-fit-transform-and-fit-transform-method-in-scikit-learn-b0a4efcab804
Handling Categorical Values
Outlier Detection
HOW TO ENCODE CATEGORICAL DATA

• the Data.csv, we see that we have two categorical variables which are Country and Purchased

• Their values contain categories

• So we need to encode the text we have over here into numbers

#Encoding categorical data:

#Encoding independent variables..

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),[0])],
remainder = 'passthrough')
x = (ct.fit_transform(x))
print(x)

1st parameter: List of (name, transformer, columns)

2nd parameter: By specifying remainder='passthrough', all remaining
columns that were not specified in transformers
#Encoding dependent variables..

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y= le.fit_transform(y)
#Splitting dataset..

from sklearn.model_selection import train_test_split

x_train , x_test,y_train , y_test = train_test_split(x, y , test_size=0.2,
random_state = 20)
print(x_train)
#print(x_test)
print(y_train)
print(y_test)
Feature Scaling
• Technique to standardize the independent
features present in the data in a fixed range.
• Scale the feature to -1 to 1
Why feature scaling is important
Types of Feature scaling
• Standardization also called as z-score
normalization.
• Normalization
– Min-max Scalar
– Robust Scalar
Example
Normalization

• Values scales between 0 to 1

• Types of Normalization
– Min-Max scaling
– Mean normalization
– Max absolute scaling
– Robust scaling
#Feature scaling: apply standard sclar on complete X train nd X test..

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
x_train[:,3:] = sc.fit_transform(x_train[:,3:])
x_test[:,3:] = sc.transform(x_test[:,3:])
print(x_train)
• https://www.youtube.com/watch?v=x08AN87
G0mg

G. David Garson-Logistic Regression - Binary and Multinomial-Statistical Associates Publishing (2014)
75% (4)
G. David Garson-Logistic Regression - Binary and Multinomial-Statistical Associates Publishing (2014)
224 pages
(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Data Wrangling and Preprocessing
100% (1)
Data Wrangling and Preprocessing
41 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
PR2 1st Quarter
No ratings yet
PR2 1st Quarter
70 pages
1 Biostatistics Lecture Notes Part One
No ratings yet
1 Biostatistics Lecture Notes Part One
237 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
Data Mining Using Python Manual
No ratings yet
Data Mining Using Python Manual
69 pages
Data Mining Using Python Lab
100% (1)
Data Mining Using Python Lab
63 pages
ML Journal
No ratings yet
ML Journal
53 pages
Data Preprocessing For Machine Learning in Python
No ratings yet
Data Preprocessing For Machine Learning in Python
27 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
1 Logit Probit and Tobit Model
100% (2)
1 Logit Probit and Tobit Model
51 pages
DE - Python For Data Science - Machine Learning
No ratings yet
DE - Python For Data Science - Machine Learning
45 pages
Python Scikit-Learn Cheat Sheet For Machine Learning
No ratings yet
Python Scikit-Learn Cheat Sheet For Machine Learning
3 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
JMP
No ratings yet
JMP
45 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Day11 Machine Learning
No ratings yet
Day11 Machine Learning
37 pages
GEA1000 Lecture Notes
No ratings yet
GEA1000 Lecture Notes
155 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
r20 Datamining Lab (2-2 Sem Lab)
No ratings yet
r20 Datamining Lab (2-2 Sem Lab)
41 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Predicting Credit Card Approvals
100% (1)
Predicting Credit Card Approvals
14 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
CSL0777 L09
No ratings yet
CSL0777 L09
29 pages
1472 (Ebook PDF) Basic Marketing Research 9th Edition by Tom J. Brown Instant Download
100% (2)
1472 (Ebook PDF) Basic Marketing Research 9th Edition by Tom J. Brown Instant Download
58 pages
Fundamentals of Biostatistics 8th Edition PDF
No ratings yet
Fundamentals of Biostatistics 8th Edition PDF
39 pages
Data - Preprocessing - Jupyter Notebook
No ratings yet
Data - Preprocessing - Jupyter Notebook
5 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
Guidelines For Data Processing and Analysis of The International Physical Activity Questionnaire (IPAQ)
100% (2)
Guidelines For Data Processing and Analysis of The International Physical Activity Questionnaire (IPAQ)
15 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
How To Prepare Your Dataset For Machine Learning in Python
No ratings yet
How To Prepare Your Dataset For Machine Learning in Python
14 pages
Data Preprocessing Tutorial
No ratings yet
Data Preprocessing Tutorial
39 pages
Data Pre Process I
No ratings yet
Data Pre Process I
6 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
Pollock 2005 SPSS I 1 3
100% (1)
Pollock 2005 SPSS I 1 3
68 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Dissertation Logistic Regression
100% (2)
Dissertation Logistic Regression
4 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
ANL252 SU5 Jul2022
No ratings yet
ANL252 SU5 Jul2022
58 pages
Week 4
No ratings yet
Week 4
2 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Unit 2 Describing Data
No ratings yet
Unit 2 Describing Data
21 pages
A Powerpoint®-Based Guide To Assist in Choosing The Suitable Statistical Test
No ratings yet
A Powerpoint®-Based Guide To Assist in Choosing The Suitable Statistical Test
43 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Data Preprocessing in Python
No ratings yet
Data Preprocessing in Python
3 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Reading 2
No ratings yet
Reading 2
111 pages
Self-Instructional Manual (SIM) For Self-Directed Learning (SDL)
No ratings yet
Self-Instructional Manual (SIM) For Self-Directed Learning (SDL)
42 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Multinomial Logistic Regression-3
No ratings yet
Multinomial Logistic Regression-3
19 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Unit - 4 Research Methods
No ratings yet
Unit - 4 Research Methods
35 pages
James & McCulloch 1990
No ratings yet
James & McCulloch 1990
40 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
ccs346 Eda Unit 1 Notes
No ratings yet
ccs346 Eda Unit 1 Notes
20 pages
Data Preprocessing
No ratings yet
Data Preprocessing
7 pages
Full An Introduction To Multilevel Modeling Techniques MLM and SEM Approaches Using Mplus 3rd Edition Ronald H. Heck Ebook All Chapters
No ratings yet
Full An Introduction To Multilevel Modeling Techniques MLM and SEM Approaches Using Mplus 3rd Edition Ronald H. Heck Ebook All Chapters
45 pages
Praktikum Modul 3
No ratings yet
Praktikum Modul 3
5 pages
Chapter 7 - Factor Analysis
No ratings yet
Chapter 7 - Factor Analysis
43 pages
TamaraMunzner 2015 Cap 7.5 SeparateOrderAndAli VisualizationAnalysis
No ratings yet
TamaraMunzner 2015 Cap 7.5 SeparateOrderAndAli VisualizationAnalysis
13 pages
Choosing The Correct Statistical Test in SAS, Stata, SPSS and R
No ratings yet
Choosing The Correct Statistical Test in SAS, Stata, SPSS and R
8 pages
Machine Learning Project Presentation
No ratings yet
Machine Learning Project Presentation
14 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Bar Graph Vs Line Graph
No ratings yet
Bar Graph Vs Line Graph
2 pages
Psych Stat Worksheet 2
No ratings yet
Psych Stat Worksheet 2
3 pages
1
No ratings yet
1
6 pages
SPSS Cheat Sheet Final
No ratings yet
SPSS Cheat Sheet Final
2 pages
Business Statistics Assignment 1A
No ratings yet
Business Statistics Assignment 1A
2 pages
Self-Test Solutions and Answers To Selected Even-Numbered Problems
No ratings yet
Self-Test Solutions and Answers To Selected Even-Numbered Problems
30 pages

Data Pre-Processing Steps

Uploaded by

Data Pre-Processing Steps

Uploaded by

Data Pre-processing Steps

What is Data pre-processing

•There are independent and dependent variables inside the dataset.

X = dataset.iloc[:, : -1].values //Creating the independent variable vector

Y = dataset.iloc[:, -1].values //Creating the dependent variable vector

• Data.csv, there are two missing data

Imputer will allow us to take care of any missing data

• imputer = SimpleImputer(missing_values = np.nan, strategy='mean')

For the first value, we are going to input missing_values = "NaN"

• imputer = imputer.fit(X[:, 1:3])

• X[:, 1:3] = imputer.transform(X[:, 1:3])

• Their values contain categories

• So we need to encode the text we have over here into numbers

#Encoding categorical data:

from sklearn.compose import ColumnTransformer

1st parameter: List of (name, transformer, columns)

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

• Values scales between 0 to 1

from sklearn.preprocessing import StandardScaler

You might also like