0% found this document useful (0 votes)

35 views2 pages

Data Pre Processing

The document discusses various techniques for preprocessing data before machine learning modeling. It covers loading and summarizing a sample dataset, handling missing values through imputation, mean removal, normalization, binning, one-hot encoding, and label encoding. Code examples are provided for each technique.

Uploaded by

rk73462002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views2 pages

Data Pre Processing

Uploaded by

rk73462002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Data Preprocessing

1) Loading the dataset

Let's take a sample dataset for this exercise. This dataset named "data.csv" contains whether a user purchased the product or not. The users data has age,salary and the country they belonged to.

In [19]: import numpy as np

import pandas as pd
import os

In [2]: #Read the 'Data.csv' and store the data in the vairable dataset.
dataset = pd.read_csv("C:/Users/amaly/Jupyter Work/DSML/Data.csv")
print('Load the datasets...')

Load the datasets...

In [3]: # print the dataset

dataset

Out[3]: Country Age Salary Purchased

0 France 44.0 72000.0 No

1 Spain 27.0 48000.0 Yes

2 Germany 30.0 54000.0 No

3 Spain 38.0 61000.0 No

4 Germany 40.0 NaN Yes

5 France 35.0 58000.0 Yes

6 Spain NaN 52000.0 No

7 France 48.0 79000.0 Yes

8 Germany 50.0 83000.0 No

9 France 37.0 67000.0 Yes

2) Summarizing the dataset

In [4]: # Print the shape of the dataset
print (dataset.shape)

(10, 4)

The dataset contains 10 rows and 4 columns

In [5]: #List the Entire Data : To prints the first 5 rows of the data and understand its summary
print(dataset.head(5))

Country Age Salary Purchased

0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes

In [6]: # Statistical Summary : You can view the statistical summary of each attribute, which includes the count, unique, top and freq,
# by using the following command
print(dataset.describe())

Age Salary
count 9.000000 9.000000
mean 38.777778 63777.777778
std 7.693793 12265.579662
min 27.000000 48000.000000
25% 35.000000 54000.000000
50% 38.000000 61000.000000
75% 44.000000 72000.000000
max 50.000000 83000.000000

3) Replacing Missing Values

In [7]: # Separate the dependent and independent variables

# Independent variable
# iloc[rows,columns]
# Take all rows
# Take last but one column from the dataset (:-1)
X = dataset.iloc[:,:-1].values
print('X:',X)

X: [['France' 44.0 72000.0]

['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

In [20]: from sklearn.impute import SimpleImputer

# Imputer Class takes the follwing parameters:
# missing_values : The missing values in our dataset are called as NaN (Not a number).Default is NaN
# strategy : replace the missing values by mean/median/mode. Default is mean.

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer on dataset.

# Take all rows and columns only with the missing values.
# Note: Index starts with 0. Upper bound (3) is not included.

# Fit imputer for columns 1 and 2 of X matrix.

imputer = imputer.fit(X[:,1:3])

#Replace missing data with mean of column

X[:,1:3] = imputer.transform(X[:,1:3])

In [9]: print ('X: %s'%(str(X)))

X: [['France' 44.0 72000.0]

['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

Mean Value of Age = Sum of all age values /10 = 38.77777777777778

Mean Value of Salary = Sum of all Salary value /10 = 63777.77777777778

4) Mean Removal
It involves removing the mean from each feature so that it is centered on zero. Mean removal helps in removing any bias from the features.

In [21]: data = np.array([[10,20],[2,4],[4,9]])

print(data)

[[10 20]
[ 2 4]
[ 4 9]]

In [11]: from sklearn import preprocessing

data_standardized = preprocessing.scale(data)
print("\nMean = ", data_standardized.mean(axis = 0))
print("Std deviation = ", data_standardized.std(axis = 0))

Mean = [ 7.40148683e-17 -5.55111512e-17]

Std deviation = [1. 1.]

Observe that in the output, mean is almost 0 and the standard deviation is 1.

5) Scaling
The values of every feature in a data point can vary between random values. So, it is important to scale them so that this matches specified rules.

In [12]: print(data)
data_scaler = preprocessing.MinMaxScaler(feature_range = (0, 1))
data_scaled = data_scaler.fit_transform(data)
print("Min max scaled data = ", data_scaled)

[[10 20]
[ 2 4]
[ 4 9]]
Min max scaled data = [[1. 1. ]
[0. 0. ]
[0.25 0.3125]]

6) Normalization
Normalization involves adjusting the values in the feature vector so as to measure them on a common scale. Here, the values of a feature vector are adjusted so that they sum up to 1.

In [13]: data_normalized = preprocessing.normalize(data, norm = 'l1')

print("\nL1 normalized data = ", data_normalized)

L1 normalized data = [[0.33333333 0.66666667]

[0.33333333 0.66666667]
[0.30769231 0.69230769]]

7) Binarization
Binarization is used to convert a numerical feature vector into a Boolean vector.

In [14]: print('data:',data)
data_binarized = preprocessing.Binarizer(threshold=5).transform(data)
print("\nBinarized data =", data_binarized)

data: [[10 20]

[ 2 4]
[ 4 9]]

Binarized data = [[1 1]

[0 0]
[0 1]]

8) One Hot Encoding

It may be required to deal with numerical values that are few and scattered, and you may not need to store these values. In such situations you can use One Hot Encoding technique.

If the number of distinct values is k, it will transform the feature into a k-dimensional vector where only one value is 1 and all other values are 0.

In [15]: encoder = preprocessing.OneHotEncoder()

encoder.fit([ [0, 2, 1, 12],
[1, 3, 5, 3],
[2, 3, 2, 12],
[1, 2, 4, 3]
])
encoded_vector = encoder.transform([[2, 3, 5, 3]]).toarray()
print("\nEncoded vector =", encoded_vector)

Encoded vector = [[0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]

In the example above, let us consider the third feature in each feature vector. The values are 1, 5, 2, and 4.

There are four separate values here, which means the one-hot encoded vector will be of length 4. If we want to encode the value 5, it will be a vector [0, 1, 0, 0]. Only one value can be 1 in this vector. The second element is 1, which indicates that
the value is 5.

9) Label Encoding
Label encoding refers to changing the word labels into numbers so that the algorithms can understand how to work on them.

In [16]: label_encoder = preprocessing.LabelEncoder()

input_classes = ['suzuki', 'ford', 'suzuki', 'toyota', 'ford', 'bmw']
label_encoder.fit(input_classes)
print("\nClass mapping:")
for i, item in enumerate(label_encoder.classes_):
print(item, '-->', i)

Class mapping:
bmw --> 0
ford --> 1
suzuki --> 2
toyota --> 3

As shown in above output, the words have been changed into 0-indexed numbers.

In [17]: labels = ['toyota', 'ford', 'suzuki']

encoded_labels = label_encoder.transform(labels)
print("\nLabels =", labels)
print("Encoded labels =", list(encoded_labels))

Labels = ['toyota', 'ford', 'suzuki']

Encoded labels = [3, 1, 2]

This is efficient than manually maintaining mapping between words and numbers. You can check by transforming numbers back to word labels as shown in the code here −

In [18]: encoded_labels = [3, 2, 0, 2, 1]

decoded_labels = label_encoder.inverse_transform(encoded_labels)
print("\nEncoded labels =", encoded_labels)
print("Decoded labels =", list(decoded_labels))
Encoded labels = [3, 2, 0, 2, 1]
Decoded labels = ['toyota', 'suzuki', 'bmw', 'suzuki', 'ford']

In [ ]:

(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Hkdse English Reading全方位實戰神技精讀主筆記 Sample 1643026918
No ratings yet
Hkdse English Reading全方位實戰神技精讀主筆記 Sample 1643026918
23 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
The Ultimate Zoom Poker Strategy Guide PDF
0% (1)
The Ultimate Zoom Poker Strategy Guide PDF
25 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Manual: Original Instructions Ver D
100% (3)
Manual: Original Instructions Ver D
95 pages
9.structural Behaviour and Design Criteria of Concrete Box-Girder Bridges - JRC
No ratings yet
9.structural Behaviour and Design Criteria of Concrete Box-Girder Bridges - JRC
16 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
Dalle Vacche Angela - André Bazin's Film Theory. Art, Science, Religion - 2020
No ratings yet
Dalle Vacche Angela - André Bazin's Film Theory. Art, Science, Religion - 2020
235 pages
EE2211 CheatSheet
No ratings yet
EE2211 CheatSheet
15 pages
Mercedes-Benz Greener Manufacturing Ai
0% (1)
Mercedes-Benz Greener Manufacturing Ai
16 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Automatic Plastic Injection Moulding Machine - Injection Moulding Machines - Injection Moulding Manufacturers
No ratings yet
Automatic Plastic Injection Moulding Machine - Injection Moulding Machines - Injection Moulding Manufacturers
23 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Resignation Letter
No ratings yet
Resignation Letter
2 pages
Pre-Processing Example - 1
No ratings yet
Pre-Processing Example - 1
6 pages
StarterNotebook - Jupyter Notebook
No ratings yet
StarterNotebook - Jupyter Notebook
12 pages
Mini 4
No ratings yet
Mini 4
9 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
43 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
Preprocessing
No ratings yet
Preprocessing
9 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
Machine Learning Practice
No ratings yet
Machine Learning Practice
17 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
EM 300 G3 Manual 12 2016 EN
100% (1)
EM 300 G3 Manual 12 2016 EN
60 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Week 10
No ratings yet
Week 10
50 pages
Walmart Responsible Sourcing Evaluation Rs / Fcca / GSV: Example Report
No ratings yet
Walmart Responsible Sourcing Evaluation Rs / Fcca / GSV: Example Report
16 pages
Advance Python
No ratings yet
Advance Python
5 pages
PR Final File
No ratings yet
PR Final File
70 pages
Pattern Recognition
No ratings yet
Pattern Recognition
26 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
Exp. 1
No ratings yet
Exp. 1
4 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Feature Engineering
No ratings yet
Feature Engineering
50 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Perhitungan Tugas Besar Geometri Jalan Raya (Andre Gunawan 1622201019)
No ratings yet
Perhitungan Tugas Besar Geometri Jalan Raya (Andre Gunawan 1622201019)
77 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
ML Complete Notes Hridoy
No ratings yet
ML Complete Notes Hridoy
5 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
DA Programs
No ratings yet
DA Programs
44 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Machine Learning Record VR19
No ratings yet
Machine Learning Record VR19
46 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
ML Notes
No ratings yet
ML Notes
44 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Hair Transplant in Nepal
100% (1)
Hair Transplant in Nepal
3 pages
t560 - Engineering Science n2 QP Nov 2015final
No ratings yet
t560 - Engineering Science n2 QP Nov 2015final
12 pages
CSR Bernard Madoff Case Analysis and Conclusion
No ratings yet
CSR Bernard Madoff Case Analysis and Conclusion
6 pages
Fire Fighter
No ratings yet
Fire Fighter
3 pages
Paulo Coelho'S: Aleph
No ratings yet
Paulo Coelho'S: Aleph
1 page
How To Define Roles and Responsibilities
No ratings yet
How To Define Roles and Responsibilities
5 pages
Tokyo Revengers, Chapter 219 - English Scans
No ratings yet
Tokyo Revengers, Chapter 219 - English Scans
1 page
PLAINTIFFS' VERIFIED COMPLAINT - BLACK LIVES MATTER SHENANDOAH VALLEY v. DONALD L. SMITH, Sheriff, Augusta County, Virginia
No ratings yet
PLAINTIFFS' VERIFIED COMPLAINT - BLACK LIVES MATTER SHENANDOAH VALLEY v. DONALD L. SMITH, Sheriff, Augusta County, Virginia
46 pages
Tyler Hoge Resume
No ratings yet
Tyler Hoge Resume
1 page
Sexual Sounds Can Trigger Porn Filter
No ratings yet
Sexual Sounds Can Trigger Porn Filter
1 page
Universidad Nacional Mayor de San Marcos: Facultad de Ingenieria Industrial Calculo Iii Práctica 2
No ratings yet
Universidad Nacional Mayor de San Marcos: Facultad de Ingenieria Industrial Calculo Iii Práctica 2
3 pages
Om - Panigale v4 - Usa - My23 Ed01
No ratings yet
Om - Panigale v4 - Usa - My23 Ed01
310 pages
SAC Assessment v2 2025 05 07
No ratings yet
SAC Assessment v2 2025 05 07
3 pages
04 Long Tables Cetking CAT DILR150 Frequently Repeated Questions
No ratings yet
04 Long Tables Cetking CAT DILR150 Frequently Repeated Questions
7 pages
Guide
No ratings yet
Guide
64 pages
Apllicant Tracking Sarinah
No ratings yet
Apllicant Tracking Sarinah
43 pages
Intermediate Reading Pack
No ratings yet
Intermediate Reading Pack
30 pages
Check List DPR at SRRDA Level
No ratings yet
Check List DPR at SRRDA Level
4 pages
Resume Material of Practicality and Authenticity
No ratings yet
Resume Material of Practicality and Authenticity
2 pages
LFP Syllabus
No ratings yet
LFP Syllabus
2 pages
Gd Script
From Everand
Gd Script
Marijo Trkulja
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet

Data Pre Processing

Uploaded by

Data Pre Processing

Uploaded by

Data Preprocessing

1) Loading the dataset

In [19]: import numpy as np

Load the datasets...

In [3]: # print the dataset

Out[3]: Country Age Salary Purchased

0 France 44.0 72000.0 No

1 Spain 27.0 48000.0 Yes

2 Germany 30.0 54000.0 No

3 Spain 38.0 61000.0 No

4 Germany 40.0 NaN Yes

5 France 35.0 58000.0 Yes

6 Spain NaN 52000.0 No

7 France 48.0 79000.0 Yes

8 Germany 50.0 83000.0 No

9 France 37.0 67000.0 Yes

2) Summarizing the dataset

The dataset contains 10 rows and 4 columns

Country Age Salary Purchased

3) Replacing Missing Values

X: [['France' 44.0 72000.0]

In [20]: from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer on dataset.

# Fit imputer for columns 1 and 2 of X matrix.

#Replace missing data with mean of column

In [9]: print ('X: %s'%(str(X)))

X: [['France' 44.0 72000.0]

Mean Value of Age = Sum of all age values /10 = 38.77777777777778

In [21]: data = np.array([[10,20],[2,4],[4,9]])

In [11]: from sklearn import preprocessing

Mean = [ 7.40148683e-17 -5.55111512e-17]

In [13]: data_normalized = preprocessing.normalize(data, norm = 'l1')

L1 normalized data = [[0.33333333 0.66666667]

data: [[10 20]

Binarized data = [[1 1]

8) One Hot Encoding

In [15]: encoder = preprocessing.OneHotEncoder()

Encoded vector = [[0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]

In [16]: label_encoder = preprocessing.LabelEncoder()

In [17]: labels = ['toyota', 'ford', 'suzuki']

Labels = ['toyota', 'ford', 'suzuki']

In [18]: encoded_labels = [3, 2, 0, 2, 1]

You might also like