0% found this document useful (0 votes)
35 views2 pages

Data Pre Processing

The document discusses various techniques for preprocessing data before machine learning modeling. It covers loading and summarizing a sample dataset, handling missing values through imputation, mean removal, normalization, binning, one-hot encoding, and label encoding. Code examples are provided for each technique.

Uploaded by

rk73462002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views2 pages

Data Pre Processing

The document discusses various techniques for preprocessing data before machine learning modeling. It covers loading and summarizing a sample dataset, handling missing values through imputation, mean removal, normalization, binning, one-hot encoding, and label encoding. Code examples are provided for each technique.

Uploaded by

rk73462002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Data Preprocessing

1) Loading the dataset


Let's take a sample dataset for this exercise. This dataset named "data.csv" contains whether a user purchased the product or not. The users data has age,salary and the country they belonged to.

In [19]: import numpy as np


import pandas as pd
import os

In [2]: #Read the 'Data.csv' and store the data in the vairable dataset.
dataset = pd.read_csv("C:/Users/amaly/Jupyter Work/DSML/Data.csv")
print('Load the datasets...')

Load the datasets...

In [3]: # print the dataset


dataset

Out[3]: Country Age Salary Purchased

0 France 44.0 72000.0 No

1 Spain 27.0 48000.0 Yes

2 Germany 30.0 54000.0 No

3 Spain 38.0 61000.0 No

4 Germany 40.0 NaN Yes

5 France 35.0 58000.0 Yes

6 Spain NaN 52000.0 No

7 France 48.0 79000.0 Yes

8 Germany 50.0 83000.0 No

9 France 37.0 67000.0 Yes

2) Summarizing the dataset


In [4]: # Print the shape of the dataset
print (dataset.shape)

(10, 4)

The dataset contains 10 rows and 4 columns

In [5]: #List the Entire Data : To prints the first 5 rows of the data and understand its summary
print(dataset.head(5))

Country Age Salary Purchased


0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes

In [6]: # Statistical Summary : You can view the statistical summary of each attribute, which includes the count, unique, top and freq,
# by using the following command
print(dataset.describe())

Age Salary
count 9.000000 9.000000
mean 38.777778 63777.777778
std 7.693793 12265.579662
min 27.000000 48000.000000
25% 35.000000 54000.000000
50% 38.000000 61000.000000
75% 44.000000 72000.000000
max 50.000000 83000.000000

3) Replacing Missing Values


In [7]: # Separate the dependent and independent variables

# Independent variable
# iloc[rows,columns]
# Take all rows
# Take last but one column from the dataset (:-1)
X = dataset.iloc[:,:-1].values
print('X:',X)

X: [['France' 44.0 72000.0]


['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

In [20]: from sklearn.impute import SimpleImputer


# Imputer Class takes the follwing parameters:
# missing_values : The missing values in our dataset are called as NaN (Not a number).Default is NaN
# strategy : replace the missing values by mean/median/mode. Default is mean.

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer on dataset.


# Take all rows and columns only with the missing values.
# Note: Index starts with 0. Upper bound (3) is not included.

# Fit imputer for columns 1 and 2 of X matrix.


imputer = imputer.fit(X[:,1:3])

#Replace missing data with mean of column


X[:,1:3] = imputer.transform(X[:,1:3])

In [9]: print ('X: %s'%(str(X)))

X: [['France' 44.0 72000.0]


['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

Mean Value of Age = Sum of all age values /10 = 38.77777777777778


Mean Value of Salary = Sum of all Salary value /10 = 63777.77777777778

4) Mean Removal
It involves removing the mean from each feature so that it is centered on zero. Mean removal helps in removing any bias from the features.

In [21]: data = np.array([[10,20],[2,4],[4,9]])


print(data)

[[10 20]
[ 2 4]
[ 4 9]]

In [11]: from sklearn import preprocessing


data_standardized = preprocessing.scale(data)
print("\nMean = ", data_standardized.mean(axis = 0))
print("Std deviation = ", data_standardized.std(axis = 0))

Mean = [ 7.40148683e-17 -5.55111512e-17]


Std deviation = [1. 1.]

Observe that in the output, mean is almost 0 and the standard deviation is 1.

5) Scaling
The values of every feature in a data point can vary between random values. So, it is important to scale them so that this matches specified rules.

In [12]: print(data)
data_scaler = preprocessing.MinMaxScaler(feature_range = (0, 1))
data_scaled = data_scaler.fit_transform(data)
print("Min max scaled data = ", data_scaled)

[[10 20]
[ 2 4]
[ 4 9]]
Min max scaled data = [[1. 1. ]
[0. 0. ]
[0.25 0.3125]]

6) Normalization
Normalization involves adjusting the values in the feature vector so as to measure them on a common scale. Here, the values of a feature vector are adjusted so that they sum up to 1.

In [13]: data_normalized = preprocessing.normalize(data, norm = 'l1')


print("\nL1 normalized data = ", data_normalized)

L1 normalized data = [[0.33333333 0.66666667]


[0.33333333 0.66666667]
[0.30769231 0.69230769]]

7) Binarization
Binarization is used to convert a numerical feature vector into a Boolean vector.

In [14]: print('data:',data)
data_binarized = preprocessing.Binarizer(threshold=5).transform(data)
print("\nBinarized data =", data_binarized)

data: [[10 20]


[ 2 4]
[ 4 9]]

Binarized data = [[1 1]


[0 0]
[0 1]]

8) One Hot Encoding


It may be required to deal with numerical values that are few and scattered, and you may not need to store these values. In such situations you can use One Hot Encoding technique.

If the number of distinct values is k, it will transform the feature into a k-dimensional vector where only one value is 1 and all other values are 0.

In [15]: encoder = preprocessing.OneHotEncoder()


encoder.fit([ [0, 2, 1, 12],
[1, 3, 5, 3],
[2, 3, 2, 12],
[1, 2, 4, 3]
])
encoded_vector = encoder.transform([[2, 3, 5, 3]]).toarray()
print("\nEncoded vector =", encoded_vector)

Encoded vector = [[0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]

In the example above, let us consider the third feature in each feature vector. The values are 1, 5, 2, and 4.

There are four separate values here, which means the one-hot encoded vector will be of length 4. If we want to encode the value 5, it will be a vector [0, 1, 0, 0]. Only one value can be 1 in this vector. The second element is 1, which indicates that
the value is 5.

9) Label Encoding
Label encoding refers to changing the word labels into numbers so that the algorithms can understand how to work on them.

In [16]: label_encoder = preprocessing.LabelEncoder()


input_classes = ['suzuki', 'ford', 'suzuki', 'toyota', 'ford', 'bmw']
label_encoder.fit(input_classes)
print("\nClass mapping:")
for i, item in enumerate(label_encoder.classes_):
print(item, '-->', i)

Class mapping:
bmw --> 0
ford --> 1
suzuki --> 2
toyota --> 3

As shown in above output, the words have been changed into 0-indexed numbers.

In [17]: labels = ['toyota', 'ford', 'suzuki']


encoded_labels = label_encoder.transform(labels)
print("\nLabels =", labels)
print("Encoded labels =", list(encoded_labels))

Labels = ['toyota', 'ford', 'suzuki']


Encoded labels = [3, 1, 2]

This is efficient than manually maintaining mapping between words and numbers. You can check by transforming numbers back to word labels as shown in the code here −

In [18]: encoded_labels = [3, 2, 0, 2, 1]


decoded_labels = label_encoder.inverse_transform(encoded_labels)
print("\nEncoded labels =", encoded_labels)
print("Decoded labels =", list(decoded_labels))
Encoded labels = [3, 2, 0, 2, 1]
Decoded labels = ['toyota', 'suzuki', 'bmw', 'suzuki', 'ford']

In [ ]:

You might also like