Data Pre Processing
Data Pre Processing
In [2]: #Read the 'Data.csv' and store the data in the vairable dataset.
dataset = pd.read_csv("C:/Users/amaly/Jupyter Work/DSML/Data.csv")
print('Load the datasets...')
(10, 4)
In [5]: #List the Entire Data : To prints the first 5 rows of the data and understand its summary
print(dataset.head(5))
In [6]: # Statistical Summary : You can view the statistical summary of each attribute, which includes the count, unique, top and freq,
# by using the following command
print(dataset.describe())
Age Salary
count 9.000000 9.000000
mean 38.777778 63777.777778
std 7.693793 12265.579662
min 27.000000 48000.000000
25% 35.000000 54000.000000
50% 38.000000 61000.000000
75% 44.000000 72000.000000
max 50.000000 83000.000000
# Independent variable
# iloc[rows,columns]
# Take all rows
# Take last but one column from the dataset (:-1)
X = dataset.iloc[:,:-1].values
print('X:',X)
4) Mean Removal
It involves removing the mean from each feature so that it is centered on zero. Mean removal helps in removing any bias from the features.
[[10 20]
[ 2 4]
[ 4 9]]
Observe that in the output, mean is almost 0 and the standard deviation is 1.
5) Scaling
The values of every feature in a data point can vary between random values. So, it is important to scale them so that this matches specified rules.
In [12]: print(data)
data_scaler = preprocessing.MinMaxScaler(feature_range = (0, 1))
data_scaled = data_scaler.fit_transform(data)
print("Min max scaled data = ", data_scaled)
[[10 20]
[ 2 4]
[ 4 9]]
Min max scaled data = [[1. 1. ]
[0. 0. ]
[0.25 0.3125]]
6) Normalization
Normalization involves adjusting the values in the feature vector so as to measure them on a common scale. Here, the values of a feature vector are adjusted so that they sum up to 1.
7) Binarization
Binarization is used to convert a numerical feature vector into a Boolean vector.
In [14]: print('data:',data)
data_binarized = preprocessing.Binarizer(threshold=5).transform(data)
print("\nBinarized data =", data_binarized)
If the number of distinct values is k, it will transform the feature into a k-dimensional vector where only one value is 1 and all other values are 0.
In the example above, let us consider the third feature in each feature vector. The values are 1, 5, 2, and 4.
There are four separate values here, which means the one-hot encoded vector will be of length 4. If we want to encode the value 5, it will be a vector [0, 1, 0, 0]. Only one value can be 1 in this vector. The second element is 1, which indicates that
the value is 5.
9) Label Encoding
Label encoding refers to changing the word labels into numbers so that the algorithms can understand how to work on them.
Class mapping:
bmw --> 0
ford --> 1
suzuki --> 2
toyota --> 3
As shown in above output, the words have been changed into 0-indexed numbers.
This is efficient than manually maintaining mapping between words and numbers. You can check by transforming numbers back to word labels as shown in the code here −
In [ ]: