Practical 1 52
Practical 1 52
1) Preprocessing
dataset=pd.read_csv('/content/drive/MyDrive/ML/ML_Lab/Data.csv')
Dataset
Mounted at /content/drive
#---connect the imputer object with the data imputer.fit(X[:,1:3]) #---the column with
numerical data is only fitted to imputer object
https://colab.research.google.com/drive/1epQs6N97ffLdHEAPXVRF8eYmSDkR565_#scrollTo=_IsOM8AA5_QF&printMode=true 1
20BECE30058
['Germany' 50.0 61375.0]
['France' 37.0 67000.0]]
----Arguments----
True, a copy of X will be created. If False, imputation will be done in-place whenever possible. ll_value--> When strategy ==
Suppose the data is categorical in the dataset and we need to convert it into the numeric format
'Country' has 3 values--> France, Spain, Germany, If we assign 0, 1 and 2 numbers respectively to each value then it will be considered as some kind of
priority given to each value(0,1 & 2)
To do this, we can do one hot encoding(create different columns for each unique values)
print(type(X))
#-----Connect the data with the 'ct' object,here we will use 'fit_transform' which will fit and transform together in 1 step
X=ct.fit_transform(X) #-- - -update the encoded column to the same feature matrix X
X
<class 'numpy.ndarray'>
array([[1.0, 0.0, 0.0, 44.0, 72000.0],
[0.0, 0.0, 1.0, 27.0, 48000.0],
[0.0, 1.0, 0.0, 39.875, 54000.0],
[0.0, 0.0, 1.0, 38.0, 61000.0],
[0.0, 1.0, 0.0, 40.0, 61375.0],
[1.0, 0.0, 0.0, 35.0, 58000.0],
[0.0, 0.0, 1.0, 39.875, 52000.0],
[1.0, 0.0, 0.0, 48.0, 79000.0],
[0.0, 1.0, 0.0, 50.0, 61375.0],
[1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)
[0 1 0 0 1 1 0 1 0 1]
Splitting Dataset
from sklearn.model_selection import train_test_split
#---returns the 4 matrices-2 for training and 2 for testing
print(X_train)
https://colab.research.google.com/drive/1epQs6N97ffLdHEAPXVRF8eYmSDkR565_#scrollTo=_IsOM8AA5_QF&printMode=true 2
20BECE30058
[[0.0 0.0 1.0 39.875 52000.0]
[0.0 1.0 0.0 40.0 61375.0]
[1.0 0.0 0.0 44.0 72000.0]
[0.0 0.0 1.0 38.0 61000.0]
[0.0 0.0 1.0 27.0 48000.0]
[1.0 0.0 0.0 48.0 79000.0]
[0.0 1.0 0.0 50.0 61375.0]
[1.0 0.0 0.0 35.0 58000.0]]
print(X_test)
print(y_train)
[0 1 0 0 1 1 0 1]
print(y_test)
[0 1]
Feature Scaling
Feature scaling is done to prevent over dominate some feature values just based upon their magnitues even if they were not important
x(stand)=[x-mean(x)]/standard deviation(x)
(norm)=[x-min(x)]/max(x)-min(x)
sc=StandardScaler()
X_train[:,3:]=sc.fit_transform(X_train[:,3:]) #---specify the range, takes all rows and selected columns from 3 upto last column
#---fit method will only compute the mean and standard deviation of the feature,then transform method will apply the
formula # we apply the same scaler to tes set, thats why we use 'transform' method only
X_test[:,3:]=sc.transform(X_test[:,3:])
print(X_train)
print(X_test)
Note : We need to apply the same scaler of training data on the test data, otherwise it will compute the different values of mean and standard
deviation for the test set and the results would be affected. So that we will get the same transformation on the test set
Do we have to apply the scaling to the dummy variables/ one hot encoders? No, as the scaling takes the values between -3 and +3, the values in dummy
varibales lie between 0 and 1 which fall in this range.
https://colab.research.google.com/drive/1epQs6N97ffLdHEAPXVRF8eYmSDkR565_#scrollTo=_IsOM8AA5_QF&printMode=true 3
20BECE30058
https://colab.research.google.com/drive/1epQs6N97ffLdHEAPXVRF8eYmSDkR565_#scrollTo=_IsOM8AA5_QF&printMode=true 4