Exp2-Dm - KS
Exp2-Dm - KS
When dealing with large and real-world datasets, categorical data is almost inevitable.Categorical variables
represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age
group, educational level etc. These variables often has letters or words as its values. Since machine learning
models are all about numbers and calculations , these categorical variables need to be coded in to numbers.
Having coded the categorical variable into numbers may just not be enough.
DATASETEXP1:
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab
import numpy as np
import matplotlib.pyplot as pltimport
pandas as pd
import dtale
datas = pd.read_csv('C:\\Users\\AIDS\\Desktop\\EXP1DATASET.csv')
datas.head(20)
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab
encoder = OneHotEncoder(handle_unknown='ignore')
encoder_datas = pd.DataFrame(encoder.fit_transform(datas[['SALARY']]).toarray())
df2 =datas.join(encoder_datas)
df2.drop('SALARY', axis=1, inplace=True)
df2.head()
0 0 INDIA 270.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 FRANCE 301.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2 JAPAN 50.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 3 SPAIN 35.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 4 CHINA 390.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
[35]:
As you can see the purchased column has been successfully transformed.
Now we have completed the encoding of all the categorical data in our dataset and can move
to the next step.
All machine learning models require us to provide a training set for the machine so that the model can train
from that data to understand the relations between features and can predict for new observations. When we
are provided a single huge dataset with too much of observations ,it is a good idea to split the dataset into to
two, a training_set and a test_set, so that we can test our model after its been trained with the training_set.
X = datas.iloc[:, :-1].values
y = datas.iloc[:, -1].values
#values function coverts the data into arrays
print("Independent Variable\n")
print(X)
print("\nDependent Variable\n")
print(y)
Independent Variable
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab
[[0 'INDIA' 270.0]
[1 'FRANCE' 301.0]
[2 'JAPAN' 50.0]
[3 'SPAIN' 35.0]
[4 'CHINA' 390.0]
[5 'GERMANY' 35.0]
[6 'INDIA' nan]
[7 'AUSTRALIA' 422.0]
[8 'RUSSIA' 50.0]
[9 'USA' 534.0]
[10 'KORIA' 21.0]
[11 'USA' nan]
[12 'INDIA' 63.0]
[13 'LONDON' 200.0]
[14 'INDIA' 105.0]
[15 'JAPAN' 321.0]]
Dependent Variable
[ 3. 4. 6. 12. 8. 7. 5. 10. 11. 12. 0. 8. 9. 2. 1. 9.]
print(x_train)
print(x_test)
x_train
Out[41]:
array([[13, 'LONDON', 200.0],
[4, 'CHINA', 390.0],
[2, 'JAPAN', 50.0],
[14, 'INDIA', 105.0],
[10, 'KORIA', 21.0],
[7, 'AUSTRALIA', 422.0],
[15, 'JAPAN', 321.0],
[11, 'USA', nan],
[3, 'SPAIN', 35.0],
[0, 'INDIA', 270.0],
[5, 'GERMANY', 35.0],
[12, 'INDIA', 63.0]], dtype=object)
x_test
Out[42]:
array([[1, 'FRANCE', 301.0],
[6, 'INDIA', nan],
[8, 'RUSSIA', 50.0],
[9, 'USA', 534.0]], dtype=object)
y_train
array([ 2., 8., 6., 1., 0., 10., 9., 8., 12., 3., 7., 9.])
y_test
Out[45]:
array([ 4., 5., 11., 12.])
.
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab
B) Scaling the features:
Since machine learning models rely on numbers to solve relations it is important to have similarly scaled data in
a dataset. Scaling ensures that all data in a dataset falls in the same range. Unscaled data can cause
inaccurate or false predictions. Some machine learning algorithms can handle feature scaling on its own and
import pandas as pd
import numpy as np
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')
np.random.seed(1)
df = pd.DataFrame({
})
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab
df.head()
col_names = df.columns
features = df[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features.head()
scaled_features.describe()