Name:Fedrick Samuel W Reg No: 19MIS1112 Course: Machine Learning (SWE4012) Slot: L11 + L12 Faculty: Dr.M. Premalatha
Name:Fedrick Samuel W Reg No: 19MIS1112 Course: Machine Learning (SWE4012) Slot: L11 + L12 Faculty: Dr.M. Premalatha
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
data = 'adult.csv'
df.shape
(32561, 15)
df.head()
0 1 2 3 4 5 \
0 39 State-gov 77516 Bachelors 13 Never-married
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
2 38 Private 215646 HS-grad 9 Divorced
3 53 Private 234721 11th 7 Married-civ-spouse
4 28 Private 338409 Bachelors 13 Married-civ-spouse
6 7 8 9 10 11 12 \
0 Adm-clerical Not-in-family White Male 2174 0 40
1 Exec-managerial Husband White Male 0 0 13
2 Handlers-cleaners Not-in-family White Male 0 0 40
3 Handlers-cleaners Husband Black Male 0 0 40
4 Prof-specialty Wife Black Female 0 0 40
13 14
0 United-States <=50K
1 United-States <=50K
2 United-States <=50K
3 United-States <=50K
4 Cuba <=50K
df.columns = col_names
df.columns
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age 32561 non-null int64
workclass 32561 non-null object
fnlwgt 32561 non-null int64
education 32561 non-null object
education_num 32561 non-null int64
marital_status 32561 non-null object
occupation 32561 non-null object
relationship 32561 non-null object
race 32561 non-null object
sex 32561 non-null object
capital_gain 32561 non-null int64
capital_loss 32561 non-null int64
hours_per_week 32561 non-null int64
native_country 32561 non-null object
income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
df[categorical].head()
df[categorical].isnull().sum()
workclass 0
education 0
marital_status 0
occupation 0
relationship 0
race 0
sex 0
native_country 0
income 0
dtype: int64
print(df[var].value_counts())
Private 22696
Self-emp-not-inc 2541
Local-gov 2093
? 1836
State-gov 1298
Self-emp-inc 1116
Federal-gov 960
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64
HS-grad 10501
Some-college 7291
Bachelors 5355
Masters 1723
Assoc-voc 1382
11th 1175
Assoc-acdm 1067
10th 933
7th-8th 646
Prof-school 576
9th 514
12th 433
Doctorate 413
5th-6th 333
1st-4th 168
Preschool 51
Name: education, dtype: int64
Married-civ-spouse 14976
Never-married 10683
Divorced 4443
Separated 1025
Widowed 993
Married-spouse-absent 418
Married-AF-spouse 23
Name: marital_status, dtype: int64
Prof-specialty 4140
Craft-repair 4099
Exec-managerial 4066
Adm-clerical 3770
Sales 3650
Other-service 3295
Machine-op-inspct 2002
? 1843
Transport-moving 1597
Handlers-cleaners 1370
Farming-fishing 994
Tech-support 928
Protective-serv 649
Priv-house-serv 149
Armed-Forces 9
Name: occupation, dtype: int64
Husband 13193
Not-in-family 8305
Own-child 5068
Unmarried 3446
Wife 1568
Other-relative 981
Name: relationship, dtype: int64
White 27816
Black 3124
Asian-Pac-Islander 1039
Amer-Indian-Eskimo 311
Other 271
Name: race, dtype: int64
Male 21790
Female 10771
Name: sex, dtype: int64
United-States 29170
Mexico 643
? 583
Philippines 198
Germany 137
Canada 121
Puerto-Rico 114
El-Salvador 106
India 100
Cuba 95
England 90
Jamaica 81
South 80
China 75
Italy 73
Dominican-Republic 70
Vietnam 67
Guatemala 64
Japan 62
Poland 60
Columbia 59
Taiwan 51
Haiti 44
Iran 43
Portugal 37
Nicaragua 34
Peru 31
France 29
Greece 29
Ecuador 28
Ireland 24
Hong 20
Trinadad&Tobago 19
Cambodia 19
Thailand 18
Laos 18
Yugoslavia 16
Outlying-US(Guam-USVI-etc) 14
Hungary 13
Honduras 13
Scotland 12
Holand-Netherlands 1
Name: native_country, dtype: int64
<=50K 24720
>50K 7841
Name: income, dtype: int64
# view frequency distribution of categorical variables
print(df[var].value_counts()/np.float(len(df)))
Private 0.697030
Self-emp-not-inc 0.078038
Local-gov 0.064279
? 0.056386
State-gov 0.039864
Self-emp-inc 0.034274
Federal-gov 0.029483
Without-pay 0.000430
Never-worked 0.000215
Name: workclass, dtype: float64
HS-grad 0.322502
Some-college 0.223918
Bachelors 0.164461
Masters 0.052916
Assoc-voc 0.042443
11th 0.036086
Assoc-acdm 0.032769
10th 0.028654
7th-8th 0.019840
Prof-school 0.017690
9th 0.015786
12th 0.013298
Doctorate 0.012684
5th-6th 0.010227
1st-4th 0.005160
Preschool 0.001566
Name: education, dtype: float64
Married-civ-spouse 0.459937
Never-married 0.328092
Divorced 0.136452
Separated 0.031479
Widowed 0.030497
Married-spouse-absent 0.012837
Married-AF-spouse 0.000706
Name: marital_status, dtype: float64
Prof-specialty 0.127146
Craft-repair 0.125887
Exec-managerial 0.124873
Adm-clerical 0.115783
Sales 0.112097
Other-service 0.101195
Machine-op-inspct 0.061485
? 0.056601
Transport-moving 0.049046
Handlers-cleaners 0.042075
Farming-fishing 0.030527
Tech-support 0.028500
Protective-serv 0.019932
Priv-house-serv 0.004576
Armed-Forces 0.000276
Name: occupation, dtype: float64
Husband 0.405178
Not-in-family 0.255060
Own-child 0.155646
Unmarried 0.105832
Wife 0.048156
Other-relative 0.030128
Name: relationship, dtype: float64
White 0.854274
Black 0.095943
Asian-Pac-Islander 0.031909
Amer-Indian-Eskimo 0.009551
Other 0.008323
Name: race, dtype: float64
Male 0.669205
Female 0.330795
Name: sex, dtype: float64
United-States 0.895857
Mexico 0.019748
? 0.017905
Philippines 0.006081
Germany 0.004207
Canada 0.003716
Puerto-Rico 0.003501
El-Salvador 0.003255
India 0.003071
Cuba 0.002918
England 0.002764
Jamaica 0.002488
South 0.002457
China 0.002303
Italy 0.002242
Dominican-Republic 0.002150
Vietnam 0.002058
Guatemala 0.001966
Japan 0.001904
Poland 0.001843
Columbia 0.001812
Taiwan 0.001566
Haiti 0.001351
Iran 0.001321
Portugal 0.001136
Nicaragua 0.001044
Peru 0.000952
France 0.000891
Greece 0.000891
Ecuador 0.000860
Ireland 0.000737
Hong 0.000614
Trinadad&Tobago 0.000584
Cambodia 0.000584
Thailand 0.000553
Laos 0.000553
Yugoslavia 0.000491
Outlying-US(Guam-USVI-etc) 0.000430
Hungary 0.000399
Honduras 0.000399
Scotland 0.000369
Holand-Netherlands 0.000031
Name: native_country, dtype: float64
<=50K 0.75919
>50K 0.24081
Name: income, dtype: float64
df.workclass.unique()
df.workclass.value_counts()
Private 22696
Self-emp-not-inc 2541
Local-gov 2093
? 1836
State-gov 1298
Self-emp-inc 1116
Federal-gov 960
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64
df.workclass.value_counts()
Private 22696
Self-emp-not-inc 2541
Local-gov 2093
State-gov 1298
Self-emp-inc 1116
Federal-gov 960
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64
df.occupation.unique()
df.occupation.value_counts()
Prof-specialty 4140
Craft-repair 4099
Exec-managerial 4066
Adm-clerical 3770
Sales 3650
Other-service 3295
Machine-op-inspct 2002
? 1843
Transport-moving 1597
Handlers-cleaners 1370
Farming-fishing 994
Tech-support 928
Protective-serv 649
Priv-house-serv 149
Armed-Forces 9
Name: occupation, dtype: int64
df.occupation.value_counts()
Prof-specialty 4140
Craft-repair 4099
Exec-managerial 4066
Adm-clerical 3770
Sales 3650
Other-service 3295
Machine-op-inspct 2002
Transport-moving 1597
Handlers-cleaners 1370
Farming-fishing 994
Tech-support 928
Protective-serv 649
Priv-house-serv 149
Armed-Forces 9
Name: occupation, dtype: int64
df.native_country.unique()
df.native_country.value_counts()
United-States 29170
Mexico 643
? 583
Philippines 198
Germany 137
Canada 121
Puerto-Rico 114
El-Salvador 106
India 100
Cuba 95
England 90
Jamaica 81
South 80
China 75
Italy 73
Dominican-Republic 70
Vietnam 67
Guatemala 64
Japan 62
Poland 60
Columbia 59
Taiwan 51
Haiti 44
Iran 43
Portugal 37
Nicaragua 34
Peru 31
France 29
Greece 29
Ecuador 28
Ireland 24
Hong 20
Trinadad&Tobago 19
Cambodia 19
Thailand 18
Laos 18
Yugoslavia 16
Outlying-US(Guam-USVI-etc) 14
Hungary 13
Honduras 13
Scotland 12
Holand-Netherlands 1
Name: native_country, dtype: int64
df.native_country.value_counts()
United-States 29170
Mexico 643
Philippines 198
Germany 137
Canada 121
Puerto-Rico 114
El-Salvador 106
India 100
Cuba 95
England 90
Jamaica 81
South 80
China 75
Italy 73
Dominican-Republic 70
Vietnam 67
Guatemala 64
Japan 62
Poland 60
Columbia 59
Taiwan 51
Haiti 44
Iran 43
Portugal 37
Nicaragua 34
Peru 31
France 29
Greece 29
Ecuador 28
Ireland 24
Hong 20
Trinadad&Tobago 19
Cambodia 19
Thailand 18
Laos 18
Yugoslavia 16
Outlying-US(Guam-USVI-etc) 14
Hungary 13
Honduras 13
Scotland 12
Holand-Netherlands 1
Name: native_country, dtype: int64
df[categorical].isnull().sum()
workclass 1836
education 0
marital_status 0
occupation 1843
relationship 0
race 0
sex 0
native_country 583
income 0
dtype: int64
df[numerical].head()
df[numerical].isnull().sum()
age 0
fnlwgt 0
education_num 0
capital_gain 0
capital_loss 0
hours_per_week 0
dtype: int64
X = df.drop(['income'], axis=1)
y = df['income']
X_train.shape, X_test.shape
X_train.dtypes
age int64
workclass object
fnlwgt int64
education object
education_num int64
marital_status object
occupation object
relationship object
race object
sex object
capital_gain int64
capital_loss int64
hours_per_week int64
native_country object
dtype: object
categorical
['workclass',
'education',
'marital_status',
'occupation',
'relationship',
'race',
'sex',
'native_country']
numerical
['age',
'fnlwgt',
'education_num',
'capital_gain',
'capital_loss',
'hours_per_week']
X_train[categorical].isnull().mean()
workclass 0.055985
education 0.000000
marital_status 0.000000
occupation 0.056072
relationship 0.000000
race 0.000000
sex 0.000000
native_country 0.018164
dtype: float64
workclass 0.055984555984555984
occupation 0.05607230607230607
native_country 0.018164268164268166
X_train[categorical].isnull().sum()
workclass 0
education 0
marital_status 0
occupation 0
relationship 0
race 0
sex 0
native_country 0
dtype: int64
X_test[categorical].isnull().sum()
workclass 0
education 0
marital_status 0
occupation 0
relationship 0
race 0
sex 0
native_country 0
dtype: int64
As a final check, I will check for missing values in X_train and X_test.
# check missing values in X_train
X_train.isnull().sum()
age 0
workclass 0
fnlwgt 0
education 0
education_num 0
marital_status 0
occupation 0
relationship 0
race 0
sex 0
capital_gain 0
capital_loss 0
hours_per_week 0
native_country 0
dtype: int64
X_test.isnull().sum()
age 0
workclass 0
fnlwgt 0
education 0
education_num 0
marital_status 0
occupation 0
relationship 0
race 0
sex 0
capital_gain 0
capital_loss 0
hours_per_week 0
native_country 0
dtype: int64
categorical
['workclass',
'education',
'marital_status',
'occupation',
'relationship',
'race',
'sex',
'native_country']
X_train[categorical].head()
import category_encoders as ce
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)
X_train.head()
X_train.shape
(22792, 105)
X_test.head()
X_test.shape
(9769, 105)
cols = X_train.columns
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train.head()
native_country_41
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
GaussianNB(priors=None, var_smoothing=1e-09)
y_pred = gnb.predict(X_test)
y_pred
y_pred_train = gnb.predict(X_train)
y_pred_train
y_test.value_counts()
<=50K 7407
>50K 2362
Name: income, dtype: int64
# check null accuracy score
null_accuracy = (7407/(7407+2362))
cm = confusion_matrix(y_test, y_pred)
Confusion matrix
[[5999 1408]
[ 465 1897]]
<matplotlib.axes._subplots.AxesSubplot at 0x7fd6899b6a58>
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
Classification accuracy
TP = cm[0,0]
TN = cm[1,1]
FP = cm[0,1]
FN = cm[1,0]
print('Classification accuracy :
{0:0.4f}'.format(classification_accuracy))
print('Precision : {0:0.4f}'.format(precision))
Precision : 0.8099
print('Specificity : {0:0.4f}'.format(specificity))
Specificity : 0.5740
y_pred_prob
array([[9.99999426e-01, 5.74152436e-07],
[9.99687907e-01, 3.12093456e-04],
[1.54405602e-01, 8.45594398e-01],
[1.73624321e-04, 9.99826376e-01],
[8.20121011e-09, 9.99999992e-01],
[8.76844580e-01, 1.23155420e-01],
[9.99999927e-01, 7.32876705e-08],
[9.99993460e-01, 6.53998797e-06],
[9.87738143e-01, 1.22618575e-02],
[9.99999996e-01, 4.01886317e-09]])
y_pred_prob_df
gnb.predict_proba(X_test)[0:10, 1]
y_pred1 = gnb.predict_proba(X_test)[:, 1]
plt.rcParams['font.size'] = 12
plt.show()
Interpretation
• ROC AUC is a single number summary of classifier performance. The higher the
value, the better the classifier.
• ROC AUC of our model approaches towards 1. So, we can conclude that our classifier
does a good job in predicting whether it will rain tomorrow or not.
# calculate cross-validated ROC AUC
print('Cross-validation scores:{}'.format(scores))