0% found this document useful (0 votes)

112 views30 pages

Name:Fedrick Samuel W Reg No: 19MIS1112 Course: Machine Learning (SWE4012) Slot: L11 + L12 Faculty: Dr.M. Premalatha

This document contains information about a machine learning course project on income classification using the adult census dataset. It includes the student's name, registration number, course details, and faculty name. The code preprocesses the dataset, renames columns, handles missing values, and explores categorical variables through counts and frequencies. It identifies 9 categorical variables and analyzes their distributions.

Uploaded by

Fèdríck Sämùél

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views30 pages

Name:Fedrick Samuel W Reg No: 19MIS1112 Course: Machine Learning (SWE4012) Slot: L11 + L12 Faculty: Dr.M. Premalatha

Uploaded by

Fèdríck Sämùél

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Name:Fedrick Samuel W

Reg No: 19MIS1112

Course: Machine learning (SWE4012)
Slot: L11 + L12
Faculty: Dr.M. Premalatha

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings

warnings.filterwarnings('ignore')

data = 'adult.csv'

df = pd.read_csv(data, header=None, sep=',\s')

df.shape

(32561, 15)

df.head()

0 1 2 3 4 5 \
0 39 State-gov 77516 Bachelors 13 Never-married
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
2 38 Private 215646 HS-grad 9 Divorced
3 53 Private 234721 11th 7 Married-civ-spouse
4 28 Private 338409 Bachelors 13 Married-civ-spouse

6 7 8 9 10 11 12 \
0 Adm-clerical Not-in-family White Male 2174 0 40
1 Exec-managerial Husband White Male 0 0 13
2 Handlers-cleaners Not-in-family White Male 0 0 40
3 Handlers-cleaners Husband Black Male 0 0 40
4 Prof-specialty Wife Black Female 0 0 40

13 14
0 United-States <=50K
1 United-States <=50K
2 United-States <=50K
3 United-States <=50K
4 Cuba <=50K

col_names = ['age', 'workclass', 'fnlwgt', 'education',

'education_num', 'marital_status', 'occupation', 'relationship',
'race', 'sex', 'capital_gain', 'capital_loss',
'hours_per_week', 'native_country', 'income']

df.columns = col_names

df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',

'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hours_per_week',
'native_country',
'income'],
dtype='object')

df.head()

age workclass fnlwgt education education_num \

0 39 State-gov 77516 Bachelors 13
1 50 Self-emp-not-inc 83311 Bachelors 13
2 38 Private 215646 HS-grad 9
3 53 Private 234721 11th 7
4 28 Private 338409 Bachelors 13

marital_status occupation relationship race sex

\
0 Never-married Adm-clerical Not-in-family White Male

1 Married-civ-spouse Exec-managerial Husband White Male

2 Divorced Handlers-cleaners Not-in-family White Male

3 Married-civ-spouse Handlers-cleaners Husband Black Male

4 Married-civ-spouse Prof-specialty Wife Black Female

capital_gain capital_loss hours_per_week native_country income

0 2174 0 40 United-States <=50K
1 0 0 13 United-States <=50K
2 0 0 40 United-States <=50K
3 0 0 40 United-States <=50K
4 0 0 40 Cuba <=50K
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age 32561 non-null int64
workclass 32561 non-null object
fnlwgt 32561 non-null int64
education 32561 non-null object
education_num 32561 non-null int64
marital_status 32561 non-null object
occupation 32561 non-null object
relationship 32561 non-null object
race 32561 non-null object
sex 32561 non-null object
capital_gain 32561 non-null int64
capital_loss 32561 non-null int64
hours_per_week 32561 non-null int64
native_country 32561 non-null object
income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

Explore categorical variables

# find categorical variables

categorical = [var for var in df.columns if df[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :\n\n', categorical)

There are 9 categorical variables

The categorical variables are :

['workclass', 'education', 'marital_status', 'occupation',

'relationship', 'race', 'sex', 'native_country', 'income']

df[categorical].head()

workclass education marital_status occupation

\
0 State-gov Bachelors Never-married Adm-clerical

1 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial

2 Private HS-grad Divorced Handlers-cleaners

3 Private 11th Married-civ-spouse Handlers-cleaners

4 Private Bachelors Married-civ-spouse Prof-specialty

relationship race sex native_country income

0 Not-in-family White Male United-States <=50K
1 Husband White Male United-States <=50K
2 Not-in-family White Male United-States <=50K
3 Husband Black Male United-States <=50K
4 Wife Black Female Cuba <=50K

# check missing values in categorical variables

df[categorical].isnull().sum()

workclass 0
education 0
marital_status 0
occupation 0
relationship 0
race 0
sex 0
native_country 0
income 0
dtype: int64

# view frequency counts of values in categorical variables

for var in categorical:

print(df[var].value_counts())

Private 22696
Self-emp-not-inc 2541
Local-gov 2093
? 1836
State-gov 1298
Self-emp-inc 1116
Federal-gov 960
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64
HS-grad 10501
Some-college 7291
Bachelors 5355
Masters 1723
Assoc-voc 1382
11th 1175
Assoc-acdm 1067
10th 933
7th-8th 646
Prof-school 576
9th 514
12th 433
Doctorate 413
5th-6th 333
1st-4th 168
Preschool 51
Name: education, dtype: int64
Married-civ-spouse 14976
Never-married 10683
Divorced 4443
Separated 1025
Widowed 993
Married-spouse-absent 418
Married-AF-spouse 23
Name: marital_status, dtype: int64
Prof-specialty 4140
Craft-repair 4099
Exec-managerial 4066
Adm-clerical 3770
Sales 3650
Other-service 3295
Machine-op-inspct 2002
? 1843
Transport-moving 1597
Handlers-cleaners 1370
Farming-fishing 994
Tech-support 928
Protective-serv 649
Priv-house-serv 149
Armed-Forces 9
Name: occupation, dtype: int64
Husband 13193
Not-in-family 8305
Own-child 5068
Unmarried 3446
Wife 1568
Other-relative 981
Name: relationship, dtype: int64
White 27816
Black 3124
Asian-Pac-Islander 1039
Amer-Indian-Eskimo 311
Other 271
Name: race, dtype: int64
Male 21790
Female 10771
Name: sex, dtype: int64
United-States 29170
Mexico 643
? 583
Philippines 198
Germany 137
Canada 121
Puerto-Rico 114
El-Salvador 106
India 100
Cuba 95
England 90
Jamaica 81
South 80
China 75
Italy 73
Dominican-Republic 70
Vietnam 67
Guatemala 64
Japan 62
Poland 60
Columbia 59
Taiwan 51
Haiti 44
Iran 43
Portugal 37
Nicaragua 34
Peru 31
France 29
Greece 29
Ecuador 28
Ireland 24
Hong 20
Trinadad&Tobago 19
Cambodia 19
Thailand 18
Laos 18
Yugoslavia 16
Outlying-US(Guam-USVI-etc) 14
Hungary 13
Honduras 13
Scotland 12
Holand-Netherlands 1
Name: native_country, dtype: int64
<=50K 24720
>50K 7841
Name: income, dtype: int64
# view frequency distribution of categorical variables

for var in categorical:

print(df[var].value_counts()/np.float(len(df)))

Private 0.697030
Self-emp-not-inc 0.078038
Local-gov 0.064279
? 0.056386
State-gov 0.039864
Self-emp-inc 0.034274
Federal-gov 0.029483
Without-pay 0.000430
Never-worked 0.000215
Name: workclass, dtype: float64
HS-grad 0.322502
Some-college 0.223918
Bachelors 0.164461
Masters 0.052916
Assoc-voc 0.042443
11th 0.036086
Assoc-acdm 0.032769
10th 0.028654
7th-8th 0.019840
Prof-school 0.017690
9th 0.015786
12th 0.013298
Doctorate 0.012684
5th-6th 0.010227
1st-4th 0.005160
Preschool 0.001566
Name: education, dtype: float64
Married-civ-spouse 0.459937
Never-married 0.328092
Divorced 0.136452
Separated 0.031479
Widowed 0.030497
Married-spouse-absent 0.012837
Married-AF-spouse 0.000706
Name: marital_status, dtype: float64
Prof-specialty 0.127146
Craft-repair 0.125887
Exec-managerial 0.124873
Adm-clerical 0.115783
Sales 0.112097
Other-service 0.101195
Machine-op-inspct 0.061485
? 0.056601
Transport-moving 0.049046
Handlers-cleaners 0.042075
Farming-fishing 0.030527
Tech-support 0.028500
Protective-serv 0.019932
Priv-house-serv 0.004576
Armed-Forces 0.000276
Name: occupation, dtype: float64
Husband 0.405178
Not-in-family 0.255060
Own-child 0.155646
Unmarried 0.105832
Wife 0.048156
Other-relative 0.030128
Name: relationship, dtype: float64
White 0.854274
Black 0.095943
Asian-Pac-Islander 0.031909
Amer-Indian-Eskimo 0.009551
Other 0.008323
Name: race, dtype: float64
Male 0.669205
Female 0.330795
Name: sex, dtype: float64
United-States 0.895857
Mexico 0.019748
? 0.017905
Philippines 0.006081
Germany 0.004207
Canada 0.003716
Puerto-Rico 0.003501
El-Salvador 0.003255
India 0.003071
Cuba 0.002918
England 0.002764
Jamaica 0.002488
South 0.002457
China 0.002303
Italy 0.002242
Dominican-Republic 0.002150
Vietnam 0.002058
Guatemala 0.001966
Japan 0.001904
Poland 0.001843
Columbia 0.001812
Taiwan 0.001566
Haiti 0.001351
Iran 0.001321
Portugal 0.001136
Nicaragua 0.001044
Peru 0.000952
France 0.000891
Greece 0.000891
Ecuador 0.000860
Ireland 0.000737
Hong 0.000614
Trinadad&Tobago 0.000584
Cambodia 0.000584
Thailand 0.000553
Laos 0.000553
Yugoslavia 0.000491
Outlying-US(Guam-USVI-etc) 0.000430
Hungary 0.000399
Honduras 0.000399
Scotland 0.000369
Holand-Netherlands 0.000031
Name: native_country, dtype: float64
<=50K 0.75919
>50K 0.24081
Name: income, dtype: float64

Explore workclass variable

# check labels in workclass variable

df.workclass.unique()

array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',

'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-
worked'],
dtype=object)

# check frequency distribution of values in workclass variable

df.workclass.value_counts()

Private 22696
Self-emp-not-inc 2541
Local-gov 2093
? 1836
State-gov 1298
Self-emp-inc 1116
Federal-gov 960
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64

# replace '?' values in workclass variable with `NaN`

df['workclass'].replace('?', np.NaN, inplace=True)

# again check the frequency distribution of values in workclass
variable

df.workclass.value_counts()

Private 22696
Self-emp-not-inc 2541
Local-gov 2093
State-gov 1298
Self-emp-inc 1116
Federal-gov 960
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64

# check labels in occupation variable

df.occupation.unique()

array(['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners',

'Prof-specialty', 'Other-service', 'Sales', 'Craft-repair',
'Transport-moving', 'Farming-fishing', 'Machine-op-inspct',
'Tech-support', '?', 'Protective-serv', 'Armed-Forces',
'Priv-house-serv'], dtype=object)

df.occupation.value_counts()

Prof-specialty 4140
Craft-repair 4099
Exec-managerial 4066
Adm-clerical 3770
Sales 3650
Other-service 3295
Machine-op-inspct 2002
? 1843
Transport-moving 1597
Handlers-cleaners 1370
Farming-fishing 994
Tech-support 928
Protective-serv 649
Priv-house-serv 149
Armed-Forces 9
Name: occupation, dtype: int64

df['occupation'].replace('?', np.NaN, inplace=True)

df.occupation.value_counts()
Prof-specialty 4140
Craft-repair 4099
Exec-managerial 4066
Adm-clerical 3770
Sales 3650
Other-service 3295
Machine-op-inspct 2002
Transport-moving 1597
Handlers-cleaners 1370
Farming-fishing 994
Tech-support 928
Protective-serv 649
Priv-house-serv 149
Armed-Forces 9
Name: occupation, dtype: int64

# check labels in native_country variable

df.native_country.unique()

array(['United-States', 'Cuba', 'Jamaica', 'India', '?', 'Mexico',

'South', 'Puerto-Rico', 'Honduras', 'England', 'Canada',
'Germany',
'Iran', 'Philippines', 'Italy', 'Poland', 'Columbia',
'Cambodia',
'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal',
'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala',
'China', 'Japan', 'Yugoslavia', 'Peru',
'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago',
'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary',
'Holand-Netherlands'], dtype=object)

# check frequency distribution of values in native_country variable

df.native_country.value_counts()

United-States 29170
Mexico 643
? 583
Philippines 198
Germany 137
Canada 121
Puerto-Rico 114
El-Salvador 106
India 100
Cuba 95
England 90
Jamaica 81
South 80
China 75
Italy 73
Dominican-Republic 70
Vietnam 67
Guatemala 64
Japan 62
Poland 60
Columbia 59
Taiwan 51
Haiti 44
Iran 43
Portugal 37
Nicaragua 34
Peru 31
France 29
Greece 29
Ecuador 28
Ireland 24
Hong 20
Trinadad&Tobago 19
Cambodia 19
Thailand 18
Laos 18
Yugoslavia 16
Outlying-US(Guam-USVI-etc) 14
Hungary 13
Honduras 13
Scotland 12
Holand-Netherlands 1
Name: native_country, dtype: int64

df['native_country'].replace('?', np.NaN, inplace=True)

df.native_country.value_counts()

United-States 29170
Mexico 643
Philippines 198
Germany 137
Canada 121
Puerto-Rico 114
El-Salvador 106
India 100
Cuba 95
England 90
Jamaica 81
South 80
China 75
Italy 73
Dominican-Republic 70
Vietnam 67
Guatemala 64
Japan 62
Poland 60
Columbia 59
Taiwan 51
Haiti 44
Iran 43
Portugal 37
Nicaragua 34
Peru 31
France 29
Greece 29
Ecuador 28
Ireland 24
Hong 20
Trinadad&Tobago 19
Cambodia 19
Thailand 18
Laos 18
Yugoslavia 16
Outlying-US(Guam-USVI-etc) 14
Hungary 13
Honduras 13
Scotland 12
Holand-Netherlands 1
Name: native_country, dtype: int64

df[categorical].isnull().sum()

workclass 1836
education 0
marital_status 0
occupation 1843
relationship 0
race 0
sex 0
native_country 583
income 0
dtype: int64

for var in categorical:

print(var, ' contains ', len(df[var].unique()), ' labels')

workclass contains 9 labels

education contains 16 labels
marital_status contains 7 labels
occupation contains 15 labels
relationship contains 6 labels
race contains 5 labels
sex contains 2 labels
native_country contains 42 labels
income contains 2 labels

numerical = [var for var in df.columns if df[var].dtype!='O']

print('There are {} numerical variables\n'.format(len(numerical)))

print('The numerical variables are :', numerical)

There are 6 numerical variables

The numerical variables are : ['age', 'fnlwgt', 'education_num',

'capital_gain', 'capital_loss', 'hours_per_week']

df[numerical].head()

age fnlwgt education_num capital_gain capital_loss

hours_per_week
0 39 77516 13 2174 0
40
1 50 83311 13 0 0
13
2 38 215646 9 0 0
40
3 53 234721 7 0 0
40
4 28 338409 13 0 0
40

Explore problems within numerical variables

Now, I will explore the numerical variables.

Missing values in numerical variables

df[numerical].isnull().sum()

age 0
fnlwgt 0
education_num 0
capital_gain 0
capital_loss 0
hours_per_week 0
dtype: int64

X = df.drop(['income'], axis=1)

y = df['income']

# split X and y into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =

0.3, random_state = 0)

# check the shape of X_train and X_test

X_train.shape, X_test.shape

((22792, 14), (9769, 14))

# check data types in X_train

X_train.dtypes

age int64
workclass object
fnlwgt int64
education object
education_num int64
marital_status object
occupation object
relationship object
race object
sex object
capital_gain int64
capital_loss int64
hours_per_week int64
native_country object
dtype: object

# display categorical variables

categorical = [col for col in X_train.columns if X_train[col].dtypes

== 'O']

categorical

['workclass',
'education',
'marital_status',
'occupation',
'relationship',
'race',
'sex',
'native_country']

# display numerical variables

numerical = [col for col in X_train.columns if X_train[col].dtypes !=

'O']

numerical

['age',
'fnlwgt',
'education_num',
'capital_gain',
'capital_loss',
'hours_per_week']

# print percentage of missing values in the categorical variables in

training set

X_train[categorical].isnull().mean()

workclass 0.055985
education 0.000000
marital_status 0.000000
occupation 0.056072
relationship 0.000000
race 0.000000
sex 0.000000
native_country 0.018164
dtype: float64

# print categorical variables with missing data

for col in categorical:

if X_train[col].isnull().mean()>0:
print(col, (X_train[col].isnull().mean()))

workclass 0.055984555984555984
occupation 0.05607230607230607
native_country 0.018164268164268166

# impute missing categorical variables with most frequent value

for df2 in [X_train, X_test]:

df2['workclass'].fillna(X_train['workclass'].mode()[0],
inplace=True)
df2['occupation'].fillna(X_train['occupation'].mode()[0],
inplace=True)
df2['native_country'].fillna(X_train['native_country'].mode()[0],
inplace=True)

# check missing values in categorical variables in X_train

X_train[categorical].isnull().sum()

workclass 0
education 0
marital_status 0
occupation 0
relationship 0
race 0
sex 0
native_country 0
dtype: int64

# check missing values in categorical variables in X_test

X_test[categorical].isnull().sum()

workclass 0
education 0
marital_status 0
occupation 0
relationship 0
race 0
sex 0
native_country 0
dtype: int64

As a final check, I will check for missing values in X_train and X_test.
# check missing values in X_train

X_train.isnull().sum()

age 0
workclass 0
fnlwgt 0
education 0
education_num 0
marital_status 0
occupation 0
relationship 0
race 0
sex 0
capital_gain 0
capital_loss 0
hours_per_week 0
native_country 0
dtype: int64

# check missing values in X_test

X_test.isnull().sum()

age 0
workclass 0
fnlwgt 0
education 0
education_num 0
marital_status 0
occupation 0
relationship 0
race 0
sex 0
capital_gain 0
capital_loss 0
hours_per_week 0
native_country 0
dtype: int64

# print categorical variables

categorical

['workclass',
'education',
'marital_status',
'occupation',
'relationship',
'race',
'sex',
'native_country']

X_train[categorical].head()

workclass education marital_status occupation \

32098 Private HS-grad Married-civ-spouse Craft-repair
25206 State-gov HS-grad Divorced Adm-clerical
23491 Private Some-college Married-civ-spouse Sales
12367 Private HS-grad Never-married Craft-repair
7054 Private 7th-8th Never-married Craft-repair

relationship race sex native_country

32098 Husband White Male United-States
25206 Unmarried White Female United-States
23491 Husband White Male United-States
12367 Not-in-family White Male Guatemala
7054 Not-in-family White Male Germany
# import category encoders

import category_encoders as ce

# encode remaining variables with one-hot encoding

encoder = ce.OneHotEncoder(cols=['workclass', 'education',

'marital_status', 'occupation', 'relationship',
'race', 'sex', 'native_country'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

X_train.head()

age workclass_1 workclass_2 workclass_3 workclass_4

workclass_5 \
32098 45 1 0 0 0
0
25206 47 0 1 0 0
0
23491 48 1 0 0 0
0
12367 29 1 0 0 0
0
7054 23 1 0 0 0
0

workclass_6 workclass_7 workclass_8 fnlwgt ...

native_country_32 \
32098 0 0 0 170871 ...
0
25206 0 0 0 108890 ...
0
23491 0 0 0 187505 ...
0
12367 0 0 0 145592 ...
0
7054 0 0 0 203003 ...
0

native_country_33 native_country_34 native_country_35 \

32098 0 0 0
25206 0 0 0
23491 0 0 0
12367 0 0 0
7054 0 0 0

native_country_36 native_country_37 native_country_38 \

32098 0 0 0
25206 0 0 0
23491 0 0 0
12367 0 0 0
7054 0 0 0

native_country_39 native_country_40 native_country_41

32098 0 0 0
25206 0 0 0
23491 0 0 0
12367 0 0 0
7054 0 0 0

[5 rows x 105 columns]

X_train.shape

(22792, 105)

X_test.head()

age workclass_1 workclass_2 workclass_3 workclass_4

workclass_5 \
22278 27 1 0 0 0
0
8950 27 1 0 0 0
0
7838 25 1 0 0 0
0
16505 46 1 0 0 0
0
19140 45 1 0 0 0
0

workclass_6 workclass_7 workclass_8 fnlwgt ...

native_country_32 \
22278 0 0 0 177119 ...
0
8950 0 0 0 216481 ...
0
7838 0 0 0 256263 ...
0
16505 0 0 0 147640 ...
0
19140 0 0 0 172822 ...
0

native_country_33 native_country_34 native_country_35 \

22278 0 0 0
8950 0 0 0
7838 0 0 0
16505 0 0 0
19140 0 0 0

native_country_36 native_country_37 native_country_38 \

22278 0 0 0
8950 0 0 0
7838 0 0 0
16505 0 0 0
19140 0 0 0

native_country_39 native_country_40 native_country_41

22278 0 0 0
8950 0 0 0
7838 0 0 0
16505 0 0 0
19140 0 0 0

[5 rows x 105 columns]

X_test.shape

(9769, 105)

cols = X_train.columns

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

X_train = pd.DataFrame(X_train, columns=[cols])

X_test = pd.DataFrame(X_test, columns=[cols])

X_train.head()

age workclass_1 workclass_2 workclass_3 workclass_4 workclass_5 \

0 0.40 0.0 0.0 0.0 0.0 0.0
1 0.50 -1.0 1.0 0.0 0.0 0.0
2 0.55 0.0 0.0 0.0 0.0 0.0
3 -0.40 0.0 0.0 0.0 0.0 0.0
4 -0.70 0.0 0.0 0.0 0.0 0.0

workclass_6 workclass_7 workclass_8 fnlwgt ... native_country_32

\
0 0.0 0.0 0.0 -0.058906 ... 0.0

1 0.0 0.0 0.0 -0.578076 ... 0.0

2 0.0 0.0 0.0 0.080425 ... 0.0

3 0.0 0.0 0.0 -0.270650 ... 0.0

4 0.0 0.0 0.0 0.210240 ... 0.0

native_country_33 native_country_34 native_country_35

native_country_36 \
0 0.0 0.0 0.0
0.0
1 0.0 0.0 0.0
0.0
2 0.0 0.0 0.0
0.0
3 0.0 0.0 0.0
0.0
4 0.0 0.0 0.0
0.0

native_country_37 native_country_38 native_country_39

native_country_40 \
0 0.0 0.0 0.0
0.0
1 0.0 0.0 0.0
0.0
2 0.0 0.0 0.0
0.0
3 0.0 0.0 0.0
0.0
4 0.0 0.0 0.0
0.0

native_country_41
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0

[5 rows x 105 columns]

# train a Gaussian Naive Bayes classifier on the training set

from sklearn.naive_bayes import GaussianNB

# instantiate the model

gnb = GaussianNB()
# fit the model
gnb.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

y_pred = gnb.predict(X_test)

y_pred

array(['<=50K', '<=50K', '>50K', ..., '>50K', '<=50K', '<=50K'],

dtype='<U5')

from sklearn.metrics import accuracy_score

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test,

y_pred)))

Model accuracy score: 0.8083

y_pred_train = gnb.predict(X_train)

y_pred_train

array(['>50K', '<=50K', '>50K', ..., '<=50K', '>50K', '<=50K'],

dtype='<U5')

print('Training-set accuracy score: {0:0.4f}'.

format(accuracy_score(y_train, y_pred_train)))

Training-set accuracy score: 0.8067

# print the scores on training and test set

print('Training set score: {:.4f}'.format(gnb.score(X_train,

y_train)))

print('Test set score: {:.4f}'.format(gnb.score(X_test, y_test)))

Training set score: 0.8067

Test set score: 0.8083

# check class distribution in test set

y_test.value_counts()

<=50K 7407
>50K 2362
Name: income, dtype: int64
# check null accuracy score

null_accuracy = (7407/(7407+2362))

print('Null accuracy score: {0:0.4f}'. format(null_accuracy))

Null accuracy score: 0.7582

# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

print('Confusion matrix\n\n', cm)

print('\nTrue Positives(TP) = ', cm[0,0])

print('\nTrue Negatives(TN) = ', cm[1,1])

print('\nFalse Positives(FP) = ', cm[0,1])

print('\nFalse Negatives(FN) = ', cm[1,0])

Confusion matrix

[[5999 1408]
[ 465 1897]]

True Positives(TP) = 5999

True Negatives(TN) = 1897

False Positives(FP) = 1408

False Negatives(FN) = 465

# visualize confusion matrix with seaborn heatmap

cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1',

'Actual Negative:0'],
index=['Predict Positive:1', 'Predict
Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

<matplotlib.axes._subplots.AxesSubplot at 0x7fd6899b6a58>
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

precision recall f1-score support

<=50K 0.93 0.81 0.86 7407

>50K 0.57 0.80 0.67 2362

accuracy 0.81 9769

macro avg 0.75 0.81 0.77 9769
weighted avg 0.84 0.81 0.82 9769

Classification accuracy
TP = cm[0,0]
TN = cm[1,1]
FP = cm[0,1]
FN = cm[1,0]

# print classification accuracy

classification_accuracy = (TP + TN) / float(TP + TN + FP + FN)

print('Classification accuracy :
{0:0.4f}'.format(classification_accuracy))

Classification accuracy : 0.8083

Classification error
# print classification error

classification_error = (FP + FN) / float(TP + TN + FP + FN)

print('Classification error : {0:0.4f}'.format(classification_error))

Classification error : 0.1917

# print precision score

precision = TP / float(TP + FP)

print('Precision : {0:0.4f}'.format(precision))

Precision : 0.8099

recall = TP / float(TP + FN)

print('Recall or Sensitivity : {0:0.4f}'.format(recall))

Recall or Sensitivity : 0.9281

true_positive_rate = TP / float(TP + FN)

print('True Positive Rate : {0:0.4f}'.format(true_positive_rate))

True Positive Rate : 0.9281

false_positive_rate = FP / float(FP + TN)

print('False Positive Rate : {0:0.4f}'.format(false_positive_rate))

False Positive Rate : 0.4260

specificity = TN / (TN + FP)

print('Specificity : {0:0.4f}'.format(specificity))

Specificity : 0.5740

17. Calculate class probabilities

Table of Contents
# print the first 10 predicted probabilities of two classes- 0 and 1
y_pred_prob = gnb.predict_proba(X_test)[0:10]

y_pred_prob

array([[9.99999426e-01, 5.74152436e-07],
[9.99687907e-01, 3.12093456e-04],
[1.54405602e-01, 8.45594398e-01],
[1.73624321e-04, 9.99826376e-01],
[8.20121011e-09, 9.99999992e-01],
[8.76844580e-01, 1.23155420e-01],
[9.99999927e-01, 7.32876705e-08],
[9.99993460e-01, 6.53998797e-06],
[9.87738143e-01, 1.22618575e-02],
[9.99999996e-01, 4.01886317e-09]])

# store the probabilities in dataframe

y_pred_prob_df = pd.DataFrame(data=y_pred_prob, columns=['Prob of -

<=50K', 'Prob of - >50K'])

y_pred_prob_df

Prob of - <=50K Prob of - >50K

0 9.999994e-01 5.741524e-07
1 9.996879e-01 3.120935e-04
2 1.544056e-01 8.455944e-01
3 1.736243e-04 9.998264e-01
4 8.201210e-09 1.000000e+00
5 8.768446e-01 1.231554e-01
6 9.999999e-01 7.328767e-08
7 9.999935e-01 6.539988e-06
8 9.877381e-01 1.226186e-02
9 1.000000e+00 4.018863e-09

# print the first 10 predicted probabilities for class 1 - Probability

of >50K

gnb.predict_proba(X_test)[0:10, 1]

array([5.74152436e-07, 3.12093456e-04, 8.45594398e-01, 9.99826376e-01,

9.99999992e-01, 1.23155420e-01, 7.32876705e-08, 6.53998797e-06,
1.22618575e-02, 4.01886317e-09])

# store the predicted probabilities for class 1 - Probability of >50K

y_pred1 = gnb.predict_proba(X_test)[:, 1]

# plot histogram of predicted probabilities

# adjust the font size

plt.rcParams['font.size'] = 12

# plot histogram with 10 bins

plt.hist(y_pred1, bins = 10)

# set the title of predicted probabilities

plt.title('Histogram of predicted probabilities of salaries >50K')

# set the x-axis limit

plt.xlim(0,1)

# set the title

plt.xlabel('Predicted probabilities of salaries >50K')
plt.ylabel('Frequency')

Text(0, 0.5, 'Frequency')

# plot ROC Curve

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, y_pred1, pos_label = '>50K')

plt.figure(figsize=(6,4))

plt.plot(fpr, tpr, linewidth=2)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12

plt.title('ROC curve for Gaussian Naive Bayes Classifier for

Predicting Salaries')

plt.xlabel('False Positive Rate (1 - Specificity)')

plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

# compute ROC AUC

from sklearn.metrics import roc_auc_score

ROC_AUC = roc_auc_score(y_test, y_pred1)

print('ROC AUC : {:.4f}'.format(ROC_AUC))

ROC AUC : 0.8941

Interpretation
• ROC AUC is a single number summary of classifier performance. The higher the
value, the better the classifier.
• ROC AUC of our model approaches towards 1. So, we can conclude that our classifier
does a good job in predicting whether it will rain tomorrow or not.
# calculate cross-validated ROC AUC

from sklearn.model_selection import cross_val_score

Cross_validated_ROC_AUC = cross_val_score(gnb, X_train, y_train, cv=5,

scoring='roc_auc').mean()

print('Cross validated ROC AUC :

{:.4f}'.format(Cross_validated_ROC_AUC))

Cross validated ROC AUC : 0.8938

# Applying 10-Fold Cross Validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(gnb, X_train, y_train, cv = 10,

scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

Cross-validation scores:[0.81359649 0.80438596 0.81184211 0.80517771

0.79640193 0.79684072
0.81044318 0.81175954 0.80210619 0.81035996]

# compute Average cross-validation score

print('Average cross-validation score: {:.4f}'.format(scores.mean()))

Average cross-validation score: 0.8063

Adult
No ratings yet
Adult
2,172 pages
Purujrk5
No ratings yet
Purujrk5
360 pages
Census Income Cleaned
No ratings yet
Census Income Cleaned
3,267 pages
Adult Dataset Final
No ratings yet
Adult Dataset Final
2,036 pages
Censo Adultos
No ratings yet
Censo Adultos
1,455 pages
Data Wrangling - Jupyter Notebook
No ratings yet
Data Wrangling - Jupyter Notebook
5 pages
Marketing Campaign
No ratings yet
Marketing Campaign
364 pages
Câu 2 Impute
No ratings yet
Câu 2 Impute
1,346 pages
Adult Census Income Prediction
100% (1)
Adult Census Income Prediction
31 pages
IABTL Audience Taxonomy 1.1 Final
No ratings yet
IABTL Audience Taxonomy 1.1 Final
136 pages
Data Preprocessing
No ratings yet
Data Preprocessing
27 pages
Project Report Abhay PDF
100% (1)
Project Report Abhay PDF
20 pages
Time-Wasters On Social Media
No ratings yet
Time-Wasters On Social Media
95 pages
MKT Data2
No ratings yet
MKT Data2
98 pages
EDA Python Code Cheatsheets
No ratings yet
EDA Python Code Cheatsheets
52 pages
ML Project
No ratings yet
ML Project
112 pages
Suicide Analysis
No ratings yet
Suicide Analysis
18 pages
Student Dropout
No ratings yet
Student Dropout
38 pages
Support Vector Machines - Problem - Statement
No ratings yet
Support Vector Machines - Problem - Statement
15 pages
Examen Primer Corte
No ratings yet
Examen Primer Corte
149 pages
DW 14
No ratings yet
DW 14
14 pages
Exportaciones Totales de Bolivia Hacia Los Paises Seleccionados - Valores en Miles de U$S
No ratings yet
Exportaciones Totales de Bolivia Hacia Los Paises Seleccionados - Valores en Miles de U$S
12 pages
Samana Tatheer-Assign 7-20U00323.Ipynb - Colaboratory
No ratings yet
Samana Tatheer-Assign 7-20U00323.Ipynb - Colaboratory
9 pages
M7 Muhammad Sandhi Khadafi 2KB04 (20122007)
No ratings yet
M7 Muhammad Sandhi Khadafi 2KB04 (20122007)
16 pages
EDA - Session-6 - Bi Variate Analysis
No ratings yet
EDA - Session-6 - Bi Variate Analysis
17 pages
2 Tekrek M7 KNN - DGX 1
No ratings yet
2 Tekrek M7 KNN - DGX 1
15 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
Online Food Orders Analysis Using Python
No ratings yet
Online Food Orders Analysis Using Python
12 pages
Conducting Community Assessment: Activity 9
No ratings yet
Conducting Community Assessment: Activity 9
21 pages
Population, Latin and Caribbean Countries, 2010-2019
0% (1)
Population, Latin and Caribbean Countries, 2010-2019
2 pages
01 Working With CSV Files
No ratings yet
01 Working With CSV Files
27 pages
Code PLFS MVPA
No ratings yet
Code PLFS MVPA
12 pages
Data Analysis Process
No ratings yet
Data Analysis Process
95 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
Share COUNTRY WISE DATA 17-22
No ratings yet
Share COUNTRY WISE DATA 17-22
14 pages
Better Life Initiative 2016 Country Notes Data
No ratings yet
Better Life Initiative 2016 Country Notes Data
74 pages
Machine Learning Engineer Nanodegree Supervised Learning Project: Finding Donors For CharityML
No ratings yet
Machine Learning Engineer Nanodegree Supervised Learning Project: Finding Donors For CharityML
16 pages
IQ Levels Analysis With Python PDF 1701793924
No ratings yet
IQ Levels Analysis With Python PDF 1701793924
11 pages
Pengambilan Data: DAN IMPORT PACKAGE: Import As Import As Import As Import As Import From Import From Import
No ratings yet
Pengambilan Data: DAN IMPORT PACKAGE: Import As Import As Import As Import As Import From Import From Import
7 pages
ALY 6000 Project 2
No ratings yet
ALY 6000 Project 2
11 pages
4 Assignment 3 - Unit 1 - Demographics and Employment
No ratings yet
4 Assignment 3 - Unit 1 - Demographics and Employment
12 pages
Udemy Project Analysis
No ratings yet
Udemy Project Analysis
23 pages
Observation: Import As Import As Import As Import As
No ratings yet
Observation: Import As Import As Import As Import As
31 pages
Kunal Assignment 3
No ratings yet
Kunal Assignment 3
19 pages
Employee Travel Responses
No ratings yet
Employee Travel Responses
3 pages
Ru399 2new
No ratings yet
Ru399 2new
6 pages
DS Practical 01
No ratings yet
DS Practical 01
9 pages
Lluveras, Sabana Grande, Censo 2000
No ratings yet
Lluveras, Sabana Grande, Censo 2000
4 pages
Rock Bolts - Improved Design and Possibilities by Capucine Thomas-Lepine PDF
100% (1)
Rock Bolts - Improved Design and Possibilities by Capucine Thomas-Lepine PDF
105 pages
Since R Considers All Variables As Numeric, We Convert Them Into Factors
No ratings yet
Since R Considers All Variables As Numeric, We Convert Them Into Factors
3 pages
Questionnaire
No ratings yet
Questionnaire
3 pages
Visa Application Report
No ratings yet
Visa Application Report
7 pages
Book 1
No ratings yet
Book 1
2 pages
Stata Codes
No ratings yet
Stata Codes
8 pages
Eda - 1@3pm 8th Nov
No ratings yet
Eda - 1@3pm 8th Nov
2 pages
Geographic Area: Barnesville Town, Maryland: Table DP-1. Profile of General Demographic Characteristics: 2000
No ratings yet
Geographic Area: Barnesville Town, Maryland: Table DP-1. Profile of General Demographic Characteristics: 2000
4 pages
Data Preparation
No ratings yet
Data Preparation
2 pages
hw3 Sol
No ratings yet
hw3 Sol
1 page
Ex 3
No ratings yet
Ex 3
4 pages
Technical Specifications of Distance Relays For 66 KV Lines SL No Particulars Specifications
No ratings yet
Technical Specifications of Distance Relays For 66 KV Lines SL No Particulars Specifications
4 pages
EFMA Ammonia Pipeline Guidance 2008
100% (2)
EFMA Ammonia Pipeline Guidance 2008
50 pages
Eapp Q4 M1
No ratings yet
Eapp Q4 M1
16 pages
Abb Iec61850
No ratings yet
Abb Iec61850
20 pages
Design Guidelines - Spot Welding Chapter
No ratings yet
Design Guidelines - Spot Welding Chapter
11 pages
XF195 Manual
No ratings yet
XF195 Manual
72 pages
Ensayo Sobre El Calentamiento Global en Inglés
100% (1)
Ensayo Sobre El Calentamiento Global en Inglés
5 pages
A Prepaid Water Control
50% (2)
A Prepaid Water Control
41 pages
Journey Mapping
No ratings yet
Journey Mapping
6 pages
1st Decision Drawing of Generator Room
No ratings yet
1st Decision Drawing of Generator Room
1 page
Temporary Revision 29 0012: Illustrated Parts Catalog
No ratings yet
Temporary Revision 29 0012: Illustrated Parts Catalog
10 pages
Proactive Gradient Conflict Mitigation in Multi-Task Learning: A Sparse Training Perspective
No ratings yet
Proactive Gradient Conflict Mitigation in Multi-Task Learning: A Sparse Training Perspective
23 pages
Gs-Assessing The Curriculum
No ratings yet
Gs-Assessing The Curriculum
36 pages
How To Use The Canon Rebel SL2 - Tips, Tricks and Picture Settings - Tom's Guide PDF
No ratings yet
How To Use The Canon Rebel SL2 - Tips, Tricks and Picture Settings - Tom's Guide PDF
8 pages
3VA27125AB030AA0 Datasheet en
No ratings yet
3VA27125AB030AA0 Datasheet en
3 pages
SQL Assignements
No ratings yet
SQL Assignements
5 pages
Norms For PoP - AICTE - Draft Addendum
No ratings yet
Norms For PoP - AICTE - Draft Addendum
3 pages
Intellectual Capital
No ratings yet
Intellectual Capital
16 pages
Fully-Differential Isolation Amplifier: Features Description
No ratings yet
Fully-Differential Isolation Amplifier: Features Description
23 pages
Structure Based Testing Techniques - MCQs
No ratings yet
Structure Based Testing Techniques - MCQs
6 pages
Share Prices
No ratings yet
Share Prices
186 pages
User Guide: Minimal Information Model For Patient Safety Incident Reporting and Learning Systems
No ratings yet
User Guide: Minimal Information Model For Patient Safety Incident Reporting and Learning Systems
20 pages
286 Idbi Statement PDF
No ratings yet
286 Idbi Statement PDF
1 page
Precision t1600 Spec Sheet
No ratings yet
Precision t1600 Spec Sheet
2 pages
Plastic Pollution Lesson Plan
No ratings yet
Plastic Pollution Lesson Plan
3 pages
Datasheet Resun 545W
No ratings yet
Datasheet Resun 545W
2 pages
Curriculum Vitae: Career Objective
No ratings yet
Curriculum Vitae: Career Objective
3 pages
Section - 1 - Analyze
No ratings yet
Section - 1 - Analyze
10 pages
Advantage of Batch and Continous Culture
No ratings yet
Advantage of Batch and Continous Culture
3 pages