Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
WHO is a specialized agency of the UN which is concerned with the world population health.
Based upon the various parameters, WHO allocates budget for various areas to conduct various
campaigns/initiatives to improve healthcare. Annual salary is an important variable which is
considered to decide budget to be allocated for an area.
We have a data which contains information about 32561 samples and 15 continuous and
categorical variables. Extraction of data was done from 1994 Census dataset.
The goal here is to build a binary model to predict whether the salary is >50K or <50K.
Data Dictionary
1. age: age
2. workclass: workclass
3. education: highest education
4. marrital status: marital status
5. occupation: occupation
6. sex: sex
7. capital gain: income from investment sources other than salary/wages
8. capital loss: income from investment sources other than salary/wages
9. working hours: nummber of working hours per week
10. salary: salary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix
adult_data=pd.read_csv("adult.data-1.csv")
EDA
adult_data.head()
adult_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 education 32561 non-null object
3 marrital status 32561 non-null object
4 occupation 32561 non-null object
5 sex 32561 non-null object
6 capital gain 32561 non-null int64
7 capital loss 32561 non-null int64
8 working hours per week 32561 non-null int64
9 salary 32561 non-null object
dtypes: int64(4), object(6)
memory usage: 2.5+ MB
There are no missing values. 6 variables are numeric and remaining categorical. Categorical
variables are not in encoded format
Do we need to remove the duplicate data over here? We have removed the duplicate data but
when are the cases that we remove duplicate data?
adult_data.drop_duplicates(inplace=True)
dups = adult_data.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
print(adult_data.shape)
for feature in adult_data.columns:
if adult_data[feature].dtype == 'object':
print(feature)
print(adult_data[feature].value_counts())
print('\n')
workclass
Private 17474
Self-emp-not-inc 2447
Local-gov 1980
? 1519
State-gov 1246
Self-emp-inc 1089
Federal-gov 921
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64
education
HS-grad 7815
Some-college 5692
Bachelors 4461
Masters 1606
Assoc-voc 1281
Assoc-acdm 1036
11th 987
10th 820
7th-8th 611
Prof-school 562
9th 502
Doctorate 399
12th 397
5th-6th 315
1st-4th 164
Preschool 49
Name: education, dtype: int64
marrital status
Married-civ-spouse 12679
Never-married 7698
Divorced 3930
Separated 978
Widowed 971
Married-spouse-absent 418
Married-AF-spouse 23
Name: marrital status, dtype: int64
occupation
Prof-specialty 3703
Exec-managerial 3531
Sales 3009
Craft-repair 2970
Adm-clerical 2884
Other-service 2626
? 1526
Machine-op-inspct 1483
Transport-moving 1372
Handlers-cleaners 1033
Farming-fishing 951
Tech-support 841
P t ti 614
# Replace ? to new Unk category
adult_data.describe()
Checking the spread of the data using boxplot for the continuous variables.
cols = ['age','capital gain','capital loss','working hours per week']
def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
## This is a loop to treat outliers for all the non-'object' type varible
# for column in adult_data.columns:
# if adult_data[column].dtype != 'object':
# lr,ur=remove_outlier(adult_data[column])
# adult_data[column]=np.where(adult_data[column]>ur,ur,adult_data[column])
# adult_data[column]=np.where(adult_data[column]<lr,lr,adult_data[column])
cols = ['age','capital gain','capital loss','working hours per week']
adult_data.corr()
adult_data.describe()
# Pairplot using sns
sns.pairplot(adult_data ,diag_kind='hist' ,hue='salary');
## We are coding up the 'education' variable in an ordinal manner
adult_data['education']=np.where(adult_data['education'] =='Preschool', '1', adult_data['e
adult_data['education']=np.where(adult_data['education'] =='1st-4th', '2', adult_data['edu
adult_data['education']=np.where(adult_data['education'] =='5th-6th', '3', adult_data['edu
adult_data['education']=np.where(adult_data['education'] =='7th-8th', '4', adult_data['edu
adult_data['education']=np.where(adult_data['education'] =='9th', '5', adult_data['educati
adult_data['education']=np.where(adult_data['education'] =='10th', '6', adult_data['educat
adult_data['education']=np.where(adult_data['education'] =='11th', '7', adult_data['educat
adult_data['education']=np.where(adult_data['education'] =='12th', '8', adult_data['educat
adult_data['education']=np.where(adult_data['education'] =='HS-grad', '9', adult_data['edu
adult_data['education']=np.where(adult_data['education'] =='Prof-school', '9', adult_data[
adult_data['education']=np.where(adult_data['education'] =='Assoc-acdm', '10', adult_data[
adult_data['education']=np.where(adult_data['education'] =='Assoc-voc', '11', adult_data['
adult_data['education']=np.where(adult_data['education'] =='Some-college', '12', adult_dat
adult_data['education']=np.where(adult_data['education'] =='Bachelors', '13', adult_data['
adult_data['education']=np.where(adult_data['education'] =='Masters', '14', adult_data['ed
adult_data['education']=np.where(adult_data['education'] =='Doctorate', '15', adult_data['
## We are grouping certain types of 'workclass' under different categories
adult_data['workclass']=np.where(adult_data['workclass'] =='Federal-gov', 'Government', ad
adult_data['workclass']=np.where(adult_data['workclass'] =='Local-gov', 'Government', adul
adult_data['workclass']=np.where(adult_data['workclass'] =='State-gov', 'Government', adul
adult_data['workclass']=np.where(adult_data['workclass'] =='Self-emp-inc', 'Others', adult_
adult_data['workclass']=np.where(adult_data['workclass'] =='Self-emp-not-inc', 'Others', a
adult_data['workclass']=np.where(adult_data['workclass'] =='unknown', 'Others', adult_data
adult_data['workclass']=np.where(adult_data['workclass'] =='Without-pay', 'Others', adult_
adult_data['workclass']=np.where(adult_data['workclass'] =='Never-worked', 'Others', adult_
## We are grouping certain types of 'marritalstatus' under different categories
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Divorced', 'Curren
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Separated', 'Curre
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Never-married', 'C
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Widowed', 'Current
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Married-civ-spouse
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Married-spouse-abs
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Married-AF-absent'
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Married-AF-spouse'
## We are grouping certain types of 'occupation' under different categories
adult_data.head()
adult_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 26697 entries, 0 to 32560
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 26697 non-null int64
1 workclass 26697 non-null object
2 education 26697 non-null object
3 marrital status 26697 non-null object
4 occupation 26697 non-null object
5 sex 26697 non-null object
6 capital gain 26697 non-null int64
7 capital loss 26697 non-null int64
8 working hours per week 26697 non-null float64
9 salary 26697 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 2.2+ MB
## Converting the education variable to numeric
adult_data['education'] = adult_data['education'].astype('int64')
adult_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 26697 entries, 0 to 32560
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 26697 non-null int64
1 workclass 26697 non-null object
2 education 26697 non-null int64
3 marrital status 26697 non-null object
4 occupation 26697 non-null object
5 sex 26697 non-null object
6 capital gain 26697 non-null int64
7 capital loss 26697 non-null int64
8 working hours per week 26697 non-null float64
9 salary 26697 non-null object
dtypes: float64(1), int64(4), object(5)
memory usage: 2.2+ MB
adult_data['salary'].value_counts()
## Converting the 'salary' Variable into numeric by using the LabelEncoder functionality i
## Defining a Label Encoder object instance
## Applying the created Label Encoder object for the target class
## Assigning the 0 to <=50k and 1 to >50k
## Converting the other 'object' type variables as dummy variables
Train Test Split
# Copy all the predictor variables into X dataframe
# Copy target into the y dataframe.
# Split X and y into training and test set in 70:30 ratio
scikit-learn
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster
for large ones.
For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle
multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.
Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with
approximately the same scale. You can preprocess the data with a scaler from
sklearn.preprocessing.
Article on Solvers
# Fit the Logistic Regression model
Model Evaluation
# Accuracy - Training Data
0.8265104083052389
#####
AUC Value closer to 1 tells that there is good seperatibility between the predicted classes
and thus the model is good for prediction
ROC Curve visually represents the above concept where the plot should be as far as
possible from the diagnol.
# predict probabilities
# keep probabilities for the positive outcome only
# calculate AUC
# calculate roc curve
# plot the roc curve for the model
# Accuracy - Test Data
0.8213483146067416
# predict probabilities
# keep probabilities for the positive outcome only
# calculate AUC
# calculate roc curve
# plot the roc curve for the model
array([[12674, 1096],
[ 2146, 2771]], dtype=int64)
precision recall f1-score support
array([[5412, 491],
[ 940, 1167]], dtype=int64)
1. Login to Google
2. Go to drive.google.com
3. Upload jupyter notebook file into the drive
4. double click it, or right click -> open with -> google colaboratory Alternatively,
5. Login to Google
6. Go to https://colab.research.google.com/notebooks/intro.ipynb#recent=true
7. Upload the jupyter notebook
import io