0% found this document useful (0 votes)

63 views12 pages

Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory

This document describes analyzing a dataset from the 1994 US Census to build a binary classification model to predict whether an individual's salary is over $50,000. It contains information on 32,561 samples and 15 variables. The analysis includes data cleaning by removing duplicates, treating outliers, encoding categorical variables, and exploring correlations between variables. Logistic regression will be used to build a classification model to predict salary based on variables like age, education level, work status, and more.

Uploaded by

SHEKHAR SWAMI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views12 pages

Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory

Uploaded by

SHEKHAR SWAMI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Problem Statement

WHO is a specialized agency of the UN which is concerned with the world population health.
Based upon the various parameters, WHO allocates budget for various areas to conduct various
campaigns/initiatives to improve healthcare. Annual salary is an important variable which is
considered to decide budget to be allocated for an area.

We have a data which contains information about 32561 samples and 15 continuous and
categorical variables. Extraction of data was done from 1994 Census dataset.

The goal here is to build a binary model to predict whether the salary is >50K or <50K.

Data Dictionary

1. age: age
2. workclass: workclass
3. education: highest education
4. marrital status: marital status
5. occupation: occupation
6. sex: sex
7. capital gain: income from investment sources other than salary/wages
8. capital loss: income from investment sources other than salary/wages
9. working hours: nummber of working hours per week
10. salary: salary

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix

adult_data=pd.read_csv("adult.data-1.csv")

EDA

adult_data.head()
adult_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 education 32561 non-null object
3 marrital status 32561 non-null object
4 occupation 32561 non-null object
5 sex 32561 non-null object
6 capital gain 32561 non-null int64
7 capital loss 32561 non-null int64
8 working hours per week 32561 non-null int64
9 salary 32561 non-null object
dtypes: int64(4), object(6)
memory usage: 2.5+ MB

There are no missing values. 6 variables are numeric and remaining categorical. Categorical
variables are not in encoded format

Check for duplicate data

Number of duplicate rows = 5864

(32561, 10)

Do we need to remove the duplicate data over here? We have removed the duplicate data but
when are the cases that we remove duplicate data?

adult_data.drop_duplicates(inplace=True)

dups = adult_data.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
print(adult_data.shape)

Number of duplicate rows = 0

(26697, 10)

Geting unique counts of all Objects

for feature in adult_data.columns:
    if adult_data[feature].dtype == 'object':
        print(feature)
        print(adult_data[feature].value_counts())
        print('\n')

workclass
Private 17474
Self-emp-not-inc 2447
Local-gov 1980
? 1519
State-gov 1246
Self-emp-inc 1089
Federal-gov 921
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64

education
HS-grad 7815
Some-college 5692
Bachelors 4461
Masters 1606
Assoc-voc 1281
Assoc-acdm 1036
11th 987
10th 820
7th-8th 611
Prof-school 562
9th 502
Doctorate 399
12th 397
5th-6th 315
1st-4th 164
Preschool 49
Name: education, dtype: int64

marrital status
Married-civ-spouse 12679
Never-married 7698
Divorced 3930
Separated 978
Widowed 971
Married-spouse-absent 418
Married-AF-spouse 23
Name: marrital status, dtype: int64

occupation
Prof-specialty 3703
Exec-managerial 3531
Sales 3009
Craft-repair 2970
Adm-clerical 2884
Other-service 2626
? 1526
Machine-op-inspct 1483
Transport-moving 1372
Handlers-cleaners 1033
Farming-fishing 951
Tech-support 841
P t ti 614

'workclass' and 'occupation' has ?

Since, high number of cases have ?, we will convert them into a new level

# Replace ? to new Unk category

D:\Anaconda\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: The default val

D:\Anaconda\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: The default val

This is separate from the ipykernel package so we can avoid doing imports until

adult_data.describe()

Checking the spread of the data using boxplot for the continuous variables.

cols = ['age','capital gain','capital loss','working hours per week']

Treating the outliers.

We can treat Outliers with the following code. We will treat the outliers for the 'Age' variable only.

def remove_outlier(col):
    sorted(col)
    Q1,Q3=np.percentile(col,[25,75])
    IQR=Q3-Q1
    lower_range= Q1-(1.5 * IQR)
    upper_range= Q3+(1.5 * IQR)
    return lower_range, upper_range

Lower Range : 26.0

Upper Range : 58.0

## This is a loop to treat outliers for all the non-'object' type varible
# for column in adult_data.columns:
#     if adult_data[column].dtype != 'object':
#         lr,ur=remove_outlier(adult_data[column])
#         adult_data[column]=np.where(adult_data[column]>ur,ur,adult_data[column])
#         adult_data[column]=np.where(adult_data[column]<lr,lr,adult_data[column])

cols = ['age','capital gain','capital loss','working hours per week']

Checking for Correlations.

adult_data.corr()

adult_data.describe()

There is hardly any correlation between the numeric variables

# Pairplot using sns
sns.pairplot(adult_data ,diag_kind='hist' ,hue='salary');

Converting all objects to categorical codes

## We are coding up the 'education' variable in an ordinal manner

adult_data['education']=np.where(adult_data['education'] =='Preschool', '1', adult_data['e
adult_data['education']=np.where(adult_data['education'] =='1st-4th', '2', adult_data['edu
adult_data['education']=np.where(adult_data['education'] =='5th-6th', '3', adult_data['edu
adult_data['education']=np.where(adult_data['education'] =='7th-8th', '4', adult_data['edu
adult_data['education']=np.where(adult_data['education'] =='9th', '5', adult_data['educati
adult_data['education']=np.where(adult_data['education'] =='10th', '6', adult_data['educat
adult_data['education']=np.where(adult_data['education'] =='11th', '7', adult_data['educat
adult_data['education']=np.where(adult_data['education'] =='12th', '8', adult_data['educat
adult_data['education']=np.where(adult_data['education'] =='HS-grad', '9', adult_data['edu
adult_data['education']=np.where(adult_data['education'] =='Prof-school', '9', adult_data[
adult_data['education']=np.where(adult_data['education'] =='Assoc-acdm', '10', adult_data[
adult_data['education']=np.where(adult_data['education'] =='Assoc-voc', '11', adult_data['
adult_data['education']=np.where(adult_data['education'] =='Some-college', '12', adult_dat
adult_data['education']=np.where(adult_data['education'] =='Bachelors', '13', adult_data['
adult_data['education']=np.where(adult_data['education'] =='Masters', '14', adult_data['ed
adult_data['education']=np.where(adult_data['education'] =='Doctorate', '15', adult_data['

## We are grouping certain types of 'workclass' under different categories

adult_data['workclass']=np.where(adult_data['workclass'] =='Federal-gov', 'Government', ad
adult_data['workclass']=np.where(adult_data['workclass'] =='Local-gov', 'Government', adul
adult_data['workclass']=np.where(adult_data['workclass'] =='State-gov', 'Government', adul

adult_data['workclass']=np.where(adult_data['workclass'] =='Self-emp-inc', 'Others', adult_
adult_data['workclass']=np.where(adult_data['workclass'] =='Self-emp-not-inc', 'Others', a
adult_data['workclass']=np.where(adult_data['workclass'] =='unknown', 'Others', adult_data
adult_data['workclass']=np.where(adult_data['workclass'] =='Without-pay', 'Others', adult_
adult_data['workclass']=np.where(adult_data['workclass'] =='Never-worked', 'Others', adult_

## We are grouping certain types of 'marritalstatus' under different categories

adult_data['marrital status']=np.where(adult_data['marrital status'] =='Divorced', 'Curren
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Separated', 'Curre
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Never-married', 'C
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Widowed', 'Current

adult_data['marrital status']=np.where(adult_data['marrital status'] =='Married-civ-spouse
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Married-spouse-abs
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Married-AF-absent'
adult_data['marrital status']=np.where(adult_data['marrital status'] =='Married-AF-spouse'

## We are grouping certain types of 'occupation' under different categories

adult_data.head()

adult_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26697 entries, 0 to 32560
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 26697 non-null int64
1 workclass 26697 non-null object
2 education 26697 non-null object
3 marrital status 26697 non-null object
4 occupation 26697 non-null object
5 sex 26697 non-null object
6 capital gain 26697 non-null int64
7 capital loss 26697 non-null int64
8 working hours per week 26697 non-null float64
9 salary 26697 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 2.2+ MB

## Converting the education variable to numeric

adult_data['education'] = adult_data['education'].astype('int64')
adult_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26697 entries, 0 to 32560
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 26697 non-null int64
1 workclass 26697 non-null object
2 education 26697 non-null int64
3 marrital status 26697 non-null object
4 occupation 26697 non-null object
5 sex 26697 non-null object
6 capital gain 26697 non-null int64
7 capital loss 26697 non-null int64
8 working hours per week 26697 non-null float64
9 salary 26697 non-null object
dtypes: float64(1), int64(4), object(5)
memory usage: 2.2+ MB

Assigning the 0 to <=50k and 1 to >50k

adult_data['salary'].value_counts()

1 IS DECIDED TO BE >50K AS THAT IS THE CLASS OF INTEREST AS DEFINED BY THE PROBLEM

STATEMENT

## Converting the 'salary' Variable into numeric by using the LabelEncoder functionality i

## Defining a Label Encoder object instance

## Applying the created Label Encoder object for the target class
## Assigning the 0 to <=50k and 1 to >50k

## Converting the other 'object' type variables as dummy variables
Train Test Split

# Copy all the predictor variables into X dataframe

# Copy target into the y dataframe.

# Split X and y into training and test set in 70:30 ratio

Logistic Regression Model

We are making some adjustments to the parameters in the Logistic Regression Class to get a
better accuracy. Details of which can be found out on the site scikit-learn mentioned below

scikit-learn

Argument=solver{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’

Algorithm to use in the optimization problem.

For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster
for large ones.

For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle
multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.

‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty

‘liblinear’ and ‘saga’ also handle L1 penalty

‘saga’ also supports ‘elasticnet’ penalty

‘liblinear’ does not support setting penalty='none'

Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with
approximately the same scale. You can preprocess the data with a scaler from
sklearn.preprocessing.

New in version 0.17: Stochastic Average Gradient descent solver.

New in version 0.19: SAGA solver.

Changed in version 0.22: The default solver changed from ‘liblinear’ to ‘lbfgs’ in
0.22.

Article on Solvers

# Fit the Logistic Regression model

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.

[Parallel(n_jobs=2)]: Done 1 out of 1 | elapsed: 6.3s finished
LogisticRegression(max_iter=10000, n_jobs=2, penalty='none', solver='newton-cg',
verbose=True)

Predicting on Training and Test dataset

Getting the Predicted Classes and Probs

Model Evaluation

# Accuracy - Training Data

0.8265104083052389

#####

AUC Value closer to 1 tells that there is good seperatibility between the predicted classes
and thus the model is good for prediction

ROC Curve visually represents the above concept where the plot should be as far as
possible from the diagnol.

AUC and ROC for the training data

# predict probabilities
# keep probabilities for the positive outcome only

# calculate AUC

# calculate roc curve

# plot the roc curve for the model

# Accuracy - Test Data

0.8213483146067416

AUC and ROC for the test data

# predict probabilities

# keep probabilities for the positive outcome only

# calculate AUC

# calculate roc curve

# plot the roc curve for the model

Confusion Matrix for the training data

array([[12674, 1096],
[ 2146, 2771]], dtype=int64)
precision recall f1-score support

0 0.86 0.92 0.89 13770

1 0.72 0.56 0.63 4917

accuracy 0.83 18687

macro avg 0.79 0.74 0.76 18687
weighted avg 0.82 0.83 0.82 18687

Confusion Matrix for test data

array([[5412, 491],
[ 940, 1167]], dtype=int64)

precision recall f1-score support

0 0.85 0.92 0.88 5903

1 0.70 0.55 0.62 2107

accuracy 0.82 8010

macro avg 0.78 0.74 0.75 8010
weighted avg 0.81 0.82 0.81 8010

Applying GridSearchCV for Logistic Regression

Running in Google Colab
Importing jupyter notebook

1. Login to Google
2. Go to drive.google.com
3. Upload jupyter notebook file into the drive
4. double click it, or right click -> open with -> google colaboratory Alternatively,
5. Login to Google
6. Go to https://colab.research.google.com/notebooks/intro.ipynb#recent=true
7. Upload the jupyter notebook

Loading dataset into colab

Use the below code to load the dataset

from google.colab import files

uploaded = files.upload() # upload file here from local

import io

df2 = pd.read_csv(io.BytesIO(uploaded['Filename.csv'])) #give the filename in quotes

Go to Runtime > change Runtime type > check if it points to Python

Happy Learning

Colab paid products - Cancel contracts here

Dynamic Reservoir Simulation of The Alwyn Field Using Eclipse.
No ratings yet
Dynamic Reservoir Simulation of The Alwyn Field Using Eclipse.
108 pages
Audi A4 Avant 95-01 Service & Repair Manual - Heating and AC
No ratings yet
Audi A4 Avant 95-01 Service & Repair Manual - Heating and AC
231 pages
Career Objective:: Hyderabad From July 2011 To Till Date
No ratings yet
Career Objective:: Hyderabad From July 2011 To Till Date
4 pages
Heizer 17-1
No ratings yet
Heizer 17-1
33 pages
CPTest Manual
No ratings yet
CPTest Manual
19 pages
Automation Studio Evergreen Notes For Interview
No ratings yet
Automation Studio Evergreen Notes For Interview
41 pages
Fin460 Group 3 Sec 3
No ratings yet
Fin460 Group 3 Sec 3
22 pages
Iconlibrary Production Oct2016
No ratings yet
Iconlibrary Production Oct2016
137 pages
Expt. No. 8. ECE of Copper
No ratings yet
Expt. No. 8. ECE of Copper
5 pages
Characteristics of Profesional Ethics
No ratings yet
Characteristics of Profesional Ethics
17 pages
Circuit 2: Admittance: I. Applying Complex Numbers To Parallel Ac Circuits
No ratings yet
Circuit 2: Admittance: I. Applying Complex Numbers To Parallel Ac Circuits
22 pages
9
No ratings yet
9
21 pages
Finishing
100% (4)
Finishing
68 pages
PD Integration of Thought Feelings and Behavior
No ratings yet
PD Integration of Thought Feelings and Behavior
15 pages
General Banking AIBL
No ratings yet
General Banking AIBL
72 pages
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
No ratings yet
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
22 pages
Apinayé Art: A Case Study in A Brazilian Indigenous School
No ratings yet
Apinayé Art: A Case Study in A Brazilian Indigenous School
23 pages
For Touring Pros - The Secret That Will Make Your Mind Create Any Outrageous Outcome That You Wish - Siddha Performance Golf
No ratings yet
For Touring Pros - The Secret That Will Make Your Mind Create Any Outrageous Outcome That You Wish - Siddha Performance Golf
6 pages
Germany Credit Analysis
No ratings yet
Germany Credit Analysis
41 pages
Adult Census Income Prediction
100% (1)
Adult Census Income Prediction
31 pages
Multiple - Linear - Regression - AirBNB - Solution-0.2 - New - Ipynb - Colaboratory
No ratings yet
Multiple - Linear - Regression - AirBNB - Solution-0.2 - New - Ipynb - Colaboratory
11 pages
Aramid Prepreg Market
No ratings yet
Aramid Prepreg Market
8 pages
AA'BB' Spectra
No ratings yet
AA'BB' Spectra
11 pages
1.17. MarketLine - JarirMarketingCo - Jan - 29 - 2024
No ratings yet
1.17. MarketLine - JarirMarketingCo - Jan - 29 - 2024
27 pages
Jujutsu Kaisen Manga Chapter 241
No ratings yet
Jujutsu Kaisen Manga Chapter 241
1 page
Week 1 - Firat and Venkatesh (1995)
No ratings yet
Week 1 - Firat and Venkatesh (1995)
30 pages
Human Centred Design For Mental Health Services Workshop Report 250523
No ratings yet
Human Centred Design For Mental Health Services Workshop Report 250523
26 pages
Reference+Material LDA
No ratings yet
Reference+Material LDA
24 pages
Ex 8
No ratings yet
Ex 8
3 pages
LinearRegression HandsOn
No ratings yet
LinearRegression HandsOn
3 pages
Capstone Project - Employee Attrition Rate
No ratings yet
Capstone Project - Employee Attrition Rate
66 pages
Robotics INNOVATION REPORT
No ratings yet
Robotics INNOVATION REPORT
15 pages
Ola Case Study
No ratings yet
Ola Case Study
51 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
56 pages
188 Code Tugas 1
No ratings yet
188 Code Tugas 1
18 pages
Animesh Jain
No ratings yet
Animesh Jain
13 pages
Data Preprocessing
No ratings yet
Data Preprocessing
27 pages
Online Food Orders Analysis Using Python
No ratings yet
Online Food Orders Analysis Using Python
12 pages
Aiml
No ratings yet
Aiml
27 pages
Building Logistic Regression Model in Python
No ratings yet
Building Logistic Regression Model in Python
24 pages
CART+ +Loan+Delinquent+ +Student+File+0.1 - New - Ipynb Colaboratory
No ratings yet
CART+ +Loan+Delinquent+ +Student+File+0.1 - New - Ipynb Colaboratory
5 pages
176 Series Remote Indicators-SBEM
No ratings yet
176 Series Remote Indicators-SBEM
2 pages
Entreprenuership CASE 1
No ratings yet
Entreprenuership CASE 1
3 pages
Sample Creative Brief
No ratings yet
Sample Creative Brief
2 pages
The Warm and The Cold
No ratings yet
The Warm and The Cold
3 pages
Copy of Final Project
No ratings yet
Copy of Final Project
16 pages
DACLUSTER
No ratings yet
DACLUSTER
9 pages
Observation: Import As Import As Import As Import As
No ratings yet
Observation: Import As Import As Import As Import As
31 pages
Random Forest Classifier
No ratings yet
Random Forest Classifier
18 pages
Prints
No ratings yet
Prints
43 pages
Importing Libraries: Pandas PD Numpy NP Matplotlib - Pyplot PLT
No ratings yet
Importing Libraries: Pandas PD Numpy NP Matplotlib - Pyplot PLT
18 pages
Unit7 Working With Pandas - Solved
No ratings yet
Unit7 Working With Pandas - Solved
12 pages
Credit Pruned and Cleaned
No ratings yet
Credit Pruned and Cleaned
37 pages
Lecture 2
No ratings yet
Lecture 2
30 pages
Project Report Abhay PDF
100% (1)
Project Report Abhay PDF
20 pages
LDA Code
No ratings yet
LDA Code
19 pages
DW 14
No ratings yet
DW 14
14 pages
UQ21CA632B Unit2 Class14a Data Representation
No ratings yet
UQ21CA632B Unit2 Class14a Data Representation
5 pages
Python
No ratings yet
Python
32 pages
Samana Tatheer-Assign 7-20U00323.Ipynb - Colaboratory
No ratings yet
Samana Tatheer-Assign 7-20U00323.Ipynb - Colaboratory
9 pages
Code PLFS MVPA
No ratings yet
Code PLFS MVPA
12 pages
EDA - Session-2 - Data Frame Basics-2
No ratings yet
EDA - Session-2 - Data Frame Basics-2
11 pages
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
Ads Exam 21c3
No ratings yet
Ads Exam 21c3
22 pages
Data Cleaning
No ratings yet
Data Cleaning
1 page
Complete Case Analysis (CCA) : Advantages
No ratings yet
Complete Case Analysis (CCA) : Advantages
6 pages
Week 4 LAB
No ratings yet
Week 4 LAB
26 pages
Kunal Assignment 3
No ratings yet
Kunal Assignment 3
19 pages
15 - 11 - 24 - SVM - Jupyter Notebook
No ratings yet
15 - 11 - 24 - SVM - Jupyter Notebook
5 pages
Data Preprocessing & Visualization1
No ratings yet
Data Preprocessing & Visualization1
2 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
R Working Materials Prep
No ratings yet
R Working Materials Prep
43 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Prac 31 Jan
No ratings yet
Prac 31 Jan
16 pages
ML Lab Manual 1-10
No ratings yet
ML Lab Manual 1-10
58 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Name:Fedrick Samuel W Reg No: 19MIS1112 Course: Machine Learning (SWE4012) Slot: L11 + L12 Faculty: Dr.M. Premalatha
No ratings yet
Name:Fedrick Samuel W Reg No: 19MIS1112 Course: Machine Learning (SWE4012) Slot: L11 + L12 Faculty: Dr.M. Premalatha
30 pages
Artifact #9: Integrated Math and Science Lesson Plan
No ratings yet
Artifact #9: Integrated Math and Science Lesson Plan
12 pages
Assignment Ds Midterm
No ratings yet
Assignment Ds Midterm
2 pages
Customer Segmentation 1683225943
No ratings yet
Customer Segmentation 1683225943
34 pages
Satya772244@gmail Compdf
No ratings yet
Satya772244@gmail Compdf
7 pages
TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur
No ratings yet
TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur
18 pages
R Working Manuals Students
No ratings yet
R Working Manuals Students
11 pages
Data Analysis in The Banking Sector: Pandas Fundamentals
No ratings yet
Data Analysis in The Banking Sector: Pandas Fundamentals
16 pages
ML Project 2
No ratings yet
ML Project 2
19 pages
Eda - 1@3pm 8th Nov
No ratings yet
Eda - 1@3pm 8th Nov
2 pages
Report 1 AI17C DBM302m KhaiHoan BaoChau VanThu
No ratings yet
Report 1 AI17C DBM302m KhaiHoan BaoChau VanThu
6 pages
Elsa NG Resume sp15
No ratings yet
Elsa NG Resume sp15
1 page
Machine Learning Engineer Nanodegree Supervised Learning Project: Finding Donors For CharityML
No ratings yet
Machine Learning Engineer Nanodegree Supervised Learning Project: Finding Donors For CharityML
16 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Data Preparation
No ratings yet
Data Preparation
2 pages

Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory

Uploaded by

Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory

Uploaded by

Problem Statement

Check for duplicate data

Number of duplicate rows = 5864

Number of duplicate rows = 0

Geting unique counts of all Objects

'workclass' and 'occupation' has ?

D:\Anaconda\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: The default val

D:\Anaconda\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: The default val

Treating the outliers.

Lower Range : 26.0

Checking for Correlations.

There is hardly any correlation between the numeric variables

Converting all objects to categorical codes

Assigning the 0 to <=50k and 1 to >50k

1 IS DECIDED TO BE >50K AS THAT IS THE CLASS OF INTEREST AS DEFINED BY THE PROBLEM

Logistic Regression Model

Argument=solver{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’

‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty

‘liblinear’ and ‘saga’ also handle L1 penalty

‘saga’ also supports ‘elasticnet’ penalty

‘liblinear’ does not support setting penalty='none'

New in version 0.17: Stochastic Average Gradient descent solver.

New in version 0.19: SAGA solver.

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.

Predicting on Training and Test dataset

Getting the Predicted Classes and Probs

AUC and ROC for the training data

AUC and ROC for the test data

Confusion Matrix for the training data

0 0.86 0.92 0.89 13770

accuracy 0.83 18687

Confusion Matrix for test data

precision recall f1-score support

0 0.85 0.92 0.88 5903

accuracy 0.82 8010

Applying GridSearchCV for Logistic Regression

Loading dataset into colab

Use the below code to load the dataset

uploaded = files.upload() # upload file here from local

df2 = pd.read_csv(io.BytesIO(uploaded['Filename.csv'])) #give the filename in quotes

Go to Runtime > change Runtime type > check if it points to Python

Colab paid products - Cancel contracts here

You might also like