Project - 8: Finance &risk Analytics - India Credit Risk
Project - 8: Finance &risk Analytics - India Credit Risk
0|PAGE
Contents
1 project objective...........................................................................................................................................................2
2 Data Analysis – Step by step approach..........................................................................................................................2
2.1 Exploratory data analysis.......................................................................................................................................2
2.1.1 Basic data exploration...................................................................................................................................2
2.1.2 Look out for outliers and missing values.......................................................................................................4
2.1.3 Check for multicollinearity & treat it.............................................................................................................6
2.2 Build Models and compare them to get to the best one.......................................................................................6
2.2.1 KNN................................................................................................................................................................7
2.2.2 Logistic Regression........................................................................................................................................9
2.2.3 Naive Bayes.................................................................................................................................................12
2.2.4 Boosting.......................................................................................................................................................14
2.2.5 Bagging........................................................................................................................................................16
2.2.6 Data model performance analysis...............................................................................................................19
3 Conclusion................................................................................................................................................................... 19
4 Appendix A – Source Code..........................................................................................................................................20
1.1
1|PAGE
1 PROJECT OBJECTIVE
To build data models and performance analysis of all the models using Employee mode of transport data (‘Cars.csv’):
2|PAGE
2.1.2 Missing value treatment
Missing values Ratio: Data columns with too many missing values are unlikely to carry with useful information. These
data columns with number of missing values greater than a given threshold can be removed.
We observe that variable - Deposits (accepted by commercial banks, have all observations as blank in raw data
Similarly we repeated the process for test data, found the same column as problem and removed it.
Also, removed column num as it’s an ID variable and hold no significance for model building
> dim(data_clean)
[1] 934 51
0 1
917 17
4|PAGE
Other ratios created
Income from financial services Ratio - Income from financial services /Total assets
Both the raw and test data variables and rows as below
> dim(test_data_transformed)
[1] 183 51
> dim(data_transformed)
[1] 934 51
Capping at the 5th and 95th percentile means values that are less than the value at 5 th percentile are replaced by the
value at 5th percentile, and values that are greater than the value at 95th percentile are replaced by the value at 95th
percentile.
2.1.6 Multicollinearity
Correlation
The multicolliniearity has caused the inflated VIF values for correlated variables. We will use a stepwise variable
reduction function using VIF values. The function work based on below method:
2.2.1 KNN
6|PAGE
3 CONCLUSION
#==============================================================================
> #
> # PROJEC- 5 : Predicting mode of Transport (ML)
> #
> #==============================================================================
> # Environment Set up
> # Setup Working Directory
> setwd("C:/Users/Ishan/Documents")
> getwd()
[1] "C:/Users/Ishan/Documents"
> # Install
>
> #install.packages("DMwR")
> #install.packages("class")
> #install.packages("VIF")
> #install.packages("GGally")
> #install.packages("mctest")
>
> # adding library
> library(ggplot2)
> library(DataExplorer)
> library(gower)
> library(rpart)
> #library(dplyr)
> library(plotrix)
> #library(rpart.plot)
> #library(randomForest)
> library(readxl)
> library(readr)
> #library(rattle)
> #library(ROCR)
> #library(ineq)
> #library(ROSE)
> #library(RColorBrewer)
> #library(data.table)
> #library(scales)
> library(corrplot)
> #library(caTools)
> #library(MASS)
> #library(clusterGeneration)
> library(caret)
> library(car)
> library(DMwR)
> library(class)
> library(carData)
> library(lattice)
> library(VIF)
> library(mctest)
> library(e1071)
> library(glmnet)
> library(xgboost)
7|PAGE
> library(data.table)
> library(ipred)
>
>
> # Read Input File
> Cars <- read_csv("E:/analytics program great lakes/PROJECT 5/Cars.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
Engineer = col_double(),
MBA = col_double(),
`Work Exp` = col_double(),
Salary = col_double(),
Distance = col_double(),
license = col_double(),
Transport = col_character()
)
> #attach(Cars)
>
> #Exploratory Data Analysis
>
> # Find out Total Number of Rows and Columns
> dim(Cars)
[1] 444 9
>
>
> # Find out Names of the Columns (Features)
> names(Cars)
[1] "Age" "Gender" "Engineer" "MBA" "Work Exp" "Salary" "Distance" "licens
>
>
> # Find out Class of each Feature, along with internal structure
> summary(Cars)
Age Gender Engineer MBA Work Exp Salary
Min. :18.00 Length:444 Min. :0.0000 Min. :0.0000 Min. : 0.0 Min. : 6
1st Qu.:25.00 Class :character 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 3.0 1st Qu.: 9
Median :27.00 Mode :character Median :1.0000 Median :0.0000 Median : 5.0 Median :13
Mean :27.75 Mean :0.7545 Mean :0.2528 Mean : 6.3 Mean :16
3rd Qu.:30.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 8.0 3rd Qu.:15
Max. :43.00 Max. :1.0000 Max. :1.0000 Max. :24.0 Max. :57
NA's :1
license Transport
Min. :0.0000 Length:444
1st Qu.:0.0000 Class :character
Median :0.0000 Mode :character
Mean :0.2342
3rd Qu.:0.0000
Max. :1.0000
> str(Cars)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 444 obs. of 9 variables:
$ Age : num 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : chr "Male" "Female" "Male" "Female" ...
$ Engineer : num 0 1 1 1 1 1 1 1 1 1 ...
$ MBA : num 0 0 0 1 0 0 0 0 0 0 ...
$ Work Exp : num 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : num 0 0 0 0 0 1 0 0 0 0 ...
$ Transport: chr "Public Transport" "Public Transport" "Public Transport" "Public Transport"
- attr(*, "spec")=
.. cols(
.. Age = col_double(),
.. Gender = col_character(),
.. Engineer = col_double(),
8|PAGE
.. MBA = col_double(),
.. `Work Exp` = col_double(),
.. Salary = col_double(),
.. Distance = col_double(),
.. license = col_double(),
.. Transport = col_character()
.. )
>
>
> # Check top and bottom Rows of the Dataset
> head(Cars,5)
# A tibble: 5 x 9
Age Gender Engineer MBA `Work Exp` Salary Distance license Transport
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 28 Male 0 0 4 14.3 3.2 0 Public Transport
2 23 Female 1 0 4 8.3 3.3 0 Public Transport
3 29 Male 1 0 7 13.4 4.1 0 Public Transport
4 28 Female 1 1 5 13.4 4.5 0 Public Transport
5 27 Male 1 0 4 13.4 4.6 0 Public Transport
> tail(Cars,5)
# A tibble: 5 x 9
Age Gender Engineer MBA `Work Exp` Salary Distance license Transport
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 40 Male 1 0 20 57 21.4 1 Car
2 38 Male 1 0 19 44 21.5 1 Car
3 37 Male 1 0 19 45 21.5 1 Car
4 37 Male 0 0 19 47 22.8 1 Car
5 39 Male 1 1 21 50 23.4 1 Car
>
>
> plot_str(Cars)
>
> #Hitogram View
>
> plot_histogram(Cars)
>
> # Density view
>
> plot_density(Cars)
>
>
>
> ## The columns converted into factors
> Cars$Engineer<-as.factor(Cars$Engineer)
> Cars$MBA<-as.factor(Cars$MBA)
> Cars$license<-as.factor(Cars$license)
> Cars$Gender<-as.factor(Cars$Gender)
> Cars$Transport<-as.factor(Cars$Transport)
>
> summary(Cars)
Age Gender Engineer MBA Work Exp Salary Distance
Min. :18.00 Female:128 0:109 0 :331 Min. : 0.0 Min. : 6.50 Min. : 3.20
1st Qu.:25.00 Male :316 1:335 1 :112 1st Qu.: 3.0 1st Qu.: 9.80 1st Qu.: 8.80
Median :27.00 NA's: 1 Median : 5.0 Median :13.60 Median :11.00
Mean :27.75 Mean : 6.3 Mean :16.24 Mean :11.32
3rd Qu.:30.00 3rd Qu.: 8.0 3rd Qu.:15.72 3rd Qu.:13.43
Max. :43.00 Max. :24.0 Max. :57.00 Max. :23.40
Transport
2Wheeler : 83
Car : 61
Public Transport:300
9|PAGE
> boxplot(Cars$Age ~Cars$MBA, main ="Age Vs MBA")
> boxplot(Cars$Salary ~Cars$Engineer, main = "Salary vs Eng.")
> boxplot(Cars$Salary ~Cars$MBA, main = "Salary vs MBA.")
> table(Cars$license,Cars$Transport)
Call:
10 | P A G E
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
2Wheeler Car Public Transport
0.1891026 0.1378205 0.6730769
Conditional probabilities:
Age
Y [,1] [,2]
2Wheeler 25.00000 2.658753
Car 35.60465 3.546558
Public Transport 26.76190 2.853769
Gender
Y Female Male
2Wheeler 0.4576271 0.5423729
Car 0.2093023 0.7906977
Public Transport 0.2571429 0.7428571
Engineer
Y 0 1
2Wheeler 0.2542373 0.7457627
Car 0.1627907 0.8372093
Public Transport 0.2380952 0.7619048
MBA
Y 0 1
2Wheeler 0.7627119 0.2372881
Car 0.7906977 0.2093023
Public Transport 0.7238095 0.2761905
Salary
Y [,1] [,2]
2Wheeler 2.404017 0.3803774
Car 3.456899 0.4691048
Public Transport 2.522866 0.3143948
Distance
Y [,1] [,2]
2Wheeler 12.16780 3.200185
Car 15.31628 3.864301
Public Transport 10.42286 3.027442
license
Y 0 1
2Wheeler 0.7796610 0.2203390
Car 0.2790698 0.7209302
Public Transport 0.8904762 0.1095238
>
> #Prediction on the train dataset
> NB_Predictions<-predict(Naive_Bayes_Model,Cars2datatrain)
> table(NB_Predictions,Cars2datatrain$Transport)
11 | P A G E
Public Transport 37 8 194
Overall Statistics
Accuracy : 0.7981
95% CI : (0.7492, 0.8412)
No Information Rate : 0.6731
P-Value [Acc > NIR] : 6.615e-07
Kappa : 0.5485
Statistics by Class:
Overall Statistics
Accuracy : 0.8092
95% CI : (0.7313, 0.8725)
No Information Rate : 0.6794
P-Value [Acc > NIR] : 0.0006484
Kappa : 0.5748
Statistics by Class:
12 | P A G E
>
#KNN
> setwd("C:/Users/Ishan/Documents")
> getwd()
[1] "C:/Users/Ishan/Documents"
> Cars <- read_csv("E:/analytics program great lakes/PROJECT 5/Cars.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
Engineer = col_double(),
MBA = col_double(),
`Work Exp` = col_double(),
Salary = col_double(),
Distance = col_double(),
license = col_double(),
Transport = col_character()
)
> Cars$Engineer<-as.factor(Cars$Engineer)
> Cars$MBA<-as.factor(Cars$MBA)
> Cars$license<-as.factor(Cars$license)
> Cars$Gender<-as.factor(Cars$Gender)
> Cars$Transport<-as.factor(Cars$Transport)
> Cars<-na.exclude (Cars)
> Cars2<- Cars[,-5]
> Cars2$Salary <- log(Cars2$Salary)
>
> #test and train data
> set.seed(44)
> carindex<-createDataPartition(Cars2$Transport, p=0.7,list = FALSE)
> Cars2datatrain<-Cars2[carindex,]
> Cars2datatest<-Cars2[-carindex,]
> prop.table(table(Cars2datatrain$Transport))
312 samples
7 predictor
3 classes: '2Wheeler', 'Car', 'Public Transport'
k Accuracy Kappa
2 0.7302419 0.4281271
13 | P A G E
3 0.7726815 0.4952761
4 0.7630040 0.4715395
5 0.7596774 0.4502415
6 0.7533266 0.4343293
7 0.7630040 0.4423694
8 0.7695565 0.4590623
9 0.7597782 0.4193741
10 0.7594758 0.4089674
11 0.7594758 0.4013071
12 0.7530242 0.3777808
13 0.7626008 0.3941709
14 0.7658266 0.3973498
15 0.7658266 0.3939163
16 0.7627016 0.3891776
17 0.7594758 0.3774473
18 0.7627016 0.3890871
19 0.7530242 0.3569895
20 0.7562500 0.3697960
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 3.
> KNN_predictions <- predict(fit.knn,Cars2datatrain)
> table(KNN_predictions, Cars2datatrain$Transport)
Overall Statistics
Accuracy : 0.8686
95% CI : (0.826, 0.904)
No Information Rate : 0.6731
P-Value [Acc > NIR] : 1.579e-15
Kappa : 0.7177
Statistics by Class:
14 | P A G E
Car 3 15 3
Public Transport 13 0 80
> confusionMatrix(table(KNN_predictions, Cars2datatest$Transport))
Confusion Matrix and Statistics
Overall Statistics
Accuracy : 0.7863
95% CI : (0.7061, 0.853)
No Information Rate : 0.6794
P-Value [Acc > NIR] : 0.004613
Kappa : 0.547
Statistics by Class:
15 | P A G E
0 1
382 61
> sum(Cars2$CarUsage == 1)/nrow(Cars2)
[1] 0.1376975
> Cars2$CarUsage<-as.factor(Cars2$CarUsage)
> set.seed(44)
> carindex<-createDataPartition(Cars2$Transport, p=0.7,list = FALSE)
> Cars2datatrain<-Cars2[carindex,]
> Cars2datatest<-Cars2[-carindex,]
> prop.table(table(Cars2datatrain$Transport))
> str(Cars2datatrain)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 312 obs. of 8 variables:
$ Age : num 28 23 26 28 22 27 25 27 24 27 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 2 1 2 2 2 ...
$ Engineer: Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 2 2 ...
$ MBA : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Salary : num 2.66 2.12 2.51 2.67 2.01 ...
$ Distance: num 3.2 3.3 4.8 5.1 5.1 5.2 5.2 5.3 5.4 5.5 ...
$ license : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 2 1 2 ...
$ CarUsage: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "na.action")= 'exclude' Named int 145
..- attr(*, "names")= chr "145"
> Cars2dataSMOTE<-SMOTE(CarUsage~., as.data.frame(Cars2datatrain), perc.over = 250,perc.under =
16 | P A G E
> prop.table(table(Cars2dataSMOTE$CarUsage))
0 1
0.5 0.5
> ##Create control parameter for GLM
> outcomevar<-'CarUsage'
> regressors<-c("Age","Salary","Distance","license","Engineer","MBA","Gender")
> trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)
> Cars2glm<-train(Cars2dataSMOTE[,regressors],Cars2dataSMOTE[,outcomevar],method = "glm", famil
+ "binomial",trControl = trainctrl)
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(Cars2glm$finalModel)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3474 -0.0194 0.0000 0.0445 2.2655
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -58.0133 12.1697 -4.767 1.87e-06 ***
Age 1.7266 0.3616 4.775 1.80e-06 ***
Salary 1.2720 1.4542 0.875 0.3817
Distance 0.1321 0.1539 0.858 0.3908
license1 1.9338 1.0768 1.796 0.0725 .
Engineer1 -0.1014 1.2338 -0.082 0.9345
MBA1 -2.1554 0.9435 -2.285 0.0223 *
GenderMale -1.0655 0.9267 -1.150 0.2502
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Overall
Age 100.00
MBA1 46.94
license1 36.52
GenderMale 22.75
Salary 16.89
Distance 16.54
Engineer1 0.00
> plot(varImp(object = Cars2glm), main="Vairable Importance for Logistic Regression")
> carusageprediction<-predict.train(object = Cars2glm,Cars2datatest[,regressors],type = "raw")
> Cars2datatest$CarUsage<-as.factor(Cars2datatest$CarUsage)
> confusionMatrix(carusageprediction,Cars2datatest$CarUsage, positive='1')
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 104 0
17 | P A G E
1 9 18
Accuracy : 0.9313
95% CI : (0.8736, 0.9681)
No Information Rate : 0.8626
P-Value [Acc > NIR] : 0.010562
Kappa : 0.7605
Sensitivity : 1.0000
Specificity : 0.9204
Pos Pred Value : 0.6667
Neg Pred Value : 1.0000
Prevalence : 0.1374
Detection Rate : 0.1374
Detection Prevalence : 0.2061
Balanced Accuracy : 0.9602
'Positive' Class : 1
> #str(carusageprediction)
> #str(Cars2datatest$CarUsage)
>
> #summary(carusageprediction)
> #summary(Cars2datatest$CarUsage)
>
> carusagepreddata<-Cars2datatest
> carusagepreddata$predictusage<-carusageprediction
> trainctrlgn<-trainControl(method = 'cv',number = 10,returnResamp = 'none')
> Cars2glmnet<-train(CarUsage~Age+Salary+Distance+license, data = Cars2dataSMOTE,
+ method = 'glmnet', trControl = trainctrlgn)
> Cars2glmnet
glmnet
258 samples
4 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 232, 232, 233, 232, 232, 233, ...
Resampling results across tuning parameters:
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0.55 and lambda = 0.08382152.
> varImp(object <- Cars2glmnet)
glmnet variable importance
Overall
Salary 100.00
Age 26.38
18 | P A G E
license1 18.50
Distance 0.00
> plot(varImp(object <- Cars2glmnet), main="Vairable Importance for Logistic Regression - Post
+ gularization")
> carusagepredictiong<-predict.train(object = Cars2glmnet,Cars2datatest[,regressors],type = "ra
> confusionMatrix(carusagepredictiong,Cars2datatest$CarUsage, positive='1')
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 105 0
1 8 18
Accuracy : 0.9389
95% CI : (0.8832, 0.9733)
No Information Rate : 0.8626
P-Value [Acc > NIR] : 0.004479
Kappa : 0.7829
Sensitivity : 1.0000
Specificity : 0.9292
Pos Pred Value : 0.6923
Neg Pred Value : 1.0000
Prevalence : 0.1374
Detection Rate : 0.1374
Detection Prevalence : 0.1985
Balanced Accuracy : 0.9646
'Positive' Class : 1
>
>
> # Boosting
> setwd("C:/Users/Ishan/Documents")
> getwd()
[1] "C:/Users/Ishan/Documents"
> Cars <- read_csv("E:/analytics program great lakes/PROJECT 5/Cars.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
Engineer = col_double(),
MBA = col_double(),
`Work Exp` = col_double(),
Salary = col_double(),
Distance = col_double(),
license = col_double(),
Transport = col_character()
)
> Cars$Engineer<-as.factor(Cars$Engineer)
> Cars$MBA<-as.factor(Cars$MBA)
> Cars$license<-as.factor(Cars$license)
> Cars$Gender<-as.factor(Cars$Gender)
> Cars$Transport<-as.factor(Cars$Transport)
> Cars<-na.exclude (Cars)
> Cars2<- Cars[,-5]
> Cars2$Salary <- log(Cars2$Salary)
> set.seed(44)
> carindex<-createDataPartition(Cars2$Transport, p=0.7,list = FALSE)
> Cars2datatrain<-Cars2[carindex,]
> Cars2datatest<-Cars2[-carindex,]
19 | P A G E
> Cars2datatrain$license<-as.factor(Cars2datatrain$license)
> Cars2datatest$license<-as.factor(Cars2datatest$license)
> Cars2train.car<-Cars2datatrain[Cars2datatrain$Transport %in% c("Car", "Public Transport"),]
> Cars2train.twlr<-Cars2datatrain[Cars2datatrain$Transport %in% c("2Wheeler", "Public Transport
> Cars2train.car$Transport<-as.character(Cars2train.car$Transport)
> Cars2train.car$Transport<-as.factor(Cars2train.car$Transport)
> Cars2train.twlr$Transport<-as.character(Cars2train.twlr$Transport)
> Cars2train.twlr$Transport<-as.factor(Cars2train.twlr$Transport)
> prop.table(table(Cars2train.car$Transport))
20 | P A G E
cb.print.evaluation(period = print_every_n)
# of features: 7
niter: 50
nfeatures : 7
xNames : Age GenderMale Engineer1 MBA1 Salary Distance license1
problemType : Classification
tuneValue :
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
1 50 1 0.3 0 0.6 1 1
obsLevels : 2Wheeler Public Transport Car
param :
list()
> predictions_xgb<-predict(Cars2xgb,Cars2datatest)
> confusionMatrix(predictions_xgb,Cars2datatest$Transport)
Confusion Matrix and Statistics
Reference
Prediction 2Wheeler Car Public Transport
2Wheeler 12 0 25
Car 1 18 6
Public Transport 11 0 58
Overall Statistics
Accuracy : 0.6718
95% CI : (0.5843, 0.7512)
No Information Rate : 0.6794
P-Value [Acc > NIR] : 0.614462
Kappa : 0.4182
Statistics by Class:
21 | P A G E
Transport = col_character()
)
> Cars$Engineer<-as.factor(Cars$Engineer)
> Cars$MBA<-as.factor(Cars$MBA)
> Cars$license<-as.factor(Cars$license)
> Cars$Gender<-as.factor(Cars$Gender)
> Cars$Transport<-as.factor(Cars$Transport)
> Cars<-na.exclude (Cars)
> Cars2<- Cars[,-5]
> Cars2$Salary <- log(Cars2$Salary)
> set.seed(44)
> #test and train data
> carindex <- createDataPartition(Cars2$Transport, p=0.70, list=FALSE)
> Cars2datatrain <- Cars2[ carindex,]
> Cars2datatest <- Cars2[-carindex,]
> Cars2.bagging <- bagging(Transport ~.,
+ data=Cars2datatrain,
+ control=rpart.control(maxdepth=5, minsplit=4))
> Cars2datatrain$pred.class <- predict(Cars2.bagging, Cars2datatrain)
> table(Cars2datatrain$Gender,Cars2datatrain$pred.class)
22 | P A G E
2.59525470695687 0 0 1
2.60268968544438 0 0 2
2.61006979274201 0 0 10
2.61739583283408 0 0 6
2.62466859216316 0 0 7
2.63188884013665 0 0 10
2.66025953726586 0 0 1
2.66722820658195 0 0 3
2.67414864942653 0 0 1
2.68102152871429 0 0 15
2.68784749378469 0 0 4
2.69462718077007 0 0 7
2.70136121295141 1 0 5
2.70805020110221 2 0 0
2.73436750941958 0 0 1
2.74727091425549 0 1 6
2.75366071235426 0 0 1
2.76000994003292 0 1 3
2.76631910922619 0 2 1
2.8094026953625 0 0 1
2.81540871942271 0 1 0
2.82137888640921 0 0 1
2.82731362192903 0 5 1
2.83321334405622 0 4 0
2.87919845729804 0 0 3
2.9338568698359 1 0 2
2.98061863574394 0 0 1
3.03013370027132 0 0 2
3.03495298670727 0 0 1
3.07269331469012 0 0 1
3.08190996979504 0 0 1
3.13549421592915 1 0 0
3.16968558067743 1 0 2
3.17387845893747 0 0 1
3.21486780347066 0 0 1
3.25424296870549 0 0 1
3.35689712276558 0 0 2
3.3603753871419 0 1 2
3.49650756146648 0 1 0
3.52636052461616 0 2 0
3.55248682920838 0 1 0
3.55534806148941 0 1 0
3.58351893845611 0 1 0
3.60004824040732 0 0 1
3.61091791264422 0 2 0
3.66356164612965 0 1 0
3.68637632389582 0 1 0
3.71113006304876 0 1 0
3.71357206670431 0 1 0
3.73766961828337 0 2 0
3.75887182593397 0 1 0
3.76120011569356 0 2 0
3.78418963391826 0 1 0
3.80666248977032 0 4 0
3.85014760171006 0 2 0
3.87120101090789 0 1 0
3.93182563272433 0 2 0
3.95124371858143 0 1 0
3.98898404656427 0 1 0
4.00733318523247 0 1 0
4.04305126783455 0 1 0
> Cars2.bagging <- bagging(Transport ~.,
+ data=Cars2datatest,
+ control=rpart.control(maxdepth=5, minsplit=4))
> Cars2datatest$pred.class <- predict(Cars2.bagging, Cars2datatest)
>
23 | P A G E
> table(Cars2datatest$Gender,Cars2datatest$pred.class)
>
> # Missing Value & Multicollinearity
> # Setup Working Directory
> setwd("C:/Users/Ishan/Documents")
> getwd()
[1] "C:/Users/Ishan/Documents"
>
> # adding library
> library(DataExplorer)
> library(gower)
> library(rpart)
> library(plotrix)
> library(readr)
> library(car)
> library(DMwR)
> library(class)
> library(carData)
> library(lattice)
>
>
> # Read Input File
> Cars <- read_csv("E:/analytics program great lakes/PROJECT 5/Cars.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
Engineer = col_double(),
MBA = col_double(),
`Work Exp` = col_double(),
Salary = col_double(),
Distance = col_double(),
license = col_double(),
Transport = col_character()
)
> ## The columns converted into factors
> Cars$Engineer<-as.factor(Cars$Engineer)
> Cars$MBA<-as.factor(Cars$MBA)
> Cars$license<-as.factor(Cars$license)
> Cars$Gender<-as.factor(Cars$Gender)
> Cars$Transport<-as.factor(Cars$Transport)
>
>
> Cars$Engineer<-as.numeric(Cars$Engineer)
> Cars$MBA<-as.numeric(Cars$MBA)
> Cars$license<-as.numeric(Cars$license)
> Cars$Gender<-as.numeric(Cars$Gender)
> Cars$Transport<-as.numeric(Cars$Transport)
>
> str(Cars)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 444 obs. of 9 variables:
$ Age : num 28 23 29 28 27 26 28 26 22 27 ...
24 | P A G E
$ Gender : num 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : num 1 2 2 2 2 2 2 2 2 2 ...
$ MBA : num 1 1 1 2 1 1 1 1 1 1 ...
$ Work Exp : num 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : num 1 1 1 1 1 2 1 1 1 1 ...
$ Transport: num 3 3 3 3 3 3 1 3 3 3 ...
- attr(*, "spec")=
.. cols(
.. Age = col_double(),
.. Gender = col_character(),
.. Engineer = col_double(),
.. MBA = col_double(),
.. `Work Exp` = col_double(),
.. Salary = col_double(),
.. Distance = col_double(),
.. license = col_double(),
.. Transport = col_character()
.. )
> #missing value
> anyNA(Cars)
[1] TRUE
> Cars[!complete.cases(Cars), ]
# A tibble: 1 x 9
Age Gender Engineer MBA `Work Exp` Salary Distance license Transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 28 1 1 NA 6 13.7 9.4 1 3
> Cars<-na.exclude (Cars)
> accounts_n<-Cars
> str(accounts_n)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 443 obs. of 9 variables:
$ Age : num 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : num 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : num 1 2 2 2 2 2 2 2 2 2 ...
$ MBA : num 1 1 1 2 1 1 1 1 1 1 ...
$ Work Exp : num 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : num 1 1 1 1 1 2 1 1 1 1 ...
$ Transport: num 3 3 3 3 3 3 1 3 3 3 ...
- attr(*, "na.action")= 'exclude' Named int 145
..- attr(*, "names")= chr "145"
> corrplot(cor(accounts_n))
> model1 <- glm(Transport ~ ., data= accounts_n)
> summary(model1)
Call:
glm(formula = Transport ~ ., data = accounts_n)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1081 -0.2535 0.2071 0.4631 1.1748
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.965013 0.529462 1.823 0.069 .
Age 0.087019 0.021419 4.063 5.76e-05 ***
Gender 0.369621 0.077090 4.795 2.24e-06 ***
Engineer -0.007704 0.079025 -0.097 0.922
MBA 0.129177 0.078727 1.641 0.102
`Work Exp` -0.039439 0.026128 -1.509 0.132
Salary -0.008249 0.009595 -0.860 0.390
Distance -0.052808 0.010545 -5.008 8.03e-07 ***
license -0.561159 0.095571 -5.872 8.59e-09 ***
---
25 | P A G E
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> vif(model1)
Age Gender Engineer MBA `Work Exp` Salary Distance license
7.892948 1.071875 1.015424 1.032638 15.735548 8.871978 1.274628 1.447236
> model2 <- glm(Transport ~ ., data= accounts_n[,-5])
> summary(model2)
Call:
glm(formula = Transport ~ ., data = accounts_n[, -5])
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0607 -0.2625 0.2145 0.4680 1.0776
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.488038 0.400920 3.712 0.000233 ***
Age 0.064015 0.015072 4.247 2.65e-05 ***
Gender 0.373212 0.077167 4.836 1.84e-06 ***
Engineer -0.004950 0.079120 -0.063 0.950141
MBA 0.115982 0.078355 1.480 0.139541
Salary -0.018465 0.006811 -2.711 0.006973 **
Distance -0.051120 0.010501 -4.868 1.58e-06 ***
license -0.545627 0.095155 -5.734 1.84e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
glm(formula = Transport ~ ., data = accounts_n[, -c(5, 6)])
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0817 -0.2194 0.2039 0.4808 1.1676
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.259173 0.284592 7.938 1.75e-14 ***
Age 0.031048 0.008969 3.462 0.00059 ***
Gender 0.381131 0.077671 4.907 1.31e-06 ***
Engineer -0.007425 0.079688 -0.093 0.92581
MBA 0.109572 0.078888 1.389 0.16555
Distance -0.058510 0.010215 -5.728 1.90e-08 ***
license -0.605413 0.093236 -6.493 2.29e-10 ***
26 | P A G E
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> vif(model3)
Age Gender Engineer MBA Distance license
1.360182 1.069320 1.014748 1.018978 1.175392 1.353631
> model4 <- glm(Transport ~ ., data= accounts_n[,-c(5,1)])
> summary(model4)
Call:
glm(formula = Transport ~ ., data = accounts_n[, -c(5, 1)])
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0259 -0.1351 0.2098 0.4732 1.1094
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.877943 0.236105 12.189 < 2e-16 ***
Gender 0.383198 0.078624 4.874 1.54e-06 ***
Engineer 0.009012 0.080581 0.112 0.911
MBA 0.100695 0.079787 1.262 0.208
Salary 0.004875 0.004102 1.189 0.235
Distance -0.053903 0.010684 -5.045 6.66e-07 ***
license -0.532477 0.096945 -5.493 6.74e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
27 | P A G E