0% found this document useful (0 votes)
88 views28 pages

Project - 8: Finance &risk Analytics - India Credit Risk

This document outlines steps for building data models and analyzing their performance using employee transportation data. It includes: 1. Exploratory data analysis including basic data exploration, outlier detection, missing value treatment, and checking for multicollinearity. 2. Building models like KNN, logistic regression, naive bayes, boosting, and bagging. 3. Analyzing model performance through measures like prediction accuracy and sorting data based on probability of default. The analysis follows a step-by-step approach including data cleaning, new variable creation, model building, and comparing models to select the best performing one.

Uploaded by

psyish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views28 pages

Project - 8: Finance &risk Analytics - India Credit Risk

This document outlines steps for building data models and analyzing their performance using employee transportation data. It includes: 1. Exploratory data analysis including basic data exploration, outlier detection, missing value treatment, and checking for multicollinearity. 2. Building models like KNN, logistic regression, naive bayes, boosting, and bagging. 3. Analyzing model performance through measures like prediction accuracy and sorting data based on probability of default. The analysis follows a step-by-step approach including data cleaning, new variable creation, model building, and comparing models to select the best performing one.

Uploaded by

psyish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

PROJECT - 8

Finance &Risk Analytics - India Credit Risk

Ishan Rastogi | PGP BABI online | 2nd Feb’20

0|PAGE
Contents
1 project objective...........................................................................................................................................................2
2 Data Analysis – Step by step approach..........................................................................................................................2
2.1 Exploratory data analysis.......................................................................................................................................2
2.1.1 Basic data exploration...................................................................................................................................2
2.1.2 Look out for outliers and missing values.......................................................................................................4
2.1.3 Check for multicollinearity & treat it.............................................................................................................6
2.2 Build Models and compare them to get to the best one.......................................................................................6
2.2.1 KNN................................................................................................................................................................7
2.2.2 Logistic Regression........................................................................................................................................9
2.2.3 Naive Bayes.................................................................................................................................................12
2.2.4 Boosting.......................................................................................................................................................14
2.2.5 Bagging........................................................................................................................................................16
2.2.6 Data model performance analysis...............................................................................................................19
3 Conclusion................................................................................................................................................................... 19
4 Appendix A – Source Code..........................................................................................................................................20

1.1

1|PAGE
1 PROJECT OBJECTIVE
To build data models and performance analysis of all the models using Employee mode of transport data (‘Cars.csv’):

1. Exploratory data analysis


2. Build Models
3. Model Performance Measures

2 DATA ANALYSIS – STEP BY STEP APPROACH


Building data models and performance analysis of all the models using Customer Churn Data (‘Cellphone.xlsx’) consists
of following steps:

 Exploratory data analysis


o Basic data exploration
o Missing value treatment
o New variable creation
o Outliers treatment
o Univariate & bivariate analysis
o Multicollinearity
 Build models
o Logistic Regression Model
o Analyze coefficient & their signs
 Model performance measures
o Predict accuracy of model
o Data sorting based on probability of default
o Model performance

2.1 EXPLORATORY DATA ANALYSIS


2.1.1 Basic data exploration
 Manual overview in excel
 dim() funstion - find out total number of rows and columns
o There are 3541 observations and 53 variables in raw data
o The variable, default is a target variable which is created by a new column “Default” and it is 1 if net worth
<= 0 else 0.
o There are 715 observations of 52 variables in validation data
 names() function - find out names of the columns (features)
 str() function - find out class of each column, along with internal structure .

2|PAGE
2.1.2 Missing value treatment
Missing values Ratio: Data columns with too many missing values are unlikely to carry with useful information. These
data columns with number of missing values greater than a given threshold can be removed.

Remove columns with missing values above threshold limit


> n <- dim(data)[1]
> m <- dim(data)[2]
> miss_values_data <- data.frame()
> for (i in 1:m) {
+ miss_values_data_i <- data.frame()
+ x <- colnames(data)[i]
+ valu <- sum(is.na(data[,x])) / n
+ name <- x
+ df1 <- data.frame(name = name, val = valu)
+ miss_values_data <- rbind(miss_values_data, df1)
+ }
> print(miss_values_data)
name val
1 Num 0.000000000
2 Networth Next Year 0.000000000
3 Total assets 0.000000000
4 Net worth 0.000000000
5 Total income 0.055916408
6 Change in stock 0.129341994
7 Total expenses 0.039254448
8 Profit after tax 0.036995199
9 PBDITA 0.036995199
10 PBT 0.036995199
11 Cash profit 0.036995199
12 PBDITA as % of total income 0.019203615
13 PBT as % of total income 0.019203615
14 PAT as % of total income 0.019203615
15 Cash profit as % of total income 0.019203615
16 PAT as % of net worth 0.000000000
17 Sales 0.073143180
18 Income from financial services 0.264049703
19 Other income 0.365715899
20 Total capital 0.001129624
21 Reserves and funds 0.024004518
22 Deposits (accepted by commercial banks) 1.000000000
23 Borrowings 0.103360633
24 Current liabilities & provisions 0.027110986
25 Deferred tax liability 0.321942954
26 Shareholders funds 0.000000000
27 Cumulative retained profits 0.010731432
28 Capital employed 0.000000000
29 TOL/TNW 0.000000000
30 Total term liabilities / tangible net worth 0.000000000
31 Contingent liabilities / Net worth (%) 0.000000000
32 Contingent liabilities 0.335498447
33 Net fixed assets 0.033323920
34 Investments 0.405252753
35 Current assets 0.018638803
36 Net working capital 0.009036995
37 Quick ratio (times) 0.026263767
38 Current ratio (times) 0.026263767
39 Debt to equity ratio (times) 0.000000000
40 Cash to current liabilities (times) 0.026263767
41 Cash to average cost of sales per day 0.024004518
42 Creditors turnover 0.013273087
43 Debtors turnover 0.011861056
44 Finished goods turnover 0.128212369
45 WIP turnover 0.099971759
46 Raw material turnover 0.021180457
47 Shares outstanding 0.000000000
3|PAGE
48 Equity face value 0.000000000
49 EPS 0.000000000
50 Adjusted EPS 0.000000000
51 Total liabilities 0.000000000
52 PE on BSE 0.006495340
53 default 0.000000000
> threshold <- 0.50
> print(miss_values_data[miss_values_data$val > 0.50,c('name','val')],row.names=F)
name val
Deposits (accepted by commercial banks) 1

We observe that variable - Deposits (accepted by commercial banks, have all observations as blank in raw data

Similarly we repeated the process for test data, found the same column as problem and removed it.

Also, removed column num as it’s an ID variable and hold no significance for model building

Removed missing values from raw data

> data_clean <- data[,-1]

> data_clean <- na.omit(data_clean)

> dim(data_clean)

[1] 934 51

There are 934 observations and 51 variables in raw data


> table(data_clean$default)

0 1
917 17

The number observations for No Defaulter is 934 and Defaulter is 17.

The Defaulters is a minority class comprising 2% (approximately) of total observations

Similarly we performed operations on test data

The number observations for No Defaulter is 178 and Defaulter is 5.

The Defaulters is a minority class comprising 2.5% (approximately) of total observations

2.1.3 New Variable creation


We need to convert the variables indicating amount into ratio variables by dividing each of them by relevant variables to
bring homogeneity while comparing.

Profitability - Profit after tax /Sales

Liquidity - Net Working Capital/Total Assets

Leverage - Total liabilities/ (Total assets - Total liabilities)

4|PAGE
Other ratios created

Total Income Ratio - Total income / Total assets

Change in stock Ratio - Change in stock / Total income

Total expenses Ratio - Total expenses / Total income

Profit after tax Ratio - Profit after tax / Total assets

PBDITA Ratio - PBDITA / Total assets

PBT Ratio - PBT / Total assets

Cash profit /Ratio - Cash profit / Total assets

Sales Ratio - Sales / Total assets

Income from financial services Ratio - Income from financial services /Total assets

Other income Ratio - other income /Total assets

Total capital Ratio - Total capital /Total assets

Reserves and funds Ratio - Reserves and funds /Total assets

Borrowings Ratio - Borrowings /Total assets

Current.liabilities...provisions.Ratio -Current.liabilities...provisions /Total assets

Deferred tax liability Ratio - Deferred tax liability /Total assets

Shareholders funds Ratio - Shareholders funds /Total assets

Cumulative retained profits Ratio - Cumulative retained profits /Total income

Capital employed Ratio - Capital employed /Total assets

Contingent liabilities - Contingent liabilities /Total assets

Net fixed assets Ratio - Net fixed assets /Total assets

Current assets Ratio - Current assets /Total assets

Net working capital Ratio - Net working capital /Total assets

After variable creation or conversion to ratio for better model analysis

Both the raw and test data variables and rows as below
> dim(test_data_transformed)
[1] 183 51
> dim(data_transformed)
[1] 934 51

2.1.4 Outliers treatment


We will check the presence of the outliers, which is an observation that lies at an abnormal distance from other values in
a data set for a continuous, numerical variable. Basically they are extreme values. We identify observations that are lying
5|PAGE
beyond an outer boundary of (Q3 - 1.5 IQR) and (Q3 + 1.5 IQR). Here Q3 is the third Quartile (75th percentile), Q1 is the
first Quartile (25th perecentile) and IQR, the Inter Quartile Range, difference between Q3 and Q1 (Q3 - Q1). We need to
deal with Outliers carefully. They could be mere typos or a possible correct value. They may contain valuable
information about the data gathering and recording process. Before deciding to eliminate these points from the data,
one should try to understand the root cause of why they appeared and whether similar values will continue to appear. Is
it due to the fact the values for one variable are recorded using different unit of measure instead of one unit of
measure? For example, weight recorded in kilograms for most of the observatiions and in pounds for a few
observations. Outliers are often bad data points influencing the data analysis.

Capping at the 5th and 95th percentile means values that are less than the value at 5 th percentile are replaced by the
value at 5th percentile, and values that are greater than the value at 95th percentile are replaced by the value at 95th
percentile.

2.1.5 Univariate & bivariate analysis


 boxplot() - A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each
boxplot consists of:
o A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as
the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile,
of the distribution. These three lines give you a sense of the spread of the distribution and whether or not
the distribution is symmetric about the median or skewed to one side.
o Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box.
These outlying points are unusual so are plotted individually.
o A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the
distribution.

2.1.6 Multicollinearity
Correlation

The multicolliniearity has caused the inflated VIF values for correlated variables. We will use a stepwise variable
reduction function using VIF values. The function work based on below method:

 Use the full set of explanatory variables.


 Calculates VIF for each variable,
 Removes the variable with the single highest value,
 Then recalculates all VIF values with the new set of variables,
 Removes the variable with the next highest value, and so on, until all values are below the threshold.

2.2 BUILD MODELS

2.2.1 KNN

6|PAGE
3 CONCLUSION

4 APPENDIX A – SOURCE CODE

#==============================================================================
> #
> # PROJEC- 5 : Predicting mode of Transport (ML)
> #
> #==============================================================================
> # Environment Set up
> # Setup Working Directory
> setwd("C:/Users/Ishan/Documents")
> getwd()
[1] "C:/Users/Ishan/Documents"
> # Install
>
> #install.packages("DMwR")
> #install.packages("class")
> #install.packages("VIF")
> #install.packages("GGally")
> #install.packages("mctest")
>
> # adding library
> library(ggplot2)
> library(DataExplorer)
> library(gower)
> library(rpart)
> #library(dplyr)
> library(plotrix)
> #library(rpart.plot)
> #library(randomForest)
> library(readxl)
> library(readr)
> #library(rattle)
> #library(ROCR)
> #library(ineq)
> #library(ROSE)
> #library(RColorBrewer)
> #library(data.table)
> #library(scales)
> library(corrplot)
> #library(caTools)
> #library(MASS)
> #library(clusterGeneration)
> library(caret)
> library(car)
> library(DMwR)
> library(class)
> library(carData)
> library(lattice)
> library(VIF)
> library(mctest)
> library(e1071)
> library(glmnet)
> library(xgboost)

7|PAGE
> library(data.table)
> library(ipred)
>
>
> # Read Input File
> Cars <- read_csv("E:/analytics program great lakes/PROJECT 5/Cars.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
Engineer = col_double(),
MBA = col_double(),
`Work Exp` = col_double(),
Salary = col_double(),
Distance = col_double(),
license = col_double(),
Transport = col_character()
)
> #attach(Cars)
>
> #Exploratory Data Analysis
>
> # Find out Total Number of Rows and Columns
> dim(Cars)
[1] 444 9
>
>
> # Find out Names of the Columns (Features)
> names(Cars)
[1] "Age" "Gender" "Engineer" "MBA" "Work Exp" "Salary" "Distance" "licens
>
>
> # Find out Class of each Feature, along with internal structure
> summary(Cars)
Age Gender Engineer MBA Work Exp Salary
Min. :18.00 Length:444 Min. :0.0000 Min. :0.0000 Min. : 0.0 Min. : 6
1st Qu.:25.00 Class :character 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 3.0 1st Qu.: 9
Median :27.00 Mode :character Median :1.0000 Median :0.0000 Median : 5.0 Median :13
Mean :27.75 Mean :0.7545 Mean :0.2528 Mean : 6.3 Mean :16
3rd Qu.:30.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 8.0 3rd Qu.:15
Max. :43.00 Max. :1.0000 Max. :1.0000 Max. :24.0 Max. :57
NA's :1
license Transport
Min. :0.0000 Length:444
1st Qu.:0.0000 Class :character
Median :0.0000 Mode :character
Mean :0.2342
3rd Qu.:0.0000
Max. :1.0000

> str(Cars)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 444 obs. of 9 variables:
$ Age : num 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : chr "Male" "Female" "Male" "Female" ...
$ Engineer : num 0 1 1 1 1 1 1 1 1 1 ...
$ MBA : num 0 0 0 1 0 0 0 0 0 0 ...
$ Work Exp : num 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : num 0 0 0 0 0 1 0 0 0 0 ...
$ Transport: chr "Public Transport" "Public Transport" "Public Transport" "Public Transport"
- attr(*, "spec")=
.. cols(
.. Age = col_double(),
.. Gender = col_character(),
.. Engineer = col_double(),

8|PAGE
.. MBA = col_double(),
.. `Work Exp` = col_double(),
.. Salary = col_double(),
.. Distance = col_double(),
.. license = col_double(),
.. Transport = col_character()
.. )
>
>
> # Check top and bottom Rows of the Dataset
> head(Cars,5)
# A tibble: 5 x 9
Age Gender Engineer MBA `Work Exp` Salary Distance license Transport
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 28 Male 0 0 4 14.3 3.2 0 Public Transport
2 23 Female 1 0 4 8.3 3.3 0 Public Transport
3 29 Male 1 0 7 13.4 4.1 0 Public Transport
4 28 Female 1 1 5 13.4 4.5 0 Public Transport
5 27 Male 1 0 4 13.4 4.6 0 Public Transport
> tail(Cars,5)
# A tibble: 5 x 9
Age Gender Engineer MBA `Work Exp` Salary Distance license Transport
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 40 Male 1 0 20 57 21.4 1 Car
2 38 Male 1 0 19 44 21.5 1 Car
3 37 Male 1 0 19 45 21.5 1 Car
4 37 Male 0 0 19 47 22.8 1 Car
5 39 Male 1 1 21 50 23.4 1 Car
>
>
> plot_str(Cars)
>
> #Hitogram View
>
> plot_histogram(Cars)
>
> # Density view
>
> plot_density(Cars)
>
>
>
> ## The columns converted into factors
> Cars$Engineer<-as.factor(Cars$Engineer)
> Cars$MBA<-as.factor(Cars$MBA)
> Cars$license<-as.factor(Cars$license)
> Cars$Gender<-as.factor(Cars$Gender)
> Cars$Transport<-as.factor(Cars$Transport)
>
> summary(Cars)
Age Gender Engineer MBA Work Exp Salary Distance
Min. :18.00 Female:128 0:109 0 :331 Min. : 0.0 Min. : 6.50 Min. : 3.20
1st Qu.:25.00 Male :316 1:335 1 :112 1st Qu.: 3.0 1st Qu.: 9.80 1st Qu.: 8.80
Median :27.00 NA's: 1 Median : 5.0 Median :13.60 Median :11.00
Mean :27.75 Mean : 6.3 Mean :16.24 Mean :11.32
3rd Qu.:30.00 3rd Qu.: 8.0 3rd Qu.:15.72 3rd Qu.:13.43
Max. :43.00 Max. :24.0 Max. :57.00 Max. :23.40
Transport
2Wheeler : 83
Car : 61
Public Transport:300

> boxplot(Cars$Age ~Cars$Engineer, main = "Age vs Eng.")

9|PAGE
> boxplot(Cars$Age ~Cars$MBA, main ="Age Vs MBA")
> boxplot(Cars$Salary ~Cars$Engineer, main = "Salary vs Eng.")
> boxplot(Cars$Salary ~Cars$MBA, main = "Salary vs MBA.")
> table(Cars$license,Cars$Transport)

2Wheeler Car Public Transport


0 60 13 267
1 23 48 33
> boxplot(Cars$'Work Exp' ~ Cars$Gender)
> boxplot(Cars$Salary~Cars$Transport, main="Salary vs Transport")
> boxplot(Cars$Age~Cars$Transport, main="Age vs Transport")
> boxplot(Cars$Distance~Cars$Transport, main="Distance vs Transport")
> table(Cars$Gender,Cars$Transport)

2Wheeler Car Public Transport


Female 38 13 77
Male 45 48 223
>
>
> #Native Bayes
> setwd("C:/Users/Ishan/Documents")
> getwd()
[1] "C:/Users/Ishan/Documents"
> Cars <- read_csv("E:/analytics program great lakes/PROJECT 5/Cars.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
Engineer = col_double(),
MBA = col_double(),
`Work Exp` = col_double(),
Salary = col_double(),
Distance = col_double(),
license = col_double(),
Transport = col_character()
)
> Cars$Engineer<-as.factor(Cars$Engineer)
> Cars$MBA<-as.factor(Cars$MBA)
> Cars$license<-as.factor(Cars$license)
> Cars$Gender<-as.factor(Cars$Gender)
> Cars$Transport<-as.factor(Cars$Transport)
> Cars<-na.exclude (Cars)
> Cars2<- Cars[,-5]
> Cars2$Salary <- log(Cars2$Salary)
>
> #test and train data
> set.seed(44)
> carindex<-createDataPartition(Cars2$Transport, p=0.7,list = FALSE)
> Cars2datatrain<-Cars2[carindex,]
> Cars2datatest<-Cars2[-carindex,]
> prop.table(table(Cars2datatrain$Transport))

2Wheeler Car Public Transport


0.1891026 0.1378205 0.6730769
> prop.table(table(Cars2datatest$Transport))

2Wheeler Car Public Transport


0.1832061 0.1374046 0.6793893
>
> #Fitting the Naive Bayes model
> Naive_Bayes_Model<-naiveBayes(Cars2datatrain$Transport ~., data=Cars2datatrain)
> Naive_Bayes_Model

Naive Bayes Classifier for Discrete Predictors

Call:

10 | P A G E
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
2Wheeler Car Public Transport
0.1891026 0.1378205 0.6730769

Conditional probabilities:
Age
Y [,1] [,2]
2Wheeler 25.00000 2.658753
Car 35.60465 3.546558
Public Transport 26.76190 2.853769

Gender
Y Female Male
2Wheeler 0.4576271 0.5423729
Car 0.2093023 0.7906977
Public Transport 0.2571429 0.7428571

Engineer
Y 0 1
2Wheeler 0.2542373 0.7457627
Car 0.1627907 0.8372093
Public Transport 0.2380952 0.7619048

MBA
Y 0 1
2Wheeler 0.7627119 0.2372881
Car 0.7906977 0.2093023
Public Transport 0.7238095 0.2761905

Salary
Y [,1] [,2]
2Wheeler 2.404017 0.3803774
Car 3.456899 0.4691048
Public Transport 2.522866 0.3143948

Distance
Y [,1] [,2]
2Wheeler 12.16780 3.200185
Car 15.31628 3.864301
Public Transport 10.42286 3.027442
license
Y 0 1
2Wheeler 0.7796610 0.2203390
Car 0.2790698 0.7209302
Public Transport 0.8904762 0.1095238

>
> #Prediction on the train dataset
> NB_Predictions<-predict(Naive_Bayes_Model,Cars2datatrain)
> table(NB_Predictions,Cars2datatrain$Transport)

NB_Predictions 2Wheeler Car Public Transport


2Wheeler 20 0 10
Car 2 35 6
Public Transport 37 8 194
> confusionMatrix(table(NB_Predictions,Cars2datatrain$Transport))
Confusion Matrix and Statistics

NB_Predictions 2Wheeler Car Public Transport


2Wheeler 20 0 10
Car 2 35 6

11 | P A G E
Public Transport 37 8 194

Overall Statistics

Accuracy : 0.7981
95% CI : (0.7492, 0.8412)
No Information Rate : 0.6731
P-Value [Acc > NIR] : 6.615e-07

Kappa : 0.5485

Mcnemar's Test P-Value : 0.0004845

Statistics by Class:

Class: 2Wheeler Class: Car Class: Public Transport


Sensitivity 0.33898 0.8140 0.9238
Specificity 0.96047 0.9703 0.5588
Pos Pred Value 0.66667 0.8140 0.8117
Neg Pred Value 0.86170 0.9703 0.7808
Prevalence 0.18910 0.1378 0.6731
Detection Rate 0.06410 0.1122 0.6218
Detection Prevalence 0.09615 0.1378 0.7660
Balanced Accuracy 0.64973 0.8921 0.7413
>
> #Prediction on the test dataset
> NB_Predictions<-predict(Naive_Bayes_Model,Cars2datatest)
> table(NB_Predictions,Cars2datatest$Transport)

NB_Predictions 2Wheeler Car Public Transport


2Wheeler 7 0 5
Car 1 17 2
Public Transport 16 1 82
> confusionMatrix(table(NB_Predictions,Cars2datatest$Transport))
Confusion Matrix and Statistics

NB_Predictions 2Wheeler Car Public Transport


2Wheeler 7 0 5
Car 1 17 2
Public Transport 16 1 82

Overall Statistics

Accuracy : 0.8092
95% CI : (0.7313, 0.8725)
No Information Rate : 0.6794
P-Value [Acc > NIR] : 0.0006484

Kappa : 0.5748

Mcnemar's Test P-Value : 0.0689234

Statistics by Class:

Class: 2Wheeler Class: Car Class: Public Transport


Sensitivity 0.29167 0.9444 0.9213
Specificity 0.95327 0.9735 0.5952
Pos Pred Value 0.58333 0.8500 0.8283
Neg Pred Value 0.85714 0.9910 0.7813
Prevalence 0.18321 0.1374 0.6794
Detection Rate 0.05344 0.1298 0.6260
Detection Prevalence 0.09160 0.1527 0.7557
Balanced Accuracy 0.62247 0.9589 0.7583

12 | P A G E
>
#KNN
> setwd("C:/Users/Ishan/Documents")
> getwd()
[1] "C:/Users/Ishan/Documents"
> Cars <- read_csv("E:/analytics program great lakes/PROJECT 5/Cars.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
Engineer = col_double(),
MBA = col_double(),
`Work Exp` = col_double(),
Salary = col_double(),
Distance = col_double(),
license = col_double(),
Transport = col_character()
)
> Cars$Engineer<-as.factor(Cars$Engineer)
> Cars$MBA<-as.factor(Cars$MBA)
> Cars$license<-as.factor(Cars$license)
> Cars$Gender<-as.factor(Cars$Gender)
> Cars$Transport<-as.factor(Cars$Transport)
> Cars<-na.exclude (Cars)
> Cars2<- Cars[,-5]
> Cars2$Salary <- log(Cars2$Salary)
>
> #test and train data
> set.seed(44)
> carindex<-createDataPartition(Cars2$Transport, p=0.7,list = FALSE)
> Cars2datatrain<-Cars2[carindex,]
> Cars2datatest<-Cars2[-carindex,]
> prop.table(table(Cars2datatrain$Transport))

2Wheeler Car Public Transport


0.1891026 0.1378205 0.6730769
> prop.table(table(Cars2datatest$Transport))

2Wheeler Car Public Transport


0.1832061 0.1374046 0.6793893
>
>
> trControl <- trainControl(method = "cv", number = 10)
> fit.knn <- train(Transport ~ .,
+ method = "knn",
+ tuneGrid = expand.grid(k = 2:20),
+ trControl = trControl,
+ metric = "Accuracy",
+ preProcess = c("center","scale"),
+ data = Cars2datatrain)
> fit.knn
k-Nearest Neighbors

312 samples
7 predictor
3 classes: '2Wheeler', 'Car', 'Public Transport'

Pre-processing: centered (7), scaled (7)


Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 281, 281, 281, 281, 281, 281, ...
Resampling results across tuning parameters:

k Accuracy Kappa
2 0.7302419 0.4281271

13 | P A G E
3 0.7726815 0.4952761
4 0.7630040 0.4715395
5 0.7596774 0.4502415
6 0.7533266 0.4343293
7 0.7630040 0.4423694
8 0.7695565 0.4590623
9 0.7597782 0.4193741
10 0.7594758 0.4089674
11 0.7594758 0.4013071
12 0.7530242 0.3777808
13 0.7626008 0.3941709
14 0.7658266 0.3973498
15 0.7658266 0.3939163
16 0.7627016 0.3891776
17 0.7594758 0.3774473
18 0.7627016 0.3890871
19 0.7530242 0.3569895
20 0.7562500 0.3697960

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 3.
> KNN_predictions <- predict(fit.knn,Cars2datatrain)
> table(KNN_predictions, Cars2datatrain$Transport)

KNN_predictions 2Wheeler Car Public Transport


2Wheeler 36 0 8
Car 2 36 3
Public Transport 21 7 199
> confusionMatrix(table(KNN_predictions, Cars2datatrain$Transport))
Confusion Matrix and Statistics

KNN_predictions 2Wheeler Car Public Transport


2Wheeler 36 0 8
Car 2 36 3
Public Transport 21 7 199

Overall Statistics

Accuracy : 0.8686
95% CI : (0.826, 0.904)
No Information Rate : 0.6731
P-Value [Acc > NIR] : 1.579e-15

Kappa : 0.7177

Mcnemar's Test P-Value : 0.02411

Statistics by Class:

Class: 2Wheeler Class: Car Class: Public Transport


Sensitivity 0.6102 0.8372 0.9476
Specificity 0.9684 0.9814 0.7255
Pos Pred Value 0.8182 0.8780 0.8767
Neg Pred Value 0.9142 0.9742 0.8706
Prevalence 0.1891 0.1378 0.6731
Detection Rate 0.1154 0.1154 0.6378
Detection Prevalence 0.1410 0.1314 0.7276
Balanced Accuracy 0.7893 0.9093 0.8366
> KNN_predictions <- predict(fit.knn,Cars2datatest)
> table(KNN_predictions, Cars2datatest$Transport)

KNN_predictions 2Wheeler Car Public Transport


2Wheeler 8 3 6

14 | P A G E
Car 3 15 3
Public Transport 13 0 80
> confusionMatrix(table(KNN_predictions, Cars2datatest$Transport))
Confusion Matrix and Statistics

KNN_predictions 2Wheeler Car Public Transport


2Wheeler 8 3 6
Car 3 15 3
Public Transport 13 0 80

Overall Statistics

Accuracy : 0.7863
95% CI : (0.7061, 0.853)
No Information Rate : 0.6794
P-Value [Acc > NIR] : 0.004613

Kappa : 0.547

Mcnemar's Test P-Value : 0.133992

Statistics by Class:

Class: 2Wheeler Class: Car Class: Public Transport


Sensitivity 0.33333 0.8333 0.8989
Specificity 0.91589 0.9469 0.6905
Pos Pred Value 0.47059 0.7143 0.8602
Neg Pred Value 0.85965 0.9727 0.7632
Prevalence 0.18321 0.1374 0.6794
Detection Rate 0.06107 0.1145 0.6107
Detection Prevalence 0.12977 0.1603 0.7099
Balanced Accuracy 0.62461 0.8901 0.7947
>
>
> #Logistic Regression
> setwd("C:/Users/Ishan/Documents")
> getwd()
[1] "C:/Users/Ishan/Documents"
> Cars <- read_csv("E:/analytics program great lakes/PROJECT 5/Cars.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
Engineer = col_double(),
MBA = col_double(),
`Work Exp` = col_double(),
Salary = col_double(),
Distance = col_double(),
license = col_double(),
Transport = col_character()
)
> Cars$Engineer<-as.factor(Cars$Engineer)
> Cars$MBA<-as.factor(Cars$MBA)
> Cars$license<-as.factor(Cars$license)
> Cars$Gender<-as.factor(Cars$Gender)
> Cars$Transport<-as.factor(Cars$Transport)
> Cars<-na.exclude (Cars)
> Cars2<- Cars[,-5]
> Cars2$Salary <- log(Cars2$Salary)
>
>
> Cars2$CarUsage<-ifelse(Cars2$Transport =='Car',1,0)
> table(Cars2$CarUsage)

15 | P A G E
0 1
382 61
> sum(Cars2$CarUsage == 1)/nrow(Cars2)
[1] 0.1376975
> Cars2$CarUsage<-as.factor(Cars2$CarUsage)
> set.seed(44)
> carindex<-createDataPartition(Cars2$Transport, p=0.7,list = FALSE)
> Cars2datatrain<-Cars2[carindex,]
> Cars2datatest<-Cars2[-carindex,]
> prop.table(table(Cars2datatrain$Transport))

2Wheeler Car Public Transport


0.1891026 0.1378205 0.6730769
> prop.table(table(Cars2datatest$Transport))

2Wheeler Car Public Transport


0.1832061 0.1374046 0.6793893
> str(Cars2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 443 obs. of 9 variables:
$ Age : num 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 2 2 ...
$ MBA : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
$ Salary : num 2.66 2.12 2.6 2.6 2.6 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
$ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...
$ CarUsage : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "na.action")= 'exclude' Named int 145
..- attr(*, "names")= chr "145"
> Cars2datatrain<-Cars2datatrain[,-8]
> Cars2datatest<-Cars2datatest[,-8]
> attach(Cars2datatrain)
The following objects are masked from Cars2datatrain (pos = 3):

Age, CarUsage, Distance, Engineer, Gender, license, MBA, Salary

The following objects are masked from Cars2datatrain (pos = 4):

Age, CarUsage, Distance, Engineer, Gender, license, MBA, Salary

The following objects are masked from Cars2datatrain (pos = 5):

Age, CarUsage, Distance, Engineer, Gender, license, MBA, Salary


The following objects are masked from Cars2datatrain (pos = 6):

Age, CarUsage, Distance, Engineer, Gender, license, MBA, Salary

The following objects are masked from Cars2datatrain (pos = 12):

Age, CarUsage, Distance, Engineer, Gender, license, MBA, Salary

> str(Cars2datatrain)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 312 obs. of 8 variables:
$ Age : num 28 23 26 28 22 27 25 27 24 27 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 2 1 2 2 2 ...
$ Engineer: Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 2 2 ...
$ MBA : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Salary : num 2.66 2.12 2.51 2.67 2.01 ...
$ Distance: num 3.2 3.3 4.8 5.1 5.1 5.2 5.2 5.3 5.4 5.5 ...
$ license : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 2 1 2 ...
$ CarUsage: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "na.action")= 'exclude' Named int 145
..- attr(*, "names")= chr "145"
> Cars2dataSMOTE<-SMOTE(CarUsage~., as.data.frame(Cars2datatrain), perc.over = 250,perc.under =

16 | P A G E
> prop.table(table(Cars2dataSMOTE$CarUsage))

0 1
0.5 0.5
> ##Create control parameter for GLM
> outcomevar<-'CarUsage'
> regressors<-c("Age","Salary","Distance","license","Engineer","MBA","Gender")
> trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)
> Cars2glm<-train(Cars2dataSMOTE[,regressors],Cars2dataSMOTE[,outcomevar],method = "glm", famil
+ "binomial",trControl = trainctrl)
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(Cars2glm$finalModel)

Call:
NULL

Deviance Residuals:
Min 1Q Median 3Q Max
-3.3474 -0.0194 0.0000 0.0445 2.2655

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -58.0133 12.1697 -4.767 1.87e-06 ***
Age 1.7266 0.3616 4.775 1.80e-06 ***
Salary 1.2720 1.4542 0.875 0.3817
Distance 0.1321 0.1539 0.858 0.3908
license1 1.9338 1.0768 1.796 0.0725 .
Engineer1 -0.1014 1.2338 -0.082 0.9345
MBA1 -2.1554 0.9435 -2.285 0.0223 *
GenderMale -1.0655 0.9267 -1.150 0.2502
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 357.664 on 257 degrees of freedom


Residual deviance: 47.327 on 250 degrees of freedom
AIC: 63.327

Number of Fisher Scoring iterations: 9


> carglmcoeff<-exp(coef(Cars2glm$finalModel))
> write.csv(carglmcoeff,file = "Coeffs.csv")
> varImp(object = Cars2glm)
glm variable importance

Overall
Age 100.00
MBA1 46.94
license1 36.52
GenderMale 22.75
Salary 16.89
Distance 16.54
Engineer1 0.00
> plot(varImp(object = Cars2glm), main="Vairable Importance for Logistic Regression")
> carusageprediction<-predict.train(object = Cars2glm,Cars2datatest[,regressors],type = "raw")
> Cars2datatest$CarUsage<-as.factor(Cars2datatest$CarUsage)
> confusionMatrix(carusageprediction,Cars2datatest$CarUsage, positive='1')
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 104 0

17 | P A G E
1 9 18

Accuracy : 0.9313
95% CI : (0.8736, 0.9681)
No Information Rate : 0.8626
P-Value [Acc > NIR] : 0.010562

Kappa : 0.7605

Mcnemar's Test P-Value : 0.007661

Sensitivity : 1.0000
Specificity : 0.9204
Pos Pred Value : 0.6667
Neg Pred Value : 1.0000
Prevalence : 0.1374
Detection Rate : 0.1374
Detection Prevalence : 0.2061
Balanced Accuracy : 0.9602

'Positive' Class : 1

> #str(carusageprediction)
> #str(Cars2datatest$CarUsage)
>
> #summary(carusageprediction)
> #summary(Cars2datatest$CarUsage)
>
> carusagepreddata<-Cars2datatest
> carusagepreddata$predictusage<-carusageprediction
> trainctrlgn<-trainControl(method = 'cv',number = 10,returnResamp = 'none')
> Cars2glmnet<-train(CarUsage~Age+Salary+Distance+license, data = Cars2dataSMOTE,
+ method = 'glmnet', trControl = trainctrlgn)
> Cars2glmnet
glmnet

258 samples
4 predictor
2 classes: '0', '1'

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 232, 232, 233, 232, 232, 233, ...
Resampling results across tuning parameters:

alpha lambda Accuracy Kappa


0.10 0.0008382152 0.9572308 0.9143453
0.10 0.0083821515 0.9572308 0.9143453
0.10 0.0838215151 0.9303077 0.8604991
0.55 0.0008382152 0.9572308 0.9143453
0.55 0.0083821515 0.9610769 0.9220376
0.55 0.0838215151 0.9610769 0.9220376
1.00 0.0008382152 0.9610769 0.9220376
1.00 0.0083821515 0.9570769 0.9141012
1.00 0.0838215151 0.9610769 0.9221917

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0.55 and lambda = 0.08382152.
> varImp(object <- Cars2glmnet)
glmnet variable importance

Overall
Salary 100.00
Age 26.38

18 | P A G E
license1 18.50
Distance 0.00
> plot(varImp(object <- Cars2glmnet), main="Vairable Importance for Logistic Regression - Post
+ gularization")
> carusagepredictiong<-predict.train(object = Cars2glmnet,Cars2datatest[,regressors],type = "ra
> confusionMatrix(carusagepredictiong,Cars2datatest$CarUsage, positive='1')
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 105 0
1 8 18

Accuracy : 0.9389
95% CI : (0.8832, 0.9733)
No Information Rate : 0.8626
P-Value [Acc > NIR] : 0.004479

Kappa : 0.7829

Mcnemar's Test P-Value : 0.013328

Sensitivity : 1.0000
Specificity : 0.9292
Pos Pred Value : 0.6923
Neg Pred Value : 1.0000
Prevalence : 0.1374
Detection Rate : 0.1374
Detection Prevalence : 0.1985
Balanced Accuracy : 0.9646

'Positive' Class : 1

>
>
> # Boosting
> setwd("C:/Users/Ishan/Documents")
> getwd()
[1] "C:/Users/Ishan/Documents"
> Cars <- read_csv("E:/analytics program great lakes/PROJECT 5/Cars.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
Engineer = col_double(),
MBA = col_double(),
`Work Exp` = col_double(),
Salary = col_double(),
Distance = col_double(),
license = col_double(),
Transport = col_character()
)
> Cars$Engineer<-as.factor(Cars$Engineer)
> Cars$MBA<-as.factor(Cars$MBA)
> Cars$license<-as.factor(Cars$license)
> Cars$Gender<-as.factor(Cars$Gender)
> Cars$Transport<-as.factor(Cars$Transport)
> Cars<-na.exclude (Cars)
> Cars2<- Cars[,-5]
> Cars2$Salary <- log(Cars2$Salary)
> set.seed(44)
> carindex<-createDataPartition(Cars2$Transport, p=0.7,list = FALSE)
> Cars2datatrain<-Cars2[carindex,]
> Cars2datatest<-Cars2[-carindex,]

19 | P A G E
> Cars2datatrain$license<-as.factor(Cars2datatrain$license)
> Cars2datatest$license<-as.factor(Cars2datatest$license)
> Cars2train.car<-Cars2datatrain[Cars2datatrain$Transport %in% c("Car", "Public Transport"),]
> Cars2train.twlr<-Cars2datatrain[Cars2datatrain$Transport %in% c("2Wheeler", "Public Transport
> Cars2train.car$Transport<-as.character(Cars2train.car$Transport)
> Cars2train.car$Transport<-as.factor(Cars2train.car$Transport)
> Cars2train.twlr$Transport<-as.character(Cars2train.twlr$Transport)
> Cars2train.twlr$Transport<-as.factor(Cars2train.twlr$Transport)
> prop.table(table(Cars2train.car$Transport))

Car Public Transport


0.1699605 0.8300395
> prop.table(table(Cars2train.twlr$Transport))

2Wheeler Public Transport


0.2193309 0.7806691
> cartwlrsm <- SMOTE(Transport~., data = as.data.frame(Cars2train.twlr), perc.over = 150, perc.
> table(cartwlrsm$Transport)

2Wheeler Public Transport


118 118
> carCars2m <- SMOTE(Transport~., data = as.data.frame(Cars2train.car), perc.over = 175, perc.u
> table(carCars2m$Transport)

Car Public Transport


86 86
> car<-carCars2m[carCars2m$Transport %in% c("Car"),]
> Cars2datatrainldasm<-rbind(cartwlrsm,car)
> str(Cars2datatrainldasm)
'data.frame': 322 obs. of 8 variables:
$ Age : num 26 27 20 24 26 28 27 29 24 24 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 2 1 2 2 2 ...
$ Engineer : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 2 2 ...
$ MBA : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 2 2 1 1 ...
$ Salary : num 2.28 2.6 2.14 2.03 2.55 ...
$ Distance : num 12.2 5.3 7 9.5 10.8 9.3 13.3 10.2 16.8 16.8 ...
$ license : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 2 1 1 ...
$ Transport: Factor w/ 3 levels "2Wheeler","Public Transport",..: 2 2 2 2 2 2 2 2 2 2 ...
- attr(*, "na.action")= 'exclude' Named int 145
..- attr(*, "names")= chr "145"
> boostcontrol <- trainControl(number=10)
> xgbGrid <- expand.grid(
+ eta = 0.3,
+ max_depth = 1,
+ nrounds = 50,
+ gamma = 0,
+ colsample_bytree = 0.6,
+ min_child_weight = 1, subsample = 1
+ )
> Cars2xgb <- train(Transport ~ .,Cars2datatrainldasm,trControl = boostcontrol,tuneGrid = xgbGr
+ = "Accuracy",method = "xgbTree")
> Cars2xgb$finalModel
##### xgb.Booster
raw: 38.5 Kb
call:
xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
gamma = param$gamma, colsample_bytree = param$colsample_bytree,
min_child_weight = param$min_child_weight, subsample = param$subsample),
data = x, nrounds = param$nrounds, num_class = length(lev),
objective = "multi:softprob")
params (as set within xgb.train):
eta = "0.3", max_depth = "1", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1",
"3", objective = "multi:softprob", silent = "1"
xgb.attributes:
niter
callbacks:

20 | P A G E
cb.print.evaluation(period = print_every_n)
# of features: 7
niter: 50
nfeatures : 7
xNames : Age GenderMale Engineer1 MBA1 Salary Distance license1
problemType : Classification
tuneValue :
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
1 50 1 0.3 0 0.6 1 1
obsLevels : 2Wheeler Public Transport Car
param :
list()
> predictions_xgb<-predict(Cars2xgb,Cars2datatest)
> confusionMatrix(predictions_xgb,Cars2datatest$Transport)
Confusion Matrix and Statistics

Reference
Prediction 2Wheeler Car Public Transport
2Wheeler 12 0 25
Car 1 18 6
Public Transport 11 0 58

Overall Statistics

Accuracy : 0.6718
95% CI : (0.5843, 0.7512)
No Information Rate : 0.6794
P-Value [Acc > NIR] : 0.614462

Kappa : 0.4182

Mcnemar's Test P-Value : 0.006006

Statistics by Class:

Class: 2Wheeler Class: Car Class: Public Transport


Sensitivity 0.5000 1.0000 0.6517
Specificity 0.7664 0.9381 0.7381
Pos Pred Value 0.3243 0.7200 0.8406
Neg Pred Value 0.8723 1.0000 0.5000
Prevalence 0.1832 0.1374 0.6794
Detection Rate 0.0916 0.1374 0.4427
Detection Prevalence 0.2824 0.1908 0.5267
Balanced Accuracy 0.6332 0.9690 0.6949
Warning message:
In confusionMatrix.default(predictions_xgb, Cars2datatest$Transport) :
Levels are not in the same order for reference and data. Refactoring data to match.
>
>
>
> #Bagging
> setwd("C:/Users/Ishan/Documents")
> getwd()
[1] "C:/Users/Ishan/Documents"
> Cars <- read_csv("E:/analytics program great lakes/PROJECT 5/Cars.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
Engineer = col_double(),
MBA = col_double(),
`Work Exp` = col_double(),
Salary = col_double(),
Distance = col_double(),
license = col_double(),

21 | P A G E
Transport = col_character()
)
> Cars$Engineer<-as.factor(Cars$Engineer)
> Cars$MBA<-as.factor(Cars$MBA)
> Cars$license<-as.factor(Cars$license)
> Cars$Gender<-as.factor(Cars$Gender)
> Cars$Transport<-as.factor(Cars$Transport)
> Cars<-na.exclude (Cars)
> Cars2<- Cars[,-5]
> Cars2$Salary <- log(Cars2$Salary)
> set.seed(44)
> #test and train data
> carindex <- createDataPartition(Cars2$Transport, p=0.70, list=FALSE)
> Cars2datatrain <- Cars2[ carindex,]
> Cars2datatest <- Cars2[-carindex,]
> Cars2.bagging <- bagging(Transport ~.,
+ data=Cars2datatrain,
+ control=rpart.control(maxdepth=5, minsplit=4))
> Cars2datatrain$pred.class <- predict(Cars2.bagging, Cars2datatrain)
> table(Cars2datatrain$Gender,Cars2datatrain$pred.class)

2Wheeler Car Public Transport


Female 17 10 63
Male 15 36 171
> table(Cars2datatrain$Salary,Cars2datatrain$pred.class)

2Wheeler Car Public Transport


1.88706964903238 0 0 1
1.90210752639692 1 0 0
1.91692261218206 1 0 2
1.93152141160321 6 0 0
2.01490302054226 0 0 2
2.02814824729229 0 0 3
2.04122032885964 0 0 4
2.05412373369555 0 0 2
2.06686275947298 3 0 1
2.07944154167984 1 0 1
2.11625551480255 0 0 1
2.12823170584927 0 0 2
2.14006616349627 0 0 9
2.15176220325946 0 0 8
2.16332302566054 1 0 2
2.17475172148416 4 0 3
2.18605127673809 3 0 6
2.19722457733622 3 0 0
2.2512917986065 0 0 5
2.26176309847379 0 0 4
2.28238238567653 0 0 8
2.29253475714054 1 0 5
2.30258509299405 0 0 1
2.35137525716348 0 0 2
2.36085400111802 1 0 4
2.37024374146786 0 0 2
2.37954613413017 0 0 3
2.3887627892351 0 0 2
2.4423470353692 0 0 2
2.45100509811232 1 0 2
2.45958884180371 0 0 3
2.47653840011748 0 0 1
2.50959926237837 0 0 2
2.51769647261099 0 0 1
2.52572864430826 0 0 4
2.53369681395743 0 0 2
2.54160199346455 0 0 10
2.54944517092557 0 0 9
2.55722731136763 0 0 7

22 | P A G E
2.59525470695687 0 0 1
2.60268968544438 0 0 2
2.61006979274201 0 0 10
2.61739583283408 0 0 6
2.62466859216316 0 0 7
2.63188884013665 0 0 10
2.66025953726586 0 0 1
2.66722820658195 0 0 3
2.67414864942653 0 0 1
2.68102152871429 0 0 15
2.68784749378469 0 0 4
2.69462718077007 0 0 7
2.70136121295141 1 0 5
2.70805020110221 2 0 0
2.73436750941958 0 0 1
2.74727091425549 0 1 6
2.75366071235426 0 0 1
2.76000994003292 0 1 3
2.76631910922619 0 2 1
2.8094026953625 0 0 1
2.81540871942271 0 1 0
2.82137888640921 0 0 1
2.82731362192903 0 5 1
2.83321334405622 0 4 0
2.87919845729804 0 0 3
2.9338568698359 1 0 2
2.98061863574394 0 0 1
3.03013370027132 0 0 2
3.03495298670727 0 0 1
3.07269331469012 0 0 1
3.08190996979504 0 0 1
3.13549421592915 1 0 0
3.16968558067743 1 0 2
3.17387845893747 0 0 1
3.21486780347066 0 0 1
3.25424296870549 0 0 1
3.35689712276558 0 0 2
3.3603753871419 0 1 2
3.49650756146648 0 1 0
3.52636052461616 0 2 0
3.55248682920838 0 1 0
3.55534806148941 0 1 0
3.58351893845611 0 1 0
3.60004824040732 0 0 1
3.61091791264422 0 2 0
3.66356164612965 0 1 0
3.68637632389582 0 1 0
3.71113006304876 0 1 0
3.71357206670431 0 1 0
3.73766961828337 0 2 0
3.75887182593397 0 1 0
3.76120011569356 0 2 0
3.78418963391826 0 1 0
3.80666248977032 0 4 0
3.85014760171006 0 2 0
3.87120101090789 0 1 0
3.93182563272433 0 2 0
3.95124371858143 0 1 0
3.98898404656427 0 1 0
4.00733318523247 0 1 0
4.04305126783455 0 1 0
> Cars2.bagging <- bagging(Transport ~.,
+ data=Cars2datatest,
+ control=rpart.control(maxdepth=5, minsplit=4))
> Cars2datatest$pred.class <- predict(Cars2.bagging, Cars2datatest)
>

23 | P A G E
> table(Cars2datatest$Gender,Cars2datatest$pred.class)

2Wheeler Car Public Transport


Female 6 4 27
Male 7 15 72
> summary(Cars2dataSMOTE)
Age Gender Engineer MBA Salary Distance license CarUsage
Min. :18.00 Female: 77 0: 52 0:175 Min. :1.902 Min. : 4.80 0:161 0:129
1st Qu.:26.00 Male :181 1:206 1: 83 1st Qu.:2.549 1st Qu.:10.00 1: 97 1:129
Median :31.02 Median :2.817 Median :12.25
Mean :30.90 Mean :2.942 Mean :12.78
3rd Qu.:34.90 3rd Qu.:3.551 3rd Qu.:15.47
Max. :43.00 Max. :4.043 Max. :22.80

>
> # Missing Value & Multicollinearity
> # Setup Working Directory
> setwd("C:/Users/Ishan/Documents")
> getwd()
[1] "C:/Users/Ishan/Documents"
>
> # adding library
> library(DataExplorer)
> library(gower)
> library(rpart)
> library(plotrix)
> library(readr)
> library(car)
> library(DMwR)
> library(class)
> library(carData)
> library(lattice)
>
>
> # Read Input File
> Cars <- read_csv("E:/analytics program great lakes/PROJECT 5/Cars.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
Engineer = col_double(),
MBA = col_double(),
`Work Exp` = col_double(),
Salary = col_double(),
Distance = col_double(),
license = col_double(),
Transport = col_character()
)
> ## The columns converted into factors
> Cars$Engineer<-as.factor(Cars$Engineer)
> Cars$MBA<-as.factor(Cars$MBA)
> Cars$license<-as.factor(Cars$license)
> Cars$Gender<-as.factor(Cars$Gender)
> Cars$Transport<-as.factor(Cars$Transport)
>
>
> Cars$Engineer<-as.numeric(Cars$Engineer)
> Cars$MBA<-as.numeric(Cars$MBA)
> Cars$license<-as.numeric(Cars$license)
> Cars$Gender<-as.numeric(Cars$Gender)
> Cars$Transport<-as.numeric(Cars$Transport)
>
> str(Cars)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 444 obs. of 9 variables:
$ Age : num 28 23 29 28 27 26 28 26 22 27 ...

24 | P A G E
$ Gender : num 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : num 1 2 2 2 2 2 2 2 2 2 ...
$ MBA : num 1 1 1 2 1 1 1 1 1 1 ...
$ Work Exp : num 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : num 1 1 1 1 1 2 1 1 1 1 ...
$ Transport: num 3 3 3 3 3 3 1 3 3 3 ...
- attr(*, "spec")=
.. cols(
.. Age = col_double(),
.. Gender = col_character(),
.. Engineer = col_double(),
.. MBA = col_double(),
.. `Work Exp` = col_double(),
.. Salary = col_double(),
.. Distance = col_double(),
.. license = col_double(),
.. Transport = col_character()
.. )
> #missing value
> anyNA(Cars)
[1] TRUE
> Cars[!complete.cases(Cars), ]
# A tibble: 1 x 9
Age Gender Engineer MBA `Work Exp` Salary Distance license Transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 28 1 1 NA 6 13.7 9.4 1 3
> Cars<-na.exclude (Cars)
> accounts_n<-Cars
> str(accounts_n)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 443 obs. of 9 variables:
$ Age : num 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : num 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : num 1 2 2 2 2 2 2 2 2 2 ...
$ MBA : num 1 1 1 2 1 1 1 1 1 1 ...
$ Work Exp : num 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : num 1 1 1 1 1 2 1 1 1 1 ...
$ Transport: num 3 3 3 3 3 3 1 3 3 3 ...
- attr(*, "na.action")= 'exclude' Named int 145
..- attr(*, "names")= chr "145"
> corrplot(cor(accounts_n))
> model1 <- glm(Transport ~ ., data= accounts_n)
> summary(model1)

Call:
glm(formula = Transport ~ ., data = accounts_n)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.1081 -0.2535 0.2071 0.4631 1.1748

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.965013 0.529462 1.823 0.069 .
Age 0.087019 0.021419 4.063 5.76e-05 ***
Gender 0.369621 0.077090 4.795 2.24e-06 ***
Engineer -0.007704 0.079025 -0.097 0.922
MBA 0.129177 0.078727 1.641 0.102
`Work Exp` -0.039439 0.026128 -1.509 0.132
Salary -0.008249 0.009595 -0.860 0.390
Distance -0.052808 0.010545 -5.008 8.03e-07 ***
license -0.561159 0.095571 -5.872 8.59e-09 ***
---

25 | P A G E
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.5022761)

Null deviance: 276.68 on 442 degrees of freedom


Residual deviance: 217.99 on 434 degrees of freedom
AIC: 963.03

Number of Fisher Scoring iterations: 2

> vif(model1)
Age Gender Engineer MBA `Work Exp` Salary Distance license
7.892948 1.071875 1.015424 1.032638 15.735548 8.871978 1.274628 1.447236
> model2 <- glm(Transport ~ ., data= accounts_n[,-5])
> summary(model2)

Call:
glm(formula = Transport ~ ., data = accounts_n[, -5])

Deviance Residuals:
Min 1Q Median 3Q Max
-2.0607 -0.2625 0.2145 0.4680 1.0776

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.488038 0.400920 3.712 0.000233 ***
Age 0.064015 0.015072 4.247 2.65e-05 ***
Gender 0.373212 0.077167 4.836 1.84e-06 ***
Engineer -0.004950 0.079120 -0.063 0.950141
MBA 0.115982 0.078355 1.480 0.139541
Salary -0.018465 0.006811 -2.711 0.006973 **
Distance -0.051120 0.010501 -4.868 1.58e-06 ***
license -0.545627 0.095155 -5.734 1.84e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.5037523)

Null deviance: 276.68 on 442 degrees of freedom


Residual deviance: 219.13 on 435 degrees of freedom
AIC: 963.35

Number of Fisher Scoring iterations: 2


> vif(model2)
Age Gender Engineer MBA Salary Distance license
3.896910 1.070855 1.014883 1.019907 4.457554 1.260307 1.430460
> model3 <- glm(Transport ~ ., data= accounts_n[,-c(5,6)])
> summary(model3)

Call:
glm(formula = Transport ~ ., data = accounts_n[, -c(5, 6)])

Deviance Residuals:
Min 1Q Median 3Q Max
-2.0817 -0.2194 0.2039 0.4808 1.1676

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.259173 0.284592 7.938 1.75e-14 ***
Age 0.031048 0.008969 3.462 0.00059 ***
Gender 0.381131 0.077671 4.907 1.31e-06 ***
Engineer -0.007425 0.079688 -0.093 0.92581
MBA 0.109572 0.078888 1.389 0.16555
Distance -0.058510 0.010215 -5.728 1.90e-08 ***
license -0.605413 0.093236 -6.493 2.29e-10 ***

26 | P A G E
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.5110887)

Null deviance: 276.68 on 442 degrees of freedom


Residual deviance: 222.83 on 436 degrees of freedom
AIC: 968.78

Number of Fisher Scoring iterations: 2

> vif(model3)
Age Gender Engineer MBA Distance license
1.360182 1.069320 1.014748 1.018978 1.175392 1.353631
> model4 <- glm(Transport ~ ., data= accounts_n[,-c(5,1)])
> summary(model4)

Call:
glm(formula = Transport ~ ., data = accounts_n[, -c(5, 1)])

Deviance Residuals:
Min 1Q Median 3Q Max
-2.0259 -0.1351 0.2098 0.4732 1.1094

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.877943 0.236105 12.189 < 2e-16 ***
Gender 0.383198 0.078624 4.874 1.54e-06 ***
Engineer 0.009012 0.080581 0.112 0.911
MBA 0.100695 0.079787 1.262 0.208
Salary 0.004875 0.004102 1.189 0.235
Distance -0.053903 0.010684 -5.045 6.66e-07 ***
license -0.532477 0.096945 -5.493 6.74e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.5234398)

Null deviance: 276.68 on 442 degrees of freedom


Residual deviance: 228.22 on 436 degrees of freedom
AIC: 979.36

Number of Fisher Scoring iterations: 2


> vif(model4)
Gender Engineer MBA Salary Distance license
1.069861 1.013131 1.017755 1.555869 1.255401 1.428946
> #############################################################################################
> # END OF PROJECT
> #############################################################################################

27 | P A G E

You might also like