100% found this document useful (10 votes)

2K views24 pages

Analysis of Transport Choice of Employees - A Project On Machine Learning

This case study analyzes employee transport preference data using machine learning algorithms. Exploratory data analysis is performed on the data, including checking for outliers and missing values. Logistic regression, KNN, naive bayes, bagging, and boosting models are built and compared using various performance metrics to best identify which employees prefer cars for their commute.

Uploaded by

Shyam Kishore Tripathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (10 votes)

2K views24 pages

Analysis of Transport Choice of Employees - A Project On Machine Learning

Uploaded by

Shyam Kishore Tripathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

This is the case study prepared to

understand employee transport

preference. Complete case study is
performed using Machine Learning
algorithm.

Project on Machine Learning

SHYAM KISHORE TRIPATHI

PGP - BABI

0|Page
Table of Contents

1. PROJECT OBJECTIVES .........................................................................................................................................2

2. DEFINING BUSINESS PROBLEM………..................................................................................................................3
3. EXPLORATORY DATA ANALYSIS…........................................................................................................................4
1.1 VARIABLE IDENTIFICATION & DATASET UNDERSTANDING ................................................................4
1.1.1 Data Structure...................................................................................................................... 4
1.1.2 Data Summary ..................................................................................................................... 4
1.1.3 Check/Visualize Missing Values........................................................................................... 5
1.2 UNIVARIATE & BIVARIATE ANALYSIS................................................................................................... 5
1.3 SMOTE…………………………………………................................................................................................... 10
1.4 CHECK FOR CORRELATION ................................................................................................................10
4. LOGISTIC REGRESSION..................................................................................................................................... 10
4.1. LOGISTIC REGRESSION MODELS....................................................................................................... 10
4.2. LOGISTIC REGRESSION MODEL PERFORMANCE.............................................................................. 13
4.1.1. Confusion matrix............................................................................................................... 13
4.1.2. ROC.................................................................................................................................... 13
4.1.3. K-S...................................................................................................................................... 14
4.1.4. GINI.................................................................................................................................... 14
5. K-NEAREST NEIGHBOURS (K-NN) CLASSIFICATION ........................................................................................ 14
5.1. K-NN MODEL WITH K=3 (3 NEAREST NEIGHBOURS) ...................................................................... 15
5.1.1. Confusion matrix............................................................................................................... 15
5.1.2. ROC.................................................................................................................................... 15
5.1.3. K-S...................................................................................................................................... 16
5.1.4. GINI.................................................................................................................................... 16
5.2. K-NN MODEL WITH K=5 (5 NEAREST NEIGHBOURS) ...................................................................... 17
5.2.1. Confusion matrix............................................................................................................... 17
5.2.2. ROC.................................................................................................................................... 17
5.2.3. K-S...................................................................................................................................... 18
5.2.4. GINI.................................................................................................................................... 18
6. NAÏVE BAYES MODEL....................................................................................................................................... 19
8.1.1. CONFUSION MATRIX..................................................................................................................... 19
8.1.2. ROC…………………………..................................................................................................................... 20
8.1.3. K-S…………………………...................................................................................................................... 20
7. BAGGING and BOOSTING................................................................................................................................ 20
8.1.1. CONCLSION………………................................................................................................................... 20
8. MODEL COMPARISION & CONCLUSION ......................................................................................................... 22
9. APPENDIX.………................................................................................................................................................ 23

1|Page
Project Objective

This case study is prepared for an organization to study their employees transport preference to commute and need
to predict whether an employee will use Car as a mode of transport. Also, which variables are a significant predictor
behind this decision. The objective is to build the best model using Machine Learning techniques which can
identify right employees who prefers cars.
We will be performing below steps and will analyze the data using Machine Learning Modeling techniques to
identify such customers:
1. EDA
1.1 How does the data look like by doing Univariate and bivariate analysis? Plots and charts which
illustrate the relationships between variables.
1.2 Will look out for outliers and missing values.
1.3 Checking of distribution of target variable in given data set and applying treatment using SMOTE
accordingly.
1.4 Checking multicollinearity & its treatment.
1.5 Summarize the insights we will get from EDA.
2. Data Preparation
1.1 Prepare the data for analysis
3. Build Various Predictive Models and compare them to get to the best one
3.1 Building Logistic Regression Model and its interpretation.
3.2 Building KNN Model and its interpretation.
3.3 Building Naive Bayes and its interpretation.
3.4 Performing Model Comparison using Model Performance metrices.
3.5 Apply both bagging and boosting modeling procedures to create 2 models and compare its
accuracy with the best model.
4. Actionable Insights
4.1 Interpretation & Recommendations from the best model

Complete case study is performed on given dataset (Cars.csv) to build suitable model for predicting using Machine
Learning Techniques like Logistic Regression, KNN, Naive Bayes and applying Bagging and Boosting on top of that
and finally perform Model Performance Measures by various metrics.
Confusion Matrix (for all models)
AUC – ROC (for all model)
Gini Coefficient (only for Logistic regression)
Kolmogorov Smirnov (KS) Chart (only for Logistic regression)

2|Page
Defining Business Problem
Objective is to helps in understanding the mode of transport employees prefers to commute to their office.

There are several factors which predominantly plays important role in their mode of transport like:-
- Monthly salary
- Expenses
- Work Experience
- Distance
- Position they hold
- Age

In this case study we will try to understand that what are the factor which are influencing employee’s decision to
use car as their favorable means of transport by building Machine Learning model.

Data Dictionary
The dataset has data of 418 employee’s information about their mode of transport as well as their personal and
professional details like age, salary, work exp etc.

Variables Description
Age Age of an employee
Gender Gender of an employee
Engineer Whether 3mployee is an engineer graduate or not.1 menas Engineer, 0 means not.
MBA Whether 3mployee has done MBA or not.1 menas MBA, 0 means not.
Work Exp Total Work Experience of an empolyee
Salary Monthly salary of an employee
Distance average distance employee travels
license Whether employee holds valid driving license or not. 1 means Yes, 0 means no.
Transport Transport mode which they prefer to commute currently.

Data Summary and Exploratory Data Analysis

Structure of Data
‘data. Frame’: 418 obs. Of 9 variables:
$ Age : int 28 24 27 25 25 21 23 23 24 28 …
$ Gender : Factor w/ 2 levels “Female”, “Male”: 2 2 1 2 1 2 2 2 2 2 …
$ Engineer : int 1 1 1 0 0 0 1 0 1 1 …
$ MBA : int 0 0 0 0 0 0 1 0 0 0 …
$ Work.Exp : int 5 6 9 1 3 3 3 0 4 6 …
$ Salary : num 14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 …
$ Distance : num 5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 …
$ license : int 0 0 0 0 0 0 0 0 0 1 …
$ Transport: Factor w/ 3 levels “2Wheeler”,”Car”,..: 1 1 1 1 1 1 1 1 1 1 …

3|Page
Target Variable: - Transport

Summary of Data

Blank value check and treatment

Univariate and Bivariate Analysis

1. Transport

Conclusion: -
1. Has 2 values 0 and 1
2. Out of 418 employee, 83 employee travel by 2-Wheeler, 35 by Car and 300 by Public Transport.
3. Percentage of employees are as follows 19.9% use Two Wheelers, 8.4% use Cars and 71.8% use
Public Transport.

4|Page
2. Mode of transport by Gender

Conclusion: -
1. Very Few Females use Cars compared to Males. Both Males & Females use more of Public
Transport.
No Significant difference due to Gender.

3. Mode of Transport by Engineer

Conclusion: -
1. No Significant difference due to Engineer/Non-Engineer

5|Page
4. Mode of Transport by MBA

Conclusion: -
1. No Significant difference due to MBA/Non-MBA

5. License Analysis by Mode of Transport

Conclusion: -
1. Driving License Holders prefer 2 Wheelers & Cars over Public Transport.
2. Significant number of people without Driving License use 2 Wheelers

6|Page
6. Analysis of Work Experience in Years by Transport Mode

Conclusion: -
1. The higher the Experience the more the usage of Cars over 2 Wheelers and Public
Transport.
2. Work Experience between 15 & 25 Years prefer Cars

7. Analysis of Salary by Transport Mode

Conclusion: -
1. The higher the Salary the less the usage of 2 Wheelers and Public Transport.

7|Page
8. Analysis of Distance by Transport Mode

Conclusion: -
1. Car is preferred for travelling a distance greater than 13 Miles

Our primary interest as per problem statement is to understand the factors influencing car usage. Hence, we will
create a new column for Car usage. It will take value 0 for Public Transport & 2-Wheeler and 1 for car usage and
Understand the proportion of cars in Transport Mode accordingly

We can clearly see that Target variable is less than 10% in total available data set so we will be applying SMOTE in
further steps. Before that we will be converting Engineer, MBA and License variable into Factor Variable by executing
below R Code

Checking Target variable proportion in overall data set

First check the proportion of target variable in actual dataset

The number of records for people travelling by car is in minority i.e. 10%. Hence, we need to use an appropriate
sampling method.
We will explore using SMOTE to balance target variable proportion and we will use those Test and Train dataset in
logistic regression to see the best fit model and explore a couple of black box models for prediction later.

8|Page
Applying SMOTE for data balancing

After balancing we can see that ratio of target variable has increased over 10% and we can use this balanced dataset
in further validating models.

Let’s create a subset and create Train and Test dataset

Checking Correlation
Let's look at the correlation between all the variables and treat highly correlated variables accordingly to build the
regression model.

Correlation Interpretation
• Age, Work Exp and Salary are highly Correlated
• Age, Work Exp and Salary are all moderately correlated with Distance and License
• Transport is somewhat moderately (marginally) correlated with Gender but not significant

Since we are unable to identify clearly the variables from which we can predict the Mode of Transport, we
will perform a logistic Regression

9|Page
Logistic Regression
We will start with Logistic Regression Analysis as it will give us clear insight that what are those variables which are
significant in building model so that we can achieve more precision by eliminating irrelevant variables

Building Logistic Regression Model based upon all the given variables

Checking for Logistic Regression Model Multicollinearity

Interpretation from logistic model using all available variables and going through the multicollinearity

- The multicollinearity has caused the inflated VIF values for correlated variables, making the model
unreliable for model building.
- VIF values for Salary and Work Exp are 5.54 and 15.69 respectively which are not inflated as such
- Being conservative and not considering VIF values above 5, we will remove Salary & Work Exp (Highly
Correlated)

Steps for Variable Reduction:

- Use full set of explanatory variables.
- Calculate VIF for each variable.
- Remove variable with highest value.
- Recalculates all VIF values for logistic model built with the new set of variables.
- Removes again variable which has highest value, until all values are within threshold.

10 | P a g e
Creating Model 2 - Logistic Regression after Removing highly correlated variables

Create 2nd Model after removing correlated variables Salary & Work Exp

Engineer, Distance, Gender and MBA are Insignificant, so we will remove them as well and create a new model
based upon the rest of the variables.

Creating Model 3 - Logistic Regression built after Removing all insignificant variables

Now based on this new built model we can see that all values are significant, and we can verify the same by
checking the multicollinearity as well.

Now we can see that VIF values are within range and all variables are significant and results are making more sense
and are in line with the results which we obtained from EDA.

11 | P a g e
Regression Model Performance on Train and Test Data set

1. Confusion Matrix: -
We will start model evaluation on train and test data by executing below code and will see that how accurate our
model will be in identification of employee who will be preferring Car as mode of transport.

Calculating Confusion Matrix on Train Test Data: - We are predicting classification of 0 and 1 for each row and then
we are putting our actual and predicted into a table to build confusion matrix to check that how accurate our model
is by executing below R Code.

Calculating Confusion Matrix on Test Data: -

Confusion Matrix Output: -

From Confusion matrix we can clearly see that our Train data is 96.75% accurate in predicting and Train data
confirms the same with 96.20% of accuracy. We can see there is a slight variation but that is within the range so
we can confirm that our model is good model.

2. ROC
The ROC curve is the plot between sensitivity and (1- specificity).
(1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate.

Calculating ROC on Train Data

12 | P a g e
Calculating ROC on Test Data

ROC Output Analysis: -

So, from the plot we can see that plot is covering large area under the curve and we are able to predict on the True
Positive side.

In Train data my true positive rate is 99.66% and in test data it’s 98.80%. so, there is no major variation in our Test
and Train data, and this proves that our model is more stable.

3. K-S chart
K-S will measure the degree of separation between car users and non-car users

By executing below code on Train and Test model, we will be able to see K-S Analysis result: -

K-S Output Analysis

From K-S analysis we can clearly see that our Train data can distinguish between people likely to prefer car or not
95.30% on Train and 93.96 % on Test accuracy. We can see there is a slight variation but that is within the range so
we can confirm that our model is ok.

13 | P a g e
4. Gini chart
Gini is the ratio between area between the ROC curve and the diagonal line & the area of the above triangle.

Gini Output Analysis

From Gini analysis we can clearly see that our Train data covering maximum area of car and non-car use employee
with 73.06% and test data with 71.93% of accuracy. We can see there is a slight variation but that is within the range
so we can confirm that our model is ok.

k-NN Classification
k-NN is a supervised learning algorithm. It uses labeled input data to learn a function that produces an
appropriate output when given new unlabeled data. So, let’s build our classification model by following below
steps: -

Splitting data into Test and Train set in 70:30 ratio

14 | P a g e
Creating k-NN model
When we choose 3 neighbors
Creating k-NN model on Train and Test data set

Performing classification Model Performance Measures, when K=3

Calculating Confusion Matrix on Train Data: - We are predicting classification of 0 and 1 for each row and then we
are putting our actual and predicted into a table to build confusion matrix to check that how accurate our model is
by executing below R Code.

Calculating Confusion Matrix on Test Data: -

Confusion Matrix Output: -

From Confusion matrix we can clearly see that our Train data is 97.83% accurate in predicting and Train data
confirms the same with 94.93% of accuracy. We can see there is a slight variation but that is within the range so
we can confirm that our model is good model.

2. ROC
The ROC curve is the plot between sensitivity and (1- specificity).
(1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate.

Calculating ROC on Train Data

15 | P a g e
Calculating ROC on Test Data

ROC Output Analysis: -

So, from the plot we can see that plot is covering large area under the curve and we are able to predict on the True
Positive side.

In Train data my true positive rate is 97.22% and in test data it’s 92.58%. so, there is major variation in our Test and
Train data, and this proves that our model is stable.

3. K-S chart
K-S will measure the degree of separation between car users and non-car users

By executing below code on Train and Test model, we will be able to see K-S Analysis result: -

K-S Output Analysis

From K-S analysis we can clearly see that our Train data can distinguish between people likely to prefer car or not
94.44% on Train and 85.17% on Test accuracy. We can see there is a variation and that is not within the range so we
can confirm that our model is not stable.

4. Gini chart
Gini is the ratio between area between the ROC curve and the diagonal line & the area of the above triangle.

16 | P a g e
Gini Output Analysis

From Gini analysis we can clearly see that our Train data not covering maximum area of car and non-car use
employee with 15.39% and test data with 15.98% of accuracy. We can see there is a slight variation but that is within
the range so we can confirm that our model is ok.

When we choose 5 neighbors

Creating k-NN model on Train and Test data set

Performing classification Model Performance Measures, when K=5

Calculating Confusion Matrix on Test Data: -

Confusion Matrix Output: -

From Confusion matrix we can clearly see that our Train data is 97.56% accurate in predicting and Train data
confirms the same with 95.56% of accuracy. We can see there is a slight variation but that is within the range so
we can confirm that our model is good model.
2. ROC
The ROC curve is the plot between sensitivity and (1- specificity).
(1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate.

Calculating ROC on Train Data

17 | P a g e
Calculating ROC on Test Data

ROC Output Analysis: -

So, from the plot we can see that plot is covering large area under the curve and we are able to predict on the True
Positive side.

In Train data my true positive rate is 97.02% and in test data it’s 94.04%. so, there is no major variation in our Test
and Train data, and this proves that our model is stable.

3. K-S chart
K-S will measure the degree of separation between car users and non-car users
By executing below code on Train and Test model, we will be able to see K-S Analysis result: -

K-S Output Analysis

From K-S analysis we can clearly see that our Train data can distinguish between people likely to prefer car or not
94.04% on Train and 88.08% on Test accuracy. We can see there is a slight variation and that is not within the range
so we can confirm that our model is not stable.

4. Gini chart
Gini is the ratio between area between the ROC curve and the diagonal line & the area of the above triangle.

18 | P a g e
Gini Output Analysis
From Gini analysis we can clearly see that our Train data not covering maximum area of car and non-car use
employee with 15.32% and test data with 15.57% of accuracy. We can see there is a slight variation but that is within
the range so we can confirm that our model is ok.

Creating Naïve Bayes model

Naive Bayes classifier presume that the presence of a feature in a class is unrelated to the presence of any other
feature in the same class, so let’s build the model and see how good our model is as per this classification model

Performing classification Model Performance Measures for Naïve Bayes

1. Confusion Matrix: -
Calculating Confusion Matrix on Train and Test Data: -

Calculating Confusion Matrix on Test Data: -

Confusion Matrix Output: -

From Confusion matrix we can clearly see that our Train data is 95.94% accurate in predicting and Test data has
93.67% accuracy in prediction the churn rate.
2. ROC
The ROC curve is the plot between sensitivity and (1- specificity).
(1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate.

Calculating ROC on Train Data

19 | P a g e
Calculating ROC on Test Data

ROC Output Analysis: -

So, from the plot we can see that plot is covering large area under the curve and we are able to predict on the True
Positive side.

In Train data my true positive rate is 72.93% and in test data it’s 94.04%. so, there is major variation in our Test and
Train data, and this proves that our model is not stable.

3. K-S chart
K-S will measure the degree of separation between car users and non-car users
By executing below code on Train and Test model, we will be able to see K-S Analysis result: -

K-S Output Analysis

From K-S analysis we can clearly see that our Train data can distinguish between people likely to prefer car or not
45.87% on Train and 50.03% on Test accuracy. We can see there is a slight variation and that is within the range so
we can confirm that our model is stable.

Applying Bagging and Boosting Technique

Bagging and Boosting :- The technique is an ensemble method that helps in training the multiple models
using the same algorithm and helps in creating the strong learner from weak one.

Bagging (aka Bootstrap Aggregating): is a way to decrease the variance of prediction by generating
additional data for training from your original dataset using combinations with repetitions to produce multisets of
the same cardinality/size as your original data.

Boosting: Is a way of training the weak learners sequentially

20 | P a g e
Applying Bagging model:

Helps in comparing the prediction with the observed values thereby estimating the errors

Interpretation:

Bagging here is going with the baseline approach calling everything as true hence it’s an extreme, hence bagging is
going with the minority therefore it is not preferable

Firstly, convert the dependent variable to a numeric

Next use the balanced test data from SMOTE analysis

Pass the Boosting model:

We are using XG method that is a specialized implementation of gradient boosting decision trees designed for
performance

Convert everything to numeric after passing the metrics

XGBoost works with matrixes that contain all numeric variables. Hence firstly change the data to matrix

Pass the XGBoost model:

21 | P a g e
The functions above are described as below:
eta = A learning rate at which the values are updated, it’s a slow learning rate
max_depth = Represents how many nodes to expand the trees. Larger the depth, more complex the model; higher
chances of overfitting. There is no standard value for max_depth. Larger data sets require deep trees to learn the
rules from data.
min_child_weight = it blocks the potential feature interactions to prevent overfitting
nrounds = Controls the maximum number of iterations. For classification, it is similar to the number of trees to grow
nfold = used for cross validation
verbose = do not want to see the output printed
early_stopping_rounds = stop if no improvement for 10 consecutive trees

Confusion Matrix Output: -

Shows a prediction of 100% accuracy that the customers are using cars

This model same as bagging and therefore is a proper representation of both majority and minority class

Model Comparison and Conclusion

Logistics Regression Model, K-NN and Naïve Bayes Models are all able to predict the Transport mode with very high
accuracy.

However, using bagging and Boosting, we can predict the Choice of Transport Mode with 100% Accuracy.

In this case, any of the models Logistics Regression, K-NN, Naïve Bayes or Bagging/Boosting can be used for high
accuracy prediction. However, the key aspect is SMOTE for balancing the minority and majority class, without which
our models will not be so accurate.

22 | P a g e
Appendix
R Code

23 | P a g e

MATH-UNIT-2-GRADE-3-LESSON-29-sinugbuanong Binisaya
80% (5)
MATH-UNIT-2-GRADE-3-LESSON-29-sinugbuanong Binisaya
74 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
Operations Management Chapter 3 - Forecasting
100% (2)
Operations Management Chapter 3 - Forecasting
44 pages
Customer Churn Data - A Project Based On Logistic Regression
100% (12)
Customer Churn Data - A Project Based On Logistic Regression
31 pages
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
100% (3)
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
49 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Machine Learning Project: Raghul Harish
100% (2)
Machine Learning Project: Raghul Harish
46 pages
Traditional Approach VS OO Approach
100% (16)
Traditional Approach VS OO Approach
17 pages
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
100% (1)
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
24 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
FRA Report
100% (1)
FRA Report
30 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Predictive Modeling PDF
100% (3)
Predictive Modeling PDF
49 pages
Project Report
100% (3)
Project Report
36 pages
Data Visualization in Tableau - Car Insurance Claim Project
50% (2)
Data Visualization in Tableau - Car Insurance Claim Project
51 pages
Predictive Modelling Project - Business Report
100% (1)
Predictive Modelling Project - Business Report
23 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Telecom Churn Solution
100% (5)
Telecom Churn Solution
28 pages
Marketing & Retail Analytics - Report - Part A
100% (2)
Marketing & Retail Analytics - Report - Part A
18 pages
Cold Storage Assignment Solution Ankur Jain
75% (8)
Cold Storage Assignment Solution Ankur Jain
6 pages
Linear - Regression - Assignment: Problem Statement
100% (3)
Linear - Regression - Assignment: Problem Statement
24 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
Advance Statistics-Project Report
50% (2)
Advance Statistics-Project Report
17 pages
Social Media Tourism - Capstone Project
No ratings yet
Social Media Tourism - Capstone Project
13 pages
Time Series Project
50% (4)
Time Series Project
2 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Machine Learning Project - Sapan Parikh
100% (1)
Machine Learning Project - Sapan Parikh
12 pages
Australian Gas Production - Project On Time Series Forecasting
100% (19)
Australian Gas Production - Project On Time Series Forecasting
29 pages
Great Learning DVT Final Project - Car Claims For Insurance
100% (1)
Great Learning DVT Final Project - Car Claims For Insurance
113 pages
Shivani Pandey TSF
100% (1)
Shivani Pandey TSF
32 pages
5th Session Forecasting Business
No ratings yet
5th Session Forecasting Business
13 pages
Detail Project Report SMDM
100% (1)
Detail Project Report SMDM
25 pages
Anamit Deb Gupta Mra - Project Milestone - 1
100% (1)
Anamit Deb Gupta Mra - Project Milestone - 1
30 pages
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
No ratings yet
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
12 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
Machine Learning (Project5) PDF
100% (2)
Machine Learning (Project5) PDF
13 pages
V-Ray 3ds Max Build 2.40.01-1 PDF
No ratings yet
V-Ray 3ds Max Build 2.40.01-1 PDF
4 pages
SMDM Project Report
100% (1)
SMDM Project Report
19 pages
Time Series Rose Shehroz Arfeen
100% (1)
Time Series Rose Shehroz Arfeen
42 pages
Data Mining Case Study PDF
100% (1)
Data Mining Case Study PDF
21 pages
Clustering Project
100% (1)
Clustering Project
44 pages
Project ML
100% (4)
Project ML
36 pages
SMDM Business-Report Arvind Soni-2
0% (1)
SMDM Business-Report Arvind Soni-2
15 pages
Business Report Problem 2
No ratings yet
Business Report Problem 2
10 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Shark Tank - Web and Social Media Analytics Case Study
100% (1)
Shark Tank - Web and Social Media Analytics Case Study
9 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
100% (1)
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
11 pages
SQL & Advanced SQL
100% (6)
SQL & Advanced SQL
37 pages
Stat (Ian Castro) 5
50% (2)
Stat (Ian Castro) 5
4 pages
Process Modelling For Simulation
100% (1)
Process Modelling For Simulation
14 pages
Business Report SMDM Project - Coded
No ratings yet
Business Report SMDM Project - Coded
27 pages
Rajendra Ladda DVT Car Insurance Tableau Project
No ratings yet
Rajendra Ladda DVT Car Insurance Tableau Project
8 pages
CHP 7 MCQ
No ratings yet
CHP 7 MCQ
5 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
CALPUFF Version6 UserInstructions PDF
No ratings yet
CALPUFF Version6 UserInstructions PDF
873 pages
MRA Project Milestone2 PDF
100% (1)
MRA Project Milestone2 PDF
1 page
Object Oriented Modeling and Design With UML
No ratings yet
Object Oriented Modeling and Design With UML
99 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
SMDM - Project Report - Lakshmi
No ratings yet
SMDM - Project Report - Lakshmi
26 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Market Segmentation - Product Service Management
No ratings yet
Market Segmentation - Product Service Management
16 pages
Short Questions: Draw A Use Case of Library Management System
No ratings yet
Short Questions: Draw A Use Case of Library Management System
16 pages
QUIZ Week 2 CART Practice PDF
No ratings yet
QUIZ Week 2 CART Practice PDF
10 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
Telecom Customer Churn Prediction Assessment-Pratik Zanke
No ratings yet
Telecom Customer Churn Prediction Assessment-Pratik Zanke
19 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
Modeling Data in The Organization: WEEK-2
No ratings yet
Modeling Data in The Organization: WEEK-2
31 pages
CSC 302 Lecture Note 2017 - 2018 Complete
No ratings yet
CSC 302 Lecture Note 2017 - 2018 Complete
70 pages
Record
No ratings yet
Record
32 pages
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
No ratings yet
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
9 pages
Hydrological Modeling Using Generalized Artificial Neuron Model
No ratings yet
Hydrological Modeling Using Generalized Artificial Neuron Model
24 pages
Project Questions
No ratings yet
Project Questions
3 pages
6 - Tables of A Database
No ratings yet
6 - Tables of A Database
30 pages
ML Quiz-2
No ratings yet
ML Quiz-2
5 pages
Problem Statement1
No ratings yet
Problem Statement1
1 page
Oracle 1Z0 071
No ratings yet
Oracle 1Z0 071
2 pages
Finance & Risk Analytics: Project Report
No ratings yet
Finance & Risk Analytics: Project Report
16 pages
Visual Paradigm For UML Tutorial English
No ratings yet
Visual Paradigm For UML Tutorial English
22 pages
An Old and Forgotten Problem
No ratings yet
An Old and Forgotten Problem
24 pages
Type Conversions
No ratings yet
Type Conversions
6 pages
To Object-Oriented Programming: COMP1161
No ratings yet
To Object-Oriented Programming: COMP1161
8 pages
Lambert Conformal Conic Projection For India
No ratings yet
Lambert Conformal Conic Projection For India
4 pages
Par Report
No ratings yet
Par Report
5 pages
Tema Informatica Medicala
No ratings yet
Tema Informatica Medicala
6 pages
Pengembangan Basis Data Sistem Informasi Manajemen Rumah Sakit Berbasis Linguistic-Based Schema Matching
No ratings yet
Pengembangan Basis Data Sistem Informasi Manajemen Rumah Sakit Berbasis Linguistic-Based Schema Matching
6 pages
Quiz Java Final
No ratings yet
Quiz Java Final
4 pages
Uml Lecture PDF
No ratings yet
Uml Lecture PDF
4 pages
Assignment1 DDL and DML (As-Level)
No ratings yet
Assignment1 DDL and DML (As-Level)
4 pages
Topographic Map of Purdon
No ratings yet
Topographic Map of Purdon
1 page

Analysis of Transport Choice of Employees - A Project On Machine Learning

Uploaded by

Analysis of Transport Choice of Employees - A Project On Machine Learning

Uploaded by

This is the case study prepared to

understand employee transport

Project on Machine Learning

SHYAM KISHORE TRIPATHI

1. PROJECT OBJECTIVES .........................................................................................................................................2

Data Summary and Exploratory Data Analysis

Blank value check and treatment

Univariate and Bivariate Analysis

3. Mode of Transport by Engineer

5. License Analysis by Mode of Transport

7. Analysis of Salary by Transport Mode

Checking Target variable proportion in overall data set

Let’s create a subset and create Train and Test dataset

Checking for Logistic Regression Model Multicollinearity

Steps for Variable Reduction:

Calculating Confusion Matrix on Test Data: -

Confusion Matrix Output: -

Calculating ROC on Train Data

ROC Output Analysis: -

K-S Output Analysis

Gini Output Analysis

Splitting data into Test and Train set in 70:30 ratio

Performing classification Model Performance Measures, when K=3

Calculating Confusion Matrix on Test Data: -

Confusion Matrix Output: -

Calculating ROC on Train Data

ROC Output Analysis: -

K-S Output Analysis

When we choose 5 neighbors

Performing classification Model Performance Measures, when K=5

Calculating Confusion Matrix on Test Data: -

Confusion Matrix Output: -

Calculating ROC on Train Data

ROC Output Analysis: -

K-S Output Analysis

Creating Naïve Bayes model

Performing classification Model Performance Measures for Naïve Bayes

Calculating Confusion Matrix on Test Data: -

Confusion Matrix Output: -

Calculating ROC on Train Data

ROC Output Analysis: -

K-S Output Analysis

Applying Bagging and Boosting Technique

Boosting: Is a way of training the weak learners sequentially

Firstly, convert the dependent variable to a numeric

Next use the balanced test data from SMOTE analysis

Pass the Boosting model:

Convert everything to numeric after passing the metrics

Pass the XGBoost model:

Confusion Matrix Output: -

Model Comparison and Conclusion

You might also like