100% found this document useful (1 vote)
147 views41 pages

Machine Learnin1

The document is a report on machine learning models to predict employees' preferred mode of transport based on their attributes. It describes exploring and preprocessing the data, then building and comparing several models including linear regression, naive bayes, KNN, LDA, logistic regression, decision trees, AdaBoost, and gradient boosting. The best performing models are decision trees and gradient boosting, which achieve over 96% accuracy on the test data.

Uploaded by

Surabhi Kulkarni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
147 views41 pages

Machine Learnin1

The document is a report on machine learning models to predict employees' preferred mode of transport based on their attributes. It describes exploring and preprocessing the data, then building and comparing several models including linear regression, naive bayes, KNN, LDA, logistic regression, decision trees, AdaBoost, and gradient boosting. The best performing models are decision trees and gradient boosting, which achieve over 96% accuracy on the test data.

Uploaded by

Surabhi Kulkarni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Machine Learning - Report

Surabhi Kulkarni

PGP-DSBA Online

Table of Contents

Contents
Executive Summary
Data Description
Sample of the dataset
Exploratory Data Analysis
Let us check the types of variables in the data frame
Check for missing values in the dataset
Pair Plot
Box Plot
Histogram

List of Figures

Fig.1 – Pair plot


Fig.2 – Correlation Heatmap
Fig.3– Products
Fig.4– Outlier
Fig.5– Histplot
Executive Summary

Problem 1: You work for an office transport company. You are in


discussions with ABC Consulting company for providing transport for
their employees. For this purpose, you are tasked with understanding
how do the employees of ABC Consulting prefer to commute presently
(between home and office). Based on the parameters like age, salary,
work experience etc. given in the data set ‘Transport.csv’, you are
required to predict the preferred mode of transport. The project
requires you to build several Machine Learning models and compare
them so that the model can be finalised.

Data Dictionary

Age : Age of the Employee in Years

Gender : Gender of the Employee

Engineer : For Engineer =1 , Non Engineer =0

MBA : For MBA =1 , Non MBA =0

Work Exp : Experience in years

Salary : Salary in Lakhs per Annum

Distance : Distance in Kms from Home to Office

license : If Employee has Driving Licence -1, If not, then 0

Transport : Mode of Transport


The objective is to build various Machine Learning models on this data
set and based on the accuracy metrics decide which model is to be
finalised for finally predicting the mode of transport chosen by the
employee.

Importing Libraries.

Importing Data.

Checking the type of the dataset.


Checking the shape of the dataset: (444, 9)

Getting the info data types column wise.


dtypes: float64(2), int64(5), object(2)
memory usage: 31.3+ KB

Observation-1:

The data set contains 444 row, 9 columns .

In the given data set there are 5 Integer type features, 2 Float type
features. 2 Object type features

converting theGender'and 'Transport' column from object / string type


to integer..
Age Gender Engineer MBA Work Exp Salary Distance license Transport

0 28 1 0 0 4 14.3 3.2 0 1

1 23 0 1 0 4 8.3 3.3 0 1

2 29 1 1 0 7 13.4 4.1 0 1

3 28 0 1 1 5 13.4 4.5 0 1

4 27 1 1 0 4 13.4 4.6 0 1

5 26 1 1 0 4 12.3 4.8 1 1

6 28 1 1 0 5 14.4 5.1 0 0

7 26 0 1 0 3 10.5 5.1 0 1

8 22 1 1 0 1 7.5 5.1 0 1

0 1
9 27 1 1 0 4 13.5 5.2
Performing EDA

EDA-Step 1: Checking for duplicate records in the data

Number of duplicate rows = 0

EDA-Step 2: Checking Missing value.

Are there any missing values ?

we can observe there are 0 missing value in the depth column. Missing
value treatment will be done.

Outliers : Boxplot
Distplot
Histplot
Pairplot
Get the Correlation Heatmap

1.2 Split the data into train and test. Is scaling necessary or not?

Lets break the X and y dataframes into training set and test set. For this
we will use

Sklearn package's data splitting function which is based on random


function

Split X and y into training and test set in 70:30 ratio


invoke the LinearRegression function and find the bestfit model on
training data

LinearRegression()

The coefficient for Age is 0.04543297494826039


The coefficient for Gender is
0.19825715816344822
The coefficient for Engineer is -
0.04983406779229654
The coefficient for MBA is 0.05963535977202542
The coefficient for Work Exp is -
0.02955184857514718
The coefficient for Salary is -
0.008628446413510701
The coefficient for Distance is -
0.032482366443639825
The coefficient for license is -
0.3954409462694837

Let us check the intercept for the model


The intercept for our model is
0.07345136229447868

# R square on training data


0.34894130957779157

# R square on testing data


0.31501466929614674

#RMSE on Training data


0.3791229977315064
#RMSE on Testing data
0.3839320749182315

# Since this is regression, plot the predicted y


value vs actual y values for the test
data
# A good model's prediction will be close to
actual leading to high R and R2 values

# Naive Bayes
0.7935483870967742
[[ 51 51]
[ 13 195]]
precision recall f1-score
support
0 0.80 0.50 0.61
102
1 0.79 0.94 0.86
208

accuracy 0.79
310
macro avg 0.79 0.72 0.74
310
weighted avg 0.79 0.79 0.78
310

0.7910447761194029
[[22 20]
[ 8 84]]
precision recall f1-score
support

0 0.73 0.52 0.61


42
1 0.81 0.91 0.86
92

accuracy 0.79
134
macro avg 0.77 0.72 0.73
134
weighted avg 0.78 0.79 0.78
134

KNN

0.7161290322580646

precision recall f1-score support


0 0.62 0.36 0.46
102
1 0.74 0.89 0.81
208

accuracy 0.72
310
macro avg 0.68 0.63 0.63
310
weighted avg 0.70 0.72 0.69
310

0.5223880597014925
[[ 7 35]
[29 63]]
precision recall f1-score
support
0 0.19 0.17 0.18
42
1 0.64 0.68 0.66
92

accuracy 0.52
134
macro avg 0.42 0.43 0.42
134
weighted avg 0.50 0.52 0.51
134

# Linear Discriminant Analysis


0.8
[[ 57 45]
[ 17 191]]
precision recall f1-score
support
0 0.77 0.56 0.65
102
1 0.81 0.92 0.86
208

accuracy 0.80
310
macro avg 0.79 0.74 0.75
310
weighted avg 0.80 0.80 0.79
310

the auc 0.834

0.8208955223880597
[[26 16]
[ 8 84]]
precision recall f1-score
support
0 0.76 0.62 0.68
42
1 0.84 0.91 0.87
92

accuracy 0.82
134
macro avg 0.80 0.77 0.78
134
weighted avg 0.82 0.82 0.82
134

the auc curve 0.810

# Logistic Regression
LogisticRegression(max_iter=10000, n_jobs=2,
penalty='none', solver='newton-cg',
verbose=True)

0.7870967741935484
[[ 58 44]
[ 22 186]]
precision recall f1-score
support

0 0.72 0.57 0.64


102
1 0.81 0.89 0.85
208

accuracy 0.79
310
macro avg 0.77 0.73 0.74
310
weighted avg 0.78 0.79 0.78
310

train : 0.7870967741935484
0.8059701492537313
[[25 17]
[ 9 83]]
precision recall f1-score
support

0 0.74 0.60 0.66


42
1 0.83 0.90 0.86
92

accuracy 0.81
134
macro avg 0.78 0.75 0.76
134
weighted avg 0.80 0.81 0.80
134

AUC: 0.816
In [135]:

# DecisionTreeClassifier

1.0
[[102 0]
[ 0 208]]
precision recall f1-score
support

0 1.00 1.00 1.00


102
1 1.00 1.00 1.00
208

accuracy 1.00
310
macro avg 1.00 1.00 1.00
310
weighted avg 1.00 1.00 1.00
310
AUC: 1.000

0.8134328358208955
[[29 13]
[12 80]]
precision recall f1-score
support

0 0.71 0.69 0.70


42
1 0.86 0.87 0.86
92

accuracy 0.81
134
macro avg 0.78 0.78 0.78
134
weighted avg 0.81 0.81 0.81
134
AUC: 0.861
AdaBoostClassifier(n_estimators=100,
random_state=1)

0.8838709677419355
[[ 77 25]
[ 11 197]]
precision recall f1-score
support

0 0.88 0.75 0.81


102
1 0.89 0.95 0.92
208

accuracy 0.88
310
macro avg 0.88 0.85 0.86
310
weighted avg 0.88 0.88 0.88
310

AUC: 0.959

0.7910447761194029
[[22 20]
[ 8 84]]
precision recall f1-score
support
0 0.73 0.52 0.61
42
1 0.81 0.91 0.86
92

accuracy 0.79
134
macro avg 0.77 0.72 0.73
134
weighted avg 0.78 0.79 0.78
134

AUC: 0.959
# Gradient Boosting
GradientBoostingClassifier(random_state=1)

0.967741935483871
[[ 51 51]
[ 13 195]]
precision recall f1-score
support

0 0.99 0.91 0.95


102
1 0.96 1.00 0.98
208

accuracy 0.97
310
macro avg 0.97 0.95 0.96
310
weighted avg 0.97 0.97 0.97
310

AUC: 0.998

0.7686567164179104
[[22 20]
[ 8 84]]
precision recall f1-score
support

0 0.73 0.52 0.61


42
1 0.81 0.91 0.86
92

accuracy 0.79
134
macro avg 0.77 0.72 0.73
134
weighted avg 0.78 0.79 0.78
134

AUC: 0.815

# RandomForestClassifier

RandomForestClassifier(random_state=1)

1.0
[[ 51 51]
[ 13 195]]
precision recall f1-score
support

0 1.00 1.00 1.00


102
1 1.00 1.00 1.00
208

accuracy 1.00
310
macro avg 1.00 1.00 1.00
310
weighted avg 1.00 1.00 1.00
310

AUC: 1.000
0.8059701492537313
[[22 20]
[ 8 84]]
precision recall f1-score
support

0 0.73 0.52 0.61


42
1 0.81 0.91 0.86
92

accuracy 0.79
134
macro avg 0.77 0.72 0.73
134
weighted avg 0.78 0.79 0.78
134

AUC: 0.829
Problem 2: A dataset of Shark Tank episodes is made available. It
contains 495 entrepreneurs making their pitch to the VC sharks.

Importing Libraries.

Importing Data.

Checking Missing value.

Are there any missing values ?

we can observe there are 100 missing value in the depth column.
Missing value treatment will be done.

2.1 Pick out the Deal (Dependent Variable) and Description columns into a
separate data frame.

deal description

1 True Retail and wholesale pie factory with two reta...

2 True Ava the Elephant is a godsend for frazzled par...

3 False Organizing, packing, and moving services deliv...

4 False Interactive media centers for healthcare waiti...

5 True One of the first entrepreneurs to pitch on Sha...

... ... ...

490 True Zoom Interiors is a virtual service for interi...

491 True Spikeball started out as a casual outdoors gam...


deal description

492 True Shark Wheel is out to literally reinvent the w...

493 False Adriana Montano wants to open the first Cat Ca...

494 True Sway Motorsports makes a three-wheeled, all-el...

387 rows × 2 columns

2.2 Create two corpora, one with those who secured a Deal, the other with
those who did not secure a deal.

Getting the info.


0 deal 204 non-null object
1 description 204 non-null object
dtypes: object(2)
memory usage: 4.8+ KB

True Corpus 50302


False Corpus 34899

0 description 204 non-null object


1 chars 204 non-null object
dtypes: object(2)
memory usage: 4.8+ KB

We are importing the nltk library to use the inaugural.fileds()

Print true words


Print false words
word cloud true [Secured a deal]
word cloud false [Did not secure a deal]
Q4: Refer to both the word clouds. What do you infer?

The 'secured a deal' wordcloud contains words such as 'one', 'design' ,


'free' ,'children' ,'offer', 'easy' ,'online','use' .These indicate that Deals aimed
towards catering to the children, which provided offers or a free
sample/product, was easy to use, had a good design and was unique in its
creativity are more likely to secure a deal.

The 'Did not secure a deal' wordcloud contains words such as 'one',
'designed' , 'help' ,'device' ,'bottle', 'premium' ,'use' .These indicate that
Deals with a mediocre design, less suited to solve/help a problem, products
involving water bottles, having a higher and premium price tag and less
usability are less likely to secure a deal.

It is also observed that words such as 'one', 'designed' ,'system' and 'use'
have a higher weight in both these wordclouds.This indicates that either
these were not the defining factors to whether a deal is made or not or
might have been used in a different context in the description in each
scenario.

Q5.Looking at the word clouds, is it true that the entrepreneurs who


introduced devices are less likely to secure a deal based on your analysis?

The word 'device' is not easily found in the 'secured a deal' wordcloud while
it is easily spotted in tne 'not secured a deal' wordcloud. This indicates that
the word 'device' occured frequently when a deal was rejected hence
implying the statement given in the question is true.

You might also like