100% found this document useful (1 vote)
520 views

Machine Learning

Machine Learning

Uploaded by

ARNAB CHOWDHURY.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
520 views

Machine Learning

Machine Learning

Uploaded by

ARNAB CHOWDHURY.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

02/28/2023

Machine Learning
Project
Machine Learning

Contents:
Part 1: Machine Learning Models:
You work for an office transport company. You are in discussions with ABC Consulting company for
providing transport for their employees. For this purpose, you are tasked with understanding how do the
employees of ABC Consulting prefer to commute presently (between home and office). Based on the
parameters like age, salary, work experience etc. given in the data set ‘Transport.csv’, you are required
to predict the preferred mode of transport. The project requires you to build several Machine Learning
models and compare them so that the model can be finalised.

Data Dictionary:
Age: Age of the Employee in Years
Gender: Gender of the Employee
Engineer: For Engineer =1, Non-Engineer =0
MBA: For MBA =1, Non-MBA =0
Work Exp: Experience in years
Salary: Salary in Lakhs per Annum
Distance: Distance in Kms from Home to Office
license: If Employee has Driving Licence -1, If not, then 0
Transport: Mode of Transport
The objective is to build various Machine Learning models on this data set and based on the accuracy
metrics decide which model is to be finalised for finally predicting the mode of transport chosen by the
employee.

Questions:
1. Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and
missing values treatment (if necessary) and check the basic descriptive statistics of the dataset.
2. Split the data into train and test in the ratio 70:30. Is scaling necessary or not?
3. Build the following models on the 70% training data and check the performance of these models on
the Training as well as the 30% Test data using the various inferences from the Confusion Matrix and
plotting an AUC-ROC curve along with the AUC values. Tune the models wherever required for optimum
performance.:
a. Logistic Regression Model
b. Linear Discriminant Analysis
c. Decision Tree Classifier – CART model
d. Naïve Bayes Model
e. KNN Model
f. Random Forest Model
g. Boosting Classifier Model using Gradient boost.
4. Which model performs the best?
5. What are your business insights?

Part 2: Text Mining


A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their pitch to
the VC sharks.

You will ONLY use “Description” column for the initial text mining exercise.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
2|Page
Machine Learning

Questions:
1. Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.
2. Create two corpora, one with those who secured a Deal, the other with those who did not secure a
deal.
3. The following exercise is to be done for both the corpora:
a) Find the number of characters for both the corpuses.
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and
‘company’ are to be removed)
c) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?
d) Plot the Word Cloud for both the corpora.
4. Refer to both the word clouds. What do you infer?
5. Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less likely to
secure a deal based on your analysis?

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
3|Page
Machine Learning

Solutions:
Problem 1:
1. Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and
missing values treatment (if necessary) and check the basic descriptive statistics of the dataset.
2. Split the data into train and test in the ratio 70:30. Is scaling necessary or not?
3. Build the following models on the 70% training data and check the performance of these models
on the Training as well as the 30% Test data using the various inferences from the Confusion
Matrix and plotting an AUC-ROC curve along with the AUC values. Tune the models wherever
required for optimum performance.:
a. Logistic Regression Model
b. Linear Discriminant Analysis
c. Decision Tree Classifier – CART model
d. Naïve Bayes Model
e. KNN Model
f. Random Forest Model
g. Boosting Classifier Model using Gradient boost.
4. Which model performs the best?
5. What are your business insights?

1.1 Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and
missing values treatment (if necessary) and check the basic descriptive statistics of the dataset.
Ans.:

 Data frame Top 10 rows:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
4|Page
Machine Learning

 Data frame info:

#There are a total of 444 rows and 9 columns in the dataset. Out of 9, 2 are float type, 5 are integer type
and 2 object type variable present in dataset.

 Data frame summary:

 Null value check:

We don’t have NULL values present in our data frame.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
5|Page
Machine Learning

 Zero- value Check:

Some people did the Engineering but did not go for MBA or vice-versa in that case we can consider
Zero(0) values, for freshers we can consider Work Exp as Zero(0) values, and having Zero(0) values who
don't have license are considerable.

 Duplicate value Check:

We don’t have Duplicate values present in our data frame.

##From the above analysis I can see that:

 The dataset originally had 444 rows and 9 columns.


 There are 2 columns which are categorical types and 7 columns are present in this dataset which are
numeric variables.
 There is no null value present in the Dataset.
 There are no duplicate rows present in the Dataset.

#Univariate analysis for numerical data (Histplot & Boxplot):

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
6|Page
Machine Learning

#Univariate analysis for categorical data:

#Bivariate analysis:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
7|Page
Machine Learning

#From the above analysis we can say age group from 23 years to 31 years uses public transport.

#Male(s) uses public transport more than Female(s).

#People with less work experience uses more public transport.

#People with less Salary uses more public transport and the People travelled from far distance uses
private transport.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
8|Page
Machine Learning

#Multi-variate analysis: Pairplot (pairwise relationships between variables):

# Multi-variate analysis: Heatmap (Check for presence of correlations):

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
9|Page
Machine Learning

#From the analysis of both Pairplot and Heatmap we can see the presence of correlations between ‘Age’
and ‘Work Exp’, ‘Age’ and ‘Salary, ‘Salary’ and ‘Work Exp’.

#Unique values for categorical variables:

#Converting categorical to dummy variables:

Gender: Transport:

#Sample data set after data-encoding:

From the Univariate analysis done above we can clearly see that outliers are present in the data-set, so
we need to treat those outliers.

Function to treat outliers: “remove_outlier”

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
10 | P a g e
Machine Learning

Outliers check after treatment:

1.2 Split the data into train and test in the ratio 70:30. Is scaling necessary or not?

Ans.: Total 444 number of records had been splitted in the ratio 70:30 :

So in the train part we have 310 entries after data spliting.

In the test part we have 134 entries after data spliting.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
11 | P a g e
Machine Learning

Checking Multicollinearity using Variance Inflation Factor (VIF):

#Variance Inflation Factor (VIF) is one of the methods to check if independent variables have
correlation between them. If they are correlated, then it is not ideal for linear regression models as
they inflate the standard errors which in turn affects the regression parameters. As a result, the
regression model becomes non-reliable and lacks interpretability.

##General rule of thumb: If VIF values are equal to 1, then that means there is no multicollinearity. If
VIF values are equal to 5 or exceedingly more than 5, then there is moderate multicollinearity. If VIF is
10 or more, then that means there is high collinearity.

##From the above I can conclude that variables have moderate correlation.

1.3 Build the following models on the 70% training data and check the performance of these models on
the Training as well as the 30% Test data using the various inferences from the Confusion Matrix
and plotting an AUC-ROC curve along with the AUC values. Tune the models wherever required for
optimum performance.:

a. Logistic Regression Model:

 Logistic Regression Model:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
12 | P a g e
Machine Learning

 Tuned Logistic Regression Model:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
13 | P a g e
Machine Learning

b. Linear Discriminant Analysis:

Linear Discriminant Model:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
14 | P a g e
Machine Learning

Tuned Linear Discriminant Model:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
15 | P a g e
Machine Learning

c. Decision Tree Classifier – CART model:

Decision Tree Classifier – CART model :

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
16 | P a g e
Machine Learning

Tuned Decision Tree Classifier – CART model :

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
17 | P a g e
Machine Learning

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
18 | P a g e
Machine Learning

d. Naïve Bayes Model:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
19 | P a g e
Machine Learning

e. KNN Model:

KNN Model:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
20 | P a g e
Machine Learning

Tuned KNN Model :

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
21 | P a g e
Machine Learning

f. Random Forest Model:

Random Forest Model:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
22 | P a g e
Machine Learning

Tuned Random Forest Model:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
23 | P a g e
Machine Learning

g. Boosting Classifier Model using Gradient boost:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
24 | P a g e
Machine Learning

1.4 Which model performs the best?

Ans.:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
25 | P a g e
Machine Learning

 From the Pie-chart and Bar-chart we can clearly see for both Train-model and Test-model the
“Random- Forest” model worked well and gave the best result among all other model.

1.5 What are your business insights?

Ans.: Comparing all the performance metrics, Random-Forest model is performing best. Although
there are some other models which are performing almost same as that of Random-Forest
model. But Random-Forest model is very consistent when train and test results are compared
with each other. Along with other parameters such as Recall value, AUC_SCORE and
AUC_ROC_Curve, those results were pretty good is this model.

So, predicting the mode of transport chosen by the employee is:


# From the analysis we can say age group from 23 years to 31 years uses public transport.

# Male(s) uses public transport more than Female(s).

# People with less work experience uses more public transport.

# People with less Salary uses more public transport.

# People travelled from far distance uses private transport.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
26 | P a g e
Machine Learning

Problem 2:

A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their pitch to
the VC sharks.
You will ONLY use “Description” column for the initial text mining exercise.
1. Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.
2. Create two corpora, one with those who secured a Deal, the other with those who did not secure a
deal.
3. The following exercise is to be done for both the corpora:
a) Find the number of characters for both the corpuses.
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and
‘company’ are to be removed)
c) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?
d) Plot the Word Cloud for both the corpora.
4. Refer to both the word clouds. What do you infer?

5. Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less likely
to secure a deal based on your analysis?

2.1. Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.

Ans.:

 Data frame Top 5 rows:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
27 | P a g e
Machine Learning

 Data frame info:

 Null value check:


We have some Null values which need to be dropped before further analysis:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
28 | P a g e
Machine Learning

# Picked out the Deal (Dependent Variable) and Description columns into a separate data frame, and
the new data frame looks like this below :

2.2 Create two corpora, one with those who secured a Deal, the other with those who did not secure a
deal.

Ans.: Created two different data-frames, one is with the data with those who secured the deal:

another is with the data with those who did not secure the deal:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
29 | P a g e
Machine Learning

2.3. The following exercise is to be done for both the corpora:


a) Find the number of characters for both the corpuses.
Ans.: Total number of characters in True Corpus: 50302
Total number of characters in False Corpus: 34899

b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’
and ‘company’ are to be removed)

Ans.: Before removing the Stop words the punctuations are also removed, Lower Case conversion is
done and Stemming is performed for better cleaning of the text for both the corpuses.

 Word cloud in True Corpus after removing stop words :

 Word cloud in False Corpus after removing stop words :

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
30 | P a g e
Machine Learning

c) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?

Ans.: Top 3 most frequently occurring words in true corpuses (Secured a deal): [('products', 18),
('easy', 16), ('make', 16)]

Top 3 most frequently occurring words in false corpuses (Did not secure a deal): [('make', 15),
('bottle', 14), ('products', 12)]

d) Plot the Word Cloud for both the corpora.

 Word Cloud for the true corpuses (Secured a deal):

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
31 | P a g e
Machine Learning

 Word Cloud for the false corpuses (Did not secure a deal):

2.4 Refer to both the word clouds. What do you infer?


Ans.: The 'secured a deal' word cloud contains words such as 'one', ‘make’,'children’,'offer',
'easy’,'online’, ‘use’. These indicate that Deals aimed towards catering to the children, which
provided offers or a free sample/product, was easy to use, had a good design and was unique in
its creativity are more likely to secure a deal.

The 'Did not secure a deal' word cloud contains words such as 'one', 'designed’,
'help’,'device’,'bottle', ‘service’,'use’. These indicate that Deals with a mediocre design, less
suited to solve/help a problem, products involving water bottles, having a higher and premium
price tag and less usability are less likely to secure a deal.

It is also observed that words such as 'one', 'product’, ‘system' and 'use' have a higher weight in
both these word clouds. This indicates that either these were not the defining factors to
whether a deal is made or not or might have been used in a different context in the description
in each scenario.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
32 | P a g e
Machine Learning

2.5 Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less
likely to secure a deal based on your analysis?

Ans.: The word 'device' is not easily found in the 'secured a deal' word cloud while it is easily spotted in
the 'not secured a deal' word cloud. This indicates that the word 'device' occurred frequently
when a deal was rejected.

hence implying the statement given in the question is true.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
33 | P a g e

You might also like