Machine Learning
Machine Learning
Machine Learning
Project
Machine Learning
Contents:
Part 1: Machine Learning Models:
You work for an office transport company. You are in discussions with ABC Consulting company for
providing transport for their employees. For this purpose, you are tasked with understanding how do the
employees of ABC Consulting prefer to commute presently (between home and office). Based on the
parameters like age, salary, work experience etc. given in the data set ‘Transport.csv’, you are required
to predict the preferred mode of transport. The project requires you to build several Machine Learning
models and compare them so that the model can be finalised.
Data Dictionary:
Age: Age of the Employee in Years
Gender: Gender of the Employee
Engineer: For Engineer =1, Non-Engineer =0
MBA: For MBA =1, Non-MBA =0
Work Exp: Experience in years
Salary: Salary in Lakhs per Annum
Distance: Distance in Kms from Home to Office
license: If Employee has Driving Licence -1, If not, then 0
Transport: Mode of Transport
The objective is to build various Machine Learning models on this data set and based on the accuracy
metrics decide which model is to be finalised for finally predicting the mode of transport chosen by the
employee.
Questions:
1. Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and
missing values treatment (if necessary) and check the basic descriptive statistics of the dataset.
2. Split the data into train and test in the ratio 70:30. Is scaling necessary or not?
3. Build the following models on the 70% training data and check the performance of these models on
the Training as well as the 30% Test data using the various inferences from the Confusion Matrix and
plotting an AUC-ROC curve along with the AUC values. Tune the models wherever required for optimum
performance.:
a. Logistic Regression Model
b. Linear Discriminant Analysis
c. Decision Tree Classifier – CART model
d. Naïve Bayes Model
e. KNN Model
f. Random Forest Model
g. Boosting Classifier Model using Gradient boost.
4. Which model performs the best?
5. What are your business insights?
You will ONLY use “Description” column for the initial text mining exercise.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
2|Page
Machine Learning
Questions:
1. Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.
2. Create two corpora, one with those who secured a Deal, the other with those who did not secure a
deal.
3. The following exercise is to be done for both the corpora:
a) Find the number of characters for both the corpuses.
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and
‘company’ are to be removed)
c) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?
d) Plot the Word Cloud for both the corpora.
4. Refer to both the word clouds. What do you infer?
5. Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less likely to
secure a deal based on your analysis?
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
3|Page
Machine Learning
Solutions:
Problem 1:
1. Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and
missing values treatment (if necessary) and check the basic descriptive statistics of the dataset.
2. Split the data into train and test in the ratio 70:30. Is scaling necessary or not?
3. Build the following models on the 70% training data and check the performance of these models
on the Training as well as the 30% Test data using the various inferences from the Confusion
Matrix and plotting an AUC-ROC curve along with the AUC values. Tune the models wherever
required for optimum performance.:
a. Logistic Regression Model
b. Linear Discriminant Analysis
c. Decision Tree Classifier – CART model
d. Naïve Bayes Model
e. KNN Model
f. Random Forest Model
g. Boosting Classifier Model using Gradient boost.
4. Which model performs the best?
5. What are your business insights?
1.1 Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and
missing values treatment (if necessary) and check the basic descriptive statistics of the dataset.
Ans.:
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
4|Page
Machine Learning
#There are a total of 444 rows and 9 columns in the dataset. Out of 9, 2 are float type, 5 are integer type
and 2 object type variable present in dataset.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
5|Page
Machine Learning
Some people did the Engineering but did not go for MBA or vice-versa in that case we can consider
Zero(0) values, for freshers we can consider Work Exp as Zero(0) values, and having Zero(0) values who
don't have license are considerable.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
6|Page
Machine Learning
#Bivariate analysis:
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
7|Page
Machine Learning
#From the above analysis we can say age group from 23 years to 31 years uses public transport.
#People with less Salary uses more public transport and the People travelled from far distance uses
private transport.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
8|Page
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
9|Page
Machine Learning
#From the analysis of both Pairplot and Heatmap we can see the presence of correlations between ‘Age’
and ‘Work Exp’, ‘Age’ and ‘Salary, ‘Salary’ and ‘Work Exp’.
Gender: Transport:
From the Univariate analysis done above we can clearly see that outliers are present in the data-set, so
we need to treat those outliers.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
10 | P a g e
Machine Learning
1.2 Split the data into train and test in the ratio 70:30. Is scaling necessary or not?
Ans.: Total 444 number of records had been splitted in the ratio 70:30 :
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
11 | P a g e
Machine Learning
#Variance Inflation Factor (VIF) is one of the methods to check if independent variables have
correlation between them. If they are correlated, then it is not ideal for linear regression models as
they inflate the standard errors which in turn affects the regression parameters. As a result, the
regression model becomes non-reliable and lacks interpretability.
##General rule of thumb: If VIF values are equal to 1, then that means there is no multicollinearity. If
VIF values are equal to 5 or exceedingly more than 5, then there is moderate multicollinearity. If VIF is
10 or more, then that means there is high collinearity.
##From the above I can conclude that variables have moderate correlation.
1.3 Build the following models on the 70% training data and check the performance of these models on
the Training as well as the 30% Test data using the various inferences from the Confusion Matrix
and plotting an AUC-ROC curve along with the AUC values. Tune the models wherever required for
optimum performance.:
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
12 | P a g e
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
13 | P a g e
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
14 | P a g e
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
15 | P a g e
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
16 | P a g e
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
17 | P a g e
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
18 | P a g e
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
19 | P a g e
Machine Learning
e. KNN Model:
KNN Model:
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
20 | P a g e
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
21 | P a g e
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
22 | P a g e
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
23 | P a g e
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
24 | P a g e
Machine Learning
Ans.:
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
25 | P a g e
Machine Learning
From the Pie-chart and Bar-chart we can clearly see for both Train-model and Test-model the
“Random- Forest” model worked well and gave the best result among all other model.
Ans.: Comparing all the performance metrics, Random-Forest model is performing best. Although
there are some other models which are performing almost same as that of Random-Forest
model. But Random-Forest model is very consistent when train and test results are compared
with each other. Along with other parameters such as Recall value, AUC_SCORE and
AUC_ROC_Curve, those results were pretty good is this model.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
26 | P a g e
Machine Learning
Problem 2:
A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their pitch to
the VC sharks.
You will ONLY use “Description” column for the initial text mining exercise.
1. Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.
2. Create two corpora, one with those who secured a Deal, the other with those who did not secure a
deal.
3. The following exercise is to be done for both the corpora:
a) Find the number of characters for both the corpuses.
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and
‘company’ are to be removed)
c) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?
d) Plot the Word Cloud for both the corpora.
4. Refer to both the word clouds. What do you infer?
5. Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less likely
to secure a deal based on your analysis?
2.1. Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.
Ans.:
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
27 | P a g e
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
28 | P a g e
Machine Learning
# Picked out the Deal (Dependent Variable) and Description columns into a separate data frame, and
the new data frame looks like this below :
2.2 Create two corpora, one with those who secured a Deal, the other with those who did not secure a
deal.
Ans.: Created two different data-frames, one is with the data with those who secured the deal:
another is with the data with those who did not secure the deal:
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
29 | P a g e
Machine Learning
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’
and ‘company’ are to be removed)
Ans.: Before removing the Stop words the punctuations are also removed, Lower Case conversion is
done and Stemming is performed for better cleaning of the text for both the corpuses.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
30 | P a g e
Machine Learning
c) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?
Ans.: Top 3 most frequently occurring words in true corpuses (Secured a deal): [('products', 18),
('easy', 16), ('make', 16)]
Top 3 most frequently occurring words in false corpuses (Did not secure a deal): [('make', 15),
('bottle', 14), ('products', 12)]
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
31 | P a g e
Machine Learning
Word Cloud for the false corpuses (Did not secure a deal):
The 'Did not secure a deal' word cloud contains words such as 'one', 'designed’,
'help’,'device’,'bottle', ‘service’,'use’. These indicate that Deals with a mediocre design, less
suited to solve/help a problem, products involving water bottles, having a higher and premium
price tag and less usability are less likely to secure a deal.
It is also observed that words such as 'one', 'product’, ‘system' and 'use' have a higher weight in
both these word clouds. This indicates that either these were not the defining factors to
whether a deal is made or not or might have been used in a different context in the description
in each scenario.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
32 | P a g e
Machine Learning
2.5 Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less
likely to secure a deal based on your analysis?
Ans.: The word 'device' is not easily found in the 'secured a deal' word cloud while it is easily spotted in
the 'not secured a deal' word cloud. This indicates that the word 'device' occurred frequently
when a deal was rejected.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
33 | P a g e