Project - Machine Learning (E)
Project - Machine Learning (E)
Anil Ulchala
3/13/2022
1
1 Contents
1 Contents .......................................................................................................................................... 1
1. Action Required: ............................................................................................................................. 4
2. Problem Statement: ........................................................................................................................ 4
2.1 Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers
and missing values treatment (if necessary) and check the basic descriptive statistics of the
dataset. ............................................................................................................................................... 4
................................................................................................................................................................ 7
2.2 Split the data into train and test in the ratio 70:30. Is scaling necessary or not? ................ 13
2.3 Build the following models on the 70% training data and check the performance of these
models on the Training as well as the 30% Test data using the various inferences from the
Confusion Matrix and plotting a AUC-ROC curve along with the AUC values. Tune the models
wherever required for optimum performance. ................................................................................ 14
2.4 Which model performs the best? ......................................................................................... 18
2.5 What are your business insights? ......................................................................................... 29
3. Problem Statement: ...................................................................................................................... 29
3.1 Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame. 29
3.1.2.1 Secure Deal data frame .................................................................................................... 30
3.1.2.2 Not Secure Deal data frame ............................................................................................. 30
List of Tables:
Table 1 – Comparison between various models .................................................................................. 28
This Business Report is generated based on the Data set extracted from reliable sources
2
This Business Report is generated based on the Data set extracted from reliable sources
3
This Business Report is generated based on the Data set extracted from reliable sources
4
Business Case
1. Action Required:
The purpose of this whole exercise is to explore the dataset. Do the exploratory data analysis.
Perform Machine Learning using Naive Bayes, KNN techniques on the data set and compare the
models. Also, need to provide the insights and recommendations based on the output. Also
need to perform NLP techniques, on another data set containing descriptions of entrepreneur in
the shark tank show and create a word cloud for secured deals and non-secured deals.
2. Problem Statement:
You work for an office transport company. You are in discussions with ABC Consulting company
for providing transport for their employees. For this purpose, you are tasked with understanding
how do the employees of ABC Consulting prefer to commute presently (between home and
office). Based on the parameters like age, salary, work experience etc. given in the data set
‘Transport.csv’, you are required to predict the preferred mode of transport. The project
requires you to build several Machine Learning models and compare them so that the model can
be finalised.
2.1.1. Now data set is ready for the Exploratory Data Analysis. As per the output dimension or
shape of the data set is (444, 9). Therefore data set has 444 rows and 9 columns. Refer below
figure
This Business Report is generated based on the Data set extracted from reliable sources
5
2.1.2. There are no null values in the data set. Ref below figure
2.1.3. On the basis of problem description it is clear that ‘Transport’ is the dependent variable/
target variable and the remaining variables are independent variables. Going forward the
report uses terminology of dependent and independent variable to address the columns.
Hence, proportions for categorical variables are calculated.
This Business Report is generated based on the Data set extracted from reliable sources
6
Model. Also, majority of the transport users are Male with a count of 316 , we can conclude that
data set has no bad values.
• As observed, stats from target variable say Public Transport is the most opted one.
• Age ranges from 18 to 43 with a mean of 27 years and having a std deviation of 4.41. Mean
and median are almost similar giving an intention of normal distribution.
• Male are with the high count having a frequency of 316 in comparison with Female.
• The work experience of the transporters varies from 0 to 24. Mean value is 6.29 and median
is 5.
• The Salary of the transporters varies from 6.5 to 57 LPA. Mean salary is 10.45 LPA and
median is 13.6 LPA.
• The distance of the transporters varies from 3.2 to 23.4 kms. Average distance travelled is
11.23 kms.
This Business Report is generated based on the Data set extracted from reliable sources
7
This Business Report is generated based on the Data set extracted from reliable sources
8
• Majority of the Transporters are Male and most of them are using Public Transport.
2.1.7. Bivariate Analysis
1. Swarm plot between Transport and Age
• Based on the data, from the scatter plot Age group between 23 to 30 are dense for
at Public Transport
• Age group above 35 prefer for Private transport.
This Business Report is generated based on the Data set extracted from reliable sources
9
• Based on the data, lower work experience people are opting for public transport.
This Business Report is generated based on the Data set extracted from reliable sources
10
• Based on the data, lower Salary people are opting for public transport.
• Density of population is more in public transport and that too thicker for lower
distance travellers.
This Business Report is generated based on the Data set extracted from reliable sources
11
This Business Report is generated based on the Data set extracted from reliable sources
12
• From the pair plot, we can infer that there is strong relation between work
experience and salary. Rest of the variables have very weak correlation.
• Hence the dataset is not cursed by mutli-collinearity and good for modelling.
This Business Report is generated based on the Data set extracted from reliable sources
13
2.2 Split the data into train and test in the ratio 70:30. Is scaling necessary
or not?
Ans: To create a model for analysis we need to convert the categorical variables. So we do Label
encoding for the categorical variables and use the data set for the model building. Scaling is not
necessary since most of the variables are categorical code.
2.2.1. Data Encoding and Model building for Machine Learning Analysis
This Business Report is generated based on the Data set extracted from reliable sources
14
• Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
• Test and train split is made with ratio 70:30
• X is an array of dependent variables and Y is an array of target variable
2.3 Build the following models on the 70% training data and check the
performance of these models on the Training as well as the 30% Test
data using the various inferences from the Confusion Matrix and
plotting a AUC-ROC curve along with the AUC values. Tune the models
wherever required for optimum performance.
2.3.1. Model Building for Machine Learning Analysis – Split of data
• Penalty: ‘elasticnet','l2','none'
• Solver: 'newton-cg', 'saga'
• Tol: 0.001, 0.00001
2.3.2. Logistic Regression model generation using the Grid search method
This Business Report is generated based on the Data set extracted from reliable sources
15
Finally based on the GridSearchCV technique the best estimators are captured to best_model and
predictions are made based on this best_model.
• Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
• Test and train split is made with ratio 70:30
• X is an array of dependent variables and Y is an array of target variable
• Here data is not scaled, since there is no much difference in the independent
variables. Hence scaling is not done for the entire modelling except for KNN
• Test and train split is made with ratio 70:30
• X is an array of dependent variables and Y is an array of target variable
• Here data is not scaled, since there is no much difference in the independent
variables. Here scaling is done because KNN model is based on Euclidean
distance measurement. So zscore is applied for scaling on the Train data set.
• Test and train split is made with ratio 70:30
• X is an array of dependent variables and Y is an array of target variable
This Business Report is generated based on the Data set extracted from reliable sources
16
This Business Report is generated based on the Data set extracted from reliable sources
17
2.3.6.2 Creating model using Random Forest model and applying Bagging
This Business Report is generated based on the Data set extracted from reliable sources
18
2.3.6.3 Creating model using Random Forest model and applying Boosting
This Business Report is generated based on the Data set extracted from reliable sources
19
2.7.1.3. Accuracy Score AUC_score and RoC Curve for the train data
2.7.1.4. Accuracy Score, AUC_score and RoC Curve for the test data
This Business Report is generated based on the Data set extracted from reliable sources
20
This Business Report is generated based on the Data set extracted from reliable sources
21
2.7.2.3. AUC_score and RoC Curve for the train and test data
This Business Report is generated based on the Data set extracted from reliable sources
22
This Business Report is generated based on the Data set extracted from reliable sources
23
2.7.3.3. Accuracy Score AUC_score and RoC Curve for the train data
2.7.3.4. Accuracy Score, AUC_score and RoC Curve for the test data
This Business Report is generated based on the Data set extracted from reliable sources
24
2.7.4.3. AUC_score and RoC Curve for the train and test data
This Business Report is generated based on the Data set extracted from reliable sources
25
2.7.5.3. AUC_score and RoC Curve for the train and test data
This Business Report is generated based on the Data set extracted from reliable sources
26
This Business Report is generated based on the Data set extracted from reliable sources
27
2.7.6.3. AUC_score and RoC Curve for the train and test data
This Business Report is generated based on the Data set extracted from reliable sources
28
2.7.7.3. AUC_score and RoC Curve for the train and test data
This Business Report is generated based on the Data set extracted from reliable sources
29
On the whole, based on the outcomes of the model for data set following Insights are observed:
• Male Transporters are more than the Female Transporters. So parties have to attract Male
Transporters by respective means.
• From the analysis it is observed that people having age more than 35 prefer to opt for
Private transportation, seems they are looking for comfort than travelling in public.
• Also, the high salary employees are opting for Private Transportation in comparison with low
salary people.
• Based on the model comparison it is clear that Random Forest Bagging technique will be the
best model for further predictions.
3. Problem Statement:
➢ A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs
making their pitch
➢ to the VC sharks.
3.1 Pick out the Deal (Dependent Variable) and Description columns into a
separate data frame.
Ans: To start with the counting, we need to first load the data. Figure below shows the snap for
loading data
This Business Report is generated based on the Data set extracted from reliable sources
30
3.1.1. Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.
Ans: Deal (Dependent Variable) and Description into another dataframe.
This Business Report is generated based on the Data set extracted from reliable sources
31
3.1.3.2 Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’,even’
and ‘company’ are to be removed)
Ans: To calculate the no of words in data below the corpora is converted to lower text and stop
words are removed from both the corpora.
This Business Report is generated based on the Data set extracted from reliable sources
32
3.1.3.3 What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?
Ans: Below are the top 3 frequently occurring words in both corpuses after removing stop words.
In secure deal corpora and non-secure deal corpora both have same words like ‘Product’, ‘Designed’,
‘Easy’.
Ans: Word cloud is a visual representation of the words available in a particular text files. The size of
the words indicates the frequency of the word. Bigger is the size more is the frequency.
Now we are creating the word cloud for Secure Deal corpora . From wordcloud imported
WordCloud to create word cloud.
This Business Report is generated based on the Data set extracted from reliable sources
33
Now we are creating the word cloud for Non-Secure deal corpora. From wordcloud imported
WordCloud to create word cloud.
3.1.5. Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less
likely to secure a deal based on your analysis?
Based on the two word clouds, we can clearly see that the word ‘Device’ appears in Non-secure deal
word corpora.
As per the word cloud visualization and principles, the most repeated word is shown in Bigger size
compared to others. Here in the non-secure deal word cloud we can see the word ‘Device” appears
bigger.
So most of the business ideas have strategies related to device.
Since there is no word appearing in the secure deal word cloud, in this belief we can have a thumb
rule, that if a business strategy that relates to the word ‘Device’ are less likely to secure a deal based
on my text analysis.
This Business Report is generated based on the Data set extracted from reliable sources