0% found this document useful (0 votes)
6 views34 pages

Project - Machine Learning (E)

Uploaded by

ANIL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views34 pages

Project - Machine Learning (E)

Uploaded by

ANIL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

2022

Machine learning – Naïve Bayes, KNN, Bagging and


Boosting on Predicting Transportation Mode

Anil Ulchala
3/13/2022
1

1 Contents
1 Contents .......................................................................................................................................... 1
1. Action Required: ............................................................................................................................. 4
2. Problem Statement: ........................................................................................................................ 4
2.1 Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers
and missing values treatment (if necessary) and check the basic descriptive statistics of the
dataset. ............................................................................................................................................... 4
................................................................................................................................................................ 7
2.2 Split the data into train and test in the ratio 70:30. Is scaling necessary or not? ................ 13
2.3 Build the following models on the 70% training data and check the performance of these
models on the Training as well as the 30% Test data using the various inferences from the
Confusion Matrix and plotting a AUC-ROC curve along with the AUC values. Tune the models
wherever required for optimum performance. ................................................................................ 14
2.4 Which model performs the best? ......................................................................................... 18
2.5 What are your business insights? ......................................................................................... 29
3. Problem Statement: ...................................................................................................................... 29
3.1 Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame. 29
3.1.2.1 Secure Deal data frame .................................................................................................... 30
3.1.2.2 Not Secure Deal data frame ............................................................................................. 30

List of Tables:
Table 1 – Comparison between various models .................................................................................. 28

This Business Report is generated based on the Data set extracted from reliable sources
2

Figure 1 - Loading Dataset into Jupyter notebook.................................................................................. 4


Figure 2 – Shape and Data type information of the data set ................................................................. 5
Figure 3 – Checking for Null Values ........................................................................................................ 5
Figure 4 – Proportions checking for Categorical Variables ..................................................................... 5
Figure 5 – Descriptive Information ......................................................................................................... 6
Figure 6 – Distribution of the variables................................................................................................... 7
Figure 7 – Count plot for Transport Variable .......................................................................................... 8
Figure 8 – Swarm plot Age vs Transport ................................................................................................. 8
Figure 9 – Swarm plot Transport vs Engineer ......................................................................................... 9
Figure 10 – Swarm plot Transport vs MBA ............................................................................................. 9
Figure 11 – Swarm plot against Transport and Work Experience ........................................................ 10
Figure 12 – Swarm plot against Transport and Salary .......................................................................... 10
Figure 13 – Swarm plot against Transport and Distance Travelling ..................................................... 11
Figure 14 – Heat map of the Variables ................................................................................................. 11
Figure 15 – Pair Plot for the variables ................................................................................................... 12
Figure 16 – Boxplot before outlier Treatment ...................................................................................... 12
Figure 17 –Box plot after outlier Treatment ......................................................................................... 13
Figure 18 – Converting Target Variable to Integer type ....................................................................... 13
Figure 19 – Dataset info after encoding the variables .......................................................................... 13
Figure 20 – Train Data after the split .................................................................................................... 14
Figure 21 – Test Data after the split...................................................................................................... 14
Figure 22 – Grid Search Tuning for Logistic Regression ........................................................................ 14
Figure 23 – Logistic Regression Model building .................................................................................... 14
Figure 24 – Model for LDA .................................................................................................................... 15
Figure 25 – Model for Gaussian NB ...................................................................................................... 15
Figure 26 – Scaling the Data Set for model building ............................................................................. 15
Figure 27 – Model building ................................................................................................................... 16
Figure 28 – Model building ................................................................................................................... 16
Figure 29 – Calculating Misclassification error for various values of K................................................. 16
Figure 30 – Plotting Misclassification error for various values of K ...................................................... 17
Figure 31 – Model building for K =19 .................................................................................................... 17
Figure 32 – Model using Random Forest .............................................................................................. 17
Figure 33 – Model using Random Forest and applying Bagging ........................................................... 18
Figure 34 – Model using Random Forest and applying Boosting ......................................................... 18
Figure 35 – Model using CART .............................................................................................................. 18
Figure 36 – Confusion Matrix of Train Data .......................................................................................... 18
Figure 37 – Confusion Matrix of Train Data .......................................................................................... 19
Figure 38 – AUC and RoC of Train Data ................................................................................................ 19
Figure 39 – AUC and RoC of Test Data .................................................................................................. 20
Figure 40 – Confusion Matrix of Train and Test Data ........................................................................... 20
Figure 41 – Classification Report of Train and Test Data ...................................................................... 21
Figure 42 – AUC and RoC of Train and Test Data .................................................................................. 21
Figure 43 – Confusion Matrix of Train Data .......................................................................................... 22
Figure 44 – Confusion Matrix of Train Data .......................................................................................... 22
Figure 45 – AUC and RoC of Train Data ................................................................................................ 23
Figure 46 – AUC and RoC of Test Data .................................................................................................. 23
Figure 47 – Confusion Matrix of Train Data .......................................................................................... 23
Figure 48 – Confusion Matrix of Test Data ........................................................................................... 24
Figure 49 – AUC and RoC of Train Data ................................................................................................ 24
Figure 50 – Confusion Matrix of Train Data .......................................................................................... 25
Figure 51 – Confusion Matrix of Train Data .......................................................................................... 25

This Business Report is generated based on the Data set extracted from reliable sources
3

Figure 52 – AUC and RoC of Train Data ................................................................................................ 26


Figure 53 – Confusion Matrix of Train Data .......................................................................................... 26
Figure 54 – Confusion Matrix of Train Data .......................................................................................... 27
Figure 55 – AUC and RoC of Train Data ................................................................................................ 27
Figure 56 – Confusion Matrix of Train Data .......................................................................................... 27
Figure 57 – Confusion Matrix of Train Data .......................................................................................... 28
Figure 58 – AUC and RoC of Train Data ................................................................................................ 28
Figure 59 - Loading Dataset into Jupyter notebook.............................................................................. 29
Figure 60 – Separate Data Frame ......................................................................................................... 30
Figure 61 – Secure Deal Data frame ..................................................................................................... 30
Figure 62 – Not Secure Deal Data frame .............................................................................................. 30
Figure 63 – Number of Characters of Secure deal corpora .................................................................. 31
Figure 64 – Number of Characters of Non -Secure deal corpora ......................................................... 31
Figure 65 – Typical words in secure deal corpora................................................................................. 31
Figure 66 – Typical words in non-secure deal corpora ......................................................................... 32
Figure 67 – Most frequent words of secure deal corpora .................................................................... 32
Figure 68 – Most frequent words of secure deal corpora .................................................................... 32
Figure 69 – Word cloud for Secure Deal Corpora ................................................................................. 32
Figure 70 – Word cloud for Non-Secure Deal Corpora ......................................................................... 33

This Business Report is generated based on the Data set extracted from reliable sources
4

Business Case
1. Action Required:
The purpose of this whole exercise is to explore the dataset. Do the exploratory data analysis.
Perform Machine Learning using Naive Bayes, KNN techniques on the data set and compare the
models. Also, need to provide the insights and recommendations based on the output. Also
need to perform NLP techniques, on another data set containing descriptions of entrepreneur in
the shark tank show and create a word cloud for secured deals and non-secured deals.

2. Problem Statement:
You work for an office transport company. You are in discussions with ABC Consulting company
for providing transport for their employees. For this purpose, you are tasked with understanding
how do the employees of ABC Consulting prefer to commute presently (between home and
office). Based on the parameters like age, salary, work experience etc. given in the data set
‘Transport.csv’, you are required to predict the preferred mode of transport. The project
requires you to build several Machine Learning models and compare them so that the model can
be finalised.

2.1 Basic data summary, Univariate, Bivariate analysis, graphs, checking


correlations, outliers and missing values treatment (if necessary) and
check the basic descriptive statistics of the dataset.
Ans: To start with the data analysis, we need to first load the data. Figure below shows the snap for
loading data

Figure 1 - Loading Dataset into Jupyter notebook

• Successfully loaded the data into Python.

2.1.1. Now data set is ready for the Exploratory Data Analysis. As per the output dimension or
shape of the data set is (444, 9). Therefore data set has 444 rows and 9 columns. Refer below
figure

This Business Report is generated based on the Data set extracted from reliable sources
5

Figure 2 – Shape and Data type information of the data set


As per the figure we can observe there are 2 variables with ‘object’ type and remaining variables are
of ‘int’ and ‘float’ type. Also there are no duplicate values in the data set.

2.1.2. There are no null values in the data set. Ref below figure

Figure 3 – Checking for Null Values

2.1.3. On the basis of problem description it is clear that ‘Transport’ is the dependent variable/
target variable and the remaining variables are independent variables. Going forward the
report uses terminology of dependent and independent variable to address the columns.
Hence, proportions for categorical variables are calculated.

Figure 4 – Proportions checking for Categorical Variables


From the figure we can infer that 144 members opted for Private Transport 300 members has
chosen Public Transport, data set is the balanced with 70 and 30 proportion and good for creating

This Business Report is generated based on the Data set extracted from reliable sources
6

Model. Also, majority of the transport users are Male with a count of 316 , we can conclude that
data set has no bad values.

2.1.4. Descriptive Stats:

Figure 5 – Descriptive Information

• As observed, stats from target variable say Public Transport is the most opted one.
• Age ranges from 18 to 43 with a mean of 27 years and having a std deviation of 4.41. Mean
and median are almost similar giving an intention of normal distribution.
• Male are with the high count having a frequency of 316 in comparison with Female.
• The work experience of the transporters varies from 0 to 24. Mean value is 6.29 and median
is 5.
• The Salary of the transporters varies from 6.5 to 57 LPA. Mean salary is 10.45 LPA and
median is 13.6 LPA.
• The distance of the transporters varies from 3.2 to 23.4 kms. Average distance travelled is
11.23 kms.

2.1.5. Distribution and boxplot of the variables


Inferences based on the boxplots and dist plot:
1. ‘Distance’ and ‘Age’ variables are normal distributed
2. Rest of the features being ordinal variables and not continuous variables. So we can see
multiple spikes in the distribution.
3. Variables like ‘Salary’ and ‘Work Experience’ are Right tailed.

This Business Report is generated based on the Data set extracted from reliable sources
7

Figure 6 – Distribution of the variables

2.1.6. Univariate Analysis

This Business Report is generated based on the Data set extracted from reliable sources
8

Figure 7 – Count plot for Transport Variable

• Majority of the Transporters are Male and most of them are using Public Transport.
2.1.7. Bivariate Analysis
1. Swarm plot between Transport and Age

Figure 8 – Swarm plot Age vs Transport

• Based on the data, from the scatter plot Age group between 23 to 30 are dense for
at Public Transport
• Age group above 35 prefer for Private transport.

2. Swarm plot between Transport and Engineer

This Business Report is generated based on the Data set extracted from reliable sources
9

Figure 9 – Swarm plot Transport vs Engineer

• Based on the data, there is no clear insight.

3. Swarm plot between Transport vs MBA

Figure 10 – Swarm plot Transport vs MBA

• Based on the data, there is no clear correlation.

4. Swarm plot between Transport vs Work Experience

• Based on the data, lower work experience people are opting for public transport.

This Business Report is generated based on the Data set extracted from reliable sources
10

Figure 11 – Swarm plot against Transport and Work Experience

5. Swarm plot between Transport vs Salary

• Based on the data, lower Salary people are opting for public transport.

Figure 12 – Swarm plot against Transport and Salary

6. Swarm plot between Transport vs Distance

• Density of population is more in public transport and that too thicker for lower
distance travellers.

This Business Report is generated based on the Data set extracted from reliable sources
11

Figure 13 – Swarm plot against Transport and Distance Travelling

7. Correlation between the variables:

Figure 14 – Heat map of the Variables

• Age vs Work Experience have strong correlation of 0.93.


• Age vs Salary have strong correlation of 0.86
• Age and MBA has negative correlation
• Salary and MBA also have negative correlation
• Engineer vs salary has very weak correlation

This Business Report is generated based on the Data set extracted from reliable sources
12

8. Pairplot between the variables:

Figure 15 – Pair Plot for the variables

• From the pair plot, we can infer that there is strong relation between work
experience and salary. Rest of the variables have very weak correlation.
• Hence the dataset is not cursed by mutli-collinearity and good for modelling.

2.1.8. Treating Outliers


Before we do label encoding and creating a model we observed there are some outliers existing in
each variable. Now we will treat outliers in these variables. Salary is having highest no of outliers.

Figure 16 – Boxplot before outlier Treatment

This Business Report is generated based on the Data set extracted from reliable sources
13

Figure 17 –Box plot after outlier Treatment

2.2 Split the data into train and test in the ratio 70:30. Is scaling necessary
or not?
Ans: To create a model for analysis we need to convert the categorical variables. So we do Label
encoding for the categorical variables and use the data set for the model building. Scaling is not
necessary since most of the variables are categorical code.

2.2.1. Data Encoding and Model building for Machine Learning Analysis

2.2.1.1. Converting Target Variable to Integer type using Categorical Function


• Here ‘0’ represents ‘Private Transport and ‘1’ represents ‘Public Transport’
• In Gender column, ‘0’ represents ‘Female Transport and ‘1’ represents ‘Male’

Figure 18 – Converting Target Variable to Integer type

2.2.1.2. Data information after conversion.

Figure 19 – Dataset info after encoding the variables

This Business Report is generated based on the Data set extracted from reliable sources
14

Figure 20 – Train Data after the split

Figure 21 – Test Data after the split

• Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
• Test and train split is made with ratio 70:30
• X is an array of dependent variables and Y is an array of target variable

2.3 Build the following models on the 70% training data and check the
performance of these models on the Training as well as the 30% Test
data using the various inferences from the Confusion Matrix and
plotting a AUC-ROC curve along with the AUC values. Tune the models
wherever required for optimum performance.
2.3.1. Model Building for Machine Learning Analysis – Split of data

Figure 22 – Grid Search Tuning for Logistic Regression

• Penalty: ‘elasticnet','l2','none'
• Solver: 'newton-cg', 'saga'
• Tol: 0.001, 0.00001
2.3.2. Logistic Regression model generation using the Grid search method

Figure 23 – Logistic Regression Model building

This Business Report is generated based on the Data set extracted from reliable sources
15

Finally based on the GridSearchCV technique the best estimators are captured to best_model and
predictions are made based on this best_model.

2.3.3. Creating model for LDA

Figure 24 – Model for LDA

• Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
• Test and train split is made with ratio 70:30
• X is an array of dependent variables and Y is an array of target variable

2.3.4. Creating model using Gaussian NB

Figure 25 – Model for Gaussian NB

• Here data is not scaled, since there is no much difference in the independent
variables. Hence scaling is not done for the entire modelling except for KNN
• Test and train split is made with ratio 70:30
• X is an array of dependent variables and Y is an array of target variable

2.3.5. Creating model using KNN


2.3.5.1 Scaling the data set for KNN model building

Figure 26 – Scaling the Data Set for model building

• Here data is not scaled, since there is no much difference in the independent
variables. Here scaling is done because KNN model is based on Euclidean
distance measurement. So zscore is applied for scaling on the Train data set.
• Test and train split is made with ratio 70:30
• X is an array of dependent variables and Y is an array of target variable

This Business Report is generated based on the Data set extracted from reliable sources
16

2.3.5.2 Building Model:

Figure 27 – Model building


2.3.5.3 Building Model (K value =7):

Figure 28 – Model building

• K value is considered as ‘7’.


• Model score is 83.7%
• X is an array of dependent variables and Y is an array of target variable
2.3.5.4 Calculating Misclassification error of KNN model for a range from 1 to 19 at steps of 2

Figure 29 – Calculating Misclassification error for various values of K

• It is observed that misclassification error value is less for 19

2.3.5.5 Plotting Misclassification Error for range of K values

This Business Report is generated based on the Data set extracted from reliable sources
17

Figure 30 – Plotting Misclassification error for various values of K


2.3.5.6 Building model for K=19

Figure 31 – Model building for K =19

• Model score is observed as 79%

2.3.6. Creating model using Random Forest Technique

Figure 32 – Model using Random Forest

• Model is created using ensembling technique, with estimators as ‘100’.


• Train model score is observed as 100 %
• Test model score is observed as 79%, seems like little bit overfitting. So we will
do Model tuning using methods like Bagging and Boosting

2.3.6.2 Creating model using Random Forest model and applying Bagging

This Business Report is generated based on the Data set extracted from reliable sources
18

Figure 33 – Model using Random Forest and applying Bagging

• Model is created using ensembling technique, with base_estimators as


‘RF_model’.
• Train model score is observed as 96 %
• Test model score is observed as 80%, seems like little bit overfitting. So we can
check Model tuning using Boosting.

2.3.6.3 Creating model using Random Forest model and applying Boosting

Figure 34 – Model using Random Forest and applying Boosting

• Model is created using ensembling technique, with base_estimators as


‘RF_model’. Gradient Boosting method is considered in the above case
• Train model score is observed as 93.9 %
• Test model score is observed as 78.3%, seems like both train and testing scores
are in-line, can be considered as best model.
2.3.6.4 Creating model using CART

Figure 35 – Model using CART

• Model is created using CART.


• Train model score is observed as 67.26 %
• Test model score is observed as 68.46%, seems like both train and testing
scores are in-line, can be considered as best model.

2.4 Which model performs the best?

2.7.1. Performance Metrics of Logistic Regression

2.7.1.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 36 – Confusion Matrix of Train Data

This Business Report is generated based on the Data set extracted from reliable sources
19

• Accuracy of the train data is 76.77%


• Recall for train data in the interest of ‘1’ is 0.92
2.7.1.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 37 – Confusion Matrix of Train Data

• Accuracy of the test data is 75%


• Recall for test data in the interest of ‘1’ is 83

2.7.1.3. Accuracy Score AUC_score and RoC Curve for the train data

Figure 38 – AUC and RoC of Train Data

• AUC score of the train data is 78.96%

2.7.1.4. Accuracy Score, AUC_score and RoC Curve for the test data

This Business Report is generated based on the Data set extracted from reliable sources
20

Figure 39 – AUC and RoC of Test Data

• AUC score of the train data is 78.96%

2.7.2. Performance Metrics of LDA

2.7.2.1. Confusion Matrix of Train and Test data

Figure 40 – Confusion Matrix of Train and Test Data

• Recall for train data in the interest of ‘1’ is 194


• Recall for Test data in the interest of ‘ 1’ is 84

2.7.2.2. Classification report of Train and Test data

This Business Report is generated based on the Data set extracted from reliable sources
21

Figure 41 – Classification Report of Train and Test Data

• Accuracy of the Train data is 78%


• Accuracy of the Test Data is 77%

2.7.2.3. AUC_score and RoC Curve for the train and test data

Figure 42 – AUC and RoC of Train and Test Data

• AUC score of the train data is 79.1%


• AUC score of the test data is 72.8%

2.7.3. Performance Metrics of Naïve Bayes

2.7.3.1. Confusion Matrix, accuracy and other metrics of Train data

This Business Report is generated based on the Data set extracted from reliable sources
22

Figure 43 – Confusion Matrix of Train Data

• Accuracy of the train data is 78.06%


• Recall for train data in the interest of ‘1’ is 191
2.7.3.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 44 – Confusion Matrix of Train Data

• Accuracy of the test data is 74.62%


• Recall for test data in the interest of ‘1’ is 81

This Business Report is generated based on the Data set extracted from reliable sources
23

2.7.3.3. Accuracy Score AUC_score and RoC Curve for the train data

Figure 45 – AUC and RoC of Train Data

• AUC score of the train data is 78.88%

2.7.3.4. Accuracy Score, AUC_score and RoC Curve for the test data

Figure 46 – AUC and RoC of Test Data

• AUC score of the train data is 74.03%

2.7.4. Performance Metrics of KNN Model with K value as 19

2.7.4.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 47 – Confusion Matrix of Train Data

This Business Report is generated based on the Data set extracted from reliable sources
24

• Accuracy of the train data is 79.27%


• Recall for train data in the interest of ‘1’ is 211
2.7.4.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 48 – Confusion Matrix of Test Data

• Accuracy of the test data is 70.27%


• Recall for test data in the interest of ‘1’ is 64

2.7.4.3. AUC_score and RoC Curve for the train and test data

Figure 49 – AUC and RoC of Train Data

• AUC score of the train data is 84.50%


• AUC score of the test data is 74%

2.7.5. Performance Metrics of Random Forest with Bagging Technique

2.7.5.1. Confusion Matrix, accuracy and other metrics of Train data

This Business Report is generated based on the Data set extracted from reliable sources
25

Figure 50 – Confusion Matrix of Train Data

• Accuracy of the train data is 96.39%


• Recall for train data in the interest of ‘1’ is 100

2.7.5.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 51 – Confusion Matrix of Train Data

• Accuracy of the test data is 80.18%


• Recall for test data in the interest of ‘1’ is 0.92

2.7.5.3. AUC_score and RoC Curve for the train and test data

This Business Report is generated based on the Data set extracted from reliable sources
26

Figure 52 – AUC and RoC of Train Data

• AUC score of the train data is 99.90%


• AUC score of the test data is 82.80%

2.7.6. Performance Metrics of Random Forest with Boosting Technique

2.7.6.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 53 – Confusion Matrix of Train Data

• Here Gradient boosting technique is used


• Accuracy of the train data is 93.99%
• Recall for train data in the interest of ‘1’ is 0.99

2.7.6.2. Confusion Matrix, accuracy and other metrics of Test data

This Business Report is generated based on the Data set extracted from reliable sources
27

Figure 54 – Confusion Matrix of Train Data

• Accuracy of the test data is 78.37%


• Recall for test data in the interest of ‘1’ is 0.87

2.7.6.3. AUC_score and RoC Curve for the train and test data

Figure 55 – AUC and RoC of Train Data

• AUC score of the train data is 94.80%


• AUC score of the test data is 90.80%

2.7.7. Performance Metrics of CART

2.7.7.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 56 – Confusion Matrix of Train Data

This Business Report is generated based on the Data set extracted from reliable sources
28

• Accuracy of the train data is 67.26%


• Recall for train data in the interest of ‘1’ is 1

2.7.7.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 57 – Confusion Matrix of Train Data

• Accuracy of the test data is 68.46%


• Recall for test data in the interest of ‘1’ is 1

2.7.7.3. AUC_score and RoC Curve for the train and test data

Figure 58 – AUC and RoC of Train Data

• AUC score of the train and test data is 50%


• Also the auc is same for train and test

2.7.8. Comparison between Models

LR Model LDA Model NB Model KNN Model Bagging Boosting CART


Train Test Train Test Train Test Train Test Train Test Train Test Train Test
Accuracy 76.77 75.37 78 76 78 74.62 79.27 70 96.3 80.18 93.99 78.37 67.26 68.46
Precision(1) 78 77 79 78 79 77 79 75 95 81 93 82 67 68
AUC 78.96 78.96 79 72 78 74 84 74 99 82.80 99 76.90 50 50
Recall(1) 92 91 93 92 91 89 94 84 100 92 99 87 1 100
F1-Score(1) 84 83 85 84 85 83 86 80 97 86 96 85 80 81

Table 1 – Comparison between various models

This Business Report is generated based on the Data set extracted from reliable sources
29

• On comparing the models, Random Forest technique with Bagging seems


consistent across various Model evaluation parameters
• Train and test data for Bagging model are almost close and consistent
• Also, the accuracy of the Bagging falls in first place with 96.3 for train and 80.18
in Test data. Boosting has Accuracy for train data 93.99% but test data is
performing low with deviation of 10 % almost
• Finally Random Forest Technique with Bagging technique tuning provides better
model

2.5 What are your business insights?

On the whole, based on the outcomes of the model for data set following Insights are observed:
• Male Transporters are more than the Female Transporters. So parties have to attract Male
Transporters by respective means.
• From the analysis it is observed that people having age more than 35 prefer to opt for
Private transportation, seems they are looking for comfort than travelling in public.
• Also, the high salary employees are opting for Private Transportation in comparison with low
salary people.
• Based on the model comparison it is clear that Random Forest Bagging technique will be the
best model for further predictions.

3. Problem Statement:
➢ A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs
making their pitch
➢ to the VC sharks.

3.1 Pick out the Deal (Dependent Variable) and Description columns into a
separate data frame.

Ans: To start with the counting, we need to first load the data. Figure below shows the snap for
loading data

Figure 59 - Loading Dataset into Jupyter notebook

• The shape of the data frame is (495,19).


• The deal column has two unique values True and False.
• Total ‘False’ count is 244 and ‘True’ count is 251.

This Business Report is generated based on the Data set extracted from reliable sources
30

3.1.1. Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.
Ans: Deal (Dependent Variable) and Description into another dataframe.

Figure 60 – Separate Data Frame


3.1.2. Create two corpora, one with those who secured a Deal, the other with those who did not
secure a deal.

3.1.2.1 Secure Deal data frame

Figure 61 – Secure Deal Data frame

• Secure deal data frame has a shape of (251,2)

3.1.2.2 Not Secure Deal data frame

Figure 62 – Not Secure Deal Data frame

• Secure deal data frame has a shape of (244,2)

This Business Report is generated based on the Data set extracted from reliable sources
31

3.1.3. The following exercise is to be done for both the corpora:


3.1.3.1: Find the number of characters for both the corpuses.
Number of characters in secure deal:

Figure 63 – Number of Characters of Secure deal corpora


➢ Total Number of characters in secure deal corpora is 64060.
Number of characters in non-secure deal:

Figure 64 – Number of Characters of Non -Secure deal corpora


➢ Total Number of characters in secure deal corpora is 47184.

3.1.3.2 Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’,even’
and ‘company’ are to be removed)
Ans: To calculate the no of words in data below the corpora is converted to lower text and stop
words are removed from both the corpora.

Figure 65 – Typical words in secure deal corpora

This Business Report is generated based on the Data set extracted from reliable sources
32

Figure 66 – Typical words in non-secure deal corpora

3.1.3.3 What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?

Ans: Below are the top 3 frequently occurring words in both corpuses after removing stop words.
In secure deal corpora and non-secure deal corpora both have same words like ‘Product’, ‘Designed’,
‘Easy’.

Figure 67 – Most frequent words of secure deal corpora

Figure 68 – Most frequent words of secure deal corpora

3.1.3.4 Plot the Word Cloud for both the corpora?

Ans: Word cloud is a visual representation of the words available in a particular text files. The size of
the words indicates the frequency of the word. Bigger is the size more is the frequency.
Now we are creating the word cloud for Secure Deal corpora . From wordcloud imported
WordCloud to create word cloud.

Figure 69 – Word cloud for Secure Deal Corpora

This Business Report is generated based on the Data set extracted from reliable sources
33

Now we are creating the word cloud for Non-Secure deal corpora. From wordcloud imported
WordCloud to create word cloud.

Figure 70 – Word cloud for Non-Secure Deal Corpora


3.1.4. Refer to both the word clouds. What do you infer?

• Secure Deal corpora Word cloud


o From the word cloud it is clear that most frequent words are “Product, use, design,
children, Water.
o Based on the word cloud, we can guess that the most of the businesses are intended
for children.
o Another word kids also supports the above statement. The words designed says
most of the businesses have designed their product in the easy way.
• Non - secure Deal corpora Word cloud
o From the word cloud it is clear that most frequent words are “Product, use, device,
help, water.
o Based on the word cloud, we can guess that the most of the businesses are intended
related to water, services.
o Also, online is the another word which is occurring more number of times.

3.1.5. Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less
likely to secure a deal based on your analysis?

Based on the two word clouds, we can clearly see that the word ‘Device’ appears in Non-secure deal
word corpora.
As per the word cloud visualization and principles, the most repeated word is shown in Bigger size
compared to others. Here in the non-secure deal word cloud we can see the word ‘Device” appears
bigger.
So most of the business ideas have strategies related to device.
Since there is no word appearing in the secure deal word cloud, in this belief we can have a thumb
rule, that if a business strategy that relates to the word ‘Device’ are less likely to secure a deal based
on my text analysis.

This Business Report is generated based on the Data set extracted from reliable sources

You might also like