Machine Learnin1
Machine Learnin1
Surabhi Kulkarni
PGP-DSBA Online
Table of Contents
Contents
Executive Summary
Data Description
Sample of the dataset
Exploratory Data Analysis
Let us check the types of variables in the data frame
Check for missing values in the dataset
Pair Plot
Box Plot
Histogram
List of Figures
Data Dictionary
Importing Libraries.
Importing Data.
Observation-1:
In the given data set there are 5 Integer type features, 2 Float type
features. 2 Object type features
0 28 1 0 0 4 14.3 3.2 0 1
1 23 0 1 0 4 8.3 3.3 0 1
2 29 1 1 0 7 13.4 4.1 0 1
3 28 0 1 1 5 13.4 4.5 0 1
4 27 1 1 0 4 13.4 4.6 0 1
5 26 1 1 0 4 12.3 4.8 1 1
6 28 1 1 0 5 14.4 5.1 0 0
7 26 0 1 0 3 10.5 5.1 0 1
8 22 1 1 0 1 7.5 5.1 0 1
0 1
9 27 1 1 0 4 13.5 5.2
Performing EDA
we can observe there are 0 missing value in the depth column. Missing
value treatment will be done.
Outliers : Boxplot
Distplot
Histplot
Pairplot
Get the Correlation Heatmap
1.2 Split the data into train and test. Is scaling necessary or not?
Lets break the X and y dataframes into training set and test set. For this
we will use
LinearRegression()
# Naive Bayes
0.7935483870967742
[[ 51 51]
[ 13 195]]
precision recall f1-score
support
0 0.80 0.50 0.61
102
1 0.79 0.94 0.86
208
accuracy 0.79
310
macro avg 0.79 0.72 0.74
310
weighted avg 0.79 0.79 0.78
310
0.7910447761194029
[[22 20]
[ 8 84]]
precision recall f1-score
support
accuracy 0.79
134
macro avg 0.77 0.72 0.73
134
weighted avg 0.78 0.79 0.78
134
KNN
0.7161290322580646
accuracy 0.72
310
macro avg 0.68 0.63 0.63
310
weighted avg 0.70 0.72 0.69
310
0.5223880597014925
[[ 7 35]
[29 63]]
precision recall f1-score
support
0 0.19 0.17 0.18
42
1 0.64 0.68 0.66
92
accuracy 0.52
134
macro avg 0.42 0.43 0.42
134
weighted avg 0.50 0.52 0.51
134
accuracy 0.80
310
macro avg 0.79 0.74 0.75
310
weighted avg 0.80 0.80 0.79
310
0.8208955223880597
[[26 16]
[ 8 84]]
precision recall f1-score
support
0 0.76 0.62 0.68
42
1 0.84 0.91 0.87
92
accuracy 0.82
134
macro avg 0.80 0.77 0.78
134
weighted avg 0.82 0.82 0.82
134
# Logistic Regression
LogisticRegression(max_iter=10000, n_jobs=2,
penalty='none', solver='newton-cg',
verbose=True)
0.7870967741935484
[[ 58 44]
[ 22 186]]
precision recall f1-score
support
accuracy 0.79
310
macro avg 0.77 0.73 0.74
310
weighted avg 0.78 0.79 0.78
310
train : 0.7870967741935484
0.8059701492537313
[[25 17]
[ 9 83]]
precision recall f1-score
support
accuracy 0.81
134
macro avg 0.78 0.75 0.76
134
weighted avg 0.80 0.81 0.80
134
AUC: 0.816
In [135]:
# DecisionTreeClassifier
1.0
[[102 0]
[ 0 208]]
precision recall f1-score
support
accuracy 1.00
310
macro avg 1.00 1.00 1.00
310
weighted avg 1.00 1.00 1.00
310
AUC: 1.000
0.8134328358208955
[[29 13]
[12 80]]
precision recall f1-score
support
accuracy 0.81
134
macro avg 0.78 0.78 0.78
134
weighted avg 0.81 0.81 0.81
134
AUC: 0.861
AdaBoostClassifier(n_estimators=100,
random_state=1)
0.8838709677419355
[[ 77 25]
[ 11 197]]
precision recall f1-score
support
accuracy 0.88
310
macro avg 0.88 0.85 0.86
310
weighted avg 0.88 0.88 0.88
310
AUC: 0.959
0.7910447761194029
[[22 20]
[ 8 84]]
precision recall f1-score
support
0 0.73 0.52 0.61
42
1 0.81 0.91 0.86
92
accuracy 0.79
134
macro avg 0.77 0.72 0.73
134
weighted avg 0.78 0.79 0.78
134
AUC: 0.959
# Gradient Boosting
GradientBoostingClassifier(random_state=1)
0.967741935483871
[[ 51 51]
[ 13 195]]
precision recall f1-score
support
accuracy 0.97
310
macro avg 0.97 0.95 0.96
310
weighted avg 0.97 0.97 0.97
310
AUC: 0.998
0.7686567164179104
[[22 20]
[ 8 84]]
precision recall f1-score
support
accuracy 0.79
134
macro avg 0.77 0.72 0.73
134
weighted avg 0.78 0.79 0.78
134
AUC: 0.815
# RandomForestClassifier
RandomForestClassifier(random_state=1)
1.0
[[ 51 51]
[ 13 195]]
precision recall f1-score
support
accuracy 1.00
310
macro avg 1.00 1.00 1.00
310
weighted avg 1.00 1.00 1.00
310
AUC: 1.000
0.8059701492537313
[[22 20]
[ 8 84]]
precision recall f1-score
support
accuracy 0.79
134
macro avg 0.77 0.72 0.73
134
weighted avg 0.78 0.79 0.78
134
AUC: 0.829
Problem 2: A dataset of Shark Tank episodes is made available. It
contains 495 entrepreneurs making their pitch to the VC sharks.
Importing Libraries.
Importing Data.
we can observe there are 100 missing value in the depth column.
Missing value treatment will be done.
2.1 Pick out the Deal (Dependent Variable) and Description columns into a
separate data frame.
deal description
493 False Adriana Montano wants to open the first Cat Ca...
2.2 Create two corpora, one with those who secured a Deal, the other with
those who did not secure a deal.
The 'Did not secure a deal' wordcloud contains words such as 'one',
'designed' , 'help' ,'device' ,'bottle', 'premium' ,'use' .These indicate that
Deals with a mediocre design, less suited to solve/help a problem, products
involving water bottles, having a higher and premium price tag and less
usability are less likely to secure a deal.
It is also observed that words such as 'one', 'designed' ,'system' and 'use'
have a higher weight in both these wordclouds.This indicates that either
these were not the defining factors to whether a deal is made or not or
might have been used in a different context in the description in each
scenario.
The word 'device' is not easily found in the 'secured a deal' wordcloud while
it is easily spotted in tne 'not secured a deal' wordcloud. This indicates that
the word 'device' occured frequently when a deal was rejected hence
implying the statement given in the question is true.