Machine Learning Business Report

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 60

Business Report – Predictive Modelling Project

By- Shorya Goel

Problem 1- You are hired by one of the leading news channels CNBE who wants to analyse recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that will help in
predicting overall win and seats covered by a particular party.

1.1 Read the dataset. Do the descriptive statistics and do the null value
condition check. Write an inference on it.

Read the dataset – “Election_Data.xlsx”

Exploratory Data Analysis:


Top 5 entries in the dataset.

“Unnamed: 0” is a variable that simply represents the index in the data. Hence, it should be dropped as it is
of no use in the model.

Also, some variables contain ‘.’ operator in their name that can affect the model, so we will replace the ‘.’
With ‘_’ operator.

Shape of the Dataset


Number of rows: 1525
Number of columns: 9

Info of the Dataset


There are total of 10 variables present in the dataset.
2 Categorical Variables- vote, gender.
7 Numeric type variables-age, economic_cond_national, economic_cond_household, Blair, Hague, Europe,
political_knowledge.

Descriptive Statistics of the Dataset


Numerical Columns-

Categorical Columns-

The above table gives information such as unique values, mean, median, standard deviation, five point
summary, min-max, count, etc. for all the variables present in the dataset.

Check for Null Values-

From the above, it is clear that there are no null values present in the dataset.
The isnull() function is used here to check for missing values.
The sum() function is used in order to get the total number of null values present in a particular variable.
Check for Duplicates-
There are total of 8 duplicate rows.

Since, there is no identification or unique code for each row present. We cannot clearly say that this is the
same person or different. So, we will not remove the duplicates in this case.

Skewness of the Dataset

Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable
about its mean.
Only two variables are positively skewed and rest negatively skewed with max skewedness in Blair.

Coefficient of Variation Check

The coefficient of variation (CV) is a measure of relative variability. It is the ratio of the standard deviation
to the mean (average).

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis.


Check for Outliers.

Univariate Analysis
For Continuous variables
We can see that all the numerical Variables are normally distributed (not perfectly normal though and are
multi modal in some instances as well.
There are outliers present in “economic_cond_national” and “economic_cond_household” variables that
can be seen from the boxplots on the right too.
Also from the boxplots the min and max values of the variables are not very clear, we can separately obtain
them while checking for outliers.
Bivariate Analysis-

Pairplot-

Pairplot tells us about the interaction of each variable with every other variable present.
As such there is no strong relationship present between the variables.
There is a mixture of positive and negative relationships though which is expected.

Overall, it’s a rough estimate of the interactions, clearer picture can be obtained by heatmap values and
also different kinds of plot.
Analysis - Blair and Age

People above the age of 45 yrs generally thinks that Blair is doing a good job.

Analysis - Hague and Age

Hague has slightly more concentration of nuteral points than that of Blair for people above 50 years of age.

Catplot Analysis - Blair (count) on economic_cond_household.


Catplot Analysis - Hague (count) on economic_cond_household

Blair has more points in terms of economic households than Hague.

Catplot Analysis - Blair (count) on economic_cond_national


Catplot Analysis – Hague (count) on economic_cond_national

Blair has more points in terms of economic national than Hague.


Catplot Analysis – Blair (count) on Europe

Catplot Analysis – Hague (count) on Europe

In the whole Europe if we look at the data then Blair is leading.


Catplot Analysis – Blair (count) on political_knowledge

Catplot Analysis – Hague (count) on political_knowledge

In terms of political knowledge Blair is considered better.


Covariance Matrix-

Correlation Matrix-

Heatmap-

Multicollinearity is an important issue which can harm the model. Heatmap is a good way of identifying this
issue. It gives us a basic idea of relationaship the variables have with each other.

Observations-
 Highest positive correlation is between “economic_cond_national” and “economic_cond_household”
(35%). But the good thing is that it’s not huge.
 Highest negative correlation is between “Blair” and “Europe” (30%) but this is also not huge.

Thus, Multicollinearity won’t be a issue in this dataset.

Outlier Check/Treatment-

Using boxplot-

There are outliers present in “economic_cond_national” and “economic_cond_household” variables that


can be seen from the boxplots.
We will find the upper and lower limits to get a clear picture of the outliers.
The upper and lower limits are in not that distant from each other and the outliers are on the lower side only
that too having value 1 where the lower limit is 1.5.
So it is not advisable to treat the outliers in this case.
We will move forward without treating the outliers.

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary
here or not? Data Split: Split the data into train and test (70:30).
Encoding the dataset

As many machine learning models cannot work with string values we will encode the categorical variables
and convert their datatypes to integer type.
From the info of the dataset, we know there are 2 categorical type variables, so we need to encode these 2
variables with the suitable technique.
Those 2 variables are ‘vote’ and ‘gender’. Their distribution is given below.

Gender Distribution-

Vote Distribution-
From the above results we can see that both variables contain only two classifications of data in them.
We can use a simple categorical conversion (pd.Categorical() or dummy encoding with drop_first = True, both of
them will work here) This will convert the values into 0 and 1. As there is no level or order in the subcategory any
encoding will give the same result.
The datatype after conversion is int8 format we can convert these to int64 format, it will work even if we don’t
change it to int64.

After encoding-

Info-

Data-

Now, the model can built on this data.

Scaling the dataset


Scaling is done so that the data which belongs to wide variety of ranges can be brought together in similar
relative range and thus bringing out the best performance of the model.
Generally, we perform Feature Scaling while dealing with the Gradient Descent Based Algorithms such as
Linear and Logistic Regression as these are very sensitive to the range of data points. In addition, it is very
useful in checking and reducing multi-collinearity in the data. VIF (Variance Inflation Factor) is a value,
which indicates the presence of multicollinearity. This value can be calculated only after building the
regression model.
So, it totally depends on the model we building whether scaling is required or not. Usually, the distance
based methods (E.g.: KNN) would require scaling as it is sensitive to extreme difference and can cause a
bias. But the tree-based method (E.g.: Decision Trees) would not require scaling in general as its
unnecessary (because it uses split method).

Here, we will perform scaling on both type of models and will check whether there is a difference in the
performance of the model.
Also, after looking at the data we only need to scale the ‘age’ variable as rest of the variables are in the
range 0-10 at max.
We will use Z-core scaling here to scale the age variable.
After Scaling using z-score or standard scaling in which mean=0 and standard deviation=1.

Data Split: Splitting the data into test and train

Before splitting we need to find the target variable. Here, the target variable is “vote”.
Vote data distribution-

There is a data imbalance in the variable as seen above so we cannot split it in 50:50 ratio instead will split
the data into 70:30 ratio. Also we will use the oversampling technique SMOTE to check whether it improves
the model or not.

Here, we will use 2 different train and test sets, one without scaled data and one with scaled data. This will
help us in understanding whether scaling can improve the performance or not.

Now splitting both X and y data in the ratio 70:30, where train data is 70 % and test data is 30%.
After splitting- the shape of the data

Here,
X_train - denotes 70% training dataset with 8 columns (except the target column called “vote”).
X_test- denotes 30% test dataset with 8 columns (except the target column called “vote”).
y_train- denotes the 70% training dataset with only the target column called “vote”.
y_test- denotes 30% test dataset with only the target column called “vote”.

Similarly, the data is divided for scaled data and SMOTE oversampling data.
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). Interpret
the inferences of both models.
Logistic Regression Model

Before fitting the model it is important to know about the hyper parameters that is involved in model
building. Parameters:
• penalty
• solver
• max_iter
• tol, etc.
To find the best combination among these parameters we will use the “GridSearchCV” method. This
method can perform multiple combinations of these parameters simultaneously and can provide us with the
best optimum results.
After performing the search the best parameters came out to be-

Now the results for unscaled data-

Intercept for the model is: [2.83418594]


Feature Importance-

Train Accuracy - 0.8303655107778819


Test Accuracy - 0.8537117903930131

Probabilities on the test set-(0 being preferring Conservative Party and 1 being preferring Labour Party)
Now the results for scaled data-

Intercept for the model is: [2.01329492]


Feature Importance-

Train Accuracy - 0.8303655107778819


Test Accuracy - 0.8493449781659389

Probabilities on the test set-


Statsmodels can also be used here in building the Logistic regression model to more about the statistics of
the model in the background.

Inferences
Pseudo R2 = 0.3809 shows that the model performs really well, as the value between 0.2 – 0.4 shows that
a model performs well.
Model perform slightly well on the unscaled data.
There is no under-fitting or overfitting present as accuracy for both test and train data are not very different.

Also, I performed SMOTE (oversampling technique), whose output is discussed further in the performance
model comparison.

LDA (Linear Discriminant Analysis) Model

Before fitting the model, it is important to know about the hyper parameters that is involved in model
building.
Parameters:
• solver
• shrinkage
Now after performing the GridSearchCV, the best parameters obtained are-
 shrinkage = 'auto'
 solver = 'lsqr'
Now the results for unscaled data-

Intercept for the model is: [3.72460468]


Feature Importance-

Train Accuracy- 0.8284910965323337


Test Accuracy- 0.851528384279476

Probabilities on the test set-

Now the results for scaled data-

Intercept for the model is: [2.48783541]


Feature Importance-

Train Accuracy- 0.828491096532333


Test Accuracy- 0.851528384279476

Probabilities on the test set-


Inferences
The model performed well and the accuracy for both the scaled and unscaled data are same.

Also, I performed SMOTE (oversampling technique), whose output is discussed further in the performance
model comparison.

1.5. Apply KNN Model and Naïve Bayes Model. Interpret the inferences of each
model.

K Nearest Neighbours Model

KNN is a distance based supervised machine learning algorithm that can be used to solve both
classification and regression problems. Main disadvantage of this model is it becomes very slow when
large volume of data is there and thus makes it an impractical choice where inferences need to be drawn
quickly.

Before fitting the model, it is important to know about the hyper parameters that is involved in model
building.
Parameters:
• n_neighbors
• weights
• algorithm
Now after performing the “GridSearchCV”, the best parameters obtained are-
• 'n_neighbors' = 5,
• 'weights' = uniform,
• 'algorithm' = auto

Now the results for unscaled data-

Train Accuracy- 0.8369259606373008


Test Accuracy- 0.8165938864628821

Probabilities on the test set-


Now the results for scaled data-

Train Accuracy- 0.8603561387066542


Test Accuracy- 0.8384279475982532

Probabilities on the test set-

Inference-
The model performed better with the scaled data.
Also, overall the model performed well but there can be slight overfitting as Accuracy is more for Train set
then for the test.

Naive Bayes Model

Naive Bayes classifiers is a model based on applying Bayes' theorem with strong (naïve) independent
assumptions between the features. These assumptions however may not be the perfect case in real life
scenarios.
Bayes Theorem-

Here the method that we are going to use is the GaussianNB() method, also know as BernoulliNB(). This
method requires all the features to be in categorical type. A general assumption in this method is the data is
following a normal or Gaussian distribution.
There are no specific parameters in this model like other, so we will simply fit the model with default
parameters.

Now the results for unscaled data-

Train Accuracy- 0.8219306466729147


Test Accuracy- 0.8471615720524017

Probabilities on the test set-

Now the results for scaled data-

Train Accuracy- 0.8219306466729147


Test Accuracy- 0.8471615720524017

Probabilities on the test set-


Inference-
The model performed exactly the same for both Unscaled and Scaled data.
This model performed well on the data no overfitting or under-fitting present.

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and


Boosting.
Model Tuning

Tuning is the process of maximizing a model’s performance without overfitting or creating too high of a
variance. In machine learning, this is accomplished by selecting appropriate “hyper-parameters”.

Grid Search is one of the most common methods of optimizing the parameters. In this a set of parameters
is defined and then the performance for each combinations of these parameters is evaluated, using cross
validation. Then from among those

Models such as Bagging, Boosting, Gradient boosting, Cat boosting, etc are prone to under or over fitting of
data. Overfitting means that the model works very well on the Train data but works relatively poor in the test
data. Under-fitting means that the model works very well on the Test data, but works relatively poor on the
training data.

Bagging Model (Using Random Forest Classifier)

Bagging is an ensemble technique. Ensemble techniques are the machine learning techniques that
combine several base models to get an optimal model. Bagging is designed to improve the performance of
existing machine learning algorithms used in statistical classification or regression. It is most commonly
used with tree-based algorithms. It is a parallel method.

Each base classifier is trained in parallel with a training set which is generated by randomly drawing, with
replacement, N data from the training .Training set for each of the base classifiers is independent of each
other.

Here, we will use random forest as the base classifier. Hyper-parameters that will be used in the model are
• max_depth
• max_features
• min_samples_leaf
• min_samples_split
• n_estimators
There are other parameters as well but we will use these for gridsearch, rest default values.
Now after performing the “GridSearchCV”, the best parameters obtained are-
• ' max_depth ' = 5,
• ' max_features ' = 7,
• ' min_samples_leaf ' = 25,
• ' min_samples_split ' = 60,
• ' n_estimators ' = 101

Now the results for unscaled data-

Train Accuracy- 0.8303655107778819


Test Accuracy- 0.834061135371179

Probabilities on the test set-


Now the results for scaled data-

Train Accuracy- 0.8303655107778819


Test Accuracy- 0.834061135371179

Probabilities on the test set-

Inference-
The model performed exactly the same for both Unscaled and Scaled data.
This model performed extremely well on the data no overfitting or under-fitting present.

Boosting Model

Boosting is also an ensemble technique. It converts weak learners to strong learners. Unlike bagging it is a
sequential method where result from one weak learner becomes the input for the another and so on, thus
improving the performance of the model.
Each time base learning algorithm is applied, it generates a new weak learner prediction rule. This is an
iterative process and the boosting algorithm combines these weak rules into a single strong prediction rule.
Misclassified input data gain a higher weight and examples that are classified correctly will lose weight.
Thus, future weak learners focus more on the examples that previous weak learners misclassified. They
are also tree based methods.

There are many kinds of Boosting Techniques available and for this project, the following boosting
techniques are to be used.
1. ADA Boost (Adaptive Boosting)
2. Gradient Boosting
3. Extreme Gradient Boosting
4. CAT Boost (Categorical Boosting)

ADA Boosting Model

This model is used to increase the efficiency of binary classifiers, but now used to improve multiclass
classifiers as well. AdaBoost can be applied on top of any classifier method to learn from its issues and
bring about a more accurate model and thus it is called the “best out-of-the-box classifier”.

Before fitting the model it is important to know about the hyper-parameters that is involved in model
building.
Parameters:
• algorithm
• n_estimators
There are other parameters as well but we will use these for gridsearch, rest default values.
Now after performing the “GridSearchCV”, the best parameters obtained are-
• ' algorithm ' = ' SAMME',
• ' n_estimators ' = 50

Now the results for unscaled data-

Train Accuracy- 0.8369259606373008


Test Accuracy- 0.8427947598253275

Probabilities on the test set-

Now the results for scaled data-

Train Accuracy- 0.8369259606373008


Test Accuracy- 0.8427947598253275

Probabilities on the test set-


Inference-
The model performed exactly the same for both Unscaled and Scaled data.
This model performed extremely well on the data no overfitting or under-fitting present.

Gradient Boosting Model

This model is just like the ADABoosting model. Gradient Boosting works by sequentially adding the
misidentified predictors and under-fitted predictions to the ensemble, ensuring the errors identified
previously are corrected. The major difference lies in the in what it does with the misidentified value of the
previous weak learner. This method tries to fit the new predictor to the residual errors made by the previous
one.

Before fitting the model it is important to know about the hyper-parameters that is involved in model
building.
Parameters:
• Criterion
• loss
• n_estimators
• max_features
• min_samples_split

There are other parameters as well but we will use these for gridsearch, rest default values.
Now after performing the “GridSearchCV”, the best parameters obtained are-

• 'criterion' = 'friedman_mse',
• 'loss' = 'exponential',
• 'n_estimators' = 50,
• 'max_features' = 8,
• 'min_samples_split' = 45

Now the results for unscaled data-

Train Accuracy- 0.865979381443299


Test Accuracy- 0.8493449781659389

Probabilities on the test set-


Now the results for scaled data-

Train Accuracy- 0.865979381443299


Test Accuracy- 0.8493449781659389

Probabilities on the test set-

Inference-
The model performed exactly the same for both Unscaled and Scaled data.
Also, overall the model performed well but there can be slight overfitting as Accuracy is more for Train set
then for the test.

XGBoost (eXtreme Gradient Boosting) Model

This model as the name suggests is based on the gradient boosting framework. However, XGBoost
improves upon the base GBM framework through systems optimization and algorithmic enhancements. It
uses parallel processing and RAM optimizations that can improve the working of Gradient Boost method to
its peak and thus making the name “extreme”.
Another advantage is that it automatically treat the null values by passing the parameter “missing = NaN”.
Another difference is that XGB don’t contain the parameter ‘min_sample_split’ .
Before fitting the model it is important to know about the hyper-parameters that is involved in model
building.
Parameters:
• Max_depth
• Min_samples_leaf
• n_estimators
• learning_rare
There are other parameters as well but we will use these for gridsearch, rest default values.
Now after performing the “GridSearchCV”, the best parameters obtained are-
• 'max_depth': 4,
• 'min_samples_leaf': 15,
• 'n_estimators': 50,
• 'learning_rate': 0.1

Now the results for unscaled data-

Train Accuracy- 0.8847235238987816


Test Accuracy- 0.851528384279476

Probabilities on the test set-

Now the results for scaled data-

Train Accuracy- 0.8847235238987816


Test Accuracy- 0.851528384279476

Probabilities on the test set-


Inference-
The model performed exactly the same for both Unscaled and Scaled data.
Also, overall the model performed well but there can be slight overfitting as Accuracy is more for Train set
then for the test.

CATBoosting Model

CATBoosting (CATegorical Boosting) is a machine learning algorithm that uses gradient boosting on
decision trees. It is an open source library and it’s not available under the usual Sklearn package. We have
to separately install the package. CAT Boost can manage huge amount of categorical data that is usually a
problem for majority of the machine learning algorithm. CATBoost is easy to implement and very powerful.
It provides excellent results and is very fast in executing.

There are plenty of parameters to specify but we are going forward with the default parameters.

Now the results for unscaled data-

Train Accuracy- 0.9381443298969072


Test Accuracy- 0.851528384279476

Probabilities on the test set-

Now the results for scaled data-

Train Accuracy- 0.9381443298969072


Test Accuracy- 0.851528384279476

Probabilities on the test set-


Inference-
The model performed exactly the same for both Unscaled and Scaled data.
There is a huge difference between the accuracy values of train and test data.
There is overfitting of data here as accuracy of train is far more then test data.

1.7 Performance Metrics: Check the performance of Predictions on Train and


Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Final Model: Compare the models and write
inference which model is best/optimized.
Performance Metrics:

Usually there are many performance metrics that are used in assessing the strength of the model to
understand how the model has performed as well as to take an informed decision on whether to go forward
with the model in the real time scenario or not.

The industrial standards are generally based on the following methods:


• Classification Accuracy.
• Confusion Matrix.
• Classification Report.
• Area Under ROC Curve (visualization) and AUC Score

Logistic Regression

Before Scaling-
Train Accuracy- 0.8303655107778819
Test Accuracy- 0.8537117903930131

Confusion Matrix-

For Train Data For Test Data

True Negative: 212 False Positive: 111 True Negative: 94 False Positive: 45
False Negative: 70 True Positive: 674 False Negative: 22 True Positive: 297
Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

Logistic Regression (Train) score: 0.877 Logistic Regression (Test) score: 0.916
After Scaling-
Train Accuracy- 0.8303655107778819
Test Accuracy- 0.8493449781659389

Confusion Matrix-

For Train Data For Test Data

True Negative: 211 False Positive: 112 True Negative: 94 False Positive: 45
False Negative: 69 True Positive: 675 False Negative: 24 True Positive: 295

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

Logistic Regression (Train) score: 0.877 Logistic Regression (Test) score: 0.915
SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.8245967741935484 Train Accuracy- 0.8138440860215054
Test Accuracy- 0.8427947598253275 Test Accuracy- 0.8384279475982532

----------------------------------------------------------------------------------------------------------------------------------------------

LDA (Linear Discriminant Analysis)

Before Scaling-
Train Accuracy- 0.8284910965323337
Test Accuracy- 0.851528384279476

Confusion Matrix-
For Train Data For Test Data

True Negative: 218 False Positive: 105 True Negative: 100 False Positive: 39
False Negative: 78 True Positive: 666 False Negative: 29 True Positive: 290

Classification Report-
For Train Set-
For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

LDA (Train) score: 0.877 LDA (Test) score: 0.915

After Scaling
Train Accuracy- 0.8284910965323337
Test Accuracy- 0.851528384279476

Confusion Matrix-
For Train Data For Test Data

True Negative: 218 False Positive: 105 True Negative: 100 False Positive: 39
False Negative: 78 True Positive: 666 False Negative: 29 True Positive: 290

Classification Report-
For Train Set-
For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

LDA (Train) score: 0.877 LDA (Test) score: 0.915

SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.8245967741935484 Train Accuracy- 0.8125
Test Accuracy- 0.8427947598253275 Test Accuracy- 0.8296943231441049

----------------------------------------------------------------------------------------------------------------------------------------------

KNN (K Nearest Neighbours)

Before Scaling-
Train Accuracy- 0.8369259606373008
Test Accuracy- 0.8165938864628821

Confusion Matrix-
For Train Data For Test Data

True Negative: 219 False Positive: 104 True Negative: 84 False Positive: 55
False Negative: 70 True Positive: 674 False Negative: 29 True Positive: 290
Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

KNN (Train) score: 0.915 KNN (Test) score: 0.867

After Scaling-
Train Accuracy- 0.8603561387066542
Test Accuracy- 0.8384279475982532
Confusion Matrix-
For Train Data For Test Data

True Negative: 239 False Positive: 84 True Negative: 95 False Positive: 44


False Negative: 65 True Positive: 679 False Negative: 30 True Positive: 289

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

KNN (Train) score: 0.933 KNN (Test) score: 0.877


SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.8830645161290323 Train Accuracy- 0.8918010752688172
Test Accuracy- 0.8144104803493449 Test Accuracy- 0.8231441048034934

Naïve Bayes

Before Scaling-
Train Accuracy- 0.8219306466729147
Test Accuracy- 0.8471615720524017

Confusion Matrix-
For Train Data For Test Data

True Negative: 223 False Positive: 100 True Negative: 101 False Positive: 38
False Negative: 90 True Positive: 654 False Negative: 32 True Positive: 287

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

NB (Train) score: 0.874 NB (Test) score: 0.910

After Scaling-
Train Accuracy- 0.8219306466729147
Test Accuracy- 0.8471615720524017

Confusion Matrix-
For Train Data For Test Data

True Negative: 223 False Positive: 100 True Negative: 101 False Positive: 38
False Negative: 90 True Positive: 654 False Negative: 32 True Positive: 287

Classification Report-
For Train Set-

For Test Set-


Area Under ROC Curve and AUC Score:
For both Training and Testing:

NB (Train) score: 0.874 NB (Test) score: 0.910

SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.8205645161290323 Train Accuracy- 0.8077956989247311
Test Accuracy- 0.8362445414847162 Test Accuracy- 0.8253275109170306

----------------------------------------------------------------------------------------------------------------------------------------------
Bagging

Before Scaling-
Train Accuracy- 0.8303655107778819
Test Accuracy- 0.834061135371179

Confusion Matrix-
For Train Data For Test Data

True Negative: 201 False Positive: 122 True Negative: 83 False Positive: 56
False Negative: 59 True Positive: 685 False Negative: 20 True Positive: 299
Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

Bagging (Train) score: 0.891 Bagging (Test) score: 0.900

After Scaling-
Train Accuracy- 0.8303655107778819
Test Accuracy- 0.834061135371179

Confusion Matrix-
For Train Data For Test Data

True Negative: 201 False Positive: 122 True Negative: 83 False Positive: 56
False Negative: 59 True Positive: 685 False Negative: 20 True Positive: 299
Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

Bagging (Train) score: 0.891 Bagging (Test) score: 0.900

SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.831989247311828 Train Accuracy- 0.8259408602150538
Test Accuracy- 0.8078602620087336 Test Accuracy- 0.8100436681222707
----------------------------------------------------------------------------------------------------------------------------------------------

ADA Boosting

Before Scaling-
Train Accuracy- 0.8369259606373008
Test Accuracy- 0.8427947598253275

Confusion Matrix-
For Train Data For Test Data

True Negative: 224 False Positive: 99 True Negative: 97 False Positive: 42


False Negative: 75 True Positive: 669 False Negative: 30 True Positive: 289

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

ADABoost (Train) score: 0.889 ADABoost (Test) score: 0.906


After Scaling-
Train Accuracy- 0.8369259606373008
Test Accuracy- 0.8427947598253275

Confusion Matrix-
For Train Data For Test Data

True Negative: 224 False Positive: 99 True Negative: 97 False Positive: 42


False Negative: 75 True Positive: 669 False Negative: 30 True Positive: 289

Classification Report-
For Train Set-

For Test Set-


Area Under ROC Curve and AUC Score:
For both Training and Testing:

ADABoost (Train) score: 0.889 ADABoost (Test) score: 0.906

SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.842741935483871 Train Accuracy- 0.8185483870967742
Test Accuracy- 0.8362445414847162 Test Accuracy- 0.8013100436681223

----------------------------------------------------------------------------------------------------------------------------------------------
Gradient Boosting

Before Scaling-
Train Accuracy- 0.865979381443299
Test Accuracy- 0.8493449781659389

Confusion Matrix-
For Train Data For Test Data
True Negative: 229 False Positive: 94 True Negative: 94 False Positive: 45
False Negative: 49 True Positive: 695 False Negative: 24 True Positive: 295
Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:
Gradient Boost (Train) score: 0.933 Gradient Boost (Test) score: 0.915

After Scaling-
Train Accuracy- 0.865979381443299
Test Accuracy- 0.8493449781659389

Confusion Matrix-
For Train Data For Test Data
True Negative: 229 False Positive: 94 True Negative: 94 False Positive: 45
False Negative: 49 True Positive: 695 False Negative: 24 True Positive: 295
Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

Gradient Boost (Train) score: 0.933 Gradient Boost (Test) score: 0.915

SMOTE –
Without Scaling With Scaling
Train Accuracy- 0.8716397849462365 Train Accuracy- 0.8595430107526881
Test Accuracy- 0.8296943231441049 Test Accuracy- 0.8296943231441049

----------------------------------------------------------------------------------------------------------------------------------------------

XGBoost

Before Scaling-
Train Accuracy- 0.8847235238987816
Test Accuracy- 0.851528384279476

Confusion Matrix-
For Train Data For Test Data

True Negative: 242 False Positive: 81 True Negative: 96 False Positive: 43


False Negative: 42 True Positive: 702 False Negative: 25 True Positive: 294

Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:
XGBoost (Train) score: 0.941 XGBoost (Test) score: 0.912

After Scaling-
Train Accuracy- 0.8847235238987816
Test Accuracy- 0.851528384279476

Confusion Matrix-
For Train Data For Test Data

True Negative: 242 False Positive: 81 True Negative: 96 False Positive: 43


False Negative: 42 True Positive: 702 False Negative: 25 True Positive: 294

Classification Report-
For Train Set-

For Test Set-


Area Under ROC Curve and AUC Score:
For both Training and Testing:

XGBoost (Train) score: 0.941 XGBoost (Test) score: 0.912

SMOTE –

Without Scaling With Scaling


Train Accuracy- 0.8803763440860215 Train Accuracy- 0.875
Test Accuracy- 0.8384279475982532 Test Accuracy- 0.8362445414847162

----------------------------------------------------------------------------------------------------------------------------------------------
CATBoost

Before Scaling-
Train Accuracy- 0.9381443298969072
Test Accuracy- 0.851528384279476

Confusion Matrix-
For Train Data For Test Data
True Negative: 281 False Positive: 42 True Negative: 97 False Positive: 42
False Negative: 24 True Positive: 720 False Negative: 26 True Positive: 293
Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

CATBoost (Train) score: 0.978 CATBoost (Test) score: 0.914

After Scaling-
Train Accuracy- 0.9381443298969072
Test Accuracy- 0.851528384279476

Confusion Matrix-
For Train Data For Test Data

True Negative: 281 False Positive: 42 True Negative: 97 False Positive: 42


False Negative: 24 True Positive: 720 False Negative: 26 True Positive: 293
Classification Report-
For Train Set-

For Test Set-

Area Under ROC Curve and AUC Score:


For both Training and Testing:

CATBoost (Train) score: 0.978 CATBoost (Test) score: 0.914

SMOTE –
Without Scaling With Scaling
Train Accuracy- 0.9455645161290323 Train Accuracy- 0.9401881720430108
Test Accuracy- 0.834061135371179 Test Accuracy- 0.8318777292576419

----------------------------------------------------------------------------------------------------------------------------------------------

Model Comparison-
This is a process through which we will compare all models build and find the best optimised among. There
are total of 9 different kind of model which each model build 4 times in following fashion –
- Without scaling
- With Scaling
- Smote Without Scaling
- Smote With Scale.
So, that makes total of 36 model in all.

The basis on which models are evaluated are known as performance metrics. The metrics on which the
model will be evaluated are-
• Accuracy
• AUC
• Recall
• Precision
• F1-Score

Without Scaling-

From the above-


- Basis on the Accuracy – Logistic Regression performed better than others.
- Basis on the AUC Score – Logistics Regression performed better than others.
- Basis on Recall – Bagging performed slightly better than others.
- Basis on Precision – Naive Bayes performed slightly better than others.
- Basis on F1- Score – Logistic Regression along with some others performed well.

All the models performed well with slight difference ranging from (1-5%).

With Scaling-

From the above-


- Basis on the Accuracy – LDA and XGBoost performed better than others.
- Basis on the AUC Score – Logistics Regression and LDA performed better than others.
- Basis on Recall – Bagging performed slightly better than others.
- Basis on Precision – Naive Bayes performed slightly better than others.
- Basis on F1- Score – Logistic Regression along with some others performed well.
Smote Performance Metrics-
Here, the comparison is based on Accuracy values only. This will help in understanding whether using
Smote has positive effect or not.

Smote Without Scaling-

From the above-


- On the basis of Accuracy Logistic Regression performed better than others.

Smote With Scaling-

From the above-


- On the basis of Accuracy Logistic Regression performed better than others.

Observations-
- From the above 4 tables it can be observed that using smote didn’t increase the performance of
the models. Overall models without Smote performed well for both Scaled and Unscaled Data.
Thus, there is no use of applying smote here.
- As for the Scaled and Unscaled Data Models, scaling only improved the performance of the
distance based algorithms for others it slightly decreased the performance overall. Here, only
KNN from Scaled Data Model performed slightly well than the KNN Unscaled Model.
- Best Optimised Model – On the basis of all the comparisons and performance metrics “Logistic
Regression” without scaling performed the best out of all.

1.8) Based on your analysis and working on the business problem, detail out
appropriate insights and recommendations to help the management solve the
business objective.

Inferences
- Logistic Regression performed the best out of all the models build.
- Logistic Regression Equation for the model:
(3.05008) * Intercept + (-0.01891) * age + (0.41855) * economic_cond_national + (0.06714) *
economic_cond_household + (0.62627) * Blair + (-0.83974) * Hague + (- 0.21413) * Europe + (-
0.40331) * political_knowledge + (0.10881) * gender

The above equation help in understanding the model and the feature importance, how each feature
contributes to the predicted output.

Top 5 features in Logistic Regression Model in order of decreasing importance are-


1. Hague : |-0.8181846212178241|
2. Blair : |0.5460018962250501|
3. economic_cond_national : |0.37700497490783885|
4. political_knowledge : |-0.3459485608005413|
5. Europe : |-0.19691071679312278|
Insights and Recommendations

Our main Business Objective is - “To build a model, to predict which party a voter will vote for on the basis
of the given information, to create an exit poll that will help in predicting overall win and seats covered by a
particular party.”

 Using Logistic Regression Model without scaling for predicting the outcome as it has the best
optimised performance.
 Hyper-parameters tuning is an important aspect of model building. There are limitations to this as to
process these combinations huge amount of processing power is required. But if tuning can be
done with many sets of parameters than we might get even better results.
 Gathering more data will also help in training the models and thus improving their predictive powers.
 Boosting Models can also perform well like CATBoost performed well even without tuning. Thus, if
we perform hyper-parameters tuning we might get better results.
 We can also create a function in which all the models predict the outcome in sequence. This will
helps in better understanding and the probability of what the outcome will be.
Problem 2- In this particular project, we are going to work on the inaugural corpora from the nltk
in Python. We will be looking at the following speeches of the Presidents of the United States of
America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

2.1 Find the number of characters, words, and sentences for the mentioned
documents.
Characters
Characters in Franklin D. Roosevelt’s speech: 7571
Characters in John F. Kennedy’s speech: 7618
Characters in Richard Nixon’s speech: 9991

Words
Words in Franklin D. Roosevelt’s speech: 1536
Words in John F. Kennedy’s speech: 1546
Words in Richard Nixon’s speech: 20208

Sentences
Sentences in Franklin D. Roosevelt’s speech: 68
Sentences in John F. Kennedy’s speech: 52
Sentences in Richard Nixon’s speech: 69

2.2 Remove all the stopwords from all three speeches.


To remove the stopwords, there is package called “stopwords” in the nltk.corpus library.
So, in order to do so we need to import following libraries-
- from nltk.corpus import stopwords
- from nltk.stem.porter import PorterStemmer

The stopwords library contains all the stop words like ‘and’, ‘a’, ‘is’, ‘to’, ‘is’, ‘.’, ‘of’, ‘to’ etc., that usually don’t
have any importance in understanding the sentiment or usefullness in machine learning algorithms. These
stopwords present in the package are universally accepted stopwords and we can add using the (.extend())
function or remove them as per our requirement.

Also, we need to specify the language we are working with before defining the functions, as there are many
language packages. Here, we will use English.

Stemming is a process which helps the processor in understanding the words that have similar meaning. In
this the words are brought down to their base or root level by removing the affixes. It is highly used in
search engines. For e.g. - eating, eats, eaten all these will be reduced to eat after stemming.

Some of the stop words removed are-


2.3 Which word occurs the most number of times in his inaugural address for
each president? Mention the top three words. (after removing the stopwords)

Results after removing stopwords and stemming.

 For Franklin D. Roosevelt’s speech:

Here ‘peopl’, ‘spirit’, ‘life’ and ‘democraci’ all are on 3rd place because of the same number of
occurrences.
Most occurring word: Nation.

 For John F. Kennedy’s speech:

Most occurring word: Let.

 For Richard Nixon’s speech:

Most occurring word: Us.

2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stopwords) 
Word Cloud is a data visualization technique used for representing text data in which the size of each
word indicates its frequency or importance. For generating word-cloud we need word-cloud package. By
default it is not installed in the kernel, so we have to install it.
After importing the package we will again remove the stopwords but will not perform stemming. As
removing stops words would remove the filter the unwanted words that possibly have no sentiment
analysis.

Word Cloud of Roosevelt’s Speech:

We can see some highlighted words like “nation”, ”know”, “people”, etc which we observed as top words in
the previous question. This shows the bigger the size more the frequency.
Word Cloud of Kennedy’s Speech:
Word Cloud of Nixon’s Speech:

Insights –
 Our objective was to look at all the 3 speeches and analyse them. To find the strength and
sentiment of the speeches.
 Based on the outputs we can see that there are some similar words that are present in all
the speeches.
 These words may the point which inspired the many people and also get them the seat of
the president of United States of America
 Among all the speeches “ nation “ is the word that is significantly highlighted in all three.

You might also like