Machine Learning Business Report
Machine Learning Business Report
Machine Learning Business Report
Problem 1- You are hired by one of the leading news channels CNBE who wants to analyse recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that will help in
predicting overall win and seats covered by a particular party.
1.1 Read the dataset. Do the descriptive statistics and do the null value
condition check. Write an inference on it.
“Unnamed: 0” is a variable that simply represents the index in the data. Hence, it should be dropped as it is
of no use in the model.
Also, some variables contain ‘.’ operator in their name that can affect the model, so we will replace the ‘.’
With ‘_’ operator.
Categorical Columns-
The above table gives information such as unique values, mean, median, standard deviation, five point
summary, min-max, count, etc. for all the variables present in the dataset.
From the above, it is clear that there are no null values present in the dataset.
The isnull() function is used here to check for missing values.
The sum() function is used in order to get the total number of null values present in a particular variable.
Check for Duplicates-
There are total of 8 duplicate rows.
Since, there is no identification or unique code for each row present. We cannot clearly say that this is the
same person or different. So, we will not remove the duplicates in this case.
Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable
about its mean.
Only two variables are positively skewed and rest negatively skewed with max skewedness in Blair.
The coefficient of variation (CV) is a measure of relative variability. It is the ratio of the standard deviation
to the mean (average).
Univariate Analysis
For Continuous variables
We can see that all the numerical Variables are normally distributed (not perfectly normal though and are
multi modal in some instances as well.
There are outliers present in “economic_cond_national” and “economic_cond_household” variables that
can be seen from the boxplots on the right too.
Also from the boxplots the min and max values of the variables are not very clear, we can separately obtain
them while checking for outliers.
Bivariate Analysis-
Pairplot-
Pairplot tells us about the interaction of each variable with every other variable present.
As such there is no strong relationship present between the variables.
There is a mixture of positive and negative relationships though which is expected.
Overall, it’s a rough estimate of the interactions, clearer picture can be obtained by heatmap values and
also different kinds of plot.
Analysis - Blair and Age
People above the age of 45 yrs generally thinks that Blair is doing a good job.
Hague has slightly more concentration of nuteral points than that of Blair for people above 50 years of age.
Correlation Matrix-
Heatmap-
Multicollinearity is an important issue which can harm the model. Heatmap is a good way of identifying this
issue. It gives us a basic idea of relationaship the variables have with each other.
Observations-
Highest positive correlation is between “economic_cond_national” and “economic_cond_household”
(35%). But the good thing is that it’s not huge.
Highest negative correlation is between “Blair” and “Europe” (30%) but this is also not huge.
Outlier Check/Treatment-
Using boxplot-
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary
here or not? Data Split: Split the data into train and test (70:30).
Encoding the dataset
As many machine learning models cannot work with string values we will encode the categorical variables
and convert their datatypes to integer type.
From the info of the dataset, we know there are 2 categorical type variables, so we need to encode these 2
variables with the suitable technique.
Those 2 variables are ‘vote’ and ‘gender’. Their distribution is given below.
Gender Distribution-
Vote Distribution-
From the above results we can see that both variables contain only two classifications of data in them.
We can use a simple categorical conversion (pd.Categorical() or dummy encoding with drop_first = True, both of
them will work here) This will convert the values into 0 and 1. As there is no level or order in the subcategory any
encoding will give the same result.
The datatype after conversion is int8 format we can convert these to int64 format, it will work even if we don’t
change it to int64.
After encoding-
Info-
Data-
Here, we will perform scaling on both type of models and will check whether there is a difference in the
performance of the model.
Also, after looking at the data we only need to scale the ‘age’ variable as rest of the variables are in the
range 0-10 at max.
We will use Z-core scaling here to scale the age variable.
After Scaling using z-score or standard scaling in which mean=0 and standard deviation=1.
Before splitting we need to find the target variable. Here, the target variable is “vote”.
Vote data distribution-
There is a data imbalance in the variable as seen above so we cannot split it in 50:50 ratio instead will split
the data into 70:30 ratio. Also we will use the oversampling technique SMOTE to check whether it improves
the model or not.
Here, we will use 2 different train and test sets, one without scaled data and one with scaled data. This will
help us in understanding whether scaling can improve the performance or not.
Now splitting both X and y data in the ratio 70:30, where train data is 70 % and test data is 30%.
After splitting- the shape of the data
Here,
X_train - denotes 70% training dataset with 8 columns (except the target column called “vote”).
X_test- denotes 30% test dataset with 8 columns (except the target column called “vote”).
y_train- denotes the 70% training dataset with only the target column called “vote”.
y_test- denotes 30% test dataset with only the target column called “vote”.
Similarly, the data is divided for scaled data and SMOTE oversampling data.
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). Interpret
the inferences of both models.
Logistic Regression Model
Before fitting the model it is important to know about the hyper parameters that is involved in model
building. Parameters:
• penalty
• solver
• max_iter
• tol, etc.
To find the best combination among these parameters we will use the “GridSearchCV” method. This
method can perform multiple combinations of these parameters simultaneously and can provide us with the
best optimum results.
After performing the search the best parameters came out to be-
Probabilities on the test set-(0 being preferring Conservative Party and 1 being preferring Labour Party)
Now the results for scaled data-
Inferences
Pseudo R2 = 0.3809 shows that the model performs really well, as the value between 0.2 – 0.4 shows that
a model performs well.
Model perform slightly well on the unscaled data.
There is no under-fitting or overfitting present as accuracy for both test and train data are not very different.
Also, I performed SMOTE (oversampling technique), whose output is discussed further in the performance
model comparison.
Before fitting the model, it is important to know about the hyper parameters that is involved in model
building.
Parameters:
• solver
• shrinkage
Now after performing the GridSearchCV, the best parameters obtained are-
shrinkage = 'auto'
solver = 'lsqr'
Now the results for unscaled data-
Also, I performed SMOTE (oversampling technique), whose output is discussed further in the performance
model comparison.
1.5. Apply KNN Model and Naïve Bayes Model. Interpret the inferences of each
model.
KNN is a distance based supervised machine learning algorithm that can be used to solve both
classification and regression problems. Main disadvantage of this model is it becomes very slow when
large volume of data is there and thus makes it an impractical choice where inferences need to be drawn
quickly.
Before fitting the model, it is important to know about the hyper parameters that is involved in model
building.
Parameters:
• n_neighbors
• weights
• algorithm
Now after performing the “GridSearchCV”, the best parameters obtained are-
• 'n_neighbors' = 5,
• 'weights' = uniform,
• 'algorithm' = auto
Inference-
The model performed better with the scaled data.
Also, overall the model performed well but there can be slight overfitting as Accuracy is more for Train set
then for the test.
Naive Bayes classifiers is a model based on applying Bayes' theorem with strong (naïve) independent
assumptions between the features. These assumptions however may not be the perfect case in real life
scenarios.
Bayes Theorem-
Here the method that we are going to use is the GaussianNB() method, also know as BernoulliNB(). This
method requires all the features to be in categorical type. A general assumption in this method is the data is
following a normal or Gaussian distribution.
There are no specific parameters in this model like other, so we will simply fit the model with default
parameters.
Tuning is the process of maximizing a model’s performance without overfitting or creating too high of a
variance. In machine learning, this is accomplished by selecting appropriate “hyper-parameters”.
Grid Search is one of the most common methods of optimizing the parameters. In this a set of parameters
is defined and then the performance for each combinations of these parameters is evaluated, using cross
validation. Then from among those
Models such as Bagging, Boosting, Gradient boosting, Cat boosting, etc are prone to under or over fitting of
data. Overfitting means that the model works very well on the Train data but works relatively poor in the test
data. Under-fitting means that the model works very well on the Test data, but works relatively poor on the
training data.
Bagging is an ensemble technique. Ensemble techniques are the machine learning techniques that
combine several base models to get an optimal model. Bagging is designed to improve the performance of
existing machine learning algorithms used in statistical classification or regression. It is most commonly
used with tree-based algorithms. It is a parallel method.
Each base classifier is trained in parallel with a training set which is generated by randomly drawing, with
replacement, N data from the training .Training set for each of the base classifiers is independent of each
other.
Here, we will use random forest as the base classifier. Hyper-parameters that will be used in the model are
• max_depth
• max_features
• min_samples_leaf
• min_samples_split
• n_estimators
There are other parameters as well but we will use these for gridsearch, rest default values.
Now after performing the “GridSearchCV”, the best parameters obtained are-
• ' max_depth ' = 5,
• ' max_features ' = 7,
• ' min_samples_leaf ' = 25,
• ' min_samples_split ' = 60,
• ' n_estimators ' = 101
Inference-
The model performed exactly the same for both Unscaled and Scaled data.
This model performed extremely well on the data no overfitting or under-fitting present.
Boosting Model
Boosting is also an ensemble technique. It converts weak learners to strong learners. Unlike bagging it is a
sequential method where result from one weak learner becomes the input for the another and so on, thus
improving the performance of the model.
Each time base learning algorithm is applied, it generates a new weak learner prediction rule. This is an
iterative process and the boosting algorithm combines these weak rules into a single strong prediction rule.
Misclassified input data gain a higher weight and examples that are classified correctly will lose weight.
Thus, future weak learners focus more on the examples that previous weak learners misclassified. They
are also tree based methods.
There are many kinds of Boosting Techniques available and for this project, the following boosting
techniques are to be used.
1. ADA Boost (Adaptive Boosting)
2. Gradient Boosting
3. Extreme Gradient Boosting
4. CAT Boost (Categorical Boosting)
This model is used to increase the efficiency of binary classifiers, but now used to improve multiclass
classifiers as well. AdaBoost can be applied on top of any classifier method to learn from its issues and
bring about a more accurate model and thus it is called the “best out-of-the-box classifier”.
Before fitting the model it is important to know about the hyper-parameters that is involved in model
building.
Parameters:
• algorithm
• n_estimators
There are other parameters as well but we will use these for gridsearch, rest default values.
Now after performing the “GridSearchCV”, the best parameters obtained are-
• ' algorithm ' = ' SAMME',
• ' n_estimators ' = 50
This model is just like the ADABoosting model. Gradient Boosting works by sequentially adding the
misidentified predictors and under-fitted predictions to the ensemble, ensuring the errors identified
previously are corrected. The major difference lies in the in what it does with the misidentified value of the
previous weak learner. This method tries to fit the new predictor to the residual errors made by the previous
one.
Before fitting the model it is important to know about the hyper-parameters that is involved in model
building.
Parameters:
• Criterion
• loss
• n_estimators
• max_features
• min_samples_split
There are other parameters as well but we will use these for gridsearch, rest default values.
Now after performing the “GridSearchCV”, the best parameters obtained are-
• 'criterion' = 'friedman_mse',
• 'loss' = 'exponential',
• 'n_estimators' = 50,
• 'max_features' = 8,
• 'min_samples_split' = 45
Inference-
The model performed exactly the same for both Unscaled and Scaled data.
Also, overall the model performed well but there can be slight overfitting as Accuracy is more for Train set
then for the test.
This model as the name suggests is based on the gradient boosting framework. However, XGBoost
improves upon the base GBM framework through systems optimization and algorithmic enhancements. It
uses parallel processing and RAM optimizations that can improve the working of Gradient Boost method to
its peak and thus making the name “extreme”.
Another advantage is that it automatically treat the null values by passing the parameter “missing = NaN”.
Another difference is that XGB don’t contain the parameter ‘min_sample_split’ .
Before fitting the model it is important to know about the hyper-parameters that is involved in model
building.
Parameters:
• Max_depth
• Min_samples_leaf
• n_estimators
• learning_rare
There are other parameters as well but we will use these for gridsearch, rest default values.
Now after performing the “GridSearchCV”, the best parameters obtained are-
• 'max_depth': 4,
• 'min_samples_leaf': 15,
• 'n_estimators': 50,
• 'learning_rate': 0.1
CATBoosting Model
CATBoosting (CATegorical Boosting) is a machine learning algorithm that uses gradient boosting on
decision trees. It is an open source library and it’s not available under the usual Sklearn package. We have
to separately install the package. CAT Boost can manage huge amount of categorical data that is usually a
problem for majority of the machine learning algorithm. CATBoost is easy to implement and very powerful.
It provides excellent results and is very fast in executing.
There are plenty of parameters to specify but we are going forward with the default parameters.
Usually there are many performance metrics that are used in assessing the strength of the model to
understand how the model has performed as well as to take an informed decision on whether to go forward
with the model in the real time scenario or not.
Logistic Regression
Before Scaling-
Train Accuracy- 0.8303655107778819
Test Accuracy- 0.8537117903930131
Confusion Matrix-
True Negative: 212 False Positive: 111 True Negative: 94 False Positive: 45
False Negative: 70 True Positive: 674 False Negative: 22 True Positive: 297
Classification Report-
For Train Set-
Logistic Regression (Train) score: 0.877 Logistic Regression (Test) score: 0.916
After Scaling-
Train Accuracy- 0.8303655107778819
Test Accuracy- 0.8493449781659389
Confusion Matrix-
True Negative: 211 False Positive: 112 True Negative: 94 False Positive: 45
False Negative: 69 True Positive: 675 False Negative: 24 True Positive: 295
Classification Report-
For Train Set-
Logistic Regression (Train) score: 0.877 Logistic Regression (Test) score: 0.915
SMOTE –
----------------------------------------------------------------------------------------------------------------------------------------------
Before Scaling-
Train Accuracy- 0.8284910965323337
Test Accuracy- 0.851528384279476
Confusion Matrix-
For Train Data For Test Data
True Negative: 218 False Positive: 105 True Negative: 100 False Positive: 39
False Negative: 78 True Positive: 666 False Negative: 29 True Positive: 290
Classification Report-
For Train Set-
For Test Set-
After Scaling
Train Accuracy- 0.8284910965323337
Test Accuracy- 0.851528384279476
Confusion Matrix-
For Train Data For Test Data
True Negative: 218 False Positive: 105 True Negative: 100 False Positive: 39
False Negative: 78 True Positive: 666 False Negative: 29 True Positive: 290
Classification Report-
For Train Set-
For Test Set-
SMOTE –
----------------------------------------------------------------------------------------------------------------------------------------------
Before Scaling-
Train Accuracy- 0.8369259606373008
Test Accuracy- 0.8165938864628821
Confusion Matrix-
For Train Data For Test Data
True Negative: 219 False Positive: 104 True Negative: 84 False Positive: 55
False Negative: 70 True Positive: 674 False Negative: 29 True Positive: 290
Classification Report-
For Train Set-
After Scaling-
Train Accuracy- 0.8603561387066542
Test Accuracy- 0.8384279475982532
Confusion Matrix-
For Train Data For Test Data
Classification Report-
For Train Set-
Naïve Bayes
Before Scaling-
Train Accuracy- 0.8219306466729147
Test Accuracy- 0.8471615720524017
Confusion Matrix-
For Train Data For Test Data
True Negative: 223 False Positive: 100 True Negative: 101 False Positive: 38
False Negative: 90 True Positive: 654 False Negative: 32 True Positive: 287
Classification Report-
For Train Set-
After Scaling-
Train Accuracy- 0.8219306466729147
Test Accuracy- 0.8471615720524017
Confusion Matrix-
For Train Data For Test Data
True Negative: 223 False Positive: 100 True Negative: 101 False Positive: 38
False Negative: 90 True Positive: 654 False Negative: 32 True Positive: 287
Classification Report-
For Train Set-
SMOTE –
----------------------------------------------------------------------------------------------------------------------------------------------
Bagging
Before Scaling-
Train Accuracy- 0.8303655107778819
Test Accuracy- 0.834061135371179
Confusion Matrix-
For Train Data For Test Data
True Negative: 201 False Positive: 122 True Negative: 83 False Positive: 56
False Negative: 59 True Positive: 685 False Negative: 20 True Positive: 299
Classification Report-
For Train Set-
After Scaling-
Train Accuracy- 0.8303655107778819
Test Accuracy- 0.834061135371179
Confusion Matrix-
For Train Data For Test Data
True Negative: 201 False Positive: 122 True Negative: 83 False Positive: 56
False Negative: 59 True Positive: 685 False Negative: 20 True Positive: 299
Classification Report-
For Train Set-
SMOTE –
ADA Boosting
Before Scaling-
Train Accuracy- 0.8369259606373008
Test Accuracy- 0.8427947598253275
Confusion Matrix-
For Train Data For Test Data
Classification Report-
For Train Set-
Confusion Matrix-
For Train Data For Test Data
Classification Report-
For Train Set-
SMOTE –
----------------------------------------------------------------------------------------------------------------------------------------------
Gradient Boosting
Before Scaling-
Train Accuracy- 0.865979381443299
Test Accuracy- 0.8493449781659389
Confusion Matrix-
For Train Data For Test Data
True Negative: 229 False Positive: 94 True Negative: 94 False Positive: 45
False Negative: 49 True Positive: 695 False Negative: 24 True Positive: 295
Classification Report-
For Train Set-
After Scaling-
Train Accuracy- 0.865979381443299
Test Accuracy- 0.8493449781659389
Confusion Matrix-
For Train Data For Test Data
True Negative: 229 False Positive: 94 True Negative: 94 False Positive: 45
False Negative: 49 True Positive: 695 False Negative: 24 True Positive: 295
Classification Report-
For Train Set-
Gradient Boost (Train) score: 0.933 Gradient Boost (Test) score: 0.915
SMOTE –
Without Scaling With Scaling
Train Accuracy- 0.8716397849462365 Train Accuracy- 0.8595430107526881
Test Accuracy- 0.8296943231441049 Test Accuracy- 0.8296943231441049
----------------------------------------------------------------------------------------------------------------------------------------------
XGBoost
Before Scaling-
Train Accuracy- 0.8847235238987816
Test Accuracy- 0.851528384279476
Confusion Matrix-
For Train Data For Test Data
Classification Report-
For Train Set-
After Scaling-
Train Accuracy- 0.8847235238987816
Test Accuracy- 0.851528384279476
Confusion Matrix-
For Train Data For Test Data
Classification Report-
For Train Set-
SMOTE –
----------------------------------------------------------------------------------------------------------------------------------------------
CATBoost
Before Scaling-
Train Accuracy- 0.9381443298969072
Test Accuracy- 0.851528384279476
Confusion Matrix-
For Train Data For Test Data
True Negative: 281 False Positive: 42 True Negative: 97 False Positive: 42
False Negative: 24 True Positive: 720 False Negative: 26 True Positive: 293
Classification Report-
For Train Set-
After Scaling-
Train Accuracy- 0.9381443298969072
Test Accuracy- 0.851528384279476
Confusion Matrix-
For Train Data For Test Data
SMOTE –
Without Scaling With Scaling
Train Accuracy- 0.9455645161290323 Train Accuracy- 0.9401881720430108
Test Accuracy- 0.834061135371179 Test Accuracy- 0.8318777292576419
----------------------------------------------------------------------------------------------------------------------------------------------
Model Comparison-
This is a process through which we will compare all models build and find the best optimised among. There
are total of 9 different kind of model which each model build 4 times in following fashion –
- Without scaling
- With Scaling
- Smote Without Scaling
- Smote With Scale.
So, that makes total of 36 model in all.
The basis on which models are evaluated are known as performance metrics. The metrics on which the
model will be evaluated are-
• Accuracy
• AUC
• Recall
• Precision
• F1-Score
Without Scaling-
All the models performed well with slight difference ranging from (1-5%).
With Scaling-
Observations-
- From the above 4 tables it can be observed that using smote didn’t increase the performance of
the models. Overall models without Smote performed well for both Scaled and Unscaled Data.
Thus, there is no use of applying smote here.
- As for the Scaled and Unscaled Data Models, scaling only improved the performance of the
distance based algorithms for others it slightly decreased the performance overall. Here, only
KNN from Scaled Data Model performed slightly well than the KNN Unscaled Model.
- Best Optimised Model – On the basis of all the comparisons and performance metrics “Logistic
Regression” without scaling performed the best out of all.
1.8) Based on your analysis and working on the business problem, detail out
appropriate insights and recommendations to help the management solve the
business objective.
Inferences
- Logistic Regression performed the best out of all the models build.
- Logistic Regression Equation for the model:
(3.05008) * Intercept + (-0.01891) * age + (0.41855) * economic_cond_national + (0.06714) *
economic_cond_household + (0.62627) * Blair + (-0.83974) * Hague + (- 0.21413) * Europe + (-
0.40331) * political_knowledge + (0.10881) * gender
The above equation help in understanding the model and the feature importance, how each feature
contributes to the predicted output.
Our main Business Objective is - “To build a model, to predict which party a voter will vote for on the basis
of the given information, to create an exit poll that will help in predicting overall win and seats covered by a
particular party.”
Using Logistic Regression Model without scaling for predicting the outcome as it has the best
optimised performance.
Hyper-parameters tuning is an important aspect of model building. There are limitations to this as to
process these combinations huge amount of processing power is required. But if tuning can be
done with many sets of parameters than we might get even better results.
Gathering more data will also help in training the models and thus improving their predictive powers.
Boosting Models can also perform well like CATBoost performed well even without tuning. Thus, if
we perform hyper-parameters tuning we might get better results.
We can also create a function in which all the models predict the outcome in sequence. This will
helps in better understanding and the probability of what the outcome will be.
Problem 2- In this particular project, we are going to work on the inaugural corpora from the nltk
in Python. We will be looking at the following speeches of the Presidents of the United States of
America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and sentences for the mentioned
documents.
Characters
Characters in Franklin D. Roosevelt’s speech: 7571
Characters in John F. Kennedy’s speech: 7618
Characters in Richard Nixon’s speech: 9991
Words
Words in Franklin D. Roosevelt’s speech: 1536
Words in John F. Kennedy’s speech: 1546
Words in Richard Nixon’s speech: 20208
Sentences
Sentences in Franklin D. Roosevelt’s speech: 68
Sentences in John F. Kennedy’s speech: 52
Sentences in Richard Nixon’s speech: 69
The stopwords library contains all the stop words like ‘and’, ‘a’, ‘is’, ‘to’, ‘is’, ‘.’, ‘of’, ‘to’ etc., that usually don’t
have any importance in understanding the sentiment or usefullness in machine learning algorithms. These
stopwords present in the package are universally accepted stopwords and we can add using the (.extend())
function or remove them as per our requirement.
Also, we need to specify the language we are working with before defining the functions, as there are many
language packages. Here, we will use English.
Stemming is a process which helps the processor in understanding the words that have similar meaning. In
this the words are brought down to their base or root level by removing the affixes. It is highly used in
search engines. For e.g. - eating, eats, eaten all these will be reduced to eat after stemming.
Here ‘peopl’, ‘spirit’, ‘life’ and ‘democraci’ all are on 3rd place because of the same number of
occurrences.
Most occurring word: Nation.
2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stopwords)
Word Cloud is a data visualization technique used for representing text data in which the size of each
word indicates its frequency or importance. For generating word-cloud we need word-cloud package. By
default it is not installed in the kernel, so we have to install it.
After importing the package we will again remove the stopwords but will not perform stemming. As
removing stops words would remove the filter the unwanted words that possibly have no sentiment
analysis.
We can see some highlighted words like “nation”, ”know”, “people”, etc which we observed as top words in
the previous question. This shows the bigger the size more the frequency.
Word Cloud of Kennedy’s Speech:
Word Cloud of Nixon’s Speech:
Insights –
Our objective was to look at all the 3 speeches and analyse them. To find the strength and
sentiment of the speeches.
Based on the outputs we can see that there are some similar words that are present in all
the speeches.
These words may the point which inspired the many people and also get them the seat of
the president of United States of America
Among all the speeches “ nation “ is the word that is significantly highlighted in all three.