Buisness Report - Machine Learning
Buisness Report - Machine Learning
TABLE OF CONTENTS
Project objective
Assumptions
Exploratory data analysis
▪ Summary of the dataset
▪ Bivariate analysis
Converting object data type into categorical
Splitting the data into train and test data
▪ Dimensions on the train and test data
Model building
Model Prediction
Model evaluation
Conclusion
Recommendation
Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This survey was
conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will vote for on the
basis of the given information, to create an exit poll that will help in predicting overall win and seats covered by a
particular party.
Data Dictionary
2. age: in years
7. Europe: an 11-point scale that measures respondents' attitudes toward European integration. High
Project Objective:
The Objective of the report is to explore the dataset "Election_Data.xlsx"in Python (JUPYTER
NOTEBOOK) and generate insights about the dataset. This exploration report will consist of the following:
• Graphical exploration
Assumptions:
Predictive modelling is the general concept of building a model that can make predictions. Typically, such a model
includes a machine learning algorithm that learns certain properties from a training dataset to make those
predictions. Pattern classification is to assign discrete class labels to observations as outcomes of a prediction.
Machine learning model predictions allow businesses to make highly accurate guesses as to the likely outcomes
of a question based on historical data, which can be about all kinds of things. These provide the business with
insights that result in tangible business value.
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.
Dataset: Election_Data.xls
SR.NO unnamed: 0 vote age economic.cond.national economic.cond.household Blair Hague Europe political.knowledge gender
0 1 Labour 43 3 3 4 1 2 2 female
1 2 Labour 36 4 4 4 4 5 2 male
2 3 Labour 35 4 4 5 2 3 2 male
3 4 Labour 24 4 2 2 1 4 0 female
4 5 Labour 41 2 2 1 1 6 2 male
Information on dataset:
Inference
➢ The column “Unnamed : 0” is removed from the dataset before proceeding further as its insignificant for the analysis.
Inference
➢ Hague: average Assessment of the Conservative leader is 2.7 whereas Blair: average Assessment of
Duplicates:
There are 8 duplicates in the dataframe which was dropped. now there are 1517 rows in the dataframe
Skewness:
Inference
➢ age have positive skewness whereas other variables have negative skewness
Unique values for categorical variables vote: Party choice: Conservative or Labour
Labour 0.69677
Conservative 0.30323
Name: vote, dtype: float64
Inference
➢ nearly 69% vote for labour party only 30% vote for conservative party
Gender
female 0.53263
male 0.46737
Name: gender, dtype: float64
Inference
➢ 53% voters are female voters, and 46% voters are male voters Labour 0.69677.
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
Inference
➢ only age variable is normally distributed and other variables has multimodal skewness seen.
Bivariate Analysis:
➢ Some of the attributes look like they may have an exponential distribution
➢ The highest positive correlation is seen between “economic_cond_national” and “economic_cond_household” (35%) with
nearly similar results seen from “Blair” and “economic_cond_national” (35%)
➢ The highest negative correlation is seen between “Blair” and “Europe” (29%) with nearly similar results seen from “Blair” and
“Hague” (24%) so, there is less or no chance of multi_collinearity
Above plot is a strip plot with jitter as True that really shows the distribution points on the assessment of t he
Conservative leader “Hague” on voters of various age. more voters are distributed in 2 and 4 group.
Catplot Analysis - Blair(count) on economic.cond.national
Assessment of current national economic conditions with Blair shows no 3 cluster have very less distributio n whereas
no 4 cluster have more distribution
Boxplot
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split: Split the data
into train and test (70:30).
Inference
➢ Codes are an array of integers which are the positions of the actual values in the categories array.
➢ Here vote and gender are categorical variables are now converted into integers using codes
➢ Most of the times, your dataset will contain features highly varying in magnitudes, units, and range But since, most of
the machine learning algorithms use Euclidean distance between two data points in their computations, this is a
problem.
➢ Differences in the scales across input variables may increase the difficulty of the problem being
modelled.
➢ This means that you are transforming your data so that it fits within a specific scale, like 0-100 or
0-1
➢ Usually, the distance-based methods (E.g.: KNN) would require scaling as it is sensitive to extreme d
ifference and can cause a bias.
➢ tree-based method uses split method (E.g.: Decision Trees) would not require scaling in general as it
s unnecessary
➢ In this dataset, age is only continuous variable and rest of the variables have 1 to 5. Age variable is
➢ The method of scaling performed only on the ‘age’ variable is the Z-score scaling.
➢ Z-score scaling is the most common form of scaling that takes from the formula (x – mean) / standard deviation).
Train-Test Split
Separating independent (train) and dependent (test)variables for the linear regression model
Y = dependent (test)variables
The training set for the independent variables: (1061, 8)
The training set for the dependent variable: (1061,)
The test set for the independent variables: (456, 8)
The test set for the dependent variable: (456,)
Inference
spilting the dataset into train and test set to build Logistic regression and LDA model (70:30)
X_train :70% of data randomly chosen from the 8 columns. These are training independent
variables
X_test :30% of data randomly chosen from the 8 columns. These are test independent
variables
y_train :70% of data randomly chosen from the "vote" column. These are training
dependent variables
y_test :30% of data randomly chosen from the "vote" columns. These are test dependent
variables
Logistic regression is a fundamental classification technique.It belongs to the group of linear classifiers and is somewhat
similar to polynomial and linear regression.It is the go-to method for binary classification problems (problems with two
class values).
1. Sklearn
2. Statsmodel
Here for the model sklearn library is used
Applying GridSearchCV for Logistic Regression
The probabilities on the training set The probabilities on the test set
We now fit our model to the GridSearchCV for Logistic Regression model by training the model with our
independent variable and dependent variables.
Inference
#Using GridsearchCV, we input various parameters like 'max_iter', 'penalty',solver', 'tol' which will helps us
to find best grid for prediction of the better model
#max_iter is an integer (100 by default) that defines the maximum number of iterations by the solver
during model fitting.
#solver is a string ('liblinear' by default) that decides what solver to use for fitting the model. Other options
are 'newton-cg', 'lbfgs', 'sag', and 'saga'.
#penalty is a string ('l2' by default) that decides whether there is regularization and which approach to use.
Other options are 'l1', 'elasticnet', and 'none'.
#bestgrid:{'max_iter': 1000, 'penalty': 'l2', 'solver': 'saga', 'tol': 1e-05}
#Accuracy score of training data:83.5%
#Accuracy score of test data:83.5%
The probabilities on the training set The probabilities on the test set
Applying GridSearchCV for LDA ➢ Using GridsearchCV, we input various parameters like 'max_iter',
'penalty',solver', 'tol
' which will helps us to find best grid for prediction of the better model
➢ max_iter is an integer (100 by default) that defines the maximum number of iteration
s by the solver during model fitting.
➢ solver is a string ('liblinear' by default) that decides what solver to use for fitting the
model. Other options are 'newton-cg', 'lbfgs', 'sag', and 'saga'.
➢ here ‘solver':['svd', 'lsqr', 'eigen'] are used with others parameters has default
➢ ‘svd’: Singular value decomposition (default). Does not compute the covariance matri
x, therefore this solver is recommended for data with many features.
➢ ‘lsqr’: Least squares solution. Can be combined with shrinkage or custom covariance
estimator.
➢ ‘eigen’: Eigenvalue decomposition. Can be combined with shrinkage or custom covari
ance estimator.
➢ bestgrid:{'solver': 'svd'}
➢ Training Data Class Prediction with a cut-off value of 0.5
➢ Test Data Class Prediction with a cut-off value of 0.5
➢ Accuracy score of training data:83.4%
➢ Accuracy score of test data:83.3%
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
KNN Model
KNN is a non-parametric and lazy learning algorithm. Non-parametric means there is no assumption for un
derlying data distribution.In KNN, K is the number of nearest neighbors. The number of neighbors is the cor
e deciding factor
KNN has the following basic steps:
➢ Calculate distance
➢ Find closest neighbors
➢ Vote for labels
we will be using popular scikit-learn package.
➢ The k-nearest neighbor algorithm is imported from the scikit-learn package.
➢ Create feature and target variables.
➢ Split data into training and test data.
➢ Generate a k-NN model using neighbors value.
➢ Train or fit the data into the model.
➢ Predict the future.
➢ First, import the KNeighborsClassifier module and create KNN classifier object by passing argument
number of neighbors in KNeighborsClassifier() function.
➢ Then, fit your model on the train set using fit() and perform prediction on the test set using predict().
➢ Let us build KNN classifier model for k=15.
misclassification error
[0.2149122807017544,
0.1864035087719298,
0.17763157894736847,
0.18201754385964908,
0.17763157894736847,
0.17324561403508776,
0.17324561403508776,
0.16885964912280704,
0.17105263157894735,
0.16885964912280704]
Plot misclassification error vs k (with k value on X-axis)
The number of neighbors(K) in KNN is a hyperparameter that you need choose at the time of model buildin g.
You can think of K as a controlling variable for the prediction model. n_neighbors = 15
The probabilities on the training set The probabilities on the test set
The probabilities on the training set The probabilities on the test set
Accuracy score of training data:83.5%
Accuracy score of test data:82.2%
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging) and Boosting.
Model tuning
Tuning is the process of maximizing a model’s performance without overfitting or creating too high of a variance.
In machine learning, this is accomplished by selecting appropriate “hyperparameters.” Hyperparameters can be
thought of as the “dials” or “knobs” of a machine learning model. Choosing an appropriate set of
hyperparameters is crucial for model accuracy, but can be computationally challenging. Hyperparameters differ
from other model parameters in that they are not learned by the model automatically through training
methods. Instead, these parameters must be set manually. Many methods exist for selecting appropriate
hyperparameters
Grid Search
Grid Search, also known as parameter sweeping, is one of the most basic and traditional methods of
hyperparametric optimization. This method involves manually defining a subset of the hyperparametric space
and exhausting all combinations of the specified hyperparameter subsets. Each combination’s performance is
then evaluated, typically using cross-validation, and the best performing hyperparametric combination is
chosen.
A Bagging classifier.
A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets
of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to
form a final prediction. Such a meta-estimator can typically be used to reduce the variance of a black-box
estimator (e.g., RandomForest), by introducing randomization into its construction procedure and then
making an ensemble out of it.
Bagging and random forests are “bagging” algorithms that aim to reduce the complexity of models that
overfit the training data.
Bagging (Random Forest should be applied for Bagging)
Inference
set the hyper parameters in randomforest classifier
N_estimators (only used in Random Forests) is the number of decision trees used in making the forest
(default = 100).
Max_depth is an integer that sets the maximum depth of the tree. The default is None, which means the
nodes are expanded until all the leaves are pure
Min_samples_split is the minimum number of samples required to split an internal node.
Min_samples_leaf defines the minimum number of samples needed at each leaf. The default input here is
1.
We now fit randomforest classifier model to the bagging model by training the model with our
independent variable and dependent variables.
At this point, you have the classification model defined
The probabilities on the training set The probabilities on the test set
Boosting
Boosting is an ensemble strategy that is consecutively builds on weak learners in order to generate one final
strong learner. A weak learner is a model that may not be exactly accurate or may not take many predictors into
account. By building a weak model, making conclusions about the various feature importance’s and parameters,
and then using those conclusions to build a new, stronger model, Boosting can effectively convert weak learners
into a strong learner.
AdaBoost uses decision stumps as weak learners. A Decision Stump is a Decision Tree model that only splits
off at one level, ergo the final prediction is based off only one feature. When AdaBoost makes its first
Decision Stump, all observations are weighted evenly.
To correct previous error, when moving to the second Decision Stump, the observations that were
classified incorrectly now carry more weight than the observations that were correctly classified. AdaBoost
continues this strategy until the best classification model is built.
➢ GridSearchCV ADA boosting
➢ Using GridsearchCV, we input various parameters like {'algorithm', 'learning_rate', 'n_estimators'}
which will helps us to find best grid for prediction of the better model
➢ N_estimators is the maximum number of estimators at which boosting is terminated. If a perfect fit
is reached, the algo is stopped. The default here is 50.
➢ Learning_rate is the rate at which we are adjusting the weights of our model with respect to the
loss gradient.
➢ The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with
fewer boosting iterations.
➢ bestgrid: {'algorithm': 'SAMME.R', 'learning_rate': 0.3, 'n_estimators': 51}
The probabilities on the training set The probabilities on the test set’
Gradient Boosting
The probabilities on the training set The probabilities on the test set
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare the models
and write inference which model is best/optimized.
Logistic Regression Model
Confusion matrix on the training and test data
Inference
Training data:
True Negative : 196 False Positive : 111
False Negative : 68 True Positive : 686
Test data:
True Negative : 113 False Positive : 40
False Negative : 35 True Positive : 268
Train Data:
➢ AUC: 89%
➢ Accuracy: 83%
➢ precision : 86%
➢ recall : 91%
➢ f1 :88%
Test Data:
➢ AUC: 88.3%
➢ Accuracy: 84%
➢ precision: 87%
➢ recall : 88%
➢ f1 : 88%
➢ Training and Test set results are almost similar, this proves no overfitting or underfitting
➢ Trainig data:
➢ Test data:
Inference
Train Data:
Train Data:
➢ AUC: 88.9%
➢ Accuracy: 83%
➢ precision : 86%
➢ recall : 91%
➢ f1 :89%
Test Data:
➢ AUC: 88.8%
➢ Acuracy: 83%
➢ precision :86%
➢ recall : 89%
➢ f1 : 88%
➢ Training and Test set results are almost similar,This proves no overfitting or underfitting
KNN Model
Training data:
Test data:
Inference
Train Data:
➢ AUC: 91%
➢ Accuracy: 85%
➢ precision : 88%
➢ recall : 92%
➢ f1 :89%
Test Data:
➢ AUC: 89.3%
➢ Accuracy: 83%
➢ precision :85%
➢ recall : 90%
➢ f1 : 88%
➢ Training and Test set results are almost similar,This proves no overfitting or underfitting
Inference
Training data:
Test data:
Inference
Train Data:
➢ AUC: 88.8%
➢ Accuracy: 84%
➢ precision : 88%
➢ recall : 90%
➢ f1 :89%
Test Data:
➢ AUC: 87.6%
➢ Accuracy: 82%
➢ precision :88%
➢ recall : 88%
➢ f1 : 88%
➢ Training and Test set results are almost similar,This proves no overfitting or underfitting
Training data:
Test data:
Inference
Train Data:
➢ AUC: 89.7%
➢ Accuracy: 84%
➢ precision : 85%
➢ recall : 93%
➢ f1 :89%
Test Data:
➢ AUC: 88.4%
➢ Accuracy: 82%
➢ precision :82%
➢ recall : 91%
➢ f1 : 87%
➢ Training and Test set results are almost similar,This proves no overfitting or
underfitting
AdaBoostClassifier
Training data:
Test data:
Inference
Train Data:
➢ AUC: 90.6%
➢ Accuracy: 84%
➢ precision : 85%
➢ recall : 92%
➢ f1 :89%
Test Data:
➢ AUC: 88.9%
➢ Accuracy: 82%
➢ precision :85%
➢ recall : 89%
➢ f1 : 87%
➢ Training and Test set results are almost similar,This proves no overfitting or underfitting
Gradient Boosting
Training data:
Test data:
Inference
Train Data:
➢ AUC: 93.4%
➢ Accuracy: 87%
➢ precision : 89%
➢ recall : 93%
➢ f1 :91%
Test Data:
➢ AUC: 90.1%
➢ Accuracy: 83%
➢ precision :86%
➢ recall : 90%
➢ f1 : 88%
➢ Training and Test set results are almost similar,This proves no overfitting or
underfitting
➢ Almost all the models performed well with accuracy between 82% to 84%.
➢ Comparing all the model ,Gradient boosting model is best model for this
➢ AUC of Train and test in Gradient boosting model is 93% and 90% respectively
➢ f1 score of Train and test in Gradient boosting model is 91% and 88%
respectively
➢ Precision of Train and test in Gradient boosting model is 89% and 86%
respectively
➢ Recall of Train and test in Gradient boosting model is 93% and 90%
respectively
➢ Accuracy ,AUC,Precision,Recall for test data are almost in line with training
the model
Almost all the models performed well with accuracy between 82% to 84% with scaled
data. But Gradient boosting is best and optimised model with accuracy of 87% and also best
LR
Train
LR
Test
LDA
Train
LDA
Test
KNN
Train
KNN
Test
NB
Train
NB
Test
BAGGING
Train
BAGGING
Test
ADA
Train
ADA
Test
Gradient
Train
Gradient
Test
Accuracy 0.83 0.84 0.83 0.83 0.85 0.83 0.84 0.82 0.84 0.82 0.84 0.82 0.87 0.87
AUC 0.89 0.88 0.9 0.88 0.91 0.89 0.89 0.88 0.9 0.88 0.91 0.89 0.93 0.9
Recall 0.91 0.88 0.91 0.89 0.92 0.9 0.9 0.87 0.93 0.93 0.92 0.89 0.93 0.9
Precision 0.86 0.87 0.86 0.86 0.88 0.85 0.88 0.87 0.85 0.82 0.86 0.85 0.89 0.86
F1 Score 0.88 0.88 0.89 0.88 0.89 0.88 0.89 0.87 0.89 0.87 0.89 0.87 0.91 0.88
The main business objective of this project is to build a model to predict which party a voter will vote for
based on the given information, to create an exit poll that will help in predicting overall win and seats
covered by a particular party.
Various model was built on scaled dataset in that it is found that Gradient Boosting model gave
best/optimized accuracy with 87% to predict which party a voter will vote based on given information and
clearly an exit poll can be built that can help in overall win and seats covered by a particular party.
➢ age' :4
➢ 'economic.cond.national': 6
➢ 'economic.cond.household': 10
➢ 'Blair': 8
➢ 'Hague':20
➢ 'Europe': 11
➢ 'political_knowledge': 7
➢ 'gender': 21
Conclusion
Based on this machine learning models, we can predict which party the voter might vote with more sample voters. Exit
polls can be created with this model to predict which party will win or lose and seats covered by a particular party
Problem2 Text analysis on speeches of the Presidents of the United States of America: President Franklin D. Roosevelt
in 1941 President John F. Kennedy in 1961 President Richard Nixon in 1973
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United
States of America:
2.1 Find the number of characters, words, and sentences for the mentioned documents.
In natural language processing, useless words (data), are referred to as stop words.
Stop Words:
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine
has been programmed to ignore, both when indexing entries for searching and when
retrieving them as the result of a search query. We would not want these words to take
up space in our database, or taking up valuable processing time. For this, we can remove
them easily, by storing a list of words that you consider to stop words.
Libraries used:
nltk.download('stopwords')
Stemming is the process of reducing inflection in words to their root forms such as mapping
a group of words to the same stem even if the stem itself is not a valid word in the
Language.
Library used:
['nation', 'know', 'peopl', 'spirit', 'life', 'democraci', 'us', 'america', 'live', 'ye
ar', 'human', 'freedom', 'measur', 'men', 'govern', 'new', 'bodi', 'mind', 'speak', 'da
y', 'state', 'american', 'must', 'someth', 'faith', 'unit', 'task', 'preserv', 'within'
, 'histori', 'three', 'form', 'futur', 'seem', 'hope', 'understand', 'thing', 'free', '
alon', 'still', 'everi', 'contin', 'like', 'person', 'world', 'sacr', 'word', 'came', '
land', 'first']
['let', 'us', 'power', 'world', 'nation', 'side', 'new', 'pledg', 'ask', 'citizen',
'peac', 'shall', 'free', 'final', 'presid', 'fellow', 'freedom', 'begin', 'man', 'h
and', 'human', 'first', 'gener', 'american', 'war', 'alway', 'know', 'support', 'un
it', 'cannot', 'hope', 'help', 'weak', 'arm', 'countri', 'call', 'today', 'well', '
god', 'form', 'poverti', 'life', 'globe', 'right', 'state', 'dare', 'word', 'go', '
friend', 'bear']
eat', 'year', 'home', 'abroad', 'make', 'togeth', 'shall', 'time', 'polici', 'role'
, 'four', 'war', 'today', 'era', 'progress', 'other', 'build', 'act', 'challeng', '
one', 'mr', 'share', 'meet', 'promis', 'long', 'work', 'preserv', 'freedom', 'place
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords)
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords)
Wordcloud
Many times you might have seen a cloud filled with lots of words in different sizes, which represent the
frequency or the importance of each word. This is called Tag Cloud or WordCloud.
Liabrary used
Roosevelt speech
Kennedy speech
Inference:
Nixon speech
Inference:
➢ Most Frequent words are America, let, us, nation ➢ Less Frequent words are flimsy, adopted, saw
Conclusion: This project data presented from '1941-Roosevelt.txt', '1961-Kennedy.txt' and '1973-Nixon.txt', we analysed
some interesting insights like the number of characters, words, and sentences from the speeches. To Identify the
strength and the sentiment of these presidential speeches the stop words were removed (punctuation and lowering the
characters were removed) along with stemming. We analysed some of the common words from their speeches which
inspired m