Unit 3
Unit 3
Suppose you ask a complex question to thousands of random people, then aggregate their
answers. In many cases you will find that this aggregated answer is better than an expert’s
answer. This is called the wisdom of the crowd. Similarly, if you aggregate the predictions of
a group of predictors (such as classifiers or regressors), you will often get better predictions
than with the best individual predictor. A group of predictors is called an ensemble; thus, this
technique is called Ensemble Learning, and an Ensemble Learning algorithm is called
an Ensemble method.
For example, you can train a group of Decision Tree classifiers, each on a different random
subset of the training set. To make predictions, you just obtain the predictions of all
individual trees, then predict the class that gets the most votes.Such an ensemble of Decision
Trees is called a Random Forest, and despite its simplicity, this is one of the most
powerful Machine Learning algorithms available today
Definition
Ensemble learning is the process by which multiple models, such as classifiers or experts,
are strategically generated and combined to solve a particular computational
intelligence problem. Ensemble learning is primarily used to improve the (classification,
prediction, function approximation, etc.) performance of a model, or reduce the likelihood of
an unfortunate selection of a poor one. Other applications of ensemble learning include
assigning a confidence to the decision made by the model, selecting optimal (or near optimal)
features, data fusion, incremental learning, nonstationary learning and error-correcting. In
learning models, noise, variance, and bias are the major sources of error. The ensemble
methods in machine learning help minimize these error-causing factors, thereby ensuring the
accuracy and stability of machine learning (ML) algorithms.
``
Example 1: If you are planning to buy an air-conditioner, would you enter a showroom and
buy the air-conditioner that the salesperson shows you? The answer is probably no. In this
day and age, you are likely to ask your friends, family, and colleagues for an opinion, do
research on various portals about different models, and visit a few review sites before making
a purchase decision. In a nutshell, you would not come to a conclusion directly. Instead, you
would try to make a more informed decision after considering diverse opinions and reviews.
In the case of ensemble learning, the same principle applies
``
Ensemble learning helps improve machine learning results by combining several models.
This approach allows the production of better predictive performance compared to a single
model. Basic idea is to learn a set of classifiers (experts) and to allow them to vote.
Ensemble Methods
Boosting
Stacking Classifier
Voting Classifier
``
Voting Classifier:
A voting classifier is a machine learning estimator that trains various base models or
estimators and predicts on the basis of aggregating the findings of each base estimator. The
aggregating criteria can be combined decision of voting for each estimator output. The voting
Soft Voting: Voting is calculated on the predicted probability of the output class.
The voting classifier aggregates the predicted class or predicted probability on basis of hard
voting or soft voting. So if we feed a variety of base models to the voting classifier it makes
``
Implementation:
For our sample classification dataset, we are training 4 base estimators of Logistic Regression,
voting aggregators. The parameter weight can be tuned to users to overshadow some of the
predicted class labels for hard voting or class probabilities before averaging for soft voting.
``
We are using a soft voting classifier and weight distribution of [1,2,1,1], where twice the
weight is assigned to the Random Forest model. Now lets, observe the benchmark
From the above pretty table, the voting classifier boosts the performance compared to its base
estimator performances.
Bagging and pasting are techniques that are used in order to create varied subsets of the
training data. The subsets produced by these techniques are then used to train the predictors
of an ensemble.
Bagging, short for bootstrap aggregating, creates a dataset by sampling the training set with
replacement. Pasting creates a dataset by sampling the training set without replacement.
Bootstrapping
In statistics, bootstrapping refers to a resample method that consists of repeatedly drawn, with
replacement, samples from data to form other smaller datasets, called bootstrapping samples.
It’s as if the bootstrapping method is making a bunch of simulations to our original dataset so
in some cases we can generalize the mean and the standard deviation.
For example, let’s say we have a set of observations: [2, 4, 32, 8, 16]. If we want each
``
n=3: [32, 4, 4], [8, 16, 2], [2, 2, 2]…
Since we drawn data with replacement, the observations can appear more than one time in a
single sample.
our data and for each bootstrap sample we train one model. After that, we aggregate them with
equal weights. When it’s not used replacement, the method is called pasting.
algorithm designed to improve the stability and accuracy of machine learning algorithms
used in statistical classification and regression. It decreases the variance and helps to
avoid overfitting. It is usually applied to decision tree methods . Bagging is a special case of
``
Out-of-Bag Scoring
If we are using bagging, there’s a chance that a sample would never be selected, while
anothers may be selected multiple time. The probability of not selecting a specific sample is
(1–1/n), where n is the number of samples. Therefore, the probability of not picking n samples
in n draws is (1–1/n)^n. When the value of n is big, we can approximate this probability to 1/e,
which is approximately 0.3678. This means that when the dataset is big enough, 37% of its
``
samples are never selected and we could use it to test our model. This is called Out-of-Bag
``
Random Patches and Random Subspaces
The BaggingClassifier class supports sampling the features as well. Sampling is controlled by
two hyperparameters: max_features and bootstrap_features. They work the same way as
max_samples and bootstrap, but for feature sampling instead of instance sampling. Thus,
Random Patches samples both the training Instances as well as the Features.
Classification and Regression problems. It builds decision trees on different samples and
takes their majority vote for classification and average in case of regression.
One of the most important features of the Random Forest Algorithm is that it can handle the
data set containing continuous variables, as in the case of regression, and categorical
variables, as in the case of classification. It performs better for classification and regression
tasks.
One of the most important features of the Random Forest Algorithm is that it can handle the
data set containing continuous variables, as in the case of regression, and categorical
``
variables, as in the case of classification. It performs better for classification and regression
tasks.
Before understanding the working of the random forest algorithm in machine learning, we
must look into the ensemble learning technique. Ensemble simplymeans combining multiple
models. Thus a collection of models is used to make predictions rather than an individual
model.
1. Bagging– It creates a different training subset from sample training data with replacement
& the final output is based on majority voting. For example, Random Forest.
2. Boosting– It combines weak learners into strong learners by creating sequential models
such that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST.
``
Bagging, also known as Bootstrap Aggregation, is the ensemble technique used by random
forest.Bagging chooses a random sample/random subset from the entire data set. Hence each
model is generated from the samples (Bootstrap Samples) provided by the Original Data with
replacement known as row sampling. This step of row sampling with replacement is
called bootstrap. Now each model is trained independently, which generates results. The
final output is based on majority voting after combining the results of all models. This step
which involves combining all the results and generating output based on majority voting, is
known as aggregation.
Now let’s look at an example by breaking it down with the help of the following figure. Here
the bootstrap sample is taken from actual data (Bootstrap sample 01, Bootstrap sample 02,
and Bootstrap sample 03) with a replacement which means there is a high possibility that
each sample won’t contain unique data. The model (Model 01, Model 02, and Model 03)
obtained from this bootstrap sample is trained independently. Each model generates results as
shown. Now the Happy emoji has a majority when compared to the Sad emoji. Thus based on
``
Steps Involved in Random Forest Algorithm
Step 1: In the Random forest model, a subset of data points and a subset of features is
selected for constructing each decision tree. Simply put, n random records and m features are
Step 4: Final output is considered based on Majority Voting or Averaging for Classification
For example: consider the fruit basket as the data as shown in the figure below. Now n
number of samples are taken from the fruit basket, and an individual decision tree is
constructed for each sample. Each decision tree will generate an output, as shown in the
``
figure. The final output is considered based on majority voting. In the below figure, you can
see that the majority decision tree gives output as an apple when compared to a banana, so the
Immune to the curse of dimensionality: Since each tree does not consider all the features,
Parallelization: Each tree is created independently out of different data and attributes. This
Train-Test split: In a random forest, we don’t have to segregate the data for train and test as
there will always be 30% of the data which is not seen by the decision tree.
``
Stability: Stability arises because the result is based on majority voting/ averaging.
Random Forest is an ensemble of Decision Trees, generally trained via the bagging method
(or sometimes pasting), typically with max_samples set to the size of the training set. Instead
of building a BaggingClassifier and passing it a DecisionTreeClassifier, you can instead use
the RandomForestClassifier class, which is more convenient and optimized for Decision
Trees
y_pred_rf = rnd_clf.predict(X_test)
Feature Importance Yet another great quality of Random Forests is that they make it easy to
measure the relative importance of each feature.
Scikit-Learn measures a feature’s importance by looking at how much the tree nodes that use
that feature reduce impurity on average (across all trees in the forest). More precisely, it is a
weighted average, where each node’s weight is equal to the number of training samples that
are associated with it
``
Boosting
Boosting is one of the techniques that use the concept of ensemble learning. A boosting
algorithm combines multiple simple models (also known as weak learners or base estimators)
to generate the final output. It is done by building a model by using weak models in series.
Boosting (originally called hypothesis boosting) refers to any Ensemble method that can
Boosting Methods
2. Gradient Boosting
There are several boosting algorithms; was the first really successful boosting algorithm that
was developed for the purpose of binary classification. AdaBoost is an abbreviation for and is
a prevalent boosting technique that combines multiple “weak classifiers” into a single “strong
classifier.”
``
AdaBoost One way for a new predictor to correct its predecessor is to pay a bit more
attention to the training instances that the predecessor underfitted. This results in new
predictors focusing more and more on the hard cases. This is the technique used by
AdaBoost. For example, when training an AdaBoost classifier, the algorithm first trains a
base classifier (such as a Decision Tree) and uses it to make predictions on the training set.
The algorithm then increases the relative weight of misclassified training instances. Then it
trains a second classifier, using the updated weights, and again makes predictions on the
Each instance weight w is initially set to 1 m . A first predictor is trained, and its weighted
The predictor’s weight α is then computed using Equation 7-2, where η is the learning rate
hyperparameter (defaults to 1). The more accurate the predictor is, the higher its weight will
be
``
Scikit-Learn uses a multiclass version of AdaBoost called SAMME (which stands for
Stagewise Additive Modeling using a Multiclass Exponential loss function). When there are
just two classes, SAMME is equivalent to AdaBoost. If the predictors can estimate class
probabilities (i.e., if they have a predict_proba() method), Scikit-Learn can use a variant of
SAMME called SAMME.R (the R stands for “Real”), which relies on class probabilities
rather than predictions and generally performs better. The following code trains an AdaBoost
classifier based on 200 Decision Stumps using Scikit-Learn’s AdaBoostClassifier class (as
you might expect, there is also an AdaBoostRegressor class). A Decision Stump is a Decision
Tree with max_depth=1—in other words, a tree composed of a single decision node plus two
leaf nodes. This is the default base estimator for the AdaBoostClassifier class:
``
Boosting is an ensemble modeling technique that attempts to build a strong
classifier from the number of weak classifiers. It is done by building a model
by using weak models in series. Firstly, a model is built from the training
data. Then the second model is built which tries to correct the errors present
in the first model. This procedure is continued and models are added until
either the complete training data set is predicted correctly or the maximum
number of models are added.
AdaBoost was the first really successful boosting algorithm developed for
the purpose of binary classification. AdaBoost is short for Adaptive
Boosting and is a very popular boosting technique that combines multiple
“weak classifiers” into a single “strong classifier
Algorithm:
``
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data
points.
3. Increase the weight of the wrongly classified data points.
4. if(gotrequiredresults)
Gotostep5
else
Gotostep2
5. End
6.
``
``
he above diagram explains the AdaBoost algorithm in a very simple way.
Let’s try to understand it in a stepwise process:
Gradient Boosting
Gradient Boosting is a powerful boosting algorithm that combines several
weak learners into strong learners, in which each new model is trained to
minimize the loss function such as mean squared error or cross-entropy of
the previous model using gradient descent. In each iteration, the algorithm
computes the gradient of the loss function with respect to the predictions of
the current ensemble and then trains a new weak model to minimize this
gradient. The predictions of the new model are then added to the ensemble,
and the process is repeated until a stopping criterion is met.
In contrast to AdaBoost, the weights of the training instances are not
tweaked, instead, each predictor is trained using the residual errors of the
predecessor as labels. There is a technique called the Gradient Boosted
Trees whose base learner is CART (Classification and Regression Trees).
The below diagram explains how gradient-boosted trees are trained for
regression problems.
``
y(pred) = y1 + (eta * r1) + (eta * r2) + ....... + (eta *
rN)
``
AdaBoost Gradient Boosting
AdaBoost uses simple decision trees with Gradient Boosting can use a wide
one split known as the decision stumps of range of base learners, such as
weak learners. decision trees, and linear models.
``
Support Vector Machine(SVM)
The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane.
Example:
``
Suppose we see a strange cat that also has some features of dogs, so if
such a model can be created by using the SVM algorithm. We will first
of cats and dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature. So as support vector
creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means
if a dataset can be classified into two classes by using a single straight
line, then such data is termed as linearly separable data, and classifier is
used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then
such data is termed as non-linear data and classifier used is called as Non-
linear SVM classifier.
``
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to
segregate the classes in n-dimensional space, but we need to find out the
best decision boundary that helps to classify the data points. This best
boundary is known as the hyperplane of SVM.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vector.
Since these vectors support the hyperplane, hence called a Support
vector.
``
So as it is 2-d space so by just using a straight line, we can easily separate
these two classes. But there can be multiple lines that can separate these
classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary;
this best boundary or region is called as a hyperplane. SVM algorithm
finds the closest point of the lines from both the classes. These points are
called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal
hyperplane.
``
Margins are generally defined by the closest data points (called support
vectors) on either of the hyperplane
``
How to choose the Correct SVM
Suppose we are given 2 Hyperplane one with 100% accuracy (HP1) on the
left side and another with >90% accuracy (HP2) on the right side. Which
one would you think is the correct classifier?
Most of us would pick the HP2 thinking that it because of the maximum
margin. But it is the wrong answer.
But Support Vector Machine would choose the HP1 though it has a narrow
margin. Because though HP2 has maximum margin but it is going against
the constrain that: each data point must lie on the correct side of the
``
margin and there should be no misclassification. This constraint is
the hard constraint
Margins
Margins are generally defined by the closest data points (called support vectors) on
either of the hyperplane
The core of any Machine learning algorithm is the Optimization technique that is
happening behind the scene.
``
It can be mathematically be written as:
``
4. Hard and Soft SVM
We can now clearly state that HP1 is a Hard SVM(left side) while HP2 is a Soft
SVM(right side).
By default, Support Vector Machine implements Hard margin SVM. It works well
only if our data is linearly separable.
In case our data is non-separable/ nonlinear then the Hard margin SVM will not
return any hyperplane as it will not be able to separate the data. Hence this is where
Soft Margin SVM comes to the rescue.
``
Soft margin SVM allows some misclassification to happen by relaxing the hard
constraints of Support Vector Machine.
Soft margin SVM is implemented with the help of the Regularization parameter
(C).
Now that we know what the Regularization parameter (C) does. We need
to understand its relation with Support Vector Machine.
– If the values of C are very small the margin increases thus Soft SVM.
``
The following Scikit-Learn code loads the iris dataset, scales the features, and then trains a
linear SVM model (using the LinearSVC class with C=1 and the hinge loss function,
import numpy as np
iris = datasets.load_iris()
loss="hinge")), ])
svm_clf.fit(X, y)
``
Non Linear SVM Classification
Non-Linear SVM:
So to separate these data points, we need to add one more dimension. For
linear data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below
image:
``
So now, SVM will divide the datasets into classes in the following way.
Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-
axis. If we convert it in 2d space with z=1, then it will become as:
``
Hence we get a circumference of radius 1 in case of non-linear data
Polynomial Kernel
Similarity Features
Read the above topics by referring the text book Page nos 218-223(Hands
on Machine Learning)
that supports both linear and non-linear regressions. This method works
on the principle of the Support Vector Machine. SVR differs from SVM in
the way that SVM is a classifier that is used for predicting discrete
``
In simple regression, the idea is to minimize the error rate while in SVR
the idea is to fit the error inside a certain threshold which means, work of
SVR is to approximate the best value within a given margin called ε- tube.
do that we need a function that should map the data points into its higher
3. Boundary Lines: These are the two lines that are drawn around the
``
4. Support Vector: It is the vector that is used to define the hyperplane
or we can say that these are the extreme data points in the dataset which
helps in defining the hyperplane. These data points lie close to the
boundary.
was to define the hyperplane but in SVR they are used to define the linear
regression.
Working of SVR
SVR works on the principle of SVM with few minor differences. Given data
instead of using the curve as a decision boundary it uses the curve to find
the match between the vector and position of the curve. Support Vectors
helps in determining the closest match between the data points and the
``
Consider these two red lines as the decision boundary and the green line
as the hyperplane. Our objective, when we are moving on with SVR, is
to basically consider the points that are within the decision boundary
line. Our best fit line is the hyperplane that has a maximum number of
points.
The first thing that we’ll understand is what is the decision boundary (the
danger red line above!). Consider these lines as being at any distance, say
‘a’, from the hyperplane. So, these are the lines that we draw at distance
‘+a’ and ‘-a’ from the hyperplane. This ‘a’ in the text is basically referred to
as epsilon.
wx+b= +a
wx+b= -a
``
Our main aim here is to decide a decision boundary at ‘a’ distance from the
original hyperplane such that data points closest to the hyperplane or the
support vectors are within that boundary line.
Hence, we are going to take only those points that are within the decision
boundary and have the least error rate, or are within the Margin of
Tolerance. This gives us a better fitting model
Can you decide what the separating line will be? You might have come up with
this:
``
The line fairly separates the classes. This is what SVM essentially does – simple
class separation. Now, what is the data was like this:
Here, we don’t have a simple line separating these two classes. So we’ll extend our
dimension and introduce a new dimension along the z-axis. We can now separate
these two classes:
``
When we transform this line back to the original plane, it maps to the circular
boundary as I’ve shown here:
``
Implementing Support Vector Regression (SVR) in Python
import numpy as np
import matplotlib.pyplot as
plt
import pandas as pd
dataset =
pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
A real-world dataset contains features that vary in magnitudes, units, and range. I
would suggest performing normalization when the scale of a feature is irrelevant or
misleading.
Feature Scaling basically helps to normalize the data within a particular range.
Normally several common class types contain the feature scaling function so that
``
they make feature scaling automatically. However, the SVR class is not a
commonly used class type so we should perform feature scaling using Python.
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
regressor = SVR(kernel =
'rbf')
regressor.fit(X, y)
Kernel is the most important feature. There are many types of kernels – linear,
Gaussian, etc. Each is used depending on the dataset.
y_pred = regressor.predict(6.5)
y_pred =
sc_y.inverse_transform(y_pred)
``
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
dataset =
pd.read_csv('/content/drive/MyDrive/Position_Salaries.csv')
# our dataset in this implementation is small, and thus we can
print it all instead of viewing only the end
print(dataset)
``
Output:
Output:
[[ 1]
[ 2]
[ 3]
[ 4]
[ 5]
[ 6]
``
[ 7]
[ 8]
[ 9]
[10]]
Output:
It’s seen from the output above that the y_p variable is a
vector, i.e., a 1D array.
``
Output:
[[ 45000]
[ 50000]
[ 60000]
[ 80000]
[ 110000]
[ 150000]
[ 200000]
[ 300000]
[ 500000]
[1000000]]
Output:
Scaled X_l:
[[-1.5666989 ]
[-1.21854359]
[-0.87038828]
[-0.52223297]
[-0.17407766]
[ 0.17407766]
[ 0.52223297]
[ 0.87038828]
[ 1.21854359]
``
[ 1.5666989 ]]
Scaled y_p:
[[-0.72004253]
[-0.70243757]
[-0.66722767]
[-0.59680786]
[-0.49117815]
[-0.35033854]
[-0.17428902]
[ 0.17781001]
[ 0.88200808]
[ 2.64250325]]
``
The plot shows a non-linear relationship between
the Levels and Salary.
Implementing SVR
``
Since the model is now ready, we can use it and make
predictions as shown:
A=regressor.predict(StdS_X.transform([[6.5]]))
print(A)
Output:
array([-0.27861589])
Output:
``
array([[-0.27861589]])
Output:
array([[170370.0204065]])
Output:
``
Naïve Bayes Classifier
``
if the fruit is identified on the bases of color, shape, and taste, then red,
spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on
each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
Steps
``
Example
X is given as,
``
Here x_1,x_2….x_n represent the features, i.e they can be
mapped to outlook, temperature, humidity and windy. By
substituting for X and expanding using the chain rule we
get,
Now, you can obtain the values for each by looking at the
dataset and substitute them into the equation. For all
entries in the dataset, the denominator does not change, it
remain static. Therefore, the denominator can be removed
and a proportionality can be introduced.
``
``