0% found this document useful (0 votes)
28 views59 pages

Unit 3

dwdm

Uploaded by

tinaktm2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views59 pages

Unit 3

dwdm

Uploaded by

tinaktm2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Unit-III

Ensemble Learning and Random Forests

Suppose you ask a complex question to thousands of random people, then aggregate their
answers. In many cases you will find that this aggregated answer is better than an expert’s
answer. This is called the wisdom of the crowd. Similarly, if you aggregate the predictions of
a group of predictors (such as classifiers or regressors), you will often get better predictions
than with the best individual predictor. A group of predictors is called an ensemble; thus, this
technique is called Ensemble Learning, and an Ensemble Learning algorithm is called
an Ensemble method.

For example, you can train a group of Decision Tree classifiers, each on a different random
subset of the training set. To make predictions, you just obtain the predictions of all
individual trees, then predict the class that gets the most votes.Such an ensemble of Decision
Trees is called a Random Forest, and despite its simplicity, this is one of the most
powerful Machine Learning algorithms available today

Definition

Ensemble learning is the process by which multiple models, such as classifiers or experts,
are strategically generated and combined to solve a particular computational
intelligence problem. Ensemble learning is primarily used to improve the (classification,
prediction, function approximation, etc.) performance of a model, or reduce the likelihood of
an unfortunate selection of a poor one. Other applications of ensemble learning include
assigning a confidence to the decision made by the model, selecting optimal (or near optimal)
features, data fusion, incremental learning, nonstationary learning and error-correcting. In
learning models, noise, variance, and bias are the major sources of error. The ensemble
methods in machine learning help minimize these error-causing factors, thereby ensuring the
accuracy and stability of machine learning (ML) algorithms.

``
Example 1: If you are planning to buy an air-conditioner, would you enter a showroom and
buy the air-conditioner that the salesperson shows you? The answer is probably no. In this
day and age, you are likely to ask your friends, family, and colleagues for an opinion, do
research on various portals about different models, and visit a few review sites before making
a purchase decision. In a nutshell, you would not come to a conclusion directly. Instead, you
would try to make a more informed decision after considering diverse opinions and reviews.
In the case of ensemble learning, the same principle applies

``
Ensemble learning helps improve machine learning results by combining several models.
This approach allows the production of better predictive performance compared to a single
model. Basic idea is to learn a set of classifiers (experts) and to allow them to vote.

Ensemble Methods

 Bagging or Bootstrap Aggregation and Pasting

 Boosting

 Stacking Classifier

 Voting Classifier

``
Voting Classifier:

A voting classifier is a machine learning estimator that trains various base models or

estimators and predicts on the basis of aggregating the findings of each base estimator. The

aggregating criteria can be combined decision of voting for each estimator output. The voting

criteria can be of two types:

 Hard Voting: Voting is calculated on the predicted output class.

 Soft Voting: Voting is calculated on the predicted probability of the output class.

How Voting Classifier can improve performance?

The voting classifier aggregates the predicted class or predicted probability on basis of hard

voting or soft voting. So if we feed a variety of base models to the voting classifier it makes

sure to resolve the error by any model.

Fig: Left: Hard Voting, Right: Soft Voting

``
Implementation:

Scikit-learn packages offer implementation of Voting Classifier


Fromsklearn.ensembleimportVotingClassifie
r
clf1 = LogisticRegression(random_state=42)
clf2 =
RandomForestClassifier(random_state=42)
clf3 = GaussianNB()
clf4 = SVC(probability=True,
random_state=42)
eclf = VotingClassifier(estimators=[('LR', clf1),
('RF', clf2), ('GNB', clf3), ('SVC', clf4)],
voting='soft', weights=[1,2,1,1])
eclf.fit(X_train, y_train)

For our sample classification dataset, we are training 4 base estimators of Logistic Regression,

Random Forest, Gaussian Naive Bayes, and Support Vector Classifier.

Parameter voting=‘soft’ or voting=‘hard’ enables developers to switch between hard or soft

voting aggregators. The parameter weight can be tuned to users to overshadow some of the

good-performing base estimators. The sequence of weights to weigh the occurrences of

predicted class labels for hard voting or class probabilities before averaging for soft voting.

``
We are using a soft voting classifier and weight distribution of [1,2,1,1], where twice the

weight is assigned to the Random Forest model. Now lets, observe the benchmark

performance of each of the base estimators vis-a-vis the voting classifier.

From the above pretty table, the voting classifier boosts the performance compared to its base
estimator performances.

Bagging & Pasting

Bagging and pasting are techniques that are used in order to create varied subsets of the
training data. The subsets produced by these techniques are then used to train the predictors
of an ensemble.

Bagging, short for bootstrap aggregating, creates a dataset by sampling the training set with
replacement. Pasting creates a dataset by sampling the training set without replacement.

Bootstrapping

In statistics, bootstrapping refers to a resample method that consists of repeatedly drawn, with

replacement, samples from data to form other smaller datasets, called bootstrapping samples.

It’s as if the bootstrapping method is making a bunch of simulations to our original dataset so

in some cases we can generalize the mean and the standard deviation.

For example, let’s say we have a set of observations: [2, 4, 32, 8, 16]. If we want each

bootstrap sample containing n observations, the following are valid samples:

``
 n=3: [32, 4, 4], [8, 16, 2], [2, 2, 2]…

 n=4: [2, 32, 4, 16], [2, 4, 2, 8], [8, 32, 4, 2]…

Since we drawn data with replacement, the observations can appear more than one time in a

single sample.

Bagging & Pasting

Bagging means bootstrap+aggregating and it is a ensemble method in which we first bootstrap

our data and for each bootstrap sample we train one model. After that, we aggregate them with

equal weights. When it’s not used replacement, the method is called pasting.

Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-

algorithm designed to improve the stability and accuracy of machine learning algorithms

used in statistical classification and regression. It decreases the variance and helps to

avoid overfitting. It is usually applied to decision tree methods . Bagging is a special case of

the model averaging approach.

Implementation Steps of Bagging


 Step 1: Multiple subsets are created from the original data set with equal tuples,
selecting observations with replacement.
 Step 2: A base model is created on each of these subsets.
 Step 3: Each model is learned in parallel with each training set and independent of each
other.
 Step 4: The final predictions are determined by combining the predictions from all the
models.

``
Out-of-Bag Scoring

If we are using bagging, there’s a chance that a sample would never be selected, while

anothers may be selected multiple time. The probability of not selecting a specific sample is

(1–1/n), where n is the number of samples. Therefore, the probability of not picking n samples

in n draws is (1–1/n)^n. When the value of n is big, we can approximate this probability to 1/e,

which is approximately 0.3678. This means that when the dataset is big enough, 37% of its

``
samples are never selected and we could use it to test our model. This is called Out-of-Bag

scoring, or OOB Scoring.

``
Random Patches and Random Subspaces

The BaggingClassifier class supports sampling the features as well. Sampling is controlled by

two hyperparameters: max_features and bootstrap_features. They work the same way as

max_samples and bootstrap, but for feature sampling instead of instance sampling. Thus,

each predictor will be trained on a random subset of the input features

Random Patches samples both the training Instances as well as the Features.

Setting certain parameters in the BaggingClassifier() performs this

bootstrap_features = True, max_samples = 0.6

Random Subspaces keeps all the instances but samples features.

Bootstrap = True, bootstrap_features = True, max_features = 0.6

Random Forest Algorithm

Random forest is a Supervised Machine Learning Algorithm that is used widely in

Classification and Regression problems. It builds decision trees on different samples and

takes their majority vote for classification and average in case of regression.

One of the most important features of the Random Forest Algorithm is that it can handle the

data set containing continuous variables, as in the case of regression, and categorical

variables, as in the case of classification. It performs better for classification and regression

tasks.

One of the most important features of the Random Forest Algorithm is that it can handle the

data set containing continuous variables, as in the case of regression, and categorical

``
variables, as in the case of classification. It performs better for classification and regression

tasks.

Working of Random Forest Algorithm

Before understanding the working of the random forest algorithm in machine learning, we

must look into the ensemble learning technique. Ensemble simplymeans combining multiple

models. Thus a collection of models is used to make predictions rather than an individual

model.

Ensemble uses two types of methods:

1. Bagging– It creates a different training subset from sample training data with replacement

& the final output is based on majority voting. For example, Random Forest.

2. Boosting– It combines weak learners into strong learners by creating sequential models

such that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST.

``
Bagging, also known as Bootstrap Aggregation, is the ensemble technique used by random

forest.Bagging chooses a random sample/random subset from the entire data set. Hence each

model is generated from the samples (Bootstrap Samples) provided by the Original Data with

replacement known as row sampling. This step of row sampling with replacement is

called bootstrap. Now each model is trained independently, which generates results. The

final output is based on majority voting after combining the results of all models. This step

which involves combining all the results and generating output based on majority voting, is

known as aggregation.

Now let’s look at an example by breaking it down with the help of the following figure. Here

the bootstrap sample is taken from actual data (Bootstrap sample 01, Bootstrap sample 02,

and Bootstrap sample 03) with a replacement which means there is a high possibility that

each sample won’t contain unique data. The model (Model 01, Model 02, and Model 03)

obtained from this bootstrap sample is trained independently. Each model generates results as

shown. Now the Happy emoji has a majority when compared to the Sad emoji. Thus based on

majority voting final output is obtained as Happy emoji.

``
Steps Involved in Random Forest Algorithm

Step 1: In the Random forest model, a subset of data points and a subset of features is

selected for constructing each decision tree. Simply put, n random records and m features are

taken from the data set having k number of records.

Step 2: Individual decision trees are constructed for each sample.

Step 3: Each decision tree will generate an output.

Step 4: Final output is considered based on Majority Voting or Averaging for Classification

and regression, respectively.

For example: consider the fruit basket as the data as shown in the figure below. Now n

number of samples are taken from the fruit basket, and an individual decision tree is

constructed for each sample. Each decision tree will generate an output, as shown in the

``
figure. The final output is considered based on majority voting. In the below figure, you can

see that the majority decision tree gives output as an apple when compared to a banana, so the

final output is taken as an apple.

Important Features of Random Forest

 Diversity: Not all attributes/variables/features are considered while making an individual

tree; each tree is different.

 Immune to the curse of dimensionality: Since each tree does not consider all the features,

the feature space is reduced.

 Parallelization: Each tree is created independently out of different data and attributes. This

means we can fully use the CPU to build random forests.

 Train-Test split: In a random forest, we don’t have to segregate the data for train and test as

there will always be 30% of the data which is not seen by the decision tree.

``
 Stability: Stability arises because the result is based on majority voting/ averaging.

Random Forest is an ensemble of Decision Trees, generally trained via the bagging method
(or sometimes pasting), typically with max_samples set to the size of the training set. Instead
of building a BaggingClassifier and passing it a DecisionTreeClassifier, you can instead use
the RandomForestClassifier class, which is more convenient and optimized for Decision
Trees

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)


rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

The following BaggingClassifier is roughly equivalent to the previous


RandomForestClassifier:

bag_clf = BaggingClassifier( DecisionTreeClassifier(splitter="random",


max_leaf_nodes=16), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

Feature Importance Yet another great quality of Random Forests is that they make it easy to
measure the relative importance of each feature.

Scikit-Learn measures a feature’s importance by looking at how much the tree nodes that use
that feature reduce impurity on average (across all trees in the forest). More precisely, it is a
weighted average, where each node’s weight is equal to the number of training samples that
are associated with it

``
Boosting

Boosting is one of the techniques that use the concept of ensemble learning. A boosting

algorithm combines multiple simple models (also known as weak learners or base estimators)

to generate the final output. It is done by building a model by using weak models in series.

Boosting (originally called hypothesis boosting) refers to any Ensemble method that can

combine several weak learners into a strong learner

Boosting Methods

1. AdaBoost- Adaptive Boosting

2. Gradient Boosting

There are several boosting algorithms; was the first really successful boosting algorithm that

was developed for the purpose of binary classification. AdaBoost is an abbreviation for and is

a prevalent boosting technique that combines multiple “weak classifiers” into a single “strong

classifier.”

``
AdaBoost One way for a new predictor to correct its predecessor is to pay a bit more

attention to the training instances that the predecessor underfitted. This results in new

predictors focusing more and more on the hard cases. This is the technique used by

AdaBoost. For example, when training an AdaBoost classifier, the algorithm first trains a

base classifier (such as a Decision Tree) and uses it to make predictions on the training set.

The algorithm then increases the relative weight of misclassified training instances. Then it

trains a second classifier, using the updated weights, and again makes predictions on the

training set, updates the instance weights, and so on

Each instance weight w is initially set to 1 m . A first predictor is trained, and its weighted

error rate r is computed on the training set

The predictor’s weight α is then computed using Equation 7-2, where η is the learning rate

hyperparameter (defaults to 1). The more accurate the predictor is, the higher its weight will

be

``
Scikit-Learn uses a multiclass version of AdaBoost called SAMME (which stands for

Stagewise Additive Modeling using a Multiclass Exponential loss function). When there are

just two classes, SAMME is equivalent to AdaBoost. If the predictors can estimate class

probabilities (i.e., if they have a predict_proba() method), Scikit-Learn can use a variant of

SAMME called SAMME.R (the R stands for “Real”), which relies on class probabilities

rather than predictions and generally performs better. The following code trains an AdaBoost

classifier based on 200 Decision Stumps using Scikit-Learn’s AdaBoostClassifier class (as

you might expect, there is also an AdaBoostRegressor class). A Decision Stump is a Decision

Tree with max_depth=1—in other words, a tree composed of a single decision node plus two

leaf nodes. This is the default base estimator for the AdaBoostClassifier class:

``
Boosting is an ensemble modeling technique that attempts to build a strong
classifier from the number of weak classifiers. It is done by building a model
by using weak models in series. Firstly, a model is built from the training
data. Then the second model is built which tries to correct the errors present
in the first model. This procedure is continued and models are added until
either the complete training data set is predicted correctly or the maximum
number of models are added.
AdaBoost was the first really successful boosting algorithm developed for
the purpose of binary classification. AdaBoost is short for Adaptive
Boosting and is a very popular boosting technique that combines multiple
“weak classifiers” into a single “strong classifier
Algorithm:

``
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data
points.
3. Increase the weight of the wrongly classified data points.
4. if(gotrequiredresults)
Gotostep5
else
Gotostep2

5. End
6.

``
``
he above diagram explains the AdaBoost algorithm in a very simple way.
Let’s try to understand it in a stepwise process:

 B1 consists of 10 data points which consist of two types namely plus(+)


and minus(-) and 5 of which are plus(+) and the other 5 are minus(-) and
each one has been assigned equal weight initially. The first model tries to
classify the data points and generates a vertical separator line but it
wrongly classifies 3 plus(+) as minus(-).
 B2 consists of the 10 data points from the previous model in which the 3
wrongly classified plus(+) are weighted more so that the current model
tries more to classify these pluses(+) correctly. This model generates a
vertical separator line that correctly classifies the previously wrongly
classified pluses(+) but in this attempt, it wrongly classifies three
minuses(-).
 B3 consists of the 10 data points from the previous model in which the 3
wrongly classified minus(-) are weighted more so that the current model
tries more to classify these minuses(-) correctly. This model generates a
horizontal separator line that correctly classifies the previously wrongly
classified minuses(-).
 B4 combines together B1, B2, and B3 in order to build a strong prediction
model which is much better than any individual model used.

Gradient Boosting
Gradient Boosting is a powerful boosting algorithm that combines several
weak learners into strong learners, in which each new model is trained to
minimize the loss function such as mean squared error or cross-entropy of
the previous model using gradient descent. In each iteration, the algorithm
computes the gradient of the loss function with respect to the predictions of
the current ensemble and then trains a new weak model to minimize this
gradient. The predictions of the new model are then added to the ensemble,
and the process is repeated until a stopping criterion is met.
In contrast to AdaBoost, the weights of the training instances are not
tweaked, instead, each predictor is trained using the residual errors of the
predecessor as labels. There is a technique called the Gradient Boosted
Trees whose base learner is CART (Classification and Regression Trees).
The below diagram explains how gradient-boosted trees are trained for
regression problems.

``
y(pred) = y1 + (eta * r1) + (eta * r2) + ....... + (eta *
rN)

Gradient Boosted Trees for Regression

The ensemble consists of M trees. Tree1 is trained using the feature


matrix X and the labels y. The predictions labeled y1(hat) are used to
determine the training set residual errors r1. Tree2 is then trained using the
feature matrix X and the residual errors r1 of Tree1 as labels. The predicted
results r1(hat) are then used to determine the residual r2. The process is
repeated until all the M trees forming the ensemble are trained. There is an
important parameter used in this technique known
as Shrinkage. Shrinkage refers to the fact that the prediction of each tree in
the ensemble is shrunk after it is multiplied by the learning rate (eta) which
ranges between 0 to 1. There is a trade-off between eta and the number of
estimators, decreasing learning rate needs to be compensated with
increasing estimators in order to reach certain model performance. Since all
trees are trained now, predictions can be made. Each tree predicts a label
and the final prediction is given by the formula,
y(pred) = y1 + (eta * r1) + (eta * r2) + ....... + (eta *
rN)

Difference between Adaboost and Gradient Boosting

The difference between AdaBoost and gradient boosting are as follows:

``
AdaBoost Gradient Boosting

During each iteration in AdaBoost, the Gradient Boosting updates the


weights of incorrectly classified samples are weights by computing the negative
increased, so that the next weak learner gradient of the loss function with
focuses more on these samples. respect to the predicted output.

AdaBoost uses simple decision trees with Gradient Boosting can use a wide
one split known as the decision stumps of range of base learners, such as
weak learners. decision trees, and linear models.

Gradient Boosting is generally more


AdaBoost is more susceptible to noise and
robust, as it updates the weights
outliers in the data, as it assigns high
based on the gradients, which are
weights to misclassified samples
less sensitive to outliers.

``
Support Vector Machine(SVM)

Support Vector Machine or SVM is one of the most popular Supervised


Learning algorithms, which is used for Classification as well as Regression
problems. primarily, it is used for Classification problems in Machine
Learning.

The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the


hyperplane. These extreme cases are called as support vectors, and
hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified
using a decision boundary or hyperplane:

Example:

``
Suppose we see a strange cat that also has some features of dogs, so if

we want a model that can accurately identify whether it is a cat or dog, so

such a model can be created by using the SVM algorithm. We will first

train our model with lots of images

of cats and dogs so that it can learn about different features of cats and

dogs, and then we test it with this strange creature. So as support vector

creates a decision boundary between these two data (cat and dog) and

choose extreme cases (support vectors), it will see the extreme case of

cat and dog. On the basis of the support vectors, it will classify it as a cat.

Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means
if a dataset can be classified into two classes by using a single straight
line, then such data is termed as linearly separable data, and classifier is
used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then
such data is termed as non-linear data and classifier used is called as Non-
linear SVM classifier.

``
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to
segregate the classes in n-dimensional space, but we need to find out the
best decision boundary that helps to classify the data points. This best
boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the


dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means


the maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vector.
Since these vectors support the hyperplane, hence called a Support
vector.

How does SVM works?


Suppose we have a dataset that has two tags (green and blue), and the
dataset has two features x1 and x2. We want a classifier that can classify
the pair(x1, x2) of coordinates in either green or blue. Consider the below
image:

``
So as it is 2-d space so by just using a straight line, we can easily separate
these two classes. But there can be multiple lines that can separate these
classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary;
this best boundary or region is called as a hyperplane. SVM algorithm
finds the closest point of the lines from both the classes. These points are
called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal
hyperplane.

``
Margins are generally defined by the closest data points (called support
vectors) on either of the hyperplane

Optimization Technique used in SVM

SVM maximizes the margin by learning a suitable decision


boundary/decision surface/separating hyperplane.

``
How to choose the Correct SVM

Suppose we are given 2 Hyperplane one with 100% accuracy (HP1) on the
left side and another with >90% accuracy (HP2) on the right side. Which
one would you think is the correct classifier?

Most of us would pick the HP2 thinking that it because of the maximum
margin. But it is the wrong answer.

But Support Vector Machine would choose the HP1 though it has a narrow
margin. Because though HP2 has maximum margin but it is going against
the constrain that: each data point must lie on the correct side of the

``
margin and there should be no misclassification. This constraint is
the hard constraint

that Support Vector Machine follows throughout.

Margins

Margins are generally defined by the closest data points (called support vectors) on
either of the hyperplane

Optimization Technique used in SVM

The core of any Machine learning algorithm is the Optimization technique that is
happening behind the scene.

SVM maximizes the margin by learning a suitable decision boundary/decision


surface/separating hyperplane.

``
It can be mathematically be written as:

``
4. Hard and Soft SVM

I would like to again continue with the above example.

We can now clearly state that HP1 is a Hard SVM(left side) while HP2 is a Soft
SVM(right side).

By default, Support Vector Machine implements Hard margin SVM. It works well
only if our data is linearly separable.

Hard margin SVM does not allow any misclassification to happen.

In case our data is non-separable/ nonlinear then the Hard margin SVM will not
return any hyperplane as it will not be able to separate the data. Hence this is where
Soft Margin SVM comes to the rescue.

``
Soft margin SVM allows some misclassification to happen by relaxing the hard
constraints of Support Vector Machine.

Soft margin SVM is implemented with the help of the Regularization parameter
(C).

Regularization parameter (C): It tells us how much misclassification we want to


avoid.

– Hard margin SVM generally has large values of C.

– Soft margin SVM generally has small values of C.

Relation between Regularization parameter (C) and SVM

Now that we know what the Regularization parameter (C) does. We need
to understand its relation with Support Vector Machine.

– As the value of C increases the margin decreases thus Hard SVM.

– If the values of C are very small the margin increases thus Soft SVM.

– Large value of C can cause overfitting therefore we need to select the


correct value using Hyperparameter Tuning.

``
The following Scikit-Learn code loads the iris dataset, scales the features, and then trains a

linear SVM model (using the LinearSVC class with C=1 and the hinge loss function,

described shortly) to detect Iris virginica flowers:

import numpy as np

from sklearn import datasets

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

from sklearn.svm import LinearSVC

iris = datasets.load_iris()

X = iris["data"][:, (2, 3)] # petal length, petal width

y = (iris["target"] == 2).astype(np.float64) # Iris virginica

svm_clf = Pipeline([ ("scaler", StandardScaler()), ("linear_svc", LinearSVC(C=1,

loss="hinge")), ])

svm_clf.fit(X, y)

``
Non Linear SVM Classification

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight


line, but for non-linear data, we cannot draw a single straight line.
Consider the below image:

So to separate these data points, we need to add one more dimension. For
linear data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below
image:

``
So now, SVM will divide the datasets into classes in the following way.
Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-
axis. If we convert it in 2d space with z=1, then it will become as:

``
Hence we get a circumference of radius 1 in case of non-linear data

Adding Polynomial Features

Polynomial Kernel

Similarity Features

Gaussian RBF Kernel

Read the above topics by referring the text book Page nos 218-223(Hands
on Machine Learning)

Support Vector Regression

Support Vector Regression as the name suggests is a regression algorithm

that supports both linear and non-linear regressions. This method works

on the principle of the Support Vector Machine. SVR differs from SVM in

the way that SVM is a classifier that is used for predicting discrete

categorical labels while SVR is a regressor that is used for predicting

continuous ordered variables.

``
In simple regression, the idea is to minimize the error rate while in SVR

the idea is to fit the error inside a certain threshold which means, work of

SVR is to approximate the best value within a given margin called ε- tube.

1. Hyperplane: It is a separation line between two data classes in a

higher dimension than the actual dimension. In SVR it is defined as the

line that helps in predicting the target value.

2. Kernel: In SVR the regression is performed at a higher dimension. To

do that we need a function that should map the data points into its higher

dimension. This function is termed as the kernel. Type of kernel used in

SVR is Sigmoidal Kernel, Polynomial Kernel, Gaussian Kernel, etc,

3. Boundary Lines: These are the two lines that are drawn around the

hyperplane at a distance of ε (epsilon). It is used to create a margin

between the data points.

``
4. Support Vector: It is the vector that is used to define the hyperplane

or we can say that these are the extreme data points in the dataset which

helps in defining the hyperplane. These data points lie close to the

boundary.

The objective of SVR is to fit as many data points as possible without

violating the margin. Classification that is in SVM use of support vector

was to define the hyperplane but in SVR they are used to define the linear

regression.

Working of SVR
SVR works on the principle of SVM with few minor differences. Given data

points, it tries to find the curve. But since it is a regression algorithm

instead of using the curve as a decision boundary it uses the curve to find

the match between the vector and position of the curve. Support Vectors

helps in determining the closest match between the data points and the

function which is used to represent them.

The Idea Behind Support Vector Regression


The problem of regression is to find a function that approximates mapping
from an input domain to real numbers on the basis of a training sample. So
let’s now dive deep and understand how SVR works actually.

``
Consider these two red lines as the decision boundary and the green line
as the hyperplane. Our objective, when we are moving on with SVR, is
to basically consider the points that are within the decision boundary
line. Our best fit line is the hyperplane that has a maximum number of
points.

The first thing that we’ll understand is what is the decision boundary (the
danger red line above!). Consider these lines as being at any distance, say
‘a’, from the hyperplane. So, these are the lines that we draw at distance
‘+a’ and ‘-a’ from the hyperplane. This ‘a’ in the text is basically referred to
as epsilon.

Assuming that the equation of the hyperplane is as follows:

Y = wx+b (equation of hyperplane)

Then the equations of decision boundary become:

wx+b= +a

wx+b= -a

Thus, any hyperplane that satisfies our SVR should satisfy:

-a < Y- wx+b < +a

``
Our main aim here is to decide a decision boundary at ‘a’ distance from the
original hyperplane such that data points closest to the hyperplane or the
support vectors are within that boundary line.

Hence, we are going to take only those points that are within the decision
boundary and have the least error rate, or are within the Margin of
Tolerance. This gives us a better fitting model

What is a Support Vector Machine (SVM)?

So what exactly is Support Vector Machine (SVM)? We’ll start by understanding


SVM in simple terms. Let’s say we have a plot of two label classes as shown in the
figure below:

Can you decide what the separating line will be? You might have come up with
this:

``
The line fairly separates the classes. This is what SVM essentially does – simple
class separation. Now, what is the data was like this:

Here, we don’t have a simple line separating these two classes. So we’ll extend our
dimension and introduce a new dimension along the z-axis. We can now separate
these two classes:

``
When we transform this line back to the original plane, it maps to the circular
boundary as I’ve shown here:

This is exactly what SVM does! It tries to find a line/hyperplane (in


multidimensional space) that separates these two classes. Then it classifies the new
point depending on whether it lies on the positive or negative side of the
hyperplane depending on the classes to predict.

``
Implementing Support Vector Regression (SVR) in Python

Here, we have to predict the salary of an employee given a few independent


variables. A classic HR analytics project!

Step 1: Importing the libraries

import numpy as np

import matplotlib.pyplot as
plt

import pandas as pd

Step 2: Reading the dataset

dataset =
pd.read_csv('Position_Salaries.csv')

X = dataset.iloc[:, 1:2].values

y = dataset.iloc[:, 2].values

Step 3: Feature Scaling

A real-world dataset contains features that vary in magnitudes, units, and range. I
would suggest performing normalization when the scale of a feature is irrelevant or
misleading.

Feature Scaling basically helps to normalize the data within a particular range.
Normally several common class types contain the feature scaling function so that

``
they make feature scaling automatically. However, the SVR class is not a
commonly used class type so we should perform feature scaling using Python.

from sklearn.preprocessing import


StandardScaler

sc_X = StandardScaler()

sc_y = StandardScaler()

X = sc_X.fit_transform(X)

y = sc_y.fit_transform(y)

Step 4: Fitting SVR to the dataset

from sklearn.svm import SVR

regressor = SVR(kernel =
'rbf')

regressor.fit(X, y)

Kernel is the most important feature. There are many types of kernels – linear,
Gaussian, etc. Each is used depending on the dataset.

Step 5. Predicting a new result

y_pred = regressor.predict(6.5)
y_pred =
sc_y.inverse_transform(y_pred)

So, the prediction for y_pred(6, 5) will be 170,370.

Step 6. Visualizing the SVR results (for higher resolution and


smoother curve)

X_grid = np.arange(min(X), max(X), 0.01) #this step required because data is


feature scaled.
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (SVR)')

``
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

Implementing SVR in Python


Data preprocessing

As in any other implementation, first, we get the necessary


libraries in place. The code below imports these libraries:
# get the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset =
pd.read_csv('/content/drive/MyDrive/Position_Salaries.csv')
# our dataset in this implementation is small, and thus we can
print it all instead of viewing only the end
print(dataset)

``
Output:

The above dataset contains ten instances. The significant


feature in this dataset is the Level column.
The Position column is just a description of
the Level column, and therefore, it adds no value to our
analysis.

Therefore, we will separate the dataset into a set of


features and study variables.

As discussed above, we only have one feature in this


dataset. Therefore, we carry out our feature-study variable
separation as shown in the code below:
# split the data into featutes and target variable seperately
X_l = dataset.iloc[:, 1:-1].values # features set
y_p = dataset.iloc[:, -1].values # set of study variable

We can look at our feature set using the print() function.


print(X_l)

Output:
[[ 1]
[ 2]
[ 3]
[ 4]
[ 5]
[ 6]

``
[ 7]
[ 8]
[ 9]
[10]]

From this output, it’s clear that the X_l variable is a 2D


array. Similarly, we can have a look at the y_p variable:
print(y_p)

Output:

It’s seen from the output above that the y_p variable is a
vector, i.e., a 1D array.

We need to note that the values of y_p are huge compared


to x_l.

Therefore, if we implement a model on this data, the study


variable will dominate the feature variable, such that its
contribution to the model will be neglected.

Due to this, we will have to scale this study variable to the


same range as the scaled study variable.

The challenge here is that the StandardScaler, the class we


use to scale the data, takes in a 2D array; otherwise, it
returns an error.

Due to this, we have to reshape our y_p variable from 1D to


2D. The code below does this for us:
y_p = y_p.reshape(-1,1)

``
Output:
[[ 45000]
[ 50000]
[ 60000]
[ 80000]
[ 110000]
[ 150000]
[ 200000]
[ 300000]
[ 500000]
[1000000]]

From the above output, y_p was successfully reshaped into


a 2D array.

Now, import the StandardScalar class and scale up


the X_l and y_p variables separately as shown:
from sklearn.preprocessing import StandardScaler
StdS_X = StandardScaler()
StdS_y = StandardScaler()
X_l = StdS_X.fit_transform(X_l)
y_p = StdS_y.fit_transform(y_p)

Let’s simultaneously print and check if our two variables


were scaled.
print("Scaled X_l:")
print(X_l)
print("Scaled y_p:")
print(y_p)

Output:
Scaled X_l:
[[-1.5666989 ]
[-1.21854359]
[-0.87038828]
[-0.52223297]
[-0.17407766]
[ 0.17407766]
[ 0.52223297]
[ 0.87038828]
[ 1.21854359]

``
[ 1.5666989 ]]
Scaled y_p:
[[-0.72004253]
[-0.70243757]
[-0.66722767]
[-0.59680786]
[-0.49117815]
[-0.35033854]
[-0.17428902]
[ 0.17781001]
[ 0.88200808]
[ 2.64250325]]

As we can see from the obtained output, both variables


were scaled within the range -3 and +3.

Our data is now ready to implement our SVR model.

However, before we can do so, we will first visualize the


data to know the nature of the SVR model that best fits it.
So, let us create a scatter plot of our two variables.
plt.scatter(X_l, y_p, color = 'red') # plotting the training
set
plt.title('Scatter Plot') # adding a tittle to our plot
plt.xlabel('Levels') # adds a label to the x-axis
plt.ylabel('Salary') # adds a label to the y-axis
plt.show() # prints

``
The plot shows a non-linear relationship between
the Levels and Salary.

Due to this, we cannot use the linear SVR to model this


data. Therefore, to capture this relationship better, we will
use the SVR with the kernel functions.

Implementing SVR

To implement our model, first, we need to import it from


the scikit-learn and create an object to itself.

Since we declared our data to be non-linear, we will pass it


to a kernel called the Radial Basis function (RBF) kernel.

After declaring the kernel function, we will fit our data on


the object. The following program performs these rules:
# import the model
from sklearn.svm import SVR
# create the model object
regressor = SVR(kernel = 'rbf')
# fit the model on the data
regressor.fit(X_l, y_p)

``
Since the model is now ready, we can use it and make
predictions as shown:
A=regressor.predict(StdS_X.transform([[6.5]]))
print(A)

Output:
array([-0.27861589])

As we can see, the model prediction values are for the


scaled study variable. But, the required value for the
business is the output of the unscaled data. So, we need to
get back to the real scale of the study variable.

To go back to the real study variable, we will write a


program whose objective is to take the predicted values on
the scaled range and transform them to the actual scale.

We do so by taking an inverse of the transformation on the


study variable

we had reshaped our study variable from 1D to 2D array


since the StandarScaler method takes in only 2D arrays.

So, for any predicted value to fit within such a new


dimension of the study variable, it must be transformed
from 1D to 2D; otherwise, we will get an error.

So, let’s implement these commands and get the required


value:
# Convert A to 2D
A = A.reshape(-1,1)
print(A)

Output:

``
array([[-0.27861589]])

It is clear from the output above is a 2D array. Using


the inverse_transform() function, we can convert it to an
unscaled value in the original dataset as shown:
# Taking the inverse of the scaled value
A_pred = StdS_y.inverse_transform(A)
print(A_pred)

Output:
array([[170370.0204065]])

Here is the result, and it falls within the expected range.

visualize our model.

The following code carries out this task:


# inverse the transformation to go back to the initial scale
plt.scatter(StdS_X.inverse_transform(X_l),
StdS_y.inverse_transform(y_p), color = 'red')
plt.plot(StdS_X.inverse_transform(X_l),
StdS_y.inverse_transform(regressor.predict(X_l).reshape(-
1,1)), color = 'blue')
# add the title to the plot
plt.title('Support Vector Regression Model')
# label x axis
plt.xlabel('Position')
# label y axis
plt.ylabel('Salary Level')
# print the plot
plt.show()

Output:

``
Naïve Bayes Classifier

o Naïve Bayes algorithm is a supervised learning algorithm, which is based


on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional
training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine learning
models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis
of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?


The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a


certain feature is independent of the occurrence of other features. Such as

``
if the fruit is identified on the bases of color, shape, and taste, then red,
spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on
each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the


observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that


the probability of a hypothesis is true.

Using Bayes theorem, we can find the probability


of A happening, given that B has occurred. Here, B is the
evidence and A is the hypothesis. The assumption made
here is that the predictors/features are independent. That
is presence of one particular feature does not affect the
other. Hence it is called naive.

Steps

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given
features.
3. Now, use Bayes theorem to calculate the posterior probability.

``
Example

According to this example, Bayes theorem can be rewritten


as:

The variable y is the class variable(play golf), which


represents if it is suitable to play golf or not given the
conditions. Variable X represent the parameters/features.

X is given as,

``
Here x_1,x_2….x_n represent the features, i.e they can be
mapped to outlook, temperature, humidity and windy. By
substituting for X and expanding using the chain rule we
get,

Now, you can obtain the values for each by looking at the
dataset and substitute them into the equation. For all
entries in the dataset, the denominator does not change, it
remain static. Therefore, the denominator can be removed
and a proportionality can be introduced.

In our case, the class variable(y) has only two outcomes,


yes or no. There could be cases where the classification
could be multivariate. Therefore, we need to find the
class y with maximum probability.

Using the above function, we can obtain the class, given


the predictors.

``
``

You might also like