0% found this document useful (0 votes)
19 views22 pages

ML Unit 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views22 pages

ML Unit 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Sub:Machine Learning (unit-3)

Q)VOTING CLASSIFIERS?
A Voting Classifier is a machine learning model that trains on an ensemble of
numerous models and predicts an output (class) based on their highest
probability of chosen class as the output.
It simply aggregates the findings of each classifier passed into Voting Classifier
and predicts the output class based on the highest majority of voting. The idea
is instead of creating separate dedicated models and finding the accuracy for
each them, we create a single model which trains by these models and
predicts output based on their combined majority of voting for each output
class.
Voting Classifier supports two types of votings.
1. Hard Voting: In hard voting, the predicted output class is a class with the
highest majority of votes i.e the class which had the highest probability of
being predicted by each of the classifiers. Suppose three classifiers predicted
the output class(A, A, B), so here the majority predicted A as output. Hence A
will be the final prediction.
Example:
.
2. Soft Voting: In soft voting, the output class is the prediction based on the
average of probability given to that class. Suppose given some input to three
models, the prediction probability for class A = (0.30, 0.47, 0.53) and B =
(0.20, 0.32, 0.40). So the average for class A is 0.4333 and B is 0.3067, the
winner is clearly class A because it had the highest probability averaged by
each classifier.
Example:

……………… end………
Q) Bagging and pasting?
When sampling is performed with replacement, this method is called bagging
(short for bootstrap aggregating). When sampling is performed without
replacement, it is called pasting.
Both bagging and pasting allow training instances to be sampled several times
across multiple predictors, but only bagging allows training instances to be
sampled several times for the same predictor.

Once all predictors are trained, the ensemble can make a prediction for a new
instance by simply aggregating the predictions of all predictors. The
aggregation function is typically the statistical mode for classification or the
average for the regression.

Predictors can all be trained in parallel, via different CPU cores. Similarly,
predictions can be made in parallel. This is one of the reasons bagging and
pasting scale very well.

Pasting is an ensemble technique similar to bagging except for the fact that in
pasting sampling is done without replacement i.e. an observation can be
present in only one subset. Since pasting limits diversity of models its
performance with is suboptimal when compared to bagging, particularly in
case of small datasets. However, pasting is preferred over bagging in case of so
large datasets, owing to computational efficiency.

bagging decreases the variance and helps to avoid overfitting. It is usually


applied to decision tree methods. Bagging is a special case of the model
averaging approach.

Implementation Steps of Bagging


• Step 1: Multiple subsets are created from the original data set with
equal tuples, selecting observations with replacement.
• Step 2: A base model is created on each of these subsets.
• Step 3: Each model is learned in parallel with each training set and
independent of each other.
• Step 4: The final predictions are determined by combining the
predictions from all the models.
Example of Bagging
• The Random Forest model uses Bagging, where decision tree models
with higher variance are present. It makes random feature selection to
grow trees. Several random trees make a Random Forest.

………………………… end …………..

Q) Random forests?

Random Forest is a popular machine learning algorithm that belongs to the


supervised learning technique. It can be used for both Classification and
Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a


number of decision trees on various subsets of the given dataset and takes
the average to improve the predictive accuracy of that dataset." Instead of
relying on one decision tree, the random forest takes the prediction from each
tree and based on the majority votes of predictions, and it predicts the final
output.
The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:

Note: To better understand the Random Forest Algorithm, you should have
knowledge of the Decision Tree Algorithm.

Assumptions for Random Forest

Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct output,
while others may not. But together, all the trees predict the correct output.
Therefore, below are two assumptions for a better Random forest classifier:

o There should be some actual values in the feature variable of the dataset
so that the classifier can predict accurate results rather than a guessed
result.
o The predictions from each tree must have very low correlations.

Why use Random Forest?

Below are some points that explain why we should use the Random Forest
algorithm:
<="" li="">

o It takes less training time as compared to other algorithms.


o It predicts output with high accuracy, even for the large dataset it runs
efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by


combining N decision tree, and second is to make predictions for each tree
created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points
(Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So,
this dataset is given to the Random forest classifier. The dataset is divided into
subsets and given to each decision tree. During the training phase, each
decision tree produces a prediction result, and when a new data point occurs,
then based on the majority of results, the Random Forest classifier predicts
the final decision. Consider the below image:
Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the


identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of
the disease can be identified.
3. Land Use: We can identify the areas of similar land use by this
algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

o Random Forest is capable of performing both Classification and


Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest

o Although random forest can be used for both classification and


regression tasks, it is not more suitable for Regression tasks.

………………… end ………………

Q) Boosting?

Boosting is an ensemble learning technique that uses a set of Machine


Learning algorithms to convert weak learner to strong learners in order to
increase the accuracy of the model.

Boosting=highbias,low variance+additivelycombining.

->additively combining help us to reduce bias.

->high bias causes large training error.

->we are concentrating on the errors ,that leading us to correct


classifications,so that we can reduce the error.

->We will take decision trees.

Sequential ensemble, popularly known as boosting, here the weak


learners are sequentially produced during the training phase. The
performance of the model is improved by assigning a higher weightage to
the previous, incorrectly classified samples. An example of boosting is the
AdaBoost algorithm.

How Boosting Algorithm Works?


The basic principle behind the working of the boosting algorithm is to
generate multiple weak learners and combine their predictions to form one
strong rule. These weak rules are generated by applying base Machine
Learning algorithms on different distributions of the data set. These
algorithms generate weak rules for each iteration. After multiple iterations,
the weak learners are combined to form a strong learner that will predict a
more accurate outcome.

Here’s how the algorithm works:

Step 1: The base algorithm reads the data and assigns equal weight to each
sample observation.

Step 2: False predictions made by the base learner are identified. In the next
iteration, these false predictions are assigned to the next base learner with a
higher weightage on these incorrect predictions.

Step 3: Repeat step 2 until the algorithm can correctly classify the output.

Therefore, the main aim of Boosting is to focus more on miss-classified


predictions.

Types Of Boosting
There are three main ways through which boosting can be carried out:

1. Adaptive Boosting or AdaBoost


2. Gradient Boosting
3. XGBoost

Adaptive Boosting

• AdaBoost is implemented by combining several weak learners into a


single strong learner.
• The weak learners in AdaBoost take into account a single input feature
and draw out a single split decision tree called the decision stump. Each
observation is weighed equally while drawing out the first decision
stump.
• The results from the first decision stump are analyzed and if any
observations are wrongfully classified, they are assigned higher weights.
• Post this, a new decision stump is drawn by considering the
observations with higher weights as more significant.
• Again if any observations are misclassified, they’re given higher weight
and this process continues until all the observations fall into the right
class.
• Adaboost can be used for both classification and regression-based
problems, however, it is more commonly used for classification purpose.

Gradient Boosting

Gradient Boosting is also based on sequential ensemble learning. Here the


base learners are generated sequentially in such a way that the present base
learner is always more effective than the previous one, i.e. the overall model
improves sequentially with each iteration.

The difference in this type of boosting is that the weights for misclassified
outcomes are not incremented, instead, Gradient Boosting method tries to
optimize the loss function of the previous learner by adding a new model that
adds weak learners in order to reduce the loss function.

The main idea here is to overcome the errors in the previous learner’s
predictions. This type of boosting has three main components:

1. Loss function that needs to be ameliorated.


2. Weak learner for computing predictions and forming strong learners.
3. An Additive Model that will regularize the loss function.

Like AdaBoost, Gradient Boosting can also be used for both classification and
regression problems.
……………….. end………..

Q) STACKING ?

· Stacked Generalization or “Stacking” for short is an ensemble machine


learning algorithm.
Stacking is an ensemble technique that uses meta-learning for generating
predictions. It can harness the capabilities of well-performing as well as
weakly-performing models on a classification or regression task and make
predictions with better performance than any other single model in the
ensemble.

It is an extended form of the Model Averaging Ensemble technique, where


multiple sub-models contribute equally or according to their performance
weights to a combined prediction. In Stacking, an entirely new model to
trained to combine the contributions from each submodel and produce the
best predictions. This final model is said to be stacked on top of the others,
hence the name.

The Architecture of Stacking:

• Original Data - The original split is split into n-folds


• Base Models - Level 1 individual Models
• Level 1 Predictions - Predictions generated by base models on original data
• Level 2 Model - Meta-Learner, the model which combines the Level 1
predictions to generate best final Predictions
Steps of Implementation

The following steps are involved in implementation:

1. The Original Train data is split into n-folds using the


RepeatedStratifiedKFold.
2. Then the base learner (Model 1) is fitted on the first n-1 folds and
predictions are made for the nth part.
3. This prediction is added to the x1_train list.
4. Steps 2 & 3 are repeated for the rest of the n-1 parts and we obtain
x1_train array of size n
where, x1_train[i] is the prediction on (i+1)th part, when the model 1 is fitted on
1,2...,i-1,i+1...n parts
5. Now, train the model on all the n parts and make predictions for test
data. Store this prediction in y1_test.
6. Similarly, we obtain x2_train, y2_test, x3_train and y3_test by using
Model 2 and 3 for training respectively to obtain Level 2 predictions.
7. Now we train a Meta Learner on Level 1 Predictions (using these
predictions as features for the model).
8. The Meta learner is now used to make predictions on test data.

· Advantages of a Stacked Generalization Model:

1. The benefit of stacking is that it can harness the capabilities of a range of


well-performing models on a classification or regression task and make
predictions that have better performance than any single model in the
ensemble.

2. Stacking improves the model prediction accuracy.

· Disadvantage of a Stacked Generalization Model:


1. As we are taking the whole dataset for training for every individual
classifier, in the case of huge datasets the computational time will be more as
each classifier is working independently on the huge dataset.
……………………. End ……………

Q)naive bayes ?

It is a theorem that works on conditional probability. Conditional


probability is the probability that something will happen, given that
something else has already occurred. The conditional probability can give us
the probability of an event using its prior knowledge.

Conditional probability:

Conditional Probability
Where,
P(A): The probability of hypothesis H being true. This is known as the prior
probability.
P(B): The probability of the evidence.
P(A|B): The probability of the evidence given that hypothesis is true.
P(B|A): The probability of the hypothesis given that the evidence is true.

Naive Bayes Classifier

A classifier is a machine learning model segregating different objects on the


basis of certain features of variables.
It is a kind of classifier that works on the Bayes theorem. Prediction of
membership probabilities is made for every class such as the probability of
data points associated with a particular class.

The class having maximum probability is appraised as the most suitable


class. This is also referred to as Maximum A Posteriori (MAP).

• The MAP for a hypothesis is:


o 𝑀𝐴𝑃 (𝐻) = max 𝑃((𝐻|𝐸))
o 𝑀𝐴𝑃 (𝐻) = max 𝑃((𝐻|𝐸) ∗ (𝑃(𝐻)) /𝑃(𝐸))
o 𝑀𝐴𝑃 (𝐻) = max(𝑃(𝐸|𝐻) ∗ 𝑃(𝐻))
o 𝑃 (𝐸) is evidence probability, and it is used to normalize the
result. The result will not be affected by removing (𝐸).

Naive Bayes classifiers conclude that all the variables or features are not
related to each other. The Existence or absence of a variable does not impact
the existence or absence of any other variable.

Advantages And Disadvantages Of Naive Bayes Classifier

Advantages:

• It is a highly extensible algorithm that is very fast.


• It can be used for both binaries as well as multiclass classification.
• It has mainly three different types of algorithms that are GaussianNB,
MultinomialNB, BernoulliNB.
• It is a famous algorithm for spam email classification.
• It can be easily trained on small datasets and can be used for large
volumes of data as well.

Disadvantages:

• The main disadvantage of the NB is considering all the variables


independent that contributes to the probability.

Applications of Naive Bayes Algorithms

• Real-time Prediction: Being a fast learning algorithm can be used to


make predictions in real-time as well.
• MultiClass Classification: It can be used for multi-class classification
problems also.
• Text Classification: As it has shown good results in predicting multi-
class classification so it has more success rates compared to all other
algorithms. As a result, it is majorly used in sentiment analysis & spam
detection.
…………… end………….
Q)Support Vector Machine ?

Support Vector Machine or SVM is one of the most popular Supervised


Learning algorithms, which is used for Classification as well as Regression
problems. However, primarily, it is used for Classification problems in
Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the below diagram in which
there are two different categories that are classified using a decision boundary
or hyperplane:

Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will first
train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two
data (cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog. On the basis of the support vectors, it will classify
it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a straight
line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane:

There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that
helps to classify the data points. This best boundary is known as the
hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the


dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then hyperplane
will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. Since these
vectors support the hyperplane, hence called a Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example.


Suppose we have a dataset that has two tags (green and blue), and the dataset
has two features x1 and x2. We want a classifier that can classify the pair(x1,
x2) of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate


these two classes. But there can be multiple lines that can separate these
classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this
best boundary or region is called as a hyperplane. SVM algorithm finds the
closest point of the lines from both the classes. These points are called support
vectors. The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but
for non-linear data, we cannot draw a single straight line. Consider the below
image:

So to separate these data points, we need to add one more dimension. For
linear data, we have used two dimensions x and y, so for non-linear data, we
will add a third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way.
Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis.
If we convert it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.

………………………….. end ……………………….

You might also like