ML Unit 3 (Ab22)
ML Unit 3 (Ab22)
Q) Boosting?
1. Start with a Weak Model: We start with a weak model that makes
predictions on the data.
2. Identify Errors: We identify the errors made by the weak model and give
more weight to the data points that were misclassified.
3. Create a New Model: We create a new model that tries to correct the errors
made by the previous model.
Key Concepts
1. Weak Model: A weak model is a model that makes predictions on the data,
but is not very accurate.
3. Weighting: We give more weight to the data points that were misclassified
by the previous model.
Types of Boosting
Advantages:
2. Handling Complex Data: Boosting can handle complex data sets with
multiple features and interactions between them.
4. Flexibility: Boosting allows for the use of different models and algorithms,
which can be useful for handling different types of data.
5. Handling Missing Values: Boosting can handle missing values in the data by
using surrogate models to impute the missing values.
7. Efficient Use of Data: Boosting can make efficient use of the data by using
the same data to train multiple models.
Disadvantages:
7. Not Suitable for All Problems: Boosting may not be suitable for all problems,
especially those that require a simple and interpretable model.
Boosting example:
| 25 | 50000 | Yes |
| 30 | 60000 | Yes |
| 35 | 70000 | No |
| 20 | 40000 | No |
| 40 | 80000 | Yes |
We start with a weak model that uses only the age to make predictions. This
model is not very accurate, but it's a good starting point.
| --- | --- |
| 25 | Yes |
| 30 | Yes |
| 35 | Yes |
| 20 | No |
| 40 | Yes |
The weak model predicts that people under 30 will buy a car, and people over
30 will not buy a car.
We identify the errors made by the weak model. For example, the model
predicts that a person aged 35 will buy a car, but they actually do not.
| 25 | Yes | Yes |
| 30 | Yes | Yes |
| 35 | Yes | No (Error) |
| 20 | No | No |
| 40 | Yes | Yes |
We create a new model that uses the age and income to make predictions.
This model is more accurate than the previous model.
| 25 | 50000 | Yes |
| 30 | 60000 | Yes |
| 35 | 70000 | No |
| 20 | 40000 | No |
| 40 | 80000 | Yes |
The new model predicts that people with an income over 60000 will buy a car,
and people with an income under 60000 will not buy a car.
| 35 | 70000 | Yes | No | No |
| 20 | 40000 | No | No | No |
The strong model uses the predictions of both models to make a final
prediction.
Step 6: Repeat
We repeat the process of creating a new model and combining it with the
previous models until we achieve the desired level of accuracy.
............................................ END…………………
1. Make predictions: The weak learner makes predictions on the training data,
which are then used to compute the error.
2. Compute the error: The error is computed as the difference between the
predicted values and the actual values.
3. Update the weights: The weights of the training data are updated based on
the error, so that the data points that are misclassified are given more weight.
4. Train the next weak learner: The next weak learner is trained on the
updated weighted data, and the process is repeated.
The weak learner is typically a simple model, such as a decision tree or a
linear classifier, and is trained on a subset of the data. The weak learner is not
expected to perform well on its own, but rather to contribute to the overall
performance of the boosting ensemble.
3. Low accuracy: A weak learner is not expected to have high accuracy on its
own.
4. Diverse: A weak learner should be diverse from other weak learners, so that
they can contribute to the overall performance of the boosting ensemble.
1. Decision trees: Decision trees are a popular choice for weak learners in
boosting, as they are simple, fast, and can be trained on a subset of the data.
1. Improved accuracy: Boosting with weak learners can improve the accuracy
of the overall model.
………………….. END………………….
Q)Ada Boost?
1. Start with a weak model: We start with a simple model that makes
predictions on the data.
3. Create a new model: We create a new model that tries to correct the errors
made by the previous model.
AdaBoost Formula
where:
- F(x) is the final prediction
Key concepts
3. Weighting: We give more weight to the models that are more accurate.
Example
Suppose we want to predict whether a person will buy a car based on their
age and income. We start with a weak model that uses only the age to make
predictions. We then create a new model that uses the age and income to
make predictions. We combine the predictions of both models to create a
strong model.
Advantages
2. Robust to noise: AdaBoost is robust to noise and can handle datasets with
outliers.
Disadvantages
Real-life example:
AdaBoost is used in many real-life applications, such as:
………………………….. END…………………
Q) Gradient Boosting?
1. Start with a weak model: We start with a simple model that makes
predictions on the data.
2. Calculate the error: We calculate the error between the predicted values
and the actual values.
3. Create a new model: We create a new model that tries to correct the errors
made by the previous model.
Key concepts
3. Gradient: We use the gradient of the loss function to update the models.
Advantages
Disadvantages
Real-life example
.............................. END…………..
Q)XG Boost?
What is XGBoost?
1. Start with a weak model: We start with a simple model that makes
predictions on the data.
2. Calculate the error: We calculate the error between the predicted values
and the actual values.
3. Create a new model: We create a new model that tries to correct the errors
made by the previous model.
Key concepts
3. Gradient: We use the gradient of the loss function to update the models.
2. Handling missing values: XGBoost can handle missing values in the data,
which is a common problem in machine learning.
Advantages
2. Fast training time: XGBoost is much faster than other Gradient Boosting
algorithms.
3. Handling missing values: XGBoost can handle missing values in the data.
Disadvantages
Real-life example
……………………… END………………………
Q) SVM Regression?
1. Data: We have a dataset with input features (e.g. age, income) and a
continuous output variable (e.g. house price).
2. Find the Best Line: SVM regression tries to find the best line or curve that
fits the data, by minimizing the error between the predicted and actual values.
4. Find the Support Vectors: The algorithm identifies the most important data
points, called support vectors, that help define the best line or curve.
5. Make Predictions: Once the best line or curve is found, SVM regression uses
it to make predictions on new, unseen data.
Key Concepts*
1. Kernel Trick: The kernel trick is a mathematical technique that allows us to
transform the data into a higher-dimensional space, without actually having to
compute the coordinates of the data points in that space.
1. Linear SVM Regression: Linear SVM regression uses a linear kernel to find
the best line that fits the data.
Example*
Suppose we want to predict house prices based on features like age, income,
and location. We collect a dataset of 100 houses, with their corresponding
prices. We use SVM regression to find the best line or curve that fits the data,
and then use it to make predictions on new, unseen data.
………………….. END……………….
What is a kernel?
The Gaussian RBF kernel takes the original data points and calculates the
similarity between them using a Gaussian function. The similarity is based on
the distance between the points, so points that are close together will have a
high similarity, while points that are far apart will have a low similarity.
where x and y are the data points, γ is a parameter that controls the width of
the Gaussian function, and exp is the exponential function.
1. Non-linear separation: The Gaussian RBF kernel can separate data that is
not linearly separable, which means it can handle complex relationships
between the data points.
2. Robust to noise: The Gaussian RBF kernel is robust to noise and outliers in
the data, which means it can handle data that is not perfect.
3. Flexible: The Gaussian RBF kernel can be used with different types of data,
including numerical and categorical data.
…………………. END…………..
3. Increase Robustness: Ensemble learning can make the model more robust
to changes in the data or to outliers.
2. Boosting: Boosting involves creating multiple models on the same data, but
with different weights assigned to each model.
3. Stacking: Stacking involves creating multiple models and then using a meta-
model to combine their predictions.
1. Train Multiple Models: We train multiple models on the same data, using
different algorithms or different subsets of the data.
4. Make Final Prediction: The final prediction is made based on the combined
predictions of all the models.
Example
1. Model 1: A decision tree model that predicts 70% of the people will buy a
car.
2. Model 2: A logistic regression model that predicts 60% of the people will
buy a car.
3. Model 3: A neural network model that predicts 80% of the people will buy a
car.
We combine the predictions of each model using a voting scheme, where the
model with the highest prediction wins. The final prediction is that 75% of the
people will buy a car.
…………………. END………………….
What is Bagging?
1. Create multiple datasets: Take your original dataset and create multiple
subsets of it, called bootstrap samples. Each bootstrap sample is created by
randomly selecting a subset of the original data with replacement.
3. Make predictions: Use each model to make predictions on new, unseen data.
What is Pasting?
3. Make predictions: Use each model to make predictions on new, unseen data.
- Model training: Bagging trains a model on each dataset, while pasting trains
multiple models on the same dataset.
- Time series forecasting: Bagging and pasting can be used to improve the
accuracy of time series forecasting models, such as ARIMA or LSTM.
Let's say you're trying to predict the price of a house, and you have a dataset
of 100 houses with their features (number of bedrooms, square footage,
location, etc.). You create 5 bootstrap samples, each with 80 houses. You train
a decision tree model on each bootstrap sample, and then you use each model
to estimate the price of a new house. The estimates are:
- Model 1: $500,000
- Model 2: $550,000
- Model 3: $450,000
- Model 4: $520,000
- Model 5: $480,000
You take the average of all 5 estimates, which is $500,000. This is your final
prediction.
Let's say you're trying to predict the price of a house, and you have a dataset
of 100 houses with their features (number of bedrooms, square footage,
location, etc.). You create 5 different models:
You train each model on the same dataset, and then you use each model to
estimate the price of a new house. The estimates are:
- Model 1: $480,000
- Model 2: $520,000
- Model 3: $450,000
- Model 4: $500,000
- Model 5: $550,000
You combine the estimates from each model using a weighted average, where
the weights are based on the performance of each model on the training data.
The final prediction is $505,000.
4. Final prediction: The final prediction is the class with the most votes.
1. Hard Voting:
Hard Voting Classifiers are a type of ensemble learning method that combines
the predictions of multiple models to make a final prediction. In simple words,
it's like having a team of experts voting on a decision, where each model gets
one vote, and the class with the most votes wins.
For example, let's say we have three models predicting whether a picture is
of a dog, cat, or horse. Model 1 predicts dog, Model 2 predicts cat, and Model 3
predicts dog. In this case, the final prediction would be dog because it got two
votes.
Hard voting classifiers are simple to implement and can be very effective in
improving the accuracy of your models. They work by reducing the impact of
noise and outliers in the data, and by combining the strengths of multiple
models.
2. Soft Voting:
Soft Voting Classifiers are a type of ensemble learning method that combines
the predictions of multiple models to make a final prediction. In simple words,
it's like having a team of experts voting on a decision, where each model gets a
weighted vote based on its confidence in the prediction.
For example, let's say we have three models predicting whether a picture is
of a dog, cat, or horse. Model 1 predicts dog with 80% confidence, Model 2
predicts cat with 60% confidence, and Model 3 predicts dog with 70%
confidence. In this case, the final prediction would be dog because it got the
highest weighted vote.
Soft voting classifiers are useful when we have multiple models with different
strengths and weaknesses, and we want to combine their predictions to make
a more accurate final prediction. They can be used for both binary and
multiclass classification problems, and can handle nonlinear relationships
between features.
Advantages
2. Robustness to noise: Voting Classifiers can reduce the impact of noise in the
data.
3. Handling missing values: Voting Classifiers can handle missing values in the
data.
Disadvantages
Real-life example
2. Text classification: Voting Classifiers are used to classify text into different
categories, such as spam vs. non-spam emails.
Q) Stacking ?
1. Train multiple models: We train multiple models on the same dataset, each
with a different algorithm or configuration.
Stacking is useful when we have multiple models that are good at different
things, and we want to combine their strengths to make a more accurate final
prediction. It's like having a team of experts, where each expert is good at a
different aspect of the problem, and the meta-model combines their expertise
to make a final decision.
Advantages:
Disadvantages:
4. Difficulty in Choosing the Right Models: Stacking requires the choice of the
right models and algorithms, which can be difficult and time-consuming.
1. When dealing with complex data sets: Stacking can be useful when dealing
with complex data sets that require the use of multiple models and
algorithms.
1. When dealing with simple data sets: Stacking may not be necessary when
dealing with simple data sets that can be handled by a single model.
Here's an example:
Let's say we're trying to predict whether a customer will buy a product or not.
We have three models:
We train each model on the same dataset and get the following predictions:
We then create a meta-model, let's say a Linear Regression model, that takes
the predictions from each of the individual models as input. The meta-model
predicts the final probability of a customer buying the product.
| Customer | Meta-Model |
| --- | --- |
| John | 0.85 |
| Jane | 0.55 |
| Bob | 0.65 |
In this example, the meta-model combines the predictions from the individual
models to make a final prediction. The final prediction is a weighted average
of the individual predictions, where the weights are learned by the meta-
model.
……………………….. END……………
- Formula: K(x, y) = x · y
5. Gaussian Kernel: Similar to the RBF kernel, but with a different formulation.
These kernels can be used for different types of data and problems, such as:
- RBF kernel for non-linearly separable data with a large number of features
Each kernel has its own strengths and weaknesses, and the choice of kernel
depends on the specific problem and data.
Q) Random forests?
The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Note: To better understand the Random Forest Algorithm, you should have
knowledge of the Decision Tree Algorithm.
Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct output,
while others may not. But together, all the trees predict the correct output.
Therefore, below are two assumptions for a better Random forest classifier:
o There should be some actual values in the feature variable of the dataset
so that the classifier can predict accurate results rather than a guessed
result.
o The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest
algorithm:
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points
(Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So,
this dataset is given to the Random forest classifier. The dataset is divided into
subsets and given to each decision tree. During the training phase, each
decision tree produces a prediction result, and when a new data point occurs,
then based on the majority of results, the Random Forest classifier predicts
the final decision. Consider the below image:
There are mainly four sectors where Random forest mostly used:
Q)naive bayes ?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:
Bayes' Theorem:
Where,
Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
Problem: If the weather is sunny, then the Player should play or not?
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
…………… end………….
Q)Support Vector Machine ?
The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the below diagram in which
there are two different categories that are classified using a decision boundary
or hyperplane:
Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will first
train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two
data (cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog. On the basis of the support vectors, it will classify
it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM:
1.Linear SVM:
The goal of a Linear SVM is to find the best hyperplane that separates
the data into different classes. The hyperplane is defined by a set of
weights and a bias term. The weights determine the direction of the
hyperplane, and the bias term determines the position of the
hyperplane.
| 80 | 10 | Pass |
| 70 | 8 | Fail |
| 90 | 12 | Pass |
| 60 | 6 | Fail |
| 85 | 11 | Pass |
We can use a Linear SVM to classify the students into two classes: Pass
and Fail. The Linear SVM algorithm will find the optimal hyperplane
that separates the data into these two classes.
This hyperplane separates the data into two classes: Pass and Fail. The
students who have an exam score greater than 75 and have studied for
more than 9 hours are classified as Pass, while the students who have an
exam score less than 75 or have studied for less than 9 hours are
classified as Fail.
2.Non-linear SVM:
3. Find the Optimal Hyperplane: The algorithm finds the optimal hyperplane
that separates the data into different classes. The optimal hyperplane is the
one that maximizes the margin between the classes.
4. Make Predictions: Once the optimal hyperplane is found, the algorithm can
make predictions on new, unseen data.
Suppose we have a dataset of exam scores and hours studied, and we want to
predict whether a student will pass or fail an exam based on their score and
hours studied. The dataset is as follows:
| 80 | 10 | Pass |
| 70 | 8 | Fail |
| 90 | 12 | Pass |
| 60 | 6 | Fail |
| 85 | 11 | Pass |
We can use a Nonlinear SVM to classify the students into two classes: Pass and
Fail. The Nonlinear SVM algorithm will find the optimal hyperplane that
separates the data into these two classes.
This hyperplane separates the data into two classes: Pass and Fail. The
students who have an exam score and hours studied that satisfy the inequality
are classified as Pass, while the students who do not satisfy the inequality are
classified as Fail.
Hyperplane:
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. Since these
vectors support the hyperplane, hence called a Support vector.
…………………. END…………..