0% found this document useful (0 votes)
21 views26 pages

Unit-V 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views26 pages

Unit-V 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Unit-V

Ensemble Learning:
Ensemble learning is a technique in machine learning where multiple models are
combined to solve a particular problem. The primary idea is to aggregate
predictions from several models to achieve better performance than any single
model alone.

Ensemble methods leverage the strengths of different models to reduce errors


and enhance accuracy. By combining multiple weak models (models that perform
slightly better than random guessing), we can create a strong learner that is more
accurate and robust.

➢ Improves Accuracy:

• By aggregating predictions, ensemble models often outperform individual


models in terms of accuracy.

➢ Reduces Overfitting:

• Combining multiple models reduces the risk of overfitting (high variance)


because it balances out the individual errors of the models.

➢ Handles Bias and Variance Trade-off:

• Ensembles can help manage the bias-variance trade-off by reducing both


bias (systematic errors) and variance (sensitivity to training data).

Types of Ensemble Methods:

There are two main categories of ensemble methods:

1. Bagging (Bootstrap Aggregating)

2. Boosting
1. Bagging (Bootstrap Aggregating)

• In bagging, multiple models are trained independently in parallel on


different subsets of the data created using bootstrapping (random sampling
with replacement).

• The results of these models are then aggregated to form a final prediction.

• Example: Random Forest is a popular bagging technique where multiple


Decision Trees are built, and their predictions are averaged or voted on.
➢ creates different subsets of data (this is called bootstrapping)
➢ trains one model per subset
➢ aggregates all predictions to get the final prediction

Key Goal: Reduce variance and prevent overfitting.

• Examples: Random Forests

2. Boosting
• Boosting is a sequential ensemble method where each new model is trained
to correct the errors made by the previous models.

• Models are added one by one, and each new model focuses on the hardest-
to-predict instances by adjusting their weights.

• Example: AdaBoost, Gradient Boosting, and XGBoost are popular boosting


algorithms.

Key Goal: Reduce bias by focusing on mistakes of previous models.

• It is an iterative training process

• the subsequent model puts more focus on misclassified samples from the
previous model

• the final prediction is a weighted combination of all predictions

• Example: XGBoost,AdaBoost, etc.

Random Forest Algorithm:


Random Forest is a powerful and versatile ensemble learning algorithm used
primarily for classification and regression tasks. It is based on the concept of
combining multiple Decision Trees to form a robust model that improves accuracy
and reduces the risk of overfitting.

Key Concepts of Random Forest

1. Ensemble Learning:

➢ Ensemble methods use multiple learning models to improve the


overall performance. Random Forest builds multiple Decision Trees
and combines their predictions.

2. Bootstrap Aggregation (Bagging):

➢ The idea of bagging is to train each Decision Tree on a random subset


of data (with replacement), known as a bootstrap sample. This helps
to create diverse models.

➢ Each tree is trained on a different portion of the dataset, and the final
prediction is made by aggregating (e.g., majority voting for
classification or averaging for regression).

3. Feature Randomness:

➢ At each split in a Decision Tree, Random Forest randomly selects a


subset of features. This introduces additional randomness, making
the trees less correlated with each other and improving overall
generalization.
How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by


combining N decision tree, and second is to make predictions for each tree created
in the first phase.

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign
the new data points to the category that wins the majority votes.

Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into subsets
and given to each decision tree. During the training phase, each decision tree
produces a prediction result, and when a new data point occurs, then based on the
majority of results, the Random Forest classifier predicts the final decision.

Example:
Heart
Blood
Record Age Cholesterol Disease
Pressure
(Target)
1 45 230 140 1
2 50 210 130 0
3 55 250 150 1
4 60 240 145 0
5 48 235 138 1
6 53 225 135 0

Steps in Random Forest Algorithm

1. Dataset Sampling with Replacement (Bootstrap Sampling)

2. Feature Selection (Random Subset of Features)

3. Tree Construction for Each Sample

4. Prediction Aggregation (Majority Voting)

Step 1: Bootstrap Sampling:

Bootstrap Sample 1

Record Age Cholesterol Blood Pressure Target

2 50 210 130 0

3 55 250 150 1

5 48 235 138 1

1 45 230 140 1

Bootstrap Sample 2
Record Age Cholesterol Blood Pressure Target

4 60 240 145 0

2 50 210 130 0

6 53 225 135 0

3 55 250 150 1

Step 2: Random Feature Selection

• For each split, we randomly choose two features from the three available
(Age, Cholesterol, Blood Pressure).

Step 3: Construct Decision Trees

Tree 1 (Bootstrap Sample 1)

• Split 1: Choose features Age and Cholesterol.

➢ Split based on Cholesterol <= 230:

▪ Left Node (Chol <= 230): Record 2 → Target: 0.

▪ Right Node (Chol > 230): Records {1, 3, 5} → Majority Target: 1.

Decision Rule for Tree 1:

• If Cholesterol <= 230: Predict 0.

• If Cholesterol > 230: Predict 1.

Tree 2 (Bootstrap Sample 2)

• Split 1: Choose features Cholesterol and Blood Pressure.

o Split based on Age <= 53:

▪ Left Node (Age <= 53): Records {2, 6} → Majority Target: 0.

▪ Right Node (Age > 53): Records {3, 4} → Majority Target: 0.


Decision Rule for Tree 2:

• If Age <= 53: Predict 0.

• If Age > 53: Predict 0.

Step 4: Prediction Aggregation

Let's make a prediction for a new patient:

• New Patient: Age = 52, Cholesterol = 240, Blood Pressure = 145.

Prediction from Tree 1:

• Cholesterol is 240 (which is > 230).

o Prediction: 1 (Heart Disease).

Prediction from Tree 2:

• Age is 52 (which is <= 53).

o Prediction: 0 (No Heart Disease).

Final Prediction (Majority Voting):

• Predictions are 1 and 0.

• Majority vote is 0 (No Heart Disease).


Fusion techniques:

In machine learning and pattern recognition, fusion techniques are


methods used to combine the outputs of multiple models or classifiers to
make a final decision. Fusion techniques are broadly classified into two
categories.

➢ Fixed Rule Fusion Techniques


➢ Trained Rule Fusion Techniques

Fixed Rule Fusion Techniques:

Fixed rule fusion techniques use pre-defined, straightforward mathematical rules


to combine the predictions of different classifiers. These rules are "fixed" because
they do not change during the learning process; they are independent of the
training data.

Common Fixed Rule Fusion Methods:

Sum Rule:

Here, yi is the output (e.g., probability score) from the ith classifier, and wi is its
weight.

The final prediction is based on the sum of weighted outputs. Weights w i can be
equal or manually assigned based on the importance of each classifier.

Product Rule:
This rule multiplies the outputs of all classifiers. It is sensitive to low confidence
scores (i.e., if one classifier gives a near-zero score, it heavily affects the product).

Max Rule:

The final decision is based on the maximum output among the classifiers. It is
useful when one of the classifiers is highly confident about a particular prediction.

Min Rule:

The final decision is based on the minimum output among the classifiers. It is
rarely used because it heavily depends on the lowest score.

Majority Voting:

Each classifier gives its class prediction. The class with the majority votes is
chosen as the final prediction.

This method works well when all classifiers have equal weights and similar
accuracy.

Trained Rule Fusion Techniques:

Trained rule fusion techniques, also known as learned fusion techniques, involve
training an additional model (meta-classifier) to learn the optimal combination
of predictions from the base classifiers. This approach adapts based on the
training data.

Common Trained Rule Fusion Methods:

Stacking (Stacked Generalization):


• In stacking, the predictions from multiple base classifiers are used as input
features for a meta-classifier (e.g., Logistic Regression, SVM).

• The meta-classifier learns to combine these predictions to make the final


decision.

Boosting:

• Boosting techniques, like AdaBoost, focus on training multiple weak


classifiers in sequence. The errors of previous classifiers are used to adjust
the weights of training samples in subsequent classifiers.

• The final decision is made by a weighted combination of the predictions.

Bagging (Bootstrap Aggregating):

• In bagging, multiple classifiers are trained on different bootstrap samples


of the dataset. The final prediction is usually determined by majority
voting.

• Random Forests are a common example of bagging.

K-Means Clustering Algorithm:

➢ K-Means Clustering is an unsupervised learning algorithm that is used to


solve the clustering problems in machine learning or data science.
➢ K-Means Clustering is an Unsupervised Learning algorithm, which groups
the un labelled dataset into different clusters. Here K defines the number of
pre-defined clusters that need to be created in the process, as if K=2, there
will be two clusters, and for K=3, there will be three clusters, and so on.
➢ It is an iterative algorithm that divides the unlabelled dataset into k
different clusters in such a way that each dataset belongs only one group
that has similar properties.
➢ Given a data set of items, with certain features, and values for these
features, the algorithm will categorize the items into k groups or clusters of
similarity.
➢ To calculate the similarity, we can use the Euclidean distance, Manhattan
distance, Hamming distance, Cosine distance as measurement.

Here is the pseudocode for implementing a K-means algorithm.

➢ Input: Algorithm K-Means (K number of clusters, D list of data points)


➢ Choose K number of random data points as initial centroids (cluster
centers).
➢ Repeat till cluster centers stabilize:
a. Allocate each point in D to the nearest of Kth centroids.
b. Compute centroid for the cluster using all points in the cluster.

1. Choose the Number of Clusters (K)


• Before running the algorithm, you decide on the number of clusters, KKK.
This is often based on prior knowledge or through methods like the elbow
method (checking which K minimizes the cost of clustering).

2. Initialize Cluster Centroids

• Randomly select K data points to serve as the initial centroids, the center
points of the clusters.

3. Assign Data Points to Nearest Centroid

• For each data point, calculate the distance (usually Euclidean) to each
centroid.

• Assign the data point to the cluster whose centroid is closest to it.

4. Update Centroids

• For each cluster, calculate the mean of all data points assigned to it.

• This mean becomes the new centroid for that cluster.

5. Repeat Steps 3 and 4

• Continue reassigning data points and updating centroids until convergence.

• Convergence is achieved when either the assignments no longer change, or


the centroids stabilize.

6. Result

• The algorithm outputs K clusters, each with its centroid and assigned data
points.

Advantages of K-Means Algorithm

1. K-means algorithm is simple, easy to understand, and easy to implement.

2. It is also efficient, in which the time taken to cluster K-means rises linearly
with the number of data points.
3. No other clustering algorithm performs better than K-means.

Disadvantages of K-Means Algorithm

1. The user needs to specify an initial value of K.

2. The process of finding the clusters may not converge.

3. It is not suitable for discovering clusters that are not hyper ellipsoids or
hyper spheres.

➢ Here the points are:

• A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9).
• The distance function is Euclidean distance.
• Suppose initially we assign A1, B1, and C1 as the center of each cluster,
respectively.

Initial Centroids A1: (2, 10) B1: (5, 8) C1: (1, 2)

Distance Distance Distance New


Data Points Cluster
to (2,10) to (5,8) to (1,2) Cluster
A1 (2,10) 0 3.61 8.06 1
A2 (2,5) 5 4.24 3.16 3
A3 (8,4) 8.49 5 7.28 2
B1 (5,8) 3.61 0 7.21 2
B2 (7,5) 7.07 3.61 6.71 2
B3 (6,4) 7.21 4.12 5.39 2
C1 (1,2) 8.06 7.21 0 3
C2 (4,9) 2.24 1.41 7.62 2
Current Centroids A1: (2, 10) B1: (6, 6) C1: (1.5, 3.5)

Distance
Data Distance Distance New
to Cluster
Points to (2,10) to (6,6) Cluster
(1.5,3.5)

A1 (2,10) 0 5.66 6.52 1 1


A2 (2,5) 5 4.12 1.58 3 3
A3 (8,4) 8.49 2.83 6.52 2 2
B1 (5,8) 3.61 2.24 5.7 2 2
B2 (7,5) 7.07 1.41 5.7 2 2
B3 (6,4) 7.21 2 4.53 2 2
C1 (1,2) 8.06 6.4 1.58 3 3
C2 (4,9) 2.24 3.61 6.04 2 1

Current Centroids A1: (3, 9.5) B1: (6.5, 5.25) C1: (1.5, 3.5)

Distance
Data Distance to Distance to New
to Cluster
Points (3,9.5) (6.5,5.25) Cluster
(1.5,3.5)
A1 (2,10) 1.12 6.54 6.52 1 1
A2 (2,5) 4.61 4.51 1.58 3 3
A3 (8,4) 7.43 1.95 6.52 2 2
B1 (5,8) 2.5 3.13 5.7 2 1
B2 (7,5) 6.02 0.56 5.7 2 2
B3 (6,4) 6.26 1.35 4.53 2 2
C1 (1,2) 7.76 6.39 1.58 3 3
C2 (4,9) 1.12 4.51 6.04 1 1
Current Centroids A1: (3.67, 9) B1: (7, 4.33) C1: (1.5, 3.5)

Distance Distance
Distance to New
Data Points to to Cluster
(1.5,3.5) Cluster
(3.67,9) (7,4.33)
A1 (2,10) 1.94 7.56 6.52 1 1
A2 (2,5) 4.33 5.04 1.58 3 3
A3 (8,4) 6.62 1.05 6.52 2 2
B1 (5,8) 1.67 4.18 5.7 1 1
B2 (7,5) 5.21 0.67 5.7 2 2
B3 (6,4) 5.52 1.05 4.53 2 2
C1 (1,2) 7.49 6.44 1.58 3 3
C2 (4,9) 0.33 5.55 6.04 1 1

Now A1,B1,C2 – FIRST CLUSTER

A3,B2,B3 – SECOND CLUSTER

A2,C1 – THIRD CLUSTER

Hierarchical Clustering:

Hierarchical clustering is another unsupervised machine learning algorithm,


which is used to group the unlabelled datasets into a cluster and also known
as hierarchical cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and
this tree-shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no
requirement to predetermine the number of clusters as we did in the K-Means
algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the


algorithm starts with taking all data points as single clusters and merging
them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as
it is a top-down approach.

we have seen in the K-means clustering that there are some challenges with this
algorithm, which are a predetermined number of clusters, and it always tries to
create the clusters of the same size.

To solve these two challenges, we can opt for the hierarchical clustering algorithm
because, in this algorithm, we don't need to have knowledge about the predefined
number of clusters.

Agglomerative Hierarchical clustering:

The agglomerative hierarchical clustering algorithm is a popular example of HCA.


To group the datasets into clusters, it follows the bottom-up approach.

It means, this algorithm considers each dataset as a single cluster at the


beginning, and then start combining the closest pair of clusters together. It does
this until all the clusters are merged into a single cluster that contains all the
datasets.

Process:

1. Start: Each data point is its own cluster (initially, we have n clusters for n
data points).

2. Merge: Find the pair of clusters that are closest to each other and merge
them.

3. Repeat: Continue merging the closest clusters until only one cluster
remains or the desired number of clusters is reached.
Divisive Hierarchical Clustering:

• This is a top-down approach, where all data points initially belong to a


single cluster.

• The algorithm then iteratively splits the clusters into smaller clusters until
each data point is its own cluster or a stopping criterion is met.

Process:

• Start: All data points are in one single cluster.

• Split: Divide the cluster into two or more sub-clusters based on


dissimilarities.

• Repeat: Continue splitting the clusters until every data point is its own
cluster or a predefined number of clusters is reached.

The dendrogram in divisive clustering starts with a single cluster at the top
and splits downward until each data point is separated.

Perform Bottom-Up Agglomerative Clustering using:

• Single Linkage: where the distance between clusters is the minimum


distance between any two points in the clusters.
• Complete Linkage: where the distance between clusters is the maximum
distance between any two points in the clusters.

i. Single Linkage Clustering

1. Find the Minimum Distance: Start by merging clusters with the smallest
distance.

2. Form Clusters: At each step, find the smallest distance between clusters.

3. Draw Dendrogram: Link clusters at each merge step based on their


distance.

ii. Complete Linkage Clustering

1. Maximum Distance Rule: In this approach, the distance between clusters is


defined as the maximum distance between any two points.

2. Form Clusters: Use the maximum distance between clusters to merge them.

3. Draw Dendrogram: Link clusters as per the complete linkage criterion.

➢ Consider the following set of 6 one-dimensional data points:


18, 22, 25, 42, 27, 43
➢ Apply the agglomerative hierarchical clustering algorithm to build the
hierarchical clustering dendrogram.
➢ Merge the clusters using Min distance and update the proximity matrix
accordingly.
➢ Clearly show the proximity matrix corresponding to each iteration of the
algorithm.

18 22 25 27 42 43
18 0 4 7 9 24 25
22 4 0 3 5 20 21
25 7 3 0 2 17 18
27 9 5 2 0 15 16
42 24 20 17 15 0 1
43 25 21 18 16 1 0
Find the minimum distance between data points

18 22 25 27 42 43
18 0 4 7 9 24 25
22 4 0 3 5 20 21
25 7 3 0 2 17 18
27 9 5 2 0 15 16
42 24 20 17 15 0 1
43 25 21 18 16 1 0

Cluster 1: (42,43)

18 22 25 27 42,43
18 0 4 7 9 24
22 4 0 3 5 20
25 7 3 0 2 17
27 9 5 2 0 15
42,43 24 20 17 15 0

18 22 25 27 42,43
18 0 4 7 9 24
22 4 0 3 5 20
25 7 3 0 2 17
27 9 5 2 0 15
42,43 24 20 17 15 0

18 22 25,27 42,43
18 0 4 7 24
22 4 0 3 20
25,27 7 3 0 15
42,43 24 20 15 0

Cluster 2: ((42,43) ,(25,27))


18 22 25,27 42,43
18 0 4 7 24
22 4 0 3 20
25,27 7 3 0 15
42,43 24 20 15 0
18 22,25,27 42,43
18 0 4 24
22,25,27 4 0 15
42,43 24 15 0

Cluster 3: ((42,43),((25,27),22))
18 22,25,27 42,43
18 0 4 24
22,25,27 4 0 15
42,43 24 15 0

18,22,25,27 42,43
18,22,25,27 0 15
42,43 15 0

Cluster 4: ((42,43),(((25,27),22),18))
Final Dendogram:
Given a one-dimensional data set {1, 5, 8, 10, 2}, use the agglomerative clustering
algorithms with the complete link with Euclidean distance to establish a
hierarchical grouping relationship.

EuclideanDistance=√((x₂-x₁)²+(y₂-y₁)²)
Euclidean Distance = √((x₂ - x₁)²)

• In order to use the agglomerative algorithm,

• we need to calculate the distance matrix.

• One-dimensional data set {1, 5, 8, 10, 2}

Sno 1 2 3 4 5
DataPoints 1 5 8 10 2
1 1 0 4 7 9 1
2 5 4 0 3 5 3
3 8 7 3 0 2 6
4 10 9 5 2 0 8
5 2 1 3 6 8 0

Here Minimum distance from data point i.e (1,2) of first row, fifth column or fifth
row ,first column

d(2, {1,5}) = max{ d(2,1), d(2,5) } = max {4, 3} = 4

d(3, {1,5}) = max{ d(3,1), d(3,5) } = max {7, 6} = 7

d(4, {1,5}) = max{ d(4,1), d(4,5) } = max {9, 8} = 9

1,5 2 3 4
1,5 0 4 7 9
2 4 0 3 5
3 7 3 0 2
4 9 5 2 0

➢ From the above distance matrix, we can see the distance between points 3
and 4 is smallest.
1,5 2 3 4
1,5 0 4 7 9
2 4 0 3 5
3 7 3 0 2
4 9 5 2 0

➢ Hence, they merge together to form a cluster {3, 4}.


➢ Using the complete link, we have the distance between different
points/clusters as follows:

• d({1,5}, {3,4}) = max{ d({1,5}, 3), d({1,5}, 4) } = max{ 7, 9 } = 9

• d(2, {3,4}) = max{ d(2,3), d(2,4) } = max{ 3, 5 } = 5

Thus, we can update the distance matrix, where row 2 corresponds to point 2,
rows 1 and 3 correspond to clusters {1,5} and {3,4}, as follows:

1,5 2 3,4
1,5 0 4 9
2 4 0 5
3,4 9 5 0

Following the same procedure, we merge point 2 with the cluster {1, 5} to form {1,
2, 5} and update the distance matrix as follows:

1,5 2 3,4
1,5 0 4 9
2 4 0 5
3,4 9 5 0

➢ After increasing the distance threshold to 9, all clusters would merge.


➢ Based on all above distance matrices, we draw the dendrogram tree as
follows.

{1,5},2 {3,4}
{1,5},2 0 9
{3,4} 9 0

You might also like