Chapter 06
Chapter 06
BAGGING/ADABOOST
By ,
YOUSUF SK
ASSISTANT PROF.
PIET - CSE
SUBTITLE: UNDERSTANDING THE APPLICATION
OF ADABOOST IN RISK MANAGEMENT
INTRODUCTION TO ADABOOST
BIAS REDUCTION: BY COMBINING MULTIPLE WEAK LEARNERS, ADABOOST REDUCES BIAS, MAKING
THE OVERALL MODEL MORE FLEXIBLE AND CAPABLE OF CAPTURING COMPLEX PATTERNS IN THE
DATA.
CONTROLLED VARIANCE: EVEN THOUGH EACH WEAK LEARNER MAY HAVE HIGH VARIANCE, THE
AGGREGATION PROCESS HELPS TO AVERAGE OUT THEIR ERRORS, REDUCING THE OVERALL
VARIANCE.
3.IMPROVING GENERALIZATION:
ENHANCED ACCURACY: ITERATIVE FOCUS ON MISCLASSIFIED INSTANCES IMPROVES
ACCURACY ON TRAINING DATA AND UNSEEN DATA.
ADAPTIVE WEIGHTING: DYNAMICALLY ADJUSTS WEIGHTS OF TRAINING INSTANCES TO
FOCUS ON CHALLENGING PARTS OF THE DATASET.
4.MODEL AGGREGATION:
WEIGHTED MAJORITY VOTING: FINAL PREDICTION IS MADE BY WEIGHTED MAJORITY VOTE
OF ALL WEAK LEARNERS, WITH WEIGHTS BASED ON THEIR ACCURACY.
CUMULATIVE LEARNING: COMBINES ALL WEAK LEARNERS TO CREATE A STRONG
CLASSIFIER, LEVERAGING COLLECTIVE KNOWLEDGE.
Unsupervised Machine Learning
In contrast to supervised learning, unsupervised learning doesn't require labelled data for training.
Instead, it aims to discover hidden patterns and insights within the dataset. Just like how humans
learn, unsupervised learning enables models to act on unlabelled datasets without explicit guidance.
Due to the absence of corresponding output data, unsupervised learning cannot be directly applied
to regression or classification problems. Its primary goal is to uncover a dataset's inherent structure,
group data based on similarities, and represent the dataset in a more condensed form.
Consider a dataset of images of cats and dogs (Figure 1). Without prior training on this specific
dataset, the algorithm's task is to identify each image's distinctive features independently. The
algorithm will then cluster the images into groups based on similarities.
Clustering
Clustering is an unsupervised learning technique used to group a set of objects in such a way that objects in the same
group (or cluster) are more similar to each other than to those in other groups. The goal is to identify natural groupings
within the data based on inherent similarities.
A clustering criterion function is a mathematical measure used to evaluate the quality of the clusters formed
by a clustering algorithm. It quantifies how well the objects within each cluster are similar to each other and
how distinct the clusters are from one another. Common clustering criterion functions include
:
ELBOW METHOD
FOLLOWING ARE THE STEPS IN THE ELBOW METHOD TO FIND THE OPTIMAL VALUE OF CLUSTERS.
EXECUTE THE K-MEANS CLUSTERING ON A GIVEN DATASET FOR DIFFERENT K VALUES (1-10).
FOR EACH VALUE OF K, CALCULATE THE WCSS VALUE.
PLOTS A CURVE BETWEEN CALCULATED WCSS VALUES AND THE NUMBER OF CLUSTERS K.
THE SHARP POINT OF BEND OR A POINT OF THE PLOT LOOKS LIKE AN ARM, AND THAT POINT IS
CONSIDERED AS THE BEST VALUE OF K
FOLLOWING ARE THE STEPS IN THE ELBOW METHOD TO FIND THE OPTIMAL VALUE OF CLUSTERS:
· EXECUTE THE K-MEANS CLUSTERING ON A GIVEN DATASET FOR DIFFERENT K VALUES (1-10).
· FOR EACH VALUE OF K, CALCULATE THE WCSS VALUE.
· PLOTS A CURVE BETWEEN CALCULATED WCSS VALUES AND THE NUMBER OF CLUSTERS K.
THE SHARP POINT OF BEND OR A POINT OF THE PLOT LOOKS LIKE AN ARM, AND THAT POINT IS
CONSIDERED AS THE BEST VALUE OF K.
FIGURE 5: ELBOW
GRAPH
SILHOUETTE
COEFFICIENT
DEFINITION: MEASURES HOW SIMILAR A POINT IS TO ITS OWN CLUSTER COMPARED TO OTHER CLUSTERS.
OBJECTIVE: MAXIMIZE THE SILHOUETTE COEFFICIENT, WHICH RANGES FROM -1 TO 1.
WHERE 𝑎(𝑖) IS THE AVERAGE DISTANCE FROM POINT 𝑖 TO OTHER POINTS IN ITS OWN CLUSTER, AND 𝑏(𝑖) IS
THE MINIMUM AVERAGE DISTANCE FROM POINT TO POINTS IN ANOTHER CLUSTER.
K-MEANS CLUSTERING
DEFINITION: K-MEANS CLUSTERING IS A POPULAR UNSUPERVISED LEARNING ALGORITHM THAT PARTITIONS A
DATASET INTO K DISTINCT, NON-OVERLAPPING CLUSTERS. EACH CLUSTER IS REPRESENTED BY ITS CENTROID,
WHICH IS THE MEAN OF THE DATA POINTS IN THE CLUSTER.
2.ASSIGNMENT STEP:
ASSIGN EACH DATA POINT TO THE NEAREST CENTROID. THIS FORMS K CLUSTERS BASED ON THE EUCLIDEAN
DISTANCE
3.UPDATE STEP:
RECALCULATE THE CENTROID OF EACH CLUSTER AS THE MEAN OF ALL DATA POINTS ASSIGNED TO THAT
CLUSTER.
4.REPEAT:
REPEAT THE ASSIGNMENT AND UPDATE STEPS UNTIL THE CENTROIDS NO LONGER CHANGE SIGNIFICANTLY OR
A MAXIMUM NUMBER OF ITERATIONS IS REACHED.
EUCLIDEAN
DISTANCE
HIERARCHICAL CLUSTERING
DEFINITION: HIERARCHICAL CLUSTERING IS AN UNSUPERVISED LEARNING ALGORITHM USED TO GROUP DATA
POINTS INTO NESTED CLUSTERS IN A HIERARCHICAL STRUCTURE. IT CAN BE VISUALIZED USING A
DENDROGRAM, WHICH ILLUSTRATES THE MERGING OR SPLITTING OF CLUSTERS AT VARIOUS LEVELS OF
SIMILARITY.
AGGLOMERATIVE HIERARCHICAL CLUSTERING ALGORITHM
DEFINITION: AGGLOMERATIVE HIERARCHICAL CLUSTERING IS AN UNSUPERVISED LEARNING ALGORITHM THAT
STARTS WITH EACH DATA POINT AS AN INDIVIDUAL CLUSTER AND ITERATIVELY MERGES THE CLOSEST PAIRS OF
CLUSTERS UNTIL ALL POINTS ARE IN A SINGLE CLUSTER OR THE DESIRED NUMBER OF CLUSTERS IS ACHIEVED.