0% found this document useful (0 votes)
4 views40 pages

EML %TH Module

Uploaded by

cherry.divesh099
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views40 pages

EML %TH Module

Uploaded by

cherry.divesh099
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Clustering and

Reinforcement Methods
Content
• Introduction to Clusters
• K Means Clustering
• Fixing the value of K in the Kmeans
• Hierarchal Model
• DB Scan Model
• Sprial Model
Introduction to Clusters
• In machine learning, clustering is a type of unsupervised
learning that involves grouping a set of objects into subsets, or
clusters,
• so that objects within the same cluster are more similar to each
other than to those in other clusters.
• Unlike supervised learning, clustering doesn't rely on labeled
data. Instead, it tries to uncover patterns and structure from raw,
unlabeled data.
Introduction to Clusters
Why Clustering?
• Clustering is useful in various real-world applications where the
goal is to identify natural groupings in data, such as:
• Customer Segmentation: Grouping customers based on purchasing
behavior for targeted marketing.
• Anomaly Detection: Identifying unusual patterns in network traffic for
cybersecurity.
• Document Classification: Organizing large collections of documents
into categories for easy retrieval.
• Genomics: Identifying gene groups with similar expression patterns.
Introduction to Clusters
Characteristics of Clusters
• Clusters can differ based on:
• Shape: Spherical, elongated, or arbitrary.
• Size: Uniform or varying sizes.
• Density: Tight or loose groupings.
• Distance: Based on different distance metrics like Euclidean,
Manhattan, or cosine similarity.
Introduction to Clusters
Types of Clustering Algorithms
1. Partitioning Methods: Divide the data into distinct clusters.
1. Example: K-Means, K-Medoids.
2. Hierarchical Methods: Build clusters step-by-step in a hierarchy.
1. Example: Agglomerative, Divisive Clustering.
3. Density-Based Methods: Form clusters based on high-density
regions.
1. Example: DBSCAN, OPTICS.
4. Model-Based Methods: Assume a specific model for each cluster and
fit the data.
1. Example: Gaussian Mixture Models (GMM).
Introduction to Clusters
• Challenges in Clustering
• Determining the Number of Clusters: Often requires domain
knowledge or validation methods like the elbow method or silhouette
score.
• Scalability: Some algorithms struggle with large datasets.
• Interpretability: Understanding and interpreting clusters can be
subjective, depending on the application.
• Clustering is a foundational concept in machine learning,
enabling insight into complex data by identifying meaningful
patterns, relationships, and structures.
K Means Clustering
Overview of K-Means Clustering
• K-Means is a popular unsupervised learning algorithm used
for partitioning a dataset into K distinct clusters.
• The goal is to minimize the distance between data points and
their respective cluster centroids.
K Means Clustering
Steps of K-Means Algorithm
1. Initialization: Choose K random initial centroids.
2. Assignment: Assign each data point to the nearest centroid
based on a distance metric (usually Euclidean distance).
3. Update: Recalculate the centroids as the mean of all points
assigned to each cluster.
4. Repeat: Repeat steps 2 and 3 until the centroids no longer
change significantly or a maximum number of iterations is
reached.
K Means Clustering
K Means Clustering
Example: Clustering Customers Based on Annual Income
and Spending Score
• Let’s say we have a dataset of customers with two features:
1. Annual Income ($)
2. Spending Score (a measure of spending habits on a scale of 1 to 100)
K Means Clustering
K Means Clustering
Clustering Process (with K=2):
1. Initialization: Randomly select two centroids.
2. Assignment: Compute the distance from each data point to
both centroids and assign points to the nearest centroid.
3. Update: Recalculate centroids based on assigned points.
4. Convergence: Repeat until centroids stabilize.
K Means Clustering
K Means Clustering
• In the plot above, the data points represent customers, and the two
clusters are differentiated by colors. The red X marks indicate the
centroids of the two clusters. The K-Means algorithm successfully
divides the customers into two groups based on their annual income
and spending score.
• This type of clustering can be useful for:
• Targeted Marketing: Identifying high-spending customers.
• Customer Segmentation: Creating personalized promotions for different
groups.
• You can adjust the number of clusters (KKK) based on specific
business needs or by using evaluation metrics like the elbow
method.
Fixing the value of K in the K Means
• Choosing the optimal number of clusters K is critical for the
effectiveness of the K-Means algorithm.
• There are several methods to determine the best K:
Fixing the value of K in the K Means
Fixing the value of K in the K Means
Fixing the value of K in the K Means
3. Gap Statistic Method
• The Gap Statistic compares the performance of clustering on
the original dataset with random datasets. A higher gap
indicates a more appropriate KKK.
Fixing the value of K in the K Means
Fixing the value of K in the K Means
• The Elbow Plot shows the WCSS for different values of KKK.
The point where the WCSS curve starts to flatten is known as
the "elbow point." This point indicates the optimal number of
clusters.
• In this plot, the elbow appears around K=2 or K=3. Depending
on the specific application and domain knowledge, you could
select one of these values for KKK.
• If interpretability is important, fewer clusters (like K=2) might be
preferred.
• If you want more granularity, a slightly higher KKK (like K=3) could
be chosen.
Hierarchal Model
• Hierarchical clustering is a type of unsupervised learning that
builds a hierarchy of clusters by either merging or splitting them
iteratively.
• Unlike K-Means, hierarchical clustering does not require
specifying the number of clusters KKK in advance.
Hierarchal Model
Types of Hierarchical Clustering
1. Agglomerative (Bottom-Up):
1. Starts with each data point as its own cluster.
2. Iteratively merges the closest clusters until all points are in one cluster.
2. Divisive (Top-Down):
1. Starts with all data points in a single cluster.
2. Iteratively splits clusters until each point is its own cluster.
Hierarchal Model
Hierarchal Model
Steps in Agglomerative Hierarchical Clustering
1. Calculate Distance Matrix: Compute pairwise distances
between all data points.
2. Merge Closest Clusters: Use a linkage criterion to determine
which clusters to merge.
3. Repeat: Continue merging until a single cluster remains.
4. Visualize: Use a dendrogram to visualize the cluster hierarchy.
Hierarchal Model
Hierarchal Model
• The dendrogram above illustrates the hierarchical clustering
process:
• X-axis: Represents individual data points (customers).
• Y-axis: Represents the Euclidean distance between clusters at the
point of merging.
• Horizontal lines: Show the merging process. The height at which two
clusters merge indicates their similarity—lower merges imply more
similar clusters.
Hierarchal Model
How to Determine the Number of Clusters:
• Look for the largest vertical distance between two horizontal
lines without intersecting another horizontal line.
• For instance, if you cut the dendrogram at a certain height, like
just before the two largest clusters merge, the number of
clusters below that cut line is your cluster count.
Hierarchal Model
• In this example, cutting around the middle might suggest two to
three clusters.
• Hierarchical clustering is particularly useful when:
• You need interpretable results with a dendrogram.
• You have small to medium-sized datasets (it can be computationally
expensive for large datasets).
DB Scan Model
• DBSCAN is a powerful density-based clustering algorithm
that groups together points that are closely packed, marking
outliers as noise if they are in low-density regions.
• Unlike K-Means or Hierarchical clustering, DBSCAN doesn't
require the number of clusters to be specified in advance and
can handle clusters of varying shapes and sizes.
DB Scan Model
Key Concepts of DBSCAN
1. Core Point:
A point is a core point if it has at least MinPts points (including
itself) within a given Epsilon (ε) radius.
2. Border Point:
A point that is within the ε radius of a core point but has fewer
than MinPts neighbors itself. It is part of a cluster but not a core
point.
3. Noise Point:
A point that is neither a core point nor a border point and falls
outside any cluster.
DB Scan Model
Parameters of DBSCAN
1. Epsilon (ε):
The radius within which to search for neighboring points.
2. MinPts:
The minimum number of points required to form a dense region.
DB Scan Model
How DBSCAN Works
1. Select an unvisited point.
2. Determine its neighbors using ε.
3. If the point is a core point:
a) Form a new cluster with the point and its neighbors.
b) Recursively add points that are directly density-reachable.
4. If the point is a border point, mark it as part of a cluster.
5. If it’s neither, mark it as noise.
6. Repeat until all points are visited.
DB Scan Model
Advantages of DBSCAN
• Can find clusters of arbitrary shapes.
• Automatically detects outliers.
• No need to specify the number of clusters.
DB Scan Model
Limitations of DBSCAN
• Struggles with datasets that have varying densities.
• Performance depends heavily on the choice of ε and MinPts.
Spiral Model
• The Spiral Model is a risk-driven process model that
combines elements of both iterative and waterfall models.
• It is ideal for large, complex, and high-risk projects.
• The model allows for continuous refinement through multiple
iterations, focusing heavily on risk management at every phase.
Spiral Model
Key Features of the Spiral Model
1. Iterative Cycles (Spirals): The project goes through several
iterations, with each spiral involving a set of activities.
2. Risk Management: Each cycle starts with identifying and
addressing potential risks.
3. Prototyping: Prototypes are often created to clarify
requirements and reduce risks.
4. Customer Feedback: Frequent involvement of stakeholders
ensures that the product meets expectations.
Spiral Model
Phases in Each Spiral
1. Planning Phase:
1. Define objectives, requirements, and constraints.
2. Identify potential risks and develop risk mitigation strategies.
2. Risk Analysis Phase:
1. Assess identified risks and prioritize them.
2. Develop prototypes or simulations if needed to reduce uncertainty.
3. Engineering Phase:
1. Design and develop the system.
2. Build, test, and refine the system or prototype.
4. Evaluation Phase:
1. Receive feedback from stakeholders.
2. Decide whether to proceed, adjust, or terminate the project.
Spiral Model
Diagram of the Spiral Model
• Let me describe the process:
• The spiral starts from the center and progresses outward.
• Each loop represents a phase, including planning, risk analysis,
development, and customer feedback.
• Risk evaluation is continuous at each loop.
Summary
• Introduction to Clusters
• K Means Clustering
• Fixing the value of K in the K Means
• Hierarchal Model
• DB Scan Model
• Spiral Model

You might also like