Unit 5
Unit 5
1. Hierarchical Clustering:
○ Agglomerative: Starts with each point as its own cluster and merges
the closest pairs of clusters until only one cluster remains.
○ Divisive: Starts with all points in one cluster and splits into smaller
clusters recursively.
2. Partitioning Methods:
○ K-means: Divides data into non-overlapping clusters where each
data point belongs to only one cluster. It aims to minimize the
variance within each cluster.
○ K-medoids (PAM): Similar to K-means but uses actual data points
(medoids) as cluster centers to handle outliers better.
3. Density-Based Methods:
○ DBSCAN (Density-Based Spatial Clustering of Applications with
Noise): Clusters dense regions of points based on density
reachability.
○ OPTICS (Ordering Points To Identify the Clustering Structure):
Generates a density-based clustering order rather than specific
clusters.
4. Grid-Based Methods:
○ STING (Statistical Information Grid): Clusters data into a grid
structure and then aggregates the grid cells to form clusters.
5. Model-Based Methods:
○ Gaussian Mixture Models (GMM): Assumes that the data points are
generated from a mixture of several Gaussian distributions.
Applications of Clustering:
Hierarchical Clustering
Key Points:
K-Means Clustering:
K-Means is a popular clustering algorithm used for partitioning a dataset into K
distinct, non-overlapping clusters. It's an iterative algorithm that aims to minimize
the variance within each cluster.
Key Steps:
1. Initialization:
○ Choose K initial cluster centroids randomly (or based on some
heuristic).
2. Assignment:
○ Assign each data point to the nearest centroid, typically based on
Euclidean distance.
3. Update Centroids:
○ Recalculate the centroids of the clusters by taking the mean of all
data points assigned to each centroid.
4. Iteration:
○ Repeat the assignment and centroid update steps until convergence
(when centroids no longer change significantly or a maximum
number of iterations is reached).
Properties:
● Objective: Minimize the sum of squared distances from each point to its
assigned cluster centroid.
● Initialization Sensitivity: The final clusters can depend on the initial random
choice of centroids, impacting the algorithm's performance.
● Scalability: Works well for large datasets but can be computationally
expensive, especially with large K or high-dimensional data.
● Suitability: Effective when clusters are spherical and of similar size, less
effective with irregular shapes or widely varying cluster sizes.
Advantages:
Disadvantages:
Applications:
● Customer segmentation.
● Document clustering.
● Image segmentation.
● Anomaly detection (by treating the smallest cluster as anomalies).
Understanding these aspects will give you a solid foundation for implementing
and applying K-Means clustering effectively in various contexts.
Streams
Streams refer to continuous flows of data, typically arriving in real-time or near real-time. Stream
processing involves handling and analyzing these data streams as they are generated, often
with the goal of extracting insights or making decisions in real-time. Examples include
processing sensor data, financial transactions, social media updates, etc.
Parallelism
Parallelism involves executing multiple tasks simultaneously, either within a single processor
with multiple cores or across multiple processors or machines. In the context of clustering and
stream processing:
● Objective: Develop a recommendation system that utilizes big data analytics to improve
accuracy and relevance of recommendations.
● Context: Typically, this involves scenarios like e-commerce platforms, streaming
services, social media, etc., where user engagement and satisfaction heavily rely on
personalized recommendations.
● Data Sources: Gathering diverse datasets including user behavior (clicks, views,
purchases), item characteristics (descriptions, categories), and contextual data (time,
location).
● Data Preprocessing: Cleaning data, handling missing values, feature engineering, and
possibly scaling for large datasets.
● Scalability: Ensuring the recommendation system can handle large datasets and
real-time recommendations efficiently.
● Cold Start Problem: Addressing issues when there is limited data for new users or
items.
● Privacy and Ethics: Handling sensitive user data responsibly and ensuring compliance
with privacy regulations (e.g., GDPR).
● Netflix: Utilizes collaborative filtering and machine learning to recommend movies and
TV shows based on user viewing history and ratings.
● Amazon: Incorporates both collaborative filtering and content-based filtering to suggest
products based on user browsing and purchase history.