0% found this document useful (0 votes)

69 views10 pages

Unit 5

notes

Uploaded by

poornank05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views10 pages

Unit 5

notes

Uploaded by

poornank05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

UNIT 5

Introduction to Clustering Techniques

Clustering is a fundamental technique in unsupervised learning where the goal is
to group similar objects or data points into clusters. The objective is to partition
the data in such a way that data points in the same cluster are more similar to
each other than to those in other clusters. Clustering is widely used in various
applications such as customer segmentation, grouping documents, anomaly
detection, and more.

Types of Clustering Techniques:

1. Hierarchical Clustering:
○ Agglomerative: Starts with each point as its own cluster and merges
the closest pairs of clusters until only one cluster remains.
○ Divisive: Starts with all points in one cluster and splits into smaller
clusters recursively.
2. Partitioning Methods:
○ K-means: Divides data into non-overlapping clusters where each
data point belongs to only one cluster. It aims to minimize the
variance within each cluster.
○ K-medoids (PAM): Similar to K-means but uses actual data points
(medoids) as cluster centers to handle outliers better.
3. Density-Based Methods:
○ DBSCAN (Density-Based Spatial Clustering of Applications with
Noise): Clusters dense regions of points based on density
reachability.
○ OPTICS (Ordering Points To Identify the Clustering Structure):
Generates a density-based clustering order rather than specific
clusters.
4. Grid-Based Methods:
○ STING (Statistical Information Grid): Clusters data into a grid
structure and then aggregates the grid cells to form clusters.
5. Model-Based Methods:
○ Gaussian Mixture Models (GMM): Assumes that the data points are
generated from a mixture of several Gaussian distributions.

Key Considerations in Clustering:

● Distance Metric: Choice of distance metric (Euclidean, Manhattan, etc.)

impacts how similarity between data points is measured.
● Number of Clusters: For methods like K-means, determining the optimal
number of clusters (K) is crucial and can be assessed using metrics like
the silhouette score or elbow method.
● Scalability: Some methods scale better with large datasets or
high-dimensional data than others.
● Interpretability: Depending on the application, interpretability of clusters
(meaning of each cluster) may be important.

Applications of Clustering:

● Customer Segmentation: Group customers based on purchasing behavior

or demographics.
● Document Clustering: Organize documents into topics or categories based
on content similarity.
● Image Segmentation: Group pixels in images to identify objects or regions.
● Anomaly Detection: Identify unusual patterns or outliers in data.

Hierarchical Clustering

Hierarchical clustering is a method of clustering analysis that builds a hierarchy

of clusters. It can be approached in two main ways:

1. Agglomerative Hierarchical Clustering: This starts with each data point

as its own cluster and then merges the closest pairs of clusters until only
one cluster remains. The result is a tree-like structure (dendrogram) where
the height of each fusion represents the distance between clusters.
2. Divisive Hierarchical Clustering: This starts with all data points in one
cluster and recursively splits them into smaller clusters until each cluster
only contains one data point.

Key Points:

● Distance Measure: Determines how the proximity between clusters or

data points is calculated.
● Linkage Criteria: Determines the rule for computing the distance between
clusters during merging (agglomerative) or splitting (divisive).
● Dendrogram: A visual representation of the clustering process, showing
the sequence of merges or splits.
● Advantages: No need to specify the number of clusters beforehand, and
the hierarchical structure can be informative for understanding
relationships between clusters.
● Disadvantages: Computationally expensive for large datasets, and
decisions about where to cut the dendrogram to form clusters can be
subjective.

K-Means Clustering:
K-Means is a popular clustering algorithm used for partitioning a dataset into K
distinct, non-overlapping clusters. It's an iterative algorithm that aims to minimize
the variance within each cluster.

Key Steps:

1. Initialization:
○ Choose K initial cluster centroids randomly (or based on some
heuristic).
2. Assignment:
○ Assign each data point to the nearest centroid, typically based on
Euclidean distance.
3. Update Centroids:
○ Recalculate the centroids of the clusters by taking the mean of all
data points assigned to each centroid.
4. Iteration:
○ Repeat the assignment and centroid update steps until convergence
(when centroids no longer change significantly or a maximum
number of iterations is reached).

Properties:

● Objective: Minimize the sum of squared distances from each point to its
assigned cluster centroid.
● Initialization Sensitivity: The final clusters can depend on the initial random
choice of centroids, impacting the algorithm's performance.
● Scalability: Works well for large datasets but can be computationally
expensive, especially with large K or high-dimensional data.
● Suitability: Effective when clusters are spherical and of similar size, less
effective with irregular shapes or widely varying cluster sizes.

Advantages:

● Simple and easy to implement.

● Computationally efficient for large datasets.
● Scales well to large numbers of variables.

Disadvantages:

● Requires the number of clusters (K) to be specified in advance.

● Sensitive to initial centroids, which can lead to different results on different
runs.
● May not handle well clusters of different sizes and densities.

Applications:

● Customer segmentation.
● Document clustering.
● Image segmentation.
● Anomaly detection (by treating the smallest cluster as anomalies).

Extensions and Variants:

● K-Means++: Improved initialization to select initial centroids that are distant

from each other.
● Mini-batch K-Means: Speeds up K-Means by using mini-batches of data to
update centroids.
● Kernel K-Means: Allows non-linear separation of clusters by mapping data
into a higher-dimensional space.

Understanding these aspects will give you a solid foundation for implementing
and applying K-Means clustering effectively in various contexts.

CURE Clustering in Non Euclidean Spaces

CURE (Clustering Using Representatives) is a clustering algorithm designed to
work efficiently in non-Euclidean spaces. Unlike traditional clustering algorithms
that often assume Euclidean distance measures, CURE is suitable for data
spaces where distance metrics are not straightforward or where the data may not
adhere to Euclidean geometry.

Key Concepts of CURE Clustering:

1. Clustering Using Representatives:

○ CURE selects a subset of points, called representatives, from the
dataset. These representatives are chosen to capture the overall
characteristics of clusters and serve as the initial cluster centers.
2. Hierarchical Clustering:
○ CURE employs a hierarchical clustering approach. It starts with each
point as its own cluster and gradually merges clusters based on their
proximity, using a combination of single-linkage and
complete-linkage strategies.
3. Handling Non-Euclidean Spaces:
○ CURE addresses the challenge of non-Euclidean spaces by using
an approximation strategy. It projects data points onto a line that
connects two of its representative points. This projection helps in
estimating the distance between points in a non-Euclidean space.
4. Advantages:
○ CURE is robust against outliers and can handle arbitrary shapes of
clusters.
○ It reduces the computation cost by using a representative set of
points rather than the entire dataset for clustering.
5. Steps in CURE Algorithm:
○ Selection of Representatives: Initially, select a large number of
points as representatives.
○ Hierarchical Clustering: Perform hierarchical clustering on the
representatives using an appropriate distance metric (often a
dissimilarity measure).
○ Cluster Formation: After hierarchical clustering, cut the dendrogram
at an appropriate level to form clusters.
6. Distance Measures:
○ CURE can utilize various distance measures suitable for the specific
data space, including cosine similarity, Jaccard distance, or other
domain-specific metrics.
7. Scalability:
○ The efficiency of CURE in handling large datasets is a significant
advantage, as it reduces the computational complexity compared to
traditional hierarchical clustering methods.

Implementation and Applications:

● Implementation: Implementing CURE requires careful consideration of

the distance metric and the selection of representative points.
● Applications: CURE is useful in domains where data do not conform to
Euclidean geometry, such as text clustering (using document similarity
metrics), biological data clustering (protein structures), and image
clustering (based on feature vectors).

Streams
Streams refer to continuous flows of data, typically arriving in real-time or near real-time. Stream
processing involves handling and analyzing these data streams as they are generated, often
with the goal of extracting insights or making decisions in real-time. Examples include
processing sensor data, financial transactions, social media updates, etc.

Parallelism
Parallelism involves executing multiple tasks simultaneously, either within a single processor
with multiple cores or across multiple processors or machines. In the context of clustering and
stream processing:

● Parallel Clustering: Algorithms like K-means can be parallelized to handle large

datasets more efficiently by distributing the computation across multiple processors or
nodes.
● Parallel Stream Processing: When dealing with real-time data streams, parallelism
enables faster processing and analysis of incoming data by dividing the workload among
multiple processing units.

Case Study Outline: Advertising on the Web

1. Introduction
○ Overview of online advertising landscape.
○ Importance of targeted advertising and personalized user experiences.
○ Goals of the case study (e.g., improving ad relevance, increasing conversion
rates).

2. Data Collection and Preparation

○ Sources of data: ad impressions, click-through rates (CTR), user interactions.
○ Data preprocessing steps: cleaning, handling missing values, normalization.
3. Clustering Analysis
○ Objective: Identify segments or clusters of users based on their behavior and
preferences.
○ Methods: Use clustering algorithms like k-means, hierarchical clustering, or
DBSCAN.
○ Application: Group users into clusters that exhibit similar patterns in ad
engagement or website interactions.
4. Recommendation System Implementation
○ Objective: Develop a recommendation system for ads or content.
○ Approaches:
■ Collaborative Filtering: Recommend ads based on similar users'
preferences or behaviors.
■ Content-Based Filtering: Recommend ads based on the content of the
ad itself and user profiles.
■ Hybrid Approaches: Combine collaborative and content-based methods
for improved accuracy.
○ Evaluation: Measure recommendation system performance using metrics like
precision, recall, or A/B testing.
5. Results and Insights
○ Segmentation Insights: Understand the characteristics and behaviors of each
user segment.
○ Recommendation Effectiveness: Evaluate how well the recommendation
system improves ad targeting and user engagement.
○ Business Impact: Discuss any observed improvements in ad click-through
rates, conversion rates, or revenue.
6. Challenges and Considerations
○ Data Privacy: Ensure compliance with data protection regulations.
○ Scalability: Address challenges related to processing large volumes of data in
real-time.
○ Ethical Considerations: Consider implications of personalized advertising on
user privacy and experience.
7. Conclusion
○ Summary of key findings and outcomes from the case study.
○ Future directions for research or improvements in ad targeting techniques.

Case Study Overview: Recommendation Systems in Big Data

Mining Analytics
1. Problem Statement and Context

● Objective: Develop a recommendation system that utilizes big data analytics to improve
accuracy and relevance of recommendations.
● Context: Typically, this involves scenarios like e-commerce platforms, streaming
services, social media, etc., where user engagement and satisfaction heavily rely on
personalized recommendations.

2. Data Collection and Preparation

● Data Sources: Gathering diverse datasets including user behavior (clicks, views,
purchases), item characteristics (descriptions, categories), and contextual data (time,
location).
● Data Preprocessing: Cleaning data, handling missing values, feature engineering, and
possibly scaling for large datasets.

3. Techniques and Algorithms

● Collaborative Filtering: Utilizing user-item interaction data to identify similarities

between users or items. Techniques may include:
○ Memory-based methods: Such as user-based or item-based collaborative
filtering.
○ Model-based methods: Employing matrix factorization (like Singular Value
Decomposition or Alternating Least Squares) to predict user preferences.
● Content-Based Filtering: Incorporating item features and user profiles to recommend
items similar to those previously liked by the user.
● Hybrid Methods: Combining collaborative filtering and content-based filtering to
leverage their strengths and mitigate weaknesses.

4. Implementation and Evaluation

● System Architecture: Designing a scalable architecture using distributed computing

frameworks (like Apache Spark) to handle big data volumes efficiently.
● Algorithm Implementation: Developing and fine-tuning recommendation algorithms
using appropriate libraries and frameworks (e.g., TensorFlow, Scikit-learn).
● Evaluation Metrics: Assessing the performance of the recommendation system using
metrics such as precision, recall, and mean average precision. Cross-validation
techniques may be employed to validate model robustness.

5. Challenges and Considerations

● Scalability: Ensuring the recommendation system can handle large datasets and
real-time recommendations efficiently.
● Cold Start Problem: Addressing issues when there is limited data for new users or
items.
● Privacy and Ethics: Handling sensitive user data responsibly and ensuring compliance
with privacy regulations (e.g., GDPR).

6. Real-World Applications and Impact

● Business Insights: Analyzing user behavior through recommendation system insights
to drive business decisions (e.g., product placements, marketing strategies).
● User Experience: Enhancing user satisfaction and engagement by providing
personalized recommendations that align with individual preferences.

Examples and References

● Netflix: Utilizes collaborative filtering and machine learning to recommend movies and
TV shows based on user viewing history and ratings.
● Amazon: Incorporates both collaborative filtering and content-based filtering to suggest
products based on user browsing and purchase history.

Clustering
No ratings yet
Clustering
11 pages
Partition
No ratings yet
Partition
52 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
Unit 4
No ratings yet
Unit 4
29 pages
Clustering Analysis
No ratings yet
Clustering Analysis
12 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
21 pages
Data Mining Presentation On
No ratings yet
Data Mining Presentation On
11 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Clustering
No ratings yet
Clustering
6 pages
Week 10
No ratings yet
Week 10
84 pages
Ds Econtent
No ratings yet
Ds Econtent
8 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Understanding Clustering - A Comprehensive Guide To
No ratings yet
Understanding Clustering - A Comprehensive Guide To
5 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Unit V Machine Learning
No ratings yet
Unit V Machine Learning
5 pages
Module 5
No ratings yet
Module 5
43 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
4 pages
Lecture 13 - Unsupervised Learning, PCA ICA
No ratings yet
Lecture 13 - Unsupervised Learning, PCA ICA
50 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Clustering
No ratings yet
Clustering
34 pages
Clustering: An Overview: Key Concepts Objective
No ratings yet
Clustering: An Overview: Key Concepts Objective
12 pages
Clustering Personal
No ratings yet
Clustering Personal
9 pages
ML - 8
No ratings yet
ML - 8
70 pages
Ab Initio and Performance Tuning
No ratings yet
Ab Initio and Performance Tuning
9 pages
E-Note 28966 Content Document 20241211091351PM
No ratings yet
E-Note 28966 Content Document 20241211091351PM
69 pages
HTCB Unit 5
No ratings yet
HTCB Unit 5
3 pages
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
Clustering
No ratings yet
Clustering
45 pages
Clustering
No ratings yet
Clustering
7 pages
Be Paper 1
100% (2)
Be Paper 1
852 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
Parallel Processing in ABAP
100% (1)
Parallel Processing in ABAP
15 pages
Clustering
No ratings yet
Clustering
38 pages
Chatgpt Unit - 4
No ratings yet
Chatgpt Unit - 4
4 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
M5
No ratings yet
M5
40 pages
M5
No ratings yet
M5
40 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
Unit 4
No ratings yet
Unit 4
16 pages
SPK Clustering
No ratings yet
SPK Clustering
35 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Unit 4
No ratings yet
Unit 4
74 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Clustering
No ratings yet
Clustering
11 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Mpi Unit 5 Part 2 1
No ratings yet
Mpi Unit 5 Part 2 1
65 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
DC Notes - 2 Marks
No ratings yet
DC Notes - 2 Marks
11 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Cluster
100% (1)
Cluster
72 pages
Clustering
No ratings yet
Clustering
8 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
SNUG Home Gateway Architecture Case Study
No ratings yet
SNUG Home Gateway Architecture Case Study
25 pages
Pipeline Architecture: C. V. Ramamoorthy
No ratings yet
Pipeline Architecture: C. V. Ramamoorthy
42 pages
DC Viva
No ratings yet
DC Viva
8 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
Parallels Cloud Storage 06142013
No ratings yet
Parallels Cloud Storage 06142013
81 pages
Chapter - 1 Basic Structure of Computers: 1.1 Computer Types
No ratings yet
Chapter - 1 Basic Structure of Computers: 1.1 Computer Types
46 pages
SAP HANA On Power Level 2
No ratings yet
SAP HANA On Power Level 2
10 pages
Unit 9
No ratings yet
Unit 9
24 pages
Survey of Architectures of Parallel Database Systems: Programming and Computer Software November 2004
No ratings yet
Survey of Architectures of Parallel Database Systems: Programming and Computer Software November 2004
11 pages
M.Tech - Digital Systems Computer Electronics
No ratings yet
M.Tech - Digital Systems Computer Electronics
56 pages
Operating System
No ratings yet
Operating System
17 pages
Questions With Answers
No ratings yet
Questions With Answers
22 pages
PCC-CS402
No ratings yet
PCC-CS402
7 pages
Performance Measurement Tools and Techniques
No ratings yet
Performance Measurement Tools and Techniques
50 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
Toksh
No ratings yet
Toksh
31 pages
Chapter 3 Naming and Threads
No ratings yet
Chapter 3 Naming and Threads
38 pages
The Landscape of Machine,...
No ratings yet
The Landscape of Machine,...
31 pages
Distributed Memory Architecture
No ratings yet
Distributed Memory Architecture
16 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
On The Security of 1024-Bit RSA and 160-Bit Elliptic Curve Cryptography
No ratings yet
On The Security of 1024-Bit RSA and 160-Bit Elliptic Curve Cryptography
20 pages
Parallel Viterbi Algorithm Implementation: Breaking The ACS-Bottleneck
No ratings yet
Parallel Viterbi Algorithm Implementation: Breaking The ACS-Bottleneck
6 pages
Intelligent Memory Computers
No ratings yet
Intelligent Memory Computers
9 pages
ParalleSystem Report
No ratings yet
ParalleSystem Report
11 pages
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet

Unit 5

Uploaded by

Unit 5

Uploaded by

UNIT 5

Introduction to Clustering Techniques

Types of Clustering Techniques:

Key Considerations in Clustering:

● Distance Metric: Choice of distance metric (Euclidean, Manhattan, etc.)

● Customer Segmentation: Group customers based on purchasing behavior

Hierarchical clustering is a method of clustering analysis that builds a hierarchy

1. Agglomerative Hierarchical Clustering: This starts with each data point

● Distance Measure: Determines how the proximity between clusters or

● Simple and easy to implement.

● Requires the number of clusters (K) to be specified in advance.

Extensions and Variants:

● K-Means++: Improved initialization to select initial centroids that are distant

CURE Clustering in Non Euclidean Spaces

Key Concepts of CURE Clustering:

1. Clustering Using Representatives:

Implementation and Applications:

● Implementation: Implementing CURE requires careful consideration of

● Parallel Clustering: Algorithms like K-means can be parallelized to handle large

Case Study Outline: Advertising on the Web

2. Data Collection and Preparation

Case Study Overview: Recommendation Systems in Big Data

2. Data Collection and Preparation

3. Techniques and Algorithms

● Collaborative Filtering: Utilizing user-item interaction data to identify similarities

4. Implementation and Evaluation

● System Architecture: Designing a scalable architecture using distributed computing

5. Challenges and Considerations

6. Real-World Applications and Impact

Examples and References

You might also like