0% found this document useful (0 votes)
69 views10 pages

Unit 5

notes

Uploaded by

poornank05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views10 pages

Unit 5

notes

Uploaded by

poornank05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT 5

Introduction to Clustering Techniques


Clustering is a fundamental technique in unsupervised learning where the goal is
to group similar objects or data points into clusters. The objective is to partition
the data in such a way that data points in the same cluster are more similar to
each other than to those in other clusters. Clustering is widely used in various
applications such as customer segmentation, grouping documents, anomaly
detection, and more.

Types of Clustering Techniques:

1. Hierarchical Clustering:
○ Agglomerative: Starts with each point as its own cluster and merges
the closest pairs of clusters until only one cluster remains.
○ Divisive: Starts with all points in one cluster and splits into smaller
clusters recursively.
2. Partitioning Methods:
○ K-means: Divides data into non-overlapping clusters where each
data point belongs to only one cluster. It aims to minimize the
variance within each cluster.
○ K-medoids (PAM): Similar to K-means but uses actual data points
(medoids) as cluster centers to handle outliers better.
3. Density-Based Methods:
○ DBSCAN (Density-Based Spatial Clustering of Applications with
Noise): Clusters dense regions of points based on density
reachability.
○ OPTICS (Ordering Points To Identify the Clustering Structure):
Generates a density-based clustering order rather than specific
clusters.
4. Grid-Based Methods:
○ STING (Statistical Information Grid): Clusters data into a grid
structure and then aggregates the grid cells to form clusters.
5. Model-Based Methods:
○ Gaussian Mixture Models (GMM): Assumes that the data points are
generated from a mixture of several Gaussian distributions.

Key Considerations in Clustering:

● Distance Metric: Choice of distance metric (Euclidean, Manhattan, etc.)


impacts how similarity between data points is measured.
● Number of Clusters: For methods like K-means, determining the optimal
number of clusters (K) is crucial and can be assessed using metrics like
the silhouette score or elbow method.
● Scalability: Some methods scale better with large datasets or
high-dimensional data than others.
● Interpretability: Depending on the application, interpretability of clusters
(meaning of each cluster) may be important.

Applications of Clustering:

● Customer Segmentation: Group customers based on purchasing behavior


or demographics.
● Document Clustering: Organize documents into topics or categories based
on content similarity.
● Image Segmentation: Group pixels in images to identify objects or regions.
● Anomaly Detection: Identify unusual patterns or outliers in data.

Hierarchical Clustering

Hierarchical clustering is a method of clustering analysis that builds a hierarchy


of clusters. It can be approached in two main ways:

1. Agglomerative Hierarchical Clustering: This starts with each data point


as its own cluster and then merges the closest pairs of clusters until only
one cluster remains. The result is a tree-like structure (dendrogram) where
the height of each fusion represents the distance between clusters.
2. Divisive Hierarchical Clustering: This starts with all data points in one
cluster and recursively splits them into smaller clusters until each cluster
only contains one data point.

Key Points:

● Distance Measure: Determines how the proximity between clusters or


data points is calculated.
● Linkage Criteria: Determines the rule for computing the distance between
clusters during merging (agglomerative) or splitting (divisive).
● Dendrogram: A visual representation of the clustering process, showing
the sequence of merges or splits.
● Advantages: No need to specify the number of clusters beforehand, and
the hierarchical structure can be informative for understanding
relationships between clusters.
● Disadvantages: Computationally expensive for large datasets, and
decisions about where to cut the dendrogram to form clusters can be
subjective.

K-Means Clustering:
K-Means is a popular clustering algorithm used for partitioning a dataset into K
distinct, non-overlapping clusters. It's an iterative algorithm that aims to minimize
the variance within each cluster.

Key Steps:

1. Initialization:
○ Choose K initial cluster centroids randomly (or based on some
heuristic).
2. Assignment:
○ Assign each data point to the nearest centroid, typically based on
Euclidean distance.
3. Update Centroids:
○ Recalculate the centroids of the clusters by taking the mean of all
data points assigned to each centroid.
4. Iteration:
○ Repeat the assignment and centroid update steps until convergence
(when centroids no longer change significantly or a maximum
number of iterations is reached).

Properties:

● Objective: Minimize the sum of squared distances from each point to its
assigned cluster centroid.
● Initialization Sensitivity: The final clusters can depend on the initial random
choice of centroids, impacting the algorithm's performance.
● Scalability: Works well for large datasets but can be computationally
expensive, especially with large K or high-dimensional data.
● Suitability: Effective when clusters are spherical and of similar size, less
effective with irregular shapes or widely varying cluster sizes.

Advantages:

● Simple and easy to implement.


● Computationally efficient for large datasets.
● Scales well to large numbers of variables.

Disadvantages:

● Requires the number of clusters (K) to be specified in advance.


● Sensitive to initial centroids, which can lead to different results on different
runs.
● May not handle well clusters of different sizes and densities.

Applications:

● Customer segmentation.
● Document clustering.
● Image segmentation.
● Anomaly detection (by treating the smallest cluster as anomalies).

Extensions and Variants:

● K-Means++: Improved initialization to select initial centroids that are distant


from each other.
● Mini-batch K-Means: Speeds up K-Means by using mini-batches of data to
update centroids.
● Kernel K-Means: Allows non-linear separation of clusters by mapping data
into a higher-dimensional space.

Understanding these aspects will give you a solid foundation for implementing
and applying K-Means clustering effectively in various contexts.

CURE Clustering in Non Euclidean Spaces


CURE (Clustering Using Representatives) is a clustering algorithm designed to
work efficiently in non-Euclidean spaces. Unlike traditional clustering algorithms
that often assume Euclidean distance measures, CURE is suitable for data
spaces where distance metrics are not straightforward or where the data may not
adhere to Euclidean geometry.

Key Concepts of CURE Clustering:

1. Clustering Using Representatives:


○ CURE selects a subset of points, called representatives, from the
dataset. These representatives are chosen to capture the overall
characteristics of clusters and serve as the initial cluster centers.
2. Hierarchical Clustering:
○ CURE employs a hierarchical clustering approach. It starts with each
point as its own cluster and gradually merges clusters based on their
proximity, using a combination of single-linkage and
complete-linkage strategies.
3. Handling Non-Euclidean Spaces:
○ CURE addresses the challenge of non-Euclidean spaces by using
an approximation strategy. It projects data points onto a line that
connects two of its representative points. This projection helps in
estimating the distance between points in a non-Euclidean space.
4. Advantages:
○ CURE is robust against outliers and can handle arbitrary shapes of
clusters.
○ It reduces the computation cost by using a representative set of
points rather than the entire dataset for clustering.
5. Steps in CURE Algorithm:
○ Selection of Representatives: Initially, select a large number of
points as representatives.
○ Hierarchical Clustering: Perform hierarchical clustering on the
representatives using an appropriate distance metric (often a
dissimilarity measure).
○ Cluster Formation: After hierarchical clustering, cut the dendrogram
at an appropriate level to form clusters.
6. Distance Measures:
○ CURE can utilize various distance measures suitable for the specific
data space, including cosine similarity, Jaccard distance, or other
domain-specific metrics.
7. Scalability:
○ The efficiency of CURE in handling large datasets is a significant
advantage, as it reduces the computational complexity compared to
traditional hierarchical clustering methods.

Implementation and Applications:

● Implementation: Implementing CURE requires careful consideration of


the distance metric and the selection of representative points.
● Applications: CURE is useful in domains where data do not conform to
Euclidean geometry, such as text clustering (using document similarity
metrics), biological data clustering (protein structures), and image
clustering (based on feature vectors).

Streams
Streams refer to continuous flows of data, typically arriving in real-time or near real-time. Stream
processing involves handling and analyzing these data streams as they are generated, often
with the goal of extracting insights or making decisions in real-time. Examples include
processing sensor data, financial transactions, social media updates, etc.

Parallelism
Parallelism involves executing multiple tasks simultaneously, either within a single processor
with multiple cores or across multiple processors or machines. In the context of clustering and
stream processing:

● Parallel Clustering: Algorithms like K-means can be parallelized to handle large


datasets more efficiently by distributing the computation across multiple processors or
nodes.
● Parallel Stream Processing: When dealing with real-time data streams, parallelism
enables faster processing and analysis of incoming data by dividing the workload among
multiple processing units.

Case Study Outline: Advertising on the Web


1. Introduction
○ Overview of online advertising landscape.
○ Importance of targeted advertising and personalized user experiences.
○ Goals of the case study (e.g., improving ad relevance, increasing conversion
rates).

2. Data Collection and Preparation


○ Sources of data: ad impressions, click-through rates (CTR), user interactions.
○ Data preprocessing steps: cleaning, handling missing values, normalization.
3. Clustering Analysis
○ Objective: Identify segments or clusters of users based on their behavior and
preferences.
○ Methods: Use clustering algorithms like k-means, hierarchical clustering, or
DBSCAN.
○ Application: Group users into clusters that exhibit similar patterns in ad
engagement or website interactions.
4. Recommendation System Implementation
○ Objective: Develop a recommendation system for ads or content.
○ Approaches:
■ Collaborative Filtering: Recommend ads based on similar users'
preferences or behaviors.
■ Content-Based Filtering: Recommend ads based on the content of the
ad itself and user profiles.
■ Hybrid Approaches: Combine collaborative and content-based methods
for improved accuracy.
○ Evaluation: Measure recommendation system performance using metrics like
precision, recall, or A/B testing.
5. Results and Insights
○ Segmentation Insights: Understand the characteristics and behaviors of each
user segment.
○ Recommendation Effectiveness: Evaluate how well the recommendation
system improves ad targeting and user engagement.
○ Business Impact: Discuss any observed improvements in ad click-through
rates, conversion rates, or revenue.
6. Challenges and Considerations
○ Data Privacy: Ensure compliance with data protection regulations.
○ Scalability: Address challenges related to processing large volumes of data in
real-time.
○ Ethical Considerations: Consider implications of personalized advertising on
user privacy and experience.
7. Conclusion
○ Summary of key findings and outcomes from the case study.
○ Future directions for research or improvements in ad targeting techniques.

Case Study Overview: Recommendation Systems in Big Data


Mining Analytics
1. Problem Statement and Context

● Objective: Develop a recommendation system that utilizes big data analytics to improve
accuracy and relevance of recommendations.
● Context: Typically, this involves scenarios like e-commerce platforms, streaming
services, social media, etc., where user engagement and satisfaction heavily rely on
personalized recommendations.

2. Data Collection and Preparation

● Data Sources: Gathering diverse datasets including user behavior (clicks, views,
purchases), item characteristics (descriptions, categories), and contextual data (time,
location).
● Data Preprocessing: Cleaning data, handling missing values, feature engineering, and
possibly scaling for large datasets.

3. Techniques and Algorithms

● Collaborative Filtering: Utilizing user-item interaction data to identify similarities


between users or items. Techniques may include:
○ Memory-based methods: Such as user-based or item-based collaborative
filtering.
○ Model-based methods: Employing matrix factorization (like Singular Value
Decomposition or Alternating Least Squares) to predict user preferences.
● Content-Based Filtering: Incorporating item features and user profiles to recommend
items similar to those previously liked by the user.
● Hybrid Methods: Combining collaborative filtering and content-based filtering to
leverage their strengths and mitigate weaknesses.

4. Implementation and Evaluation

● System Architecture: Designing a scalable architecture using distributed computing


frameworks (like Apache Spark) to handle big data volumes efficiently.
● Algorithm Implementation: Developing and fine-tuning recommendation algorithms
using appropriate libraries and frameworks (e.g., TensorFlow, Scikit-learn).
● Evaluation Metrics: Assessing the performance of the recommendation system using
metrics such as precision, recall, and mean average precision. Cross-validation
techniques may be employed to validate model robustness.

5. Challenges and Considerations

● Scalability: Ensuring the recommendation system can handle large datasets and
real-time recommendations efficiently.
● Cold Start Problem: Addressing issues when there is limited data for new users or
items.
● Privacy and Ethics: Handling sensitive user data responsibly and ensuring compliance
with privacy regulations (e.g., GDPR).

6. Real-World Applications and Impact


● Business Insights: Analyzing user behavior through recommendation system insights
to drive business decisions (e.g., product placements, marketing strategies).
● User Experience: Enhancing user satisfaction and engagement by providing
personalized recommendations that align with individual preferences.

Examples and References

● Netflix: Utilizes collaborative filtering and machine learning to recommend movies and
TV shows based on user viewing history and ratings.
● Amazon: Incorporates both collaborative filtering and content-based filtering to suggest
products based on user browsing and purchase history.

You might also like