SlideShare a Scribd company logo
16
Most read
21
Most read
22
Most read
Welcome To My Presentation
On
Clustering Analysis
Submitted By
Ruhul Amin
Department of Statistics
Pabna University of Science & Technology
Department of Statistics, Pabna University of Science & Technology
OUTLINEOF PRESENTATION
 Clustering : basic concept
 Types of clustering
 Clustering techniques
 K-means clustering
 K-means clustering algorithm
 Requirements
 Applications
 Advantages & Disadvantages
 Conclusion
Department of Statistics, Pabna University of Science & Technology 2
CLUSTERING: BASICCONCEPT
CLUSTERING
Clustering is traditionally viewed as an unsupervised method for data analysis. Clustering is the task of
the population or data points into a number of groups such that data points in the same groups are more
to other data points in the same group than those in other groups. In simple words, the aim is to segregate
groups with similar traits and assign them into clusters. It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in many fields, including machine learning, pattern
recognition, image analysis, information
retrieval, bioinformatics, data compression, and computer graphics.
Department of Statistics, Pabna University of Science & Technology 3
TYPESOF CLUSTERING
Broadly speaking, clustering can be divided into two subgroups :
HARD CLUSTERING:
In hard clustering, each data point either belongs to a cluster completely or not.
As an instance, we want the algorithm to read all of the tweets and determine if a tweet is a positive or a negative
tweet.
SOFT CLUSTERING:
In the soft clustering method, each data point will not completely belong to one cluster, instead, it can be a member of
more than one cluster it has a set of membership coefficients corresponding to the probability of being in a given
cluster.
As an instance, if you are attempting to forecast the rating changes for the counterparties who you trade with. The
algorithm can create clusters for each rating and indicate the likelihood of a counterparty to belong to a cluster.
Department of Statistics, Pabna University of Science & Technology 4
TYPES OF CLUSTERING
Is clustering typically …?
A. Supervised
B. Unsupervised
Department of Statistics, Pabna University of Science & Technology 5
Supervised
Unsupervised
CLUSTERING TECHNIQUES
Department of Statistics, Pabna University of Science & Technology 6
CLUSTERINGTECHNIQUES
A CATEGORIZATION OF MAJOR CLUSTERING METHODS
Partitioning Methods
Hierarchical Methods
Density-based Methods
Grid-based Methods
Model-based Methods
Department of Statistics, Pabna University of Science & Technology 7
CLUSTERINGTECHNIQUES
Partitional clustering decomposes a data set into a set of disjoint clusters.
Partitional clustering (or partitioning clustering) are clustering method
used to classify observations, within a data set, into multiple groups based on their
similarity. The algorithms require the analyst to specify the number of clusters to
be generated (N ≥ K). This course describes the
commonly used partitional, including: k means clustering
Department of Statistics, Pabna University of Science & Technology 8
K MEANSCLUSTERING
K-means clustering (Macqueen, 1967) is a method commonly used to automatically partition a data set
into k groups. It proceeds by selecting k initial cluster.
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.E., Data without defined
categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the
variable K (N ≥ K). The algorithm works iteratively to assign each data point to one of K groups based on the features that are
provided. Data points are clustered based on feature similarity. The results of the k-means clustering algorithm are:
1.the centroids of the K clusters, which can be used to label new data.
2.labels for the training data (each data point is assigned to a single cluster).
.
Department of Statistics, Pabna University of Science & Technology 9
K MEANS CLUSTERING ALGORITHMS
AS, YOU CAN SEE, K-MEANS ALGORITHM IS COMPOSED OF 3 STEPS:
STEP 1: INITIALIZATION
The first thing k-means does, is randomly choose K examples (data points) from the dataset as initial
centroids and that’s simply because it does not know yet where the center of each cluster is. (A
centroid is the center of a cluster).
STEP 2: CLUSTER ASSIGNMENT
Then, all the data points that are the closest (similar) to a centroid will create a cluster. If we’re using
the Euclidean distance between data points and every centroid, a straight line is drawn between two
centroids, then a perpendicular bisector (boundary line) divides this line into two clusters.
STEP 3: MOVE THE CENTROID
Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of all the
examples (data points) in a cluster.
We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means algorithm is
converged.
Department of Statistics, Pabna University of Science & Technology 10
K MEANS CLUSTERING ALGORITHMS
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Step 1 Step 2 Step 3
Step 4
Department of Statistics, Pabna University of Science & Technology 11
K MEANSCLUSTERINGALGORITHM
CLUSTER ANALYSIS – EXAMPLE
We will work with a real-number example of the well-known k-means clustering algorithm.
We will try to find clusters in the below dataset, consisting of 5 points.
Department of Statistics, Pabna University of Science & Technology 12
K MEANSCLUSTERINGALGORITHMS
STEP 1: SET CLUSTER QUANTITY
The k-means algorithm requires you to set a number of clusters k beforehand. Here, we take k=2(the data look like there
clusters – one on the bottom left and one on the top right).
STEP 2: ASSIGNMENT OF DATA POINTS
In the assignment step, each data point gets assigned to the nearest cluster centroid. The cluster centroids can be seen as
centers of gravity within each cluster. To start with, we chose random points as centroids. Here, we take point A(1,1)
Instead of taking actual data points, we could have taken completely random points as well.
To calculate the nearest cluster centroid for each data point, you need a distance measure. There is a large number of
available metrics doing the job. We will work with the ordinary Euclidian distance.
Department of Statistics, Pabna University of Science & Technology 13
K MEANSCLUSTERINGALGORITHMS
STEP 3: MOVE THE CENTROID
Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of
all the examples in a cluster.
We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means
algorithm is converged.
Department of Statistics, Pabna University of Science & Technology 14
K MEANSCLUSTERINGALGORITHMS
Department of Statistics, Pabna University of Science & Technology 15
K MEANSCLUSTERINGALGORITHMS
K MEANS CLUSTERINGALGORITHMS
Department of Statistics, Pabna University of Science & Technology 17
Requirements
Requirements
Requirements of clustering in data mining:-
1. Scalability - we need highly scalable clustering algorithms to deal with large databases.
2. Ability to deal with different kind of attributes - algorithms should be capable to be applied on any kind of data such as
interval based (numerical) data, categorical, binary data.
3. Discovery of clusters with attribute shape - the clustering algorithm should be capable of detect cluster of arbitrary
shape. The should not be bounded to only distance measures that tend to find spherical cluster of small size.
4. High dimensionality - the clustering algorithm should not only be able to handle low- dimensional data but also the high
dimensional space.
5. Ability to deal with noisy data - databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such
data and may lead to poor quality clusters.
6. Interpretability - the clustering results should be interpretable, comprehensible and usable.
Department of Statistics, Pabna University of Science & Technology 18
APPLICATIONS
HERE ARE 7 EXAMPLES OF CLUSTERING ALGORITHMS IN ACTION.
1. IDENTIFYING FAKE NEWS
How clustering works:
The way that the algorithm works is by taking in the content of the fake news article, the corpus,
examining the words used and then clustering them. These clusters are what helps the algorithm
determine which pieces are genuine and which are fake news. Certain words are found more
commonly in sensationalized, click-bait articles. When you see a high percentage of specific
terms in an article, it gives a higher probability of the material being fake news.
2. SPAM FILTER
How clustering works:
k-means clustering techniques have proven to be an effective way of identifying spam. The way
that it works is by looking at the different sections of the email (header, sender, and content). The
data is then grouped together.
These groups can then be classified to identify which are spam. Including clustering in the
classification process improves the accuracy of the filter to 97%. This is excellent news for
people who want to be sure they’re not missing out on your favorite newsletters and offers.
Department of Statistics, Pabna University of Science & Technology 19
APPLICATIONS
3. ASTRONOMY:
It helps to find groups of similar stars and galaxies.
4. GENOMICS:
It can be used to derive plant and animal taxonomies, categorize genes with similar functionality and gain insight into structures inherent in
populations.
5. CLASSIFYING NETWORK TRAFFIC
How clustering works:
k-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic
types. The process is faster and more accurate than the previous autoclass method. By having precise information on traffic sources, you are able
to grow your site and plan capacity effectively.
6. IDENTIFYING FRAUDULENT OR CRIMINAL ACTIVITY
How clustering works:
By analysing the GPS logs, the algorithm is able to group similar behaviors. Based on the characteristics of the groups you are then able to
classify them into those that are real and which are fraudulent.
7. DOCUMENT ANALYSIS
HOW CLUSTERING WORKS:
Hierarchical clustering has been used to solve this problem. The algorithm is able to look at the text and group it into different
themes. Using this technique, you can cluster and organize similar documents quickly using the characteristics identified in the
paragraph.
8.CALL RECORD DETAIL ANALYSIS
A call detail record (CDR) is the information captured by telecom companies during the call, SMS, and internet activity of a
customer.
Department of Statistics, Pabna University of Science & Technology 20
K-means advantages and disadvantages
Advantages of k-means
Relatively simple to implement.
Scales to large data sets.
Guarantees convergence.
Can warm-start the positions of centroids.
Easily adapts to new examples (data points).
Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
Disadvantage of k-means
Choosing k manually being dependent on initial values.
For a low k, you can mitigate this dependence by running k-means several times with different
initial values and picking the best result. As k increases, you need advanced versions of k-means to
pick better values of the initial centroids (called k-means seeding).
Clustering data of varying sizes and density.
Clustering outliers.
Scaling with number of dimensions.
Department of Statistics, Pabna University of Science & Technology 21
CONCLUSION
Conclusion:
K means algorithm is useful for undirected knowledge discovery and is relatively simple.
K mean has found wide spread usage in lot of field raging from unsupervised learning of neural
,Pattern recognitions, classification analysis, Artificial intelligence ,Image processing and many others
Department of Statistics, Pabna University of Science & Technology 22

More Related Content

PPT
K mean-clustering algorithm
PPTX
K means clustering
PPTX
Short Channel Effect In MOSFET
PPTX
WATER RESOURCES -CLASS X.CHAPTER 3
PPT
Clustering
PDF
Support Vector Machines ( SVM )
PPTX
Dbscan algorithom
PPT
K mean-clustering
K mean-clustering algorithm
K means clustering
Short Channel Effect In MOSFET
WATER RESOURCES -CLASS X.CHAPTER 3
Clustering
Support Vector Machines ( SVM )
Dbscan algorithom
K mean-clustering

What's hot (20)

PPT
K means Clustering Algorithm
PDF
Classification Based Machine Learning Algorithms
PDF
Hierarchical Clustering
PPTX
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
PPTX
Naive bayes
PPT
Clustering
PPTX
Introduction to Data Mining
PPTX
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
PPT
Back propagation
PDF
Data clustering
PPT
Machine Learning presentation.
PDF
Decision trees in Machine Learning
PDF
Machine learning
PPTX
Introduction to Deep Learning
PPTX
Clusters techniques
PPTX
Hierarchical clustering.pptx
PDF
Hierarchical clustering
PPT
PDF
Feature Extraction
K means Clustering Algorithm
Classification Based Machine Learning Algorithms
Hierarchical Clustering
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Naive bayes
Clustering
Introduction to Data Mining
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Back propagation
Data clustering
Machine Learning presentation.
Decision trees in Machine Learning
Machine learning
Introduction to Deep Learning
Clusters techniques
Hierarchical clustering.pptx
Hierarchical clustering
Feature Extraction
Ad

Similar to Presentation on K-Means Clustering (20)

PPT
Clustering in Machine Learning: A Brief Overview.ppt
PDF
Cancer data partitioning with data structure and difficulty independent clust...
PDF
Chapter 5.pdf
PDF
Premeditated Initial Points for K-Means Clustering
PDF
Comparison Between Clustering Algorithms for Microarray Data Analysis
PDF
Experimental study of Data clustering using k- Means and modified algorithms
PPT
Clustering & classification
PDF
47 292-298
PDF
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
PPTX
Unsupervised learning Algorithms and Assumptions
PDF
A Study of Efficiency Improvements Technique for K-Means Algorithm
DOCX
Cluster analysis (2).docx
PPTX
Unsupervised Learning.pptx
PDF
Mat189: Cluster Analysis with NBA Sports Data
PDF
A Comparative Study Of Various Clustering Algorithms In Data Mining
PDF
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
PDF
CLUSTERING IN DATA MINING.pdf
PDF
CSA 3702 machine learning module 3
PPTX
K- means clustering method based Data Mining of Network Shared Resources .pptx
PPTX
K- means clustering method based Data Mining of Network Shared Resources .pptx
Clustering in Machine Learning: A Brief Overview.ppt
Cancer data partitioning with data structure and difficulty independent clust...
Chapter 5.pdf
Premeditated Initial Points for K-Means Clustering
Comparison Between Clustering Algorithms for Microarray Data Analysis
Experimental study of Data clustering using k- Means and modified algorithms
Clustering & classification
47 292-298
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
Unsupervised learning Algorithms and Assumptions
A Study of Efficiency Improvements Technique for K-Means Algorithm
Cluster analysis (2).docx
Unsupervised Learning.pptx
Mat189: Cluster Analysis with NBA Sports Data
A Comparative Study Of Various Clustering Algorithms In Data Mining
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
CLUSTERING IN DATA MINING.pdf
CSA 3702 machine learning module 3
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
Ad

Recently uploaded (20)

PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Understanding Prototyping in Design and Development
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Foundation of Data Science unit number two notes
PDF
Report The-State-of-AIOps 20232032 3.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Data Science Trends & Career Guide---ppt
PDF
Data Analyst Certificate Programs for Beginners | IABAC
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Launch Your Data Science Career in Kochi – 2025
STUDY DESIGN details- Lt Col Maksud (21).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Understanding Prototyping in Design and Development
Reliability_Chapter_ presentation 1221.5784
Foundation of Data Science unit number two notes
Report The-State-of-AIOps 20232032 3.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
climate analysis of Dhaka ,Banglades.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Data Science Trends & Career Guide---ppt
Data Analyst Certificate Programs for Beginners | IABAC
Introduction-to-Cloud-ComputingFinal.pptx
Fluorescence-microscope_Botany_detailed content
Moving the Public Sector (Government) to a Digital Adoption
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg

Presentation on K-Means Clustering

  • 1. Welcome To My Presentation On Clustering Analysis Submitted By Ruhul Amin Department of Statistics Pabna University of Science & Technology Department of Statistics, Pabna University of Science & Technology
  • 2. OUTLINEOF PRESENTATION  Clustering : basic concept  Types of clustering  Clustering techniques  K-means clustering  K-means clustering algorithm  Requirements  Applications  Advantages & Disadvantages  Conclusion Department of Statistics, Pabna University of Science & Technology 2
  • 3. CLUSTERING: BASICCONCEPT CLUSTERING Clustering is traditionally viewed as an unsupervised method for data analysis. Clustering is the task of the population or data points into a number of groups such that data points in the same groups are more to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters. It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. Department of Statistics, Pabna University of Science & Technology 3
  • 4. TYPESOF CLUSTERING Broadly speaking, clustering can be divided into two subgroups : HARD CLUSTERING: In hard clustering, each data point either belongs to a cluster completely or not. As an instance, we want the algorithm to read all of the tweets and determine if a tweet is a positive or a negative tweet. SOFT CLUSTERING: In the soft clustering method, each data point will not completely belong to one cluster, instead, it can be a member of more than one cluster it has a set of membership coefficients corresponding to the probability of being in a given cluster. As an instance, if you are attempting to forecast the rating changes for the counterparties who you trade with. The algorithm can create clusters for each rating and indicate the likelihood of a counterparty to belong to a cluster. Department of Statistics, Pabna University of Science & Technology 4
  • 5. TYPES OF CLUSTERING Is clustering typically …? A. Supervised B. Unsupervised Department of Statistics, Pabna University of Science & Technology 5 Supervised Unsupervised
  • 6. CLUSTERING TECHNIQUES Department of Statistics, Pabna University of Science & Technology 6
  • 7. CLUSTERINGTECHNIQUES A CATEGORIZATION OF MAJOR CLUSTERING METHODS Partitioning Methods Hierarchical Methods Density-based Methods Grid-based Methods Model-based Methods Department of Statistics, Pabna University of Science & Technology 7
  • 8. CLUSTERINGTECHNIQUES Partitional clustering decomposes a data set into a set of disjoint clusters. Partitional clustering (or partitioning clustering) are clustering method used to classify observations, within a data set, into multiple groups based on their similarity. The algorithms require the analyst to specify the number of clusters to be generated (N ≥ K). This course describes the commonly used partitional, including: k means clustering Department of Statistics, Pabna University of Science & Technology 8
  • 9. K MEANSCLUSTERING K-means clustering (Macqueen, 1967) is a method commonly used to automatically partition a data set into k groups. It proceeds by selecting k initial cluster. K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.E., Data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K (N ≥ K). The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the k-means clustering algorithm are: 1.the centroids of the K clusters, which can be used to label new data. 2.labels for the training data (each data point is assigned to a single cluster). . Department of Statistics, Pabna University of Science & Technology 9
  • 10. K MEANS CLUSTERING ALGORITHMS AS, YOU CAN SEE, K-MEANS ALGORITHM IS COMPOSED OF 3 STEPS: STEP 1: INITIALIZATION The first thing k-means does, is randomly choose K examples (data points) from the dataset as initial centroids and that’s simply because it does not know yet where the center of each cluster is. (A centroid is the center of a cluster). STEP 2: CLUSTER ASSIGNMENT Then, all the data points that are the closest (similar) to a centroid will create a cluster. If we’re using the Euclidean distance between data points and every centroid, a straight line is drawn between two centroids, then a perpendicular bisector (boundary line) divides this line into two clusters. STEP 3: MOVE THE CENTROID Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of all the examples (data points) in a cluster. We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means algorithm is converged. Department of Statistics, Pabna University of Science & Technology 10
  • 11. K MEANS CLUSTERING ALGORITHMS 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Step 1 Step 2 Step 3 Step 4 Department of Statistics, Pabna University of Science & Technology 11
  • 12. K MEANSCLUSTERINGALGORITHM CLUSTER ANALYSIS – EXAMPLE We will work with a real-number example of the well-known k-means clustering algorithm. We will try to find clusters in the below dataset, consisting of 5 points. Department of Statistics, Pabna University of Science & Technology 12
  • 13. K MEANSCLUSTERINGALGORITHMS STEP 1: SET CLUSTER QUANTITY The k-means algorithm requires you to set a number of clusters k beforehand. Here, we take k=2(the data look like there clusters – one on the bottom left and one on the top right). STEP 2: ASSIGNMENT OF DATA POINTS In the assignment step, each data point gets assigned to the nearest cluster centroid. The cluster centroids can be seen as centers of gravity within each cluster. To start with, we chose random points as centroids. Here, we take point A(1,1) Instead of taking actual data points, we could have taken completely random points as well. To calculate the nearest cluster centroid for each data point, you need a distance measure. There is a large number of available metrics doing the job. We will work with the ordinary Euclidian distance. Department of Statistics, Pabna University of Science & Technology 13
  • 14. K MEANSCLUSTERINGALGORITHMS STEP 3: MOVE THE CENTROID Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of all the examples in a cluster. We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means algorithm is converged. Department of Statistics, Pabna University of Science & Technology 14
  • 15. K MEANSCLUSTERINGALGORITHMS Department of Statistics, Pabna University of Science & Technology 15
  • 17. K MEANS CLUSTERINGALGORITHMS Department of Statistics, Pabna University of Science & Technology 17
  • 18. Requirements Requirements Requirements of clustering in data mining:- 1. Scalability - we need highly scalable clustering algorithms to deal with large databases. 2. Ability to deal with different kind of attributes - algorithms should be capable to be applied on any kind of data such as interval based (numerical) data, categorical, binary data. 3. Discovery of clusters with attribute shape - the clustering algorithm should be capable of detect cluster of arbitrary shape. The should not be bounded to only distance measures that tend to find spherical cluster of small size. 4. High dimensionality - the clustering algorithm should not only be able to handle low- dimensional data but also the high dimensional space. 5. Ability to deal with noisy data - databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters. 6. Interpretability - the clustering results should be interpretable, comprehensible and usable. Department of Statistics, Pabna University of Science & Technology 18
  • 19. APPLICATIONS HERE ARE 7 EXAMPLES OF CLUSTERING ALGORITHMS IN ACTION. 1. IDENTIFYING FAKE NEWS How clustering works: The way that the algorithm works is by taking in the content of the fake news article, the corpus, examining the words used and then clustering them. These clusters are what helps the algorithm determine which pieces are genuine and which are fake news. Certain words are found more commonly in sensationalized, click-bait articles. When you see a high percentage of specific terms in an article, it gives a higher probability of the material being fake news. 2. SPAM FILTER How clustering works: k-means clustering techniques have proven to be an effective way of identifying spam. The way that it works is by looking at the different sections of the email (header, sender, and content). The data is then grouped together. These groups can then be classified to identify which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%. This is excellent news for people who want to be sure they’re not missing out on your favorite newsletters and offers. Department of Statistics, Pabna University of Science & Technology 19
  • 20. APPLICATIONS 3. ASTRONOMY: It helps to find groups of similar stars and galaxies. 4. GENOMICS: It can be used to derive plant and animal taxonomies, categorize genes with similar functionality and gain insight into structures inherent in populations. 5. CLASSIFYING NETWORK TRAFFIC How clustering works: k-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic types. The process is faster and more accurate than the previous autoclass method. By having precise information on traffic sources, you are able to grow your site and plan capacity effectively. 6. IDENTIFYING FRAUDULENT OR CRIMINAL ACTIVITY How clustering works: By analysing the GPS logs, the algorithm is able to group similar behaviors. Based on the characteristics of the groups you are then able to classify them into those that are real and which are fraudulent. 7. DOCUMENT ANALYSIS HOW CLUSTERING WORKS: Hierarchical clustering has been used to solve this problem. The algorithm is able to look at the text and group it into different themes. Using this technique, you can cluster and organize similar documents quickly using the characteristics identified in the paragraph. 8.CALL RECORD DETAIL ANALYSIS A call detail record (CDR) is the information captured by telecom companies during the call, SMS, and internet activity of a customer. Department of Statistics, Pabna University of Science & Technology 20
  • 21. K-means advantages and disadvantages Advantages of k-means Relatively simple to implement. Scales to large data sets. Guarantees convergence. Can warm-start the positions of centroids. Easily adapts to new examples (data points). Generalizes to clusters of different shapes and sizes, such as elliptical clusters. Disadvantage of k-means Choosing k manually being dependent on initial values. For a low k, you can mitigate this dependence by running k-means several times with different initial values and picking the best result. As k increases, you need advanced versions of k-means to pick better values of the initial centroids (called k-means seeding). Clustering data of varying sizes and density. Clustering outliers. Scaling with number of dimensions. Department of Statistics, Pabna University of Science & Technology 21
  • 22. CONCLUSION Conclusion: K means algorithm is useful for undirected knowledge discovery and is relatively simple. K mean has found wide spread usage in lot of field raging from unsupervised learning of neural ,Pattern recognitions, classification analysis, Artificial intelligence ,Image processing and many others Department of Statistics, Pabna University of Science & Technology 22