Data Mining
Lectore-07
Dr. Waqas Haider Khan Bangyal
Data Mining Overview
Data Mining
Data warehouses and OLAP (On Line Analytical
Processing.)
Clustering: Hierarchical and Partitioned approaches
Classification: Decision Trees , ANN, Bayesian
classifiers and G.A
Association Rules Mining
Advanced topics: outlier detection, web mining
What is a natural grouping among these objects?
What is a natural grouping among these objects?
Clustering is subjective
Siaad's Family School Employees Females Males
What is Clustering?
Also called unsupervised learning, sometimes called
classification by statisticians and sorting by psychologists
and segmentation by people in marketing
• Organizing data into classes such that there is
• high intra-class similarity
• low inter-class similarity
• Finding the class labels and the number of classes directly
from the data (in contrast to classification).
• More informally, finding natural groupings among objects.
What is clustering?
Clustering is the classification of objects into different
groups, or more precisely, the partitioning of a data set
into subsets (clusters), so that the data in each subset
(ideally) share some common trait - often according to
some defined distance measure.
What is Clustering?
You can say this “unsupervised classification”
Clustering is alternatively called as “grouping”
Intuitively, if you would want to assign same label to a
data points that are “close” to each other
Thus, clustering algorithms rely on a distance metric
between data points
Sometimes, it is said that the for clustering, the distance
metric is more important than the clustering algorithm
Idea and Applications
Clustering is the process of grouping a set of physical or
abstract objects into classes of similar objects.
It is also called unsupervised learning.
It is a common and important task that finds many
applications.
Applications in Search engines:
Structuring search results
Suggesting related pages
Automatic directory construction/update
Finding near identical/duplicate pages
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined
classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
Desirable Properties of a Clustering Algorithm
• Scalability (in terms of both time and space)
• Ability to deal with different data types
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Incorporation of user-specified constraints
• Interpretability and usability
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to
their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should
be clustered along continent faults
Concepts in Clustering
“Defining distance between points
A good clustering is one where
(Intra-clusterdistance) the sum of distances between objects
in the same cluster are minimized,
(Inter-clusterdistance) while the distances between different
clusters are maximized
Objective to minimize: F(Intra,Inter)
What Is Good Clustering?
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
Types of clustering
1. Hierarchical algorithms: these find successive clusters
using previously established clusters.
1. Agglomerative ("bottom-up"): Agglomerative algorithms
begin with each element as a separate cluster and merge
them into successively larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin with
the whole set and proceed to divide it into successively
smaller clusters.
2. Partitional clustering: Partitional algorithms determine all
clusters at once. They include:
K-means and derivatives
Fuzzy c-means clustering
Classical clustering methods
Partitioning methods
k-Means (and EM), k-Medoids
Hierarchical methods
agglomerative, divisive, BIRCH
Model-based clustering methods
K-MEANS CLUSTERING
Simply speaking k-means clustering is an algorithm to
classify or to group the objects based on
attributes/features into K number of group.
K is positive integer number.
The grouping is done by minimizing the sum of
squares of distances between data and the
corresponding cluster centroid.
Common Distance measures:
Distance measure will determine how the similarity of two
elements is calculated and it will influence the shape of the
clusters.
They include:
1. The Euclidean distance (also called 2-norm distance) is given by:
2. The Manhattan distance (also called taxicab norm or 1-norm) is
given by:
Common Distance measures:
3.The maximum norm is given by:
4. The Mahalanobis distance corrects data for different scales and
correlations in the variables.
5. Inner product space: The angle between two vectors can be used
as a distance measure when clustering high dimensional data
6. Hamming distance (sometimes edit distance) measures the
minimum number of substitutions required to change one
member into another.
K-means
Works when we know k, the number of clusters we want to
find
Idea:
Randomly pick k points as the “centroids” of the k clusters
Loop:
For each point, put the point in the cluster to whose
centroid it is closest
Recompute the cluster centroids
Repeat loop (until there is no change in clusters between
two consecutive iterations.)
Iterative improvement of the objective function:
Sum of the squared distance from each point to the centroid of
How the K-Mean Clustering algorithm works?
How the K-Mean Clustering algorithm works?
Step 1: Begin with a decision on the value of k =
number of clusters .
Step 2:
Put any initial partition that classifies the data into
k clusters. You may assign the training samples
randomly, or systematically as the following:
1.Take the first k training sample as single- element
clusters
2. Assign each of the remaining (N-k) training
sample to the cluster with the nearest centroid. After
each assignment, recompute the centroid of the gaining
cluster.
How the K-Mean Clustering algorithm works?
Step 3: Take each sample in sequence and
compute its distance from the centroid of
each of the clusters. If a sample is not
currently in the cluster with the closest
centroid, switch this sample to that cluster
and update the centroid of the cluster
gaining the new sample and the cluster
losing the sample.
Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through the
training sample causes no new assignments.
K-means Overview
An unsupervised clustering algorithm
“K” stands for number of clusters, it is typically a user
input to the algorithm; some criteria can be used to
automatically estimate K
It is an approximation to an NP-hard combinatorial
optimization problem
K-means algorithm is iterative in nature
It converges, however only a local minimum is obtained
Works only for numerical data
Easy to implement
Weaknesses of K-Mean Clustering
1. When the numbers of data are not so many, initial grouping
will determine the cluster significantly.
2. The number of cluster, K, must be determined before hand. Its
disadvantage is that it does not yield the same result with each
run, since the resulting clusters depend on the initial random
assignments.
3. We never know the real cluster, using the same data, because if
it is inputted in a different order it may produce different
cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition
may produce different result of cluster. The algorithm may be
trapped in the local optimum.
Applications of K-Mean Clustering
It is relatively efficient and fast. It computes result at O(tkn),
where n is number of objects or points, k is number of clusters
and t is number of iterations.
k-means clustering can be applied to machine learning or data
mining
Used on acoustic data in speech understanding to convert
waveforms into one of k categories (known as Vector
Quantization or Image Segmentation).
Also used for choosing color palettes on old fashioned graphical
display devices and Image Quantization.
CONCLUSION
K-means algorithm is useful for undirected knowledge
discovery and is relatively simple. K-means has found
wide spread usage in lot of fields, ranging from
unsupervised learning of neural network, Pattern
recognitions, Classification analysis, Artificial
intelligence, image processing, machine vision, and many
others.
Real-Life Numerical Example of K-Means Clustering
We have 4 medicines as our training data points object and each
medicine has 2 attributes. Each attribute represents coordinate of
the object. We have to determine which medicines belong to
cluster 1 and which medicines belong to the other cluster.
Attribute1 (X): Attribute 2 (Y): pH
Object weight index
1 1
Medicine A
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
Step 1:
Initial value of
centroids : Suppose we
use medicine A and
medicine B as the first
centroids.
Let and c1 and c2 denote
the coordinate of the
centroids, then c1=(1,1)
and c2=(2,1)
Objects-Centroids distance : we calculate the distance between cluster centroid to each object.
Let us use ρ(a, b) = |x2 – x1| + |y2 – y1|
distance matrix at iteration 0 is
So, we fill in these values in the table:
Mean (Centroid) Mean (Centroid) Clusters
Object Cluster-1 Cluster-2
(1,1) (2,1)
(1,1) 0 1 Cluster-1
(2,1) 1 0 Cluster-2
(4,3) 5 4 Cluster-2
(5,4) 7 6 Cluster-2
point mean1 point mean2
x1, y1 x2, y2 x1, y1 x2, y2
(1, 1) (1, 1) (1, 1) (2, 1)
ρ(a, b) = |x2 – x1| + |y2 – y1| ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean1) = |x2 – x1| + |y2 – y1| ρ(point, mean1) = |x2 – x1| + |y2 – y1|
= |1 – 1| + |1 – 1| = |2 – 1| + |1 – 1|
=0+0 =1+0
=0 =0
So, which cluster should the point (1, 1) be placed in? The one, where the point has
the shortest distance to the mean – that is mean 1 (cluster 1), since the distance is 0.
Step 2:
Objects clustering : We
assign each object based on
the minimum distance.
Medicine A is assigned to
group 1, medicine B to
group 2, medicine C to
group 2 and medicine D to
group 2.
The elements of Group
matrix below is 1 if and only
if the object is assigned to
that group.
Iteration-2, Objects-Centroids distances : The next
step is to compute the distance of all objects to the new
centroids.
Similar to step 2, we have distance matrix at iteration 1 is
Cluster 1 Cluster 2
(1, 1) (2, 1)
(4, 3)
(5, 4)
Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of all points in each cluster.
For Cluster 1, we only have one point A1(1, 1), which was the old mean, so the cluster center remains the same.
For Cluster 2, we have ( (2+4+5)/3, (1+3+4)/3 ) = (3.6, 2.6)
In Iteration2, we basically repeat the process from Iteration1 this time using the new
means we computed.
That was Iteration1 Mean (Centroid) Mean (Centroid) Clusters
(epoch1). Next, we go to Object Cluster-1 Cluster-2
(1,1) (3.6,2.6)
Iteration3 (epoch3),
Iteration4, and so on until (1,1) 0 4.2 Cluster-1
the means do not change (2,1) 1 3.2 Cluster-1
anymore. (4,3) 5 4.2 Cluster-2
(5,4) 7 4 Cluster-2
We get the final grouping as the results as:
Object Feature1(X): Feature2 Group
weight index (Y): pH (result)
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2