0% found this document useful (0 votes)

54 views32 pages

DM Lecture 06

This document provides an overview of data mining and clustering techniques. It discusses hierarchical and partitioned clustering approaches as well as classification methods like decision trees, artificial neural networks, Bayesian classifiers, and genetic algorithms. Key clustering concepts covered include defining distance between data points, desirable properties of clustering algorithms, examples of applications, and types of clustering like hierarchical and partitional. Classical clustering methods like k-means, k-medoids, hierarchical, and model-based clustering are also summarized.

Uploaded by

Sameer Ahmad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views32 pages

DM Lecture 06

Uploaded by

Sameer Ahmad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

Data Mining

Lectore-07
Dr. Waqas Haider Khan Bangyal
Data Mining Overview
Data Mining
Data warehouses and OLAP (On Line Analytical
Processing.)
Clustering: Hierarchical and Partitioned approaches
Classification: Decision Trees , ANN, Bayesian
classifiers and G.A
Association Rules Mining
Advanced topics: outlier detection, web mining
What is a natural grouping among these objects?
What is a natural grouping among these objects?

Clustering is subjective

Siaad's Family School Employees Females Males

What is Clustering?

Also called unsupervised learning, sometimes called

classification by statisticians and sorting by psychologists
and segmentation by people in marketing
• Organizing data into classes such that there is
• high intra-class similarity
• low inter-class similarity
• Finding the class labels and the number of classes directly
from the data (in contrast to classification).
• More informally, finding natural groupings among objects.
What is clustering?

Clustering is the classification of objects into different

groups, or more precisely, the partitioning of a data set
into subsets (clusters), so that the data in each subset
(ideally) share some common trait - often according to
some defined distance measure.
What is Clustering?

You can say this “unsupervised classification”

Clustering is alternatively called as “grouping”
Intuitively, if you would want to assign same label to a
data points that are “close” to each other
Thus, clustering algorithms rely on a distance metric
between data points
Sometimes, it is said that the for clustering, the distance
metric is more important than the clustering algorithm
Idea and Applications
Clustering is the process of grouping a set of physical or
abstract objects into classes of similar objects.
It is also called unsupervised learning.
It is a common and important task that finds many
applications.
Applications in Search engines:
Structuring search results
Suggesting related pages
Automatic directory construction/update
Finding near identical/duplicate pages
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined
classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
Desirable Properties of a Clustering Algorithm

• Scalability (in terms of both time and space)

• Ability to deal with different data types
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Incorporation of user-specified constraints
• Interpretability and usability
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to
their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should
be clustered along continent faults
Concepts in Clustering

“Defining distance between points

A good clustering is one where

 (Intra-clusterdistance) the sum of distances between objects

in the same cluster are minimized,

 (Inter-clusterdistance) while the distances between different

clusters are maximized

 Objective to minimize: F(Intra,Inter)

What Is Good Clustering?
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
Types of clustering
1. Hierarchical algorithms: these find successive clusters
using previously established clusters.

1. Agglomerative ("bottom-up"): Agglomerative algorithms

begin with each element as a separate cluster and merge
them into successively larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin with
the whole set and proceed to divide it into successively
smaller clusters.
2. Partitional clustering: Partitional algorithms determine all
clusters at once. They include:
 K-means and derivatives
 Fuzzy c-means clustering
Classical clustering methods

Partitioning methods
k-Means (and EM), k-Medoids
Hierarchical methods
agglomerative, divisive, BIRCH
Model-based clustering methods
K-MEANS CLUSTERING
Simply speaking k-means clustering is an algorithm to
classify or to group the objects based on
attributes/features into K number of group.
K is positive integer number.
The grouping is done by minimizing the sum of
squares of distances between data and the
corresponding cluster centroid.
Common Distance measures:
Distance measure will determine how the similarity of two
elements is calculated and it will influence the shape of the
clusters.
They include:
1. The Euclidean distance (also called 2-norm distance) is given by:

2. The Manhattan distance (also called taxicab norm or 1-norm) is

given by:
Common Distance measures:

3.The maximum norm is given by:

4. The Mahalanobis distance corrects data for different scales and

correlations in the variables.
5. Inner product space: The angle between two vectors can be used
as a distance measure when clustering high dimensional data
6. Hamming distance (sometimes edit distance) measures the
minimum number of substitutions required to change one
member into another.
K-means
Works when we know k, the number of clusters we want to
find
Idea:
Randomly pick k points as the “centroids” of the k clusters
Loop:
 For each point, put the point in the cluster to whose
centroid it is closest
 Recompute the cluster centroids
 Repeat loop (until there is no change in clusters between
two consecutive iterations.)

Iterative improvement of the objective function:

Sum of the squared distance from each point to the centroid of
How the K-Mean Clustering algorithm works?
How the K-Mean Clustering algorithm works?
Step 1: Begin with a decision on the value of k =
number of clusters .
Step 2:
Put any initial partition that classifies the data into
k clusters. You may assign the training samples
randomly, or systematically as the following:
1.Take the first k training sample as single- element
clusters
2. Assign each of the remaining (N-k) training
sample to the cluster with the nearest centroid. After
each assignment, recompute the centroid of the gaining
cluster.
How the K-Mean Clustering algorithm works?
Step 3: Take each sample in sequence and
compute its distance from the centroid of
each of the clusters. If a sample is not
currently in the cluster with the closest
centroid, switch this sample to that cluster
and update the centroid of the cluster
gaining the new sample and the cluster
losing the sample.
Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through the
training sample causes no new assignments.
K-means Overview
An unsupervised clustering algorithm
“K” stands for number of clusters, it is typically a user
input to the algorithm; some criteria can be used to
automatically estimate K
It is an approximation to an NP-hard combinatorial
optimization problem
K-means algorithm is iterative in nature
It converges, however only a local minimum is obtained
Works only for numerical data
Easy to implement
Weaknesses of K-Mean Clustering

1. When the numbers of data are not so many, initial grouping
will determine the cluster significantly.
2. The number of cluster, K, must be determined before hand. Its
disadvantage is that it does not yield the same result with each
run, since the resulting clusters depend on the initial random
assignments.
3. We never know the real cluster, using the same data, because if
it is inputted in a different order it may produce different
cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition
may produce different result of cluster. The algorithm may be
trapped in the local optimum.
Applications of K-Mean Clustering
It is relatively efficient and fast. It computes result at O(tkn),
where n is number of objects or points, k is number of clusters
and t is number of iterations.
k-means clustering can be applied to machine learning or data
mining
Used on acoustic data in speech understanding to convert
waveforms into one of k categories (known as Vector
Quantization or Image Segmentation).
Also used for choosing color palettes on old fashioned graphical
display devices and Image Quantization.
CONCLUSION
K-means algorithm is useful for undirected knowledge
discovery and is relatively simple. K-means has found
wide spread usage in lot of fields, ranging from
unsupervised learning of neural network, Pattern
recognitions, Classification analysis, Artificial
intelligence, image processing, machine vision, and many
others.
Real-Life Numerical Example of K-Means Clustering
We have 4 medicines as our training data points object and each
medicine has 2 attributes. Each attribute represents coordinate of
the object. We have to determine which medicines belong to
cluster 1 and which medicines belong to the other cluster.

Attribute1 (X): Attribute 2 (Y): pH

Object weight index
1 1
Medicine A

Medicine B 2 1

Medicine C 4 3

Medicine D 5 4
Step 1:
 Initial value of
centroids : Suppose we
use medicine A and
medicine B as the first
centroids.
Let and c1 and c2 denote
the coordinate of the
centroids, then c1=(1,1)
and c2=(2,1)
Objects-Centroids distance : we calculate the distance between cluster centroid to each object.
Let us use ρ(a, b) = |x2 – x1| + |y2 – y1|
distance matrix at iteration 0 is

So, we fill in these values in the table:

Mean (Centroid) Mean (Centroid) Clusters

Object Cluster-1 Cluster-2
(1,1) (2,1)

(1,1) 0 1 Cluster-1
(2,1) 1 0 Cluster-2
(4,3) 5 4 Cluster-2
(5,4) 7 6 Cluster-2

point mean1 point mean2

x1, y1 x2, y2 x1, y1 x2, y2
(1, 1) (1, 1) (1, 1) (2, 1)
ρ(a, b) = |x2 – x1| + |y2 – y1| ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean1) = |x2 – x1| + |y2 – y1| ρ(point, mean1) = |x2 – x1| + |y2 – y1|
= |1 – 1| + |1 – 1| = |2 – 1| + |1 – 1|
=0+0 =1+0
=0 =0
So, which cluster should the point (1, 1) be placed in? The one, where the point has
the shortest distance to the mean – that is mean 1 (cluster 1), since the distance is 0.
Step 2:
Objects clustering : We
assign each object based on
the minimum distance.
Medicine A is assigned to
group 1, medicine B to
group 2, medicine C to
group 2 and medicine D to
group 2.
The elements of Group
matrix below is 1 if and only
if the object is assigned to
that group.
Iteration-2, Objects-Centroids distances : The next
step is to compute the distance of all objects to the new
centroids.
Similar to step 2, we have distance matrix at iteration 1 is
Cluster 1 Cluster 2
(1, 1) (2, 1)
(4, 3)
(5, 4)
Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of all points in each cluster.
For Cluster 1, we only have one point A1(1, 1), which was the old mean, so the cluster center remains the same.
For Cluster 2, we have ( (2+4+5)/3, (1+3+4)/3 ) = (3.6, 2.6)

In Iteration2, we basically repeat the process from Iteration1 this time using the new
means we computed.

That was Iteration1 Mean (Centroid) Mean (Centroid) Clusters

(epoch1). Next, we go to Object Cluster-1 Cluster-2
(1,1) (3.6,2.6)
Iteration3 (epoch3),
Iteration4, and so on until (1,1) 0 4.2 Cluster-1
the means do not change (2,1) 1 3.2 Cluster-1
anymore. (4,3) 5 4.2 Cluster-2
(5,4) 7 4 Cluster-2
We get the final grouping as the results as:

Object Feature1(X): Feature2 Group

weight index (Y): pH (result)
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2

Glad We Met: The Art and Science of 1:1 Meetings Steven G. Rogelberg pdf download
100% (1)
Glad We Met: The Art and Science of 1:1 Meetings Steven G. Rogelberg pdf download
147 pages
Coursework For PHD in Vtu
100% (2)
Coursework For PHD in Vtu
8 pages
Nurses Notes: Patient Name: Mr. X Age: 48 Y/o Sex: Male C.S: Married Room/bed No.: 6
50% (2)
Nurses Notes: Patient Name: Mr. X Age: 48 Y/o Sex: Male C.S: Married Room/bed No.: 6
2 pages
The Undeserved Reward by Premchand: (For Your Reference Only)
0% (1)
The Undeserved Reward by Premchand: (For Your Reference Only)
3 pages
Competency-Based Learning Material
100% (1)
Competency-Based Learning Material
54 pages
Instant Download Research Methods in Second Language Psycholinguistics 1st Edition Jill Jegerski PDF All Chapter
100% (13)
Instant Download Research Methods in Second Language Psycholinguistics 1st Edition Jill Jegerski PDF All Chapter
66 pages
Rizal: Elementary and High School Days
No ratings yet
Rizal: Elementary and High School Days
14 pages
Topics Group 8 MTB Mle
No ratings yet
Topics Group 8 MTB Mle
4 pages
Spanish English Speech Practices
100% (2)
Spanish English Speech Practices
22 pages
Experiential Learning Presentation With Index
No ratings yet
Experiential Learning Presentation With Index
12 pages
Class 12 Chemistry Bengali Cbse
No ratings yet
Class 12 Chemistry Bengali Cbse
10 pages
PCom - Lesson 17
No ratings yet
PCom - Lesson 17
23 pages
Transactional and Interactional
No ratings yet
Transactional and Interactional
12 pages
Korthagen 2016
No ratings yet
Korthagen 2016
20 pages
U2 - Extra Practice
No ratings yet
U2 - Extra Practice
4 pages
Revise Lp7 Text Type (Narrative, Expository)
No ratings yet
Revise Lp7 Text Type (Narrative, Expository)
2 pages
Chapter 6-Leading
No ratings yet
Chapter 6-Leading
27 pages
ĐỀ CƯƠNG ÔN TẬP GIỮA KÌ 2-ANH 9
No ratings yet
ĐỀ CƯƠNG ÔN TẬP GIỮA KÌ 2-ANH 9
6 pages
PFM 2
No ratings yet
PFM 2
4 pages
BIOL 1310 Syllabus Fall 2023 Robert Morris University
No ratings yet
BIOL 1310 Syllabus Fall 2023 Robert Morris University
4 pages
Gwentry Proposal
No ratings yet
Gwentry Proposal
20 pages
Adverbs
No ratings yet
Adverbs
2 pages
A Quantitative Evaluation of Shame Resilience Theory
No ratings yet
A Quantitative Evaluation of Shame Resilience Theory
2 pages
Vamsi Krishna Myalapalli Resume
No ratings yet
Vamsi Krishna Myalapalli Resume
2 pages
Prime Factors Mas, KVP Density
No ratings yet
Prime Factors Mas, KVP Density
10 pages
Script For Project Control
No ratings yet
Script For Project Control
8 pages
Lesson 02 - Number Conversions and Arithmetic Operations
No ratings yet
Lesson 02 - Number Conversions and Arithmetic Operations
7 pages
New GRE Scoring Format: GRE Test GRE Test GRE Exam GRE Exam
No ratings yet
New GRE Scoring Format: GRE Test GRE Test GRE Exam GRE Exam
2 pages
INVITATION For Speaker
No ratings yet
INVITATION For Speaker
3 pages
Job Interview: Listening Practice
No ratings yet
Job Interview: Listening Practice
3 pages

DM Lecture 06

Uploaded by

DM Lecture 06

Uploaded by

Data Mining

Siaad's Family School Employees Females Males

Also called unsupervised learning, sometimes called

Clustering is the classification of objects into different

You can say this “unsupervised classification”

• Scalability (in terms of both time and space)

“Defining distance between points

A good clustering is one where

 (Intra-clusterdistance) the sum of distances between objects

 (Inter-clusterdistance) while the distances between different

 Objective to minimize: F(Intra,Inter)

1. Agglomerative ("bottom-up"): Agglomerative algorithms

2. The Manhattan distance (also called taxicab norm or 1-norm) is

3.The maximum norm is given by:

4. The Mahalanobis distance corrects data for different scales and

Iterative improvement of the objective function:

Attribute1 (X): Attribute 2 (Y): pH

So, we fill in these values in the table:

Mean (Centroid) Mean (Centroid) Clusters

point mean1 point mean2

That was Iteration1 Mean (Centroid) Mean (Centroid) Clusters

Object Feature1(X): Feature2 Group

You might also like