0% found this document useful (0 votes)

32 views67 pages

DM - Topic Four - Part III (Autosaved)

The document discusses different data mining techniques including cluster analysis. Cluster analysis involves grouping a set of data objects into clusters where objects in the same cluster are more similar to each other than objects in other clusters. The document provides examples of applications of cluster analysis and discusses factors that determine the quality of clustering results.

Uploaded by

arse atl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views67 pages

DM - Topic Four - Part III (Autosaved)

Uploaded by

arse atl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Data Mining

Topic Four-part III

Data mining techniques
Eyob N. (PhD)
Topics
 Fundamental concepts and the need for business intelligence,
data mining and its flavors , big data analysis
 BI/DM/BDA applications, DA models and frameworks
 Data and data warehousing
 Data mining techniques ;Association rule mining, Classification
and Cluster analysis
 Including web/text, opinion mining, Big data and BI technologies,
applications, and case studies
 Current trends in (big) data analytics and BI
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
Example: clustering
 The example below demonstrates the clustering of
padlocks of same kind. There are a total of 10 padlocks
which various in color, size, shape, etc.

 How many possible clusters of padlocks can be

identified?
 There are three different kind of padlocks; which can be
grouped into three different clusters.
 The padlocks of same kind are clustered into a group as shown
below:
Clustering

 Given a set of data points, each having a set of attributes,

and a similarity measure among them, find clusters such that
 Data points in one cluster are more similar to one another.
 Data points in separate clusters are less similar to one
another.
 Similarity/distance Measures:
 Euclidean Distance if attributes are continuous.
 Other Problem-specific Measures.

8
Clustering: Application 1
 Market Segmentation:
 Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
 Approach:
 Collect different attributes of customers based on their
geographical and lifestyle related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.

9
Clustering: Application 2
 Document Clustering:
 Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
 Approach: To identify frequently occurring terms in each
document , form a similarity measure based on the
frequencies of different terms. Use it to cluster.

10
Clustering: Application

 outlier detection
 Clustering can also be used for outlier detection,
where outliers (values that are “far away” from
any cluster) may be more interesting than
common cases.
 Applications of outlier detection include the
detection of credit card fraud
What Is Good Clustering?
Quality
• A good clustering method will produce high quality clusters
with
 high intra-class similarity
 low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
• Key requirement of clustering: Need a good measure of
similarity between instances

• Implementation-The quality of a clustering method is

also measured by its ability to discover some or all of
the hidden patterns in the given datasets
Data formats in Cluster Analysis
Types of Data in Cluster Analysis
 Two formats: data and dissimilarity matrix
 Data matrix (or object-by-variable structure)  x11 ... x
1f
... x 
1p 

 Rows and columns are  ... ... ... ... ... 
x ... x ... x 
different objects(two modes)  i1 if ip 
 ... ... ... ... ... 
 Xij shows value of the ith  
object on the jth attribute  xn1 ... x
nf
... x 
np 

 Dissimilarity matrix (or object-by-object structure)  0 

 Rows and columns are  d(2,1) 0 
 
similar objects(one mode)  d(3,1) d ( 3,2) 0 
 
 D(I,j) shows the distance between the  : : : 
ith object and the jth object d ( n,1) d ( n,2) ... ... 0
Type of data in clustering analysis
 Data types of variables are different

 The difference need proper distance computation logic for cluster

analysis

 Some of the common types of data we may have are:

 Interval-scaled variables

 Binary variables

 Nominal, and ordinal

 mixed types:
Interval-valued variables- and distance functions
 These are values of variables of an object which are characterized
by its continuous nature of the measurement such as height,
weight, age

 As the measurement unit affect cluster distance, we need

preprocessing that avoid the effect of unit of measurement

 This is called standardization

 To standardize data you may calculate the standardized measurement
(z-score or min-max normalization)
Similarity and Dissimilarity Between Objects
• Each clustering problem is based on some kind of “distance” or
“nearness measurement” between data points.
 Distances are normally used to measure the similarity or dissimilarity between
two data objects
 Some popular ones include: Minkowski distance:

d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1 i2 j 2 ip jp

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and q is a positive integer
 If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
Similarity and Dissimilarity Between Objects (Cont.)

 If q = 2, d is Euclidean distance:

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp

 Basic Properties
 d(i,j) 0
 d(i,i) =0
 d(i,j) = d(j,i)
Cosine similarity
 Measures similarity between objects say d1 and q or d2
as
 

n
d j q w w
sim(d j , q)     i 1 i , j i ,q

i1 w i1 i,q

n n
dj q 2
i, j w 2

 The denominator involves the lengths of the vectors

 So the cosine measure is also known as the normalized inner product



n
Length d j  i 1
2
w
i, j
Example : Computing Cosine Similarity
• Let say we have query vector
• Q = (0.4, 0.8); and also document
• D1 = (0.2, 0.7).
• Compute their similarity using cosine?

(0.4 * 0.2)  (0.8 * 0.7)

sim(Q, D1 ) 
[(0.4) 2  (0.8) 2 ] *[(0.2) 2  (0.7) 2 ]
0.64
  0.98
0.42
Binary Variables
 A binary variable is a variable which has only two possible values (1
or 0, yes or no, etc)
 For example smoker, educated, Ethiopian, IsFemale etc

 If all attributes of the objects are binary valued, we can construct

dissimilarity matrix from the given binary data

 If all the binary valued attributes have the same weight, we can
construct a 2-by-2 contingency table for any two objects I and J as
shown bellow
Binary Variables
Object j
 A contingency table for binary data 1 0 sum
Where Object i
1 a b a b
0 c d cd
 a is the number attributes with value 1 in both objects
sum a  c b  d p
 b is the number attributes with value 1 in object I, 0 in object j.

 c is the number attributes with value 0 in object I, 1 in object j.

 d is the number attributes with value 0 in both objects

 a+ b is the number attributes with value 1 in object I

 c+ d is the number attributes with value 0 in object I

 a+ c is the number attributes with value 1 in object J

 b+d is the number attributes with value 0 in object J

 P = a + b + c + d is the total number of variables

Binary Variables – distance functions

 Hence, distance between the two object can be measured as follows

 Simple matching coefficient for binary valued attributes in which the
two values are equally relevant (Symmetric)
 For example sex as Female or male d (i, j)  bc
a bc  d

 Jaccard coefficient: the two values are not equally important for
example smoker no(=1) more relevant than smoker yes (=0)
(asymmetric):

d (i, j)  bc
a bc
Dissimilarity between Binary Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

M  Male (coded as 0) F Female (coded as 1)

 Y  Yes (coded as 0) N  No (coded as 1)
 P  Positive (Undesirable) (coded as 0) N  Negative (desirable)
(coded as 1)
 gender is a symmetric attribute
 the remaining attributes are asymmetric binary
Dissimilarity between Binary Variables

 Contingency table between Jack and Mary

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack 0 0 1 0 1 1 1
Mary 1 0 1 0 1 0 1

Mary
1 0 sum
1 3 1 4

Jack 0 1 2 3

sum 4 3 7
Dissimilarity between Binary Variables

 Contingency table between Jack and Jim

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack 0 0 1 0 1 1 1
Jim 0 0 0 1 1 1 1

Jim

1 0 sum
1 3 1 4
Jack
0 1 2 3

sum 4 3 7
Dissimilarity between Binary Variables

 Contingency table between Jim and Mary

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Mary 1 0 1 0 1 0 1
Jim 0 0 0 1 1 1 1

Jim

1 0 sum
1 2 2 4

0 2 1 3

sum 4 3 7
Dissimilarity between Binary Variables

Jim
 Contingency table between any
1 0 sum
two objects
1 3 1 4
Jack 0 1 2 3

sum 4 3 7

Mary Jim
1 0 sum 1 0 sum
1 3 1 4 1 2 2 4

Jack 0 1 2 3 0 2 1 3

sum 4 3 7 sum 4 3 7
Dissimilarity between Binary Variables

11
d ( jack , mary )   0.4
3 11
11
d ( jack , jim)   0.4
3 11
22
d ( jim, mary )   0.66
222
Nominal Variables
 A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
 Method 1: Simple matching
 m: # of matches, p: total # of variables

d (i, j)  p 
p
m

 Method 2: use a large number of binary variables

 creating a new binary variable for each of the M nominal states
Variables of Mixed Types
 A database may contain different types of variables
 symmetric binary, asymmetric binary, nominal,
ordinal, interval.
 One may use
a weighted formula to combine their effects.
 Or preprocess the data so that it fits to the
techniques requirement
Major Clustering approaches and
Algorithms
Major Clustering Approaches
 Partitioning clustering approach:
 Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
 Typical methods:
 distance-based: K-means clustering
 model-based: expectation maximization (EM) clustering.
 Hierarchical clustering approach:
 Create a hierarchical decomposition of the set of data (or
objects) using some criterion
 Typical methods:
 agglomerative Vs divisive
 single link Vs complete link
Partitioning clustering approach
 Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters; such that, sum of squared
distance is minimum
 Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Heuristic methods: k-means and k-medoids algorithms
 k-means:
 Each cluster is represented by the center of the cluster
 k-medoids or PAM (Partition around medoids):
 Each cluster is represented by one of the objects in the
cluster
K-means Clustering
 Most common clustering methods and can be tailored
 Partitional clustering approach
 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest centroid
 Number of clusters, K, must be specified
 The basic algorithm is simple
The K-Means Clustering Method
• Algorithm:
• Select K cluster points as initial centroids (the initial centroids
are selected randomly)
 Given k, the k-means algorithm is implemented as follows:
• Repeat
 Partition objects into k nonempty subsets
 Recompute the centroids of each K clusters of the
current partition (the centroid is the center, i.e., mean
point, of the cluster)
 Assign each object to the cluster with the nearest seed
point
• Until the centroid don‟t change
Cont…

 Initial centroids are often chosen randomly.

 Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in the cluster.
 „Closeness‟ is measured by Euclidean distance, cosine similarity,
etc.
 K-means will converge for common similarity measures
mentioned above.
 Most of the convergence happens in the first few iterations.
 Often the stopping condition is changed to „Until relatively
few points change clusters‟
 Complexity is O( n * K * I * d )
 n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Variations of the K-Means Method
 There are also a few variants of the k-means which differ in
 Selection of the initial k means
 Dissimilarity calculations
 Selecting initial points
 Strategies to calculate cluster means

 k-modes (Huang‟98)- Handling categorical data:

 Replacing means of clusters with modes
 Using suitable dissimilarity measures to deal with categorical
objects
 Using a frequency-based method to update modes of clusters
Cont…
 K-Medoids - Clustering Method
 Find representative objects, called medoids, in clusters
 PAM (Partitioning Around Medoids, 1987)
 startsfrom an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it improves
the total distance of the resulting clustering
 PAM works effectively for small data sets, but does not scale
well for large data sets
Example- k-means clustering

 Cluster the following eight points (with (x, y) representing

locations) into three clusters :
 A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9).
 Assume that initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
 The distance function between two points a=(x1, y1) and b=(x2,
y2) is defined as:
dis(a, b) = |x2 – x1| + |y2 – y1| .
 Use k-means algorithm to find optimal centroids to group the
given data into three clusters.
Iteration 1
First we list all points in the first column of the table below. The initial cluster
centers – centroids, are (2, 10), (5, 8) and (1, 2) - chosen randomly.

(2,10) (5, 8) (1, 2)

Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3
A3 (8, 4) 12 7 9 2
A4 (5, 8) 5 0 10 2
A5 (7, 5) 10 5 9 2
A6 (6, 4) 10 5 7 2
A7 (1, 2) 9 10 0 3
A8 (4, 9) 3 2 10 2
Next, we will calculate the distance from each points to each of
the three centroids, by using the distance function:
dis(point i,mean j)=|x2 – x1| + |y2 – y1|
Iteration 1
• Starting from point A1 calculate the distance to each of the three means,
by using the distance function:
dis (A1, mean1) = |2 – 2| + |10 – 10| = 0 + 0 = 0
dis(A1, mean2) = |5 – 2| + |8 – 10| = 3 + 2 = 5
dis(A1, mean3) = |1 – 2| + |2 – 10| = 1 + 8 = 9
 Fill these values in the table & decide which cluster should the point
(2, 10) be placed in? The one, where the point has the shortest
distance to the mean – i.e. mean 1 (cluster 1), since the distance is 0.
• Next go to the second point A2 and calculate the distance:
dis(A2, mean1) = |2 – 2| + |10 – 5| = 0 + 5 = 5
dis(A2, mean2) = |5 – 2| + |8 – 5| = 3 + 3 = 6
dis(A2, mean3) = |1 – 2| + |2 – 5| = 1 + 3 = 4
 So, we fill in these values in the table and assign the point (2, 5) to
cluster 3 since mean 3 is the shortest distance from A2.
• Analogically, we fill in the rest of the table, and place each point in one
of the clusters
Iteration 1
 Next, we need to re-compute the new cluster centers (means). We do so,
by taking the mean of all points in each cluster.
 For Cluster 1, we only have one point A1(2, 10), which was the old mean,
so the cluster center remains the same.
 For Cluster 2, we have five points and needs to take average of them as
new centroid, i,e.
( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6)
 For Cluster 3, we have two points. The new centroid is:
( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)

 That was Iteration1 (epoch1).

 Next, we go to Iteration2 (epoch2), Iteration3, and so on until the
centroids do not change anymore.
 In Iteration2, we basically repeat the process from Iteration1 this
time using the new means we computed.
Second epoch

 Using the new centroid we have to compute cluster members.

(2,10) (6, 6) (1.5, 3.5)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10) 0 8 7 1
A2 (2, 5) 5 5 2 3
A3 (8, 4) 12 4 7 2
A4 (5, 8) 5 3 8 2
A5 (7, 5) ... … … 2
A6 (6, 4) 2
A7 (1, 2) 3
A8 (4, 9) 1
 After the 2nd epoch the results would be:
cluster 1: {A1,A8} with new centroid=(3,9.5);
cluster 2: {A3,A4,A5,A6} with new centroid=(6.5,5.25);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
Third epoch
 Using the new centroid we have to compute cluster members.
(3,9.5) (6.5, 5.25) (1.5, 3.5)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10) 1.5 9.25 7 1
A2 (2, 5) 5.5 4.75 2 3
A3 (8, 4) 2
A4 (5, 8) 1
A5 (7, 5) 2
A6 (6, 4) 2
A7 (1, 2) 3
A8 (4, 9) 1
 After the 3rd epoch the results would be:
cluster 1: {A1,A4,A8} with new centroid=(3.66,9);
cluster 2: {A3,A5,A6} with new centroid=(7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
Fourth epoch
 Using the new centroid we have to compute cluster members.

(3.66,9) (7, 4.33) (1.5, 3.5)

Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10) 2.66 10.67 7 1
A2 (2, 5) 3
A3 (8, 4) 2
A4 (5, 8) 1
A5 (7, 5) 2
A6 (6, 4) 2
A7 (1, 2) 3
A8 (4, 9) 1
Final results
• Finally in the 4th epoch there is no change of members of clusters and
centroids. So the algorithm stops.
• The result of clustering is shown in the following figure
Comments on the K-Means Method
 Applicable only when mean is defined, then what about categorical
data?
o Use hierarchical clustering or other variations of K-means
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers since an object with an
extremely large value may substantially distort the distribution of the
data.
Solutions to Initial Centroids Problem
o Multiple runs
o Helps, but probability may not be on your side
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial
centroids
o Select most widely separated
o Updating centers incrementally
o Preprocessing and Postprocessing
o Bisecting K-means
o Not as susceptible to initialization issues
Updating Centers Incrementally
 In the basic K-means algorithm, centroids are updated after all
points are assigned to a centroid

 An alternative is to update the centroids after each

assignment (incremental approach)
 More expensive
 Introduces an order dependency
 Never get an empty cluster
 Can use “weights” to change the impact
Pre-processing and Post-processing
 Pre-processing
 Normalize the data
 Eliminate outliers
 Post-processing
 Eliminate small clusters that may represent outliers
 Split „loose‟ clusters, i.e., clusters with relatively high SSE
 Merge clusters that are „close‟ and that have relatively low
SSE
Bisecting K-means
 Bisecting (dividing) K-means algorithm
 Variant of K-means that can produce a partitional and/or a
hierarchical clustering
 Bisecting k-Means is like a combination of k-Means and hierarchical
clustering.
Cont…
 Basic Bisecting K-means Algorithm for finding K clusters
– 1. Pick a cluster to split.
– 2. Find 2 sub-clusters using the basic k-Means algorithm (Bisecting step)
– 3. Repeat step 2, the bisecting step, for ITER times and take the split
that produces the clustering with the highest overall similarity.
– 4. Repeat steps 1, 2 and 3 until the desired number of clusters is
reached.
 The critical part is which cluster to choose for splitting. And there are
different ways to proceed, for example, you can choose the biggest cluster
or the cluster with the worst quality or a combination of both.
Hierarchical clustering approach
0.2

 Produces a set of nested clusters organized as 0.15

a hierarchical tree. 0.1

 Can be visualized as a dendrogram; a tree like

diagram that records the sequences of merges
0.05

or splits
0
1 3 2 5 4 6

Step 0 Step 1 Step 2 Step 3 Step 4

agglomerative
a
ab
b
abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0
Two main types of hierarchical clustering

 Agglomerative and Divisive

 Agglomerative: it is a Bottom Up clustering technique

 Start with all sample units in n clusters of size 1.
 Then, at each step of the algorithm, the pair of clusters with
the shortest distance are combined into a single cluster.
 The algorithm stops when all sample units are combined into
a single cluster of size n.
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward
1. Let each data point be a cluster
2. Compute the proximity matrix
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
Key operation is the computation of the proximity of two clusters
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
DIANA (Divisive Analysis)
 Divisive: it is a Top Down clustering technique
 Start with all sample units in a single cluster of size n.
 Then, at each step of the algorithm, clusters are partitioned
into a pair of daughter clusters, selected to maximize the
distance between each daughter.
 The algorithm stops when sample units are partitioned into n
clusters of size 1.
 Introduced in Kaufmann and Rousseeuw (1990)
 Thus it is an Inverse order of AGNES
10

10 10 9

9 9 8

8 8 7

7 7 6

6 6 5

5 5 4

4 4 3

3 3 2

2 2 1

1 1 0
0 1 2 3 4 5 6 7 8 9 10
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Strengths of Hierarchical Clustering

 Do not have to assume any particular number of clusters

 Any desired number of clusters can be obtained by
„cutting‟ the dendogram at the proper level

 They may correspond to meaningful taxonomies

 Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
Cluster Validity/Evaluation
Cluster Validity
o For supervised classification we have a variety of measures to evaluate
how good our model is
o Accuracy, precision, recall
o For cluster analysis (unsupervised) , the analogous question is how to
evaluate the “goodness” of the resulting clusters?
o But “clusters are in the eye of the beholder”!
o Then why do we want to evaluate them?
o To avoid noise in finding patterns
o To compare clustering algorithms
o To compare two sets of clusters
o To compare two clusters
Measures of Cluster Validity
o Numerical measures that are applied to judge various aspects of cluster
validity, are classified into the following three types.
o External Index: Used to measure the extent to which cluster labels
match externally supplied class labels.
o Entropy

o Internal Index: Used to measure the goodness of a clustering structure

without respect to external information.
o Sum of Squared Error (SSE)
o Relative Index: Used to compare two different clusterings or clusters.
o Often an external or internal index is used for this function, e.g., SSE or entropy

o Sometimes these are referred to as criteria instead of indices

o However, sometimes criterion is the general strategy and index is the numerical
measure that implements the criterion.
Internal Index- SSE
o Most common measure is Sum of Squared Error (SSE)
o For each point, the error is the distance to the representative point with in a cluster
or nearest cluster
o To get SSE, we square these errors and sum them.
K
SSE    dist 2 ( mi , x )
i 1 xCi

o x is a data point in cluster Ci and mi is the representative point for cluster Ci

o can show that mi corresponds to the center (mean) of the cluster

o Given two clusterings, we can choose the one with the smallest error
o One easy way to reduce SSE is to increase K, the number of clusters
o But do not forget that a good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Internal Measures: SSE
o Internal Index: Used to measure the goodness of a clustering structure
without respect to external information
o SSE
o SSE is good for comparing two clusterings or two clusters (average SSE).
o Can also be used to estimate the number of clusters
10

6
SSE 5

0
2 5 10 15 20 25 30
K
Review questions
o What makes clustering more challenging?
o What Is Good Clustering?
o Explain SSE?
o Describe the basic Agglomerative Clustering Algorithm?
o Explain the key concept in Clustering?
Review questions
o What is the key issue in clustering and what makes it
challenging?
o How do you know that a given clustering activity is good?
o How does SSE works ?
o What does unsupervised learning means?
o Describe the basic Agglomerative Clustering Algorithm?
o Explain data format in Clustering?
Thank you

Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Clustering
0% (1)
Clustering
127 pages
Cluster Analisys
No ratings yet
Cluster Analisys
100 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
127 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
No ratings yet
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
119 pages
Clustering
No ratings yet
Clustering
47 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Unit 4
No ratings yet
Unit 4
65 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Cluster Analysis and DBSCAN
No ratings yet
Cluster Analysis and DBSCAN
44 pages
Data Mining
No ratings yet
Data Mining
98 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
02data Part4
No ratings yet
02data Part4
28 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
Chapter 8. Cluster Analysis
No ratings yet
Chapter 8. Cluster Analysis
51 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Cluster
No ratings yet
Cluster
120 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Lec 5
No ratings yet
Lec 5
24 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
Chapter4 Clustering
No ratings yet
Chapter4 Clustering
77 pages
V DM Clustering
No ratings yet
V DM Clustering
76 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
Clustering and Applications and Trends in Data Mining
No ratings yet
Clustering and Applications and Trends in Data Mining
42 pages
K Medoids
No ratings yet
K Medoids
101 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
Cluster Analysis
No ratings yet
Cluster Analysis
39 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Clustering
No ratings yet
Clustering
51 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
No ratings yet
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
21 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
ML Clustering Algorithm
No ratings yet
ML Clustering Algorithm
29 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
24 pages
Lecture 3.2.1 3.2.2
No ratings yet
Lecture 3.2.1 3.2.2
28 pages
ML12 Clustering
No ratings yet
ML12 Clustering
34 pages
Cluster Analysis: Introduction - I: Dr. A. Ramesh
No ratings yet
Cluster Analysis: Introduction - I: Dr. A. Ramesh
28 pages
DWM Unit-Vi
No ratings yet
DWM Unit-Vi
30 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages

DM - Topic Four - Part III (Autosaved)

Uploaded by

DM - Topic Four - Part III (Autosaved)

Uploaded by

Data Mining

Topic Four-part III

 How many possible clusters of padlocks can be

 Given a set of data points, each having a set of attributes,

• Implementation-The quality of a clustering method is

 Dissimilarity matrix (or object-by-object structure)  0 

 The difference need proper distance computation logic for cluster

 Some of the common types of data we may have are:

 Nominal, and ordinal

 As the measurement unit affect cluster distance, we need

 This is called standardization

i1 w i1 i,q

 The denominator involves the lengths of the vectors

(0.4 * 0.2)  (0.8 * 0.7)

 If all attributes of the objects are binary valued, we can construct

 c is the number attributes with value 0 in object I, 1 in object j.

 d is the number attributes with value 0 in both objects

 a+ b is the number attributes with value 1 in object I

 c+ d is the number attributes with value 0 in object I

 a+ c is the number attributes with value 1 in object J

 b+d is the number attributes with value 0 in object J

 P = a + b + c + d is the total number of variables

 Hence, distance between the two object can be measured as follows

M  Male (coded as 0) F Female (coded as 1)

 Contingency table between Jack and Mary

 Contingency table between Jack and Jim

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

 Contingency table between Jim and Mary

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

 Method 2: use a large number of binary variables

 Initial centroids are often chosen randomly.

 k-modes (Huang‟98)- Handling categorical data:

 Cluster the following eight points (with (x, y) representing

(2,10) (5, 8) (1, 2)

 That was Iteration1 (epoch1).

 Using the new centroid we have to compute cluster members.

(3.66,9) (7, 4.33) (1.5, 3.5)

 An alternative is to update the centroids after each

 Produces a set of nested clusters organized as 0.15

a hierarchical tree. 0.1

 Can be visualized as a dendrogram; a tree like

Step 0 Step 1 Step 2 Step 3 Step 4

 Agglomerative and Divisive

 Agglomerative: it is a Bottom Up clustering technique

 Do not have to assume any particular number of clusters

 They may correspond to meaningful taxonomies

o Internal Index: Used to measure the goodness of a clustering structure

o Sometimes these are referred to as criteria instead of indices

o x is a data point in cluster Ci and mi is the representative point for cluster Ci

You might also like