0% found this document useful (0 votes)

130 views21 pages

Datamining Mod3

Uploaded by

20b739

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

130 views21 pages

Datamining Mod3

Uploaded by

20b739

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

CST466

DATA MINING

MODULE-3
TRACE KTU
Introduction to Clustering:

● Consider an electronics shop datastore.

● Suppose there are five managers working there.
● The customer relationship director of the shop may be interested in organizing all the company’s customers into
five groups so that each group can be assigned to a different manager.
● Grouping should be done in such a way that the customers in each group are as similar as possible.
● Moreover, two given customers having very different business patterns should not be placed in the same group.
● The intention behind this business strategy may be to develop customer relationship campaigns that specifically

TRACE KTU
target each group, based on common features shared by the customers per group.

What kind of data mining techniques help to accomplish this task?

● Here, the class label (or group ID) of each customer is unknown.
● We have to discover these groupings.
● Given a large number of customers and many attributes describing customer proﬁles, it can be very costly or even
infeasible to have a human study the data and manually come up with a way to partition the customers into
strategic groups. An appropriate tool must be used.
CLUSTERING:

● Clustering is the process of grouping a set of data objects into multiple groups or
clusters so that objects within a cluster have high similarity, but are very
dissimilar to objects in other clusters.
● Cluster analysis is the process of partitioning a set of data objects into subsets.
○ Each subset is a cluster, such that objects in a cluster are similar to one another,
yet dissimilar to objects in other clusters.
TRACE KTU
○ The set of clusters resulting from a cluster analysis can be referred to as the
clusters.
● Cluster analysis has been widely used in many applications such as business
intelligence, image pattern recognition,outlier detection, web search, biology, and
security.
CLUSTERING PARADIGMS:
● There are many clustering approaches.
● It is diﬃcult to provide a crisp categorization of clustering methods because these
categories may overlap so that a method may have features from several categories.
● In general, the major fundamental clustering methods can be classiﬁed into the
following categories,

TRACE KTU
Partitioning methods:

● Given a set of n objects, a partitioning method constructs k partitions of the data,

where each partition represents a cluster and k ≤ n.
○ ie, it divides the data into k groups such that each group must contain at least one
object.
● The basic partitioning methods typically adopt exclusive cluster separation.
○ ie, each object must belong to exactly one group.
○ This requirement may be relaxed, for eg, in fuzzy partitioning techniques.
●
● TRACE KTU
Most partitioning methods are distance-based.
Given k(the number of partitions to construct), a partitioning method creates an initial
partitioning.
● It then uses an iterative relocation technique that attempts to improve the partitioning by
moving objects from one group to another.
● The general criterion of a good partitioning is that objects in the same cluster are “close” or
related to each other, whereas objects in different clusters are “far apart” or very different.
● There are various kinds of other criteria for judging the quality of partitions.
● Achieving global optimality in partitioning-based clustering is often computationally
prohibitive, potentially requiring an exhaustive enumeration of all the possible partitions.
● Instead, most applications adopt popular heuristic methods, such as greedy approaches like
the k-means and the k-medoids(PAM) algorithms, which progressively improve the
clustering quality and approach a local optimum.
● These heuristic clustering methods works well for finding spherical-shaped clusters in
small- to medium-size databases.
● To find clusters with complex shapes and for very large data sets, partitioning-based
methods need to be extended.

Hierarchical methods:

●
TRACE KTU
A hierarchical method creates a hierarchical decomposition of the given set of data
objects.
● Hierarchical clustering methods can be distance-based or density- and continuity-based.
● Based on how the hierarchical decomposition is formed, a hierarchical method can be
classiﬁed as;
○ Agglomerative approaches.
○ Divisive approaches.
Agglomerative approach:
● Also called the bottom-up approach.
● Starts with each object forming a separate group.
● It successively merges the objects or groups close to one another, until all the groups
are merged into one (the topmost level of the hierarchy), or a termination condition
holds.
Divisive approach:
● Also called top-down approach.
●
● TRACE KTU
Starts with all the objects in the same cluster.
In each successive iteration, a cluster is split into smaller clusters, until each object is in
one cluster, or a termination condition holds.
Drawback:
● Once a step (merge or split) is done, it can never be undone.
● This rigidity is useful in that it leads to smaller computation costs by not having to worry
about a combinatorial number of diﬀerent choices.
● Cannot correct erroneous decisions.
Density-based methods:

● Most partitioning methods cluster objects based on the distance between objects.
● Such methods can find only spherical-shaped clusters and encounter difficulty in discovering
clusters of arbitrary shapes.
● Other clustering methods have been developed based on the notion of density.
● Their general idea is to continue growing a given cluster as long as the density (number
of objects or data points) in the “neighborhood” exceeds some threshold.
○ For example, for each data point within a given cluster, the neighborhood of a given
radius has to contain at least a minimum number of points.
TRACE KTU
○ Such a method can be used to filter out noise or outliers and discover clusters of
arbitrary shape.
● Density-based methods can divide a set of objects into multiple exclusive clusters, or a
hierarchy of clusters.
● Typically, density-based methods consider exclusive clusters only, and do not consider fuzzy
clusters.
● Eg, of density based clustering: DBSCAN, OPTICS, DENCLUE
Grid-based methods:

● Grid-based methods quantize the object space into a ﬁnite number of cells that form a grid
structure.
● All the clustering operations are performed on the grid structure (i.e., on the quantized
space).
● The main advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the number of cells in
each dimension in the quantized space.
● Using grids is often an eﬃcient approach to many spatial data mining problems,including

●
clustering.
TRACE KTU
Therefore, grid-based methods can be integrated with other clustering methods such as
density-based methods and hierarchical methods.
Note:

● Most of the clustering algorithms integrate the ideas of several clustering methods, so that it is
sometimes diﬃcult to classify a given algorithm as uniquely belonging to only one clustering method
category.
● Furthermore, some applications may have clustering criteria that require the integration of several
clustering techniques.
DISTANCE MEASURES IN CLUSTER ANALYSIS:

● In cluster analysis, the dissimilarity(or similarity) between the objects described by

interval scaled variables is computed based on the distance between each pair of
objects.
● Following are the common distance measures used;
○ Euclidean distance
○ Manhattan distance

●
●
TRACE KTU
○ Minkowski distance
Consider two n-dimensional data objects i = (xi1, xi2,..., xin) and j = (x j1, x j2,..., x jn)
Euclidean distance between i and j is deﬁned as;
● Manhattan (or city block) distance, deﬁned as;

● Minkowski distance is a generalization of both Euclidean distance and Manhattan distance

and is deﬁned as;

,where p is a positive integer.

Note:
● TRACE KTU
Both the Euclidean distance and Manhattan distance satisfy the following mathematic requirements of a
distance function:
Sample Questions:

1. Let x1 = (1, 2) and x2 = (3, 5) represent two objects. Calculate the euclidean and
manhattan distance.
2. Given 5 dimensional samples A (1,0,2,5,3) and B (2,1,0,3,-1). Find the euclidean,
manhattan and minkowski distance (given p = 3, for minkowski distance).

TRACE KTU
Eg:

● Let x1 = (1,2) and x2=(3,5) represent two objects.

● Find the euclidean and manhattan distance between x1 and x2.

Solution:

● Euclidian distance = ((1-3)2+(2-5)2)1/2

= 3.61
●
TRACE KTU
Manhattan distance = |1-3|+ |2-5| = 5
PARTITIONING METHODS

● Given D, a data set of n objects, and k, the number of clusters to form.

● A partitioning algorithm organizes the objects into k partitions (k ≤ n), where each partition
represents a cluster.
● The clusters are formed to optimize an objective partitioning criterion, such as a dissimilarity
function based on distance, so that the objects within a cluster are “similar,” whereas the

TRACE KTU
objects of diﬀerent clusters are “dissimilar” in terms of the data set attributes.
● The most well-known and commonly used partitioning methods are;
○ k-means
○ k-medoids
Variants
■ PAM (Partitioning around medoids)
of ■ CLARA (Clustering Large Applications)
k-medoid
■ CLARANS (Clustering Large Applications based upon Randomized Search)
PAM:

TRACE KTU
● PAM was one of the ﬁrst k-medoids algorithms introduced.
● It attempts to determine k partitions for n objects.
● Initially, we’ll select k representative objects randomly.
● After that, the algorithm repeatedly tries to make a better choice of cluster
representatives.
● All of the possible pairs of objects are analyzed, where one object in each pair is

●
●
TRACE KTU
considered a representative object and the other is not.
The quality of the resulting clustering is calculated for each such combination.
An object, oj, is replaced with the object causing the greatest reduction in error.
● The set of best objects for each cluster in one iteration forms the representative
objects for the next iteration.
● The ﬁnal set of representative objects are the respective medoids of the clusters.
Drawback of PAM method:

● For large values of n and k, k-medoids method becomes very costly.

○ ie, for small data sets, PAM works well.
○ But, PAM does not scale well for large data sets.
● For dealing with large data sets, a sampling-based method called CLARA
(Clustering LARge Applications) can be used.
To improve the scalability and quality of CLARA, another k-medoid algorithm
●
TRACE KTU
called CLARANS (Clustering Large Applications based upon Randomized Search) is
used.
● CLARANS combines sampling technique with PAM.
TRACE KTU
Operating steps:

● The initial representative objects (medoids) are chosen arbitrarily.

● The iterative process of replacing representative objects by non-representative objects
continues as long as the quality of the resulting clustering is improved.
● This quality is estimated using a cost function that measures the average dissimilarity
between an object and the representative object of its cluster.
○ To determine whether a non-representative object, orandom, is a good replacement
for a current representative object, oj, the following four cases are examined for

TRACE KTU
each of the non-representative objects,p.
● Each time a reassignment occurs, a diﬀerence in absolute error, E, is contributed to the cost
function.
○ Therefore, the cost function calculates the diﬀerence in absolute-error value if a
current representative object is replaced by a nonrepresentative object.
● The total cost of swapping is the sum of costs incurred by all non-representative objects.
●

●
TRACE KTU
If the total cost is negative, then oj is replaced or swapped with orandom since the actual
absolute error E would be reduced.
If the total cost is positive, the current representative object, o j, is considered acceptable,
and nothing is changed in the iteration.

Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet
Rubaiyat of Omar Khayyam
No ratings yet
Rubaiyat of Omar Khayyam
6 pages
Effect of Photic Stimulation For Migraine Detection Using Random Forest and Discrete Wavelet Transform
No ratings yet
Effect of Photic Stimulation For Migraine Detection Using Random Forest and Discrete Wavelet Transform
9 pages
Selenium Exception Handling 1744116046
No ratings yet
Selenium Exception Handling 1744116046
6 pages
Q Bank
No ratings yet
Q Bank
3 pages
Historia Royal Hotel & Spa 1 - New Bhupalpura, 100 FT Road, Udaipur, Rajasthan, 313001 Tel No +91 7891434548 Info@Historiaroyal - Com-2
No ratings yet
Historia Royal Hotel & Spa 1 - New Bhupalpura, 100 FT Road, Udaipur, Rajasthan, 313001 Tel No +91 7891434548 Info@Historiaroyal - Com-2
12 pages
ForwardInvoice ORD474579931
No ratings yet
ForwardInvoice ORD474579931
2 pages
L2 Students' Barriers in Engaging With Form and Content-Focused AI-generated Feedback in Revising Their Compositions
No ratings yet
L2 Students' Barriers in Engaging With Form and Content-Focused AI-generated Feedback in Revising Their Compositions
23 pages
Vistas Prose 3 - Journey To The End of The Earth
No ratings yet
Vistas Prose 3 - Journey To The End of The Earth
4 pages
K-Means Clustering Algorithm Based On E-Commerce B
No ratings yet
K-Means Clustering Algorithm Based On E-Commerce B
6 pages
Prerna
No ratings yet
Prerna
4 pages
Valve Operator Matl Control Data
No ratings yet
Valve Operator Matl Control Data
12 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
NutcrackerMarch WW
No ratings yet
NutcrackerMarch WW
20 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
Clustering Unit4
No ratings yet
Clustering Unit4
9 pages
Elimination Reactions
No ratings yet
Elimination Reactions
63 pages
Unit VII
No ratings yet
Unit VII
30 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
Lecture 3.2.3 3.2.4
No ratings yet
Lecture 3.2.3 3.2.4
28 pages
Gu01 2009 Standard Reference
No ratings yet
Gu01 2009 Standard Reference
7 pages
Tax Invoice
No ratings yet
Tax Invoice
1 page
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Plumbing Symbol: Sewer Line Layout Ground Floor Water Line Layout Storm Drainage Layout
No ratings yet
Plumbing Symbol: Sewer Line Layout Ground Floor Water Line Layout Storm Drainage Layout
1 page
Descriptive Paragraph - My Bedroom - WRITING CLASS
No ratings yet
Descriptive Paragraph - My Bedroom - WRITING CLASS
2 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
METROPOLITAN BANK AND TRUST COMPANY vs. MARINA B. CUSTODIO
No ratings yet
METROPOLITAN BANK AND TRUST COMPANY vs. MARINA B. CUSTODIO
2 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Agile SW Dev - Unit 4B
No ratings yet
Agile SW Dev - Unit 4B
77 pages
6005 Completo
No ratings yet
6005 Completo
196 pages
Standard Spreadsheet For Continuous Column
100% (1)
Standard Spreadsheet For Continuous Column
12 pages
Data Mining - Lecture 9
No ratings yet
Data Mining - Lecture 9
29 pages
DM Module 4
No ratings yet
DM Module 4
17 pages
Clustering
No ratings yet
Clustering
16 pages
Brand Coolness
No ratings yet
Brand Coolness
21 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
Unit 5
No ratings yet
Unit 5
66 pages
Data Mining Notes UNIT IV
No ratings yet
Data Mining Notes UNIT IV
19 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Employee Safety and Health
No ratings yet
Employee Safety and Health
4 pages
DWDS Unit 6 Cluster Analysis
No ratings yet
DWDS Unit 6 Cluster Analysis
31 pages
Unit 5
No ratings yet
Unit 5
85 pages
Tools and Equipment Technologies For BAMBOO
100% (2)
Tools and Equipment Technologies For BAMBOO
34 pages
M5
No ratings yet
M5
40 pages
M5
No ratings yet
M5
40 pages
ICS 2408 Lecture 7 Clustering
No ratings yet
ICS 2408 Lecture 7 Clustering
25 pages
Andman and Nicobar
No ratings yet
Andman and Nicobar
8 pages
Pattern Recognition Lecture 3
No ratings yet
Pattern Recognition Lecture 3
44 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
Specialties and Accessories: Buffer Tank Hydraulic Separator
No ratings yet
Specialties and Accessories: Buffer Tank Hydraulic Separator
4 pages
Clustering
No ratings yet
Clustering
80 pages
Clustering
No ratings yet
Clustering
39 pages
ML - 8
No ratings yet
ML - 8
70 pages
Unit 2 DMW
No ratings yet
Unit 2 DMW
26 pages
Update Plan
100% (1)
Update Plan
79 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
47 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
Unit IV Cluster Analysis
No ratings yet
Unit IV Cluster Analysis
7 pages
Clustering
No ratings yet
Clustering
25 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Clustering
No ratings yet
Clustering
7 pages
Clustering Part-1
No ratings yet
Clustering Part-1
48 pages
Ielts 1
No ratings yet
Ielts 1
1 page
4 Clustering
No ratings yet
4 Clustering
9 pages
By Lior Rokach and Oded Maimon: Clustering Methods
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
Grouping
No ratings yet
Grouping
98 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Techniques of Cluster Analysis: A Seminar On
No ratings yet
Techniques of Cluster Analysis: A Seminar On
25 pages
Total Loss Claim Settlement
No ratings yet
Total Loss Claim Settlement
3 pages
Pam Clustering Technique: Bachelor of Technology Computer Science and Engineering
No ratings yet
Pam Clustering Technique: Bachelor of Technology Computer Science and Engineering
11 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Clustering
No ratings yet
Clustering
104 pages
Gas Station Guidelines
100% (1)
Gas Station Guidelines
16 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Unit 4
No ratings yet
Unit 4
4 pages
Amity School of Engineering and Technology Amity University, Uttar Pradesh
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
5 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Oxygen Therapy
100% (1)
Oxygen Therapy
6 pages
Data Mining-Unit 3-Part1
No ratings yet
Data Mining-Unit 3-Part1
41 pages
Ambo University: Inistitute of Technology
No ratings yet
Ambo University: Inistitute of Technology
15 pages
Data Mining: Clustering
No ratings yet
Data Mining: Clustering
46 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages

Datamining Mod3

Uploaded by

Datamining Mod3

Uploaded by

CST466

● Consider an electronics shop datastore.

What kind of data mining techniques help to accomplish this task?

● Given a set of n objects, a partitioning method constructs k partitions of the data,

● In cluster analysis, the dissimilarity(or similarity) between the objects described by

● Minkowski distance is a generalization of both Euclidean distance and Manhattan distance

,where p is a positive integer.

● Let x1 = (1,2) and x2=(3,5) represent two objects.

● Euclidian distance = ((1-3)2+(2-5)2)1/2

● Given D, a data set of n objects, and k, the number of clusters to form.

● For large values of n and k, k-medoids method becomes very costly.

● The initial representative objects (medoids) are chosen arbitrarily.

You might also like