0% found this document useful (0 votes)

7 views19 pages

Data Mining Notes UNIT IV

Clustering is the process of grouping similar objects into clusters, with various methods including partitioning, hierarchical, density-based, grid-based, model-based, and constraint-based approaches. It has numerous applications in fields such as market research, biology, and data mining, where it helps identify patterns and insights from data. Key clustering algorithms include k-means and k-medoids, each with its advantages and disadvantages regarding robustness and efficiency.

Uploaded by

Gayathri T

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views19 pages

Data Mining Notes UNIT IV

Uploaded by

Gayathri T

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

UNIT IV

Cluster:

Cluster is a group of objects that belongs to the same class. In other words, similar objects
are grouped in one cluster and dissimilar objects are grouped in another cluster.

What is Clustering?

Clustering is the process of making a group of abstract objects into classes of similar
objects.
Points to Remember
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of data into groups based on
data similarity and then assign the labels to the groups.
 The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful features that distinguish different groups.

Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as market research,

pattern recognition, data analysis, and image processing.
 Clustering can also help marketers discover distinct groups in their customer base.
And they can characterize their customer groups based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations.
 Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in a city
according to house type, value, and geographic location.
 Clustering also helps in classifying documents on the web for information
discovery.
 Clustering is also used in outlier detection applications such as detection of credit
card fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining −
 Scalability − We need highly scalable clustering algorithms to deal with large
databases.
 Ability to deal with different kinds of attributes − Algorithms should be capable
to be applied on any kind of data such as interval-based (numerical) data,
categorical, and binary data.
 Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to
only distance measures that tend to find spherical cluster of small sizes.
 High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or erroneous
data. Some algorithms are sensitive to such data and may lead to poor quality
clusters.
 Interpretability − The clustering results should be interpretable, comprehensible,
and usable.

Clustering Methods

Clustering methods can be classified into the following categories −

 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method

Partitioning Method

Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.
Points to remember −
 For a given number of partitions (say k), the partitioning method will create an
initial partitioning.
 Then it uses the iterative relocation technique to improve the partitioning by
moving objects from one group to other.
Hierarchical Methods

This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two approaches here −

 Agglomerative Approach
 Divisive Approach

Agglomerative Approach

This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.

Divisive Approach

This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds. This
method is rigid, i.e., once a merging or splitting is done, it can never be undone.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical
clustering −
 Perform careful analysis of object linkages at each hierarchical partitioning.
 Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-
clustering on the micro-clusters.

Density-based Method

This method is based on the notion of density. The basic idea is to continue growing the
given cluster as long as the density in the neighborhood exceeds some threshold, i.e., for
each data point within a given cluster, the radius of a given cluster has to contain at least a
minimum number of points.

Grid-based Method

In this, the objects together form a grid. The object space is quantized into finite number of
cells that form a grid structure.
Advantages
 The major advantage of this method is fast processing time.
 It is dependent only on the number of cells in each dimension in the quantized
space.

Model-based methods

In this method, a model is hypothesized for each cluster to find the best fit of data for a
given model. This method locates the clusters by clustering the density function. It reflects
spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters based
on standard statistics, taking outlier or noise into account. It therefore yields robust
clustering methods.

Constraint-based Method

In this method, the clustering is performed by the incorporation of user or application-

oriented constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user or the
application requirement.
Partitional Clustering Introduction

Given a database of n objects or data tuples, a partitioning method constructs k partitions

of the data, where each partition represents a cluster and k <= n.

That is, it classifies the data into k groups, which together satisfy the following
requirements
 Each group must contain at least one object,
 Each object must belong to exactly one group.

Given k, the number of partitions to construct, a partitioning method creates an initial

partition.

It uses iterative relocation technique that attempts to improve the partitioning by moving
objects from one group to another.

The general criterion of a good partitioning is that objects in the same cluster are “close” or
related to each other, whereas objects of different clusters are “far apart” or very different.

There are various kinds of other criteria for judging the quality of partitions.
To achieve global optimality in partitioning – based clustering it would require the
continuous evaluation of all the possible partitions.

1) The k-means algorithm, where each cluster is represented by the mean value of the
objects in the cluster.

2) the k-medoids algorithm, where each cluster is represented by one of the objects
located near the center of the cluster.

The heuristic clustering methods work well for finding spherical-shaped clusters in small
to medium databases.

To find clusters with complex shapes and for clustering very large data sets, partitioning
based methods need to be extended.

Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D of n objects into a set of k
clusters.

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
 Global optimal: Continuous evaluation all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each
cluster is represented by one of the objects in the cluster

How Does K-Means Work

First, it randomly selects k of the objects, each of which initially represents a cluster mean
or center.

For each of the remaining objects, an object is assigned to the cluster to which it is the most
similar, based on the distance between the object and the cluster mean.

It then computes the new mean for each cluster. This process iterates until the criterion
function converges.

K-Means Algorithm

Given k, the k-means algorithm is implemented in 4 steps:

 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the clusters of the current partition. The centroid
is the center (mean point) of the cluster.
 Assign each object to the cluster with the nearest seed point.
 Go back to Step 2, stop when no more new assignment.

 Here, E is the sum of the square error for all objects in the data set.
 x is the point in space representing a given object, and mi is the mean of cluster Ci (both x
and mi are multidimensional). In other words, for each object in each cluster, the distance
from the object to its cluster center is squared, and the distances are summed.
 This criterion tries to make the resulting k clusters as compact and as separate as possible.
 Suppose that there is a set of objects located in space as depicted in the rectangle.
 Let k =3; i.e. the user would like to cluster the object into three clusters.
 According to the algorithm, we arbitrarily choose three objects as the three initial cluster
centers, were cluster centers are marked by a “+”.
 Each object is distributed to a cluster based on the cluster center to which it is the nearest.
 Such distribution forms circled by dotted curves.

Advantages Of K-Means

 Relatively efficient: O(tkn), where n is objects, k is clusters, and t is iterations. Normally, k,

t << n.
 Each object is distributed to a cluster based on the cluster center to which it is the nearest.

Disadvantages Of K-Means

 Applicable only when mean is defined, then what about categorical data?
 Need to specify k, the number of clusters, in advance.
 Unable to handle noisy data and outliers.
 Not suitable to discover clusters with non-convex shapes.

K-Medoids Clustering

A medoid can be defined as that object of a cluster, whose average dissimilarity to

all the objects in the cluster is minimal i.e. it is a most centrally located point in the given
dataset.

K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster.

The basic strategy of k-medoids clustering algorithms is to find k clusters in n

objects by first arbitrarily finding a representative object (the medoid) for each
cluster. Each remaining object is clustered with the medoid to which it is most similar.
The strategy then iteratively replaces one of the medoids by one of the non-medoids
as long as the quality of the resulting clustering is improved.

This quality is estimated by using a cost function that measures the average
dissimilarity between an object and the medoid of its cluster.

To determine whether a non-medoid object is "Oi" random is a good replacement

for a current medoid "Oj", the following four cases are examined for each of the non-
medoid objects "P".

Case 1: "P" currently belongs to medoid "Oj", If "Oj" is replaced by "Orandom", as a medoid
and "P" is closest to one of "Oi", it do not belong "j", then "P" is assigned to "Oi".

Case 2: "P" currently belongs to medoid "Oj". If "Oj" is replaced by "Orandom" as medoid
and "P" is closest to "Orandom", then "P" is reassigned to "Orandom".

Case 3: "P" currently belongs to medoid "Oi", it does not belong "j". If "Oj" is replaced by
"Orandom" as a medoid and "P" is still closest to "Oi", then the assignment does not change.

Case 4: "P" currently belongs to medoid "Oi", it does not belong to "j". If "Oj" is replaced by
"Orandom" as a medoid and "P" is closest to "Orandom", then "P" is reassigned to
"Orandom".
Which Is More Robust -- K-Means or K-Medoids

The k-medoids method is more robust than k-means in the presence of noise and
outliers because a medoid is less influenced by outliers or other extreme values than a
mean.

However, its processing is more costly than the k-means method. Both methods
require the user to specify k, the number of clusters.

Aside from using the mean or the medoid as a measure of cluster center, other
alternative measures are also commonly used in partitioning clustering methods.

The median can be used, resulting in the k-median method, where the median or
“middle value” is taken for each ordered attribute. Alternatively, in the k-modes method,
the most frequent value for each attribute is used.

Hierarchical Clustering in Data Mining

A Hierarchical clustering method works via grouping data into a tree of clusters.
Hierarchical clustering begins by treating every data points as a separate cluster. Then, it
repeatedly executes the subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all
the clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested
clusters. A diagram called Dendrogram (A Dendrogram is a tree-like diagram that
statistics the sequences of merges or splits) graphically represents this hierarchy and is
an inverted tree that describes the order in which factors are merged (bottom-up view)
or cluster are break up (top-down view).
The basic method to generate hierarchical clustering are:

1. Agglomerative:

Initially consider every data point as an individual Cluster and at every

step, merge the nearest pairs of the cluster. (It is a bottom-up method). At first every
data set set is considered as individual entity or cluster. At every iteration, the clusters
merge with different clusters until one cluster is formed.
Algorithm for Agglomerative Hierarchical Clustering is:
 Calculate the similarity of one cluster with all the other clusters (calculate proximity
matrix)
 Consider every data point as a individual cluster
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster
 Repeat Step 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note:
This is just a demonstration of how the actual algorithm works no calculation has been
performed below all the proximity among the clusters are assumed.
Let’s say we have six data points A, B, C, D, E, F.

Figure – Agglomerative Hierarchical clustering

 Step-1:
Consider each alphabet as a single cluster and calculate the distance of one cluster
from all the other clusters.
 Step-2:
In the second step comparable clusters are merged together to form a single cluster.
Let’s say cluster (B) and cluster (C) are very similar to each other therefore we merge
them in the second step similarly with cluster (D) and (E) and at last, we get the
clusters
[(A), (BC), (DE), (F)]
 Step-3:
We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
 Step-4:
Repeating the same process; The clusters DEF and BC are comparable and merged
together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
 Step-5:
At last the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].

2. Divisive:
We can say that the Divisive Hierarchical clustering is precisely the opposite of the
Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we take into
account all of the data points as a single cluster and in every iteration, we separate the
data points from the clusters which aren’t comparable. In the end, we are left with N
clusters.

Figure – Divisive Hierarchical clustering

Grid-Based Clustering

Grid-Based Clustering method uses a multi-resolution grid data structure.

Several interesting methods
 STING (a STatistical INformation Grid approach) by Wang, Yang, and Muntz (1997)
 WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) - A multi-resolution
clustering approach using wavelet method
 CLIQUE - Agrawal, et al. (SIGMOD’98)

STING - A Statistical Information Grid Approach

STING was proposed by Wang, Yang, and Muntz (VLDB’97).

In this method, the spatial area is divided into rectangular cells.

There are several levels of cells corresponding to different levels of resolution.
For each cell, the high level is partitioned into several smaller cells in the next lower level.

The statistical info of each cell is calculated and stored beforehand and is used to answer
queries.
The parameters of higher-level cells can be easily calculated from parameters of lower-
level cell
 Count, mean, s, min, max
 Type of distribution—normal, uniform, etc.

Then using a top-down approach we need to answer spatial data queries.

Then start from a pre-selected layer—typically with a small number of cells.

For each cell in the current level compute the confidence interval.

Now remove the irrelevant cells from further consideration.

When finishing examining the current layer, proceed to the next lower level.

Repeat this process until the bottom layer is reached.

Advantages:
 It is Query-independent, easy to parallelize, incremental update.
 O(K), where K is the number of grid cells at the lowest level.
Disadvantages:
 All the cluster boundaries are either horizontal or vertical, and no diagonal
boundary is detected.

WaveCluster

It was proposed by Sheikholeslami, Chatterjee, and Zhang (VLDB’98).

It is a multi-resolution clustering approach which applies wavelet transform to the feature
space
 A wavelet transform is a signal processing technique that decomposes a signal into
different frequency sub-band.
It can be both grid-based and density-based method.

Input parameters:
 No of grid cells for each dimension
 The wavelet, and the no of applications of wavelet transform.

How to apply the wavelet transform to find clusters

 It summaries the data by imposing a multidimensional grid structure onto data space.
 These multidimensional spatial data objects are represented in an n-dimensional feature
space.
 Now apply wavelet transform on feature space to find the dense regions in the feature
space.
 Then apply wavelet transform multiple times which results in clusters at different scales
from fine to coarse.

Why is wavelet transformation useful for clustering

 It uses hat-shape filters to emphasize region where points cluster, but simultaneously to
suppress weaker information in their boundary.
 It is an effective removal method for outliers.
 It is of Multi-resolution method.
 It is cost-efficiency.

Major features:
 The time complexity of this method is O(N).
 It detects arbitrary shaped clusters at different scales.
 It is not sensitive to noise, not sensitive to input order.
 It only applicable to low dimensional data.

CLIQUE - Clustering In QUEst

It was proposed by Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).

It is based on automatically identifying the subspaces of high dimensional data space that
allow better clustering than original space.

CLIQUE can be considered as both density-based and grid-based:

 It partitions each dimension into the same number of equal-length intervals.
 It partitions an m-dimensional data space into non-overlapping rectangular units.
 A unit is dense if the fraction of the total data points contained in the unit exceeds the input
model parameter.
 A cluster is a maximal set of connected dense units within a subspace.

Partition the data space and find the number of points that lie inside each cell of the
partition.

Identify the subspaces that contain clusters using the Apriori principle.

Identify clusters:
 Determine dense units in all subspaces of interests.
 Determine connected dense units in all subspaces of interests.

Generate minimal description for the clusters:

 Determine maximal regions that cover a cluster of connected dense units for each cluster.
 Determination of minimal cover for each cluster.
Advantages

 It automatically finds subspaces of the highest dimensionality such that high-density

clusters exist in those subspaces.
 It is insensitive to the order of records in input and does not presume some
canonical data distribution.
 It scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases.

Disadvantages
 The accuracy of the clustering result may be degraded at the expense of the
simplicity of the method.
Model-Based Clustering

 Model-based clustering method is an attempt to optimize the fit between the data
and some mathematical models.
 It is the Statistical and AI approach.
 Model-based clustering works on the intuition that gene expression data originates
from a finite mixture of underlying probability distributions (Ramoni et al. 2001).
 Each cluster corresponds to a different distribution, and these distributions are
assumed to be Gaussians.
 The parameters of each distribution (i.e., cluster) are estimated by maximizing the
likelihood of the expression data (Hogg and Craig 1994).
 The k-means clustering method is a special case of model-based clustering,
where all the distributions are assumed to be Gaussians with equal variance.
 Randomly generate the parameters (the parameters would be the mean and
standard deviation or covariance matrix) describing each probability distribution (i.e.,
cluster)
 Repeat until the parameters of each distribution converge
 For each gene, estimate the probability that the gene's expression pattern was
generated from each of the distributions.
 For each distribution, estimate the parameters of the distribution to maximize the
likelihood of the expression data given the probability that each gene was generated from
the distribution.
 Assign each gene to the distribution which generates the gene's expression profile
with maximum probability
Model-based clustering has the advantage of providing the probability that each gene
belongs in each cluster.

However, model-based clustering operates under the assumption that expression data
comes from particular probability distributions, which may not be a reasonable
assumption for many microarray data sets.

Conceptual clustering
 Conceptual clustering is a form of clustering in machine learning.
 It produces a classification scheme for a set of unlabeled objects and finds
characteristic description for each concept (class).

COBWEB (Fisher’87)
 COBWEB is a popular a simple method of incremental conceptual learning.
 It creates a hierarchical clustering in the form of a classification tree.
 Each node refers to a concept and contains a probabilistic description of that
concept.

Classification Tree
Limitations of COBWEB

 The assumption that the attributes are independent of each other is often too strong
because correlation may exist.
 It is not suitable for clustering large database data – skewed tree and expensive
probability distributions.
 Some of the other methods alike COBWEB are:

CLASSIT
 It is an extension of COBWEB for incremental clustering of continuous data.
 It suffers similar problems as COBWEB.

AutoClass (Cheeseman and Stutz, 1996)

 It uses Bayesian statistical analysis to estimate the number of clusters.
 It has been popular in the industry.

Other Model-Based Clustering Methods

Neural network approaches

 It represents each cluster as an exemplar, acting as a “prototype” of the cluster.
 Then new objects are distributed to the cluster whose exemplar is the most similar
according to some distance measure.
Competitive learning
 It involves a hierarchical architecture of several units (neurons).
 Neurons compete in a “winner-takes-all” fashion for the object currently being presented.
Self-Organizing Feature Maps

 Clustering is also performed by having several units competing for the current
object.
 The unit whose weight vector is closest to the current object wins.
 The winner and its neighbors learn by having their weights adjusted.
 SOMs are believed to resemble processing that can occur in the brain.
 Useful for visualizing high-dimensional data in 2-D or 3-D space.
Outliers in Data Mining
Outlier is a data object that deviates significantly from the rest of the data objects
and behaves in a different manner. An outlier is an object that deviates significantly from
the rest of the objects. They can be caused by measurement or execution errors. The
analysis of outlier data is referred to as outlier analysis or outlier mining.
An outlier cannot be termed as a noise or error. Instead, they are suspected of not being
generated by the same method as the rest of the data objects.
Outliers are of three types, namely –
1. Global (or Point) Outliers
2. Collective Outliers
3. Contextual (or Conditional) Outliers

1. Global Outliers

They are also known as Point Outliers. These are the simplest form of outliers. If, in a
given dataset, a data point strongly deviates from all the rest of the data points, it is known
as a global outlier. Mostly, all of the outlier detection methods are aimed at finding global
outliers.
For example, In Intrusion Detection System, if a large number of packages are
broadcast in a very short span of time, then this may be considered as a global outlier and
we can say that that particular system has been potentially hacked.

The red data point is a global outlier.

2. Collective Outliers

As the name suggests, if in a given dataset, some of the data points, as a whole,
deviate significantly from the rest of the dataset, they may be termed as collective outliers.
Here, the individual data objects may not be outliers, but when seen as a whole, they may
behave as outliers. To detect these types of outliers, we might need background
information about the relationship between those data objects showing the behavior of
outliers.
For example: In an Intrusion Detection System, a DOS (denial-of-service) package
from one computer to another may be considered as normal behavior. However, if this
happens with several computers at the same time, then this may be considered as
abnormal behavior and as a whole they can be termed as collective outliers.

The red data points as a whole are collective outliers.

3. Contextual Outliers

They are also known as Conditional Outliers. Here, if in a given dataset, a data object
deviates significantly from the other data points based on a specific context or condition
only. A data point may be an outlier due to a certain condition and may show normal
behavior under another condition. Therefore, a context has to be specified as part of the
problem statement in order to identify contextual outliers.
Contextual outlier analysis provides flexibility for users where one can examine
outliers in different contexts, which can be highly desirable in many applications. The
attributes of the data point are decided on the basis of both contextual and behavioral
attributes.
For example: A temperature reading of 40°C may behave as an outlier in the context
of a “winter season” but will behave like a normal data point in the context of a “summer
season”.
A low temperature value in June is a contextual outlier because the same value in December

is not an outlier.

Unit 5
No ratings yet
Unit 5
27 pages
Machine Learning Applications For Quality Improvement in Laser Powder Bed Fusion: A State-Of-The-Art Review
No ratings yet
Machine Learning Applications For Quality Improvement in Laser Powder Bed Fusion: A State-Of-The-Art Review
18 pages
Machine Learning Lab Viva
No ratings yet
Machine Learning Lab Viva
3 pages
AL3451 Assignment Question1
No ratings yet
AL3451 Assignment Question1
3 pages
Study Plan On Computer Science in China
No ratings yet
Study Plan On Computer Science in China
3 pages
Lecun 2015
No ratings yet
Lecun 2015
10 pages
Data Mining Clustering Techniques
No ratings yet
Data Mining Clustering Techniques
3 pages
Google Cloud Platform
100% (1)
Google Cloud Platform
20 pages
AI - Deloitte-Nl-Data-Analytics-Artificial-Intelligence-Whitepaper-Eng PDF
No ratings yet
AI - Deloitte-Nl-Data-Analytics-Artificial-Intelligence-Whitepaper-Eng PDF
34 pages
Chapter 1 Project
No ratings yet
Chapter 1 Project
18 pages
DM Unit-4 Part1
No ratings yet
DM Unit-4 Part1
21 pages
UNIT 4 Clustering and Applications
No ratings yet
UNIT 4 Clustering and Applications
5 pages
Machine Learning With PySpark
No ratings yet
Machine Learning With PySpark
21 pages
Clustering
No ratings yet
Clustering
41 pages
ML Assignment 02
No ratings yet
ML Assignment 02
8 pages
Article PP 1416-1433
No ratings yet
Article PP 1416-1433
18 pages
Unit 5
No ratings yet
Unit 5
27 pages
DM-Lecture Decision Trees (A)
No ratings yet
DM-Lecture Decision Trees (A)
161 pages
DMDW Unit-5
No ratings yet
DMDW Unit-5
21 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
Unit VII
No ratings yet
Unit VII
30 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Data Mining and Data Warehouse
No ratings yet
Data Mining and Data Warehouse
9 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
Large Language Models For Data Annotation - A Survey
No ratings yet
Large Language Models For Data Annotation - A Survey
22 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Creation FromGodtoMantoAI
No ratings yet
Creation FromGodtoMantoAI
15 pages
DMDW R20 Unit 5
No ratings yet
DMDW R20 Unit 5
21 pages
DWDM Lecture Notes U-5
No ratings yet
DWDM Lecture Notes U-5
26 pages
Cluster Analysis
No ratings yet
Cluster Analysis
20 pages
Prasanna Hebbar @govt First Grade College Honnavar
No ratings yet
Prasanna Hebbar @govt First Grade College Honnavar
11 pages
Application of Machine Learning Controller in Matrix Converter Based On Model Predictive Control Algorithm
No ratings yet
Application of Machine Learning Controller in Matrix Converter Based On Model Predictive Control Algorithm
8 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
Unit 4
No ratings yet
Unit 4
106 pages
CLUSTER ANALYSIS Unit 3 Data Mining
No ratings yet
CLUSTER ANALYSIS Unit 3 Data Mining
84 pages
Dissertation Part-I: Name:Kamalpreet Kaur Roll No.:2018CSB2015 Guide:Prof Kiranbir Kaur
No ratings yet
Dissertation Part-I: Name:Kamalpreet Kaur Roll No.:2018CSB2015 Guide:Prof Kiranbir Kaur
16 pages
Introduction To ML
100% (1)
Introduction To ML
39 pages
Unit-V (Dmwh6em)
No ratings yet
Unit-V (Dmwh6em)
30 pages
E-Note 28966 Content Document 20241211091351PM
No ratings yet
E-Note 28966 Content Document 20241211091351PM
69 pages
BC Report SE-22092
No ratings yet
BC Report SE-22092
8 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
Unit 4
No ratings yet
Unit 4
21 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Detecting Anomalies in Financial Statements Using ML
No ratings yet
Detecting Anomalies in Financial Statements Using ML
21 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
ID3 Decision Tree Explanation
No ratings yet
ID3 Decision Tree Explanation
8 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Module V
No ratings yet
Module V
16 pages
AAU Exit Exam Model
No ratings yet
AAU Exit Exam Model
30 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
DM Module 4
No ratings yet
DM Module 4
17 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Screenshot 2024-05-17 at 3.30.05 PM
No ratings yet
Screenshot 2024-05-17 at 3.30.05 PM
31 pages
DWDM - Unit - VI
No ratings yet
DWDM - Unit - VI
38 pages
IV Sem BTech AIML230222083345
No ratings yet
IV Sem BTech AIML230222083345
10 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
Clustering New
No ratings yet
Clustering New
6 pages
BI UNIT-03 Chap02 Clustering
No ratings yet
BI UNIT-03 Chap02 Clustering
8 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Dmbi Unit-4
No ratings yet
Dmbi Unit-4
18 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Iv Unit DM
No ratings yet
Iv Unit DM
26 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Unit-3 DWDM 7TH Sem Cse
No ratings yet
Unit-3 DWDM 7TH Sem Cse
54 pages
UNIT 3 DWDM Notes
No ratings yet
UNIT 3 DWDM Notes
32 pages
Machine Learning Fundamentals (Updated)
No ratings yet
Machine Learning Fundamentals (Updated)
42 pages
Ieee - Intrusion Detection System Using Neural
No ratings yet
Ieee - Intrusion Detection System Using Neural
8 pages
Unit 2 DMW
No ratings yet
Unit 2 DMW
26 pages
MCQQQQQQQQQ
No ratings yet
MCQQQQQQQQQ
35 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Clustering
No ratings yet
Clustering
6 pages
Amity School of Engineering and Technology Amity University, Uttar Pradesh
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
5 pages
Artificial Intelligence in The Manufacturing Industry
No ratings yet
Artificial Intelligence in The Manufacturing Industry
3 pages
Data Mining - Cluster Analysis
No ratings yet
Data Mining - Cluster Analysis
4 pages
Cluster Is A Group of Objects That Belongs To The Same Class
No ratings yet
Cluster Is A Group of Objects That Belongs To The Same Class
12 pages
Data Mining-Unit IV
No ratings yet
Data Mining-Unit IV
15 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
4 pages
Practical Software Testing
No ratings yet
Practical Software Testing
3 pages
Resume Template
No ratings yet
Resume Template
2 pages
Unit 4
No ratings yet
Unit 4
4 pages
DM Cluster Analysis
No ratings yet
DM Cluster Analysis
3 pages
Data Science - A Kaggle Walkthrough - Introduction - 1 PDF
No ratings yet
Data Science - A Kaggle Walkthrough - Introduction - 1 PDF
5 pages
Clement Machine Learning Methods For Malware Recognition Based On Semantic Behaviours
No ratings yet
Clement Machine Learning Methods For Malware Recognition Based On Semantic Behaviours
5 pages
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet

Data Mining Notes UNIT IV

Uploaded by

Data Mining Notes UNIT IV

Uploaded by

UNIT IV

Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as market research,

Requirements of Clustering in Data Mining

Clustering methods can be classified into the following categories −

Approaches to Improve Quality of Hierarchical Clustering

In this method, the clustering is performed by the incorporation of user or application-

Given a database of n objects or data tuples, a partitioning method constructs k partitions

Given k, the number of partitions to construct, a partitioning method creates an initial

Partitioning Algorithms: Basic Concept

How Does K-Means Work

Given k, the k-means algorithm is implemented in 4 steps:

 Relatively efficient: O(tkn), where n is objects, k is clusters, and t is iterations. Normally, k,

A medoid can be defined as that object of a cluster, whose average dissimilarity to

The basic strategy of k-medoids clustering algorithms is to find k clusters in n

To determine whether a non-medoid object is "Oi" random is a good replacement

Hierarchical Clustering in Data Mining

Initially consider every data point as an individual Cluster and at every

Figure – Agglomerative Hierarchical clustering

Figure – Divisive Hierarchical clustering

Grid-Based Clustering method uses a multi-resolution grid data structure.

STING - A Statistical Information Grid Approach

In this method, the spatial area is divided into rectangular cells.

Then using a top-down approach we need to answer spatial data queries.

Then start from a pre-selected layer—typically with a small number of cells.

Now remove the irrelevant cells from further consideration.

Repeat this process until the bottom layer is reached.

It was proposed by Sheikholeslami, Chatterjee, and Zhang (VLDB’98).

How to apply the wavelet transform to find clusters

Why is wavelet transformation useful for clustering

CLIQUE - Clustering In QUEst

It was proposed by Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).

CLIQUE can be considered as both density-based and grid-based:

Generate minimal description for the clusters:

 It automatically finds subspaces of the highest dimensionality such that high-density

AutoClass (Cheeseman and Stutz, 1996)

Other Model-Based Clustering Methods

Neural network approaches

The red data point is a global outlier.

The red data points as a whole are collective outliers.

You might also like