Data Mining Unit-4
Data Mining Unit-4
Data Mining Unit-4
Lecture Notes
---------------------------------------------------------------------------------------------------------------
Clustering and Applications: Cluster Analysis – Types of data in cluster analysis –
Categorization of Major Clustering Methods – Partitioning Methods, Hierarchical Methods-
Density based Methods, Grid based Methods, Outlier Analysis.
Cluster Analysis?
Clustering helps to splits data into several subsets. Each of these subsets contains
data similar to each other, and these subsets are called clusters. Now that the data
from our customer base is divided into clusters, we can make an informed decision
about who we think is best suited for this product.
Cluster analysis has been widely used in many applications such as business
intelligence, image pattern recognition, Web search, biology, and security.
In business intelligence, clustering can be used to organize a large number of
customers into groups, where customers within a group share strong similar
characteristics
We shall know the types of data that often occur in cluster analysis and how to preprocess
them for such analysis.
Suppose that a data set to be clustered contains n objects, which may represent persons,
houses, documents, countries, and so on.
Main memory-based clustering algorithms typically operate on either of the following two
data structures.
Types of data structures in cluster analysis are
Data Matrix (or object by variable structure)
Dissimilarity Matrix (or object by object structure)
Data Matrix
This represents n objects, such as persons, with p variables such as age, height, weight,
gender, race and so on. The structure is in the form of a relational table, or n-by-p matrix (n
objects x p variables)
The Data Matrix is often called a two-mode matrix since the rows and columns of this
represent the different entities.
Dissimilarity Matrix
It is often represented by a n – by – n table, where d(i,j) is the measured difference or
dissimilarity between objects i and j. In general, d(i,j) is a non-negative number that is close
to 0 when objects i and j are higher similar or “near” each other and becomes larger the
more they differ. Since d(i,j) = d(j,i) and d(i,i) =0,
This is also called as one mode matrix since the rows and columns of this represent the same
entity.
Typical examples include weight and height, latitude and longitude coordinates (e.g.,
when clustering houses), and weather temperature.
The measurement unit used can affect the clustering analysis. For example, changing
measurement units from meters to inches for height, or from kilograms to pounds for
weight, may lead to a very different clustering structure.
To help avoid dependence on the choice of measurement units, the data should be
standardized. Standardizing measurements attempts to give all variables an equal
weight.
This is especially useful when given no prior knowledge of the data. However, in
some applications, users may intentionally want to give more weight to a certain set
of variables than to others.
For example, when clustering basketball player candidates, we may prefer to give
more weight to the variable height.
Binary Variables
A binary variable is a variable that can take only 2 values.
For example, generally, gender variables can take 2 variables male and female.
Contingency Table For Binary Data
Let us consider binary values 0 and 1
Sub type of Binary variable
Symmetric binary : we cannot change the values according user wish Ex: male or
female
Asymmetric binary: we can change the values according to user wish Ex: Covid
By mapping the range of each variable onto [0, 1] by replacing the i-th object in the
f-th variable by,
Partitioning Methods:
The simplest and most fundamental version of cluster analysis is partitioning, which
organizes the objects of a set into several exclusive groups or clusters. To keep the
problem specification concise, we can assume that the number of clusters is given as
background knowledge. This parameter is the starting point for partitioning methods.
Formally, given a data set, D, of n objects, and k, the number of clusters to form, a
partitioning algorithm organizes the objects into k partitions
k-Means: A Centroid-Based Technique
Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods
distribute the objects in D into k clusters, C1, : : : ,Ck,
An objective function is used to assess the partitioning quality so that objects within a
cluster are similar to one another but dissimilar to objects in other clusters. This is, the
objective function aims for high intra cluster similarity and low inter cluster similarity.
A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent
that cluster. Conceptually, the centroid of a cluster is its center point. The centroid can
be defined in various ways such as by the mean or medoid of the objects (or points)
assigned to the cluster.
Step 5: Repeat step 3 and step 4 until you get a single cluster.
Divisive Hierarchical Clustering
Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical
clustering.
In Divisive Hierarchical clustering, all the data points are considered an individual
cluster, and in every iteration, the data points that are not similar are separated from the
cluster.
The separated data points are treated as an individual cluster.
Finally, we are left with N clusters.
Example:
Now to combine or joining a cluster I need to considered least value in matrix i.e 5 which
contain the objects p2 and p4
We need to calculate distance between (p1,[p2,p4])
Now I need to take least value in matrix to combine or join the cluster i.e 9 so we need
join p1 and p2,p4
The final visualization of cluster will be given bellow
Example 2:
Advantages of Hierarchical clustering
It is simple to implement and gives the best output in some cases.
It is easy and results in a hierarchy, a structure that contains more information.
It does not need us to pre-specify the number of clusters.
Disadvantages of hierarchical clustering
It breaks the large clusters.
It is Difficult to handle different sized clusters and convex shapes.
It is sensitive to noise and outliers.
The algorithm can never be changed or deleted once it was done previously.
Density-Based Methods
Partitioning and hierarchical methods are designed to find spherical-shaped clusters.
They have difficulty finding clusters of arbitrary shape such as the “S” shape and oval
clusters Given such data, they would likely inaccurately identify convex regions, where
noise or outliers are included in the clusters.
To find clusters of arbitrary shape, alternatively, we can model clusters as dense regions
in the data space, separated by sparse regions.
This is the main strategy behind density-based clustering methods, which can discover
clusters of non spherical shape
Density-based clustering by studying DBSCAN ((Density-Based Spatial Clustering of
Applications with Noise)
`
Grid-based clustering using several interesting methods.
STING: explores statistical information stored in the grid cells.
CLIQUE: represents a grid- and density-based approach for subspace clustering in a
high-dimensional data space.
STING (STatistical INformation Grid) STING is a grid-based multiresolution
clustering technique in which the embedding spatial area of the input objects is
divided into rectangular cells. The space can be divided in a hierarchical and recursive
way.
Several levels of such rectangular cells correspond to different levels of resolution and
form a hierarchical structure:
Each cell at a high level is partitioned to form a number of cells at the next lower
level.
Statistical information regarding the attributes in each grid cell, such as the mean,
maximum, and minimum values, is precomputed and stored as statistical parameters.
The figure shows a hierarchical structure for STING clustering. The statistical parameters of
higher-level cells can easily be computed from the parameters of the lower-level cells.
These parameters include the following: the attribute-independent parameter, count; and the
attribute-dependent parameters, mean, stdev (standard deviation), min (minimum), max
(maximum), and the type of distribution that the attribute value in the cell follows such as
normal, uniform, exponential, or none
“How is this statistical information useful for query answering?” The statistical parameters
can be used in a top-down, grid-based manner as follows. First, a layer within the
hierarchical structure is determined from which the query-answering process is to start. This
layer typically contains a small number of cells. For each cell in the current layer, we
compute the confidence interval (or estimated probability range) reflecting the cell’s
relevancy to the given query. The irrelevant cells are removed from further consideration.
Processing of the next lower level examines only the remaining relevant cells. This process
is repeated until the bottom layer is reached. At this time, if the query specification is met,
the regions of relevant cells that satisfy the query are returned. Otherwise, the data that fall
into the relevant cells are retrieved and further processed until they meet the query’s
requirements.
“What advantages does STING offer over other clustering methods?” STING offers several
Advantages:
Disadvantage:
All the cluster boundaries are either horizontal or vertical and no diagonal boundary is
detected
CLIQUE
For example, consider a health informatics application where patient records contain
attributes describing, Personal information, Numerous symptoms, Conditions and
Family history. In bird flu patients, for instance, the age, gender, and job attributes may
vary dramatically within a wide range of values.
Thus, it can be difficult to find such a cluster within the entire data space. Instead, by
searching in subspaces, we may find a cluster of similar patients in a lower-dimensional
space (e.g., patients who are similar to one other with respect to symptoms like high
fever, cough but no runny nose, and aged between 3 and 16).
Outliers are different from noisy data. Noise is a random error or variance in a measured
variable. In general, noise is not interesting in data analysis, including outlier detection. For
example, in credit card fraud detection, a customer’s purchase behavior can be modeled as a
random variable. A customer may generate some “noise transactions” that may seem like
“random errors” or “variance,” such as by buying a bigger lunch one day, or having one
more cup of coffee than usual. Such transactions should not be treated as outliers; otherwise,
the credit card company would incur heavy costs from verifying that many transactions. The
company may also lose customers by bothering them with multiple false alarms. As in many
other data analysis and data mining tasks, noise should be removed before outlier detection.
Outliers are interesting because they are suspected of not being generated by the same
mechanisms as the rest of the data. Therefore, in outlier detection, it is important to
Outlier detection is also related to novelty detection in evolving data sets.
For example, by monitoring a social media web site where new content is incoming,
novelty detection may identify new topics and trends in a timely manner.
Novel topics may initially appear as outliers.
To this extent, outlier detection and novelty detection share some similarity in
modeling and detection methods
In general, outliers can be classified into three categories, namely global outliers,
contextual (or conditional) outliers, and collective outliers.
Global Outliers
In a given data set, a data object is a global outlier if it deviates significantly from
the rest of the data set.
Global outliers are sometimes called point anomalies, and are the simplest type of
outliers.
Most outlier detection methods are aimed at finding global outliers.
Global outlier detection is important in many applications.
Consider intrusion detection in computer networks, for example.
If the communication behavior of a computer is very different from the normal
patterns (e.g., a large number of packages is broadcast in a short time), this behavior
may be considered as a global outlier and the corresponding computer is a suspected
victim of hacking.
As another example, in trading transaction auditing systems, transactions that do not
follow the regulations are considered as global outliers and should be held for further
examination.
Contextual Outliers
“The temperature today is 35 º C. Is it exceptional (i.e., an outlier)?” It depends, for
example, on the time and location! If it is in winter in Hyderabad, yes, it is an
outlier.
If it is a summer day in Hyderabad, then it is normal.
Unlike global outlier detection, in this case, whether or not today’s temperature
value is an outlier depends on the context—the date, the location, and possibly some
other factors.
Contextual outliers are a generalization of local outliers
In credit card fraud detection, in addition to global outliers, an analyst may consider
outliers in different contexts.
Consider customers who use more than 90% of their credit limit.
If one such customer is viewed as belonging to a group of customers with low credit
limits, then such behavior may not be considered an outlier.
However, similar behavior of customers from a high-income group may be
considered outliers if their balance often exceeds their credit limit.
Such outliers may lead to business opportunities—raising credit limits for such
customers can bring in new revenue
Collective Outliers
Suppose you are a supply-chain manager of AllElectronics. You handle thousands of
orders and shipments every day. If the shipment of an order is delayed, it may not be
considered an outlier because, statistically, delays occur from time to time.
However, you have to pay attention if 100 orders are delayed on a single day.
Those 100 orders as a whole form an outlier, although each of them may not be
regarded as an outlier if considered individually.
You may have to take a close look at those orders collectively to understand the
shipment problem.
Collective outliers. In Figure the black objects as a whole form a collective outlier
because the density of those objects is much higher than the rest in the data set.
However, every black object individually is not an outlier with respect to the whole
data set.
Collective outlier detection has many important applications.
For example, in intrusion detection, a denial-of-service package from one computer
to another is considered normal, and not an outlier at all.
However, if several computers keep sending denial-of-service packages to each
other, they as a whole should be considered as a collective outlier.
Method General Characteristics
Partitioning methods
Find mutually exclusive clusters of spherical shape Distance-based
May use mean or medoid (etc.) to represent cluster center
Effective for small- to medium-size data sets
Hierarchical methods
Clustering is a hierarchical decomposition (i.e., multiple levels)
Cannot correct erroneous merges or splits
May incorporate other techniques like microclustering or
consider object “linkages”
Density-based methods
Can find arbitrarily shaped clusters
Clusters are dense regions of objects in space that are separated by low-density
regions
Cluster density: Each point must have a minimum number of points within its
“neighborhood” May filter out outliers
Grid-based methods
Use a multiresolution grid data structure
Fast processing time (typically independent of the number of data objects, yet
dependent on grid size)