0% found this document useful (0 votes)

70 views27 pages

Unit 5

This document discusses different types of cluster analysis methods. It begins by defining cluster analysis and its applications. It then describes the major categories of clustering methods: partitioning methods which classify data into distinct groups; hierarchical methods which create nested groupings; density-based methods which group together dense regions of data; grid-based methods which quantize space into a grid for analysis; and model-based methods which fit data to probability models. The document outlines typical requirements for clustering in data mining such as scalability, handling different data types and noisy data.

Uploaded by

ajayagupta1101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views27 pages

Unit 5

Uploaded by

ajayagupta1101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Unit 5

Cluster Analysis Introduction :Types of Data in Cluster Analysis, A Categorization of Major Clustering
Methods, Partitioning Methods, Hierarchical Methods, Density-Based Methods, Grid-Based Methods,

Model-Based Clustering Methods, Clustering High-Dimensional Data, Constraint-Based Cluster Analysis,

Outlier Analysis. Applications and Trends in Data Mining: Data Mining Applications, Data Mining System
Products and Research Prototypes, Additional Themes on Data Mining and Social Impacts of Data Mining.

5.1 Cluster Analysis:

The process of grouping a set of physical or abstract objects into classes of similar objects
is called clustering.
A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters.
A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression.
Cluster analysis tools based on k-means, k-medoids, and several methods have also been
built into many statisticalanalysis software packages or systems, such as S-Plus, SPSS, and
SAS.

5.1.1 Applications:
Cluster analysis has been widely used in numerous applications, including market research,
pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to house
type, value,and geographic location, as well as the identification of groups of automobile
insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection,Applications of outlier detection include the
detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.

5.1.2 Typical Requirements Of Clustering In Data Mining:

 Scalability:
Many clustering algorithms work well on small data sets containing fewer than several
hundred data objects; however, a large database may contain millions of objects. Clustering
on a sample of a given large data set may lead to biased results.
Highly scalable clustering algorithms are needed.
 Ability to deal with different types of attributes:
Many algorithms are designed to cluster interval-based (numerical) data. However,
applications may require clustering other types of data, such as binary, categorical (nominal),
and ordinal data, or mixtures of these data types.
 Discovery of clusters with arbitrary shape:
Many clustering algorithms determine clusters based on Euclidean or Manhattan distance
measures. Algorithms based on such distance measures tend to find spherical clusters with
similar size and density.
However, a cluster could be of any shape. It is important to develop algorithms thatcan detect
clusters of arbitrary shape.
 Minimal requirements for domain knowledge to determine input parameters:
Many clustering algorithms require users to input certain parameters in cluster analysis
(such as the number of desired clusters). The clustering results can be quite sensitive to
input parameters. Parameters are often difficult to determine, especially for data sets
containing high-dimensional objects. This not only burdens users, but it also makes the
quality of clustering difficult to control.
 Ability to deal with noisy data:
Most real-world databases contain outliers or missing, unknown, or erroneous data.
Some clustering algorithms are sensitive to such data and may lead to clusters of poor
quality.
 Incremental clustering and insensitivity to the order of input records:
Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates)
into existing clustering structures and, instead, must determine a new clustering from
scratch. Some clustering algorithms are sensitive to the order of input data.
That is, given a set of data objects, such an algorithm may return dramatically different
clusterings depending on the order of presentation of the input objects.
It is important to develop incremental clustering algorithms and algorithms thatare
insensitive to the order of input.
 High dimensionality:
A database or a data warehouse can contain several dimensionsor attributes.Many clustering
algorithms are good at handling low-dimensional data,involving only two to three
dimensions. Human eyes are good at judging the qualityof clustering for up to three
dimensions. Finding clusters of data objects in highdimensionalspace is challenging,
especially considering that such data can be sparseand highly skewed.
 Constraint-based clustering:
Real-world applications may need to perform clustering under various kinds of constraints.
Suppose that your job is to choose the locations for a given number of new automatic banking
machines (ATMs) in a city. To decide upon this, you may cluster householdswhile
considering constraints such as the city’s rivers and highway networks, and the type and
number of customers per cluster. A challenging task is to find groups of data with good
clustering behavior that satisfy specified constraints.
 Interpretability and usability:
Users expect clustering results to be interpretable, comprehensible, and usable. That is,
clustering may need to be tied to specific semantic interpretations and applications. It is
important to study how an application goal may influence the selection of clustering features
and methods.

5.2 Major Clustering Methods:

 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods

5.2.1 Partitioning Methods:

A partitioning method constructs k partitions of the data, where each partition represents a
cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the
following requirements:
Each group must contain at least one object, and
Each object must belong to exactly one group.

A partitioning method creates an initial partitioning. It then uses an iterative relocation

technique that attempts to improve the partitioning by moving objects from one group to
another.

The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.

5.2.2 Hierarchical Methods:

A hierarchical method creates a hierarchical decomposition ofthe given set of data objects. A
hierarchical method can be classified as being eitheragglomerative or divisive, based on
howthe hierarchical decomposition is formed.

 Theagglomerative approach, also called the bottom-up approach, starts with each
objectforming a separate group. It successively merges the objects or groups that are
closeto one another, until all of the groups are merged into one or until a termination
condition holds.
 The divisive approach, also calledthe top-down approach, starts with all of the objects in
the same cluster. In each successiveiteration, a cluster is split up into smaller clusters, until
eventually each objectis in one cluster, or until a termination condition holds.
Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
toworry about a combinatorial number of different choices.

There are two approachesto improving the quality of hierarchical clustering:

 Perform careful analysis ofobject ―linkages‖ at each hierarchical partitioning, such as in

Chameleon, or
 Integratehierarchical agglomeration and other approaches by first using a
hierarchicalagglomerative algorithm to group objects into microclusters, and then
performingmacroclustering on the microclusters using another clustering method such as
iterative relocation.
5.2.3 Density-based methods:
 Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only spherical-shaped clusters and encounter difficulty at discovering
clusters of arbitrary shapes.
 Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold; that is, for each data point within a givencluster,
the neighborhood of a given radius has to contain at least a minimum number of points.
Such a method can be used to filter out noise (outliers)and discover clusters of arbitrary
shape.
 DBSCAN and its extension, OPTICS, are typical density-based methods that
growclusters according to a density-based connectivity analysis. DENCLUE is a
methodthat clusters objects based on the analysis of the value distributions of density
functions.
5.2.4 Grid-Based Methods:
 Grid-based methods quantize the object space into a finite number of cells that form a
grid structure.
 All of the clustering operations are performed on the grid structure i.e., on the quantized
space. The main advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the number of cells in
each dimension in the quantized space.
 STING is a typical example of a grid-based method. Wave Cluster applies wavelet
transformation for clustering analysis and is both grid-based and density-based.

5.2.5 Model-Based Methods:

 Model-based methods hypothesize a model for each of the clusters and find the best fit
of the data to the given model.
 A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
 It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.

5.3 Tasks in Data Mining:

 Clustering High-Dimensional Data
 Constraint-Based Clustering
5.3.1 Clustering High-Dimensional Data:
It is a particularly important task in cluster analysis because many applications require
the analysis of objects containing a large number of features or dimensions.
For example, text documents may contain thousands of terms or keywords asfeatures,
and DNA micro array data may provide information on the expression levels of
thousands of genes under hundreds of conditions.
Clustering high-dimensional data is challenging due to the curse of dimensionality.
Many dimensions may not be relevant. As the number of dimensions increases,
thedata become increasingly sparse so that the distance measurement between pairs
ofpoints become meaningless and the average density of points anywhere in thedata
islikely to be low. Therefore, a different clustering methodology needs to be
developedfor high-dimensional data.
CLIQUE and PROCLUS are two influential subspace clustering methods, which
search for clusters in subspaces ofthe data, rather than over the entire data space.
Frequent pattern–based clustering,another clustering methodology, extractsdistinct
frequent patterns among subsets ofdimensions that occur frequently. It uses such
patterns to group objects and generatemeaningful clusters.

5.3.2 Constraint-Based Clustering:

It is a clustering approach that performs clustering by incorporation of user-specified or
application-oriented constraints.
A constraint expresses a user’s expectation or describes properties of the desired
clustering results, and provides an effective means for communicating with the clustering
process.
Various kinds of constraints can be specified, either by a user or as per application
requirements.
Spatial clustering employs with the existence of obstacles and clustering under user-
specified constraints. In addition, semi-supervised clusteringemploys forpairwise
constraints in order to improvethe quality of the resulting clustering.

5.4 Classical Partitioning Methods:

The mostwell-known and commonly used partitioningmethods are
 The k-Means Method
 k-Medoids Method
5.4.1 Centroid-Based Technique: The K-Means Method:
The k-means algorithm takes the input parameter, k, and partitions a set of n objects intok
clusters so that the resulting intracluster similarity is high but the intercluster similarity is
low.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can
be viewed as the cluster’s centroid or center of gravity.
The k-means algorithm proceeds as follows.
First, it randomly selects k of the objects, each of which initially represents a cluster
mean or center.
For each of the remaining objects, an object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the cluster mean.
It then computes the new mean for each cluster.
This process iterates until the criterion function converges.

Typically, the square-error criterion is used, defined as

whereE is the sum of the square error for all objects in the data set
pis the point in space representing a given object
miis the mean of cluster Ci

4.4.1 The k-means partitioning algorithm:

The k-means algorithm for partitioning, where each cluster’s center is represented by the mean
value of the objects in the cluster.
Clustering of a set of objects based on the k-means method

4.4.2 The k-Medoids Method:

The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data. This effect is particularly exacerbated
due to the use of the square-error function.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick
actual objects to represent the clusters, using one representative object per cluster. Each
remaining object is clustered with the representative object to which it is the most similar.
Thepartitioning method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point. That is, an
absolute-error criterion is used, defined as

whereE is the sum of the absolute error for all objects in the data set

pis the point inspace representing a given object in clusterCj

ojis the representative object of Cj

The initial representative objects are chosen arbitrarily. The iterative process of replacing
representative objects by non representative objects continues as long as the quality of the
resulting clustering is improved.
This quality is estimated using a cost function that measures the average dissimilaritybetween
an object and the representative object of its cluster.
To determine whether a non representative object, oj random, is a good replacement for a
current representativeobject, oj, the following four cases are examined for each of the
nonrepresentative objects.

Case 1:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.

Case 2:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to orandom, then p is reassigned to orandom.

Case 3:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced by orandomas a representative
object and p is still closest to oi, then the assignment does notchange.

Case 4:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned
toorandom.
Four cases of the cost function for k-medoids clustering

5.4.2 The k-Medoids Algorithm:

The k-medoids algorithm for partitioning based on medoid or central objects.

The k-medoids method ismore robust than k-means in the presence of noise and outliers, because
a medoid is lessinfluenced by outliers or other extreme values than a mean. However, its
processing ismore costly than the k-means method.

5.5 Hierarchical Clustering Methods:

A hierarchical clustering method works by grouping data objects into a tree of clusters.
The quality of a pure hierarchical clusteringmethod suffers fromits inability to
performadjustment once amerge or split decision hasbeen executed. That is, if a particular
merge or split decision later turns out to have been apoor choice, the method cannot backtrack
and correct it.

Hierarchical clustering methods can be further classified as either agglomerative or divisive,

depending on whether the hierarchical decomposition is formed in a bottom-up or top-down
fashion.

5.5.1 Agglomerative hierarchical clustering:

This bottom-up strategy starts by placing each object in its own cluster and then merges
these atomic clusters into larger and larger clusters, until all of the objects are in a single
cluster or until certain termination conditions are satisfied.
Most hierarchical clustering methods belong to this category. They differ only in their
definition of intercluster similarity.

5.5.2 Divisive hierarchical clustering:

This top-down strategy does the reverse of agglomerativehierarchical clustering by starting
with all objects in one cluster.
It subdividesthe cluster into smaller and smaller pieces, until each object forms a cluster
on itsown or until it satisfies certain termination conditions, such as a desired number
ofclusters is obtained or the diameter of each cluster is within a certain threshold.
5.6 Constraint-Based Cluster Analysis:
Constraint-based clustering finds clusters that satisfy user-specified preferences orconstraints.
Depending on the nature of the constraints, constraint-based clusteringmay adopt rather different
approaches.
There are a few categories of constraints.
 Constraints on individual objects:

We can specify constraints on the objects to beclustered. In a real estate application, for
example, one may like to spatially cluster only those luxury mansions worth over a million
dollars. This constraint confines the setof objects to be clustered. It can easily be handled
by preprocessing after which the problem reduces to an instance ofunconstrained clustering.

 Constraints on the selection of clustering parameters:

A user may like to set a desired range for each clustering parameter. Clustering parameters
are usually quite specific to the given clustering algorithm. Examples of parameters include
k, the desired numberof clusters in a k-means algorithm; or e the radius and the minimum
number of points in the DBSCAN algorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus, their fine tuning and processing are usually not considered a form of constraint-based
clustering.
 Constraints on distance or similarity functions:

We can specify different distance orsimilarity functions for specific attributes of the objects
to be clustered, or differentdistance measures for specific pairs of objects.When clustering
sportsmen, for example,we may use different weighting schemes for height, body weight,
age, and skilllevel. Although this will likely change the mining results, it may not alter the
clusteringprocess per se. However, in some cases, such changes may make the evaluationof
the distance function nontrivial, especially when it is tightly intertwined with the clustering
process.
 User-specified constraints on the properties of individual clusters:
A user may like tospecify desired characteristics of the resulting clusters, which may strongly
influencethe clustering process.
 Semi-supervised clustering based on partial supervision:
The quality of unsupervisedclustering can be significantly improved using some weak form
of supervision.This may be in the formof pairwise constraints (i.e., pairs of objects labeled
as belongingto the same or different cluster). Such a constrained clustering process is
calledsemi-supervised clustering.

5.7 Outlier Analysis:

There exist data objects that do not comply with the general behavior or model of the data.
Such data objects, which are grossly different from or inconsistent with the remaining set
of data, are called outliers.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all
together. This, however, could result in the loss of important hidden information because one
person’s noise could be another person’s signal. In other words, the outliers may be of
particular interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.
It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services. In addition, it is useful in customized marketing for identifying
the spending behavior of customers with extremely low or extremely high incomes, or in
medicalanalysis for finding unusual responses to various medical treatments.

Outlier mining can be described as follows: Given a set of n data points or objectsand k, the
expected number of outliers, find the top k objects that are considerablydissimilar, exceptional,
or inconsistent with respect to the remaining data. The outliermining problem can be viewed
as two subproblems:
Define what data can be considered as inconsistent in a given data set, and
Find an efficient method to mine the outliers so defined.
Types of outlier detection:
 Statistical Distribution-Based Outlier Detection
 Distance-Based Outlier Detection
 Density-Based Local Outlier Detection
 Deviation-Based Outlier Detection

5.7.1 Statistical Distribution-Based Outlier Detection:

The statistical distribution-based approach to outlier detection assumes a distributionor
probability model for the given data set (e.g., a normal or Poisson distribution) andthen
identifies outliers with respect to the model using a discordancy test. Application ofthe test
requires knowledge of the data set parameters knowledge of distribution parameters such
as the mean and variance and theexpected number of outliers.
A statistical discordancy test examines two hypotheses:
A working hypothesis
An alternative hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from
an initial distribution model, F, that is,

The hypothesis is retained if there is no statistically significant evidence supporting its

rejection. A discordancy test verifies whether an object, oi, is significantly large (or small)
in relation to the distribution F. Different test statistics have been proposed for use as a
discordancy test, depending on the available knowledge of the data. Assuming that some
statistic, T, has been chosen for discordancy testing, and the value of the statisticfor
object oi is vi, then the distribution of T is constructed. Significance probability,
SP(vi)=Prob(T > vi), is evaluated. If SP(vi) is sufficiently small, then oi is discordant and
the working hypothesis is rejected.
An alternative hypothesis, H, which states that oi comes from another distribution model,
G, is adopted. The result is very much dependent on which model F is chosen because
oimay be an outlier under one model and a perfectly valid value under another. The
alternative distribution is very important in determining the power of the test, that is, the
probability that the working hypothesis is rejected when oi is really an outlier.
There are different kinds of alternative distributions.
Inherent alternative distribution:
In this case, the working hypothesis that all of the objects come from distribution F is
rejected in favor of the alternative hypothesis that all of the objects arise from another
distribution, G:
H :oi € G, where i = 1, 2,…, n
F and G may be different distributions or differ only in parameters of the same
distribution.
There are constraints on the form of the G distribution in that it must have potential to
produce outliers. For example, it may have a different mean or dispersion, or a longer
tail.
Mixture alternative distribution:
The mixture alternative states that discordant values are not outliers in the F population,
but contaminants from some other population,
G. In this case, the alternative hypothesis is

Slippage alternative distribution:

This alternative states that all of the objects (apart from some prescribed small number)
arise independently from the initial model, F, with its given parameters, whereas the
remaining objects are independent observations from a modified version of F in which
the parameters have been shifted.
There are two basic types of procedures for detecting outliers:
Block procedures:
In this case, either all of the suspect objects are treated as outliersor all of them are accepted
as consistent.
Consecutive procedures:
An example of such a procedure is the insideoutprocedure. Its main idea is that the object
that is least likely to be an outlier istested first. If it is found to be an outlier, then all of the
more extreme values are alsoconsidered outliers; otherwise, the next most extreme object is
tested, and so on. Thisprocedure tends to be more effective than block procedures.

5.7.2 Distance-Based Outlier Detection:

The notion of distance-based outliers was introduced to counter the main limitationsimposed
by statistical methods. An object, o, in a data set, D, is a distance-based (DB)outlier with
parameters pct and dmin,that is, a DB(pct;dmin)-outlier, if at least a fraction,pct, of the objects
in D lie at a distance greater than dmin from o. In other words, rather thatrelying on statistical
tests, we can think of distance-based outliers as thoseobjects that do not have enoughneighbors,
where neighbors are defined based ondistance from the given object. In comparison with
statistical-based methods, distancebased outlier detection generalizes the ideas behind
discordancy testing for various standarddistributions. Distance-based outlier detection avoids
the excessive computationthat can be associated with fitting the observed distribution into
some standard distributionand in selecting discordancy tests.
For many discordancy tests, it can be shown that if an object, o, is an outlier accordingto the
given test, then o is also a DB(pct, dmin)-outlier for some suitably defined pct anddmin.
For example, if objects that lie three or more standard deviations from the mean
are considered to be outliers, assuming a normal distribution, then this definition can
be generalized by a DB(0.9988, 0.13s) outlier.
Several efficient algorithms for mining distance-based outliers have been developed.
Index-based algorithm:
Given a data set, the index-based algorithm uses multidimensionalindexing structures, such as
R-trees or k-d trees, to search for neighbors of eachobject o within radius dminaround that
object. Let Mbe the maximum number ofobjects within the dmin-neighborhood of an outlier.
Therefore, onceM+1 neighborsof object o are found, it is clear that o is not an outlier. This
algorithm has a worst-casecomplexity of O(n2k), where n is the number of objects in the data
set and k is thedimensionality. The index-based algorithm scales well as k increases. However,
thiscomplexity evaluation takes only the search time into account, even though the taskof
building an index in itself can be computationally intensive.
Nested-loop algorithm:
The nested-loop algorithm has the same computational complexityas the index-based
algorithm but avoids index structure construction and triesto minimize the number of I/Os. It
divides the memory buffer space into two halvesand the data set into several logical blocks.
By carefully choosing the order in whichblocks are loaded into each half, I/O efficiency can
be achieved.
Cell-based algorithm:
To avoidO(n2) computational complexity, a cell-based algorithm was developed for memory-
resident data sets. Its complexity is O(ck+n), where c is a constant depending on the number of
cells and k is the dimensionality.

In this method, the data space is partitioned into cells with a side length equal to Eachcell
has two layers surrounding it. The first layer is one cell thick, while the secondis

cells thick, rounded up to the closest integer. The algorithm countsoutliers on a

cell-by-cell rather than an object-by-object basis. For a given cell, itaccumulates three counts—
the number of objects in the cell, in the cell and the firstlayer together, and in the cell and both
layers together. Let’s refer to these counts ascell count, cell + 1 layer count, and cell + 2 layers
count, respectively.

Let Mbe the maximum number ofoutliers that can exist in the dmin-neighborhood of an outlier.
An object, o, in the current cell is considered an outlier only if cell + 1 layer countis less
than or equal to M. If this condition does not hold, then all of the objectsin the cell can be
removed from further investigation as they cannot be outliers.
If cell_+ 2_layers_count is less than or equal to M, then all of the objects in thecell are
considered outliers. Otherwise, if this number is more than M, then itis possible that some
of the objects in the cell may be outliers. To detect theseoutliers, object-by-object
processing is used where, for each object, o, in the cell,objects in the second layer of o are
examined. For objects in the cell, only thoseobjects having no more than M points in their
dmin-neighborhoods are outliers.The dmin-neighborhood of an object consists ofthe
object’s cell, all of its firstlayer, and some of its second layer.
A variation to the algorithm is linear with respect to n and guarantees that no morethan three
passes over the data set are required. It can be used for large disk-residentdata sets, yet does
not scale well for high dimensions.

5.7.3 Density-Based Local Outlier Detection:

Statistical and distance-based outlier detection both depend on the overall or
globaldistribution of the given set of data points, D. However, data are usually not
uniformlydistributed. These methods encounter difficulties when analyzing data with rather
different
density distributions.
To define the local outlier factor of an object, we need to introduce the concepts ofk- distance,
k-distance neighborhood, reachability distance,13 and local reachability density.
These are defined as follows:
The k-distance of an object p is the maximal distance that p gets from its k-
nearestneighbors. This distance is denoted as k-distance(p). It is defined as the distance,
d(p, o), between p and an object o 2 D, such that for at least k objects, o0 2 D, it holds that
d(p, o’)_d(p, o). That is, there are at least k objects inDthat are as close asor closer to p than
o, and for at most k-1 objects, o00 2 D, it holds that d(p;o’’) <d(p, o).

That is, there are at most k-1 objects that are closer to p than o. You may bewondering at this
point how k is determined. The LOF method links to density-basedclustering in that it sets k
to the parameter rMinPts,which specifies the minimumnumberof points for use in identifying
clusters based on density.
Here, MinPts (as k) is used to define the local neighborhood of an object, p.
The k-distance neighborhood of an object p is denoted Nkdistance(p)(p), or Nk(p)for short. By
setting k to MinPts, we get NMinPts(p). It contains the MinPts-nearestneighbors of p. That is, it
contains every object whose distance is not greater than theMinPts-distance of p.
The reachability distance of an object p with respect to object o (where o is within
theMinPts-nearest neighbors of p), is defined as reach
distMinPts(p, o) = max{MinPtsdistance(o), d(p, o)}.
Intuitively, if an object p is far away , then the reachabilitydistance between the two is simply
their actual distance. However, if they are sufficientlyclose (i.e., where p is within theMinPts-
distance neighborhood of o), thenthe actual distance is replaced by the MinPts-distance of o.
This helps to significantlyreduce the statistical fluctuations of d(p, o) for all of the p close to o.
The higher thevalue of MinPts is, the more similar is the reachability distance for objects
withinthe same neighborhood.
Intuitively, the local reachability density of p is the inverse of the average reachability
density based on the MinPts-nearest neighbors of p. It is defined as

The local outlier factor (LOF) of p captures the degree to which we call p an outlier.
It is defined as

It is the average of the ratio of the local reachability density of p and those of p’s
MinPts-nearest neighbors. It is easy to see that the lower p’s local reachability density
is, and the higher the local reachability density of p’s MinPts-nearest neighbors are,
the higher LOF(p) is.
5.7.4 Deviation-Based Outlier Detection:
Deviation-based outlier detection does not use statistical tests or distance-basedmeasures to
identify exceptional objects. Instead, it identifies outliers by examining themain characteristics
of objects in a group.Objects that ―deviate‖ fromthisdescription areconsidered outliers. Hence,
in this approach the term deviations is typically used to referto outliers. In this section, we
study two techniques for deviation-based outlier detection.The first sequentially compares
objects in a set, while the second employs an OLAPdata cube approach.

Sequential Exception Technique:

The sequential exception technique simulates the way in which humans can
distinguishunusual objects from among a series of supposedly like objects. It uses
implicit redundancyof the data. Given a data set, D, of n objects, it builds a sequence
of subsets,{D1, D2, …,Dm}, of these objects with 2<=m <= n such that

Dissimilarities are assessed between subsets in the sequence. The technique

introducesthe following key terms.
Exception set:
This is the set of deviations or outliers. It is defined as the smallestsubset of objects
whose removal results in the greatest reduction of dissimilarity in the residual set.
Dissimilarity function:
This function does not require a metric distance between theobjects. It is any function
that, if given a set of objects, returns a lowvalue if the objectsare similar to one another.
The greater the dissimilarity among the objects, the higherthe value returned by the
function. The dissimilarity of a subset is incrementally computedbased on the subset
prior to it in the sequence. Given a subset of n numbers, {x1, …,xn}, a possible
dissimilarity function is the variance of the numbers in theset, that is,

where x is the mean of the n numbers in the set. For character strings, the
dissimilarityfunction may be in the form of a pattern string (e.g., containing wildcard
charactersthat is used to cover all of the patterns seen so far. The dissimilarity increases
when the pattern covering all of the strings in Dj-1 does not cover any string in Dj that
isnot in Dj-1.
Cardinality function:
This is typically the count of the number of objects in a given set.
Smoothing factor:
This function is computed for each subset in the sequence. Itassesses how much the
dissimilarity can be reduced by removing the subset from theoriginal set of objects.
Data Mining Applications

Here is the list of areas where data mining is widely used −

 Financial Data Analysis

 Retail Industry

 Telecommunication Industry

 Biological Data Analysis

 Other Scientific Applications

 Intrusion Detection

Financial Data Analysis

The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining. Some of the typical
cases are as follows −

 Design and construction of data warehouses for multidimensional data analysis

and data mining.

 Loan payment prediction and customer credit policy analysis.

 Classification and clustering of customers for targeted marketing.

 Detection of money laundering and other financial crimes.

Retail Industry

Data Mining has its great application in Retail Industry because it collects large amount
of data from on sales, customer purchasing history, goods transportation, consumption
and services. It is natural that the quantity of data collected will continue to expand
rapidly because of the increasing ease, availability and popularity of the web.

Data mining in retail industry helps in identifying customer buying patterns and trends
that lead to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data mining in the retail industry −

 Design and Construction of data warehouses based on the benefits of data mining.
 Multidimensional analysis of sales, customers, products, time and region.

 Analysis of effectiveness of sales campaigns.

 Customer Retention.

 Product recommendation and cross-referencing of items.

Telecommunication Industry

Today the telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e-mail,
web data transmission, etc. Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding. This is the reason
why data mining is become very important to help and understand the business.

Data mining in telecommunication industry helps in identifying the telecommunication

patterns, catch fraudulent activities, make better use of resource, and improve quality of
service. Here is the list of examples for which data mining improves telecommunication
services −

 Multidimensional Analysis of Telecommunication data.

 Fraudulent pattern analysis.

 Identification of unusual patterns.

 Multidimensional association and sequential patterns analysis.

 Mobile Telecommunication services.

 Use of visualization tools in telecommunication data analysis.

Biological Data Analysis

In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological data
mining is a very important part of Bioinformatics. Following are the aspects in which
data mining contributes for biological data analysis −
 Semantic integration of heterogeneous, distributed genomic and proteomic
databases.

 Alignment, indexing, similarity search and comparative analysis multiple

nucleotide sequences.

 Discovery of structural patterns and analysis of genetic networks and protein

pathways.

 Association and path analysis.

 Visualization tools in genetic data analysis.

Other Scientific Applications

The applications discussed above tend to handle relatively small and homogeneous data
sets for which the statistical techniques are appropriate. Huge amount of data have been
collected from scientific domains such as geosciences, astronomy, etc. A large amount
of data sets is being generated because of the fast numerical simulations in various fields
such as climate and ecosystem modeling, chemical engineering, fluid dynamics, etc.
Following are the applications of data mining in the field of Scientific Applications −

 Data Warehouses and data preprocessing.

 Graph-based mining.

 Visualization and domain specific knowledge.

Intrusion Detection

Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical
component of network administration. Here is the list of areas in which data mining
technology may be applied for intrusion detection −

 Development of data mining algorithm for intrusion detection.

 Association and correlation analysis, aggregation to help select and build
discriminating attributes.

 Analysis of Stream data.

 Distributed data mining.

 Visualization and query tools.

Data Mining System Products

There are many data mining system products and domain specific data mining
applications. The new data mining systems and applications are being added to the
previous systems. Also, efforts are being made to standardize data mining languages.

Choosing a Data Mining System

The selection of a data mining system depends on the following features −

 Data Types − The data mining system may handle formatted text, record-based
data, and relational data. The data could also be in ASCII text, relational database
data or data warehouse data. Therefore, we should check what exact format the
data mining system can handle.

 System Issues − We must consider the compatibility of a data mining system with
different operating systems. One data mining system may run on only one
operating system or on several. There are also data mining systems that provide
web-based user interfaces and allow XML data as input.

 Data Sources − Data sources refer to the data formats in which data mining
system will operate. Some data mining system may work only on ASCII text files
while others on multiple relational sources. Data mining system should also
support ODBC connections or OLE DB for ODBC connections.

 Data Mining functions and methodologies − There are some data mining systems
that provide only one data mining function such as classification while some
provides multiple data mining functions such as concept description, discovery-
driven OLAP analysis, association mining, linkage analysis, statistical analysis,
classification, prediction, clustering, outlier analysis, similarity search, etc.
 Coupling data mining with databases or data warehouse systems − Data mining
systems need to be coupled with a database or a data warehouse system. The
coupled components are integrated into a uniform information processing
environment. Here are the types of coupling listed below −

o No coupling

o Loose Coupling

o Semi tight Coupling

o Tight Coupling

 Scalability − There are two scalability issues in data mining −

o Row (Database size) Scalability − A data mining system is considered as row

scalable when the number or rows are enlarged 10 times. It takes no more
than 10 times to execute a query.

o Column (Dimension) Salability − A data mining system is considered as

column scalable if the mining query execution time increases linearly with
the number of columns.

 Visualization Tools − Visualization in data mining can be categorized as follows

−

o Data Visualization

o Mining Results Visualization

o Mining process visualization

o Visual data mining

 Data Mining query language and graphical user interface − An easy-to-use

graphical user interface is important to promote user-guided, interactive data
mining. Unlike relational database systems, data mining systems do not share
underlying data mining query language.

Trends in Data Mining

Data mining concepts are still evolving and here are the latest trends that we get to see
in this field −

 Application Exploration.
 Scalable and interactive data mining methods.
 Integration of data mining with database systems, data warehouse systems and
web database systems.
 Standardization of data mining query language.
 Visual data mining.
 New methods for mining complex types of data.
 Biological data mining.
 Data mining and software engineering.
 Web mining.
 Distributed data mining.
 Real time data mining.
 Multi database data mining.
 Privacy protection and information security in data mining.

Unit 5
No ratings yet
Unit 5
27 pages
GKJ Sir Book 4.3 - 24425
No ratings yet
GKJ Sir Book 4.3 - 24425
60 pages
Factors That Influence The Distribution of Plants and Animals
No ratings yet
Factors That Influence The Distribution of Plants and Animals
17 pages
CALABRESI, Guido, Some Thoughts On Risk Distribution and The Law of Torts
No ratings yet
CALABRESI, Guido, Some Thoughts On Risk Distribution and The Law of Torts
56 pages
Fan Coils YHBC (Version 1)
No ratings yet
Fan Coils YHBC (Version 1)
60 pages
Structural Design of Steel Bar Bending Machine
No ratings yet
Structural Design of Steel Bar Bending Machine
7 pages
DA Unit II
No ratings yet
DA Unit II
21 pages
Flash Point by Tag Closed Cup Tester: Standard Test Method For
No ratings yet
Flash Point by Tag Closed Cup Tester: Standard Test Method For
12 pages
Bcs602 ML Mod-5 Notes @vtunetwork
No ratings yet
Bcs602 ML Mod-5 Notes @vtunetwork
17 pages
DM Unit-4 Part1
No ratings yet
DM Unit-4 Part1
21 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Unit 4
No ratings yet
Unit 4
106 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
38 pages
DataMining Unit4 Notes
No ratings yet
DataMining Unit4 Notes
27 pages
CLUSTER ANALYSIS Unit 3 Data Mining
No ratings yet
CLUSTER ANALYSIS Unit 3 Data Mining
84 pages
DMDW Unit-5
No ratings yet
DMDW Unit-5
21 pages
DM UNIT-4 Part2
No ratings yet
DM UNIT-4 Part2
18 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
E-Note 28966 Content Document 20241211091351PM
No ratings yet
E-Note 28966 Content Document 20241211091351PM
69 pages
Biocluster MB05
No ratings yet
Biocluster MB05
26 pages
DWDM Lecture Notes U-5
No ratings yet
DWDM Lecture Notes U-5
26 pages
Unit 5 - Cluster Analysis
No ratings yet
Unit 5 - Cluster Analysis
28 pages
Brown Hauenstein 2005 Interrater Agreement Reconsidered An Alternative To The RWG Indices
No ratings yet
Brown Hauenstein 2005 Interrater Agreement Reconsidered An Alternative To The RWG Indices
20 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Unit-V (Dmwh6em)
No ratings yet
Unit-V (Dmwh6em)
30 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Types of Glasswares
100% (2)
Types of Glasswares
18 pages
Specification Sheet Wfg320m0bspecsheetv01
No ratings yet
Specification Sheet Wfg320m0bspecsheetv01
1 page
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
DMDW R20 Unit 5
No ratings yet
DMDW R20 Unit 5
21 pages
Unit 4
No ratings yet
Unit 4
21 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
Module V
No ratings yet
Module V
16 pages
Ec16403 Lic
No ratings yet
Ec16403 Lic
2 pages
Data Mining Notes UNIT IV
No ratings yet
Data Mining Notes UNIT IV
19 pages
DM Unit-5 Notes
No ratings yet
DM Unit-5 Notes
16 pages
DWDM - Unit - VI
No ratings yet
DWDM - Unit - VI
38 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
DM Module 4
No ratings yet
DM Module 4
17 pages
Screenshot 2024-05-17 at 3.30.05 PM
No ratings yet
Screenshot 2024-05-17 at 3.30.05 PM
31 pages
Performance of Sound Insulation of AAC in Massive Buildings-Experience With EN 12354-1
No ratings yet
Performance of Sound Insulation of AAC in Massive Buildings-Experience With EN 12354-1
6 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
A Review of Satellite Based Atomic Oxygen Sens - 2023 - Progress in Aerospace SC
No ratings yet
A Review of Satellite Based Atomic Oxygen Sens - 2023 - Progress in Aerospace SC
10 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
UNIT 4 Clustering and Applications
No ratings yet
UNIT 4 Clustering and Applications
5 pages
Amplitude Modulation: Page 1 of 37
No ratings yet
Amplitude Modulation: Page 1 of 37
36 pages
CT200 Littlefuse
No ratings yet
CT200 Littlefuse
2 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
Passive Voice
No ratings yet
Passive Voice
1 page
Clustering
No ratings yet
Clustering
7 pages
Tle 6281
No ratings yet
Tle 6281
15 pages
Seminar 4 Verb
No ratings yet
Seminar 4 Verb
11 pages
Clustering New
No ratings yet
Clustering New
6 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Unit-3 DWDM 7TH Sem Cse
No ratings yet
Unit-3 DWDM 7TH Sem Cse
54 pages
Dmbi Unit-4
No ratings yet
Dmbi Unit-4
18 pages
SUMMER INTERNSHIP REPORT (AutoRecovered)
No ratings yet
SUMMER INTERNSHIP REPORT (AutoRecovered)
19 pages
Assignment 4
No ratings yet
Assignment 4
40 pages
Iv Unit DM
No ratings yet
Iv Unit DM
26 pages
Unit 2 DMW
No ratings yet
Unit 2 DMW
26 pages
UNIT 3 DWDM Notes
No ratings yet
UNIT 3 DWDM Notes
32 pages
Installation Manual: Glendinning Electronic Engine Controls
100% (1)
Installation Manual: Glendinning Electronic Engine Controls
56 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Risk Evaluation/Assessment List EF/2018-036-GENERAL/rec023
No ratings yet
Risk Evaluation/Assessment List EF/2018-036-GENERAL/rec023
39 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
EN
No ratings yet
EN
2 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Carburetor: Important: Please Read Through This Manual
No ratings yet
Carburetor: Important: Please Read Through This Manual
6 pages
Clustering
No ratings yet
Clustering
6 pages
SymSitive 1609 - PB
No ratings yet
SymSitive 1609 - PB
8 pages
Design Procedure - Process
No ratings yet
Design Procedure - Process
4 pages
Emsisoft Howto Diavol
No ratings yet
Emsisoft Howto Diavol
4 pages
World Map: Middle East Europe Africa Asia North America Central, South America South West Pacific
100% (2)
World Map: Middle East Europe Africa Asia North America Central, South America South West Pacific
32 pages
SQL Plus: A Command Line DOS-like Interface Which Can Provide Users An Environment To Execute
No ratings yet
SQL Plus: A Command Line DOS-like Interface Which Can Provide Users An Environment To Execute
5 pages
Cluster Is A Group of Objects That Belongs To The Same Class
No ratings yet
Cluster Is A Group of Objects That Belongs To The Same Class
12 pages
Paper-2 Clustering Algorithms in Data Mining A Review
No ratings yet
Paper-2 Clustering Algorithms in Data Mining A Review
7 pages
Manifold: Differentiable Manifolds and If Differentiation Can Take Place Arbitrarily Often They Are
No ratings yet
Manifold: Differentiable Manifolds and If Differentiation Can Take Place Arbitrarily Often They Are
1 page
Data Mining-Unit IV
No ratings yet
Data Mining-Unit IV
15 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
Search Algorithm: Fundamentals and Applications
From Everand
Search Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
2007 Sanjay Prabhakaran
100% (1)
2007 Sanjay Prabhakaran
21 pages
Physics EE Subject Guide
No ratings yet
Physics EE Subject Guide
9 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet