0% found this document useful (0 votes)

146 views24 pages

Unit - 4 DM

Cluster analysis is an unsupervised machine learning technique that groups unlabeled data points into clusters so that objects within a cluster are more similar to each other than objects in other clusters. There are several types of clustering algorithms, including partitioning methods like k-means that divide data into a specified number of clusters, hierarchical methods that create nested clusters, density-based methods like DBSCAN that find clusters based on density, and model-based methods that fit data to probability distributions. Clustering algorithms aim to categorize data in a way that reveals hidden patterns and similarities.

Uploaded by

minto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

146 views24 pages

Unit - 4 DM

Uploaded by

minto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 24

Cluster Analysis

Definition:
Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data points
would comprise together to form a cluster in which all the objects would belong to the same group.

Cluster:
A cluster is nothing but a collection of similar data which is grouped together.
For example, consider a dataset of vehicles given in which it contains information about different
vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels like Cars,
Bikes, etc for all the vehicles, all the data is combined and is not in a structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming clusters like cars
cluster which contains all the cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Properties of Clustering:

1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with huge
databases. In order to handle extensive databases, the clustering algorithm should be scalable. Data
should be scalable, if it is not scalable, then we can’t get the appropriate result which would lead to
wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along with the
data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with algorithms
of clustering. It should be capable of dealing with different types of data like discrete, categorical and
interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing values, and
noisy or erroneous data. If the algorithms are sensitive to such data then it may lead to poor quality
clusters. So it should be able to handle unstructured data and give some structure to the data by
organising it into groups of similar data objects. This makes the job of the data expert easier in order to
process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and usable. The
interpretability reflects how easily the data is understood.

Clustering Methods:
The clustering methods can be classified into the following categories:
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method

Partitioning Method
This clustering method classifies the information into multiple groups based on the characteristics and
similarity of the data. Its the data analysts to specify the number of clusters that has to be generated for
the clustering methods.
In the partitioning method when database(D) that contains multiple(N) objects then the partitioning
method constructs user-specified(K) partitions of the data in which each partition represents a cluster
and a particular region. There are many algorithms that come under partitioning method some of the
popular ones are K-Mean, PAM(K-Mediods), CLARA algorithm (Clustering Large Applications) etc.
In this article, we will be seeing the working of K Mean algorithm in detail.
K-Mean (A centroid based Technique):
The K means algorithm takes the input parameter K from the user and partitions the dataset containing
N objects into K clusters so that resulting similarity among the data objects inside the group
(intracluster) is high but the similarity of data objects with the data objects from outside the cluster is
low (intercluster). The similarity of the cluster is determined with respect to the mean value of the
cluster.
It is a type of square error algorithm. At the start randomly k objects from the dataset are chosen in
which each of the objects represents a cluster mean(centre). For the rest of the data objects, they are
assigned to the nearest cluster based on their distance from the cluster mean. The new mean of each of
the cluster is then calculated with the added data objects.
Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects

Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
4. Repeat Step 4 until no change occurs.

Flowchart:
Example: Suppose we want to group the visitors to a website using just their age as follows:

16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:

K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.
Iteration-1:

C1 = 16.33 [16, 16, 17]

C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-2:

C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:

C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Hierarchical Clustering
A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical
clustering begins by treating every data point as a separate cluster. Then, it repeatedly executes the
subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all the clusters
are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram
called Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences of merges or
splits) graphically represents this hierarchy and is an inverted tree that describes the order in which
factors are merged (bottom-up view) or clusters are broken up (top-down view).
The basic method to generate hierarchical clustering is
1. Agglomerative: Initially consider every data point as an individual Cluster and at every
step, merge the nearest pairs of the cluster. (It is a bottom-up method). At first, every dataset is
considered as an individual entity or cluster. At every iteration, the clusters merge with different
clusters until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering is:
 Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
 Consider every data point as an individual cluster
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster
 Repeat Steps 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note: This is just a demonstration of how the actual algorithm works no calculation has been performed
below all the proximity among the clusters is assumed.
Let’s say we have six data points A, B, C, D, E, and F.

 Step-1: Consider each alphabet as a single cluster and calculate the distance of one cluster from all
the other clusters.
 Step-2: In the second step comparable clusters are merged together to form a single cluster. Let’s
say cluster (B) and cluster (C) are very similar to each other therefore we merge them in the second
step similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
 Step-3: We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
 Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged together
to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
 Step-5: At last the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].
2. Divisive:
We can say that the Divisive Hierarchical clustering is precisely the opposite of the Agglomerative
Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the data
points as a single cluster and in every iteration, we separate the data points from the clusters which
aren’t comparable. In the end, we are left with N clusters.

Density-Based method
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.
It is a popular unsupervised learning method used for model construction and machine learning
algorithms. It is a clustering method utilized for separating high-density clusters from low-density
clusters. It divides the data points into many groups so that points lying in the same group will have the
same properties. It was proposed by Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu in
1996.
DBSCAN is designed for use with databases that can accelerate region queries. It can not cluster data
sets with large differences in their densities.

Characteristics
 It identifies clusters of any shape in a data set, it means it can detect arbitrarily shaped clusters.
 It is based on intuitive notions of clusters and noise.
 It is very robust in detection of outliers in data set
 It requires only two points which are very insensitive to the order of occurrence of the points in data
set
Advantages
 Specification of number of clusters of data in the data set is not required.
 It can find any shape cluster even if the cluster is surrounded by any other cluster.
 It can easily find outliers in data set.
 It is not much sensitive to noise, it means it is noise tolerant.
 It is the second most used clustering method after K-means.
Disadvantages
 The quality of the result depends on the distance measure used in the regionQuery function.
 Border points may go in any cluster depending on the processing order so it is not completely
deterministic.
 It can be expensive when cost of computation of nearest neighbor is high.
 It can be slow in execution for higher dimension.
 Adaptability of variation in local density is less.

Grid-Based Clustering - STING, WaveCluster & CLIQUE

The grid-based clustering methods use a multi-resolution grid data structure. It quantizes the object areas
into a finite number of cells that form a grid structure on which all of the operations for clustering are
implemented. The benefit of the method is its quick processing time, which is generally independent of
the number of data objects, still dependent on only the multiple cells in each dimension in the quantized
space.

Instance of Grid Based Clustering involves several methods:

STING - A Statistical Information Grid Approach

STING was proposed by Wang, Yang, and Muntz (VLDB’97). In this method, the spatial area is divided
into rectangular cells. There are several levels of cells corresponding to different levels of resolution.

For each cell, the high level is partitioned into several smaller cells in the next lower level. The statistical
info of each cell is calculated and stored beforehand and is used to answer queries. For each cell, the high
level is partitioned into several smaller cells in the next lower level. The statistical info of each cell is
calculated and stored beforehand and is used to answer queries. The parameters of higher-level cells can
be easily calculated from parameters of lower-level cell
 Count, mean, s, min, max
 Type of distribution—normal, uniform, etc.
Then using a top-down approach we need to answer spatial data queries. Then start from a pre-selected
layer—typically with a small number of cells. For each cell in the current level compute the confidence
interval. Now remove the irrelevant cells from further consideration. When finishing examining the
current layer, proceed to the next lower level. Repeat this process until the bottom layer is reached.
Advantages:

It is Query-independent, easy to parallelize, incremental update.

O(K), where K is the number of grid cells at the lowest level.

Disadvantages:

All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected.

WaveCluster
It was proposed by Sheikholeslami, Chatterjee, and Zhang (VLDB’98). It is a multi-resolution clustering
approach which applies wavelet transform to the feature space
A wavelet transform is a signal processing technique that decomposes a signal into different frequency
sub-band. It can be both grid-based and density-based method.
Input parameters:
 No of grid cells for each dimension
 The wavelet, and the no of applications of wavelet transform.
Clustering In Quest

CLIQUE is a density-based and grid-based subspace clustering algorithm. CLIQUE Algorithm is very
scalable with respect to the value of the records, and a number of dimensions in the dataset because it is
grid-based and uses the Apriori Property effectively.

Working of CLIQUE Algorithm:

The CLIQUE algorithm first divides the data space into grids. It is done by dividing each dimension
into equal intervals called units. After that, it identifies dense units. A unit is dense if the data points in
this are exceeding the threshold value.
Once the algorithm finds dense cells along one dimension, the algorithm tries to find dense cells along
two dimensions, and it works until all dense cells along the entire dimension are found.
After finding all dense cells in all dimensions, the algorithm proceeds to find the largest set (“cluster”)
of connected dense cells. Finally, the CLIQUE algorithm generates a minimal description of the cluster.
Clusters are then generated from all dense subspaces using the apriori approach.

Advantage:
 CLIQUE is a subspace clustering algorithm that outperforms K-means, DBSCAN, and Farthest
First in both execution time and accuracy.
 CLIQUE can find clusters of any shape and is able to find any number of clusters in any number of
dimensions, where the number is not predetermined by a parameter.
 One of the simplest methods, and interpretability of results.

Disadvantage:
 The main disadvantage of CLIQUE Algorithm is that if the size of the cell is unsuitable for a set of
very high values, then too much of the estimation will take place and the correct cluster will be
unable to find.

Clustering Evaluation strategies

Clustering is an unsupervised machine learning algorithm. It helps in clustering data points to groups.
Validating the clustering algorithm is bit tricky compared to supervised machine learning algorithm as
clustering process does not contain ground truth labels. If one want to do clustering with ground truth
labels being present, validation methods and metrics of supervised machine learning algorithms can be
used.

How Clustering can be evaluated?

Three important factors by which clustering can be evaluated are

(a) Clustering tendency (b) Number of clusters, k (c) Clustering quality

Clustering tendency

Before evaluating the clustering performance, making sure that data set we are working has clustering

tendency and does not contain uniformly distributed points is very important. If the data does not contain

clustering tendency, then clusters identified by any state of the art clustering algorithms may be irrelevant.

Non-uniform distribution of points in data set becomes important in clustering.

To solve this, Hopkins test, a statistical test for spatial randomness of a variable, can be used to measure
the probability of data points generated by uniform data distribution.

Number of Optimal Clusters, k

Some of the clustering algorithms like K-means, require number of clusters, k, as clustering parameter.

Getting the optimal number of clusters is very significant in the analysis. If k is too high, each point will

broadly start representing a cluster and if k is too low, then data points are incorrectly clustered. Finding

the optimal number of clusters leads to granularity in clustering.

There is no definitive answer for finding right number of cluster as it depends upon (a) Distribution shape

(b) scale in the data set (c) clustering resolution required by user. Although finding number of clusters is a

very subjective problem. There are two major approaches to find optimal number of clusters:
(1) Domain knowledge

(2) Data driven approach

Domain knowledge — Domain knowledge might give some prior knowledge on finding number of
clusters. For example, in case of clustering iris data set, if we have the prior knowledge of species (sertosa,
virginica, versicolor) , then k = 3. Domain knowledge driven k value gives more relevant insights.

Data driven approach — If the domain knowledge is not available, mathematical methods help in finding
out right number of clusters.
Empirical Method:-
A simple empirical method of finding number of clusters is Square root of N/2 where N is total number of
data points, so that each cluster contains square root of 2 * N

Elbow method:-
Within-cluster variance is a measure of compactness of the cluster. Lower the value of within cluster
variance, higher the compactness of cluster formed.
Sum of within-cluster variance, W, is calculated for clustering analyses done with different values of
k. W is a cumulative measure how good the points are clustered in the analysis. Plotting the k values and
their corresponding sum of within-cluster variance helps in finding the number of clusters.
Plot shows that number of optimal clusters = 4.
Initially, Error measure (within-cluster variance) decreases with increase in cluster number. After a
particular point, k=4 , Error measure starts flattening. Cluster number corresponding to that particular
point, k=4, should be considered as optimal number of clusters.

Statistical approach:-
Gap statistic is a powerful statistical method to find the optimal number of clusters, k.
Similar to Elbow method, sum of within-cluster (intra-cluster) variance is calculated for different values of
k.
Then Random data points from reference null distribution are generated and Sum of within-cluster
variance is calculated for the clustering done for different values of k.
In Simpler words, Sum-of-within-Cluster variance of original data set for different values of k to Sum-of-
within-cluster variance of reference data set (null reference data set of uniform distribution) of
corresponding values of k is compared to find the ideal k value where ‘deviation’ or ‘Gap’ between two is
highest. As Gap statistic quantifies this deviation, More the Gap statistic means more the deviation.

Clustering quality
Once clustering is done, how well the clustering has performed can be quantified by a number of metrics.
Ideal clustering is characterised by minimal intra cluster distance and maximal inter cluster distance.
There are majorly two types of measures to assess the clustering performance.
(i) Extrinsic Measures which require ground truth labels. Examples are Adjusted Rand index, Fowlkes-
Mallows scores, Mutual information based scores, Homogeneity, Completeness and V-measure.
(ii) Intrinsic Measures that does not require ground truth labels. Some of the clustering performance
measures are Silhouette Coefficient, Calinski-Harabasz Index, Davies-Bouldin Index etc.

Clustering High-Dimensional Data in Data Mining

Clustering is basically a type of unsupervised learning method. An unsupervised learning method is a
method in which we draw references from datasets consisting of input data without labeled responses.
Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to the
data points in other groups.
Challenges of Clustering High-Dimensional Data:
Clustering of the High-Dimensional Data return the group of objects which are clusters. It is required to
group similar types of objects together to perform the cluster analysis of high-dimensional data, But the
High-Dimensional data space is huge and it has complex data types and attributes. A major challenge is
that we need to find out the set of attributes that are present in each cluster. A cluster is defined and
characterized based on the attributes present in the cluster. Clustering High-Dimensional Data we need
to search for clusters and find out the space for the existing clusters.

The High-Dimensional data is reduced to low-dimension data to make the clustering and search for
clusters simple. some applications need the appropriate models of clusters, especially the high-
dimensional data. clusters in the high-dimensional data are significantly small. the conventional
distance measures can be ineffective. Instead, To find the hidden clusters in high-dimensional data we
need to apply sophisticated techniques that can model correlations among the objects in subspaces.
Subspace Clustering Methods:
There are 3 Subspace Clustering Methods:
 Subspace search methods
 Correlation-based clustering methods
 Biclustering methods
Subspace clustering approaches to search for clusters existing in subspaces of the given high-
dimensional data space, where a subspace is defined using a subset of attributes in the full space.
1. Subspace Search Methods: A subspace search method searches the subspaces for clusters. Here, the
cluster is a group of similar types of objects in a subspace. The similarity between the clusters is
measured by using distance or density features. CLIQUE algorithm is a subspace clustering method.
subspace search methods search a series of subspaces. There are two approaches in Subspace Search
Methods: Bottom-up approach starts to search from the low-dimensional subspaces. If the hidden
clusters are not found in low-dimensional subspaces then it searches in higher dimensional subspaces.
The top-down approach starts to search from the high-dimensional subspaces and then search in subsets
of low-dimensional subspaces. Top-down approaches are effective if the subspace of a cluster can be
defined by the local neighborhood sub-space clusters.
2. Correlation-Based Clustering: correlation-based approaches discover the hidden clusters by
developing advanced correlation models. Correlation-Based models are preferred if is not possible to
cluster the objects by using the Subspace Search Methods. Correlation-Based clustering includes the
advanced mining techniques for correlation cluster analysis. Biclustering Methods are the Correlation-
Based clustering methods in which both the objects and attributes are clustered.
3. Biclustering Methods:
Biclustering means clustering the data based on the two factors. we can cluster both objects and
attributes at a time in some applications. The resultant clusters are biclusters. To perform the
biclustering there are four requirements:
 Only a small set of objects participate in a cluster.
 A cluster only involves a small number of attributes.
 The data objects can take part in multiple clusters, or the objects may also include in any cluster.
 An attribute may be involved in multiple clusters.
Objects and attributes are not treated in the same way. Objects are clustered according to their attribute
values. We treat Objects and attributes as different in biclustering analysis.

Clustering Graph Data

Graph clustering is classifying similar objects in different clusters on one graph. In a biological
instance, the objects can have similar physiological features, such as body height. Still, the objects can
be of the same species. When you want to perform graph clustering, some parameters you can consider
include data point density and the distance between data points.
Applications of Graph Clustering Methods in Data Mining:

In the Business World:

 You can use graph clustering methods to group your customers as a marketer.
 You can group your customers based on their purchasing behavior and preferences when you
obtain meaningful insights.
 You can also classify your products and the geographical location where they sell the most.
 As a business person, you can use graph clustering to help you identify how various social
media platforms affect your business model.
In Biology:
 If you are a biology student or scientist, you can use graph clustering methods in classifying plants
and animals.
In your biology classes, one of the basics was classifying plant taxonomies based on their genes.
Graph clustering methods are handy as they can help you know various species and what they share
in common.
In Geography:
 Graph Clustering Methods in Data Mining can help you as a geography expert. You can establish
insights such as forest coverage and population distribution.
 You can classify which areas experience similar climatic conditions. Still, you can group particular
geographical regions based on their rainfall distribution patterns.
Constrained Clustering:
Constrained clustering is an approach to clustering the data while it incorporates the domain knowledge
in form of constraints. All data including input data, constraints, and domain knowledge are processed
in the clustering process with constraints and give the output clusters as an output.

Methods For Clustering With Constraints:

There are various methods for clustering with constraints and can handle specific constraints:
 Handling Hard Constraints: There is a method for handling the hard constraints by regarding the
constraint in a cluster assignment procedure. It is a very important method for handling the difficult
constraints we can regard the constraints in the assignment procedure of cluster.
 Generating the super instances for must-link constraints: There are must-link constraints that
have transitive closure that can be calculated by it. so that we can say that must-link constraints are
known as an equivalence relation. The subset can be defined by it. In subset, there are some objects
which can be replaced by the mean.
 Handling the soft constraints: In the clustering process of soft constraints, there is always an
optimization process. There is always a penalty requires in the clustering process. Hence the
optimization in this process’s aim is to optimize the constraint violation and decreasing the
clustering aspect. For example, if we take two sets one is of data sets and the other is a set of
constraints, CVQE stands for Constrained Vector Quantization Error. In the CVQE algorithm, K-
means clustering is enforced to constraint violation penalty. The main objective of CVQE is the
total distance used for K-means which are used as follows:
 Penalty in must-link violation: This penalty occurs due to when there is a must-link
constraint present on objects y, x. They are created to the given two centers C1, C2 by
which the constraint can be violated hence the distance that lies between C1 and C2 is
inserted but as a penalty.
 Penalty in cannot-link violation: This type of penalty is different from a must-link
violation as in this penalty there is one center created to a common center C when cannot
link is present on objects x, y. Therefore the constraints are violated and hence the
distance that lies between (C, C) can be inserted in the objective function and it is
recognized as a penalty.

Web Mining
Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is discovering useful
information from the World-Wide Web and its usage patterns.

Applications of Web Mining:

1. Web mining helps to improve the power of web search engine by classifying the web documents
and identifying the web pages.
2. It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching e.g., FatLens, Become
etc.
3. Web mining is used to predict user behavior.
4. Web mining is very useful of a particular Website and e-service e.g., landing page optimization.
Web mining can be broadly divided into three different types of techniques of mining: Web Content
Mining, Web Structure Mining, and Web Usage Mining. These are explained as following below.
1. Web Content Mining: Web content mining is the application of extracting useful information from
the content of the web documents. Web content consist of several types of data – text, image, audio,
video etc. Content data is the group of facts that a web page is designed. It can provide effective and
interesting patterns about user needs. Text documents are related to text mining, machine learning
and natural language processing. This mining is also known as text mining. This type of mining
performs scanning and mining of the text, images and groups of web pages according to the content
of the input.
2. Web Structure Mining: Web structure mining is the application of discovering structure
information from the web. The structure of the web graph consists of web pages as nodes, and
hyperlinks as edges connecting related pages. Structure mining basically shows the structured
summary of a particular website. It identifies relationship between web pages linked by information
or direct link connection. To determine the connection between two commercial websites, Web
structure mining can be very useful.
3. Web Usage Mining: Web usage mining is the application of identifying or discovering interesting
usage patterns from large data sets. And these patterns enable you to understand the user behaviors
or something like that. In web usage mining, user access data on the web and collect data in form of
logs. So, Web usage mining is also called log mining.

Spatial Data Mining

Spatial data mining refers to the process of extraction of knowledge, spatial relationships and interesting
patterns that are not specifically stored in a spatial database.

The emergence of spatial data and extensive usage of spatial databases has led to spatial knowledge
discovery. Spatial data mining can be understood as a process that determines some exciting and
hypothetically valuable patterns from spatial databases.

Several tools are there that assist in extracting information from geospatial data. These tools play a vital
role for organizations like NASA, the National Imagery and Mapping Agency (NIMA), the National
Cancer Institute (NCI), and the United States Department of Transportation (USDOT) which tends to
make big decisions based on large spatial datasets.

spatial data is the data collected through with physical real life locations like towns, cities, islands etc.
Spatial data are basically of three different types and are wisely used in commercial sectors :
1. Map data :
Map data includes different types of spatial features of objects in map, e.g – an object’s shape and
location of object within map. The three basic types of features are points, lines, and polygons (or
areas).
 Points –
Points are used to represent spatial characteristics of objects whose locations correspond to
single 2-D coordinates (x, y, or longitude/latitude) in the scale of particular application. For
examples : Buildings, cellular towers, or stationary vehicles. Moving vehicles and other moving
objects can be represented by sequence of point locations that change over time.
 Lines –
Lines represent objects having length, such as roads or rivers, whose spatial characteristics can
be approximated by sequence of connected lines.
 Polygons –
Polygons are used to represent characteristics of objects that have boundary, like states, lakes, or
countries.
2. Attribute data :
It is the descriptive data that GIeographic Information Systems associate with features in the map.
For example, in map representing countries within an Indian state (ex – Odisha or Mumbai).
Attributes- Population, largest city/town, area in square miles, and so on.
3. Image data :
It includes camera created data like satellite images and aerial photographs. Objects of interest, such
as buildings and roads, can be identified and overlaid on these images. Aerial and satellite images
are typical examples of raster data.
Models of Spatial Information :
It is divided into two categories :
 Field :
These models are used to model spatial data that is continuous in nature, e.g. terrain elevation, air
quality index, temperature data, and soil variation characteristics.
 Object :
These models have been used for applications such as transportation networks, land parcels,
buildings, and other objects that possess both spatial and non-spatial attributes.
A spatial application is modeled using either field or an object based model, which depends on the
requirements and the traditional choice of model for the application. Example – High traffic
analysing system, etc.

Spatial data must have latitude or longitude, UTM easting or northing, or some other coordinates denoting
a point's location in space. Beyond that, spatial data can contain any number of attributes pertaining to a
place. You can choose the types of attributes you want to describe a place. Government websites provide
a resource by offering spatial data, but you need not be limited to what they have produced. You can
produce your own.

Say, for example, you wanted to log information about every location you've visited in the past week.
This might be useful to provide insight into your daily habits. You could capture your destination's
coordinates and list a number of attributes such as place name, the purpose of visit, duration of visit, and
more. You can then create a shapefile in Quantum GIS or similar software with this information and use
the software to query and visualize the data. For example, you could generate a heatmap of the most
visited places or select all places you've visited within a radius of 8 miles from home.

OUTLIER
DEFINITION
An outlier is an object that deviates significantly from the rest of the objects. They can be caused by
measurement or execution errors. The analysis of outlier data is referred to as outlier analysis or outlier
mining.

Outliers are of three types, namely –

1. Global (or Point) Outliers
2. Collective Outliers
3. Contextual (or Conditional) Outliers

1. Global Outliers
They are also known as Point Outliers. These are the simplest form of outliers. If, in a given dataset, a
data point strongly deviates from all the rest of the data points, it is known as a global outlier. Mostly,
all of the outlier detection methods are aimed at finding global outliers.
For example, In Intrusion Detection System, if a large number of packages are broadcast in a very short
span of time, then this may be considered as a global outlier and we can say that that particular system
has been potentially hacked.
The red data point is a global outlier.

2. Collective Outliers
As the name suggests, if in a given dataset, some of the data points, as a whole, deviate significantly
from the rest of the dataset, they may be termed as collective outliers. Here, the individual data objects
may not be outliers, but when seen as a whole, they may behave as outliers. To detect these types of
outliers, we might need background information about the relationship between those data objects
showing the behavior of outliers.
For example: In an Intrusion Detection System, a DOS (denial-of-service) package from one computer
to another may be considered as normal behavior. However, if this happens with several computers at
the same time, then this may be considered as abnormal behavior and as a whole they can be termed as
collective outliers.

The red data points as a whole are collective outliers.

3. Contextual Outliers
They are also known as Conditional Outliers. Here, if in a given dataset, a data object deviates
significantly from the other data points based on a specific context or condition only. A data point may
be an outlier due to a certain condition and may show normal behavior under another condition.
Therefore, a context has to be specified as part of the problem statement in order to identify contextual
outliers. Contextual outlier analysis provides flexibility for users where one can examine outliers in
different contexts, which can be highly desirable in many applications. The attributes of the data point
are decided on the basis of both contextual and behavioral attributes.
For example: A temperature reading of 40°C may behave as an outlier in the context of a “winter
season” but will behave like a normal data point in the context of a “summer season”.

A low temperature value in June is a contextual outlier because the same value in December is not an

outlier.

Outlier Analysis
An outlier is important as it specifies an error in the experiment. Outliers are extensively used in various
areas such as detecting frauds, introducing potential new trends in the market and others.

Usually, outliers are confused with noise. However, outliers are different from noise data in the following
sense:

a. Noise is a random error, but outlier is an observation point that is situated away from different
observations.
b. Noise should be removed for better outlier detection.

Given below are two graphical examples of outliers:

As shown in this graph, the outliers are points that lie outside the entire pattern of distribution.
Another illustration of outliers can be seen in the histogram given below. In this, one point lies far away
from the remaining, this point is an outlier.

Various causes of outliers in Data Mining

There are various causes of outliers in Data Mining. Some of these causes are given below:

a. It is used in identifying the frauds in banking sectors such as credit card hacking or any similar
frauds.
b. It is used in observing the change in trends of buying patterns of a customer.
c. It is used in identifying the typing errors and reporting errors made by humans.
d. It is used in discovering the errors or faults in machines or systems.

What is the need of handling the outliers in Data Mining?

There are various reasons to handle the outliers in Data Mining. Some of those reasons are listed below:
a. Outliers affect the results of the databases.
b. Outliers often give useful or beneficial results and conclusions due to which various trends or
patterns can be recorded.
c. Outliers can be beneficial in research department also. They can be extremely useful in some
discovery.
d. Outliers are the key branches of data mining.

Applications of Outlier Detection in Data Mining

In Data Mining, Outlier Detection is extensively used. It is used to obtain patterns or trends in data
mining. The applications of Outlier Detection in Data Mining are given below:

a. Fraud Detection
b. Telecom Fraud Detection
c. Intrusion Detection in Cyber Security
d. Medical Analysis
e. Environment Monitoring such as Cyclone, Tsunami, Floods, Drought and so on
f. Noticing unforeseen entries in Databases

Outlier detection methods

There are various methods of outlier detection is as follows −
Supervised Methods − Supervised methods model data normality and abnormality. Domain
professionals tests and label a sample of the basic data. Outlier detection can be modeled as a
classification issue. The service is to understand a classifier that can identify outliers.
The sample can be used for training and testing. In various applications, the professionals can label only
the normal objects, and several objects not connecting the model of normal objects are documented as
outliers. There are different methods model the outliers and consider objects not connecting the model of
outliers as normal.
Unsupervised Methods − In various application methods, objects labeled as “normal” or “outlier” are
not applicable. Therefore, an unsupervised learning approach has to be used. Unsupervised outlier
detection methods create an implicit assumption such as the normal objects are considerably “clustered.”
An unsupervised outlier detection method predict that normal objects follow a pattern far more generally
than outliers. Normal objects do not have to decline into one team sharing large similarity. Instead, they
can form several groups, where each group has multiple features.
This assumption cannot be true sometime. The normal objects do not send some strong patterns. Rather
than, they are uniformly distributed. The collective outliers, share large similarity in a small area.
Unsupervised methods cannot identify such outliers efficiently. In some applications, normal objects are
separately distributed, and several objects do not follow strong patterns. For example, in some intrusion
detection and computer virus detection issues, normal activities are distinct and some do not decline into
high-quality clusters.
Some clustering methods can be adapted to facilitate as unsupervised outlier detection methods. The
main idea is to discover clusters first, and therefore the data objects not belonging to some cluster are
identified as outliers. However, such methods deteriorate from two issues. First, a data object not
belonging to some cluster can be noise rather than an outlier. Second, it is expensive to discover clusters
first and then discover outliers.
Semi-Supervised Methods − In several applications, although obtaining some labeled instance is
possible, the number of such labeled instances is small. It can encounter cases where only a small group
of the normal and outlier objects are labeled, but some data are unlabeled. Semi-supervised outlier
detection methods were produced to tackle such methods.
Semi-supervised outlier detection methods can be concerned as applications of semisupervised learning
approaches. For example, when some labeled normal objects are accessible, it can use them with
unlabeled objects that are nearby, to train a model for normal objects. The model of normal objects is
used to identify outliers—those objects not suitable the model of normal objects are defined as outliers.

Statistical Methods Outlier detection

Statistical methods (also known as model-based methods) make assumptions of data normality. They
assume that normal data objects are generated by a statistical (stochastic) model, and that data not
following the model are outliers.

Statistical outlier detection involves applying statistical tests or procedures to identify extreme

values.

You can convert extreme data points into z scores that tell you how many standard deviations
away they are from the mean.

If a value has a high enough or low enough z score, it can be considered an outlier. As a rule of
thumb, values with a z score greater than 3 or less than –3 are often determined to be outliers.

Using the interquartile range

The interquartile range (IQR) tells you the range of the middle half of your dataset. You can
use the IQR to create “fences” around your data and then define outliers as any values that fall
outside those fences.

Interquartile range method

1. Sort your data from low to high

2. Identify the first quartile (Q1), the median, and the third quartile (Q3).
3. Calculate your IQR = Q3 – Q1
4. Calculate your upper fence = Q3 + (1.5 * IQR)
5. Calculate your lower fence = Q1 – (1.5 * IQR)
6. Use your fences to highlight any outliers, all values that fall outside your fences.

Your outliers are any values greater than your upper fence or less than your lower fence.

Example: Using the interquartile range to find outliers

This dataset has 11 values. You have a couple of extreme values in your dataset, so use the
IQR method to check whether they are outliers.

2 37 24 28 35 22 31 53 41 64 29
5

Step 1: Sort your data from low to high

First, you’ll simply sort your data in ascending order.

2 24 25 28 29 31 35 37 41 53 64
2

Step 2: Identify the median, the first quartile (Q1), and the third quartile (Q3)
The median is the value exactly in the middle of your dataset when all values are ordered from
low to high.

Since you have 11 values, the median is the 6th value. The median value is 31.

2 24 25 28 29 31 35 37 41 53 64
2

Next, use the exclusive method for identifying Q1 and Q3. This means we remove the median
from our calculations.

The Q1 is the value in the middle of the first half of your dataset, excluding the median. The first
quartile value is 25.

22 24 25 28 29

Your Q3 value is in the middle of the second half of your dataset, excluding the median. The
third quartile value is 41.

35 37 41 53 64
Step 3: Calculate the IQR
The IQR is the range of the middle half of your dataset. Subtract Q1 from Q3 to calculate the
IQR.

Formula Calculation

IQR = Q3 – Q1 = 26
Q1
Q3 = 41

IQR = 41 – 26

= 15

Step 4: Calculate your upper fence

The upper fence is the boundary around the third quartile. It tells you that any values exceeding
the upper fence are outliers.

Formula Calculation

Upper fence = Q3 + (1.5 * Upper fence = 41 + (1.5 * 15)

IQR)
= 41 + 22.5

= 63.5

Step 5: Calculate your lower fence

The lower fence is the boundary around the first quartile. Any values less than the lower fence
are outliers.

Formula Calculation

Lower fence = Q1 – (1.5 * IQR) Lower fence = 26 – (1.5 * IQR)

= 26 – 22.5

= 3.5

Step 6: Use your fences to highlight any outliers

Go back to your sorted dataset from Step 1 and highlight any values that are greater than the
upper fence or less than your lower fence. These are your outliers.

 Upper fence = 63.5

 Lower fence = 3.5

2 24 25 28 29 31 35 37 41 53 64
2
Proximity-Based Methods in Data Mining
Proximity-based methods are an important technique in data mining. They are employed to find
patterns in large databases by scanning documents for certain keywords and phrases. They are highly
prevalent because they do not require expensive hardware or much storage space, and they scale up
efficiently as the size of databases increases.
Advantages of Proximity-Based Methods:
1. Proximity-based methods make use of machine learning techniques, in which algorithms are trained
to respond to certain patterns.
2. Using a random sample of documents, the machine learning algorithm analyzes the keywords and
phrases used in them and makes predictions about the probability that these words appear together
across all documents.
3. Proximity can be calculated by calculating a similarity score between two collections of training
data and then comparing these scores. The algorithm then tries to compute the maximum similarity
score for two distinct sets of training items.
Disadvantages of Proximity-Based Methods:
1. Important words may not be as close in proximity as we expected.
2. Over-segmentation of documents into phrases. To counter these problems, a lexical chain-based
algorithm has been proposed.
Proximity-based methods perform very well for finding sets of documents that contain certain words
based on background knowledge. But performance is limited when the background knowledge has not
been pre-classified into categories.
To find sets of documents containing certain categories, one must assign categorical values to each
document and then run proximity-based methods on these documents as training data, hoping for
accurate representations of the categories.
One way to identify outliers is by calculating their distance from the rest of the data set in is known as
density-based outlier detection.

Outlier Detection in High-Dimensional Data

In some applications, we may need to detect outliers in high-dimensional data. The dimensionality curse
poses huge challenges for effective outlier detection. As the dimensionality increases, the distance
between objects may be heavily dominated by noise. That is, the distance and similarity between two
points in a high-dimensional space may not reflect the real relationship between the points. Consequently,
conventional outlier detection methods, which mainly use proximity or density to identify outliers,
deteriorate as dimensionality increases.

Ideally, outlier detection methods for high-dimensional data should meet the challenges that follow.

Interpretation of outliers: They should be able to not only detect outliers, but also provide an
interpretation of the outliers. Because many features (or dimensions) are involved in a high-dimensional
data set, detecting outliers without providing any interpretation as to why they are outliers is not very
useful. The interpretation of outliers may come from, for example, specific subspaces that manifest the
outliers or an assessment regarding the “outlier-ness” of the objects. Such interpretation can help users to
understand the possible meaning and significance of the outliers.

Data sparsity: The methods should be capable of handling sparsity in highdimensional spaces. The
distance between objects becomes heavily dominated by noise as the dimensionality increases. Therefore,
data in high-dimensional spaces are often sparse.
Data subspaces: They should model outliers appropriately, for example, adaptive to the subspaces
signifying the outliers and capturing the local behavior of data. Using a fixed-distance threshold against
all subspaces to detect outliers is not a good idea because the distance between two objects monotonically
increases as the dimensionality increases.

Scalability with respect to dimensionality: As the dimensionality increases, the number of subspaces
increases exponentially. An exhaustive combinatorial exploration of the search space, which contains all
possible subspaces, is not a scalable choice.

Actimize AML-CDD 2 0 8 Solution Guide
No ratings yet
Actimize AML-CDD 2 0 8 Solution Guide
117 pages
Web Security - Project Synopsis Final
50% (2)
Web Security - Project Synopsis Final
4 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
Hierarchical Clustering Unit 4 ML
No ratings yet
Hierarchical Clustering Unit 4 ML
14 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Cluster
100% (1)
Cluster
72 pages
The Curious Digital Marketer 2.0
0% (1)
The Curious Digital Marketer 2.0
27 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Design and Implementation of An Online Marketing Information Management System
100% (1)
Design and Implementation of An Online Marketing Information Management System
10 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Complete Clustering
No ratings yet
Complete Clustering
80 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Clustering
No ratings yet
Clustering
104 pages
Module 5
No ratings yet
Module 5
91 pages
Unit-4th Question-Bank Solution
No ratings yet
Unit-4th Question-Bank Solution
52 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Grouping
No ratings yet
Grouping
98 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Unit-4 New
No ratings yet
Unit-4 New
36 pages
Unit - Iv Unsupervisied Learning - Notes
No ratings yet
Unit - Iv Unsupervisied Learning - Notes
32 pages
Clustering
No ratings yet
Clustering
38 pages
Unit 4
No ratings yet
Unit 4
74 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
By Lior Rokach and Oded Maimon: Clustering Methods
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
Clustering
No ratings yet
Clustering
29 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Unit 2
No ratings yet
Unit 2
84 pages
Unit 4 Self Made
No ratings yet
Unit 4 Self Made
28 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
DM Module 4
No ratings yet
DM Module 4
17 pages
U20cs604 Machine Learning Unit III
No ratings yet
U20cs604 Machine Learning Unit III
23 pages
Cluster
No ratings yet
Cluster
20 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
Test Strategy
No ratings yet
Test Strategy
14 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
Unit 4 Mining
No ratings yet
Unit 4 Mining
12 pages
Unit Iii - ML
No ratings yet
Unit Iii - ML
13 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Clustering
No ratings yet
Clustering
9 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Unsupervised Learning Part 1
No ratings yet
Unsupervised Learning Part 1
9 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
Clustering
No ratings yet
Clustering
11 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
DWM Unit-5 Sem Ans
No ratings yet
DWM Unit-5 Sem Ans
8 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
Unit 4
No ratings yet
Unit 4
4 pages
Research On K Mean Algorithm
No ratings yet
Research On K Mean Algorithm
5 pages
Cluster Is A Group of Objects That Belongs To The Same Class
No ratings yet
Cluster Is A Group of Objects That Belongs To The Same Class
12 pages
Online Platform and Appplication Platform
No ratings yet
Online Platform and Appplication Platform
14 pages
ITS OD 301 HTML and CSS
No ratings yet
ITS OD 301 HTML and CSS
2 pages
HC Ipc Tsa2200n4 M
No ratings yet
HC Ipc Tsa2200n4 M
3 pages
Sample E-Commerce Evaluation
100% (1)
Sample E-Commerce Evaluation
15 pages
Mphasis
No ratings yet
Mphasis
3 pages
EE - Application PG 1 - Complete
No ratings yet
EE - Application PG 1 - Complete
1 page
CSS Intro
No ratings yet
CSS Intro
5 pages
Web Intelligence On SAP Implementation Best Practices PDF
No ratings yet
Web Intelligence On SAP Implementation Best Practices PDF
68 pages
Everything On Vrchat Scouting-1
No ratings yet
Everything On Vrchat Scouting-1
6 pages
Web Mining PPT 4121
No ratings yet
Web Mining PPT 4121
18 pages
B.Tech Syll
No ratings yet
B.Tech Syll
136 pages
Introduction To Web Scraping in RPA With Python
No ratings yet
Introduction To Web Scraping in RPA With Python
10 pages
Deploying Darktrace Google Workspace Security Module
No ratings yet
Deploying Darktrace Google Workspace Security Module
7 pages
CookieGraph Understanding and Detecting First-Party Tracking Cookies
No ratings yet
CookieGraph Understanding and Detecting First-Party Tracking Cookies
15 pages
HTML
No ratings yet
HTML
9 pages
Vsphere Vcenter
No ratings yet
Vsphere Vcenter
58 pages
A Real-Time Hydrological Model For Flood Prediction Using GIS and The WWW
No ratings yet
A Real-Time Hydrological Model For Flood Prediction Using GIS and The WWW
24 pages
N.D N.D N.D: Free Proxy List - Free Proxy Servers - Proxyd
No ratings yet
N.D N.D N.D: Free Proxy List - Free Proxy Servers - Proxyd
1 page
mqtt-v3 1 1
No ratings yet
mqtt-v3 1 1
80 pages
Final Key Assessment Jessica Cox FRIT 7231 Georgia Southern University
No ratings yet
Final Key Assessment Jessica Cox FRIT 7231 Georgia Southern University
33 pages
ISP Complete Topic 1
No ratings yet
ISP Complete Topic 1
5 pages
Class XI Micro Syllabus For Second Term
No ratings yet
Class XI Micro Syllabus For Second Term
1 page
Website Testing and Deployment
No ratings yet
Website Testing and Deployment
34 pages
Mupparaju's Resume
No ratings yet
Mupparaju's Resume
1 page
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet