Unit - 4 DM
Unit - 4 DM
Definition:
Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data points
would comprise together to form a cluster in which all the objects would belong to the same group.
Cluster:
A cluster is nothing but a collection of similar data which is grouped together.
For example, consider a dataset of vehicles given in which it contains information about different
vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels like Cars,
Bikes, etc for all the vehicles, all the data is combined and is not in a structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming clusters like cars
cluster which contains all the cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Properties of Clustering:
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with huge
databases. In order to handle extensive databases, the clustering algorithm should be scalable. Data
should be scalable, if it is not scalable, then we can’t get the appropriate result which would lead to
wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along with the
data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with algorithms
of clustering. It should be capable of dealing with different types of data like discrete, categorical and
interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing values, and
noisy or erroneous data. If the algorithms are sensitive to such data then it may lead to poor quality
clusters. So it should be able to handle unstructured data and give some structure to the data by
organising it into groups of similar data objects. This makes the job of the data expert easier in order to
process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and usable. The
interpretability reflects how easily the data is understood.
Clustering Methods:
The clustering methods can be classified into the following categories:
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
This clustering method classifies the information into multiple groups based on the characteristics and
similarity of the data. Its the data analysts to specify the number of clusters that has to be generated for
the clustering methods.
In the partitioning method when database(D) that contains multiple(N) objects then the partitioning
method constructs user-specified(K) partitions of the data in which each partition represents a cluster
and a particular region. There are many algorithms that come under partitioning method some of the
popular ones are K-Mean, PAM(K-Mediods), CLARA algorithm (Clustering Large Applications) etc.
In this article, we will be seeing the working of K Mean algorithm in detail.
K-Mean (A centroid based Technique):
The K means algorithm takes the input parameter K from the user and partitions the dataset containing
N objects into K clusters so that resulting similarity among the data objects inside the group
(intracluster) is high but the similarity of data objects with the data objects from outside the cluster is
low (intercluster). The similarity of the cluster is determined with respect to the mean value of the
cluster.
It is a type of square error algorithm. At the start randomly k objects from the dataset are chosen in
which each of the objects represents a cluster mean(centre). For the rest of the data objects, they are
assigned to the nearest cluster based on their distance from the cluster mean. The new mean of each of
the cluster is then calculated with the added data objects.
Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
4. Repeat Step 4 until no change occurs.
Flowchart:
Example: Suppose we want to group the visitors to a website using just their age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.
Iteration-1:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Hierarchical Clustering
A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical
clustering begins by treating every data point as a separate cluster. Then, it repeatedly executes the
subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all the clusters
are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram
called Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences of merges or
splits) graphically represents this hierarchy and is an inverted tree that describes the order in which
factors are merged (bottom-up view) or clusters are broken up (top-down view).
The basic method to generate hierarchical clustering is
1. Agglomerative: Initially consider every data point as an individual Cluster and at every
step, merge the nearest pairs of the cluster. (It is a bottom-up method). At first, every dataset is
considered as an individual entity or cluster. At every iteration, the clusters merge with different
clusters until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering is:
Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
Consider every data point as an individual cluster
Merge the clusters which are highly similar or close to each other.
Recalculate the proximity matrix for each cluster
Repeat Steps 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note: This is just a demonstration of how the actual algorithm works no calculation has been performed
below all the proximity among the clusters is assumed.
Let’s say we have six data points A, B, C, D, E, and F.
Step-1: Consider each alphabet as a single cluster and calculate the distance of one cluster from all
the other clusters.
Step-2: In the second step comparable clusters are merged together to form a single cluster. Let’s
say cluster (B) and cluster (C) are very similar to each other therefore we merge them in the second
step similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
Step-3: We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged together
to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
Step-5: At last the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].
2. Divisive:
We can say that the Divisive Hierarchical clustering is precisely the opposite of the Agglomerative
Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the data
points as a single cluster and in every iteration, we separate the data points from the clusters which
aren’t comparable. In the end, we are left with N clusters.
Density-Based method
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.
It is a popular unsupervised learning method used for model construction and machine learning
algorithms. It is a clustering method utilized for separating high-density clusters from low-density
clusters. It divides the data points into many groups so that points lying in the same group will have the
same properties. It was proposed by Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu in
1996.
DBSCAN is designed for use with databases that can accelerate region queries. It can not cluster data
sets with large differences in their densities.
Characteristics
It identifies clusters of any shape in a data set, it means it can detect arbitrarily shaped clusters.
It is based on intuitive notions of clusters and noise.
It is very robust in detection of outliers in data set
It requires only two points which are very insensitive to the order of occurrence of the points in data
set
Advantages
Specification of number of clusters of data in the data set is not required.
It can find any shape cluster even if the cluster is surrounded by any other cluster.
It can easily find outliers in data set.
It is not much sensitive to noise, it means it is noise tolerant.
It is the second most used clustering method after K-means.
Disadvantages
The quality of the result depends on the distance measure used in the regionQuery function.
Border points may go in any cluster depending on the processing order so it is not completely
deterministic.
It can be expensive when cost of computation of nearest neighbor is high.
It can be slow in execution for higher dimension.
Adaptability of variation in local density is less.
STING was proposed by Wang, Yang, and Muntz (VLDB’97). In this method, the spatial area is divided
into rectangular cells. There are several levels of cells corresponding to different levels of resolution.
For each cell, the high level is partitioned into several smaller cells in the next lower level. The statistical
info of each cell is calculated and stored beforehand and is used to answer queries. For each cell, the high
level is partitioned into several smaller cells in the next lower level. The statistical info of each cell is
calculated and stored beforehand and is used to answer queries. The parameters of higher-level cells can
be easily calculated from parameters of lower-level cell
Count, mean, s, min, max
Type of distribution—normal, uniform, etc.
Then using a top-down approach we need to answer spatial data queries. Then start from a pre-selected
layer—typically with a small number of cells. For each cell in the current level compute the confidence
interval. Now remove the irrelevant cells from further consideration. When finishing examining the
current layer, proceed to the next lower level. Repeat this process until the bottom layer is reached.
Advantages:
Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected.
WaveCluster
It was proposed by Sheikholeslami, Chatterjee, and Zhang (VLDB’98). It is a multi-resolution clustering
approach which applies wavelet transform to the feature space
A wavelet transform is a signal processing technique that decomposes a signal into different frequency
sub-band. It can be both grid-based and density-based method.
Input parameters:
No of grid cells for each dimension
The wavelet, and the no of applications of wavelet transform.
Clustering In Quest
CLIQUE is a density-based and grid-based subspace clustering algorithm. CLIQUE Algorithm is very
scalable with respect to the value of the records, and a number of dimensions in the dataset because it is
grid-based and uses the Apriori Property effectively.
Advantage:
CLIQUE is a subspace clustering algorithm that outperforms K-means, DBSCAN, and Farthest
First in both execution time and accuracy.
CLIQUE can find clusters of any shape and is able to find any number of clusters in any number of
dimensions, where the number is not predetermined by a parameter.
One of the simplest methods, and interpretability of results.
Disadvantage:
The main disadvantage of CLIQUE Algorithm is that if the size of the cell is unsuitable for a set of
very high values, then too much of the estimation will take place and the correct cluster will be
unable to find.
Clustering tendency
Before evaluating the clustering performance, making sure that data set we are working has clustering
tendency and does not contain uniformly distributed points is very important. If the data does not contain
clustering tendency, then clusters identified by any state of the art clustering algorithms may be irrelevant.
Some of the clustering algorithms like K-means, require number of clusters, k, as clustering parameter.
Getting the optimal number of clusters is very significant in the analysis. If k is too high, each point will
broadly start representing a cluster and if k is too low, then data points are incorrectly clustered. Finding
There is no definitive answer for finding right number of cluster as it depends upon (a) Distribution shape
(b) scale in the data set (c) clustering resolution required by user. Although finding number of clusters is a
very subjective problem. There are two major approaches to find optimal number of clusters:
(1) Domain knowledge
Data driven approach — If the domain knowledge is not available, mathematical methods help in finding
out right number of clusters.
Empirical Method:-
A simple empirical method of finding number of clusters is Square root of N/2 where N is total number of
data points, so that each cluster contains square root of 2 * N
Elbow method:-
Within-cluster variance is a measure of compactness of the cluster. Lower the value of within cluster
variance, higher the compactness of cluster formed.
Sum of within-cluster variance, W, is calculated for clustering analyses done with different values of
k. W is a cumulative measure how good the points are clustered in the analysis. Plotting the k values and
their corresponding sum of within-cluster variance helps in finding the number of clusters.
Plot shows that number of optimal clusters = 4.
Initially, Error measure (within-cluster variance) decreases with increase in cluster number. After a
particular point, k=4 , Error measure starts flattening. Cluster number corresponding to that particular
point, k=4, should be considered as optimal number of clusters.
Statistical approach:-
Gap statistic is a powerful statistical method to find the optimal number of clusters, k.
Similar to Elbow method, sum of within-cluster (intra-cluster) variance is calculated for different values of
k.
Then Random data points from reference null distribution are generated and Sum of within-cluster
variance is calculated for the clustering done for different values of k.
In Simpler words, Sum-of-within-Cluster variance of original data set for different values of k to Sum-of-
within-cluster variance of reference data set (null reference data set of uniform distribution) of
corresponding values of k is compared to find the ideal k value where ‘deviation’ or ‘Gap’ between two is
highest. As Gap statistic quantifies this deviation, More the Gap statistic means more the deviation.
Clustering quality
Once clustering is done, how well the clustering has performed can be quantified by a number of metrics.
Ideal clustering is characterised by minimal intra cluster distance and maximal inter cluster distance.
There are majorly two types of measures to assess the clustering performance.
(i) Extrinsic Measures which require ground truth labels. Examples are Adjusted Rand index, Fowlkes-
Mallows scores, Mutual information based scores, Homogeneity, Completeness and V-measure.
(ii) Intrinsic Measures that does not require ground truth labels. Some of the clustering performance
measures are Silhouette Coefficient, Calinski-Harabasz Index, Davies-Bouldin Index etc.
The High-Dimensional data is reduced to low-dimension data to make the clustering and search for
clusters simple. some applications need the appropriate models of clusters, especially the high-
dimensional data. clusters in the high-dimensional data are significantly small. the conventional
distance measures can be ineffective. Instead, To find the hidden clusters in high-dimensional data we
need to apply sophisticated techniques that can model correlations among the objects in subspaces.
Subspace Clustering Methods:
There are 3 Subspace Clustering Methods:
Subspace search methods
Correlation-based clustering methods
Biclustering methods
Subspace clustering approaches to search for clusters existing in subspaces of the given high-
dimensional data space, where a subspace is defined using a subset of attributes in the full space.
1. Subspace Search Methods: A subspace search method searches the subspaces for clusters. Here, the
cluster is a group of similar types of objects in a subspace. The similarity between the clusters is
measured by using distance or density features. CLIQUE algorithm is a subspace clustering method.
subspace search methods search a series of subspaces. There are two approaches in Subspace Search
Methods: Bottom-up approach starts to search from the low-dimensional subspaces. If the hidden
clusters are not found in low-dimensional subspaces then it searches in higher dimensional subspaces.
The top-down approach starts to search from the high-dimensional subspaces and then search in subsets
of low-dimensional subspaces. Top-down approaches are effective if the subspace of a cluster can be
defined by the local neighborhood sub-space clusters.
2. Correlation-Based Clustering: correlation-based approaches discover the hidden clusters by
developing advanced correlation models. Correlation-Based models are preferred if is not possible to
cluster the objects by using the Subspace Search Methods. Correlation-Based clustering includes the
advanced mining techniques for correlation cluster analysis. Biclustering Methods are the Correlation-
Based clustering methods in which both the objects and attributes are clustered.
3. Biclustering Methods:
Biclustering means clustering the data based on the two factors. we can cluster both objects and
attributes at a time in some applications. The resultant clusters are biclusters. To perform the
biclustering there are four requirements:
Only a small set of objects participate in a cluster.
A cluster only involves a small number of attributes.
The data objects can take part in multiple clusters, or the objects may also include in any cluster.
An attribute may be involved in multiple clusters.
Objects and attributes are not treated in the same way. Objects are clustered according to their attribute
values. We treat Objects and attributes as different in biclustering analysis.
There are various methods for clustering with constraints and can handle specific constraints:
Handling Hard Constraints: There is a method for handling the hard constraints by regarding the
constraint in a cluster assignment procedure. It is a very important method for handling the difficult
constraints we can regard the constraints in the assignment procedure of cluster.
Generating the super instances for must-link constraints: There are must-link constraints that
have transitive closure that can be calculated by it. so that we can say that must-link constraints are
known as an equivalence relation. The subset can be defined by it. In subset, there are some objects
which can be replaced by the mean.
Handling the soft constraints: In the clustering process of soft constraints, there is always an
optimization process. There is always a penalty requires in the clustering process. Hence the
optimization in this process’s aim is to optimize the constraint violation and decreasing the
clustering aspect. For example, if we take two sets one is of data sets and the other is a set of
constraints, CVQE stands for Constrained Vector Quantization Error. In the CVQE algorithm, K-
means clustering is enforced to constraint violation penalty. The main objective of CVQE is the
total distance used for K-means which are used as follows:
Penalty in must-link violation: This penalty occurs due to when there is a must-link
constraint present on objects y, x. They are created to the given two centers C1, C2 by
which the constraint can be violated hence the distance that lies between C1 and C2 is
inserted but as a penalty.
Penalty in cannot-link violation: This type of penalty is different from a must-link
violation as in this penalty there is one center created to a common center C when cannot
link is present on objects x, y. Therefore the constraints are violated and hence the
distance that lies between (C, C) can be inserted in the objective function and it is
recognized as a penalty.
Web Mining
Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is discovering useful
information from the World-Wide Web and its usage patterns.
Spatial data mining refers to the process of extraction of knowledge, spatial relationships and interesting
patterns that are not specifically stored in a spatial database.
The emergence of spatial data and extensive usage of spatial databases has led to spatial knowledge
discovery. Spatial data mining can be understood as a process that determines some exciting and
hypothetically valuable patterns from spatial databases.
Several tools are there that assist in extracting information from geospatial data. These tools play a vital
role for organizations like NASA, the National Imagery and Mapping Agency (NIMA), the National
Cancer Institute (NCI), and the United States Department of Transportation (USDOT) which tends to
make big decisions based on large spatial datasets.
spatial data is the data collected through with physical real life locations like towns, cities, islands etc.
Spatial data are basically of three different types and are wisely used in commercial sectors :
1. Map data :
Map data includes different types of spatial features of objects in map, e.g – an object’s shape and
location of object within map. The three basic types of features are points, lines, and polygons (or
areas).
Points –
Points are used to represent spatial characteristics of objects whose locations correspond to
single 2-D coordinates (x, y, or longitude/latitude) in the scale of particular application. For
examples : Buildings, cellular towers, or stationary vehicles. Moving vehicles and other moving
objects can be represented by sequence of point locations that change over time.
Lines –
Lines represent objects having length, such as roads or rivers, whose spatial characteristics can
be approximated by sequence of connected lines.
Polygons –
Polygons are used to represent characteristics of objects that have boundary, like states, lakes, or
countries.
2. Attribute data :
It is the descriptive data that GIeographic Information Systems associate with features in the map.
For example, in map representing countries within an Indian state (ex – Odisha or Mumbai).
Attributes- Population, largest city/town, area in square miles, and so on.
3. Image data :
It includes camera created data like satellite images and aerial photographs. Objects of interest, such
as buildings and roads, can be identified and overlaid on these images. Aerial and satellite images
are typical examples of raster data.
Models of Spatial Information :
It is divided into two categories :
Field :
These models are used to model spatial data that is continuous in nature, e.g. terrain elevation, air
quality index, temperature data, and soil variation characteristics.
Object :
These models have been used for applications such as transportation networks, land parcels,
buildings, and other objects that possess both spatial and non-spatial attributes.
A spatial application is modeled using either field or an object based model, which depends on the
requirements and the traditional choice of model for the application. Example – High traffic
analysing system, etc.
Spatial data must have latitude or longitude, UTM easting or northing, or some other coordinates denoting
a point's location in space. Beyond that, spatial data can contain any number of attributes pertaining to a
place. You can choose the types of attributes you want to describe a place. Government websites provide
a resource by offering spatial data, but you need not be limited to what they have produced. You can
produce your own.
Say, for example, you wanted to log information about every location you've visited in the past week.
This might be useful to provide insight into your daily habits. You could capture your destination's
coordinates and list a number of attributes such as place name, the purpose of visit, duration of visit, and
more. You can then create a shapefile in Quantum GIS or similar software with this information and use
the software to query and visualize the data. For example, you could generate a heatmap of the most
visited places or select all places you've visited within a radius of 8 miles from home.
OUTLIER
DEFINITION
An outlier is an object that deviates significantly from the rest of the objects. They can be caused by
measurement or execution errors. The analysis of outlier data is referred to as outlier analysis or outlier
mining.
1. Global Outliers
They are also known as Point Outliers. These are the simplest form of outliers. If, in a given dataset, a
data point strongly deviates from all the rest of the data points, it is known as a global outlier. Mostly,
all of the outlier detection methods are aimed at finding global outliers.
For example, In Intrusion Detection System, if a large number of packages are broadcast in a very short
span of time, then this may be considered as a global outlier and we can say that that particular system
has been potentially hacked.
The red data point is a global outlier.
2. Collective Outliers
As the name suggests, if in a given dataset, some of the data points, as a whole, deviate significantly
from the rest of the dataset, they may be termed as collective outliers. Here, the individual data objects
may not be outliers, but when seen as a whole, they may behave as outliers. To detect these types of
outliers, we might need background information about the relationship between those data objects
showing the behavior of outliers.
For example: In an Intrusion Detection System, a DOS (denial-of-service) package from one computer
to another may be considered as normal behavior. However, if this happens with several computers at
the same time, then this may be considered as abnormal behavior and as a whole they can be termed as
collective outliers.
3. Contextual Outliers
They are also known as Conditional Outliers. Here, if in a given dataset, a data object deviates
significantly from the other data points based on a specific context or condition only. A data point may
be an outlier due to a certain condition and may show normal behavior under another condition.
Therefore, a context has to be specified as part of the problem statement in order to identify contextual
outliers. Contextual outlier analysis provides flexibility for users where one can examine outliers in
different contexts, which can be highly desirable in many applications. The attributes of the data point
are decided on the basis of both contextual and behavioral attributes.
For example: A temperature reading of 40°C may behave as an outlier in the context of a “winter
season” but will behave like a normal data point in the context of a “summer season”.
A low temperature value in June is a contextual outlier because the same value in December is not an
outlier.
Outlier Analysis
An outlier is important as it specifies an error in the experiment. Outliers are extensively used in various
areas such as detecting frauds, introducing potential new trends in the market and others.
Usually, outliers are confused with noise. However, outliers are different from noise data in the following
sense:
a. Noise is a random error, but outlier is an observation point that is situated away from different
observations.
b. Noise should be removed for better outlier detection.
a. It is used in identifying the frauds in banking sectors such as credit card hacking or any similar
frauds.
b. It is used in observing the change in trends of buying patterns of a customer.
c. It is used in identifying the typing errors and reporting errors made by humans.
d. It is used in discovering the errors or faults in machines or systems.
a. Fraud Detection
b. Telecom Fraud Detection
c. Intrusion Detection in Cyber Security
d. Medical Analysis
e. Environment Monitoring such as Cyclone, Tsunami, Floods, Drought and so on
f. Noticing unforeseen entries in Databases
You can convert extreme data points into z scores that tell you how many standard deviations
away they are from the mean.
If a value has a high enough or low enough z score, it can be considered an outlier. As a rule of
thumb, values with a z score greater than 3 or less than –3 are often determined to be outliers.
Your outliers are any values greater than your upper fence or less than your lower fence.
2 37 24 28 35 22 31 53 41 64 29
5
2 24 25 28 29 31 35 37 41 53 64
2
Step 2: Identify the median, the first quartile (Q1), and the third quartile (Q3)
The median is the value exactly in the middle of your dataset when all values are ordered from
low to high.
Since you have 11 values, the median is the 6th value. The median value is 31.
2 24 25 28 29 31 35 37 41 53 64
2
Next, use the exclusive method for identifying Q1 and Q3. This means we remove the median
from our calculations.
The Q1 is the value in the middle of the first half of your dataset, excluding the median. The first
quartile value is 25.
22 24 25 28 29
Your Q3 value is in the middle of the second half of your dataset, excluding the median. The
third quartile value is 41.
35 37 41 53 64
Step 3: Calculate the IQR
The IQR is the range of the middle half of your dataset. Subtract Q1 from Q3 to calculate the
IQR.
Formula Calculation
IQR = Q3 – Q1 = 26
Q1
Q3 = 41
IQR = 41 – 26
= 15
Formula Calculation
= 63.5
Formula Calculation
= 26 – 22.5
= 3.5
2 24 25 28 29 31 35 37 41 53 64
2
Proximity-Based Methods in Data Mining
Proximity-based methods are an important technique in data mining. They are employed to find
patterns in large databases by scanning documents for certain keywords and phrases. They are highly
prevalent because they do not require expensive hardware or much storage space, and they scale up
efficiently as the size of databases increases.
Advantages of Proximity-Based Methods:
1. Proximity-based methods make use of machine learning techniques, in which algorithms are trained
to respond to certain patterns.
2. Using a random sample of documents, the machine learning algorithm analyzes the keywords and
phrases used in them and makes predictions about the probability that these words appear together
across all documents.
3. Proximity can be calculated by calculating a similarity score between two collections of training
data and then comparing these scores. The algorithm then tries to compute the maximum similarity
score for two distinct sets of training items.
Disadvantages of Proximity-Based Methods:
1. Important words may not be as close in proximity as we expected.
2. Over-segmentation of documents into phrases. To counter these problems, a lexical chain-based
algorithm has been proposed.
Proximity-based methods perform very well for finding sets of documents that contain certain words
based on background knowledge. But performance is limited when the background knowledge has not
been pre-classified into categories.
To find sets of documents containing certain categories, one must assign categorical values to each
document and then run proximity-based methods on these documents as training data, hoping for
accurate representations of the categories.
One way to identify outliers is by calculating their distance from the rest of the data set in is known as
density-based outlier detection.
Ideally, outlier detection methods for high-dimensional data should meet the challenges that follow.
Interpretation of outliers: They should be able to not only detect outliers, but also provide an
interpretation of the outliers. Because many features (or dimensions) are involved in a high-dimensional
data set, detecting outliers without providing any interpretation as to why they are outliers is not very
useful. The interpretation of outliers may come from, for example, specific subspaces that manifest the
outliers or an assessment regarding the “outlier-ness” of the objects. Such interpretation can help users to
understand the possible meaning and significance of the outliers.
Data sparsity: The methods should be capable of handling sparsity in highdimensional spaces. The
distance between objects becomes heavily dominated by noise as the dimensionality increases. Therefore,
data in high-dimensional spaces are often sparse.
Data subspaces: They should model outliers appropriately, for example, adaptive to the subspaces
signifying the outliers and capturing the local behavior of data. Using a fixed-distance threshold against
all subspaces to detect outliers is not a good idea because the distance between two objects monotonically
increases as the dimensionality increases.
Scalability with respect to dimensionality: As the dimensionality increases, the number of subspaces
increases exponentially. An exhaustive combinatorial exploration of the search space, which contains all
possible subspaces, is not a scalable choice.