Data Mining Unit-IV
Data Mining Unit-IV
UNIT-IV
Cluster Analysis:
Cluster analysis, also known as clustering is the process of partitioning a set of data points (or
objects) into Groups. Each Group is a Cluster, such that data points in a cluster are similar to one
another, yet dissimilar to data points in other clusters.
(or)
Clustering is an unsupervised learning technique that group’s similar data points together based
on their features. Clustering is similar to classification. But in this case, the class label is
unknown.
The main goal of Clustering is to identify important relationships or patterns in the data.
There are many different algorithms used for cluster analysis, such as k-means, hierarchical
clustering, and density-based clustering.
The different clustering methods may generate different clusters on the same data set.
Fig: Clustering
Applications of Clustering in Data Mining
Clustering is a widely used technique in data mining and has numerous applications in various
fields. Some of the common applications of clustering in data mining include -
Customer Segmentation
Clustering techniques in data mining can be used to group customers with similar behavior,
preferences, and purchasing patterns to create more targeted marketing campaigns.
Image Segmentation
Clustering techniques in data mining can be used to segment images into different regions
based on their pixel values, which can be useful for tasks such as object recognition and
image compression.
Anomaly Detection
Clustering techniques in data mining can be used to identify outliers or anomalies in
datasets that deviate significantly from normal behavior.
Text Mining
Clustering techniques in data mining can be used to group documents or texts with similar
content, which can be useful for tasks such as document summarization and topic
modelling.
Recommender Systems
Clustering techniques in data mining can be used to group users with similar interests or
behavior to create more personalized recommendations for products or services.
Types of Data in Cluster Analysis:
First, let us know what types of data structures are widely used in cluster analysis.
In cluster analysis, various data structures are used to represent and organize data efficiently during
the clustering process. The choice of data structure often depends on the algorithm being employed
and the characteristics of the data.
The following two data structures are commonly used in cluster analysis:
Data Matrix:
This matrix represents n objects, such as persons, with p variables (also called attributes or
measurements), such as gender, age, height, weight, etc.
This data structure is in the form of a relational table, or n-by-p matrix (n objects x p variables)
The Data Matrix is commonly referred to as a two-mode matrix as it represents different entities
through its rows and columns.
The Dissimilarity Matrix is commonly referred as one mode matrix since the rows and columns of
this represent the same entity.
Types Of Data:
Cluster analysis is a technique used in data mining and statistics to categorize data points into
groups or clusters based on their similarities. Different types of data can be used in cluster analysis,
and the choice of data type often depends on the nature of the data and the goals of the analysis.
1. Continuous Data:
o Numeric Data:
- This type of data includes variables that can take any numerical value, such as
height, weight, temperature, etc.
- Algorithms like k-means clustering are commonly used for such data.
o Interval-Scaled Attributes:
- Interval-scaled attributes are measured on a scale of equal-size units. The values of
interval-scaled attributes have order and can be positive, 0, or negative.
- We can compare and quantify the difference between values of interval attributes.
Example:
A temperature attribute is an interval attribute.
- We can quantify the difference between values. For example, a temperature of
20oC is five degrees higher than a temperature of 15 o C.
2. Categorical Data:
o Nominal Data:
- Nominal data represents categories without any inherent order or ranking.
Examples include colors, gender, or types of fruits.
- Algorithms like k-modes clustering can be used for nominal data.
- The nominal attribute values do not have any meaningful order about them and
they are not quantitative.
- It makes no sense to find the mean (average) value or median (middle) value for
such an attribute.
- However, we can find the attribute’s most commonly occurringvalue (mode)
Examples:
o Ordinal Data: Ordinal data represents categories with a meaningful order but with no
fixed intervals between them. Examples include educational levels or customer
satisfaction ratings.
Examples:
An ordinal attribute drink_size corresponds to the size of drinks available at a fast-food
restaurant.
– This attribute has three possible values: small, medium, and large.
– The values have a meaningful sequence (which corresponds to increasing drink
size);
– However, we cannot tell from the values how much bigger, say, a medium is than a
large.
An ordinal attribute Customer satisfaction had the following ordinal categories:
0: very dissatisfied, 1: somewhat dissatisfied, 2: neutral, 3: satisfied, and 4: very
satisfied.
The central tendency of an ordinal attribute can be represented by its mode and its
median (middle value in an ordered sequence), but the mean cannot be defined.
3. Binary Data:
o Binary Variables: Data that can take only two values, such as 0 or 1, true or false.
Binary data is common in fields like genetics or market basket analysis.
o In other words, A binary attribute is a special nominal attribute with only two states: 0
or 1. Where 0 typically means that the attribute is absent, and 1 means that it is present.
Symmetric Binary Attribute:
A binary attribute is symmetric if both of its states are equally valuable and carry the
same weight. Example: the attribute gender having the states male and female.
Asymmetric Binary Attribute:
A binary attribute is asymmetric if the outcomes of the states are not equallyimportant.
Example: Test results for COVID patient: Positive (1) and Negative (0).
4. Mixed Data:
5. Text Data:
o Temporal Data: Time series clustering involves grouping data points based on their
temporal patterns. This is common in applications such as financial analysis, sensor
data, or climate studies.
7. Geo-Spatial Data:
o Spatial Data: Data with geographic coordinates can be clustered based on spatial
proximity. This is useful in applications like location-based services or geographic
information systems (GIS).
Categorization of Major Clustering Methods:
In general, the major fundamental clustering methods can be classified into the following
categories.
Partitioning Methods
Hierarchical Methods
Density-based Methods
Grid-Based Methods
Partitioning Methods:
Suppose we have a set of n objects, a partitioning method constructs k partitions of the data, where
each partition represents a cluster and k <= n. That is, it divides the data into k groups such that
each group must contain at least one object.
For given the number of partitions(k)to construct, a partitioning method creates an initial
partitioning. Then, it uses an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another.
The general criterion of a good partitioning is that objects in the same cluster are “close” or related
to each other, whereas objects in different clusters are “far apart” or very different.
Hierarchical Methods:
A hierarchical method creates a hierarchical decomposition of the given set of data objects.
A hierarchical method can be classified based on how the hierarchical decomposition is formed.
Agglomerative Approach
This approach, also called the bottom-up approach, starts with each object forming a
separate group. It successively merges the objects or groups close to one another, until all
the groups are merged into one.
Divisive Approach
This approach, also called the top-down approach, starts with all the objects in the
same cluster. In each successive iteration, a cluster is split into smaller clusters, until
eventually each object is in one cluster
Density-based Methods:
Most partitioning methods are distance-based. Such methods can find only spherical-shaped
clusters and encounter difficulty in discovering clusters of arbitrary shapes.
These methods have been developed based on the notion of density. Their general idea is to
continue growing a given cluster as long as the density (number of objects or data points) in the
“neighborhood” exceeds some threshold.
Grid-Based Methods:
In this Grid-based methods, the object space is divided into a finite number of cells, forming a grid
structure, All the clustering operations are performed on the grid structure (i.e., on the cell).
The main advantage of this approach is its fast-processing time, which is typically independent of
the number of data objects and dependent only on the number of cells in each dimension in the
quantized space.
Formally, given a data set, D, of n objects, and k, the number of clusters to form, a partitioning
algorithm organizes the objects into k partitions(k <= n), where each partition represents a cluster.
The most commonly used partitioning methods are k-means and k-medoids.
It is an iterative algorithm that divides the unlabelled dataset into k different clusters in such a
way that each data point belongs only one group that has similar properties.
It allows us to cluster the data into different groups in the unlabelled dataset on its own without the
need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.
The algorithm takes the unlabelled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.
K-Means Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign data point to their closest cluster centre according to the similarity or distance
(Euclidean distance).
Step-5: Repeat 3 and 4 steps, which means reassign each data point to the new closest centroid of
each cluster.
16, 17, 20, 21, 22, 23, 29, 36, 41, 43, 44, 45, 61, 62
Solution:
Step 1:
Let Us K = 2
Step 2:
i.e.
centroid (c1) = 17
centroid (c2) = 23
Step 3:
Now, calculate the distance from each data point to each centroid.
Let us assume Distance 1is the distances from each data point (Xi) to centroid 1and Distance 2 is
the distances from each data point Xi ) to centroid 2
Iteration 1:
Step:4
The Updated centroids are
centroid (c1) =mean (16,17) = 16.5
centroid (c2) =mean (20, 21, 22, 23, 29, 36, 41, 43, 44, 45, 61, 62) = 37.25
Iteration 2:
Nearest
xi c1 c2 Distance 1 Distance 2
Cluster
New Centroid
16 16.5 37.25 0.5 21.25 1
17 16.5 37.25 0.5 20.25 1
20 16.5 37.25 3.5 17.25 1
19.83
21 16.5 37.25 4.5 16.25 1
22 16.5 37.25 5.5 15.25 1
23 16.5 37.25 6.5 14.25 1
29 16.5 37.25 12.5 8.25 2
36 16.5 37.25 19.5 1.25 2
41 16.5 37.25 24.5 3.75 2
43 16.5 37.25 26.5 5.75 2
45.12
44 16.5 37.25 27.5 6.75 2
45 16.5 37.25 28.5 7.75 2
61 16.5 37.25 44.5 23.75 2
62 16.5 37.25 45.5 24.75 2
Nearest
xi c1 c2 Distance 1 Distance 2
Cluster
New Centroid
16 19.83 45.12 3.83 29.12 1
17 19.83 45.12 2.83 28.12 1
20 19.83 45.12 0.17 25.12 1
21 19.83 45.12 1.17 24.12 1 21.14
22 19.83 45.12 2.17 23.12 1
23 19.83 45.12 3.17 22.12 1
29 19.83 45.12 9.17 16.12 1
36 19.83 45.12 16.17 9.12 2
41 19.83 45.12 21.17 4.12 2
43 19.83 45.12 23.17 2.12 2
44 19.83 45.12 24.17 1.12 2 47.42
45 19.83 45.12 25.17 0.12 2
61 19.83 45.12 41.17 15.88 2
62 19.83 45.12 42.17 16.88 2
The Updated centroids are
centroid (c1) = 21.14
centroid (c2) = 47.42
Iteration 4:
Nearest
xi c1 c2 Distance 1 Distance 2
Cluster
New Centroid
16 21.14 47.42 5.14 31.42 1
17 21.14 47.42 4.14 30.42 1
20 21.14 47.42 1.14 27.42 1
21 21.14 47.42 0.14 26.42 1 21.14
22 21.14 47.42 0.86 25.42 1
23 21.14 47.42 1.86 24.42 1
29 21.14 47.42 7.86 18.42 1
36 21.14 47.42 14.86 11.42 2
41 21.14 47.42 19.86 6.42 2
43 21.14 47.42 21.86 4.42 2
44 21.14 47.42 22.86 3.42 2 47.42
45 21.14 47.42 23.86 2.42 2
61 21.14 47.42 39.86 13.58 2
62 21.14 47.42 40.86 14.58 2
No change between iterations 3 and 4 has been noted. By using clustering, 2 groups have been
identified 16-29 and 36-62.
In the K-Medoids algorithm, each data point is called a medoid. The medoids serve as cluster
centers. The medoid is a point such that its sum of the distance from all other points in the same
cluster is minimum. For distance, any suitable metric like Euclidian distance or Manhattan
distance can be used.
K-Medoids Algorithm:
Step-1: Choose k number of random points from the data and consider these k points are the initial
medoids.
Step-2: For all the remaining data points, calculate the distance from each medoid and assign it to
the cluster with the nearest medoid.
Step-3: Calculate the total cost (Sum of all the distances from all the data points to the medoids)
Step-4: Select k random points as the new medoids and swap the previous medoids with them.
Repeat 2 and 3 steps.
Step-5: If the total cost of the new medoid is less than that of the previous medoid, make the new
medoid permanent and repeat step 4.
Step-6: If the total cost of the new medoid is greater than the cost of the previous medoid, undo
the swap and repeat step 4.
Step-7: The Repetitions have to continue until no change is encountered with new medoids to
classify data points.
Example: Suppose we want to group the visitors to a website using just their age as follows:
16, 17, 21, 23, 29, 36, 41, 43
Solution:
Select random K=2 points or medoids. i.e.
medoid (m1) = 17
medoid (m2) = 23
Now, calculate the distance from each data point to each medoid.
Let us assume Distance 1 is the distances from each data point (Xi) to medoid 1 and Distance 2 is
the distances from each data point (Xi ) to medoid 2
Distance 1 = | Xi - m1|
Distance 2 = | Xi - m2|
Iteration 1:
The Initial medoids are
medoid1 (m1) = 17
medoid2 (m2) = 23
Nearest Cost of
xi m1 m2 Distance 1 Distance 2 Total Cost
Cluster Cluster
16 17 23 1 7 1
1
17 17 23 0 6 1
21 17 23 4 2 2
23 17 23 6 0 2
60
29 17 23 12 6 2
59
36 17 23 19 13 2
41 17 23 24 18 2
43 17 23 26 20 2
Iteration-3:
Select random medoids
medoid1 (m1) = 21
medoid2 (m2) = 41
Nearest Cost of
xi m1 m2 Distance 1 Distance 2 Total Cost
Cluster Cluster
16 21 41 5 25 1
17 21 41 4 24 1
21 21 41 0 20 1 19
23 21 41 2 18 1
26
29 21 41 8 12 1
36 21 41 15 5 2
41 21 41 20 0 2 7
43 21 41 22 2 2
Exercise: Implement cluster analysis in the following dataset using k-medoid algorithm to find the
best clusters.
X Y
5 4
7 7
1 3
8 6
4 9
K-Means vs. K-Medoids:
K-Means K-Medoid
Metric of similarity: Euclidian Distance Metric of similarity: Manhattan Distance
Clustering is done based on distance Clustering is done based on distance
from centroids. from medoids.
A centroid can be a data point or some A medoid is always a data point in the cluster.
other point in the cluster
Can't cope with outlier data Can manage outlier data too
Sometimes, outlier sensitivity can turn out Tendency to ignore meaningful clusters in outlier
to be useful data
Hierarchical clustering is a method of cluster analysis in data mining that creates a hierarchical
representation of the clusters in a dataset. The method starts by treating each data point as a
separate cluster and then iteratively combines the closest clusters until a stopping criterion is
reached.
The hierarchical clustering method works by combining data objects into a tree of clusters.The
result of hierarchical clustering is a tree-like structure, and this tree-like structure is known as the
dendrogram, which illustrates the hierarchical relationships among the clusters.
We have seen in the K-means clustering that there are some challenges with this algorithm, which
are a predetermined number of clusters, and it always tries to create the clusters of the same size.
To solve these two challenges, we can opt for the hierarchical clustering algorithm because, in this
algorithm, we do not need to have knowledge about the predefined number of clusters.
Types of Hierarchical Clustering
In this technique, initially each data point is considered as an individual cluster. At each iteration,
the similar data points / clusters merge with other clusters until one cluster.
In other words, the Agglomerative clustering follows a bottom-up approach, it means, this
algorithm considers each data point as a single cluster at the beginning, and then start combining
the closest pair of clusters together. It does this until all the clusters are merged into a single cluster
that contains all the data points.
The working of the AHC algorithm can be explained using the below steps:
Step-1: Create each data point as a single cluster. Let us say there are N data points, so the number
of clusters will also be N.
Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will
now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will
be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider
the below images:
Example:
Step-1: Consider each data point(alphabet) as a single cluster and calculate the distance of one
cluster from all the other clusters.
Step-2: In this step, comparable clusters are merged together to form a single cluster. Let’s say
cluster (B) and cluster (C) are very similar to each other therefore we merge them in the second
step similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
Step-3: We recalculate the distances according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged
together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
Step-5: At last, the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].
Divisive Hierarchical clustering (DHC):
In simple words, we can say that the Divisive Hierarchical clustering is exactly the opposite of the
Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we consider all the data
points as a single cluster and in each iteration, we separate the data points from the cluster which
are not similar. Each data point which is separated is considered as an individual cluster. In the
end, we’ll be left with n clusters.
The working of the DHC algorithm can be explained using the below steps:
Step-1: Consider all the data points as a single cluster. Let us say there are N data points, so the
number of clusters is 1.
Step-2: Separate the data points which are not similar into clusters. So, there will now be 2
clusters.
Step-3: Again, Separate the data points which are not similar into clusters. So, there will now be 3
clusters.
Step-4: Repeat Step 3 until left with N clusters. So, we will get the following clusters. Consider the
below images:
Example:
As we have seen, the closest distance between the two clusters is crucial for the hierarchical
clustering. There are various ways to calculate the distance between two clusters, and these ways
decide the rule for clustering. These measures are called Linkage methods. Some of the popular
linkage methods are given below:
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.
Consider the below image:
2. Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.
3. Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of
the clusters is calculated. Consider the below image:
From the above-given approaches, we can apply any of them according to the type of problem or
business requirement.
Density-based clustering is an unsupervised learning algorithm that groups similar data points in a
dataset based on their density. Where the density refers the number of data points within a
specified radius (epsilon ε) of a given data point.
It can discover clusters of different shapes and sizes from a large amount of data, which is
containing noise and outliers.
The DBSCAN algorithm uses two parameters:
eps (ε):The eps parameter defines the radius of the neighbours around a datapoint x. (or) It
is a distance measure that will be used to locate the points in the neighbourhood of any
point.
MinPts:The MinPts parameter is the minimum number of neighbours (threshold) within the
“eps” radius.
In density-based clustering, the number of clusters does not need to be predetermined, but the
values of MinPts and eps do.
In the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm, there
are three types of points: core points, border points, and noise points.
Core point: A core point is a point that has at least MinPts within its epsilon neighbourhood.
In other words, a core point has enough nearby neighbours to be considered as part of a
cluster.
Border point: A border point is a point that is not a core point, but it is within the epsilon
neighbourhood of a core point. Border points can be considered as part of a cluster, but they
are not as strongly connected to the cluster as core points.
Noise point: A noise point is a point that does not have any core points within its epsilon
neighbourhood. Noise points are often considered outliers and are not part of any cluster.
DBSCAN algorithm:
Step -2:Starting with a core point, the algorithm finds all the points within its epsilon
neighbourhood and assigns them to the same cluster. It then repeats this process for each core point
and continues to add points to the same cluster as long as they are within epsilon distance of
another point in the cluster.
Step -3:Identify border points: A border point is a point that has fewer than MinPts within its
epsilon neighbourhood, but is reachable from a core point. These points are often considered part
of a cluster but can also be considered noise.
Step-4:Output clusters: The algorithm outputs the clusters that have been identified.
The advantages of DBSCAN algorithm:
Density-based clustering algorithms can handle noisy data since they can identify noise
points as outliers that do not belong to any cluster.
Unlike some other clustering algorithms, density-based clustering can identify clusters of
different shapes and sizes, including irregularly shaped clusters.
They don’t require prior knowledge of the number of clusters, making them more flexible
and versatile.
They can efficiently process large datasets and handle high-dimensional data.
They can be computationally expensive and time-consuming, especially for large datasets
with complex structures.
These algorithms may not be suitable for datasets with low-density regions or evenly
distributed data points.
Example: Apply the DBSCAN algorithm to given data points and create the clusters with
MinPts = 3and epsilon = 1.9.
Data points:
P1: (3,7)
P2: (4,6)
P3: (5,5)
P4: (6,3)
P5: (7,3)
P6: (6,2)
P7: (3,3)
Solution:
The given data
MinPts = 3
epsilon radius = 1.9
In this we use Euclidian distance to calculate the distance between the points.
P1 P2 P3 P4 P5 P6 P7
P1: (3,7) P1 0
P2: (4,6) P2 1.41 0
P3: (5,5) P3 2.83 1.41 0
P4: (6,3) P4 5.00 2.83 3.60 0
P5: (7,3) P5 5.66 4.24 2.83 1.41 0
P6: (6,2) P6 5.83 4.47 3.16 2.00 1.00 0
P7: (3,3) P7 4.00 3.16 2.83 3.16 4.00 3.16 0
Next, find thenearest data points for each data point are:
P1: P2
P2: P1, P3
P3: P2
P4:P5
P5:P4, P6
P6:P5
P7
Once the calculate nearest data points, we need to find the Core, Border and Noise data points.
Point Status
P1: P2 P1 Noise Border
P2: P1, P3 P2 Core
P3: P2 P3 Noise Border
P4:P5 P4 Noise Border
P5:P4, P6 P5 Core
P6:P5 P6 Noise Border
P7 P7 Noise
From the above,
In the above there are two core data points that means there are two clusters.
Those are
A STING is a grid-based clustering technique. It uses a multidimensional grid data structure that
quantifies space into a finite number of cells. Instead of focusing on data points, it focuses on the
value space surrounding the data points.
In STING, the spatial area is divided into rectangular cells and several levels of cells at different
resolution levels. High-level cells are divided into several low-level cells.
In STING Statistical Information about attributes in each cell, such as mean, maximum, and
minimum values, are precomputed and stored as statistical parameters. These statistical parameters
are useful for query processing and other data analysis tasks.
The statistical parameter of higher-level cells can easily be computed from the parameters of the
lower-level cells.
How STING Work:
Step 2: For each cell of this layer, it calculates the confidence interval or estimated range of
probability that this is cell is relevant to the query.
Step 3: From the interval calculate above, it labels the cell as relevant or not relevant.
Step 5: It goes down the hierarchy structure by one level. Go to point 2 for those cells that form the
relevant cell of the high-level layer.
Step 7: Retrieve those data that fall into the relevant cells and do further processing. Return the
result that meets the requirement of the query. Go to point 9.
Step 8: Find the regions of relevant cells. Return those regions that meet the requirement of the
query. Go to point 9.
Advantages:
Grid-based computing is query-independent because the statistics stored in each cell represent a
summary of the data in the grid cells and are query-independent.
Disadvantage:
The main disadvantage of Sting (Statistics Grid). As we know, all cluster boundaries are either
horizontal or vertical, so no diagonal boundaries are detected.
Outlier analysis
Outlier analysis, also known as outlier detection or anomaly detection, is the process of identifying
observations or data points that deviate significantly from the rest of the dataset. Outliers can arise
due to measurement errors, data corruption, or genuine unusual behavior in the data.
Analyzing outliers is essential in various fields such as finance, fraud detection, manufacturing, and
healthcare. Here are some common methods and approaches used in outlier analysis:
Statistical Methods:
Standard Deviation Method: Identify outliers as data points that lie beyond a certain number of
standard deviations from the mean.
Z-Score: Calculate the z-score of each data point and consider points with z-scores above a certain
threshold as outliers.
Percentiles: Identify outliers based on percentiles, such as the 95th or 99th percentile of the data
distribution.
Boxplots: Detect outliers as data points that fall outside the whiskers of the boxplot, typically
defined as 1.5 times the interquartile range above or below the upper or lower quartiles.
Distance-Based Methods:
k-Nearest Neighbors (kNN): Measure the distance of each data point to its k-nearest neighbors
and flag points with unusually large distances as outliers.
Distance to Centroid: Calculate the distance of each data point to the centroid of its cluster in
clustering algorithms like k-means, and identify points with large distances as outliers.
Local Outlier Factor (LOF): Evaluate the local density of each data point relative to its neighbors
and flag points with significantly lower density as outliers.
Model-Based Methods:
Parametric Models: Fit a probabilistic model to the data (e.g., Gaussian distribution) and identify
data points with low likelihood as outliers.
Non-Parametric Models: Use techniques like kernel density estimation (KDE) to estimate the
underlying probability density function of the data and identify outliers based on regions of low
density.
Isolation Forest: Construct an ensemble of decision trees that isolate outliers by recursively
partitioning the data space.
Clustering-Based Methods:
Density-Based Clustering: Identify outliers as data points that do not belong to any dense cluster
in algorithms like DBSCAN or OPTICS.
Cluster-Based Outlier Detection: Detect outliers as data points that are far from the centroids of
their assigned clusters in clustering algorithms like k-means or hierarchical clustering.
Supervised Learning:
Classification: Train a classifier to distinguish between normal and outlier instances using labeled
data, then use the classifier to predict outliers in unseen data.
One-Class Classification: Train a model using only normal instances and classify instances that
deviate significantly from the model as outliers.
Visualization Techniques:
Scatterplots: Visualize the data using scatterplots and highlight data points that lie far from the
main cluster.
When conducting outlier analysis, it's important to consider domain knowledge, interpretability,
and the specific characteristics of the dataset. Outliers can be valuable sources of insights or noise
depending on the context, so careful consideration is needed when deciding how to handle them.
Additionally, outlier detection is often an iterative process that may require multiple techniques and
validation steps to effectively identify and understand outliers in the data.