0% found this document useful (0 votes)

73 views

Unit 5 Data Science

The document discusses different clustering methods in data science including partitioning, hierarchical, density-based and grid-based methods. It describes techniques like K-means, K-medoids, DBSCAN and provides details on their advantages and disadvantages.

Uploaded by

sindhukc2010

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views

Unit 5 Data Science

Uploaded by

sindhukc2010

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

1

Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
Marks 40 Summative Assessment Marks 60
Course Outcomes (COs): After the successful completion of the course, the student will be able to:
CO1 Understand the concepts of data and pre-processing of data.
CO2 Know simple pattern recognition methods
CO3 Understand the basic concepts of Clustering and Classification
CO4 Know the recent trends in Data Science
Contents 42 Hrs
Unit I: Data Mining: Introduction, Data Mining Definitions, Knowledge Discovery
in Databases (KDD) Vs Data Mining, DBMS Vs Data Mining, DM techniques, 8
Problems,Issues and Challenges in DM, DM applications.
Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data Cleaning,
Data Integration and transformation, Data reduction, Discretization 8
Mining Frequent Patterns: Basic Concept – Frequent Item Set Mining Methods -
8
Aprioriand Frequent Pattern Growth (FPGrowth) algorithms -Mining Association Rules
Classification: Basic Concepts, Issues, Algorithms: Decision Tree Induction. Bayes
Classification Methods, Rule-Based Classification, Lazy Learners (or Learning from 10
yourNeighbors), k Nearest Neighbor. Prediction - Accuracy- Precision and Recall
Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-Based
Methods, Grid-Based Methods, Evaluation of Clustering 8

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Unit 5

Topics:

Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-Based

Methods, Grid-Based Methods, Evaluation of Clustering.

Cluster Analysis

Cluster analysis or clustering is the process of grouping a set of data objects (or observations) into
subsets. Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters.

Clustering is also known as unsupervised learning since groups are made without the knowledge
of class labels.

Clustering is also called data segmentation in some applications because clustering partitions large
data sets into groups according to their similarity.
Ex: grouping customers into different groups by discovering.

Clustering can also be used for outlier detection, where outliers (values that are “far away” from
any cluster) may be more interesting than common cases.

Clustering has many applications in different industries, including:

• Market segmentation: Categorizing customers into groups like "frequent buyers" or

"occasional buyers"
• Social network analysis: Identifying online user communities for targeted advertising
campaigns
• Recommendation engines: Grouping users with similar preferences and viewing habits
• Image segmentation: Grouping pixels that correspond to different tissue types
• Medical imaging: Grouping patients based on similar genetic traits or symptoms to help
with disease diagnosis and treatment customization
• Search result grouping: Matching similar search queries
• Anomaly detection: Identifying anomalies

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Requirements for cluster Analysis

The following are typical requirement of clustering in data mining:

• Scalability: A large dataset consists of millions of objects. Clustering on only a sample of

a given large data set may lead to biased results. Therefore, highly scalable clustering
algorithms are needed.
• Ability to deal with different types of attributes: Clustering algorithm which can work
on all kind of data like internal based, binary, categorical, graphs, image, documents etc
are needed.
• Discovery of cluster with arbitrary shapes: Cluster with arbitrary shapes are often
required in real time scenarios.
• Requirements for domain knowledge to determine input parameters: Algorithm
which requires min input/domain knowledge about the data in required for clustering
complex data.
• Ability to deal with noisy data: Algorithms which are robust to noise is required.
• Incremental clustering and insensitivity to input order: Algorithms that can input
additional data after given results and hence form update cluster are needed.
• Capable of clustering high dimensional data: Finding clusters of data objects in a high
dimensional is a critical requirement.

Basic Clustering Methods.

1. Partitioning methods: Partitioning method construct K partition of the data, where each
partition represents a cluster. It groups similar data points into clusters based on their
similarities and differences.
E.g., K-Means, K-Medoids, FCMA, CLARA.
2. Hierarchical methods: Hierarchical methods creates a hierarchical decomposition of the
given set of data objects. E.g, BIRCH, CHAMELEON.
There are two approaches for hierarchical clustering:
1. Agglomerative approach: It is also called as bottom up approach. Here initially each
object forms a separated group and then successively merges other/group closest to it
until all groups are merged into one. E.g., AGNES (Agglomerative Nesting)
2. Divisive approach: It is also called as top-down approach. Here all objects are merged
as one group initially and then successively split into smaller cluster until each object
is one cluster. E.g., DIANA, or Divisive ANAlysis
3. Density based methods: Partitioning and hierarchical methods are designed to find
spherical-shaped clusters. They have difficulty finding clusters of arbitrary shape such
as the “S” shape and oval clusters. To find clusters of arbitrary shape, alternatively, we
can model clusters as dense regions in the data space, separated by sparse regions. This
is the main strategy behind density-based clustering methods, which can discover
clusters of nonspherical shape. Here a cluster is grown as long on density ( no. of
objects) in the neighborhood exceeds some threshold. It groups similar data points in a
dataset based on their density. The algorithm identifies core points with a minimum

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

number of neighboring points within a specified distance (known as the epsilon radius).
It expands clusters by connecting these core points to their neighboring points until the
density falls below a certain threshold. Points that do not include any cluster are
considered outliers or noise.
E.g., DBSCAN, OPTICS, DENCLUE, Mean-Shift.
4. Grid -based method: Here data objects are first formed as grid (cells) and then clustering
operations are performed on this grid. The object space is divided into a grid structure of
finite cells, and clustering operations are performed on the cells instead of individual data
points. This method is highly efficient for handling spatial data and has a fast processing
time that is independent of the number of data objects.
E.g., Statistical Information Grid(STING), CLIQUE, ENCLUS

Overview of Clustering Methods:

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Partitioning Methods: The simplest and most fundamental version of cluster analysis is
partitioning, which groups similar data points into clusters based on their similarities and
differences.
E.g, . K-means and K-medoids.

K Means Algorithm:

• First, it randomly selects K of the object in D and this will be the Mean/Center considered
• For each iteration an object is assigned to it which is near based on Euclidean distance and
the mean/center is updated
• The iteration continues till last iteration cluster and current iteration cluster are same.

Advantages of k-means:

• Relatively simple to implement.

• Can be used for large data sets
• Guarantees convergence.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Disadvantages of k-means:
• It is a bit difficult to predict the number of clusters i.e. the value of k.
• Output is strongly impacted by initial inputs like number of clusters (value of k).

K – Medoids Algorithm(Partitioning Around Medoids (PAM)):

It is an improvised version of the K-Means algorithm mainly designed to deal with outlier data
sensitivity. Instead of taking the mean value to represent the cluster. A Medoid is a point in the
cluster from which dissimilarities with all the other points in the clusters are minimal. A
representative object (Oi) is chosen randomly for representing the cluster. Each remaining object
is assigned to the cluster of which the representative object is the most similar. The partitioning
method is then performed based on the principle of minimizing the sum of the dissimilarity
between each objects ' P ' and its corresponding representative objects.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Advantages of using K-Medoids:

• Deals with noise and outlier data effectively
• Easily implementable and simple to understand
• Faster compared to other partitioning algorithms
Disadvantages:
• Not suitable for Clustering arbitrarily shaped groups of data points.
• As the initial medoids are chosen randomly, the results might vary based on the choice in
different runs.

Hierarchical Methods

Partitioning Methods partitions objects into exclusive group. In some situation we may want data
formed into groups in different levels A hierarchical clustering method works by grouping data
objects into a hierarchy or 'tree' of clusters. This helps in summarizing the data with the help of
hierarchy.

Hierarchical methods can be categorized as:

a) Algorithmic methods,

b) Probabilistic methods and

c) Bayesian methods

Agglomerative, divisive and multiphase method are algorithmic meaning they consider data
objects as deterministic and compute clusters accordingly to the deterministic distance between
objects. Probabilistic methods use probabilistic models to compare clusters and measure the
quality of clusters by the firmness of models. Bayesian Methods compute a distribution of possible
clustering. That is, instead of outputting a single deterministic clustering over a data set, they
return a group of clustering structures and their probabilities, conditional on the given data.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

AGNES (AGglomerative NESting), an agglomerative hierarchical clustering method and

DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method:

Figure shows the application of AGNES (AGglomerative NESting), an agglomerative hierarchical

clustering method (Top arrow) , and DIANA (DIvisive Analysis), a divisive hierarchical clustering
method(Bottom arrow), on a data set of five objects, {a, b, c, d, e}.
Working of AGNES:
• Initially, AGNES, the agglomerative method, places each object into a cluster of its own.
• The clusters are then merged step-by-step according to some criterion. For example,
clusters C1 and C2 may be merged if an object in C1 and an object in C2 form the minimum
Euclidean distance between any two objects from different clusters. This is a single-linkage
approach in that each cluster is represented by all the objects in the cluster, and the
similarity between two clusters is measured by the similarity of the closest pair of data
points belonging to different clusters.
• The cluster-merging process repeats until all the objects are eventually merged to form one
cluster.
Working of DIANA:
• All the objects are used to form one initial cluster.
• The cluster is split according to some principle such as the maximum Euclidean distance
between the closest neighboring objects in the cluster.
• The cluster-splitting process repeats until, eventually, each new cluster contains only a
single object.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Dendrogram:
A tree structure called a dendrogram is commonly used to represent the process of hierarchical
clustering. Dendrogram is used as a plot to show the results of hierarchical clustering method
graphically.

Distance Measuring in Algorithmic Methods

Whether using an agglomerative method or a divisive method, a core need is to measure the
distance between two clusters. Four widely used measures for distance between clusters are as
follows, where |p-p’| is the distance between two objects.

1 Minimum distance:

Distmin(Ci,Cj) = min { |𝑃 − 𝑃′|} 𝑃 ∈ 𝐶𝑖, 𝑃′ ∈ 𝐶𝑗

2 Maximum distance:

Distmax(Ci,Cj) = max { |𝑃 − 𝑃′|} 𝑃 ∈ 𝐶𝑖, 𝑃′ ∈ 𝐶𝑗

3 Mean distance :

Distmean(Ci,Cj) = |𝑚𝑖 − 𝑚𝑗|

Where, mi is mean of Ci;

mj is mean of Cj;

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

4 Average distance :
1
Distavg(Ci,Cj) = 𝑛𝑖 𝑛𝑗 ∑𝑃∈𝐶𝑖,𝑃′∈𝐶𝑗|𝑃 − 𝑃′|

Where , ni,nj are the size of clusters.

Types of Linkage of clusters:

1. Single linkage: computes the minimum distance between clusters before merging them.
If the clustering process is terminated when the distance between nearest clusters exceeds
a user-defined threshold, it is called a single-linkage algorithm.
2. Complete linkage: computes the maximum distance between clusters before merging
them. If the clustering process is terminated when the maximum distance between nearest
clusters exceeds a user-defined threshold, it is called a complete-linkage algorithm.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

BIRCH : Multiphase Hierarchical Clustering Using Clustering Feature Trees

• Balanced Iterative Reducing and Clustering using Hierarchies.
• It is an unsupervised data mining algorithm that performs hierarchical clustering on large
datasets.
• Designed to cluster large amount of numeric data.

The BIRCH clustering algorithm consists of two stages:

1. Building the CF Tree: BIRCH summarizes large datasets into smaller, dense regions
called Clustering Feature (CF) entries. It uses clustering feature (CF) to summarize a
cluster and clustering feature tree (CF tree) to represent a cluster hierarchy. Formally, a
Clustering Feature entry is defined as an ordered triple (N, LS, SS) where 'N' is the number
of data points in the cluster, 'LS' is the linear sum of the data points, and 'SS' is the squared
sum of the data points in the cluster. A CF tree is a height-balanced tree with two
parameters, branching factor and threshold. The CF-tree is a very compact representation
of the dataset because each entry in a leaf node is not a single data point but a subcluster.
Every entry in a CF tree contains a pointer to a child node and a CF entry made up of the
sum of CF entries in the child nodes. There is a maximum number of entries in each leaf
node. This maximum number is called the threshold. The tree size is a function of the
threshold. The larger the threshold is, the smaller tree is.
2. Global Clustering: Applies an existing clustering algorithm on the leaves of the CF tree.
A CF tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree
contains a pointer to a child node, and a CF entry made up of the sum of CF entries in the
child nodes. Optionally, we can refine these clusters.

CHAMELEON : Multiphase Hierarchical Clustering using Dynamic Modeling

Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the
similarity between pairs of cluster.
Chameleon uses a two-phase algorithm to find clusters in a data set:

1. First phase
Uses a graph partitioning algorithm to cluster data items into small subclusters
2. Second phase
Uses an algorithm to find the genuine clusters by repeatedly combining these subclusters

Two cluster are merged if their interconnectivity is high and they close together.
Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College
12

* Chameleon uses a K- nearest neighbour graph approach to construct a sparse graph,

Then it uses graph partition algorithm to partition the graph into smaller sub cluster.
Then an agglomerative hierarchical clustering algorithm that iteratively merges sub cluster based
on the similarity.

* Works for arbitrary shaped cluster.

Probabilistic Hierarchical Clustering:

Traditional hierarchical clustering assumes that there is no uncertainty or noise in the data being
clustered. However, this assumption does not hold for many real-world datasets where there may
be missing values, outliers, or measurement errors present in the data. Traditional methods also
assume that all features have equal importance, which may not always be true.
Probabilistic hierarchical clustering tries to overcome some of these drawbacks by employing
probabilistic model to measure distance between clusters.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Advantages:

• Handles complex data containing noise or missing values.

• More flexible than algorithmic methods.

Disadvantages:

• Computationally expensive
• sensitivity to initialization parameters.
• assumption of Gaussian distributions within clusters
• requires domain knowledge and expertise in statistical modeling

Density-Based Methods

DBSCAN: Density-Based Clustering Based on Connected Regions with High Density

The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea
is that for each point of a cluster, the neighborhood of a given radius has to contain at least a
minimum number of points.

Parameters Required For DBSCAN Algorithm

1. eps: It defines the neighborhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered neighbors. If the eps value
is chosen too small then a large part of the data will be considered as an outlier. If it
is chosen very large then the clusters will merge and the majority of the data points
will be in the same clusters. One way to find the eps value is based on the k-
distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger
the dataset, the larger value of MinPts must be chosen. As a general rule, the
minimum MinPts can be derived from the number of dimensions D in the dataset as,
MinPts >= D+1. The minimum value of MinPts must be chosen at least 3.

In this algorithm, we have 3 types of data points.

Core Point: A point is a core point if it has more than MinPts points within eps.

Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a
core point.

Noise or outlier: A point which is not a core point or border point.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Steps Used In DBSCAN Algorithm

1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster
as the core point.
A point a and b are said to be density connected if there exists a point c which has a
sufficient number of points in its neighbors and both points a and b are within
the eps distance. This is a chaining process. So, if b is a neighbor of c, c is a
neighbor of d, and d is a neighbor of e, which in turn is neighbor of a implying
that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do
not belong to any cluster are noise.

DENCLUE:

DENsity CLUstering. The DENCLUE algorithm employs a cluster model based on kernel density estimation.
A cluster is defined by a local maximum of the estimated density function. Observations going to the same
local maximum are put into the same cluster.

Clearly, DENCLUE doesn't work on data with uniform distribution. In high dimensional space, the data
always look like uniformly distributed because of the curse of dimensionality. Therefore, DENCLUDE doesn't
work well on high-dimensional data in general.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Grid-Based Methods

The grid-based clustering approach uses a grid data structure. It limits the object space into a finite
number of cells that form a grid structure on which all of the operations for clustering are
performed. The main advantage of the approach is its fast-processing time.

Statistical Information Grid (STING):

A STING is a grid-based clustering technique. It uses a multidimensional grid data structure that
quantifies space into a finite number of cells. Instead of focusing on data points, it focuses on
the value space surrounding the data points.
In STING, the spatial area is divided into rectangular cells and several levels of cells at different
resolution levels. High-level cells are divided into several low-level cells.
In STING Statistical Information about attributes in each cell, such as mean, maximum, and
minimum values, are precomputed and stored as statistical parameters. These statistical
parameters are useful for query processing and other data analysis tasks. The statistical parameter
of higher-level cells can easily be computed from the parameters of the lower-level cells.

Working of STING:
Step 1: Determine a layer, to begin with.
Step 2: For each cell of this layer, it calculates the confidence interval or estimated range of
probability that this is cell is relevant to the query.
Step 3: From the interval calculate above, it labels the cell as relevant or not relevant.
Step 4: If this layer is the bottom layer, go to point 6, otherwise, go to point 5.
Step 5: It goes down the hierarchy structure by one level. Go to point 2 for those cells that form
the relevant cell of the high-level layer.
Step 6: If the specification of the query is met, go to point 8, otherwise go to point 7.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Step 7: Retrieve those data that fall into the relevant cells and do further processing. Return the
result that meets the requirement of the query. Go to point 9.
Step 8: Find the regions of relevant cells. Return those regions that meet the requirement of the
query. Go to point 9.
Step 9: Stop or terminate.

Advantages:

• Grid-based computing is query-independent because the statistics stored in each cell

represent a summary of the data in the grid cells and are query-independent.
• The grid structure facilitates parallel processing and incremental updates.

Disadvantage:

• The main disadvantage of Sting (Statistics Grid). As we know, all cluster boundaries
are either horizontal or vertical, so no diagonal boundaries are detected.

CLIQUE: An Apriori-like Subspace Clustering Method

CLIQUE (CLustering In QUEst) is a simple grid-based method for finding density based clusters
in subspaces.
It is based on automatically identifying the subspaces of high dimensional data space that allow
better clustering than original space.
It uses a density threshold to identify dense cells and sparse ones.
A cell is dense if the number of objects mapped to it exceeds the density threshold.
CLIQUE Algorithm is very scalable with respect to the value of the records, and a number of
dimensions in the dataset because it is grid-based and uses the Apriori Property effectively.
Apriori Approach Stated that If an X dimensional unit is dense then all its projections in X -1
dimensional space are also dense.
This means that dense regions in a given subspace must produce dense regions when projected
to a low-dimensional subspace.
CLIQUE restricts its search for high-dimensional dense cells to the intersection of dense cells in
the subspace because CLIQUE uses apriori properties.

(a) Working of CLIQUE Algorithm:

The CLIQUE algorithm first divides the data space into grids. It is done by dividing each
dimension into equal intervals called units. After that, it identifies dense units. A unit is dense if
the data points in this are exceeding the threshold value.
Once the algorithm finds dense cells along one dimension, the algorithm tries to find dense cells
along two dimensions, and it works until all dense cells along the entire dimension are found.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

After finding all dense cells in all dimensions, the algorithm proceeds to find the largest set
(“cluster”) of connected dense cells. Finally, the CLIQUE algorithm generates a minimal
description of the cluster. Clusters are then generated from all dense subspaces using the apriori
approach.

Advantage:

• CLIQUE is a subspace clustering algorithm that outperforms K-means, DBSCAN,

and Farthest First in both execution time and accuracy.
• CLIQUE can find clusters of any shape and is able to find any number of clusters in
any number of dimensions, where the number is not predetermined by a parameter.
• One of the simplest methods, and interpretability of results.

Disadvantage:

• The main disadvantage of CLIQUE Algorithm is that if the size of the cell is
unsuitable for a set of very high values, then too much of the estimation will take
place and the correct cluster will be unable to find.

Evaluation of Clustering
Cluster evaluation assesses the feasibility of clustering analysis on a data set and the quality of the results
generated by a clustering method.

The major tasks of clustering evaluation include the following:

Assessing clustering tendency: is the process of determining if a dataset has clusters. This helps answer
the question, "Are there clusters in this dataset based on our research question?" . The answer to this
question determines whether cluster analysis is necessary.
Measures to assess clustering tendency:
• Hopkins statistic: Measures the probability that a dataset is generated by a uniform
data distribution, or tests the spatial randomness of the data.

Determining the number of clusters in a data set: A few algorithms, such as k-means, require the number
of clusters in a data set as the parameter. Moreover, the number of clusters can be regarded as an interesting
and important summary statistic of a data set. Therefore, it is desirable to estimate this number even before
a clustering algorithm is used to derive detailed clusters.
Measures to assess determining number of clusters:
• Elbow method: To determine the optimal number of clusters, we have to select the
value of k at the “elbow” i.e., the point after which the distortion/inertia starts
decreasing in a linear fashion. Thus for the given data, we conclude that the optimal
number of clusters for the data is 4.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Measuring clustering quality: After applying a clustering method on a data set, we want to assess how
good the resulting clusters are. A number of measures can be used. Some methods measure how well the
clusters fit the data set, while others measure how well the clusters match the ground truth, if such truth is
available. There are also measures that score clustering and thus can compare two sets of clustering results
on the same data set.
Methods of measuring clustering quality:
Depending on the availability of ground truth(information that is already known by directed observation
or measurement), the methods can be classified as Extrinsic/supervised ( ground truth available) and
Intrinsic/unsupervised (ground truth not available) methods.
Extrinsic methods:
Cluster homogeneity: The similarity between the clusters can be expressed in terms of a distance
function, which is represented by d(i, j).
Cluster completeness: Cluster completeness is the essential parameter for good clustering, if
any two data objects are having similar characteristics then they are assigned to the same
category of the cluster according to ground truth. Cluster completeness is high if the objects are
of the same category.
Rag bag: In some situations, there can be a few categories in which the objects of those
categories cannot be merged with other objects. Then the quality of those cluster categories is
measured by the Rag Bag method. According to the rag bag method, we should put the
heterogeneous object into a rag bag category
Small cluster preservation:
If a small category of clustering is further split into small pieces, then those small pieces of
cluster become noise to the entire clustering and thus it becomes difficult to identify that small
category from the clustering. The small cluster preservation criterion states that are splitting a
small category into pieces is not advisable and it further decreases the quality of clusters as the
pieces of clusters are distinctive.

Intrinsic Methods
intrinsic methods evaluate a clustering by examining how well the clusters are separated and how
compact the clusters are. Many intrinsic methods have the advantage of a similarity metric between
objects in the
data set.
Silhouette coefficient: The value of the silhouette coefficient is between -1 and 1. Positive value
nearing 1 indicates a good cluster.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Ds Econtent
No ratings yet
Ds Econtent
8 pages
Unit 4
No ratings yet
Unit 4
4 pages
Clustering
No ratings yet
Clustering
7 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
M5
No ratings yet
M5
40 pages
DATA_MINING_UNIT-4
No ratings yet
DATA_MINING_UNIT-4
15 pages
HTCB Unit 5
No ratings yet
HTCB Unit 5
3 pages
Unit 5 Clustering-2
No ratings yet
Unit 5 Clustering-2
28 pages
ML UNIT-III
No ratings yet
ML UNIT-III
18 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
M5
No ratings yet
M5
40 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
4 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
Clustering
No ratings yet
Clustering
104 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
DM MODULE 4
No ratings yet
DM MODULE 4
17 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Big Data Analytics
No ratings yet
Big Data Analytics
25 pages
Data Mining Notes UNIT IV
No ratings yet
Data Mining Notes UNIT IV
19 pages
Clustering
No ratings yet
Clustering
11 pages
UNIT 2 DMW
No ratings yet
UNIT 2 DMW
26 pages
By Lior Rokach and Oded Maimon: Clustering Methods
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
Cluster
No ratings yet
Cluster
20 pages
E-Note_28966_Content_Document_20241211091351PM
No ratings yet
E-Note_28966_Content_Document_20241211091351PM
69 pages
Clustering
No ratings yet
Clustering
25 pages
ML_Unit-3
No ratings yet
ML_Unit-3
22 pages
Unit 5
No ratings yet
Unit 5
10 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Clustering
No ratings yet
Clustering
8 pages
ML-UNIT-5
No ratings yet
ML-UNIT-5
20 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
ML - 8
No ratings yet
ML - 8
70 pages
Module 5
No ratings yet
Module 5
91 pages
Clustering
No ratings yet
Clustering
6 pages
UNIT-5 PPT
No ratings yet
UNIT-5 PPT
85 pages
unit 2 ml
No ratings yet
unit 2 ml
11 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Unit 2 - Introduction to Cluster Analysis
No ratings yet
Unit 2 - Introduction to Cluster Analysis
53 pages
Unit IV Cluster Analysis
No ratings yet
Unit IV Cluster Analysis
7 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
21 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
Data Clustering Seminar
No ratings yet
Data Clustering Seminar
34 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Ai To Decode Financial Crime
No ratings yet
Ai To Decode Financial Crime
10 pages
Download Complete An Introduction to Spatial Data Science with GeoDa Volume 2 Clustering Spatial Data 2nd Edition Luc Anselin PDF for All Chapters
100% (2)
Download Complete An Introduction to Spatial Data Science with GeoDa Volume 2 Clustering Spatial Data 2nd Edition Luc Anselin PDF for All Chapters
81 pages
Survey of Clustering Algorithms: IEEE Transactions On Neural Networks June 2005
No ratings yet
Survey of Clustering Algorithms: IEEE Transactions On Neural Networks June 2005
35 pages
Chapter04 - MDA 8e
No ratings yet
Chapter04 - MDA 8e
67 pages
Dolnicar 2003 SHARE
No ratings yet
Dolnicar 2003 SHARE
9 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Paper - Limitations of From-To Chart
No ratings yet
Paper - Limitations of From-To Chart
13 pages
Statistica
No ratings yet
Statistica
40 pages
Clustering Lecture
No ratings yet
Clustering Lecture
49 pages
20 - 1 - ML - UNSUP - 02 - Hierarchical Clustering
No ratings yet
20 - 1 - ML - UNSUP - 02 - Hierarchical Clustering
41 pages
CSE3506 - CAT-2 Answers: Q. No. Sub-Division Question Text Marks
No ratings yet
CSE3506 - CAT-2 Answers: Q. No. Sub-Division Question Text Marks
11 pages
Unit 3 Supervised Learning
No ratings yet
Unit 3 Supervised Learning
89 pages
9248-Article Text-33828-1-10-20111216 PDF
No ratings yet
9248-Article Text-33828-1-10-20111216 PDF
8 pages
Cluster MCQ
No ratings yet
Cluster MCQ
12 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Malhotra Mr05 PPT 20
100% (1)
Malhotra Mr05 PPT 20
41 pages
Objective: For One Dimensional Data Set (7,10,20,28,35), Perform Hierarchical Clustering
No ratings yet
Objective: For One Dimensional Data Set (7,10,20,28,35), Perform Hierarchical Clustering
13 pages
Linkage Methods
No ratings yet
Linkage Methods
2 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
61 pages
Ml Customer Segmentation
No ratings yet
Ml Customer Segmentation
39 pages
Paper - Hierarchical Cluster
No ratings yet
Paper - Hierarchical Cluster
13 pages
Engineering Literature Review PDF
100% (2)
Engineering Literature Review PDF
7 pages
An Introduction to Spatial Data Science with GeoDa Volume 2 Clustering Spatial Data 2nd Edition Luc Anselin pdf download
No ratings yet
An Introduction to Spatial Data Science with GeoDa Volume 2 Clustering Spatial Data 2nd Edition Luc Anselin pdf download
67 pages
HierarchicalClusterAnalysis1
No ratings yet
HierarchicalClusterAnalysis1
13 pages
Air BNB Paper
No ratings yet
Air BNB Paper
17 pages
Malhotra MR6e 20
No ratings yet
Malhotra MR6e 20
46 pages
Vdoc - Pub - Random Geometric Graphs Oxford Studies in Probability 5
No ratings yet
Vdoc - Pub - Random Geometric Graphs Oxford Studies in Probability 5
345 pages
Lattin Et Al - Analyzing Multivariate Data - 279-281
No ratings yet
Lattin Et Al - Analyzing Multivariate Data - 279-281
3 pages
PDF Laporan Praktikum Data Mining - Compress
No ratings yet
PDF Laporan Praktikum Data Mining - Compress
142 pages

Unit 5 Data Science

Uploaded by

Unit 5 Data Science

Uploaded by

1

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-Based

Clustering has many applications in different industries, including:

• Market segmentation: Categorizing customers into groups like "frequent buyers" or

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Requirements for cluster Analysis

The following are typical requirement of clustering in data mining:

• Scalability: A large dataset consists of millions of objects. Clustering on only a sample of

Basic Clustering Methods.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Overview of Clustering Methods:

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

• Relatively simple to implement.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

K – Medoids Algorithm(Partitioning Around Medoids (PAM)):

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Advantages of using K-Medoids:

Hierarchical methods can be categorized as:

b) Probabilistic methods and

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

AGNES (AGglomerative NESting), an agglomerative hierarchical clustering method and

Figure shows the application of AGNES (AGglomerative NESting), an agglomerative hierarchical

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Distance Measuring in Algorithmic Methods

Distmin(Ci,Cj) = min { |𝑃 − 𝑃′|} 𝑃 ∈ 𝐶𝑖, 𝑃′ ∈ 𝐶𝑗

Distmax(Ci,Cj) = max { |𝑃 − 𝑃′|} 𝑃 ∈ 𝐶𝑖, 𝑃′ ∈ 𝐶𝑗

Distmean(Ci,Cj) = |𝑚𝑖 − 𝑚𝑗|

Where, mi is mean of Ci;

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Where , ni,nj are the size of clusters.

Types of Linkage of clusters:

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

BIRCH : Multiphase Hierarchical Clustering Using Clustering Feature Trees

The BIRCH clustering algorithm consists of two stages:

CHAMELEON : Multiphase Hierarchical Clustering using Dynamic Modeling

* Chameleon uses a K- nearest neighbour graph approach to construct a sparse graph,

* Works for arbitrary shaped cluster.

Probabilistic Hierarchical Clustering:

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

• Handles complex data containing noise or missing values.

DBSCAN: Density-Based Clustering Based on Connected Regions with High Density

Parameters Required For DBSCAN Algorithm

In this algorithm, we have 3 types of data points.

Noise or outlier: A point which is not a core point or border point.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Steps Used In DBSCAN Algorithm

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Statistical Information Grid (STING):

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

• Grid-based computing is query-independent because the statistics stored in each cell

CLIQUE: An Apriori-like Subspace Clustering Method

(a) Working of CLIQUE Algorithm:

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

• CLIQUE is a subspace clustering algorithm that outperforms K-means, DBSCAN,

The major tasks of clustering evaluation include the following:

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

You might also like