Unit 5 Data Science
Unit 5 Data Science
Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
Marks 40 Summative Assessment Marks 60
Course Outcomes (COs): After the successful completion of the course, the student will be able to:
CO1 Understand the concepts of data and pre-processing of data.
CO2 Know simple pattern recognition methods
CO3 Understand the basic concepts of Clustering and Classification
CO4 Know the recent trends in Data Science
Contents 42 Hrs
Unit I: Data Mining: Introduction, Data Mining Definitions, Knowledge Discovery
in Databases (KDD) Vs Data Mining, DBMS Vs Data Mining, DM techniques, 8
Problems,Issues and Challenges in DM, DM applications.
Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data Cleaning,
Data Integration and transformation, Data reduction, Discretization 8
Mining Frequent Patterns: Basic Concept – Frequent Item Set Mining Methods -
8
Aprioriand Frequent Pattern Growth (FPGrowth) algorithms -Mining Association Rules
Classification: Basic Concepts, Issues, Algorithms: Decision Tree Induction. Bayes
Classification Methods, Rule-Based Classification, Lazy Learners (or Learning from 10
yourNeighbors), k Nearest Neighbor. Prediction - Accuracy- Precision and Recall
Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-Based
Methods, Grid-Based Methods, Evaluation of Clustering 8
Unit 5
Topics:
Cluster Analysis
Cluster analysis or clustering is the process of grouping a set of data objects (or observations) into
subsets. Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters.
Clustering is also known as unsupervised learning since groups are made without the knowledge
of class labels.
Clustering is also called data segmentation in some applications because clustering partitions large
data sets into groups according to their similarity.
Ex: grouping customers into different groups by discovering.
Clustering can also be used for outlier detection, where outliers (values that are “far away” from
any cluster) may be more interesting than common cases.
number of neighboring points within a specified distance (known as the epsilon radius).
It expands clusters by connecting these core points to their neighboring points until the
density falls below a certain threshold. Points that do not include any cluster are
considered outliers or noise.
E.g., DBSCAN, OPTICS, DENCLUE, Mean-Shift.
4. Grid -based method: Here data objects are first formed as grid (cells) and then clustering
operations are performed on this grid. The object space is divided into a grid structure of
finite cells, and clustering operations are performed on the cells instead of individual data
points. This method is highly efficient for handling spatial data and has a fast processing
time that is independent of the number of data objects.
E.g., Statistical Information Grid(STING), CLIQUE, ENCLUS
Partitioning Methods: The simplest and most fundamental version of cluster analysis is
partitioning, which groups similar data points into clusters based on their similarities and
differences.
E.g, . K-means and K-medoids.
K Means Algorithm:
• First, it randomly selects K of the object in D and this will be the Mean/Center considered
• For each iteration an object is assigned to it which is near based on Euclidean distance and
the mean/center is updated
• The iteration continues till last iteration cluster and current iteration cluster are same.
Advantages of k-means:
Disadvantages of k-means:
• It is a bit difficult to predict the number of clusters i.e. the value of k.
• Output is strongly impacted by initial inputs like number of clusters (value of k).
It is an improvised version of the K-Means algorithm mainly designed to deal with outlier data
sensitivity. Instead of taking the mean value to represent the cluster. A Medoid is a point in the
cluster from which dissimilarities with all the other points in the clusters are minimal. A
representative object (Oi) is chosen randomly for representing the cluster. Each remaining object
is assigned to the cluster of which the representative object is the most similar. The partitioning
method is then performed based on the principle of minimizing the sum of the dissimilarity
between each objects ' P ' and its corresponding representative objects.
Hierarchical Methods
Partitioning Methods partitions objects into exclusive group. In some situation we may want data
formed into groups in different levels A hierarchical clustering method works by grouping data
objects into a hierarchy or 'tree' of clusters. This helps in summarizing the data with the help of
hierarchy.
a) Algorithmic methods,
c) Bayesian methods
Agglomerative, divisive and multiphase method are algorithmic meaning they consider data
objects as deterministic and compute clusters accordingly to the deterministic distance between
objects. Probabilistic methods use probabilistic models to compare clusters and measure the
quality of clusters by the firmness of models. Bayesian Methods compute a distribution of possible
clustering. That is, instead of outputting a single deterministic clustering over a data set, they
return a group of clustering structures and their probabilities, conditional on the given data.
Dendrogram:
A tree structure called a dendrogram is commonly used to represent the process of hierarchical
clustering. Dendrogram is used as a plot to show the results of hierarchical clustering method
graphically.
Whether using an agglomerative method or a divisive method, a core need is to measure the
distance between two clusters. Four widely used measures for distance between clusters are as
follows, where |p-p’| is the distance between two objects.
1 Minimum distance:
2 Maximum distance:
3 Mean distance :
4 Average distance :
1
Distavg(Ci,Cj) = 𝑛𝑖 𝑛𝑗 ∑𝑃∈𝐶𝑖,𝑃′∈𝐶𝑗|𝑃 − 𝑃′|
1. Single linkage: computes the minimum distance between clusters before merging them.
If the clustering process is terminated when the distance between nearest clusters exceeds
a user-defined threshold, it is called a single-linkage algorithm.
2. Complete linkage: computes the maximum distance between clusters before merging
them. If the clustering process is terminated when the maximum distance between nearest
clusters exceeds a user-defined threshold, it is called a complete-linkage algorithm.
1. Building the CF Tree: BIRCH summarizes large datasets into smaller, dense regions
called Clustering Feature (CF) entries. It uses clustering feature (CF) to summarize a
cluster and clustering feature tree (CF tree) to represent a cluster hierarchy. Formally, a
Clustering Feature entry is defined as an ordered triple (N, LS, SS) where 'N' is the number
of data points in the cluster, 'LS' is the linear sum of the data points, and 'SS' is the squared
sum of the data points in the cluster. A CF tree is a height-balanced tree with two
parameters, branching factor and threshold. The CF-tree is a very compact representation
of the dataset because each entry in a leaf node is not a single data point but a subcluster.
Every entry in a CF tree contains a pointer to a child node and a CF entry made up of the
sum of CF entries in the child nodes. There is a maximum number of entries in each leaf
node. This maximum number is called the threshold. The tree size is a function of the
threshold. The larger the threshold is, the smaller tree is.
2. Global Clustering: Applies an existing clustering algorithm on the leaves of the CF tree.
A CF tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree
contains a pointer to a child node, and a CF entry made up of the sum of CF entries in the
child nodes. Optionally, we can refine these clusters.
Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the
similarity between pairs of cluster.
Chameleon uses a two-phase algorithm to find clusters in a data set:
1. First phase
Uses a graph partitioning algorithm to cluster data items into small subclusters
2. Second phase
Uses an algorithm to find the genuine clusters by repeatedly combining these subclusters
Two cluster are merged if their interconnectivity is high and they close together.
Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College
12
Traditional hierarchical clustering assumes that there is no uncertainty or noise in the data being
clustered. However, this assumption does not hold for many real-world datasets where there may
be missing values, outliers, or measurement errors present in the data. Traditional methods also
assume that all features have equal importance, which may not always be true.
Probabilistic hierarchical clustering tries to overcome some of these drawbacks by employing
probabilistic model to measure distance between clusters.
Advantages:
Disadvantages:
• Computationally expensive
• sensitivity to initialization parameters.
• assumption of Gaussian distributions within clusters
• requires domain knowledge and expertise in statistical modeling
Density-Based Methods
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea
is that for each point of a cluster, the neighborhood of a given radius has to contain at least a
minimum number of points.
1. eps: It defines the neighborhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered neighbors. If the eps value
is chosen too small then a large part of the data will be considered as an outlier. If it
is chosen very large then the clusters will merge and the majority of the data points
will be in the same clusters. One way to find the eps value is based on the k-
distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger
the dataset, the larger value of MinPts must be chosen. As a general rule, the
minimum MinPts can be derived from the number of dimensions D in the dataset as,
MinPts >= D+1. The minimum value of MinPts must be chosen at least 3.
Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a
core point.
1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster
as the core point.
A point a and b are said to be density connected if there exists a point c which has a
sufficient number of points in its neighbors and both points a and b are within
the eps distance. This is a chaining process. So, if b is a neighbor of c, c is a
neighbor of d, and d is a neighbor of e, which in turn is neighbor of a implying
that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do
not belong to any cluster are noise.
DENCLUE:
DENsity CLUstering. The DENCLUE algorithm employs a cluster model based on kernel density estimation.
A cluster is defined by a local maximum of the estimated density function. Observations going to the same
local maximum are put into the same cluster.
Clearly, DENCLUE doesn't work on data with uniform distribution. In high dimensional space, the data
always look like uniformly distributed because of the curse of dimensionality. Therefore, DENCLUDE doesn't
work well on high-dimensional data in general.
Grid-Based Methods
The grid-based clustering approach uses a grid data structure. It limits the object space into a finite
number of cells that form a grid structure on which all of the operations for clustering are
performed. The main advantage of the approach is its fast-processing time.
A STING is a grid-based clustering technique. It uses a multidimensional grid data structure that
quantifies space into a finite number of cells. Instead of focusing on data points, it focuses on
the value space surrounding the data points.
In STING, the spatial area is divided into rectangular cells and several levels of cells at different
resolution levels. High-level cells are divided into several low-level cells.
In STING Statistical Information about attributes in each cell, such as mean, maximum, and
minimum values, are precomputed and stored as statistical parameters. These statistical
parameters are useful for query processing and other data analysis tasks. The statistical parameter
of higher-level cells can easily be computed from the parameters of the lower-level cells.
Working of STING:
Step 1: Determine a layer, to begin with.
Step 2: For each cell of this layer, it calculates the confidence interval or estimated range of
probability that this is cell is relevant to the query.
Step 3: From the interval calculate above, it labels the cell as relevant or not relevant.
Step 4: If this layer is the bottom layer, go to point 6, otherwise, go to point 5.
Step 5: It goes down the hierarchy structure by one level. Go to point 2 for those cells that form
the relevant cell of the high-level layer.
Step 6: If the specification of the query is met, go to point 8, otherwise go to point 7.
Step 7: Retrieve those data that fall into the relevant cells and do further processing. Return the
result that meets the requirement of the query. Go to point 9.
Step 8: Find the regions of relevant cells. Return those regions that meet the requirement of the
query. Go to point 9.
Step 9: Stop or terminate.
Advantages:
Disadvantage:
• The main disadvantage of Sting (Statistics Grid). As we know, all cluster boundaries
are either horizontal or vertical, so no diagonal boundaries are detected.
CLIQUE (CLustering In QUEst) is a simple grid-based method for finding density based clusters
in subspaces.
It is based on automatically identifying the subspaces of high dimensional data space that allow
better clustering than original space.
It uses a density threshold to identify dense cells and sparse ones.
A cell is dense if the number of objects mapped to it exceeds the density threshold.
CLIQUE Algorithm is very scalable with respect to the value of the records, and a number of
dimensions in the dataset because it is grid-based and uses the Apriori Property effectively.
Apriori Approach Stated that If an X dimensional unit is dense then all its projections in X -1
dimensional space are also dense.
This means that dense regions in a given subspace must produce dense regions when projected
to a low-dimensional subspace.
CLIQUE restricts its search for high-dimensional dense cells to the intersection of dense cells in
the subspace because CLIQUE uses apriori properties.
The CLIQUE algorithm first divides the data space into grids. It is done by dividing each
dimension into equal intervals called units. After that, it identifies dense units. A unit is dense if
the data points in this are exceeding the threshold value.
Once the algorithm finds dense cells along one dimension, the algorithm tries to find dense cells
along two dimensions, and it works until all dense cells along the entire dimension are found.
After finding all dense cells in all dimensions, the algorithm proceeds to find the largest set
(“cluster”) of connected dense cells. Finally, the CLIQUE algorithm generates a minimal
description of the cluster. Clusters are then generated from all dense subspaces using the apriori
approach.
Advantage:
Disadvantage:
• The main disadvantage of CLIQUE Algorithm is that if the size of the cell is
unsuitable for a set of very high values, then too much of the estimation will take
place and the correct cluster will be unable to find.
Evaluation of Clustering
Cluster evaluation assesses the feasibility of clustering analysis on a data set and the quality of the results
generated by a clustering method.
Assessing clustering tendency: is the process of determining if a dataset has clusters. This helps answer
the question, "Are there clusters in this dataset based on our research question?" . The answer to this
question determines whether cluster analysis is necessary.
Measures to assess clustering tendency:
• Hopkins statistic: Measures the probability that a dataset is generated by a uniform
data distribution, or tests the spatial randomness of the data.
Determining the number of clusters in a data set: A few algorithms, such as k-means, require the number
of clusters in a data set as the parameter. Moreover, the number of clusters can be regarded as an interesting
and important summary statistic of a data set. Therefore, it is desirable to estimate this number even before
a clustering algorithm is used to derive detailed clusters.
Measures to assess determining number of clusters:
• Elbow method: To determine the optimal number of clusters, we have to select the
value of k at the “elbow” i.e., the point after which the distortion/inertia starts
decreasing in a linear fashion. Thus for the given data, we conclude that the optimal
number of clusters for the data is 4.
Measuring clustering quality: After applying a clustering method on a data set, we want to assess how
good the resulting clusters are. A number of measures can be used. Some methods measure how well the
clusters fit the data set, while others measure how well the clusters match the ground truth, if such truth is
available. There are also measures that score clustering and thus can compare two sets of clustering results
on the same data set.
Methods of measuring clustering quality:
Depending on the availability of ground truth(information that is already known by directed observation
or measurement), the methods can be classified as Extrinsic/supervised ( ground truth available) and
Intrinsic/unsupervised (ground truth not available) methods.
Extrinsic methods:
Cluster homogeneity: The similarity between the clusters can be expressed in terms of a distance
function, which is represented by d(i, j).
Cluster completeness: Cluster completeness is the essential parameter for good clustering, if
any two data objects are having similar characteristics then they are assigned to the same
category of the cluster according to ground truth. Cluster completeness is high if the objects are
of the same category.
Rag bag: In some situations, there can be a few categories in which the objects of those
categories cannot be merged with other objects. Then the quality of those cluster categories is
measured by the Rag Bag method. According to the rag bag method, we should put the
heterogeneous object into a rag bag category
Small cluster preservation:
If a small category of clustering is further split into small pieces, then those small pieces of
cluster become noise to the entire clustering and thus it becomes difficult to identify that small
category from the clustering. The small cluster preservation criterion states that are splitting a
small category into pieces is not advisable and it further decreases the quality of clusters as the
pieces of clusters are distinctive.
Intrinsic Methods
intrinsic methods evaluate a clustering by examining how well the clusters are separated and how
compact the clusters are. Many intrinsic methods have the advantage of a similarity metric between
objects in the
data set.
Silhouette coefficient: The value of the silhouette coefficient is between -1 and 1. Positive value
nearing 1 indicates a good cluster.