Unit - V DW
Unit - V DW
Unit – V
Clustering in Data Mining
Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of data points into clusters so
that the objects belong to the same group.
Clustering helps to splits data into several subsets. Each of these subsets contains data similar to each other, and these
subsets are called clusters. Now that the data from our customer base is divided into clusters, we can make an informed
decision about who we think is best suited for this product.
What is a Cluster?
o A cluster is a subset of similar objects
o A subset of objects such that the distance between any of the two objects in the cluster is less than the distance
between any object in the cluster and any object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high density of objects.
What is clustering in Data Mining?
o Clustering is the method of converting a group of abstract objects into classes of similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of significant subclasses called clusters.
o It helps users to understand the structure or natural grouping in a data set and used either as a stand-alone
instrument to get a better insight into data distribution or as a pre-processing step for other algorithms
Important points:
o Data objects of a cluster can be considered as one group.
o We first partition the information set into groups while doing cluster analysis. It is based on data similarities
and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications, and it helps single out important
characteristics that differentiate between distinct groups.
Page 1 of 6
Data Warehouse and Data Mining
Different types of Clustering
Page 2 of 6
Data Warehouse and Data Mining
K-Mean (A centroid based Technique): The K means algorithm takes the input parameter K from the user and
partitions the dataset containing N objects into K clusters so that resulting similarity among the data objects inside the
group (intracluster) is high but the similarity of data objects with the data objects from outside the cluster is low
(intercluster). The similarity of the cluster is determined with respect to the mean value of the cluster. It is a type of
square error algorithm. At the start randomly k objects from the dataset are chosen in which each of the objects represents
a cluster mean(centre). For the rest of the data objects, they are assigned to the nearest cluster based on their distance
from the cluster mean. The new mean of each of the cluster is then calculated with the added data objects.
Algorithm:
K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
4. Repeat Step 2 until no change occurs.
Page 3 of 6
Data Warehouse and Data Mining
Grid-Based Method in Data Mining
We can use the grid-based clustering method for multi-resolution of grid-based data structure. It is used to quantize the
area of the object into a finite number of cells, which is stored in the grid system where all the operations of Clustering
are implemented. We can use this method for its quick processing time, which is generally independent of the number
of data objects, still dependent on only the multiple cells in each dimension in the quantized space.
There is an instance of a grid-based approach that involves STING, which explores statistical data stored in the grid
cells, and WaveCluster, which clusters objects using a wavelet transform approach. And CLIQUE, which defines a grid-
and density-based approach for Clustering in high-dimensional data space.
Basics of Grid-Based Methods
When we deal with the datasets available in multidimensional characteristics, we need the help of a grid-based
approach. This method includes some spatial data such as geographical information, image data, or datasets with
multiple attributes. If we divide this data space, we can get various advantages of the grid-based method. Some of the
gained advantages are as follows.
1. Data Partitioning - This is a clustering method that classifies all the information into many groups. This
classification is based on the characteristics and similarity of the data. With the help of data analysis, we can
specify the number of clusters generated with the clustering method's help. With the help of the portioning
method, the data can be specified in constructs user-specified(K) partitions in which each partition represents a
cluster and a particular region. So many algorithms are generated with the help of the data partitioning method.
These algorithms are K-Mean, PAM(K-Medoids), and CLARA algorithm (Clustering Large Applications).
2. Data Reduction - We can use this technique in data mining, which is used to reduce the size of a dataset while
still preserving the most important information. Where there is a too large amount of dataset that needs to be
processed efficiently or if the dataset contains a large amount of irrelevant or redundant information in that
situation, we use the data reduction method.
3. Local Pattern Discovery - With the help of the grid-based method, we can identify the local patterns or trends
within the data. We can analyze the data within individual cells, patterns and relationships; these things are still
hidden, and all the data in the entire dataset can be uncovered. This is especially valuable for finding localized
phenomena within data.
4. Scalability - This method is known for its scalability. We can handle large datasets, making them particularly
useful when dealing with high-dimensional data. The partitioning of space inherently reduces dimensionality,
simplifying analysis.
5. Density Estimation - Density-based Clustering refers to one of the most popular unsupervised learning
methodologies used in model building and machine learning algorithms. The data points in the region separated
by two low-point density clusters are considered noise. The surroundings with a radius ε of a given object are
known as the ε neighbourhood of the object. If the ε neighbourhood of the object comprises at least a minimum
number of MinPts of objects, it is called a core object.
6. Clustering and Classification - The grid-based mining method can divide the space of instances into two types.
Clustering techniques are then applied using the Cells of the grid, instead of individual data points, as the base
units. The biggest advantage of this method is that it improves processing time.
7. Grid-Based Indexing - We can use grid-based indexing, which utilizes efficient access and retrieval of data.
These structures organize the data based on the grid partitions, enhancing query performance and retrieval.
Model-based clustering
Model-based clustering is a statistical approach to data clustering. The observed (multivariate) data is considered to
have been created from a finite combination of component models. Each component model is a probability distribution,
generally a parametric multivariate distribution.
Page 4 of 6
Data Warehouse and Data Mining
For instance, in a multivariate Gaussian mixture model, each component is a multivariate Gaussian distribution. The
component responsible for generating a particular observation determines the cluster to which the observation belongs.
Model-based clustering is a try to advance the fit between the given data and some mathematical model and is based on
the assumption that data are created by a combination of a basic probability distribution.
There are the following types of model-based clustering are as follows −
Statistical approach − Expectation maximization is a popular iterative refinement algorithm. An extension to k-means-
It can assign each object to a cluster according to weight (probability distribution).
New means are computed based on weight measures.
The basic idea is as follows −
It can start with an initial estimate of the parameter vector.
It can be used to iteratively rescore the designs against the mixture density made by the parameter vector.
It is used to rescored patterns are used to update the parameter estimates.
It can be used to pattern belonging to the same cluster if they are placed by their scores in a particular
component.
Algorithm
Initially, assign k cluster centers randomly.
It can be iteratively refined the clusters based on two steps are as follows −
Expectation step − It can assign each data point Xi to cluster Ci with the following probability
P(Xi∈Ck)=P(Ck⏐Xi)=P(Ck)P(Xi⏐Ck)P(Xi)P(Xi∈Ck)=P(Ck⏐Xi)=P(Ck)P(Xi⏐Ck)P(Xi)
Machine learning approach − Machine learning is an approach that makes complex algorithms for huge data
processing and supports results to its users. It uses complex programs that can understand through experience and create
predictions.
The algorithms are improved by themselves by frequent input of training information. The main objective of machine
learning is to learn data and build models from data that can be understood and used by humans.
It is a famous approach of incremental conceptual learning, which produces a hierarchical clustering in the form of a
classification tree. Each node defines a concept and includes a probabilistic representation of that concept.
There are some important applica ons of data mining. Some of the following applica ons are:
o Market Basket Analysis: Retailers use data mining to iden fy the products frequently purchased in
combina on. This supports targeted marke ng, product placement, and store design.
o Customer segmenta on: Organiza ons use data mining to classify customers based on shared traits or
behaviours. This makes it possible to create individualized marke ng plans and product recommenda ons.
Page 5 of 6
Data Warehouse and Data Mining
o Recommenda on Systems: Data mining is frequently applied in recommenda on systems, including those
used by social networks, e-commerce websites, and streaming pla orms. It examines user behaviour and
preferences to recommend personalized products, content, or friends.
o Financial Market Forecas ng: Data mining is used in finance to forecast future stock prices, currency exchange
rates, and market trends by analyzing historical market data, news sen ment, and economic indicators. This is
helpful for trading and investment strategies.
o Healthcare Fraud Detec on: Data mining is used in the healthcare industry to iden fy fraudulent billing
prac ces, insurance claims, and unnecessary medical procedures. It aids in spo ng unusual pa erns that
might point to fraudulent ac vity.
o Churn Predic on: Data mining is used by businesses in sectors like telecommunica ons and subscrip on
services to forecast which customers are most likely to discon nue their subscrip ons. This aids in campaigns
to keep customers.
o Credit Scoring: Data mining is used by financial ins tu ons to evaluate a person's or company's
creditworthiness. It aids in selec ng whether to approve loans and the applicable interest rates.
o Agriculture: To maximize crop yields and reduce resource waste, farmers use data mining to analyze crop data,
weather pa erns, and soil condi ons.
Page 6 of 6