0% found this document useful (0 votes)

7 views6 pages

Unit - V DW

The document discusses clustering in data mining, explaining its purpose of grouping similar data points into clusters for better decision-making. It covers various clustering methods, including K-Means, density-based, grid-based, and model-based clustering, along with their algorithms and applications. Additionally, it highlights practical applications of data mining such as market basket analysis, customer segmentation, and healthcare fraud detection.

Uploaded by

amanprajapat648

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views6 pages

Unit - V DW

Uploaded by

amanprajapat648

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Data Warehouse and Data Mining

Unit – V
Clustering in Data Mining
Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of data points into clusters so
that the objects belong to the same group.
Clustering helps to splits data into several subsets. Each of these subsets contains data similar to each other, and these
subsets are called clusters. Now that the data from our customer base is divided into clusters, we can make an informed
decision about who we think is best suited for this product.

What is a Cluster?
o A cluster is a subset of similar objects
o A subset of objects such that the distance between any of the two objects in the cluster is less than the distance
between any object in the cluster and any object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high density of objects.
What is clustering in Data Mining?
o Clustering is the method of converting a group of abstract objects into classes of similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of significant subclasses called clusters.
o It helps users to understand the structure or natural grouping in a data set and used either as a stand-alone
instrument to get a better insight into data distribution or as a pre-processing step for other algorithms
Important points:
o Data objects of a cluster can be considered as one group.
o We first partition the information set into groups while doing cluster analysis. It is based on data similarities
and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications, and it helps single out important
characteristics that differentiate between distinct groups.

Page 1 of 6
Data Warehouse and Data Mining
Different types of Clustering

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine
learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the algorithm works,
along with the Python implementation of k-means clustering.
What is K-Means Algorithm?
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters.
Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two
clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset
belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the
unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to
minimize the sum of distances between the data point and their corresponding clusters.
The k-means clustering algorithm mainly performs two tasks:
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create
a cluster.

Partitioning Method (K-Mean) in Data Mining

Partitioning Method: This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number of clusters that has to be generated
for the clustering methods. In the partitioning method when database(D) that contains multiple(N) objects then the
partitioning method constructs user-specified(K) partitions of the data in which each partition represents a cluster and a
particular region. There are many algorithms that come under partitioning method some of the popular ones are K-
Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large Applications) etc. In this article, we will be seeing the
working of K Mean algorithm in detail.

Page 2 of 6
Data Warehouse and Data Mining
K-Mean (A centroid based Technique): The K means algorithm takes the input parameter K from the user and
partitions the dataset containing N objects into K clusters so that resulting similarity among the data objects inside the
group (intracluster) is high but the similarity of data objects with the data objects from outside the cluster is low
(intercluster). The similarity of the cluster is determined with respect to the mean value of the cluster. It is a type of
square error algorithm. At the start randomly k objects from the dataset are chosen in which each of the objects represents
a cluster mean(centre). For the rest of the data objects, they are assigned to the nearest cluster based on their distance
from the cluster mean. The new mean of each of the cluster is then calculated with the added data objects.
Algorithm:
K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects

Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
4. Repeat Step 2 until no change occurs.

Density-based clustering in data minin

Density-based clustering refers to a method that is based on local cluster criterion, such as density connected points.
In this tutorial, we will discuss density-based clustering with examples.
What is Density-based clustering?
Density-Based Clustering refers to one of the most popular unsupervised learning methodologies used in model
building and machine learning algorithms. The data points in the region separated by two clusters of low point density
are considered as noise. The surroundings with a radius ε of a given object are known as the ε neighborhood of the
object. If the ε neighborhood of the object comprises at least a minimum number, MinPts of objects, then it is called a
core object.
Density-Based Clustering - Background
There are two different parameters to calculate the density-based clustering
Eps: It is considered as the maximum radius of the neighborhood.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
NEps (i) : { k belongs to D and dist (i,k) < = Eps}
Directly density reachable:
A point i is considered as the directly density reachable from a point k with respect to Eps, MinPts if
i belongs to NEps(k)

Major Features of Density-Based Clustering

The primary features of Density-based clustering are given below.
o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.

Page 3 of 6
Data Warehouse and Data Mining
Grid-Based Method in Data Mining
We can use the grid-based clustering method for multi-resolution of grid-based data structure. It is used to quantize the
area of the object into a finite number of cells, which is stored in the grid system where all the operations of Clustering
are implemented. We can use this method for its quick processing time, which is generally independent of the number
of data objects, still dependent on only the multiple cells in each dimension in the quantized space.
There is an instance of a grid-based approach that involves STING, which explores statistical data stored in the grid
cells, and WaveCluster, which clusters objects using a wavelet transform approach. And CLIQUE, which defines a grid-
and density-based approach for Clustering in high-dimensional data space.
Basics of Grid-Based Methods
When we deal with the datasets available in multidimensional characteristics, we need the help of a grid-based
approach. This method includes some spatial data such as geographical information, image data, or datasets with
multiple attributes. If we divide this data space, we can get various advantages of the grid-based method. Some of the
gained advantages are as follows.
1. Data Partitioning - This is a clustering method that classifies all the information into many groups. This
classification is based on the characteristics and similarity of the data. With the help of data analysis, we can
specify the number of clusters generated with the clustering method's help. With the help of the portioning
method, the data can be specified in constructs user-specified(K) partitions in which each partition represents a
cluster and a particular region. So many algorithms are generated with the help of the data partitioning method.
These algorithms are K-Mean, PAM(K-Medoids), and CLARA algorithm (Clustering Large Applications).
2. Data Reduction - We can use this technique in data mining, which is used to reduce the size of a dataset while
still preserving the most important information. Where there is a too large amount of dataset that needs to be
processed efficiently or if the dataset contains a large amount of irrelevant or redundant information in that
situation, we use the data reduction method.
3. Local Pattern Discovery - With the help of the grid-based method, we can identify the local patterns or trends
within the data. We can analyze the data within individual cells, patterns and relationships; these things are still
hidden, and all the data in the entire dataset can be uncovered. This is especially valuable for finding localized
phenomena within data.
4. Scalability - This method is known for its scalability. We can handle large datasets, making them particularly
useful when dealing with high-dimensional data. The partitioning of space inherently reduces dimensionality,
simplifying analysis.
5. Density Estimation - Density-based Clustering refers to one of the most popular unsupervised learning
methodologies used in model building and machine learning algorithms. The data points in the region separated
by two low-point density clusters are considered noise. The surroundings with a radius ε of a given object are
known as the ε neighbourhood of the object. If the ε neighbourhood of the object comprises at least a minimum
number of MinPts of objects, it is called a core object.
6. Clustering and Classification - The grid-based mining method can divide the space of instances into two types.
Clustering techniques are then applied using the Cells of the grid, instead of individual data points, as the base
units. The biggest advantage of this method is that it improves processing time.
7. Grid-Based Indexing - We can use grid-based indexing, which utilizes efficient access and retrieval of data.
These structures organize the data based on the grid partitions, enhancing query performance and retrieval.
Model-based clustering
Model-based clustering is a statistical approach to data clustering. The observed (multivariate) data is considered to
have been created from a finite combination of component models. Each component model is a probability distribution,
generally a parametric multivariate distribution.

Page 4 of 6
Data Warehouse and Data Mining
For instance, in a multivariate Gaussian mixture model, each component is a multivariate Gaussian distribution. The
component responsible for generating a particular observation determines the cluster to which the observation belongs.
Model-based clustering is a try to advance the fit between the given data and some mathematical model and is based on
the assumption that data are created by a combination of a basic probability distribution.
There are the following types of model-based clustering are as follows −
Statistical approach − Expectation maximization is a popular iterative refinement algorithm. An extension to k-means-
 It can assign each object to a cluster according to weight (probability distribution).
 New means are computed based on weight measures.
The basic idea is as follows −
 It can start with an initial estimate of the parameter vector.
 It can be used to iteratively rescore the designs against the mixture density made by the parameter vector.
 It is used to rescored patterns are used to update the parameter estimates.
 It can be used to pattern belonging to the same cluster if they are placed by their scores in a particular
component.
Algorithm
 Initially, assign k cluster centers randomly.
 It can be iteratively refined the clusters based on two steps are as follows −
Expectation step − It can assign each data point Xi to cluster Ci with the following probability

P(Xi∈Ck)=P(Ck⏐Xi)=P(Ck)P(Xi⏐Ck)P(Xi)P(Xi∈Ck)=P(Ck⏐Xi)=P(Ck)P(Xi⏐Ck)P(Xi)

Maximization step − It can be used to estimate of model parameter

mk=1N∑i=1NXiP(Xi∈Ck)XjP(Xi)∈Cjmk=1N∑i=1NXiP(Xi∈Ck)XjP(Xi)∈Cj

Machine learning approach − Machine learning is an approach that makes complex algorithms for huge data
processing and supports results to its users. It uses complex programs that can understand through experience and create
predictions.
The algorithms are improved by themselves by frequent input of training information. The main objective of machine
learning is to learn data and build models from data that can be understood and used by humans.
It is a famous approach of incremental conceptual learning, which produces a hierarchical clustering in the form of a
classification tree. Each node defines a concept and includes a probabilistic representation of that concept.

Applica ons of data mining

There are some important applica ons of data mining. Some of the following applica ons are:

o Market Basket Analysis: Retailers use data mining to iden fy the products frequently purchased in
combina on. This supports targeted marke ng, product placement, and store design.

o Customer segmenta on: Organiza ons use data mining to classify customers based on shared traits or
behaviours. This makes it possible to create individualized marke ng plans and product recommenda ons.

Page 5 of 6
Data Warehouse and Data Mining
o Recommenda on Systems: Data mining is frequently applied in recommenda on systems, including those
used by social networks, e-commerce websites, and streaming pla orms. It examines user behaviour and
preferences to recommend personalized products, content, or friends.

o Financial Market Forecas ng: Data mining is used in ﬁnance to forecast future stock prices, currency exchange
rates, and market trends by analyzing historical market data, news sen ment, and economic indicators. This is
helpful for trading and investment strategies.

o Healthcare Fraud Detec on: Data mining is used in the healthcare industry to iden fy fraudulent billing
prac ces, insurance claims, and unnecessary medical procedures. It aids in spo ng unusual pa erns that
might point to fraudulent ac vity.

o Churn Predic on: Data mining is used by businesses in sectors like telecommunica ons and subscrip on
services to forecast which customers are most likely to discon nue their subscrip ons. This aids in campaigns
to keep customers.

o Credit Scoring: Data mining is used by ﬁnancial ins tu ons to evaluate a person's or company's
creditworthiness. It aids in selec ng whether to approve loans and the applicable interest rates.

o Agriculture: To maximize crop yields and reduce resource waste, farmers use data mining to analyze crop data,
weather pa erns, and soil condi ons.

Page 6 of 6

AI Agents With Python Build Autonomous Systems That Think, Learn, and Act (Publishing, Reactive Van Der Post, Hayden) (Z-Library)
No ratings yet
AI Agents With Python Build Autonomous Systems That Think, Learn, and Act (Publishing, Reactive Van Der Post, Hayden) (Z-Library)
422 pages
18AI742
No ratings yet
18AI742
2 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Unit 4
No ratings yet
Unit 4
16 pages
Clustering
No ratings yet
Clustering
11 pages
Untitled Document
No ratings yet
Untitled Document
32 pages
Unit 3 Updated Notes
No ratings yet
Unit 3 Updated Notes
29 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
DSV - Unit 3 - Data Analysis in Depth
No ratings yet
DSV - Unit 3 - Data Analysis in Depth
53 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Unit 4
No ratings yet
Unit 4
74 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
UNIT 4 K-Means Clustring
No ratings yet
UNIT 4 K-Means Clustring
13 pages
Unit - Iv Unsupervisied Learning - Notes
No ratings yet
Unit - Iv Unsupervisied Learning - Notes
32 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
18 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering New
No ratings yet
Clustering New
6 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
Unit 4
No ratings yet
Unit 4
5 pages
4.unit 4 ML Q&A
No ratings yet
4.unit 4 ML Q&A
73 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Unit 4
No ratings yet
Unit 4
29 pages
Module 5
No ratings yet
Module 5
91 pages
M5
No ratings yet
M5
40 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
1 A Modified Version
No ratings yet
1 A Modified Version
7 pages
L07 - Advance Analytical Theory and Methods - Clustering
No ratings yet
L07 - Advance Analytical Theory and Methods - Clustering
22 pages
Unit 3 & 4 (p18)
No ratings yet
Unit 3 & 4 (p18)
18 pages
Clustering
No ratings yet
Clustering
84 pages
Clustering
No ratings yet
Clustering
10 pages
Unit - 4 (ML)
No ratings yet
Unit - 4 (ML)
13 pages
Unit Iii - ML
No ratings yet
Unit Iii - ML
13 pages
Clustering
No ratings yet
Clustering
9 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
Unit 4
No ratings yet
Unit 4
4 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Unit III Clustering
No ratings yet
Unit III Clustering
47 pages
Clustering
No ratings yet
Clustering
29 pages
M5
No ratings yet
M5
40 pages
ML Unit-4
No ratings yet
ML Unit-4
14 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Clustering
No ratings yet
Clustering
57 pages
U20cs604 Machine Learning Unit III
No ratings yet
U20cs604 Machine Learning Unit III
23 pages
DM After Midz
No ratings yet
DM After Midz
22 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
Unsupervised Learning Part 1
No ratings yet
Unsupervised Learning Part 1
9 pages
Unit 4
No ratings yet
Unit 4
40 pages
Crime Investigation System PDF
No ratings yet
Crime Investigation System PDF
4 pages
Smec ML Lab Manual R22
No ratings yet
Smec ML Lab Manual R22
21 pages
Ai Resos
No ratings yet
Ai Resos
16 pages
Unit4 DMW
No ratings yet
Unit4 DMW
16 pages
Logistics Service Mode Selection For Last Mile Delivery An Analysis Method Considering Customer Utilit
No ratings yet
Logistics Service Mode Selection For Last Mile Delivery An Analysis Method Considering Customer Utilit
22 pages
Enhanced Academic Performance Evaluation Technique Using Fuzzy System
No ratings yet
Enhanced Academic Performance Evaluation Technique Using Fuzzy System
12 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
CS8091 Big Data Analytics MCQ
100% (2)
CS8091 Big Data Analytics MCQ
22 pages
Short Paper SIS Palermo 2018
No ratings yet
Short Paper SIS Palermo 2018
1,668 pages
Raju Internship Report
No ratings yet
Raju Internship Report
27 pages
Aiml Ques
No ratings yet
Aiml Ques
2 pages
10.1515 - Jisys 2022 0046
No ratings yet
10.1515 - Jisys 2022 0046
19 pages
(S1 IJEECS 2021 Rohit Chivukula) Classifying Clinically KNN and SVM
No ratings yet
(S1 IJEECS 2021 Rohit Chivukula) Classifying Clinically KNN and SVM
8 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Stock Price Prediction Using K-Medoids Clustering With Indexing Dynamic Time Warping
No ratings yet
Stock Price Prediction Using K-Medoids Clustering With Indexing Dynamic Time Warping
7 pages
Short Quizzes 13-15
No ratings yet
Short Quizzes 13-15
9 pages
BSE181055-Assignment 3
No ratings yet
BSE181055-Assignment 3
16 pages
WEKA
No ratings yet
WEKA
50 pages
Lab10 KMeans SPSS
No ratings yet
Lab10 KMeans SPSS
5 pages
Customer Spent Analysis Using K-Means Clustering
No ratings yet
Customer Spent Analysis Using K-Means Clustering
1 page
Final Report Saksham
No ratings yet
Final Report Saksham
20 pages
Page - 1
No ratings yet
Page - 1
5 pages
CA - 605 - MJP Machine Learning Practical Slips
No ratings yet
CA - 605 - MJP Machine Learning Practical Slips
25 pages
AI Enhanced+Cybersecurity+in+Smart+Manufacturing
No ratings yet
AI Enhanced+Cybersecurity+in+Smart+Manufacturing
38 pages
Tourism Enhancement Using LLMs & Neural Network - Report
No ratings yet
Tourism Enhancement Using LLMs & Neural Network - Report
37 pages
Ferreira Et Al-2017-Journal of Geophysical Research Oceans
No ratings yet
Ferreira Et Al-2017-Journal of Geophysical Research Oceans
20 pages
(IJCST-V10I3P28) :ashique Sherief, Muhammed Fayas, Adol Antony George, Adish H, Shyni Shajahan
No ratings yet
(IJCST-V10I3P28) :ashique Sherief, Muhammed Fayas, Adol Antony George, Adish H, Shyni Shajahan
5 pages

Unit - V DW

Uploaded by

Unit - V DW

Uploaded by

Data Warehouse and Data Mining

K-Means Clustering Algorithm

Partitioning Method (K-Mean) in Data Mining

Density-based clustering in data minin

Major Features of Density-Based Clustering

Maximization step − It can be used to estimate of model parameter

Applica ons of data mining

You might also like