0% found this document useful (0 votes)

17 views12 pages

Unit 4 Mining

Cluster analysis is a machine learning technique that groups similar data points into clusters, aiming to maximize inter-cluster distance and minimize intra-cluster distance. Hierarchical clustering organizes unlabelled datasets into a dendrogram format, while partitioning methods like k-means divide data into predefined clusters based on centroids. Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) are supervised learning algorithms used for classification, with SVM focusing on decision boundaries and KNN relying on the similarity of data points.

Uploaded by

sharmashree9876

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views12 pages

Unit 4 Mining

Uploaded by

sharmashree9876

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Unit 4

Cluster Analysis
Cluster analysis is a machine learning technique.
Cluster analysis is a method of data mining that groups similar data points
together. The data set has unlabelled.
goal. Goal of cluster analysis is to divide a data set into groups called
clusters such that the data points within each group are more similar to each
other than to data points in other groups.
Used:- Cluster analysis is used for data analysis and can help identify
patterns or relationships within the data.
Cluster:- Cluster is a subset of similar objects.
It is used by Amazon, Netflix, and the recommendation system to provide
recommendations as per the past search of products, movies, respectively.
Key objective of the cluster algorithm is that inter-cluster distance is
maximized and intra-cluster distance is minimized.
Properties of Clustering
1. Clustering scalability Now, a data share is a huge amount of data and
dealing with huge amount of database. In order to have a huge
database, clustering algorithm should be scalable. Data should be
scalable.
If it is not scalable, then we cannot get the appropriate result and
would lead to wrong result.
2. Dealing with unstructured data There would be some database that
contains errors, missing values, and noisy data. So, it should be able to
handle unstructured data by giving it some structure and organizing
data into groups of similar data objects.
3. Algorithm usability with multiple data Different kinds of data can be
used with clustering algorithm. It should be capable of dealing with
different types of data.
4. High dimensionality The algorithm should be able to high dimension
data.
5. Interpretability The outcome of a clustering should be usable and data
must be easy to understand.

1
Unit 4

Hierarchical Clustering Methods

Hierarchical Clustering is an unsupervised machine learning algorithm.
It groups the unlabelled dataset into a cluster, also known as Hierarchical
Cluster Analysis or HCA.
These unlabelled dataset clusters will display in the hierarchical format.
This format is called Dendrogram. It is a tree-type structure.
Why we use Hierarchical Clustering?
In k-means algorithm, the dataset labeled is known in advance. We know
the number of clusters we created and size of all clusters is equal. But if we
have unlabeled dataset and initially we do not know the number of clusters
we created and the size of clusters is not equal, then we use Hierarchical
Clustering. Do not need to have knowledge about the predefined number of
clusters. To solve these two challenges, we use Hierarchical Clustering.
Types of Hierarchical Clustering
1. Agglomerative-Related Clustering
2. Divisive Clustering
1. Algometric Clustering It is a bottom-up approach in which the algorithm
starts with taking all data points as a single cluster and merging them
until one cluster is left.

2. Divisive Clustering It is a top-down approach. It is a reverse of the

Agglomerative Clustering. We take one cluster which is composed of all
the data points and we keep divide on them till the single single element
is left.
Dendogram: Dendogram is a type of tree diagram showing hierarchical
relationship between different set of data. It contains the memory of
Hierarchical Clustering algorithm.
Application of Hierarchical Clustering Algorithm
1. Identify Fake News There are different words related to fake news. Is
a false information, spam etc.
2. Identify Criminal Activity
3. Document Analysis

2
Unit 4

Partitioning Method
It is a two-down approach to clustering.
In this method, we have divided the n data items/objects/Sets. They will be
divided into k partitions, parts, clustering a group.
Each partitions a group's part is called one cluster.
The size of individual clusters should be less than or equal to AND's total
data items.

When we partition, we should follow two rules.

1. Each partition should have at least one object.
2. Each object should belong to only one partition.

Example of Partitioning Method k means clustering.

k means number of clusters. Clustering means grouping of similar type of
data. This algorithm is centroid based algorithm where each cluster having
a particular centroid.
Centroid having the minimum distance between the remaining data points.

3
Unit 4

k means algorithm steps.

Step 1. Select the number k to decide the number of clusters.
Step 2. Select random k points or centroid. It can be other from the input
data set.
Step 3. Assign each data points to their closest centroid which will form the
predefined k clusters.
Step 4. Calculate the variance and place a new centroid over each cluster.
Step 5. Repeat the step which means reassign each data point to the new
closest centroid over each cluster.
Step 6. If any reassignment occurs, then go to step 4 as go to finish.
Step 7. This model is ready.

Advantages of k-means algorithm

1. Simple to implement,
2. it is scalable to a huge dataset and also faster to large dataset
3. Time taken to cluster k-means rise linearly with the number of data
points

Disadvantages

1. User need to specify an initial value of k,

2. not suitable for unlabeled dataset,
3. this is sensitive to the outlier.
Example:-
Sno age Amount
1 20 500
2 40 1000
3 30 800
4 18 300
5 28 1200
6 35 400

4
Unit 4

Evaluation of Clustering
Clustering evaluation evaluates the visibility and quality of clustering
analysis on a dataset through examining the results produced by a clustering
method on a specific dataset.
The major task of clustering evaluation includes the following,
1. accessing clustering tendency,
2. determining the number of clusters in a dataset,
3. measuring cluster quality,
1. accessing clustering tendency. Before evaluating or analyzing clustering
performance, it is essential to verify that a dataset under consideration
has clustering tendency and does not contain uniform distributed points.
If the data does not contain clustering tendency, the clustering identified
by any clustering algorithm may be useless.
Non-uniform distribution of points in a data collection becomes
important in clustering.
2. Determining the number of clusters in a dataset Choosing the appropriate
number of clusters in a dataset is essential for clustering algorithm. It
needs to accomplish a balance between compressibility and accuracy. The
number of clusters can be unclear depending on the distribution style,
scale, and desired clustering resolution.
ELBO method. This method includes setting the number of clusters to
about Q by n square for a data set of n points, in which we deduce the
sum of cluster variance and a heuristic use the turning point in the curve
of the sum of cluster variance with respect to the number of clusters.
Cross validation. It is a technique in the classification used to determine
the right number of clusters. First, divide the given data set into m parts.
Next, use m-1 parts to build a clustering model and use remaining parts
for testing the quality of clustering.
3. Measuring cluster quality. After clustering, a number of matrices can be
used to determine how effectively the clustering worked. Ideally,
clustering has minimum intra-cluster distance and maximum inter-
cluster distance.
There are generally two types of extrinsic and intrinsic measures.
1. Extrinsic measures, which require some ground truth labels, e.g.,
adjusted index, Flowlkes Mallow Scores, etc.
2. Intrinsic measures that do not require ground truth labels. Sum of
clustering and performance means are a coefficient, calinski, Hara-
Baze, Index, etc.

5
Unit 4

SVM, Support Vector Machine

Support Vector Machine or SVM is one of the most popular supervised
learning algorithms.
This machine is used for classification problems as well as regression
problems.
This machine is generally used in classification problems because it
generates more accurate results and more classified data.
Goal of SVM is to create a best line or to create a proper decision
boundary between the particular classes.
In the above diagram, we can see two classes. These two classes are
properly classified and the middle line is called the decision boundary.
This decision boundary is generated by SVM. So, SVM properly
classifies the data.
Uses :-
1. Face detection,
2. image classification,
3. text categorization, etc.
Key concept,
1. support vector data points that are closest to the hyperplane is called
sport vector.
2. Hyperplane it is a decision plane or space where they can divide the
particular class in a particular set of objects.
3. Margin means particular gap between two lines on the closest data
points of different classes. It is calculated as the perpendicular distance
from the line to the sport vector. Large margin is considered as good
margin and small margin is considered as bad margin.
Type of SVM
1. Linear SVM
2. Non-linear SVM
1. Linear SVM Linear SVM is used for linearly separatable data which
means if a data set can be classified into two classes by using a single
straight line, this classifier is called linear SVM classifier.
2. Non-linear SVM. It is used for non-linearly separated data, which
means if a dataset cannot be classified by using a single straight line, this
classifier is called non-linear SVM classifier.
Kernel is used for classifying the data into non-linear SVM.

6
Unit 4

Algorithm
1. Select two hyperplanes which separates the data with no points
between them.
2. Maximum linear distance .
3. the average line will be the decision boundary.
Advantages
1. Memory sufficient
2. Effective in high dimensional cases
3. offer great accuracy.
Disadvantages
1. High training time
2. Not suitable for large data sets.
Example suppose we are given the following labelled data points
(3,1), (3,-1) (6,1) (6,-1) (1,0) (0,1) (0,-1) (-1,0)

7
Unit 4

Lazy Learner, KNN, K- Nearest Neighbour.

KNN is one of the simplest machine learning algorithms based on
supervised learning techniques.
KNN algorithm can be used for regression as well as classification problem,
mostly it is used for classification problem.
KNN algorithm is also called lazy learner because it does not learn from the
training setup, instead it stores data and at the time of classification, it
performs an action on data setup.
In supervised learning techniques, we train machines using labelled data,
data which contains input or output.
KNN algorithm assumes the similarity between the new data and available
data and puts the new data into the category that is most similar to the
available category.
I have an input data, there are two available data categories, one is square
category and other is circle category. Now I want to place this input data in
any of these two categories. By using KNN classifier, we can place this input
data in any of available category. It can be placed based on the similarity
input data is similar to square because it looks like a square. So, KNN
classifier placed this input data in square category.
In the example, there are two category of data available, that is category A
and category B. Category A contains the square data, whereas category B
contains the circle data. Now I want to place new data in either category A
or category B by using KNN algorithm. First of all, KNN algorithm do the
find the nearest neighbor of the input data. Nearest neighbor of new data is
2 square and 1 circle. Among the three nearest neighbors, two are square
and one is circle, so majorly square. Then we place the new data based on
the majority, so majority is square. Then the data placed in category A, we
place new data on the basis of nearest neighbor, that is why it is called KNN
algorithm.
Working step of KNN algorithm,
1. select the k number of neighbours,
2. calculate the Euclidean distance of k number of neighbors,
3. take the k nearest neighbors as per the Euclidean distance.
4. Among these k neighbours, count the number of data points In each
category,

8
Unit 4

5. assign the new data points to that category for which the number of
neighbours is maximum.
6. Our model is ready.
Advantages of KNN
1. it is very simple and easy to implement the algorithm because we
need only two parameters. First is k value and other is Euclidean
distance.
2. A good value of k makes it robust to noise.
3. KNN learns a non-linear decision boundary.
4. There is no training required. That is why we can add new data
anytime. And we don't have work to train the model. As a result, it is
very fast in nature because training time is zero.
Disadvantages,
1. insufficient,
2. does not work With high dimension
3. it does not handle category feature very well.
Example perform KNN classification algorithm on following dataset and
predict the class for [maths=6,computer=8] where k=3
maths ccomputer Result
4 3 F
6 7 P
7 8 P
5 5 F
8 8 p

9
Unit 4

Classification
it is a task in data mining that involves assigning a class label to each instance
in a dataset based on its features.
The goal of classification is to build a model that accurately predicts the class
label of new instances based on their features.
A classification technique is a systematic approach to build classification
model from an input data set that is used to identify class level.
There are two steps in classification,
1. model construction, and
2. model usage.

Decision Tree
It is a supervised learning technique.
With the help of decision trees, we can solve both recursion and classification
problems.
By decision trees, we can take decisions.
Decision trees will represent classification models and recursion models in
the form of tree structures, where internal nodes represent the features of a
dataset, branches represent the decision rules, and external nodes represent
the outcomes.
Decision tree terminologies Decision tree contains three types of nodes,
1. root node,
2. leaf node, and
3. internal node.
1. Root node is the topmost node in the tree, data which is inside the tree
is attribute. It is represented by a rectangle.
2. Leaf node is the final output node and can be further split. Leaf node is
represented by a circle.
3. Internal node represents the features of the dataset. It is represented by
a rectangle.
Algorithm
Create a root node and assign all the training data to it.

10
Unit 4

Select the best splitting attribute according to certain criteria.

Add a branch to the root node for each value of the split.
Split the data into mutual exclusive subsets along the lines of the specific
splits.
Repeat steps 2 and 3 for each and every leaf node until the splitting criteria
is reached.
Splitting criteria
Which variable to use for the first data? How should one determine the most
important variable for the first branch and subs to grant claim for each
subtree? Algorithm uses different methods like list errors, information gain,
gimme coefficient, etc. to compute the splitting variable that provides the
most benefit.
What value to use for split? If the variable have continuous values such as 4,
8, etc., what value ranges should be used?
How many branches should be allowed for each node? There could be binary
trees with just two branches at each node or more branches allowed.
Stopping criteria
1. When to stop of building the tree?
2. When a certain depth of branching has been reached?
3. When the error level at any node is in predefined tolerable level?
Pruning It is the act of reducing the size of decision tree by removing the
fraction of the tree that provides the little value.
Goal to make decision tree more balanced, more general and more easily
usable.
There are two approaches to avoid overfitting.
1. Pre-pruning and
2. Post-pruning
1. Pre-pruning means to help you stop the tree construction early. When
certain criteria are met,
disadvantages, it is difficult to decide what criteria to use for help stop the
construction because we do not know the stopping criteria.
2. Post-pruning means removing branches or subtrees from a fully grown
tree. This method is commonly used.

11
Unit 4

Advantages of a decision tree

• Simple to understand,
• less requirement of data cleaning compared to other algorithms.
• It can be very useful for solving decision related problems.
• It solves both classification and regression problems.
Disadvantages
a. for more class labels, the computational complexity of decision tree
may increase.
b. Overfitting issues.
c. The decision tree contains lots of layers which makes it complex.
day weather temperature humidity wind play
1 Sunny Hot High Weak N
2 Cloudy Hot H Weak Y
3 Cloudy Mild H Strong Y
4 Rainy Mild H S N
5 Sunny Mild Normal S Y
6 Rainy Cold N S N
7 Rainy Mild H W Y
8 Sunny Hot H S N
9 cloudy Hot N W Y
10 rainy mild H s n

Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Unit 4
No ratings yet
Unit 4
29 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
ML Unit 4 (Ab 22)
No ratings yet
ML Unit 4 (Ab 22)
39 pages
Unit 4
No ratings yet
Unit 4
106 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
18 pages
Unit 5 ML
No ratings yet
Unit 5 ML
38 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Clustering
No ratings yet
Clustering
44 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
66 pages
DM 4
No ratings yet
DM 4
76 pages
CHAPTER 4 CLUSTERING - Docx-1
No ratings yet
CHAPTER 4 CLUSTERING - Docx-1
26 pages
Untitled Document
No ratings yet
Untitled Document
32 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
4.unit 4 ML Q&A
No ratings yet
4.unit 4 ML Q&A
73 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
Clustering
No ratings yet
Clustering
38 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Module 6 - Un-Supervised Learning Algorithms
No ratings yet
Module 6 - Un-Supervised Learning Algorithms
31 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Unit Iii - ML
No ratings yet
Unit Iii - ML
13 pages
DM Module 4
No ratings yet
DM Module 4
17 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Unit 4 Self Made
No ratings yet
Unit 4 Self Made
28 pages
Unsupervised Learning Part 1
No ratings yet
Unsupervised Learning Part 1
9 pages
Unit Iv
No ratings yet
Unit Iv
19 pages
Clustering
No ratings yet
Clustering
29 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Unit 3 Unsupervised Learning Algorith
No ratings yet
Unit 3 Unsupervised Learning Algorith
15 pages
How To Perform Clustering Algorithms in Machine Learning
No ratings yet
How To Perform Clustering Algorithms in Machine Learning
9 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
5.1 Exploratory Analysis en
No ratings yet
5.1 Exploratory Analysis en
79 pages
Unit - 4 (ML)
No ratings yet
Unit - 4 (ML)
13 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
ML Unit-4
No ratings yet
ML Unit-4
14 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
85 pages
Unit - 4 DM
No ratings yet
Unit - 4 DM
24 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Ai Fundamentals Midterm Exam Source by Ate Zein
No ratings yet
Ai Fundamentals Midterm Exam Source by Ate Zein
125 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Clustering New
No ratings yet
Clustering New
6 pages
Unit 3 & 4 (p18)
No ratings yet
Unit 3 & 4 (p18)
18 pages
Intermediate R - Cluster Analysis
33% (3)
Intermediate R - Cluster Analysis
27 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
CSE 319 Pattern Recognition: Clustering
No ratings yet
CSE 319 Pattern Recognition: Clustering
58 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Han 2019
No ratings yet
Han 2019
18 pages
UNIT5
No ratings yet
UNIT5
60 pages
Unit 5
No ratings yet
Unit 5
5 pages
Data Warehousing Mining MCQs
No ratings yet
Data Warehousing Mining MCQs
12 pages
CC282 Unsupervised Learning (Clustering) : Lecture 7 Slides For CC282 Machine Learning, R. Palaniappan, 2008 1
No ratings yet
CC282 Unsupervised Learning (Clustering) : Lecture 7 Slides For CC282 Machine Learning, R. Palaniappan, 2008 1
38 pages
Unit 4
No ratings yet
Unit 4
4 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
93 pages
Week 10 Lecture - Introduction To Clustering
No ratings yet
Week 10 Lecture - Introduction To Clustering
35 pages
MLT Quantum
No ratings yet
MLT Quantum
138 pages
Cluster Analysis or Clustering Is The Art of Separating The Data Points Into Dissimilar Group With A
No ratings yet
Cluster Analysis or Clustering Is The Art of Separating The Data Points Into Dissimilar Group With A
11 pages
Clustering in R Tutorial
No ratings yet
Clustering in R Tutorial
13 pages
Clustering
No ratings yet
Clustering
8 pages
Data Mining-1
No ratings yet
Data Mining-1
15 pages
Data Segmentation
No ratings yet
Data Segmentation
27 pages
8CT-DWM Lab Manual-19-20
No ratings yet
8CT-DWM Lab Manual-19-20
31 pages
Assign 7
No ratings yet
Assign 7
5 pages
Unit 5
No ratings yet
Unit 5
34 pages
ST-2 Notes KMBN IT-02 (Unit 2 and Unit 3) MR Rohit Pratap Singh
No ratings yet
ST-2 Notes KMBN IT-02 (Unit 2 and Unit 3) MR Rohit Pratap Singh
25 pages
Enviro Toxic and Chemistry - 2009 - Tong - Structure Activity Relationship Approaches and Applications
No ratings yet
Enviro Toxic and Chemistry - 2009 - Tong - Structure Activity Relationship Approaches and Applications
16 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Dendrograms and Clustering Density: DR Aiden Price
No ratings yet
Dendrograms and Clustering Density: DR Aiden Price
17 pages
How To Pass Sem 5 - Comps
No ratings yet
How To Pass Sem 5 - Comps
11 pages
Project 2 Clustering Algorithms: Team Members Chaitanya Vedurupaka (50205782) Anirudh Yellapragada (50206970)
No ratings yet
Project 2 Clustering Algorithms: Team Members Chaitanya Vedurupaka (50205782) Anirudh Yellapragada (50206970)
15 pages
Ambo University Inistitute of Technology Department of Computer Science
No ratings yet
Ambo University Inistitute of Technology Department of Computer Science
13 pages
Use of DARwin For Dendrogram Analysis
100% (1)
Use of DARwin For Dendrogram Analysis
2 pages
Romary 2015
No ratings yet
Romary 2015
8 pages
Cluster Analysis Exercise
No ratings yet
Cluster Analysis Exercise
2 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet

Unit 4 Mining

Uploaded by

Unit 4 Mining

Uploaded by

Unit 4

Hierarchical Clustering Methods

2. Divisive Clustering It is a top-down approach. It is a reverse of the

When we partition, we should follow two rules.

Example of Partitioning Method k means clustering.

k means algorithm steps.

Advantages of k-means algorithm

1. User need to specify an initial value of k,

SVM, Support Vector Machine

Lazy Learner, KNN, K- Nearest Neighbour.

Select the best splitting attribute according to certain criteria.

Advantages of a decision tree

You might also like